Get Our Email Newsletter

Vision-Language Models that See, Understand, and Speak the Language of Loss Prevention

Imagine a retail store that can see and understand what’s happening around it—as clearly as a store associate—and respond intelligently in real-time. That is the promise of Vision-Language Models (VLMs). VLMs are “multi-modal” models because they can simultaneously analyze video and language data to summarize what is happening at a location in real-time.

VLMs are an exciting area of artificial intelligence that combines computer vision with natural language processing (NLP) to interpret images and videos in context. Computer vision is almost self-explanatory—it is when computers use video or other image data to “see” the world. Computer vision has been the talk of the LP industry for years because of the opportunities it can create in threat detection, investigations, and many other aspects of LP.

However, VLMs are exciting because they extend the utility of computer vision. In LP, computer vision models have been used primarily to detect and report the presence of objects and basic activities in video. VLMs layer in NLP to provide more in-depth information from visual data. NLP is a field of artificial intelligence that helps computers understand, interpret, and respond to human language. Therefore, when NLP is applied to the output from computer vision models, computers can synthesize much richer descriptions of a scenario or activity, and its context.

- Digital Partner -

To understand what VLMs are, one should first understand image captioning. Image captioning is the process of converting an input image into a text description of the image. This process is shown below; if you have ever created an image using generative AI, it is essentially the process in reverse.

For retailers focused on LP, VLMs represent a groundbreaking opportunity to enhance security, detect an array of suspicious behaviors from visitors, and potentially make smarter LP decisions. However, VLMs have many other retail applications beyond LP. But, before we get too far, let’s break VLMs down further.

Breaking It Down: What Are VLMs?

VLMs are AI systems that combine two key capabilities:

  1. Vision Understanding: The ability to analyze images or videos, identifying objects, people, actions, or behaviors

2. Language Understanding: The ability to understand and generate text, including labels or descriptions of what the model is seeing, as in the case of image captioning systems.

Together, these models can look at an image or video and describe what’s happening—almost like a person would. For instance, a VLM might look at security footage and say, “A person wearing a hoodie is concealing an item in the electronics aisle.” A software program could then be written to flag the model’s concerning outputs, e.g., descriptions associated with various kinds of theft: cart pushout, concealment, etc. This also means VLMs could be used to detect violent incidents or tools associated with them, such as guns or other blunt metal objects.

LP Solutions

Why Are VLMs a Game-Changer for Loss Prevention?

Traditionally, LP programs have relied on a variety of technologies, processes, human vigilance, and human expertise to prevent as well as investigate theft, fraud, or violence. Cameras, whether analog or IP, have long played an important role in LP. Unfortunately, as LPRC research shows, video review is one of the most time-consuming and inefficient aspects of the investigator’s job. Furthermore, humans have limited abilities to effectively analyze the amount of video that retail stores collect at any given time. VLMs operating on camera video streams take things to the next level by:

  1. Providing Contextual Understanding: Unlike traditional cameras with a video management system (VMS) that simply records footage, VLMs can interpret what they see. They don’t just capture footage—they can recognize unusual behavior and alert store associates based on predefined prompts or questions. Unsurprisingly, this is called Visual Question Answering (VQA). For example, you might ask a VLM, “Is anyone walking out of the store carrying merchandise that is not in a bag?” and it can provide an answer.

2.  Reducing False Alarms: By understanding the context of a situation through a sufficiently descriptive prompt or question, VLMs can reduce the number of false positives, like mistaking a big sales event that draws a crowd as a flash rob event, for example.

3.  Supporting Decision-Making: VLMs don’t just detect problems; they describe them, helping LP teams decide how to respond once alerted to any detected and described event class of interest. In other words, they can play an important role in understanding a business problem before blindly throwing solutions at it.

- Digital Partner -

Examples of Cutting-Edge Use Cases for VLMs in Retail Loss Prevention

Let’s explore some of the most exciting ways VLMs could be used to enhance other LP innovations:

  1. Detecting Concealment Shoplifting Behaviors or Other Criminal Warning Indicators

VLMs can analyze live video feeds to identify actions like:

  • Concealing items in bags, pockets, or under clothing.
  • Tampering with price tags, products, or security devices.
  • Suspicious hiding of identity through masks, hoodies, wearing sunglasses indoors, or bandanas.

Example: A VLM sees a person repeatedly looking over their shoulder while handling a product. It flags this behavior as unusual, and alerts store associates to check in.

2. Monitoring Self-Checkout Fraud

Self-checkouts are convenient but prone to fraud, such as scanning lower-priced items instead of the real product. VLMs can:

  • Detect mismatches between scanned barcodes and the items being bagged, but, more importantly, they can also describe which mismatches occurred and potentially how the mismatch happened.
  • Identify attempts to skip scanning altogether.

Example: A customer scans a water bottle but places an electronic gadget in the bag. A system connected to the VLM alerts an associate before the customer leaves.

3. Enhancing Crowd Monitoring

VLMs can monitor crowd behaviors to prevent flash robs or violent incidents by:

  • Detecting large groups of people entering together with their faces obscured.
  • Recognizing signs of agitation or escalating tensions such as crowds running.

Example: A group of individuals wearing hoodies and masks runs through the front entrance. The VLM flags this as a potential flash rob scenario and contacts store security or law enforcement.

Use Cases for VLMs Beyond Loss Prevention

Many different retail divisions, departments, and teams use video; many of these uses relate to some threat, risk, or liability or use video to mitigate these. However, VLMs make it possible for retailers to efficiently use video for many other purposes.

This is probably the most exciting aspect of VLMs—the data that retailers spend a lot of money to acquire is becoming much, much more valuable, and the retailers that succeed tomorrow will be the ones who quickly understand and leverage this valuable data. More important, readers—the LP leaders who will make a name for themselves tomorrow are the ones who are thinking about how they can help their business partners achieve their goals today. Therefore, it is important to think about how technologies apply to the business generally while building their business cases for technology.

For example, VLMs could be used to identify and alert store employees when:

  • An elderly customer is having trouble removing an item from a shelf—a potential safety incident.
  • It could be used to alert managers that a case or shelf has not been stocked prior to opening—an opportunity to avoid missed sales.
  • A person has spent time removing, inspecting, and replacing multiple items from a shelf—an opportunity to help a customer with product selection.

The point is that VLMs connect visual and language data. This creates incredibly powerful opportunities to protect people, places, and property and also adds to the overall value proposition of retailers’ camera systems and the video data they create.

Conclusion

It is still very early days for VLMs. As VLMs evolve, they will become indispensable tools for retail security teams, offering smarter, faster, and more proactive ways to combat theft, fraud, or violence while improving overall store operations. VLMs represent cutting-edge AI innovation in LP. By combining the ability to see with the ability to understand language, VLMs empower retailers to detect, describe, and respond to security threats with unprecedented accuracy and efficiency. As these models continue to advance, they promise to reduce losses and create safer, smarter, and more secure shopping environments.


Caleb Bowyer is a PhD candidate in engineering at the University of Florida. His dissertation research is on how to best train cooperative autonomous agents from noisy and partially observed data for decision making tasks, primarily focusing on localization applications. He is expected to defend and graduate with his PhD in summer 2025. At the LPRC, he is head of the LPRC’s DETECT research initiative and facilitator of the Data Analytics Working Group (DAWG) that meets monthly. He researches, develops, and implements AI solutions related to detecting theft, fraud, or violence attempts—earlier and further away from retail stores. His goal at the LPRC is to integrate all of the sensors and AI models across the five zones of influence into a production ready system for better alerting of offenders in real-time and detecting repeat offenders on their journey to commit harm and then deflecting that harm.

Digital Partners

Become a Digital Partner

Loss Prevention Media Logo

Stay up-to-date with our free email newsletter

The trusted newsletter for loss prevention professionals, security and retail management. Get the latest news, best practices, technology updates, management tips, career opportunities and more.

No, thank you.

View our privacy policy.