Since its launch in 2018, amazon’s Just Walk Out technology has transformed the shopping experience by allowing customers to enter a store, pick up items, and walk out without waiting in line to checkout. You can find this checkout-free technology in more than 180 third-party stores around the world, including travel retailers, sports stadiums, entertainment venues, conference centers, theme parks, convenience stores, hospitals, and college campuses. Just Walk Out’s end-to-end system automatically determines which products each customer picked in-store and provides digital receipts, eliminating the need to wait in line at checkout.
In this post, we showcase the latest generation of amazon’s Just Walk Out technology, powered by a multimodal (FM) foundation model. We designed this multimodal FM for brick-and-mortar stores using a transformer-based architecture similar to that underpinning many generative artificial intelligence (ai) applications. The model will help retailers generate highly accurate shopping receipts using data from multiple inputs, including a network of overhead video cameras, specialized weight sensors on shelves, digital floor plans, and product catalog images. To put it in simple terms, a multimodal model means using data from multiple inputs.
Our investments in state-of-the-art multimodal FM research and development (R&D) enable the Just Walk Out system to be deployed in a wide range of shopping situations with greater accuracy and at a lower cost. Similar to large language models (LLMs) that generate text, the new Just Walk Out system is designed to generate an accurate sales receipt for every shopper who visits the store.
The challenge: addressing complex long-tail purchasing scenarios
Because of their innovative checkout-less environment, Just Walk Out stores presented us with a unique technical challenge. Retailers and shoppers, as well as amazon, demand nearly 100 percent accuracy at checkout, even in the most complex shopping situations. These include unusual shopping behaviors that can create a long, complicated sequence of activities that require extra effort to analyze what happened.
Previous generations of the Just Walk Out system used a modular architecture; they addressed complex shopping situations by breaking down the shopper visit into discrete tasks such as detecting interactions with the shopper, tracking items, identifying products, and counting what is selected. These individual components were then integrated into sequential channels to enable the overall functionality of the system. While this approach produced highly accurate receipts, significant engineering efforts were required to address challenges in new, previously unseen situations and complex shopping scenarios. This limitation restricted the scalability of this approach.
The solution: Just Walk Out multimodal ai
To address these challenges, we are introducing a new multimodal FM that we designed specifically for retail store environments, enabling Just Walk Out technology to handle real-world, complex shopping scenarios. The new multimodal FM further enhances the capabilities of the Just Walk Out system by more effectively generalizing to new store formats, products, and customer behaviors, which is crucial to scaling Just Walk Out technology.
Incorporating continuous learning allows the training model to automatically adapt and learn from new challenging scenarios as they arise. This self-improving capability helps ensure that the system maintains high performance, even as purchasing environments continue to evolve.
Thanks to this combination of end-to-end learning and improved generalizability, the Just Walk Out system can address a wider range of dynamic and complex retail environments. Retailers can deploy this technology with confidence, knowing that it will provide their customers with a seamless checkout experience.
The following video shows our system architecture in action.
Key elements of our Just Walk Out multimodal ai model include:
- Flexible data inputs –The system tracks user interaction with products and fixed fixtures such as shelves or refrigerators. It relies primarily on multi-view video streams as inputs, using weight sensors only to track small items. The model maintains a 3D digital representation of the store and can access catalog images to identify products, even if the shopper returns items to the shelf incorrectly.
- Multimodal ai tokens to represent buyer journeys – Multimodal input data is processed by encoders, which compress it into transformer tokens, the basic unit of input for the receipt model. This allows the model to interpret hand movements, differentiate between items, and accurately and quickly count the number of items picked up or returned to the shelf.
- Continuous updating of receipts – The system uses tokens to create digital receipts for each shopper. It can differentiate between different shopping sessions and dynamically updates each receipt as customers pick up or return items.
Just Walk Out FM Training
By feeding large amounts of multimodal data into Just Walk Out FM, we found that it could consistently generate (or, technically, “predict”) accurate receipts for shoppers. To improve accuracy, we designed over 10 auxiliary tasks, such as detection, tracking, image segmentation, grounding (linking abstract concepts to real-world objects), and activity recognition. All of these tasks are learned within a single model, improving the model’s ability to handle new and never-before-seen store formats, products, and customer behaviors. This is crucial for bringing Just Walk Out technology to new locations.
ai model training, where selected data is fed into selected algorithms, helps the system hone itself to produce accurate results. We quickly discovered that we could speed up our model training by using a data wheel which continuously extracts and labels high-quality data in a self-reinforcing cycle. The system is designed to integrate these incremental improvements with minimal manual intervention. The following diagram illustrates the process.
To effectively train an FM, we invested in a robust infrastructure that can efficiently process the massive amounts of data required to train high-capacity neural networks that mimic human decision-making. We built the infrastructure for our Just Walk Out model with the help of several amazon Web Services (AWS) services, including amazon Simple Storage Service (amazon S3) for data storage and amazon SageMaker for training.
To effectively train an FM, we invested in a robust infrastructure that can efficiently process the massive amounts of data required to train high-capacity neural networks that mimic human decision-making. We built the infrastructure for our Just Walk Out model with the help of several amazon Web Services (AWS) services, including amazon Simple Storage Service (amazon S3) for data storage and amazon SageMaker for training.
Below are some key steps we follow in training our FM:
- Selecting challenging data sources – To train our ai model for Just Walk Out technology, we focused on training data from particularly challenging shopping scenarios that test the limits of our model. While these complex cases make up only a small fraction of the shopping data, they are the most valuable in helping the model learn from its mistakes.
- Take advantage of automatic labeling – To increase operational efficiency, we develop algorithms and models that automatically assign meaningful labels to data. In addition to receipt prediction, our automated labeling algorithms cover auxiliary tasks, ensuring that the model acquires comprehensive multimodal understanding and reasoning capabilities.
- Model pre-training – Our FM is pre-trained on a large collection of multimodal data across a wide range of tasks, improving the model’s ability to generalize to new, never-before-seen store environments.
- Fine tuning of the model – Finally, we further refined the model and used quantization techniques to create a smaller, more efficient model that uses edge computing.
As the data model continues to run, it will progressively identify and incorporate more high-quality, complex cases to test the robustness of the model. These additional complex samples are then incorporated into the training set, further improving the accuracy and applicability of the model to new brick-and-mortar environments.
Conclusion
In this post, we show how our multimodal ai system represents significant new possibilities for Just Walk Out technology. With our innovative approach, we are moving away from modular ai systems that rely on human-defined subcomponents and interfaces. Instead, we are building simpler, more scalable ai systems that can be trained from start to finish. Although we are just getting started, multimodal ai has raised the bar for our already highly accurate receipt system and will enable us to improve the shopping experience in more Just Walk Out technology stores around the world.
Visit amazon-just-walk-out-improves-accuracy” target=”_blank” rel=”noopener”>About amazon to read the official announcement about the new multimodal ai system and learn more about the latest improvements to Just Walk Out technology.
To find out where you can find Just Walk Out technology locations, visit Just Walk Out technology Locations Near YouLearn more about how to power your store or location with amazon's Just Walk Out technology on the Just Walk Out technology product page.
Visit Build and scale the next wave of ai innovation on AWS to learn more about how AWS can reinvent customer experiences with the most comprehensive set of ai and ML services.
About the authors
Tian Lan is a Principal Scientist at AWS. He is currently leading research efforts to develop the next-generation Just Walk Out 2.0 technology, transforming it into a multi-modal, storage-domain-focused, end-to-end learning-based model.
Chris Broaddus is a Senior Manager at AWS. He currently manages all research efforts for the Just Walk Out technology, including the multimodal ai model and other projects such as deep learning for human pose estimation and receipt prediction using radio frequency identification (RFID).