The AI Technologies Powering Checkout-free Retail - Part 1

Written by Motilal Agrawal | Mar 22, 2022 3:15:00 PM

In my role as Chief Scientist at Zippin, I get asked a lot about the myriad of cameras and sensors in checkout-free stores. If you’re amazed at the design and amount of information they’re able to collect, you’re not alone. It’s a synergistic combination of three complex pieces of technology. 1. Deep learning 2. Sensor fusion and 3. Computer vision. In this three-part blog series, we deep-dive into each of these core technologies and explore their role in enabling checkout-free retail.

Deep learning is at the heart of all modern AI advances ranging from modern protein unfolding and speech understanding, to medical imaging and driverless vehicles. Let’s take a look at the underlying principles behind this technology, explore its requirements, applications, and most importantly, its limitations. In particular, we’ll also examine how deep learning is already revolutionizing checkout-free brick and mortar food service and retail in a big way, with lots of untapped potential.

What is Machine Learning?

Simply stated, machine learning is the science of recognizing patterns. These patterns may come from everyday human activities such as reading, writing, hearing, visual perception, or from more complex activities like playing a game of go, human genome understanding or protein structure unfolding.

I started my AI career in the late 1990s doing machine learning. Back in those days, we spent a major chunk of our research time designing and handcrafting distinguishing features that would help us to recognize these patterns. Often this would require lots of domain expertise and pondering over lots and lots of data. I still remember analyzing endless hours of drone video to solve the problem of detecting and tracking moving people and vehicles that are just a few tens of pixels in a high resolution image. The trick was to have an intuition for what kind of features would help in the specific problem at hand and then finally tweak these features to get the best performance numbers.

As a result of the domain expertise needed, machine learning was pretty fragmented across different domains such as natural language processing, speech understanding, and computer vision. The underlying theory behind machine learning in each of these domains was the same, but the different researchers remained sort of restricted to their domains with cross-talk happening occasionally at core AI conferences.

What is Deep Learning?

Deep learning is the natural evolution in the field of machine learning driven largely by the ability to collect more data and advances in computing. Instead of relying on feature engineering for a specific domain, deep learning attempts to find patterns through crunching numbers over large volumes of data and their associated labels. Deep learning models these pattern recognition problems through the use of layers and layers of neurons. The more layers, the deeper the network and hence greater the ability to model complex non-linear patterns.

Elements of Deep Learning

At the core of it, deep learning has three key elements:

Multiple layers of neural networks:. As the name suggests, deep learning has multiple layers of neurons. Typically, each neuron is a basic non-linear element that interacts with other neurons to form a mesh structure. Convolutional neural networks (CNN) have neurons with a spatial structure (2D or 3D) and can model patterns in images or video. The representation power of these neural networks are a function of the number of neurons as well as the number of addition/multiplication operators required for the network. It is not uncommon for modern day deep neural networks to have 100 million neurons or so requiring a few billion floating point operations.
Data availability. Given the large number of neurons, it is imperative to have a large amount of data so that each of these neurons can be tuned accordingly to a desired output for each of these data inputs. This process of tuning the parameters is called training. Training this model is conventionally accomplished through a loss function. Minimizing this loss function ensures us that we are able to find the sought after pattern in the dataset.
Compute availability. Training millions of neurons is no easy task and is usually accomplished through massive computing like you find in GPU or specialized hardware known as AI accelerators. Once trained, these neurons also need decent beefy hardware to run the computations on input data (inference).

Requirements for Deep Learning

In order to apply deep learning to any pattern recognition task at hand, we need a good architecture, also called model, of neurons and how they are connected. Over the last 10 years, there has been constant innovations in models and we have reached a point where this is more or less standardized and, to some extent, the choice of models is not that important overall. Moreover with transformer models that use self attention, we have reached a juncture where it looks like the models across different domains (audio/video/text) have architecture that is more or less standardized as well.

Perhaps one of the biggest bottlenecks for applying deep learning comes from requirements on the data side. Not only do we need large amounts of training data for each task, but for most use cases this data has to be annotated to guide the tuning of the neurons with the right feedback. Recent works in self-supervised learning try to get away with minimal annotation but there are still a lot of performance gaps between supervised and self-supervised deep learning metrics. This requirement for data and annotation is perhaps the biggest roadblock to rapid adoption of deep learning, and as a result, there is large growth in this domain with various startups providing data collection and annotation services. This sector is currently seeing massive growth.

Finally, on the computing side, even though training these models is a one-time compute cost, companies have to have a budget for these resources and it can spiral pretty quickly. For instance, according to some estimates GPT-3, a powerful attention based language model, easily costs tens of millions of dollars to train and produces a CO2 equivalent of 100 cars being driven for a year. Many companies are trying to bring down the CO2 footprint both for training as well as for inference by developing specialized hardware accelerators for these neural networks. Thus another sector undergoing a massive growth spurt!

Applications of Deep Learning

Deep learning can be applied to all sorts of pattern recognition tasks as long as the three requirements above are met for the specific problem. Common applications of deep learning include modeling languages such as those used in chatbots, searching, question-answering, and language translation to name a few.

Deep learning has found extensive applications in the domain of computer vision. For example, self-driving vehicles apply deep learning to detect road signs, other vehicles and pedestrians and, in general, to semantically understand the world around an autonomous vehicle using cameras and lidars. They have also found extensive applications in automating agriculture such as identifying weeds or ripe crops, as well as understanding satellite imagery. In the field of radiology they have become an indispensable tool to identify anomalies.

Other applications of deep learning include speech processing, transcription as well as more niche applications such as protein unfolding or game playing.

Limitations of Deep Learning

Since deep learning relies heavily on the availability of large amounts of training data, it can also be seen as the main limitation. Collecting and annotating large amounts of data is resource and time intensive.

In general, deep learning works only when the data in real life (inference mode) follows the same distribution as that seen during training. Obviously this means that deep learning does not fare well in edge cases that do not follow the original distribution of data. Thus anytime we see a long-tailed distribution, deep learning will perform poorly on the skewed portion of the long tail. This is perhaps the single most factor why deep learning hasn’t reached the required level needed for fully autonomous driving.

Deep Learning for Checkout-free Retail

Checkout-free retail has all the challenges that researchers are trying to solve for driverless vehicles without the associated risk that someone might be seriously injured if things don’t go as planned. This makes it the perfect market for the application of deep learning.

The main challenge in checkout-free retail is to figure out “who picked what?” This involves detection, tracking and identification of customers, recognizing their activities and interactions as well as recognizing products that they are interacting with. All of this can be very aptly done using deep learning.

At Zippin, we strongly believe in the potential of deep learning and are actively pursuing its application in all aspects of checkout-free retail. In fact, these different sub-problems are ripe candidates for deep learning and will keep getting better as we collect and feed the right kind of data into each of these enormous number crunchers.

Hopefully, now that you have a better understanding of deep learning and how it works, you won’t be full of questions when you glance at the ceiling of a checkout-free store. Be sure to come back for the next post in this series exploring the technologies that power checkout-free shopping.

View full post