In my role as Chief Scientist at Zippin, I get asked a lot about the myriad of cameras and sensors in checkout-free stores. If you’re amazed at the design and amount of information they’re able to collect, you’re not alone. It’s a synergistic combination of three complex pieces of technology. 1. Deep learning 2. Sensor fusion and 3. Computer vision. In this three-part blog series, we deep-dive into each of these core technologies and explore their role in enabling checkout-free retail.
Deep learning is at the heart of all modern AI advances ranging from modern protein unfolding and speech understanding, to medical imaging and driverless vehicles. Let’s take a look at the underlying principles behind this technology, explore its requirements, applications, and most importantly, its limitations. In particular, we’ll also examine how deep learning is already revolutionizing checkout-free brick and mortar food service and retail in a big way, with lots of untapped potential.
Simply stated, machine learning is the science of recognizing patterns. These patterns may come from everyday human activities such as reading, writing, hearing, visual perception, or from more complex activities like playing a game of go, human genome understanding or protein structure unfolding.
I started my AI career in the late 1990s doing machine learning. Back in those days, we spent a major chunk of our research time designing and handcrafting distinguishing features that would help us to recognize these patterns. Often this would require lots of domain expertise and pondering over lots and lots of data. I still remember analyzing endless hours of drone video to solve the problem of detecting and tracking moving people and vehicles that are just a few tens of pixels in a high resolution image. The trick was to have an intuition for what kind of features would help in the specific problem at hand and then finally tweak these features to get the best performance numbers.
As a result of the domain expertise needed, machine learning was pretty fragmented across different domains such as natural language processing, speech understanding, and computer vision. The underlying theory behind machine learning in each of these domains was the same, but the different researchers remained sort of restricted to their domains with cross-talk happening occasionally at core AI conferences.
Deep learning is the natural evolution in the field of machine learning driven largely by the ability to collect more data and advances in computing. Instead of relying on feature engineering for a specific domain, deep learning attempts to find patterns through crunching numbers over large volumes of data and their associated labels. Deep learning models these pattern recognition problems through the use of layers and layers of neurons. The more layers, the deeper the network and hence greater the ability to model complex non-linear patterns.
At the core of it, deep learning has three key elements:
In order to apply deep learning to any pattern recognition task at hand, we need a good architecture, also called model, of neurons and how they are connected. Over the last 10 years, there has been constant innovations in models and we have reached a point where this is more or less standardized and, to some extent, the choice of models is not that important overall. Moreover with transformer models that use self attention, we have reached a juncture where it looks like the models across different domains (audio/video/text) have architecture that is more or less standardized as well.
Perhaps one of the biggest bottlenecks for applying deep learning comes from requirements on the data side. Not only do we need large amounts of training data for each task, but for most use cases this data has to be annotated to guide the tuning of the neurons with the right feedback. Recent works in self-supervised learning try to get away with minimal annotation but there are still a lot of performance gaps between supervised and self-supervised deep learning metrics. This requirement for data and annotation is perhaps the biggest roadblock to rapid adoption of deep learning, and as a result, there is large growth in this domain with various startups providing data collection and annotation services. This sector is currently seeing massive growth.
Finally, on the computing side, even though training these models is a one-time compute cost, companies have to have a budget for these resources and it can spiral pretty quickly. For instance, according to some estimates GPT-3, a powerful attention based language model, easily costs tens of millions of dollars to train and produces a CO2 equivalent of 100 cars being driven for a year. Many companies are trying to bring down the CO2 footprint both for training as well as for inference by developing specialized hardware accelerators for these neural networks. Thus another sector undergoing a massive growth spurt!
Deep learning can be applied to all sorts of pattern recognition tasks as long as the three requirements above are met for the specific problem. Common applications of deep learning include modeling languages such as those used in chatbots, searching, question-answering, and language translation to name a few.
Deep learning has found extensive applications in the domain of computer vision. For example, self-driving vehicles apply deep learning to detect road signs, other vehicles and pedestrians and, in general, to semantically understand the world around an autonomous vehicle using cameras and lidars. They have also found extensive applications in automating agriculture such as identifying weeds or ripe crops, as well as understanding satellite imagery. In the field of radiology they have become an indispensable tool to identify anomalies.
Other applications of deep learning include speech processing, transcription as well as more niche applications such as protein unfolding or game playing.
Since deep learning relies heavily on the availability of large amounts of training data, it can also be seen as the main limitation. Collecting and annotating large amounts of data is resource and time intensive.
In general, deep learning works only when the data in real life (inference mode) follows the same distribution as that seen during training. Obviously this means that deep learning does not fare well in edge cases that do not follow the original distribution of data. Thus anytime we see a long-tailed distribution, deep learning will perform poorly on the skewed portion of the long tail. This is perhaps the single most factor why deep learning hasn’t reached the required level needed for fully autonomous driving.
Checkout-free retail has all the challenges that researchers are trying to solve for driverless vehicles without the associated risk that someone might be seriously injured if things don’t go as planned. This makes it the perfect market for the application of deep learning.
The main challenge in checkout-free retail is to figure out “who picked what?” This involves detection, tracking and identification of customers, recognizing their activities and interactions as well as recognizing products that they are interacting with. All of this can be very aptly done using deep learning.
At Zippin, we strongly believe in the potential of deep learning and are actively pursuing its application in all aspects of checkout-free retail. In fact, these different sub-problems are ripe candidates for deep learning and will keep getting better as we collect and feed the right kind of data into each of these enormous number crunchers.
Hopefully, now that you have a better understanding of deep learning and how it works, you won’t be full of questions when you glance at the ceiling of a checkout-free store. Be sure to come back for the next post in this series exploring the technologies that power checkout-free shopping.