In my last post, I described deep learning as the heart of all modern AI advances. In part two I will discuss the second core AI technology, sensor fusion, and its important role in enabling high accuracy for end customers, as well as lowering the cost of deployment for retailers.
To help me introduce sensor fusion, this artificial intelligence term, I look to a wonderful creature of mother nature, the honey bee. It is perhaps the most impressive of all social insects, and though tiny, this creature plays a profound role in our ecology. It is able to navigate and localize itself while scouring for nectar at speeds up to 20 mph - an enviable feat for most micro unmanned aerial vehicle researchers! Bees do this through a multitude of sensors: two large compound eyes, three simple eyes (ocelli), two antennas, and a brain. The brain takes all the data collected from these sources so the honey bee can sense color, motion, light, fragrance or wind speed.
Simply stated, sensor fusion is the fusion of multiple sensors and is prevalent in all forms of life. Sensor fusion is necessary because each sensor may be limited in its scope, noisy or less reliable (including failure modes). By fusing information from different sensors together, the hope is to gather more precise and robust information about the world in a fail-safe manner.
Sensing for sensor fusion can be accomplished in one of three ways. The simplest form is when multiple sensors of the same kind are used in a non-overlapping way. This increases the overall amount of information available without affecting the quality of each. In the bee’s case, the two compound eyes are more or less non-overlapping and gather information from the left or right side, allowing it to navigate and change directions.
Another way sensor fusion is accomplished is when these sensors have an overlap. This allows us to gather more precise information from each sensor as well. For example, stereoscopic cameras enable better depth perception. In the case of the bees, the three ocelli with a wide field-of-view located in a triangular shape helps them to better triangulate the direction of the sun.
The two forms of sensing detailed above ,involve one kind of sensor (unimodal sensing), but require more of them to capture more data and/or to increase the precision of the data. The third way, multimodal sensing, on the other hand, involves fusing information from different modalities. For example, bees fuse the compound eyes, simple eyes and the antenna to find flower petals. Multimodal sensing is interesting since it allows one to gather information from modes that by itself may be less reliable and imprecise. Thus the whole is greater than the sum of the parts.
Besides the sensing modalities, sensor fusion also has nuances depending on how and where the fusion happens, resulting in early, mid or late fusion AI algorithms. The sensor inputs feed into the CPU for processing (bee’s brain). Early fusion is when the information fuses as is, without any pre-processing, resulting in a tight integration of the sensors. In contrast, the late fusion approach processes each sensory input independently and the processed outputs of each mode are combined loosely to solve the task at hand (e.g. navigation for bees). While it's not entirely clear how the brains of most living entities perform this fusion, research points to fusion happening somewhere in between (mid-fusion).
Over the last few decades of my career in AI I’ve been fortunate enough to work on a variety of use cases involving sensor fusion. During graduate school at the University of Maryland, we developed a laboratory for capturing human motion from a hundred cameras looking inwards from a 3D dome. We developed early fusion algorithms to produce detailed 3D models of humans. Multi-camera sensor fusion is now prevalent in a variety of use cases involving video security as well as driverless vehicle applications.
Of course, multimodal sensing is widely used in driverless vehicles where these cameras are combined with LIDAR (3D sensing with lasers) to provide robust person and vehicle detection. An example, and one of the more interesting applications of multimodal sensing I’ve worked on, was in robotics where we used torque sensors as well as pressure sensors. We combined them with vision sensing to perform fine-grained manipulation and grasping, using a robotic arm. One task was to insert a key into a lock and open the door using the handle. Without these torque and pressure sensors, there is no way to be able to do this task reliably.
Multimodal sensing with late fusion has been adopted widely in areas that need multi-factor authentication. High security biometrics applications, for instance, that use voice, fingerprint, retinal scan, and face ID, to authenticate a particular user benefit greatly from this approach because the combining of these modes increases the overall accuracy of the system. Although checkout-free retail has less stringent accuracy requirements to these high security biometric applications, multimodal sensor fusion can play a critical role.
For checkout-free retail, sensor fusion can be applied both for identifying customers as well as recognizing products. By placing multiple cameras throughout the store, customers can be identified and tracked robustly while still maintaining privacy. Fusing information from multiple cameras with overlapping fields of view allows us to localize these customers in the store accurately. This is akin to a bird's eye view of the store with customers appearing as dots on a map. These cameras can be also used for recognizing customer’s products.
Recognizing products is harder because retail stores usually carry a variety of products ranging in size from Tic-Tacs to suitcases. Besides size, products vary in flavor (slightly different packaging), organic vs. non-organic, made-to-order food and meat, as well as apparel that come in various prints and sizes. The ability to recognize these products across the gamut is not child’s play. Further challenges come from the fact that the shopper may be holding the product in a way that obscures the product from the camera's view partially, or completely for the small products. This is technically known as occlusion.
Occlusion is perhaps the predominant limiting factor for being able to recognize products with a high accuracy from cameras alone. Sensor fusion (when done right) can help alleviate these occlusion issues and allow us to provide a highly accurate checkout-free experience.
Products that are larger than the human hand will have certain portions of the product still visible from certain vantage points. For these larger products, fusing information from multiple cameras can provide the required accuracy boost. Of course, there is a lot more to this and one has to carefully select the best camera, the precise placement and the right fusion algorithm to benefit from these multiple sensors. The cost of deployment, or capital expense, is an important consideration here as well. As the number of cameras increases so does the compute required for processing these video feeds.
Products that are smaller than the human hand are, however, more challenging to recognize using only cameras. Since these products can be completely covered by the human hand, there is no way to figure out the product and its quantity through cameras alone. Thus, achieving high accuracy for these products using (multiple) cameras as the only source is a futile attempt. This is where multimodal sensor fusion comes in to rescue us! In this case, one can take advantage of other sensing modalities that have been developed over the years. Some of these sensors include pressure sensing, load sensors, range sensors etc. Recent technological advances have made these sensors easily available, highly accessible (through IOT advances), reliable, as well as low cost.
At Zippin, we strongly believe in the power of multi-modal sensor fusion and have been pairing cameras with shelf sensors since our inception to provide the highest level of accuracy to the shopper, thereby winning their trust. We have carefully selected the sensors and designed these “smart shelves” to optimize for cost and ease of deployment. Our retail partners trust our process and design as we continue to innovate to support other use cases of sensor fusion for retail.
Our retail partners work as hard as the bee gathering nectar, to provide access to high quality products at their stores, and we are proud to support them with the best sensor-fusion tech. Zippin-powered stores are all over the U.S. in stadiums from Spectrum Center in Charlotte, NC to NRG Park in Houston. Visit one near you to experience the power of sensor fusion at its best.