In today’s world, we are witnessing a proliferation of AI solutions. However, in many cases, these solutions fail to reach consumers due to the high hardware resource requirements needed to run these models. To scale our AI journey, we require solutions that are efficient, faster, and accurate enough to run on edge devices. This is where FOMO comes into the picture.

Object detection is a crucial aspect of computer vision that has been explored for many years. Deep learning and neural networks have revolutionized the field, enabling more precise and accurate results in object detection. Popular deep learning-based algorithms and model architectures like R-CNNs and their variants are prevalent in object detection. However, feature-based methods like Haar Cascades, SIFT, SURF, and HOG still play a significant role in certain applications. The strengths and weaknesses of these methods should be considered when selecting the best approach.

Object detection techniques have greatly benefited from Convolutional Neural Networks, but their usage requires specialized hardware and computational resources. tinyML has enabled deep learning on microcontrollers, making real-time multi-object detection possible on constrained devices. This breakthrough has brought about new possibilities for object detection applications, as deep learning models can now be run directly on the devices that detect them.

TinyML has made great strides in image classification, which predicts the presence of an object in an image. However, object detection requires identifying multiple objects and their bounding boxes, making it more complex and memory-intensive. Traditional object detection models processed images multiple times, but newer models like YOLO use single-shot detection for near real-time results. However, these models still require large memory and data sets, making it challenging to run them on small devices and detect small objects.

FOMO (Faster Objects, More Objects) is a concept that challenges the idea that all object-detection applications require high-precision output from deep learning models. It suggests that by balancing accuracy, speed, and memory, deep-learning models can be reduced to small sizes while remaining useful. One way this can be achieved is by predicting the object’s centre rather than detecting bounding boxes. Many object detection applications only require the location of objects in the frame, not their sizes, and detecting centroids is more compute-efficient than bounding box prediction while requiring less data.

structure of deep learning models for object detection.jpeg

FOMO changes the structure of deep learning models for object detection. Single-shot detectors use convolutional layers to extract features and fully-connected layers to predict bounding boxes. The layers detect increasingly complex features, such as lines, corners, and objects. Pooling layers reduce output size and highlight important features. With more layers, feature maps can detect intricate things like faces.

Although an image classifier’s output is binary (i.e., “face” or “no face”), the underlying architecture is composed of convolutional layers that create a diffused lower-resolution image of the previous layer. In a standard image classification network, this locality, or “receptive field,” decreases as you move deeper into the network until there is only one output. FOMO uses the same architecture but replaces the final layer with a per-region class probability map and a custom loss function that preserves locality in the final layer, resulting in a heatmap of object locations.

FOMO model object detection algorithm

One limitation of using a heat map is that each cell functions as an independent classifier. For instance, if the classes are “lamp,” “plant,” and “background,” each cell will only be classified as either lamp, plant, or background. Consequently, detecting objects with overlapping centroids is not possible.

During the initial evaluation, it was discovered that while bounding boxes are a common output format for object detection models, they are not always necessary. In many cases, the object size is not a concern since cameras are fixed and objects have a consistent size, so what is needed is simply the object location and count. Consequently, the model has been adapted to train on object centroids, making it easier to count closely located objects. Since the neural network architecture is convolutional, it naturally searches for objects surrounding the centroid.

In addition, FOMO can be used with any MobileNetV2 model, allowing for the selection of a model with a higher or lower alpha depending on the deployment requirements. Transfer learning is also possible, although base models must be trained specifically with FOMO in mind. This makes FOMO suitable for a wide range of hardware, from microcontrollers to gateways and GPUs.