An object-detection computer vision system identifies and locates multiple objects of interest in a digital image or video. It may be specialized to identify application-specific entities such as people or faces in a people-counting application or cars in a traffic density-estimation system. Or it can be a general system to locate different types of objects in a digital image and label them.
Image classification is simply the labeling of an entire image. For example, you can train a computer vision system to differentiate between images that show elephants and others that include horses. The classification system does not identify the location of the object of interest. The term localization is used to pinpoint the location of an object within an image.
An object detection system generally contains many objects within the same image. The task is to both detect where the objects are located and to label them. Normally, the object detection system places bounded rectangular boxes around the detected objects.
Object detection creates the foundation for many other computer vision applications. Here are a few examples.
Object detection can be used to track entities in videos; its applications include vehicle tracking, augmented reality, person tracking in video calls, and more.
This activity uses object detection to get count estimates of specialized objects in a digital image. Examples include counting the number of people in a supermarket to locate highly visited areas of the store, counting vehicles to estimate traffic volumes, and counting products in manufacturing.
Image retrieval systems use object detection to automatically identify and annotate images in a dataset.
Computer vision–based activity recognition systems use object detection algorithms for activity detection in videoconferencing, augmented reality systems, and more.
Object detection in computer vision systems has been substantially speeded up and has yielded remarkable improvement in accuracy, but challenges remain.
Developing datasets for object detection is tedious and costly. Object detection systems require not only annotated images, but also bounded rectangles around objects to indicate their location. Also, a large number of training examples of the same object are required to fully capture it from all angles and views under different lighting conditions.
Classifying objects correctly in unseen images is a hard task because of basic differences between the training images and the unseen test images—in other words, images the system hasn’t seen before. Differences in lighting, tones, and colors can lead to misclassification. Further, slight rotations or shifting and scaling objects can reduce the accuracy of these systems.
All objects, even the simplest, come in different shapes and sizes. A table can be round, square, or rectangular and come in many different sizes, materials, and colors. It is a challenge to develop an automatic system that learns and generalizes from a limited number of specialized examples.
Small objects within large images present their own detection challenges. Isolating objects of interest and differentiating objects from an image background is a difficult task.
Time and memory constraints also present challenges to real-time object detection. As one example, all image processing algorithms must process a large number of pixels. A 512-by-512-pixel image has around 262K pixels, with each pixel having three color values that require processing.
With the popularity of deep learning algorithms, the model itself also requires large amounts of memory to store its parameters and takes a long time to train when run on conventional computing machines.
Some of the many types of object detection methods are shown in the figure below.
Figure 1: A taxonomy of object detection methods. Source: Mehreen Saeed
Traditional methods mostly involve non-neural network–based solutions and involve feature extraction followed by a recognition stage.
While the Viola-Jones method was initially proposed as a general-purpose object detection algorithm, it is now known as one of the first methods to successfully detect faces within an image in real time. The method is based upon Haar features, which essentially use the observation that different facial features have a relative relationship to each other. For example, the nose lies below the eyes, and this region is brighter than the eyes.
There are object detection methods based on feature descriptors, including a histogram of oriented gradients (HOG) and scale-invariant feature transforms (SIFT). These features are extracted from reference images via different transformations and image computations and stored in a database, which an algorithm later uses to find candidate objects within unseen images.
With the advent of neural networks, researchers achieved new milestones in the area of object detection systems. Neural networks were inspired by the working of the human brain and learn complex nonlinear mapping functions between inputs and outputs. Deep learning architectures are extensions of neural networks, with more layers. Here are just a few object detection methods based on deep learning.
Constructing a region proposal involves generating various regions within an image that may contain an object of interest. R-CNN has demonstrated success in this area. It consists of three steps or modules. The first one searches for different regions within the image that can be possible candidate objects. The second step employs a large convolutional neural network (CNN) that extracts features from these regions. As a final step, a support vector machine (SVM) predicts whether an object exists within that region and classifies it.
There are many variations of the R-CNN, namely fast R-CNN, faster R-CNN, and feature pyramid networks.
As opposed to region proposal-based methods, single-stage neural networks use just a single network for both object localization and labeling. These methods are fast and efficient for real-time use.
YOLO is a fast algorithm that uses a single neural network to find regions within the image that may contain an object and to predict bounding boxes around possible objects and assign each a probability value.
The SSD algorithm is faster than YOLO and again uses a single neural network for simultaneously locating and identifying objects in an image.
Facebook’s detection transformer (DETR) method was one of the first methods based on transformers and was successfully applied to object detection problems. It employs a CNN for feature extraction as a first step. Then it passes the features it finds through an encoder-decoder network that not only outputs the bounding boxes of objects within an image but also their corresponding classification.
Building a large database of annotated images is challenging. You can find large-scale, freely available datasets that were used in object detection challenges and for benchmarking and establishing different state-of-the-art methods. Here are a few datasets that you can download and use to train and test your system.
Pascal VOC Dataset has more than 11,000 images with more than 27,000 objects and 6,900 segmentations.
The ImageNet Dataset is a database of fully annotated images with bounding boxes. This dataset was used in the ImageNet large-scale visual recognition challenge (ILSVRC). There are more than 128,000 training images with 1,000 object classes.
The Microsoft common objects in context (MS-COCO) dataset contains images that are not only annotated with bounded boxes but also segmented for precise location of objects within the image. This dataset has more than 330K images.
Research is taking object detection technology in several new directions. These include:
With weakly supervised learning systems, training is performed using so-called weak examples. These are noisy examples, which are either partially labeled, inaccurately labeled, or labeled using heuristics, and are not necessarily 100% accurate. Training a machine learning system for object detection and recognition requires a massive number of annotated images. Using weakly supervised learning can reduce the burden of manually annotating and labeling images.
Edge computing performs data computations at the source. One example is a mobile phone that captures a video and processes it locally in real time. While there has been tremendous progress in object detection technology, there is still room for further research on how to make it lightweight and efficient enough for mobile and edge devices.
Another area of research that’s moving forward is having an object detection system output a complete and detailed English description of objects in an image and their relationship to each other.
While there have been significant performance improvements in object detection technology, identifying small items or objects in a larger scene remains challenging. Designing systems that can recognize even the smallest details is another promising research area.
With the advent of deep learning methods and transformer architectures, object detection is faster and more accurate than ever before. Now that you understand the basics, you’re ready to decide how to proceed. You can develop your own object detection system, use an existing AI service for object detection, or tune an existing model for your application.