Data labeling is the process of structuring data in a way that an algorithm can interpret. Labeling for computer vision should correspond to what the model is intended to predict, and the quality of the data labels included in the input data is directly linked to how well a model can perform. So it’s vital that your machine learning datasets include high-quality labeling.
Here’s what you need to know about the types of data annotation available for computer vision, the most common annotation techniques, and how data for computer vision is typically annotated.
The majority of data annotation types fall into four groups: categorization, segmentation, sequencing, and mapping.
Categorization is one of the most common forms of data annotation. With this approach, each item in a dataset is given one or more category labels. These labels can be either binary choices or multi-class labels. For binary categorization, the label might note, for instance, whether or not a picture contains a vehicle. A picture containing a vehicle would have one label, while a picture not containing a vehicle would have another. A multi-class example, on the other hand, might note whether a picture contains a car, a truck, or a motorcycle. Categorization can also support taxonomy hierarchies, where each piece of data is associated with multiple labels of increasing specificity. If a piece of data was labeled with whether or not it represented a vehicle, it could also be labeled with the type of vehicle and that vehicle’s make and model.
With segmentation labeling, data is broken into segments. This approach can be applied to a variety of types of data. When working with text, a paragraph might be broken into sentences or into pieces that express different ideas. With image data, segmentation identifies which pixels in an image belong to a specific object or object class. For example, in a medical scan, segmentation might separately label different organs.
Sequencing describes the progression of items in a series of data. This approach is especially common when using time series modeling to predict future events. A series of human portraits might be labeled with the time when they were taken, for example, and then used to predict a future portrait.
Sequencing can also be applied to any type of data that has a beginning and an end. When analyzing text, sequencing might be used to predict what comes next in a sentence.
Mapping sets up labels that map one piece of data to another. This labeling technique is common for language-to-language translation, where a word in one language is mapped to a similar word in another language.
It is also used to identify associations between data. For example, digital images might be mapped to other images with similar themes.
Any data that can be stored as a collection of pixels or that is derived from collections of pixels can be used for computer vision. The most common forms of data include 2D or 3D data and video files, but any visual representation of data can be used for computer vision. While any data in this format can be used for unsupervised learning, labeled data is required to perform supervised learning techniques.
The types of image annotation used with computer vision are extensions of the general types of data annotation described above. Common techniques for computer vision include image classification, object detection, segmentation, and spatial annotation.
With image classification, an image is simplified to one or more labels. These label categories can be binary or multi-class. With a binary classification label, an image might be labeled with either “cat” or “not cat.” A multi-class label, on the other hand, might be labeled with “cat,” “dog,” or “bird.”
Image classification can also support taxonomy hierarchies, where each image has multiple labels of varying specificity. For instance, at the highest level, a Fuji apple might be labeled as grocery. As the labels grow more specific, it could also have the labels of produce, fruit, apple, and Fuji apple.
Whereas image classification identifies only whether an image contains an object, object detection identifies the presence and location of each instance of that object. An image containing multiple cats, for example, would be labeled with the location of each cat.
Segmentation partitions images into pieces. Segmentation requires pixel-level annotations, where each pixel is labeled with a given item. The two main types of segmentation used in computer vision are semantic segmentation and instance segmentation.
Semantic segmentation separates an image into object classes without identifying specific objects. If semantic segmentation were used to identify cars and buildings, for example, the algorithm would identify all the pixels that corresponded to cars and buildings. Each pixel that corresponded to a car would be assigned the same label, while each pixel that corresponded to a building would be assigned another. The remaining pixels would be labeled as neither car nor building.
Rather than grouping objects by class, instance segmentation identifies individual instances of an object. If instance segmentation were used to identify a row of cars, for example, each identified car would be given its own individual label. Each pixel that corresponded to a specific car would be assigned with that car’s label. The remaining pixels would be labeled as not a car.
Spatial annotation labels an object’s location in 3D space. This type of annotation can be accomplished using a mixture of automated and manual approaches. Sensor technologies such as lidar, for example, use a mixture of lasers, scanners, and GPS receivers to calculate the distance to an object. Spatial annotation can also be performed manually by fusing 3D scenes with their corresponding 2D images.
After deciding on a suitable annotation type for a computer vision project, you can choose from several annotation techniques for recording those labels.
Bounding boxes are simply drawn around the target object. When creating a model that identifies cars, for example, you’d draw a box around each car in each image. By doing so, you tell the model what parts of the image to look at.
Polygonal segmentation works similarly to bounding boxes but allows for more precise identification of objects. Because polygons are complex shapes, they can be used to more accurately identify the boundaries of an object. This approach is useful for objects with complex shapes that don’t map neatly to a box, since it eliminates unrelated pixels from the label. For example, polygonal segmentation can be useful for identifying complex company logos or symbols.
Polylines mark continuous lines or edges. This annotation technique is primarily used to identify linear objects where only the edge point matters. For example, this technique is used by autonomous vehicles to identify lines on the road or on the edge of the sidewalk. This annotation can often be performed automatically by using edge-detection algorithms.
Landmarking identifies points of interest within an image. With this approach, each defined point of interest is manually marked with a dot. When trying to label human expressions, for example, one might mark the pupils and points along the edge of the mouth. This technique tends to be fairly error-prone since it is difficult to mark various points continuously. More general landmarking can also be performed with feature detection algorithms such as corner detection.
Tracking plots an object’s movement over time, typically through subsequent frames in a video. This technique can be performed manually by identifying an object in each frame of the video but is often automated with tools that can identify and annotate objects in new frames. Automating this part of the process, however, requires robust object detection that can accurately identify the object in various frames.
Data for computer vision is typically annotated by humans. There are a few different methods for obtaining annotations performed by humans, including crowdsourcing, in-house annotation, and outsourcing. For certain computer vision problems, there are also automated solutions that can annotate data based on pre-trained models.
Because data in the real-world is messy, automated annotation and annotation by hand both come with a variety of difficulties. Since most categories come with a wide variety of edge cases, it can sometimes be difficult for a person to come up with an appropriate label, let alone an algorithm. Often, accurate labeling requires complex instructions for how to proceed.
To see how complicated data labeling can get in the real world, consider the simple example of defining a chair. A dining room chair or a desk chair would be easy to classify, but what about a bar stool, bean bag, or loveseat? Even a problem as seemingly simple as a chair requires a robust set of instructions for both humans and automated systems.
With crowdsourcing, a company creates a collection of tasks for workers to perform. For example, these tasks could each include some of the images that need to be labeled. Crowdsourcing can be performed through platforms where workers are paid small sums of money based on each task they complete. Crowdsourcing is an easy, low-cost way to get labeled data, but it can sometimes result in low-quality, inconsistent labeling.
With in-house annotation, images are labeled by employees of the company itself. While this can work well for small datasets, labeling in-house is often an inefficient use of staff, since the task is usually assigned to employees who have more pressing things to do. Hiring a team of annotators typically provides better results and higher quality control, but as an in-house solution, this approach is often neither scalable nor cost-effective.
Another option is to outsource the data annotation to a company that focuses specifically on data labeling. With this approach, a company’s data is labeled by an external team of experts who focus on providing data annotation services with an emphasis on quality control.
While some outsourcing companies focus on providing their customers with labeling software to automate the annotation process, others pursue a more manual approach, where humans add labels to the dataset directly.
Other companies combine these two approaches by taking advantage of automated data annotation tools while still involving human experts in the annotation process. These companies might use tools like lidar annotation and sensor fusion, which provide automated annotations that can be easily verified by experts.
Computer vision is applicable to a wide variety of technologies, from self-driving cars to robotic bin picking to medical image analysis. Organizations that work with image data can often benefit from automated solutions built using computer vision technology. Business leaders and engineers should look for tasks within their company that can be simplified using computer vision solutions. From there, they can begin the process of building and labeling datasets for this technology.