Recent advances in computer vision have led to the development of increasingly powerful, highly accurate object detection techniques. When applied to image segmentation, however, the performance of these approaches remains inadequate. Standard image segmentation techniques can produce a coarse mask over an object, but they don’t always capture the object’s finer details. Masks frequently extend past the boundaries of an object or miss higher-frequency details.
As described in the paper “Mask Transfiner for High-Quality Instance Segmentation,” researchers from ETH Zürich, HKUST, and Kuaishou Technology developed Mask Transfiner, a technique for performing high-quality image segmentation. The paper discusses the architecture of this model and evaluates it against existing state-of-the-art segmentation techniques. In both quantitative and qualitative analysis, Mask Transfiner consistently outperformed standard segmentation techniques.
The most difficult object regions to segment tend to be those that contain high-resolution features, such as the headlights on a car or the narrow branches of a tree. Mask Transfiner detects these difficult-to-segment sections, then performs a deeper analysis to refine those regions. By applying this fine-tuning approach only to error-prone features, Mask Transfiner can greatly improve final mask quality with little increase in computation or memory requirements.
First, Mask Transfiner identifies incoherent or error-prone regions. When segmenting an image, high-resolution information is often lost when an object mask is applied to the image. This introduces error to the segmentation, since object details of a smaller resolution than the object mask cannot be captured. This is demonstrated in Figure 1.
Figure 1. Visualization of incoherent regions based on a simulation of information loss during mask application. Source: “Mask Transfiner for High-Quality Instance Segmentation”; used with permission
The incoherent regions, where high-resolution information has been lost, need to be analyzed with more granularity to create higher-quality segmentations. In general, these regions typically fall around the boundaries of the object or in areas containing high-frequency details.
To identify these regions, Mask Transfiner’s incoherent-region detector takes as its input the initial object mask as well as features of various scales. A convolutional network then predicts which cells within the mask demonstrate the greatest loss of information. These areas are then upsampled and fused with high-resolution neighborhood features before being passed to another convolution layer. This layer again detects which regions demonstrate the greatest loss of information. These cells will be refined further throughout this process.
Once the incoherent regions are detected, the researchers build a quadtree from their points. The corner points of each cell form the quadtree’s root nodes, and at each descending level of the tree, the algorithm subdivides the incoherent regions by four. This technique breaks the incoherent regions into finer subsections, so that Mask Transfiner can achieve the level of detail necessary to achieve highly accurate segmentation.
After the researchers build the quadtree, Mask Transfiner refines the object mask by performing a finer analysis on the incoherent regions. The refinement network takes the quadtree as input and acts directly on its nodes, predicting object probabilities for each incoherent region at increasingly smaller scales.
This refinement network contains three primary components: a node encoder, a sequence encoder, and a pixel decoder. The node encoder builds the feature embedding for each incoherent point within the quadtree. The sequence encoder takes these feature embeddings and processes them across multiple quadtree levels, generating an output for each node. Finally, the pixel decoder decrypts this output and predicts the final mask label for each pixel.
Because Mask Transfiner acts only on a small subset of image features—those included in the quadtree—it’s possible to perform this high-resolution analysis without large increases in computation or memory.
After developing Mask Transfiner, the researchers compared its performance against state-of-the-art techniques using benchmark datasets such as COCO, Cityscapes, and BDD100K. On all three datasets, Mark Transfiner outperformed standard techniques.
On the COCO dataset, for example, it outperformed RefineMask by 1.3 average precision (AP), BCNet by 0.9 AP, and Query Inst by 1.7 AP. On the Cityscapes dataset, the Mask Transfiner excelled in improving precision along object boundaries, outperforming PointRend by 1.3 AP and BMask R-CNN by 2.3 AP along object boundaries.
This improvement in precision was also visible when qualitatively comparing the outputs of Mask Transfiner to the outputs of other techniques. These differences were especially noticeable when examining difficult-to-segment object regions, such as the giraffes’ legs in Figure 2. Other regions with large differences in performance are marked with boxes:
Figure 2. Segmentation of images in the COCO validation set, using Mask-R-CNN, SOLQ, PointRend, and the Mask Transfiner, respectively. Mask Transfiner provides better results in highly detailed regions compared to other techniques. Regions with large performance differences are marked with boxes. Source: “Mask Transfiner for High-Quality Instance Segmentation”; used with permission.
Mask Transfiner improves on state-of-the-art image segmentation techniques by providing high-quality object segmentation that can capture the fine details of an object. Because of its highly efficient approach, Mask Transfiner produces higher-quality results with very little increase in computational power. The technique makes it possible to automatically detect the boundaries of objects with little manual correction needed.
Try Mask Transfiner for yourself; the code and trained models from the study are available here.