While researchers in the computer vision field have made a lot of exciting progress related to attribute prediction in recent years, the current state-of-the-art techniques are still far from perfect. Visual attribute prediction can be a challenging problem, since it requires a complex web of information.
The process often requires a global understanding of the entire scene but can also depend on data near the object or even within it. It can be difficult to build a reliable attribute predictor that captures all this relevant context. Attributes include color, shape, state, and action.
In our paper, “GlideNet: Global, Local and Intrinsic based Dense Embedding NETwork for Multi-category Attributes Prediction,” research engineers at Scale AI and Pennsylvania State University proposed a novel attribute prediction architecture, which we call GlideNet. We have since presented our findings at CVPR 2022, the IEEE / CVF Computer Vision and Pattern Recognition Conference.
We trained and evaluated our computer vision model on the Cityscapes Attributes Recognition (CAR) and Visual Attributes in the Wild (VAW) datasets, two challenging datasets that are commonly used for attributes prediction, and found that GlideNet improved attributes prediction compared to existing state-of-the-art solutions. Here I’ll discuss how GlideNet is used to reliably predict object attributes.
To predict object attributes, GlideNet starts by extracting features from a scene. It then passes those features to a feature composition step, which creates a single category embedding for each object. Finally, this category embedding is passed to an interpreter layer, which predicts the important attributes and removes those that are unnecessary. We’ll discuss this process in three stages—feature extraction, feature composition, and feature interpretation.
The feature extraction process includes three distinct feature extractors, each with its own purpose. The first is a global feature extractor that describes individual objects in the image, including their locations and category. The next is a local feature extractor that captures the area surrounding the identified objects, to describe attributes that are related to those objects.
Finally, the instance feature extractor identifies information about the intrinsic attributes of the identified objects themselves. This extractor uses a convolution layer to estimate features in regions with low pixel counts.
With these three separate feature extractors, GlideNet can interpret both the global context of the scene and the features associated with more minor scene details.
Once the features have been extracted, the feature composition step combines the embeddings from each feature extractor. With this approach, GlideNet can learn from the information acquired from all three extraction types. To ensure that we’re including valuable information from all three features extractors, we need to make sure to use a well-balanced composition approach.
To create this feature composition, we use a self-attention technique to learn the appropriate weights for each feature extractor. We use a combination of a binary mask and the self-learned category embedding to fuse these embeddings into a single object description, based on the object’s category. Finally, we use a gating mechanism to fine-tune each feature’s contributions, producing our final feature embedding.
Once we have the feature embedding, GlideNet’s work still isn’t complete. Now, we pass the embedding to the interpreter layer, which identifies the important features from that embedding. As shown in Figure 1, we use a multihead technique for this final stage, with an independent layer for each object category.
Figure 1. Interpreter structure for the CAR and VAW datasets. Source: “GlideNet: Global, Local and Intrinsic based Dense Embedding NETwork for Multi-category Attributes Prediction”
This multihead approach allows us to determine the importance of attributes based on object category and also accounts for different categories having different attribute lengths. The interpreter predicts the important features for each category embedding, removing any unnecessary attributes. By completing this process, the interpreter translates each embedding into meaningful object attributes.
To evaluate our model’s performance for large-scale attribute prediction, we applied it to the VAR and CAR datasets. Both of these datasets provide challenging data for attribute prediction, so they allowed us to evaluate our model against existing state-of-the-art solutions. In Figure 2, we show GlideNet’s performance on both datasets compared to existing solutions.
Figure 2. GlideNet provided better results on the CAR and VAW datasets than other state-of-the-art solutions. Source: “GlideNet: Global, Local and Intrinsic based Dense Embedding NETwork for Multi-category Attributes Prediction”
On the VAW dataset, GlideNet demonstrated a 5% gain on the mean recall (mR) metric compared to the second-best solution. GlideNet showed an even greater improvement on the CAR dataset, displaying an 8% mR increase.
We found that GlideNet was especially successful at predicting attributes in regions with low pixel counts, thanks to our inclusion of the intrinsic feature extractor. Similarly, our inclusion of the global feature detector made GlideNet perform well when identifying attributes that required understanding of global context due to the global feature detector.
GlideNet provides a solution for visual attribute prediction that improves on existing state-of-the-art models. It accomplishes this by capturing both local and global context relating to objects in a scene, before interpreting the most relevant features. In our paper, we explored its performance on the CAR and VAW datasets, but this architecture can be adapted to other visual recognition problems as well!
To try GlideNet out for yourself, check out the code and models from our implementation. You can easily get started with our model using the publicly available CAR dataset. If you’d like to replicate our results, we’ve also released a supplementary document that contains even more details about our specific architecture. With the GlideNet architecture, you can improve your own attribute extraction techniques!
Other research by Aerin Kim