OpenAI’s CLIP model marks another historic advance in establishing value for large-scale models, much like GPT-3 in the NLU domain. This one’s not for language, but for vision—specifically, for image classification. Contrastive Language-Image Pre-training, more succinctly known as CLIP, demonstrates yet again how far both computer vision and natural language have come as a result of the transformer model architecture.
As reported in the paper “Learning Transferable Visual Models from Natural Language Supervision,” researchers compared vision transformers and ResNet. Zero-shot CLIP learning and its “linear probe” variant solidly outperformed both ResNet50 and the previous state of the art, BiT-M and EfficientNet-NoisyStudent, by at least five percentage points on classification accuracy in the average score. CLIP also outperformed numerous other highly accurate models, along the lines of a fourfold improvement in training efficiency.
Thus, if you want to build your own machine learning vision classifier without training, start with CLIP as a “rough and ready,” accurate zero-shot model and add a linear regression for your output layer. If you do have time to train on your dataset, you’ll have the confidence that you’re most likely outperforming every other state-of-the-art vision classifier. Building a competitive vision AI system just got a whole lot easier.
Zero-shot learning is a process in which a model is trained on an independent corpus of data but still can be tested on benchmark datasets without ever previously “seeing” or being trained on these standard, canonical datasets.
In the case of CLIP, the model is trained on a 400 million–image dataset consisting of both images “crawled” from around the web and their metadata—HTML labels and filenames. What’s surprising about CLIP is that on a representative sample of 27 datasets, it can perform better on most than standard ResNet-style models that have been trained multiple times on the benchmark datasets themselves.
CLIP shows promise as a new standard baseline model, which typically only improves in accuracy when the final output layer is replaced with a simple linear regression and then is trained one or more times on a benchmark dataset such as Oxford Pets.
The authors (all OpenAI employees) determined that a simplified version of GPT-3 is most suitably used for the text encoder—a full GPT-scale model being too large to cheaply and efficiently train—while vision transformers (ViTs) consistently outperform a standard ResNet-101 model for the image encoder.
One oddity of CLIP is that the predictions it generates are often prefaced with “a photo of …,” “a type of …,” or even “a centered satellite photo of …,” and yet CLIP isn’t perfect. In fact, it performs worse on the standard MNIST dataset of handwritten digits than even a simple linear regression model. More specifically, the authors identify three typical failure conditions:
Beyond traditional performance metrics, CLIP also incorporates certain bias issues commonly found in web images and their descriptions, although the authors claim that on bias benchmark datasets such as FairFace, CLIP does better than several alternatives. Still, certain ages, genders, and races are more apt to result in a query result containing certain “animal” classes, and certain ages and races are more apt to be categorized as a “thief” or “criminal.”
That said, the authors encourage further research into transfer learning and training runs on specific datasets in order to reduce the biases inherent in CLIP.
If you’re curious to try CLIP yourself, check out the Colab notebook from OpenAI. Or, if you’d rather read the peer-reviewed paper, you can find it here on ArXiv. OpenAI also wrote a helpful summary blog of the paper.
And stay tuned for ways in which you can query image data with natural language. In the coming weeks and months, look for updates to various dataset management products in which you can form your own natural language queries of image datasets.
The following engineers at Scale AI contributed to this report: Sasha Harrison, Albert Zhang, and Diego Ardila.