Many natural language processing (NLP) models can understand language but are ambiguous about images. Vision models understand visual patterns but only at a pixel level.
CLIP (Contrastive Language-Image Pre-training) is a neural network that connects images to text. The original model by OpenAI (January 2021) was trained with 400 million images with their text captions. It uses a technique called "contrastive learning" that embeds the images and text in a common space where the representations from the two modalities can be compared.
CLIP can perform zero-shot learning, where an unseen concept can be recognized (e.g., identifying a narwhal) without any prior training. It can also do one-shot learning, where only a single example is shown for a concept, and it can still recognize it (e.g., recognizing a new font style).
CLIP has always been important in image models: Stable Diffusion uses a CLIP model, and Stable Diffusion XL uses two. But now it’s important it'sore models become multi-modal. LLaVA, an open-source multi-modal LLM, uses an open-source version of CLIP, and I imagine DALLE-3/GPT-4 uses a more advanced internal version.
There are more specialized versions of CLIP — like MedCLIP (for medical image captioning). Fine-tuning CLIP is doable (but not as easy as you’d think) and could lead to interesting results. There’s OpenCLIP, which is an open-source implementation of OpenAI’s CLIP.