Vision Language Models

๐Ÿ“… 2025 ยท #computer-vision #deep-learning #multimodal

In the evolving landscape of AI, the distinction between "seeing" and "reading" is vanishing. Vision Language Models (VLMs) bridge this gap by mapping visual features into the same semantic space as text embeddings.

Terminology

Before diving into architectures, we must distinguish between different flavors of multimodal systems:

  • VLM: Specifically refers to models that use a vision encoder (like CLIP) to feed tokens into a Language Model.
  • MLLM: Multimodal Large Language Models that can handle Interleaved data (images mixed with text).

Vision Encoders & Projection

Modern VLMs typically utilize a pre-trained Vision Transformer (ViT). However, the dimensions of visual tokens often don't match the LLM's hidden dimension. This requires a projection layer.

ViT Backbones

We typically use ViT-L/14 variants for the best trade-off between speed and accuracy.

Projection Layers

The projector can be a simple Linear layer or a C-Abstractor for better feature compression.

The Contrastive Objective

Most VLMs are grounded in contrastive learning. The goal is to maximize the cosine similarity between an image embedding $v$ and its corresponding text embedding $l$.

The loss function often used is the InfoNCE loss. For a batch of $N$ image-text pairs, the loss for the $i$-th pair is defined as:

$$L_i = -\log \frac{\exp(\text{sim}(v_i, l_i) / \tau)}{\sum_{j=1}^N \exp(\text{sim}(v_i, l_j) / \tau)}$$

Where $\text{sim}(u, w) = \frac{u \cdot w}{\|u\| \|w\|}$ is the cosine similarity and $\tau$ is a learnable temperature parameter.

InfoNCE Loss

This ensures that positive pairs are pulled together while negative samples are pushed away in the embedding space.

Open VLMs

The open-source community has made massive strides. Models like PaliGemma and LLaVA have democratized access to high-performance multimodal reasoning.

Model Architecture Release Year
LLaVA 1.5 Vicuna + CLIP 2023
PaliGemma SigLIP + Gemma 2024

Video Modalities

Temporal Aggregation

For video, we don't just process one frame. We stack frames $F_1, F_2, ..., F_n$ and apply a temporal pooling layer to capture motion features.

Note: Video LLMs require significantly higher compute due to the $N \times \text{tokens}$ complexity.