ViTs vs CNNs

Assil | Jan 17, 2024 min read

In the field of computer vision, adapting Transformer models, originally designed for natural language processing, has opened new frontiers in image recognition.


  • 2017: The release of Attention is All You Need marked Transformers as the primary model in natural language processing.
  • 2020 : The release of An Image is Worth 16x16 Words marked the adaptation of Transformers for image classification.
    • Goal of this paper: Adapt Transformers to computer vision.

Model Architecture

The adaptation involves the following key components:

Image Patching

  • Split the image into fixed-size non-overlapping patches covering the entire image.
  • Each patch is transformed into a linear representation, incorporating position embedding.

Flattening for 2D Images

  • Flatten the image to handle 2D images.
  • Use standard 1D position embeddings, as more advanced 2D-aware embeddings do not show significant performance gains.

Transformer Design

  • The model design closely resembles the original Transformer.
  • Easy setup, can be used almost out of the box.

Attention Mechanism

  • Self-attention is used to compute attention maps, identifying the most relevant image regions for a specific task.
  • Weighted operations based on attention maps capture long-range dependencies within the image.

Vision Transformers vs. Convolutional Neural Networks

  • Less image-specific inductive bias in Vision Transformers (ViT) compared to Convolutional Neural Networks (CNNs).
  • ViT focuses on capturing global context, while CNNs make assumptions about 2D neighboring structures and local and translation equivariance.

In ViT:

  • Only the multilayer perceptron (MLP) presents local and translation equivariance.
  • Self-attention mechanisms are global in nature, with very little use of the 2D structure.


Vision Transformers excel in capturing global context, enabling them to handle long-range dependencies effectively in image recognition tasks.

Note: This blog post draws inspiration from the advancements discussed in the original paper, highlighting the impact of adapting Transformers for image classification.