ViTs vs CNNs

In the field of computer vision, adapting Transformer models, originally designed for natural language processing, has opened new frontiers in image recognition.

Background

2017: The release of Attention is All You Need marked Transformers as the primary model in natural language processing.
2020 : The release of An Image is Worth 16x16 Words marked the adaptation of Transformers for image classification.
- Goal of this paper: Adapt Transformers to computer vision.

Model Architecture

The adaptation involves the following key components:

Image Patching

Split the image into fixed-size non-overlapping patches covering the entire image.
Each patch is transformed into a linear representation, incorporating position embedding.

Flattening for 2D Images

Flatten the image to handle 2D images.
Use standard 1D position embeddings, as more advanced 2D-aware embeddings do not show significant performance gains.

Transformer Design

The model design closely resembles the original Transformer.
Easy setup, can be used almost out of the box.

Attention Mechanism

Self-attention is used to compute attention maps, identifying the most relevant image regions for a specific task.
Weighted operations based on attention maps capture long-range dependencies within the image.

Vision Transformers vs. Convolutional Neural Networks

Less image-specific inductive bias in Vision Transformers (ViT) compared to Convolutional Neural Networks (CNNs).
ViT focuses on capturing global context, while CNNs make assumptions about 2D neighboring structures and local and translation equivariance.

In ViT:

Only the multilayer perceptron (MLP) presents local and translation equivariance.
Self-attention mechanisms are global in nature, with very little use of the 2D structure.

Conclusion

Vision Transformers excel in capturing global context, enabling them to handle long-range dependencies effectively in image recognition tasks.

Note: This blog post draws inspiration from the advancements discussed in the original paper, highlighting the impact of adapting Transformers for image classification.