In the field of computer vision, adapting Transformer models, originally designed for natural language processing, has opened new frontiers in image recognition.
Background
- 2017: The release of Attention is All You Need marked Transformers as the primary model in natural language processing.
- 2020 : The release of An Image is Worth 16x16 Words marked the adaptation of Transformers for image classification.
- Goal of this paper: Adapt Transformers to computer vision.
Model Architecture
The adaptation involves the following key components:
Image Patching
- Split the image into fixed-size non-overlapping patches covering the entire image.
- Each patch is transformed into a linear representation, incorporating position embedding.
Flattening for 2D Images
- Flatten the image to handle 2D images.
- Use standard 1D position embeddings, as more advanced 2D-aware embeddings do not show significant performance gains.
Transformer Design
- The model design closely resembles the original Transformer.
- Easy setup, can be used almost out of the box.
Attention Mechanism
- Self-attention is used to compute attention maps, identifying the most relevant image regions for a specific task.
- Weighted operations based on attention maps capture long-range dependencies within the image.
Vision Transformers vs. Convolutional Neural Networks
- Less image-specific inductive bias in Vision Transformers (ViT) compared to Convolutional Neural Networks (CNNs).
- ViT focuses on capturing global context, while CNNs make assumptions about 2D neighboring structures and local and translation equivariance.
In ViT:
- Only the multilayer perceptron (MLP) presents local and translation equivariance.
- Self-attention mechanisms are global in nature, with very little use of the 2D structure.
Conclusion
Vision Transformers excel in capturing global context, enabling them to handle long-range dependencies effectively in image recognition tasks.
Note: This blog post draws inspiration from the advancements discussed in the original paper, highlighting the impact of adapting Transformers for image classification.