Paper Notes: Attention Is All You Need
Apr 19, 2026
TL;DR
- This paper replaces RNNs/CNNs with a fully attention-based architecture for sequence modeling.
- The Transformer enables parallel computation and captures long-range dependencies efficiently.
- The key takeaway is that self-attention alone is sufficient and more scalable than recurrence.
Bibliographic Snapshot
| Field | Detail |
|---|---|
| Citation | Vaswani et al., NeurIPS 2017 |
| Keywords | transformer, attention, sequence modeling |
| Dataset / Benchmarks | WMT 2014 En-De, En-Fr |
| Code / Repo | https://github.com/tensorflow/tensor2tensor |
Problem Statement
Traditional sequence models rely on RNNs or CNNs, which suffer from limited parallelism and difficulty capturing long-range dependencies. The sequential nature of RNNs makes training slow and inefficient for long sequences. The paper aims to design a model that removes recurrence while maintaining strong performance on sequence transduction tasks like machine translation.
Core Idea
- Replace recurrence and convolution with self-attention
- Use encoder-decoder architecture with stacked layers
- Key components:
- Multi-head self-attention
- Position-wise feed-forward networks
- Residual connections + layer normalization
- Attention formula:
- Scaled Dot-Product Attention: Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V
- Positional encoding injects sequence order using sinusoidal functions
Visual / Diagram Notes
- Figure 1 (page 3) shows the full Transformer architecture:
- Encoder: repeated self-attention + FFN blocks
- Decoder: masked attention + encoder-decoder attention
- Figure 2 (page 4) explains scaled dot-product attention and multi-head attention
- Key intuition: multiple heads learn different relationships in parallel
Key Results
- Achieves 28.4 BLEU on WMT En-De, outperforming prior models
- Much lower training cost compared to RNN/CNN-based models
- Strong parallelization → significantly faster training
- Ablation: multi-head attention improves performance over single-head
- Limitation: quadratic complexity in sequence length (O(n^2))
Personal Analysis
What worked:
- Elegant removal of recurrence simplifies architecture
- Parallelism is a major practical advantage
- Attention provides interpretable alignment patterns
What puzzled you:
- Quadratic complexity limits scaling to very long sequences
- Positional encoding feels somewhat heuristic
Connections & Related Work
- Replaces RNN-based seq2seq (e.g., LSTM, GRU)
- Extends attention mechanisms used in prior models
- Foundation of modern LLMs (GPT, BERT, etc.)
Implementation Sketch
- Framework: PyTorch / TensorFlow
- Steps:
- Tokenize input (BPE)
- Add positional encoding
- Stack encoder layers (self-attention + FFN)
- Stack decoder layers (masked + cross attention)
- Train with Adam + learning rate warmup
- Hyperparameters:
- d_model = 512
- heads = 8
- layers = 6
- Compute: multi-GPU recommended
Open Questions / Next Actions
- Can attention be made more efficient than O(n^2)?
- How to scale to longer contexts?
- Explore sparse or linear attention variants
Glossary
- Self-attention: mechanism relating all tokens in a sequence
- Multi-head attention: parallel attention projections
- Positional encoding: injects order into sequence