Justin Lin | Portfolio

Paper Notes: Attention Is All You Need

Apr 19, 2026

TL;DR

This paper replaces RNNs/CNNs with a fully attention-based architecture for sequence modeling.
The Transformer enables parallel computation and captures long-range dependencies efficiently.
The key takeaway is that self-attention alone is sufficient and more scalable than recurrence.

Bibliographic Snapshot

Field	Detail
Citation	`Vaswani et al., NeurIPS 2017`
Keywords	transformer, attention, sequence modeling
Dataset / Benchmarks	WMT 2014 En-De, En-Fr
Code / Repo	https://github.com/tensorflow/tensor2tensor

Problem Statement

Traditional sequence models rely on RNNs or CNNs, which suffer from limited parallelism and difficulty capturing long-range dependencies. The sequential nature of RNNs makes training slow and inefficient for long sequences. The paper aims to design a model that removes recurrence while maintaining strong performance on sequence transduction tasks like machine translation.

Core Idea

Replace recurrence and convolution with self-attention
Use encoder-decoder architecture with stacked layers
Key components:
- Multi-head self-attention
- Position-wise feed-forward networks
- Residual connections + layer normalization
Attention formula:
- Scaled Dot-Product Attention: Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V
Positional encoding injects sequence order using sinusoidal functions

Visual / Diagram Notes

Figure 1 (page 3) shows the full Transformer architecture:
- Encoder: repeated self-attention + FFN blocks
- Decoder: masked attention + encoder-decoder attention
Figure 2 (page 4) explains scaled dot-product attention and multi-head attention
Key intuition: multiple heads learn different relationships in parallel

Key Results

Achieves 28.4 BLEU on WMT En-De, outperforming prior models
Much lower training cost compared to RNN/CNN-based models
Strong parallelization → significantly faster training
Ablation: multi-head attention improves performance over single-head
Limitation: quadratic complexity in sequence length (O(n^2))

Personal Analysis

What worked:

Elegant removal of recurrence simplifies architecture
Parallelism is a major practical advantage
Attention provides interpretable alignment patterns

What puzzled you:

Quadratic complexity limits scaling to very long sequences
Positional encoding feels somewhat heuristic

Connections & Related Work

Replaces RNN-based seq2seq (e.g., LSTM, GRU)
Extends attention mechanisms used in prior models
Foundation of modern LLMs (GPT, BERT, etc.)

Implementation Sketch

Framework: PyTorch / TensorFlow
Steps:
1. Tokenize input (BPE)
2. Add positional encoding
3. Stack encoder layers (self-attention + FFN)
4. Stack decoder layers (masked + cross attention)
5. Train with Adam + learning rate warmup
Hyperparameters:
- d_model = 512
- heads = 8
- layers = 6
Compute: multi-GPU recommended

Open Questions / Next Actions

Can attention be made more efficient than O(n^2)?
How to scale to longer contexts?
Explore sparse or linear attention variants

Glossary

Self-attention: mechanism relating all tokens in a sequence
Multi-head attention: parallel attention projections
Positional encoding: injects order into sequence

Back to blog