Justin Lin | Portfolio

Paper Notes: SELF-RAG

Apr 19, 2026

TL;DR

Standard RAG retrieves blindly and may hurt generation quality.
SELF-RAG introduces on-demand retrieval + self-reflection using special tokens.
Key takeaway: LLM learns when to retrieve and how to critique its own outputs.

Bibliographic Snapshot

Field	Detail
Citation	`Asai et al., 2023`
Keywords	RAG, self-reflection, retrieval
Dataset / Benchmarks	PopQA, TriviaQA, ARC, ASQA
Code / Repo	https://selfrag.github.io

Problem Statement

RAG improves factuality but retrieves information indiscriminately, leading to noise and inefficiency. Additionally, LLMs are not trained to verify whether outputs are supported by retrieved data. The paper aims to make retrieval adaptive and self-aware.

Core Idea

Introduce reflection tokens:
- Retrieve (when to retrieve)
- ISREL (relevance)
- ISSUP (support)
- ISUSE (utility)
Pipeline:
- Decide → Retrieve → Generate → Critique
Train LLM to:
- Retrieve on demand
- Evaluate its own outputs
Inference:
- Beam search with critique scoring

Visual / Diagram Notes

Figure 1 (page 2):
- RAG vs SELF-RAG comparison
- Shows adaptive retrieval + critique loop
Training pipeline (page 5):
- Reflection tokens inserted into training data

Key Results

Outperforms ChatGPT and RAG baselines on multiple tasks
Improves factuality + citation accuracy
Better long-form generation quality
Limitation: higher inference complexity

Personal Analysis

What worked:

Elegant integration of retrieval + reasoning + evaluation
Reflection tokens = very powerful abstraction
Improves controllability at inference

What puzzled you:

Complex training pipeline (critic + generator)
Increased inference cost due to multi-pass reasoning

Connections & Related Work

Extends RAG → makes it adaptive
Related to RLHF (self-critique)
Similar to agent-style reasoning loops

Implementation Sketch

Train:
- Critic model (generate reflection labels)
- Generator model (predict tokens + reflections)
Inference:
- Conditional retrieval
- Multi-candidate generation
- Rank with critique scores

Open Questions / Next Actions

Can reflection tokens generalize to other tasks?
How to reduce inference cost?
Combine with MemGPT-style memory?

Glossary

Reflection tokens: special tokens for self-evaluation
ISREL: relevance score
ISSUP: factual support score
ISUSE: usefulness score

Back to blog