Paper Notes: Extracting Training Data from Large Language Models
Feb 22, 2026
Original Paper: Extracting Training Data from Large Language Models
TL;DR
- This paper demonstrates that large language models (LLMs) can leak verbatim training data through black-box querying.
- The authors design a practical extraction attack that recovers hundreds of memorized sequences from GPT-2, including contact information, UUIDs, URLs, and source code.
- Crucially, memorization occurs without overfitting, and larger models exhibit significantly greater privacy leakage.
Bibliographic Snapshot
| Field | Detail |
|---|---|
| Citation | Carlini et al., USENIX Security 2021 |
| Keywords | memorization, training data extraction, membership inference, LLM privacy |
| Dataset / Benchmarks | GPT-2 (XL, Medium, Small); OpenWebText (implicitly) |
| Code / Repo | Not publicly released (evaluation done in collaboration with OpenAI) |
Problem Statement
Large language models are often trained on massive datasets scraped from the web, potentially containing sensitive or personal information. The prevailing belief was that modern LLMs do not meaningfully memorize training data because they exhibit minimal train–test loss gaps.
This paper challenges that assumption by asking:
Can an adversary extract verbatim training examples from a large language model using only black-box access?
The threat model assumes:
- Black-box access (API-level queries).
- No knowledge of model weights.
- No direct access to training data.
The attack goal is indiscriminate extraction, not targeted recovery of specific secrets.
Core Idea
1. Attack Pipeline
The attack consists of two major phases:
Step 1 — Generate Candidate Sequences
Generate 200,000 samples from GPT-2 using:
- Top-n sampling
- Decaying temperature sampling
- Internet-conditioned prefix sampling
Each sample is 256 tokens long.
Step 2 — Rank by Memorization Likelihood
The authors introduce improved membership inference metrics:
- Raw perplexity
- Perplexity ratio vs GPT-2 Small
- Perplexity ratio vs GPT-2 Medium
- zlib compression entropy ratio
- Lowercased perplexity ratio
- Sliding-window minimum perplexity
Key insight:
Memorized sequences have anomalously low perplexity compared to reference models or compression metrics.
2. Formal Definition: k-Eidetic Memorization
They define:
- A string is extractable if it can be generated with high likelihood.
- A string is k-eidetic memorized if it appears in at most k training documents.
Small k ⇒ stronger privacy violation.
This definition shifts memorization from vague intuition to a measurable property.
3. Key Quantitative Results
- 1,800 candidates manually inspected.
- 604 unique memorized sequences recovered.
- Best attack configuration: 67% precision.
- Extracted content included:
- 32 contact information entries
- 46 named individuals
- 35 high-entropy UUID/base64 strings
- 31 source code snippets
- 50 valid URLs
Some sequences appeared in only one training document.
Visual / Diagram Notes
Figure 2 (Attack Workflow): LM → 200k generations → rank via metric → deduplicate → manually inspect → verify against training data.
This makes clear the attack is entirely black-box and requires no gradient access.
Figure 3 (zlib vs GPT-2 Perplexity Scatter Plot):
- Most samples lie on a diagonal.
- Memorized samples appear as outliers where GPT-2 assigns low perplexity but zlib is surprised.
- Indicates memorization detection via likelihood ratio anomaly.
Table 2: Internet-conditioned prompts + zlib ratio yielded highest precision (67%).
Key Results
- 604 memorized training examples extracted.
- Larger models memorize significantly more.
- GPT-2 XL memorizes URLs after ~33 insertions.
- Medium and Small models show much weaker memorization.
- Memorization increases with model size.
Important empirical claim:
Memorization does NOT require overfitting.
Train/test gap was minimal, yet extraction succeeded.
Personal Analysis
What worked
- The likelihood-ratio idea (compare to smaller model or compression metric) is elegant.
- The formalization of k-eidetic memorization is useful for auditing.
- The empirical study is methodologically strong (manual verification + OpenAI collaboration).
What puzzled me
- The paper assumes memorization is inherently undesirable, but some memorization (e.g., public licenses, digits of π) is expected.
- The boundary between “knowledge” and “memorization” is philosophically unclear.
- It is not obvious how scalable manual verification would be for GPT-3/4-scale models.
Connections & Related Work
- Extends “The Secret Sharer” (Carlini et al., 2019) to large-scale LLMs.
- Related to membership inference attacks (Shokri et al.).
- Connects to differential privacy (DP-SGD).
- Relevant to scaling laws (Kaplan et al.) — larger models ⇒ greater memorization.
In modern context:
- Directly relevant to API-based LLM deployment.
- Impacts training data governance for foundation models.
Implementation Sketch
If reproducing the work:
Dependencies
- GPT-2 models (XL, Medium, Small)
- zlib compression
- Perplexity computation utilities
- Web scraping for prefix generation
Steps
- Generate 200k samples per strategy.
- Compute perplexity and ratio metrics.
- Rank samples by anomaly score.
- Deduplicate via trigram overlap.
- Manually search substrings online.
- Confirm with training data (if available).
Compute budget:
- Modest by modern standards (GPT-2 scale).
- Manual inspection is the bottleneck.
Open Questions / Next Actions
- How does memorization scale for GPT-3, GPT-4, or modern 70B+ models?
- Can we automate memorization auditing?
- Is differential privacy viable at trillion-token scale?
- Can we design architectural inductive biases to reduce memorization?
- How does fine-tuning affect inherited memorization?
Glossary
k-Eidetic Memorization
A string extractable from the model that appears in ≤ k training documents.
Perplexity
Exponential of average negative log-likelihood; lower means higher model confidence.
Membership Inference
Attack to determine whether data was in training set.
Differential Privacy (DP-SGD)
Training method that provides formal privacy guarantees via noise injection.
Contextual Integrity
Privacy framework stating data misuse occurs when information appears outside intended context.
Personal Takeaway
This paper is important because it exposes a structural privacy weakness in LLMs. From my own experience using GPT-3.5 in 2022, I was able to get sensitive outputs such as Windows activation codes and callable phone numbers, that reflected with the authors’ core claim: memorization is not accidental but a byproduct of the next-token prediction objective. Rare or uniquely structured training examples like addresses or contacts can receive disproportionately low training loss, making them statistically attractive continuations under specific prompt engineering. This vulnerability is extremely dangerous, since attackers actually have no technical barriers on extracting data from LLMs as paper shown. I would rate this paper 5/5. And here are my 2 questions:
- For legacy open source LLMs such as GPT-2, they were trained without privacy filtering. What practical mitigation strategies can reduce ongoing privacy risks, given that retraining from scratch and delete all vulnerable copies may not be feasible?
- Do modern LLM-based agent systems introduce new privacy vulnerabilities beyond model memorization, such as tool misuse or retrieval-layer leakage?