Justin Lin | Portfolio

Paper Notes: Model Extraction Attacks & Defenses for LLMs

Mar 20, 2026

TL;DR

This paper studies how attackers can extract functionality, training data, or prompts from large language models via API access.
It provides a structured taxonomy of attacks and defenses, highlighting the unique vulnerabilities of generative transformer models.
The key takeaway is that LLM capabilities can be replicated surprisingly cheaply, creating serious risks for intellectual property and privacy.

Bibliographic Snapshot

Field	Detail
Citation	`Zhao et al., KDD 2025`
Keywords	LLM security, model extraction, prompt stealing, privacy
Dataset / Benchmarks	Not specific (survey paper)
Code / Repo	N/A

Problem Statement

The paper addresses the growing threat of model extraction attacks (MEAs) against large language models deployed via APIs. Attackers with only black-box query access can replicate model functionality, recover training data, or infer proprietary prompts. The challenge is amplified by LLM-specific properties such as generative outputs, large-scale memorization, and standardized transformer architectures. The paper aims to systematically categorize these threats and evaluate corresponding defenses under realistic deployment constraints. :contentReference[oaicite:1]{index=1}

Core Idea

The paper proposes a taxonomy of LLM extraction attacks and defenses:

1. Attack Categories

Functionality Extraction
- API-based knowledge distillation
- Direct query-based extraction
- Parameter/architecture recovery
Training Data Extraction
- Prompt-based memorization extraction
- Private text reconstruction (e.g., activation inversion)
Prompt-targeted Attacks
- Prompt stealing
- Prompt reconstruction

2. Defense Categories

Model Protection
- Architectural defenses (e.g., watermarking)
- Output control (response perturbation)
Data Privacy Protection
- Training-time protection (e.g., differential privacy)
- Output sanitization
Prompt Protection
- Prompt watermarking
- Query monitoring

3. Evaluation Framework

Attack effectiveness:
- Functional similarity
- Data recovery rate
Defense performance:
- Security metrics (e.g., detection rate)
- Utility metrics (e.g., performance degradation)

Visual / Diagram Notes

Figure 1 (page 3): Shows the full model extraction pipeline:
- User queries → API → target model → responses → attacker trains surrogate model
Figure 2 (page 4): Taxonomy tree:
- Clean separation between attack types and defense strategies
- Useful for structuring slides or system design

Key Results

LLMs can be effectively cloned via API queries, even without internal access.
Modern attacks are increasingly query-efficient, requiring fewer interactions.
Training data extraction is especially effective for:
- rare patterns
- structured data (e.g., emails)
Defense trade-offs:
- Strong protection → reduced utility
- Weak protection → high extraction risk
Table (page 9) shows:
- No single defense is universally effective across all attack types

Personal Analysis

What worked:

Clear taxonomy makes it easy to reason about different attack surfaces.
Good separation between:
- functionality vs data vs prompt extraction
Evaluation metrics are well-defined for generative models.

What puzzled you:

No concrete quantitative benchmark comparing all attacks/defenses.
Some defenses (e.g., watermarking) seem fragile against adaptive attackers.
Lack of real-world deployment evaluation.

Connections & Related Work

Closely related to:
- Data extraction / memorization papers (e.g., Carlini et al.)
- Prompt injection & jailbreak research
Compared to your OSINT pipeline:
- This paper extracts model knowledge
- Your system extracts human knowledge
Shared insight:
- Risk comes from aggregation + synthesis, not raw data

Implementation Sketch

If reproducing an extraction attack:

Use API access to target LLM
Generate diverse query dataset
Collect (prompt, response) pairs
Train surrogate model:
- fine-tune smaller LLM
- or distillation-based training

For defense:

Add output filtering layer
Implement query monitoring:
- detect abnormal query patterns
Optionally embed watermark signals

Open Questions / Next Actions

Can multiple attack methods be combined for stronger extraction?
How to design defenses that:
- don’t require retraining
- maintain high utility?
Could your OSINT pipeline be adapted for:
- model extraction auditing?
Explore:
- prompt leakage detection in real systems

Glossary

MEA (Model Extraction Attack): Replicating a model via query access
Distillation: Training a model to mimic another model’s outputs
Prompt Stealing: Extracting hidden system prompts
Output Sanitization: Filtering sensitive outputs

Back to blog