Paper Notes: Model Extraction Attacks & Defenses for LLMs

Mar 20, 2026

TL;DR

  • This paper studies how attackers can extract functionality, training data, or prompts from large language models via API access.
  • It provides a structured taxonomy of attacks and defenses, highlighting the unique vulnerabilities of generative transformer models.
  • The key takeaway is that LLM capabilities can be replicated surprisingly cheaply, creating serious risks for intellectual property and privacy.

Bibliographic Snapshot

Field Detail
Citation Zhao et al., KDD 2025
Keywords LLM security, model extraction, prompt stealing, privacy
Dataset / Benchmarks Not specific (survey paper)
Code / Repo N/A

Problem Statement

The paper addresses the growing threat of model extraction attacks (MEAs) against large language models deployed via APIs. Attackers with only black-box query access can replicate model functionality, recover training data, or infer proprietary prompts. The challenge is amplified by LLM-specific properties such as generative outputs, large-scale memorization, and standardized transformer architectures. The paper aims to systematically categorize these threats and evaluate corresponding defenses under realistic deployment constraints. :contentReference[oaicite:1]{index=1}

Core Idea

The paper proposes a taxonomy of LLM extraction attacks and defenses:

1. Attack Categories

  • Functionality Extraction
    • API-based knowledge distillation
    • Direct query-based extraction
    • Parameter/architecture recovery
  • Training Data Extraction
    • Prompt-based memorization extraction
    • Private text reconstruction (e.g., activation inversion)
  • Prompt-targeted Attacks
    • Prompt stealing
    • Prompt reconstruction

2. Defense Categories

  • Model Protection
    • Architectural defenses (e.g., watermarking)
    • Output control (response perturbation)
  • Data Privacy Protection
    • Training-time protection (e.g., differential privacy)
    • Output sanitization
  • Prompt Protection
    • Prompt watermarking
    • Query monitoring

3. Evaluation Framework

  • Attack effectiveness:
    • Functional similarity
    • Data recovery rate
  • Defense performance:
    • Security metrics (e.g., detection rate)
    • Utility metrics (e.g., performance degradation)

Visual / Diagram Notes

  • Figure 1 (page 3): Shows the full model extraction pipeline:
    • User queries → API → target model → responses → attacker trains surrogate model
  • Figure 2 (page 4): Taxonomy tree:
    • Clean separation between attack types and defense strategies
    • Useful for structuring slides or system design

Key Results

  • LLMs can be effectively cloned via API queries, even without internal access.
  • Modern attacks are increasingly query-efficient, requiring fewer interactions.
  • Training data extraction is especially effective for:
    • rare patterns
    • structured data (e.g., emails)
  • Defense trade-offs:
    • Strong protection → reduced utility
    • Weak protection → high extraction risk
  • Table (page 9) shows:
    • No single defense is universally effective across all attack types

Personal Analysis

What worked:

  • Clear taxonomy makes it easy to reason about different attack surfaces.
  • Good separation between:
    • functionality vs data vs prompt extraction
  • Evaluation metrics are well-defined for generative models.

What puzzled you:

  • No concrete quantitative benchmark comparing all attacks/defenses.
  • Some defenses (e.g., watermarking) seem fragile against adaptive attackers.
  • Lack of real-world deployment evaluation.

Connections & Related Work

  • Closely related to:
    • Data extraction / memorization papers (e.g., Carlini et al.)
    • Prompt injection & jailbreak research
  • Compared to your OSINT pipeline:
    • This paper extracts model knowledge
    • Your system extracts human knowledge
  • Shared insight:
    • Risk comes from aggregation + synthesis, not raw data

Implementation Sketch

If reproducing an extraction attack:

  • Use API access to target LLM
  • Generate diverse query dataset
  • Collect (prompt, response) pairs
  • Train surrogate model:
    • fine-tune smaller LLM
    • or distillation-based training

For defense:

  • Add output filtering layer
  • Implement query monitoring:
    • detect abnormal query patterns
  • Optionally embed watermark signals

Open Questions / Next Actions

  • Can multiple attack methods be combined for stronger extraction?
  • How to design defenses that:
    • don’t require retraining
    • maintain high utility?
  • Could your OSINT pipeline be adapted for:
    • model extraction auditing?
  • Explore:
    • prompt leakage detection in real systems

Glossary

  • MEA (Model Extraction Attack): Replicating a model via query access
  • Distillation: Training a model to mimic another model’s outputs
  • Prompt Stealing: Extracting hidden system prompts
  • Output Sanitization: Filtering sensitive outputs
Back to blog