Paper Notes: SoK – TEE Design Choices
Feb 24, 2026
TL;DR (3 sentences max)
- This paper systematizes the design space of hardware-based server-side Trusted Execution Environments (TEEs).
- It proposes TRAF (TEE Runtime Architectural Framework) to analyze how TEEs split runtime resource management between the Trusted Computing Base (TCB) and the untrusted host OS.
- The key takeaway: most TEE vulnerabilities stem from how runtime management tasks (CPU, memory, I/O) are divided across trust boundaries, especially when using unprotected or partially guarded modes.
Bibliographic Snapshot
| Field | Detail |
|---|---|
| Citation | Li et al., ASIA CCS 2024 |
| Keywords | TEE, Confidential Computing, SGX, SEV, TDX, TRAF |
| Platforms Covered | Intel SGX/TDX, AMD SEV/SEV-ES/SEV-SNP, ARM CCA, IBM PEF, Keystone, Penglai, CURE |
| Code / Repo | N/A (SoK paper) |
Problem Statement
Server-side TEEs enable secure remote execution (SRE) in cloud environments, protecting confidentiality and integrity of workloads against a malicious cloud provider. However, modern TEEs vary significantly in design choices (e.g., SGX vs. SEV vs. TDX), making it difficult to reason systematically about security trade-offs. The central question the paper tackles:
How can TEE designs safeguard resources used by TEE instances while still allowing an untrusted OS to manage computing resources efficiently?
The threat model assumes a privileged adversary controlling the host OS and possibly with physical access, but typically excludes availability attacks and most side channels.
Core Idea
1. TRAF (TEE Runtime Architectural Framework)
TRAF decomposes TEE runtime into OS-style resource management tasks:
- CPU management
- Scheduling
- Context switching
- Interrupt & instruction emulation
- Memory management
- Virtual memory
- Physical allocation
- Page fault handling
- Memory encryption
- I/O management
- Data transmission
- I/O operations
For each task, TEEs choose one of four protection modes:
2. Four Runtime Protection Modes
-
Unprotected Mode
- Host OS fully manages resource.
- Best performance, largest attack surface.
- Example: CPU scheduling in most TEEs.
-
RTPM-only Mode
- Managed entirely by Runtime Protection Module (Manufacturer TCB).
- Strong security, larger TCB, potential performance cost.
- Example: Context switch in SEV-SNP, TDX.
-
RTPM-guarded Mode
- Host performs management; RTPM verifies correctness.
- Balance between security and efficiency.
- Example: Memory allocation in SEV-SNP.
-
Instance-assisted Mode
- TEE instance participates in resource management.
- Used for virtual memory (e.g., Keystone page handling).
- Improves isolation but increases complexity.
Visual / Diagram Notes
Figure 3 (Runtime Events)
Shows how TEE instances interact with:
- CPU scheduling & context switching
- Page table updates & page faults
- I/O data paths via shared memory
This makes clear that TEEs are fundamentally about re-partitioning OS responsibilities across trust boundaries.
Figure 4 (Four Modes)
Graphical depiction of:
- Privileged SW
- RTPM
- TEE Instance
- Runtime resources
It clarifies who controls what in each mode — extremely useful mental model.
Figure 5 (Timeline of Design Choices)
Shows evolution from:
- SEV (weak protection)
- SEV-ES
- SEV-SNP (stronger RTPM-only transitions)
- TDX, ARM CCA
Bug icons highlight where design flaws were later exploited.
Trend: → Newer TEEs increasingly shift critical operations into RTPM-only mode after real-world attacks.
Key Results
1. Most TEEs follow similar patterns
- CPU scheduling → Unprotected mode
- Context switch → RTPM-only
- Memory allocation → RTPM-guarded
- I/O → Mostly unprotected
2. Vulnerabilities Cluster Around:
- Nested Page Tables (NPT) in SEV
- TLB handling
- Instruction emulation (e.g., CPUID spoofing)
- Unencrypted register state in early SEV
3. Case Study: AMD SEV Evolution
| Version | Weakness | Fix |
|---|---|---|
| SEV | Unencrypted registers | SEV-ES encrypts register state |
| SEV-ES | TLB poisoning | SEV-SNP adds hardware-enforced TLB protection |
| SEV-SNP | CPUID filtering & RMP checks | Stronger integrity enforcement |
Big insight: SEV’s early designs relied heavily on unprotected or weakly guarded modes, which directly enabled attacks like:
- SEVered (NPT remapping)
- TLB poisoning
- Ciphertext side channels
Personal Analysis
What worked
- TRAF is a clean abstraction layer. It feels like an OS textbook classification applied to confidential computing.
- The CPU/Memory/I/O breakdown is pedagogically powerful.
- The SEV case study shows concrete evolution under attack pressure — great for understanding real-world design iteration.
What puzzled me
- TRAF does not deeply model microarchitectural side channels, even though they dominate practical attacks.
- TCB size comparison remains qualitative due to proprietary vendor implementations.
Connections & Related Work
-
Connects directly to:
- SGX Explained (Costan & Devadas)
- SEVered (Morbitzer et al.)
- SEV-SNP whitepapers
- Controlled-channel attacks (Xu et al.)
- Spectre/Meltdown class attacks
-
Conceptually related to:
- Microkernel vs monolithic kernel trust partitioning
- Virtualization security models
- Formal SRE definitions (Subramanyan et al.)
Implementation Sketch
If reproducing a research prototype inspired by this:
-
Choose platform:
- SGX (process-based)
- SEV-SNP (VM-based)
- Keystone (RISC-V)
-
Map runtime tasks:
- Which are unprotected?
- Which use guarded checks?
-
Evaluate:
- TCB size
- Privilege transitions
- Page fault latency
- Interrupt handling path
-
Attack surface evaluation:
- Can OS manipulate PTE?
- Can ASID/TLB be misused?
- Is instruction emulation trusted?
Open Questions / Next Actions
- Can we formally verify correct coordination in RTPM-guarded mode?
- What is the minimal TCB architecture for VM-based TEEs?
- Can we eliminate controlled-channel attacks without massive performance cost?
- How does TRAF extend to accelerator-attached TEEs (e.g., GPU confidential computing)?
Glossary
TEE – Trusted Execution Environment
TCB – Trusted Computing Base
RTPM – Runtime Protection Module (Manufacturer TCB component)
NPT – Nested Page Table
ASID – Address Space Identifier
Controlled-channel attack – Attack exploiting OS-controlled resources (e.g., page faults, interrupts)
SRE – Secure Remote Execution
Personal Takeaway
This paper is important since it reshaped how I think about confidential computing from a cloud infrastructure perspective. In server environments, the hypervisor and host OS are trusted to manage CPU scheduling, memory, and I/O, but this paper shows how TEEs treats them as adversaries. The TRAF framework helped me understand that many TEE vulnerabilities are results of how runtime resource management is divided across trust boundaries. I'm interested in how different TEEs perform tradeoffs between performance and security, especially when deciding whether to leave components in unprotected mode. Two questions I would like to leave are: How should TEEs adapt to modern cloud environments that integrated heavily on GPUs for confidential workloads? Are some security risks unavoidable in TEEs because cloud systems must share and manage resources efficiently?