Quad-Core Attention Accelerator
Mar 22, 2026
Project Title: Quad-Core Attention Accelerator
Project Overview
This project is a quad-core hardware attention accelerator implemented in Verilog RTL and released as a public snapshot of the final Step 6 design. The repository packages the main compute blocks, top-level integration RTL, representative testbenches, sparse dataset generation scripts, simulation vectors, and selected synthesis and place-and-route artifacts.
- GitHub Repository: JustinLinKK/quad-core-attention-accelerator
The design focuses on scaling an attention-style compute pipeline beyond a single core while still handling practical implementation constraints such as memory overlap, timing pressure, sparse execution behavior, and cross-core accumulation.
Objectives
- Scale Attention Compute: Extend a single-core datapath into a grouped quad-core architecture for higher throughput.
- Improve Data Movement: Reduce idle time with double-buffered loading for the major on-chip memories.
- Support Sparse Execution: Add zero-skipping and B1/B2 control behavior to avoid unnecessary switching.
- Preserve Timing Feasibility: Clean up the datapath and add selective pipelining for a 1 GHz design target.
- Publish End-to-End Evidence: Package verification, synthesis, and physical-design artifacts alongside the RTL.
Architecture Highlights
- Quad-Core Integration:
core.vandfullchip.vcompose the accelerator into a multi-core top level with adaptive grouping modes. - Adaptive Grouping: Supports enabled-core layouts such as
[2,2],[1,1,2], and full 4-core operation. - Cross-Core Accumulation:
cross_core_adder.vprovides the accumulation path across cores. - Compute Fabric: Built around
mac_array,mac_col, andsfp_optimizedfor the main attention datapath. - Memory / Control Path: Includes scratchpad-style local storage, output FIFO support, and clock-domain-crossing logic.
Key Design Improvements
1. Double-Buffered Loading
The final design adds double buffering in pmem, kmem, and qmem so the next tile can be loaded while the current tile is still being processed. This improves overlap between memory movement and compute.
2. Sparse B1 / B2 Control
The accelerator includes sparsity-aware behavior for B1/B2 filtering experiments, along with skip-normalization flow in the optimized sparse processing path. This lets the design model more realistic sparse attention workloads instead of only dense matrix flow.
3. Zero-Skipping and Gating
All-zero Q or K rows can suppress unnecessary register and MAC activity, reducing wasted switching and improving efficiency on sparse inputs.
4. Timing-Aware Datapath Cleanup
The published Step 6 RTL adds a small pipeline stage in mac_col to improve timing near the 1 GHz target, accepting a modest extra cycle of output latency in exchange for cleaner critical-path behavior.
5. Parallel Core Loading
Input-duplication muxing allows grouped cores to be loaded in parallel from a shared input path, which makes the multi-core modes more practical from a system-integration standpoint.
Repository Contents
- RTL Sources: top-level and block-level Verilog modules for the accelerator
- Testbenches: representative verification benches for single-core, double-buffered, sparse, and full-chip cases
- Dataset Scripts: Python utilities for generating sparse attention-style vectors and B2-oriented datasets
- Simulation Assets: filelists and saved simulation outputs
- Synthesis Collateral: selected block-level synthesis deliverables
- PnR Collateral: packaged place-and-route outputs for key submodules
Verification and Reproducibility
The repo includes sample sparse manifests, full-chip B2 datasets, and simulator filelists that document how the published test cases were assembled. Representative verification coverage spans blocks such as:
sfp_optimizedmac_arraymac_colcorefullchipcross_core_adder
This makes the project more than a code dump; it also captures the experimental setup used to validate the architecture direction.
Physical Design Status
One of the strongest parts of this release is that it includes physical-design evidence, not just RTL. The repository shows that several standalone blocks are in strong shape for the target direction, while also being honest that hierarchical reintegration and top-level timing closure remain the main follow-on challenge.
That makes the project valuable both as:
- a working RTL accelerator design, and
- a documented hardware implementation study showing where scaling pressure appears in practice.
Outcomes
- Built a quad-core attention accelerator RTL with adaptive multi-core grouping
- Added double-buffered memory flow to improve tile-to-tile overlap
- Implemented sparsity-aware skip behavior and B1/B2 control paths
- Published testbenches, datasets, simulation assets, synthesis, and PnR evidence in one repository
- Captured the real tradeoff between block-level timing success and full-chip integration difficulty
Why It Matters
This project sits at the intersection of AI acceleration and physical hardware realization. Instead of stopping at algorithm ideas or high-level architecture diagrams, it pushes through RTL integration, verification infrastructure, and implementation evidence.
That makes it a strong systems project: it demonstrates not only how an attention accelerator can be structured, but also what it takes to move that design toward a realistic silicon-oriented workflow.