uarchlabs

What RVA23 Actually Asks of a Decoder

2026-05-18T00:00:00+00:00

What RVA23 Actually Asks of a Decoder

There is a version of RISC-V processor design that sounds straightforward: pick the extensions you need, implement them, ship. The modular ISA (Instruction Set Architecture) is one of RISC-V’s genuine strengths — you are not dragged into supporting instructions your workload will never use, and the base integer ISA is clean enough that a minimal implementation is genuinely minimal.

The version of RISC-V processor design I am actually doing is different. I am building a server-class out-of-order core targeting the RVA23 application processor profile. That changes the problem considerably.

What a Profile Is and Why It Matters

RISC-V defines mandatory extension sets through profiles rather than through the base ISA. A profile is a named, versioned set of extension requirements — a processor claiming RVA23 compliance must implement a specific list of extensions, all mandatory, no omissions. For the application processor tier, RVA23 is the current generation target: it is what Linux distributions and toolchains can assume is present on a conformant system. To be precise, I am building the RVA23S64 profile [1]. This is a 64-bit profile with supervisor instructions targeting server-class machines. I will use RVA23 throughout as shorthand for RVA23S64.

The mandatory extension list for RVA23 is not short. At minimum it includes the base integer and multiply-divide extensions (RV64IMA), single and double precision floating point (FD), compressed instructions (C) including the Zcb subset, bitmanipulation extensions (Zba, Zbb, Zbs), the vector extension (V), vector half-precision float (Zvfhmin), and a collection of smaller extensions covering CSR instructions, cache block operations, and scalar half-precision float (Zfhmin, Zfa, Zicsr, Zicbom, Zicbop, Zicboz). The hypervisor extension (H) is also required.

For a software developer, this is a feature list. For the processor implementation team, each item on that list is a set of instructions the decode stage must handle correctly, in parallel, at the target fetch width.

Because RVA23 mandates all of these extensions without exception, a conformant processor does not actually need per-extension enable logic at the decoder level — either the full profile is implemented or it is not. I added an extension enable mechanism anyway, as a deliberate engineering choice for validation and silicon bring-up. Being able to disable individual extensions at the decoder level — flagging their instructions as ILLEGAL — is useful during integration testing even when the final product will always run with all extensions active. More on this when we reach that experiment.

Instruction Encoding Formats

RISC-V instructions follow a small number of fixed-width encoding formats. Base ISA instructions are either 16 or 32 bits wide.

Within the 32-bit word, the lower 7 bits are always the opcode, and the remaining fields carry operand register indices, immediate values, and disambiguation fields. The fields that matter most for the decoder are:

funct3 (bits [14:12]): A 3-bit field that distinguishes instructions within the same opcode group
funct6 (bits [31:26]) and funct7 (bits [31:25]): Higher-order disambiguation fields used extensively by the vector extension and arithmetic operations respectively

See [2] for funct3 and funct6 documentation.

The vector ALU decode work described in the next post depends almost entirely on funct3 and funct6: funct3 selects the instruction group (integer, floating-point, or mixed), and funct6 selects the specific operation within that group. Getting those field values right from the specification rather than from model training data was a central discipline of the IA prompt implementation.

See [3] for the vector instruction documentation.

The Decode Problem at 8 Instructions Per Cycle

A high-performance out-of-order core does not decode one instruction per cycle. The Pacino target is a fetch bundle of eight 32-bit instructions decoded simultaneously, producing results in a single cycle. Every instruction in the bundle must be identified, its operands extracted, its type classified, and its output routed to the correct downstream packet — all in parallel, all in the same cycle.

At this width, the decoder is not a lookup table with some muxes around it. It is eight parallel decoders operating on independent instructions, sharing only the structural definitions they decode into. Any serial dependency across slots — any logic that says “look at slot N before deciding what to do with slot N+1” — is a potential speed path.

This constraint is not just a performance requirement. It shapes every architectural decision made during decoder implementation.

Why Compressed Instructions Complicate the Front End

The C extension allows 16-bit instruction encodings as compressed forms of common 32-bit instructions. A fetch bundle from an RVA23 processor can contain a mix of 16-bit and 32-bit instructions packed together in memory without alignment between them.

The RISC-V RVC encodings were deliberately designed so that each compressed instruction is a proper subset of a 32-bit instruction — same opcode semantics, different encoding density. This design choice means hardware can expand 16-bit instructions to their 32-bit equivalents early in the pipeline, after which the backend sees only 32-bit instructions and requires no knowledge of the compressed encoding. It simplifies functional unit implementation at the cost of an expander in the front end.

The expansion introduces a bookkeeping obligation: a 16-bit instruction at address 0x1000 that is expanded to 32 bits must retain its original 16-bit PC throughout the pipeline. The expanded instruction cannot be treated as if it were a native 32-bit instruction at that address. Precise exceptions, branch targets, and debug information all depend on the original PC. The fetch bundle must carry both the expanded instruction bits and the address of the 16-bit encoding that produced them.

The Zcb extension adds additional compressed instruction variants beyond base C. Some Zcb encodings share bit patterns with base C instructions and are distinguished only by specific field values — a subtlety that affected both the expander logic and coverage validation tooling during the IA implementation.

Why the Vector Extension Changes the Architecture

The RVA23 vector extension (RVV) is not simply more instructions. It introduces a separate register file (32 vector registers), a type system for element width and grouping (vtype), and a length register (vl) that affects how many elements each instruction processes. None of these exist in the scalar ISA.

More concretely for the decoder: vector instructions need different information extracted from the encoding than scalar instructions do, they consume different register names, and they go to different execution resources. A dual-packet output architecture — one packet stream for scalar instructions, a separate one for vector — is the natural response. But it means the decode stage produces two parallel output bundles instead of one, and every downstream stage from rename to commit must consume both.

The vtype dependency is particularly interesting. Instructions like vsetvl and vsetvli set the current vector type — element width, grouping, tail and mask policy — and every subsequent vector instruction consumes that type. This is a data dependency that flows through a special register rather than a general-purpose register, and tracking it correctly matters for performance. I implemented a dedicated combinational pre-decode block that scans the fetch bundle before the main decoder runs, identifies vsetvl instructions, and annotates each slot with vtype dependency information. This keeps the main decoder stateless while giving the rename stage the information it needs to track the dependency correctly.

A full branch detection pre-decode stage — capable of providing early branch information to the branch predictor with the resolution needed by the BPU and FTQ — is planned but deferred to the fetch unit design phase. The pre-decode block implemented here carries a conservative may_be_branch hint signal set by opcode alone as a placeholder. The full branch pre-decode design requires the fetch unit interface to be defined first, since its output format is tightly coupled to how the BPU consumes early prediction targets.

The Encoding Overlap Problem

Here is a problem that does not appear in extension feature lists but absolutely appears in implementation: vector load and store instructions share opcodes with scalar floating-point loads and stores.

In the RISC-V encoding, opcode 0x07 is OP_LOAD_FP — the scalar floating point load opcode. It is also used by vector load instructions. Opcode 0x27 is OP_STORE_FP and is similarly shared. The disambiguation between scalar FP and vector memory operations happens at the width field within the instruction, not at the opcode level.

This means the decoder cannot route these instructions by opcode alone. For 0x07 and 0x27, it must inspect the width field first, then decide which decode path applies. The scalar FP path must be preserved exactly; the vector memory path must extract entirely different information from the same instruction bits.

This is not a theoretical edge case. vle32.v — load a vector of 32-bit elements — uses opcode 0x07. Without explicit disambiguation logic, a decoder would misidentify every vector load as a scalar FP load. Getting this right while keeping the scalar path unchanged is one of the more interesting problems in building an RVA23 decoder.

What We Set Out to Build

Given all of this, Pacino’s decoder implementation had four concrete requirements.

First, eight-instruction parallel decode with single-cycle latency. No serial dependencies between slots.

Second, complete RVA23S64 coverage. Every mandatory instruction from every mandatory extension, plus correct handling of disabled extensions producing an ILLEGAL decode packet for downstream exception handling. This last constraint was a self imposed forward looking feature intended to assist bring up.

Third, a dual-packet output architecture. Scalar instructions produce decode_pkt_t; vector instructions produce vec_decode_pkt_t. A steering signal tells downstream stages which packet to consume for each slot.

Fourth, a combinational pre-decode block that identifies vtype-producing instructions before the main decoder runs, annotates the bundle with dependency information, and provides a clean interface for the rename stage to track vtype without the main decoder holding any state.

The implementation used a structured AI co-design methodology with a dual assistant architecture: Claude.ai for architectural planning and experiment design, Claude Code for RTL implementation and automated verification. Experiments were isolated to single sessions with defined hypotheses, explicit deliverables, and ground-truth verification against the riscv-opcodes [4] repository rather than model training data.

The next post describes how I built the scalar foundation and worked through the full vector ALU instruction space.

References

[1] RVA23 Profiles
    https://github.com/riscv/riscv-profiles/blob/main/src/rva23-profile.adoc
    accessed 2026.05.01
[2] RISC-V Unprivileged ISA Specification 
    https://github.com/riscv/riscv-isa-manual
    accessed 2026.05.01
[3] RISC-V Vector Extension Specification (RVV 1.0)
    https://github.com/riscvarchive/riscv-v-spec/blob/master/v-spec.adoc
    accessed 2026.05.01
[4] riscv-opcodes 
    https://github.com/riscv/riscv-opcodes
    accessed 2026.05.01

Jeff Nye is a microprocessor architect with 35 years of industry experience spanning performance modeling, RTL implementation, and architecture for high-performance OOO processors. He has contributed RTL to Pentium 4, ARM V7, TI C6x and RISC-V designs, and recently served as sole architect and full-stack implementer of the TAGE-SC-L + ITTAGE branch prediction cluster in an 8-issue RVA23 RISC-V processor — from research through timing closure at 2.75 GHz. He holds +20 issued patents in processor design, architecture, and hardware virtualization. He is the author of Pacino and the uarchlabs methodology documented here.

Connect on LinkedIn.

Project Rationale

2026-05-11T00:00:00+00:00

Project Rationale

The emergence of LLMs raises a practical question for high performance large-scale processor development: can a standards-compliant, competitively performant design be built by a very small team on a small budget? If yes, the implications to corporate development in team size, schedule, and cost are significant. This project is designed to explore these questions.

To investigate this I am building Pacino — an RVA23S64 8-issue OOO RISC-V processor targeting competitive SPECint2006 performance. The target is deliberately ambitious. A simple pipeline would not stress the methodology; a design at this complexity will expose where AI assistance genuinely helps and where it breaks down.

Efficiently evaluating AI-generated RTL requires domain expertise — to direct the work and to judge the output. Microarchitecture depth, verification strategy, and performance correlation methodology are each useful prerequisites for an honest assessment at lower token consumption rates, providing a method that is efficient in outcome and cost.

For other scope related discussions there is an FAQ.

Goals

Defining “practicality” requires a specific focus on methodology. This work is driven by a central investigative question:

What prompting structures and methodology processes yield the best results when using LLMs for the co-design of a high-performance RISC-V processor?

In addition to the primary goal, I am evaluating other qualitative and quantitative characteristics:

Context Management: Developing repeatable mechanisms for managing context in the Planning Assistant (PA) and Implementation Assistant (IA).
Task Scaling: Establishing an intuition for the size of a design task relative to the context required.
Human-in-the-Loop Requirements: Determining the level of human interaction and domain expertise necessary to achieve functional results.
Future Impact: Assessing how this methodology might reshape the workflow and composition of future microprocessor design teams.

Methodology

I structured the approach around four complementary elements: a dual AI assistant architecture that separates strategic planning from implementation, a context isolation strategy that keeps individual experiments clean, a structured prompt template that enables automated results reporting and analysis, and a structured handoff process that preserves continuity across planning sessions.

Dual AI Assistant Architecture

The methodology utilizes two distinct Claude interfaces:

Claude.ai (Web): Serves as the Planning Assistant (PA).
Claude Code (Terminal): Serves as the Implementation Assistant (IA).

The roles were assigned based on the native capabilities of each interface. This approach addresses the fundamental challenge of maintaining both strategic architectural thinking and detailed implementation capability within the constraints of AI context windows.

Claude.ai (Web Interface) — Planning Assistant (PA)

The PA serves as the strategic actor. Its primary functions include high-level architectural guidance, experimental methodology design, structured prompt generation for implementation work, results evaluation, and session-to-session knowledge transfer via handoff documents.

In this role, the PA is responsible for:

Design space exploration and trade-off analysis.
Interface specification and module boundary decisions.
Experimental planning and hypothesis formation.
Cross-session state management via structured documentation.
Quality assessment of implementation results contrasted with User developed assessment.

For context management, the PA maintains conversational history for architectural reasoning, accesses past session data through search tools when needed, preserves design rationale and decision context, and tracks experimental methodology evolution.

User interaction is central to this phase. The user makes the final decision on order and scope of implementation tasks, decisions required for compliance to standards and interactive generation of specifications and design rules. The PA has no access to the IA file system or source control repositories.

Claude Code (Terminal Interface) — Implementation Assistant (IA)

The IA serves as the execution actor. Its primary functions include direct SystemVerilog RTL generation and modification, file system access for reading and writing project files, compilation, linting, and testing through Verilator integration, and testbench creation and verification.

In this role, the IA is responsible for:

Production-quality RTL code generation.
Adherence to coding style and structural requirements.
Integration with existing build and verification flows.
Technical constraint satisfaction (timing, area, and functionality).

For context management, the IA reads project guidelines from CLAUDE.md automatically but maintains no persistent state between sessions. The IA operates with a “clean context” for each task. It is the responsibility of the PA session to declare the Minimal Viable Context required explicitly through the prompt for any given implementation task.

The IA currently has read/write privileges to the file system but has no knowledge of the source control system (GIT) or knowledge of the repo.

Workflow Integration Pattern

Strategic Planning Phase (PA/User) — I analyze requirements and constraints with the PA, review previous session results and lessons learned, define the experimental hypothesis and success criteria, and generate a structured implementation prompt with complete context specification. This is also the phase where specifications are developed as context. Domain knowledge informs the scope and order of tasks throughout.
Transfer Phase (User-mediated) — I chose to keep this as a manual step due to permissions and security considerations. The PA has no access to the file system. I make the PA-generated task file available to the IA environment, ensure all referenced files and contexts are accessible, update the repo with the latest accepted edits, and initiate the implementation session.
Implementation Phase (IA) — The IA executes RTL implementation per the structured prompt, performs compilation, linting, and basic verification, generates a results summary identifying any issues, and produces deliverables ready for integration. The IA populates a structured results section in the task file and reports a summary to the console.
Evaluation Phase (PA/User) — I review implementation results against the IA run — time, context used, model, completion status — and write an assessment of the results. I provide my analysis and the IA results to the PA for further analysis. Status and technical debt are recorded and I plan the next experimental phase or iteration.
Knowledge Preservation Phase (User-mediated) — I judge the PA’s remaining context and effectiveness. If warranted I initiate a session handoff — refreshing context with the previous handoff document and requesting that the PA produce the handoff document for the next session, recording architectural decisions, rationale, and updates to project status and planning documents.

Workflow Summary

With PA discuss the next tasks or experiments, agree on scope, provide any implementation specifications, interfaces, etc.
I provide the PA the task template, the PA populates the IA session prompt
- these tasks files use a numbering scheme DECODE-001.md, etc
I transfer the populated task file to the IA file system at ./prompts
A fresh Claude Code session is started
- claude
- There are additional options to control claude automation –auto-accept-edits or –dangerously-skip-permissions
- This is a user choice. It is independent of the methodology
I specify the /run command
- /run
- The run command locates the task file, verifies it’s format, extracts the prompt and executes the instructions.
The IA will run and report summary results to the console and write to the ::RESULTS CAPTURE:: section of the task file.
I populate the header data fields with run statistics, and optionally edit the User Assessment section and paste the IA console output into the task file.
- This step supports the experimental record — it is part of the methodology documentation, not the design flow itself.
I share the completed task file with PA, discuss results, record decisions, plan next task - This is interactive and can generate a number of actions, technical debt, additional or clarified documents, or occasionally require updates to CLAUDE.md
Once ready I commit the git repo changes
- The IA does not have knowledge of the repo. This is a deliberate design choice.
- I also mirror the repo on a separate file system distinct from the file system the IA has access to.

Since PA also has context limits at some point it will be necessary to perform a session handoff. This is usually indicated by incomplete or inaccurate answers by the PA, forgetting instructions from earlier in the session, etc.

In this case, I supply PA with the SESSION_HANDOFF.md template, a copy of the previous session handoff file, and ask that PA generate the next session handoff document. Supply the current session number and the next. PA will produce session_handoff-NNN.md with

Key architectural decisions and their reasoning
Technical debt inventory
Tools status and known issues
Next steps in priority order
Anything not captured elsewhere in the repo

When starting the next session supply STATUS.md, and the latest session_handoff-00N.md file. If flows or changes to CLAUDE.md were made in the last session supply CORE.md and/or CLAUDE.md as well.

Methodology Mechanics

MD support files

MD files form the conventional basis for interacting with IA and PA.

File / Directory	Description
`./CLAUDE.md`	Canonical baseline context, constant across IA sessions. Covers purpose, text output rules, fixed constraints (e.g. read fully before write), and how the IA should respond to conflicting or poorly defined requirements.
`./pa_handoffs/`	Previous PA session handoff files.
`./planning/PROJECT_CORE.md`	High level description of project intent, scope, roles, workflow, conventions, and 3rd party tool status. Supplied to PA only when project-level changes occur — new steps, new tools, methodology changes.
`./planning/PROJECT_STATUS.md`	Current project state: module status, technical debt, development and design open items, SV package conventions, key cluster/module parameters, prompt generation guide, architecture decisions, and prompt decomposition list. Used in handoff and planning sessions.
`./planning/arch/`	Contains documentation of architecture decisions and guidance. These documents are tactically supplied as reference context in IA prompts.
`./planning/interfaces/`	Contains definition of module ports necessary for sharing between modules and subsystems. This is the primary mechanism to ensure minimal issues with interoperability. These documents are tactically supplied as reference context in IA prompts.
`./planning/testbenches/`	Contains context for test bench guidance.
`./planning/tools`	3rd party tool capabilities, usage, etc, these are not claude tools or skills.
`./prompts/`	IA task files generated by PA using the task template, labeled by module and iteration e.g. `DECODE-002.md`.
`./templates/TASK_TEMPLATE.md`	Structured document populated by PA with goals and IA prompt. Contains task header (ID, context stats, runtime, model, resume SHA, status), a user assessment section, the extracted IA prompt, and a results capture section the IA populates. Once populated, labeled `-.md` and stored in `./prompts/`.
`./templates/SESSION_HANDOFF.md`	Structured document populated by PA at session handoff. Records session progress, decisions carried forward, prompts generated, and PROJECT_STATUS.md updates. PROJECT_CORE.md and PROJECT_STATUS.md updates are applied manually.

Evaluation Criteria

The primary evaluation metric is projected SPEC CPU2006 and CPU2017 IPC, derived from a validated C++ performance model executing SimPoints. Model validation is established by correlating against the RTL using a common microarchitectural event schema anchored to the RISC-V Hardware Performance Monitor specification, with RISC-V micro-benchmarks, Dhrystone, and CoreMark as the correlation workloads. Linux boot on an FPGA platform is anticipated as a further correctness validation and provides a natural environment for HPM counter verification. PPA characterization and silicon measurement remain open for future work.

Summary

The dual assistant architecture is a practical solution to a real problem: maintaining architectural coherence across a long, complex design while keeping individual implementation sessions clean and reproducible. The PA handles design reasoning and continuity. The IA handles execution. The user owns every decision. Whether this approach can produce a competitive 8-issue OOO processor is the question this project is designed to answer. The methodology, the prompts, the failures, and the results will all be published. That transparency is part of the point.

Connect on LinkedIn.

Introducing uarchlabs and the Pacino blog series

2026-05-10T00:00:00+00:00

uarchlabs is an open source hardware design organization. We build high performance processors and publish everything — the RTL, the design decisions, and the AI-assisted methodology used to produce them.

The first project is Pacino: an 8-issue out-of-order RISC-V processor targeting the RVA23S64 profile with competitive SPECint2006 performance as the design goal. The target is deliberately ambitious — a simple pipeline would not stress the methodology.

This blog series documents the design and the process. Posts cover architectural decisions, experiment results, and the AI co-design flow as it develops. The methodology, the prompts, the success and failures, and the results will be published. We are making transparency part of the point.

The next post covers the project rationale, methodology, and workflow in detail. A FAQ covering scope, tooling, and design targets is at uarchlabs.com/faq.html.

Jeff Nye is a microprocessor architect with 35 years of industry experience spanning performance modeling, RTL implementation, and architecture for high-performance OOO processors. He has contributed RTL to Pentium 4, ARM V7, TI C6x and RISC-V designs, and recently served as sole architect and full-stack implementer of the TAGE-SC-L + ITTAGE branch prediction cluster in an 8-issue RVA23 RISC-V processor — from research through timing closure at 2.75 GHz. He holds 20+ issued patents in processor design, architecture, and hardware virtualization.

Connect on LinkedIn.