<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://uarchlabs.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://uarchlabs.github.io/" rel="alternate" type="text/html" /><updated>2026-05-18T21:23:43+00:00</updated><id>https://uarchlabs.github.io/feed.xml</id><title type="html">uarchlabs</title><subtitle>Open source high performance hardware design</subtitle><entry><title type="html">What RVA23 Actually Asks of a Decoder</title><link href="https://uarchlabs.github.io/blog/what-rva23-actually-asks-of-a-decoder/" rel="alternate" type="text/html" title="What RVA23 Actually Asks of a Decoder" /><published>2026-05-18T00:00:00+00:00</published><updated>2026-05-18T00:00:00+00:00</updated><id>https://uarchlabs.github.io/blog/what-rva23-actually-asks-of-a-decoder</id><content type="html" xml:base="https://uarchlabs.github.io/blog/what-rva23-actually-asks-of-a-decoder/"><![CDATA[<h1 id="what-rva23-actually-asks-of-a-decoder">What RVA23 Actually Asks of a Decoder</h1>

<p>There is a version of RISC-V processor design that sounds straightforward:
pick the extensions you need, implement them, ship. The modular ISA 
(Instruction Set Architecture) is
one of RISC-V’s genuine strengths — you are not dragged into supporting
instructions your workload will never use, and the base integer ISA is
clean enough that a minimal implementation is genuinely minimal.</p>

<p>The version of RISC-V processor design I am actually doing is different.
I am building a server-class out-of-order core targeting the RVA23
application processor profile. That changes the problem considerably.</p>

<h2 id="what-a-profile-is-and-why-it-matters">What a Profile Is and Why It Matters</h2>

<p>RISC-V defines mandatory extension sets through profiles rather than
through the base ISA. A profile is a named, versioned set of extension
requirements — a processor claiming RVA23 compliance must implement a
specific list of extensions, all mandatory, no omissions. For the
application processor tier, RVA23 is the current generation target: it
is what Linux distributions and toolchains can assume is present on a
conformant system. To be precise, I am building the RVA23S64 profile [1].
This is a 64-bit profile with supervisor instructions targeting 
server-class machines. I will use RVA23 throughout as shorthand for RVA23S64.</p>

<p>The mandatory extension list for RVA23 is not short. At minimum it
includes the base integer and multiply-divide extensions (RV64IMA),
single and double precision floating point (FD), compressed instructions
(C) including the Zcb subset, bitmanipulation extensions (Zba, Zbb, Zbs),
the vector extension (V), vector half-precision float (Zvfhmin), and a
collection of smaller extensions covering CSR instructions, cache block
operations, and scalar half-precision float (Zfhmin, Zfa, Zicsr, Zicbom,
Zicbop, Zicboz). The hypervisor extension (H) is also required.</p>

<p>For a software developer, this is a feature list. For the processor
implementation team, each item on that list is a set of instructions the
decode stage must handle correctly, in parallel, at the target fetch width.</p>

<p>Because RVA23 mandates all of these extensions without exception, a
conformant processor does not actually need per-extension enable logic at
the decoder level — either the full profile is implemented or it is not.
I added an extension enable mechanism anyway, as a deliberate engineering
choice for validation and silicon bring-up. Being able to disable
individual extensions at the decoder level — flagging their instructions
as ILLEGAL — is useful during integration testing even when the final
product will always run with all extensions active. More on this when we
reach that experiment.</p>

<h2 id="instruction-encoding-formats">Instruction Encoding Formats</h2>

<p>RISC-V instructions follow a small number of fixed-width encoding formats.
Base ISA instructions are either 16 or 32 bits wide.</p>

<p>Within the 32-bit word, the lower 7 bits are always the opcode, and the remaining fields carry operand register indices, immediate values, and disambiguation fields. The fields that matter most for the decoder are:</p>

<ul>
  <li><strong>funct3</strong> (bits [14:12]): A 3-bit field that distinguishes instructions
within the same opcode group</li>
  <li><strong>funct6</strong> (bits [31:26]) and <strong>funct7</strong> (bits [31:25]): Higher-order
disambiguation fields used extensively by the vector extension and
arithmetic operations respectively</li>
</ul>

<p>See [2] for funct3 and funct6 documentation.</p>

<p>The vector ALU decode work described in the next post depends almost
entirely on funct3 and funct6: funct3 selects the instruction group
(integer, floating-point, or mixed), and funct6 selects the specific
operation within that group. Getting those field values right from the
specification rather than from model training data was a central
discipline of the IA prompt implementation.</p>

<p>See [3] for the vector instruction documentation.</p>

<p><img src="/assets/diagrams/scalar_encoding_formats.svg" alt="RISC-V base and compressed instruction formats" /></p>

<p><img src="/assets/diagrams/vector_encoding_formats.svg" alt="Vector V extension instruction encoding formats" /></p>

<h2 id="the-decode-problem-at-8-instructions-per-cycle">The Decode Problem at 8 Instructions Per Cycle</h2>

<p>A high-performance out-of-order core does not decode one instruction per
cycle. The Pacino target is a fetch bundle of eight 32-bit instructions decoded
simultaneously, producing results in a single cycle. Every instruction in
the bundle must be identified, its operands extracted, its type classified,
and its output routed to the correct downstream packet — all in parallel,
all in the same cycle.</p>

<p>At this width, the decoder is not a lookup table with some muxes around
it. It is eight parallel decoders operating on independent instructions,
sharing only the structural definitions they decode into. Any serial
dependency across slots — any logic that says “look at slot N before
deciding what to do with slot N+1” — is a potential speed path.</p>

<p>This constraint is not just a performance requirement. It shapes every
architectural decision made during decoder implementation.</p>

<h2 id="why-compressed-instructions-complicate-the-front-end">Why Compressed Instructions Complicate the Front End</h2>

<p>The C extension allows 16-bit instruction encodings as compressed forms
of common 32-bit instructions. A fetch bundle from an RVA23 processor
can contain a mix of 16-bit and 32-bit instructions packed together
in memory without alignment between them.</p>

<p>The RISC-V RVC encodings were deliberately designed so that each compressed
instruction is a proper subset of a 32-bit instruction — same opcode
semantics, different encoding density. This design choice means hardware
can expand 16-bit instructions to their 32-bit equivalents early in the
pipeline, after which the backend sees only 32-bit instructions and
requires no knowledge of the compressed encoding. It simplifies
functional unit implementation at the cost of an expander in the front
end.</p>

<p>The expansion introduces a bookkeeping obligation: a 16-bit instruction
at address 0x1000 that is expanded to 32 bits must retain its original
16-bit PC throughout the pipeline. The expanded instruction cannot be
treated as if it were a native 32-bit instruction at that address.
Precise exceptions, branch targets, and debug information all depend on
the original PC. The fetch bundle must carry both the expanded instruction
bits and the address of the 16-bit encoding that produced them.</p>

<p>The Zcb extension adds additional compressed instruction variants beyond
base C. Some Zcb encodings share bit patterns with base C instructions
and are distinguished only by specific field values — a subtlety that
affected both the expander logic and coverage validation tooling during the IA
implementation.</p>

<h2 id="why-the-vector-extension-changes-the-architecture">Why the Vector Extension Changes the Architecture</h2>

<p>The RVA23 vector extension (RVV) is not simply more instructions. It
introduces a separate register file (32 vector registers), a type system
for element width and grouping (vtype), and a length register (vl) that
affects how many elements each instruction processes. None of these exist
in the scalar ISA.</p>

<p>More concretely for the decoder: vector instructions need different
information extracted from the encoding than scalar instructions do, they
consume different register names, and they go to different execution
resources. A dual-packet output architecture — one packet stream for scalar
instructions, a separate one for vector — is the natural response. But it
means the decode stage produces two parallel output bundles instead of one,
and every downstream stage from rename to commit must consume both.</p>

<p>The vtype dependency is particularly interesting. Instructions like vsetvl
and vsetvli set the current vector type — element width, grouping, tail and
mask policy — and every subsequent vector instruction consumes that type.
This is a data dependency that flows through a special register rather than
a general-purpose register, and tracking it correctly matters for
performance. I implemented a dedicated combinational pre-decode block that
scans the fetch bundle before the main decoder runs, identifies vsetvl
instructions, and annotates each slot with vtype dependency information.
This keeps the main decoder stateless while giving the rename stage the
information it needs to track the dependency correctly.</p>

<p>A full branch detection pre-decode stage — capable of providing early
branch information to the branch predictor with the resolution needed by
the BPU and FTQ — is planned but deferred to the fetch unit design phase.
The pre-decode block implemented here carries a conservative
<code class="language-plaintext highlighter-rouge">may_be_branch</code> hint signal set by opcode alone as a placeholder. The
full branch pre-decode design requires the fetch unit interface to be
defined first, since its output format is tightly coupled to how the
BPU consumes early prediction targets.</p>

<h2 id="the-encoding-overlap-problem">The Encoding Overlap Problem</h2>

<p>Here is a problem that does not appear in extension feature lists but
absolutely appears in implementation: vector load and store instructions
share opcodes with scalar floating-point loads and stores.</p>

<p>In the RISC-V encoding, opcode 0x07 is OP_LOAD_FP — the scalar floating
point load opcode. It is also used by vector load instructions. Opcode
0x27 is OP_STORE_FP and is similarly shared. The disambiguation between
scalar FP and vector memory operations happens at the width field within
the instruction, not at the opcode level.</p>

<p>This means the decoder cannot route these instructions by opcode alone.
For 0x07 and 0x27, it must inspect the width field first, then decide
which decode path applies. The scalar FP path must be preserved exactly;
the vector memory path must extract entirely different information from
the same instruction bits.</p>

<p>This is not a theoretical edge case. vle32.v — load a vector of 32-bit
elements — uses opcode 0x07. Without explicit disambiguation logic, a
decoder would misidentify every vector load as a scalar FP load. Getting
this right while keeping the scalar path unchanged is one of the more
interesting problems in building an RVA23 decoder.</p>

<h2 id="what-we-set-out-to-build">What We Set Out to Build</h2>

<p>Given all of this, Pacino’s decoder implementation had four concrete
requirements.</p>

<p>First, eight-instruction parallel decode with single-cycle latency. No
serial dependencies between slots.</p>

<p>Second, complete RVA23S64 coverage. Every mandatory instruction from every
mandatory extension, plus correct handling of disabled extensions
producing an ILLEGAL decode packet for downstream exception handling. This last constraint was a self imposed forward looking feature intended to assist bring up.</p>

<p>Third, a dual-packet output architecture. Scalar instructions produce
decode_pkt_t; vector instructions produce vec_decode_pkt_t. A steering
signal tells downstream stages which packet to consume for each slot.</p>

<p>Fourth, a combinational pre-decode block that identifies vtype-producing
instructions before the main decoder runs, annotates the bundle with
dependency information, and provides a clean interface for the rename
stage to track vtype without the main decoder holding any state.</p>

<p>The implementation used a structured AI co-design methodology with a dual
assistant architecture: Claude.ai for architectural planning and experiment
design, Claude Code for RTL implementation and automated verification.
Experiments were isolated to single sessions with defined hypotheses,
explicit deliverables, and ground-truth verification against the
riscv-opcodes [4] repository rather than model training data.</p>

<p>The next post describes how I built the scalar foundation and worked
through the full vector ALU instruction space.</p>

<h2 id="references">References</h2>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[1] RVA23 Profiles
    https://github.com/riscv/riscv-profiles/blob/main/src/rva23-profile.adoc
    accessed 2026.05.01
[2] RISC-V Unprivileged ISA Specification 
    https://github.com/riscv/riscv-isa-manual
    accessed 2026.05.01
[3] RISC-V Vector Extension Specification (RVV 1.0)
    https://github.com/riscvarchive/riscv-v-spec/blob/master/v-spec.adoc
    accessed 2026.05.01
[4] riscv-opcodes 
    https://github.com/riscv/riscv-opcodes
    accessed 2026.05.01
</code></pre></div></div>
<hr />
<hr />
<p><em>Jeff Nye is a microprocessor architect with 35 years of industry experience 
spanning performance modeling, RTL implementation, and architecture for 
high-performance OOO processors. He has contributed RTL to Pentium 4, ARM V7,  TI C6x and RISC-V designs, and recently served as sole architect and full-stack implementer of the TAGE-SC-L + ITTAGE branch prediction cluster in an 8-issue RVA23 RISC-V processor — from research through timing closure at 2.75 GHz. He holds +20 issued patents in processor design, architecture, and hardware 
virtualization. He is the author of Pacino and the uarchlabs methodology documented here.</em></p>

<p><em>Connect on <a href="https://www.linkedin.com/in/jeff-nye-21353926">LinkedIn</a>.</em></p>]]></content><author><name>Jeff Nye</name></author><summary type="html"><![CDATA[The mandatory extension list for RVA23S64 is not short. A discussion of the profile and decoding nuances.]]></summary></entry><entry><title type="html">Project Rationale</title><link href="https://uarchlabs.github.io/blog/project-rationale/" rel="alternate" type="text/html" title="Project Rationale" /><published>2026-05-11T00:00:00+00:00</published><updated>2026-05-11T00:00:00+00:00</updated><id>https://uarchlabs.github.io/blog/project-rationale</id><content type="html" xml:base="https://uarchlabs.github.io/blog/project-rationale/"><![CDATA[<h1 id="project-rationale">Project Rationale</h1>

<p>The emergence of LLMs raises a practical question for high performance
large-scale processor development: can a standards-compliant, competitively
performant design be built by a very small team on a small budget? If yes, the
implications to corporate development in team size, schedule, and cost are
significant.  This project is designed to explore these questions.</p>

<p>To investigate this I am building Pacino — an RVA23S64 8-issue OOO RISC-V
processor targeting competitive SPECint2006 performance. The target is
deliberately ambitious. A simple pipeline would not stress the methodology; a
design at this complexity will expose where AI assistance genuinely helps and
where it breaks down.</p>

<p>Efficiently evaluating AI-generated RTL requires domain expertise — to direct
the work and to judge the output. Microarchitecture depth, verification
strategy, and performance correlation methodology are each useful prerequisites
for an honest assessment at lower token consumption rates, providing a
method that is efficient in outcome and cost.</p>

<p>For other scope related discussions there is an <a href="https://uarchlabs.com/faq.html">FAQ</a>.</p>

<h3 id="goals">Goals</h3>

<p>Defining “practicality” requires a specific focus on methodology. This work is
driven by a central investigative question:</p>

<blockquote>
  <p><strong>What prompting structures and methodology processes yield the best results when using LLMs for the co-design of a high-performance RISC-V processor?</strong></p>
</blockquote>

<p>In addition to the primary goal, I am evaluating other qualitative and quantitative characteristics:</p>

<ul>
  <li><strong>Context Management</strong>: Developing repeatable mechanisms for managing context in the Planning Assistant (PA) and Implementation Assistant (IA).</li>
  <li><strong>Task Scaling</strong>: Establishing an intuition for the size of a design task relative to the context required.</li>
  <li><strong>Human-in-the-Loop Requirements</strong>: Determining the level of human interaction and domain expertise necessary to achieve functional results.</li>
  <li><strong>Future Impact</strong>: Assessing how this methodology might reshape the workflow and composition of future microprocessor design teams.</li>
</ul>

<h2 id="methodology">Methodology</h2>

<p>I structured the approach around four complementary elements: a dual AI
assistant architecture that separates strategic planning from implementation, a
context isolation strategy that keeps individual experiments clean, a
structured prompt template that enables automated results reporting and
analysis, and a structured handoff process that preserves continuity across
planning sessions.</p>

<h3 id="dual-ai-assistant-architecture">Dual AI Assistant Architecture</h3>

<p>The methodology utilizes two distinct Claude interfaces:</p>
<ul>
  <li><strong>Claude.ai (Web)</strong>: Serves as the <strong>Planning Assistant (PA)</strong>.</li>
  <li><strong>Claude Code (Terminal)</strong>: Serves as the <strong>Implementation Assistant (IA)</strong>.</li>
</ul>

<p>The roles were assigned based on the native capabilities of each interface.
This approach addresses the fundamental challenge of maintaining both strategic
architectural thinking and detailed implementation capability within the
constraints of AI context windows.</p>

<h4 id="claudeai-web-interface--planning-assistant-pa">Claude.ai (Web Interface) — Planning Assistant (PA)</h4>

<p>The PA serves as the strategic actor. Its primary functions include high-level
architectural guidance, experimental methodology design, structured prompt
generation for implementation work, results evaluation, and session-to-session
knowledge transfer via handoff documents.</p>

<p>In this role, the PA is responsible for:</p>
<ul>
  <li>Design space exploration and trade-off analysis.</li>
  <li>Interface specification and module boundary decisions.</li>
  <li>Experimental planning and hypothesis formation.</li>
  <li>Cross-session state management via structured documentation.</li>
  <li>Quality assessment of implementation results contrasted with User developed assessment.</li>
</ul>

<p>For context management, the PA maintains conversational history for
architectural reasoning, accesses past session data through search tools when
needed, preserves design rationale and decision context, and tracks
experimental methodology evolution.</p>

<p>User interaction is central to this phase. The user makes the final decision on
order and scope of implementation tasks, decisions required for compliance to
standards and interactive generation of specifications and design rules.  The
PA has no access to the IA file system or source control repositories.</p>

<h4 id="claude-code-terminal-interface--implementation-assistant-ia">Claude Code (Terminal Interface) — Implementation Assistant (IA)</h4>

<p>The IA serves as the execution actor. Its primary functions include direct
SystemVerilog RTL generation and modification, file system access for reading
and writing project files, compilation, linting, and testing through Verilator
integration, and testbench creation and verification.</p>

<p>In this role, the IA is responsible for:</p>
<ul>
  <li>Production-quality RTL code generation.</li>
  <li>Adherence to coding style and structural requirements.</li>
  <li>Integration with existing build and verification flows.</li>
  <li>Technical constraint satisfaction (timing, area, and functionality).</li>
</ul>

<p>For context management, the IA reads project guidelines from CLAUDE.md
automatically but maintains no persistent state between sessions.  The IA
operates with a “clean context” for each task. It is the responsibility of the
PA session to declare the <strong>Minimal Viable Context</strong> required explicitly 
through the prompt for any given implementation task.</p>

<p>The IA currently has read/write privileges to the file system but has no
knowledge of the source control system (GIT) or knowledge of the repo.</p>

<h3 id="workflow-integration-pattern">Workflow Integration Pattern</h3>

<ol>
  <li>
    <p><strong>Strategic Planning Phase (PA/User)</strong> — I analyze requirements and
  constraints with the PA, review previous session results and lessons
learned, define the experimental hypothesis and success criteria, and generate
a structured implementation prompt with complete context specification. This is
also the phase where specifications are developed as context. Domain knowledge
informs the scope and order of tasks throughout.</p>
  </li>
  <li>
    <p><strong>Transfer Phase (User-mediated)</strong> — I chose to keep this as a manual step
  due to permissions and security considerations. The PA has no access to the
file system. I make the PA-generated task file available to the IA environment,
ensure all referenced files and contexts are accessible, update the repo with
the latest accepted edits, and initiate the implementation session.</p>
  </li>
  <li>
    <p><strong>Implementation Phase (IA)</strong> — The IA executes RTL implementation per the
  structured prompt, performs compilation, linting, and basic verification,
generates a results summary identifying any issues, and produces deliverables
ready for integration. The IA populates a structured results section in the
task file and reports a summary to the console.</p>
  </li>
  <li>
    <p><strong>Evaluation Phase (PA/User)</strong> — I review implementation results against
the IA run — time, context used, model, completion status — and write an
assessment of the results. I provide my analysis and the IA results to the PA
for further analysis.  Status and technical debt are recorded and I plan the
next experimental phase or iteration.</p>
  </li>
  <li>
    <p><strong>Knowledge Preservation Phase (User-mediated)</strong> — I judge the PA’s
  remaining context and effectiveness. If warranted I initiate a session
handoff — refreshing context with the previous handoff document and requesting
that the PA produce the handoff document for the next session, recording
architectural decisions, rationale, and updates to project status and planning
documents.</p>
  </li>
</ol>

<h2 id="workflow-summary">Workflow Summary</h2>

<p><img src="/assets/diagrams/pa_ia_workflow.svg" alt="PA/IA Workflow" /></p>

<ol>
  <li>
    <p>With PA discuss the next tasks or experiments, agree on scope, provide any implementation specifications, interfaces, etc.</p>
  </li>
  <li>I provide the PA the task template, the PA populates the IA session prompt
    <ul>
      <li>these tasks files use a numbering scheme DECODE-001.md, etc</li>
    </ul>
  </li>
  <li>
    <p>I transfer the populated task file to the IA file system at ./prompts</p>
  </li>
  <li>A fresh Claude Code session is started
    <ul>
      <li><code class="language-plaintext highlighter-rouge">claude</code></li>
      <li>There are additional options to control claude automation 
–auto-accept-edits or –dangerously-skip-permissions</li>
      <li>This is a user choice. It is independent of the methodology</li>
    </ul>
  </li>
  <li>I specify the /run command
    <ul>
      <li><code class="language-plaintext highlighter-rouge">/run &lt;task id&gt; </code></li>
      <li>The run command locates the task file, verifies it’s format, extracts
the prompt and executes the instructions.</li>
    </ul>
  </li>
  <li>
    <p>The IA will run and report summary results to the console and write to the
  ::RESULTS CAPTURE:: section of the task file.</p>
  </li>
  <li>I populate the header data fields with run statistics, and optionally 
edit the User Assessment section and paste the IA console output into the task file.
    <ul>
      <li>This step supports the experimental record — it is part of the methodology
documentation, not the design flow itself.</li>
    </ul>
  </li>
  <li>
    <p>I share the completed task file with PA, discuss results, record
     decisions, plan next task
     - This is interactive and can generate a number of actions, technical debt,
       additional or clarified documents, or occasionally require updates to
   CLAUDE.md</p>
  </li>
  <li>Once ready I commit the git repo changes
    <ul>
      <li>The IA does not have knowledge of the repo. This is a deliberate design choice.</li>
      <li>I also mirror the repo on a separate file system distinct from the file system the IA has access to.</li>
    </ul>
  </li>
</ol>

<p>Since PA also has context limits at some point it will be necessary to perform
a session handoff. This is usually indicated by incomplete or inaccurate
answers by the PA, forgetting instructions from earlier in the session, etc.</p>

<p>In this case, I supply PA with the SESSION_HANDOFF.md template, a copy of the
previous session handoff file, and ask that PA generate the next session
handoff document. Supply the current session number and the next. PA will
produce session_handoff-NNN.md with</p>

<ul>
  <li>Key architectural decisions and their reasoning</li>
  <li>Technical debt inventory</li>
  <li>Tools status and known issues</li>
  <li>Next steps in priority order</li>
  <li>Anything not captured elsewhere in the repo</li>
</ul>

<p>When starting the next session supply STATUS.md, and the latest
session_handoff-00N.md file. If flows or changes to CLAUDE.md were made in the
last session supply CORE.md and/or CLAUDE.md as well.</p>

<h2 id="methodology-mechanics">Methodology Mechanics</h2>

<h3 id="md-support-files">MD support files</h3>
<p>MD files form the conventional basis for interacting with IA and PA.</p>

<table>
  <thead>
    <tr>
      <th style="text-align: left">File / Directory</th>
      <th style="text-align: left">Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left"><code class="language-plaintext highlighter-rouge">./CLAUDE.md</code></td>
      <td style="text-align: left">Canonical baseline context, constant across IA sessions. Covers purpose, text output rules, fixed constraints (e.g. read fully before write), and how the IA should respond to conflicting or poorly defined requirements.</td>
    </tr>
    <tr>
      <td style="text-align: left"><code class="language-plaintext highlighter-rouge">./pa_handoffs/</code></td>
      <td style="text-align: left">Previous PA session handoff files.</td>
    </tr>
    <tr>
      <td style="text-align: left"><code class="language-plaintext highlighter-rouge">./planning/PROJECT_CORE.md</code></td>
      <td style="text-align: left">High level description of project intent, scope, roles, workflow, conventions, and 3rd party tool status. Supplied to PA only when project-level changes occur — new steps, new tools, methodology changes.</td>
    </tr>
    <tr>
      <td style="text-align: left"><code class="language-plaintext highlighter-rouge">./planning/PROJECT_STATUS.md</code></td>
      <td style="text-align: left">Current project state: module status, technical debt, development and design open items, SV package conventions, key cluster/module parameters, prompt generation guide, architecture decisions, and prompt decomposition list. Used in handoff and planning sessions.</td>
    </tr>
    <tr>
      <td style="text-align: left"><code class="language-plaintext highlighter-rouge">./planning/arch/</code></td>
      <td style="text-align: left">Contains documentation of architecture decisions and guidance. These documents are tactically supplied as reference context in IA prompts.</td>
    </tr>
    <tr>
      <td style="text-align: left"><code class="language-plaintext highlighter-rouge">./planning/interfaces/</code></td>
      <td style="text-align: left">Contains definition of module ports necessary for sharing between modules and subsystems. This is the primary mechanism to ensure minimal issues with interoperability. These documents are tactically supplied as reference context in IA prompts.</td>
    </tr>
    <tr>
      <td style="text-align: left"><code class="language-plaintext highlighter-rouge">./planning/testbenches/</code></td>
      <td style="text-align: left">Contains context for test bench guidance.</td>
    </tr>
    <tr>
      <td style="text-align: left"><code class="language-plaintext highlighter-rouge">./planning/tools</code></td>
      <td style="text-align: left">3rd party tool capabilities, usage, etc, these are not claude tools or skills.</td>
    </tr>
    <tr>
      <td style="text-align: left"><code class="language-plaintext highlighter-rouge">./prompts/</code></td>
      <td style="text-align: left">IA task files generated by PA using the task template, labeled by module and iteration e.g. <code class="language-plaintext highlighter-rouge">DECODE-002.md</code>.</td>
    </tr>
    <tr>
      <td style="text-align: left"><code class="language-plaintext highlighter-rouge">./templates/TASK_TEMPLATE.md</code></td>
      <td style="text-align: left">Structured document populated by PA with goals and IA prompt. Contains task header (ID, context stats, runtime, model, resume SHA, status), a user assessment section, the extracted IA prompt, and a results capture section the IA populates. Once populated, labeled <code class="language-plaintext highlighter-rouge">&lt;Module&gt;-&lt;ID&gt;.md</code> and stored in <code class="language-plaintext highlighter-rouge">./prompts/</code>.</td>
    </tr>
    <tr>
      <td style="text-align: left"><code class="language-plaintext highlighter-rouge">./templates/SESSION_HANDOFF.md</code></td>
      <td style="text-align: left">Structured document populated by PA at session handoff. Records session progress, decisions carried forward, prompts generated, and PROJECT_STATUS.md updates. PROJECT_CORE.md and PROJECT_STATUS.md updates are applied manually.</td>
    </tr>
  </tbody>
</table>

<h1 id="evaluation-criteria">Evaluation Criteria</h1>

<p>The primary evaluation metric is projected SPEC CPU2006 and CPU2017 IPC, derived from a validated C++ performance model executing SimPoints. Model validation is established by correlating against the RTL using a common microarchitectural event schema anchored to the RISC-V Hardware Performance Monitor specification, with RISC-V micro-benchmarks, Dhrystone, and CoreMark as the correlation workloads. Linux boot on an FPGA platform is anticipated as a further correctness validation and provides a natural environment for HPM counter verification. PPA characterization and silicon measurement remain open for future work.</p>

<h1 id="summary">Summary</h1>

<p>The dual assistant architecture is a practical solution to a real problem:
maintaining architectural coherence across a long, complex design while keeping
individual implementation sessions clean and reproducible. The PA handles
design reasoning and continuity. The IA handles execution. The user owns every
decision.  Whether this approach can produce a competitive 8-issue OOO
processor is the question this project is designed to answer. The methodology,
the prompts, the failures, and the results will all be published. That
transparency is part of the point.</p>

<hr />
<hr />
<p><em>Jeff Nye is a microprocessor architect with 35 years of industry experience 
spanning performance modeling, RTL implementation, and architecture for 
high-performance OOO processors. He has contributed RTL to Pentium 4, ARM V7,  TI C6x and RISC-V designs, and recently served as sole architect and full-stack implementer of the TAGE-SC-L + ITTAGE branch prediction cluster in an 8-issue RVA23 RISC-V processor — from research through timing closure at 2.75 GHz. He holds +20 issued patents in processor design, architecture, and hardware 
virtualization. He is the author of Pacino and the uarchlabs methodology documented here.</em></p>

<p><em>Connect on <a href="https://www.linkedin.com/in/jeff-nye-21353926">LinkedIn</a>.</em></p>]]></content><author><name>Jeff Nye</name></author><summary type="html"><![CDATA[The emergence of LLMs raises a practical question for high performance large-scale processor development: can a standards-compliant, competitively performant design be built by a very small team on a small budget?]]></summary></entry><entry><title type="html">Introducing uarchlabs and the Pacino blog series</title><link href="https://uarchlabs.github.io/blog/uarchlabs-blog-live/" rel="alternate" type="text/html" title="Introducing uarchlabs and the Pacino blog series" /><published>2026-05-10T00:00:00+00:00</published><updated>2026-05-10T00:00:00+00:00</updated><id>https://uarchlabs.github.io/blog/uarchlabs-blog-live</id><content type="html" xml:base="https://uarchlabs.github.io/blog/uarchlabs-blog-live/"><![CDATA[<p><a href="https://uarchlabs.com">uarchlabs</a> is an open source hardware design organization. We build high performance processors and publish everything — the RTL, the design decisions, and the AI-assisted methodology used to produce them.</p>

<p>The first project is <a href="https://github.com/uarchlabs/pacino">Pacino</a>: an 8-issue
out-of-order RISC-V processor targeting the RVA23S64 profile with competitive
SPECint2006 performance as the design goal. The target is deliberately
ambitious — a simple pipeline would not stress the methodology.</p>

<p>This blog series documents the design and the process. Posts cover
architectural decisions, experiment results, and the AI co-design flow as it
develops. The methodology, the prompts, the success and failures, and the
results will be published. We are making transparency part of the point.</p>

<p>The next post covers the project rationale, methodology, and workflow in detail. A FAQ covering scope, tooling, and design targets is at <a href="https://uarchlabs.com/faq.html">uarchlabs.com/faq.html</a>.</p>

<hr />

<p><em>Jeff Nye is a microprocessor architect with 35 years of industry experience spanning performance modeling, RTL implementation, and architecture for high-performance OOO processors. He has contributed RTL to Pentium 4, ARM V7, TI C6x and RISC-V designs, and recently served as sole architect and full-stack implementer of the TAGE-SC-L + ITTAGE branch prediction cluster in an 8-issue RVA23 RISC-V processor — from research through timing closure at 2.75 GHz. He holds 20+ issued patents in processor design, architecture, and hardware virtualization.</em></p>

<p><em>Connect on <a href="https://www.linkedin.com/in/jeff-nye-21353926">LinkedIn</a>.</em></p>]]></content><author><name>Jeff Nye</name></author><summary type="html"><![CDATA[uarchlabs is an open source hardware design organization. We build high performance processors and publish everything — the RTL, the design decisions, and the AI-assisted methodology used to produce them.]]></summary></entry></feed>