Day 10: Numerical Architectures & Design Trade-offs¶
Course: Accelerated HDL for Digital System Design¶
Week 3, Session 10 of 16¶
Student Learning Objectives¶
- SLO 10.1: Compare adder architectures (ripple-carry, behavioral
+, carry-lookahead concept) and explain their PPA trade-offs. - SLO 10.2: Implement a sequential shift-and-add multiplier and explain when sequential multiplication is preferable to combinational.
- SLO 10.3: Implement fixed-point (Q-format) arithmetic and handle bit growth in multiplication.
- SLO 10.4: Perform structured PPA analysis: measure LUTs, FFs, and Fmax across design variants using
yosys statandnextpnr. - SLO 10.5: Read a nextpnr timing report and identify whether timing constraints are met.
- SLO 10.6: Articulate the PPA trade-off triangle: performance vs. power vs. area, and why optimizing all three simultaneously is impossible.
Pre-Class Video (~55 min) ★ Revised lecture¶
| # | Segment | Duration | File |
|---|---|---|---|
| 1 | Timing & constraints essentials: setup/hold, critical path, nextpnr reports | 15 min | video/day10_seg1_timing_essentials.mp4 |
| 2 | Numerical architecture trade-offs: adders, multipliers, + and * operators |
20 min | video/day10_seg2_numerical_architectures.mp4 |
| 3 | PPA — Performance, Power, Area: the three axes of digital design | 15 min | video/day10_seg3_ppa_intro.mp4 |
| 4 | OpenROAD/OpenLane: open-source ASIC PPA (aspirational context) | 5 min | video/day10_seg4_asic_ppa_context.mp4 |
Segment 2 key points:
- Ripple-carry (Day 2 review) → carry-lookahead (concept) → the + operator (what does the tool build?)
- Writing assign sum = a + b; lets the tool choose the architecture — understanding the choice is the designer's job
- Multiplication: shift-and-add → why * on iCE40 uses pure LUT logic (no DSP blocks)
- Fixed-point Q-format: Q4.4 = 4 integer + 4 fractional bits
Segment 3 key points:
- FPGA PPA proxies: Fmax (performance), LUT/FF count (area), toggle rate × capacitance (power, conceptual)
- ASIC PPA: gate count, standard cells, process node, leakage
- The trade-off triangle: pipelining helps Fmax but costs FFs; parallelism costs area
- if/else vs case revisited from PPA perspective
Session Timeline¶
| Time | Activity | Duration |
|---|---|---|
| 0:00 | Warm-up: pre-class questions, timing concepts check | 5 min |
| 0:05 | Mini-lecture: timing, numerical demo, PPA thinking | 30 min |
| 0:35 | Lab Exercise 1: Adder architecture comparison | 30 min |
| 1:05 | Lab Exercise 2: Shift-and-add multiplier | 30 min |
| 1:35 | Break | 5 min |
| 1:40 | Lab Exercise 3: Fixed-point exercise | 20 min |
| 2:00 | Lab Exercise 4: Timing constraint exercise | 10 min |
| 2:10 | Lab Exercise 5 (Stretch): PLL / CDC | 10 min |
| 2:20 | Wrap-up and Day 11 preview | 10 min |
In-Class Mini-Lecture (30 min)¶
Quick Timing Check (5 min)¶
- Read a nextpnr timing report together
- Key line: "Max frequency for clock 'clk': XX.XX MHz (PASS at 25.00 MHz)"
- What's the critical path? Where does delay accumulate?
Numerical Architectures Live Demo (15 min)¶
- Synthesize
assign sum = a + b;at 4-bit, 8-bit, 16-bit, 32-bit widths - Run
yosys statfor each — plot LUT count vs. width (should be roughly linear) - Synthesize
assign product = a * b;at 4-bit, 8-bit — show the LUT explosion (quadratic growth, no DSP blocks on iCE40) - Compare:
a + bvs.a + b + c— does the tool chain adders or use something smarter? - Inspect with
yosys show: what does the synthesized adder actually look like?
PPA Thinking (10 min)¶
- Design decision framework: "For this application, do I care more about Fmax, area, or power?"
- FPGA vs. ASIC: on FPGA, LUTs are fixed-size so "area" is really "LUT utilization"; on ASIC, a 2-input gate is physically smaller than a 4-input gate
- Real example: a 32-bit multiplier on iCE40 HX1K uses ~30% of available LUTs; on 28nm ASIC, it's a tiny fraction
- Brief aside: "The PPA habits you build with
yosys stattransfer directly to ASIC flows like OpenROAD/OpenLane."
Lab Exercises¶
Exercise 1: Adder Architecture Comparison (30 min)¶
Objective (SLO 10.1, 10.4): Compare manual and behavioral adder implementations using PPA metrics.
Tasks:
1. Manual ripple-carry adder: Chain full-adder instances (reuse Day 2 code) at 8-bit and 16-bit widths.
2. Behavioral adder: assign sum = a + b; at the same widths.
3. For each variant and width, record:
- LUT count (yosys stat)
- Schematic structure (yosys show)
- Fmax (nextpnr timing report)
4. Fill in a comparison table:
| Variant | Width | LUTs | FFs | Fmax (MHz) |
|---|---|---|---|---|
| Ripple-carry | 8 | |||
| Ripple-carry | 16 | |||
Behavioral + |
8 | |||
Behavioral + |
16 |
- Analysis question: Does the synthesis tool produce the same circuit for both? When would you manually implement an adder instead of using
+?
Checkpoint: Comparison table filled in with real data. At least one observation about the results.
Exercise 2: Shift-and-Add Multiplier (30 min)¶
Objective (SLO 10.2, 10.4): Implement a sequential multiplier and compare to combinational.
Tasks:
1. Implement an 8-bit unsigned shift-and-add multiplier:
- FSM controls the operation (IDLE → COMPUTE → DONE)
- Shift register holds the multiplier, accumulator holds the partial product
- Takes up to 8 clock cycles to complete
2. Write a testbench: verify at least 10 test cases including 0×N, N×0, 1×N, max×max.
3. Synthesize the sequential multiplier. Record LUTs, FFs, Fmax.
4. Synthesize assign product = a * b; (8-bit, combinational). Record LUTs, FFs, Fmax.
5. Compare:
| Variant | LUTs | FFs | Fmax | Latency |
|---|---|---|---|---|
| Shift-and-add (sequential) | 8 cycles | |||
Behavioral * (combinational) |
1 cycle |
- Discussion: When would you choose sequential over combinational? (High area pressure, relaxed latency requirements.)
Checkpoint: Working multiplier. Comparison table with both variants.
Exercise 3: Fixed-Point Arithmetic (20 min)¶
Objective (SLO 10.3): Handle the practical challenges of fixed-point computation.
Tasks: 1. Implement a Q4.4 fixed-point adder: two 8-bit inputs (4 integer, 4 fractional), 9-bit output (with carry). 2. Implement a Q4.4 fixed-point multiplier: two 8-bit inputs → 16-bit product (Q8.8). - Key challenge: The product of two Q4.4 numbers is Q8.8 — you need to extract the right bits for the integer part. 3. Drive the integer part of the result to the 7-seg display. 4. Test with known values: e.g., 2.5 × 3.0 = 7.5 → integer part = 7 on display.
Checkpoint: Fixed-point multiplication gives correct integer result on 7-seg.
Exercise 4: Timing Constraint Exercise (10 min)¶
Objective (SLO 10.5): Practice reading timing reports.
Tasks: 1. Add a timing constraint to an existing design (e.g., the shift-and-add multiplier). 2. Synthesize with nextpnr. Read the timing report. 3. Does timing pass at 25 MHz? At 50 MHz? At 100 MHz? 4. Identify what limits Fmax in your design.
Checkpoint: Timing report read and Fmax identified.
Exercise 5 (Stretch): PLL & CDC (10 min)¶
Objective (SLO 10.5): Explore clock generation and domain crossing.
Tasks:
1. Instantiate SB_PLL40_CORE to generate a different frequency from 25 MHz.
2. Build a 2-FF synchronizer to pass a signal between the two clock domains.
Deliverable¶
- Adder/multiplier PPA comparison table with real data (LUTs, FFs, Fmax for each variant).
- Working shift-and-add multiplier on the FPGA.
Assessment Mapping¶
| Exercise | SLOs Assessed | Weight |
|---|---|---|
| 1 — Adder comparison | 10.1, 10.4 | Core |
| 2 — Shift-and-add multiplier | 10.2, 10.4 | Core |
| 3 — Fixed-point | 10.3 | Core |
| 4 — Timing constraints | 10.5 | Core |
| 5 — PLL / CDC | 10.5 | Stretch (bonus) |
⚠️ Common Pitfalls & FAQ¶
Day 10 is about numerical architectures and your first structured PPA analysis. Understanding the trade-offs here is directly relevant to your final project.
- Behavioral
+is just as good as my ripple-carry adder? Often yes — and that's the point. Yosys optimizes behavioral operators aggressively. A hand-built ripple-carry adder may use the same or more LUTs thanassign sum = a + b;. The lesson: let the tool work for you, but understand what it produces so you can make informed decisions when it matters. - Shift-and-add multiplier FSM is complex — where do I start? Draw the block diagram first: FSM controller, shift register (for the multiplier), accumulator (for the partial product), and a bit counter. Label the connections between them. Then code each block separately before wiring them together.
- Fixed-point: extracting the wrong bits? For Q8.8 multiplication (8 integer bits, 8 fractional bits), the full product is 32 bits wide. The integer result is in bits [23:16], not [31:24] or [15:8]. Draw out the bit positions on paper — label the integer and fractional portions — before writing the extraction code.
- Where do I find the timing report? After
nextpnrruns, look for the "Max frequency" line in the output and the critical path description. The path name tells you which flip-flops are at the start and end of the longest combinational delay.
🔗 Bigger Picture¶
This is the anchor day for PPA analysis. The comparison tables you build in today's exercises establish the exact format you'll use in your final project PPA report (Days 15–16).¶
Preview: Day 11¶
UART TX — your first real communication interface. Everything from Weeks 1–2 comes together: FSMs, counters, shift registers, and testbenches, all combined to send data from the Go Board to your PC.