Day 10 · Timing, Numerics & PPA

Numerical Architecture Trade-offs

Video 2 of 4 · ~15 minutes

Dr. Mike Borowczak · Electrical & Computer Engineering · CECS · UCF

TimingNumericsPPAASIC Context

🌍 Where This Lives

In Industry

Every DSP chip markets by “multiply-accumulate operations per second.” A Qualcomm Hexagon DSP does ~1012 MACs/s because each multiplier is a dedicated block. A Google TPU has 65,536 multipliers in one systolic array. Even your phone's voice assistant runs on custom arithmetic units optimized for matrix math. Architecture choices at this layer determine how many FLOPS per watt — the central metric of the AI era.

In This Course

Your Day 5 counter uses a + 1. Your Day 11 UART uses baud_cnt < HALF_PERIOD. Your capstone with filters or transforms will need real multipliers. Choosing the right arithmetic architecture for the job is the difference between “fits at 25 MHz” and “doesn't fit.”

⚠️ + Is Not One Thing

❌ Wrong Model

“The + operator in Verilog always produces the same hardware. What else could ‘addition' be?”

✓ Right Model

Synthesizers choose among several adder architectures based on context, target, and width: ripple-carry (simple, slow — O(N) delay), carry-lookahead (fast, larger — O(log N) delay), carry-select (medium), Kogge-Stone (fastest, largest, ASIC-only usually). On iCE40, Yosys uses the dedicated SB_CARRY chain — essentially fast ripple with hardware acceleration.

The receipt: A 32-bit Kogge-Stone adder is ~4× larger but ~4× faster than a 32-bit ripple adder. Different architectures for different optimization targets.

👁️ I Do — Adder Architectures

ArchitectureDelay (N-bit)Area (N-bit)Use when
Ripple-carryO(N)O(N)Narrow (<16 bits), low clock
iCE40 SB_CARRY chainO(N), fastO(N)Default on iCE40 — use this
Carry-lookaheadO(log N)O(N log N)ASIC, wide, high clock
Carry-selectO(√N)O(N)Middle-ground FPGA designs
Kogge-StoneO(log N)O(N log N)Very wide, aggressive ASIC
My thinking: On iCE40, always write c = a + b and let Yosys use SB_CARRY. The dedicated carry chain is fast (propagates in dedicated routing, not LUTs). Don't hand-roll ripple-carry adders — they're strictly worse. Trust the tool.

👁️ I Do — Multiplication: The LUT Explosion

// 8-bit unsigned multiplier
assign product = a * b;   // compact source, expensive silicon

On iCE40 (no dedicated multipliers):

  • 4×4 multiply: ~20 LUTs
  • 8×8 multiply: ~80 LUTs
  • 16×16 multiply: ~350 LUTs (27% of iCE40 HX1K!)
  • 32×32 multiply: does not fit
The reality: Multiplication is fundamentally O(N2) in area for parallel implementation. On a chip with only 1280 LUTs, you can only afford 1-2 wide multipliers. This is why DSP-heavy FPGAs (Xilinx 7-series, Intel Cyclone) include hardened multiplier blocks (DSP48, DSP slice) — to offload the area cost from general-purpose LUTs.

🤝 We Do — Sequential Shift-and-Add Multiplier

module mul_seq #(parameter W = 8) (
    input  wire           i_clk, i_start,
    input  wire [W-1:0]   i_a, i_b,
    output reg  [2*W-1:0] o_p, output reg o_done
);
    reg [W-1:0]  r_a, r_mask;
    reg [2*W-1:0] r_sum;
    reg [$clog2(W)-1:0] r_step;
    reg r_busy;
    always @(posedge i_clk) begin
        if (i_start) begin
            r_a <= i_a; r_sum <= 0; r_step <= 0; r_busy <= 1; o_done <= 0;
            r_mask <= i_b;
        end else if (r_busy) begin
            if (r_mask[0]) r_sum <= r_sum + ({{W{1'b0}}, r_a} << r_step);
            r_mask <= r_mask >> 1;
            r_step <= r_step + 1;
            if (r_step == W-1) begin r_busy <= 0; o_done <= 1; o_p <= r_sum; end
        end
    end
endmodule
Together: Instead of 80 LUTs in parallel, we use ~20 LUTs across 8 clock cycles. Saves 75% area at the cost of 8× latency. This is the classic throughput-vs-area tradeoff. Perfect for low-throughput paths; wrong for 1-sample-per-clock.

Fixed-Point Arithmetic: Q-Format

Integers can't represent fractions, but full floating-point is expensive (~500 LUTs for a 32-bit FMUL). Fixed-point is the middle way:

Q4.12 format: 16 bits total = 4 integer bits + 12 fractional bits
  Range: -8.0 to +7.9998
  Resolution: 2^-12 = 0.000244

  Example: decimal 3.5 → binary 0011.100000000000 = 0x3800
           (3.5 * 2^12 = 14336 = 0x3800)
Key insight: A Q4.12 multiply is just a 16×16 integer multiply, followed by a right-shift of 12 to rescale. No special hardware needed — just integer operations and careful scaling. This is how all signal-processing on FPGAs/DSPs works (before floating-point was cheap enough for realtime).

🧪 You Do — Signed Arithmetic Gotcha

reg  [7:0] a = 8'hFF;    // declared unsigned
wire [8:0] sum = a + 1;  // what is sum?

What does sum equal?

Answer: sum = 9'h100 = 256. a is unsigned, so 8'hFF = 255. Plus 1 = 256. The 9-bit result correctly extends to hold the carry.
The trap: If you'd declared a as reg signed [7:0] a = 8'hFF;, then a = -1 (two's complement), and sum = 9'h000 = 0. Same bits, different interpretation. Always use signed keyword explicitly when you mean signed arithmetic; never leave it to the implied default.
▶ LIVE DEMO

Adder & Multiplier Synthesis Comparison

~5 minutes

▸ COMMANDS

cd labs/week3_day10/ex2_numerics/
make stat WIDTH=8
make stat WIDTH=16
make stat WIDTH=32
# vs sequential multiplier:
make stat_seq WIDTH=16

▸ EXPECTED STDOUT

Parallel mult:
  8-bit:  82 LUTs
  16-bit: 348 LUTs
  32-bit: FAILS (too big)

Sequential mult:
  16-bit: 58 LUTs, 16 cycles

▸ KEY OBSERVATION

Same function, 6× area reduction by accepting 16× latency. When to pick which? Throughput requirement. One-sample-per-100-cycles audio processing? Sequential. One-sample-per-cycle filter tap? Parallel.

🤖 Check the Machine

Ask AI: “I need a 24×24 fixed-point Q8.16 multiplier. Parallel won't fit on iCE40. Design a sequential shift-and-add multiplier and estimate area and latency.”

TASK

AI designs a sequential wide multiplier.

BEFORE

Predict: ~70 LUTs + 24 cycles. Must document Q8.16 scaling.

AFTER

Strong AI discusses Q-format scaling. Weak AI treats it as integer-only.

TAKEAWAY

Fixed-point requires explicit scaling discussion. AI skipping it = bug.

Key Takeaways

+ is a family of architectures. On iCE40, use SB_CARRY (write +, trust tool).

 Multiplication is O(N2) area. Wide multipliers fill small FPGAs fast.

 Sequential multiplier: time-for-area. Essential when parallel won't fit.

 Fixed-point (Q-format) = integer math + virtual decimal + right-shift.

 Signed arithmetic: always declare signed explicitly.

Match the arithmetic architecture to the throughput requirement. Never default to parallel.

🔗 Transfer

PPA: Performance, Power, Area

Video 3 of 4 · ~11 minutes

▸ WHY THIS MATTERS NEXT

You've now made timing and arithmetic choices. Video 3 steps back: every hardware choice sits on three axes — Performance (Fmax, throughput), Power (static + dynamic), Area (LUTs, EBRs, dollars). Real engineering lives in that three-dimensional space. You'll learn to chart your design, do design-space exploration, and report honestly.