Video 2 of 4 · ~15 minutes
Dr. Mike Borowczak · Electrical & Computer Engineering · CECS · UCF
Every DSP chip markets by “multiply-accumulate operations per second.” A Qualcomm Hexagon DSP does ~1012 MACs/s because each multiplier is a dedicated block. A Google TPU has 65,536 multipliers in one systolic array. Even your phone's voice assistant runs on custom arithmetic units optimized for matrix math. Architecture choices at this layer determine how many FLOPS per watt — the central metric of the AI era.
Your Day 5 counter uses a + 1. Your Day 11 UART uses baud_cnt < HALF_PERIOD. Your capstone with filters or transforms will need real multipliers. Choosing the right arithmetic architecture for the job is the difference between “fits at 25 MHz” and “doesn't fit.”
+ Is Not One Thing“The + operator in Verilog always produces the same hardware. What else could ‘addition' be?”
Synthesizers choose among several adder architectures based on context, target, and width: ripple-carry (simple, slow — O(N) delay), carry-lookahead (fast, larger — O(log N) delay), carry-select (medium), Kogge-Stone (fastest, largest, ASIC-only usually). On iCE40, Yosys uses the dedicated SB_CARRY chain — essentially fast ripple with hardware acceleration.
| Architecture | Delay (N-bit) | Area (N-bit) | Use when |
|---|---|---|---|
| Ripple-carry | O(N) | O(N) | Narrow (<16 bits), low clock |
iCE40 SB_CARRY chain | O(N), fast | O(N) | Default on iCE40 — use this |
| Carry-lookahead | O(log N) | O(N log N) | ASIC, wide, high clock |
| Carry-select | O(√N) | O(N) | Middle-ground FPGA designs |
| Kogge-Stone | O(log N) | O(N log N) | Very wide, aggressive ASIC |
c = a + b and let Yosys use SB_CARRY. The dedicated carry chain is fast (propagates in dedicated routing, not LUTs). Don't hand-roll ripple-carry adders — they're strictly worse. Trust the tool.
// 8-bit unsigned multiplier
assign product = a * b; // compact source, expensive silicon
On iCE40 (no dedicated multipliers):
module mul_seq #(parameter W = 8) (
input wire i_clk, i_start,
input wire [W-1:0] i_a, i_b,
output reg [2*W-1:0] o_p, output reg o_done
);
reg [W-1:0] r_a, r_mask;
reg [2*W-1:0] r_sum;
reg [$clog2(W)-1:0] r_step;
reg r_busy;
always @(posedge i_clk) begin
if (i_start) begin
r_a <= i_a; r_sum <= 0; r_step <= 0; r_busy <= 1; o_done <= 0;
r_mask <= i_b;
end else if (r_busy) begin
if (r_mask[0]) r_sum <= r_sum + ({{W{1'b0}}, r_a} << r_step);
r_mask <= r_mask >> 1;
r_step <= r_step + 1;
if (r_step == W-1) begin r_busy <= 0; o_done <= 1; o_p <= r_sum; end
end
end
endmodule
Integers can't represent fractions, but full floating-point is expensive (~500 LUTs for a 32-bit FMUL). Fixed-point is the middle way:
Q4.12 format: 16 bits total = 4 integer bits + 12 fractional bits
Range: -8.0 to +7.9998
Resolution: 2^-12 = 0.000244
Example: decimal 3.5 → binary 0011.100000000000 = 0x3800
(3.5 * 2^12 = 14336 = 0x3800)
reg [7:0] a = 8'hFF; // declared unsigned
wire [8:0] sum = a + 1; // what is sum?
What does sum equal?
a is unsigned, so 8'hFF = 255. Plus 1 = 256. The 9-bit result correctly extends to hold the carry.
a as reg signed [7:0] a = 8'hFF;, then a = -1 (two's complement), and sum = 9'h000 = 0. Same bits, different interpretation. Always use signed keyword explicitly when you mean signed arithmetic; never leave it to the implied default.
~5 minutes
▸ COMMANDS
cd labs/week3_day10/ex2_numerics/
make stat WIDTH=8
make stat WIDTH=16
make stat WIDTH=32
# vs sequential multiplier:
make stat_seq WIDTH=16
▸ EXPECTED STDOUT
Parallel mult:
8-bit: 82 LUTs
16-bit: 348 LUTs
32-bit: FAILS (too big)
Sequential mult:
16-bit: 58 LUTs, 16 cycles
▸ KEY OBSERVATION
Same function, 6× area reduction by accepting 16× latency. When to pick which? Throughput requirement. One-sample-per-100-cycles audio processing? Sequential. One-sample-per-cycle filter tap? Parallel.
Ask AI: “I need a 24×24 fixed-point Q8.16 multiplier. Parallel won't fit on iCE40. Design a sequential shift-and-add multiplier and estimate area and latency.”
TASK
AI designs a sequential wide multiplier.
BEFORE
Predict: ~70 LUTs + 24 cycles. Must document Q8.16 scaling.
AFTER
Strong AI discusses Q-format scaling. Weak AI treats it as integer-only.
TAKEAWAY
Fixed-point requires explicit scaling discussion. AI skipping it = bug.
① + is a family of architectures. On iCE40, use SB_CARRY (write +, trust tool).
② Multiplication is O(N2) area. Wide multipliers fill small FPGAs fast.
③ Sequential multiplier: time-for-area. Essential when parallel won't fit.
④ Fixed-point (Q-format) = integer math + virtual decimal + right-shift.
⑤ Signed arithmetic: always declare signed explicitly.
🔗 Transfer
Video 3 of 4 · ~11 minutes
▸ WHY THIS MATTERS NEXT
You've now made timing and arithmetic choices. Video 3 steps back: every hardware choice sits on three axes — Performance (Fmax, throughput), Power (static + dynamic), Area (LUTs, EBRs, dollars). Real engineering lives in that three-dimensional space. You'll learn to chart your design, do design-space exploration, and report honestly.