Day 10 · Timing, Numerics & PPA

Timing & Constraints Essentials

Video 1 of 4 · ~12 minutes

Dr. Mike Borowczak · Electrical & Computer Engineering · CECS · UCF

🌍 Where This Lives

In Industry

Tapeouts are timing closure. A modern SoC has hundreds of thousands of timing paths; every one must meet setup and hold. Teams of timing engineers spend months ensuring closure at target frequency. A chip that fails timing at the corners ships late, ships slow, or doesn't ship at all. When Intel's Pentium 4 capped at 4 GHz instead of reaching 10 GHz, it was primarily a timing problem (leakage + critical paths too long to hide).

In This Course

Your Go Board runs at 25 MHz (40 ns period). Every design so far has stayed comfortably under that — we've never shown you a failing timing report. Today changes that. You'll learn to read nextpnr's report, identify critical paths, and apply the three-move playbook when timing fails.

⚠️ Combinational Depth Kills Clock Speed

❌ Wrong Model

“If I want my design to run at a higher clock rate, I just set a faster clock. The logic is the logic.”

✓ Right Model

The maximum clock rate is set by the longest combinational path between any two flip-flops. That path has a delay (t_clk-to-Q + Σ t_LUT + t_setup). Your clock period must exceed this delay. A deep combinational chain = long delay = slow maximum clock. To go faster: shorten the chains by inserting pipeline flops.

The receipt: A 32-bit ripple-carry adder on iCE40 has a critical path of ~12 ns. Max clock: ~80 MHz. A pipelined or carry-lookahead adder reaches 200+ MHz on the same chip, same inputs. Same function, different architecture, different timing.

Setup & Hold — The Timing Contract

    clk ────┐           ┌───
            └───────────┘

    D   ────────∎────────────
           setup│    │hold
                │    │
                └──edge

Setup time (t_setup): D must be stable before the clock edge (iCE40 ~0.4 ns)
Hold time (t_hold): D must remain stable after the clock edge (iCE40 ~0.3 ns)
Clock-to-Q (t_clk-to-Q): propagation from clock edge to Q output (iCE40 ~0.8 ns)

Setup timing equation: T_clk ≥ t_clk-to-Q + t_comb(FF→FF) + t_setup (the clock period must accommodate all three)

👁️ I Do — Reading the nextpnr Timing Report

Info: Critical path report for clock 'clk_25mhz' (posedge -> posedge):
Info: curr total
Info: 0.8  0.8   Source u_counter.r_count_reg[5]_DFFR_Q
Info: 1.4  2.2   Net count[5]                              (fanout = 3)
Info: 1.1  3.3   Source u_alu.u_add.SB_LUT4_I3_O
Info: 1.4  4.7   Net adder_inter[5]                        (fanout = 2)
Info: 1.1  5.8   Source u_alu.u_mux.SB_LUT4_I0_O
Info: 0.4  6.2   Sink u_out.r_out_reg[5]_DFFR_D (setup)
Info: 6.2 ns delay estimate, frequency 161.3 MHz

My thinking: Read from top: starts at r_count[5]'s flop output, wire delay (Net), through an adder LUT, wire, through a mux LUT, to r_out[5]'s flop D input with setup. Total: 6.2 ns. Max frequency: 1/6.2 ns = 161 MHz. At 25 MHz (40 ns period), we have 33.8 ns of slack — plenty.

🤝 We Do — When Timing Fails: Three Moves

Pipeline the critical path. Insert a register mid-path to split it into two shorter paths. Adds one cycle of latency but roughly halves the delay. Canonical fix for arithmetic.
Reduce combinational width. Replace a 32-bit ripple-carry adder with a tree or carry-lookahead structure. Or use DSP blocks for multiplication. Often free if you write the idiom.
Reduce fanout. A signal driving hundreds of loads has long routing delay. Clone the driver so each copy drives fewer loads, or add a pipeline flop to break fanout.

Together: In practice, “pipeline it” is the answer 80% of the time. Adding a register is cheap (1 flop); pays dividends immediately. The cost is latency — an extra cycle of delay — which is usually affordable.

🧪 You Do — Diagnose This Failure

Info: Critical path report:
Info: 0.8  0.8   Source r_a_reg[0]_DFFR_Q
Info: 1.5  2.3   Net a[0] -> adder input
Info: 1.1  3.4   Source u_add.full_adder_0.SB_LUT4_I1_O
Info: 1.4  4.8   Net carry[1] -> full_adder_1.cin
Info: 1.1  5.9   Source u_add.full_adder_1.SB_LUT4_I1_O
Info: ... (30 more full_adder stages ...)
Info: 1.1 45.3   Source u_add.full_adder_31.SB_LUT4_I1_O
Info: 0.4 45.7   Sink r_sum_reg[31]_DFFR_D (setup)
Info: 45.7 ns delay, frequency 21.9 MHz

ERROR: max frequency for clock 'clk_25mhz' is 21.9 MHz, target 25 MHz

What's wrong, and which fix?

Diagnosis: 32-bit ripple-carry adder. The carry chain is the critical path — 32 LUT stages in series. Fix: pipeline the adder (insert flops after bits 15 and 16, adding 1 cycle latency but halving the critical path), or rewrite as assign sum = a + b and let Yosys infer a tree-adder / use the SB_CARRY fast path.

The classic “ripple carry doesn't scale” lesson. 32 stages of ripple = 32 LUT delays. Either pipeline or use the SB_CARRY fast path via idiomatic a + b.

▶ LIVE DEMO

Live Timing Closure

~5 minutes — make a design fail, then fix it

▸ COMMANDS

cd labs/week3_day10/ex1_timing/
make timing      # fails at 25 MHz
# edit adder.v — add pipeline reg
make timing      # passes
diff adder_slow.v adder_fast.v

▸ EXPECTED STDOUT

BEFORE:
  Fmax: 21.9 MHz — FAIL

AFTER (pipelined):
  Fmax: 85.3 MHz — PASS
  (1 cycle extra latency)

▸ KEY OBSERVATION

Same inputs, same output. One register added. Fmax went from 21.9 → 85.3 MHz. Latency: one extra cycle. This is the full toolkit: sacrifice latency for throughput when you need to hit a timing target.

🔧 Where Timing Lives in the Flow

Tool	Does What	Output You Care About
`yosys`	RTL → gate-level netlist	Cell count (from `stat`)
`nextpnr-ice40`	Place & route, timing analysis	Fmax (critical path)
`icetime`	Static timing analysis of placed design	Path-by-path timing report

Checkpoint: Yosys says “how many cells.” nextpnr says “how fast can they run.” Both metrics matter. Today's focus: the second one. Tomorrow: balancing both.

🤖 Check the Machine

Ask AI: “My 32-bit ripple adder has Fmax 22 MHz, I need 50 MHz. Show me a pipelined version with the minimum number of pipeline stages to meet timing.”

TASK

Ask AI for a pipelined adder.

BEFORE

Predict: 1 pipeline stage ≈ doubles Fmax. So 1 stage gets to ~44 MHz. Need 2 stages for 50 MHz safely.

AFTER

Strong AI picks proper split points. Weak AI just says “add a register” with no breakdown.

TAKEAWAY

Pipelining is a quantitative decision — estimate before implementing.

Key Takeaways

① Fmax = 1 / (longest combinational path between flops).

② Read nextpnr's critical path report — it names your bottleneck.

③ Three fixes: pipeline, reduce width, reduce fanout.

④ Pipelining trades latency for throughput. Usually a worthy trade.

Every design has a critical path. You just haven't met yours yet.

🔗 Transfer

Numerical Architecture Tradeoffs

Video 2 of 4 · ~15 minutes

▸ WHY THIS MATTERS NEXT

Timing tells you “how fast.” Video 2 asks “why so slow?” — and answers with adder architectures, multiplier explosions, fixed-point arithmetic, and the tricks that turn a 32-cycle math operation into a 1-cycle one. By the end you'll know which + the synthesizer built and what you could have built instead.