Day 14 · Advanced Verification & Road Ahead

PPA Methodology

Video 3 of 4 · ~12 minutes

Dr. Mike Borowczak · Electrical & Computer Engineering · CECS · UCF

🌍 Where This Lives

In Industry

Every chip has PPA goals before design starts. “4 GHz @ 50W @ 50 mm²” is the sentence that opens an Apple CPU spec. The design process is then a structured hunt against those numbers: measure, optimize the worst metric, re-measure, decide when good enough. At Intel, this is formalized as the “PPA dashboard” tracked weekly; at AMD, it's the “area/power/frequency budget”; at every company, it's a methodology, not a vibe.

In This Course

Your Day 10 PPA intro gave you the triangle. Today's methodology video makes it actionable: the three-step loop (measure → identify bottleneck → optimize), the reports you actually read, and the tradeoffs that matter for your capstone integration.

⚠️ “Optimize Everything”

❌ Wrong Model

“I'll make the design as fast, as small, and as low-power as I can.”

✓ Right Model

You can't optimize all three. They are in active tension. The methodology is: measure all three, find the binding constraint, push on only that one. If your design is area-limited (won't fit on the iCE40 HX1K), don't chase Fmax. If you're meeting Fmax with 70% utilization, don't aggressively shrink — you'll hurt timing. PPA is navigation, not optimization.

The receipt: Every optimization trades one metric against another. Pipelining adds Fmax but costs area. Operator sharing saves area but costs Fmax. Clock gating saves power but complicates timing analysis. There are no free wins. The methodology helps you pick which trade to make.

👁️ I Do — The Three-Step PPA Loop

Measure. Run synthesis + place-and-route. Collect: cell count (SB_LUT4, SB_DFF, SB_CARRY totals), max frequency (from nextpnr report), and toggle-rate-weighted activity estimate (for power).
Identify the binding constraint.
- Fmax below target? → timing-critical
- Cell count near device max? → area-critical
- Battery life below target? → power-critical
- None binding? → you're done; ship.
Optimize the constraint. Apply targeted transformations: pipelining for timing, operator sharing for area, clock gating for power. Re-measure after each transformation.

My thinking: The loop is boring, but boringness is the point. Each pass is small, measured, and reversible. The opposite — “let me rewrite this big chunk to be faster” — often breaks other things. Measure small changes.

🤝 We Do — Reading a Real Report

$ make stat && make place
=== Synthesis (Yosys → iCE40) ===
  SB_LUT4:      487   (38% of HX1K's 1280 LUTs)
  SB_DFF:       124
  SB_CARRY:     42
  SB_RAM40_4K:  2 of 16 used
  Total cells:  653

=== Placement + Routing (nextpnr) ===
  Device: iCE40 HX1K-TQ144
  Utilization: LUTs 38%, DFFs 9%, BRAM 12%
  Max frequency: 43.2 MHz (target: 25 MHz) ✓
  Critical path: 23.1 ns
      SB_IO (i_data[0]) → 5.2 ns
      SB_LUT4 (datapath[0]_cmp) → 7.4 ns
      SB_CARRY chain (18 stages) → 8.9 ns
      SB_DFF (result[31]) → 1.6 ns

Together — diagnosis: Fmax = 43.2 MHz, target = 25 MHz → timing is fine, not the constraint. LUT utilization = 38% → area is fine. The design ships as-is. If the target had been 100 MHz, that 18-stage carry chain would be the bottleneck — split it into 2 stages with a pipeline register, and Fmax roughly doubles. Measure → identify → target.

🧪 You Do — Pick the Optimization

Your UART + FIFO echo design reports:

Fmax: 48 MHz (target: 25 MHz)
Utilization: LUTs 72% of HX1K, DFFs 45%
You want to double the FIFO depth from 16 to 32

Question: which metric is most likely to block the change, and what's the mitigation?

Answer: Area. At 72% LUT utilization, doubling FIFO depth pushes area toward 85-90% — nextpnr will start struggling to place, and timing may degrade. Mitigation: use BRAM (SB_RAM40_4K) instead of LUT-based FIFO. One 4 Kbit block can hold a 32-deep × 8-wide FIFO with zero LUTs. This is the textbook case for moving from distributed to block memory.

▶ LIVE DEMO

PPA Sweep: Pipelining an Adder

~6 minutes

▸ COMMANDS

cd labs/week4_day14/ex3_ppa/
for p in 0 1 2 4; do
  make clean
  make all PIPE=$p
  grep 'Max freq\\|Cells' logs/report.txt
done
python scripts/plot_ppa.py logs/*.txt

▸ EXPECTED STDOUT

PIPE=0: 44 MHz,  98 cells
PIPE=1: 82 MHz, 132 cells
PIPE=2: 121 MHz, 167 cells
PIPE=4: 143 MHz, 245 cells
# Fmax vs cells — diminishing
# returns past PIPE=2

▸ THE PARETO FRONTIER

Plot Fmax vs. cells — you get a classic Pareto curve. Beyond PIPE=2 the curve flattens: you're spending more area for less speed. The knee of the curve is almost always the right answer. This is what an engineer means by “I found the sweet spot.”

🔧 Top-5 FPGA PPA Moves (Memorize These)

Goal	Move	Cost
↑ Fmax	Add pipeline register in critical path	+1 flop / stage, +1 cycle latency
↑ Fmax	Use `SB_CARRY` chain instead of generic LUTs for adders	Already automatic; check synthesis
↓ Area	Move FIFOs/ROMs to `SB_RAM40_4K` BRAM	Only works for ≥8×8 storage
↓ Area	Share operators across mutually exclusive uses	Extra muxes, possible Fmax hit
↓ Power	Clock-gate idle logic	Requires careful timing; iCE40 tricky

Why this matters: These five cover 80% of the cases you'll see in senior design and capstone projects. Not every FPGA has the same primitives, but the category of moves is universal.

🤖 Check the Machine

Ask AI: “Here's my synthesis report: 72% LUT usage, Fmax 28 MHz, target 50 MHz. My critical path is a 32-bit ripple adder. Give me 3 optimization strategies ordered by expected Fmax gain per area cost.”

TASK

AI ranks PPA moves by ROI.

BEFORE

Predict: pipeline (cheapest), carry-select (middle), Kogge-Stone (expensive).

AFTER

Strong AI gives expected Fmax numbers + area deltas. Weak AI just lists moves without quantification.

TAKEAWAY

Good optimization advice has numbers attached. Without them it's just opinions.

Key Takeaways

① PPA is navigation, not unbounded optimization. Measure, identify binding constraint, target.

② Every optimization trades one metric against another. No free wins.

③ The Pareto curve's knee is almost always the right answer.

④ Five moves (pipeline, carry chain, BRAM, operator sharing, clock gating) cover most cases.

Optimize the binding constraint. Measure. Stop at the knee.

🔗 Transfer

Coverage & The Road Ahead

Video 4 of 4 · ~10 minutes

▸ WHY THIS MATTERS NEXT

You've learned the tools of verification. Video 4 addresses the completeness question: how do you know you've verified enough? Coverage analysis answers this. And then we look at where the field is going — formal verification, UVM, HLS, open-source silicon, and what HDL careers look like in 2026 and beyond.