Day 10 · Timing, Numerics & PPA

PPA — Performance, Power, Area

Video 3 of 4 · ~11 minutes

Dr. Mike Borowczak · Electrical & Computer Engineering · CECS · UCF

🌍 Where This Lives

In Industry

Every chip tapeout is a PPA negotiation. Apple's A-series wants maximum performance-per-watt. ARM Cortex-M0+ wants minimum area. NVIDIA H100 wants maximum compute in a given power envelope. Product managers set targets; designers iterate architectures; physical designers tune implementation; a spreadsheet tracks every variant against P-P-A and only one ships. Engineering careers ARE PPA optimization.

In This Course

Every design you've built can be measured on these three axes. Today's lab has you produce a PPA report for three FSM variants. Your capstone will be judged on PPA. Day 14 capstone retrospective reviews designs against their PPA targets.

⚠️ There Is No Free Lunch — PPA Is a Triangle

❌ Wrong Model

“Good engineering produces a chip that's fast, low-power, and small. Better engineers optimize all three.”

✓ Right Model

PPA is a triangle: improving one axis usually costs another. Pipelining = more performance, more area (extra flops), more power (extra flops switch). Clock gating = less power, more control logic. Lower voltage = less power, lower Fmax. Real design picks the operating point. “Optimizing all three” means picking the best tradeoff, not dominating each individually.

The receipt: Apple's M3 is 15% faster than M2 but consumes 20% more power. NVIDIA's H100 is 3× the performance of A100 but consumes 2× the power. No free lunches; measurable tradeoffs.

👁️ I Do — FPGA PPA Proxies

Axis	ASIC Metric	FPGA Proxy
Performance	Fmax (MHz), Cycles/operation	nextpnr Fmax, RTL cycles
Power	Static (leakage) + Dynamic (activity)	Cell count × Fmax × activity (rough)
Area	mm² of silicon, gates	LUTs + FFs + EBRs (from yosys stat)

My thinking: On FPGAs, exact power numbers require vendor tools (Lattice Diamond, Intel Quartus). For education, cell count × clock × activity factor is a reasonable proxy: doubling cells with everything else equal → ~2× dynamic power. Absolute accuracy isn't the point; relative comparisons across your own design variants is.

🤝 We Do — The Tradeoff in Action

Same 16-bit FIR filter, 4 variants:

Variant	LUTs	EBRs	Fmax	Cycles/sample	Throughput
Fully-serial	90	0	180 MHz	16	11 Msps
Fully-parallel	820	0	105 MHz	1	105 Msps
Pipelined parallel	860	0	185 MHz	1 (+5 lat)	185 Msps
BRAM-stored coeffs	300	1	150 MHz	4	37 Msps

Together: Four points in PPA space for the same filter. Best throughput: pipelined parallel (185 Msps). Smallest: fully-serial (90 LUTs). Best throughput-per-LUT: fully-serial (120 ksps/LUT). Best use of iCE40: the BRAM variant, because it uses otherwise-idle EBRs. “Best” depends on the requirements.

🧪 You Do — Pick the Winner

Given the four variants above, pick the right one for:

Audio processing: 48 ksps in, 100 mW power budget
Video pre-processing: 100 Msps in, area is no object
Teaching lab on Go Board: must fit alongside other logic
Latency-critical control loop: 10 ns from input to output

Answers:

Fully-serial — 48 ksps << 11 Msps; minimizes area and power
Pipelined parallel — 185 Msps > 100 Msps, meets throughput
BRAM variant — 300 LUTs leaves room; uses free EBRs
Fully-parallel (no pipeline) — 1 cycle latency beats 5

Structured PPA Reporting

For any design, your PPA report should include:

Target: chip family, clock rate, throughput requirement
Area: LUTs, FFs, EBRs (from make stat); utilization % (from nextpnr)
Performance: Fmax (nextpnr), cycles/operation, end-to-end latency
Power proxy: cell count × Fmax × activity estimate
Variants considered: at least 2, with side-by-side PPA numbers
Recommendation: which variant you'd ship and why

Pro tip: Include a “requirements vs. measured” table at the top. If measured < required, the design ships. If measured > required, you've over-engineered (wasted area/power). If measured = required, you're a genius. Include the delta in the report.

▶ LIVE DEMO

PPA of Three FSM Variants

~6 minutes — binary vs one-hot vs gray

▸ COMMANDS

cd labs/week3_day10/ex3_ppa/
make ppa_report   # produces CSV
cat report.csv

▸ EXPECTED OUTPUT

variant,LUT,FF,Fmax_MHz
binary, 12, 2, 165
onehot, 15, 4, 180  ← fastest
gray,   12, 2, 162
(same FSM, 3 encodings,
 measurable differences)

Design-Space Exploration

Rather than pick one variant, build several and plot them. Classic DSE visualization:

  Throughput
    ▲
    │   ●────● ← pipelined parallel
    │  /
    │ ●  ← fully parallel
    │/
    │●  ← BRAM variant
    │●  ← fully serial
    └──────────────────▶ Area (LUTs)

Pareto frontier: The outer edge — variants that aren't dominated by any other. An inner point is always strictly worse than a Pareto point on some axis. Only ship from the Pareto frontier. Publish both your measurements and the frontier you chose from.

🤖 Check the Machine

Ask AI: “Compare the PPA of a 16-bit parallel multiplier versus a 16-bit sequential shift-and-add multiplier for iCE40 HX1K. Give me a tradeoff table.”

TASK

AI produces a PPA tradeoff table.

BEFORE

Predict: parallel ~350 LUTs, 1 cyc, ~140 MHz. Sequential ~60 LUTs, 16 cyc, ~180 MHz.

AFTER

Strong AI computes throughput/LUT. Weak AI just lists numbers without comparison.

TAKEAWAY

PPA tables should include derived metrics (efficiency), not just raw numbers.

Key Takeaways

① PPA = Performance, Power, Area. Three axes, no free lunches.

② FPGA proxies: Fmax (nextpnr), cell count (yosys), activity × cells.

③ Always build multiple variants. Pick from the Pareto frontier.

④ PPA report = requirements + measurements + variants + recommendation.

Measure first. Argue second. Ship third.

🔗 Transfer

Open-Source ASIC PPA

Video 4 of 4 · ~8 minutes

▸ WHY THIS MATTERS NEXT

Your PPA work so far targets FPGA. But the same RTL can target real silicon through open-source ASIC flows (OpenROAD, OpenLane). Video 4 ends the PPA day with a sneak peek: what happens when your Verilog becomes a chip? You'll see how FPGA PPA translates (and doesn't) to ASIC PPA, and why both matter.