Day 10 · Timing, Numerics & PPA

PPA — Performance, Power, Area

Video 3 of 4 · ~11 minutes

Dr. Mike Borowczak · Electrical & Computer Engineering · CECS · UCF

TimingNumericsPPAASIC Context

🌍 Where This Lives

In Industry

Every chip tapeout is a PPA negotiation. Apple's A-series wants maximum performance-per-watt. ARM Cortex-M0+ wants minimum area. NVIDIA H100 wants maximum compute in a given power envelope. Product managers set targets; designers iterate architectures; physical designers tune implementation; a spreadsheet tracks every variant against P-P-A and only one ships. Engineering careers ARE PPA optimization.

In This Course

Every design you've built can be measured on these three axes. Today's lab has you produce a PPA report for three FSM variants. Your capstone will be judged on PPA. Day 14 capstone retrospective reviews designs against their PPA targets.

⚠️ There Is No Free Lunch — PPA Is a Triangle

❌ Wrong Model

“Good engineering produces a chip that's fast, low-power, and small. Better engineers optimize all three.”

✓ Right Model

PPA is a triangle: improving one axis usually costs another. Pipelining = more performance, more area (extra flops), more power (extra flops switch). Clock gating = less power, more control logic. Lower voltage = less power, lower Fmax. Real design picks the operating point. “Optimizing all three” means picking the best tradeoff, not dominating each individually.

The receipt: Apple's M3 is 15% faster than M2 but consumes 20% more power. NVIDIA's H100 is 3× the performance of A100 but consumes 2× the power. No free lunches; measurable tradeoffs.

👁️ I Do — FPGA PPA Proxies

AxisASIC MetricFPGA Proxy
PerformanceFmax (MHz), Cycles/operationnextpnr Fmax, RTL cycles
PowerStatic (leakage) + Dynamic (activity)Cell count × Fmax × activity (rough)
Areamm² of silicon, gatesLUTs + FFs + EBRs (from yosys stat)
My thinking: On FPGAs, exact power numbers require vendor tools (Lattice Diamond, Intel Quartus). For education, cell count × clock × activity factor is a reasonable proxy: doubling cells with everything else equal → ~2× dynamic power. Absolute accuracy isn't the point; relative comparisons across your own design variants is.

🤝 We Do — The Tradeoff in Action

Same 16-bit FIR filter, 4 variants:

VariantLUTsEBRsFmaxCycles/sampleThroughput
Fully-serial900180 MHz1611 Msps
Fully-parallel8200105 MHz1105 Msps
Pipelined parallel8600185 MHz1 (+5 lat)185 Msps
BRAM-stored coeffs3001150 MHz437 Msps
Together: Four points in PPA space for the same filter. Best throughput: pipelined parallel (185 Msps). Smallest: fully-serial (90 LUTs). Best throughput-per-LUT: fully-serial (120 ksps/LUT). Best use of iCE40: the BRAM variant, because it uses otherwise-idle EBRs. “Best” depends on the requirements.

🧪 You Do — Pick the Winner

Given the four variants above, pick the right one for:

  1. Audio processing: 48 ksps in, 100 mW power budget
  2. Video pre-processing: 100 Msps in, area is no object
  3. Teaching lab on Go Board: must fit alongside other logic
  4. Latency-critical control loop: 10 ns from input to output
Answers:
  1. Fully-serial — 48 ksps << 11 Msps; minimizes area and power
  2. Pipelined parallel — 185 Msps > 100 Msps, meets throughput
  3. BRAM variant — 300 LUTs leaves room; uses free EBRs
  4. Fully-parallel (no pipeline) — 1 cycle latency beats 5

Structured PPA Reporting

For any design, your PPA report should include:

  1. Target: chip family, clock rate, throughput requirement
  2. Area: LUTs, FFs, EBRs (from make stat); utilization % (from nextpnr)
  3. Performance: Fmax (nextpnr), cycles/operation, end-to-end latency
  4. Power proxy: cell count × Fmax × activity estimate
  5. Variants considered: at least 2, with side-by-side PPA numbers
  6. Recommendation: which variant you'd ship and why
Pro tip: Include a “requirements vs. measured” table at the top. If measured < required, the design ships. If measured > required, you've over-engineered (wasted area/power). If measured = required, you're a genius. Include the delta in the report.
▶ LIVE DEMO

PPA of Three FSM Variants

~6 minutes — binary vs one-hot vs gray

▸ COMMANDS

cd labs/week3_day10/ex3_ppa/
make ppa_report   # produces CSV
cat report.csv

▸ EXPECTED OUTPUT

variant,LUT,FF,Fmax_MHz
binary, 12, 2, 165
onehot, 15, 4, 180  ← fastest
gray,   12, 2, 162
(same FSM, 3 encodings,
 measurable differences)

Design-Space Exploration

Rather than pick one variant, build several and plot them. Classic DSE visualization:

  Throughput
    ▲
    │   ●────● ← pipelined parallel
    │  /
    │ ●  ← fully parallel
    │/
    │●  ← BRAM variant
    │●  ← fully serial
    └──────────────────▶ Area (LUTs)
Pareto frontier: The outer edge — variants that aren't dominated by any other. An inner point is always strictly worse than a Pareto point on some axis. Only ship from the Pareto frontier. Publish both your measurements and the frontier you chose from.

🤖 Check the Machine

Ask AI: “Compare the PPA of a 16-bit parallel multiplier versus a 16-bit sequential shift-and-add multiplier for iCE40 HX1K. Give me a tradeoff table.”

TASK

AI produces a PPA tradeoff table.

BEFORE

Predict: parallel ~350 LUTs, 1 cyc, ~140 MHz. Sequential ~60 LUTs, 16 cyc, ~180 MHz.

AFTER

Strong AI computes throughput/LUT. Weak AI just lists numbers without comparison.

TAKEAWAY

PPA tables should include derived metrics (efficiency), not just raw numbers.

Key Takeaways

 PPA = Performance, Power, Area. Three axes, no free lunches.

 FPGA proxies: Fmax (nextpnr), cell count (yosys), activity × cells.

 Always build multiple variants. Pick from the Pareto frontier.

 PPA report = requirements + measurements + variants + recommendation.

Measure first. Argue second. Ship third.

🔗 Transfer

Open-Source ASIC PPA

Video 4 of 4 · ~8 minutes

▸ WHY THIS MATTERS NEXT

Your PPA work so far targets FPGA. But the same RTL can target real silicon through open-source ASIC flows (OpenROAD, OpenLane). Video 4 ends the PPA day with a sneak peek: what happens when your Verilog becomes a chip? You'll see how FPGA PPA translates (and doesn't) to ASIC PPA, and why both matter.