Day 9 · Memory Architecture

Practical Memory Applications

Video 4 of 4 · ~9 minutes

Dr. Mike Borowczak · Electrical & Computer Engineering · CECS · UCF

🌍 Where This Lives

In Industry

Every LED matrix display — the ones in elevators, billboards, stadium scoreboards — is a ROM-driven pattern sequencer. Every video card has a character ROM for text overlays. Every phone has sine tables for audio synthesis. Every modem has constellation ROMs for symbol decoding. The “counter addresses ROM” pattern is the canonical architecture for any computed-then-replayed signal.

In This Course

Your Day 11 UART HELLO demo uses this pattern to drive the character stream. Your capstone melody player (a common project) uses it for sine tables. Your Go Board's LED patterns, 7-seg sequences, and scrolling text all follow the same shape.

⚠️ Compute or Look Up?

❌ Wrong Model

“Runtime computation is always better than a lookup table — it uses less memory and is more 'clever.'”

✓ Right Model

Runtime compute costs logic (LUTs, DSPs). Table lookup costs memory (EBRs). FPGAs have abundant memory that would otherwise be wasted. For fixed functions — sine, log, gamma correction, CRC tables — look up beats compute every time. The choice isn't about cleverness; it's about resource allocation.

The receipt: A 1024-entry sine table fits in 1 EBR. Computing sine at runtime via CORDIC takes ~400 LUTs + 100 cycles. The ROM is cheaper in area and faster.

👁️ I Do — Pattern Sequencer Architecture

module pattern_sequencer #(
    parameter STEP_LEN    = 10_000_000,   // cycles per pattern step (0.4s @ 25 MHz)
    parameter PATTERN_LEN = 16,            // number of steps in sequence
    parameter INIT_FILE   = "pattern.hex"
) (
    input  wire       i_clk, i_reset,
    output wire [7:0] o_leds
);
    localparam STEP_W = $clog2(STEP_LEN);
    localparam ADDR_W = $clog2(PATTERN_LEN);

    // Timer: counts STEP_LEN cycles, then pulses step_tick
    reg [STEP_W-1:0] r_step_counter;
    wire             w_step_tick = (r_step_counter == STEP_LEN - 1);
    always @(posedge i_clk) r_step_counter <= i_reset ? 0 : (w_step_tick ? 0 : r_step_counter + 1);

    // Address counter: advances on each step_tick
    reg [ADDR_W-1:0] r_addr;
    always @(posedge i_clk) r_addr <= i_reset ? 0 : (w_step_tick ? r_addr + 1 : r_addr);

    // ROM: 16 × 8 patterns
    rom_array #(.ADDR_W(ADDR_W), .DATA_W(8), .INIT_FILE(INIT_FILE))
      u_rom (.i_clk(i_clk), .i_addr(r_addr), .o_data(o_leds));
endmodule

My thinking: Three sub-blocks: (1) step timer (creates a slow tick), (2) address counter (walks the pattern space), (3) ROM (pattern contents). Compose with hierarchical instantiation. Change pattern.hex → different animation. No RTL changes needed.

🤝 We Do — Sine Table Lookup

For an audio synthesizer, we need sin(x) at 1024 points per cycle with 10-bit precision:

// Python: generate the table at build time
// for i in range(1024): print(f"{int((math.sin(2*math.pi*i/1024)+1)/2 * 1023):03x}")

module sine_rom (
    input wire i_clk,
    input wire [9:0] i_phase,    // 0..1023 (full period)
    output reg [9:0] o_sample    // 0..1023 (scaled)
);
    reg [9:0] mem [0:1023];
    initial $readmemh("sine_1024x10.hex", mem);
    always @(posedge i_clk) o_sample <= mem[i_phase];
endmodule

Together: 1024 × 10 = 10 Kbit ≈ 3 EBRs. Cost: 3 EBRs and 10 output flops — roughly 0.8% of the chip. Equivalent compute (CORDIC iterative sine) would consume ~200-400 LUTs + multiple cycles per sample. Memory wins on both area and speed. This is the FPGA trick.

🧪 You Do — Design a 7-Seg Decoder ROM

Your 7-seg display needs to show hex digits 0-F. Input: 4-bit value. Output: 7-bit segment pattern. Is this a good ROM candidate?

Analysis:

Input space: 4 bits = 16 entries
Output: 7 bits
Total: 16 × 7 = 112 bits. Way smaller than 1 EBR.
Conclusion: Case-based ROM wins at this size (6-8 LUTs vs. 1 whole EBR). Small ROMs belong in LUTs; use EBRs only when the LUT cost exceeds ~100 cells.

Lesson: EBRs aren't always the right answer. Tiny lookup tables belong in LUTs. “Always use block RAM” is as wrong as “never use block RAM.” Pick by size.

▶ LIVE DEMO

Pattern Sequencer on the Go Board

~5 minutes

▸ COMMANDS

cd labs/week3_day09/ex4_sequencer/
cat pattern.hex    # 16 × 8-bit patterns
make sim
make prog
# watch LEDs cycle
# edit pattern.hex, reprogram,
#   new animation — no RTL change

▸ EXPECTED BEHAVIOR

pattern.hex = walk right:
01 → 02 → 04 → 08 → 10 → 20 → 40 → 80 → 80 → 40 ...

Board shows 8 LEDs cycling
left to right, pattern
wraps every 6.4 sec
(16 steps × 0.4s)

▸ KEY OBSERVATION

Edit the hex file, reprogram — new animation. Content and architecture are decoupled. Adding a new pattern = 30 seconds of editing + 10 seconds of programming. No synthesis, no Verilog changes. This is the elegance of lookup-based designs.

More Memory-Driven Applications

Application	Pattern	Memory Cost
Audio sine synth	phase counter → sine ROM	1-3 EBRs
7-seg message scroller	char counter → char ROM → 7seg decoder	1-2 EBRs
Microcoded controller	state counter → microcode ROM	1-4 EBRs (depending on instruction width)
CRC lookup	byte in → CRC table → XOR accumulator	1 EBR (256×16 table)
Gamma correction	pixel → gamma ROM	1-2 EBRs
Font ROM (VGA text)	char + row → font ROM → pixel stream	1-2 EBRs (128 chars × 8×8 pixels)

Your capstone probably uses 1-3 of these. If you're building anything that plays, displays, or transforms data according to a fixed rule, a ROM is almost certainly involved.

🤖 Check the Machine

Ask AI: “Design a Verilog pattern sequencer that drives 8 LEDs with a heartbeat pattern (pulse, pulse, pause, pulse, pulse, pause) at 1 Hz on a 25 MHz clock. Include Python code to generate the hex file.”

TASK

AI designs complete sequencer + table.

BEFORE

Predict: timer + address counter + array ROM + Python hex generator.

AFTER

Strong AI pairs Verilog and Python. Weak AI writes only Verilog with hard-coded tables.

TAKEAWAY

Separation of content and architecture = Python + Verilog.

Key Takeaways

① Pattern sequencer = timer + address counter + ROM.

② Look up beats compute for fixed functions — memory is abundant, logic is dear.

③ Tiny tables (< 100 bits) → LUTs. Large tables (> few hundred bits) → EBRs.

④ Content (hex files) and architecture (RTL) should be decoupled.

If you can name what it should do and the rule is fixed, it probably belongs in a ROM.

Pre-Class Self-Check

Q1: What triggers block RAM inference on iCE40?

Idiomatic array + initial $readmemh (for ROM) or sync write + sync read (for RAM). The key: synchronous read via a registered output.

Q2: How many EBRs does a 768×8 RAM consume?

768 × 8 = 6 Kbit. Doesn't fit in one 4 Kbit EBR → needs 2 EBRs. The second one is half-wasted; you “pay” for 8 Kbit.

Pre-Class Self-Check (cont.)

Q3: Why is lookup usually better than runtime compute on FPGAs?

Memory (EBRs) is abundant and otherwise unused; logic (LUTs) is the scarce resource. Lookups consume “free” silicon; compute consumes the resource you need for everything else.

Q4: When does the case-ROM pattern still make sense?

For tables smaller than ~32 entries where the content is fixed and the readability of case is worth more than EBR allocation. 7-seg decoders, opcode decoders, small FSM outputs — all case-based.

🔗 End of Day 9

Tomorrow: Timing & PPA

Day 10 · Timing, Numerics, Performance-Power-Area

▸ WHY THIS MATTERS NEXT

You now know what your chip has (LUTs + EBRs) and how to use memory idioms. Day 10 answers the next question: how fast can it go, and what does it cost? You'll learn to read timing reports, understand numerical architecture tradeoffs, and measure PPA (Performance / Power / Area) — the three axes every real design is evaluated on.