Day 9 · Memory Architecture

ROM in Verilog

Video 1 of 4 · ~10 minutes

Dr. Mike Borowczak · Electrical & Computer Engineering · CECS · UCF

ROMRAMiCE40 EBRsApplications

🌍 Where This Lives

In Industry

Every CPU has a boot ROM (containing the first instructions executed after reset — Apple's M-series chips have a multi-megabyte iBoot ROM). Every video device has a character ROM. Every DSP has coefficient tables, sine tables, windowing functions in ROM. The GPU in your phone has texture ROMs. “Pre-compute and look up” is faster and lower-power than compute-at-runtime, so ROMs are everywhere silicon meets math.

In This Course

Today's ROM patterns appear in Day 9.4 (LED pattern sequencer), Day 10.2 (lookup-based multipliers), Day 11.3 (UART character ROM for HELLO demo), and every capstone design. ROM isn't a nice-to-have — it's the first memory you'll reach for.

⚠️ “The Tools Decide the Memory Resource”

❌ Wrong Model

“To use block RAM, I need a special SB_RAM40_4K primitive. I should instantiate it directly like any other module.”

✓ Right Model

You write a standard Verilog memory pattern (reg [7:0] mem [0:255]; ... data <= mem[addr];) and the synthesizer infers the resource. Small ROMs (16 entries) become LUTs. Medium ROMs (256 entries) become distributed LUT-RAM or block RAM. Large ROMs become block RAM. Idiomatic code → tool-chosen target.

The receipt: Vendor primitives (SB_RAM40_4K for iCE40, RAMB36E2 for Xilinx) work but lock you to a chip. Idiomatic patterns are portable across vendors and get inferred correctly everywhere.

👁️ I Do — Approach 1: case-Based ROM

module rom_case (
    input  wire [2:0] i_addr,
    output reg  [7:0] o_data
);
    always @(*) begin
        case (i_addr)
            3'd0: o_data = 8'h48;   // 'H'
            3'd1: o_data = 8'h45;   // 'E'
            3'd2: o_data = 8'h4C;   // 'L'
            3'd3: o_data = 8'h4C;   // 'L'
            3'd4: o_data = 8'h4F;   // 'O'
            default: o_data = 8'h00;
        endcase
    end
endmodule
My thinking: Combinational (always @(*)), no clock needed — the ROM's contents are truly fixed at synthesis time. For 5 entries this is perfectly readable. At 256 entries, it becomes a maintenance nightmare. default case covers the unused addresses and prevents latch inference.

🤝 We Do — Approach 2: Array + $readmemh

module rom_array #(
    parameter ADDR_W = 8,
    parameter DATA_W = 8,
    parameter INIT_FILE = "rom_contents.hex"
) (
    input  wire              i_clk,
    input  wire [ADDR_W-1:0] i_addr,
    output reg  [DATA_W-1:0] o_data
);
    reg [DATA_W-1:0] mem [0:(2**ADDR_W)-1];

    initial $readmemh(INIT_FILE, mem);      // synth-time content load

    always @(posedge i_clk) o_data <= mem[i_addr];  // synchronous read
endmodule
Together: Three features: (1) parameterized depth and width, (2) content loaded from external hex file — same $readmemh you saw in Day 6 testbenches, but here it's used at synthesis, (3) synchronous read — o_data appears one cycle after i_addr. That synchronous read is the magic word: it's what triggers block RAM inference.

🧪 You Do — Identify the Bug

// Intended: a 1024-entry ROM that should infer block RAM
module rom_bad (input wire [9:0] addr, output wire [7:0] data);
    reg [7:0] mem [0:1023];
    initial $readmemh("rom.hex", mem);

    assign data = mem[addr];        // ← combinational read
endmodule

Why will the tool not infer block RAM here?

Answer: Combinational read (assign, not a clocked always block). Block RAM reads require a clock edge. The synthesizer will be forced to map this 1024-entry array as LUT RAM or even scattered combinational logic — orders of magnitude more expensive. Fix: change to always @(posedge clk) data <= mem[addr]; with data as reg.
▶ LIVE DEMO

Case vs Array ROM: Same Output, Different Silicon

~5 minutes

▸ COMMANDS

cd labs/week3_day09/ex1_rom/
cat hello.hex    # 'H','E','L','L','O'
make stat_case    # tiny case-ROM
make stat_array   # synchronous array ROM
make sim
gtkwave tb_rom.vcd &

▸ EXPECTED STDOUT

=== rom_case ===
  SB_LUT4: 5   SB_DFF: 0

=== rom_array ===
  SB_LUT4: 0   SB_DFF: 8
  SB_RAM40_4K: 1  ← block RAM!

5 × 8 = 40 bits
stored in a 4Kbit EBR.

▸ KEY OBSERVATION

The 5-entry case ROM costs 5 LUTs. The 5-entry array ROM costs 1 block RAM. For this size, case is cheaper. But the same array code scales to 4096 entries with zero more LUTs — block RAM stays the same size, case explodes. The array pattern wins at scale.

🔧 What Did the Tool Build?

$ yosys -p "read_verilog rom_array.v; synth_ice40 -top rom_array; stat" -q

=== rom_array ===    # ADDR_W=10, 1024 × 8-bit ROM (8 Kbit)
   Number of wires:                 15
   Number of cells:                  3
     SB_DFFE                         8     ← output register
     SB_RAM40_4K                     2     ← 2 block RAMs @ 4Kbit each
                                            = 8 Kbit total ✓
What to notice: 1024 entries × 8 bits = 8 Kbit, needs 2 EBRs (each 4 Kbit). Cost: 2 of iCE40 HX1K's 16 EBRs (12.5% of block RAM). Zero LUTs for the ROM itself — the ROM “disappeared” into dedicated memory silicon.
Preview: Video 3 shows iCE40's memory resources in detail: 16 EBRs, aspect-ratio options, budgeting for complete designs. You'll learn to plan memory usage.

🤖 Check the Machine

Ask AI: “Write a synchronous-read ROM in Verilog with 256 entries of 16 bits, loaded from a hex file. Include a testbench.”

TASK

Ask AI for a parameterized BRAM-inferred ROM.

BEFORE

Predict: array, $readmemh, clocked read with <=, parameters.

AFTER

Strong AI uses synchronous read. Weak AI uses assign — won't infer BRAM.

TAKEAWAY

Verify with make stat that SB_RAM40_4K appears.

Key Takeaways

 Write idiomatic RTL; let the tool pick the resource.

 Case ROM: readable for small tables, doesn't scale.

 Array + $readmemh: scales to any size, external content.

Synchronous read is the trigger for block-RAM inference.

If it's more than 16 entries, use array + hex file + clocked read.

🔗 Transfer

RAM in Verilog

Video 2 of 4 · ~10 minutes

▸ WHY THIS MATTERS NEXT

ROM is read-only — great for fixed content. Real designs also need read-write memory: FIFOs for buffering, register files for CPUs, frame buffers for video. Video 2 shows the RAM pattern that infers the same block-RAM silicon, plus the read-before-write vs. write-first choice that shapes your next-cycle behavior.