Day 11 · UART Transmitter

UART TX Implementation

Video 3 of 4 · ~12 minutes

Dr. Mike Borowczak · Electrical & Computer Engineering · CECS · UCF

ProtocolTX ArchitectureImplementationPC Connection

🌍 Where This Lives

In Industry

Every embedded engineer you'll ever work with has written a UART TX. It's a rite of passage, the “hello world” of serial communication. On your first FPGA job, you will be asked to either (a) write one, or (b) debug one someone else wrote. Textbook implementations run on billions of chips worldwide. An elegant UART TX = competent RTL engineer; a messy UART TX = red flag.

In This Course

Today we write the whole thing. ~60 lines total. You'll compile it, simulate it, synthesize it, watch the waveform. Video 4 connects it to your PC. Your capstone protocol layer uses this exact pattern. If you internalize the idioms now, they become automatic for every subsequent protocol block.

⚠️ Start Simple — Add Features Only When Tested

❌ Wrong Approach

“I'll code everything in one go: FSM + datapath + baud gen + valid/busy + FIFO + parity. Then debug the whole mess together.”

✓ Right Approach

Iterative integration: (1) FSM alone with fake outputs, (2) add baud counter, (3) add shift register, (4) add valid/busy handshake, (5) add stop-bit check. Test each addition before continuing. When a bug appears, you know which component is wrong — because the previous ones already passed.

The receipt: 5-step iterative integration = 5 short debug sessions. Monolithic build = 1 long, painful debug session where everything could be wrong at once.

👁️ I Do — Complete UART TX

module uart_tx #(
    parameter CLKS_PER_BIT = 217       // 25 MHz / 115200 baud
) (
    input  wire       i_clk, i_reset,
    input  wire       i_valid,
    input  wire [7:0] i_data,
    output reg        o_busy,
    output reg        o_tx
);
    localparam [1:0] S_IDLE = 2'd0, S_START = 2'd1, S_DATA = 2'd2, S_STOP = 2'd3;
    localparam       CNT_W  = $clog2(CLKS_PER_BIT);

    reg [1:0]       r_state;
    reg [CNT_W-1:0] r_baud;     // 0..CLKS_PER_BIT-1 per bit
    reg [2:0]       r_bit;      // 0..7, which data bit
    reg [7:0]       r_shift;    // the byte being shifted out

    always @(posedge i_clk) begin
        if (i_reset) begin
            r_state  <= S_IDLE; r_baud <= 0; r_bit <= 0;
            o_tx     <= 1'b1;   o_busy <= 1'b0;
        end else case (r_state)
            S_IDLE: begin
                o_tx   <= 1'b1;
                o_busy <= 1'b0;
                if (i_valid) begin
                    r_shift <= i_data;
                    r_state <= S_START;  r_baud <= 0;  o_busy <= 1'b1;
                end
            end
            S_START: begin
                o_tx <= 1'b0;     // start bit
                if (r_baud == CLKS_PER_BIT-1) begin
                    r_state <= S_DATA; r_baud <= 0; r_bit <= 0;
                end else r_baud <= r_baud + 1'b1;
            end
            S_DATA: begin
                o_tx <= r_shift[0];   // LSB first
                if (r_baud == CLKS_PER_BIT-1) begin
                    r_baud  <= 0;
                    r_shift <= {1'b1, r_shift[7:1]};  // shift right, fill with idle
                    if (r_bit == 3'd7) r_state <= S_STOP;
                    else              r_bit  <= r_bit + 1'b1;
                end else r_baud <= r_baud + 1'b1;
            end
            S_STOP: begin
                o_tx <= 1'b1;     // stop bit
                if (r_baud == CLKS_PER_BIT-1) begin
                    r_state <= S_IDLE; o_busy <= 1'b0;
                end else r_baud <= r_baud + 1'b1;
            end
        endcase
    end
endmodule
My thinking: 55 lines total, everything in one always block (single-block FSM is acceptable here because state/outputs are tightly coupled to timing). Parameter for baud rate. Uses all your familiar idioms: $clog2, localparam, 3-block pattern, named constants.

🤝 We Do — Reading the Shift Register

r_shift <= {1'b1, r_shift[7:1]};   // shift right, LSB drops off
o_tx    <= r_shift[0];              // output the current LSB
Together: Classic PISO pattern from Week 2. Each baud tick: LSB goes out on o_tx, the rest of the byte shifts right by 1 (putting the old bit 1 in position 0, old bit 2 in position 1, etc.), and a 1 (idle value) fills in at bit 7. After 8 ticks, the byte has been fully transmitted and the shift register is all 1's again (back to idle state).
Timing subtlety: o_tx reads before the shift (note the nonblocking assignment semantics — both happen at the next clock edge). So on entry to each S_DATA cycle, the current LSB is already on the line; the shift happens just as we exit that cycle's baud count.

🧪 You Do — Predict the Trace

Reset, then assert i_valid=1 with i_data=8'b01010011 (0x53 = 'S'). CLKS_PER_BIT = 4 (simplified for simulation). Sketch the first 50 cycles of o_tx.

Answer:
Cycles 0-3:   o_tx = 0   (start bit)
Cycles 4-7:   o_tx = 1   (D0 = 0b01010011[0] = 1)
Cycles 8-11:  o_tx = 1   (D1 = bit 1 of data = 1)
Cycles 12-15: o_tx = 0   (D2 = bit 2 = 0)
Cycles 16-19: o_tx = 0   (D3 = 0)
Cycles 20-23: o_tx = 1   (D4 = 1)
Cycles 24-27: o_tx = 0   (D5 = 0)
Cycles 28-31: o_tx = 1   (D6 = 1)
Cycles 32-35: o_tx = 0   (D7 = 0)
Cycles 36-39: o_tx = 1   (stop bit)
Cycles 40+:   o_tx = 1   (idle)
Total frame: 40 cycles = 10 bit-times × 4 cycles/bit. ✓
▶ LIVE DEMO

Build UART TX from Scratch

~7 minutes — live coding

▸ COMMANDS

cd labs/week3_day11/ex3_impl/
# Start from empty uart_tx.v skeleton
# Add FSM → test → add datapath → test
make sim        # self-checks byte = 'A'
make wave
make stat       # ~30 cells
gtkwave tb.vcd &

▸ EXPECTED STDOUT

PASS: IDLE after reset
PASS: o_busy after valid
PASS: start bit = 0
PASS: data bits LSB first
PASS: stop bit = 1
PASS: IDLE after 10 ticks
=== 32 passed, 0 failed ===

  SB_DFFE:  20
  SB_LUT4:  12

▸ GTKWAVE

Signals: r_state · r_baud · r_bit · r_shift · o_tx · o_busy. Expand to see the byte ‘A' (0x41 = 01000001) being shifted out LSB-first: 1 0 0 0 0 0 1 0.

🔧 What Did the Tool Build?

$ yosys -p "read_verilog uart_tx.v; synth_ice40 -top uart_tx; stat" -q

=== uart_tx ===  (CLKS_PER_BIT=217 at 115200 baud / 25 MHz clk)
   Number of wires:                 49
   Number of cells:                 32
     SB_CARRY                        8    ← counter carry chain
     SB_DFFE                        20    ← state + baud + bit + shift + outputs
     SB_LUT4                        12    ← FSM + comparator + mux
What to notice: 32 cells = 2.5% of an iCE40 HX1K. UART TX is tiny. You can fit 25 of them on one chip. This is why every system has UART — it's essentially free silicon.
Reflection: The Week 2 modules you built (shift register, mod-N counter, FSM template) each cost 5-10 cells. Compose them: ~30 cells for a complete UART TX. Composition doesn't add much overhead.

🤖 Check the Machine

Ask AI: “Write a complete UART TX in Verilog, parameterized for baud rate and clock frequency, with a valid/busy handshake. Include a self-checking testbench.”

TASK

AI writes a complete UART TX + TB.

BEFORE

Predict: 50-80 lines RTL, FSM-driven, LSB-first, parameterized.

AFTER

Strong AI gets LSB-first and idle=1. Weak AI gets MSB-first or idle=0 (both wrong).

TAKEAWAY

UART is common enough in AI training that most models do well; always verify protocol details.

Key Takeaways

 Complete UART TX = ~60 lines, ~32 cells. Tiny.

 Iterative integration beats monolithic build every time.

 LSB-first shift-out. Idle = 1. Start = 0. Stop = 1.

 Parameterize CLKS_PER_BIT — works at any clock/baud combination.

Build in stages. Test each stage. Ship the whole.

🔗 Transfer

Connecting to a PC

Video 4 of 4 · ~8 minutes

▸ WHY THIS MATTERS NEXT

Simulation is great, but chips that only talk to simulators don't do real work. Video 4 hooks your UART TX to a USB-serial adapter and your laptop. By the end of the video, your Go Board is transmitting “HELLO” that shows up in your terminal. First time your Verilog talks to the outside world.