| 12 min

Hello, UART: Your First Kernel Output via the PL011 on QEMU

Learn how to write a bare-metal C driver for the PL011 UART on AArch64. Master Memory-Mapped I/O (MMIO), the volatile keyword in C, and building a kprint function for QEMU virt without a standard library.

Terminal showing 'Hello, Kernel!' output from a bare-metal ARM kernel over UART

Your kernel boots. It reaches C. But it’s completely silent. Every time you’ve run make run since, you’ve stared at a blank terminal and had to trust a GDB breakpoint to confirm anything actually happened. That’s about to change. By the end of this post, your kernel will finally print to the screen. We will not print through a standard library, an OS, or any framework. We will print an output by writing bytes directly to a memory address.

This is what memory-mapped I/O means in practice. And it’s the first time your kernel will have a voice.


What We’re Building Today

We’re wiring up the PL011 UART that QEMU’s virt machine provides at address 0x09000000. By the end of this post, we will have:

  • A uart.h header defining the PL011 register map
  • A uart.c file with uart_putc, uart_puts, and kprint
  • An updated kernel.c that calls kprint("Hello, Kernel!\n")
  • Output that actually appears in your terminal!

The UART lives at 0x09000000. Your kernel lives at 0x40000000. These are two very different regions of the memory map: one is hardware, and the other is RAM.

Everything between 0x0 and 0x09000000, and between 0x09001000 and 0x40000000, is either unmapped or reserved. The address space is mostly empty at this stage, but it will matter a lot when we get to the MMU.

bios@confessions ~/memory-map · Physical address space — kernel and UART before the MMU is enabled
PL011 UART (MMIO) 0x0009000000 → 0x0009001000 (4 KB)
Kernel .text 0x0040000000 → 0x0040010000 (64 KB)
Kernel .data/.bss 0x0040010000 → 0x0040020000 (64 KB)
Stack 0x0040020000 → 0x0040021000 (4 KB)
PL011 UART (MMIO)
Kernel .text
Kernel .data/.bss
Stack
unmapped

Everything between 0x0 and 0x09000000, and between 0x09001000 and 0x40000000, is either unmapped or reserved. The address space is mostly empty at this stage — a fact that will matter enormously when we get to the MMU in Post 6.


What Is Memory-Mapped I/O?

This is the insight that made hardware programming click for me: on modern CPUs, talking to a peripheral is not fundamentally different from reading and writing memory. There is no special “talk to the UART” instruction. Instead, hardware devices expose their registers as locations in the physical address space. Write to the right address, and the hardware acts. Read from it, and the hardware reports its state.

This is called memory-mapped I/O, or MMIO. The QEMU virt machine defines a fixed memory map where every device has an address. The PL011 UART lives at 0x09000000. The Generic Interrupt Controller starts at 0x08000000. When your kernel writes a byte to 0x09000000, the UART transmits it. When your kernel reads from 0x09000018, it gets the UART’s flag register, which includes a bit that tells you whether the transmit buffer is full.

The ARM documentation for the QEMU virt machine’s memory layout isn’t in one tidy place, but QEMU’s own source code defines VIRT_UART at offset 0x09000000 with a size of 0x00001000. The PL011 UART specification from ARM defines exactly what those 4 KB of registers do.


The PL011 Register Map

The PL011 is a standard ARM serial controller. The registers we need for transmitting are:

RegisterOffsetPurpose
UARTDR0x000Data Register. Write a byte here to send it. Read from here to receive.
UARTFR0x018Flag Register. Status bits that ask: is the FIFO full? Is the UART busy?
UARTIBRD0x024Integer Baud Rate Divisor. Upper 16 bits of baud rate configuration.
UARTFBRD0x028Fractional Baud Rate Divisor. Lower 6 bits of baud rate configuration.
UARTLCR_H0x02CLine Control Register. Data width, parity, stop bits.
UARTCR0x030Control Register. Enable UART, TX, and RX.

The two flag bits we care most about in UARTFR are:

  • Bit 5 — TXFF (Transmit FIFO Full): When this bit is set, the transmit buffer is full, and you must wait before writing another byte. Writing while TXFF is set causes the character to be dropped silently.
  • Bit 3 — BUSY: The UART is currently transmitting. Relevant if you need to wait until the last byte has actually left the wire before doing something else.

For our purposes, we only need TXFF. When sending characters from a kernel, we need to poll TXFF before each write. When it clears, the FIFO has space, and we can send data.


Why do we need to use volatile?

Before we write any code, we need to talk about volatile. For us, the use of volatile is not optional. Without it, your UART code will fail in ways that work correctly in debug builds and silently break in optimised ones.

The C compiler assumes that memory behaves predictably: if you write the same value to the same address twice, it can eliminate the second write. If you read a value from a variable and nothing in your function modifies it, it can cache the value in a register and never reload it. These are legal, correct optimisations for normal memory.

However, hardware registers are not normal memory. Writing the character 'H' to UARTDR twice sends two bytes. Reading UARTFR in a loop is the only way to know when the transmit FIFO has space. Each read reflects the current hardware state, not a cached version. If the compiler decides to hoist the UARTFR read out of the loop (“nothing modifies this between iterations, so I’ll read it once and reuse the value”), your busy-wait becomes an infinite loop or a no-op.

The volatile qualifier tells the compiler: “do not optimise accesses to this object.” Every read generates a real load instruction. Every write generates a real store instruction. The order and count of accesses are preserved.

/* Without volatile, the compiler may eliminate the loop entirely */
uint32_t *fr = (uint32_t *)(UART_BASE + UARTFR);
while (*fr & TXFF) {} /* "nothing changes this" — compiler may hoist it out */

/* With volatile, every iteration reads fresh hardware state */
volatile uint32_t *fr = (volatile uint32_t *)(UART_BASE + UARTFR);
while (*fr & TXFF) {} /* correct: reads UARTFR on every iteration */

In a kernel, there’s no OS to protect against this. You are the OS that needs to get things right. Use volatile for every MMIO access.


uart.h — The Register Map in Code

Create the following uart.h in your OS directory:

#ifndef UART_H
#define UART_H

#include <stdint.h>

/* PL011 UART base address on QEMU virt */
#define UART_BASE       0x09000000UL

/* Register offsets */
#define UARTDR          0x000   /* Data Register      */
#define UARTFR          0x018   /* Flag Register      */
#define UARTIBRD        0x024   /* Int Baud Rate Div  */
#define UARTFBRD        0x028   /* Frac Baud Rate Div */
#define UARTLCR_H       0x02C   /* Line Control       */
#define UARTCR          0x030   /* Control Register   */

/* UARTFR bits */
#define UARTFR_TXFF     (1 << 5)    /* Transmit FIFO full */
#define UARTFR_BUSY     (1 << 3)    /* UART busy          */

/* UARTLCR_H bits */
#define UARTLCR_FEN     (1 << 4)    /* Enable FIFOs       */
#define UARTLCR_WLEN8   (3 << 5)    /* 8-bit word length  */

/* UARTCR bits */
#define UARTCR_UARTEN   (1 << 0)    /* UART enable        */
#define UARTCR_TXE      (1 << 8)    /* Transmit enable    */
#define UARTCR_RXE      (1 << 9)    /* Receive enable     */

/* Helper macro: read a PL011 register */
#define UART_REG(offset) \
    (*(volatile uint32_t *)(UART_BASE + (offset)))

void uart_init(void);
void uart_putc(char c);
void uart_puts(const char *s);
void kprint(const char *s);

#endif /* UART_H */

The UART_REG(offset) macro is the entire MMIO interface. It takes an offset, adds it to the base address, casts the result to a volatile uint32_t *, and dereferences it. Every read and write to a PL011 register goes through this macro, which ensures the volatile qualifier is never forgotten.


uart.c — Initialization and Output

Create the file uart.c:

#include "uart.h"

void uart_init(void) {
    /* Disable the UART before reconfiguring */
    UART_REG(UARTCR) = 0;

    /*
     * Set baud rate. QEMU's virt machine uses a 24 MHz UART clock.
     * For 115200 baud:   divisor = 24000000 / (16 * 115200) = 13.020833...
     *   Integer part:   13   → UARTIBRD = 13
     *   Fractional part: 0.020833 * 64 ≈ 1  → UARTFBRD = 1
     *
     * On QEMU, the UART works even without this — QEMU ignores baud
     * rate configuration. But real hardware requires it, so we do it right.
     */
    UART_REG(UARTIBRD)  = 13;
    UART_REG(UARTFBRD)  = 1;

    /*
     * Line control: 8-bit word length, FIFOs enabled, 1 stop bit, no parity.
     * The UARTLCR_H write must happen AFTER setting baud rate — writing
     * UARTLCR_H latches the baud rate divisors into the internal registers.
     */
    UART_REG(UARTLCR_H) = UARTLCR_WLEN8 | UARTLCR_FEN;

    /* Re-enable: UART on, TX on, RX on */
    UART_REG(UARTCR) = UARTCR_UARTEN | UARTCR_TXE | UARTCR_RXE;
}

void uart_putc(char c) {
    /* Spin until the transmit FIFO has space */
    while (UART_REG(UARTFR) & UARTFR_TXFF)
        ;
    UART_REG(UARTDR) = (uint32_t)c;
}

void uart_puts(const char *s) {
    while (*s) {
        uart_putc(*s++);
    }
}

void kprint(const char *s) {
    uart_puts(s);
}

uart_init follows the PL011’s required initialisation sequence: disable first, configure baud rate and line format, then re-enable. On QEMU, QEMU itself initialises the UART before loading your kernel, so uart_init is technically optional. On real hardware like a Raspberry Pi, any QEMU bare-metal target with firmware, or production embedded ARM, skipping uart_init can leave the UART in an undefined state. We call it anyway, so you can run it on a Raspberry Pi.

The core of the driver is uart_putc. Two operations: spin on TXFF, then write to UARTDR. That’s the entire protocol for sending a byte over a PL011 UART.


The Transmit Loop in detail

The two-line loop in uart_putc hides a surprising amount of what “talking to hardware” actually means. This is the assembly your compiler generates to do this. Each step is expendable to give you an explanation of what it does.

bios@confessions ~/uart.c (compiled)
    // while (UART_REG(UARTFR) & UARTFR_TXFF)
  ldr  x0, =0x09000018    // Load UARTFR address (base + 0x018)
.poll:
  ldr  w1, [x0]           // Read UARTFR: 32-bit load from 0x09000018
  tst  w1, #0x20          // Test bit 5 (TXFF = 1 << 5 = 0x20)
  b.ne .poll              // Branch if TXFF set — FIFO full, keep waiting
 
  // UART_REG(UARTDR) = (uint32_t)c
  ldr  x0, =0x09000000    // Load UARTDR address (base + 0x000)
  strb w2, [x0]           // Store the character byte to UARTDR

The generated assembly makes the memory access pattern explicit. Every trip around the .poll loop is a real load from a hardware register over the physical memory bus. On QEMU, it’s fast. On real silicon at 24 MHz with an external peripheral, each of these loads takes real time. The busy-wait is correct here; it is also fine since we have nothing better to do yet. When we add a scheduler later, we’ll revisit the loop’s blocking nature.


Updating kernel.c

Replace the contents of kernel.c with:

#include "uart.h"

void kernel_main(void) {
    uart_init();
    kprint("Hello, Kernel!\n");

    /* Spin — nothing else to do yet */
    while (1);
}

Two lines of actual work. That’s it. The real work happened in uart.c and boot.S, which we covered in the posts before this one. kernel_main is now a coordinator: initialise the UART, send a greeting, stop.

The \n matters here because without it, some terminal emulators buffer the line, and you see nothing until a newline flushes the output.


Updating the Makefile

Add uart.o to the object list:

OBJS    = boot.o uart.o kernel.o

The full Makefile doesn’t change structurally. The %.o: %.c pattern rule already handles uart.c. You’re just declaring a new dependency. Run make clean first to ensure the old kernel.o gets recompiled with the new #include:

make clean && make

There are three object files now. The linker combines them in order: boot.o first (so _start sits at 0x40000000), then uart.o and kernel.o. The order of the last two doesn’t matter since neither has a .text.boot section.


Running It

make run

“Hello, Kernel!” printed by your code, running at EL1 on a virtual ARM processor, talking directly to a memory-mapped UART register at physical address 0x09000000. No standard library. No OS. No printf. Just a few stores to a hardware register.

Press Ctrl+A, then X to exit QEMU.


What Broke (And Why)

I did not just write a whole section on why we need to use volatile because I was smart enough to think about this in advance. I made the mistake of not including it initially myself. The code looked identical to the correct version. It compiled without warnings. In a debug build (-O0), it worked perfectly. The compiler generates straightforward code at -O0 with no caching, no reordering.

I only discovered the bug after adding -O2 to the compiler flags. Suddenly, the kernel printed the first character and hung. GDB showed execution stuck in uart_putc, in the flag-check loop, reading 0x0 from w1 on on every iteration, but the loop never exited.

The compiler had hoisted the UARTFR load above the loop. It read the flag register once, stored the result in a register, and checked that register on every iteration. The first read returned 0x0 (FIFO not full). The compiler’s optimisation was “correct” from its point of view: nothing between the iterations modified the variable. Without volatile, the compiler had no reason to believe the hardware could change the value under its feet.

Adding volatile uint32_t * to the cast in UART_REG fixed it immediately. One keyword. The difference between code that works in debug and silently hangs in release.


What’s Next

Our kernel can talk now, but it’s still using a memory layout that is a bit of a coincidence. We know the UART is at 0x09000000 and the kernel is at 0x40000000 because QEMU’s documentation says so and because our linker script hardcodes 0x40000000. We don’t really understand yet why the address space looks this way.

The next post fixes that. We’re going to look at the linker script properly: what BSS, stack, and heap regions actually are, how the address space gets divided, and how to set up the regions you’ll need for a real kernel. This is the conceptual foundation that we will need when creating the MMU later in this series. At that point, the linker script stops being boilerplate and starts being something you truly understand.


Sources