| 15 min

The ARM Boot Process: From Reset Vector to kernel_main

Learn how to write an ARM64 bootloader stub in assembly. We cover AArch64 exception levels (EL2 to EL1), zeroing BSS, stack setup, and jumping to a C kernel_main on bare metal.

Diagram of ARM exception levels stacked from EL3 at the top to EL0 at the bottom

In the last post, we set up the toolchain, built a minimal boot.S, and got a clean exit out of QEMU. The binary did almost nothing; it only set up a stack, called a semihosting exit, and disappeared. But everything worked end-to-end, and that mattered.

This post is where it starts to feel real. We’re going to understand exactly how ARM hands control to your code. We will rewrite boot.S to handle the full boot sequence properly, and jump into C for the first time. By the end, your kernel will boot into a C function called kernel_main. The C function will have no output yet, but there will be something genuinely running, and we’ll use GDB to prove it.

Before we write a single line of assembly, we need to understand the privilege hierarchy. ARM’s exception levels are the reason the first five lines of every real OS boot sequence look the way they do.


What We’re Building Today

Today, we are bridging the gap between raw hardware and structured C code. We are moving away from a binary that “just exists” to building a proper, reproducible runtime environment.

Specifically, we will build:

  • An Exception Level Handler: Code that detects whether QEMU or real hardware dropped us off at EL2 or EL1, safely dropping us to EL1 if needed.
  • The C Runtime Environment: Setting up a stack pointer so functions can actually return, and manually zeroing out the BSS section to respect C’s global variable guarantees.
  • The Jump to C: Officially calling kernel_main and setting up an infinite safety-net loop to catch the CPU if it ever tries to escape.

By the end of this post, you won’t see a “Hello World” printed on your screen yet, but you will use GDB to look under the hood and verify that your C code is actively running on the processor.


The Privilege Hierarchy

The ARM boot process isn’t a single event. It’s a handoff, from firmware to hypervisor to kernel, and each handoff crosses a privilege boundary. Here’s the full sequence from CPU reset to your C kernel entry point:

bios@confessions ~/boot-sequence

The CPU exits reset at the highest privilege level, EL3. Secure firmware, such as ARM Trusted Firmware (TF-A), runs here. It owns the hardware and sets up the secure world. On QEMU virt without firmware, this stage is handled invisibly by QEMU itself.

ARM defines four exception levels, numbered 0 through 3. A higher number means more privilege.

  • EL0 is where the programs run. Browsers, text editors, and shell scripts all run at EL0. Code here has no direct access to hardware, can’t modify page tables, and can’t mess with interrupt configuration. It’s a sandbox by design.
  • EL1 is where your kernel runs. EL1 code can configure the MMU, handle exceptions, manage memory, and interact with hardware via memory-mapped I/O. This is where we will live for the rest of this series.
  • EL2 is the hypervisor level. When you run virtual machines, the hypervisor runs at EL2, managing multiple guest kernels that each run at EL1. If you’re not doing virtualisation, you can skip EL2 entirely, but you still need to know it exists. Depending on how QEMU is launched or on the hardware you’re running, you might boot into EL2 rather than EL1.
  • EL3 is secure firmware. ARM Trusted Firmware (TF-A) runs here. It’s responsible for establishing the secure world, initialising cryptographic hardware, and managing the transition between the secure and non-secure worlds. On real hardware, EL3 is the first code to execute after a CPU reset. On QEMU without firmware, QEMU handles the EL3 stage invisibly.

The direction of privilege is one-way in normal execution. EL0 code can’t spontaneously jump to EL1. The only way to go up in privilege is through an exception, a svc instruction, an interrupt, or a hardware fault, which the kernel handles. The only way to go down is through eret.


The Reset Vector

When a CPU resets, it needs to know where to start. The address of the first instruction is called the reset vector.

On AArch64, the reset vector address is implementation-defined. The CPU manufacturer decides where it is. On real ARM silicon, it’s typically either 0x0 or configured through a boot ROM. ARM’s own documentation specifies that the reset vector for AArch64 is at a fixed address, but in practice, firmware intercepts it and handles the very early boot.

For us on QEMU, none of this matters directly. QEMU acts as the firmware. When you pass -kernel kernel.elf, QEMU parses our ELF file, finds the entry point symbol specified by ENTRY(_start) in our linker script, loads our code segments into memory at 0x40000000, and jumps straight to _start. The reset vector ceremony is abstracted away.

What’s left for us to understand is that QEMU drops us at EL1 by default. QEMU’s virt machine without -machine virtualization=on or -machine secure=on starts your kernel at EL1 with a known, clean initial state. But on real hardware like a Raspberry Pi 3 or 4, you typically land at EL2. Our boot stub needs to handle both.


Checking the Exception Level

The first thing a portable AArch64 boot stub does is ask the CPU: “Where am I?” The CurrentEL system register answers that question. It’s a read-only register where bits [3:2] encode the current exception level:

Bits [3:2]ValueException Level
0b011EL1 — OS kernel
0b102EL2 — Hypervisor
0b113EL3 — Secure firmware

Reading it is a single instruction:

mrs  x0, CurrentEL    // Move system register into x0
lsr  x0, x0, #2      // Shift right 2 bits to isolate EL number

After the shift, x0 holds 1, 2, or 3. We compare against 1 to check if we’re already at EL1:

cmp  x0, #1
beq  .el1_ready       // Already at EL1 — skip the drop

If QEMU dropped us at EL1, we would skip everything in the next section. If we are on a Raspberry Pi and land at EL2, the branch doesn’t take, and we fall through to the EL2 drop sequence.


Dropping from EL2 to EL1

Moving down a privilege level always occurs when the exception return instruction via eret is executed. When eret is triggered, it atomically performs two actions: it sets the program counter to the address in ELR_ELx (the Exception Link Register) and restores the processor state from SPSR_ELx (the Saved Program Status Register). Both need to be set up before eret executes.

The three registers that control the transition from EL2 to EL1 are:

HCR_EL2: Hypervisor Configuration Register. Bit 31 is the RW (Register Width) bit. Setting it to 1 tells the CPU that EL1 will run in AArch64 mode, not in 32-bit AArch32 mode. Without this, the CPU silently enters EL1 in 32-bit compatibility mode, and nothing works as expected.

SPSR_EL2: The processor state we want after eret. We encode this as a bit field. The value we use, 0x3c5, breaks down like this:

Bit(s)FieldValueMeaning
[9]D1Debug exceptions masked
[8]A1SError exceptions masked
[7]I1IRQs masked
[6]F1FIQs masked
[3:0]M0b0101EL1h — EL1 using SP_EL1

The mode bits 0b0101 are what select EL1h. EL1h means we use the dedicated EL1 stack pointer SP_EL1. The alternative, EL1t (“thread”), would use SP_EL0 even at EL1. EL1h is what every OS kernel uses.

ELR_EL2: Where to resume execution after eret. We point this at the label immediately after the EL2 setup code.

Here’s what these registers look like right before eret fires. The highlighted ones are the ones we explicitly set up:

bios@confessions ~/registers · EL2 register state — right before eret
CurrentEL 0x0000000000000008
HCR_EL2 0x0000000080000000
SPSR_EL2 0x00000000000003c5
ELR_EL2 0x000000004000001c
SP_EL2 0x0000000000000000
SP_EL1 0x0000000000000000
PC 0x0000000040000018
LR 0x0000000000000000

After eret: PC ← ELR_EL2 (jump to .el1_ready) and PSTATE ← SPSR_EL2 (enter EL1h with all exceptions masked). The CPU is now at EL1 with a clean, known state. SP_EL1 is still zero. We will fix that in the next step.

The assembly for this transition are the following four instructions:

mov  x0, #(1 << 31)     // HCR_EL2.RW = 1 — EL1 is AArch64
msr  hcr_el2, x0

mov  x0, #0x3c5         // SPSR: EL1h, D/A/I/F masked
msr  spsr_el2, x0

adr  x0, .el1_ready     // Point ELR_EL2 at the label after eret
msr  elr_el2, x0

eret                     // Jump to EL1. No return.

adr (Address of label Relative to current) generates a PC-relative address without needing a literal pool. It’s the right choice here because we need the address of a nearby label, not a 64-bit constant.

After eret, the CPU resumes at .el1_ready, running as EL1.


Setting Up the Stack

We’ve already covered this in the previous post, but it’s worth anchoring it in context: on bare metal, the stack doesn’t exist until you create it.

Your C functions push local variables and return addresses onto the stack. If sp points to 0x0, those pushes corrupt memory starting at address 0x0. Things might even appear to work for a while, until they catastrophically don’t. The crash will be completely silent. Setting the stack pointer is non-negotiable.

ldr  x0, =_stack_top
mov  sp, x0

_stack_top is the symbol we defined in link.ld in the last post. The address immediately above the 4 KB reserved stack region. The stack grows downward on AArch64, so pointing sp at the top of the region is correct.


Zeroing the BSS Section

The C standard guarantees that uninitialised global and static variables are initialised to 0. In a normal system, the runtime that launches your program handles this. On bare metal, you are the runtime.

If you skip BSS zeroing, global variables will have whatever garbage happened to be in RAM at boot. This can be power-on residue or remnants from a previous boot. Your kernel will misbehave in ways specific to a single run and not reproducible on the next, which is the worst kind of debugging situation.

The BSS region is bounded by __bss_start and __bss_end. These symbols are defined by the linker in your link.ld. Zeroing it is a straightforward loop:

ldr  x1, =__bss_start
ldr  x2, =__bss_end
.zero_bss:
    cmp  x1, x2
    bge  .call_kernel
    str  xzr, [x1], #8    // *x1 = 0; x1 += 8
    b    .zero_bss

xzr is AArch64’s zero register. It always reads as 0, and writes to it are discarded. Using it here instead of loading an explicit 0 into a register saves an instruction.

In the instruction str xzr, [x1], #8, we set the #8 suffix to be a post-index offset. It stores the address in x1, then adds 8 to x1. One instruction, two effects. We step through BSS in 8-byte chunks , zeroing as we go.


Jumping Into C

After the stack is up and BSS is clean, we can call C:

bl   kernel_main
b    .               // Spins forever and should never reach here

bl jumps to kernel_main and stores the return address in lr. When kernel_main eventually returns, the CPU will jump to whatever address lr holds, which should be the b . infinite loop.

The spin loop is not dead code. Without it, execution would fall through into whatever random bytes follow bl kernel_main in memory. Depending on your kernel layout, that might be data, another function, or unmapped memory. The infinite loop turns a potential mystery crash into a deliberate halt.


The Full boot.S, Annotated

Here’s the complete assembly stub for this post. Click any [?] indicator to see the explanation for that line:

bios@confessions ~/boot.S
.section .text.boot
.global _start
 
_start:
  mrs  x0, CurrentEL
  lsr  x0, x0, #2
  cmp  x0, #1
  beq  .el1_ready
 
  mov  x0, #(1 << 31)
  msr  hcr_el2, x0
 
  mov  x0, #0x3c5
  msr  spsr_el2, x0
 
  adr  x0, .el1_ready
  msr  elr_el2, x0
 
  eret
 
.el1_ready:
  ldr  x0, =_stack_top
  mov  sp, x0
 
  ldr  x1, =__bss_start
  ldr  x2, =__bss_end
.zero_bss:
  cmp  x1, x2
  bge  .call_kernel
  str  xzr, [x1], #8
  b    .zero_bss
 
.call_kernel:
  bl   kernel_main
  b    .

The C Entry Point

Create kernel.c alongside boot.S:

void kernel_main(void) {
    /* UART isn't set up yet.
     * For now, if you reach this, your boot stub worked. */
    while (1);
}

That’s the entire file. No includes, no main, no standard library. Just a C function that spins. This is also the last time in this series that the kernel does nothing. The next post wires up UART and gives your kernel a voice.

The function signature is deliberately simple. We don’t pass any arguments because we don’t have device tree parsing yet. Later posts will add parameters as we discover what information the bootloader can pass us.


Updating the Makefile

Add C compilation to the skeleton of the Makefile from the last post:

CROSS   ?= aarch64-elf-

AS      = $(CROSS)as
CC      = $(CROSS)gcc
LD      = $(CROSS)ld
OBJDUMP = $(CROSS)objdump

ASFLAGS = -g
CFLAGS  = -g -ffreestanding -nostdlib -mcpu=cortex-a53 -O2
LDFLAGS = -T link.ld

TARGET  = kernel.elf
OBJS    = boot.o kernel.o

all: $(TARGET)

$(TARGET): $(OBJS) link.ld
	$(LD) $(LDFLAGS) -o $@ $(OBJS)

%.o: %.S
	$(AS) $(ASFLAGS) -o $@ $<

%.o: %.c
	$(CC) $(CFLAGS) -c -o $@ $<

run: $(TARGET)
	qemu-system-aarch64 \
		-M virt \
		-cpu cortex-a53 \
		-nographic \
		-kernel $(TARGET)

gdb: $(TARGET)
	qemu-system-aarch64 \
		-M virt \
		-cpu cortex-a53 \
		-nographic \
		-kernel $(TARGET) \
		-S -gdb tcp::1234

dump: $(TARGET)
	$(OBJDUMP) -d $(TARGET)

clean:
	rm -f $(OBJS) $(TARGET)

.PHONY: all run gdb dump clean

Some key additions and changes we made compared to the Makefile in the previous post:

  • -ffreestanding tells GCC that the C standard library isn’t available and that the program might not have a main. Without this, GCC makes assumptions about the runtime environment that will cause linker errors or silent misbehaviour.
  • -nostdlib prevents the linker from automatically linking libc, libgcc startup files, and other defaults that don’t exist in our environment. -mcpu=cortex-a53 targets our specific CPU, enabling relevant optimisations. -O2 enables optimisation, which is important because without it, GCC may generate code that expects a frame pointer setup that we haven’t done.
  • We also drop -semihosting from the QEMU flags. The kernel_main in this post never exits, so we no longer need QEMU’s semihosting support. Instead, we’ll use Ctrl+A X to kill QEMU manually or attach GDB.
  • The new gdb target adds -S (start paused, waiting for the debugger) and -gdb tcp::1234 (expose a GDB server on port 1234). This is the right way to debug a kernel that produces no output.

Running It

First, build the kernel using:

There are two object files now. The linker combines them into a single ELF using our linker script. Run that with:

make run

QEMU starts, then… nothing. No output, no exit. Your kernel is spinning inside kernel_main, waiting for a UART that doesn’t exist yet. This is expected. To kill QEMU: press Ctrl+A, then X.

To actually prove the kernel reached kernel_main, use GDB. In one terminal, run the command:

make gdb

QEMU starts paused, waiting for a debugger. Now, in a second terminal:

aarch64-elf-gdb kernel.elf

After this, in the GDB prompt, we should see this:

This shows that sp is 0x40001000, aka the top of our 4 KB stack, exactly where link.ld placed it. pc is inside kernel_main. Every step of the boot sequence executed correctly.

The GDB session also shows x0 = 0x0, indicating that the BSS loop ran and zeroed everything. Try adding a global int counter = 0; to kernel.c and checking its address in GDB. You’ll see it sit at zero even before your C initialisation code could touch it.


What Broke (And Why)

What messed me up building this was the BSS loop alignment. I was zeroing 8 bytes at a time (64-bit str), but my linker script didn’t guarantee that BSS started on an 8-byte boundary. If __bss_start is at an odd address, str xzr, [x1], #8 generates a data alignment fault, and the CPU silently resets or hangs. The fix is one line in link.ld:

.bss : {
    . = ALIGN(8);       // ← add this
    __bss_start = .;
    *(.bss)
    __bss_end = .;
}

This ensures BSS starts at a multiple of 8 before we try to write 8-byte words into it. Check your current link.ld and add it if it’s missing.

If you’re trying to run this on a Raspberry Pi, you might forget to set HCR_EL2.RW. The boot stub dropped to EL1, but EL1 was in AArch32 mode. Nothing will crash immediately because AArch32 can execute instructions at EL1 just fine. But the BSS zeroing loop uses 64-bit str instructions, which have different encodings in AArch32. The loop silently writes garbage to the wrong addresses. Every global variable in the kernel will then be corrupted before kernel_main even starts.

Debugging this might make you go insane. Variables will have the wrong types. Functions that should not have been called yet will have been “called” already. The bug won’t be at where the crash happens; it will be thirty instructions later, in completely unrelated code.


What’s Next

Your kernel boots. It reaches C, but it doesn’t say anything yet.

The next post fixes that: we’ll wire up the PL011 UART provided by QEMU’s virtual machine and write a kprint function. No standard library, no printf, no dependencies, but just your code writing bytes directly to a memory-mapped register and watching characters appear in your terminal. This will be the very first output from your kernel.


Sources