| 13 min

Toolchain & Environment Setup: Your First ARM Binary

Step-by-step guide to installing an AArch64 cross-compiler (GCC) and QEMU on macOS and Linux. Write a bare-metal linker script and Makefile to boot your first ARM64 program.

Terminal window showing a cross-compiler build pipeline for ARM

By the end of the previous post, we promised that in this post, we would install a cross compiler, have QEMU running, provide a Makefile skeleton that will grow with you through the series, and have your first binary executing on a virtual ARM machine.

So that is exactly what we will do. There’s nothing conceptually difficult here. The more complex processes, such as the ARM boot process, exception levels, and page tables, will come later. This post is about getting your hands on the right tools and confirming they work, so that every post after this starts from a clean, verified baseline. Think of it as tuning an instrument before the concert.


What We’re Building Today

By the end of this post, we will have three files in our project:

  • boot.S: a minimal AArch64 assembly stub
  • link.ld: a linker script that places your code at the right address
  • Makefile: a skeleton build system with make, make run, and make clean targets

Running make run compiles, links, and runs your binary in QEMU. The binary will boot, ask QEMU to exit cleanly, and return you to the shell prompt with exit code 0. This is proof that our toolchain works end-to-end; there is no UART or output yet.

That 0 is the whole point of this post. QEMU exited cleanly because your binary requested it. Every tool in the chain worked.


Why a Cross-Compiler?

Here’s something that trips people up: you can’t use your system’s default gcc to build bare-metal AArch64 code.

Your system compiler, such as gcc on Linux or clang on macOS, is built to produce binaries for your machine and operating system. It knows how to link against your system libraries, generate the right calling convention for your OS’s ABI, and produce executables your kernel’s loader can run. That’s all completely useless for OS development.

What you need is a cross-compiler: a version of GCC that runs on your machine but generates code for a different target. For our OS, that means bare-metal AArch64 with no OS assumptions, no standard library, and no startup code you didn’t write yourself. It’s just you and the hardware.

The two relevant toolchain triples for this series are:

  • aarch64-elf: targets AArch64 bare metal (ELF output, no OS). This is what Homebrew provides directly, and what we’ll use on macOS.
  • aarch64-linux-gnu: targets AArch64 Linux userspace, but can produce bare-metal code with the right flags. This is what apt provides on Ubuntu and Debian.

Both produce correct bare-metal binaries for our purposes. The Makefile will abstract the difference with a CROSS variable you set once.


Installing the Toolchain

macOS

Homebrew has aarch64-elf-gcc as a first-class formula, no taps, no third-party repositories:

brew install aarch64-elf-gcc qemu

This installs QEMU with AArch64 support included and the full aarch64-elf-* toolchain:

  • aarch64-elf-gcc
  • aarch64-elf-as
  • aarch64-elf-ld
  • aarch64-elf-objdump

After installing, verify the version with:

aarch64-elf-gcc --version
qemu-system-aarch64 --version

You should see GCC 14.x or later, and QEMU 9.x or later. If either command fails, check that Homebrew’s bin directory is in your PATH.

Ubuntu and Debian

Ubuntu 22.04 and later include a proper bare-metal AArch64 toolchain in the standard repositories:

sudo apt update
sudo apt install gcc-aarch64-linux-gnu binutils-aarch64-linux-gnu qemu-system-arm

A note on the naming: aarch64-linux-gnu is technically a Linux-target toolchain triple. It assumes you’re building Linux userspace programs. In practice, it produces correct bare-metal binaries when you pass the right flags and use your own linker script. Since we only use the assembler and linker directly, the distinction is invisible, and both produce identical output.

If you prefer the explicitly bare-metal variant, gcc-aarch64-linux-gnu-none-elf is available on some distributions. The Arm-published toolchain uses the triple aarch64-none-elf. Either prefix works; the Makefile’s CROSS variable handles the difference.

Verify:

aarch64-linux-gnu-gcc --version
qemu-system-aarch64 --version

The Project Layout

Create a new directory named purgatory. This is what we will call our new OS, because that’s where our processes go when they get stuck in an infinite loop. In this series, we use this as the root directory and keep extending on it in each post. To start, create a new directory using the command:

mkdir -p silicon-to-shell/post-02
cd silicon-to-shell/post-02

By the end of this post, it should look like the following:

purgatory/
├── boot.S
├── link.ld
└── Makefile

The directory will grow across posts, but it’ll always have this shape at its core: a source file, a linker script, and a Makefile. This series is accompanied by this GitHub repo, where you can find a working solution. Each post has a dedicated branch in the git repo.


The Linker Script

Before we write any assembly, we need to tell the linker where to put things. QEMU’s virt machine loads our ELF binary and starts execution at whatever address the ELF says. The RAM on the virt machine starts at 0x40000000. If we don’t place our code there, QEMU tries to run it from address 0x0, where there’s nothing, and silently hangs.

Create link.ld:

ENTRY(_start)

SECTIONS {
    . = 0x40000000;

    .text : {
        *(.text.boot)
        *(.text)
    }

    .data : { *(.data) }

    .bss : {
        __bss_start = .;
        *(.bss)
        __bss_end = .;
    }

    /* Reserve 4KB for the stack */
    . = ALIGN(16);
    . += 4096;
    _stack_top = .;
}

What each part does

1. The Entry point

ENTRY(_start)

This part of the code tells the linker, and eventually the CPU, the exact name of the function or assembly label your program should begin executing. When the processor powers on. It needs to know where the very first instruction is. Here it is looking for a symbol named _start.

2. Setting the Base Address

SECTIONS {
    . = 0x40000000;

The dot is called the location counter; here, we are manually setting it to 0x40000000. It tells the linker, “Start placing the following sections of code at this specific memory address.” It is the start of our physical RAM.

3. The .text Section

Here we are creating the .text block in our final file.

  • *(.text.boot) takes the startup code from all input files and puts it first.
  • *(.text) takes all the rest of the compiled functions and code and places them right after.

This is done because the code execution must happen in a specific order. By putting .text.boot first, we guarantee that our setup code runs before the standard C++ functions.

4. The .data Section

.data : { *(.data) }

In the data section, we are gathering all global or static variables that already have a value assigned to them. These values will occupy actual space in the binary file, allowing them to be loaded into RAM.

5. The .bss Section

.data : { *(.data) }

In this section, we are gathering all global variables that are not initialised or are initialised to zero. Compared to the .data section, we do not store any of these values in the binary file. The linker just marks the start (__bss_start) and end (__bss_end) of this section. When your OS boots, the startup assembly code reads these two markers and loops through that memory space, zeroing it out manually.

6. The Stack Allocation

. = ALIGN(16); ensures the memory address is a multiple of 16 bytes. Many CPU architectures require the stack to be 16-byte aligned for efficiency. The . += 4096; moves the location counter forward by 4096 bytes (4KB), leaving empty space. Lastly, _stack_top = .; creates a symbol pointing to the very end of that 4KB block.


The Entry Point

boot.S is the entire program. A minimal assembly stub that sets up the stack and asks QEMU to exit cleanly:

.section .text.boot
.global _start

_start:
    /* Point the stack pointer at the top of our reserved stack */
    ldr x0, =_stack_top
    mov sp, x0

    /*
     * QEMU semihosting — SYS_EXIT call.
     *
     * Semihosting lets bare-metal code make requests of the host
     * (QEMU, in this case) through a special trap instruction.
     * We use it here to ask QEMU to exit cleanly with code 0.
     * This is only useful for testing — real kernels don't exit.
     *
     * The parameter block lives on the stack:
     *   [sp+0]  reason:      0x00020026  (ADP_Stopped_ApplicationExit)
     *   [sp+8]  exit_status: 0x00000000  (success)
     */
    mov x1, #0x26
    movk x1, #0x2, lsl #16     /* x1 = 0x00020026 */
    str x1, [sp, #-16]!        /* push reason; pre-decrement sp by 16 */
    mov x0, #0
    str x0, [sp, #8]           /* push exit status */

    mov x1, sp                 /* x1 → start of parameter block */
    mov w0, #0x18              /* w0 = SYS_EXIT */
    hlt #0xf000                /* semihosting trap — A64 encoding */

There are a few things that we should understand here because they’ll come back in every post that follows:

.section .text.boot
.global _start
_start:

This part explicitly flags this chunk of code for the .text.boot section. If you forget this and use .text instead, your code ends up somewhere after other .text sections, and _start might not be at address 0x40000000. It will compile fine and fail mysteriously at runtime.

/* Point the stack pointer at the top of our reserved stack */
ldr x0, =_stack_top
mov sp, x0

Here, it uses the assembler’s literal pool to load a 64-bit address into a register. AArch64 can’t encode a full 64-bit immediate directly in a single instruction, so the assembler synthesises this as a PC-relative load.

mov x1, #0x26
movk x1, #0x2, lsl #16     /* x1 = 0x00020026 */

Because our bare-metal chip doesn’t have a screen, keyboard, or hard drive, it can make a special request to the hardware QEMU emulator to borrow its operating system resources. This process is also called semihosting. Semihosting is a debug mechanism defined by ARM that lets bare-metal code communicate with a host environment.

mov w0, #0x18              /* w0 = SYS_EXIT */
hlt #0xf000                /* semihosting trap — A64 encoding */

Finally, we tell the program to exit. mov w0, #0x18 puts the command for SYS_EXIT (0x18) into register w0. hlt #0xf000 is a processor halt instruction with a specific tag (#0xf000). Under normal conditions on real hardware without a debugger attached, this would just cause a crash. However, QEMU watches for this specific halt code. When QEMU sees hlt #0xf000, it pauses execution, reads the command in w0, examines the parameters pointed to by x1 on the stack, and then cleanly shuts down the virtual machine.


The Makefile

Create the following Makefile:

# CROSS — set to the prefix of your toolchain:
#   macOS (Homebrew):      aarch64-elf-
#   Ubuntu / Debian:       aarch64-linux-gnu-
CROSS ?= aarch64-elf-

AS      = $(CROSS)as
LD      = $(CROSS)ld
OBJDUMP = $(CROSS)objdump

ASFLAGS = -g
LDFLAGS = -T link.ld

TARGET  = kernel.elf
OBJS    = boot.o

all: $(TARGET)

$(TARGET): $(OBJS) link.ld
	$(LD) $(LDFLAGS) -o $@ $(OBJS)

%.o: %.S
	$(AS) $(ASFLAGS) -o $@ $<

run: $(TARGET)
	qemu-system-aarch64 \
		-M virt \
		-cpu cortex-a53 \
		-nographic \
		-kernel $(TARGET) \
		-semihosting

dump: $(TARGET)
	$(OBJDUMP) -d $(TARGET)

clean:
	rm -f $(OBJS) $(TARGET)

.PHONY: all run dump clean

The CROSS ?= aarch64-elf- line is a key detail. On Ubuntu, where the binary is named aarch64-linux-gnu-gcc, you need to override it when you call make with:

make CROSS=aarch64-linux-gnu-

Or you can export it once and never have to think about it again:

export CROSS=aarch64-linux-gnu-
make run

The dump target will be your best friend for debugging later in the series. make dump pipes the ELF through objdump -d, and shows you the disassembly of your kernel. This is often the only way to figure out what actually made it into the binary versus what you thought you wrote.

The QEMU flags also deserve a quick explanation:

  • -M virt selects the virtual machine type. The virt machine is a clean ARM board defined by QEMU and not a simulation of any real hardware. It has a documented, stable memory map, which is why we use it.
  • -cpu cortex-a53 sets the CPU model. The Cortex-A53 is a real AArch64 core found in the Raspberry Pi 3 and 4. QEMU implements it faithfully enough that code written for it runs on the real chip without changes.
  • -nographic disables the graphical display and keeps everything in the terminal. We don’t have a framebuffer driver.
  • -kernel kernel.elf tells QEMU to load our ELF, parse the program headers to find where to place each segment, and jump to the entry point.
  • -semihosting enables the host-side of the semihosting protocol. Without this flag, hlt #0xf000 will cause an exception, and QEMU will hang instead of exiting.

Running It

make run

QEMU starts, loads your kernel at 0x40000000, jumps to _start, your code sets up the stack, calls semihosting exit, and QEMU terminates. The terminal returns to your shell prompt. There is no output because we haven’t wired up UART yet.

You can also check the output code with:

echo $?
# 0

That 0 is your binary talking to you through QEMU’s exit mechanism. It means every piece worked: the assembler converted your source to machine code, the linker placed it at the right address, QEMU loaded and executed it, and the semihosting call went through correctly.

If you want to see what you actually compiled, run:

make dump

You’ll see the disassembly of boot.o. A handful of instructions, ending with hlt #0xf000. That’s your entire kernel right now, and it’s enough to prove the toolchain works.


What Broke (And Why)

The first time I ran make run on macOS, I got this:

qemu-system-aarch64: -kernel kernel.elf: could not load kernel 'kernel.elf'

QEMU couldn’t load the kernel. The binary existed. The path was right. The issue was that kernel.elf had been linked as a bare ELF without a proper program header. QEMU needs to know where in memory to place the code. The linker had produced a valid ELF object, but not a loadable executable.

To fix this, I had to ensure I invoked the cross-linker $(LD) rather than $(AS). I had accidentally wired the link step to the assembler in an early version of the Makefile. The assembler produced an .o file, renamed it .elf, and QEMU couldn’t parse it. I ran the file kernel.elf and saw ELF 64-bit LSB relocatable, where it should have said ELF 64-bit LSB executable. Relocatable means an unlinked object. Executable means linked, load-addressed, and ready to run.

Another issue to look out for that could cause a lot of time debugging is forgetting the .text.boot section qualifier and using .text instead. Everything compiles and links cleanly. QEMU loads the binary. Then nothing happens. No output, no exit, just a hung process. The entry point isn’t at 0x40000000; another section got there first. Running make dump and checking the address of _start will confirm it immediately: if _start isn’t at 0x40000000, the section ordering is wrong.


What’s Next

The next post starts to feel like OS development. We’ll look at how ARM actually boots: exception levels (EL0 through EL3), the reset vector, and what happens in the nanoseconds between power-on and your first instruction. Then we’ll write an assembly stub that does something useful and doesn’t just exit; instead, it jumps to a C++ kernel_main function.


Sources