Toolchain & Environment Setup: Your First ARM Binary
Step-by-step guide to installing an AArch64 cross-compiler (GCC) and QEMU on macOS and Linux. Write a bare-metal linker script and Makefile to boot your first ARM64 program.
By the end of the previous post, we promised that in this post, we would install a cross compiler, have QEMU running, provide a Makefile skeleton that will grow with you through the series, and have your first binary executing on a virtual ARM machine.
So that is exactly what we will do. There’s nothing conceptually difficult here. The more complex processes, such as the ARM boot process, exception levels, and page tables, will come later. This post is about getting your hands on the right tools and confirming they work, so that every post after this starts from a clean, verified baseline. Think of it as tuning an instrument before the concert.
What We’re Building Today
By the end of this post, we will have three files in our project:
- boot.S: a minimal AArch64 assembly stub
- link.ld: a linker script that places your code at the right address
- Makefile: a skeleton build system with
make,make run, andmake clean targets
Running make run compiles, links, and runs your binary in QEMU. The binary will boot,
ask QEMU to exit cleanly, and return you to the shell prompt with exit code 0.
This is proof that our toolchain works end-to-end; there is no UART or output yet.
That 0 is the whole point of this post. QEMU exited cleanly because your binary requested it. Every tool in the chain worked.
Why a Cross-Compiler?
Here’s something that trips people up: you can’t use your system’s default gcc to build bare-metal AArch64 code.
Your system compiler, such as gcc on Linux or clang on macOS, is built to produce binaries for your machine and
operating system. It knows how to link against your system libraries, generate the right calling convention for
your OS’s ABI, and produce executables your kernel’s loader can run. That’s all completely useless for OS development.
What you need is a cross-compiler: a version of GCC that runs on your machine but generates code for a different target. For our OS, that means bare-metal AArch64 with no OS assumptions, no standard library, and no startup code you didn’t write yourself. It’s just you and the hardware.
The two relevant toolchain triples for this series are:
aarch64-elf: targets AArch64 bare metal (ELF output, no OS). This is what Homebrew provides directly, and what we’ll use on macOS.aarch64-linux-gnu: targets AArch64 Linux userspace, but can produce bare-metal code with the right flags. This is whataptprovides on Ubuntu and Debian.
Both produce correct bare-metal binaries for our purposes. The Makefile will abstract the difference with a CROSS variable you set once.
Installing the Toolchain
macOS
Homebrew has aarch64-elf-gcc as a first-class formula, no taps, no third-party repositories:
brew install aarch64-elf-gcc qemu
This installs QEMU with AArch64 support included and the full aarch64-elf-* toolchain:
aarch64-elf-gccaarch64-elf-asaarch64-elf-ldaarch64-elf-objdump
After installing, verify the version with:
aarch64-elf-gcc --version
qemu-system-aarch64 --version
You should see GCC 14.x or later, and QEMU 9.x or later. If either command fails, check that Homebrew’s bin directory is in your PATH.
Ubuntu and Debian
Ubuntu 22.04 and later include a proper bare-metal AArch64 toolchain in the standard repositories:
sudo apt update
sudo apt install gcc-aarch64-linux-gnu binutils-aarch64-linux-gnu qemu-system-arm
A note on the naming: aarch64-linux-gnu is technically a Linux-target toolchain triple. It assumes you’re building Linux userspace programs.
In practice, it produces correct bare-metal binaries when you pass the right flags and use your own linker script. Since we only
use the assembler and linker directly, the distinction is invisible, and both produce identical output.
If you prefer the explicitly bare-metal variant, gcc-aarch64-linux-gnu-none-elf is available on some distributions.
The Arm-published toolchain uses the triple aarch64-none-elf. Either prefix works; the Makefile’s CROSS variable handles the difference.
Verify:
aarch64-linux-gnu-gcc --version
qemu-system-aarch64 --version
The Project Layout
Create a new directory named purgatory. This is what we will call our new OS, because that’s where our processes go when they get stuck in an infinite loop. In this series, we use this as the root directory and keep extending on it in each post. To start, create a new directory using the command:
mkdir -p silicon-to-shell/post-02
cd silicon-to-shell/post-02
By the end of this post, it should look like the following:
purgatory/
├── boot.S
├── link.ld
└── Makefile
The directory will grow across posts, but it’ll always have this shape at its core: a source file, a linker script, and a Makefile. This series is accompanied by this GitHub repo, where you can find a working solution. Each post has a dedicated branch in the git repo.
The Linker Script
Before we write any assembly, we need to tell the linker where to put things. QEMU’s virt machine loads our ELF binary and
starts execution at whatever address the ELF says. The RAM on the virt machine starts at 0x40000000.
If we don’t place our code there, QEMU tries to run it from address 0x0, where there’s nothing, and silently hangs.
Create link.ld:
ENTRY(_start)
SECTIONS {
. = 0x40000000;
.text : {
*(.text.boot)
*(.text)
}
.data : { *(.data) }
.bss : {
__bss_start = .;
*(.bss)
__bss_end = .;
}
/* Reserve 4KB for the stack */
. = ALIGN(16);
. += 4096;
_stack_top = .;
}
What each part does
1. The Entry point
ENTRY(_start)
This part of the code tells the linker, and eventually the CPU, the exact name of the function or assembly label your program should begin executing. When the processor powers on. It needs to know where the very first instruction is. Here it is looking for a symbol named _start.
2. Setting the Base Address
SECTIONS {
. = 0x40000000;
The dot is called the location counter; here, we are manually setting it to 0x40000000. It tells the linker, “Start placing the following sections of code at this specific memory address.” It is the start of our physical RAM.
3. The .text Section
Here we are creating the .text block in our final file.
*(.text.boot)takes the startup code from all input files and puts it first.*(.text)takes all the rest of the compiled functions and code and places them right after.
This is done because the code execution must happen in a specific order.
By putting .text.boot first, we guarantee that our setup code runs before the standard C++ functions.
4. The .data Section
.data : { *(.data) }
In the data section, we are gathering all global or static variables that already have a value assigned to them. These values will occupy actual space in the binary file, allowing them to be loaded into RAM.
5. The .bss Section
.data : { *(.data) }
In this section, we are gathering all global variables that are not initialised or are initialised to zero.
Compared to the .data section, we do not store any of these values in the binary file.
The linker just marks the start (__bss_start) and end (__bss_end) of this section. When your OS boots,
the startup assembly code reads these two markers and loops through that memory space, zeroing it out manually.
6. The Stack Allocation
. = ALIGN(16); ensures the memory address is a multiple of 16 bytes.
Many CPU architectures require the stack to be 16-byte aligned for efficiency. The . += 4096; moves the location
counter forward by 4096 bytes (4KB), leaving empty space. Lastly, _stack_top = .;
creates a symbol pointing to the very end of that 4KB block.
The Entry Point
boot.S is the entire program. A minimal assembly stub that sets up the stack and asks QEMU to exit cleanly:
.section .text.boot
.global _start
_start:
/* Point the stack pointer at the top of our reserved stack */
ldr x0, =_stack_top
mov sp, x0
/*
* QEMU semihosting — SYS_EXIT call.
*
* Semihosting lets bare-metal code make requests of the host
* (QEMU, in this case) through a special trap instruction.
* We use it here to ask QEMU to exit cleanly with code 0.
* This is only useful for testing — real kernels don't exit.
*
* The parameter block lives on the stack:
* [sp+0] reason: 0x00020026 (ADP_Stopped_ApplicationExit)
* [sp+8] exit_status: 0x00000000 (success)
*/
mov x1, #0x26
movk x1, #0x2, lsl #16 /* x1 = 0x00020026 */
str x1, [sp, #-16]! /* push reason; pre-decrement sp by 16 */
mov x0, #0
str x0, [sp, #8] /* push exit status */
mov x1, sp /* x1 → start of parameter block */
mov w0, #0x18 /* w0 = SYS_EXIT */
hlt #0xf000 /* semihosting trap — A64 encoding */
There are a few things that we should understand here because they’ll come back in every post that follows:
.section .text.boot
.global _start
_start:
This part explicitly flags this chunk of code for the .text.boot section. If you forget this and use
.text instead, your code ends up somewhere after other .text sections, and _start might not
be at address 0x40000000. It will compile fine and fail mysteriously at runtime.
/* Point the stack pointer at the top of our reserved stack */
ldr x0, =_stack_top
mov sp, x0
Here, it uses the assembler’s literal pool to load a 64-bit address into a register. AArch64 can’t encode a full 64-bit immediate directly in a single instruction, so the assembler synthesises this as a PC-relative load.
mov x1, #0x26
movk x1, #0x2, lsl #16 /* x1 = 0x00020026 */
Because our bare-metal chip doesn’t have a screen, keyboard, or hard drive, it can make a special request to the hardware QEMU emulator to borrow its operating system resources. This process is also called semihosting. Semihosting is a debug mechanism defined by ARM that lets bare-metal code communicate with a host environment.
mov w0, #0x18 /* w0 = SYS_EXIT */
hlt #0xf000 /* semihosting trap — A64 encoding */
Finally, we tell the program to exit. mov w0, #0x18 puts the command for SYS_EXIT (0x18)
into register w0. hlt #0xf000 is a processor halt instruction with a specific tag (#0xf000).
Under normal conditions on real hardware without a debugger attached, this would just cause a crash.
However, QEMU watches for this specific halt code. When QEMU sees hlt #0xf000,
it pauses execution, reads the command in w0, examines the parameters pointed to
by x1 on the stack, and then cleanly shuts down the virtual machine.
The Makefile
Create the following Makefile:
# CROSS — set to the prefix of your toolchain:
# macOS (Homebrew): aarch64-elf-
# Ubuntu / Debian: aarch64-linux-gnu-
CROSS ?= aarch64-elf-
AS = $(CROSS)as
LD = $(CROSS)ld
OBJDUMP = $(CROSS)objdump
ASFLAGS = -g
LDFLAGS = -T link.ld
TARGET = kernel.elf
OBJS = boot.o
all: $(TARGET)
$(TARGET): $(OBJS) link.ld
$(LD) $(LDFLAGS) -o $@ $(OBJS)
%.o: %.S
$(AS) $(ASFLAGS) -o $@ $<
run: $(TARGET)
qemu-system-aarch64 \
-M virt \
-cpu cortex-a53 \
-nographic \
-kernel $(TARGET) \
-semihosting
dump: $(TARGET)
$(OBJDUMP) -d $(TARGET)
clean:
rm -f $(OBJS) $(TARGET)
.PHONY: all run dump clean
The CROSS ?= aarch64-elf- line is a key detail. On Ubuntu, where the binary is named aarch64-linux-gnu-gcc,
you need to override it when you call make with:
make CROSS=aarch64-linux-gnu-
Or you can export it once and never have to think about it again:
export CROSS=aarch64-linux-gnu-
make run
The dump target will be your best friend for debugging later in the series. make dump pipes
the ELF through objdump -d, and shows you the disassembly of your kernel. This is
often the only way to figure out what actually made it into the binary versus what you thought you wrote.
The QEMU flags also deserve a quick explanation:
-M virtselects the virtual machine type. Thevirtmachine is a clean ARM board defined by QEMU and not a simulation of any real hardware. It has a documented, stable memory map, which is why we use it.-cpu cortex-a53sets the CPU model. The Cortex-A53 is a real AArch64 core found in the Raspberry Pi 3 and 4. QEMU implements it faithfully enough that code written for it runs on the real chip without changes.-nographicdisables the graphical display and keeps everything in the terminal. We don’t have a framebuffer driver.-kernel kernel.elftells QEMU to load our ELF, parse the program headers to find where to place each segment, and jump to the entry point.-semihostingenables the host-side of the semihosting protocol. Without this flag,hlt #0xf000will cause an exception, and QEMU will hang instead of exiting.
Running It
make run
QEMU starts, loads your kernel at 0x40000000, jumps to _start, your code sets up the stack, calls semihosting exit,
and QEMU terminates. The terminal returns to your shell prompt. There is no output because we haven’t wired up UART yet.
You can also check the output code with:
echo $?
# 0
That 0 is your binary talking to you through QEMU’s exit mechanism. It means every piece worked: the assembler
converted your source to machine code, the linker placed it at the right address, QEMU loaded and
executed it, and the semihosting call went through correctly.
If you want to see what you actually compiled, run:
make dump
You’ll see the disassembly of boot.o. A handful of instructions, ending with hlt #0xf000.
That’s your entire kernel right now, and it’s enough to prove the toolchain works.
What Broke (And Why)
The first time I ran make run on macOS, I got this:
qemu-system-aarch64: -kernel kernel.elf: could not load kernel 'kernel.elf'
QEMU couldn’t load the kernel. The binary existed. The path was right. The issue was that kernel.elf
had been linked as a bare ELF without a proper program header. QEMU needs to know where in memory to
place the code. The linker had produced a valid ELF object, but not a loadable executable.
To fix this, I had to ensure I invoked the cross-linker $(LD) rather than $(AS).
I had accidentally wired the link step to the assembler in an early version of the Makefile.
The assembler produced an .o file, renamed it .elf, and QEMU couldn’t parse it. I ran the file kernel.elf and saw
ELF 64-bit LSB relocatable, where it should have said ELF 64-bit LSB executable.
Relocatable means an unlinked object. Executable means linked, load-addressed, and ready to run.
Another issue to look out for that could cause a lot of time debugging is forgetting the .text.boot section qualifier
and using .text instead. Everything compiles and links cleanly. QEMU loads the binary. Then nothing happens. No output, no exit, just a hung process.
The entry point isn’t at 0x40000000; another section got there first. Running make dump and checking the address of
_start will confirm it immediately: if _start isn’t at 0x40000000, the section ordering is wrong.
What’s Next
The next post starts to feel like OS development. We’ll look at how ARM actually boots: exception levels (EL0 through EL3), the reset vector,
and what happens in the nanoseconds between power-on and your first instruction. Then we’ll write an assembly stub that does
something useful and doesn’t just exit; instead, it jumps to a C++ kernel_main function.
Sources
- Arm GNU Toolchain Install Guide — Arm Developer - official ARM guide for installing the GNU cross-compiler toolchain on macOS and Linux.
- messense/homebrew-macos-cross-toolchains — GitHub - Homebrew tap providing cross-compiler toolchains for macOS, referenced for alternative toolchain setups.
- aarch64-elf-gcc — Homebrew Formulae - Homebrew formula for the bare-metal AArch64 GCC toolchain used in the macOS installation steps.
- QEMU AArch64 Virt Bare Bones — OSDev Wiki - OSDev community reference for running bare-metal AArch64 on the QEMU virt machine, used when designing the Makefile.
- QEMU ARM System Emulation Documentation - official QEMU documentation explaining the virt machine’s memory map and the meaning of each QEMU flag used in the Makefile.
- AArch64 Semihosting Exit — cirosantilli/linux-kernel-module-cheat - reference implementation of the semihosting SYS_EXIT call used to cleanly exit QEMU from bare-metal code.
- Cross-compiling for AArch64 on Debian/Ubuntu — Jensd’s I/O buffer - guide to setting up the aarch64-linux-gnu toolchain on Debian and Ubuntu referenced in the Linux installation section.
- QEMU About: Emulation - explains how QEMU emulates hardware and why the virt machine behaves differently from real silicon.