Breeding Better Binaries: Shackleton and the Case for Evolutionary Optimizations in LLVM

Home > Blog > Breeding Better Binaries

Posted: 2026-05-30 |

Author: sid |

Category: Compilers, Research

read a paper from Arm and Michigan State about using linear genetic programming to optimize LLVM pass sequences, as someone who took bio in 12th grade , the idea kinda stuck with me.

this post is long. i'm going to explain everything from the ground up , what llvm passes are, how genetic programming works, how shackleton puts them together, and whether any of it matters in practice.

part 1: what does a compiler actually do

before we talk about optimization, we need to be clear on what a compiler does. you write source code in C or Rust or whatever. the compiler translates it to machine code. in the old days (and still in GCC's architecture), this was a straightforward pipeline: lexing, parsing, code generation, done.

llvm is different. it inserts an intermediate step called IR (intermediate representation). your source code gets compiled to IR first. then llvm runs a bunch of transformations on that IR. then the IR gets compiled to machine code for your specific target. the reason for this three-stage design is reuse , the same IR transformations work whether your source is C, Rust, Swift, or Julia, and they work whether your target is x86, ARM, or RISC-V. you write the optimization once, it works everywhere.

each transformation is called a pass. there are two kinds:

analysis passes read the IR and compute some property , things like which variables are never used (dead code analysis), which memory addresses might point to the same location (alias analysis), or what the loop structure of a function is (loop analysis). they don't change anything, they just produce information that other passes consume. transform passes change the IR , eliminate dead code, inline functions, unroll loops, simplify expressions. each transform pass typically needs results from one or more analysis passes to know what to change.

this creates dependencies. if you run an analysis pass, then a transform pass, then another analysis pass, the second analysis might produce different results than the first because the IR changed. passes invalidate each other's analyses. so the order you run them in directly affects what the compiler can figure out.

the default -O2 pipeline in llvm is about 100 passes long, give or take. it was hand-crafted by llvm developers over many years. it's pretty good for most code. but "pretty good" isn't the same as "optimal for your specific program on your specific hardware."

part 2: why finding the best pass sequence is hard

formalizing the problem: a pass sequence is an ordered list of passes:

$$P = \langle p_1, p_2, \dots, p_n \rangle$$

where each $p_i$ belongs to the set $\mathcal{P}$ of all available llvm passes (roughly 200 of them). the length $n$ isn't fixed , for -O2 it's around 100. the search space is all possible sequences up to some maximum length $L$:

$$\mathcal{S} = \bigcup_{n=1}^{L} \mathcal{P}^n$$

the size of this space is $\sum_{n=1}^{L} |\mathcal{P}|^n$. for $|\mathcal{P}| = 200$ and $L = 100$, the first term alone is $200^{100}$.

to give you a sense of scale: the number of atoms in the observable universe is about $10^{80}$. $200^{100}$ is roughly $10^{230}$. brute force welpsies :((.

you could try to reason about it analytically maybe . researchers have attempted to model pass interactions as a graph, where edges represent "this pass enables that optimization." but the interactions are nonlinear , pass A enables pass B which enables pass C, but running C first disables both A and B. the fitness landscape is epistatic, meaning the effect of changing one pass depends on the state of all the other passes around it.

so we need a search strategy. could try random search (generate random sequences and keep the best). orr you could try hill climbing (start from -O2 and make small changes, keep the improvements). you could try reinforcement learning (train a model to predict good sequences). or you could try what shackleton does drum roll plsss: evolution.

part 3: a crash course in genetic programming

genetic programming is a subset of evolutionary computation. the core idea is older (not as old as lisp but yea wtv) than most programmers realize , john holland wrote about genetic algorithms in 1975, and john koza popularized genetic programming in the 1990s.

you start with a population of candidate solutions, measure how good each one is with a fitness function, select the best ones to be parents, create offspring by mixing parents together (crossover) and randomly tweaking them (mutation), then replace the worst members of the population with the new offspring and repeat for many generations.

in traditional genetic algorithms, solutions are fixed-length bit strings. in genetic programming, solutions are programs , usually represented as tree structures (like abstract syntax trees) or linear sequences (like assembly code). shackleton uses the linear representation, hence linear genetic programming (LGP).

the linear representation has advantages for compiler problems. compilers process a linear sequence of instructions. the IR itself is a linear instruction stream (with control flow superimposed on it). evolving linear sequences maps naturally onto the problem. tree-based GP would represent a pass sequence as a tree, which is awkward because there's no natural tree structure to pass ordering.

the two main variation operators in LGP are:

crossover takes two parent sequences, picks a random cut point in each, and swaps the tails , producing two children that combine segments from both parents. mutation randomly modifies the offspring: insert a new instruction, delete an instruction, replace one with another, or swap two adjacent instructions.

single-point crossover. each child keeps its parent's length , only the tails trade places.

the interplay between these operators determines how the search explores the space. crossover exploits existing good building blocks by recombining them. mutation explores new territory by introducing random changes. getting the balance right depends on the problem , too much mutation and you're doing random search, too little and you converge prematurely on a mediocre solution.

part 4: what's in the shackleton repo

the osaka list structure (OSL)

at the core of shackleton's representation is the Osaka List Structure, or OSL. this is the data structure that holds sequences of genetic objects , whether those are aarch64 instructions or llvm passes. the name comes from the city where the initial design discussions happened.

an OSL is a doubly linked list where each node contains:

an OSL with three pass nodes.

each node contains a generic genetic object (GGO) (the actual instruction or pass), a pointer to the next node (forward link), a pointer to the previous node (backward link), and a sequence-level metadata block with the generation number, unique individual ID, and parent IDs for ancestry tracking.

the reason shackleton uses a linked list instead of an array is that insert and delete mutations are O(1) , you just relink a few pointers. an array-based representation would require O(n) shifts for every insert or delete, and since mutation is applied per-position with probability $p_m$, those shifts add up fast over a population of $N$ individuals across $G$ generations.

but linked lists have a cost: traversal is O(n) with a pointer-chasing overhead that arrays don't have. for short sequences (under 100 passes) this doesn't matter much. for longer sequences, the cache miss penalty of walking a linked list starts to show. shackleton doesn't seem to hit this wall because the experiments cap sequence length at well under 200.

the metadata per individual includes a fitness value (best measurement so far, cached to avoid re-evaluation), generation of birth, parent IDs for tracing lineages, sequence length, and an eval count for how many times this genome has been evaluated for averaging.

the generic genetic object (GGO)

the GGO is shackleton's abstraction for "a thing that can be evolved" , a struct with an opcode (which pass or instruction), a pointer to domain-specific data (threshold values, target features), and a mutation behavior function pointer so different pass types can mutate differently.

the function pointer for mutation is the key abstraction. shackleton's core evolutionary loop doesn't need to know what a GGO represents , it just calls the mutation function and lets the domain-specific code handle the details. this is how the same framework supports both aarch64 instruction evolution and llvm pass sequence optimization. the core loop is generic; the GGO implementation is domain-specific.

for the LLVM pass domain, each GGO wraps a single pass invocation. the opcode maps to a pass name like instcombine or gvn or licm. the mutation function for an LLVM-pass GGO randomly picks a different pass from $\mathcal{P}$ and replaces its opcode. insert and delete mutations operate on the OSL level, not on individual GGOs.

population dynamics

the population is an array of OSL pointers , flat for fast random access during tournament selection. each generation, shackleton evaluates all individuals that haven't been evaluated yet (or whose cached fitness is stale), sorts by fitness descending, copies the top $e\%$ directly into the next generation (elitism), fills the remaining slots by selecting parents via tournament and applying crossover with probability $p_c$ and mutation with probability $p_m$ per position, then assigns each new individual a unique ID and records its generation of birth and parent IDs.

the selection pressure is controlled by the tournament size $k$. larger $k$ means the best individuals are more likely to be selected as parents, which speeds up convergence but increases the risk of getting stuck in a local optimum. smaller $k$ preserves diversity but slows down progress. shackleton defaults to $k = 3$, which is standard in the GP literature , it provides moderate selection pressure without collapsing diversity too quickly.

the fitness landscape

the fitness function $\Phi: \mathcal{S} \to \mathbb{R}$ maps from sequence space to real-valued fitness. the landscape has four properties that matter: ruggedness , small changes in sequence cause large fitness changes, the landscape is all sharp peaks and deep valleys. neutrality , many sequences produce identical fitness (especially compiler-crashing ones, all $\Phi = 0$), creating flat networks the search drifts across without improving. deceptiveness , combining two good sequences doesn't give a better one because pass interactions are nonlinear, violating the building block hypothesis. noise , the same sequence evaluated twice gives different fitness, so the search works with an approximation of the true landscape.

these properties make pass sequence optimization hard for any search algorithm, including GP. they also explain why shackleton works best with genetic improvement , starting from a known-good sequence keeps you in the smoother part of the landscape near a working point.

the llvm optimization mode

the llvm integration is enabled with the -llvm_optimize flag. shackleton prompts you for the source files to optimize (placed in src/files/llvm/), creates an initial population of random pass sequences, compiles each with clang using the evolved sequence, runs the binary, measures execution time, and evolves toward faster sequences.

the compilation uses llvm's opt and clang under the hood. shackleton generates the pass sequence as an opt invocation, pipes the IR through it, then compiles the result with clang and benchmarks it.

the parameters

shackleton exposes a lot of knobs. you can set them interactively or via a parameters.txt file:

population size controls how many individuals per generation , larger populations explore more but cost more per generation. maximum generations sets when to stop if we haven't converged. crossover rate is the probability two selected parents produce offspring via crossover instead of just cloning. mutation rate is the probability per position that mutation fires , higher values increase exploration but disrupt good sequences. elitism percentage determines how many of the best individuals survive unchanged, preventing loss of the best solution found so far. initial sequence length and maximum sequence length bound the search space. number of evaluations per fitness controls noise handling by averaging multiple benchmark runs.

the -cache flag enables caching of evaluation results. without it, if two different sequences produce the same binary (possible if the difference between them gets optimized away), shackleton re-evaluates from scratch. with caching, it stores (sequence_hash, fitness) pairs and skips redundant evaluations. the cache is just a flat file , there's no database or serialization framework involved.

part 5: the evolutionary algorithm in detail; TW(i really suck at maths , apologies in advance if there's a mistake)

we generate a population of $N$ individuals. each individual is a sequence of passes. in genetic improvement mode, one individual is the default pipeline; the rest are mutated copies of it. in pure GP mode, each pass is chosen uniformly at random from $\mathcal{P}$, and sequence lengths are drawn uniformly from $[L_{\min}, L_{\max}]$.

$$G_i = [p_1, p_2, \dots, p_{L_i}], \quad L_i \sim U(L_{\min}, L_{\max}), \quad p_j \sim U(\mathcal{P})$$

fitness evaluation

for each individual $i$, compile the benchmark $f$ with pass sequence $G_i$:

$$\text{binary}_i = \text{compile}(f, G_i)$$

run the binary $R$ times and record execution times:

$$T_i^{(1)}, T_i^{(2)}, \dots, T_i^{(R)}$$

compute the mean:

$$\hat{T}_i = \frac{1}{R} \sum_{r=1}^{R} T_i^{(r)}$$

fitness relative to baseline:

$$\Phi_i = \frac{\hat{T}_{\text{-O2}}}{\hat{T}_i}$$

if the binary doesn't compile or crashes at runtime, $\Phi_i = 0$. this creates strong selection pressure against broken sequences.

selection

shackleton uses tournament selection. pick $k$ individuals at random from the population, and keep the one with the highest fitness. repeat until you have enough parents. tournament size $k$ controls selection pressure , larger $k$ means stronger pressure toward the fittest individuals, which speeds up convergence but increases the risk of premature convergence on a local optimum.

$$P(\text{select } i) = \frac{k \cdot \Phi_i}{\sum_{j=1}^{N} \Phi_j} \quad \text{(approximately)}$$

crossover

with probability $p_c$, two parents produce two children via single-point crossover. given parents $A$ and $B$ with lengths $L_A$ and $L_B$, choose a crossover point $c$ uniformly from $[1, \min(L_A, L_B) - 1]$:

$$\text{child}_1 = [A_1, \dots, A_c, B_{c+1}, \dots, B_{L_B}]$$

$$\text{child}_2 = [B_1, \dots, B_c, A_{c+1}, \dots, A_{L_A}]$$

the children have lengths $c + (L_B - c) = L_B$ and $c + (L_A - c) = L_A$ respectively. so crossover preserves the length of each parent in the corresponding child. length variation happens through insert/delete mutations instead.

mutation

after crossover, each child is mutated with probability $p_m$ per position. the mutation operator is chosen uniformly from the four options (replace, insert, delete, swap). insert and delete change the sequence length by $\pm 1$. the framework enforces hard bounds $[L_{\min}, L_{\max}]$ , if a delete would take it below $L_{\min}$, it's ignored, and if an insert would take it above $L_{\max}$, it's ignored.

replacement

the new offspring replace the worst individuals in the population. elitism preserves the top $e\%$ unchanged. the remaining population slots are filled with offspring from tournament selection of the survivors.

the loop stops when the maximum generation count $G_{\max}$ is reached, or when fitness hasn't improved by more than $\epsilon$ for $S$ consecutive generations, or when a target fitness threshold (like 5% improvement over -O2) is reached.

part 6: what the paper actually found

the paper tests two benchmarks. i wish they had tested more but it is what it is . sob sob

benchmark 1: DSP kernel. a simple digital signal processing loop. about 20 lines of C, tight loop with multiply-accumulate operations. shakleton converged within 20 generations to sequences that matched -O2 performance. the best evolved sequences were slightly different from the default pipeline , they used fewer passes in a different order , but produced binaries within 1-2% of the same runtime. this suggests that for simple kernels, the default pipeline is already near-optimal and evolution confirms what the llvm devs already knew.

benchmark 2: cryptographic workload. a larger program with complex control flow, multiple functions, and data-dependent branching. this took 50+ generations and the results were less consistent. different runs of the evolutionary algorithm produced different best sequences, with fitness values varying by as much as 10% across runs. the best sequences matched -O2 but never significantly beat it. the authors note that longer sequences (above 40 passes) become exponentially harder to optimize because the epistasis problem compounds.

the paper also reports that mutation rate had a strong effect on convergence. low mutation rates ($p_m < 0.01$) caused premature convergence , the population collapsed to a few similar sequences and stopped exploring. high mutation rates ($p_m > 0.1$) turned the search into random walk , good sequences were disrupted before they could propagate. the sweet spot was around $p_m = 0.05$, which is consistent with the gp literature.

population size mattered too. smaller populations ($N < 50$) converged quickly but to poor solutions. larger populations ($N > 200$) explored more thoroughly but the per-generation cost became prohibitive , each generation took hours because every individual had to be compiled and benchmarked. the sweet spot was $N = 100$ to $N = 150$.

part 7: why this is hard in practice

reading the paper and the code, i see several practical problems that the framework doesn't fully solve.

measurement noise. runtime measurements on a modern cpu are incredibly noisy. frequency scaling, cache warming, background processes, thermal throttling , all of these add variance. shackleton runs each binary $R \geq 3$ times and averages, but the paper doesn't report confidence intervals or statistical significance tests. if the variance is high, a sequence that appears to be 2% faster than -O2 might just be luck. you'd need $R = 30$ or more to get reliable measurements, which multiplies the cost.

compilation failures. most random pass sequences don't produce working binaries. the paper doesn't report exact numbers, but in practice, especially without genetic improvement, 80-90% of random sequences might fail. each failure still costs the compilation time. the fitness cache helps once you've evaluated a sequence, but the first evaluation of each sequence is wasted if it crashes.

generalization. the evolved sequence is specific to one benchmark on one input. there's no guarantee it will work well on different inputs to the same program. a sequence that optimizes a hot loop for input size 1024 might make it slower for input size 512, because different passes trigger different unrolling decisions. the paper doesn't test this.

epistasis. i keep coming back to this because it's the fundamental challenge. the fitness function is not smooth. small changes can have large nonlinear effects. two sequences that differ by one pass can produce binaries with wildly different performance, making the fitness landscape essentially random at small scales. genetic algorithms rely on the assumption that similar genomes have similar fitness (the "building block hypothesis"). if that assumption is violated , and in pass optimization, it often is , the search degrades to random sampling.

$$\Phi(P) \not\approx \Phi(P') \quad \text{even when } \text{diff}(P, P') = 1$$

part 8: how shackleton compares to other approaches

other cool projects written by really really smark people to tackle the same problem differently.

Approach	Method	Pros	Cons
Shackleton	LGP on pass sequences	no training data needed, works per-benchmark, simple setup	doesn't generalize, expensive evaluation, noisy fitness
MLGO (Google)	neural network policy for pass decisions	generalizes across programs, learns IR features	needs training corpus, GPU time, ML expertise
AutoTuner (LLVM)	beam search + cost model	integrated into LLVM, deterministic	limited to known pass orderings, no novelty
Opt-Bisect	binary search for crash-causing passes	built into LLVM, zero setup	only finds crashes, not performance
Random search	generate random sequences, keep best	trivially parallelizable, baseline for comparison	sample-inefficient, no learning

google's mlgo takes a different angle , train a neural network to look at the IR and decide which pass to run next. it feeds on features like instruction counts, loop depths, and memory access patterns. learning from features means it can generalize to programs it hasn't seen before. the catch is you need a large training corpus of (program, good_sequence) pairs and GPU time to train the thing.

shackleton just needs a compiler and a benchmark, but the pass sequence it finds for your DSP kernel won't help your crypto workload at all.

people tried evolving compiler passes on GCC back in the early 2000s. shackleton's twist is targeting LLVM and packaging it as a reusable framework rather than a one-off script.

part 9: how i think you'd use this(could only think this up while writing this,will surely come back again )

shackleton makes most sense when you've got one program, one target.a sequence that works for your firmware on your weird risc-v core is all that matters.

it dumps your source through random pass sequences, compiles each one, benchmarks the result, and keeps whatever works best. over time the population drifts toward sequences that happen to be good for your specific setup. it's guided by measurement, not guesses.

whether the result beats -O2 is a crapshoot. the paper saw modest wins on simple kernels and nothing on complex ones. the editor tool lets you hand-tweak whatever evolution finds, which matters because evolution usually gets close but not all the way there.

part 10: what i think

the niche i see is embedded targets with weird architectures. if you're writing firmware for a custom risc-v core, the -O2 pipeline was tuned for x86 desktop chips, not your particular cache layout or instruction latencies. mlgo needs a team and a training pipeline. shackleton just needs a compiler and an overnight run. you evolve once and ship that sequence to every device. the compilation cost is front-loaded.

genetic improvement saves this whole thing. start from -O2 and at least everything compiles. now evolution just needs to find the extra few percent instead of figuring out how to start a fucking fire.

what i'd really like to see is someone runs shackleton against random search with the same budget of compilations, tests whether the best sequence works across different inputs to the same program, and does a case study on an actual embedded board with a known bottleneck. if it can beat -O2 by 5% on a Cortex-M4 or a real risc-v core.

paper and repo

Paper: arXiv:2201.13305 (Jan 2022)
Repo: ARM-software/Shackleton-Framework , 25 stars, C (96.6%), last commit 2022
Authors: Hannah Peeler (Arm), Shuyue Stella Li (Johns Hopkins), Andrew N. Sloss (Arm), Kenneth N. Reid (Michigan State), Yuan Yuan (MSU), Wolfgang Banzhaf (MSU)

Back to All Posts

Filed under: Compilers, Genetic Programming, LLVM