Chasing Bit-Equality Through a Wasm Actor Runtime

This is the second post in a series on Driftwood, a runtime for stateful Wasm actors with hardware-accelerated inference on Apple Silicon. The first post covered the zero-copy GPU path: that a Wasm actor's linear memory can be shared with the GPU without copying. This one covers a different question: does their full state (the model's, the runtime's, and the guest's) properly round-trip, through snapshot and restore, byte-for-byte.

What bit-equality means here

A snapshot is the durable byte-image of a Wasm actor: linear memory, globals, and the metadata to reconstruct them. Restoring an actor rebuilds that image into a fresh Wasmtime store and continues execution as if nothing happened.

The strongest version of that promise is bit-equality. Run an actor through a multi-turn conversation, fork at a checkpoint. In one branch, run one more turn normally. In the other, snapshot, kill the process, restore, and run that same turn. Then compare every byte: KV cache, globals, linear memory, generated tokens. They should all match.

Driftwood is bit-reproducibly resumable on the same machine, and the guarantee comes in two parts: what the runtime can promise unconditionally, and what additionally holds when the guest cooperates. The work wasn't making it deterministic, but in finding where the line between those two halves had to be drawn, and making both sides airtight.

The hard part isn't the model itself. The usual LLM nondeterminism problem, batch-invariance failure (see this piece by Thinking Machines) is that a kernel's output for a given input depends on what else is in the batch, because reduction order changes with batch shape. Driftwood sidesteps it entirely: inference runs at batch size one on a single MLX thread. No batching means no batch-variance. The motivation was thread safety; determinism was a side effect. What's left to test is whether the runtime around the model (serialize, snapshot, restore, replay) round-trips its state byte-for-byte.

The investigation

Before running anything, I mapped the layers that could be controlled (the inference math, the serialization path, the snapshot writer), and checked them from the bottom up ... a byte difference at the top is uninterpretable unless you've ruled out everything beneath it.

The bottom layers were uneventful. The quantized matmul is bit-identical across invocations; process startup injects no nondeterminism. Both were expected, both were confirmed; the first real failure was the snapshot writer.

The test:

run inference
snapshot the KV cache to safetensors
kill the process
restore
run again.

Two snapshots of the same logical state (roughly ~1MB each in this test) and the first byte difference inside the JSON header that names the tensors, the diff was in the order of the tensor keys.

Safetensors stores tensors as a map from name to descriptor; the key order in that JSON is whatever order the serializer emitted. The map was backed by a Rust HashMap, where the iteration order is (by language design) unspecified and randomized per process.

So, an easy fix was switching from HashMap to BTreeMap, which iterates in key order, deterministic by construction. Re-run: showed both to be bit-identical. The bug had survived because it never affected correctness, because the map round-trips losslessly either way, and the right tensors load. With that fixed, snapshotting in one process and restoring in another came out clean too.

Then, the test that includes guest memory: run a Wasm actor through a multi-turn conversation, snapshot it, kill the process, restore from disk, run the same next turn: is every byte identical to what continuous execution would have produced? Four of five properties matched. Linear memory failed, by 43 bytes: one byte just below the 1 MiB boundary, then 42 bytes just above it. The obvious read was that they belonged together.

It turned out: they didn't, and the single byte was the easier one. First, I thought this was "heap residue", but no: the compiled guest was ~7KB and dlmalloc, even stripped, is 20–30 KB; there is no heap! Then, I thought of "live stack residue", but no: actors snapshot in Idle, between host calls, with no live frames; nothing behind the stack pointer to leak.. The residue wasn't from the snapshot moment, but rather from an earlier execution that had already unwound: I needed the stack as a region, not a pointer value.

My next instinct was to bracket the region with the guest's exported __data_end and __heap_base symbols ... surely, the stack lives between them? But no, it doesn't. That mental model (stack near the data, growing toward the heap) is an x86 convention. Wasm32 has no fixed stack location; the toolchain decides, and Rust's default for wasm32-unknown-unknown is --stack-first: the stack sits at the lowest addresses, below the data section. Those two symbols bracket the heap, which in this guest is empty.

The layout I assumed versus the one Rust's wasm32-unknown-unknown target actually uses. --stack-first puts the stack below the data section; the symbols I'd planned to bracket it with sit at the top, around an empty heap.

The correct region is [0, first_data_segment_offset), and neither bound is a runtime-readable global: the offset has to be parsed out of the wasm binary: read the data section, take the lowest segment start. A small parser does this once at instantiation; the snapshot writer zeros that region before serializing. The byte below the 1 MiB boundary disappeared, but ... the 42 above it remained.

So, I printed them:

0x100039   "user:Two plus two is..."
           └─> this the first user turn, verbatim!

They were the literal text of the first user turn, role prefix included, sitting in PREFILL_BUF! A static .bss buffer the guest uses during cold-restore replay to assemble each transcript turn before calling infer().

At this point ... I stopped trying to fix it any further.

The fresh-execution control never touched PREFILL_BUF; cold-restore replay is the only path that uses it. Zeros in the control snapshot (BSS init), content in the restored one (restore had just used it). The two snapshots differed because they took different code paths to what I'd been calling "the same logical state."

Not residue or scribble, and not a leaked frame: a buffer one path uses and the other doesn't ... which is the actual finding here 😄

The runtime can see which bytes changed. It cannot see whether they're allowed to.

A guest could write a BSS region only during restore and intend it as durable (e.g. a replay counter, or a generation marker) and the runtime would see exactly the same write. Scratch versus durable is defined by what the guest's code does with the region, and that is not recoverable from the memory image alone.

Two halves of one contract

Byte-equality at the actor level isn't one property the runtime owes, but two (let's call them "Tier 1" and "Tier 2") and they belong to different parties: Tier 1 is the runtime's to guarantee, Tier 2 only the guest can complete.

The contract, mapped. Tier 1 is what the runtime guarantees automatically; Tier 2 is what becomes possible when a cooperating guest does its half. The boundary between them is the line the investigation found.

Tier 1 is everything the runtime writes: KV cache, globals, snapshot metadata, the stack region, serialization order: byte-identical between a fresh session and a restored one that reach the same logical state, no cooperation required. After the BTreeMap fix and the stack-zeroing fix, this half was airtight. It sounds like the smaller claim and it's the harder one: every float op reproducible, every serialization order canonical, every time-dependent operation absent or explicitly managed.

Tier 2 is the full address space round-tripping exactly: every byte of guest memory, not just observable behavior. Only the guest can deliver this, because whether two BSS regions are equal depends on how the guest used them, which is exactly the question the runtime can't answer. Consider narrower use cases: reproducible research, audit trails, attestation that hashes the full memory image ... all of them need the guest to cooperate.

A guest cooperates in one of three ways:

It can be written so end-of-turn memory is a pure function of the durable transcript, with no scratch surviving between turns.
It can zero its scratch explicitly: the demo guest does this: after cold-restore replay, zero_prefill_buf() wipes PREFILL_BUF back to BSS-init, and since the fresh path never touched the buffer, both paths end identical.
It can declare which regions are durable and which are scratch and let the runtime handle them: linker sections, memory classes, before-snapshot hooks. Driftwood doesn't support this third way (yet!)

The bit-equality result proves the contract is implementable, and the contract itself is the actual discovery here: determinism in a system with split ownership has to be split along the ownership line. The runtime's job is to make its half airtight and not overpromise the other. The property exists, the contract is precise, and the runtime now ships with both halves of the split made legible to anyone using it.

Now, everything here was proven on the easy case: one machine, identical hardware on both sides. The hard case is cross-machine restore, and the difficulty isn't the different machine, it's the different GPU. Same Apple Silicon generation barely differs from cross-process; bit-identical floats from an M1 to an M4 is the real frontier. That's what I'll try next.