Git for Cognition

Driftwood is an experimental runtime I've been building for stateful AI agents (the kind that remember!). Each agent is a WebAssembly actor whose full state can be snapshotted to disk, killed, and restored (ideally/eventually, even under a different model than the one that created it).

I've written about it twice before: the zero-copy path that lets a Wasm actor and the GPU share memory on Apple Silicon, and the hunt for bit-for-bit reproducible snapshots (you don't need either to read this one).

Say an agent is a few dozen turns into something that matters. It has read the codebase, made a plan, ruled out two approaches, and reached the hinge point where the next instruction decides whether the next hour is useful or wasted.

I want to try three moves from exactly here: (not three new chats with the same preamble, but three continuations of this one), sharing the whole accumulated past and diverging only at the next thing I say.

On Driftwood, that is a file operation and an instantiation. The actor’s durable state (its linear memory plus the transcript living inside it) is a flat artifact on disk. I copy that directory three times and instantiate three actors from the same ancestor, sharing a past down to the byte and parting ways from the next token forward. I run them, keep the one that worked, and delete the rest. The dead branches leave nothing behind, because there is no shared mutable state to leak through.

None of this needs supervision. The copy is cheap enough that the actor could drive it itself: fork a hundred ways, run each forward, score, keep one. Speculative search over an agent’s own cognition falls out of the stat, becoming cheap to copy.

The vocabulary almost writes itself... Copy an ancestor: fork. Keep one line and throw away the rest: reset. Name a point worth returning to: tag. These are borrowed from version control, but not as decoration. Once an actor’s state is durable and addressable, a recognizable subset of version-control operations becomes available

I do not think Git is the right model for cognition (and yes, some operations don't come across neatly like the three examples above), but it is the right stress test.

Systems have been freezing and replaying live state for half a century. What is different here is not the existence of snapshotting, but what is being snapshotted.

An actor’s state is a first-class artifact

So what did we copy here?

Not a "conversation" in the product sense, and not a model. An actor on Driftwood is a WebAssembly module whose durable state is (1) its linear memory and (2) globals: a flat byte image, no threads. Inside it the actor keeps a plain-text transcript of its own conversation: the thing it can replay to rebuild itself from nothing. Alongside the image, optionally, sits the inference engine’s KV cache: the model’s working memory of the context so far.

The first two are the actor. The cache is an accelerator.

It is tempting to call the whole bundle the actor’s identity, but that word reaches too far. The actor can be said to have "continuity", which in turn comes in grades.

Restore the image under the same model and you get byte-level continuity: the actor resumes as if it had never stopped. Restore from the transcript under a different model (I have snapshotted an actor under Llama and brought it back under Qwen) and you get something thinner but still real. The recorded past survives; the new model’s reading of it is its own. What crosses the swap is the conversation (but not the logits, not the exact next word, not the hidden geometry of the prior model). The cache, if it survives at all, buys only speed.

This shape is not new. Smalltalk programmers lived inside a running image they could freeze and resume, and learned early that merging images was a path into madness; the workable pattern was the change log, a re-playable record of edits applied back into the image. Erlang instead lets a running process outlive its code, migrating state across an explicit upgrade callback.

The pattern is old: keep the durable thing in a form you can replay, and treat the live image as something you can rebuild rather than reconcile.

What changes here is the substrate of replay. Smalltalk’s change log is Smalltalk; Erlang’s upgrade path needs a successor that understands the old state’s format. Driftwood’s durable layer is a model-neutral transcript, with model-specific prompt wrapping applied at replay time. The replacement interpreter need not be a compatible successor (it can be a different model family!) and needs no migration code, because the format was never tied to one model in the first place.

The closest pattern is event sourcing (i.e. replay a log of events to rebuild state) but that still assumes the new code understands the event schema. Here the continuity is not a transfer of exact state; it is a behavioral reconstruction from a shared interlingua.

This separates an actor snapshot from a VM snapshot. A VM snapshot also freezes opaque running state, but it preserves everything or nothing. Driftwood draws the line by construction: the transcript is sufficient to rebuild the actor; the cache is never required. That split is what lets the actor survive a changed execution environment instead of becoming un-restorable the moment the model under it moves.

The whole-actor snapshot is the conservative move. The pieces can later be versioned separately (transcript as log, memory as store, cache as disposable acceleration) but only after the basic operation works: capture the actor, resume it, and have the continuation still count as the same line of work.

Fork, branch, and tag need only ancestry

The verbs that map cleanly are the ones that ask little of the actor.

To fork an actor is to copy its snapshot and instantiate a second actor from it: two actors, one ancestor, identical until the next token. A branch is a movable name pointing at a snapshot, with an implied “continue from here.” A tag is a name attached to a snapshot you intend to treat as fixed. Reset is restoring an earlier snapshot and discarding everything after it, which is exactly what the cold-restore path already does.

None of these operations needs to understand the actor. They need only two facts: the state is addressable, and one state can descend from another.

That requirement is old enough to hide in plain sight. Unix fork(2) duplicates a running process from a single ancestor, made cheap by copy-on-write: parent and child share pages until one writes. A process tree branches the way a snapshot tree does ... i.e. by descent from a common point, with no merge implied.

So the verbs are not new. What is new is the thing inside the copied box.

This bounds the claim. Making the actor’s state a first-class artifact gives you the ancestry operations almost for free: they fall out of addressability and descent, which Driftwood’s snapshot format already has. They are also the least interesting verbs in the set, because they never touch the medium.

The operations that touch the medium (the ones that need to read what is inside one state and compare it with another) are where the analogy starts doing real work.

Diff is the first of those.

Diff is three different questions

Diffing two actors is where the borrowing stops being free.

To diff them, you have to open them, and the question splits. There are at least three things you might mean by “what changed.”

The first is the byte diff: compare the snapshot images directly. Easy, and usually useless. Two actors that would answer the next prompt identically can differ at the byte level: allocator layout, timestamps, counters, incidental state. A byte diff tells you the images differ, not whether the actors do.

The second is the semantic diff: compare the transcripts. Where did the two conversations diverge, and which turn introduced the fact one branch knows and the other does not? Genuinely useful, and just text: any document comparison tool can do it. But it stays on the surface, in the log, where the words are.

The third is more interesting: take two actors forked from a common ancestor, run them down different branches, then hand both the same next prompt and (under fixed decoding) watch what they generate. Token by token they agree for a while, then split.

That split is the behavioral diff: the first token where the two functions part, and the continuations that follow. It treats the actor as a function from input to continuation and asks how that function changed. It is the only one of the three you cannot do with a text editor, because the thing being compared is not stored anywhere ... it has to be produced by running both actors forward. That is where the difficulty lives.

Why fixed decoding? Because in normal operation an actor may sample, and sampling is supposed to vary. Two actors in the same state could produce different outputs on the same input, so a naive behavioral diff would measure sampler noise rather than actor difference: run it twice on one actor and it obediently diffs the actor against itself.

The comparison only means something in an evaluation mode where the same state yields the same tokens.

Deterministic replay is old news in debugging. Mozilla’s rr records a program’s execution so it replays exactly, which is what makes reverse execution and bisection possible. It leans on a deterministic core: same machine state and same boundary events, same instructions. The disorder it tames is at the boundary (scheduling, syscalls, signals, the clock), so it records the boundary and replays it.

Inference inverts the problem. Even with the environment controlled, disorder remains inside the computation: floating-point addition is not associative, so the order of a parallel reduction can change the result, and sampling may draw from an RNG. You cannot just record the boundary and replay, because the execution itself is where some of the nondeterminism lives.

So you have to manufacture a deterministic regime: fixed decoding, fixed seed, pinned backend, stable reduction order. Then the same state yields the same tokens. “Just fix the seed and pin the backend” is not an objection; it is the contract.

Once behavior is reproducibly comparable, two more verbs come along.

If an agent regressed somewhere in a long run, you can bisect the transcript: restore to the midpoint, replay forward, check the behavior, and binary-search to the turn where the regression entered ... i.e. blame, in the Git sense.

These are not new operations, they are behavioral diff applied across time instead of across branches, working for the same reason and only that reason: the run can be made to repeat.

This is what separates a versioned actor from a snapshot. A snapshot preserves a past you can return to. A behavioral diff lets you compare futures descending from the same past. You only get that when the state is an artifact and execution can be pinned.

The next verbs ask for more: to take what is inside two actors and reconcile it. That is where Git stops helping.

Where Git stops being the right analogy

Three verbs resist the move: cherry-pick, merge, and rebase. Not minor ones: they are the reason anyone uses Git instead of copying folders.

They fail for one reason: attention entangles the state.

Each token is generated against everything before it. The transcript is not a list of independent moves but a dense causal chain: no operation can lift one turn out, drop it elsewhere, and guarantee it still means what it meant. That locality is what Git’s powerful verbs need, and what cognitive state does not give them.

Reconciliation needs decomposable units. Independent pieces. Parts you can recombine without changing what they are. A causal chain refuses that.

Cherry-pick is the clean example. In Git you can lift a self-contained commit off one branch and apply it to another. Try lifting "turn forty" from one conversation into another and you usually get nonsense, because "turn forty" leans on things only the first branch knows: a name introduced twenty turns back, a decision the other branch never made, a constraint negotiated on a path it never walked.

Merge is worse, because merge is the verb Git was built to make ordinary. No algorithm can take two divergent continuations and produce the one they “would have been.” There are no independent hunks to align.

Contrast a place where merge does work. Dolt is a SQL database built around Git-like operations: branch the tables, change different rows on each branch, and Dolt combines them cleanly, flagging a conflict only where edits overlap.

That works because a table decomposes into independent units. A cell is a cell; two cells nobody both touched cannot disagree. Merge works to the degree state breaks into pieces that do not depend on each other. Tabular data does. A conversation has pieces, but not ones you can safely recombine.

Rebase is the dangerous one because it almost works.

You can replay one branch’s turns against another branch’s state: Driftwood’s cold restore already does this. So rebase appears to survive. But replaying the same words is not preserving the same meaning. The same turn, fed a different prior, can land differently, and nothing checks whether it still says what it said.

Git rebase can drift too: code that applies cleanly may call an interface the new base removed. But there the drift is a bug tests can catch. Here there may be no test. The model produces fluent output either way, so the drift is invisible by default.

That is the most dangerous way for an analogy to break: mechanically successful, semantically false.

None of this would surprise the people who built image-based systems. Smalltalk and the Lisp Machines never solved diff and merge on the image either; they versioned change logs instead. Ours entangles for a sharper reason: attention conditions everything on everything. Whether Driftwood draws a cleaner boundary than those systems is still open: the precedent is not binding, but it is hard to ignore.

So merge failure is not a missing feature. It is the most informative result in the piece. The operation that gives Git its real power (the one that makes branching safe enough to become ordinary) is the one cognition refuses to give back.

A document merges line by line, a database cell by cell; both decompose into parts that keep some independence. An actor’s state does not. It is a dense causal chain: copyable as a whole, comparable under a fixed regime, and not generally mergeable.

That is a strange kind of object: something you can fork without limit, but never recombine.

The next question is what kind of system you build around such an object.

The next structural question

Return to the actor forking itself a hundred ways.

If each fork is a full copy, a hundred branches means a hundred copies of the state, and the thing that felt cheap stops being cheap. The bottleneck moves from compute to ancestry, because most of what those branches hold is the identical shared past ... and copying it a hundred times is waste.

What you want is one frozen ancestor and a hundred thin branches diverging from it: copy-on-write for cognitive state. The shared history stored once. Each branch tracked by its provenance back to the split.

This has precedent too. Symbolics Genera built worlds incrementally, as layers over a frozen base, so many environments could share one immutable past. The mechanism is old; the new pressure is the use case: many live continuations of one ancestor, proliferating and being pruned.

Working with these objects is not a matter of reconciliation. It is almost the opposite. You fork widely. You let most branches die. You keep the few that earn it, and the ancestors worth returning to.

So, in some sense, working with primitives like this ... we do not "combine" minds (if 'minds' can be used here); instead, we choose among them.