Your Agent's Tools Are Dead Data

Today's AI agents can call tools. They cannot evolve them.

An agent built on any of the major frameworks operates in a hard split. On one side, there's the model: a reasoning engine that processes language, maintains context, and makes decisions. On the other, there's a set of tool definitions, usually expressed as JSON schemas, that describe what the agent can do. These two sides communicate through a narrow protocol: the model emits a structured request, a runtime executes it, and a result comes back.

If you build agents, you've felt this split. You've written the scaffolding that bridges it: the prompt templates, the state machines, the retry loops that exist only because the model and the runtime don't share a language. This works. It works surprisingly well for a first generation of agentic systems. But there's a structural problem hiding in this architecture, and it becomes visible the moment you ask a simple question: can the agent reason about its own capabilities?

I'm not talking about just picking the right tool from a list, but rather: can it inspect the set of tools available to it and notice what's missing? Can it look at a sequence of past actions and recognize a recurring pattern? Can it abstract that pattern into a new, reusable capability? Can it modify its own action space based on experience?

In principle, nothing prevents this. Models can output valid JSON, generate new schemas, and reflect on execution logs ... but only by simulation and convention. In practice, these operations are non-native: they require the model to work around its representations rather than within them. The result is structurally fragile: every act of introspection or composition requires ad hoc scaffolding that breaks the moment the task changes shape.

This is a representational problem.

Where it breaks

Consider a scenario you might encounter while building an agent, say an agent that books travel. It has tools for searching flights, searching hotels, and sending confirmation emails. A user asks it to plan a trip: find a flight, find a hotel near the airport, and send a confirmation with both itineraries.

The agent does this successfully. It calls search_flights, calls search_hotels with the destination, calls send_email with a summary. Three tool calls, good result.

Now the same user asks it to do this for a team of eight people, each with different departure cities but the same destination and dates. The agent needs to do the same flight-hotel-email pattern eight times, varying only the origin city and the recipient.

What the agent cannot do is share execution semantics with the runtime.

That's the core architectural flaw. The tool definitions describe behavior, but they don't participate in it: they can't be quoted, composed, passed as arguments, or evaluated. The model proposes structure; the runtime executes it; but these two activities happen in different ontologies, glued together by convention.

To be precise: the seam isn't in the model's experience of the data ... to a transformer, everything is token sequences. The seam is in the orchestrator's state machine, which segments the world into tool schemas, execution results, planning prompts, and error envelopes that have no structural relationship to each other.

The model bridges these segments by inference every time. That's the fragility.

In a system where action definitions, execution traces, and new capability definitions share a common representation, the agent could inspect what it just did, recognize the pattern, bind the varying parts as parameters, and define a reusable book-team-member-travel operation. Then it executes that operation eight times. If one fails (e.g. a hotel search returns no results near one airport) it can inspect the failure, see which step broke, and adapt.

The new composite capability is data the agent can reason about, just like everything else: it can be bound to a symbol, persisted to its environment, and invoked later without leaving its native notation.

In the current architecture, none of this is impossible. But notice what it actually requires. In practice you solve this with a loop node, scratchpad state, and prompt templates ... maybe a LangGraph subgraph, maybe a planner that re-derives the three-step sequence for each person. Or perhaps you give the agent a Python REPL tool so it can write a for loop on the fly ... but as we'll see, handing the model a raw execution environment is a surrender of system governance, not an architectural solution.

Either way, the "composite tool" exists only as latent plan text, not as a first-class object the runtime can validate or execute. It's like writing reusable functions as comments and hoping the compiler re-derives them each run. If the hotel search fails, the error comes back as an envelope the runtime can't relate back to the action term in any enforceable way ... so the model (once again, that's all it can do) bridges it by inference.

It works. It's also not a foundation you can build endogenous capability growth on.

This is why agents plateau. You can scale context and add tools, but you don't get cumulative capability. Without a native "action algebra", experience doesn't crystallize into new affordances, it evaporates back into prompt text. Change one tool's return shape, and your latent "macro" shatters across prompts, state, and retries.

The separation

This is a familiar problem in the history of programming languages. It's the difference between a language where code is a special, opaque category and a language where code is data: where programs can inspect, generate, and transform other programs using the same constructs they use for everything else.

The property has a name: homoiconicity. And its absence in agent architectures is, I think, the structural bottleneck that better prompting and bigger context windows won't resolve.

The model thinks in token sequences. Its tools are described in JSON. Its execution history is a mix of structured logs and serialized return values. Its planning (if it plans at all) happens in yet another format (perhaps ... natural language packed into a system prompt!). None of these can interoperate natively, and the tool-calling contract doesn't give you compositional operators. There's no way to say "this tool, but with retry logic" or "these three tools, sequenced, with the output of the first feeding the second" in a form the runtime can directly evaluate. You can describe these things in natural language and hope the model re-derives the right sequence of calls. But description is not composition.

What's actually missing

Zooming out from the travel example, there are four capabilities that the representation gap makes structurally fragile:

Introspection. "What tools do I have? What did I just do? What has worked before?" This requires the agent to access its own action definitions and execution history as structured, queryable data, not as a prose summary crammed into a context window, but as something with the same representational status as the rest of its reasoning.

Composition. "I keep doing A, then B, then C in sequence. Let me make that a single operation." This requires combining existing action definitions into new ones, programmatically, within the same notation that defines individual actions.

Abstraction. "This pattern of API call, error check, retry with backoff appears in half my tasks. Let me name it." This requires recognizing structural patterns in behavior and elevating them into reusable definitions: an agent writing its own tools (!)

Self-modification. "Given what I've learned about this environment, I should change how I approach this class of problems." This requires rewriting the action space itself: not selecting from a fixed menu, but altering the menu.

None of these are exotic capabilities. Any competent professional does all four constantly. Current architectures can approximate each of them through increasingly elaborate scaffolding: retrieval-augmented tool selection for introspection, prompt chaining for composition, code generation for abstraction. But the scaffolding is always external to the action representation itself, which means it's brittle, task-specific, and invisible to the agent's own reasoning. The agent doesn't know it's being scaffolded. It can't improve the scaffolding. It can't even see it.

To see the representational difference concretely, consider how a tool is defined today versus how it might look in a homoiconic notation. A simplified "send email" tool in a typical function-calling schema:

{
  "name": "send_email",
  "description": "Send an email to a recipient",
  "parameters": {
    "type": "object",
    "properties": {
      "to":      { "type": "string" },
      "subject": { "type": "string" },
      "body":    { "type": "string" }
    },
    "required": ["to", "subject", "body"]
  }
}

And the same tool as a symbolic expression:

(def send-email
  (action [to :- Str, subject :- Str, body :- Str]
    (effect :email {:to to :subject subject :body body})))

The difference isn't cosmetic. The JSON version is inert metadata — a description of a tool that is not treated as an executable term in the agent/runtime contract. It buries three lines of intent inside thirteen lines of structural boilerplate, and every nested brace is a hallucination opportunity for a token predictor. The symbolic version is the tool: an expression in the same notation the agent uses for everything else, where the AST is the syntax. An agent could quote it, inspect its parameter structure, wrap it in retry logic, or compose it with other actions to build a new capability, all using the same operations it uses to reason about anything.

A representational answer

If the diagnosis is right ... i.e. if the bottleneck is representational rather than algorithmic ... then the response should be representational too. Not another framework. Not a better tool-selection algorithm or a more sophisticated planning module. A notation: a representational system in which tool definitions, execution traces, error reports, and self-modification instructions are all first-class expressions in the same language.

What would such a notation need? Surprisingly little. The structural substrate is just atoms, lists, and maps. On top of that, a small core of irreducible primitives ... quoting, evaluation, binding, abstraction ... and the rest is library. Critically, these aren't free-floating: they operate within a typed environment where expressions carry constraints, provenance, and capability gates. The point isn't eval in the wild. It's eval within a structurally typed, capability-governed algebra.

Everything else (tool invocation, schema validation, trace access, capability creation) can be built from these primitives rather than added as separate machinery. The key property is what you might call role-uniformity: a tool definition, a plan, a result, and an error should all have the same shape. They should all be expressions in the same notation, manipulable with the same operations. An agent shouldn't need to context-switch between "this is a description of what I can do" and "this is a record of what I did" and "this is an instruction for what to do next." They should be the same kind of thing.

If this sounds like Lisp, that's not an accident. The Lisp family's core properties ... homoiconicity, minimal syntax, code-as-data ... were designed for exactly this class of problem: systems that need to represent, inspect, and transform their own behavior. The application is new, but the insight is sixty-seven years old.

What homoiconicity doesn't solve

I should be honest about the limits of this argument, because the history of self-modifying systems is not reassuring.

Even in Lisp, code-as-data doesn't automatically yield good abstraction. Metaprogramming often becomes chaos. Self-modifying systems are notoriously difficult to debug, reason about, or make safe. Making it possible for an agent to rewrite its own action space is very different from making it wise.

A homoiconic notation is a necessary substrate, but it's not sufficient. The harder problems are downstream: stability of self-modification (how do you prevent an agent from rewriting itself into incoherence?), credit assignment over evolving action spaces (how does an agent know which of its self-created abstractions are actually useful?), governance of learned capabilities (who decides what an agent is allowed to teach itself?), and runtime safety guarantees (what happens when a self-composed tool does something its constituent parts never would have?).

These are real problems, and a notation doesn't solve them. What a notation does is make them possible to address within a single system rather than across a patchwork of incompatible representations. You can't build governance over action evolution if actions aren't represented in a format that supports inspection and constraint. The notation is the floor, not the ceiling.

The rebuttals I expect

There are three objections smart readers will raise, and they're worth addressing head-on.

The first is that code generation already solves this: an agent can write Python, register new tools, and update its own tool registry. This is the strongest existing counterexample, and it deserves a real answer. Systems like Voyager (Wang et al., 2023) demonstrate exactly this pattern: an LLM agent generating JavaScript skill functions, storing them in a vector database, retrieving and composing them into increasingly complex behaviors. It works, and it works well for capability accumulation in bounded environments.

But there's a structural difference between generating code strings and manipulating an action algebra. When an agent writes a Python function that calls three other functions, that composition is opaque text until evaluated. The orchestrator can't safely inspect the sub-components, enforce constraints on what the composed tool is allowed to do, or step through it without parsing the string back into an AST ... and even then, the AST is Python's, not the agent's own action representation. You get a growing library but not a governed one.

In a homoiconic system, the agent isn't writing raw text to disk; it's manipulating bounded, typed expressions that the runtime can inherently validate, restrict, decompose, and step through. The difference is between "the agent wrote some code that seems to work" and "the agent composed a new capability whose structure and constraints are visible to the entire system."

The second is that introspection and composition belong in the orchestrator, not the model. This is a valid design choice, especially for safety. But notice what it concedes: it's an explicit decision to keep the agent exogenous: to have its action space managed for it rather than by it. That may be wise in many settings. But it's also an admission that the agent itself has no endogenous relationship to its own capabilities. If you're comfortable with that, the current architecture is fine. If you want something more, you need a different representation.

The third is that homoiconicity isn't the bottleneck: evaluation and verification are. This is the strongest objection, and it's largely correct: having a unified notation means nothing if you can't verify that self-composed tools are safe, correct, and bounded. But verification becomes coherent only when the objects being verified ... plans, tools, traces, errors ... live in a single representation with shared structure. You can't write verification over a patchwork of JSON schemas, prompt text, and serialized logs. The notation doesn't replace verification. It makes verification expressible.

Two important caveats

The argument has two structural objections worth addressing directly.

The first is the interpreter gap. An LLM is a probabilistic token predictor, not an evaluator. Even if its context window is filled with elegant S-expressions, the model cannot execute them. It still requires an external runtime ... a REPL, an interpreter, some execution environment ... to actually evaluate code. This is true: the claim here is not that homoiconicity turns the LLM into an interpreter. Rather, a homoiconic representation allows the model and the external runtime to share the exact same language. The model generates expressions; the runtime evaluates them; the results come back as more expressions in the same notation. The model doesn't need to execute anything. It needs to read, write, and reason about execution artifacts in the same representational system it uses for everything else. That's the property current architectures lack.

The second is the translation boundary. The world's APIs speak JSON. Databases expect SQL. HTTP is text over a socket. No matter how elegant the internal notation, an agent eventually has to reach across the boundary and touch something that doesn't speak its language. This is also true, and it's actually a clarifying constraint rather than a defeater. Every real system has boundaries where representations change ... the question is where you draw those boundaries and how much of the agent's internal reasoning can happen in a unified format before crossing them. In the current architecture, the boundary is everywhere: the model crosses a representational boundary every time it formulates an action, reads a result, or looks at its own history. The argument isn't that the boundary disappears. It's that it should be pushed to the edge (to the point of contact with external systems) rather than running through the middle of the agent's own cognitive loop.

Four claims

There are really four nested claims here, in increasing order of ambition.

The narrow claim is that JSON function-calling schemas are a structurally hostile substrate for agent self-reasoning. Not inexpressive ... after all, JSON can represent any AST, and people have built impressive things on top of it. But when your "compiler" is a probabilistic token predictor, structural density matters: the ratio of intent to syntax, the token overhead of representing composition in a format designed for data serialization. JSON buries structural intent in boilerplate. S-expressions are their own AST, which means higher intent density per token. Today's models may actually track JSON braces more reliably than they balance Lisp parentheses, simply because their pre-training diet is 99% JSON. But that's a data artifact, not an architectural truth. A token predictor natively benefits from syntactic regularity.

Engineers will rightly point out that grammar-constrained decoding (e.g. Outlines, XGrammar, OpenAI's Structured Outputs) has largely solved the "missing bracket" problem by masking logits at the inference level. But look at the cost: constrained decoding solves structural fragility by calcifying the schema. The orchestrator's safety net mathematically prohibits the model from generating keys outside the pre-registered grammar. You get perfect JSON, but you guarantee the action space remains strictly exogenous. The very mechanism that makes JSON reliable makes it incapable of evolving. I think the field will converge on this tension independently.

The medium claim is that agent action spaces should be homoiconic ... that the representation of actions should be the same kind of thing as the actions themselves. This is a stronger architectural position with real design consequences, and it's the one I'm most interested in testing.

The broad claim is that the Lisp genotype (not any particular dialect, but the family of properties including homoiconicity, minimal syntax, and code-as-data) represents something like the natural substrate for machine cognition.

Yes, yes, yes ... this is frankly speculative. But the alignment is hard to ignore: a token predictor benefits from syntactic regularity (fewer structural tokens to get wrong), high intent density (more semantic content per generation step), and uniform structure (the same parse strategy works on every expression). These are precisely the properties the Lisp genotype optimizes for. Every property that made Lisp powerful for human metaprogramming becomes more valuable when the programmer is a language model.

The fourth claim is the one I think matters most in the long run: that what we're really talking about is the difference between exogenous and endogenous agency. Current agents have externally defined control surfaces: their action space is configured for them, not by them. A fully closed cognitive loop would require the agent's action space to be native to its reasoning substrate, not imported from outside it. That's a claim about architecture, not notation, and it points toward something more like cybernetics than programming language theory. But you need the notation first. More on this soon.