Agents Need Environments, Not Just Memories

This is the third in a series. The first, "Your Agent's Tools Are Dead Data," diagnosed a representational problem in agent architectures. The second, "What It Means for an Agent to Think About Its Own Actions," argued that agent action spaces should be homoiconic.

I initially felt the claim that agent action spaces should use an executable, self-inspectable representation (rather than inert JSON schemas) was an isolated viewpoing ... but I no longer think that.

Pel, a language out of CMU published in mid-2025, uses Lisp-inspired s-expressions with a minimal grammar designed for constrained LLM generation; its argument for homoiconicity tracks mine almost exactly. CodeAct demonstrated empirically that executable Python outperforms JSON for tool calls across nearly every benchmark. And Anthropic's Model Context Protocol has become the de facto standard for agent-tool integration (adopted by OpenAI, Google, AWS). The wire-protocol question is largely settled.

However, none of these projects fully address the thing I was focussing on. They've validated the surface of the argument: (1) executable representations beat inert schemas, (2) s-expressions are a reasonable substrate for constrained generation, (3) the notation layer is real. Several groups arrived there independently. But what remains unaddressed is the layer underneath. And that's the part I was thinking about.

The "stateless-competence" problem

Here is the thing that current agent architectures take for granted: every invocation starts from scratch.

An agent receives a task. It inspects its available tools: defined by a developer, loaded from a config file, registered through a protocol. It reasons, selects, executes, returns. The cycle ends. The next invocation loads the same tool definitions, the same capabilities, the same starting conditions. Whatever the agent learned about the shape of the problem (which combinations worked, what intermediate structures proved useful) is gone.

Yes, there are memory systems. RAG retrieves relevant past episodes. Vector databases store embeddings. Chat histories persist across turns. But these are recollection mechanisms. The agent remembers that something happened. It does not carry forward the "competence" that resulted.

Consider a software developer. After debugging a complex integration issue, they don't just remember that they fixed it. They've developed a new intuition about how two systems interact, and they've written a utility function, a test fixture, or a runbook that makes the next encounter cheaper. The artifact of their past work has become part of their working environment. It's not stored in a filing cabinet of memories. It's on the workbench, available for use and further modification.

Current agent systems have the filing cabinet. None of them treat the workbench as the primary architectural layer. Or, that's the claim here, anyway, and I'll try to be more precise about what I mean.

What "environmental" means

The natural misreading of this thesis is "persistent memory," and that misreading collapses it into something much less interesting. So I want to be precise.

When I say an agent's competence should become environmental, I mean the structures an agent produces in the course of solving problems (composed tool definitions, intermediate abstractions, error-recovery procedures, task decompositions that proved useful) should remain "live", re-composable, and causally active in the space where the agent's future actions occur. Not logged. Not embedded in a vector. Not summarized into a paragraph for a future prompt. Live.

An environment is not a database you query; it's a space you inhabit. Its contents are not retrieved on request; they shape what actions are available, what compositions are natural, what costs are low. When a programmer opens a project with a well-factored codebase, they don't need to "remember" the utility functions: the functions are there, in scope, discoverable, callable. The codebase is an environment. The programmer's prior decisions have causal force without requiring retrieval.

Pel generates s-expressions, executes them, and discards them. CodeAct emits Python, runs it, returns results. MCP routes tool calls through a standardized protocol. All of them treat each invocation as a fresh evaluation in a static environment defined by the developer at design time. The agent's competence at time t has no structural effect on the environment at time t+1.

Why storage isn't enough

The obvious reaction is: "So persist things. Store the agent's intermediate artifacts in a database and load them back next time." But what you persist, and in what form, determines everything.

Persist natural-language summaries and the agent can recall that it once solved a problem but can't reuse the solution as a callable operator. It has to reconstruct the approach from a description.

Persist structured artifacts as JSON and the agent can load them but can't manipulate their internal structure with the same operations it uses for everything else. The stored artifacts are inert relative to the agent's reasoning machinery: exactly the representation boundary I diagnosed in the first post, just at a different layer.

Persist executable expressions in the same notation the agent uses for action, planning, and introspection: and something different happens. The difference between "a tool the developer defined" and "a procedure the agent authored last Tuesday" vanishes. They're the same kind of expression, subject to the same operations, available through the same inspection mechanisms. The agent's action space has grown, and it grew from the inside.

This is where homoiconicity stops being a programming-language preference and becomes an architectural requirement. (which is what I was circling around in the first two posts, but didn't explain well enough earlier). The representation must be simultaneously the action format, the storage format, and the inspection format: not because that's elegant, but because any gap between these roles reintroduces the translation boundaries that prevent accumulated work from remaining causally active.

(I want to flag the obvious: this is a claim about a structural property of the architecture, not an argument for any particular language. Lisp is the canonical example of a homoiconic system, but the requirement is representational uniformity, and there are multiple ways to achieve it)

An obvious objection: none of this matters if the LLM can’t perceive the environment. A transformer is stateless. It has no persistent connection to a live runtime. Whatever environment you construct, the model sees only what fits in its context window (!): serialized into a one-dimensional token stream. If the accumulated workbench grows beyond context length, it must be searched, filtered, retrieved. So ... doesn’t that collapse back into RAG?

I'd say no, because the serialization bottleneck is orthogonal to the representation problem. Yes, you will always need retrieval as the environment grows. That constraint is real and permanent. But what you retrieve determines everything.

Retrieving a callable procedure that the agent can execute, inspect, decompose, and recompose is fundamentally different from retrieving a natural-language summary of what the agent did last time.

The context window is a bandwidth constraint; representational uniformity determines what that bandwidth buys you. A system that retrieves live, composable operators is not the same as a system that retrieves descriptions of past work, even if both systems use vector search to decide what to load.

A concrete model

Imagine an agent tasked with monitoring web services for downtime. On day one, it has a basic tool: fetch-url. The agent composes a procedure: fetch each URL, check status codes, report failures. That procedure is itself an expression: a composition of existing tools with conditional logic.

In a conventional architecture, the procedure lives in the orchestration layer or in a prompt. Tomorrow, the same task arrives and the procedure is either re-derived from scratch or hardcoded by a developer who reviewed the agent's output. The agent's competence has been extracted, not preserved.

In an environmental architecture, the composed procedure persists as an expression in the agent's environment. Tomorrow it's there, not as a memory to be recalled but as an operator to be applied. The agent can also inspect it, notice it doesn't handle timeouts, and compose a refined version. The refined version sits alongside the original, and the agent can reason about the relationship between them, because they're both expressions in the same notation it uses for everything else.

Now extend one step further. The agent encounters a different task: monitoring database response times. It inspects its environment and finds the monitoring procedure. The structure is relevant: iterate over targets, check a health signal, report failures. The specific tools are different, but the procedural skeleton is reusable. The agent composes a new procedure by analogy, substituting components while preserving structure, using the same expression-manipulation operations it uses for everything else.

This is competence compression: prior work compressed into a reusable form that reduces future search. Not through fine-tuning. Not through prompt engineering. Through the accumulation of live, inspectable, composable structure in a persistent environment (yes, I'm aware this is vaporware, but ... still an architectural pattern that doesn't exist ... yet).

Three layers, not one

Current thinking about agent memory tends to conflate distinct concerns. There are at least three layers, and the failure to distinguish them is a source of real confusion.

The first is episodic memory: what happened. RAG, chat history, interaction logs. It answers "what did I do before?" Systems like MemGPT operate here.

The second is semantic memory: what's true. Knowledge graphs, ontologies, extracted facts. It answers "what do I know?"

The third is what I'm calling environmental memory: what I can do. Accumulated capability, live procedures, agent-authored operators that remain available for future use. It answers "what competences have I developed?" This layer is almost entirely absent from current architectures!

The environmental layer is distinct from the others in a critical way: it's not about information. It's about affordance: possibilities for action that exist in the environment as handles, operators, structures that enable something not previously available. When an agent authors a procedure and that procedure remains live, it has created a new affordance. The action space has expanded. The agent is more capable not because it knows more, but because it can do more.

Storage preserves information. Environment-alization preserves affordance. The distinction is between remembering that you once built a bridge and having the bridge still be there when you need to cross the river again.

What I think this means

I've been writing this series in a deliberately "diagnostic" way, let me summarize more directly.

I believe the agent systems we're building now will hit a competence ceiling. It won't be about reasoning quality or context length or tool availability. It will be about the inability of agents to compound their own work. Every invocation from scratch, every capability re-derived, every lesson re-learned from a natural-language summary rather than a live artifact. The waste is invisible because we don't yet have systems that show us the alternative. But the ceiling is structural, and no amount of better prompting or longer context will remove it.

The fix is architectural. The environment in which an agent operates needs to be a persistent, evaluable space: not a runtime sandbox, not a tool registry, not contextual memory, but an evolving collection of authored operators, abstractions, wrappers, decompositions, and parameterized procedures that remain callable and editable across tasks. The representation used for actions, storage, and inspection needs to be the same representation, because any translation boundary between these roles creates a gap where accumulated competence falls through.

Now obviously, there are open problems I'm not solving here, so I'll call them out now.

A live environment where the agent authors its own operators is also an environment that can accumulate broken, brittle, or hallucinated structures. Curation, validation, and decay are real questions, and I suspect they're harder than the representational ones.

There's also the question of whether current models can reliably do what this architecture demands of them: inspecting a procedure's structure, recognizing its analogy to a new problem, composing a variant by principled substitution rather than guess-and-check.

Also, this architecture presupposes a kind of structural reasoning over symbolic expressions that today's LLMs can do sometimes but not yet reliably (that's an empirical bet!).

I'd rather flag all these issues as the next layer of work, rather than objections that defeat the thesis.

What homoiconicity was always for (even though the programming-language community has mostly treated it as a metaprogramming convenience) was systems that modify and extend themselves. That we're now building systems that need exactly this capability is not a coincidence. You can think of it as the use case arriving sixty years late!

Current agents can remember. They cannot inherit themselves. Until that changes, accumulation will be simulated through recollection, and the ceiling will hold. Maybe this is something to build soon, maybe later ... we'll see.