Why stateless agents beat stateful ones
When designing the sverm agent architecture, we had a choice: stateful or stateless agents. Stateful agents carry their own memory, maintain local state, and feel more “agent-like.” Every AI agent framework we looked at took this approach.
We chose stateless. Here’s why.
What a stateful agent looks like
Section titled “What a stateful agent looks like”A stateful agent is an object that lives in memory:
class StatefulAgent: def __init__(self): self.memory = [] self.current_task = None self.status = "idle"
async def run(self, task): self.current_task = task while self.status != "done": response = await self.llm.generate(self.memory + [task]) self.memory.append(response) # ... complex state machineThis feels natural. The agent “remembers” things. It has a lifecycle. It’s what most tutorials show you.
The problems:
- Can’t restart. If the process dies, all memory is lost. No recovery.
- Can’t scale. You can’t just spin up another instance — it doesn’t have the first one’s memory.
- Can’t test deterministically. The agent’s behavior depends on its internal state, which is hard to set up and even harder to assert against.
- Hard to observe. What is the agent thinking right now? You have to instrument the object and hope you catch it at the right moment.
What a stateless agent looks like
Section titled “What a stateless agent looks like”A stateless agent is a pure function: messages in, decisions out.
async def stateless_agent(messages: list[dict], tools: list[Tool]) -> AgentThought: response = await llm.generate(messages, tools) return parse_thought(response)All state lives in the message bus. The agent reads new messages addressed to it, thinks, and publishes its response. That’s it.
What you get:
- Crash recovery for free. The agent dies? Restart it. The full conversation is in the message bus. Replay the messages, and it picks up where it left off.
- Horizontal scaling is trivial. Need more agents? Start more processes. They all read from and write to the same message bus.
- Testing is deterministic. Give the agent the same input messages, and you get the same output. No setup, no teardown, no mocking internal state.
- Full observability. The message bus IS the state. Every thought, every tool call, every delegation is a message. You can inspect, replay, and audit the entire swarm.
But what about memory?
Section titled “But what about memory?”“But agents need memory!” Yes — but memory doesn’t have to be local.
We use two levels of shared memory:
- Redis Streams for the conversation history (short-term)
- Redis JSON for shared working memory (the swarm’s “notes”)
Both survive agent restarts. Both are inspectable from outside.
The agent doesn’t need to “remember” — it just needs to read what’s in the bus.
The trade-off
Section titled “The trade-off”Stateless isn’t free. Every agent call requires re-reading from the bus. That means more network calls than a stateful agent that holds everything in local memory.
But network calls to a local Redis instance cost microseconds. LLM calls cost seconds. The overhead is negligible.
What you gain — reliability, scalability, testability, observability — is worth orders of magnitude more than what you pay in Redis round-trips.
What we’ve learned
Section titled “What we’ve learned”After running thousands of swarms:
- Zero data loss from agent crashes. The message bus design works.
- Scaling is boring. Adding workers is a docker-compose config change.
- Testing caught bugs we’d never have found otherwise. Deterministic replay of agent conversations found 3 logic errors in our first month.
- Observability changed how we debug. Instead of adding log statements to agent code, we just query the message bus.
Stateless isn’t just a design choice. It’s a debugging superpower.
This is the first post in our architecture series. Next: how we built the cost router that keeps swarms under $0.10.