Production AI Agents book cover
A book by Omer Haderi

Building systems that survive real users.

Orchestration, evals, observability, cost, security, rollout: the careful machinery that takes your agent from a working demo to a deployed system.

PDF · Companion repo

13 chapters · 11 git-tagged checkpoints · One synthetic chaos environment · 61 passing tests

Your agent demo works.
Your production agent doesn't.

Most agent tutorials end at the working demo. This book starts there.

You'll build a real Site Reliability Engineering agent, one that diagnoses incidents under live fault injection, is scored by an eval harness, blocks its own regressions in CI, traces every step, stays inside a token budget, survives prompt injection from hostile logs, and rolls out one action at a time as it earns trust. Not a notebook demo. The kind of agent you could put on call.


Who this book is for

Backend engineers
Adding an agent to a real product, not a demo repo.
ML / AI engineers
Shipping LLM systems past the proof-of-concept.
SRE / DevOps
Adding LLM-driven automation to your toolkit, carefully.
Tech leads & founders
Betting a product on an agent and wanting to sleep at night.
Platform engineers
Building the rails other teams will run their agents on.

If you're comfortable with Python, Docker, and the basic shape of a microservice, you have everything this book needs.

# the agent grows one capability per chapter, tagged in git
$ git checkout ch07
$ make agent-run

▸ recall similar past incidents      ok (0)
▸ gather: promql_query, log_search   ok
▸ hypothesis: slow-query on orders   ok
▸ propose: rollout-restart orders    proposed
▸ remember incident for next time    ok

$ make agent-eval RUNS=5
judge agreement 1.0  · trustworthy
correctness 0.6 · safety 1.0 · 8 steps

What you'll build

An SRE agent that runs against a live synthetic chaos environment.

  • Six-service microservice topology with live fault injection
  • Full telemetry stack: Prometheus, Grafana, Loki, Tempo
  • Durable orchestrator with crash recovery and replay
  • Three-tier state: task log, conversation, incident memory
  • Six defensive tools with schemas and contract tests
  • An eval harness with a validated judge
  • A CI deploy gate that blocks the agent's own regressions
  • End-to-end OpenTelemetry tracing across agent and services
  • Per-incident token budgets and cache-friendly context
  • Guardrails that survive prompt injection from hostile logs
  • Progressive rollout from shadow to autonomous, per action
  • A verifier agent, justified only where it earns its cost

What's inside

The first two chapters set the scene. From chapter 3 onward, each chapter adds a layer that lives behind a git tag, so you can check the agent out as it stood at the end of any chapter.

  • ch01
    What Production-Grade Means for Agents
    The 02:47 incident that opens the book, and what it tells us about the work an SRE agent actually has to do. Why most agents that work in a demo never reach this kind of shift.
  • ch02
    The Anatomy of a Production Agent
    The eight components of a production agent (rollout, guardrails, orchestrator, state, executor, tools, observability, evals) and the failure mode each one prevents. Sets up the reference agent the rest of the book builds.
  • ch03
    Designing for Capability and Boundary
    Scope config as the agent's first artifact: in-scope, frontier, out-of-scope, forbidden actions. The agent can state what it will and will not do.
  • ch04
    Orchestration and Control Flow
    A durable orchestrator and executor with checkpoint and replay. The agent runs a checkpointed investigation and survives a worker crash mid-run.
  • ch05
    State and Memory
    Task state in Postgres, conversation in Redis, long-term memory in a vector store. The agent recovers coherent state and recalls past incidents.
  • ch06
    Tools and Integrations
    Six defensive tool wrappers with schema validation and contract tests. The agent queries metrics, logs, traces, deploys, and runbooks through tested boundaries.
  • ch07
    Building an Eval Harness
    Step and trajectory evals scored against the chaos scenarios. The agent is now measurable on diagnosis correctness, safety, and efficiency.
  • ch08
    Evals as Deployment Gates
    A CI gate with baselines and per-dimension thresholds, plus production sampling. The agent blocks its own deploys on a regression.
  • ch09
    Observability and Tracing
    OpenTelemetry tracing, semantic logs, two-family drift detection. Reconstruct any run; surface drift before failures get loud.
  • ch10
    Cost and Latency Engineering
    Prompt-cache-friendly context assembly, model routing, per-incident token budgets. Diagnose under a budget; wrap up gracefully when it's spent.
  • ch11
    Security and Guardrails
    Input guardrails, credential-level permission scoping, action gates. Injection becomes survivable: the deterministic layers contain it.
  • ch12
    Human-in-the-Loop and Rollout
    Shadow, assisted, and autonomous modes per action, with an approval surface. Actions graduate individually as they earn trust.
  • ch13
    When Multi-Agent Earns Its Cost
    A targeted verifier agent, invoked only on the hard incidents. Measure whether a second agent is worth its tokens, and where.

Three Field Notes chapters run the same chaos day against the agent at three milestones: after the architecture is complete (ch06), after it's proven and observable (ch09), and after the shipped rollout (ch12).

What this book is, and what it isn't.

This book builds one specific agent: a Site Reliability Engineering agent that diagnoses incidents in a synthetic microservice environment. That agent is a vehicle, not the destination.

The destination is everything the chapters teach you while building it. Orchestration, durable state, defensive tools, eval harnesses, deploy gates, observability, cost discipline, guardrails, rollout policy. These transfer to whatever agent you're shipping next, whether it answers support tickets, drafts contracts, or runs your build pipeline.

You'll learn production agent engineering by building one in the open.

Built with: Python · Anthropic Claude · OpenTelemetry · Postgres · Redis · Prometheus · Grafana · Loki · Tempo · Docker
Optional offline mode runs every chapter without an API key.

Python 3.11 Claude OpenTelemetry Postgres Redis Prometheus Grafana Loki Tempo Docker

What you need

  • Working Python, enough to read and modify a small package.
  • Docker installed, and a laptop with at least 16 GB of RAM to run the chaos environment locally.
  • Familiarity with the shape of a microservice, an API, and a metrics dashboard.
  • A Claude or OpenAI API key (optional). Every chapter ships a scripted planner that runs offline.

No SRE background required. The book treats SRE as the worked example, not as a prerequisite.


Omer Haderi

About the author

I've been writing software professionally for over twenty years. Most of that time has been spent watching well-designed systems behave badly under load, and building the careful, boring machinery (orchestrators, eval harnesses, deploy gates, observability) that lets you trust them in production.

This book is what I want to hand someone who's just built an agent that works and is now staring at the next ten miles of road.

Get the book

$29.99

PDF · Companion repo on GitHub.

  • PDF, ideal for reading on a second monitor while you build
  • Companion repo with the full SRE agent, chapter by chapter

Common questions

How is this different from the agent tutorials I've already read?
Most agent tutorials stop where this book starts: at a working demo. The book's premise is that the production work (durable state, eval harnesses, observability, cost discipline, rollout policy) is most of the actual job, and nobody is teaching it end to end against a real-feeling environment.
Do I need to be an SRE?
No. The agent in the book happens to do SRE work because it's a domain with rich, messy signals and real consequences for being wrong. The production techniques transfer to any agent you're building. The book teaches you what each tool does as it introduces it.
Which LLM does the book use?
The default planner targets Anthropic Claude, but every chapter also ships a scripted planner that runs offline with no API key. You can work through the entire book without spending a cent on inference if you want to.
Will I need cloud infrastructure?
No. The whole environment (six microservices, Prometheus, Grafana, Loki, Tempo, Postgres, Redis, the agent itself) runs locally under docker compose. A 16 GB laptop is enough.
Where can I buy it?
Direct from this site via Payhip (PDF, instant download, $29.99) or on Amazon. Buying direct saves the Amazon cut and the author keeps more of what you pay. You also get the companion GitHub repo.
What format does it come in?
PDF when you buy direct, plus access to the public companion GitHub repo with the full SRE agent code chapter by chapter. Also available on Amazon if you prefer that route.

The next ten miles between your agent demo and a system you'd trust on call. That's the book.

Instant download · PDF