A book by Omer Haderi

Building systems that survive real users.

Orchestration, evals, observability, cost, security, rollout: the careful machinery that takes your agent from a working demo to a deployed system.

Get the book for $29.99 Buy on Amazon

PDF · Companion repo

Your agent demo works.
Your production agent doesn't.

Most agent tutorials end at the working demo. This book starts there.

You'll build a real Site Reliability Engineering agent, one that diagnoses incidents under live fault injection, is scored by an eval harness, blocks its own regressions in CI, traces every step, stays inside a token budget, survives prompt injection from hostile logs, and rolls out one action at a time as it earns trust. Not a notebook demo. The kind of agent you could put on call.

Who this book is for

Backend engineers

Adding an agent to a real product, not a demo repo.

ML / AI engineers

Shipping LLM systems past the proof-of-concept.

SRE / DevOps

Adding LLM-driven automation to your toolkit, carefully.

Tech leads & founders

Betting a product on an agent and wanting to sleep at night.

Platform engineers

Building the rails other teams will run their agents on.

If you're comfortable with Python, Docker, and the basic shape of a microservice, you have everything this book needs.

# the agent grows one capability per chapter, tagged in git
$ git checkout ch07
$ make agent-run

▸ recall similar past incidents      ok (0)
▸ gather: promql_query, log_search   ok
▸ hypothesis: slow-query on orders   ok
▸ propose: rollout-restart orders    proposed
▸ remember incident for next time    ok

$ make agent-eval RUNS=5
judge agreement 1.0  · trustworthy
correctness 0.6 · safety 1.0 · 8 steps

What you'll build

An SRE agent that runs against a live synthetic chaos environment.

✓Six-service microservice topology with live fault injection
✓Full telemetry stack: Prometheus, Grafana, Loki, Tempo
✓Durable orchestrator with crash recovery and replay
✓Three-tier state: task log, conversation, incident memory
✓Six defensive tools with schemas and contract tests
✓An eval harness with a validated judge
✓A CI deploy gate that blocks the agent's own regressions
✓End-to-end OpenTelemetry tracing across agent and services
✓Per-incident token budgets and cache-friendly context
✓Guardrails that survive prompt injection from hostile logs
✓Progressive rollout from shadow to autonomous, per action
✓A verifier agent, justified only where it earns its cost

What's inside

The first two chapters set the scene. From chapter 3 onward, each chapter adds a layer that lives behind a git tag, so you can check the agent out as it stood at the end of any chapter.

ch01

What Production-Grade Means for Agents

The 02:47 incident that opens the book, and what it tells us about the work an SRE agent actually has to do. Why most agents that work in a demo never reach this kind of shift.
ch02

The Anatomy of a Production Agent

The eight components of a production agent (rollout, guardrails, orchestrator, state, executor, tools, observability, evals) and the failure mode each one prevents. Sets up the reference agent the rest of the book builds.
ch03

Designing for Capability and Boundary

Scope config as the agent's first artifact: in-scope, frontier, out-of-scope, forbidden actions. The agent can state what it will and will not do.
ch04

Orchestration and Control Flow

A durable orchestrator and executor with checkpoint and replay. The agent runs a checkpointed investigation and survives a worker crash mid-run.
ch05

State and Memory

Task state in Postgres, conversation in Redis, long-term memory in a vector store. The agent recovers coherent state and recalls past incidents.
ch06

Tools and Integrations

Six defensive tool wrappers with schema validation and contract tests. The agent queries metrics, logs, traces, deploys, and runbooks through tested boundaries.
ch07

Building an Eval Harness

Step and trajectory evals scored against the chaos scenarios. The agent is now measurable on diagnosis correctness, safety, and efficiency.
ch08

Evals as Deployment Gates

A CI gate with baselines and per-dimension thresholds, plus production sampling. The agent blocks its own deploys on a regression.
ch09

Observability and Tracing

OpenTelemetry tracing, semantic logs, two-family drift detection. Reconstruct any run; surface drift before failures get loud.
ch10

Cost and Latency Engineering

Prompt-cache-friendly context assembly, model routing, per-incident token budgets. Diagnose under a budget; wrap up gracefully when it's spent.
ch11

Security and Guardrails

Input guardrails, credential-level permission scoping, action gates. Injection becomes survivable: the deterministic layers contain it.
ch12

Human-in-the-Loop and Rollout

Shadow, assisted, and autonomous modes per action, with an approval surface. Actions graduate individually as they earn trust.
ch13

When Multi-Agent Earns Its Cost

A targeted verifier agent, invoked only on the hard incidents. Measure whether a second agent is worth its tokens, and where.

Three Field Notes chapters run the same chaos day against the agent at three milestones: after the architecture is complete (ch06), after it's proven and observable (ch09), and after the shipped rollout (ch12).

What this book is, and what it isn't.

This book builds one specific agent: a Site Reliability Engineering agent that diagnoses incidents in a synthetic microservice environment. That agent is a vehicle, not the destination.

The destination is everything the chapters teach you while building it. Orchestration, durable state, defensive tools, eval harnesses, deploy gates, observability, cost discipline, guardrails, rollout policy. These transfer to whatever agent you're shipping next, whether it answers support tickets, drafts contracts, or runs your build pipeline.

You'll learn production agent engineering by building one in the open.

What you need

✓Working Python, enough to read and modify a small package.
✓Docker installed, and a laptop with at least 16 GB of RAM to run the chaos environment locally.
✓Familiarity with the shape of a microservice, an API, and a metrics dashboard.
✓A Claude or OpenAI API key (optional). Every chapter ships a scripted planner that runs offline.

No SRE background required. The book treats SRE as the worked example, not as a prerequisite.

About the author

I've been writing software professionally for over twenty years. Most of that time has been spent watching well-designed systems behave badly under load, and building the careful, boring machinery (orchestrators, eval harnesses, deploy gates, observability) that lets you trust them in production.

This book is what I want to hand someone who's just built an agent that works and is now staring at the next ten miles of road.

Get the book

$29.99

PDF · Companion repo on GitHub.

✓PDF, ideal for reading on a second monitor while you build
✓Companion repo with the full SRE agent, chapter by chapter

Buy on Payhip for $29.99 Buy on Amazon

Common questions

How is this different from the agent tutorials I've already read?

Most agent tutorials stop where this book starts: at a working demo. The book's premise is that the production work (durable state, eval harnesses, observability, cost discipline, rollout policy) is most of the actual job, and nobody is teaching it end to end against a real-feeling environment.

Do I need to be an SRE?

No. The agent in the book happens to do SRE work because it's a domain with rich, messy signals and real consequences for being wrong. The production techniques transfer to any agent you're building. The book teaches you what each tool does as it introduces it.

Which LLM does the book use?

The default planner targets Anthropic Claude, but every chapter also ships a scripted planner that runs offline with no API key. You can work through the entire book without spending a cent on inference if you want to.

Will I need cloud infrastructure?

No. The whole environment (six microservices, Prometheus, Grafana, Loki, Tempo, Postgres, Redis, the agent itself) runs locally under docker compose. A 16 GB laptop is enough.

Where can I buy it?

Direct from this site via Payhip (PDF, instant download, $29.99) or on Amazon. Buying direct saves the Amazon cut and the author keeps more of what you pay. You also get the companion GitHub repo.

What format does it come in?

PDF when you buy direct, plus access to the public companion GitHub repo with the full SRE agent code chapter by chapter. Also available on Amazon if you prefer that route.

The next ten miles between your agent demo and a system you'd trust on call. That's the book.