We used Antimetal to resolve 76% of our production errors in one day

We pointed Antimetal at our production environment to see how many errors it could resolve on its own.

Last month, we pointed Antimetal at our own production environment, seven services and about 3,000 error logs a day, to see how much the agent could resolve on its own. We gave it no runbook and no list of known issues. It cut total error volume by 76% in a single day, down to about 720 logs, and over the next 30 days volume stayed near that new baseline.

Antimetal's Triage agent clustered the error logs into issues, investigated each one against a world model of the running system, opened PRs with tests, routed them to the owning engineer, and verified the fixes in production. An engineer reviewed every change before it shipped.

It found six issues. Three did not require changing the service that logged the error: one was an expected failure logged too loudly, one was a customer sending malformed requests because of a misconfigured credential, and one was noise from our own agent. The other three needed code changes: a duplicate parser, a cross-service retry storm, and a Postgres race. In every case, the cause sat outside the code path where the error appeared.

That pattern is why this work is hard for a coding agent, which can write a fix that compiles and reads well but cannot tell whether the fix is right. The deciding evidence lives outside the repo, in deploy timelines, cross-service traces, ownership, and whether the error comes back after the fix merges. In each case below, that evidence decided the cause, the owner, or whether the fix was done.

The first job was triage

The first thing we learned was that resolving an error does not always mean writing code. Every error in production was in scope, whether or not our code caused it, so the useful question for each issue was what kind of action it needed.

The first three issues all showed up in our service logs, but none of them needed a change to the service that logged them. One needed a severity reclassification, one needed to be routed to a customer, and one needed a prompt fix. Resolving these removed roughly 1,150 logs a day.

What showed up in production	What Antimetal found	Action
Sentry returned HTTP 429, but the response was logged as an application error	The response was retryable. The problem was log severity, not product behavior.	Reclassify. Observed drop: about 512 logs/day
A customer request failed validation before it reached our service logic	The customer credential was configured incorrectly, so the request body was already wrong on arrival.	Route to the customer to reconfigure credentials. Observed drop: about 358 logs/day
Grafana's MCP server rejected an agent tool call	Our prompt had drifted from Grafana's registered tools, so the agent called a tool the server did not expose.	Fix the prompt and tighten the schema. Observed drop: about 281 logs/day

The next three were different. The service that logged the error also needed a code fix.

Bug 1: A duplicate parser the stack trace could not see

Our Model Context Protocol endpoint was intermittently throwing stream is not readable. We had chalked it up to flaky transport, which felt reasonable. SSE is fragile, and much of the ecosystem is already moving off it. Idle connections drop, and a session resume comes back as a 404. Looking at the throw site, the obvious fix was defensive: wrap the read in try/catch, add a retry, and suppress the symptom.

A defensive fix like that would have hidden the real cause, because the stack trace blamed the read while the actual problem was in the body parsing behind it. And there was one detail a code-only view would miss: the errors all started at a single deploy.

The deploy timeline was what gave it away. The /mcp route parser had been in place for six weeks without this error, and the errors only started after a later deploy added global JSON parsing before requests reached individual routes. From then on, /mcp requests could be parsed twice. The global parser consumed the body first, the route parser then tried to read the same stream, and since request bodies can only be consumed once, it threw.

Instead of adding defensive code, Antimetal deleted the /mcp route parser that had been added six weeks earlier, leaving the global parser as the only code reading the request body. That removed roughly 247 error logs a day and touched none of the code the stack trace pointed at.

Bug 2: A retry storm between two services

Our Slack app calls our backend on every request and retries when a call fails. One backend call that could never succeed turned into about 446 error logs a day, because the app kept retrying it.

From the app's repo, the bug looked local. The retry loop was right there, and the obvious fix was to cap the retries or add backoff. That would have cut the errors, passed review, and left the real problem untouched.

The backend answered every failure with the same generic error, with no signal for whether it was worth retrying. So the app treated a permanent failure exactly like a temporary one.

Antimetal pulled the trace across both services, and it showed what the repo could not:

Trace fact	What it proved
One interaction carried a single correlation ID across the retries	The burst came from one user action
Repeated backend spans carried the same payload fingerprint	The app was retrying the same failing request
Each backend span returned the same terminal validation result	The backend had already made a final decision
The backend response exposed only a generic error, with no retryability marker	The app saw "backend failed", not "do not retry this"

Because the actual problem sat in the contract between the two services, that is where the fix went. The backend started telling the app which failures were terminal, the app stopped retrying requests that could never succeed, and the bursts disappeared from production.

Bug 3: A Postgres race that came back after the first fix

The last bug looked finished when the fix merged. CI passed, the fix made sense in review, and the alert went quiet. The next day, production showed the failure had only moved.

The symptom was a race on a Postgres write, roughly 436 error logs a day. Several workers wrote the same row at the same time. Each row ID was a content hash, so two workers that saw the same content computed the same ID. The old code read first and inserted second. Under Postgres's default READ COMMITTED, two transactions could both see no row, both insert, and one would hit Prisma's P2002 duplicate key error.

The first fix collapsed that read and insert into a single createMany with skipDuplicates. It addressed the race Antimetal had found.

If Antimetal had stopped watching production at merge, this would have looked solved. Instead, it kept a scoped monitor on that exact P2002 class, and the next day the monitor fired again.

Antimetal reopened the issue it had just closed. The first fix removed the race from one write path. Another write later in the same flow still checked for an existing row before inserting, so concurrent workers could collide there too. The tests missed the failure class because they exercised each write path one worker at a time.

Rather than keep patching individual write sites, the agent moved the write to a content-addressed store, where the same content maps to the same object, so a duplicate write is a no-op instead of a collision. Then it checked the rest of the flow for other writes that read before inserting and found nothing else exposed. For the next 30 days, the same P2002 failure class did not return.

The first fix was reasonable, just incomplete. It matched the race Antimetal had found at the time, and the scoped monitor was what caught the same failure class returning from another path.

Toward autonomous production engineering

Across all six issues, writing the fix was rarely the hard part. The hard part was knowing the fix was right, and in every case that answer came from production: a deploy timeline, a cross-service trace, or monitoring continuously after merge.

We have started calling that loop autonomous production engineering. It means diagnosing from production, shipping a fix, and watching the exact failure class after merge. An engineer still makes the call on whether to merge and ship, but as the proof gets more conclusive, more of the loop runs on its own, and engineers get their time back for the work that needs them.

We used Antimetal to resolve 76% of our production errors in one day

The first job was triage

Bug 1: A duplicate parser the stack trace could not see

Bug 2: A retry storm between two services

Bug 3: A Postgres race that came back after the first fix

Toward autonomous production engineering

Related posts.

How we automated technical implementation

Introducing Antimetal for Coding Agents

Building a Unified Model of Software Systems