Spacetime StudiosSpacetime Studios
Back to Blog
AI AgentsEngineering

AI agent maintenance: why your agents break in 90 days (and how to prevent it)

Haven Vu, Founder & CEO of Spacetime||4 min read

TL;DR

AI agent maintenance means treating agents like production software: version your prompts/tools, monitor every run, write contract tests for APIs and data formats, and schedule regular model + dependency reviews. Most “it broke” incidents come from silent upstream change, not your logic. Build for observability and rollback from day one.

How do you maintain AI agents so they don’t break after 90 days?

AI agent maintenance means treating agents like production software: version your prompts/tools, monitor every run, write contract tests for APIs and data formats, and schedule regular model + dependency reviews. Most “it broke” incidents come from silent upstream change, not your logic. Build for observability and rollback from day one.

Why this matters for ops + eng leaders at 10–200 person teams

If you’re running agents in customer-facing workflows or internal ops, a “small” failure doesn’t stay small.

  • A lead enrichment agent silently changes a field name. Your routing rules fail. SDRs stop getting leads.
  • A browser automation breaks after a UI update. Now invoices don’t get reconciled. Month-end slips.
  • A model update shifts output format just enough to break downstream parsing. Your CRM fills with garbage.

At this company size, you don’t have a dedicated SRE team watching automations. The automation either runs… or it becomes one more thing your best people babysit.

But after 90 days, reality shows up: APIs change, data shifts, auth expires, vendors tweak UI, and edge cases accumulate. Maintenance becomes the hidden tax.

Actionable steps: a practical AI agent maintenance system

You don’t need a big process. You need a repeatable one.

1) Put your agent on a “software contract”

Define what “correct” looks like, in writing.

  • Inputs: required fields, allowed formats, max payload size
  • Outputs: schema, required keys, error codes, confidence signals
  • Side effects: which systems it may write to, and which fields it’s allowed to modify

If your agent outputs JSON, enforce it. If it outputs free text, wrap it with a parser + validator and treat validation failures as first-class errors.

2) Add observability before you add more capabilities

Minimum viable observability for agents:

  • A run ID for every execution
  • Structured logs: tool calls, inputs, outputs, latency, token usage
  • Failure reasons: validation failed, tool 429, auth expired, selector missing, etc.
  • Alerting on error rate and “stuck” runs

If you can’t answer “what changed?” within 10 minutes, you don’t have an agent. You have a liability.

3) Version everything that can change

Pin versions where possible. Track the rest.

  • Prompt templates and system instructions
  • Tool schemas
  • Model name + parameters
  • Dependency versions
  • Integration config (field mappings, IDs, selectors)

Treat prompt changes like code changes. PR it. Review it. Ship it.

4) Write the tests that catch 90-day failures

You want tests that fail when the world changes, not only when your code changes.

Recommended test types:

  1. Contract tests (APIs): call the upstream API in a sandbox and assert key fields still exist.
  2. Golden-run tests (LLM): keep a small set of representative inputs and verify the output still validates.
  3. Data-format tests: verify the shape of inbound data from your warehouse/CRM/export hasn’t drifted.
  4. Browser smoke tests: run one canary flow daily and alert on selector/step failure.

If you only test on deploy, you’ll miss the breakage that happens on Tuesday because someone else shipped.

5) Build rollback + “safe mode” paths

Rollbacks are what make maintenance cheap.

  • Keep the last known-good prompt/tool config
  • Support a “no-write” mode for CRMs and billing systems
  • On repeated failure, switch to a fallback: queue for human review or a simpler deterministic flow

A reliable agent sometimes says: “I can’t safely do this right now.”

6) Assign ownership and schedule maintenance

Create a lightweight cadence:

  • Weekly (15 minutes): check error rate, top failure modes, alerts, and “unknown unknowns”
  • Monthly (60 minutes): review upstream API changes, vendor UI changes, auth/permissions drift
  • Quarterly (90 minutes): re-run eval set, reassess model choice, refactor brittle steps

Also: make one person the owner. Not a committee. A name.

A maintenance checklist you can steal

  • Area: API integrations | What breaks: field removed/renamed, auth expiry | What to do: contract tests + alert on schema changes | Frequency: daily/weekly
  • Area: Browser automation | What breaks: UI change, selector drift | What to do: canary run + resilient selectors | Frequency: daily
  • Area: LLM outputs | What breaks: format drift, refusal drift | What to do: schema validation + golden tests | Frequency: weekly
  • Area: Data inputs | What breaks: missing fields, new nulls | What to do: data validation + upstream checks | Frequency: weekly
  • Area: Costs | What breaks: token spikes, runaway loops | What to do: budgets + rate limits + anomaly alerts | Frequency: weekly

What most teams get wrong

They optimize for “it worked once”

A demo is a controlled environment. Production is adversarial.

If your agent only works when:

  • the data is perfect,
  • the UI doesn’t change,
  • the model behaves exactly the same,

…then it’s not automation. It’s a fragile script with a chatbot bolted on.

They treat prompts like copy instead of code

Prompt edits feel harmless, so teams change them casually. That’s how you get silent regressions.

If the prompt determines a downstream write into HubSpot, Salesforce, or a database, it is code. No exceptions.

They add more tools instead of adding reliability

More tools increase the surface area for failure: more APIs, more auth, more rate limits, more timeouts.

A boring, observable agent beats a magical one that breaks.

Bottom line

“Set it and forget it” was always a lie.

If you want agents that survive past the first quarter, build them like software: contracts, tests, monitoring, and rollbacks. Then maintenance is predictable and cheap.

If you want a second set of eyes on your agent architecture, observability, and maintenance plan, book a call and I’ll tell you what I’d fix first: https://calendar.app.google/fvvhoEcfBzupGyC27

Sources

  • https://www.reddit.com/r/automation/comments/1ja2hxi/what_are_the_biggest_challenges_in_ai_automation/
  • https://www.reddit.com/r/n8n/comments/1mg0z79/is_anyone_else_tired_of_ai_agents_that_dont/
  • https://www.reddit.com/r/AI_Agents/comments/1r6t1vc/ive_been_running_ai_agents_247_for_3_months_here/
  • https://www.reddit.com/r/automation/comments/1nphndt/how_are_you_automating_repetitive_browser_tasks/
  • https://www.reddit.com/r/AI_Agents/comments/1ovk0lx/can_we_talk_about_why_90_of_ai_agents_still_fail/
  • https://news.ycombinator.com/item?id=47039354
  • https://www.anthropic.com/engineering/building-effective-agents

Frequently Asked Questions

I reply to all emails if you want to chat:

Related Articles

Get AI automation insights

No spam. Occasional dispatches on AI agents, automation, and scaling with less headcount.