Why we don’t use ChatGPT for ops

It’s not the model. It’s the lack of plumbing around it. Here’s what we use instead.

Every week we get on a discovery call with someone who tells us, with a slightly defeated tone, that they tried to use AI for their ops and it didn’t work.

We ask what they tried. They tried ChatGPT. They opened the chat window, they typed in their problem, they got a great answer, they couldn’t figure out how to make the great answer happen automatically every Monday morning, and they gave up.

The conclusion they’ve drawn is that AI isn’t ready for their business yet. The actual conclusion is that chat isn’t how you run an operation.

Chat is the demo. Not the product.

ChatGPT is great. We use it. We use Claude. We use whatever’s good. The model is rarely the bottleneck.

But a chat interface is, by design, missing every property an operational system needs:

It has no memory of what it did yesterday.
It can’t be triggered by an event — only by a human typing.
It can’t call your tools, only describe what it would do if it could.
It has no idempotency. Ask it twice, do the work twice.
It logs nothing in a way you can audit.
It can’t hand off to a human when it’s unsure.
It can’t be rolled back when it does something stupid.

Every one of those is solvable. None of them is solved by adding a better prompt to a chat window.

What ops actually needs

We build agents around five components. The model is one of them. The other four are what makes the difference between “cool demo” and “runs your invoices for two years without supervision.”

1. A trigger

The thing that wakes the agent up. Cron (“every Monday at 9”), webhook (“when a new ticket comes in”), queue message (“when the previous step finished”), event stream. Pick one. Boring infrastructure.

2. A read of the world

The agent has to know what’s happening before it acts. Database queries, API calls, file reads. This is also where MCP comes in — the open protocol that gives a model a clean way to call your tools without you wiring custom code per tool. We use MCP for everything new now.

3. A rules layer

Deterministic logic that decides whether to act, which action to take, and when to escalate. Not the LLM. Code. The reason: when something goes wrong at 2 AM, you need to be able to read a single function and say “ah, that’s why.”

4. The model — for words and judgment

Now the LLM does what it’s actually good at: writing the email, summarizing the thread, picking between two reasonable choices, classifying a ticket. The model is a function call inside your system. Not the system itself.

5. Guardrails and logs

Before any action with consequences, a final deterministic check. After every action, a log entry that tells you what the agent saw, what it decided, what it did, and why. When you wake up to a customer complaint, you need to be able to answer the question “what did the agent do and why” in under sixty seconds.

What we use

For the model, we lean on Claude (Sonnet for anything that needs judgment or long context, Haiku for high-volume copy generation, Opus when the work is hard and rare). We’re model-pluralists in principle — if a different model wins on a specific task, we’ll use it. We’re not loyal to any of these companies, only to the work coming out clean.

For building the agents, we use Claude Code as our IDE. Most of the agents we ship for clients were written, debugged, and deployed without leaving it. The point of Claude Code isn’t the chat — it’s that it can read your codebase, run your tests, edit files, run shell commands, and do all that under your eye. It flips writing software from “you describe, then translate to code” into “you steer, the model types.”

For tool calling, we use MCP wherever the source system has a server (or where it’s worth ten minutes to write one). For the rest, plain function calls in a TypeScript codebase. Nothing fancy.

For hosting, Vercel functions for low-volume agents, a small Postgres for state, Inngest or a simple queue for anything multi-step. We don’t use a vector database for most agents — the work usually doesn’t need one. When it does, we use whatever the data already lives in.

For logs, anywhere structured. Axiom is fine. Datadog is fine. A Postgres table is fine. Just structured. Not console-logs.

What this looks like in practice

Take the “please summarize my support inbox every morning” thing. With ChatGPT, you forward emails, paste, ask. It works. You can’t make it happen without you sitting there.

With the agent shape: a cron at 8 AM (trigger). It reads yesterday’s inbox via the Gmail API (read of the world). A rules layer filters spam, internal threads, and anything from a do-not-summarize list. The model summarizes the rest by category, with example tickets. A guardrail caps length and removes anything that looks like a password or a key. The summary lands in your Slack at 8:05. Logs go to Postgres so if you ever want to re-run yesterday with new rules, you can.

Same model. Same prompts. Wildly different system. The difference is the four pieces around the model.

If you took one thing away

It would be: stop blaming the model. The model is fine. The model has been fine for two years. What you don’t have is the plumbing. Trigger, read, rules, model, guardrails. Five pieces. None of them are exotic. Most of them aren’t even AI — they’re the kind of code your team already writes.

That’s the work. The model is the cherry. Most teams skip the cake.