“We want to build an agent” is the most common sentence we hear on a first call, and it’s almost never specific enough to scope. The word agent has been stretched to cover a support chatbot, a search box over a wiki, and a system that reschedules deliveries by itself. Those three things share a sentence and almost nothing else. They have different users, different failure modes, different evaluation strategies, and prices that differ by an order of magnitude.
So before we talk budget, we sort the idea into one of three shapes. The taxonomy isn’t academic. It’s the fastest way we know to turn a vague ambition into a scoped, fixed-price engagement, and to set the right expectations about what “good” will mean once it ships.
Why the shape decides everything
An agent’s shape is defined by one question: what does it touch? A concierge touches a customer. A reference touches a body of documents. An operator touches your systems of record. The blast radius of a mistake grows across that list, and so does the engineering you have to do to make the mistake survivable. That single axis predicts the eval metric, the guardrails, the timeline, and the bill.
| Shape | Touches | Optimized for | Worst failure | Typical band |
|---|---|---|---|---|
| Concierge | A customer | Time-to-resolution | Confidently wrong, in public | CA$28–72k |
| Reference | Your documents | Citation accuracy | A hallucinated source | CA$28–72k |
| Operator | Your systems | Tool-call reliability | A silent partial action | CA$28–72k+ |
The bands overlap because complexity, not shape, sets the final number. What the shape changes is where the effort goes. Below, the three in detail.
1. The concierge
A concierge is the customer-facing one: the chat bubble on a website, the assistant inside a product, the bot that handles the first round of inbound support. Its job is to get a stranger to an answer or an outcome as quickly as possible, and to know precisely when to hand off to a human.
The defining constraint is that it speaks to people who don’t work for you and don’t owe you any patience. That makes tone, latency, and graceful escalation as important as raw accuracy. The metric that matters is not “was the answer technically correct” but time-to-resolution and escalation rate: how often a real person reaches the end of the conversation satisfied, and how cleanly the cases it can’t handle get routed to a human.
How a concierge fails
A concierge fails by being confidently wrong in public. It invents a refund policy, promises a delivery date it can’t know, or argues with a frustrated customer. The fix is rarely a smarter model. It’s a tighter scope (answer only from approved content), a low threshold for saying “let me get a person,” and an eval set built from real transcripts of your hardest conversations, not the happy path.
2. The reference
A reference answers questions from a body of knowledge you already own: policies, handbooks, contracts, tickets, a decade of internal memos. This is the shape most people mean when they say “a chatbot over our docs,” and it’s usually built on retrieval — the model is handed the relevant passages at question time and asked to answer from them, with citations.
The metric here is citation accuracy: when the agent answers, does the cited source actually say what the agent claims, and when the corpus doesn’t contain the answer, does it say “I don’t know” instead of guessing? A reference agent that confidently answers questions outside its corpus is worse than useless, because it teaches the team to stop checking.
How a reference fails
The signature failure is the hallucinated citation: a fluent, plausible answer attached to a source that doesn’t support it, or doesn’t exist. Most reference failures are retrieval failures, not generation failures. If the right passage never makes it into the context window, no amount of model quality saves the answer. That’s why we spend the build budget on the retrieval layer and the “I don’t know” behaviour, and evaluate against a labelled set of real questions with known-correct sources.
The line that matters
A concierge and a reference both read. An operator writes. The moment an agent can change the state of your business — send the email, update the record, issue the credit — you are building a different, more expensive animal, and you should price and evaluate it as one.
3. The operator
An operator does things. It takes multi-step tasks and runs them across your tools: pull the document, extract the fields, check them against a record, update the system, notify a person. It’s the shape behind “structured extraction,” “intake triage,” and most of what people now call “agentic” work. It plans, calls tools, observes the result, and loops until the task is done.
Because an operator has side effects, the stakes change. A wrong answer from a reference agent wastes a minute. A wrong action from an operator corrupts a record, double-sends an invoice, or quietly skips a step and tells no one. The metric is tool-call reliability and task completion: of the runs it attempted, how many finished correctly end-to-end, and of the tool calls it made, how many succeeded without a silent failure.
How an operator fails
The dangerous failure is the silent partial: the agent completes four of five steps, the fifth fails, and nothing flags it. The defenses are unglamorous and non-negotiable — a confidence floor that routes uncertain cases to a human, retries with backoff, hard cost ceilings, an audit log of every action, and a human-in-the-loop on the edge cases from day one. This is the engineering that separates a demo from a system, and it’s why operators sit at the top of the price band.
Hybrids, and how to tell which you need
Plenty of real systems are two shapes stapled together. A customer-facing assistant that can also issue a refund is a concierge wearing an operator’s tool belt, and it inherits the operator’s guardrails the instant it can move money. An internal assistant that answers from the handbook and files the ticket is a reference plus an operator. The point of the taxonomy isn’t to force every system into one box. It’s to make you name each capability, because each one you add changes the eval set and the risk.
The quick test, on a first call: who is on the other end, and can the agent change anything? A stranger and no side effects is a concierge. Your own team and no side effects is a reference. Side effects, for anyone, is an operator, and it should be scoped, priced, and hardened like one.
What this means for a build
Once the shape is named, the rest of the engagement falls out of it. The shape tells us which eval set to write in week two, where the hardening budget goes, and which number we’ll be holding ourselves to at handoff. A reference build lives or dies on retrieval and the “I don’t know” rate; an operator build lives or dies on the audit log and the human-in-the-loop. Putting the same effort in the wrong place is how a project ships a beautiful demo and an unusable system.
So when we scope a Chatbots & Agents engagement, the first thing we write down isn’t the model or the framework. It’s the shape. Everything else — the price band, the calendar, the metric on the dashboard — is a consequence of that one word.
If you can already say which of the three you need, you’re most of the way to a scoped engagement. If you can’t, that’s exactly what the first week is for.
Tell us the task, and we’ll tell you which shape it is and what it costs.