What is the difference between a reference agent and an operator agent?

A reference agent reads: it retrieves from a document corpus and answers questions with citations, and its worst failure is a confident wrong answer. An operator agent writes: it takes actions through tools, such as updating a record or sending an email, and its worst failure is a silent partial action. Reference agents are evaluated on citation accuracy; operator agents on task completion and tool-call success.

Which AI agent type is the most expensive to build?

Operator agents cost the most to build, because side effects raise the stakes. They need confidence floors, retries, cost ceilings, an audit log, and a human-in-the-loop on edge cases before they touch production. A concierge or reference agent can ship in two to four weeks; a production operator agent is usually a four-to-eight-week engagement.

Concierge, reference, operator: the three agent shapes

Q: What are the three types of AI agent?

Most production AI agents are one of three shapes. A concierge is customer-facing and optimized for time-to-resolution. A reference answers internal questions from your own documents and is optimized for citation accuracy. An operator executes multi-step tasks across your tools and is optimized for tool-call reliability. The shape decides the scope, the price, and the evaluation strategy.

The short version. Most production AI agents are one of three shapes. A concierge is customer-facing and optimized for time-to-resolution. A reference answers internal questions from your own documents and is optimized for citation accuracy. An operator executes multi-step tasks across your tools and is optimized for tool-call reliability. Decide the shape first; it determines everything downstream.

“We want to build an agent” is the most common sentence we hear on a first call, and it’s almost never specific enough to scope. The word agent has been stretched to cover a support chatbot, a search box over a wiki, and a system that reschedules deliveries by itself. Those three things share a sentence and almost nothing else. They have different users, different failure modes, different evaluation strategies, and prices that differ by an order of magnitude.

So before we talk budget, we sort the idea into one of three shapes. The taxonomy isn’t academic. It’s the fastest way we know to turn a vague ambition into a scoped, fixed-price engagement, and to set the right expectations about what “good” will mean once it ships.

Why the shape decides everything

An agent’s shape is defined by one question: what does it touch? A concierge touches a customer. A reference touches a body of documents. An operator touches your systems of record. The blast radius of a mistake grows across that list, and so does the engineering you have to do to make the mistake survivable. That single axis predicts the eval metric, the guardrails, the timeline, and the bill.

Shape	Touches	Optimized for	Worst failure	Typical band
Concierge	A customer	Time-to-resolution	Confidently wrong, in public	CA$28–72k
Reference	Your documents	Citation accuracy	A hallucinated source	CA$28–72k
Operator	Your systems	Tool-call reliability	A silent partial action	CA$28–72k+

The bands overlap because complexity, not shape, sets the final number. What the shape changes is where the effort goes. Below, the three in detail.

1. The concierge

A concierge is the customer-facing one: the chat bubble on a website, the assistant inside a product, the bot that handles the first round of inbound support. Its job is to get a stranger to an answer or an outcome as quickly as possible, and to know precisely when to hand off to a human.

The defining constraint is that it speaks to people who don’t work for you and don’t owe you any patience. That makes tone, latency, and graceful escalation as important as raw accuracy. The metric that matters is not “was the answer technically correct” but time-to-resolution and escalation rate: how often a real person reaches the end of the conversation satisfied, and how cleanly the cases it can’t handle get routed to a human.

How a concierge fails

A concierge fails by being confidently wrong in public. It invents a refund policy, promises a delivery date it can’t know, or argues with a frustrated customer. The fix is rarely a smarter model. It’s a tighter scope (answer only from approved content), a low threshold for saying “let me get a person,” and an eval set built from real transcripts of your hardest conversations, not the happy path.

2. The reference

A reference answers questions from a body of knowledge you already own: policies, handbooks, contracts, tickets, a decade of internal memos. This is the shape most people mean when they say “a chatbot over our docs,” and it’s usually built on retrieval — the model is handed the relevant passages at question time and asked to answer from them, with citations.

The metric here is citation accuracy: when the agent answers, does the cited source actually say what the agent claims, and when the corpus doesn’t contain the answer, does it say “I don’t know” instead of guessing? A reference agent that confidently answers questions outside its corpus is worse than useless, because it teaches the team to stop checking.

How a reference fails

The signature failure is the hallucinated citation: a fluent, plausible answer attached to a source that doesn’t support it, or doesn’t exist. Most reference failures are retrieval failures, not generation failures. If the right passage never makes it into the context window, no amount of model quality saves the answer. That’s why we spend the build budget on the retrieval layer and the “I don’t know” behaviour, and evaluate against a labelled set of real questions with known-correct sources.

The line that matters

A concierge and a reference both read. An operator writes. The moment an agent can change the state of your business — send the email, update the record, issue the credit — you are building a different, more expensive animal, and you should price and evaluate it as one.

3. The operator

An operator does things. It takes multi-step tasks and runs them across your tools: pull the document, extract the fields, check them against a record, update the system, notify a person. It’s the shape behind “structured extraction,” “intake triage,” and most of what people now call “agentic” work. It plans, calls tools, observes the result, and loops until the task is done.

Because an operator has side effects, the stakes change. A wrong answer from a reference agent wastes a minute. A wrong action from an operator corrupts a record, double-sends an invoice, or quietly skips a step and tells no one. The metric is tool-call reliability and task completion: of the runs it attempted, how many finished correctly end-to-end, and of the tool calls it made, how many succeeded without a silent failure.

How an operator fails

The dangerous failure is the silent partial: the agent completes four of five steps, the fifth fails, and nothing flags it. The defenses are unglamorous and non-negotiable — a confidence floor that routes uncertain cases to a human, retries with backoff, hard cost ceilings, an audit log of every action, and a human-in-the-loop on the edge cases from day one. This is the engineering that separates a demo from a system, and it’s why operators sit at the top of the price band.

Hybrids, and how to tell which you need

Plenty of real systems are two shapes stapled together. A customer-facing assistant that can also issue a refund is a concierge wearing an operator’s tool belt, and it inherits the operator’s guardrails the instant it can move money. An internal assistant that answers from the handbook and files the ticket is a reference plus an operator. The point of the taxonomy isn’t to force every system into one box. It’s to make you name each capability, because each one you add changes the eval set and the risk.

The quick test, on a first call: who is on the other end, and can the agent change anything? A stranger and no side effects is a concierge. Your own team and no side effects is a reference. Side effects, for anyone, is an operator, and it should be scoped, priced, and hardened like one.

What this means for a build

Once the shape is named, the rest of the engagement falls out of it. The shape tells us which eval set to write in week two, where the hardening budget goes, and which number we’ll be holding ourselves to at handoff. A reference build lives or dies on retrieval and the “I don’t know” rate; an operator build lives or dies on the audit log and the human-in-the-loop. Putting the same effort in the wrong place is how a project ships a beautiful demo and an unusable system.

So when we scope a Chatbots & Agents engagement, the first thing we write down isn’t the model or the framework. It’s the shape. Everything else — the price band, the calendar, the metric on the dashboard — is a consequence of that one word.

If you can already say which of the three you need, you’re most of the way to a scoped engagement. If you can’t, that’s exactly what the first week is for.

Tell us the task, and we’ll tell you which shape it is and what it costs.

Concierge, reference, operator.