TLDR:
- SCOPE is a five-part framework for evaluating agentic AI vendors before contract signing, covering Scope, Cost, Outcome, Performance, and Exposure.
- Unlike traditional software, AI agents act autonomously, making procurement questions around seat counts and uptime SLAs insufficient for assessing real-world risk and value.

Introduction to SCOPE for AI agent procurement
Most enterprise software procurement runs on a settled set of questions. How many seats? What’s the renewal term? Where does the data reside? What’s the uptime SLA?
Those questions worked because the software waited for a human to act, and value scaled with the number of humans using it.
Agentic AI changes the terms. An agent doesn’t wait; it acts, decides, retries, and sometimes improvises, often without a human in the loop for any given step.
The same workflow can cost three pence on one run and three pounds on the next, depending on the choices the agent makes at runtime.
A vendor demo that resolves a support ticket in four seconds tells you almost nothing about what the same agent does across 40,000 tickets in your environment, against your data, under your compliance regime.
The procurement team’s job hasn’t changed, though the questions it needs to ask have. SCOPE is a framework for asking the right ones before you sign: Scope, Cost, Outcome, Performance, Exposure.
Each dimension maps to a specific category of vendor claim, and each carries a failure mode that might surface only after the contract goes live.
Used well, it gives both sides a shared language, because a vendor confident in their agent will welcome every one of these questions.
S for Scope
What is this agent designed to do?
Begin here, because vendors pitch their broadest claims at the level of scope, and the later problems with cost and performance both originate from it.
A useful principle from AI philosophy, the Floridi Conjecture, holds that breadth of scope and reliability pull against each other, so you can’t maximise both at once.
Picture a cardiologist who has spent twenty years mastering one organ system: deep proficiency in one place means declining proficiency everywhere else.
An AI agent built around one well-defined task can reach high reliability on it (approaching 100% success), while an agent pitched as handling everything across a sprawling workflow produces uncertain output throughout, its certainty spread too thin to hold anywhere.
This bears on a procurement conversation, because vendors tend to sell breadth, since breadth demos well and supports a larger contract.
A claim like “our agent handles your entire customer service operation” sounds like more value than “our agent resolves billing queries under £500 without escalation”.
Yet, the narrower offer makes the better purchase, because you can hold it to a measurable standard, where the broader one is the nurse expected to cure heart disease and pediatric cancer.
Three questions help draw the boundary:
- What single task is this agent optimised for, and what does it deliberately exclude? A vendor who can’t draw that line may not have defined the agent tightly enough to make it reliable.
- When a request falls outside the defined task, does the agent escalate to a human, decline, or attempt it regardless? That third behaviour drives both hallucination and runtime cost.
- How does accuracy change as scope widens? Ask for accuracy figures on the narrow core task and on the broad version, since the difference between the two numbers is what you’re buying.
| Scope width | Typical reliability | Procurement implication |
| Single defined task (e.g. resolve refund under a set threshold) | High, measurable, contractable | Hold to a clear standard; tie payment to it |
| Bounded domain (e.g. all billing queries) | Moderate, varies by query type | Require per-category accuracy data |
| Open-ended (“handles customer service”) | Low and unpredictable per request | Treat capability claims as unproven until tested |
During the vendor meeting, keep in mind that an AI tool and an AI agent are not the same thing. A tool executes a defined instruction and returns a result, while a human owns the decision; an agent makes the decision and acts on it.
The narrower the scope, the closer an agent behaves to a tool (i.e., more deterministic), and the easier it becomes to hold accountable. Vendors often use the two words interchangeably, so the contract should state which one it’s buying.
C for Cost
What will you pay, and when?
Agentic AI pricing has three layers, and vendors tend to quote only the first.
Compute cost
The first layer covers the visible compute cost, the tokens themselves. Most model providers price per token, the chunks of text an agent reads in and writes out, so a single interaction might run a few pence.
But multiplied across an enterprise’s monthly volume those token costs can reach five and six figures, and they don’t rise in a straight line.
As tasks grow more complex, the agent reasons more, and reasoning consumes tokens faster than task volume alone would predict.
Runtime behaviour cost
The second layer covers runtime behaviour cost, the layer that tends to surprise finance teams. Because an agent decides its own path at runtime, the same task can cost wildly different amounts on different runs.
A retrieval step might call one internal system or seven. A vague user query like “can I get that thing from last week” forces the agent to reconstruct context by pulling transcripts, ranking them, retrieving documents, rephrasing, and only then answering.
Three behaviours drive most of this hidden spend:
- Agents caught in reasoning loops retry and rephrase against an unhelpful tool until something works
- Agents over-reason on simple requests, running a full investigation on “where’s my order” when a tracking link would do
- Agents over-use tools, firing a five-step chain to book a meeting and then retrying the whole chain when a single step fails.
A demo won’t surface any of this, while the monthly invoice will.
Operational costs
The third layer covers operational costs that fall outside the model entirely: the cloud and platform bill, plus the cost of any human-in-the-loop review the agent triggers, plus the audit and compliance overhead the deployment creates.
The savings case for AI workflow automation holds up, though it describes a blended figure, so procurement should ask to see the full cost stack rather than the token line alone.
The pricing model the vendor offers determines how much of this risk falls to the buyer.
| Pricing model | What you pay for | Where the risk falls |
| Per seat / digital FTE | A fixed price per agent, framed as headcount | Predictable, though it may misvalue actual use |
| Usage-based (per token, query, run) | Metered consumption | On the buyer; spikes are yours unless capped |
| Task-based | Each completed task | On the buyer; you pay even when the task didn’t help |
| Outcome-based | Only results that meet a defined standard | On the vendor in principle; attribution disputes can return it to the buyer |
| Hybrid | A baseline plus overage | Shared; usually the most workable for enterprise |
The most important thing to ask of any usage-based or hybrid contract: what visibility and control do we get?
A responsible vendor provides real-time consumption dashboards, spend forecasting, usage alerts before you reach plan limits, soft caps, and grace periods for short surges.
Where these tools are missing, the buyer carries an uncapped liability, so their presence belongs on the evaluation checklist as much as the headline price. You can’t manage costs you can’t see.
O for Outcome
How is success defined, measured, and verified?
The industry holds up outcome-based pricing as the fairest model, because it ties what you pay to what you get.
Intercom’s support agent, Fin, charges per resolved conversation (what they call Outcomes) rather than per reply, which aligns the vendor’s revenue with the customer’s result.
The idea looks clean on paper, though it’s the hardest model to implement well, with the entire difficulty resting on how the definitions are drawn.
Two questions decide whether outcome-based pricing protects the buyer or exposes them:
What counts as the outcome?
“Resolved” can mean the ticket was marked resolved, or it can mean the customer’s problem was solved to their satisfaction, measured by a CSAT score above a threshold, and those are two different products at one price.
For a sales agent, does “qualified lead” mean a meeting booked or a meeting attended, and if the prospect no-shows, who absorbs that?
Pin the definition down in writing, with the measurement method attached, before the contract is signed.
Who gets credit?
Most enterprise outcomes involve several touches. An agent sends a strong opening email, a human rep follows up by phone, a demo happens, and weeks later the deal closes; the agent’s opening email contributed to the close, and so did every human step after it.
Outcome-based pricing therefore needs an attribution agreement reached upfront, so the question doesn’t return for relitigation every billing cycle.
Two mechanisms make this workable:
- An attribution window, where the outcome counts if it occurs within a set number of days of the agent’s action (similar to how paid ads work)
- An ownership threshold, where the agent must own a defined share of the process to qualify for payment.
Without one or both, every invoice becomes a negotiation.
Verification comes next: whatever the outcome and attribution rules, both sides need an audit trail of transcripts, timestamps, and system activity logs, the evidence that confirms what the agent did and what followed.
Put the requirement in the contract, and a vendor confident in their outcomes will already have the dashboards built.
Your broader enterprise AI governance posture pays off here, because the same reporting discipline behind a structured AI audit report carries straight over to verifying an outcome-based invoice.
P for Performance
Can they prove it works at your scale, in your environment?
Demo performance and production performance produce different numbers, and the distance between them accounts for most failed AI deployments.
Non-determinism explains the difference. A traditional software function given the same input returns the same output every time, which makes performance easy to predict.
An agent given the same input may take a different path on each run, because it decides at runtime based on dynamic inputs. That same property makes agents useful and makes their performance hard to forecast.
A demo shows one hand-picked run on clean data, while production means forty thousand runs across messy data, ambiguous queries, and the edge cases a demo avoids.
The principle that AI scales your mistakes as fast as your wins applies directly here, because an agent that errs 5% of the time in a demo will err 5% of the time across the full volume, which at enterprise scale becomes a large absolute number of errors.
Three requirements separate a believable performance claim from a hopeful one:
- Require accuracy figures on a workload that resembles yours rather than the vendor’s reference case, since volume, data quality, and query mix all move the number.
- Require latency distributions rather than averages, because an average response time conceals the long tail, and that long tail produces the abandoned interactions and the mid-chain timeouts.
- Ask for failure rates and failure modes together, because a clean escalation to a human counts as a manageable failure, whereas a confident wrong answer counts as a costly one.
A second performance trap is specific to agentic systems, and it comes from Amdahl’s Law applied to AI agents. The law, originally about parallel computing, says that adding more processors to a task yields less and less benefit, because some part of every task is sequential and can’t be divided.
Picture a factory floor: Ten workers assembling a product in parallel speeds output significantly, but if every finished unit still passes through one quality inspector before it ships, the inspector becomes the constraint. Adding an eleventh worker won’t change much.
Applied to agents, with human attention as the bottleneck, the practical finding is that two or three coordinated agents produce real speedup, while ten or more produce diminishing returns and then negative ROI.
An agent that performs well alone can degrade once a vendor chains it into a larger orchestration. So when a vendor proposes a fifteen-agent system, ask them to show it outperforms a tighter three-agent design, since the maths suggests it often won’t.
A cheaper, narrower system that performs to scope outperforms an expensive, sprawling one running at 60%. This brings our framework to its central point: the test is performance to scope, with price as a secondary consideration.
E for Exposure
What are you liable for when it goes wrong?
The other four dimensions assume the agent performs, while this one assumes that on some request, eventually, it won’t, and it asks who carries the consequence when that happens. There are four kinds, and a complete contract addresses all four.
Financial exposure
Financial exposure covers the unexpected bill: the usage spike, runaway reasoning loop, the month the agent’s behaviour costs three times the forecast. Capped pricing, real-time alerts, and soft limits control it, and a well-handled cost section covers most of it already.
Operational exposure
Operational exposure covers the agent failing inside a workflow it was trusted to run: the hallucinated answer, missed escalation, or action taken on a false premise.
An agent that wrongly classifies a customer as a VIP and promises a refund outside policy creates a real cost, whether you honour the promise (revenue lost) or refuse it (angry customer).
Human-in-the-loop design therefore belongs in the procurement conversation as much as the implementation one, so the relevant questions become where the checkpoints are, what triggers them, and who owns the decision when the agent steps back.
Reputational exposure
Reputational exposure takes operational failure into public view, where a customer-facing agent that errs visibly does brand damage no SLA credit can repair. The question to settle, then, is what the agent may do unsupervised in a public channel, with a conservative answer the safer default.
Regulatory exposure
Regulatory exposure tends to be the most underpriced of the four, and for a UK enterprise it carries statutory teeth. An agent processing personal data falls under UK GDPR; an agent making or materially influencing decisions about people may come within the scope of the EU AI Act if you operate in or serve the EU; and sector regulators add their own requirements on top.
Governance frameworks warrant close review: the discipline of enterprise AI governance, the specific risk of shadow AI entering through unsanctioned vendor tools, and the certification standard of ISO 42001 all bear directly on what you’re signing up to.
Procurement should ask the vendor to state, in writing, which regulatory regimes they’ve designed for and which they leave to the buyer.
The contract should address, at minimum, SLAs defined against outcomes rather than uptime alone, liability caps and allocations for errors driven by hallucination, mandatory audit logging, and data residency and processing terms.
When an agent acts on the buyer’s behalf and gets it wrong, the consequence belongs to the buyer unless the contract allocates it elsewhere, so the contract deserves a reading that assumes occasional failure rather than uninterrupted success.

Using SCOPE in practice for agentic AI procurement
Take two vendors selling a customer service agent, evaluated side by side.
Vendor A offers usage-based pricing with a full real-time dashboard, publishes accuracy figures on a workload close to yours, and prices low, which makes the offer strong on Cost and Performance.
Pressed on Outcome, though, Vendor A defines “resolved” as “marked resolved by the agent,” with no satisfaction measure, and on Exposure offers no liability allocation for out-of-policy actions while leaving all regulatory compliance to the buyer.
So two of the five dimensions fail, and both failures might only surface months into the deployment.
Vendor B prices higher on an outcome basis, defines “resolved” as a CSAT threshold with a clear attribution window, provides audit logging by default, and states which regulatory regimes it has built for.
Vendor B looks weaker on raw cost, and its agent runs a narrower scope, though that narrow scope is the reason the vendor can stand behind the outcome definition and the liability terms.
The cheaper agent makes the more expensive purchase once the system goes live, because Vendor A’s unresolved dimensions convert into the buyer’s costs over time.
The SCOPE judgment comes down to one comparison: a narrow agent that performs to a standard the buyer can hold, against a broad agent that runs at 60% and returns the consequences to the buyer.
Agentic AI procurement tool
Weight the dimensions to your own risk appetite: a regulated financial institution should weight Exposure and Outcome heavily, while an internal-only productivity deployment can weight Cost and Performance and treat reputational exposure as minor.
Adjust each slider to reflect your organisation’s risk appetite. The bars below update in real time to show which SCOPE dimensions your evaluation should weight most heavily.
How tightly defined must the agent’s task boundary be for your use case?
How sensitive is your organisation to unpredictable or variable billing?
How important is it that the vendor’s payment ties directly to verified results?
How critical is proven, production-grade performance at your scale?
How much regulatory, reputational, or operational liability does your organisation carry?
A shorthand version to take into vendor meetings:
| Dimension | The core question | A failing answer reads like |
| Scope | What single task is this optimised for, and what does it exclude? | “It handles everything” |
| Cost | What’s the full cost stack, and what real-time visibility do we get? | A token quote and no dashboard |
| Outcome | How is success defined, attributed, and verified, in writing? | “Resolved means resolved” |
| Performance | Can you prove it at our scale, on our data, with failure modes? | A single clean demo run |
| Exposure | Who carries the financial, operational, reputational, and regulatory consequence? | Silence on liability and compliance |
Where this leaves you on AI agent procurement
SCOPE doesn’t make the buying decision on the team’s behalf. Rather, it gives the vendor conversation a clearer structure, replacing the settled SaaS questions that no longer fit agentic systems with five that do.
A vendor building reliable agents can answer all five comfortably, because the same questions guided how they built the product, so the framework functions as common ground between buyer and vendor rather than a test imposed on one side.
This guide draws on my AI consulting and procurement advisory work. For teams running a live vendor evaluation, a guided SCOPE assessment maps each dimension against specific vendor proposals and contract terms. Get in touch.