TLDR: Agentic commerce is here, but mostly in narrow, assisted forms. Agents can already help users discover products, compare options, build baskets, and complete some purchases, but reliable autonomous buying is still limited to low-risk, bounded flows.
In a January 2025 livestream, an OpenAI researcher held up a photo of a handwritten grocery list and handed it to the company’s new agent, Operator.
The agent read the list, opened Instacart, built the basket, and booked a delivery slot. A person gave the task, and software went shopping.
That clip did the rounds because it showed the big promise of an agent acting on your behalf. Shopping is one of the obvious tests for agents because it turns intent into action.
This is the first piece in a series on the seven layers of agentic commerce, starting with execution. From a retail perspective, an agent reads what you want, opens a store, fills a basket, and pays. The question for this layer is: what can these agents buy today, and how well do they do it?
The answer has two halves.
Demand signals have climbed fast. AI-driven traffic to US retail sites rose 693.4% year over year during the 2025 holiday season, and AI referrals converted 31% better than other sources. Salesforce estimated that AI and agents influenced 20% of global online holiday sales.
That doesn’t mean agents completed all those purchases, but it shows that shoppers are already bringing AI into the buying journey.
In the UK, the share of shoppers using AI assistants doubled from 12% to 28% in a year, with 44% now saying they’d let an agent handle the whole process once they’ve set a budget and brand.
But execution itself is still brittle. On a careful shopping benchmark, the strongest model scored 17.76% against human experts’ 30.02%, and passed safety checks only 35.42% of the time. So there’s still a wide distance between the demo and the daily experience.

What can retail shopping agents buy today?
Execution within agentic commerce has moved from research preview to live product in under a year. Here’s where it stands.
| Tool | What it does now | Agentic depth | Current limits |
| ChatGPT Instant Checkout | Lets users buy from eligible Etsy sellers inside ChatGPT, with Shopify support planned | Checkout-native | Single-item purchases, US only |
| Amazon Buy for Me | Lets Amazon’s agent buy selected products from outside brand sites inside the Amazon app | App-mediated checkout | Select US customers, selected brands and products, no promo codes, beta |
| Perplexity Instant Buy | Lets users search for products and buy from merchants directly on Perplexity | Checkout-native | US users only, eligible products only |
| Perplexity Comet | Uses an agentic browser to research products and help with shopping tasks across the web | Browser-driven | Site blocking, checkout friction, safety and reliability issues |
| Operator / ChatGPT agent | Drives a browser to shop across websites | Browser-driven | Slow, gets stuck, sometimes blocked by sites |
| Instacart in ChatGPT | Lets users browse groceries, build a cart, and check out inside ChatGPT | Checkout-native | Grocery only, available through supported retailers and user accounts |
| DoorDash in ChatGPT | Turns recipe ideas into grocery lists and sends users to DoorDash checkout | App-mediated checkout | Grocery only, select users at launch |
Across each platform, the agent handles discovery, basket, and payment, while the merchant keeps fulfilment and returns. For now, most flows cap at one item per order.
The fridge that orders its own milk
The fridge that notices low milk and reorders it has been the stock demo of automated shopping for a decade. The shipped version is more modest.
Amazon’s Auto Buy places an order when a price drops below a threshold you set, and Subscribe & Save runs on a fixed schedule. Neither one ‘reasons’ per se—both simply follow a rule you wrote in advance.
Groceries are the natural first home for execution because it’s a repeat purchase. You buy the same milk, eggs, and coffee every week, so the agent has little to get wrong and a clear record to copy.
That’s why Instacart and DoorDash were among the first to wire recipe-to-cart flows into ChatGPT. Repeat purchases give an agent a safe place to start.
UK shoppers have noticed. Nearly a quarter (23%) expect at least 10% of their purchases to be AI-driven within a year, and 46% would let an agent switch brands for a better-value option. The appetite has arrived ahead of the plumbing.
Broken execution
Such demos pull attention because they compress effort. Ask an agent to “find a work laptop under £800 with 16GB of RAM and good battery life,” and it scans more listings than a person would sift through by hand. The promise is search without the slog, and a basket that fills itself.
But it’s early days, so these agents aren’t perfect. Tasks sometimes run slowly, and the agent often gets stuck part-way through. Hand it your card and you’re trusting it not to buy 1,000 pairs of socks instead of 10.
On WebMall, a four-shop comparison test, the strongest agent handled add-to-cart and checkout tasks without trouble but completed under 65% of the harder jobs, like finding the cheapest offer across shops or reading vague requirements. On DeepShop, the top system reached only 20% on hard queries.
The benchmarks don’t all measure the same thing, but they point in the same direction: agents do better with bounded tasks and worse when the purchase requires judgment, substitution, compatibility checks, or safety awareness.
Reliability is climbing: the length of tasks an agent can finish at even odds has been doubling every seven months (perhaps a new Moore’s Law for AI?).
But for now, a model that succeeds nine times in ten and fails unpredictably on the tenth makes a useful assistant and a poor autonomous buyer, because that tenth time could involve a $10,000 purchase mistake.

What agentic commerce execution needs next
A demo runs well on a controlled stage with one cooperative store. But production agents need to run on millions of different storefronts, each with its own buttons, login walls, and checkout quirks.
For an agent to buy reliably across all of them, the stores themselves have to become readable by machines. The agent can’t carry the whole burden alone.
That’s the next layer: infrastructure.