AI agents in production: the model is no longer the bottleneck

of organizations see no return from GenAI. It isn’t the model, it’s the approach.

150k→2k[4]

tokens on the same workflow, with MCP servers presented as code instead of calls.

49.7%[6]

of agentic calls sit in one sector. Every other vertical is white space.

In summary

The model is a commodity: the bottleneck is communication with the expert and workflow redesign.
The treasure is what the expert doesn’t mention, not what they ask for. Trust is the technical condition.
Connect the data and own the sources of truth: every well-modeled piece of data is a brick that stays yours.
Redesign the workflow the “ignorant” way, AI-native. The moat is the workflow plus the expertise.
Make quality predictable with evals and the expert’s signature. Own the harness, swap the model.
Start from a deliverable worth weeks and do it in days. AI is a multiplier, not a discount.

No LLM handed me this knowledge. We learned it in the field, with real people and real projects, getting it wrong and building on the mistakes.

The feedback that told me we were on the right track came from a colleague: “with this new AI-native workflow we calmly handle deadlines that used to be structurally impossible”. That sentence is worth more than any benchmark, because it describes the only thing that counts: an agent that isn’t a demo, but something a company actually uses, every day.

The starting point

The model has stopped being the problem

Partiamo da un dato scomodo. Il MIT, nel report State of AI in Business 2025, scrive che a fronte di 30-40 miliardi di dollari investiti in GenAI il 95% delle organizzazioni non vede alcun ritorno[1]. La frase che conta è un’altra: questo divario non dipende dalla qualità del modello, ma dall’approccio. McKinsey arriva allo stesso punto da un’altra strada, la riprogettazione dei workflow è la leva con l’effetto maggiore sull’impatto economico dell’AI[5]. E lo studio MAST, analizzando oltre 1.600 tracce di sistemi multi-agente, conclude che i fallimenti sono di design, non di capacità del modello[2].

YesterdayThe bottleneck was the model

How capable the technology is: model quality, context, reliability.

→moved

Today

Communication with the domain expert

Transferring years of hard-won expertise in days, with real trust.

Workflow redesign

Rebuilding a human-born process into one designed for agents.

The model is now a powerful commodity. The gap between reaching production and staying a demo isn’t driven by model quality but by approach (MIT), and multi-agent failures are design, not capability (MAST).

Translated: the model is now an extraordinarily powerful commodity. The bottleneck has moved to two deeply human things. Communication with whoever knows the domain, and the ability to redesign a process born for humans into one meant to work with agents.

That’s why staying current with the state of the art has become a prerequisite, not an advantage. Knowing MCP, skills, and long-horizon agentic flows serves one purpose only: to stop thinking about technology. When implementation is no longer the problem, all attention goes to how the work should be done.

It’s a paradox only in appearance. The more technical you are, the less technology is your job.

The model isn’t the moat. The workflow and the expertise are.

The hard part

The hard part is talking to each other

The domain expert is the most valuable person in the room and, almost always, the one who struggles most to change mental model. Not because they’re closed off: their standards are settled over years of work, and those standards are their guarantee of quality. Now they have to transfer the same years of expertise in a few days, at the same quality as always, through a different way of working.

Hence the first predictable mistake: the expert asks for the features they think they want, usually to automate what they already see. The real treasure is what they don’t mention, because they’re so used to the old system they can’t imagine it being touched. The parts most in need of AI are often the ones the expert doesn’t even realize they do sub-optimally.

The question I always ask, at the first table, is this: what’s the thing that would change your life and that we haven’t even named, because we assume it’s impossible? That’s how you find what a person believes is unfeasible and desperately needs.

Then communication has to be fed with honest feedback from both sides. Trust is the technical condition, not an extra. The two common fears, “I’ll lose my job” and “I’ll work more with tighter deadlines”, both come from AI used badly: it opens a token hole in the budget, pushes people away, and misses the point.

What would change your life and we assume is impossible? That’s where you start.

The foundation

Connecting the data levels the field

When communication holds, the technical work starts with data. a16z put it in black and white: enterprise agents often don’t work for lack of context, because “revenue is a business definition, it isn’t hard-coded in a data warehouse”. Company data lives scattered, and an LLM dropped onto fragmented data hallucinates. In one of our projects on asset data, a model went as far as saying a holding owned 257% of itself. It wasn’t the model’s fault, but the way the data reached it.[3]

Connecting the sources levels the field, and it isn’t trivial. You have to figure out which data to connect, how to model it so the LLM keeps performing, where to keep the single source of truth (SSoT), and how to transform it to build proprietary datasets over time. Here’s a compounding competitive advantage: every piece of data you bring in and model well is a brick that stays yours.

There’s a recurring mistake, especially in agencies: experts already use powerful platforms in the usual “human” way and almost never notice that those same vendors have already shipped AI or MCP features that would change everything. I’ve often discovered that the feature we needed was already inside a tool the company had been paying for years. Connecting systems the right way is worth more than building a new one the wrong way.

The same goes for the agent’s tools. A tool must be designed for the agent, not as a thin wrapper over an API, and the agent-computer interface deserves the same care as the ones for humans. By presenting MCP servers as code to call instead of as calls, Anthropic documented a workflow going from 150,000 to 2,000 tokens. It’s a vendor number, I say it for honesty, but the direction is right.[4]

Every piece of data you bring in and model well is a brick that stays yours.

The method

Redesigning the workflow the “ignorant” way

At this point you design the flow. The way I prefer I call “ignorant”, with affection: ignore the existing traditional system and try to rebuild it from scratch, forcing a constraint, “one single person has to be able to do this, with these AI tools”. AI-native thinking naturally develops an AI-centered workflow that replaces the human-centered one. With the expert’s help, the result is better, faster, and genuinely cheaper.

At Intarget this step has a name we chose to say out loud, in a lecture at Bocconi: Fullstack AI Company. A company rebuilt around AI, not a company that uses AI.

People

Domain expertise, taste, judgment, accountability.

Workflow

Focus groups, feedback loops, shared methods.

Agents

Specialists and orchestrators. The model lives here: one piece, not the system.

Infrastructure

Architecture, data, observability, governance.

The takeaway: the model isn’t the moat, the workflow plus the expertise are. People on top, foundations at the bottom; the model is just one piece of the agent layer.

The same goes for how you build. Personally I no longer write code line by line: I write architecture, constraints, acceptance criteria. AI writes the code, I review, iterate, ship. What used to take months becomes weeks, then days.

A real case, instructive precisely because it failed three times before it worked. At Intarget the strategic knowledge about the most important clients lived in people’s heads: every meeting with a C-level took hours to rebuild the context, and when someone left, the memory left with them. Three attempts on hand-filled PowerPoints had died for the same reason: no update ritual, no owner, an unmaintainable format. The AI-native version flipped the constraint: the Business Partner doesn’t fill in a dossier, they answer questions, and the system keeps a structured knowledge base alive on its own (a company brain fed by the CRM and public sources). The deliverable is no longer a file. It’s a system that feeds itself.

The deliverable is no longer a file. It’s a system that feeds itself.

Quality

The expert’s signature against the token lottery

Here I get to the part that separates serious work from AI slop. A powerful LLM, well guided and with the right data, does almost everything. But what distinguishes a reproducible delivery from winning the token lottery is the signature of an expert who knows more than the model and oversees standards, output, and continuous updating. A brilliant model tells you how things are done based on its training data. The secret sauce only comes from someone who actually does the job.

For this to work, quality has to be made predictable, not just checked at the output. The tool is called an eval: sets of tests that measure the system’s reliability on real scenarios.

The durable asset

The harness

Tools with strict contracts
Evals and verification (pass^k)
Domain context
Observability

Swappable

The model

ClaudeGPTGemini

↺ swap it, the harness stays

Quality must be made predictable, not just checked at the output. pass^k measures the probability of succeeding across all k independent tries, not a single shot: the difference between a lucky demo and a reliable system.

The clearest example of an operational signature is one of our agents that does quality control on ad creatives before they go live.

01Technical blockers

What would stop the upload: formats, specs, platform requirements.

→

02Content safety

Brand safety, policy, claims: what can’t go live.

→

03Quality and effectiveness

The expert’s standard: performance, consistency, creative strength.

→

OutputAnnotated file + final verdict

The expert’s standard is no longer a manual check bolted on at the end. It’s encoded inside the flow and runs on every execution, the same every time.

There’s a detail I particularly love. When one of our orchestrators produced a presentation coordinating six specialist agents, every slide carried the signature of the agent that wrote it. Radical transparency: every piece has an owner, and quality can be traced.

Own the harness, swap the model. The durable asset is the environment, not the model.

How to start

Start from a deliverable worth weeks, do it in days

Theory becomes practice when you tackle a very specific task. Pick a deliverable that today takes several people, weeks of manual and research work, needs careful validation, and produces high value. Start there with the AI-native flow, the ignorant way, and do it in days. Watch the quality and prepare the evals, so it stays predictably high over time.

To pick the right deliverable, the matrix from the first article in the series comes in handy: you cross how much it matters to the business with how standardized it already is. Top right, high-impact and already standardized processes, is the “start here”.

A case, anonymized, from Intarget’s Innovation Hub: a seventy-page strategic pitch for a big education-sector brief, with competitive analysis, personas, insight. Normally a team of six to eight people does it over several weeks. We built it with two people, with a content-strategy orchestrator, saving 50-60% of the time versus the historical baseline. And here’s the important thing: quality went up, not down. The senior’s comment was “insight the team couldn’t have generated on its own”. AI didn’t remove value from the work. It removed the mechanical part.

AI didn’t remove value from the work. It removed the mechanical part.

The return

A multiplier, not a discount on margin

The biggest mistake is treating AI as a way to do the same things a bit faster and shave a few points of margin. Andrej Karpathy puts it well: AI unlocks what you were never able to do, and thinking it can just copy-paste what you already do makes you miss its real value. If you aim only at efficiency, it’s easier for AI to eat your margin than to grow it. McKinsey notes that those who get the most add goals of growth and innovation, not just savings.

If instead you aim at “impossible before, easy now”, the returns can take off, and the next bottleneck becomes go-to-market. Production stops being the limit, and marketing and sales become it.

Share of agentic tool calls

Software engineering

49.7%

Education

white space

1.8%

Healthcare

white space

Legal

white space

0.9%

Half of agentic calls sit in one sector; every other vertical is under 9%. When production stops being the limit, the bottleneck becomes go-to-market: whoever brings an agent into a vertical works on white space. Anthropic data, read by Garry Tan (Y Combinator).

That’s why I come back to the colleague’s sentence, the one about “structurally impossible” deadlines becoming manageable. It doesn’t describe a saving: it describes a threshold that shifts. AI didn’t make consulting less human, it made it less mechanical.

It doesn’t describe a saving. It describes a threshold that shifts.

In summary

In production, not in slides

Putting agents into production, effectively, is less a matter of model and more a matter of method. Stay current with the state of the art, so you can put communication and domain expertise first. Earn the experts’ trust and dig out what they consider impossible. Connect the data and own your sources of truth. Redesign the workflow the ignorant way. Make quality predictable with evals and the signature of those who know. Start from a deliverable that’s worth it, and do it in days.

One chapter deserves an article of its own: how you actually capture AI’s value before bad governance eats it. I’ll tackle that another time. For now the principle we use as a compass is enough: use AI as a system, not as a session.

The AI others leave you in slides, we put into production.

Do you have a deliverable that costs weeks today?

On a call we figure out whether it’s the right candidate and what the first concrete step would be. At Yempik we build custom agents and automations, with a fixed price and code that stays yours. If you’d rather get a sense of costs first, see our pricing.

Book a call

FAQ

The questions we get asked most

If the model isn’t the problem, does the model still matter?

It matters, but it has become a powerful commodity: Claude, GPT, and Gemini do things that were science fiction two years ago. The difference between a demo and a system in production isn’t the model, it’s the environment around it: connected data, tools with contracts, evals, and an expert’s signature. Own the harness, and you can swap the model whenever you want.

Where do you start to put an agent into production?

From a single deliverable that today takes weeks of manual work, needs careful validation, and is worth a lot to the business. You rebuild it the AI-native way and do it in days, with evals that keep quality predictable over time. You don’t start from technology, you start from the process.

Do people need to be replaced?

No. The domain expert is the most valuable person: it’s their signature that makes quality reproducible. AI removes the mechanical part, not the value. The fears “I’ll lose my job” or “I’ll work more” come from AI used badly; used well, it shifts the threshold of what the team can do.

Is this a Yempik or an Intarget project?

It’s a point of view by Simone Bova. The cases come from his work as an AI Engineer at Intarget; Simone is also a co-founder of Yempik, which builds custom AI agents and automations for companies, from prototype to production. It isn’t a Yempik engagement: it’s the method, told by someone who practices it in the field.

How much does it cost and how long does it take?

It depends on the deliverable, but the logic is “done in days, not weeks”. At Yempik we work with a fixed price and stated timelines, and the source code stays yours. The first step is a call to figure out whether the process is the right candidate.

Transparency note

I wrote this article myself. The method, the cases, and the opinions come from my work as an AI Engineer at Intarget and from Yempik, which I co-founded. It isn’t a Yempik engagement: it’s a point of view. For the writing I got help from Claude on editing, clarity, and layout; the substance is mine, the tool is declared.

Transparency

Sources

[1]MIT Project NANDA, “The GenAI Divide: State of AI in Business 2025”. nanda.media.mit.edu
[2]Cemri et al., “Why Do Multi-Agent LLM Systems Fail?” (MAST), NeurIPS 2025. arxiv.org
[3]Andreessen Horowitz (a16z), “Your Data Agents Need Context”. a16z.com
[4]Anthropic, “Code execution with MCP: building more efficient AI agents”. www.anthropic.com
[5]McKinsey QuantumBlack, “The State of AI in 2025”. www.mckinsey.com
[6]Anthropic Economic Index (March 2026), on real-world agent use by sector, read by Garry Tan (Y Combinator). www.anthropic.com

AI agents in production: the model is no longer the bottleneck.

The model has stopped being the problem

The hard part is talking to each other

Connecting the data levels the field

Redesigning the workflow the “ignorant” way

The expert’s signature against the token lottery

The harness

The model

Start from a deliverable worth weeks, do it in days

A multiplier, not a discount on margin

In production, not in slides

Do you have a deliverable that costs weeks today?

The questions we get asked most

If the model isn’t the problem, does the model still matter?

Where do you start to put an agent into production?

Do people need to be replaced?

Is this a Yempik or an Intarget project?

How much does it cost and how long does it take?

Sources

See also