Why AI agents fail in real workflows

Analysis

Why AI agents fail in real workflows

Leon Wilfan

Mar 24, 2026

19:00

Disruption snapshot

AI agents are now deployed in real business tools. But performance breaks under real conditions. Companies add retries, validation, and human approval. Autonomy decreases to gain consistency.

Winners: Companies owning full stacks like Google. Losers: API-heavy ecosystems like OpenAI deployments that face integration chaos and unpredictable behavior across external systems.

Watch how much human review remains in workflows. A steady drop in manual approval steps signals agents becoming trusted enough to run end-to-end tasks.

A year ago, AI agents looked like slick demos. Today, they’re showing up inside the tools companies actually use.

Back in November 2023, OpenAI rolled out Assistants with tool use. Since then, those features have turned into multi-step workflows running inside customer support platforms, internal dashboards, and custom-built apps.

Google pushed Gemini into Workspace, so it can read, write, and take action across Gmail, Docs, and Sheets. And Alibaba has been building enterprise agents through Alibaba Cloud, aimed at logistics, finance, and internal operations.

This is the shift people have been talking about for years. Agents aren’t something you watch anymore. They’re something you deploy inside real workflows that matter.

And that’s exactly when things get messy.

In a demo, every agent looks sharp. They plan, use tools, and handle multi-step tasks with ease. But once real data, edge cases, and business pressure hit, the cracks start to show. In some cases, those failures are not just inconvenient but costly, as seen when an AI bot accidentally sent $250,000 in crypto instead of a $500 tip.

In production, the same agent can succeed in the morning and fail in the afternoon on the exact same task.

Consider a simple workflow that shows up in countless demos. Generate a weekly report from internal data, format it, and send it to a team.

In testing, it works.

In production, the same task runs into friction immediately. The database query hits a rate limit. A schema changed two days ago and one field no longer exists. The user running the task does not have access to a required table. One API returns data in a slightly different format than expected. The final document has to match an internal template that is not publicly documented.

None of these are rare failures. They are normal conditions.

A product team at a mid-sized fintech company described this shift after deploying an internal reporting agent. In testing, the agent completed the workflow end to end in over 90 percent of runs. After connecting it to live systems, reliability dropped below 60 percent within the first week. The failures were not catastrophic, but partial. Reports were generated with missing sections. Some runs stalled midway. Others completed with formatting that required manual cleanup.

The team did not remove the agent. They added retries, validation steps, and a human approval layer before sending outputs.

The agent still saved time. It just stopped being autonomous.

This pattern is showing up everywhere.

Google’s advantage is not that its agents are more capable. The real advantage is that they operate inside a system Google controls.

Inside Workspace, identity, permissions, file formats, and core application logic are standardized.

A Gemini agent working across Docs and Sheets is not stitching together external services with unknown behavior.

The agent is operating inside a closed environment where most variables are already defined.

That does not eliminate errors, but changes their shape.

Failures are more predictable. Outputs are more consistent. Recovery is easier to design.

In production systems, that matters more than peak capability. A system that works the same way every time is more valuable than one that is occasionally brilliant and occasionally unusable.

OpenAI has taken the opposite approach. Its agents are not tied to a single environment.

OpenAI agents are deployed across thousands of different systems through APIs and developer integrations.

This gives OpenAI unmatched reach.

Its agents can operate inside customer support platforms, internal tools, SaaS products, and consumer applications. It also helps explain why some believe AI agents could finally make OpenAI profitable.

It also means every deployment is different.

Authentication works one way in one system and another way elsewhere. APIs return inconsistent outputs. Permission boundaries are unclear or poorly documented. Small changes in upstream systems introduce silent failures.

OpenAI does not control these environments, but absorbs them.

This makes the problem fundamentally different. The challenge is not building a smarter agent. It is making that agent behave consistently across systems that were never designed to work together.

Alibaba’s position is shaped by its cloud business.

It operates closer to the systems where work actually happens, including ERP platforms, logistics infrastructure, and internal enterprise tools.

That proximity reduces some integration friction. It gives Alibaba better access to data and execution layers.

It does not simplify the environment.

Enterprise systems are deeply customized, often built over many years, and full of exceptions. Two companies using the same software can have completely different configurations. Internal processes change faster than documentation.

Alibaba’s agents encounter the same instability seen elsewhere. The difference is that they meet it at a deeper layer of the stack.

Across teams, a consistent sequence is emerging.

An agent completes a full workflow in a controlled environment. It is deployed into production. Reliability drops as soon as it interacts with real systems. Engineers add constraints, validation, and fallback logic. Human review is reintroduced at key steps.

The end result is not the original demo, but a narrower, more structured system that works more often.

This is what production looks like.

Companies do not need more impressive agents. They need agents they can trust.

Trust in this context has very little to do with intelligence.

It comes from predictability.

Systems need to behave consistently under changing conditions. They need to fail in ways that are visible and recoverable. Outputs need to be usable without constant verification.

These are engineering properties.

They depend on stable integrations, well-defined data contracts, clear permission handling, and reliable execution layers. Improvements in model reasoning help, but they do not solve these problems.

The bottleneck has shifted away from the model.

The next phase of AI will not be determined by benchmark scores or incremental model upgrades.

Google is reducing the problem by controlling the environment. OpenAI is expanding across environments and taking on their complexity. Alibaba is embedding closer to enterprise infrastructure while dealing with its variability.

Each approach has advantages. None has solved the core issue.

The company that does will not just have better agents. It will own the layer that companies rely on to get work done.

Why AI agents fail in real workflows

Disruption snapshot

In production, the same agent can succeed in the morning and fail in the afternoon on the exact same task.

Google’s advantage is not that its agents are more capable. The real advantage is that they operate inside a system Google controls.

OpenAI has taken the opposite approach. Its agents are not tied to a single environment.

Alibaba’s position is shaped by its cloud business.

Across teams, a consistent sequence is emerging.

Companies do not need more impressive agents. They need agents they can trust.

The next phase of AI will not be determined by benchmark scores or incremental model upgrades.

Recommended Articles

​

​

​