Neither ChatGPT nor Gemini—autonomous AI agents have a serious reliability problem, and they have documented it with compelling figures

It’s official—Google integrates direct purchases from Walmart into Gemini, and AI enters the world of everyday commerce

Neither Portland cement nor conventional concrete—the new green material that aims to lead the construction of the future

No traditional power banks or emergency plugs—this is the new portable charger that aims to eliminate low battery issues

“AI is going to take our jobs” is something we’ve heard endlessly. But… are you sure about that? AI agents aren’t that specialized, they fail a lot, and for now, we can breathe easy, our jobs are not in danger. That’s what Carnegie Mellon University (CMU) and Duke University have revealed after analyzing several behaviors to show that we’re not at risk (for now).

In the best-case scenario, they only manage to complete about one third of the tasks. In the worst case, they don’t even reach a 10% success rate! The future of automation is still looming, but for now, they’ll need to sharpen their aim if they want to get there.

Even though automation is still on the table, this experiment debunks many of the expectations built around the idea that AI will take over everything. Yes, we dream of not having to work, but for now it seems more like fiction than reality.

What are AI Agents?

They are programs that act autonomously to carry out complex tasks, without the need for a human constantly supervising the process. Unlike traditional assistants (like Siri or Alexa) that respond to specific commands, one of these agents can make decisions, plan steps, and coordinate multiple actions, everything the tech revolution promises.

And what’s going on with them?

Maybe they’re not as autonomous as we think… To test this, the researchers created a fictional company called The Agent Company, where AI agents had to use services like GitLab, Owncloud or Rocketchat to do their jobs.

But the results were disastrous…

Disappointment

Two test environments: OpenHands CodeAct and OWL-Roleplay… and it was a disaster. The best performer was Claude Sonnet 4, completing 33.1% of tasks. Behind it were Claude 3.7 Sonnet (30.9%), Gemini 2.5 Pro (30.3%) and, much further back, GPT-4o (8.6%), Llama-3.1-405b (7.4%), Qwen-2.5-72b (5.7%), and Amazon Nova Pro v1.0 (1.7%), catastrophic!

The bottom line is they get 30% right, sure… but the remaining 70% is just constant failure. You can relax, George, they’re not going to take your office job. As of now, no model is ready to handle complex tasks autonomously.

Problems with AI

During the tests, all kinds of errors were recorded: agents that didn’t know how to send a simple message, couldn’t deal with pop-up windows, or made up ridiculous solutions that had nothing to do with the original task! One even changed a username to “simulate” having contacted the right person.

These failures, while sometimes amusing, show a serious lack of contextual understanding and poor execution ability… which casts doubt on their readiness for real responsibilities.

Do they work for anything?

Yes, but with a lot of caveats… They fail a lot, and researchers admit they can be useful for very small tasks, but not for fully replacing human jobs.

The future isn’t here yet

Sure, everything improves over time, but even in repeated tests the results weren’t very encouraging (they went from 24% to 34% success rates… and of course, they still don’t beat human capabilities).

What are the risks?

Assigning delicate tasks to an agent, like sending emails or managing customer relationships, can be a disaster if every step isn’t monitored… That’s why experts recommend applying standards like the Model Context Protocol (MCP) to improve communication between systems and reduce errors.

AI is not ready yet

A second study by Salesforce tested these agents in CRM contexts. They only reached 58% success on simple tasks, and performance dropped to 35% when tasks required multiple steps. The conclusion: these agents are not prepared or qualified for complex jobs!

Gartner predicts massive cancellations

According to data from consulting firm Gartner, 40% of AI agent projects will be canceled before 2027. Why? Many are based on hype more than technical feasibility. They’re experiments with no real application in a technology that still isn’t ready.

So, AI is still far from replacing us in the most complex tasks. Humans 1 – AI 0! (For now, no hard feelings!)