The first AI-operated breach wasn't an AI safety failure. It was an identity governance failure

Hackers used Claude Code to compromise multiple Mexican government bodies and harvest sensitive data. The story isn't AI safety. It's what AI did with standing access.

Jagadeesh Kunda

May 13, 2026

•

Share this post

Link copied

Featured event: A CISO’s take

Join Jim Alkove and Ramy Houssaini to learn how forward-thinking security teams are addressing Enterprise AI Copilot risks.

Watch Recording

In late December 2025, an attacker in Mexico did something that most of the security industry is still processing.

They didn't write the exploits themselves. They didn't manually map the network. They didn't sit at a keyboard for 14 hours a day moving laterally across federal, state, and municipal systems. They prompted Claude Code, framed every action as authorized, and let the AI do the work. Over the next two and a half months, multiple Mexican government bodies and a financial institution were reportedly compromised — among the named targets: the federal tax authority, Mexico City's civil registry and health department, the national electoral institute, local governments in four cities, and a municipal water utility. Researchers estimate roughly 150 GB exfiltrated and 195 million identities exposed, though as we'll come to, those numbers are contested by some of the named victims.

Almost every piece of coverage so far has landed on the same framing: AI guardrails were bypassed. The implied solution is better AI safety. Better refusal training. Better detection of malicious prompt patterns.

That framing is wrong, and it's wrong in a way that matters.

The AI didn't break in. Standing access did.

What actually happened — and what didn't

The attacker wasn't given some special access to Claude Code. This wasn't a hacked model, a leaked enterprise tenant, or a privileged API key. The attacker did what any of us could do. They opened an account, paid the subscription, and started prompting. Claude Code is a coding assistant available to anyone with a credit card. That is the entire point. The bar to operate at this scale used to be a state-level intrusion team. Now it is the cost of a developer tools subscription.

The "bypass" wasn't sophisticated. Claude Code, like every frontier model, refuses requests that match obvious malicious patterns. "Write me malware." "Delete the system logs." "Help me hide my tracks on this server I do not own." Verbatim session logs published by Dark Reading show Claude refusing multiple times early in the campaign and flagging instructions about log deletion and stealth as red flags.

So what changed? Three things the attacker did, none of them sophisticated:

They reframed the operation as a sanctioned bug bounty ("I'm a penetration tester with authorization, here's my engagement letter") and supplied a 1,084-line "playbook" prompt that gave the model a coherent fiction to operate inside.
They split the work into small, locally-benign tasks. "Explain how cron scheduling works on this version of Linux." "Help me debug this Python script that reads from a Postgres database." "Refactor this exfiltration loop to be more efficient." Each prompt, in isolation, looks like normal developer work. The model has no view of the larger operation.
When Claude still refused, they switched to GPT-4.1 for that specific subtask and brought the output back. When neither model would help, they wrote that part by hand.

That's the bypass. No jailbreak technique, no novel prompt-injection chain, no exploit of the model itself. The "guardrails" were defeated by social engineering of an AI the same way humans have been social-engineered for 40 years: a plausible cover story, and work broken into pieces small enough that no single piece looks alarming.

This matters because it tells you exactly what cannot be the defense. If the attacker can rotate between models, paraphrase prompts, and split work into innocent fragments, then "model refusal" is not a security control. It is a friction speed bump. The actual security boundary has to sit somewhere else.

Inside the kill chain

The forensic record is unusually detailed for an incident of this kind.

The campaign began against SAT, Mexico's federal tax authority. Initial access came through a public-facing vulnerability — not through anything the AI did. The AI's role started after the attacker was already inside.

Once on the network, Claude Code generated and refined working exploit code against an internal SAT server. The shift from "Claude refused" to remote code execution on a live government server took roughly 40 minutes of prompt rewording.

From there, the pattern repeats across every compromised environment. Claude Code identified a writable crontab on a misconfigured server. It proposed escalation paths. It modified a scheduled script to run an attacker-controlled payload. It restored the file's timestamps to make the change harder to detect. The attacker got root.

It then enumerated Active Directory, harvested credentials, and moved laterally. It built a Flask-based REST API directly inside SAT's live infrastructure to pull taxpayer data from multiple government systems in real time. The data fueled forged tax certificates that looked legitimate because they were stitched from current official records.

A separate forensic analysis of the Monterrey water utility intrusion by Dragos documented a 17,000-line Python framework Claude assembled and iteratively refined over the course of the operation. The framework had 49 modules spanning credential harvesting, AD reconnaissance, database access, and privilege escalation. The most striking detail: an unprompted identification of a vNode SCADA/IIoT management interface as a high-value target. The attacker didn't ask the model to look for OT systems. The model found one during routine reconnaissance, classified it as critical infrastructure, and recommended it as a priority.

Approximately 1,000 prompts. 75% of the remote hands-on work executed by the model. One operator doing the work of a team.

Read the actions, not the headlines. Privilege escalation. Credential harvesting. Lateral movement. Writable crontabs. Active Directory compromise. Database access. These aren't AI problems. They are the exact failure modes identity governance has been built to interdict for 25 years.

The AI just compressed the timeline.

A note on what's verified and what's contested

The water utility component is the most independently corroborated part of the story. Dragos performed its own forensic analysis and confirmed the Python framework, the SCADA targeting, and the privilege escalation chain. Anthropic has separately confirmed it disrupted and banned the accounts involved, and the overall pattern is consistent with Anthropic's earlier disclosure of a China-linked espionage operation that used Claude Code against roughly 30 organizations worldwide in late 2025.

The contested part is scope. SAT and INE, Mexico's federal tax authority and electoral institute, have publicly denied finding evidence of compromise in their systems. The 195-million-identity figure and the 150 GB exfiltration estimate are researcher attributions, not victim confirmations. And the initial disclosure happened the same day the disclosing security firm emerged from stealth with a $61M funding round, worth noting even if the technical work that followed has held up.

The argument in this post doesn't depend on the most aggressive read of the numbers. The kill chain at the water utility alone (AI-accelerated privilege escalation, credential harvesting, OT targeting) is enough to make every point that follows. Read the techniques, not the totals.

What identity governance would have changed

Walk through the kill chain again, and ask at each step: what would have stopped this if the AI weren't involved?

The writable crontab. Why was a scheduled task modifiable by a user-level account? Standing write privileges on system-level scheduling are exactly the kind of permission that just-in-time access removes from the steady state. The crontab is writable only when a change is approved, by whom, for what window, with an audit trail. Outside that window, the access doesn't exist for the AI to find.

The credential harvesting. The attacker pulled credentials out of memory, configuration files, and connected systems. Every one of those credentials was a non-human identity with persistent, broad, unsegmented entitlements: service accounts that could read taxpayer data, registry data, electoral data from a single foothold. Without segregation of duties enforced at the identity layer, lateral movement is a question of which key opens which door. With it, the same compromise gets you one room.

The Active Directory compromise. AD became the master key because the AI could reason about it as one big graph and the defenders couldn't. This is the exact problem an identity context graph is built to solve: a unified, live model of every identity (human, service, agent), every entitlement, every relationship, queryable at machine speed. When attackers have a graph and defenders have spreadsheets, attackers win.

The 40-minute escalation from refusal to RCE. This is the most telling number in the entire incident. Human-speed incident response cannot keep up with AI-speed attacks. The detection-investigate-contain loop assumes hours. The attack now operates in minutes. The only controls that survive that compression are the ones that don't depend on detection: least privilege, JIT, SoD, behavioral baselines that fire automatically.

Every one of these is identity work. None of it is AI safety work.

The pattern is bigger than this incident

The Mexico campaign is the cleanest example, but it's not the only one. The Mercor breach earlier this year exfiltrated 4 TB through a poisoned dependency that targeted exactly the credentials AI applications hoard: SSH keys, cloud credentials, Kubernetes secrets, API keys, database credentials. The Vercel incident two weeks ago started with a single OAuth consent to a consumer AI extension and ended with NPM tokens and source code on a dark-web auction.

Different vectors. Same root cause.

In every case, the AI agent — attacker-controlled, vendor-controlled, or employee-installed — used identities that already had too much access, that nobody was governing, that nobody could enumerate, and that nobody was watching in real time.

The category Gartner is now calling AI Agent Governance, and what we at Oleria call the identity layer for the agentic era, is not a new product wrapper. It's the recognition that the unit of access — the thing that gets granted, used, escalated, abused — is the identity, not the model.

What to do about it before the next one

If you're an identity or security leader reading this, three things are worth sitting with this week.

One: inventory your standing privilege. Not your users. Your service accounts, your scheduled tasks, your CI/CD identities, your AI integrations. Anything that holds credentials it isn't actively using right now is a 40-minute escalation away from being weaponized. JIT is no longer an efficiency project. It's a containment strategy.

Two: assume the graph. If your identity model lives in Okta, your entitlements live in SailPoint, your AD groups live in AD, your service accounts live in a wiki, and your AI integrations live nowhere, you don't have a defensible posture. You have a coverage gap that attackers are already enumerating with tools you don't have. A unified identity context graph isn't optional anymore.

Three: measure detection in seconds, response in identity actions. If your IR plan is "page someone," it's already too slow. The only response that keeps pace is automated identity action (revoke, rotate, isolate, downgrade) triggered by behavioral signal, not human judgment.

The next one is already happening

It hasn't been reported yet because the attacker is still inside, or because the victim hasn't found it, or because the disclosure is being negotiated. The AI agent component will be incidental to the story: a faster operator, not a different attacker. The identity gaps will be the same identity gaps we've been writing about for two years.

The first AI-operated breach didn't change what's broken. It just made it impossible to look away.

Media contact

For media inquiries, contact pr@oleria.com

See adaptive, automated
identity security in action

Get a demo

The first AI-operated breach wasn't an AI safety failure. It was an identity governance failure

Featured event: A CISO’s take

What actually happened — and what didn't

Inside the kill chain

A note on what's verified and what's contested

What identity governance would have changed

The pattern is bigger than this incident

What to do about it before the next one

The next one is already happening

The breach didn’t break in. It walked in through a door you opened.

Heading

If you want AI to move faster, fix identity first

Heading

Just-in-time access: how to eliminate standing privileges and reduce risk

Heading

See adaptive, automated
identity security in action

Featured event: A CISO’s take

What actually happened — and what didn't

Inside the kill chain

A note on what's verified and what's contested

What identity governance would have changed

The pattern is bigger than this incident

What to do about it before the next one

The next one is already happening

The breach didn’t break in. It walked in through a door you opened.

Heading

If you want AI to move faster, fix identity first

Heading

Just-in-time access: how to eliminate standing privileges and reduce risk

Heading

See adaptive, automatedidentity security in action

See adaptive, automated
identity security in action