Zendoric
← Back to the day · July 3, 2026

A public CTF to break AI agents: security as a game with a live scoreboard

🕒 Published on Zendoric: July 3, 2026 · 01:20

Declaw has opened a public arena where anyone can try to make an AI agent leak sensitive data or let a root shell slip out. The scoreboard is the demonstration: with no defenses, 47% of attempts succeed; with the full policies in place, 0%.

By Declaw · July 2, 2026. Declaw Arena is a modest Show HN —barely a couple of points and no comments when it was posted— but the concept deserves a close read because it goes straight to one of the most urgent fronts of agentic AI: how an agent that has access to real data or systems defends itself against someone trying to manipulate it. The proposal is simple and elegant: the same agent, the same isolated environment (a microVM) and the same secret to protect, but varying the level of Declaw policies that act as a security layer between the user and the model. The user chooses the challenge —from convincing a 'data analyst' to reveal a Social Security number or a credit card, to escalating from a root-privileged shell to steal an API key or the credentials of a cloud account via the metadata endpoint, a vector that directly recalls the notorious 2019 Capital One incident.

The numbers the arena itself displays are the real message: with no defense at all, 47% of attempts achieve their goal; with partial policies (PII redaction, an injection judge that evaluates each request against the agent's task) it drops to 41%; and with Declaw 'at full power' —a judge on every turn, network egress blocked to the model itself, a strict posture— the scoreboard falls to 0% across 63 attempts. It is, in essence, a product announcement turned into a public red-teaming experiment: instead of promising security, they expose it to being broken and publish the result in real time.

This connects with something we have been pointing out in the field of agent cybersecurity: saturated benchmarks say nothing; what matters is measuring against tasks that truly discriminate attack and defense capability. A public arena with increasing difficulty levels —no defenses, PII redaction, an injection judge, network blocking— is exactly the kind of granular evidence that lets you distinguish marketing from real security: it's not enough to say 'our agent is secure,' you have to show the scoreboard when dozens of people try to bring it down for ten minutes with an isolated session and no registration required.

Our reading is that this kind of initiative, however small in traction —and this one is, an almost anonymous Show HN—, points to an underlying trend that does matter: the security of AI agents is professionalizing as an engineering discipline, with layers of defense in depth (redaction, injection judges, network egress control) instead of trusting that the system prompt will suffice. It is exactly the pattern we have seen in the maturing of agent memory: when a capability starts to have its own failure modes documented, catalogued and measured, it's a sign it has stopped being a demo trick. Here the short-term risk is concrete and already here —PII leakage, cloud credential theft, exfiltration via shell—, not a distant hypothesis about superintelligence. The good news, in line with our underlying thesis, is that the more people attack these systems in public and the more the defenses that work are documented, the faster the trust infrastructure matures that will make it possible to delegate real tasks to agents without handing over the keys to the house along the way.

Sources & references