Prancer Blog / SwarmHack Deep Dive

Why Not LLM Pentesting? The Architecture Behind Deterministic AI

Token economics, hallucinations, data sovereignty — five structural reasons LLMs can't replace deterministic agent-based pentesting.

SwarmHack Team · 2026-04-29 · 9 min

The previous parts (1, 2) showed an autonomous engagement: 11 findings, 35 crown jewels, 6 minutes, zero humans, bit-for-bit reproducible across 11 runs. The natural follow-up question: *"Couldn't you do all this with GPT-4 / Claude / a fine-tuned model?"*

The honest answer is no — and the reasons aren't about today's models being too small. They're structural.

  • The five inherent limits of LLM-based pentesting
  • Why "AI-Native" ≠ "LLM-driven" at SwarmHack
  • The architecture trade-offs that make autonomous validation actually enterprise-ready

1. Token Economics Don't Scale

A single LLM-based pentest against a moderately complex web app burns 50,000–200,000+ tokens. At current API pricing that's $5–$60 per scan, before any infrastructure cost.

Run 100 apps daily and the *token bill alone* is $15,000–$180,000 per month — and it's non-deterministic, because cost scales with target complexity, response sizes, and chain-of-thought depth.

SwarmHack runs as a compiled Rust binary calling 14 standard CLI tools. Cost per scan = CPU + RAM. Running 100 scans costs 100× the infrastructure of running 1 — no API metering, no surprise invoices.

Case study: Anthropic's Claude Mythos Preview on OpenBSD

The clearest public data point on what frontier-LLM pentesting actually costs comes from Anthropic itself. In April 2026 Anthropic announced Claude Mythos Preview — a gated research model that found a 27-year-old SACK bug in OpenBSD's TCP stack, plus zero-days in FFmpeg, Linux, FreeBSD's NFS server, and every major browser.

It's an impressive capability demonstration. It's also a textbook illustration of why this approach doesn't scale to continuous validation:

*"This was the most critical vulnerability we discovered in OpenBSD with Mythos Preview after a thousand runs through our scaffold. Across a thousand runs through our scaffold, the total cost was under $20,000 and found several dozen more findings. While the specific run that found the bug above cost under $50, that number only makes sense with full hindsight. Like any search process, we can't know in advance which run will succeed."*

Unpack that:

  • ~$20 per agent run, average — and you need ~1,000 of them to land one critical
  • Non-deterministic search — Anthropic explicitly says you can't predict which run wins
  • One target, one project — $20K bought OpenBSD coverage; multiply by every codebase you actually own

Mythos rate card vs. the rest of the Claude family

Mythos is invitation-only and Anthropic has publicly described it as *"very expensive for us to serve, and very expensive for our customers to use."* Published and leaked rate-card data places it well above Opus:

| Model | Input ($/MTok) | Output ($/MTok) | Notes |

| ------- | ---------------: | ----------------: | ------- |

| Claude Haiku 4.5 | $1 | $5 | Cheapest public Claude |

| Claude Sonnet 4.6 | $3 | $15 | Workhorse tier |

| Claude Opus 4.6 | $5 | $25 | Current flagship |

| Claude Mythos (low est.) | $10 | $50 | 2× Opus |

| Claude Mythos (high est.) | $25 | $125 | 5× Opus — matches retired Opus 4.1 |

Sources: Anthropic pricing page and the publicly archived Mythos rate-card analysis.

At Mythos rates, a *single* deep agentic run against a moderately complex target lands in the $20–$200 range — and Anthropic's own data shows you typically need hundreds to thousands of runs for a high-severity find. Run that pipeline daily across an enterprise estate and you're in the six- to seven-figure annual token bill, before any human triage cost.

By contrast, SwarmHack's Part 1 engagement — 11 findings, 35 crown jewels, 6 minutes, full kill chain — runs on a single CPU. The same scan tomorrow costs the same as today. The same scan against 100 targets costs 100× the CPU minutes, not 100× a $20-per-run lottery.

Mythos is a remarkable research instrument for *deep, one-shot vulnerability hunts on critical open-source*. It is not — and Anthropic is careful not to claim it is — a continuous, reproducible, on-premise pentesting platform. Those are different problems with different cost curves.

2. LLMs Aren't Repeatable

Even with temperature=0, transformer inference produces tiny floating-point variations. With temperature > 0 (necessary for "creative" attack reasoning), the same prompt produces materially different outputs.

A SQL injection found on Tuesday may not appear Wednesday. After a developer ships a "fix", how do you verify it landed if the tool finds different things every run?

SwarmHack's CMDI agent always sends ; echo SWMHK$(expr 7777 + 4242)CK and always checks for SWMHK12019CK. Same target, same configuration → same findings. 11 consecutive runs produced exactly 11 findings and 35 crown jewels.

3. LLMs Aren't Consistent

Hallucination is a documented property of transformer generation, not a bug to be patched. In pentesting that means:

  • Vulnerabilities reported that don't exist
  • Severity scores that drift between runs (critical one day, medium the next)
  • Coverage gaps because the attention mechanism "decided" to skip a class of test
  • Evidence that sounds plausible but isn't replayable

SwarmHack runs 32 agents with deterministic logic. Confidence floors are computed, not generated:

| Method | Confidence | Why |

| -------- | ----------- | ----- |

| Marker-based (CMDI) | 0.99 | Arithmetic value reflected by server |

| Tautology heuristic (SQLi) | 0.75 | Response-change observation |

| In-band heuristic (XXE) | 0.60 | File content detected, no unique marker |

A reviewer can audit every score by replaying the payload and reading the response.

4. LLMs Aren't Enterprise-Ready

Three blockers, none of which a bigger model fixes:

Every LLM-based scan ships target data — HTML pages with credentials, internal IPs, exfiltrated DB rows — to a third-party API. SOC 2, ISO 27001, PCI DSS, HIPAA, data-residency rules — all reasons your security team will say no.

Financial, government, defense and critical-infrastructure networks run air-gapped. LLM tools simply don't function — every detection step requires outbound HTTPS to the model provider.

Tokens-per-minute and requests-per-minute caps make rapid-fire payload testing fundamentally throttled. A 6-minute SwarmHack scan would stretch into hours when every decision is an API round-trip.

SwarmHack runs entirely on-premise. The binary executes on the operator's machine; HTTP requests go directly to the target; nothing leaves the network.

5. The Depth Problem

LLMs generate text from training data — they don't execute payloads in a stateful loop. Compare:

*"Generate a SQL injection payload for this form."*

vs.

*"Test 58 payloads across 3 detection methods, fingerprint the database engine, enumerate columns via ORDER BY (11 attempts × 3 variants), execute UNION-based extraction across 8 tables, and harvest credentials."*

The second task requires:

1. State across 150+ HTTP requests within a single agent 2. Adaptation based on intermediate results (error messages choose extraction technique) 3. Cross-agent chaining (CMDI feeds SSH; SSH feeds tunneling; tunneling feeds internal scan)

This is what produced the kill chain in Part 2. It's outside the architectural model of per-prompt LLM execution.

Side-by-Side

| Dimension | LLM Pentesting | SwarmHack |

| ----------- | ---------------- | ----------- |

| Cost per scan | $5–60 (token-based, unpredictable) | Fixed infrastructure cost |

| Reproducibility | Non-deterministic | Same input → same output |

| False-positive rate | High (model "reasons" about vulns) | <1% (evidence-based confirmation) |

| Data sovereignty | Sent to cloud API | 100% on-premise, air-gap ready |

| Compliance | SOC 2 / PCI risk | Full compliance |

| Scan speed | Hours (rate limits) | 6 minutes (local execution) |

| Kill-chain depth | Stateless per-prompt | 7-phase, cross-agent intel |

| Lateral movement | Cannot pivot | Real SSH, real tunnels |

| Evidence | Generated text | Actual HTTP responses |

| Confidence | Model opinion | Calibrated (0.60–1.0) |

So What *Is* the AI in SwarmHack?

SwarmHack is AI-Native — without being LLM-driven:

| Capability | Mechanism |

| ----------- | ----------- |

| Goal planning | GOAP with A* search over kill-chain phases |

| Self-learning (optional) | SONA / ReasoningBank for payload effectiveness over time |

| Pattern recognition | Neural classifiers on response signatures |

| Cross-agent intel | Shared Intelligence Bus with 12 credential regex patterns |

LLMs are powerful for code generation and document analysis. Real-time security testing isn't one of those tasks. It needs deterministic execution, auditable evidence, stateful chaining, and on-premise data handling.

Wrapping Up the Series

Across three posts we walked one validated engagement end-to-end:

  • Part 1 — the architecture and the external scan
  • Part 2 — automatic credential correlation, SSH lateral movement, and tunnel pivoting
  • Part 3 (this post) — why this whole stack is intentionally not an LLM

One command. Full kill chain. Reproducible. Auditable. On-premise. That's autonomous penetration testing — not autonomous *prompting*.