Prancer Blog / SwarmHack Network Pentest

Under the Hood — The Architecture Behind Autonomous Network Pentesting

GOAP planning, the Intelligence Bus, deterministic agents, an SSH-ControlMaster Metasploit replacement, and five structural reasons LLMs can't do this job.

SwarmHack Team · 2026-05-06 · 11 min

TL;DR

How 32 deterministic agents are orchestrated by a GOAP planner instead of an LLM
The Intelligence Bus that lets agents share credentials and topology in real time
A Metasploit replacement built on SSH ControlMaster — same capability, ~0 MB overhead
Five structural reasons LLM-based pentesting does not work at enterprise scale

This is Part 3 of 3. In Part 1 we built a segmented Docker lab; in Part 2 we ran the full kill chain with one command. Now we open the engine.

Series:

- Part 1 — Build the lab

- Part 2 — Run the engagement

- Part 3 — Under the hood (you are here)

1. Agent orchestration: GOAP, not prompts

SwarmHack registers 32 vulnerability types across two agent families:

24 web-layer agents — SQLi, XSS, CSRF, CMDI, SSRF, LFI, SSTI, XXE, JWT, CORS, deserialization, HTTP smuggling, file upload, open redirect, IDOR, session fixation, …
8 network-layer agents — SSH, database (MySQL/Postgres/MSSQL), Redis, SNMP, DNS, SMTP, FTP, privilege escalation

A GOAP (Goal-Oriented Action Planning) engine uses A\* search to order them by kill-chain phase:

Phase 1: Reconnaissance     → WebCrawler (sequential, 90s timeout, max 100 pages)
Phase 2: Discovery          → All exploit agents launched in parallel
Phase 3: Exploitation       → Deep exploitation on confirmed vulnerabilities
Phase 4: Post-Exploitation  → SSH/DB playbooks, credential harvesting, pivoting
Phase 5: Reporting          → OCSF 1.1.0 JSON with crown jewel mapping

Inside a phase, agents run in parallel via tokio::JoinSet with a 25-slot semaphore. A GlobalRateLimiter caps the swarm at 50 concurrent HTTP requests and 50 req/s so the parallel execution doesn't trip WAF rules.

The critical design rule: each agent stops on first confirmed exploit, then goes deep instead of wide. That's the entire difference between a pentest and a vulnerability scan.

2. The Intelligence Bus — how findings cross agent boundaries

The reason a single command can chain web → SSH → tunnel → internal scan is the Intelligence Bus: a shared pub/sub bus where every agent's discoveries become every other agent's input.

When the CMDI agent reads .env and finds:

SSH_USER=pentest
SSH_PASS=pentest123

…the bus immediately broadcasts a credential event tagged protocol=ssh. The SSH lateral-movement agent is already subscribed and starts attempting connection — no orchestrator polling, no human in the loop.

Twelve regex patterns extract credentials from response bodies, .env files, and structured output. The patterns are deterministic — same input bytes, same matches, every time.

3. Pentest-mode semantics — why "stop on first" is the magic

Most scanners are wide: test every endpoint, every parameter, every payload. Useful, but it produces noise and zero impact evidence.

SwarmHack agents are depth-first:

1. Score every candidate endpoint for the agent's vulnerability class 2. Test the highest-priority candidate 3. On confirmation, stop scanning new endpoints for that class 4. Shift into deep exploitation: extract data, harvest credentials, map crown jewels 5. Publish discoveries to the Intelligence Bus so other agents can act

The CMDI agent in our lab confirmed RCE on ping.php after a few requests, then spent the rest of its budget extracting .env, /etc/passwd, network interfaces, and crafting six exfiltration PoCs — 15 crown jewels from one endpoint, instead of "we found CMDI on 4 endpoints" with no proof of impact.

4. Smart timeouts — kill stalls, not slow targets

The original reliability problem: a hard tokio::time::timeout(120s) would kill an SQLi agent mid-UNION extraction on a slow target, and all partial findings disappeared.

The fix is the ActivityTracker (164 lines of Rust) — a stall-based watchdog instead of a hard kill:

pub struct ActivityTracker {
    last_activity_ms: Arc<AtomicU64>,   // updated on every HTTP response
    stall_threshold_secs: u64,          // per agent (e.g. 300s SQLi, 90s XSS)
}

Every HTTP response calls record_activity() (one atomic store, no mutex). A watchdog task periodically calls is_stalled(). The agent is killed only when no activity has happened for the configured stall window — not because the target is slow.

A generous 3× ceiling provides an absolute upper bound, so a wedged agent can never run forever.

5. Session management — Metasploit's job, done with SSH

SwarmHack does not ship a Meterpreter equivalent. The NetworkSessionManager (635 lines) implements session persistence using SSH ControlMaster and standard CLI tools.

open_ssh()           → spawns: ssh -o ControlMaster=auto -N -f
                       creates /tmp/swmhk_sessions_<mission>/<host>_<port>_<user>.sock
ssh_exec(cmd)        → ssh -o ControlPath=<socket> user@host cmd
ssh_exec_batch(cmds) → joins commands with ';' — single round-trip
is_session_alive()   → ssh -O check
close_session()      → ssh -O exit

A side-by-side with Metasploit makes the trade-off concrete:

| Capability | Metasploit | SwarmHack |

| ------------ | ----------- | ----------- |

| Session persistence | Meterpreter agent | SSH ControlMaster (native) |

| Post-exploitation | 500+ post modules | Batched shell scripts with section markers |

| Pivoting | route add + socks_proxy | ssh -D (SOCKS5) / ssh -L (port forward) |

| Database access | Auxiliary modules | Direct CLI piping (mysql, psql, tsql) |

| Memory footprint | ~500 MB (Ruby + PostgreSQL) | ~0 MB overhead (no daemon) |

| Detection surface | Meterpreter binary on target | No implant — uses target's own SSH server |

You lose Metasploit's module ecosystem. You gain something pretty valuable in exchange: nothing to install on the target, no daemon to keep running, no implant for blue team to find.

6. The toolchain — 14 boring CLI tools, wrapped in typed Rust

| Runner | Binary | Purpose |

| -------- | -------- | --------- |

| CurlRunner | curl | HTTP, FTP anonymous, REST |

| NcRunner | nc/ncat | Port probing, banner grabbing, Redis/Memcached |

| OpensslRunner | openssl | TLS protocol & cipher enumeration |

| SshRunner | ssh | Credential testing, exec, tunneling |

| DbRunner | mysql/psql/tsql | DB credentials, query exec, extraction |

| RedisRunner | redis-cli/nc | AUTH, key enum, config dump |

| SnmpRunner | snmpwalk | Community strings, MIB walks |

| DnsRunner | dig | Zone transfer, version queries |

| FtpRunner | curl | Anonymous login, listing, retrieval |

| SmtpRunner | nc | Open relay, VRFY/EXPN |

| LdapRunner | ldapsearch | Anonymous bind, base-DN enum |

| SmbRunner | smbclient | Null session, share enum |

| IpmiRunner | ipmitool | Cipher zero, user enum |

| KerberosRunner | kinit | AS-REP roasting, ticket requests |

All standard utilities. No Metasploit, no nmap, no privileged dependencies. The first five (curl, nc, openssl, ssh, dig) are pre-installed on macOS and Linux — that's enough to run the entire engagement from Part 2.

Each runner builds safe argument vectors (no shell interpolation), spawns via tokio::process::Command, captures stdout/stderr, and parses results into structured Rust types. Tool availability is detected at runtime — if snmpwalk isn't installed, the SNMP agent skips its SNMP techniques and reports what it could test.

7. Why not LLM-based pentesting?

A growing category of security tools uses LLMs (GPT-class, Claude-class) as the brain of the engagement. The pitch is appealing. In practice it has five structural problems that better engineering will not fix.

SwarmHack uses AI techniques — GOAP planning, self-learning via the SONA ReasoningBank, neural pattern recognition — but never an LLM for vulnerability detection. Here's why.

7.1 Token economics don't scale

A single LLM-based pentest of a moderately complex web app burns 50,000–200,000+ tokens, costing $5–$60 in API fees per scan. For a security team running daily scans across 100 apps, that's $15K–$180K per month in API charges alone — and the cost is unpredictable (it varies with target size and chain-of-thought depth).

SwarmHack's cost is fixed per scan: CPU and memory for the Rust binary. 100 scans cost 100× the infrastructure of 1 scan. No token budget, no surprise invoices.

7.2 Not repeatable

LLM outputs are non-deterministic even at temperature 0. With temperature > 0 (necessary for "creative" attack reasoning), the same target produces different findings across runs. Regression testing becomes impossible: how do you verify a fix when the tool finds different things every time?

SwarmHack's CMDI agent always sends ; echo SWMHK$(expr 7777 + 4242)CK and always checks for SWMHK12019CK. Across 11 consecutive runs of the lab from Part 1, the team measured exactly 11 findings and 35 crown jewels every time.

7.3 Not consistent

LLM hallucination is a documented architectural property. Applied to pentesting it produces:

Findings reported without ever sending an HTTP request
Severity ratings that drift between runs
Plausible-sounding evidence that's actually fabricated
Coverage gaps when the model "decides" to skip a vulnerability class

SwarmHack confidence scores are computed from evidence quality, not opinion:

| Detection method | Confidence | Basis |

| ------------------ | ----------- | ------- |

| Marker-based (CMDI) | 0.99 | Arithmetic marker reflected — zero FP surface |

| Reflected payload (XSS) | 0.90 | Payload appears unencoded in body |

| Tautology (SQLi) | 0.75 | Response-change heuristic |

| Cross-origin POST (CSRF) | 0.70 | Forged action accepted |

| In-band file content (XXE) | 0.60 | File content detected, no unique marker |

A reviewer can replay the payload and verify the score independently. That's auditability.

7.4 Not enterprise-ready

Data sovereignty. Every LLM-based tool sends target response bodies — including credentials, internal IPs, and extracted data — to a third-party API. SOC 2, ISO 27001, PCI DSS, HIPAA teams say no.
Air-gapped environments. Defense, finance, critical infrastructure run networks with no internet egress. LLM-based tools simply do not function there.
API rate limits. A test that fires 58 SQLi payloads in seconds is throttled by token-per-minute caps. A 6-minute SwarmHack engagement becomes hours when every decision is a round-trip to a cloud API.

SwarmHack runs entirely on-premise. The Rust binary calls the target directly. No data leaves the network.

7.5 The depth problem

"Generate a SQL injection payload for this form" is one prompt. Real exploitation is something else: 58 payloads × 3 detection methods, fingerprint the database engine, enumerate column count via ORDER BY, run UNION-based extraction across 8 tables, harvest credentials, then feed those credentials to the SSH agent.

That requires maintaining state across hundreds of HTTP requests, adapting strategy based on intermediate results, and chaining findings across multiple agents and phases. LLMs operate on a per-prompt or per-context-window basis — multi-step, state-dependent exploitation chains are outside the model.

7.6 The structural comparison

| Dimension | LLM-based | SwarmHack |

| ----------- | ----------- | ----------- |

| Cost per scan | $5–60, unpredictable | Fixed infrastructure cost |

| Reproducibility | Non-deterministic | Same input = same output |

| Hallucination risk | High | None — evidence-based |

| Data sovereignty | Cloud API | 100% on-premise, air-gap ready |

| Compliance | Cloud API = SOC 2/PCI risk | On-premise = full compliance |

| Scan speed | Hours (rate limits) | 6 minutes (local) |

| Kill-chain depth | Stateless | Stateful, 7-phase, cross-agent |

| Lateral movement | Cannot SSH, cannot pivot | Real SSH, real tunnels |

| Confidence scoring | Model opinion | Evidence-calibrated, auditable |

LLMs are great at code generation, document analysis, natural-language reasoning. Real-time security testing isn't on that list.

8. The OCSF report — what your SIEM ingests

Every finding follows the OCSF 1.1.0 Vulnerability Finding schema, plus SwarmHack-specific extensions for crown jewels and exploitation details:

{
  "title": "Command Injection in parameter 'host'",
  "severity": "critical",
  "activity_id": 5,                  // 5 = Create (vuln confirmed)
  "type_uid": 200178,
  "cwe":   { "uid": "CWE-78" },
  "attacks": ["T1059", "T1190", "T1059.004"],
  "risk_score": 10.0,
  "confidence": 1.0,
  "evidence": "Marker 'SWMHK12019CK' found in response",
  "tested_payloads": {
    "count": 1,
    "samples": ["; echo SWMHK$(expr 7777 + 4242)CK"]
  },
  "exploitation_summary": "15 crown jewels extracted!",
  "exploitation_details": {
    "vectors_analyzed": 9,
    "vectors_exploitable": 9,
    "crown_jewels_count": 15,
    "impact_assessment": "CRITICAL: System data extracted"
  },
  "crown_jewels": [
    { "category": "api_key",     "value": "DB_HOST=localhost (+8 more lines)" },
    { "category": "credential",  "value": "26 accounts, 2 with shell access"  },
    { "category": "system_info", "value": "3 interfaces: eth 172.20.0.10..."  }
  ]
}

Tunnel-discovered findings carry generation: 1, so a single SIEM filter answers "what did we reach only by pivoting?"

An optional PCI compliance layer adds per-finding extensions (auto-fail rules, CVSS 3.1 vector, requirement mapping) without changing the core schema — toggle pci.enabled = true and your downstream consumers keep working.

9. Wrapping the series

Across three posts you've now:

1. Built a realistic two-network Docker lab with enforced segmentation 2. Attacked it with one command and watched 7 phases chain automatically into a 35-crown-jewel kill chain 3. Looked inside the engine — GOAP planning, the Intelligence Bus, deterministic agents, SSH-based session management, and a clear-eyed argument for *not* using LLMs

The headline number is still the simplest one to remember:

One command. 11 findings. 35 crown jewels. 6 minutes 7 seconds. Zero humans.

That's what autonomous network pentesting actually looks like when the architecture is built for it from the ground up.

If you want to go deeper, the Docs cover agent configuration, custom payload sets, and the OCSF report schema. If you want to bring this into your own environment, the Demo page is the right next step.

Thanks for reading.