Sovereign AI: Inside the 18.07 TB CAI Dataset

There is a structural dilemma in modern cybersecurity: to maintain tactical competitiveness, red teams, penetration testers, and SecOps experts heavily rely on the execution speed provided by Artificial Intelligence assistants. However, daily interaction with general-purpose commercial tools routinely forces professionals to deposit critical operational contexts, terminal session logs, and live network inventories into external servers controlled by third parties.

To resolve this bottleneck at its core and shield organizational confidentiality, Alias Robotics has led an unprecedented engineering effort. Following a multi-year program of continuous framework development, we are proud to announce the publication of our newest peer-reviewed scientific paper:"Cybersecurity AI (CAI) Dataset: A 14-Month Corpus of LLM-Driven Hacker Trajectories".

Through our open-source CAI agent framework, we have channeled the voluntary collaboration of the global security community to assemble the largest described training corpus specialized in offensive and defensive trajectories in Europe.

The CAI Dataset: Real Telemetry from the Algorithmic Frontline

This volume of data does not come from generic web scraping or artificially generated synthetic text. The dataset from the officialCybersecurity Datasets by Alias Roboticscaptures genuine human-machine interactions, failed execution steps, raw operating system outputs, and security tool interpretations directly extracted from live penetration testing engagements.

Aggregated over 428 days of responsible data curation, 18.07 TB of real adversarial telemetry breaks down into the following infrastructure metrics:

Infrastructure Metric	Total Volume Recorded in the Corpus
Durable Storage	18.07 TB of production-grade security data
Captured Session Logs	230,935 trajectories recording full intrusion chains
Intercepted User Prompts	26,027,742 individual messages from security operators
Geographic Source Footprint	Collected across 16,768 distinct source IPs spanning 123 countries
Model & Target Diversity	4,187 unique LLM identifiers executed against 23,147 unique domains

Comprehensive overview of the CAI Dataset volume, prompt-level role distribution, and model provider call configurations.

Anatomy of the Prompt: The Value of Real Context

Our research demonstrates that the actual post-training bottleneck for specialized cybersecurity models is the scarcity of human expert trajectories, not base-model scale. The data harvested by the CAI framework reflects a fascinating evolution: the weekly mean prompt length has drifted from approximately 150 characters in the early CLI era to a stable range of 400 to 1,300 characters in recent months.

Operators are no longer typing isolated single-line commands; they are appending massive deployment briefs that contain full ntlmrelayx pivot states or active tmux session resumptions. Our data analysis pipeline mapped the exact four primary ingress pathways through which sensitive context reaches the prompt body:

Cleartext .env Assignments (0.46% of sessions): Operators pasting entire environment files containing functional API keys for commercial providers.
Burp Suite Interceptions: Raw HTTP requests captured from local proxy setups that contain active Bearer JWT authorization tokens within the session headers.
Bug Bounty Platform Identification: Researchers pasting live HTTP request targets that explicitly carry cleartext platform handles (e.g., X-HackerOne-Research).
Scraped HTML Payloads: Automated script code pulled directly by agent web tools containing active Google Maps API embedded keys from target pages.

Structural Blueprint: Inside a Hacker Trajectory Log

Taxonomy of context ingress pathways showing how environmental variables, authorization headers, and scraped payloads enter the prompt body.

To understand why this dataset acts as the ultimate fuel for Supervised Fine-Tuning (SFT), data scientists can examine the structural token architecture. Rather than static raw text files, every trajectory record is stored as an ordered, typed array of tool invocations, system responses, and behavioral self-corrections:

JSON

{
  "trajectory_id": "cai_log_98431_offensive",
  "environment": { "os": "kali-linux-2026.2", "harness": "cai-cli-v3.1" },
  "turns": [
    { "role": "operator", "content": "Analyze target network privilege vectors on 10.10.11.24" },
    { "role": "agent_action", "tool": "nmap", "args": "-sV --script vuln 10.10.11.24" },
    { "role": "harness_observation", "status": "success", "stdout": "Port 8080 open: Apache Tomcat 9.0.37" },
    { "role": "agent_revision", "thought": "Tomcat version is vulnerable to CVE-2020-13935. Initiating memory corruption exploit verification path." }
  ]
}

Breaking the Centralized Systemic Failure Surface

By breaking down the model invocations at the provider level, our study exposes a massive infrastructure centralization risk across the modern SecOps ecosystem. A staggering majority of automated agent activity is funneled into a highly concentrated pool of external cloud APIs:

IA Infrastructure Provider	Observed Identifier Families in Logs	Total Processed Calls	Provider Share
Anthropic	Opus 4.x, Sonnet 4.x, Claude 3.7 series	7,340,499	20.79%
OpenAI	GPT-5.x namespaces, GPT-4o, o3-mini	7,062,591	20.00%
Alias Robotics	alias0, alias1, alias2, alias2-mini flagships	5,964,072	16.89%
Alibaba (Qwen)	Qwen3.6 series, qwen3-8b-grpo, pentestr1 fine-tunes	3,968,661	11.24%
DeepSeek	DeepSeek V4 flash/pro, DeepSeek-R1 variants	3,299,690	9.34%
Google	Gemini 3 preview, Gemini 2.5 Pro/Flash lines	3,296,727	9.34%

As documented in our previous analysis on"The Cloud AI Trap: Your Supply Chain is Your Vulnerability", this massive reliance on commercial clouds creates a critical vulnerability:

"Aggregated across the industry, this trade-off concentrates a substantial fraction of the world's offensive and defensive operator context inside a handful of frontier-model API providers—a single failure surface whose breach or politically motivated repurposing could cascade into nation- and enterprise-scale disruption."

Real-World Vulnerability Footprint: The CVE Focus

The analysis of vulnerability disclosures (4,532 unique CVE identifiers across the corpus) dispels the industry myth that autonomous agents spend their execution windows hunting for exotic zero-day exploits. Instead, the empirical data shows that expert operators rely on AI to aggressively identify and target legacy, unpatched systems still lingering within corporate infrastructure boundaries.

The top three most frequently targeted CVE identifiers by volume in the dataset illustrate this operational reality perfectly:

CVE-2019-11248 (47,904 mentions): An information disclosure vulnerability in the server component of enterprise-grade Kubernetes kubelets.
CVE-2017-10271 (20,163 mentions): A critical XML deserialization flaw permitting Remote Code Execution (RCE) in Oracle WebLogic servers.
CVE-2021-41773 (16,270 mentions): A widely distributed path traversal and RCE vulnerability affecting legacy Apache HTTP server deployments.

Powering Sovereign European On-Premise AI

The CAI Dataset is our direct answer to this geopolitical concentration risk. We are building a compliant infrastructure designed to directly answerthe critical digital sovereignty challenges Europe refuses to facewhen training defensive technologies.

Sovereign by Design: Because 85.7% of the unique contributors inside this corpus originate from European infrastructures and IP ranges, this dataset provides the ideal native foundation to train models under strict NIS2 and GDPR data alignment parameters.

Global geographic distribution of telemetry contributors, highlighting the dominant sovereign European data footprint.

This dataset provides the multi-turn expert trajectories (action-observation-revision loops) required to execute Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR) directly inside your private trust boundary. To prevent trained agents from overfitting to a specific environment interface (harness overfit), our data engineering pipeline enforces a rigorous three-step mitigation recipe:

Abstract Action Alignment: Mapping raw text inputs into standardized, harness-agnostic schemas compatible with the Agent Data Protocol (ADP).
Agnostic Noise Injection: Balancing the fine-tuning mixture by blending in 20% to 30% non-cybersecurity agent data sources (such as SWE-Gym or CodeActInstruct).
Interface Reliance Auditing: Validating the resulting models against PIPE-style interface rewrites (shuffling tool arguments, altering kwarg structures) to ensure the agent learns deep security semantics rather than static shell syntax.

Mapping the Core Licensing Tracks

To accelerate deployment across different security environments, Alias Robotics grants data access across three distinct, pre-redacted continuous licensing streams:

Data Segment Name	Structural Content Window	Intended Operational Use Case
CAI Dataset 10	10 high-complexity expert sessions	Evaluation baseline, pipeline probing, and tool harness validation tests.
CAI Dataset 1k	1,000 curated multi-turn trajectories	Mid-scale fine-tuning experiments, local ablation tracking, and hyperparameter optimization.
CAI Dataset 200k	~200,000 comprehensive operator logs	Production-scale foundational post-training to build private, domain-specific security LLMs.

Our collection infrastructure runs under the EIC Accelerator project RIS (GA 101161136). Our goal is to provide the operational data needed to help your organization escape the cloud AI trap. Stop feeding public infrastructure with your enterprise data assets; start training your own secure, private, on-premise SuperIntelligence.

Ready to fuel your sovereign cybersecurity models?

License the Continuous CAI Dataset Slices: Get in touch with our research team to sponsor a benchmark, request customized pipeline slicing, or secure an on-premise post-training license.
Deploy the CSI Suite On-Premise: Protect your core intellectual property by serving cybersecurity-specialized models privately inside your local network infrastructure via our Inference-time Steering services.