Security Checks

Every skill is analyzed for 6 threat classes using static pattern matching, ML classification, and LLM-based semantic analysis.

Detection is structured around the AI Agent Traps taxonomy (Franklin, Tomašev, Jacobs, Leibo & Osindero, 2026, Google DeepMind) which defines 6 threat classes targeting AI agents: Content Injection (Perception), Semantic Manipulation (Reasoning), Cognitive State (Memory), Behavioural Control (Action), Systemic (Multi-Agent), and Human-in-the-Loop (Human Overseer).

Threat Classes

Content InjectionPerception

Attacks that manipulate what the agent perceives by injecting hidden or disguised instructions into its input stream.

Semantic ManipulationReasoning

Attacks that exploit the agent's reasoning process through framing, persona manipulation, or authoritative language.

Cognitive StateMemory & Learning

Attacks that corrupt the agent's memory, knowledge bases, or learned policies to influence future behavior.

Behavioural ControlAction

Attacks that cause the agent to take harmful actions — executing commands, exfiltrating data, or spawning attacker-controlled sub-agents.

SystemicMulti-Agent Dynamics

Attacks targeting multi-agent systems — cascading failures, compositional fragment attacks across agent boundaries.

Human-in-the-LoopHuman Overseer

Attacks that exploit the human approver — approval fatigue, social engineering the human via agent output.

Prompt Injection

DAN (Do Anything Now) jailbreak patterns
Role/instruction override attempts ("ignore previous instructions")
System prompt extraction attempts
LLM control tokens (<|im_start|>, [INST], etc.)
Base64-encoded hidden instructions
Zero-width character steganography
HTML comment injection with instruction keywords
Instruction boundary manipulation (### system, etc.)
DeBERTa v3 ML classifier for subtle injection patterns
Indirect prompt injection hidden in processed data

Obfuscation

Long base64-encoded blobs (potential hidden payloads)
Hex-encoded payloads (\x sequences)
Character-by-character string concatenation

Syntactic Masking

CSS-based hidden text (display:none, visibility:hidden, opacity:0)
Markdown links with suspiciously long anchor text hiding instructions
HTML metadata attributes with hidden payloads (aria-label, data-*, alt, title)
Unicode directional override characters (RTL/LTR reordering)

Safety Filter

Gemini's built-in JAILBREAK classifier triggers on dangerous content
Content blocked by safety filter is itself a strong signal

Oversight Evasion

Red-team/educational framing ("for educational purposes only")
Role-play evasion ("pretend you are an unfiltered AI")
Fictional simulation framing ("in this hypothetical scenario")
Persona manipulation and authoritative language

Social Engineering

Impersonation, trust manipulation, deceptive framing

Description Mismatch

Compares claimed purpose against actual behavior
Flags skills that do something different than advertised

Memory Poisoning

Memory manipulation instructions ("write to your memory", "remember this forever")
Contextual learning manipulation ("from now on, always...")
Instructions to corrupt agent memory or knowledge bases

Supply Chain

Dependency confusion, typosquatting references

Command Execution

Shell execution functions (exec, system, popen, subprocess)
Reverse shell patterns (bash -i, /dev/tcp, netcat)
Remote code download and execution (curl | bash)
Dynamic code evaluation (eval, new Function)
Scheduled task/cron job creation

Data Exfiltration

HTTP URLs with embedded environment variables
Known exfiltration endpoints (burpcollaborator, requestbin, webhook.site)
DNS-based data exfiltration (nslookup/dig to attacker domains)
Pipe to curl/wget for data exfiltration
Markdown image tags with embedded data parameters

Credential Theft

Access to sensitive environment variables (API keys, tokens, passwords)
Known credential file paths (~/.aws/credentials, ~/.ssh/id_rsa, .env)
Keylogger patterns

Sub-Agent Spawning

Instructions to spawn new agents or delegate to sub-agents
System prompt injection for spawned child agents
Spawning agents with attacker-controlled prompts

Semantic analysis covers this class through LLM-based detection of multi-agent attack patterns.

Approval Fatigue

Patterns designed to exploit human approval fatigue
Social engineering the human via agent output

Security Checks

Every skill is analyzed for 6 threat classes using static pattern matching, ML classification, and LLM-based semantic analysis.