Every skill is analyzed for 6 threat classes using static pattern matching, ML classification, and LLM-based semantic analysis.
Detection is structured around the AI Agent Traps taxonomy (Franklin, Tomašev, Jacobs, Leibo & Osindero, 2026, Google DeepMind) which defines 6 threat classes targeting AI agents: Content Injection (Perception), Semantic Manipulation (Reasoning), Cognitive State (Memory), Behavioural Control (Action), Systemic (Multi-Agent), and Human-in-the-Loop (Human Overseer).
Attacks that manipulate what the agent perceives by injecting hidden or disguised instructions into its input stream.
Attacks that exploit the agent's reasoning process through framing, persona manipulation, or authoritative language.
Attacks that corrupt the agent's memory, knowledge bases, or learned policies to influence future behavior.
Attacks that cause the agent to take harmful actions — executing commands, exfiltrating data, or spawning attacker-controlled sub-agents.
Attacks targeting multi-agent systems — cascading failures, compositional fragment attacks across agent boundaries.
Attacks that exploit the human approver — approval fatigue, social engineering the human via agent output.
Semantic analysis covers this class through LLM-based detection of multi-agent attack patterns.