AI Agent Safety and Guardrails
Bottom line: An agent with tools is a program that can take real action. Safety is not an afterthought. It is part of the architecture.
The agent threat model
Agents face the same risks as any automated system, plus risks from the language model itself. A malicious or confused agent can leak data, call the wrong tool, or execute destructive actions.
Defend against prompt injection
Prompt injection happens when untrusted input overrides your system prompt. It is one of the hardest problems in agent security.
- Separate system instructions from user data.
- Mark untrusted content clearly, for example with XML tags.
- Use output filtering to detect instruction overrides.
- Avoid giving the agent access to sensitive tools based only on user text.
Sandbox every tool
Run tool code in isolated environments. Use containers, restricted processes, or function-as-a-service with minimal permissions. Never let a tool run arbitrary shell commands unless a human explicitly approves each command.
Human-in-the-loop approvals
Require human approval for high-stakes actions. Common candidates include sending messages, making purchases, deleting records, and deploying infrastructure. Make the approval request clear and show exactly what the agent plans to do.
Output and input filtering
Filter inputs to block known attack patterns and toxic content. Filter outputs to prevent PII leakage, harmful instructions, or off-brand responses. Run these filters outside the model so they cannot be bypassed by prompt injection.
Monitoring and kill switches
Log every tool call, model input, and model output. Set up alerts for unusual patterns such as repeated failures, high token usage, or denied actions. Provide a kill switch that can stop the agent immediately.
Safety checklist
- Define what the agent is allowed and not allowed to do.
- Run tools with least privilege in sandboxed environments.
- Require approval for destructive or sensitive actions.
- Validate inputs with strict schemas.
- Monitor logs and set up alerts.
- Test against adversarial inputs regularly.
Published 2026-06-12
Related Resources
Prompt Security Auditor
SkillAudit AI systems for prompt injection vulnerabilities, jailbreak risks, and output safety issues.
Prompt Injection Defender
PromptDesign robust defense mechanisms against prompt injection attacks, jailbreaks, and adversarial inputs. Implement multi-layered security for AI systems handling untrusted user input.
Filesystem
MCP ServerSecure file operations with configurable access controls
Prompt Injection
GlossaryAn attack where malicious input overrides or leaks system instructions.
Deep Security Code Auditor
PromptPerforms comprehensive security audits of codebases, identifying vulnerabilities across the entire application with context-aware analysis.