What are agent guardrails?

Guardrails are controls that keep an agent within safe boundaries. They include input filters, output filters, tool permissions, approval flows, and monitoring.

What is prompt injection?

Prompt injection is an attack where untrusted input overrides the agent instructions. It can trick the agent into leaking data or running harmful tools.

How do I sandbox agent tools?

Run tools in isolated processes with limited permissions, deny network access when it is not needed, and validate all inputs before execution.

When should an agent ask for human approval?

Require approval for irreversible, expensive, or sensitive actions such as sending emails, making payments, deleting data, or deploying code.

AI Agent Safety and Guardrails

Bottom line: An agent with tools is a program that can take real action. Safety is not an afterthought. It is part of the architecture.

The agent threat model

Agents face the same risks as any automated system, plus risks from the language model itself. A malicious or confused agent can leak data, call the wrong tool, or execute destructive actions.

Defend against prompt injection

Prompt injection happens when untrusted input overrides your system prompt. It is one of the hardest problems in agent security.

Separate system instructions from user data.
Mark untrusted content clearly, for example with XML tags.
Use output filtering to detect instruction overrides.
Avoid giving the agent access to sensitive tools based only on user text.

Sandbox every tool

Run tool code in isolated environments. Use containers, restricted processes, or function-as-a-service with minimal permissions. Never let a tool run arbitrary shell commands unless a human explicitly approves each command.

Human-in-the-loop approvals

Require human approval for high-stakes actions. Common candidates include sending messages, making purchases, deleting records, and deploying infrastructure. Make the approval request clear and show exactly what the agent plans to do.

Output and input filtering

Filter inputs to block known attack patterns and toxic content. Filter outputs to prevent PII leakage, harmful instructions, or off-brand responses. Run these filters outside the model so they cannot be bypassed by prompt injection.

Monitoring and kill switches

Log every tool call, model input, and model output. Set up alerts for unusual patterns such as repeated failures, high token usage, or denied actions. Provide a kill switch that can stop the agent immediately.

Safety checklist

Define what the agent is allowed and not allowed to do.
Run tools with least privilege in sandboxed environments.
Require approval for destructive or sensitive actions.
Validate inputs with strict schemas.
Monitor logs and set up alerts.
Test against adversarial inputs regularly.

Published 2026-06-12

Related Resources

Prompt Security Auditor

Skill

Audit AI systems for prompt injection vulnerabilities, jailbreak risks, and output safety issues.

Prompt Injection Defender

Prompt

Design robust defense mechanisms against prompt injection attacks, jailbreaks, and adversarial inputs. Implement multi-layered security for AI systems handling untrusted user input.

Filesystem

MCP Server

Secure file operations with configurable access controls

Prompt Injection

Glossary

An attack where malicious input overrides or leaks system instructions.

Deep Security Code Auditor

Prompt

Performs comprehensive security audits of codebases, identifying vulnerabilities across the entire application with context-aware analysis.