Skip to main content
VePrompts

AI Agent Safety and Guardrails

Bottom line: An agent with tools is a program that can take real action. Safety is not an afterthought. It is part of the architecture.

The agent threat model

Agents face the same risks as any automated system, plus risks from the language model itself. A malicious or confused agent can leak data, call the wrong tool, or execute destructive actions.

Defend against prompt injection

Prompt injection happens when untrusted input overrides your system prompt. It is one of the hardest problems in agent security.

  • Separate system instructions from user data.
  • Mark untrusted content clearly, for example with XML tags.
  • Use output filtering to detect instruction overrides.
  • Avoid giving the agent access to sensitive tools based only on user text.

Sandbox every tool

Run tool code in isolated environments. Use containers, restricted processes, or function-as-a-service with minimal permissions. Never let a tool run arbitrary shell commands unless a human explicitly approves each command.

Human-in-the-loop approvals

Require human approval for high-stakes actions. Common candidates include sending messages, making purchases, deleting records, and deploying infrastructure. Make the approval request clear and show exactly what the agent plans to do.

Output and input filtering

Filter inputs to block known attack patterns and toxic content. Filter outputs to prevent PII leakage, harmful instructions, or off-brand responses. Run these filters outside the model so they cannot be bypassed by prompt injection.

Monitoring and kill switches

Log every tool call, model input, and model output. Set up alerts for unusual patterns such as repeated failures, high token usage, or denied actions. Provide a kill switch that can stop the agent immediately.

Safety checklist

  • Define what the agent is allowed and not allowed to do.
  • Run tools with least privilege in sandboxed environments.
  • Require approval for destructive or sensitive actions.
  • Validate inputs with strict schemas.
  • Monitor logs and set up alerts.
  • Test against adversarial inputs regularly.

Published 2026-06-12

Related Resources