Skip to main content
VePrompts
AI Safety / Guardrails

Defense Guardrails

Production-ready strategies for securing AI applications against prompt attacks

Priority levels: criticalhighmediumlow

Input Validation & Sanitization

critical

Validate and sanitize all user inputs before processing. Remove or escape delimiter sequences, encoding patterns, and system-like markers.

Implementation Checklist

  • Strip markdown code blocks containing 'system' or 'instructions'
  • Detect and decode base64, rot13, hex, and unicode escape sequences
  • Block input containing XML-like tags: <system>, <instructions>, [INST]
  • Limit input length to prevent context window stuffing
  • Rate limit requests per user/session
Code Example
// Detect encoded payloads
function detectEncoding(input) {
  const patterns = [
    /^[A-Za-z0-9+/]{20,}={0,2}$/, // base64
    /^[A-Fa-f0-9]{20,}$/, // hex
    /\\u[0-9a-fA-F]{4}/, // unicode escapes
  ];
  return patterns.some(p => p.test(input));
}

Instruction Hierarchy

critical

Implement a clear hierarchy where system instructions cannot be overridden by user input. Use signed or encrypted system prompts.

Implementation Checklist

  • Separate system and user prompts at the API level
  • Cryptographically sign system prompts to detect tampering
  • Use different delimiters for system vs user content
  • Never concatenate user input directly into system prompt templates
Code Example
// Signed system prompt
const systemPrompt = {
  text: 'You are a helpful assistant...',
  signature: 'sha256:abc123...',
  timestamp: Date.now()
};

function verifySystemPrompt(prompt) {
  return verifySignature(prompt.text, prompt.signature);
}

Output Filtering & Moderation

high

Apply safety filters to model outputs regardless of input sanitization. Catch harmful content that slips through input controls.

Implementation Checklist

  • Run outputs through a secondary moderation model
  • Block outputs containing PII, API keys, or system prompts
  • Implement regex patterns for common injection signatures
  • Log and alert on suspicious output patterns
Code Example
// Output moderation pipeline
async function moderateOutput(output) {
  const checks = [
    detectPII(output),
    detectSystemPromptLeak(output),
    detectHarmfulContent(output),
    detectEncoding(output)
  ];
  const results = await Promise.all(checks);
  return results.every(r => r.passed);
}

Context Isolation

high

Isolate different data sources and conversation contexts to prevent cross-contamination and information leakage.

Implementation Checklist

  • Process external documents in isolated sandboxes
  • Clear context between unrelated user sessions
  • Never include one user's data in another user's context
  • Implement conversation-level boundaries
Code Example
// Isolated document processing
async function processDocument(doc, userId) {
  const sandbox = createSandbox(userId);
  const sanitized = await sanitizeDocument(doc);
  const result = await sandbox.run(sanitized);
  sandbox.destroy();
  return result;
}

Monitoring & Logging

high

Comprehensive logging of all inputs, outputs, and safety events for audit, forensics, and continuous improvement.

Implementation Checklist

  • Log all inputs with user attribution and timestamps
  • Flag and log blocked attempts for pattern analysis
  • Monitor for repeated attack patterns from same source
  • Set up alerts for high-frequency suspicious activity
Code Example
// Security event logging
function logSecurityEvent(event) {
  const record = {
    timestamp: new Date().toISOString(),
    userId: event.userId,
    eventType: event.type, // 'blocked', 'flagged', 'allowed'
    inputHash: hashInput(event.input),
    reason: event.reason
  };
  securityLog.insert(record);
}

Rate Limiting & Throttling

medium

Prevent brute-force attacks and automated exploitation by limiting request frequency and implementing progressive delays.

Implementation Checklist

  • Limit requests per IP address per minute
  • Implement exponential backoff for repeated violations
  • Require CAPTCHA after suspicious activity
  • Block IPs with excessive blocked request rates
Code Example
// Rate limiter
const rateLimits = new Map();

function checkRateLimit(ip) {
  const now = Date.now();
  const window = 60 * 1000; // 1 minute
  const maxRequests = 30;
  
  const requests = rateLimits.get(ip) || [];
  const recent = requests.filter(t => now - t < window);
  
  if (recent.length >= maxRequests) {
    return { allowed: false, retryAfter: window - (now - recent[0]) };
  }
  
  recent.push(now);
  rateLimits.set(ip, recent);
  return { allowed: true };
}

Continuous Adversarial Testing

high

Regularly test your AI systems against known attack patterns and emerging threats. Red-team your own applications before attackers do.

Implementation Checklist

  • Run automated prompt injection tests weekly
  • Subscribe to security advisories for AI/ML systems
  • Participate in bug bounty programs
  • Conduct quarterly third-party security audits
Code Example
// Automated safety test
async function runSafetyTests() {
  const testCases = loadTestCases();
  const results = [];
  
  for (const test of testCases) {
    const response = await model.generate(test.input);
    const passed = !test.failurePattern.test(response);
    results.push({ test: test.id, passed, response });
  }
  
  return generateReport(results);
}

Human-in-the-Loop for Sensitive Operations

medium

Require human approval for high-risk actions like data deletion, API key generation, or access to sensitive systems.

Implementation Checklist

  • Classify operations by risk level
  • Require MFA for administrative actions
  • Implement approval workflows for destructive operations
  • Maintain audit trail of all human approvals
Code Example
// Risk-based approval
async function executeOperation(operation, user) {
  const risk = assessRisk(operation);
  
  if (risk.level === 'critical') {
    const approval = await requestApproval(user.manager, operation);
    if (!approval.granted) {
      throw new Error('Operation requires approval');
    }
  }
  
  return await operation.execute();
}

Test Your Defenses

Now that you know the guardrails, test your implementation against real attack patterns.

Run Test Suite