Skip to content

AI guardrail policies

AI guardrail policies define the runtime security controls that AQtive Guard applies to LLM interactions. Each policy specifies which guardrails scan messages, how findings are classified by severity, and what actions (block or redact) are triggered when thresholds are met.

Guardrail policies complement AI-SPM rules by shifting from static asset analysis to active, runtime enforcement. While rules flag historical issues, guardrail policies enforce real-time safety controls on live AI traffic.

Understanding policy structure

A guardrail policy is defined in JSON and consists of three core sections:

  • Guardrail detectors – specify the types of content analysis performed on LLM interactions.
  • Severity mapping – assigns a severity level to findings.
  • Triggers – direct the actions taken when a finding meets or exceeds a severity threshold.

The following sections provide details for each of these.

Guardrail detectors

The detectors object defines which guardrail detectors are enabled in the policy. Detectors are grouped by message role (user and assistant), allowing you to apply different checks depending on who produced the content:

  • user — scans input messages from users before they reach the LLM.
  • assistant — scans output messages from the LLM before they’re returned to the user.

Each entry specifies a guardrail type and a display name:

JSON
"detectors": {
  "user": [
    { "type": "jailbreak", "name": "Jailbreak detection" },
    { "type": "pii", "name": "PII filter" },
    { "type": "secret", "name": "Secrets scanner" }
  ],
  "assistant": [
    { "type": "toxicity", "name": "Toxicity filter" },
    { "type": "pii", "name": "PII filter" },
  ]
}

Available guardrail types

  • jailbreak – Detects prompt injection and jailbreak attempts.
  • toxicity – Detects toxic, harmful, or inappropriate content.
  • pii – Detects sensitive personal data (email, credit card, SSN, phone, IP address).
  • secret – Detects API keys, tokens, and credentials.

Tip

You can apply the same guardrail type to both input and output directions. For example, PII detection is commonly applied in both directions to prevent sensitive data from being sent to or returned from the LLM.

Severity mapping

Severity mapping assigns a severity level to guardrail categories when a finding is discovered. This enables you to adjust the urgency of flagged issues to align with your internal risk tolerance and security policies. Available severity levels from most to least severe are:

  • Critical
  • High
  • Medium
  • Low
  • Info
JSON
"severity_mapping": {
  "jailbreak": "critical",
  "toxicity": "high",
  "secret": "critical",
  "pii/email": "medium",
  "pii/credit_card": "high",
  "pii/phone": "medium",
  "pii/ip_address": "low",
  "pii/ssn": "critical",
}

Note

PII guardrails support subcategories (such as pii/email, pii/credit_card, or pii/ssn) that can each be assigned a different severity level, giving you granular control over how different types of personal data are treated.

Triggers

Triggers define the enforcement action when a finding meets a severity threshold:

  • Block – Stops the message entirely, preventing it from being sent or delivered.
  • Redact – Masks the detected sensitive content while allowing the rest of the message to pass.
JSON
"triggers": [
  {
    "type": "redact",
    "name": "Redact medium and above",
    "severity": "medium"
  },
  {
    "type": "block",
    "name": "Block critical findings",
    "severity": "critical"
  }
]

Important

Block triggers take precedence over redact triggers. If a message matches both a block and a redact trigger, the message is blocked.

For example, given a block trigger at critical and a redact trigger at medium, the effective behavior is:

Severity Action
Critical Blocked
High Redacted
Medium Redacted
Low Allowed

Example

The following example policy applies jailbreak, PII, and secrets guardrails to user input, and toxicity and PII guardrails to LLM output. Critical findings are blocked, and findings at medium severity or above are redacted:

JSON
{
  "id": "production_policy",
  "name": "Production Guardrail Policy",
  "detectors": {
    "user": [
      { "type": "jailbreak", "name": "Jailbreak detection" },
      { "type": "pii", "name": "PII filter (input)" },
      { "type": "secret", "name": "Secrets scanner" }
    ],
    "assistant": [
      { "type": "toxicity", "name": "Toxicity filter" },
      { "type": "pii", "name": "PII filter (output)" },
    ]
  },
  "severity_mapping": {
    "jailbreak": "critical",
    "toxicity": "high",
    "secret": "critical",
    "pii/email": "medium",
    "pii/credit_card": "high",
    "pii/phone": "medium",
    "pii/ip_address": "low",
    "pii/ssn": "critical",
  },
  "triggers": [
    {
      "type": "redact",
      "name": "Redact medium and above",
      "severity": "medium"
    },
    {
      "type": "block",
      "name": "Block critical findings",
      "severity": "critical"
    }
  ]
}

Activate policies on the gateway

Only one policy can be active at a time. Activating a new policy automatically replaces the previous one.

The AI Gateway retrieves the active policy in one of two ways, depending on the WEB_API_BASE_URL variable in the gateway’s .env file:

  • Remote (default) – The gateway fetches the active policy from your AQG instance using WEB_API_BASE_URL and WEB_API_KEY.
  • Local – If WEB_API_BASE_URL is left blank, the gateway loads policies from a local project_db.json file on the gateway host. This is useful for air-gapped environments or testing.

Important

The gateway loads its policy at startup. After switching the active policy in AQG, restart the gateway for the change to take effect. Refer to Safe policy switching for the recommended procedure.

To manage policies in the AQG console, refer to Managing guardrail policies.