Skip to content

Add a guardrail taxonomy + queryable type metadata (grouping; enables guardrail sequencing #26) #182

@dni138

Description

@dni138

Motivation

any-guardrail now ships 20+ guardrails that do very different jobs — prompt-injection classifiers, content-safety judges, RAG-groundedness checks, off-topic/relevance, generic LLM-as-judge, and hosted moderation APIs. But there is no machine-readable way to ask "which guardrails detect prompt injection?" or "which ones run on the model's output vs the user's input?". That knowledge lives only in docstrings, CLAUDE.md prose, and the manual docs grouping — none of it queryable.

This issue proposes a guardrail taxonomy plus structured, queryable type metadata on every guardrail, so the library can:

  1. Discover / filterAnyGuardrail.list(category=GuardrailCategory.PROMPT_INJECTION).
  2. Group — for docs (auto-generate the "Prompt injection" / "Content safety" sections instead of hand-maintaining docs/SUMMARY.md), for the README, and for the cookbook.
  3. Enable guardrail sequencing (Iterative Guardrail Calls #26) — a cascade/chain needs to know each guardrail's risk category (to chain like-for-like, cheap-permissive → expensive-precise) and its stage (route input-guards pre-call, output-guards post-call). This metadata is the prerequisite that turns Iterative Guardrail Calls #26 from "hard-coded chains" into "select all PROMPT_INJECTION + INPUT guardrails and order them by cost."

Complements #178 / #177 (the GuardrailOutput standard): GuardrailOutput.categories records per-call, per-category results at the output level. This issue adds guardrail-level capability metadata — what a guardrail is designed to detect and how it runs — which is a different, static axis.


Proposed taxonomy

Guardrails vary on several orthogonal axes; a single "type" field would be lossy (Granite Guardian alone does harm + bias + jailbreak + RAG-groundedness + function-calling). Proposed dimensions, each backed by a new enum:

1. GuardrailCategorywhat it detects (multi-valued; a guardrail may have several)

  • PROMPT_INJECTION — incl. jailbreak / instruction-override
  • CONTENT_SAFETY — harm: violence, sexual, self-harm, dangerous, criminal
  • TOXICITY — hate / harassment / profanity
  • PII — sensitive-data / personal-data detection
  • HALLUCINATION — groundedness / RAG-faithfulness
  • OFF_TOPIC — topical relevance / answer-relevance
  • BIAS — social bias / fairness
  • TOOL_USE — function-calling / agent-action validity
  • GENERAL_JUDGE — open-ended rubric / quality scoring (bring-your-own-criteria)

2. GuardrailStagewhere it runs (load-bearing for #26 sequencing)

  • INPUT — screens the user prompt (pre-call)
  • OUTPUT — screens the model response (post-call)
  • RAG_CONTEXT — needs the retrieved document/context (groundedness)
  • EITHER — runs on input or output text (most moderation classifiers)

3. OutputShape — decision form (aligns with the GuardrailOutput fields it populates)

  • BINARY · MULTI_LABEL · CATEGORICAL (taxonomy/S-codes) · SCORE (scalar risk) · RUBRIC (judge score) · SPAN (offsets; forward-looking PII)

4. BackendType — how it executes

  • LOCAL_ENCODER (HF/encoderfile classifier) · LOCAL_DECODER (HF/llamafile decoder LLM) · HOSTED_API (needs a key/endpoint)

Secondary metadata fields

requires_api_key: bool, multilingual: bool, multimodal: bool, vendor: str, default_license: str.


Current-guardrail mapping (the metadata we'd encode)

Guardrail Category(ies) Stage Output shape Backend
Protectai prompt-injection either binary local-encoder
Deepset prompt-injection either binary local-encoder
Jasper prompt-injection either binary local-encoder
Sentinel prompt-injection either binary local-encoder
Pangolin prompt-injection either binary local-encoder
InjecGuard prompt-injection either binary local-encoder
HarmGuard content-safety either binary local-encoder
OffTopic off-topic input binary/score local-encoder
DuoGuard content-safety, toxicity either multi-label local-decoder
ShieldGemma content-safety either binary local-decoder
LlamaGuard content-safety input, output categorical (S-codes) local-decoder
GraniteGuardian content-safety, bias, prompt-injection, hallucination, off-topic, tool-use input, output, rag-context categorical/score local-decoder
Glider general-judge either rubric local-decoder
Flowjudge general-judge either rubric local-decoder
AnyLlm general-judge either rubric/binary hosted-api
Alinia content-safety, toxicity either categorical hosted-api
AzureContentSafety content-safety, toxicity either categorical (+severity) hosted-api
AzurePromptShields prompt-injection input binary hosted-api
BedrockGuardrails content-safety, pii, off-topic either categorical hosted-api
OpenAIModeration content-safety, toxicity either categorical hosted-api
LakeraGuard prompt-injection, content-safety, pii either categorical hosted-api

(Categories are illustrative — finalize per model card during implementation.)


Mechanism: store it so it's easy to query

Goal: queryable without importing heavy backends (so list/group_by don't spin up transformers/torch).

  • New GuardrailMetadata Pydantic model + the enums above, in a dependency-free module (e.g. src/any_guardrail/taxonomy.py).
  • A central, import-free registry keyed by the existing enum: GUARDRAIL_METADATA: dict[GuardrailName, GuardrailMetadata]. This is the source of truth for queries — filtering it imports no model code.
  • Each guardrail class also exposes METADATA: ClassVar[GuardrailMetadata] (referencing the registry entry) for co-located discoverability, with a unit test enforcing every GuardrailName has exactly one metadata entry and the ClassVar matches the registry. This guarantees the table can't drift as guardrails are added (mirrors the existing "every new guardrail needs a GuardrailName + docs entry" checklist in CLAUDE.md).

Query / grouping API on the factory (AnyGuardrail)

# discovery / filtering (no model imports)
AnyGuardrail.metadata(GuardrailName.LLAMA_GUARD)          # -> GuardrailMetadata
AnyGuardrail.list(category=GuardrailCategory.PROMPT_INJECTION)   # -> [GuardrailName, ...]
AnyGuardrail.list(stage=GuardrailStage.OUTPUT, backend=BackendType.LOCAL_ENCODER)
AnyGuardrail.group_by("category")                          # -> dict[GuardrailCategory, list[GuardrailName]]

Filters AND across dimensions; multi-valued category matches if any category is in the requested set.


Implementation checklist

  • Add GuardrailCategory, GuardrailStage, OutputShape, BackendType enums + GuardrailMetadata model (new taxonomy.py; re-export from types.py).
  • Add the GUARDRAIL_METADATA registry covering all current GuardrailName entries (table above).
  • Add METADATA: ClassVar[GuardrailMetadata] to each guardrail + a parity test (GuardrailName ↔ registry ↔ ClassVar, exhaustive).
  • Add AnyGuardrail.metadata(), .list(**filters), .group_by().
  • Use the metadata to auto-group the generated API docs (scripts/generate_api_docs.py) and docs/SUMMARY.md instead of the hand-maintained ordering.
  • Extend the "Adding a new guardrail" steps in CLAUDE.md to require a metadata entry.
  • (Optional) Export the registry to a JSON file (like schemas/guardrail_output.schema.json) so external tooling can query the taxonomy without importing the package.

Open design questions

  1. Single vs multi category — recommend multi-valued categories: set[GuardrailCategory] (Granite Guardian / Lakera need it). Confirm.
  2. Source of truth — central registry (import-free, recommended) vs per-class ClassVar as primary. Proposal above uses the registry as canonical with a ClassVar mirror + parity test.
  3. Stage granularity — is EITHER enough, or do we want explicit {INPUT, OUTPUT} sets? RAG_CONTEXT guards (groundedness) also take extra kwargs (output_text, context) — should the metadata also record the required validate() kwargs so a sequencer knows what to feed each guardrail?
  4. Should GuardrailCategory reuse / align with any existing risk taxonomy (MLCommons, OWASP LLM Top 10) for interoperability?

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions