You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
any-guardrail now ships 20+ guardrails that do very different jobs — prompt-injection classifiers, content-safety judges, RAG-groundedness checks, off-topic/relevance, generic LLM-as-judge, and hosted moderation APIs. But there is no machine-readable way to ask "which guardrails detect prompt injection?" or "which ones run on the model's output vs the user's input?". That knowledge lives only in docstrings, CLAUDE.md prose, and the manual docs grouping — none of it queryable.
This issue proposes a guardrail taxonomy plus structured, queryable type metadata on every guardrail, so the library can:
Group — for docs (auto-generate the "Prompt injection" / "Content safety" sections instead of hand-maintaining docs/SUMMARY.md), for the README, and for the cookbook.
Enable guardrail sequencing (Iterative Guardrail Calls #26) — a cascade/chain needs to know each guardrail's risk category (to chain like-for-like, cheap-permissive → expensive-precise) and its stage (route input-guards pre-call, output-guards post-call). This metadata is the prerequisite that turns Iterative Guardrail Calls #26 from "hard-coded chains" into "select all PROMPT_INJECTION + INPUT guardrails and order them by cost."
Complements #178 / #177 (the GuardrailOutput standard): GuardrailOutput.categories records per-call, per-category results at the output level. This issue adds guardrail-level capability metadata — what a guardrail is designed to detect and how it runs — which is a different, static axis.
Proposed taxonomy
Guardrails vary on several orthogonal axes; a single "type" field would be lossy (Granite Guardian alone does harm + bias + jailbreak + RAG-groundedness + function-calling). Proposed dimensions, each backed by a new enum:
1. GuardrailCategory — what it detects (multi-valued; a guardrail may have several)
(Categories are illustrative — finalize per model card during implementation.)
Mechanism: store it so it's easy to query
Goal: queryable without importing heavy backends (so list/group_by don't spin up transformers/torch).
New GuardrailMetadata Pydantic model + the enums above, in a dependency-free module (e.g. src/any_guardrail/taxonomy.py).
A central, import-free registry keyed by the existing enum: GUARDRAIL_METADATA: dict[GuardrailName, GuardrailMetadata]. This is the source of truth for queries — filtering it imports no model code.
Each guardrail class also exposes METADATA: ClassVar[GuardrailMetadata] (referencing the registry entry) for co-located discoverability, with a unit test enforcing every GuardrailName has exactly one metadata entry and the ClassVar matches the registry. This guarantees the table can't drift as guardrails are added (mirrors the existing "every new guardrail needs a GuardrailName + docs entry" checklist in CLAUDE.md).
Query / grouping API on the factory (AnyGuardrail)
# discovery / filtering (no model imports)AnyGuardrail.metadata(GuardrailName.LLAMA_GUARD) # -> GuardrailMetadataAnyGuardrail.list(category=GuardrailCategory.PROMPT_INJECTION) # -> [GuardrailName, ...]AnyGuardrail.list(stage=GuardrailStage.OUTPUT, backend=BackendType.LOCAL_ENCODER)
AnyGuardrail.group_by("category") # -> dict[GuardrailCategory, list[GuardrailName]]
Filters AND across dimensions; multi-valued category matches if any category is in the requested set.
Implementation checklist
Add GuardrailCategory, GuardrailStage, OutputShape, BackendType enums + GuardrailMetadata model (new taxonomy.py; re-export from types.py).
Add the GUARDRAIL_METADATA registry covering all current GuardrailName entries (table above).
Add METADATA: ClassVar[GuardrailMetadata] to each guardrail + a parity test (GuardrailName ↔ registry ↔ ClassVar, exhaustive).
Use the metadata to auto-group the generated API docs (scripts/generate_api_docs.py) and docs/SUMMARY.md instead of the hand-maintained ordering.
Extend the "Adding a new guardrail" steps in CLAUDE.md to require a metadata entry.
(Optional) Export the registry to a JSON file (like schemas/guardrail_output.schema.json) so external tooling can query the taxonomy without importing the package.
Open design questions
Single vs multi category — recommend multi-valued categories: set[GuardrailCategory] (Granite Guardian / Lakera need it). Confirm.
Source of truth — central registry (import-free, recommended) vs per-class ClassVar as primary. Proposal above uses the registry as canonical with a ClassVar mirror + parity test.
Stage granularity — is EITHER enough, or do we want explicit {INPUT, OUTPUT} sets? RAG_CONTEXT guards (groundedness) also take extra kwargs (output_text, context) — should the metadata also record the required validate() kwargs so a sequencer knows what to feed each guardrail?
Should GuardrailCategory reuse / align with any existing risk taxonomy (MLCommons, OWASP LLM Top 10) for interoperability?
Related
Enables Iterative Guardrail Calls #26 (Iterative Guardrail Calls) — sequencing/cascades consume category (chain like-for-like) and stage (route input vs output). This metadata is the missing prerequisite for selecting and ordering a chain programmatically rather than hard-coding it.
Motivation
any-guardrailnow ships 20+ guardrails that do very different jobs — prompt-injection classifiers, content-safety judges, RAG-groundedness checks, off-topic/relevance, generic LLM-as-judge, and hosted moderation APIs. But there is no machine-readable way to ask "which guardrails detect prompt injection?" or "which ones run on the model's output vs the user's input?". That knowledge lives only in docstrings,CLAUDE.mdprose, and the manual docs grouping — none of it queryable.This issue proposes a guardrail taxonomy plus structured, queryable type metadata on every guardrail, so the library can:
AnyGuardrail.list(category=GuardrailCategory.PROMPT_INJECTION).docs/SUMMARY.md), for the README, and for the cookbook.PROMPT_INJECTION+INPUTguardrails and order them by cost."Complements #178 / #177 (the
GuardrailOutputstandard):GuardrailOutput.categoriesrecords per-call, per-category results at the output level. This issue adds guardrail-level capability metadata — what a guardrail is designed to detect and how it runs — which is a different, static axis.Proposed taxonomy
Guardrails vary on several orthogonal axes; a single "type" field would be lossy (Granite Guardian alone does harm + bias + jailbreak + RAG-groundedness + function-calling). Proposed dimensions, each backed by a new enum:
1.
GuardrailCategory— what it detects (multi-valued; a guardrail may have several)PROMPT_INJECTION— incl. jailbreak / instruction-overrideCONTENT_SAFETY— harm: violence, sexual, self-harm, dangerous, criminalTOXICITY— hate / harassment / profanityPII— sensitive-data / personal-data detectionHALLUCINATION— groundedness / RAG-faithfulnessOFF_TOPIC— topical relevance / answer-relevanceBIAS— social bias / fairnessTOOL_USE— function-calling / agent-action validityGENERAL_JUDGE— open-ended rubric / quality scoring (bring-your-own-criteria)2.
GuardrailStage— where it runs (load-bearing for #26 sequencing)INPUT— screens the user prompt (pre-call)OUTPUT— screens the model response (post-call)RAG_CONTEXT— needs the retrieved document/context (groundedness)EITHER— runs on input or output text (most moderation classifiers)3.
OutputShape— decision form (aligns with theGuardrailOutputfields it populates)BINARY·MULTI_LABEL·CATEGORICAL(taxonomy/S-codes) ·SCORE(scalar risk) ·RUBRIC(judge score) ·SPAN(offsets; forward-looking PII)4.
BackendType— how it executesLOCAL_ENCODER(HF/encoderfile classifier) ·LOCAL_DECODER(HF/llamafile decoder LLM) ·HOSTED_API(needs a key/endpoint)Secondary metadata fields
requires_api_key: bool,multilingual: bool,multimodal: bool,vendor: str,default_license: str.Current-guardrail mapping (the metadata we'd encode)
ProtectaiDeepsetJasperSentinelPangolinInjecGuardHarmGuardOffTopicDuoGuardShieldGemmaLlamaGuardGraniteGuardianGliderFlowjudgeAnyLlmAliniaAzureContentSafetyAzurePromptShieldsBedrockGuardrailsOpenAIModerationLakeraGuard(Categories are illustrative — finalize per model card during implementation.)
Mechanism: store it so it's easy to query
Goal: queryable without importing heavy backends (so
list/group_bydon't spin uptransformers/torch).GuardrailMetadataPydantic model + the enums above, in a dependency-free module (e.g.src/any_guardrail/taxonomy.py).GUARDRAIL_METADATA: dict[GuardrailName, GuardrailMetadata]. This is the source of truth for queries — filtering it imports no model code.METADATA: ClassVar[GuardrailMetadata](referencing the registry entry) for co-located discoverability, with a unit test enforcing everyGuardrailNamehas exactly one metadata entry and the ClassVar matches the registry. This guarantees the table can't drift as guardrails are added (mirrors the existing "every new guardrail needs aGuardrailName+ docs entry" checklist inCLAUDE.md).Query / grouping API on the factory (
AnyGuardrail)Filters AND across dimensions; multi-valued
categorymatches if any category is in the requested set.Implementation checklist
GuardrailCategory,GuardrailStage,OutputShape,BackendTypeenums +GuardrailMetadatamodel (newtaxonomy.py; re-export fromtypes.py).GUARDRAIL_METADATAregistry covering all currentGuardrailNameentries (table above).METADATA: ClassVar[GuardrailMetadata]to each guardrail + a parity test (GuardrailName↔ registry ↔ ClassVar, exhaustive).AnyGuardrail.metadata(),.list(**filters),.group_by().scripts/generate_api_docs.py) anddocs/SUMMARY.mdinstead of the hand-maintained ordering.CLAUDE.mdto require a metadata entry.schemas/guardrail_output.schema.json) so external tooling can query the taxonomy without importing the package.Open design questions
categories: set[GuardrailCategory](Granite Guardian / Lakera need it). Confirm.ClassVaras primary. Proposal above uses the registry as canonical with a ClassVar mirror + parity test.EITHERenough, or do we want explicit{INPUT, OUTPUT}sets?RAG_CONTEXTguards (groundedness) also take extra kwargs (output_text, context) — should the metadata also record the requiredvalidate()kwargs so a sequencer knows what to feed each guardrail?GuardrailCategoryreuse / align with any existing risk taxonomy (MLCommons, OWASP LLM Top 10) for interoperability?Related
category(chain like-for-like) andstage(route input vs output). This metadata is the missing prerequisite for selecting and ordering a chain programmatically rather than hard-coding it.GuardrailOutputstandard) — output-level per-category results vs this issue's guardrail-level capability metadata.