Add a guardrail taxonomy + queryable type metadata (grouping; enables guardrail sequencing #26)

## Motivation

`any-guardrail` now ships 20+ guardrails that do very different jobs — prompt-injection classifiers, content-safety judges, RAG-groundedness checks, off-topic/relevance, generic LLM-as-judge, and hosted moderation APIs. But there is **no machine-readable way to ask "which guardrails detect prompt injection?"** or "which ones run on the model's *output* vs the user's *input*?". That knowledge lives only in docstrings, `CLAUDE.md` prose, and the manual docs grouping — none of it queryable.

This issue proposes a **guardrail taxonomy** plus **structured, queryable type metadata** on every guardrail, so the library can:

1. **Discover / filter** — `AnyGuardrail.list(category=GuardrailCategory.PROMPT_INJECTION)`.
2. **Group** — for docs (auto-generate the "Prompt injection" / "Content safety" sections instead of hand-maintaining `docs/SUMMARY.md`), for the README, and for the cookbook.
3. **Enable guardrail sequencing (#26)** — a cascade/chain needs to know each guardrail's **risk category** (to chain like-for-like, cheap-permissive → expensive-precise) and its **stage** (route input-guards pre-call, output-guards post-call). This metadata is the prerequisite that turns #26 from "hard-coded chains" into "select all `PROMPT_INJECTION` + `INPUT` guardrails and order them by cost."

Complements #178 / #177 (the `GuardrailOutput` standard): `GuardrailOutput.categories` records *per-call, per-category results at the output level*. This issue adds *guardrail-level capability metadata* — what a guardrail is **designed to detect** and **how it runs** — which is a different, static axis.

---

## Proposed taxonomy

Guardrails vary on several orthogonal axes; a single "type" field would be lossy (Granite Guardian alone does harm + bias + jailbreak + RAG-groundedness + function-calling). Proposed dimensions, each backed by a new enum:

### 1. `GuardrailCategory` — *what* it detects (multi-valued; a guardrail may have several)
- `PROMPT_INJECTION` — incl. jailbreak / instruction-override
- `CONTENT_SAFETY` — harm: violence, sexual, self-harm, dangerous, criminal
- `TOXICITY` — hate / harassment / profanity
- `PII` — sensitive-data / personal-data detection
- `HALLUCINATION` — groundedness / RAG-faithfulness
- `OFF_TOPIC` — topical relevance / answer-relevance
- `BIAS` — social bias / fairness
- `TOOL_USE` — function-calling / agent-action validity
- `GENERAL_JUDGE` — open-ended rubric / quality scoring (bring-your-own-criteria)

### 2. `GuardrailStage` — *where* it runs (load-bearing for #26 sequencing)
- `INPUT` — screens the user prompt (pre-call)
- `OUTPUT` — screens the model response (post-call)
- `RAG_CONTEXT` — needs the retrieved document/context (groundedness)
- `EITHER` — runs on input or output text (most moderation classifiers)

### 3. `OutputShape` — decision form (aligns with the `GuardrailOutput` fields it populates)
- `BINARY` · `MULTI_LABEL` · `CATEGORICAL` (taxonomy/S-codes) · `SCORE` (scalar risk) · `RUBRIC` (judge score) · `SPAN` (offsets; forward-looking PII)

### 4. `BackendType` — how it executes
- `LOCAL_ENCODER` (HF/encoderfile classifier) · `LOCAL_DECODER` (HF/llamafile decoder LLM) · `HOSTED_API` (needs a key/endpoint)

### Secondary metadata fields
`requires_api_key: bool`, `multilingual: bool`, `multimodal: bool`, `vendor: str`, `default_license: str`.

---

## Current-guardrail mapping (the metadata we'd encode)

| Guardrail | Category(ies) | Stage | Output shape | Backend |
|---|---|---|---|---|
| `Protectai` | prompt-injection | either | binary | local-encoder |
| `Deepset` | prompt-injection | either | binary | local-encoder |
| `Jasper` | prompt-injection | either | binary | local-encoder |
| `Sentinel` | prompt-injection | either | binary | local-encoder |
| `Pangolin` | prompt-injection | either | binary | local-encoder |
| `InjecGuard` | prompt-injection | either | binary | local-encoder |
| `HarmGuard` | content-safety | either | binary | local-encoder |
| `OffTopic` | off-topic | input | binary/score | local-encoder |
| `DuoGuard` | content-safety, toxicity | either | multi-label | local-decoder |
| `ShieldGemma` | content-safety | either | binary | local-decoder |
| `LlamaGuard` | content-safety | input, output | categorical (S-codes) | local-decoder |
| `GraniteGuardian` | content-safety, bias, prompt-injection, hallucination, off-topic, tool-use | input, output, rag-context | categorical/score | local-decoder |
| `Glider` | general-judge | either | rubric | local-decoder |
| `Flowjudge` | general-judge | either | rubric | local-decoder |
| `AnyLlm` | general-judge | either | rubric/binary | hosted-api |
| `Alinia` | content-safety, toxicity | either | categorical | hosted-api |
| `AzureContentSafety` | content-safety, toxicity | either | categorical (+severity) | hosted-api |
| `AzurePromptShields` | prompt-injection | input | binary | hosted-api |
| `BedrockGuardrails` | content-safety, pii, off-topic | either | categorical | hosted-api |
| `OpenAIModeration` | content-safety, toxicity | either | categorical | hosted-api |
| `LakeraGuard` | prompt-injection, content-safety, pii | either | categorical | hosted-api |

*(Categories are illustrative — finalize per model card during implementation.)*

---

## Mechanism: store it so it's easy to query

Goal: queryable **without importing heavy backends** (so `list`/`group_by` don't spin up `transformers`/`torch`).

- New `GuardrailMetadata` Pydantic model + the enums above, in a dependency-free module (e.g. `src/any_guardrail/taxonomy.py`).
- A central, import-free registry keyed by the existing enum: `GUARDRAIL_METADATA: dict[GuardrailName, GuardrailMetadata]`. This is the **source of truth for queries** — filtering it imports no model code.
- Each guardrail class also exposes `METADATA: ClassVar[GuardrailMetadata]` (referencing the registry entry) for co-located discoverability, with a unit test enforcing **every `GuardrailName` has exactly one metadata entry** and the ClassVar matches the registry. This guarantees the table can't drift as guardrails are added (mirrors the existing "every new guardrail needs a `GuardrailName` + docs entry" checklist in `CLAUDE.md`).

### Query / grouping API on the factory (`AnyGuardrail`)
```python
# discovery / filtering (no model imports)
AnyGuardrail.metadata(GuardrailName.LLAMA_GUARD)          # -> GuardrailMetadata
AnyGuardrail.list(category=GuardrailCategory.PROMPT_INJECTION)   # -> [GuardrailName, ...]
AnyGuardrail.list(stage=GuardrailStage.OUTPUT, backend=BackendType.LOCAL_ENCODER)
AnyGuardrail.group_by("category")                          # -> dict[GuardrailCategory, list[GuardrailName]]
```
Filters AND across dimensions; multi-valued `category` matches if *any* category is in the requested set.

---

## Implementation checklist
- [ ] Add `GuardrailCategory`, `GuardrailStage`, `OutputShape`, `BackendType` enums + `GuardrailMetadata` model (new `taxonomy.py`; re-export from `types.py`).
- [ ] Add the `GUARDRAIL_METADATA` registry covering all current `GuardrailName` entries (table above).
- [ ] Add `METADATA: ClassVar[GuardrailMetadata]` to each guardrail + a parity test (`GuardrailName` ↔ registry ↔ ClassVar, exhaustive).
- [ ] Add `AnyGuardrail.metadata()`, `.list(**filters)`, `.group_by()`.
- [ ] Use the metadata to **auto-group** the generated API docs (`scripts/generate_api_docs.py`) and `docs/SUMMARY.md` instead of the hand-maintained ordering.
- [ ] Extend the "Adding a new guardrail" steps in `CLAUDE.md` to require a metadata entry.
- [ ] (Optional) Export the registry to a JSON file (like `schemas/guardrail_output.schema.json`) so external tooling can query the taxonomy without importing the package.

## Open design questions
1. **Single vs multi category** — recommend multi-valued `categories: set[GuardrailCategory]` (Granite Guardian / Lakera need it). Confirm.
2. **Source of truth** — central registry (import-free, recommended) vs per-class `ClassVar` as primary. Proposal above uses the registry as canonical with a ClassVar mirror + parity test.
3. **Stage granularity** — is `EITHER` enough, or do we want explicit `{INPUT, OUTPUT}` sets? `RAG_CONTEXT` guards (groundedness) also take extra kwargs (`output_text`, context) — should the metadata also record the **required `validate()` kwargs** so a sequencer knows what to feed each guardrail?
4. Should `GuardrailCategory` reuse / align with any existing risk taxonomy (MLCommons, OWASP LLM Top 10) for interoperability?

## Related
- **Enables #26 (Iterative Guardrail Calls)** — sequencing/cascades consume `category` (chain like-for-like) and `stage` (route input vs output). This metadata is the missing prerequisite for selecting and ordering a chain programmatically rather than hard-coding it.
- Complements #178 / #177 (`GuardrailOutput` standard) — output-level per-category *results* vs this issue's guardrail-level capability *metadata*.
- Could feed #20 (prompt registry) and the docs grouping.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a guardrail taxonomy + queryable type metadata (grouping; enables guardrail sequencing #26) #182

Motivation

Proposed taxonomy

1. `GuardrailCategory` — what it detects (multi-valued; a guardrail may have several)

2. `GuardrailStage` — where it runs (load-bearing for #26 sequencing)

3. `OutputShape` — decision form (aligns with the `GuardrailOutput` fields it populates)

4. `BackendType` — how it executes

Secondary metadata fields

Current-guardrail mapping (the metadata we'd encode)

Mechanism: store it so it's easy to query

Query / grouping API on the factory (`AnyGuardrail`)

Implementation checklist

Open design questions

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Guardrail	Category(ies)	Stage	Output shape	Backend
`Protectai`	prompt-injection	either	binary	local-encoder
`Deepset`	prompt-injection	either	binary	local-encoder
`Jasper`	prompt-injection	either	binary	local-encoder
`Sentinel`	prompt-injection	either	binary	local-encoder
`Pangolin`	prompt-injection	either	binary	local-encoder
`InjecGuard`	prompt-injection	either	binary	local-encoder
`HarmGuard`	content-safety	either	binary	local-encoder
`OffTopic`	off-topic	input	binary/score	local-encoder
`DuoGuard`	content-safety, toxicity	either	multi-label	local-decoder
`ShieldGemma`	content-safety	either	binary	local-decoder
`LlamaGuard`	content-safety	input, output	categorical (S-codes)	local-decoder
`GraniteGuardian`	content-safety, bias, prompt-injection, hallucination, off-topic, tool-use	input, output, rag-context	categorical/score	local-decoder
`Glider`	general-judge	either	rubric	local-decoder
`Flowjudge`	general-judge	either	rubric	local-decoder
`AnyLlm`	general-judge	either	rubric/binary	hosted-api
`Alinia`	content-safety, toxicity	either	categorical	hosted-api
`AzureContentSafety`	content-safety, toxicity	either	categorical (+severity)	hosted-api
`AzurePromptShields`	prompt-injection	input	binary	hosted-api
`BedrockGuardrails`	content-safety, pii, off-topic	either	categorical	hosted-api
`OpenAIModeration`	content-safety, toxicity	either	categorical	hosted-api
`LakeraGuard`	prompt-injection, content-safety, pii	either	categorical	hosted-api

Add a guardrail taxonomy + queryable type metadata (grouping; enables guardrail sequencing #26) #182

Description

Motivation

Proposed taxonomy

1. GuardrailCategory — what it detects (multi-valued; a guardrail may have several)

2. GuardrailStage — where it runs (load-bearing for #26 sequencing)

3. OutputShape — decision form (aligns with the GuardrailOutput fields it populates)

4. BackendType — how it executes

Secondary metadata fields

Current-guardrail mapping (the metadata we'd encode)

Mechanism: store it so it's easy to query

Query / grouping API on the factory (AnyGuardrail)

Implementation checklist

Open design questions

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

1. `GuardrailCategory` — what it detects (multi-valued; a guardrail may have several)

2. `GuardrailStage` — where it runs (load-bearing for #26 sequencing)

3. `OutputShape` — decision form (aligns with the `GuardrailOutput` fields it populates)

4. `BackendType` — how it executes

Query / grouping API on the factory (`AnyGuardrail`)