Skip to content

matthijn/blind-eye

Repository files navigation

blind-eye

A modular CLI tool for privacy-preserving text processing through composable redaction and validation pipelines.

Description

blind-eye provides a Unix-style pipeline. Each component reads from stdin and writes to stdout, enabling flexible data processing workflows for PII detection, redaction, and validation.

Setup

make # runs setup, dependencies and tools

# make setup
# Installs pipx and poetry for python3 dependency management

# make dependencies
# Installs dependencies (poetry deps + models for spacy validation)

# make tools
# Installs pre-commit hooks and code quality tools

Usage

The CLI consists of three composable stages:

./blind hf-input [options] | ./blind redact [options] | ./blind validate [options]

Commands

hf-input

Fetches sample datasets from Hugging Face.

./blind hf-input --sample-size 128 --dataset ai4privacy/pii-masking-400k
./blind hf-input --help

--name           TEXT  Dataset name [default: ai4privacy/pii-masking-400k]
--filter-key     TEXT  Filter column name [default: language]
--filter-value   TEXT  Filter column value [default: en]
--source-column  TEXT  Source text column [default: source_text]
--sample-size    TEXT  Number of samples [default: 128]
--clear-cache    BOOL  Clears pre-processed cache

Some datasets are quite large, and pre-processing takes some time. This command is cached by default given the same arguments

redact

Applies redaction models (fetched from hugging face) to input text.

./blind redact --model my-NER-model
./blind redact --help

--model-id          TEXT     [default: iiiorg/piiranha-v1-detect-personal-information]
--batch-size        INTEGER  [default: 64]
--confidence        INTEGER  [default: 0.5]

validate

Validates redacted output

./blind validate
./blind validate --help

--level                [critical]         [default: critical]                     │
--output-format        [json|error_rate]  [default: error_rate]                   │
--confidence           FLOAT              [default: 0.5]                          │
--batch-size           INTEGER            [default: 64]                           │
--language             TEXT               [default: en]               

--level only does critical, it seems not existing project as I could find distinguishes on severity levels for PII/PHI leakage. Training my own model here would be too far out of scope for this project.

Examples

# Full pipeline
./blind hf-input --sample-size 100 | ./blind redact | ./blind validate
# Cache intermediate results
./blind hf-input --sample-size 128 > raw.txt
cat raw.txt | ./blind redact > redacted.txt
cat redacted.txt | ./blind validate
# Use custom data
cat my_data.txt | ./blind redact | ./blind validate
# Or skip validation to see the redacted output
echo "My name is John Doe and I live in Amsterdam" | ./blind redact

About

Sample project to vet PHI/PII model performance

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published