Data Caterer - Test Data Management Tool

Overview

A test data management tool with automated data generation, validation and clean up.

Generate data for databases, files, messaging systems or HTTP requests via UI, Scala/Java SDK or YAML input and executed via Spark. Run data validations after generating data to ensure it is consumed correctly. Clean up generated data or consumed data in downstream data sources to keep your environments tidy. Define alerts to get notified when failures occur and deep dive into issues from the generated report.

Full docs can be found here.

Scala/Java examples found here.

A demo of the UI found here.

Features

Quick Start

Java/Scala API (Recommended)

git clone [email protected]:data-catering/data-caterer.git
cd data-caterer/example
./run.sh

It will run the DocumentationPlanRun class. Press Enter to run the default example. Check results at docker/sample/report/index.html.

YAML

git clone [email protected]:data-catering/data-caterer.git
cd data-caterer/example
./run.sh csv.yaml

It will run the csv.yaml plan file and the csv_transaction_file task file. Check results at docker/data/custom/report/index.html.

UI

docker run -d -p 9898:9898 -e DEPLOY_MODE=standalone --name datacaterer datacatering/data-caterer:0.18.0

Open http://localhost:9898.

Full quick start guide

Integrations

Supported data sources

Data Caterer supports the below data sources. Check here for the full roadmap.

Data Source Type	Data Source	Support
Cloud Storage	AWS S3	✅
Cloud Storage	Azure Blob Storage	✅
Cloud Storage	GCP Cloud Storage	✅
Database	BigQuery	✅
Database	Cassandra	✅
Database	MySQL	✅
Database	Postgres	✅
Database	Elasticsearch	❌
Database	MongoDB	❌
File	CSV	✅
File	Delta Lake	✅
File	JSON	✅
File	Iceberg	✅
File	ORC	✅
File	Parquet	✅
File	Hudi	❌
HTTP	REST API	✅
Messaging	Kafka	✅
Messaging	RabbitMQ	✅
Messaging	Solace	✅
Messaging	ActiveMQ	❌
Messaging	Pulsar	❌
Metadata	Data Contract CLI	✅
Metadata	Great Expectations	✅
Metadata	JSON Schema	✅
Metadata	Marquez	✅
Metadata	OpenAPI/Swagger	✅
Metadata	OpenMetadata	✅
Metadata	Open Data Contract Standard (ODCS)	✅
Metadata	Amundsen	❌
Metadata	Datahub	❌
Metadata	Solace Event Portal	❌

Additional Details

Run Configurations

Different ways to run Data Caterer based on your use case:

Design

Design motivations and details can be found here.

Roadmap

Can check here for full list of roadmap items.

Mildly Quick Start

Generate and validate data

I want to generate data in Postgres

postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")  //name and url

But I want `account_id` to follow a pattern and be unique

postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .fields(field.name("account_id").regex("ACC[0-9]{10}").unique(true))

I then want to test my job ingests all the data after generating

val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .fields(field.name("account_id").regex("ACC[0-9]{10}").unique(true))

val parquetValidation = parquet("output_parquet", "/data/parquet/customer")
  .validation(validation.count.isEqual(1000))

I want to make sure all the `account_id` values in Postgres are in the Parquet file

val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .fields(field.name("account_id").regex("ACC[0-9]{10}").unique(true))

val parquetValidation = parquet("output_parquet", "/data/parquet/customer")
  .validation(
     validation.upstreamData(postgresTask)
       .joinFields("account_id")
       .withValidation(validation.count().isEqual(1000))
  )

I want to start validating once the Parquet file is available

val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .fields(field.name("account_id").regex("ACC[0-9]{10}").unique(true))

val parquetValidation = parquet("output_parquet", "/data/parquet/customer")
  .validation(
     validation.upstreamData(postgresTask)
       .joinFields("account_id")
       .withValidation(validation.count().isEqual(1000))
  )
  .validationWait(waitCondition.file("/data/parquet/customer"))

Generate same data across data sources

I also want to generate events in Kafka

kafka("my_kafka", "localhost:29092")
  .topic("account-topic")
  .fields(...)

But I want the same `account_id` to show in Postgres and Kafka

val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .fields(field.name("account_id").regex("ACC[0-9]{10}"))

val kafkaTask = kafka("my_kafka", "localhost:29092")
  .topic("account-topic")
  .fields(...)

plan.addForeignKeyRelationship(
   postgresTask, List("account_id"),
   List(kafkaTask -> List("account_id"))
)

Generate data and clean up

I want to generate 5 transactions per `account_id`

postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .table("account", "transactions")
  .count(count.recordsPerField(5, "account_id"))

Randomly generate 1 to 5 transactions per `account_id`

postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .table("account", "transactions")
  .count(count.recordsPerFieldGenerator(generator.min(1).max(5), "account_id"))

I want to delete the generated data

val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .table("account", "transactions")
  .count(count.recordsPerFieldGenerator(generator.min(0).max(5), "account_id"))

val conf = configuration
  .enableDeleteGeneratedRecords(true)
  .enableGenerateData(false)

I also want to delete the data in Cassandra because my job consumed the data in Postgres and pushed to Cassandra

val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .table("account", "transactions")
  .count(count.recordsPerFieldGenerator(generator.min(0).max(5), "account_id"))

val cassandraTxns = cassandra("ingested_data", "localhost:9042")
  .table("account", "transactions")

val deletePlan = plan.addForeignKeyRelationship(
   postgresTask, List("account_id"),
   List(),
   List(cassandraTxns -> List("account_id"))
)

val conf = configuration
  .enableDeleteGeneratedRecords(true)
  .enableGenerateData(false)

But only the `account_number` is saved in Cassandra from the `account_id`

val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .count(count.recordsPerFieldGenerator(generator.min(0).max(5), "account_id"))

val cassandraTxns = cassandra("ingested_data", "localhost:9042")
  .table("account", "transactions")

val deletePlan = plan.addForeignKeyRelationship(
   postgresTask, List("account_id"),
   List(),
   List(cassandraTxns -> List("SUBSTR(account_id, 3) AS account_number"))
)

val conf = configuration
  .enableDeleteGeneratedRecords(true)
  .enableGenerateData(false)

Generate data with schema from metadata source

I have a data contract using the Open Data Contract Standard (ODCS) format

parquet("customer_parquet", "/data/parquet/customer")
  .fields(metadataSource.openDataContractStandard("/data/odcs/full-example.odcs.yaml"))

I have an OpenAPI/Swagger doc

http("my_http")
  .fields(metadataSource.openApi("/data/http/petstore.json"))

Validate data using validations from metadata source

I have expectations from Great Expectations

parquet("customer_parquet", "/data/parquet/customer")
  .validations(metadataSource.greatExpectations("/data/great-expectations/taxi-expectations.json"))

Name		Name	Last commit message	Last commit date
Latest commit History 348 Commits
.cursor/rules		.cursor/rules
.github		.github
api		api
app		app
buildSrc		buildSrc
docs		docs
example		example
gradle		gradle
misc		misc
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
build.gradle.kts		build.gradle.kts
docker-action.sh		docker-action.sh
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
insta-integration.yaml		insta-integration.yaml
mkdocs.yml		mkdocs.yml
renovate.json		renovate.json
settings.gradle.kts		settings.gradle.kts
workspace.xml		workspace.xml

Uh oh!

License

data-catering/data-caterer

Folders and files

Latest commit

History

Repository files navigation

Data Caterer - Test Data Management Tool

Overview

Features

Quick Start

Java/Scala API (Recommended)

YAML

UI

Integrations

Supported data sources

Additional Details

Run Configurations

Design

Roadmap

Mildly Quick Start

Generate and validate data

Generate same data across data sources

Generate data and clean up

Generate data with schema from metadata source

Validate data using validations from metadata source

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Sponsor this project

Uh oh!

Languages