Skip to content

mferretti/SeedStream

SeedStream

Build Status Security Scan Codacy Badge Java Version Gradle License codecov

High-performance, seed-based test data generator for enterprise applications. Generate realistic, reproducible data to Kafka, databases, and files using simple YAML configuration.


Features

  • 🚀 High Performance: Multi-threaded generation — 12–258M records/sec for primitives, 25–33K rec/sec for realistic Datafaker data
  • 🔄 Reproducible: Same seed → identical output, byte-for-byte, across machines and thread counts
  • 🌍 Locale-Aware: 62 locales supported via Datafaker (Italian names, US addresses, etc.)
  • 📝 Multiple Formats: JSON (NDJSON), CSV (RFC 4180), Protobuf (binary), Avro (OCF + Confluent Schema Registry wire format), CBEFF (biometric envelope)
  • 💾 Multiple Destinations: File (NIO, gzip), Kafka (SASL/SSL, async/sync), JDBC databases (HikariCP, nested decomposition)
  • 🔗 Foreign Key References: ref[table.field, min..count] — FK columns that scale automatically with --count
  • ⚙️ YAML Configuration: Declarative structure and job definitions — no code required
  • 🔌 Extensible Type System: 48+ Datafaker semantic types with runtime registration (DatafakerRegistry)
  • 🔐 Secret Management: AES-256-GCM encrypted credentials in YAML; HashiCorp Vault, AWS Secrets Manager, Azure Key Vault backends

Requirements

  • Java 21+ (Amazon Corretto, OpenJDK, or GraalVM)
  • Gradle 9.5+ wrapper included — no system install needed
  • Docker (optional, for integration tests with Testcontainers)
  • JDBC driver (optional, for database destination — drop into extras/)

Quick Start

Option 1 — Fat JAR (no build required)

Download the release JAR and run immediately. You still need the config files, so clone first:

git clone https://github.com/mferretti/SeedStream.git && cd SeedStream
wget https://github.com/mferretti/SeedStream/releases/latest/download/seedstream-0.5.0.jar
java -jar seedstream-0.5.0.jar execute --job config/jobs/file_address.yaml --count 100

Option 2 — Distribution zip

wget https://github.com/mferretti/SeedStream/releases/latest/download/cli-0.5.0.zip
unzip cli-0.5.0.zip
# Point to your own job configs or clone the repo for examples
cli-0.5.0/bin/datagenerator execute --job /path/to/job.yaml --count 100

Option 3 — Build from source

git clone https://github.com/mferretti/SeedStream.git && cd SeedStream
./gradlew :cli:run --args="execute --job config/jobs/file_address.yaml --count 100"

Common examples

# Generate 10,000 US customers as CSV
./gradlew :cli:run --args="execute --job config/jobs/file_customer.yaml --format csv --count 10000"

# Stream 1M events to Kafka with 8 threads
./gradlew :cli:run --args="execute --job config/jobs/kafka_events_env_seed.yaml --count 1000000 --threads 8"

# Reproducible output — same seed, same data every time
./gradlew :cli:run --args="execute --job config/jobs/file_address.yaml --seed 12345 --count 1000"

# Validate a configuration without running
./gradlew :cli:run --args="validate --job config/jobs/file_invoice.yaml"

# Encrypt a credential for embedding in job YAML
export SEEDSTREAM_ENCRYPTION_KEY=$(openssl rand -hex 32)
./gradlew :cli:run --args="encrypt my-db-password"
# Output already includes the AES256GCM: prefix, e.g.:  AES256GCM:BASE64CIPHERTEXT...
# Paste it verbatim into job YAML as: password: "${SECRET:enc:<output>}"

CLI options

Option Default Description
--job required Path to job YAML
--format json json, csv, protobuf, avro, avro-registry, cbeff
--count 100 Records to generate
--seed from config Override seed for this run
--threads CPU cores Worker threads
--verbose off Detailed logging
--debug off Enables sampled TRACE logging (see --trace-sample)
--trace-sample 10 TRACE sampling rate 1–100 (percentage); only effective with --debug

Performance

Validated throughput from JMH benchmarks (March 2026):

Data type Throughput
Primitive (int, boolean) 12–258M records/sec
Datafaker (names, emails, etc.) 13–154K records/sec
Real-world (10-field customer, E2E) ~25–33K records/sec
File I/O 600–800 MB/s

Scaling: 3.7× speedup with 4 workers (92% efficiency). Datafaker workloads are I/O-bound — 4 threads is usually optimal regardless of core count.

See PERFORMANCE.md for full benchmarks, tuning guide, and hardware recommendations.


Architecture

cli → destinations → formats → generators → schema → core
              (benchmarks: JMH harness, depends on core + generators)

Seven modules — six in the runtime dependency chain plus benchmarks (JMH micro-benchmarks, excluded from production artifacts). Each layer is pluggable: add a destination by implementing DestinationAdapter, a format by implementing FormatSerializer, or a new semantic type by registering it with DatafakerRegistry.

See DESIGN.md for architecture decisions, the multi-threading reproducibility model, and extension points.


Documentation

Document Contents
config/README.md Type system reference, job/structure examples, Kafka & database config
docs/DESIGN.md Architecture, threading model, reproducibility, extensibility
docs/PERFORMANCE.md Benchmarks, tuning guide, hardware recommendations
docs/TROUBLESHOOTING.md Common errors, debug mode, FAQ
docs/CONTRIBUTING.md Setup, development workflow, code standards
docs/QUALITY.md Coverage, SpotBugs, Spotless configuration
CHANGELOG.md Release history and roadmap

Secret Management

Database passwords, Kafka credentials, and other secrets can be stored securely instead of in plaintext YAML.

Option 1 — AES-256-GCM inline encryption

# Generate a key (store it safely — you need it to decrypt)
export SEEDSTREAM_ENCRYPTION_KEY=$(openssl rand -hex 32)

# Encrypt a credential
./seedstream encrypt "my-db-password"
# → AES256GCM:BASE64CIPHERTEXT...

Paste the output into your job YAML:

conf:
  password: "${SECRET:enc:AES256GCM:BASE64CIPHERTEXT...}"

Option 2 — Environment variable substitution

conf:
  password: "${ENV:DB_PASSWORD}"

Option 3 — Cloud secret backends

secrets:
  type: vault          # or: aws | azure | encrypted-file
  address: "https://vault.example.com"
  token: "${ENV:VAULT_TOKEN}"

Supported backends: HashiCorp Vault (KV v1/v2), AWS Secrets Manager, Azure Key Vault, encrypted file.

See config/README.md for full secret configuration reference.


Security

SeedStream runs continuous OWASP Dependency-Check scans on every push (CVSS threshold ≥ 7.0).

Known open issues (as of June 2026):

Dependency CVE Status
kafka-clients 4.3.0 CVE-2026-41115 No fix available yet; producer-only usage
azure-identity 1.18.3 CVE-2026-33117 No fix available yet; startup secret resolution only
azure-core / azure-json CVE-2026-33117 Transitive from azure-identity; no fix yet
netty 4.1.131–132 CVE-2026-42xxx, CVE-2026-44248 Transitive from Azure SDK; no fix yet
azure-identity 1.18.3 CVE-2023-36415, CVE-2024-35255 Likely false positive — version post-dates fix
msal4j 1.23.1 CVE-2024-35255 Likely false positive — version post-dates fix

All suppressions expire 2026-07-05. CI will re-fail on that date, forcing a review. No permanent suppressions exist in this project. When a patched version ships the dependency is upgraded and the suppression removed.

To report a vulnerability, open a GitHub issue marked security.


Contributing

Contributions welcome — bug reports, new generators, destinations, or formats.

git clone https://github.com/mferretti/SeedStream.git
cd SeedStream
./gradlew build test

See CONTRIBUTING.md for setup, workflow, and code standards.


License

Copyright 2024-2026 Marco Ferretti

Licensed under the Apache License 2.0.

About

High-performance test data generator for enterprise applications. Generates realistic, reproducible test data to Kafka, and files using YAML configuration.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors