A self-service platform for spinning up isolated temporary environments, deploying apps, simulating outages, monitoring health, and auto-destroying everything. Think miniature internal Heroku with a chaos engineering toggle.
Every environment is short-lived by design.
┌─────────────────────────────────────────────────┐
│ Linux VM / Host │
│ │
User / CI │ ┌─────────────────────────────────────────────┐ │
│ │ │ Docker Engine │ │
│ make / curl │ │ │ │
▼ │ │ ┌──────────┐ ┌──────────┐ ┌─────────┐ │ │
┌───────────┐ │ │ │ │ │ │ │ │ │ │
│ Makefile │──────────────▶│ │ │ nginx │ │ API │ │ daemon │ │ │
│ (make up) │ │ │ │ :8080 │ │ :7000 │ │(cleanup)│ │ │
└───────────┘ │ │ │ │ │ │ │ │ │ │
│ │ └────┬─────┘ └────┬─────┘ └────┬────┘ │ │
┌───────────┐ │ │ │ │ │ │ │
│ REST API │──────────────▶│ │ │ sandbox-nginx-net │ │ │
│ /envs │ │ │ │───────────────────── │ │ │
└───────────┘ │ │ │ │ │
│ │ ┌─────────────────────────────────┐ │ │ │
│ │ │ Sandbox Environments │ │ │ │
│ │ │ │ │ │ │
│ │ │ ┌───────────┐ ┌───────────┐ │ │ │ │
│ │ │ │env-abc123 │ │env-def456 │ │◀┘ │ │
│ │ │ │ app:5000 │ │ app:5000 │ │ │ │
│ │ │ │ net: own │ │ net: own │ │ │ │
│ │ │ └───────────┘ └───────────┘ │ │ │
│ │ └─────────────────────────────────┘ │ │
│ │ │ │
│ │ ┌──────────┐ logs/ ──────────────────────▶│ │
│ │ │ monitor │ envs/ (state files) │ │
│ │ │(health) │ nginx/conf.d/ (per-env) │ │
│ │ └──────────┘ │ │
│ └─────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────┘
Request flow:
Browser → Nginx (:8080) → upstream sandbox-app-<env_id>:5000
Nginx routes by Host header: env-abc123.sandbox.local → env-abc123 container
Data flow:
create_env.sh → Docker network + container + Nginx conf + state file
cleanup_daemon.sh → reads envs/*.json → calls destroy_env.sh when TTL expired
health_monitor.py → polls localhost:<port>/health every 30s → writes health.log
- Docker ≥ 24.x and Docker Compose ≥ 2.20
- Python 3.11+ (on the host, for the health monitor)
- Bash 4+, GNU Make
- A Linux VM (tested on Ubuntu 22.04/24.04)
Zero to first running environment in 5 commands:
git clone https://github.com/YOUR_USERNAME/devops-sandbox.git
cd devops-sandbox
cp .env.example .env # review defaults; edit ports if needed
make up # starts Nginx, API, cleanup daemon, monitor
make create # prompts for name + TTL, creates envAfter make create you'll see:
╔══════════════════════════════════════════╗
║ Environment Ready! ║
║ ID: env-1716000000-a1b2c3 ║
║ URL: http://localhost:8412 ║
║ TTL: 1800s (expires in 30 min) ║
╚══════════════════════════════════════════╝
make create
# name: myapp
# ttl: 300 (5 minutes for demo)Or via the API:
curl -s -X POST http://localhost:7000/envs \
-H 'Content-Type: application/json' \
-d '{"name":"myapp","ttl":300}' | python3 -m json.toolmake status
# or
curl -s http://localhost:7000/envs | python3 -m json.toolHit the app directly (port shown at creation time):
curl http://localhost:<PORT>/health
# {"status": "ok", "env_id": "env-...", ...}make health
# or via API (last 10 results):
curl -s http://localhost:7000/envs/<ENV_ID>/health | python3 -m json.toolCrash the container:
make simulate ENV=env-1716000000-a1b2c3 MODE=crashPause it (freeze, not kill):
make simulate ENV=env-1716000000-a1b2c3 MODE=pauseNetwork isolation:
make simulate ENV=env-1716000000-a1b2c3 MODE=networkOr via the API:
curl -s -X POST http://localhost:7000/envs/<ENV_ID>/outage \
-H 'Content-Type: application/json' \
-d '{"mode":"crash"}'Within 90 seconds the health monitor will detect failures. After 3 consecutive failures, status becomes degraded:
make health
# status=degraded
curl -s http://localhost:7000/envs/<ENV_ID>/healthWatch the health log live:
tail -f logs/<ENV_ID>/health.logmake simulate ENV=env-1716000000-a1b2c3 MODE=recover
# or
curl -s -X POST http://localhost:7000/envs/<ENV_ID>/outage \
-H 'Content-Type: application/json' \
-d '{"mode":"recover"}'Health monitor detects recovery and resets status to running.
make logs ENV=env-1716000000-a1b2c3
# tails logs/<ENV_ID>/app.log live
# or via API (last 100 lines):
curl -s http://localhost:7000/envs/<ENV_ID>/logsmake destroy ENV=env-1716000000-a1b2c3If you set a short TTL (e.g. 60s), the cleanup daemon destroys it automatically. Watch it happen:
tail -f logs/cleanup.log| Method | Endpoint | Description |
|---|---|---|
POST |
/envs |
Create env — body: {name, ttl} |
GET |
/envs |
List all active envs + TTL remaining |
GET |
/envs/:id |
Get single env details |
DELETE |
/envs/:id |
Destroy env |
GET |
/envs/:id/logs |
Last 100 lines of app.log |
GET |
/envs/:id/health |
Last 10 health check results |
POST |
/envs/:id/outage |
Trigger simulation — body: {mode} |
| Target | Description |
|---|---|
make up |
Start Nginx, daemon, API, monitor |
make down |
Stop everything, destroy all envs |
make create |
Interactive: create new env |
make destroy ENV=<id> |
Destroy specific env |
make logs ENV=<id> |
Tail env app.log (live) |
make health |
Show all env health statuses |
make simulate ENV=<id> MODE=<mode> |
Run outage simulation |
make status |
List envs via API (JSON) |
make clean |
Wipe all state, logs, archives |
| Mode | Effect | Recovery |
|---|---|---|
crash |
docker kill — hard stop |
MODE=recover |
pause |
docker pause — freeze process |
MODE=recover |
network |
Disconnect from Docker networks | MODE=recover |
recover |
Unpause / restart / reconnect as needed | — |
stress |
CPU spike via stress-ng or Python burner (60s) | Self-resolving |
Nginx is the front door for all environments. Each create_env.sh call writes a config to nginx/conf.d/<ENV_ID>.conf and runs nginx -s reload. On destroy, the file is removed and Nginx is reloaded again.
Routing strategy: Host-header based. Each env gets a virtual server name <ENV_ID>.sandbox.local. For local testing, hit by port directly (each env gets a random host port). For proper hostname routing, add entries to /etc/hosts or use a wildcard DNS entry.
Network: Nginx runs in sandbox-nginx-net. App containers are also joined to this network at creation time, so Nginx can upstream to sandbox-app-<ENV_ID>:5000 by container name.
Approach A (implemented): At container creation, docker logs -f <container> >> logs/<ENV_ID>/app.log & is run and the PID saved to logs/<ENV_ID>/log_shipper.pid. On destroy, this PID is killed before container removal to prevent zombie processes.
Logs are archived to logs/archived/<ENV_ID>/ on destroy and remain queryable.
Netdata is an optional add-on that gives you a live dashboard of every container's CPU, memory, network I/O, and disk — with zero configuration. It auto-discovers all sandbox containers via the Docker socket the moment they start.
Start:
make monitoring-up
# Dashboard: http://localhost:19999Stop:
make monitoring-downWhat you get instantly, with no setup:
- Per-container CPU and memory graphs — including each
sandbox-app-<id>as it's created - Host-level system metrics (load, disk, network)
- Built-in alerts for memory pressure and high CPU
- Live log of containers appearing and disappearing as you create/destroy envs
Metrics are retained in a Docker volume (netdata-lib, netdata-cache) so they survive restarts. Configuration is in monitor/netdata/netdata.conf — the defaults are fine for local use.
devops-sandbox/
├── platform/
│ ├── create_env.sh # Spin up environment
│ ├── destroy_env.sh # Tear down environment
│ ├── cleanup_daemon.sh # TTL auto-expire loop
│ ├── simulate_outage.sh # Chaos injection
│ ├── api.py # Flask REST API
│ └── lib/
│ └── common.sh # Shared functions (state, docker, nginx helpers)
├── apps/
│ └── demo/
│ ├── app.py # Demo HTTP server (/ /health /info)
│ └── Dockerfile
├── nginx/
│ ├── nginx.conf # Main config (includes conf.d/)
│ └── conf.d/ # Auto-generated per-env configs (gitignored)
├── monitor/
│ ├── health_monitor.py # 30s health poller → health.log
│ └── netdata/
│ └── netdata.conf # Netdata config (update_every, retention, plugins)
├── scripts/
│ ├── inspect.sh # Pretty-print env runtime state
│ ├── list_envs.sh # Formatted table of all active envs
│ ├── build_demo_app.sh # Build sandbox-demo-app:latest
│ ├── export_logs.sh # Tarball logs for any env
│ ├── prune_archives.sh # Remove archived logs older than N days
│ └── reset_platform.sh # Nuclear wipe with confirmation
├── tests/
│ ├── test_api.sh # 12 API integration assertions
│ ├── test_lifecycle.sh # Full create→crash→recover→destroy cycle
│ ├── test_outage.sh # All outage modes tested end-to-end
│ └── test_cleanup_daemon.sh # TTL auto-expiry test (~2 min)
├── logs/ # gitignored
│ ├── cleanup.log
│ ├── <env_id>/
│ │ ├── app.log
│ │ └── health.log
│ └── archived/
├── envs/ # gitignored — runtime state JSONs
├── .env.example
├── .gitignore
├── docker-compose.yml # Core platform (nginx, api, daemon, monitor)
├── docker-compose.monitoring.yml # Optional Netdata (make monitoring-up)
├── Dockerfile.api
├── Makefile
├── requirements.txt
├── CONTRIBUTING.md
└── README.md
- Single VM only. No distributed scheduling — everything runs on one host. This is by design.
- Port allocation. Env ports are random (8100–9000). High concurrency could exhaust this range. For >900 envs, widen the range.
- No auth on the API. The API is unauthenticated. Do not expose port 7000 publicly without adding auth middleware.
- Demo app is ephemeral. The bundled Python HTTP server is not production-grade — it's a placeholder to satisfy
/health. Swap it for your own image viacreate_env.sh. - Nginx config reload is not atomic. Between
rmandnginx -s reload, Nginx may briefly serve a 502 for that env. For production, usenginx -tvalidation before reloading. - Log shipper depends on
docker logs. On high-throughput containers this can lag. For production use Approach B (Loki/Fluentd via Docker socket). - Cleanup daemon requires bash + python3 in the daemon container. The
docker:24-cliimage installs these at startup, which adds ~5s cold start. - No TLS. All traffic is plain HTTP. Add Certbot + nginx SSL termination for production use.