Modern AI Agent Runtime with Hardware-Level Isolation (Beta v0.2.0)
Aether is a runtime for AI agents with secure isolation, intelligent orchestration, and observability. Built on Firecracker microVMs, Aether is designed to run untrusted workloads safely and efficiently.
Think Docker for AI agents β but with security and multi-tenancy from day one.
β οΈ Project Status: Beta v0.2.0 Aether has reached beta with the core control plane integrated: HTTP API, distributed scheduler, PostgreSQL persistence, OpenTelemetry observability, and Kubernetes/Terraform deployment on AWS. Kafka messaging, HashiCorp Vault secrets, and GCE node provisioning are present but experimental. Not yet recommended for production workloads. Targeting v1.0 in Q3 2026 β see docs/V1_SCOPE.md for the v1.0 scope and shipping criteria.
- Hardware-Level Isolation: Firecracker microVMs with KVM virtualization
- Multi-Tenant Architecture: Complete tenant isolation (network, compute, data)
- Secrets Management: Designed for HashiCorp Vault integration
- Authentication: JWT + API keys with RBAC
- High Availability: Multi-AZ deployment with automatic failover
- Disaster Recovery: Automated backups, point-in-time recovery
- Observability: Distributed tracing (Jaeger), metrics (Prometheus), logs
- Scalability: Designed for 10,000+ concurrent agents
- Smart Scheduling: Bin-packing, spread, and best-fit placement strategies
- Auto-Scaling: Policy-based horizontal scaling (planned)
- Resource Quotas: Per-tenant CPU, memory, disk limits
- Rate Limiting: Token bucket algorithm with multi-tier support
- AI Agent Platforms: Run LLM agents, autonomous systems, AI assistants
- Code Execution Services: Sandboxed code execution (e.g., Jupyter, REPL)
- CI/CD Runners: Isolated build environments
- Function-as-a-Service: Serverless function runtime
- Multi-Tenant SaaS: Any workload requiring strong isolation
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Load Balancer β
β (TLS, WAF, DDoS Protection) β
ββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββΌβββββββββββββ
β β β
ββββββββββΌβββββββ ββββΌβββββββ ββΌβββββββββββ
β API Server 1 β βAPI Srv 2β βAPI Srv 3 β
β (Stateless) β β(Stless) β β(Stateless)β
ββββββββββ¬ββββββββ βββββ¬ββββββ βββββββ¬ββββββ
β β β
ββββββββββββββββΌβββββββββββββββ
β
ββββββββββββββΌβββββββββββββ
β β β
ββββββΌβββββ ββββββΌβββββ βββββΌββββββ
βSchedulerβ βSchedulerβ βSchedulerβ
β(Leader) ββββ(Follower)βββ(Follower)β
ββββββ¬βββββ βββββββββββ βββββββββββ
β
β Placement Decisions
β
βββββββββββΌβββββββββββββββββββββββββββββββ
β Compute Nodes (10-50+ nodes) β
β ββββββββββββ ββββββββββββ β
β β Node 1 β β Node 2 β ... β
β βββββββββββββ βββββββββββββ β
β ββVM1 ββVM2ββ ββVM3 ββVM4ββ β
β βββββββββββββ βββββββββββββ β
β ββββββββββββ ββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββΌββββββββββββββββ
β β β
ββββββΌβββββ ββββββΌβββββ ββββββΌβββββ
βPostgreSQLβ β Redis β β etcd β
β(Multi-AZ)β β(Multi-AZ)β β(Cluster)β
βββββββββββ βββββββββββ βββββββββββ
Key Components:
- API Servers: HTTP REST API (functional, fully wired) β
- Schedulers: Distributed scheduler with leader election (functional) β
- Compute Nodes: Firecracker VM management (functional, integrated) β
- PostgreSQL: Durable state, audit logs (functional with state store) β
- Redis: Cache, distributed locks, rate limiting (integrated) β
- etcd: Leader election, distributed coordination (integrated) β
- OS: Linux with KVM support (or macOS for development without VMs)
- Go: 1.24 or later
- Docker: For dependencies (PostgreSQL, Redis, etcd)
- Firecracker (optional): For full VM functionality on Linux
# 1. Clone and build
git clone https://github.com/dnakitare/aether.git
cd aether
go mod download
go build -o aether ./cmd/aether
# 2. Start infrastructure
docker-compose -f deployments/docker/docker-compose.dev.yml up -d
# Wait for PostgreSQL to be ready
sleep 5
# 3. Set environment variables
export DATABASE_URL="postgres://postgres:postgres@localhost:5432/aether?sslmode=disable"
export JWT_SECRET="your-secret-key-change-in-production"
export SERVER_ADDRESS=":8080"
# 4. Start the Aether server
./aether server
# Server will start on http://localhost:8080
# Logs will show: "Aether server started successfully"# In another terminal
export DATABASE_URL="postgres://postgres:postgres@localhost:5432/aether?sslmode=disable"
# Create and start an agent
./aether agent create --name "my-first-agent" --image "python:3.11"
# List agents
./aether agent list
# View agent logs
./aether agent logs <agent-id>
# Stop agent
./aether agent stop <agent-id>
# Clean up
./aether agent destroy <agent-id># Start the runtime daemon without the HTTP API server.
# Optionally set DATABASE_URL to enable PostgreSQL persistence.
./aether daemon# Apply all pending migrations
./aether migrate up
# Roll back the last migration
./aether migrate down
# Show current schema version
./aether migrate version# Create a checkpoint of agent state
./aether agent checkpoint create <agent-id>
# List checkpoints for an agent
./aether agent checkpoint list <agent-id>
# Restore from a specific checkpoint version
./aether agent checkpoint restore <agent-id> --version 3
# Delete a checkpoint
./aether agent checkpoint delete <agent-id> <version># Unit tests (fast, no infrastructure required)
go test -short ./...
# Integration tests (requires Docker infrastructure)
docker-compose -f docker-compose.test.yml up -d
go test ./tests/integration/...
# E2E tests (validates complete workflow)
go test -v ./tests/integration/e2e_workflow_test.go
# Comprehensive test suites
go test -v ./internal/scheduler/... -run Comprehensive
go test -v ./internal/backup/... -run Comprehensive
go test -v ./internal/ha/... -run Comprehensiveβ Core Functionality (End-to-End Working):
- Agent Lifecycle: Create, start, stop, destroy agents with PostgreSQL persistence
- HTTP API Server: Fully wired REST API with all components integrated
- Firecracker VM Management: Complete VM lifecycle with proper configuration
- JWT Authentication: Token generation, validation, and API key management
- Distributed Scheduler: Bin-packing, spread, and best-fit placement strategies with anti-affinity constraints
- PostgreSQL State Store: Durable agent state with CRUD operations
- Redis Integration: Caching, distributed locks, rate limiting
- HA Leader Election: etcd-based consensus for multi-instance deployments
- Rate Limiting: Token bucket algorithm with multi-tier support
- Backup/Restore: Automated PostgreSQL + Redis backup and recovery
- Security: Input validation, injection prevention, RBAC, tenant isolation
β Testing:
- E2E Integration Tests: Complete agent lifecycle validation
- Comprehensive Test Suites: Scheduler, HA, backup, auth, rate limiting
- Infrastructure-Aware: Tests skip gracefully when dependencies unavailable
- CI-Ready: Short mode for fast CI runs, full mode for local testing
π§ Alpha Limitations:
- Firecracker requires Linux with KVM (development on macOS skips VM operations)
- Checkpoint/restore saves metadata state only (full VM snapshot via CRIU planned for beta)
- Auto-scaling policies defined but evaluation loop not yet production-tested
β Not Yet Implemented:
- Multi-region support
- Full CRIU-based VM checkpoint/restore
- Advanced auto-scaling evaluation at scale
- Architecture Overview - System design and components
- Architecture Decision Records (ADRs) - Design rationale
Key ADRs:
- ADR-001: Firecracker for VM Isolation
- ADR-002: Distributed Scheduler
- ADR-003: PostgreSQL + Redis State Management
- ADR-004: JWT Authentication
- Security: See SECURITY.md for security architecture
- Upgrade Guide: See UPGRADE_GUIDE.md for version migration
# Clone and build
git clone https://github.com/dnakitare/aether.git
cd aether
go build -o aether ./cmd/aether
# Run tests with coverage
go test -coverprofile=coverage.out ./...
go tool cover -html=coverage.outaether/
βββ cmd/aether/ # CLI: server, daemon, agent, migrate commands
βββ internal/
β βββ api/ # HTTP REST API server, handlers, middleware
β βββ audit/ # Immutable audit logging
β βββ auth/ # JWT, API keys, RBAC
β βββ backup/ # PostgreSQL + Redis backup/restore
β βββ cli/ # Terminal UI helpers (spinners, tables)
β βββ config/ # Viper-based configuration loading
β βββ database/ # Migration runner (golang-migrate)
β βββ ha/ # High availability, leader election
β βββ observability/ # OpenTelemetry tracing, Prometheus metrics
β βββ optimization/ # VM pre-warming pool
β βββ ratelimit/ # Token bucket rate limiting (Redis-backed)
β βββ recovery/ # Agent state checkpointing
β βββ runtime/ # Agent + VM lifecycle management
β βββ scaler/ # Policy-based auto-scaling
β βββ scheduler/ # Placement strategies + distributed scheduler
β βββ state/ # PostgreSQL + Redis persistence
β βββ tenant/ # Multi-tenant quota management
β βββ ... # messaging, retry, routing, secrets, shutdown
βββ pkg/api/ # Public API types and interfaces
βββ deployments/
β βββ docker/ # Dockerfile, Docker Compose (dev/test)
β βββ kubernetes/ # K8s manifests + Kustomize
β βββ terraform/ # AWS/GCP/Azure infrastructure
β βββ prometheus/ # Prometheus + AlertManager config
β βββ grafana/ # Dashboards + datasource provisioning
βββ helm/aether/ # Helm chart with PostgreSQL/Redis deps
βββ migrations/ # Embedded SQL schema migrations
βββ docs/ # Architecture, ADRs, API reference, guides
βββ tests/
βββ integration/ # E2E + component integration tests
βββ security/ # Auth, injection, tenant isolation tests
βββ chaos/ # Chaos testing helpers
βββ load/ # Load and performance tests
| Component | Status | Coverage | Notes |
|---|---|---|---|
| Core Runtime | β Complete | 65% | Full agent lifecycle integrated |
| HTTP API Server | β Complete | 58% | All components wired |
| Scheduler | β Complete | 82% | Bin-packing, spread, best-fit |
| VM Lifecycle | β Complete | 60% | Firecracker integrated |
| PostgreSQL State | β Complete | 72% | Full CRUD operations |
| HA/Leader Election | β Complete | 71% | etcd-based consensus |
| Auth (JWT/API Key) | β Complete | 78% | RBAC, token management |
| Rate Limiting | β Complete | 85% | Token bucket algorithm |
| Backup/Restore | β Complete | 68% | PostgreSQL + Redis backup |
| E2E Tests | β Complete | 75% | Full lifecycle validation |
| Checkpointing | π‘ Partial | 40% | Metadata checkpoint/restore; full VM snapshot planned |
| Observability | β Complete | 60% | OpenTelemetry tracing, Prometheus metrics, structured logging |
| CLI Tool | β Complete | 50% | server, daemon, agent, migrate, checkpoint commands |
| Kafka Integration | β Complete | 55% | Distributed queue with DLQ and in-memory fallback |
| Deployment | β Complete | β | Dockerfile, Helm, Kubernetes, Terraform (AWS) |
Overall Test Coverage: ~35% (measured), targeting 60% for beta
- β Phase 1: Security (auth, isolation, validation) - Complete
- β Phase 4: High Availability - Complete
- β Phase 5: Disaster Recovery - Complete
- β Phase 6: Observability (design) - Complete
- β Phase 7: Test Coverage & Integration - Complete
- β Alpha Integration: All core components wired and functional
Focus: Minimal end-to-end agent lifecycle
- Wire API server to scheduler
- Complete VM lifecycle integration
- Basic CLI commands
- End-to-end tests (create, run, destroy agent)
- Developer documentation
- PostgreSQL state persistence
- Firecracker VM management
Status: β Complete (February 15, 2026)
Focus: Core control plane integrated
- Observability stack (OpenTelemetry tracing, Prometheus metrics, Grafana dashboards)
- Kafka distributed scheduling queue with DLQ (experimental β see v1.0 below)
- Resource quotas and tenant management
- Deployment automation (Terraform, Kubernetes, Helm)
- Database migrations CLI (
migrate up/down/version) - Checkpoint metadata save/restore
Focus: Single-region, hardware-isolated agent runtime β hardened to a defensible release.
The supported v1.0 surface is the integrated control plane that exists today: HTTP API, JWT/RBAC auth, distributed scheduler, PostgreSQL state, Redis quotas, OpenTelemetry observability, and single-cloud (AWS) deployment via Helm + Terraform. See docs/V1_SCOPE.md for the full scope document.
Shipping criteria (each measurable):
- β₯ 70% test coverage on the in-scope surface (experimental and deferred packages excluded from the denominator)
- One external security review pass β auth, multi-tenant isolation, Firecracker boundary; P0/P1 findings closed before ship
- 1,000-agent load test on a single AWS region, raw numbers published in
docs/V1_LOAD_TEST.md - One real external adopter running v1.0-rc on a non-toy workload for β₯ 2 weeks
Experimental in v1.0 (in repo, opt-in, not part of the supported surface):
- Kafka messaging (in-process queue is the supported default)
- HashiCorp Vault secrets (env vars / Kubernetes secrets are the supported defaults)
- GCE node provisioning (EC2 is the only supported provisioner)
Deferred:
- Full CRIU-based VM checkpoint/restore β v1.1 (metadata-level lands in v1.0)
- Azure deployment β v1.1+
- Multi-region support β v2.0
If the shipping criteria are not met by Q3 2026, the date slips. A v1.0 that means something is more valuable than a v1.0 that ships on schedule.
Aether implements defense in depth:
- Application: Input validation, injection prevention, RBAC
- Authentication: JWT with short expiry, API key rotation
- Multi-Tenancy: Tenant isolation in all queries
- Network: TLS 1.3, VPC isolation (in production design)
- VM Isolation: Firecracker hardware virtualization
- Infrastructure: Encrypted at rest/transit, secrets management
Current State: Security foundations complete (auth, validation, isolation design). Production hardening planned for beta.
Contributions are welcome! This project is currently in pre-alpha and maintained by a solo developer.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes
- Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
- Follow Go best practices (
go vet,golangci-lint) - Write tests for new features (aim for 60%+ coverage)
- Update documentation for user-facing changes
- Run
go test -short ./...before submitting PR
- π΄ High Priority: End-to-end integration, API endpoint implementation
- π‘ Medium Priority: CLI tool, observability integration
- π’ Low Priority: Documentation improvements, test coverage
Aether is licensed under the Apache License 2.0.
This means you can:
- β Use it commercially
- β Modify it
- β Distribute it
- β Use it privately
You must:
- π Include the license and copyright notice
- π State significant changes made to the code
See LICENSE for the full license text.
Why Apache 2.0? Patent protection, enterprise-friendly, compatible with commercial use.
- Firecracker - The microVM foundation
- etcd - Distributed consensus
- PostgreSQL - Reliable data persistence
- Redis - Fast caching and coordination
- OpenTelemetry - Observability standards
- Documentation: docs/
- Issues: GitHub Issues
- Questions: Open a discussion on GitHub
- Language: Go 1.24
- Lines of Code: ~48,000 (including tests)
- Test Coverage: ~35% (targeting β₯ 70% on the in-scope v1.0 surface; see docs/V1_SCOPE.md)
- Test Functions: 400+
- Dependencies: 30+ (see
go.mod) - Development Status: Beta v0.2.0 (All core components integrated)
- Current Release: v0.2.0-beta (April 2026)
Built for the AI agent ecosystem π
Architecture β’ Contributing β’ Security
β¨ Beta v0.2.0 β All core components integrated and functional.