Skip to content

Dnakitare/aether

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

128 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Aether

Modern AI Agent Runtime with Hardware-Level Isolation (Beta v0.2.0)

Build Status Go Version License Development Status

Aether is a runtime for AI agents with secure isolation, intelligent orchestration, and observability. Built on Firecracker microVMs, Aether is designed to run untrusted workloads safely and efficiently.

Think Docker for AI agents – but with security and multi-tenancy from day one.

⚠️ Project Status: Beta v0.2.0 Aether has reached beta with the core control plane integrated: HTTP API, distributed scheduler, PostgreSQL persistence, OpenTelemetry observability, and Kubernetes/Terraform deployment on AWS. Kafka messaging, HashiCorp Vault secrets, and GCE node provisioning are present but experimental. Not yet recommended for production workloads. Targeting v1.0 in Q3 2026 β€” see docs/V1_SCOPE.md for the v1.0 scope and shipping criteria.


✨ Vision

πŸ”’ Security First

  • Hardware-Level Isolation: Firecracker microVMs with KVM virtualization
  • Multi-Tenant Architecture: Complete tenant isolation (network, compute, data)
  • Secrets Management: Designed for HashiCorp Vault integration
  • Authentication: JWT + API keys with RBAC

πŸš€ Production Goals

  • High Availability: Multi-AZ deployment with automatic failover
  • Disaster Recovery: Automated backups, point-in-time recovery
  • Observability: Distributed tracing (Jaeger), metrics (Prometheus), logs
  • Scalability: Designed for 10,000+ concurrent agents

🧠 Intelligent Orchestration

  • Smart Scheduling: Bin-packing, spread, and best-fit placement strategies
  • Auto-Scaling: Policy-based horizontal scaling (planned)
  • Resource Quotas: Per-tenant CPU, memory, disk limits
  • Rate Limiting: Token bucket algorithm with multi-tier support

🎯 Use Cases

  • AI Agent Platforms: Run LLM agents, autonomous systems, AI assistants
  • Code Execution Services: Sandboxed code execution (e.g., Jupyter, REPL)
  • CI/CD Runners: Isolated build environments
  • Function-as-a-Service: Serverless function runtime
  • Multi-Tenant SaaS: Any workload requiring strong isolation

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      Load Balancer                           β”‚
β”‚                   (TLS, WAF, DDoS Protection)                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
            β”‚            β”‚            β”‚
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”  β”Œβ–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚  API Server 1  β”‚  β”‚API Srv 2β”‚  β”‚API Srv 3  β”‚
   β”‚  (Stateless)   β”‚  β”‚(Stless) β”‚  β”‚(Stateless)β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜
            β”‚              β”‚              β”‚
            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚            β”‚            β”‚
         β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”  β”Œβ”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”
         β”‚Schedulerβ”‚  β”‚Schedulerβ”‚  β”‚Schedulerβ”‚
         β”‚(Leader) │──│(Follower)──│(Follower)β”‚
         β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              β”‚
              β”‚ Placement Decisions
              β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚      Compute Nodes (10-50+ nodes)      β”‚
    β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”           β”‚
    β”‚  β”‚  Node 1  β”‚  β”‚  Node 2  β”‚  ...      β”‚
    β”‚  β”‚β”Œβ”€β”€β”€β”€β”β”Œβ”€β”€β”€β”β”‚ β”‚β”Œβ”€β”€β”€β”€β”β”Œβ”€β”€β”€β”β”‚          β”‚
    β”‚  β”‚β”‚VM1 β”‚β”‚VM2β”‚β”‚ β”‚β”‚VM3 β”‚β”‚VM4β”‚β”‚          β”‚
    β”‚  β”‚β””β”€β”€β”€β”€β”˜β””β”€β”€β”€β”˜β”‚ β”‚β””β”€β”€β”€β”€β”˜β””β”€β”€β”€β”˜β”‚          β”‚
    β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜           β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                      β”‚
      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
      β”‚               β”‚               β”‚
 β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”
 β”‚PostgreSQLβ”‚    β”‚  Redis  β”‚    β”‚  etcd   β”‚
 β”‚(Multi-AZ)β”‚    β”‚(Multi-AZ)β”‚    β”‚(Cluster)β”‚
 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Components:

  • API Servers: HTTP REST API (functional, fully wired) βœ…
  • Schedulers: Distributed scheduler with leader election (functional) βœ…
  • Compute Nodes: Firecracker VM management (functional, integrated) βœ…
  • PostgreSQL: Durable state, audit logs (functional with state store) βœ…
  • Redis: Cache, distributed locks, rate limiting (integrated) βœ…
  • etcd: Leader election, distributed coordination (integrated) βœ…

πŸš€ Quick Start

Prerequisites

  • OS: Linux with KVM support (or macOS for development without VMs)
  • Go: 1.24 or later
  • Docker: For dependencies (PostgreSQL, Redis, etcd)
  • Firecracker (optional): For full VM functionality on Linux

Alpha Quick Start (5 minutes)

# 1. Clone and build
git clone https://github.com/dnakitare/aether.git
cd aether
go mod download
go build -o aether ./cmd/aether

# 2. Start infrastructure
docker-compose -f deployments/docker/docker-compose.dev.yml up -d

# Wait for PostgreSQL to be ready
sleep 5

# 3. Set environment variables
export DATABASE_URL="postgres://postgres:postgres@localhost:5432/aether?sslmode=disable"
export JWT_SECRET="your-secret-key-change-in-production"
export SERVER_ADDRESS=":8080"

# 4. Start the Aether server
./aether server

# Server will start on http://localhost:8080
# Logs will show: "Aether server started successfully"

Running Your First Agent (CLI)

# In another terminal
export DATABASE_URL="postgres://postgres:postgres@localhost:5432/aether?sslmode=disable"

# Create and start an agent
./aether agent create --name "my-first-agent" --image "python:3.11"

# List agents
./aether agent list

# View agent logs
./aether agent logs <agent-id>

# Stop agent
./aether agent stop <agent-id>

# Clean up
./aether agent destroy <agent-id>

Daemon Mode (No HTTP API)

# Start the runtime daemon without the HTTP API server.
# Optionally set DATABASE_URL to enable PostgreSQL persistence.
./aether daemon

Database Migrations

# Apply all pending migrations
./aether migrate up

# Roll back the last migration
./aether migrate down

# Show current schema version
./aether migrate version

State Checkpoints

# Create a checkpoint of agent state
./aether agent checkpoint create <agent-id>

# List checkpoints for an agent
./aether agent checkpoint list <agent-id>

# Restore from a specific checkpoint version
./aether agent checkpoint restore <agent-id> --version 3

# Delete a checkpoint
./aether agent checkpoint delete <agent-id> <version>

Running Tests

# Unit tests (fast, no infrastructure required)
go test -short ./...

# Integration tests (requires Docker infrastructure)
docker-compose -f docker-compose.test.yml up -d
go test ./tests/integration/...

# E2E tests (validates complete workflow)
go test -v ./tests/integration/e2e_workflow_test.go

# Comprehensive test suites
go test -v ./internal/scheduler/... -run Comprehensive
go test -v ./internal/backup/... -run Comprehensive
go test -v ./internal/ha/... -run Comprehensive

What Works in Beta v0.2.0

βœ… Core Functionality (End-to-End Working):

  • Agent Lifecycle: Create, start, stop, destroy agents with PostgreSQL persistence
  • HTTP API Server: Fully wired REST API with all components integrated
  • Firecracker VM Management: Complete VM lifecycle with proper configuration
  • JWT Authentication: Token generation, validation, and API key management
  • Distributed Scheduler: Bin-packing, spread, and best-fit placement strategies with anti-affinity constraints
  • PostgreSQL State Store: Durable agent state with CRUD operations
  • Redis Integration: Caching, distributed locks, rate limiting
  • HA Leader Election: etcd-based consensus for multi-instance deployments
  • Rate Limiting: Token bucket algorithm with multi-tier support
  • Backup/Restore: Automated PostgreSQL + Redis backup and recovery
  • Security: Input validation, injection prevention, RBAC, tenant isolation

βœ… Testing:

  • E2E Integration Tests: Complete agent lifecycle validation
  • Comprehensive Test Suites: Scheduler, HA, backup, auth, rate limiting
  • Infrastructure-Aware: Tests skip gracefully when dependencies unavailable
  • CI-Ready: Short mode for fast CI runs, full mode for local testing

🚧 Alpha Limitations:

  • Firecracker requires Linux with KVM (development on macOS skips VM operations)
  • Checkpoint/restore saves metadata state only (full VM snapshot via CRIU planned for beta)
  • Auto-scaling policies defined but evaluation loop not yet production-tested

❌ Not Yet Implemented:

  • Multi-region support
  • Full CRIU-based VM checkpoint/restore
  • Advanced auto-scaling evaluation at scale

πŸ“š Documentation

Architecture & Design

Key ADRs:

Development


πŸ› οΈ Development

Building from Source

# Clone and build
git clone https://github.com/dnakitare/aether.git
cd aether
go build -o aether ./cmd/aether

# Run tests with coverage
go test -coverprofile=coverage.out ./...
go tool cover -html=coverage.out

Project Structure

aether/
β”œβ”€β”€ cmd/aether/              # CLI: server, daemon, agent, migrate commands
β”œβ”€β”€ internal/
β”‚   β”œβ”€β”€ api/                 # HTTP REST API server, handlers, middleware
β”‚   β”œβ”€β”€ audit/               # Immutable audit logging
β”‚   β”œβ”€β”€ auth/                # JWT, API keys, RBAC
β”‚   β”œβ”€β”€ backup/              # PostgreSQL + Redis backup/restore
β”‚   β”œβ”€β”€ cli/                 # Terminal UI helpers (spinners, tables)
β”‚   β”œβ”€β”€ config/              # Viper-based configuration loading
β”‚   β”œβ”€β”€ database/            # Migration runner (golang-migrate)
β”‚   β”œβ”€β”€ ha/                  # High availability, leader election
β”‚   β”œβ”€β”€ observability/       # OpenTelemetry tracing, Prometheus metrics
β”‚   β”œβ”€β”€ optimization/        # VM pre-warming pool
β”‚   β”œβ”€β”€ ratelimit/           # Token bucket rate limiting (Redis-backed)
β”‚   β”œβ”€β”€ recovery/            # Agent state checkpointing
β”‚   β”œβ”€β”€ runtime/             # Agent + VM lifecycle management
β”‚   β”œβ”€β”€ scaler/              # Policy-based auto-scaling
β”‚   β”œβ”€β”€ scheduler/           # Placement strategies + distributed scheduler
β”‚   β”œβ”€β”€ state/               # PostgreSQL + Redis persistence
β”‚   β”œβ”€β”€ tenant/              # Multi-tenant quota management
β”‚   └── ...                  # messaging, retry, routing, secrets, shutdown
β”œβ”€β”€ pkg/api/                 # Public API types and interfaces
β”œβ”€β”€ deployments/
β”‚   β”œβ”€β”€ docker/              # Dockerfile, Docker Compose (dev/test)
β”‚   β”œβ”€β”€ kubernetes/          # K8s manifests + Kustomize
β”‚   β”œβ”€β”€ terraform/           # AWS/GCP/Azure infrastructure
β”‚   β”œβ”€β”€ prometheus/          # Prometheus + AlertManager config
β”‚   └── grafana/             # Dashboards + datasource provisioning
β”œβ”€β”€ helm/aether/             # Helm chart with PostgreSQL/Redis deps
β”œβ”€β”€ migrations/              # Embedded SQL schema migrations
β”œβ”€β”€ docs/                    # Architecture, ADRs, API reference, guides
└── tests/
    β”œβ”€β”€ integration/         # E2E + component integration tests
    β”œβ”€β”€ security/            # Auth, injection, tenant isolation tests
    β”œβ”€β”€ chaos/               # Chaos testing helpers
    └── load/                # Load and performance tests

πŸ“Š Current Status

Beta v0.2.0 (April 2026)

Component Status Coverage Notes
Core Runtime βœ… Complete 65% Full agent lifecycle integrated
HTTP API Server βœ… Complete 58% All components wired
Scheduler βœ… Complete 82% Bin-packing, spread, best-fit
VM Lifecycle βœ… Complete 60% Firecracker integrated
PostgreSQL State βœ… Complete 72% Full CRUD operations
HA/Leader Election βœ… Complete 71% etcd-based consensus
Auth (JWT/API Key) βœ… Complete 78% RBAC, token management
Rate Limiting βœ… Complete 85% Token bucket algorithm
Backup/Restore βœ… Complete 68% PostgreSQL + Redis backup
E2E Tests βœ… Complete 75% Full lifecycle validation
Checkpointing 🟑 Partial 40% Metadata checkpoint/restore; full VM snapshot planned
Observability βœ… Complete 60% OpenTelemetry tracing, Prometheus metrics, structured logging
CLI Tool βœ… Complete 50% server, daemon, agent, migrate, checkpoint commands
Kafka Integration βœ… Complete 55% Distributed queue with DLQ and in-memory fallback
Deployment βœ… Complete β€” Dockerfile, Helm, Kubernetes, Terraform (AWS)

Overall Test Coverage: ~35% (measured), targeting 60% for beta

Completion Summary

  • βœ… Phase 1: Security (auth, isolation, validation) - Complete
  • βœ… Phase 4: High Availability - Complete
  • βœ… Phase 5: Disaster Recovery - Complete
  • βœ… Phase 6: Observability (design) - Complete
  • βœ… Phase 7: Test Coverage & Integration - Complete
  • βœ… Alpha Integration: All core components wired and functional

πŸ—ΊοΈ Roadmap

βœ… Alpha v0.1.0 (Released: February 2026)

Focus: Minimal end-to-end agent lifecycle

  • Wire API server to scheduler
  • Complete VM lifecycle integration
  • Basic CLI commands
  • End-to-end tests (create, run, destroy agent)
  • Developer documentation
  • PostgreSQL state persistence
  • Firecracker VM management

Status: βœ… Complete (February 15, 2026)

βœ… Beta v0.2.0 (Released: April 2026)

Focus: Core control plane integrated

  • Observability stack (OpenTelemetry tracing, Prometheus metrics, Grafana dashboards)
  • Kafka distributed scheduling queue with DLQ (experimental β€” see v1.0 below)
  • Resource quotas and tenant management
  • Deployment automation (Terraform, Kubernetes, Helm)
  • Database migrations CLI (migrate up/down/version)
  • Checkpoint metadata save/restore

Production v1.0 (Target: Q3 2026)

Focus: Single-region, hardware-isolated agent runtime β€” hardened to a defensible release.

The supported v1.0 surface is the integrated control plane that exists today: HTTP API, JWT/RBAC auth, distributed scheduler, PostgreSQL state, Redis quotas, OpenTelemetry observability, and single-cloud (AWS) deployment via Helm + Terraform. See docs/V1_SCOPE.md for the full scope document.

Shipping criteria (each measurable):

  • β‰₯ 70% test coverage on the in-scope surface (experimental and deferred packages excluded from the denominator)
  • One external security review pass β€” auth, multi-tenant isolation, Firecracker boundary; P0/P1 findings closed before ship
  • 1,000-agent load test on a single AWS region, raw numbers published in docs/V1_LOAD_TEST.md
  • One real external adopter running v1.0-rc on a non-toy workload for β‰₯ 2 weeks

Experimental in v1.0 (in repo, opt-in, not part of the supported surface):

  • Kafka messaging (in-process queue is the supported default)
  • HashiCorp Vault secrets (env vars / Kubernetes secrets are the supported defaults)
  • GCE node provisioning (EC2 is the only supported provisioner)

Deferred:

  • Full CRIU-based VM checkpoint/restore β†’ v1.1 (metadata-level lands in v1.0)
  • Azure deployment β†’ v1.1+
  • Multi-region support β†’ v2.0

If the shipping criteria are not met by Q3 2026, the date slips. A v1.0 that means something is more valuable than a v1.0 that ships on schedule.


πŸ” Security

Aether implements defense in depth:

  1. Application: Input validation, injection prevention, RBAC
  2. Authentication: JWT with short expiry, API key rotation
  3. Multi-Tenancy: Tenant isolation in all queries
  4. Network: TLS 1.3, VPC isolation (in production design)
  5. VM Isolation: Firecracker hardware virtualization
  6. Infrastructure: Encrypted at rest/transit, secrets management

Current State: Security foundations complete (auth, validation, isolation design). Production hardening planned for beta.


🀝 Contributing

Contributions are welcome! This project is currently in pre-alpha and maintained by a solo developer.

How to Contribute

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Development Guidelines

  • Follow Go best practices (go vet, golangci-lint)
  • Write tests for new features (aim for 60%+ coverage)
  • Update documentation for user-facing changes
  • Run go test -short ./... before submitting PR

Priority Areas for Contributors

  • πŸ”΄ High Priority: End-to-end integration, API endpoint implementation
  • 🟑 Medium Priority: CLI tool, observability integration
  • 🟒 Low Priority: Documentation improvements, test coverage

πŸ“œ License

Aether is licensed under the Apache License 2.0.

This means you can:

  • βœ… Use it commercially
  • βœ… Modify it
  • βœ… Distribute it
  • βœ… Use it privately

You must:

  • πŸ“„ Include the license and copyright notice
  • πŸ“„ State significant changes made to the code

See LICENSE for the full license text.

Why Apache 2.0? Patent protection, enterprise-friendly, compatible with commercial use.


πŸ™ Acknowledgments


πŸ“ž Support


πŸ“ˆ Project Stats

  • Language: Go 1.24
  • Lines of Code: ~48,000 (including tests)
  • Test Coverage: ~35% (targeting β‰₯ 70% on the in-scope v1.0 surface; see docs/V1_SCOPE.md)
  • Test Functions: 400+
  • Dependencies: 30+ (see go.mod)
  • Development Status: Beta v0.2.0 (All core components integrated)
  • Current Release: v0.2.0-beta (April 2026)

Built for the AI agent ecosystem πŸš€

Architecture β€’ Contributing β€’ Security

✨ Beta v0.2.0 β€” All core components integrated and functional.

⚠️ Not yet recommended for production workloads. v1.0 targeting Q3 2026.

About

Modern AI agent runtime with hardware-level isolation using Firecracker microVMs. Production-grade orchestration, security, and observability for running untrusted AI workloads at scale.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors