Skip to content

Kanim21/enterprise-network-platform

Repository files navigation

enterprise-network-platform

Senior-level enterprise networking reference platform — Containerlab multi-site simulation (datacenter spine-leaf with EVPN-VXLAN, enterprise WAN, hybrid AWS connectivity), Nornir/Ansible automation, Batfish + pyATS CI validation, Prometheus observability.

Reference Architecture Notice: Terraform apply is manual-dispatch only — no live AWS account is required or assumed. The Containerlab simulation is fully runnable locally with make lab-up. See ADR-006.


What This Is

A production-grade reference platform simulating a multi-site enterprise network for ~5,000 users:

  • Data center fabric: 2-spine / 4-leaf topology with eBGP underlay and EVPN-VXLAN overlay
  • Enterprise WAN: HQ + Branch routers running OSPF multi-area
  • Hybrid cloud: IPSec/BGP VPN from on-premises cloud-edge to AWS Transit Gateway
  • Multi-region AWS: Two VPCs per region, cross-region TGW peering
  • Automation: Nornir and Ansible scripts for provisioning, validation, change management
  • CI validation: Batfish (offline config analysis) + pyATS (live lab assertions)
  • Observability: Prometheus + SNMP Exporter + Grafana dashboards

Architecture

                   ┌─────────────────────────────────────┐
                   │           AWS Cloud                  │
                   │  TGW us-east ◄──peer──► TGW us-west │
                   │     │                        │       │
                   │  VPC prod-east         VPC prod-west │
                   └──────┬──────────────────────────────-┘
                          │ Site-to-Site VPN (IKEv2)
                   ┌──────┴──────────────────────────────-┐
                   │    On-Premises                        │
                   │  [cloud-edge ASN 65300]               │
                   │         │ OSPF                        │
                   │  [wan-hq ASN 65200] ── [wan-branch]  │
                   │         │ eBGP                        │
                   │  [spine1] ── [spine2]  ASN 65100     │
                   │   (EVPN Route Reflectors)             │
                   │  [leaf1][leaf2][leaf3][leaf4]         │
                   │  [host1][host2][host3][host4]         │
                   └──────────────────────────────────────-┘

For detailed diagrams see diagrams/ (Mermaid source) or rendered PNGs.

Tech Stack

Layer Technology
Routing FRRouting (FRR) v9.1
Overlay EVPN-VXLAN (RFC 7432)
Underlay eBGP (RFC 7938)
WAN OSPFv2 multi-area
IPSec IKEv2, AES-256-GCM
Cloud IaC Terraform ≥ 1.6
AWS Transit Gateway, VPC, Site-to-Site VPN
Automation Nornir 3.x + Ansible 9.x
Testing Batfish, pyATS/Genie
Observability Prometheus, Grafana, SNMP Exporter
Lab Containerlab 0.54+

Quickstart

Prerequisites: Docker, Containerlab, Python 3.11+, make

git clone https://github.com/Kanim21/enterprise-network-platform
cd enterprise-network-platform

# Boot the full lab (spines, leaves, WAN, cloud-edge, hosts)
make lab-up

# Validate BGP convergence
make validate

# Run full test suite
make test

# Start observability stack
cd observability && docker compose up -d

# Tear down
make lab-down

Real-World Scenario

This platform models a 5,000-user multi-site enterprise with:

  • DC HQ: Primary compute fabric (EVPN-VXLAN for VM and container workloads)
  • Branch office: 500 users on a stub WAN segment; default route via HQ
  • AWS us-east-1: Primary cloud workloads (prod VPC + transit VPC)
  • AWS us-west-2: DR region, reached via TGW peering
  • Hybrid connectivity: Encrypted IPSec VPN with BGP route exchange; on-prem workloads reach AWS and vice versa

Trade-offs

Decision Trade-off
eBGP-only fabric (no OSPF/ISIS in DC) Simpler, loop-free, scales to 100s of leafs; slightly more config per leaf
EVPN over OTV Standards-based, multi-vendor; requires jumbo frames (MTU 9000)
FRR over vendor images Zero licensing cost, fully open-source; syntax differs from Cisco/Juniper
Hub-spoke TGW Fewer connections, centralized routing; ~1ms additional latency vs VPC peering
Batfish in CI Catches config bugs without live lab; covers control-plane only, not dataplane

Failure Scenarios

Failure Impact Recovery
Single spine down 50% bandwidth (ECMP halved); fabric remains reachable BFD detects in <1s; runbook: bgp-session-flap
Single leaf down Hosts on that leaf unreachable Restart container; re-provision config
WAN-HQ down Branch + cloud isolated from DC Failover to secondary WAN path (if deployed)
Cloud-edge down Hybrid VPN down; DC continues Restart strongSwan; runbook: hybrid-vpn-down
AWS AZ failure NAT GW + EC2 fail over via ASG; TGW is AZ-resilient AWS-managed HA

Scaling Story

  • More leafs: Add leaf config + Nornir inventory entry; spines auto-accept new eBGP peers
  • More VLANs: change_vlan.py --commit or Ansible vlan role
  • More AWS regions: New terraform/environments/<region>/ directory; add TGW peering
  • More branch sites: Add WAN router to OSPF area 1 (stub); no core changes needed

Documentation

Contributing

See CONTRIBUTING.md. Security issues: see SECURITY.md.

About

Senior-level enterprise networking reference platform — Containerlab multi-site simulation (datacenter spine-leaf with EVPN-VXLAN, enterprise WAN, hybrid AWS connectivity), Nornir/Ansible automation, Batfish + pyATS CI validation, Prometheus observability.

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors