A comprehensive lab implementation of a modern data center fabric using EVPN-VxLAN overlay technology, designed to support high-performance AI/ML workloads. Built on NVIDIA Air platform with full automation and testing suite.
- Overview
- Features
- Architecture
- Prerequisites
- Quick Start
- Lab Components
- Testing & Validation
- Performance Results
- Troubleshooting
- Use Cases
- Contributing
- Resources
- Author
This project demonstrates the design, implementation, and validation of an enterprise-grade EVPN-VxLAN network fabric optimized for AI infrastructure. It simulates the networking requirements of modern GPU clusters, including the demanding east-west traffic patterns of distributed AI training.
Modern AI training requires:
- Ultra-low latency for collective operations (AllReduce, AllGather)
- Non-blocking bandwidth between GPU nodes
- Scalability to thousands of endpoints
- Multi-tenancy for resource sharing
- Resilience to failures without disrupting training
This lab addresses all these requirements using industry-standard protocols and best practices aligned with NVIDIA's DGX BasePOD architecture.
- ποΈ CLOS Leaf-Spine Architecture - Non-blocking, scalable fabric design
- π EVPN-VxLAN Overlay - Modern overlay solution for multi-tenancy
- π¦ eBGP Underlay - Simple, scalable routing with fast convergence
- π Multi-Tenant Isolation - Secure segmentation for different workloads
- π Python Test Automation - Comprehensive testing framework
- π Performance Analytics - Bandwidth and latency measurements
- π₯ Failure Simulation - Automated resilience testing
- π AI Traffic Patterns - AllReduce and broadcast simulation
- π Real-time Metrics - Performance dashboards
- πΊοΈ Topology Visualization - Network diagram generation
- π Automated Reporting - Test results and health checks
- π Troubleshooting Tools - Debug scripts and guides
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β EVPN-VxLAN Overlay β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββββ ββββββββββββ β
β β Spine-1 β β Spine-2 β β
β β AS 65001 β β AS 65002 β β
β β10.0.0.1 β β10.0.0.2 β β
β ββββββ¬ββββββ ββββββ¬ββββββ β
β β β β
β ββββββββββ΄βββββββββ¬βββββββββ¬ββββββββββββββ΄βββββββββ β
β β β β β β
β βββββ΄βββββ ββββββ΄ββββ βββ΄βββββββ βββββββ΄ββββ β
β β Leaf-1 β β Leaf-2 β β Leaf-3 β β Leaf-4 β β
β βAS 65011β βAS 65012β βAS 65013β βAS 65014 β β
β β10.0.0.11β β10.0.0.12β β10.0.0.13β β10.0.0.14β β
β βββββ¬βββββ ββββββ¬ββββ βββββ¬βββββ βββββββ¬ββββ β
β β β β β β
β β β β β β
β βββββ΄βββββ ββββββ΄ββββ ββββ΄ββββββ ββββββ΄βββββ β
β β GPU-1 β β GPU-2 β β GPU-3 β β GPU-4 β β
β βVLAN 10 β βVLAN 10 β βVLAN 20 β βVLAN 20 β β
β ββββββββββ ββββββββββ ββββββββββ βββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
- eBGP Everywhere - Unique AS per node for optimal path selection
- /31 P2P Links - Efficient IP utilization
- Loopback VTEPs - VTEP availability independent of physical interfaces
- ECMP Load Balancing - Utilize all available paths
- Basic understanding of BGP and EVPN
- Familiarity with Linux CLI
- Python programming basics (for automation)
- NVIDIA Air account (free)
- SSH client
- Python 3.8+ (for automation scripts)
- Web browser for Air GUI
- CCNP/JNCIP level networking knowledge
- Experience with data center fabrics
- Understanding of overlay technologies
git clone https://github.com/scthornton/evpn-vxlan-lab.git
cd evpn-vxlan-lab
# Login to NVIDIA Air
# Create new simulation from EVPN template or upload topology.dot
# Start the simulation
# Run the automated deployment script
./scripts/deploy_base_config.sh
# Or manually configure each device
ssh cumulus@<device-ip>
sudo net add configuration...
# Install Python dependencies
pip install -r requirements.txt
# Run comprehensive test suite
python tests/evpn_tester.py --topology configs/topology.json --tests all
# Check test report
cat evpn_test_report.txt
# View performance visualizations
open bandwidth_heatmap.png
topology.json
- Lab topology definitionspine_config.txt
- Spine switch configurationsleaf_config.txt
- Leaf switch configurationshost_config.txt
- Host configurations
deploy_base_config.sh
- Automated deploymenthealth_check.sh
- Quick health validationtraffic_generator.py
- AI traffic pattern simulationfailover_test.sh
- Resilience testing
evpn_tester.py
- Main test automation frameworkperformance_baseline.py
- Performance benchmarkingsecurity_validation.py
- Multi-tenancy verification
troubleshooting_guide.md
- Common issues and solutionsdesign_decisions.md
- Architecture rationaleperformance_tuning.md
- Optimization guide
The lab includes comprehensive testing for:
# BGP Underlay Testing
- Verify all BGP sessions established
- Check route propagation
- Validate ECMP functionality
# EVPN Overlay Testing
- Confirm EVPN neighbor relationships
- Verify MAC/IP advertisement
- Check VNI configuration
# Connectivity Testing
- End-to-end ping tests
- Multi-tenant isolation verification
- Broadcast/multicast functionality
# Performance Testing
- Bandwidth measurements (iperf3)
- Latency profiling (netperf)
- Jitter and packet loss analysis
# Failure Scenarios
- Single spine failure
- Leaf switch failure
- Link flapping
- Recovery time measurement
# Traffic Patterns
- AllReduce communication pattern
- Broadcast operations
- Ring-based collective operations
- Parameter server patterns
# Run all tests
python tests/evpn_tester.py --topology configs/topology.json --tests all
# Run specific test category
python tests/evpn_tester.py --topology configs/topology.json --tests bgp evpn
# Generate performance baseline
python tests/performance_baseline.py --duration 300
Metric | Target | Achieved | Status |
---|---|---|---|
E-W Bandwidth | > 9 Gbps | 9.4 Gbps | β PASS |
RTT Latency | < 0.5ms | 0.248ms | β PASS |
Convergence Time | < 3s | 0.9s | β PASS |
Packet Loss | 0% | 0% | β PASS |
Pattern | Total BW | Per-Flow BW | Efficiency |
---|---|---|---|
AllReduce | 112 Gbps | 9.3 Gbps | 93% |
Broadcast | 28 Gbps | 9.4 Gbps | 94% |
Ring | 37 Gbps | 9.2 Gbps | 92% |
Scenario | Detection | Convergence | Total |
---|---|---|---|
Spine Failure | 0.3s | 0.6s | 0.9s |
Leaf Failure | 0.2s | 0.4s | 0.6s |
Link Failure | 0.1s | 0.3s | 0.4s |
# Check connectivity
ping -I swp1 <neighbor-ip>
# Verify BGP configuration
net show bgp summary
net show bgp neighbor <neighbor-ip>
# Check logs
sudo journalctl -u bgpd -f
# Verify EVPN activation
net show bgp l2vpn evpn summary
# Check VNI advertisement
net show evpn vni
net show bgp l2vpn evpn route
# Verify VTEP configuration
net show vxlan vtep
# Check MAC learning
net show evpn mac vni all
# Verify VLAN to VNI mapping
net show bridge vlan
# Test underlay connectivity
traceroute -s <loopback-ip> <remote-loopback>
For detailed troubleshooting, see docs/troubleshooting_guide.md
- Distributed training clusters
- GPU-to-GPU communication optimization
- Parameter server architectures
- Inference serving networks
- Kubernetes cluster networking
- Microservices communication
- Container overlay networks
- Service mesh integration
- Multi-tenant isolation
- Workload mobility
- Disaster recovery
- Hybrid cloud connectivity
Contributions are welcome! Please feel free to submit issues, fork the repository, and create pull requests.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature
) - Commit your changes (
git commit -m 'Add some AmazingFeature'
) - Push to the branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
- Follow PEP 8 for Python code
- Use meaningful variable names
- Add comments for complex logic
- Include docstrings for functions
- NVIDIA Air Platform Guide
- Cumulus Linux Documentation
- RFC 7432 - BGP MPLS-Based Ethernet VPN
- NVIDIA DGX BasePOD Reference Architecture
Scott Thornton
- LinkedIn: @scthornton
- GitHub: @scthornton
- Blog: EVPN-VxLAN
Senior Infrastructure & Security Architect with 20+ years of experience in enterprise networking. Currently focused on AI infrastructure and security at Palo Alto Networks. Recent certifications include NVIDIA InfiniBand, RDMA Programming, and AI Infrastructure Operations.
This project is licensed under the MIT License - see the LICENSE file for details.
- NVIDIA Air team for the excellent platform
- Cumulus Linux community for documentation
- EVPN/VxLAN pioneers who created these protocols
- The AI/ML community driving infrastructure innovation
β If you find this project helpful, please consider giving it a star!
Last updated: December 2024