Skip to content

scthornton/evpn-vxlan-ai-fabric

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

14 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸš€ AI-Ready Data Center Network: EVPN-VxLAN Lab

License: MIT Python NVIDIA Air Network

A comprehensive lab implementation of a modern data center fabric using EVPN-VxLAN overlay technology, designed to support high-performance AI/ML workloads. Built on NVIDIA Air platform with full automation and testing suite.

πŸ“‹ Table of Contents

🎯 Overview

This project demonstrates the design, implementation, and validation of an enterprise-grade EVPN-VxLAN network fabric optimized for AI infrastructure. It simulates the networking requirements of modern GPU clusters, including the demanding east-west traffic patterns of distributed AI training.

Why This Matters

Modern AI training requires:

  • Ultra-low latency for collective operations (AllReduce, AllGather)
  • Non-blocking bandwidth between GPU nodes
  • Scalability to thousands of endpoints
  • Multi-tenancy for resource sharing
  • Resilience to failures without disrupting training

This lab addresses all these requirements using industry-standard protocols and best practices aligned with NVIDIA's DGX BasePOD architecture.

✨ Features

Network Design

  • πŸ—οΈ CLOS Leaf-Spine Architecture - Non-blocking, scalable fabric design
  • πŸ”„ EVPN-VxLAN Overlay - Modern overlay solution for multi-tenancy
  • 🚦 eBGP Underlay - Simple, scalable routing with fast convergence
  • πŸ” Multi-Tenant Isolation - Secure segmentation for different workloads

Automation & Testing

  • 🐍 Python Test Automation - Comprehensive testing framework
  • πŸ“Š Performance Analytics - Bandwidth and latency measurements
  • πŸ”₯ Failure Simulation - Automated resilience testing
  • πŸ“ˆ AI Traffic Patterns - AllReduce and broadcast simulation

Monitoring & Visualization

  • πŸ“‰ Real-time Metrics - Performance dashboards
  • πŸ—ΊοΈ Topology Visualization - Network diagram generation
  • πŸ“ Automated Reporting - Test results and health checks
  • πŸ” Troubleshooting Tools - Debug scripts and guides

πŸ›οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    EVPN-VxLAN Overlay                        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                              β”‚
β”‚         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”‚
β”‚         β”‚ Spine-1  β”‚                    β”‚ Spine-2  β”‚        β”‚
β”‚         β”‚ AS 65001 β”‚                    β”‚ AS 65002 β”‚        β”‚
β”‚         β”‚10.0.0.1  β”‚                    β”‚10.0.0.2  β”‚        β”‚
β”‚         β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜                    β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜        β”‚
β”‚              β”‚                                β”‚               β”‚
β”‚     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚
β”‚     β”‚                 β”‚        β”‚                       β”‚     β”‚
β”‚ β”Œβ”€β”€β”€β”΄β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”  β”Œβ”€β”΄β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”΄β”€β”€β”€β” β”‚
β”‚ β”‚ Leaf-1 β”‚      β”‚ Leaf-2 β”‚  β”‚ Leaf-3 β”‚         β”‚ Leaf-4  β”‚ β”‚
β”‚ β”‚AS 65011β”‚      β”‚AS 65012β”‚  β”‚AS 65013β”‚         β”‚AS 65014 β”‚ β”‚
β”‚ β”‚10.0.0.11β”‚     β”‚10.0.0.12β”‚ β”‚10.0.0.13β”‚        β”‚10.0.0.14β”‚ β”‚
β”‚ β””β”€β”€β”€β”¬β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”¬β”€β”€β”€β”˜  β””β”€β”€β”€β”¬β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”˜ β”‚
β”‚     β”‚                β”‚          β”‚                     β”‚      β”‚
β”‚     β”‚                β”‚          β”‚                     β”‚      β”‚
β”‚ β”Œβ”€β”€β”€β”΄β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”  β”Œβ”€β”€β”΄β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β” β”‚
β”‚ β”‚ GPU-1  β”‚      β”‚ GPU-2  β”‚  β”‚ GPU-3  β”‚         β”‚ GPU-4   β”‚ β”‚
β”‚ β”‚VLAN 10 β”‚      β”‚VLAN 10 β”‚  β”‚VLAN 20 β”‚         β”‚VLAN 20  β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚                                                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Design Principles

  1. eBGP Everywhere - Unique AS per node for optimal path selection
  2. /31 P2P Links - Efficient IP utilization
  3. Loopback VTEPs - VTEP availability independent of physical interfaces
  4. ECMP Load Balancing - Utilize all available paths

πŸ“š Prerequisites

Required Knowledge

  • Basic understanding of BGP and EVPN
  • Familiarity with Linux CLI
  • Python programming basics (for automation)

Technical Requirements

  • NVIDIA Air account (free)
  • SSH client
  • Python 3.8+ (for automation scripts)
  • Web browser for Air GUI

Recommended Background

  • CCNP/JNCIP level networking knowledge
  • Experience with data center fabrics
  • Understanding of overlay technologies

πŸš€ Quick Start

1. Clone the Repository

git clone https://github.com/scthornton/evpn-vxlan-lab.git
cd evpn-vxlan-lab

2. Set Up NVIDIA Air Environment

# Login to NVIDIA Air
# Create new simulation from EVPN template or upload topology.dot
# Start the simulation

3. Deploy Base Configuration

# Run the automated deployment script
./scripts/deploy_base_config.sh

# Or manually configure each device
ssh cumulus@<device-ip>
sudo net add configuration...

4. Run Validation Tests

# Install Python dependencies
pip install -r requirements.txt

# Run comprehensive test suite
python tests/evpn_tester.py --topology configs/topology.json --tests all

5. View Results

# Check test report
cat evpn_test_report.txt

# View performance visualizations
open bandwidth_heatmap.png

πŸ”§ Lab Components

/configs

  • topology.json - Lab topology definition
  • spine_config.txt - Spine switch configurations
  • leaf_config.txt - Leaf switch configurations
  • host_config.txt - Host configurations

/scripts

  • deploy_base_config.sh - Automated deployment
  • health_check.sh - Quick health validation
  • traffic_generator.py - AI traffic pattern simulation
  • failover_test.sh - Resilience testing

/tests

  • evpn_tester.py - Main test automation framework
  • performance_baseline.py - Performance benchmarking
  • security_validation.py - Multi-tenancy verification

/docs

  • troubleshooting_guide.md - Common issues and solutions
  • design_decisions.md - Architecture rationale
  • performance_tuning.md - Optimization guide

πŸ§ͺ Testing & Validation

Automated Test Suite

The lab includes comprehensive testing for:

1. Control Plane Validation

# BGP Underlay Testing
- Verify all BGP sessions established
- Check route propagation
- Validate ECMP functionality

# EVPN Overlay Testing  
- Confirm EVPN neighbor relationships
- Verify MAC/IP advertisement
- Check VNI configuration

2. Data Plane Validation

# Connectivity Testing
- End-to-end ping tests
- Multi-tenant isolation verification
- Broadcast/multicast functionality

# Performance Testing
- Bandwidth measurements (iperf3)
- Latency profiling (netperf)
- Jitter and packet loss analysis

3. Resilience Testing

# Failure Scenarios
- Single spine failure
- Leaf switch failure  
- Link flapping
- Recovery time measurement

4. AI Workload Simulation

# Traffic Patterns
- AllReduce communication pattern
- Broadcast operations
- Ring-based collective operations
- Parameter server patterns

Running Tests

# Run all tests
python tests/evpn_tester.py --topology configs/topology.json --tests all

# Run specific test category
python tests/evpn_tester.py --topology configs/topology.json --tests bgp evpn

# Generate performance baseline
python tests/performance_baseline.py --duration 300

πŸ“Š Performance Results

Baseline Performance Metrics

Metric Target Achieved Status
E-W Bandwidth > 9 Gbps 9.4 Gbps βœ… PASS
RTT Latency < 0.5ms 0.248ms βœ… PASS
Convergence Time < 3s 0.9s βœ… PASS
Packet Loss 0% 0% βœ… PASS

AI Workload Performance

Pattern Total BW Per-Flow BW Efficiency
AllReduce 112 Gbps 9.3 Gbps 93%
Broadcast 28 Gbps 9.4 Gbps 94%
Ring 37 Gbps 9.2 Gbps 92%

Failure Recovery Times

Scenario Detection Convergence Total
Spine Failure 0.3s 0.6s 0.9s
Leaf Failure 0.2s 0.4s 0.6s
Link Failure 0.1s 0.3s 0.4s

πŸ”§ Troubleshooting

Common Issues

BGP Sessions Not Establishing

# Check connectivity
ping -I swp1 <neighbor-ip>

# Verify BGP configuration
net show bgp summary
net show bgp neighbor <neighbor-ip>

# Check logs
sudo journalctl -u bgpd -f

EVPN Routes Not Propagating

# Verify EVPN activation
net show bgp l2vpn evpn summary

# Check VNI advertisement
net show evpn vni
net show bgp l2vpn evpn route

# Verify VTEP configuration
net show vxlan vtep

No End-to-End Connectivity

# Check MAC learning
net show evpn mac vni all

# Verify VLAN to VNI mapping
net show bridge vlan

# Test underlay connectivity
traceroute -s <loopback-ip> <remote-loopback>

For detailed troubleshooting, see docs/troubleshooting_guide.md

πŸ’‘ Use Cases

1. AI/ML Infrastructure

  • Distributed training clusters
  • GPU-to-GPU communication optimization
  • Parameter server architectures
  • Inference serving networks

2. Cloud-Native Applications

  • Kubernetes cluster networking
  • Microservices communication
  • Container overlay networks
  • Service mesh integration

3. Traditional Enterprise

  • Multi-tenant isolation
  • Workload mobility
  • Disaster recovery
  • Hybrid cloud connectivity

🀝 Contributing

Contributions are welcome! Please feel free to submit issues, fork the repository, and create pull requests.

How to Contribute

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

Code Style

  • Follow PEP 8 for Python code
  • Use meaningful variable names
  • Add comments for complex logic
  • Include docstrings for functions

πŸ“š Resources

Documentation

Related Projects

Learning Resources

πŸ‘€ Author

Scott Thornton

Background

Senior Infrastructure & Security Architect with 20+ years of experience in enterprise networking. Currently focused on AI infrastructure and security at Palo Alto Networks. Recent certifications include NVIDIA InfiniBand, RDMA Programming, and AI Infrastructure Operations.

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • NVIDIA Air team for the excellent platform
  • Cumulus Linux community for documentation
  • EVPN/VxLAN pioneers who created these protocols
  • The AI/ML community driving infrastructure innovation

⭐ If you find this project helpful, please consider giving it a star!

Last updated: December 2024

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published