Building UltraBalancer: A Load Balancer That Handles 1M+ RPS
The Problem
Most modern load balancers are either too complex, too slow, or both. When you’re building systems that need to handle millions of requests per second, every microsecond counts. I wanted something that could:
- Handle 1M+ requests per second
- Maintain sub-millisecond latency
- Use minimal resources
- Be simple to deploy and configure
So I built UltraBalancer.
Architecture
UltraBalancer is written in C/C++ and built around three core principles:
1. Lock-Free Data Structures
Traditional load balancers use mutexes and locks, which kill performance at scale. UltraBalancer uses lock-free data structures for:
- Request queuing
- Backend server health tracking
- Connection pooling
This eliminates contention and allows us to scale linearly with CPU cores.
2. Zero-Copy Networking
Every byte copied is wasted CPU time. UltraBalancer uses:
- io_uring for async I/O on Linux
- sendfile() for efficient data transfer
- TCP splicing where possible
This reduces CPU usage by ~40% compared to traditional approaches.
3. NUMA-Aware Architecture
On multi-socket systems, memory access patterns matter. UltraBalancer:
- Pins worker threads to specific NUMA nodes
- Allocates memory locally to each worker
- Uses per-core caching to avoid cross-NUMA traffic
Load Balancing Algorithms
UltraBalancer supports multiple algorithms:
enum class Algorithm {
ROUND_ROBIN,
LEAST_CONNECTIONS,
WEIGHTED_ROUND_ROBIN,
LEAST_RESPONSE_TIME,
IP_HASH
};
The LEAST_RESPONSE_TIME algorithm tracks backend latency in real-time and routes requests to the fastest available backend. This alone improved P99 latency by 60% in our tests.
Performance Numbers
Here’s what UltraBalancer can do on a single server (AMD EPYC 7763, 64 cores):
- Throughput: 1.2M RPS sustained
- Latency:
- P50: 0.3ms
- P99: 0.8ms
- P99.9: 1.2ms
- CPU Usage: 45% at 1M RPS
- Memory: 2GB RAM for 100k concurrent connections
SSL/TLS Termination
UltraBalancer handles TLS termination efficiently using:
- OpenSSL with hardware acceleration (AES-NI)
- Session resumption via tickets
- OCSP stapling
- Support for TLS 1.3
We achieve ~800k TLS handshakes per second on the same hardware.
Health Checks & Failover
The health check system runs independently:
- Configurable check intervals (default: 5s)
- Multiple check types: TCP, HTTP, custom scripts
- Automatic failover in less than 100ms
- Circuit breaker pattern to avoid thundering herd
backends:
- host: 10.0.1.10
port: 8080
weight: 100
health_check:
type: http
path: /health
interval: 5s
timeout: 1s
unhealthy_threshold: 3
Configuration
UltraBalancer uses a simple YAML config:
listeners:
- address: 0.0.0.0
port: 80
protocol: http
algorithm: least_response_time
backends:
- host: 10.0.1.10
port: 8080
- host: 10.0.1.11
port: 8080
- host: 10.0.1.12
port: 8080
Observability
Built-in Prometheus metrics for everything:
- Request rate and latency histograms
- Backend health status
- Connection pool stats
- CPU and memory usage per worker
What’s Next
Currently working on:
- HTTP/3 support with QUIC
- gRPC load balancing with connection pooling
- eBPF integration for kernel-level packet processing
- WebAssembly plugins for custom routing logic
Try It
UltraBalancer is open source and production-ready:
- GitHub: github.com/bas3line/ultrabalancer
- Docs: ultrabalancer.io
If you’re building systems that need to scale, give it a shot. Would love to hear your feedback.
Building fast systems is hard. But it’s also fun as hell.