Building UltraBalancer: A Load Balancer That Handles 1M+ RPS

The Problem

Most modern load balancers are either too complex, too slow, or both. When you’re building systems that need to handle millions of requests per second, every microsecond counts. I wanted something that could:

Handle 1M+ requests per second
Maintain sub-millisecond latency
Use minimal resources
Be simple to deploy and configure

So I built UltraBalancer.

Architecture

UltraBalancer is written in C/C++ and built around three core principles:

1. Lock-Free Data Structures

Traditional load balancers use mutexes and locks, which kill performance at scale. UltraBalancer uses lock-free data structures for:

Request queuing
Backend server health tracking
Connection pooling

This eliminates contention and allows us to scale linearly with CPU cores.

2. Zero-Copy Networking

Every byte copied is wasted CPU time. UltraBalancer uses:

io_uring for async I/O on Linux
sendfile() for efficient data transfer
TCP splicing where possible

This reduces CPU usage by ~40% compared to traditional approaches.

3. NUMA-Aware Architecture

On multi-socket systems, memory access patterns matter. UltraBalancer:

Pins worker threads to specific NUMA nodes
Allocates memory locally to each worker
Uses per-core caching to avoid cross-NUMA traffic

Load Balancing Algorithms

UltraBalancer supports multiple algorithms:

enum class Algorithm {
    ROUND_ROBIN,
    LEAST_CONNECTIONS,
    WEIGHTED_ROUND_ROBIN,
    LEAST_RESPONSE_TIME,
    IP_HASH
};

The LEAST_RESPONSE_TIME algorithm tracks backend latency in real-time and routes requests to the fastest available backend. This alone improved P99 latency by 60% in our tests.

Performance Numbers

Here’s what UltraBalancer can do on a single server (AMD EPYC 7763, 64 cores):

Throughput: 1.2M RPS sustained
Latency:
- P50: 0.3ms
- P99: 0.8ms
- P99.9: 1.2ms
CPU Usage: 45% at 1M RPS
Memory: 2GB RAM for 100k concurrent connections

SSL/TLS Termination

UltraBalancer handles TLS termination efficiently using:

OpenSSL with hardware acceleration (AES-NI)
Session resumption via tickets
OCSP stapling
Support for TLS 1.3

We achieve ~800k TLS handshakes per second on the same hardware.

Health Checks & Failover

The health check system runs independently:

Configurable check intervals (default: 5s)
Multiple check types: TCP, HTTP, custom scripts
Automatic failover in less than 100ms
Circuit breaker pattern to avoid thundering herd

backends:
  - host: 10.0.1.10
    port: 8080
    weight: 100
    health_check:
      type: http
      path: /health
      interval: 5s
      timeout: 1s
      unhealthy_threshold: 3

Configuration

UltraBalancer uses a simple YAML config:

listeners:
  - address: 0.0.0.0
    port: 80
    protocol: http
    algorithm: least_response_time
    
backends:
  - host: 10.0.1.10
    port: 8080
  - host: 10.0.1.11
    port: 8080
  - host: 10.0.1.12
    port: 8080

Observability

Built-in Prometheus metrics for everything:

Request rate and latency histograms
Backend health status
Connection pool stats
CPU and memory usage per worker

What’s Next

Currently working on:

HTTP/3 support with QUIC
gRPC load balancing with connection pooling
eBPF integration for kernel-level packet processing
WebAssembly plugins for custom routing logic

Try It

UltraBalancer is open source and production-ready:

GitHub: github.com/bas3line/ultrabalancer
Docs: ultrabalancer.io

If you’re building systems that need to scale, give it a shot. Would love to hear your feedback.

Building fast systems is hard. But it’s also fun as hell.