Back to blog

Building UltraBalancer: A Load Balancer That Handles 1M+ RPS

· 3 min read ·
C/C++Systems ProgrammingPerformanceLoad Balancing

The Problem

Most modern load balancers are either too complex, too slow, or both. When you’re building systems that need to handle millions of requests per second, every microsecond counts. I wanted something that could:

  • Handle 1M+ requests per second
  • Maintain sub-millisecond latency
  • Use minimal resources
  • Be simple to deploy and configure

So I built UltraBalancer.

Architecture

UltraBalancer is written in C/C++ and built around three core principles:

1. Lock-Free Data Structures

Traditional load balancers use mutexes and locks, which kill performance at scale. UltraBalancer uses lock-free data structures for:

  • Request queuing
  • Backend server health tracking
  • Connection pooling

This eliminates contention and allows us to scale linearly with CPU cores.

2. Zero-Copy Networking

Every byte copied is wasted CPU time. UltraBalancer uses:

  • io_uring for async I/O on Linux
  • sendfile() for efficient data transfer
  • TCP splicing where possible

This reduces CPU usage by ~40% compared to traditional approaches.

3. NUMA-Aware Architecture

On multi-socket systems, memory access patterns matter. UltraBalancer:

  • Pins worker threads to specific NUMA nodes
  • Allocates memory locally to each worker
  • Uses per-core caching to avoid cross-NUMA traffic

Load Balancing Algorithms

UltraBalancer supports multiple algorithms:

enum class Algorithm {
    ROUND_ROBIN,
    LEAST_CONNECTIONS,
    WEIGHTED_ROUND_ROBIN,
    LEAST_RESPONSE_TIME,
    IP_HASH
};

The LEAST_RESPONSE_TIME algorithm tracks backend latency in real-time and routes requests to the fastest available backend. This alone improved P99 latency by 60% in our tests.

Performance Numbers

Here’s what UltraBalancer can do on a single server (AMD EPYC 7763, 64 cores):

  • Throughput: 1.2M RPS sustained
  • Latency:
    • P50: 0.3ms
    • P99: 0.8ms
    • P99.9: 1.2ms
  • CPU Usage: 45% at 1M RPS
  • Memory: 2GB RAM for 100k concurrent connections

SSL/TLS Termination

UltraBalancer handles TLS termination efficiently using:

  • OpenSSL with hardware acceleration (AES-NI)
  • Session resumption via tickets
  • OCSP stapling
  • Support for TLS 1.3

We achieve ~800k TLS handshakes per second on the same hardware.

Health Checks & Failover

The health check system runs independently:

  • Configurable check intervals (default: 5s)
  • Multiple check types: TCP, HTTP, custom scripts
  • Automatic failover in less than 100ms
  • Circuit breaker pattern to avoid thundering herd
backends:
  - host: 10.0.1.10
    port: 8080
    weight: 100
    health_check:
      type: http
      path: /health
      interval: 5s
      timeout: 1s
      unhealthy_threshold: 3

Configuration

UltraBalancer uses a simple YAML config:

listeners:
  - address: 0.0.0.0
    port: 80
    protocol: http
    algorithm: least_response_time
    
backends:
  - host: 10.0.1.10
    port: 8080
  - host: 10.0.1.11
    port: 8080
  - host: 10.0.1.12
    port: 8080

Observability

Built-in Prometheus metrics for everything:

  • Request rate and latency histograms
  • Backend health status
  • Connection pool stats
  • CPU and memory usage per worker

What’s Next

Currently working on:

  • HTTP/3 support with QUIC
  • gRPC load balancing with connection pooling
  • eBPF integration for kernel-level packet processing
  • WebAssembly plugins for custom routing logic

Try It

UltraBalancer is open source and production-ready:

If you’re building systems that need to scale, give it a shot. Would love to hear your feedback.


Building fast systems is hard. But it’s also fun as hell.