Building a Production WebRTC SFU in Rust: A Deep Technical Dive

Building real-time video calling infrastructure is one of the most challenging problems in systems engineering. You’re simultaneously battling network latency, packet loss, clock drift, codec quirks, and the fundamental laws of physics. Over the past months, I’ve been building Surf Media Engine - a production-grade WebRTC SFU (Selective Forwarding Unit) in Rust. Let me take you deep into the architecture.

Surf Media Engine Architecture — Complete architecture of Surf Media Engine - from WebRTC signaling to chunked recording

Why Build an SFU?

The classic approaches to multi-party video calling are:

Mesh: Everyone sends to everyone. O(n²) connections. Dies at 4+ participants.
MCU (Mixing): Server decodes, composites, re-encodes. CPU-intensive, adds latency.
SFU (Selective Forwarding): Server receives streams and forwards them without transcoding. The sweet spot.

An SFU is essentially a smart packet router. It receives RTP packets from publishers and selectively forwards them to subscribers. No encoding, no decoding, just intelligent routing. This is why companies like Zoom, Discord, and Google Meet all use SFU architectures.

The Rust Advantage

Why Rust for a media server?

// Zero-cost abstractions let us write expressive code
// that compiles to bare-metal performance
pub struct MediaEngine {
    rooms: DashMap<String, MediaRoom>,
    participants: DashMap<String, MediaParticipant>,
    connections: DashMap<String, Arc<RwLock<ParticipantConnection>>>,
    tracks: DashMap<String, MediaTrack>,
    active_tracks: Arc<DashMap<String, Arc<ActiveTrack>>>,
    event_tx: broadcast::Sender<MediaEvent>,
    ai_client: Arc<AiServiceClient>,
    cache: Option<Arc<CacheManager>>,
}

Memory safety without GC pauses: No stop-the-world garbage collection during critical media paths
Fearless concurrency: Rust’s ownership model prevents data races at compile time
Zero-cost abstractions: High-level APIs that compile to efficient machine code
Predictable latency: No runtime overhead means consistent sub-50ms forwarding

The combination of Tokio for async I/O and webrtc-rs for the WebRTC stack gives us a foundation that can handle thousands of concurrent streams with minimal resource usage.

Architecture Deep Dive

The SFU Router: Heart of the System

The SfuRouter is where the magic happens. It maintains the routing table that maps tracks to subscribers:

pub struct SfuRouter {
    config: RouterConfig,
    room_id: String,
    // Lock-free concurrent hashmap for track channels
    publishers: DashMap<String, Arc<RwLock<Publisher>>>,
    subscribers: DashMap<String, Arc<RwLock<Subscriber>>>,
    routes: DashMap<String, RouteEntry>,
    // Each track gets a broadcast channel
    track_channels: DashMap<String, broadcast::Sender<Bytes>>,
    event_tx: broadcast::Sender<RouterEvent>,
    // Real-time audio level tracking for dominant speaker detection
    audio_levels: DashMap<String, f32>,
}

The key insight is using Tokio’s broadcast channels for track forwarding. When a publisher sends a packet:

pub fn route_packet(&self, track_id: &str, packet: Bytes) -> usize {
    if let Some(tx) = self.track_channels.get(track_id) {
        let count = tx.receiver_count();
        if count > 0 {
            // This is a single-producer, multi-consumer broadcast
            // Zero-copy for all subscribers
            let _ = tx.send(packet.clone());

            let _ = self.event_tx.send(RouterEvent::PacketRouted {
                source: track_id.to_string(),
                targets: count,
                bytes: packet.len(),
            });
            return count;
        }
    }
    0
}

This gives us O(1) packet routing regardless of subscriber count. The broadcast channel handles the fan-out internally with reference counting, avoiding actual data copies.

SFU Routing Architecture — N:1 routing pattern - each participant publishes once, subscribes to (N-1) streams

The N:1 Routing Pattern

In a room with N participants, each user:

Publishes: Their audio and video once to the SFU
Subscribes: To (N-1) other participants’ streams

This is dramatically more efficient than mesh:

Participants	Mesh Connections	SFU Connections
4	12	8
10	90	20
50	2,450	100
100	9,900	200

The SFU approach scales linearly instead of quadratically.

Renegotiation: The WebRTC Dance

When a new participant joins or a track is added, we need to update all peer connections. This is called renegotiation:

async fn handle_track(
    &self,
    track: Arc<TrackRemote>,
    _receiver: Arc<RTCRtpReceiver>,
    room_id: &str,
    user_id: &str,
) -> Result<()> {
    // Create a local track for SFU forwarding
    let local_track = Arc::new(TrackLocalStaticRTP::new(
        RTCRtpCodecCapability { /* codec config */ },
        format!("{}_{}", user_id, track_id),
        format!("stream_{}", user_id),
    ));

    // Store for forwarding
    self.active_tracks.insert(track_id.clone(), active_track);

    // Add to ALL OTHER participants and trigger renegotiation
    for conn_ref in self.connections.iter() {
        if conn_ref.key() == user_id { continue; }

        let conn = conn_ref.value().read().await;
        if conn.room_id == room_id {
            // Add track to peer connection
            pc.add_track(local_track.clone()).await?;

            // Only renegotiate if in stable state
            if pc.signaling_state() == RTCSignalingState::Stable {
                let offer = pc.create_offer(None).await?;
                pc.set_local_description(offer.clone()).await?;

                // Send renegotiation offer via WebSocket
                self.event_tx.send(MediaEvent::RenegotiationNeeded {
                    room_id: room_id.to_string(),
                    user_id: target_user_id,
                    sdp: offer.sdp,
                });
            } else {
                // Queue for later - avoid m-line ordering issues
                conn.pending_renegotiation = true;
            }
        }
    }
}

The tricky part is handling the signaling state machine. You can’t create a new offer while one is pending - this causes m-line ordering mismatches that break the connection. We track pending_renegotiation and process it after the current negotiation completes.

Adaptive Jitter Buffer: Fighting Network Chaos

Networks are unreliable. Packets arrive out of order, with variable delays, or not at all. The jitter buffer smooths this chaos into a consistent playout stream.

Jitter Buffer Algorithm — Adaptive jitter buffer transforms chaotic network arrivals into smooth playout

The Data Structure

pub struct JitterBuffer {
    config: JitterBufferConfig,
    // BTreeMap keeps packets sorted by sequence number
    packets: BTreeMap<u16, JitterPacket>,
    next_sequence: Option<u16>,
    last_played_timestamp: Option<u32>,
    last_played_time: Option<Instant>,
    // Exponentially weighted moving average
    jitter_estimate: f32,
    delay_estimate: f32,
    stats: JitterStats,
    // Track NACKs to avoid duplicate requests
    nack_list: VecDeque<(u16, Instant)>,
    // Detect duplicates efficiently
    sequence_history: VecDeque<u16>,
}

Using a BTreeMap for packets is crucial - it maintains sort order by sequence number with O(log n) insertion, letting us efficiently find gaps and iterate in order.

The Playout Decision Algorithm

Every playout interval (typically 20ms for audio), we make a decision:

pub fn decision(&mut self) -> PlayoutDecision {
    let next_seq = match self.next_sequence {
        Some(s) => s,
        None => return PlayoutDecision::Wait,
    };

    // Do we have the next packet?
    if let Some(packet) = self.packets.get(&next_seq) {
        let buffered_time = packet.received_at.elapsed().as_millis() as u32;

        // Has it been buffered long enough?
        if buffered_time >= self.delay_estimate as u32 {
            return PlayoutDecision::Play(next_seq);
        } else {
            return PlayoutDecision::Wait;
        }
    }

    // Packet is missing - what do we do?

    // Check if buffer is getting too full (skip strategy)
    let buffer_time = self.buffer_time_ms();
    if buffer_time > self.config.max_delay_ms {
        // Skip ahead to prevent buffer overflow
        if let Some((&first_seq, _)) = self.packets.iter().next() {
            self.next_sequence = Some(first_seq);
            self.stats.packets_lost += 1;
            return PlayoutDecision::Skip(next_seq);
        }
    }

    // Try NACK (request retransmission)
    if self.config.nack_enabled {
        let already_nacked = self.nack_list.iter().any(|(s, _)| *s == next_seq);
        if !already_nacked {
            self.nack_list.push_back((next_seq, Instant::now()));
            self.stats.nacks_sent += 1;
            return PlayoutDecision::Nack(next_seq);
        }

        // NACK timeout - give up and use PLC
        let nack_timeout = Duration::from_millis(50);
        if let Some((_, nack_time)) = self.nack_list.iter().find(|(s, _)| *s == next_seq) {
            if nack_time.elapsed() > nack_timeout {
                self.next_sequence = Some(next_seq.wrapping_add(1));
                self.stats.packets_lost += 1;
                self.stats.plc_frames += 1;
                return PlayoutDecision::Plc; // Packet Loss Concealment
            }
        }
    }

    PlayoutDecision::Wait
}

The decisions are:

Play: Packet available, playout now
Wait: Still buffering, hold on
Skip: Buffer overflow, jump ahead
Nack: Request retransmission
PLC: Generate synthetic audio to mask the gap

Adaptive Delay Estimation

The jitter buffer dynamically adjusts its delay target based on observed network conditions:

fn adapt_delay(&mut self) {
    // Target delay = 2x estimated jitter + minimum delay
    // This gives us a ~95% packet delivery probability
    let target = self.jitter_estimate * 2.0 + self.config.min_delay_ms as f32;
    let target = target.clamp(
        self.config.min_delay_ms as f32,
        self.config.max_delay_ms as f32,
    );

    // Smooth transition to avoid audible artifacts
    self.delay_estimate = self.delay_estimate * 0.95 + target * 0.05;
}

The jitter estimation uses an exponentially weighted moving average (EWMA):

// RFC 3550 jitter calculation
// Alpha = 1/16 = 0.0625 (classic RTP jitter filter)
self.jitter_estimate = self.jitter_estimate * 0.9375 + jitter * 0.0625;

This filter responds to network changes while filtering out noise. The 1/16 coefficient is the standard from RFC 3550 and provides good responsiveness without overreacting to single outlier packets.

Chunked Recording Pipeline

Recording multi-participant calls requires careful engineering. We can’t just dump everything to disk - we need streaming uploads to handle hour-long calls without running out of local storage.

The Architecture

pub struct RecordingService {
    config: RecordingConfig,
    storage: Option<Arc<S3Storage>>,
    muxer: Arc<WebMMuxer>,
    uploader: Arc<ChunkUploader>,
    recordings: DashMap<String, Arc<ChunkedRecording>>,
    upload_tx: mpsc::Sender<ChunkUploadTask>,
}

The pipeline:

Media Events → Captured from broadcast channel
ChunkedRecording → Buffers packets, triggers chunk creation every 10 seconds
WebMMuxer → Muxes audio/video into WebM container format
ChunkUploader → Compresses and uploads to R2/S3 with retry logic
RecordingMerger → Background task that concatenates chunks into final file

Zero-Copy Event Processing

// Subscribe to media events
let mut event_rx = state.media_engine.subscribe();

tokio::spawn(async move {
    loop {
        match event_rx.recv().await {
            Ok(MediaEvent::AudioData { room_id, user_id, data }) => {
                if let Some(recording) = recordings.get(&room_id) {
                    let packet = AudioPacket {
                        user_id,
                        data,  // Bytes is reference-counted, no copy
                        timestamp: chrono::Utc::now().timestamp_millis() as u64,
                        sequence: 0,
                    };
                    recording.add_audio_packet(packet).await?;
                }
            }
            Ok(MediaEvent::VideoData { room_id, user_id, data, is_keyframe }) => {
                // Similar handling for video
            }
            // ...
        }
    }
});

The Bytes type from the bytes crate gives us reference-counted, immutable byte buffers. This means passing media data around doesn’t involve copying - we’re just incrementing a reference count.

Async Upload with Backpressure

pub struct ChunkUploader {
    config: ChunkUploaderConfig,
    storage: Arc<S3Storage>,
    compressor: VideoCompressor,
}

impl ChunkUploader {
    pub fn start(self: Arc<Self>, mut rx: mpsc::Receiver<ChunkUploadTask>) {
        tokio::spawn(async move {
            // Semaphore limits concurrent uploads
            let semaphore = Arc::new(Semaphore::new(self.config.concurrency));

            while let Some(task) = rx.recv().await {
                let permit = semaphore.clone().acquire_owned().await.unwrap();
                let uploader = self.clone();

                tokio::spawn(async move {
                    if let Err(e) = uploader.upload_chunk(task).await {
                        error!("Chunk upload failed: {}", e);
                    }
                    drop(permit);  // Release semaphore slot
                });
            }
        });
    }
}

The bounded mpsc channel provides backpressure - if uploads fall behind, the channel fills up and the recording side slows down, preventing unbounded memory growth.

Distributed State with Redis

For multi-instance deployments, we need shared state. The CacheManager wraps Redis operations:

pub mod keys {
    pub const ROOM_PREFIX: &str = "surf:media:room:";
    pub const SESSION_PREFIX: &str = "surf:media:session:";
    pub const PARTICIPANT_PREFIX: &str = "surf:media:participant:";
    pub const RATE_LIMIT_PREFIX: &str = "surf:media:ratelimit:";
    pub const LOCK_PREFIX: &str = "surf:media:lock:";
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct CachedSession {
    pub user_id: String,
    pub room_id: String,
    pub display_name: Option<String>,
    pub connected_at: i64,
    pub last_activity: i64,
    pub has_audio: bool,
    pub has_video: bool,
    pub audio_muted: bool,
    pub video_muted: bool,
}

Distributed Locks

For operations that must be atomic across instances (like starting a recording), we use Redis-based distributed locks:

pub async fn acquire_lock(&self, name: &str, ttl_secs: u64) -> Result<bool> {
    let key = format!("{}{}", keys::LOCK_PREFIX, name);

    // SET NX EX - atomic lock acquisition
    let result: Option<String> = redis::cmd("SET")
        .arg(&key)
        .arg(&self.instance_id)
        .arg("NX")  // Only set if not exists
        .arg("EX")  // With expiration
        .arg(ttl_secs)
        .query_async(&mut conn)
        .await?;

    Ok(result.is_some())
}

pub async fn release_lock(&self, name: &str) -> Result<bool> {
    // Only release if we own it
    let owner: Option<String> = conn.get(&key).await?;
    if owner.as_ref() == Some(&self.instance_id) {
        self.delete(&key).await
    } else {
        Ok(false)
    }
}

Rate Limiting

Protecting against abuse with sliding window rate limiting:

pub async fn check_rate_limit(
    &self,
    key: &str,
    limit: u32,
    window_secs: u64
) -> Result<(bool, u32)> {
    let full_key = format!("{}{}", keys::RATE_LIMIT_PREFIX, key);

    let count: Option<u32> = conn.get(&full_key).await?;
    let current = count.unwrap_or(0);

    if current >= limit {
        return Ok((false, current));  // Rate limited
    }

    let new_count: u32 = conn.incr(&full_key, 1).await?;

    // Set expiry on first increment
    if new_count == 1 {
        conn.expire(&full_key, window_secs as i64).await?;
    }

    Ok((true, new_count))
}

Performance Characteristics

After extensive load testing, here are the real-world numbers:

Metric	Value
End-to-end latency	<50ms (99th percentile)
Streams per room	100+ simultaneous
Memory per participant	~2MB (audio + video buffers)
CPU per stream	~0.5% (forwarding only)
Jitter buffer range	20-200ms adaptive
Recording chunk size	10 seconds / ~5MB

The key optimizations:

Zero-copy forwarding: Using Bytes and broadcast channels
Lock-free data structures: DashMap for concurrent access
Adaptive buffering: Jitter buffer adjusts to network conditions
Streaming uploads: Chunked recording prevents memory exhaustion

Lessons Learned

Building a production SFU taught me several hard lessons:

WebRTC signaling is complex: The state machine has many edge cases. Test renegotiation extensively.
Jitter buffers are critical: Bad buffering ruins call quality even on good networks. Invest in getting this right.
Distributed systems are hard: Session affinity, failover, and state synchronization require careful design.
Observability is essential: You can’t debug real-time systems without comprehensive metrics and tracing.
Rust was the right choice: The performance and safety guarantees have saved countless debugging hours.

What’s Next

The Surf Media Engine is in production, but there’s always more to build:

Simulcast support: Adaptive quality based on subscriber bandwidth
SVC (Scalable Video Coding): More efficient bandwidth adaptation
AI integration: Real-time transcription and translation
E2E encryption: Frame-level encryption for privacy

Real-time communication is a fascinating intersection of systems programming, networking, signal processing, and distributed systems. If you’re interested in this space, I’d love to hear from you.

This is part of my systems engineering series. The code discussed here is from Surf Media Engine, the media server powering meet.surf (currently hosted at staging.shonen.live).

Written by Shubham Yadav (Kira) — hi@ykira.com