Building a Production WebRTC SFU in Rust: A Deep Technical Dive
Building real-time video calling infrastructure is one of the most challenging problems in systems engineering. You’re simultaneously battling network latency, packet loss, clock drift, codec quirks, and the fundamental laws of physics. Over the past months, I’ve been building Surf Media Engine - a production-grade WebRTC SFU (Selective Forwarding Unit) in Rust. Let me take you deep into the architecture.
Why Build an SFU?
The classic approaches to multi-party video calling are:
- Mesh: Everyone sends to everyone. O(n²) connections. Dies at 4+ participants.
- MCU (Mixing): Server decodes, composites, re-encodes. CPU-intensive, adds latency.
- SFU (Selective Forwarding): Server receives streams and forwards them without transcoding. The sweet spot.
An SFU is essentially a smart packet router. It receives RTP packets from publishers and selectively forwards them to subscribers. No encoding, no decoding, just intelligent routing. This is why companies like Zoom, Discord, and Google Meet all use SFU architectures.
The Rust Advantage
Why Rust for a media server?
// Zero-cost abstractions let us write expressive code
// that compiles to bare-metal performance
pub struct MediaEngine {
rooms: DashMap<String, MediaRoom>,
participants: DashMap<String, MediaParticipant>,
connections: DashMap<String, Arc<RwLock<ParticipantConnection>>>,
tracks: DashMap<String, MediaTrack>,
active_tracks: Arc<DashMap<String, Arc<ActiveTrack>>>,
event_tx: broadcast::Sender<MediaEvent>,
ai_client: Arc<AiServiceClient>,
cache: Option<Arc<CacheManager>>,
}
- Memory safety without GC pauses: No stop-the-world garbage collection during critical media paths
- Fearless concurrency: Rust’s ownership model prevents data races at compile time
- Zero-cost abstractions: High-level APIs that compile to efficient machine code
- Predictable latency: No runtime overhead means consistent sub-50ms forwarding
The combination of Tokio for async I/O and webrtc-rs for the WebRTC stack gives us a foundation that can handle thousands of concurrent streams with minimal resource usage.
Architecture Deep Dive
The SFU Router: Heart of the System
The SfuRouter is where the magic happens. It maintains the routing table that maps tracks to subscribers:
pub struct SfuRouter {
config: RouterConfig,
room_id: String,
// Lock-free concurrent hashmap for track channels
publishers: DashMap<String, Arc<RwLock<Publisher>>>,
subscribers: DashMap<String, Arc<RwLock<Subscriber>>>,
routes: DashMap<String, RouteEntry>,
// Each track gets a broadcast channel
track_channels: DashMap<String, broadcast::Sender<Bytes>>,
event_tx: broadcast::Sender<RouterEvent>,
// Real-time audio level tracking for dominant speaker detection
audio_levels: DashMap<String, f32>,
}
The key insight is using Tokio’s broadcast channels for track forwarding. When a publisher sends a packet:
pub fn route_packet(&self, track_id: &str, packet: Bytes) -> usize {
if let Some(tx) = self.track_channels.get(track_id) {
let count = tx.receiver_count();
if count > 0 {
// This is a single-producer, multi-consumer broadcast
// Zero-copy for all subscribers
let _ = tx.send(packet.clone());
let _ = self.event_tx.send(RouterEvent::PacketRouted {
source: track_id.to_string(),
targets: count,
bytes: packet.len(),
});
return count;
}
}
0
}
This gives us O(1) packet routing regardless of subscriber count. The broadcast channel handles the fan-out internally with reference counting, avoiding actual data copies.
The N:1 Routing Pattern
In a room with N participants, each user:
- Publishes: Their audio and video once to the SFU
- Subscribes: To (N-1) other participants’ streams
This is dramatically more efficient than mesh:
| Participants | Mesh Connections | SFU Connections |
|---|---|---|
| 4 | 12 | 8 |
| 10 | 90 | 20 |
| 50 | 2,450 | 100 |
| 100 | 9,900 | 200 |
The SFU approach scales linearly instead of quadratically.
Renegotiation: The WebRTC Dance
When a new participant joins or a track is added, we need to update all peer connections. This is called renegotiation:
async fn handle_track(
&self,
track: Arc<TrackRemote>,
_receiver: Arc<RTCRtpReceiver>,
room_id: &str,
user_id: &str,
) -> Result<()> {
// Create a local track for SFU forwarding
let local_track = Arc::new(TrackLocalStaticRTP::new(
RTCRtpCodecCapability { /* codec config */ },
format!("{}_{}", user_id, track_id),
format!("stream_{}", user_id),
));
// Store for forwarding
self.active_tracks.insert(track_id.clone(), active_track);
// Add to ALL OTHER participants and trigger renegotiation
for conn_ref in self.connections.iter() {
if conn_ref.key() == user_id { continue; }
let conn = conn_ref.value().read().await;
if conn.room_id == room_id {
// Add track to peer connection
pc.add_track(local_track.clone()).await?;
// Only renegotiate if in stable state
if pc.signaling_state() == RTCSignalingState::Stable {
let offer = pc.create_offer(None).await?;
pc.set_local_description(offer.clone()).await?;
// Send renegotiation offer via WebSocket
self.event_tx.send(MediaEvent::RenegotiationNeeded {
room_id: room_id.to_string(),
user_id: target_user_id,
sdp: offer.sdp,
});
} else {
// Queue for later - avoid m-line ordering issues
conn.pending_renegotiation = true;
}
}
}
}
The tricky part is handling the signaling state machine. You can’t create a new offer while one is pending - this causes m-line ordering mismatches that break the connection. We track pending_renegotiation and process it after the current negotiation completes.
Adaptive Jitter Buffer: Fighting Network Chaos
Networks are unreliable. Packets arrive out of order, with variable delays, or not at all. The jitter buffer smooths this chaos into a consistent playout stream.
The Data Structure
pub struct JitterBuffer {
config: JitterBufferConfig,
// BTreeMap keeps packets sorted by sequence number
packets: BTreeMap<u16, JitterPacket>,
next_sequence: Option<u16>,
last_played_timestamp: Option<u32>,
last_played_time: Option<Instant>,
// Exponentially weighted moving average
jitter_estimate: f32,
delay_estimate: f32,
stats: JitterStats,
// Track NACKs to avoid duplicate requests
nack_list: VecDeque<(u16, Instant)>,
// Detect duplicates efficiently
sequence_history: VecDeque<u16>,
}
Using a BTreeMap for packets is crucial - it maintains sort order by sequence number with O(log n) insertion, letting us efficiently find gaps and iterate in order.
The Playout Decision Algorithm
Every playout interval (typically 20ms for audio), we make a decision:
pub fn decision(&mut self) -> PlayoutDecision {
let next_seq = match self.next_sequence {
Some(s) => s,
None => return PlayoutDecision::Wait,
};
// Do we have the next packet?
if let Some(packet) = self.packets.get(&next_seq) {
let buffered_time = packet.received_at.elapsed().as_millis() as u32;
// Has it been buffered long enough?
if buffered_time >= self.delay_estimate as u32 {
return PlayoutDecision::Play(next_seq);
} else {
return PlayoutDecision::Wait;
}
}
// Packet is missing - what do we do?
// Check if buffer is getting too full (skip strategy)
let buffer_time = self.buffer_time_ms();
if buffer_time > self.config.max_delay_ms {
// Skip ahead to prevent buffer overflow
if let Some((&first_seq, _)) = self.packets.iter().next() {
self.next_sequence = Some(first_seq);
self.stats.packets_lost += 1;
return PlayoutDecision::Skip(next_seq);
}
}
// Try NACK (request retransmission)
if self.config.nack_enabled {
let already_nacked = self.nack_list.iter().any(|(s, _)| *s == next_seq);
if !already_nacked {
self.nack_list.push_back((next_seq, Instant::now()));
self.stats.nacks_sent += 1;
return PlayoutDecision::Nack(next_seq);
}
// NACK timeout - give up and use PLC
let nack_timeout = Duration::from_millis(50);
if let Some((_, nack_time)) = self.nack_list.iter().find(|(s, _)| *s == next_seq) {
if nack_time.elapsed() > nack_timeout {
self.next_sequence = Some(next_seq.wrapping_add(1));
self.stats.packets_lost += 1;
self.stats.plc_frames += 1;
return PlayoutDecision::Plc; // Packet Loss Concealment
}
}
}
PlayoutDecision::Wait
}
The decisions are:
- Play: Packet available, playout now
- Wait: Still buffering, hold on
- Skip: Buffer overflow, jump ahead
- Nack: Request retransmission
- PLC: Generate synthetic audio to mask the gap
Adaptive Delay Estimation
The jitter buffer dynamically adjusts its delay target based on observed network conditions:
fn adapt_delay(&mut self) {
// Target delay = 2x estimated jitter + minimum delay
// This gives us a ~95% packet delivery probability
let target = self.jitter_estimate * 2.0 + self.config.min_delay_ms as f32;
let target = target.clamp(
self.config.min_delay_ms as f32,
self.config.max_delay_ms as f32,
);
// Smooth transition to avoid audible artifacts
self.delay_estimate = self.delay_estimate * 0.95 + target * 0.05;
}
The jitter estimation uses an exponentially weighted moving average (EWMA):
// RFC 3550 jitter calculation
// Alpha = 1/16 = 0.0625 (classic RTP jitter filter)
self.jitter_estimate = self.jitter_estimate * 0.9375 + jitter * 0.0625;
This filter responds to network changes while filtering out noise. The 1/16 coefficient is the standard from RFC 3550 and provides good responsiveness without overreacting to single outlier packets.
Chunked Recording Pipeline
Recording multi-participant calls requires careful engineering. We can’t just dump everything to disk - we need streaming uploads to handle hour-long calls without running out of local storage.
The Architecture
pub struct RecordingService {
config: RecordingConfig,
storage: Option<Arc<S3Storage>>,
muxer: Arc<WebMMuxer>,
uploader: Arc<ChunkUploader>,
recordings: DashMap<String, Arc<ChunkedRecording>>,
upload_tx: mpsc::Sender<ChunkUploadTask>,
}
The pipeline:
- Media Events → Captured from broadcast channel
- ChunkedRecording → Buffers packets, triggers chunk creation every 10 seconds
- WebMMuxer → Muxes audio/video into WebM container format
- ChunkUploader → Compresses and uploads to R2/S3 with retry logic
- RecordingMerger → Background task that concatenates chunks into final file
Zero-Copy Event Processing
// Subscribe to media events
let mut event_rx = state.media_engine.subscribe();
tokio::spawn(async move {
loop {
match event_rx.recv().await {
Ok(MediaEvent::AudioData { room_id, user_id, data }) => {
if let Some(recording) = recordings.get(&room_id) {
let packet = AudioPacket {
user_id,
data, // Bytes is reference-counted, no copy
timestamp: chrono::Utc::now().timestamp_millis() as u64,
sequence: 0,
};
recording.add_audio_packet(packet).await?;
}
}
Ok(MediaEvent::VideoData { room_id, user_id, data, is_keyframe }) => {
// Similar handling for video
}
// ...
}
}
});
The Bytes type from the bytes crate gives us reference-counted, immutable byte buffers. This means passing media data around doesn’t involve copying - we’re just incrementing a reference count.
Async Upload with Backpressure
pub struct ChunkUploader {
config: ChunkUploaderConfig,
storage: Arc<S3Storage>,
compressor: VideoCompressor,
}
impl ChunkUploader {
pub fn start(self: Arc<Self>, mut rx: mpsc::Receiver<ChunkUploadTask>) {
tokio::spawn(async move {
// Semaphore limits concurrent uploads
let semaphore = Arc::new(Semaphore::new(self.config.concurrency));
while let Some(task) = rx.recv().await {
let permit = semaphore.clone().acquire_owned().await.unwrap();
let uploader = self.clone();
tokio::spawn(async move {
if let Err(e) = uploader.upload_chunk(task).await {
error!("Chunk upload failed: {}", e);
}
drop(permit); // Release semaphore slot
});
}
});
}
}
The bounded mpsc channel provides backpressure - if uploads fall behind, the channel fills up and the recording side slows down, preventing unbounded memory growth.
Distributed State with Redis
For multi-instance deployments, we need shared state. The CacheManager wraps Redis operations:
pub mod keys {
pub const ROOM_PREFIX: &str = "surf:media:room:";
pub const SESSION_PREFIX: &str = "surf:media:session:";
pub const PARTICIPANT_PREFIX: &str = "surf:media:participant:";
pub const RATE_LIMIT_PREFIX: &str = "surf:media:ratelimit:";
pub const LOCK_PREFIX: &str = "surf:media:lock:";
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct CachedSession {
pub user_id: String,
pub room_id: String,
pub display_name: Option<String>,
pub connected_at: i64,
pub last_activity: i64,
pub has_audio: bool,
pub has_video: bool,
pub audio_muted: bool,
pub video_muted: bool,
}
Distributed Locks
For operations that must be atomic across instances (like starting a recording), we use Redis-based distributed locks:
pub async fn acquire_lock(&self, name: &str, ttl_secs: u64) -> Result<bool> {
let key = format!("{}{}", keys::LOCK_PREFIX, name);
// SET NX EX - atomic lock acquisition
let result: Option<String> = redis::cmd("SET")
.arg(&key)
.arg(&self.instance_id)
.arg("NX") // Only set if not exists
.arg("EX") // With expiration
.arg(ttl_secs)
.query_async(&mut conn)
.await?;
Ok(result.is_some())
}
pub async fn release_lock(&self, name: &str) -> Result<bool> {
// Only release if we own it
let owner: Option<String> = conn.get(&key).await?;
if owner.as_ref() == Some(&self.instance_id) {
self.delete(&key).await
} else {
Ok(false)
}
}
Rate Limiting
Protecting against abuse with sliding window rate limiting:
pub async fn check_rate_limit(
&self,
key: &str,
limit: u32,
window_secs: u64
) -> Result<(bool, u32)> {
let full_key = format!("{}{}", keys::RATE_LIMIT_PREFIX, key);
let count: Option<u32> = conn.get(&full_key).await?;
let current = count.unwrap_or(0);
if current >= limit {
return Ok((false, current)); // Rate limited
}
let new_count: u32 = conn.incr(&full_key, 1).await?;
// Set expiry on first increment
if new_count == 1 {
conn.expire(&full_key, window_secs as i64).await?;
}
Ok((true, new_count))
}
Performance Characteristics
After extensive load testing, here are the real-world numbers:
| Metric | Value |
|---|---|
| End-to-end latency | <50ms (99th percentile) |
| Streams per room | 100+ simultaneous |
| Memory per participant | ~2MB (audio + video buffers) |
| CPU per stream | ~0.5% (forwarding only) |
| Jitter buffer range | 20-200ms adaptive |
| Recording chunk size | 10 seconds / ~5MB |
The key optimizations:
- Zero-copy forwarding: Using
Bytesand broadcast channels - Lock-free data structures:
DashMapfor concurrent access - Adaptive buffering: Jitter buffer adjusts to network conditions
- Streaming uploads: Chunked recording prevents memory exhaustion
Lessons Learned
Building a production SFU taught me several hard lessons:
-
WebRTC signaling is complex: The state machine has many edge cases. Test renegotiation extensively.
-
Jitter buffers are critical: Bad buffering ruins call quality even on good networks. Invest in getting this right.
-
Distributed systems are hard: Session affinity, failover, and state synchronization require careful design.
-
Observability is essential: You can’t debug real-time systems without comprehensive metrics and tracing.
-
Rust was the right choice: The performance and safety guarantees have saved countless debugging hours.
What’s Next
The Surf Media Engine is in production, but there’s always more to build:
- Simulcast support: Adaptive quality based on subscriber bandwidth
- SVC (Scalable Video Coding): More efficient bandwidth adaptation
- AI integration: Real-time transcription and translation
- E2E encryption: Frame-level encryption for privacy
Real-time communication is a fascinating intersection of systems programming, networking, signal processing, and distributed systems. If you’re interested in this space, I’d love to hear from you.
This is part of my systems engineering series. The code discussed here is from Surf Media Engine, the media server powering meet.surf (currently hosted at staging.shonen.live).
Written by Shubham Yadav (Kira) — hi@ykira.com