FusionFlow: WebRTC in Production

Introduction

WebRTC stops feeling simple the moment real networks get involved.

In the previous article, I described the architecture of FusionFlow and why I separated media from control.

This is where that separation stops being architectural hygiene and starts being survival.

WebRTC feels easy at first. Two browsers exchange some blobs and suddenly there is video. It almost feels suspiciously simple.

It is not simple.

The difficult part is not rendering video. The difficult part is rendering video reliably when users are behind CGNAT, corporate VPNs, hotel WiFi, or networks that treat UDP like a personal insult.

I did not want heuristics and hope. I wanted explicit state and predictable behavior. That decision shaped everything that follows.

This second article focuses on what breaks once WebRTC leaves the happy path and starts dealing with real networks.

What this article covers

Why SDP and ICE are only the starting point

How FusionFlow keeps signaling explicit and backend-driven

Why TURN becomes unavoidable in production

Why bandwidth and network behavior become the real constraints

WebRTC Is Not a Protocol

WebRTC is a collection of mechanisms working together.

The connection starts with SDP: Session Description Protocol.

SDP is just structured text describing capabilities. Codecs. Encryption fingerprints. Media directions. Transport details. An "offer" is an SDP document. An "answer" is another SDP document responding to it.

A simplified fragment looks like this:

Text
v=0
o=- 46117317 2 IN IP4 127.0.0.1
s=-
t=0 0
m=video 9 UDP/TLS/RTP/SAVPF 96
a=rtpmap:96 VP8/90000
a=ice-ufrag:abc123
a=ice-pwd:def456
a=fingerprint:sha-256 12:34:56:...

You don't need to memorize SDP. But you do need to respect that this blob defines the contract between peers.

After SDP comes ICE: Interactive Connectivity Establishment.

ICE is responsible for discovering viable network paths between peers. Each peer gathers multiple candidates and tests them until one works.

ICE Candidate Types (Clean Mental Model)

Instead of thinking about ICE as magic, think about it as a prioritized path search.

flowchart TD
    A["Peer gathers candidates"]

    H["Host candidate<br/>(local IP)"]
    S["Server reflexive<br/>(via STUN)"]
    R["Relay candidate<br/>(via TURN)"]

    A --> H
    A --> S
    A --> R

And then they are tested roughly in this order:

flowchart LR
    Host["Host (Local)"] --> Reflexive["Reflexive (STUN)"] --> Relay["Relay (TURN)"]

Host candidates are local addresses. They work beautifully on localhost and almost nowhere else.

Server reflexive candidates are discovered via STUN. They represent how your peer appears externally.

Relay candidates come from TURN. They are slower and more expensive, but they work.

On localhost, host candidates win.
On the real internet, relay candidates win more often than people expect.

Signaling: Backend-Driven and Boring on Purpose

WebRTC does not define signaling. That is the application's responsibility.

In FusionFlow, signaling runs over STOMP via WebSocket. The backend is not just forwarding blobs. It maintains canonical room state and peer roles.

Only one peer generates the offer.

When the backend emits peer-joined, it includes an initiatorId. Only that client creates the offer. This prevents dual-offer chaos, race conditions, and what I call "SDP optimism".

Here is the simplified signaling flow:

sequenceDiagram
  participant Viewer
  participant Streamer
  participant Backend

  Viewer->>Backend: /app/webrtc/join
  Backend-->>Viewer: joined + room-state

  Streamer->>Backend: /app/webrtc/join
  Backend-->>Streamer: joined + room-state
  Backend-->>Viewer: peer-joined (initiatorId)

  Streamer->>Backend: /app/webrtc/signal (offer)
  Backend-->>Viewer: signal (offer)

  Viewer->>Backend: /app/webrtc/signal (answer)
  Backend-->>Streamer: signal (answer)

  Note over Viewer,Streamer: ICE candidates exchanged

  Streamer->>Backend: /app/webrtc/media-state
  Backend-->>Viewer: media-state

Signaling is explicit. Offers, answers, ICE candidates, and media-state updates are intentional. Nothing is inferred. Nothing is guessed.

Boring is good here.

Backend as the Source of Truth

Many WebRTC examples let the frontend infer too much.

If video does not appear, reset everything.
If tracks disappear, renegotiate blindly.

That works until it does not.

In FusionFlow, the backend is authoritative for:

Room membership
Peer roles
Expected camera and microphone state

Whenever a peer toggles camera or mic, the frontend sends a media-state update. The backend updates its canonical state and broadcasts it.

The frontend compares two things:

Expected state from the backend
Actual received tracks

If the backend says video should be on but no tracks arrive, the client requests a replay instead of panicking.

stateDiagram-v2
  [*] --> Connected
  Connected --> LiveMedia: tracks received
  LiveMedia --> Connected: media toggled off
  Connected --> ReplayRequested: expected media missing
  ReplayRequested --> Connected: replay succeeds

Renegotiation becomes state-driven instead of timer-driven.

It sounds small. It removes an entire category of instability.

Failure Modes I Have Actually Seen

The most confusing production issues were not ICE failures.

They were state inconsistencies.

One scenario looked like this.

Peer A was absolutely convinced it was sharing video. Camera light on. Track active. No errors.

Peer B was receiving signaling correctly. ICE completed. Connection state was connected.

And yet, zero media frames arrived.

From the outside, everything looked healthy:

Connection: connected
ICE: completed
Signaling: fine
Media: silent

What actually happened was not a transport issue. It was state drift.

A signaling message was lost during a reconnect edge case. The backend believed media was enabled. The receiving peer never renegotiated the updated track parameters.

Both sides believed they were correct.

This is the worst category of real-time bugs. Nothing crashes. Nothing throws. Metrics look fine. Users just stare at a frozen avatar wondering why the other side is "not sharing."

That is precisely why media expectations must be reconciled explicitly against actual received tracks.

Once I enforced backend-driven media state reconciliation and replay requests, this entire class of ghost failures disappeared.

These bugs do not show up in tutorials.

They show up when real users refresh at exactly the wrong moment.

ICE in the Real World

WebRTC is described as peer-to-peer.

Technically correct. Practically conditional.

Mobile carriers use CGNAT.
Corporate networks block UDP.
VPNs rewrite routes.
Some routers behave creatively.

Without TURN properly configured, a noticeable percentage of users simply cannot establish media.

The first time you see ICE complete successfully but receive zero frames, you gain a new appreciation for relay servers.

TURN: Where Theory Ends

TURN is where WebRTC becomes infrastructure.

Coturn must be configured correctly:

External IP must match public IP
TLS must be valid
Realm must align with authentication
Relay ports must be open

If TURN is misconfigured, signaling works. SDP exchange works. ICE appears successful.

But media never flows.

Everything looks fine. Nothing works. Those are the most educational bugs.

In FusionFlow, TURN runs independently from the backend.

flowchart LR
  Viewer -->|Media| TURN
  Streamer -->|Media| TURN
  Viewer -->|Signaling| Backend
  Streamer -->|Signaling| Backend

Media and control remain separate. That keeps backend CPU predictable and prevents accidental bottlenecks.

Bandwidth Is the Real Scaling Constraint

Once TURN is involved, WebRTC is no longer purely peer-to-peer. Media is relayed.

Relay means bandwidth.

WebRTC is not expensive because of signaling. It becomes expensive because of relay bandwidth.

One HD stream relayed through TURN is manageable. Multiply that by concurrent sessions and it becomes infrastructure planning.

Running TURN yourself gives control. It also makes you responsible for the bill.

That trade-off was intentional.

Closing Thoughts

This architecture does not eliminate complexity. It contains it.

Signaling is explicit.
State is authoritative.
Media is separate from control.
Renegotiation is intentional.

WebRTC itself is stable.

Uncontrolled state and unpredictable networks are not.

FusionFlow is designed around that assumption.