Introduction
WebRTC stops feeling simple the moment real networks get involved.
In the previous article, I described the architecture of FusionFlow and why I separated media from control.
This is where that separation stops being architectural hygiene and starts being survival.
WebRTC feels easy at first. Two browsers exchange some blobs and suddenly there is video. It almost feels suspiciously simple.
It is not simple.
The difficult part is not rendering video. The difficult part is rendering video reliably when users are behind CGNAT, corporate VPNs, hotel WiFi, or networks that treat UDP like a personal insult.
I did not want heuristics and hope. I wanted explicit state and predictable behavior. That decision shaped everything that follows.
This second article focuses on what breaks once WebRTC leaves the happy path and starts dealing with real networks.
What this article covers
- Why SDP and ICE are only the starting point
- How FusionFlow keeps signaling explicit and backend-driven
- Why TURN becomes unavoidable in production
- Why bandwidth and network behavior become the real constraints
WebRTC Is Not a Protocol
WebRTC is a collection of mechanisms working together.
The connection starts with SDP: Session Description Protocol.
SDP is just structured text describing capabilities. Codecs. Encryption fingerprints. Media directions. Transport details. An "offer" is an SDP document. An "answer" is another SDP document responding to it.
A simplified fragment looks like this:
v=0
o=- 46117317 2 IN IP4 127.0.0.1
s=-
t=0 0
m=video 9 UDP/TLS/RTP/SAVPF 96
a=rtpmap:96 VP8/90000
a=ice-ufrag:abc123
a=ice-pwd:def456
a=fingerprint:sha-256 12:34:56:...
You don't need to memorize SDP. But you do need to respect that this blob defines the contract between peers.
After SDP comes ICE: Interactive Connectivity Establishment.
ICE is responsible for discovering viable network paths between peers. Each peer gathers multiple candidates and tests them until one works.
ICE Candidate Types (Clean Mental Model)
Instead of thinking about ICE as magic, think about it as a prioritized path search.
flowchart TD
A["Peer gathers candidates"]
H["Host candidate<br/>(local IP)"]
S["Server reflexive<br/>(via STUN)"]
R["Relay candidate<br/>(via TURN)"]
A --> H
A --> S
A --> R
And then they are tested roughly in this order:
flowchart LR
Host["Host (Local)"] --> Reflexive["Reflexive (STUN)"] --> Relay["Relay (TURN)"]
Host candidates are local addresses. They work beautifully on localhost and almost nowhere else.
Server reflexive candidates are discovered via STUN. They represent how your peer appears externally.
Relay candidates come from TURN. They are slower and more expensive, but they work.
On localhost, host candidates win.
On the real internet, relay candidates win more often than people
expect.
Signaling: Backend-Driven and Boring on Purpose
WebRTC does not define signaling. That is the application's responsibility.
In FusionFlow, signaling runs over STOMP via WebSocket. The backend is not just forwarding blobs. It maintains canonical room state and peer roles.
Only one peer generates the offer.
When the backend emits peer-joined, it includes an initiatorId. Only
that client creates the offer. This prevents dual-offer chaos, race
conditions, and what I call "SDP optimism".
Here is the simplified signaling flow:
sequenceDiagram
participant Viewer
participant Streamer
participant Backend
Viewer->>Backend: /app/webrtc/join
Backend-->>Viewer: joined + room-state
Streamer->>Backend: /app/webrtc/join
Backend-->>Streamer: joined + room-state
Backend-->>Viewer: peer-joined (initiatorId)
Streamer->>Backend: /app/webrtc/signal (offer)
Backend-->>Viewer: signal (offer)
Viewer->>Backend: /app/webrtc/signal (answer)
Backend-->>Streamer: signal (answer)
Note over Viewer,Streamer: ICE candidates exchanged
Streamer->>Backend: /app/webrtc/media-state
Backend-->>Viewer: media-state
Signaling is explicit. Offers, answers, ICE candidates, and media-state updates are intentional. Nothing is inferred. Nothing is guessed.
Boring is good here.
Backend as the Source of Truth
Many WebRTC examples let the frontend infer too much.
If video does not appear, reset everything.
If tracks disappear, renegotiate blindly.
That works until it does not.
In FusionFlow, the backend is authoritative for:
- Room membership
- Peer roles
- Expected camera and microphone state
Whenever a peer toggles camera or mic, the frontend sends a
media-state update. The backend updates its canonical state and
broadcasts it.
The frontend compares two things:
- Expected state from the backend
- Actual received tracks
If the backend says video should be on but no tracks arrive, the client requests a replay instead of panicking.
stateDiagram-v2
[*] --> Connected
Connected --> LiveMedia: tracks received
LiveMedia --> Connected: media toggled off
Connected --> ReplayRequested: expected media missing
ReplayRequested --> Connected: replay succeeds
Renegotiation becomes state-driven instead of timer-driven.
It sounds small. It removes an entire category of instability.
Failure Modes I Have Actually Seen
The most confusing production issues were not ICE failures.
They were state inconsistencies.
One scenario looked like this.
Peer A was absolutely convinced it was sharing video. Camera light on. Track active. No errors.
Peer B was receiving signaling correctly. ICE completed. Connection state was connected.
And yet, zero media frames arrived.
From the outside, everything looked healthy:
- Connection: connected
- ICE: completed
- Signaling: fine
- Media: silent
What actually happened was not a transport issue. It was state drift.
A signaling message was lost during a reconnect edge case. The backend believed media was enabled. The receiving peer never renegotiated the updated track parameters.
Both sides believed they were correct.
This is the worst category of real-time bugs. Nothing crashes. Nothing throws. Metrics look fine. Users just stare at a frozen avatar wondering why the other side is "not sharing."
That is precisely why media expectations must be reconciled explicitly against actual received tracks.
Once I enforced backend-driven media state reconciliation and replay requests, this entire class of ghost failures disappeared.
These bugs do not show up in tutorials.
They show up when real users refresh at exactly the wrong moment.
ICE in the Real World
WebRTC is described as peer-to-peer.
Technically correct. Practically conditional.
Mobile carriers use CGNAT.
Corporate networks block UDP.
VPNs rewrite routes.
Some routers behave creatively.
Without TURN properly configured, a noticeable percentage of users simply cannot establish media.
The first time you see ICE complete successfully but receive zero frames, you gain a new appreciation for relay servers.
TURN: Where Theory Ends
TURN is where WebRTC becomes infrastructure.
Coturn must be configured correctly:
- External IP must match public IP
- TLS must be valid
- Realm must align with authentication
- Relay ports must be open
If TURN is misconfigured, signaling works. SDP exchange works. ICE appears successful.
But media never flows.
Everything looks fine. Nothing works. Those are the most educational bugs.
In FusionFlow, TURN runs independently from the backend.
flowchart LR
Viewer -->|Media| TURN
Streamer -->|Media| TURN
Viewer -->|Signaling| Backend
Streamer -->|Signaling| Backend
Media and control remain separate. That keeps backend CPU predictable and prevents accidental bottlenecks.
Bandwidth Is the Real Scaling Constraint
Once TURN is involved, WebRTC is no longer purely peer-to-peer. Media is relayed.
Relay means bandwidth.
WebRTC is not expensive because of signaling. It becomes expensive because of relay bandwidth.
One HD stream relayed through TURN is manageable. Multiply that by concurrent sessions and it becomes infrastructure planning.
Running TURN yourself gives control. It also makes you responsible for the bill.
That trade-off was intentional.
Closing Thoughts
This architecture does not eliminate complexity. It contains it.
Signaling is explicit.
State is authoritative.
Media is separate from control.
Renegotiation is intentional.
WebRTC itself is stable.
Uncontrolled state and unpredictable networks are not.
FusionFlow is designed around that assumption.
