Building a Real-Time AI Video Pipeline with WebSockets

Most AI video demos show a progress bar and a download link. Real applications need frames arriving in the browser as they’re generated — think live previews, interactive editing, and collaborative workflows where waiting 30 seconds for a full render kills the experience.

This guide covers the architecture for streaming AI-generated video frames over WebSockets, from server-side generation to client-side rendering.

Architecture Overview

The pipeline has three stages: generation, transport, and rendering. The generation server produces frames (either from a model running locally or by polling a remote API). The transport layer streams those frames to clients. The client assembles frames into a playable sequence.

Each stage has different latency characteristics, and the overall experience is only as fast as the slowest one. In practice, transport is rarely the bottleneck — generation speed and client-side decode are where you spend most of your optimization budget.

Protocol Choice

WebSockets vs SSE vs HTTP Polling

Server-Sent Events (SSE) work for text streams but hit limitations with binary data — you end up base64-encoding frames, which adds 33% overhead. HTTP polling introduces unnecessary latency. WebSockets give you bidirectional binary transport with no encoding overhead and built-in connection management.

The tradeoff: WebSockets require more infrastructure (load balancer support, sticky sessions or connection state), but for frame-rate video data, nothing else comes close.

Server Setup

import { WebSocketServer } from "ws";

const wss = new WebSocketServer({ port: 8080 });

wss.on("connection", (ws) => {
  ws.on("message", async (data) => {
    const request = JSON.parse(data.toString());

    // Start generation and stream frames as they arrive
    const stream = generateVideoFrames(request);

    for await (const frame of stream) {
      if (ws.readyState === ws.OPEN) {
        // Send frame as binary — no base64 encoding
        ws.send(frame.buffer, { binary: true });
      }
    }

    // Signal completion
    ws.send(JSON.stringify({ type: "done" }));
  });
});

The key detail: send frames as binary messages, not JSON with embedded base64. This cuts bandwidth by a third and eliminates encode/decode overhead on both sides.

Client Integration

const ws = new WebSocket("wss://api.wavespeed.ai/stream");
ws.binaryType = "arraybuffer";

const frames: ImageBitmap[] = [];

ws.onmessage = async (event) => {
  if (typeof event.data === "string") {
    const msg = JSON.parse(event.data);
    if (msg.type === "done") startPlayback(frames);
    return;
  }

  // Decode binary frame off the main thread
  const blob = new Blob([event.data], { type: "image/webp" });
  const bitmap = await createImageBitmap(blob);
  frames.push(bitmap);

  // Optionally show latest frame as preview
  drawPreview(bitmap);
};

Use createImageBitmap instead of creating Image objects. It decodes off the main thread, which prevents frame drops when frames arrive faster than the browser can paint.

Buffering Strategy

Don’t start playback on the first frame. Buffer 5–10 frames before beginning, then play at a fixed interval while continuing to receive. This absorbs network jitter without adding noticeable delay.

If the buffer runs dry during playback, pause and show the last rendered frame rather than looping or showing a spinner. Users perceive a brief pause as “loading” — a spinner tells them something broke.

Latency Optimization

Three changes that matter most in production:

Frame format: WebP at quality 75 hits the best size/quality tradeoff for streaming. JPEG is faster to encode but larger; PNG is lossless but too heavy.
Resolution scaling:stream at 480p for preview, generate the full-resolution version in the background. Most users can’t tell the difference during playback.
Connection reuse:keep WebSocket connections alive between generations. The handshake adds 100–300ms that you’ll notice on rapid iterations.

Production Considerations

WebSocket connections are stateful, which complicates horizontal scaling. Use a connection registry (Redis pub/sub works well) so any server instance can route frames to the right client. Health checks should ping the WebSocket endpoint, not just the HTTP server.

Memory management matters: each connected client accumulates frames in server memory until the generation completes. Set a maximum frame buffer per connection and apply backpressure (slow down generation) if the client can’t keep up.

Conclusion

Real-time video streaming isn’t fundamentally hard — it’s just different from the request/response model most AI integrations use. The WebSocket layer is straightforward; the work is in buffering, backpressure, and making the experience feel smooth even when the network isn’t.

Start with the simplest version — binary frames over a WebSocket, a small buffer, and canvas rendering — and add complexity only when your metrics tell you to.