BoxCast Team • January 22, 2021
It happens about once a week: Someone starts their very first stream with a camera pointed at themselves. While watching it on their computer or tablet, they’re surprised to discover they’re seeing themselves from about 30 seconds ago, and they call us to ask why their stream is so delayed.
Many streaming newcomers are used to tools like Zoom or FaceTime that allow them to collaborate with others in real time and feel a lot like talking on the phone or in person. So why is streaming different?
In this post, we’ll answer this question and provide a deeper understanding of live video streaming and how it works.
First and foremost, it’s important to distinguish web streaming from conferencing tools.
Tools like BoxCast, Facebook Live, and YouTube Live fall primarily into the former camp, while tools like FaceTime, Skype, and Zoom fall into the latter.
The primary difference in the design of these tools is whether the content is primarily meant to be a broadcast (a small number of presenters to a potentially large number of viewers in a one-way fashion) or a two-way collaboration among a limited number of participants.
Although this distinction may seem trivial, it becomes very important when the number of participants or viewers scales to a large number. Keeping the delay between participants low enough for collaboration requires tightly coupled computing services — and tightly coupled services don't scale to large numbers of participants.
For the remainder of this post, we’ll focus on streaming services, which are meant to be a means of broadcasting content to a large, globally distributed audience.
Here are some streaming terms you might not be familiar with. We’ve defined them so you can reference them as you go on:
If your viewers aren't physically attending your live event, latency may actually not be that important. Whether two seconds or two minutes, if a viewer isn't present in person, they’ll be blissfully unaware that there's any latency at all.
Sometimes, though, latency is an issue. For example, live attendees might be tweeting updates, or you may be providing live score and stat info for a sporting event. If your latency is too long, viewers may read about something before they see and hear it happen, which is not ideal. So we should try to keep the latency as low as possible.
You can learn more about how different standards of compression — like AVC and HEVC — can contribute to video quality, compression, and latency on our blog.
Let’s look at how a typical live streaming system works and examine how latency is introduced at each step:
Whether you’re using a single camera or a sophisticated video mixing system, taking a live image and turning it into digital signals takes some time. At minimum, it'll take at least the duration of a single captured video frame (1/30th of a second for a 30fps frame rate).
More advanced systems, such as video mixers, will introduce additional latency for decoding, processing, re-encoding, and retransmitting. Your video capture and processing requirements will determine this value.
Minimum: About 33 milliseconds
Maximum: Hundreds of milliseconds
When encoding in software (on a PC or Mac) or using a hardware encoder (like a BoxCaster, Teradek, etc.), it takes time to convert the raw image signal into a compressed format suitable for transmission across the internet. This latency can range from extremely low (thousandths of a second) to values closer to the duration of a video frame. Changing encoding parameters can lower this value at the expense of encoded video quality.
Minimum: About 1 millisecond
Maximum: About 40–50 milliseconds
The encoded video takes time to transmit over the internet to a VDS. This latency is affected by the encoded media bitrate (lower bitrate usually means lower latency), the latency and bandwidth of the internet connection, and the proximity (over the internet) to the VDS.
Minimum: About 5–10 milliseconds
Maximum: Hundreds of milliseconds
Since the internet is a massively connected series of digital communication routes, the encoded video data may take one of many different routes to the VDS, and this route may change over time. Because these routes take different amounts of time to traverse (and the data may be queued anywhere along the route), it may arrive at the VDS out of order. A special software component called a jitter buffer reorders the arriving data so it can be properly decoded.
When configuring the jitter buffer, one must choose a maximum time boundary inside of which data can be reordered. This time boundary provides the latency of the jitter buffer. As the latency is lowered, the risk of losing late data increases — while choosing a higher latency ensures more late data is recovered.
Minimum: Typically no less than 100 milliseconds
Maximum: Several seconds
Your viewers watch from many kinds of devices (PCs, Macs, tablets, phones, TVs, and set-top boxes) over many types of networks (LAN/Wi-Fi, 4G LTE, 3G, etc.). In order to provide a quality viewing experience across a range of devices, a good streaming provider should provide ABR.
There are two general ways to accomplish this: Either the encoder streams multiple quality levels to the VDS (which are directly relayed to viewers), or the encoder sends a single high-quality stream to the VDS, which then transcodes and transrates it to multiple levels. Typically, the transcoding and transrating takes about as long as a segment of encoded video (more about segments later), but it can be faster at smaller resolutions and lower bitrates.
Minimum: About 1 second
Maximum: About 10 seconds
There are two categories of protocols for viewing live video content: non-HTTP-based and HTTP-based. The two differ in their latency and scalability. Understanding these differences is integral to choosing a streaming solution.
Non-HTTP-based protocols (such as RTSP and RTMP) use a combination of TCP and UDP communications to send media to viewers. They can potentially be very low latency (as low as the network latency from the VDS to the viewer), however, their support for adaptive streaming is spotty, at best. Furthermore, scaling these protocols to large numbers of viewers becomes very difficult and expensive.
HTTP-based protocols (such as HLS, HDS, MSS, and MPEG-DASH) are designed to take advantage of standard web servers and content distribution networks, which scale to many (thousands to millions of) simultaneous users. They also have built-in support for adaptive playback, and have more broad native support on mobile devices.
The way these HTTP-based protocols work is by breaking up the continuous media stream into segments that are typically 2–10 seconds long. These segments can then be served to viewers by a standard web server or content distribution network.
HTTP-based protocols are generally better suited to most live streaming scenarios due to better feature support and scalability. The disadvantage of these protocols is that the latency is at least as long as the segment length, and can be as bad as 3–4 times the segment length (for example, iOS devices buffer 3–4 segments before even beginning to play the video).
Minimum (for non-HTTP-based protocols): About 5–10 milliseconds
Minimum (for HTTP-based protocols): About 2 seconds
Maximum (for HTTP-based protocols): About 30–40 seconds
Whether viewing on a phone, a computer, or a TV, it takes time to decompress the media data and render it on the screen. In the best case, this can be as low as a single frame duration (1/30th of a second at 30fps), but typical values are 2–5 times the duration of a video frame. This latency is determined by the capabilities of the viewing device.
Minimum: About 33 milliseconds
Maximum: Hundreds of milliseconds
A streaming solution that uses non-HTTP-based protocols can achieve a lower latency. Per our estimates above, latency will likely be in the range of about 1.2–17 seconds — realistically, it will typically be about 5–10 seconds. However, this solution will not scale well beyond about 50–100 simultaneous viewers.
A streaming solution that uses HTTP-based adaptive bitrate mechanisms will have a slightly higher latency range (about 3.2–56 seconds). Realistically, it will usually be in the 15–45 second range. Since this approach uses HTTP-based mechanisms that can leverage off-the-shelf CDNs, it can theoretically support a very large number of simultaneous viewers without difficulty.
Some attributes of your total latency may be within your control. Your encoder settings, the jitter buffer, the transcoding and transrating profiles, and segment duration may be configurable. Keep in mind, though: While a lower latency may sound ideal, it’s important to test these settings with great caution, as each choice can bring about other negative consequences.
At BoxCast, we take great pains to automate as many of these choices as possible to maximize your stream quality and ensure a delightful viewing experience.
In addition to automating these choices, we make it possible for you to broadcast high-quality video — even when you're set up in less than ideal networking conditions. Learn more about how you can enhance your streaming experience with BoxCast Flow Control, which lets you deliver high-quality content and adjust latency. This video explains how it works:
BoxCast automates your streaming experience so your viewers get the best quality possible. To learn more about how we protect your live streams, check out BoxCast Flow Vs. RTMP: A Comparison of Streaming Protocols.
Happy streaming!