Global Sources
EE Times-Asia
Stay in touch with EE Times Asia
EE Times-Asia > Amplifiers/Converters

Achieve proper audio-video synchronization in digital broadcast

Posted: 01 Jul 2007 ?? ?Print Version ?Bookmark and Share

Keywords:A/V synchronization? digital broadcast? audio-video?

By Ke Ning, Gabby Yi and Rick Gentile
Analog Devices Inc.

Technologies such as DTV, DVD, Direct Broadcast Satellite (DBS) and Digital Cable use compression techniques to deliver extremely high-quality programming to consumers. With the introduction of advanced video/audio compression standards to transmit audio and video content, there is an increased awareness of the timing relationship between audio and video.

One of the overarching goals of a digital broadcasting system is to deliver audio and video in proper synchronization to the viewer. The reason for this is each digital audio and video component in the chain, from production to reception, imposes some degree of latency on the signals passing through it. The delays imposed on the audio and video signals are typically unequal. Each component harbors the potential to cause an audio-video synchronization error at its output.

Audio and video streams are required to be decoded and presented in tandem to ensure "lip synchronization" is maintained. The audio and video decoders can be implemented with independent processors or if the processing capability is available, they can be integrated into a single processor. In either case, some form of A/V synchronization must be implemented to keep them in lock step. Failure to keep the audio and video decoding streams synchronized will result in the audible effects occurring before or after they should, relative to the associated video frame. This of course is not acceptable from a viewer prospective.

Meda path and delay
A typical media-based digital system is built up as a set of encoders and decoders, as shown in Figure 1. The system encoder portion performs the compression function and the system decoder function provides for the decompression for viewing purposes.

Meda data are transmitted between the encoder system and decoder system. This data contains both an audio and video component. Because the audio and video components have inherently different characteristics, the synchronization approach is different between the two sub-systems.

Audio-video synchronization

Figure 1: Typical Steps in an encoding/decoding application

In general, audio operations are very low latency and are not very computationally intensive. Typical functions such as compression, equalization and mixing can be accomplished in under 1ms in the digital domain.

Generally, no compensation for the audio delay needs to be added because the latency through the system is so low. Video processing, on the other hand, takes significantly more time and computational power to process. The latency is more likely to be measured in terms of at least one video frame.

As is the case with audio, any time a video signal is digitized and buffered, processing operations upon that signal will take longer than if the operation was done in the analog domain. Because most video effects can't be performed in the analog domain, digital processing is required and thus some "system" delay is inevitable.

It is interesting to note how processing audio and video signals impacts the latency of the overall system. Video processing takes longer than audio processing because there is more data to process and the processing requirements are greater. This increase in processing with respect to the audio processing in turn ensures the video signals will be delayed with respect to the audio signals. As a result, audio and video components are "out of sync" from the beginning of the encode process.

Industry standards
It is critical that compensation be provided for any video device that has a delay in excess of a few milliseconds or it will be noticeable to the viewer and the perceived quality of delivery will suffer. To ensure this doesn't happen, an equal amount of delay should therefore be applied to the audio path. The ITU has made a further recommendation as part of ITU-R BT.1377. Specifically, the recommendation is that audio and video frames are "labeled" to indicate processing delay.

Human perception is much more forgiving for sound lagging behind sight as this is what we are used to seeing in everyday occurrences. The International Telecommunications Union (ITU) released ITU-R BT.1359-1 in 1998. It was based on research that showed the reliable detection of A/V Sync errors fell between 45ms audio leading video and 125ms audio lagging behind video. That was just for detection, while the acceptability region, and therefore the recommended maximum were quite a bit wider. In summary, the recommendation states that the tolerance from the point of capture to the viewer and or listener shall be no more than 90ms (audio leading video) and 185ms (audio lagging behind video). In reality, this range is probably too wide for truly acceptable performance, and tighter tolerances are generally obeyed.

With the growth of digital media broadcasting, the market requirement for ITU R BT.1359-1 was considered inadequate for purposes of audio and video synchronization for DTV broadcasting. The Advanced Television System Committee (ATSC) is an industry organization for DTV standards. The Implementation Subcommittee investigated and ultimately led to an Implementation Subcommittee Finding, IS/191, ATSC Implementation Subcommittee Finding: Relative Timing of Sound and Vision for Broadcast Operations. The Implementation Subcommittee finds that under all operational situations, at the inputs to the DTV encoding devices, the sound program should be tightly synchronized to the video program. Based on this, they recommend that the sound program should never lead the video program by more than 15ms, and should never lag the video program by more than 45ms.

A/V Sync in the encoder system
The following figure is a simplified block diagram of a typical structure of digital media encoding and processing units. It includes video and audio capture devices to sample the information from the real world and convert that information into digital domain. Immediate buffers for both video and audio data are needed right after the capturing devise. The central parts of are the video encoder, audio encoder units, which encode the digital data into compressed information. The multiplexer will be responsible for muxing the information from both channels and sending it to transmission channels.

A/V Sync in encoder system

Video and audio encoding take some time to complete on a given size frame and the multiplexer must possess the exact information on how long. This delay depends on the manufacturer of the equipment, but the value is crucial for getting the presentation timestamp (PTS) values correctly assigned. Many of the A/V Sync problems encountered can be attributed to these delays not being properly accounted for or just not set at all. The overall audio-video synchronization error is the algebraic sum of the individual synchronization errors encountered in the chain. If properly calculated the adjusted PTS value should compensate for the delay before the video/audio data reaches the video/audio processing units.

In many cases, the Society of Motion Picture and Television Engineers (SMPTE) timestamp can be applied to the audio and video encoders and can be used by the multiplexer to calculate exact PTS values, thereby removing encoder delay as a source of error.

The MPEG specification provides the proper tools to make an A/V Synchronization absolutely correct. Each audio and video frame has a PTS that allows the decoder to reconstruct the sound and pictures in sync. These PTS values are assigned by the multiplexer in the MPEG encoder. The decoder receives the audio and video data ahead of the PTS values and can therefore use these values to properly present audio and video in sync.

A/V Sync in encoder system

For this to work, it is imperative that audio and video adhere to the Dolby Digital (AC-3) and MPEG-2 specifications. The MPEG-2 specification models the end-to-end delay from an encoder's signal input to a decoder's signal output as constant. This end-to-end delay is the sum of the delays for the encoding, encoder buffering, multiplexing, transmission, de-multiplexing, decoder buffering, decoding, and presentation.

Presentation time stamps are required in the MPEG bit stream at intervals not exceeding 700ms. The MPEG System Target Decoder model allows a maximum decoder buffer delay of one second. Audio and video presentation units that represent sound and pictures that are to be presented simultaneously may be separated in time within the transport stream by as much as one second. In order to produce a synchronized output, the receiver must recover the encoder's System Time Clock (STC) and use the Presentation Time Stamps (PTS) to present the audio-video content to the viewer with a tolerance of 15ms of the time indicated by PTS.

The audio decoding is considered more significant, because any discontinuities in the audio are readily apparent, whereas video is less sensitive (e.g. dropping video frames is less noticeable to a viewer). The video decoder can be adjusted more easily than the audio decoder because a video field is emitted only once every 60th of a second (16.6ms). The adjustment must never result in a skipped or dropped video field. It is recommended that the audio and video STC values be synchronized on a regular basis of at least once per second.

Because the data streams are broadcast from the data source (such as the digital broadcast satellite head-end) without any form of flow control, the decoder must be adjusted to avoid accumulated clock error. Typically less than a second's worth of MPEG buffers will be in the system. This tight decoding tolerance requires that the decoder crystal be continually adjusted to compensate for any rate inaccuracies. In the push-model case, MPEG data also contains periodic system clock reference time stamps (SCRs) that provide clues about whether to increase or decrease the rate at which the system time clock (STC) advances. The rate of the STC determines the rate of decoding. If the decoder runs out of data or fills all of its input buffers, there will be visible decoding discontinuities as the picture decoding halts or incoming data is discarded. Both of these cases are catastrophic.

In our decoder example, audio buffers and video frames are created in external memory. As these output buffers are written to larger external memory such as SDRAM or DDR, the timestamp from the encoded stream is "associated" with each buffer and frame. In addition, the processor needs to keep track of its own time base. Before each decoded video frame and audio buffer is sent out for display, the processor performs a time check and finds the appropriate data match from each buffer. There are multiple ways to accomplish this task with a DMA controller, but the best way is to have the descriptors already built up and then, depending on which packet time matches the current processor time, adjust the pointer DMA write pointer to the appropriate descriptor.

Audio and video synchronization is a critical component of a multimedia system. The digital audio-video production, distribution, and broadcast system are a complex array of digital processing, compression, decompression, and transmission. Each component in the system imposes latency on the audio and/or video signals flowing through it. Unequal delays can also be imposed on the audio and video signals respectively, and these delays compromise audio-video synchronization. Steps must be taken to ensure that the audio and video signals delivered at the output stage of each stage are synchronized within a tight tolerance in order to comply with industry standards.

About the authors
Ke Ning, Gabby Yi
and Rick Gentile are senior applications engineers at Analog Devices Inc.

Article Comments - Achieve proper audio-video synchroni...
*? You can enter [0] more charecters.
*Verify code:


Visit Asia Webinars to learn about the latest in technology and get practical design tips.

Back to Top