Newsgroups: comp.graphics,comp.graphics.animation,comp.compression,comp.multimedia,alt.binaries.multimedia,alt.binaries.pictures.utilities,alt.binaries.pictures,alt.binaries.pictures.d,alt.answers,comp.answers,news.answers Subject: MPEG-FAQ: multimedia compression [ /9] Followup-To: alt.binaries.multimedia Reply-To: mpegfaq@powerweb.de Keywords: MPEG, FAQ, Compression Expires: 31 Dec 1996 12:00:00 GMT Summary: This is the summary about the ISO video and audioformats MPEG 1, 2 and 4 Approved: news-answers-request@MIT.EDU Archive-name: mpeg-faq/part0 Last-modified: 1996/06/02 Version: v 4.1 96/06/02 Posting-Frequency: bimonthly =========================================================================== ~Subject: SECTION 0. - INTRO ==================================================== THE MPEG-FAQ [Version 4.1 - 1. June 1996] ==================================================== PHADE Software Inh. Dipl-Inform. Frank Gadegast Leibnizstr. 30 10625 Berlin, GERMANY Fon/Fax ++ 49 30 3128103 E-mail phade@powerweb.de Web site http://www.powerweb.de/mpeg It's the eights publication of this file. Lots of information has been changed (which has surely brought errors with it, see Murphy's Law). This eights compilation is very different to the previous one, Version 4.0. First: The location of this file is: Text-Version : URL: ftp://ftp.powerweb.de/mpeg/faq/mpegfa41.zip [194.77.15.46] HTML-Version : URL: http://www.powerweb.de/mpeg/faq/ My MPEG-related software and my DOS-ports of several programs can be found there too. Second: "The Internet MPEG Audio Archive" is there ! Our brilliant collecting of everything that belongs to MPEG audio. For only DM 49,- ! Get it ! More than 400 MB of songs, documentation and utilities ! Read below, about how to Order ! Third: "The Internet MPEG CD-Rom" is still available ! The uniq collecting of everything that belongs to MPEG. For only DM 49,90 ! Get it ! More than 600 MB of movies, songs, documentation and utilities ! Read below, about how to Order ! Another CD-Rom containing material for MPEG-2 is about to get released ! It will be called the "MPEG-2 Movie Toolbox". Fourth: This FAQ has and the famous MPEG Archive has a complete new home now on the PowerWeb site ! The newest FAQ and other MPEG-related information and utilities for all platforms can always be loaded using WWW from: URL=http://www.powerweb.de/mpeg And surely, there are more interesting things to find ;o) I add my comments in brackets [], lines (---- or ====) seperate the chapters and questions. Please try and find out more information yourself. I had enough to do by getting and preparing this information. And only bother me with file- request if its not possible for you to get it somewhere else !!! If you want to contribute to this FAQ in any way, please email directly too (probably by replying to this posting): mpegfaq@powerweb.de If you want to contribute to the MPEG Archive, please upload via ftp to ftp://ftp.powerweb.de/incoming/mpeg and notity mpeg@powerweb.de via e-mail about your contribution. Other usefull information related to MPEG can be e-mailed to mpeg@powerweb.de Or send any additional information via fax or e-mail. Enjoy MPEG, KeyJ "MPEG" Phade (Frank Gadegast) ------------------------------------------------------------------------------- ~Subject: Disclaimer I HAVE NOTHING TO DO WITH THE NAMED COMPANIES, NO BUSINESS, IT'S JUST MY PERSONAL INTERESTED. COMPANIES ARE NAMED, BECAUSE THEY ARE THE FIRST, BRINGING REAL MULTIMEDIA TO THE WORLD. SURE I MAKE ADVERTS FOR THEM WITH THIS FAQ, BUT HOPE- FULLY YOU, AS A READER OF THIS FAQ, WILL FORCE THEM TO PRODUCE MORE AND BETTER PRODUCTS. MOST ADDITIONAL INFORMATION IS WRITTEN AS PERSONAL COMMENT, AND SHOULD NOT BE TAKEN AS PROOFEN FACTS. INFORMATION IS PRESENTED "AS IS", COULD BE OUT OF DATE AND CANNOT BE GARANTIED TO BE THE TRUTH. THIS INFOMATION COMES WITHOUT WARRANTY OF ANY KIND, INCLUDING WITHOUT LIMITATION OF WARRANTIES OF MERCHANTABILITY, FITNESS FOR PARTICULAR PURPOSE AND NON-INFRINGEMENT. UNDER NO CIRCUMSTANCES AND UNDER NO LEGAL THEORY, TORT, CONTRACT, OR OTHERWISE, SHALL THE AUTHOR BE LIABLE TO YOU OR ANY OTHER PERSON FOR ANY INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES OF ANY CHARACTER INCLUDING, WITHOUT LIMITATION, DAMAGES FOR LOSS OF GOODWILL, WORK STOPPAGE, COMPUTER FAILURE OR MALFUNCTION, OR ANY AND ALL OTHER COMMERCIAL DAMAGES OR LOSSES. Frank Gadegast ------------------------------------------------------------------------------- ~Subject: Copyright information THIS COMPILATION OF INFORMATION IS COPYRIGHTED BY THE AUTHOR AND MAINTAINER, CURRENTLY FRANK GADEGAST. ANY NON-COMMERCIAL USE OF IT, OR PARTS OF IT IS ALLOWED, UNTIL THE USE OF IT IS REPORTED TO THE AUTHOR AND THE COMPILATION IS KEPT UNCHANGED. ADDITONAL, IF PARTS OF IT ARE USED, INFORMATION HAS TO BE ADDED WITH THAT PART, WHO THE AUTHOR OF THAT PARTS IS, THAT IT BELONGS TO THE COMPLETE COMPILATION AND WHERE TO FIND THE COMPLETE COMPILATION. COMMERCIAL USE CAN BE GRANTED IN SPECIAL CIRCUMSTANCES, FEEL FREE TO ASK AND SEND A DESCRIPTION OF THE INTENDED USE, TO RECEIVE A CERTIFICATION. ANY NON-REPORTED OR NON-CERTIFIED COMMERCIAL USE OF THIS COMPILATION IS A VIOLATION OF GERMAN COPYRIGHT LAW ! ANY RE-PUBLICATION OF THE INFORMATION IN THIS COMPILATION SHOULD BE REPORTED TO THE AUTHOR AND SHOULD BE QUOTED IN THE NEW PUBLICATION. ANY RE-DISTRIBUTION OF THE COMPLETE FILE ON NON-COMMERCIAL ARCHIVES, LIKE FTP- OR FAQ-MIRRORS IS ALLOWED. ------------------------------------------------------------------------------- ~Subject: Digest format It should be possible to read this FAQ with a threaded newsreader or emacs in FAQ-mode to enable you, to jump from one question to another, because this FAQ is organized as a digest. You can move to the next question with the digest commands in gnus, rn or other newsreaders, or with a regex search for ^~Subject or ^--. ------------------------------------------------------------------------------- ~Subject: Recommendations Well, to stop some of the most enoying question, from those that do not read this FAQ at all, I recommend the following player/decoder and encoder. Search the FAQ for these words and download them BEFORE e-mailing to me ! DOS: VMPEG, MAPLAYPC and CMPEG, ENC11BIN Windows: VMPEG, SoftPeg, COOL 1.5.3 and Maplay 1.2 for Win32 Unix: XMPLAY and VCR CD-I's and Video-CDs are currently only supported by VMPEG and SoftPeg ! ------------------------------------------------------------------------------- ~Subject: What questions are getting answered in this FAQ ? SECTION 0. - INTRO Disclaimer Copyright information Digest format What questions are getting answered in this FAQ ? SECTION 1. - WHAT IS MPEG-VIDEO/VIDEO What is MPEG ? What is MPEG-Audio then ? What is the Audio Layer 3 then ? What is MPEG-1+ ? What is MPEG-2 ? What happened at the MPEG - NY meeting ? What's about Video-CD and CD-I ? SECTION 2. - PROFESSIONAL SOFTWARE SUBSECTION - DOS MPEG Encoder by Xing SUBSECTION - WINDOWS MPEG ARCADETM XingSound XingCD SUBSECTION - UNIX Xing Distributed Media Architecture NVR Research Kit Demo of NVR Digital Media Development Kit How will I get the NVR-Software ? SECTION 3. - FREE AVAILABLE SOFTWARE SUBSECTION - DOS layr_100 mpeg2ppm vmpeg cmpeg dmpeg secmpeg mpegstat enc11dos pvrg MPEG SUBSECTION - Windows XingIt mpgaudio SUBSECTION - WINDOWS-NT mpeg2ply mpegplay SUBSECTION - OS/2 mp SUBSECTION - X-WINDOWS and UNIX Berkeley's MPEG Tools MPEG-1 Video Software Encoder MPEG Video Software Decoder MPEG Video Software Analyzer MPEG Blocks Analyzer MPEG Video Software Statistics Gatherer xmg mpegstat mplex xmplay xplayer xmpeg.tk mpeg2encode / mpeg2decode mpegaudio maplay Scanning MPEG's ... MPEG decoder... MPEGTool What is "SECMPEG" ? PVRG-MPEG Codec wdgt SUBSECTION - VMS vms MPEG SUBSECTION - MacIntosh Sparcle Qt2MPEG Audio on Macintosh ?! SUBSECTION - Atari SUBSECTION - Amiga MPEG2DCTV SUBSECTION - NeXT MPEG_Play.app mpegnext SUBSECTION - SGI SECTION 4. - MPEG-RELATED HARDWARE MPEG audio Layer-3 Video-Maker Some MPEG chips Optibase ReelMagic Cinerama XingIt!-card MPEG-decompression hardware list Amiga CD32 SECTION 5. - MAILBOX-ACCESS Genoabox Xing Technologies BBS and fax SECTION 6. - FTP-ACCESS FTP-ACCESS - Overview MPEG-2 validation bitstreams Audio streams and utils Accessing Aminet Where will I find test-material for MPEG-encoders ? SECTION 7. - WWW-ACCESS Where is the WWW-home of this FAQ ? An Interactive Explanation on the Web ? Where is the WWW-demo of "The Internet MPEG CD-Rom" ? Which archive is mostly related to MPEG-Audio ? What's with Bryan Woodworth ftp-area ? Rock'n'Roll stored in MPEG on the Web ? Where can I find space movies coded in MPEG ? Movies on Web-site Where can I find fractal movies coded in MPEG ? Is qt2mpeg on the Web ? What are other good URL's ? SECTION 8. - MAIL ORDER The Internet MPEG CD-Rom Conversion, WWW and CD-Rom production service How can I order information from C-CUBE ? SECTION 9. - ADDITIONAL INFORMATION What are the MPEG standard documents ? So, the Xing decoder is cheating, right ? What is Aware Inc. doing ? Will MPEG be included in QuickTime ? What's about MPEG-2 software ? What about good MPEG Hardware encoders (Optivision) ? What's about CD-I ? What is the PCMotion Player ? What is the MPEG-2 ISO number ? Some papers about MPEG-audio Where can I find more documents about what Berkeley is doing ? Is there a book about MPEG ? Who are CD-I producers ? Where can I get VideoCD and CD-I coding ? Where can I do MPEG encoding ? What the problem with all these file extensions for MPEG-files ? How can I do RTP encapsulation of MPEG1/MPEG2 ? Wo kann ich den MPEG-standard bestellen ? SECTION 10. - WHERE TO FIND MORE INFOS What newsgroups discuss MPEG ? How can 'archie' help me ? SECTION 11. - QUESTIONS =========================================================================== ~Subject: SECTION 1. - WHAT IS MPEG-VIDEO/VIDEO ------------------------------------------------------------------------------- ~Subject: What is MPEG ? From comp.compression Mon Oct 19 15:38:38 1992 Sender: news@chorus.chorus.fr Author: Mark Adler [71] Introduction to MPEG (long) What is MPEG? Does it have anything to do with JPEG? Then what's JBIG and MHEG? What has MPEG accomplished? So how does MPEG I work? What about the audio compression? So how much does it compress? What's phase II? When will all this be finished? How do I join MPEG? How do I get the documents, like the MPEG I standard? [ There is no newer version of this part so far. Whoever wants to update ] [ this description, should do the job and send it over. ] Written by Mark Adler . Q. What is MPEG? A. MPEG is a group of people that meet under ISO (the International Standards Organization) to generate standards for digital video (sequences of images in time) and audio compression. In particular, they define a compressed bit stream, which implicitly defines a decompressor. However, the compression algorithms are up to the individual manufacturers, and that is where proprietary advantage is obtained within the scope of a publicly available international standard. MPEG meets roughly four times a year for roughly a week each time. In between meetings, a great deal of work is done by the members, so it doesn't all happen at the meetings. The work is organized and planned at the meetings. Q. So what does MPEG stand for? A. Moving Pictures Experts Group. Q. Does it have anything to do with JPEG? A. Well, it sounds the same, and they are part of the same subcommittee of ISO along with JBIG and MHEG, and they usually meet at the same place at the same time. However, they are different sets of people with few or no common individual members, and they have different charters and requirements. JPEG is for still image compression. Q. Then what's JBIG and MHEG? A. Sorry I mentioned them. Ok, I'll simply say that JBIG is for binary image compression (like faxes), and MHEG is for multi-media data standards (like integrating stills, video, audio, text, etc.). For an introduction to JBIG, see question 74 below. Q. Ok, I'll stick to MPEG. What has MPEG accomplished? A. So far (as of January 1996), they have completed the "Standard of MPEG phase I, colloquially called MPEG I. This defines a bit stream for compressed video and audio optimized to fit into a bandwidth (data rate) of 1.5 Mbits/s. This rate is special because it is the data rate of (uncompressed) audio CD's and DAT's. The standard is in three parts, video, audio, and systems, where the last part gives the integration of the audio and video streams with the proper timestamping to allow synchronization of the two. They have also gotten well into MPEG phase II, whose task is to define a bitstream for video and audio coded at around 3 to 10 Mbits/s. Q. So how does MPEG I work? A. First off, it starts with a relatively low resolution video sequence (possibly decimated from the original) of about 352 by 240 frames by 30 frames/s (US--different numbers for Europe), but original high (CD) quality audio. The images are in color, but converted to YUV space, and the two chrominance channels (U and V) are decimated further to 176 by 120 pixels. It turns out that you can get away with a lot less resolution in those channels and not notice it, at least in "natural" (not computer generated) images. The basic scheme is to predict motion from frame to frame in the temporal direction, and then to use DCT's (discrete cosine transforms) to organize the redundancy in the spatial directions. The DCT's are done on 8x8 blocks, and the motion prediction is done in the luminance (Y) channel on 16x16 blocks. In other words, given the 16x16 block in the current frame that you are trying to code, you look for a close match to that block in a previous or future frame (there are backward prediction modes where later frames are sent first to allow interpolating between frames). The DCT coefficients (of either the actual data, or the difference between this block and the close match) are "quantized", which means that you divide them by some value to drop bits off the bottom end. Hopefully, many of the coefficients will then end up being zero. The quantization can change for every "macroblock" (a macroblock is 16x16 of Y and the corresponding 8x8's in both U and V). The results of all of this, which include the DCT coefficients, the motion vectors, and the quantization parameters (and other stuff) is Huffman coded using fixed tables. The DCT coefficients have a special Huffman table that is "two-dimensional" in that one code specifies a run-length of zeros and the non-zero value that ended the run. Also, the motion vectors and the DC DCT components are DPCM (subtracted from the last one) coded. Q. So is each frame predicted from the last frame? A. No. The scheme is a little more complicated than that. There are three types of coded frames. There are "I" or intra frames. They are simply a frame coded as a still image, not using any past history. You have to start somewhere. Then there are "P" or predicted frames. They are predicted from the most recently reconstructed I or P frame. (I'm describing this from the point of view of the decompressor.) Each macroblock in a P frame can either come with a vector and difference DCT coefficients for a close match in the last I or P, or it can just be "intra" coded (like in the I frames) if there was no good match. Lastly, there are "B" or bidirectional frames. They are predicted from the closest two I or P frames, one in the past and one in the future. You search for matching blocks in those frames, and try three different things to see which works best. (Now I have the point of view of the compressor, just to confuse you.) You try using the forward vector, the backward vector, and you try averaging the two blocks from the future and past frames, and subtracting that from the block being coded. If none of those work well, you can intra- code the block. The sequence of decoded frames usually goes like: IBBPBBPBBPBBIBBPBBPB... Where there are 12 frames from I to I (for US and Japan anyway.) This is based on a random access requirement that you need a starting point at least once every 0.4 seconds or so. The ratio of P's to B's is based on experience. Of course, for the decoder to work, you have to send that first P *before* the first two B's, so the compressed data stream ends up looking like: 0xx312645... where those are frame numbers. xx might be nothing (if this is the true starting point), or it might be the B's of frames -2 and -1 if we're in the middle of the stream somewhere. You have to decode the I, then decode the P, keep both of those in memory, and then decode the two B's. You probably display the I while you're decoding the P, and display the B's as you're decoding them, and then display the P as you're decoding the next P, and so on. Q. You've got to be kidding. A. No, really! Q. Hmm. Where did they get 352x240? A. That derives from the CCIR-601 digital television standard which is used by professional digital video equipment. It is (in the US) 720 by 243 by 60 fields (not frames) per second, where the fields are interlaced when displayed. (It is important to note though that fields are actually acquired and displayed a 60th of a second apart.) The chrominance channels are 360 by 243 by 60 fields a second, again interlaced. This degree of chrominance decimation (2:1 in the horizontal direction) is called 4:2:2. The source input format for MPEG I, called SIF, is CCIR-601 decimated by 2:1 in the horizontal direction, 2:1 in the time direction, and an additional 2:1 in the chrominance vertical direction. And some lines are cut off to make sure things divide by 8 or 16 where needed. Q. What if I'm in Europe? A. For 50 Hz display standards (PAL, SECAM) change the number of lines in a field from 243 or 240 to 288, and change the display rate to 50 fields/s or 25 frames/s. Similarly, change the 120 lines in the decimated chrominance channels to 144 lines. Since 288*50 is exactly equal to 240*60, the two formats have the same source data rate. Q. You didn't mention anything about the audio compression. A. Oh, right. Well, I don't know as much about the audio compression. Basically they use very carefully developed psychoacoustic models derived from experiments with the best obtainable listeners to pick out pieces of the sound that you can't hear. There are what are called "masking" effects where, for example, a large component at one frequency will prevent you from hearing lower energy parts at nearby frequencies, where the relative energy vs. frequency that is masked is described by some empirical curve. There are similar temporal masking effects, as well as some more complicated interactions where a temporal effect can unmask a frequency, and vice-versa. The sound is broken up into spectral chunks with a hybrid scheme that combines sine transforms with subband transforms, and the psychoacoustic model written in terms of those chunks. Whatever can be removed or reduced in precision is, and the remainder is sent. It's a little more complicated than that, since the bits have to be allocated across the bands. And, of course, what is sent is entropy coded. Q. So how much does it compress? A. As I mentioned before, audio CD data rates are about 1.5 Mbits/s. You can compress the same stereo program down to 256 Kbits/s with no loss in discernable quality. (So they say. For the most part it's true, but every once in a while a weird thing might happen that you'll notice. However the effect is very small, and it takes a listener trained to notice these particular types of effects.) That's about 6:1 compression. So, a CD MPEG I stream would have about 1.25 MBits/s left for video. The number I usually see though is 1.15 MBits/s (maybe you need the rest for the system data stream). You can then calculate the video compression ratio from the numbers here to be about 26:1. If you step back and think about that, it's little short of a miracle. Of course, it's lossy compression, but it can be pretty hard sometimes to see the loss, if you're comparing the SIF original to the SIF decompressed. There is, however, a very noticeable loss if you're coming from CCIR-601 and have to decimate to SIF, but that's another matter. I'm not counting that in the 26:1. The standard also provides for other bit rates ranging from 32Kbits/s for a single channel, up to 448 Kbits/s for stereo. Q. What's phase II? A. As I said, there is a considerable loss of quality in going from CCIR-601 to SIF resolution. For entertainment video, it's simply not acceptable. You want to use more bits and code all or almost all the CCIR-601 data. From subjective testing at the Japan meeting in November 1991, it seems that 4 MBits/s can give very good quality compared to the original CCIR-601 material. The objective of phase II is to define a bit stream optimized for these resolutions and bit rates. Q. Why not just scale up what you're doing with MPEG I? A. The main difficulty is the interlacing. The simplest way to extend MPEG I to interlaced material is to put the fields together into frames (720x486x30/s). This results in bad motion artifacts that stem from the fact that moving objects are in different places in the two fields, and so don't line up in the frames. Compressing and decompressing without taking that into account somehow tends to muddle the objects in the two different fields. The other thing you might try is to code the even and odd field streams separately. This avoids the motion artifacts, but as you might imagine, doesn't get very good compression since you are not using the redundancy between the even and odd fields where there is not much motion (which is typically most of image). Or you can code it as a single stream of fields. Or you can interpolate lines. Or, etc. etc. There are many things you can try, and the point of MPEG II is to figure out what works well. MPEG II is not limited to consider only derivations of MPEG I. There were several non-MPEG I-like schemes in the competition in November, and some aspects of those algorithms may or may not make it into the final standard for entertainment video compression. Q. So what works? A. Basically, derivations of MPEG I worked quite well, with one that used wavelet subband coding instead of DCT's that also worked very well. Also among the worked-very-well's was a scheme that did not use B frames at all, just I and P's. All of them, except maybe one, did some sort of adaptive frame/field coding, where a decision is made on a macroblock basis as to whether to code that one as one frame macroblock or as two field macroblocks. Some other aspects are how to code I-frames--some suggest predicting the even field from the odd field. Or you can predict evens from evens and odds or odds from evens and odds or any field from any other field, etc. Q. So what works? A. Ok, we're not really sure what works best yet. The next step is to define a "test model" to start from, that incorporates most of the salient features of the worked-very-well proposals in a simple way. Then experiments will be done on that test model, making a mod at a time, and seeing what makes it better and what makes it worse. Example experiments are, B's or no B's, DCT vs. wavelets, various field prediction modes, etc. The requirements, such as implementation cost, quality, random access, etc. will all feed into this process as well. Q. When will all this be finished? A. I don't know. I'd have to hope in about a year or less. Q. How do I join MPEG? A. You don't join MPEG. You have to participate in ISO as part of a national delegation. How you get to be part of the national delegation is up to each nation. I only know the U.S., where you have to attend the corresponding ANSI meetings to be able to attend the ISO meetings. Your company or institution has to be willing to sink some bucks into travel since, naturally, these meetings are held all over the world. (For example, Paris, Santa Clara, Kurihama Japan, Singapore, Haifa Israel, Rio de Janeiro, London, etc.) Q. Well, then how do I get the documents, like the MPEG I standard ? A. MPEG is a ISO standard. It's exact name is ISO CD 11172. The standard consists of three parts: System, Video, and Audio. The System part (11172-1) deals with synchronization and multiplexing of audio-visual information, while the Video (11172-2) and Audio part (11172-3) address the video and the audio compression techniques respectively. You may order it from your national standards body (e.g. ANSI in the USA) or buy it from companies like OMNICOM phone +44 438 742424 FAX +44 438 740154 Or from 'ISO Online' at http://www.iso.ch/welcome.html ------------------------------------------------------------------------------- ~Subject: What is MPEG-Audio then ? From: "Harald Popp" From: mortenh@oslonett.no Date: Fri, 25 Mar 1994 19:09:06 +0100 Q. What is MPEG? A. MPEG is an ISO committee that proposes standards for compression of Audio and Video. MPEG deals with 3 issues: Video, Audio, and System (the combination of the two into one stream). You can find more info on the MPEG committee in other parts of this document. Q. I've heard about MPEG Video. So this is the same compression applied to audio? A. Definitely no. The eye and the ear... even if they are only a few centimeters apart, works very differently... The ear has a much higher dynamic range and resolution. It can pick out more details but it is "slower" than the eye. The MPEG committee chose to recommend 3 compression methods and named them Audio Layer-1, Layer-2, and Layer-3. Q. What does it mean exactly? A. MPEG-1, IS 11172-3, describes the compression of audio signals using high performance perceptual coding schemes. It specifies a family of three audio coding schemes, simply called Layer-1,-2,-3, with increasing encoder complexity and performance (sound quality per bitrate). The three codecs are compatible in a hierarchical way, i.e. a Layer-N decoder is able to decode bitstream data encoded in Layer-N and all Layers below N (e.g., a Layer-3 decoder may accept Layer-1,-2 and -3, whereas a Layer-2 decoder may accept only Layer-1 and -2.) Q. So we have a family of three audio coding schemes. What does the MPEG standard define, exactly? A. For each Layer, the standard specifies the bitstream format and the decoder. It does *not* specify the encoder to allow for future improvements, but an informative chapter gives an example for an encoder for each Layer. Q. What have the three audio Layers in common? A. All Layers use the same basic structure. The coding scheme can be described as "perceptual noise shaping" or "perceptual subband / transform coding". The encoder analyzes the spectral components of the audio signal by calculating a filterbank or transform and applies a psychoacoustic model to estimate the just noticeable noise-level. In its quantization and coding stage, the encoder tries to allocate the available number of data bits in a way to meet both the bitrate and masking requirements. The decoder is much less complex. Its only task is to synthesize an audio signal out of the coded spectral components. All Layers use the same analysis filterbank (polyphase with 32 subbands). Layer-3 adds a MDCT transform to increase the frequency resolution. All Layers use the same "header information" in their bitstream, to support the hierarchical structure of the standard. All Layers use a bitstream structure that contains parts that are more sensitive to biterrors ("header", "bit allocation", "scalefactors", "side information") and parts that are less sensitive ("data of spectral components"). All Layers may use 32, 44.1 or 48 kHz sampling frequency. All Layers are allowed to work with similar bitrates: Layer-1: from 32 kbps to 448 kbps Layer-2: from 32 kbps to 384 kbps Layer-3: from 32 kbps to 320 kbps Q. What are the main differences between the three Layers, from a global view? A. From Layer-1 to Layer-3, complexity increases (mainly true for the encoder), overall codec delay increases, and performance increases (sound quality per bitrate). Q. Which Layer should I use for my application? A. Good Question. Of course, it depends on all your requirements. But as a first approach, you should consider the available bitrate of your application as the Layers have been designed to support certain areas of bitrates most efficiently, i.e. with a minimum drop of sound quality. Let us look a little closer at the strong domains of each Layer. Layer-1: Its ISO target bitrate is 192 kbps per audio channel. Layer-1 is a simplified version of Layer-2. It is most useful for bitrates around the "high" bitrates around or above 192 kbps. A version of Layer-1 is used as "PASC" with the DCC recorder. Layer-2: Its ISO target bitrate is 128 kbps per audio channel. Layer-2 is identical with MUSICAM. It has been designed as trade-off between sound quality per bitrate and encoder complexity. It is most useful for bitrates around the "medium" bitrates of 128 or even 96 kbps per audio channel. The DAB (EU 147) proponents have decided to use Layer-2 in the future Digital Audio Broadcasting network. Layer-3: Its ISO target bitrate is 64 kbps per audio channel. Layer-3 merges the best ideas of MUSICAM and ASPEC. It has been designed for best performance at "low" bitrates around 64 kbps or even below. The Layer-3 format specifies a set of advanced features that all address one goal: to preserve as much sound quality as possible even at rather low bitrates. Today, Layer-3 is already in use in various telecommunication networks (ISDN, satellite links, and so on) and speech announcement systems. Q. So how does MPEG audio work? A. Well, first you need to know how sound is stored in a computer. Sound is pressure differences in air. When picked up by a microphone and fed through an amplifier this becomes voltage levels. The voltage is sampled by the computer a number of times per second. For CD audio quality you need to sample 44100 times per second and each sample has a resolution of 16 bits. In stereo this gives you 1,4Mbit per second and you can probably see the need for compression. To compress audio MPEG tries to remove the irrelevant parts of the signal and the redundant parts of the signal. Parts of the sound that we do not hear can be thrown away. To do this MPEG Audio uses psychoacoustic principles. Q. Tell me more about sound quality. How good is MPEG audio compression? And how do you assess that? A. Today, there is no alternative to expensive listening tests. During the ISO-MPEG-1 process, 3 international listening tests have been performed, with a lot of trained listeners, supervised by Swedish Radio. They took place in 7.90, 3.91 and 11.91. Another international listening test was performed by CCIR, now ITU-R, in 92. All these tests used the "triple stimulus, hidden reference" method and the so-called CCIR impairment scale to assess the audio quality. The listening sequence is "ABC", with A = original, BC = pair of original / coded signal with random sequence, and the listener has to evaluate both B and C with a number between 1.0 and 5.0. The meaning of these values is: 5.0 = transparent (this should be the original signal) 4.0 = perceptible, but not annoying (first differences noticable) 3.0 = slightly annoying 2.0 = annoying 1.0 = very annoying With perceptual codecs (like MPEG audio), all traditional parameters (like SNR, THD+N, bandwidth) are especially useless. Fraunhofer-IIS (among others) works on objective quality assessment tools, like the NMR meter (Noise-to-Mask-Ratio), too. If you need more informations about NMR, please contact nmr@iis.fhg.de Q. Now that I know how to assess quality, come on, tell me the results of these tests. A. Well, for details you should study one of those AES papers listed below. One main result is that for low bitrates (60 or 64 kbps per channel, i.e. a compression ratio of around 12:1), Layer-2 scored between 2.1 and 2.6, whereas Layer-3 scored between 3.6 and 3.8. This is a significant increase in sound quality, indeed! Furthermore, the selection process for critical sound material showed that it was rather difficult to find worst-case material for Layer-3 whereas it was not so hard to find such items for Layer-2. For medium and high bitrates (120 kbps or more per channel), Layer-2 and Layer-3 scored rather similar, i.e. even trained listeners found it difficult to detect differences between original and reconstructed signal. Q. So how does MPEG achieve this compression ratio? A. Well, with audio you basically have two alternatives. Either you sample less often or you sample with less resolution (less than 16 bit per sample). If you want quality you can't do much with the sample frequency. Humans can hear sounds with frequencies from about 20Hz to 20kHz. According to the Nyquist theorem you must sample at least two times the highest frequency you want to reproduce. Allowing for imperfect filters, a 44,1kHz sampling rate is a fair minimum. So you either set out to prove the Nyquist theorem is wrong or go to work on reducing the resolution. The MPEG committee chose the latter. Now, the real reason for using 16 bits is to get a good signal-to-noise (s/n) ratio. The noise we're talking about here is quantization noise from the digitizing process. For each bit you add, you get 6dB better s/n. (To the ear, 6dBu corresponds to a doubling of the sound level.) CD-audio achieves about 90dB s/n. This matches the dynamic range of the ear fairly well. That is, you will not hear any noise coming from the system itself (well, there is still some people arguing about that, but lets not worry about them for the moment). So what happens when you sample to 8 bit resolution? You get a very noticeable noise floor in your recording. You can easily hear this in silent moments in the music or between words or sentences if your recording is a human voice. Waitaminnit. You don't notice any noise in loud passages, right? This is the masking effect and is the key to MPEG Audio coding. Stuff like the masking effect belongs to a science called psycho-acoustics that deals with the way the human brain perceives sound. And MPEG uses psychoacoustic principles when it does its thing. Q. Explain this masking effect. A. OK, say you have a strong tone with a frequency of 1000Hz. You also have a tone nearby of say 1100Hz. This second tone is 18 dB lower. You are not going to hear this second tone. It is completely masked by the first 1000Hz tone. As a matter of fact, any relatively weak sounds near a strong sound is masked. If you introduce another tone at 2000Hz also 18 dB below the first 1000Hz tone, you will hear this. You will have to turn down the 2000Hz tone to something like 45 dB below the 1000Hz tone before it will be masked by the first tone. So the further you get from a sound the less masking effect it has. The masking effect means that you can raise the noise floor around a strong sound because the noise will be masked anyway. And raising the noise floor is the same as using less bits and using less bits is the same as compression. Do you get it? Q. I don't get it. A. Well, let me try to explain how the MPEG Audio Layer-2 encoder goes about its thing. It divides the frequency spectrum (20Hz to 20kHz) into 32 subbands. Each subband holds a little slice of the audio spectrum. Say, in the upper region of subband 8, a 6500Hz tone with a level of 60dB is present. OK, the coder calculates the masking effect of this sound and finds that there is a masking threshold for the entire 8th subband (all sounds w. a frequency...) 35dB below this tone. The acceptable s/n ratio is thus 60 - 35 = 25 dB. The equals 4 bit resolution. In addition there are masking effects on band 9-13 and on band 5-7, the effect decreasing with the distance from band 8. In a real-life situation you have sounds in most bands and the masking effects are additive. In addition the coder considers the sensitivity of the ear for various frequencies. The ear is a lot less sensitive in the high and low frequencies. Peak sensivity is around 2 - 4kHz, the same region that the human voice occupies. The subbands should match the ear, that is each subband should consist of frequencies that have the same psychoacoustic properties. In MPEG Layer 2, each subband is 750Hz wide (with 48 kHz sampling frequency). It would have been better if the subbands were narrower in the low frequency range and wider in the high frequency range. That is the trade-off Layer-2 took in favour of a simpler approach. Layer-3 has a much higher frequency resolution (18 times more) - and that is one of the reasons why Layer-3 has a much better low bitrate performance than Layer-2. But there is more to it. I have explained concurrent masking, but the masking effect also occurs before and after a strong sound (pre- and postmasking). Q. Before? A. Yes, if there is a significant (30 - 40dB ) shift in level. The reason is believed to be that the brain needs some processing time. Premasking is only about 2 to 5 ms. The postmasking can be up till 100ms. Other bit-reduction techniques involve considering tonal and non-tonal components of the sound. For a stereo signal you may have a lot of redundancy between channels. All MPEG Layers may exploit these stereo effects by using a "joint- stereo" mode, with a most flexible approach for Layer-3. Furthermore, only Layer-3 further reduces the redundancy by applying huffmann coding. Q. What are the downside? A. The coder calculates masking effects by an iterative process until it runs out of time. It is up to the implementor to spend bits in the least obtrusive fashion. For Layer 2 and Layer 3, the encoder works on 24 ms of sound (with 1152 sample, and fs = 48 kHz) at a time. For some material, the time-window can be a problem. This is normally in a situation with transients where there are large differences in sound level over the 24 ms. The masking is calculated on the strongest sound and the weak parts will drown in quantization noise. This is perceived as a "noise- echo" by the ear. Layer 3 addresses this problem specifically by using a smaller analysis window (4 ms), if the encoder encounters an "attack" situation. Q. Tell me about the complexity. What are the hardware demands? A. Alright. First, we have to separate between decoder and encoder. Remember: the MPEG coding is done asymmetrical, with a much larger workload on the encoder than on the decoder. For a stereo decoder, variuos real-time implementations exist for Layer-2 and Layer-3. They are either based on single-DSP solutions or on dedicated MPEG audio decoder chips. So you need not worry about decoder complexity. For a stereo Layer-2-encoder, various DSP based solutions with one or more DSPs exist (with different quality, also). For a stereo Layer-3-encoder achieving ISO reference quality, the current real-time implementations use two DSP32C and two DSP56002. Q. How many audio channels? A. MPEG-1 allows for two audio channels. These can be either single (mono), dual (two mono channels), stereo or joint stereo (intensity stereo (Layer-2 and Layer-3) or m/s- stereo (Layer-3 only)). In normal (l/r) stereo one channel carries the left audio signal and one channel carries the right audio signal. In m/s stereo one channel carries the sum signal (l+r) and the other the difference (l-r) signal. In intensity stereo the high frequency part of the signal (above 2kHz) is combined. The stereo image is preserved but only the temporal envelope is transmitted. In addition MPEG allows for pre-emphasis, copyright marks and original/copy marks. MPEG-2 allows for several channels in the same stream. Q. What about the audio codec delay? A. Well, the standard gives some figures of the theoretical minimum delay: Layer-1: 19 ms (<50 ms) Layer-2: 35 ms (100 ms) Layer-3: 59 ms (150 ms) The practical values are significantly above that. As they depend on the implementation, exact figures are hard to give. So the figures in brackets are just rough thumb values. Yes, for some applications, a very short delay is of critical importance. E.g. in a feedback link, a reporter can only talk intelligibly if the overall delay is below around 10 ms. If broadcasters want to apply MPEG audio coding, they have to use "N-1" switches in the studio to overcome this problem (or appropriate echo-cancellers) - or they have to forget about MPEG at all. But with most applications, these figures are small enough to present no extra problem. At least, if one can accept a Layer- 2 delay, one can most likely also accept the higher Layer-3 delay. Q. OK, I am hooked on! Where can I find more technical informations about MPEG audio coding, especially about Layer- 3? A. Well, there is a variety of AES papers, e.g. K. Brandenburg, G. Stoll, ...: "The ISO/MPEG-Audio Codec: A Generic Standard for Coding of High Quality Digital Audio", 92nd AES, Vienna 1992, pp.3336 E. Eberlein, H. Popp, ...: "Layer-3, a Flexible Coding Standard", 94th AES, Berlin 93, pp.3493 K. Brandenburg, G. Zimmer, ...: "Variable Data-Rate Recording on a PC Using MPEG-Audio Layer-3", 95th AES, New York 93 B. Grill, J. Herre,... : "Improved MPEG-2 Audio Multi-Channel Encoding", 96th AES, Amsterdam 94 And for further informations, please contact layer3@iis.fhg.de Q. Where can I get more details about MPEG audio? A. Still more details? No shit. You can get the full ISO spec from Omnicom. The specs do a fairly good job of obscuring exactly how these things are supposed to work... Jokes aside, there are no description of the coder in the specs. The specs describes in great detail the bitstream and suggests psychoacoustic models. Originally written by Morten Hjerde <100034,663@compuserve.com>, modified and updated by Harald Popp (layer3@iis.fhg.de). Harald Popp Audio & Multimedia ("Music is the *BEST*" - F. Zappa) Fraunhofer-IIS-A, Weichselgarten 3, D-91058 Erlangen, Germany Phone: +49-9131-776-340 Fax: +49-9131-776-399 email: popp@iis.fhg.de ------------------------------------------------------------------------------- ~Subject: What is the Audio Layer 3 then ? Informations about MPEG Audio Layer-3 Version 1.51 - 1. 95 This text is organized as a kind of Mini-FAQ (Frequently Asked Questions). It covers several topics: 1. ISO-MPEG Standard 2. MPEG Audio Codec Family ("Layer 1, 2, 3") 3. Applications 4. Products 5. Support by Fraunhofer-IIS 6. Shareware Information For further comments and questions regarding Layer-3, please contact: - layer3@iis.fhg.de For further informations about MPEG, you may also like to contact: - phade@powerweb.de 1. ISO-MPEG Standard Q: What is MPEG, exactly? A: MPEG is the "Moving Picture Experts Group", working under the joint direction of the International Standards Organization (ISO) and the International Electro-Technical Commission (IEC). This group works on standards for the coding of moving pictures and associated audio. Q: What is the status of MPEG's work, then? What about MPEG-1, -2, and so on? A: MPEG approaches the growing need for multimedia standards step-by- step. Today, three "phases" are defined: MPEG-1:"Coding of Moving Pictures and Associated Audio for Digital Storage Media at up to about 1.5 MBit/s" Status: International Standard IS-11172, completed in 10.92 MPEG-2:"Generic Coding of Moving Pictures and Associated Audio" Status: International Standard IS-13818, completed in 11.94 MPEG-3: does no longer exist (has been merged into MPEG-2) MPEG-4: "Very Low Bitrate Audio-Visual Coding" Status: Call for Proposals first deadline 1. 10. 95 Q: MPEG-1 and MPEG-2 are ready-for-use. How do the standards look like? A: Both standards consist of 4 main parts. The structure is the same for MPEG-1 and MPEG-2. -1: System describes synchronization and multiplexing of video and audio -2: Video describes compression of video signals -3: Audio describes compression of audio signals -4: Compliance Testing describes procedures for determining the characteristics of coded bitstreams and the decoding process and for testing compliance with the requirements stated in the other parts. Q: How do I get the MPEG documents? A: You order it from your national standards body. E.g., in Germany, please contact: DIN-Beuth Verlag, Auslandsnormen Mrs. Niehoff, Burggrafenstr. 6, D-10772 Berlin, Germany Phone: +49-30-2601-2757, Fax: +49-30-2601-1231 2. MPEG Audio Codec Family ("Layer 1, 2, 3") Q: Talking about MPEG audio coding, I heard a lot about "Layer 1, 2 and 3". What does it mean, exactly? A: MPEG describes the compression of audio signals using high performance perceptual coding schemes. It specifies a family of three audio coding schemes, simply called Layer-1,-2,-3, with increasing encoder complexity and performance (sound quality per bitrate) from 1 to 3. The three codecs are compatible in a hierarchical way, i.e. a Layer-N decoder is able to decode bitstream data encoded in Layer-N and all Layers below N (e.g., a Layer-3 decoder may accept Layer-1,-2 and -3, whereas a Layer-2 decoder may accept only Layer-1 and -2.) Q: So we have a family of three audio coding schemes. What does the MPEG standard define, exactly? A: For each Layer, the standard specifies the bitstream format and the decoder. To allow for future improvements, it does *not* specify the encoder, but an informative chapter gives an example for an encoder for each Layer. Q: What have the three audio Layers in common? A: All Layers use the same basic structure. The coding scheme can be described as "perceptual noise shaping" or "perceptual subband / transform coding". The encoder analyzes the spectral components of the audio signal by calculating a filterbank or transform and applies a psychoacoustic model to estimate the just noticeable noise-level. In its quantization and coding stage, the encoder tries to allocate the available number of data bits in a way to meet both the bitrate and masking requirements. The decoder is much less complex. Its only task is to synthesize an audio signal out of the coded spectral components. All Layers use the same analysis filterbank (polyphase with 32 subbands). Layer-3 adds a MDCT transform to increase the frequency resolution. All Layers use the same "header information" in their bitstream, to support the hierarchical structure of the standard. All Layers have a similar sensitivity to biterrors. They use a bitstream structure that contains parts that are more sensitive to biterrors ("header", "bit allocation", "scalefactors", "side information") and parts that are less sensitive ("data of spectral components"). All Layers support the insertion of programm-associated information ("ancillary data") into their audio data bitstream. All Layers may use 32, 44.1 or 48 kHz sampling frequency. All Layers are allowed to work with similar bitrates: Layer-1: from 32 kbps to 448 kbps Layer-2: from 32 kbps to 384 kbps Layer-3: from 32 kbps to 320 kbps The last two statements refer to MPEG-1; with MPEG-2, there is an extension for the sampling frequencies and bitrates (see below). Q: What are the main differences between the three Layers, from a global view? A: From Layer-1 to Layer-3, complexity increases (mainly true for the encoder), overall codec delay increases, and performance increases (sound quality per bitrate). Q: What are the main differences between MPEG-1 and MPEG-2 in the audio part? A: MPEG-1 and MPEG-2 use the same family of audio codecs, Layer-1, -2 and -3. The new audio features of MPEG-2 are: "low sample rate extension" to address very low bitrate applications with limited bandwidth requirements (the new sampling frequencies are 16, 22.05 or 24 kHz, the bitrates extend down to 8 kbps), "multichannel extension" to address surround sound applications with up to 5 main audio channels (left, center, right, left surround, right surround) and optionally 1 extra "low frequency enhancement (LFE)" channel for subwoofer signals; in addition, a "multilingual extension" allows the inclusion of up to 7 more audio channels. Q: A lot of new stuff! Is this all compatible to each other? A: Well, more or less, yes - with the execption of the low sample rate extension. Obviously, a pure MPEG-1 decoder is not able to handle the new "half" sample rates. Q: You mean: compatible!? With all these extra audio channels? Please explain! A: Compatibility has been a major topic during the MPEG-2 definition phase. The main idea is to use the same basic bitstream format as defined in MPEG-1, with the main data field carrying two audio signals (called L0 and R0) as before, and the ancillary data field carrying the multichannel extension information. Without going further into details, three terms can be explained here: "forwards compatible": the MPEG-2 decoder has to accept any MPEG-1 audio bitstream (that represents one or two audio channels) "backwards compatible": the MPEG-1 decoder should be able to decode the audio signals in the main data field (L0 and R0) of the MPEG-2 bitstream "Matrixing" may be used to get the surround information into L0 and R0: L0 = left signal + a * center signal + b * left surround signal R0 = right signal + a * center signal + b * right surround signal Therefore, a MPEG-1 decoder can reproduce a comprehensive downmix of the full 5-channel information. A MPEG-2 decoder uses the multichannel extension information (3 more audio signals) to reconstruct the five surround channels. Q: I heard something about a new NBC mode for MPEG-2 audio? What does it mean? A: "NBC" stands for "non-backwards compatible". During the development of the backwards compatible MPEG-2 standard, the experts encountered some trouble with the compatibility matrix. The introduced quantisation noise may become audible after dematrixing. Although some clever strategies have been devised to overcome this problem, the question remained how much better a non-compatible multichannel codec might perform. So ISO-MPEG decided to address that issue in a "NBC" working group - among the proponents are AT&T, Dolby, Fraunhofer, IRT, Philips, and Sony. Their work will lead to an addendum to the MPEG-2 standard (13818-8). Q: O.K., that should do for a first overview. Are there some papers for a more detailed information? A: Sure! You'll find more technical informations about MPEG audio coding in a variety of AES papers (AES = Audio Engineering Society). The AES organizes two conventions per year, and perceptual audio coding has been a topic since the middle of the 80s. Some interesting papers might be: K. Brandenburg, G. Stoll, et al.: "The ISO/MPEG-Audio Codec: A Generic Standard for Coding of High Quality Digital Audio", 92nd AES, Vienna Mar. 92, pp. 3336; revised version ("ISO-MPEG-1 Audio: A Generic Standard...") published in the Journal of AES, Vol.42, No. 10, Oct. 94 S. Church, B. Grill, et al.: "ISDN and ISO/MPEG Layer-3 Audio Coding: Powerful New tools for Broadcast and Audio Production", 95th AES, New York Oct. 93, pp. 3743 E. Eberlein, H. Popp, et al.: "Layer-3, a Flexible Coding Standard", 94th AES, Berlin Mar. 93, pp. 3493 B. Grill, J. Herre, et al.: "Improved MPEG-2 Audio Multi-Channel Encoding", 96th AES, Amsterdam Feb. 94, pp. 3865 J. Herre, K. Brandenburg, et al.: "Second Generation ISO/MPEG Audio Layer-3 Coding", 98th AES, Paris Feb. 95 F.-O. Witte, M. Dietz, et al.: "'Single Chip Implementation of an ISO/MPEG Layer-3 Decoder", 96th AES, Amsterdam Feb. 94, pp. 3805 For ordering informations, contact: AES 60 East 42nd Street, Suite 2520 New York, NY 10165-2520, USA phone: (212) 661-8528, fax: (212) 682-0477 Another interesting publication: the "Proceedings of the Sixth Tirrenia International Workshop on Digital Communications", Tirrenia Sep. 93, Elsevier Science B.V. Amsterdam 94 (ISBN 0 444 81580 5). An excellent tutorial about MPEG-2 has recently been published in a German technical journal (Fernseh- und Kino-Technik); part 4, by E. F. Schroeder and J. Spille, talks about the audio part (7/8 94, p. 364 ff). And for further informations, please feel free to contact layer3@iis.fhg.de. 3. Applications Q: O.K., let us concentrate on one or two audio channels. Which Layer shall I use for my application? A: Good Question. Of course, it depends on all your requirements. But as a first approach, you should consider the available bitrate of your application as the Layers have been designed to support certain areas of bitrates most effectively. Roughly, today you can achieve a data reduction of around 1:4 with Layer-1 (or 192 kbps per audio channel), 1:6..8 with Layer-2 (or 128..96 kbps per audio channel), and 1:10..12 with Layer-3, (or 64..56 kbps per audio channel), and still the reconstructed audio signal will maintain a "CD-like" sound quality. This may be used as a first "thumb rule" - let's talk about details later on. Q: Why does the performance increase with the number of the Layer? Why does the standard define a family of audio codecs instead of one single powerful algorithm? A: Well, the MPEG standard has forged together two main coding schemes that offered advantages either in complexity (MUSICAM) or in performance (ASPEC). Layer-2 is identical with the MUSICAM format. It has been designed as a trade-off between sound quality per bitrate and encoder complexity. So it is most useful for the "medium" range of bitrates (96..128 kbps per channel). For higher bitrates, even a simplified version, the Layer-1, performs well enough. Layer-1 has originally been developed for a target bitrate of 192 kbps per channel. It is used as "PASC" within the DCC recorder. For lower bitrates (64 kbps per channel or even less), the Layer-2 format suffers from its build-in limitations, and with decreasing bitrate, artefacts become audible more and more. Here is the strong domain of the most powerful MPEG audio format, Layer-3. It specifies a set of unique features that all address one goal: to preserve as much sound quality as possible even at very low bitrates. Q: Wait a second! I understand that Layer-3 has been an important asset to the MPEG-1 standard, to address the high-quality low bitrate applications. With the advent of the "low sample rate extension (LSF)" in MPEG-2, is it still necessary to rely on Layer-3 to achieve a high-quality sound at low bitrates? A: Yes, for sure! Please, don't mix up MPEG-1 and MPEG-2 LSF. MPEG-2 LSF is useful only for applications with limited bandwidth (11.25 kHz, at best). For applications with full bandwidth, MPEG-1 Layer-3 at 64 or 56 kbps per channel achieves the best sound quality of all ISO codecs. For applications with limited bandwidth, MPEG-2 LSF Layer-3 provides an excellent sound quality at 56 kbps for monophonic speech signals and still a good sound quality at only 64 kbps total bitrate for stereo music signals (with around 10 kHz bandwidth). The latest MPEG ISO listening test (in September 94 at NTT Japan, doc. MPEG 94/437) proved the superior performance of Layer-3 in MPEG-1 and MPEG-2 LSF. Q: Tell me more about sound quality. How do you assess that? A: Today, there is no alternative to expensive listening tests. During the ISO- MPEG process, a number of international listening tests have been performed, with a lot of trained listeners. All these tests used the "triple stimulus, hidden reference" method and the "CCIR impairment scale" to assess the sound quality. The listening sequence is "ABC", with A = original, BC = pair of original / coded signal with random sequence, and the listener has to evaluate both B and C with a number between 1.0 and 5.0. The meaning of these values is: 5.0 = transparent (this should be the original signal) 4.0 = perceptible, but not annoying (first differences noticable) 3.0 = slightly annoying 2.0 = annoying 1.0 = very annoying Q: Is there really no alternative to listening tests? A: No, there is not. With perceptual codecs, all traditional "quality" parameters (like SNR, THD+N, bandwidth) are rather useless, as any codec may introduce noise and distortions as long as it does not affect the perceived sound quality. So, listening tests are necessary, and, if carefully prepared and performed, lead to rather reliable results. Nevertheless, Fraunhofer-IIS works on objective sound quality assessment tools, too. There is already a first product available, the NMR meter, a real-time DSP-based measurement tool that nicely supports the analysis of perceptual audio codecs. If you need more informations about the Noise-to- Mask-Ratio (NMR) technology, feel free to contact nmr@iis.fhg.de. Q: O.K., back to these listening tests. Come on, tell me some results. A: Well, for details you should study one of those AES papers or MPEG documents listed above. The main result is that for low bitrates (64 kbps per channel or below), Layer-3 always scored significantly better than Layer-2. Another important conclusion is the draft recommendation of the task group TG 10/2 within the ITU-R. It recommends the use of low bit- rate audio coding schemes for digital sound-broadcasting applications (doc. BS.1115). Q: Very interesting! Tell me more about this recommendation! A: The task group TG 10/2 concluded its work in October 93. The draft recommendation defines three fields of broadcast applications: - distribution and contribution links (20 kHz bandwidth, no audible impairments with up to 5 cascaded codecs) Recommendation: Layer-2 with 180 kbps per channel - emission (20 kHz bandwidth) Recommendation: Layer-2 with 128 kbps per channel - commentary links (15 kHz bandwidth) Recommendation: Layer-3 with 60 kbps for monophonic and 120 kbps for stereophonic signals Q: I see. Medium bitrates - Layer-2, low bitrates - Layer-3. What's about a bitrate of 96 kbps per channel that seems to be "somewhere in between" Layer-2 and Layer-3 domains? A: Interesting question. In fact, a total bitrate of 192 kbps for stereo music is useful for real applications, e.g. emission via satellite channels. The ITU-R required that emission codecs should score at least 4.0 on the CCIR impairment scale, even for the most critical material. At 128 kbps per channel, Dolby's AC-2, Layer-2 and Layer-3 fulfilled this requirement. Finally, Layer-2 got the recommendation mainly because of its "commonality with the distribution and contribution application". Further tests for emission were performed at 192 kbps joint-stereo coding. Layer-3 clearly met the requirements, Layer-2 fulfilled them only marginally, with doubts remaining during further tests with cascaded codecs in 1993. In the end, the task group decided to pronounce no recommendation for emission at 192 kbps. Q: Someone told me that in the ITU-R tests, there was some trouble with Layer-3, specifically on male voice in the German language. Still, Layer-3 got the recommendation for "commentary links". Can you explain that? A: Yes. For commentary links, the quality requirements for speech were to be equivalent to 14-bit linear PCM, and for music, some perceptible impairments were to be tolerated. In the test in 1992, Layer-3 was by far the only codec that fulfilled these requirements (e.g. overall monophonic, Layer-3 scored 3.6 in contrast to Layer-2 at 2.05 - and for male German speech, Layer-3 scored 4.4 in contrast to Layer-2 at 2.4). Further tests were performed in 1993 using headphones. They showed that MPEG-1 Layer-3 with monophonic speech (the test item is German male voice) at 60 kbps did not fully meet the quality requirements. The ITU decided to recommend Layer-3 and to include a temporary footnote that will be removed as soon as an improved Layer-3 codec fulfills the requirements completely, i.e. even with that well-known critical male German speech item (for many other speech items, Layer-3 has no trouble at all). Q: O.K., a Layer-2 codec at low bitrates may sound poor today, but couldn't that be improved in the future? I guess you just told me before that the encoder is not fixed in the standard. A: Good thinking! As the sound quality mainly depends on the encoder implementation, it is true that there is no such thing as a "Layer-N"- quality. So we definitely only know the performance of the reference codecs used during the international tests. Who knows what will happen in the future? What we do know now, is: Today, in MPEG-1 and MPEG-2, Layer-3 provides the best sound quality at low bitrates, by far better than Layer-2. Tomorrow, both Layers may improve. Layer-2 has been designed as a trade-off between quality and complexity, so the bitstream format allows only limited innovations. In contrast, even the current reference Layer-3- codec does not exploit all of the powerful mechanisms inside the Layer-3 bitstream format. Q: What other topics do I have to keep in mind? Tell me about the complexity of Layer-3. A: O.K. First, we have to separate between decoder and encoder, as the workload is distributed asymmetrically between them, i.e. the encoder needs much more computation power than the decoder. For a stereo Layer-3-decoder, you may either use a DSP (e.g. one DSP56002 from Motorola) or an "ASIC", like the masc-programmed DSP chip MAS 3503 C from Intermetall, ITT. Some rough requirements are: computation power around 12 MIPs Data ROM 2.5 Kwords Data RAM 4.5 Kwords Programm ROM 2 to 4 Kwords word length at least 20 bit Intermetall (ITT) estimated an overhead of around 30 % chip area for adding the necessary Layer-3 modules to a Layer-2-decoder. So you need not worry too much about decoder complexity. For a stereo Layer-3-encoder achieving reference quality, our current real- time implementations use two DSP32C (AT&T) and one DSP56002. With the advent of the 21060 (Analog Devices), even a single-chip stereo encoder comes into view. Q: Quality, complexity - what about the codec delay? A: Well, the standard gives some figures of the theoretical minimum delay: Layer-1: 19 ms (<50 ms) Layer-2: 35 ms (100 ms) Layer-3: 59 ms (150 ms) The practical values are significantly above that. As they depend on the implementation, exact figures are hard to give. So the figures in brackets are just rough thumb values - real codecs may show significant higher values. Q: For some applications, a very short delay is of critical importance: e.g. in a feedback link, a reporter can only talk intelligibly if the overall delay is below around 10 ms. Here, do I have to forget about MPEG audio at all? A: Not necessarily. In this application, broadcasters may use "N-1" switches in the studio to overcome this problem - or they may use equipment with appropriate echo-cancellers. But with many applications, these delay figures are small enough to present no extra problem. At least, if one can accept a Layer-2 delay, one can most likely also accept the higher Layer-3 delay. Q: Someone told me that, with Layer-3, the codec delay would depend on the actual audio signal, varying over the time. Is this really true? A: No. The codec delay does not depend on the audio signal.With all Layers, the delay depends on the actual implementation used in a specific codec, so different codecs may have different delays. Furthermore, the delay depends on the actual sample rate and bitrate of your codec. Q: All in all, you sound as if anybody should use Layer-3 for low bitrates. Why on earth do some vendors still offer only Layer-2 equipment for these applications? A: Well, maybe because they started to design and develop their systems rather early, e.g. in 1990. As Layer-2 is identical with MUSICAM, it has been available since summer of 1990, at latest. In that year, Layer-3 development started and could be successfully finished at the end of 1991. So, for a certain time, vendors could only exploit the already existing part of the new MPEG standard. Now the situation has changed. All Layers are available, the standard is completed, and new systems may capitalize on the full features of MPEG audio. 4. Products Q: What are the main fields of application for Layer-3? A: Simply put: all applications that need high-quality sound at very low bitrates to store or transmit music signals. Some examples are: - high-quality music links via ISDN phone lines (basic rate) - sound broadcasting via low bitrate satellite channels - music distribution in computer networks with low demands for channel bandwidth and memory capacity - music memories for solid state recorders based on ROM chips Q: What kind of Layer-3 products are already available? A: An increasing number of applications benefit from the advanced features of MPEG audio Layer-3. Here is a list of companies that currently sell Layer-3 products. For further informations, please contact these companies directly. Layer-3 Codecs for Telecommunication: - AETA, 361 Avenue du Gal de Gaulle (*) F-92140 Clamart, France Fax: +33-1-4136-1213 (Mr. Fric) (*) products announced for 1995 - Dialog 4 System Engineering GmbH, Monreposstr. 57 D-71634 Ludwigsburg, Germany Fax: +49-7141-22667 (Mr. Burkhardtsmaier) - PKI Philips Kommunikations Industrie, Thurn-und-Taxis-Str. 14 D-90411 Nuernberg, Germany Fax: +49-911-526-3795 (Mr. Konrad) - Telos Systems, 2101 Superior Avenue Cleveland, OH 44114, USA Fax: +1-216-241-4103 (Mr. Church) Speech Announcement Systems: - Meister Electronic GmbH, Koelner Str. 37 D-51149 Koeln, Germany Fax: +49-2203-1701-30 (Mr. Seifert) PC Cards (Hardware and/or Software): - Dialog 4 System Engineering GmbH, Monreposstr. 57 D-71634 Ludwigsburg, Germany Fax: +49-7141-22667 (Mr. Burkhardtsmaier) - Proton Data, Marrensdamm 12 b D-24944 Flensburg, Germany Fax: +49-461-38169 (Mr. Nissen) Layer-3-Decoder-Chips: - ITT Intermetall GmbH, Hans-Bunte-Str. 19 D-79108 Freiburg, Germany Fax: +49-761-517-2395 (Mrs. Mayer) Layer-3 Shareware Encoder/Decoder: - Mailbox System Nuernberg (MSN), Innerer Kleinreuther Weg 21 D-90408 Nuernberg, Germany Fax: +49-911-9933661 (Mr. Hanft) Shareware (version 1.50) is available for: - IBM-PCs or Compatibles with MS-DOS: L3ENC.EXE and L3DEC.EXE should work on practically any PC with 386 type CPU or better. For the encoder, a 486DX33 or better is recommended. On a 486DX2/66 the current shareware decoder performs in 1:3 real-time, and the shareware encoder in 1:14 real-time (with stereo signals sampled with 44.1 kHz). - Sun workstations: On a SPARC station 10, the decoder works in real time, the encoder performs in 1:5 real-time. For more information, refer to chapter 6. 5. Support by Fraunhofer-IIS Q: I understand that Fraunhofer-IIS has been the main developer of MPEG audio Layer-3. What can they do for me? A: The Fraunhofer-IIS focusses on applied research. Its engineers have profound expertise in real-time implementations of signal-processing algorithms, especially of Layer-3. The IIS may support a specific Layer-3 application in various ways: - detailed informations - technical consulting - advanced C sources for encoder and decoder - training-on-the-job - research and development projects on contract basis. For more informations, feel free to contact: - Fraunhofer-IIS, Weichselgarten 3 D-91058 Erlangen, Germany Fax: +49-9131-776-399 (Mr. Popp) Q: What are the latest audio demonstrations disclosed by Fraunhofer-IIS? A: At the Tonmeistertagung 11.94 in Karlsruhe, Germany, the IIS demonstrated: - real-time Layer-3 decoder software (mono, 32 kHz fs) including sound output on ProAudioSpectrum running on a 486DX2/66 - playback of Layer-3 stereo files from a CD-ROM that has been produced by Intermetall and contains Layer-3 data of up to 15 h of stereo music (among others, all Beethoven symphonies); the decoder is a small board that is connected to the parallel printer port. It mainly carries 3 chips: a PLD as data interface, the MAS 3503 C stereo decoder chip, and the ASCO Digital-Analog-Converter. The board has two cinch adapters that allow a very simple connection to the usual stereo amplifier. - music-from-silicon demonstration by using the standard 1 Mbyte EPROMs to store 1.5 minutes of CD-like quality stereo music - music link (with around 6 kHz bandwidth) via V.34 modem at 28.8 kbps and one analog phone line 6. Shareware Information The Layer 3 Shareware is copyright Fraunhofer - IIS 1994,1995. The shareware packages are available: - via anonymous ftp from fhginfo.fhg.de (153.96.1.4) You may download our Layer-3 audio software package from the directory /pub/layer3. You will find the following files: For IBM PCs: l3v150d1.txt a short description of the files found in l3v150.zip l3v150d1.zip encoder, decoder and documentation l3v150d2.txt a short description of the files found in l3v150n.zip l3v150d2.zip sample bitstreams For SUN workstations: l3v150.sun.txt short description of the files found in l3v100.sun.tar.gz l3v150.sun.tar.gz encoder, decoder and documentation l3v150bit.sun.txt short description of the files found in l3v150bit.sun.tar.gz l3v150bit.sun.tar.gz sample bitstreams - via direct modem download (up to 14.400 bps) Modem telephone number : +49 911 9933662 Name: FHG Packet switching network: (0) 262 45 9110 10290 Name: FHG (For the telephone number, replace "+" with your appropriate international dial prefix, e.g. "011" for the USA.) Follow the menus as desired. - via shipment of diskettes (only including registration) You may order a diskette directly from: Mailbox System Nuernberg (MSN) Hanft & Hartmann Innerer Kleinreuther Weg 21 D-90408 Nuernberg, Germany Please note: MSN will only ship a diskette if they get paid for the registration fee before. The registration fee is 85 Deutsche Mark (about 50 US$) (plus sales tax, if applicable) for one copy of the package. The preferred method of payment is via credit card. Currently, MSN accepts VISA, Master Card / Eurocard / Access credit cards. For details see the file REGISTER.TXT found in the shareware package. You may reach MSN also via Internet: msn@iis.fhg.de or via Fax: +49 911 9933661 or via BBS: +49 911 9933662 Name: FHG or via X25: 0262 45 9110 10290 Name: FHG (e.g. in USA, please replace "+" with "011" - via email You may get our shareware also by a direct request to msn@iis.fhg.de. In this case, the shareware is split into about 30 small uuencoded parts... SOFTWARE: MPEG Audio Layer 3 Shareware Codec and Windows Realtime Player ---------------------------------------------------------------- MPEG Audio Codec and Windows REALTIME Player from Fraunhofer IIS ---------------------------------------------------------------- Fraunhofer IIS announces l3enc/l3dec V2.00 and WinPlay3 V1.00. For high quality audio compression, the shareware l3enc/l3dec V2.00 package is available for Linux, SUN, NeXT and DOS on Versions for SGI and HP will follow soon. The shareware package for DOS includes a demo version of WinPlay3, a Windows MPEG Audio Layer 3 realtime-player. With MPEG Audio Layer 3 you can get a 12:1 compression with a CD like quality. Instead of 12 MByte / minute (stereo 44.1 kHz) you only need about 1 Mbyte / minute! More information can be found on or contact - via direct modem download (up to 14.400 bps) Modem telephone number : +49 911 9933662 Name: FHG Packet switching network: (0) 262 45 9110 10290 Name: FHG (For the telephone number, replace "+" with your appropriate international dial prefix, e.g. "011" for the USA.) Follow the menus as desired. - via shipment of diskettes (only including registration) You may order a diskette directly from: Mailbox System Nuernberg (MSN) Hanft & Hartmann Innerer Kleinreuther Weg 21 D-90408 Nuernberg, Germany Please note: MSN will only ship a diskette if they get paid for the registration fee before. The registration fee is 85 Deutsche Mark (about 50 US$) (plus sales tax, if applicable) for one copy of the package. The preferred method of payment is via credit card. Currently, MSN accepts VISA, Master Card / Eurocard / Access credit cards. For details see the file REGISTER.TXT found in the shareware package. You may reach MSN also via Internet: msn@iis.fhg.de or via Fax: +49 911 9933661 or via BBS: +49 911 9933662 Name: FHG or via X25: 0262 45 9110 10290 Name: FHG (e.g. in USA, please replace "+" with "011" - via email You may get our shareware also by a direct request to msn@iis.fhg.de. In this case, the shareware is split into about 30 small uuencoded parts... Harald Popp Audio & Multimedia ("Music is the *BEST*" - F. Zappa) Fraunhofer-IIS-A, Weichselgarten 3, D-91058 Erlangen, Germany Phone: +49-9131-776-340 Fax: +49-9131-776-399 email: popp@iis.fhg.de P.S.: Look out for planetoid #3834! ------------------------------------------------------------------------------- ~Subject: What is MPEG-1+ ? This was a little mail-talk between harti@harti.de (Stefan Hartmann) and hgordon@system.xingtech.com. Q: What is MPEG-1+ ? It's MPEG-1 at MPEG-2 (CCIR) resolution. It will maybe be used fir TV-on-top-boxes for broadcasting or video-on-demand projects to enhance the picture quality. Q: I see. Is this a new standard ? No. MPEG-1 allows the definition of frames until 4000x4000 pixel, but that is usally not used. Q; So what's different ? I understand that the effective resolution is approximately 550 x 480. Typical datarates are 3.5Mbps - 5.5Mbps (sports programming and perhaps movies are higher). Q: Is the video quality lower than with real MPEG-2 movies ? The quality is better than cable TV, and in my area, we don't have cable. They de-interlace and compress the full frames. My understanding is that this is about 5%-10% less efficient than taking advantage of MPEG-2 interfield motion vectors. Q: If the fields are deinterlaced, do you see the interlace artifacts, so that a moving object in one field is already more into one direction, than in the other field ? Probably the TV-receiver also gives it out interlaced again to the TV- set, so this does not produce this interlace artifact like on PCs with live video windows displaing both fields.... Q: Can you record this anyhow on a VCR ? Does the SAT-Receiver have a video- output, so you can record movies to tape ? You should be able to record to tape, though they may have some record blocking hardware which has to be overcome with video stabilizing hardware. Q: What kind of realtime encoders do they use at the broadcast station ? CLI (Compression Labs) is the manufacturer, using C-Cube chipsets (10 CL-4000's per MPEG-1+ encoder). Q: Is there any written info about this MPEG-1 Plus technology available on the net ? Not that I'm aware. Maybe C-Cube has a Web site. [So it's up to you, dear reader, to find more and to tell me where it is ;o) ] Frank Gadegast, phade@powerweb.de ------------------------------------------------------------------------------- ~Subject: What is MPEG-2? MPEG-2 FAQ version 3.7 (May 11, 1995) by Chad Fogg (cfogg@chromatic.com) The MPEG (Moving Pictures Experts Group) committee began its life in late 1988 by the hand of Leonardo Chairiglione and Hiroshi Yasuda with the immediate goal of standardizing video and audio for compact discs. Over the next few years, participation amassed from international technical experts in the areas of Video, Audio, and Systems, reaching over 200 participants by 1992. By the end of the third year (1990), a syntax emerged, which when applied to code SIF video and compact disc audio samples rates at a combined coded bitrate of 1.5 Mbit/sec, approximated the perceptual quality of consumer video tape (VHS). After demonstrations proved that the syntax was generic enough to be applied to bit rates and sample rates far higher than the original primary target application, a second phase (MPEG-2) was initiated within the committee to define a syntax for efficient representation of broadcast video. Efficient representation of interlaced (broadcast) video signals was more challenging than the progressive (non-interlaced) signals coded by MPEG-1. Similarly, MPEG-1 audio was capable of only directly representing two channels of sound. MPEG-2 would introduce a scheme to decorrelate mutlichannel discrete surround sound audio. Need for a third phase (MPEG-3) was anticipated in 1991 for High Definition Television, although it was later discovered by late 1992 and 1993 that the MPEG-2 syntax simply scaled with the bit rate, obviating the third phase. MPEG-4 was launched in late 1992 to explore the requirements of a more diverse set of applications, while finding a more efficient means of coding low bit rate/low sample rate video and audio signals. Today, MPEG (video and systems) is exclusive syntax of the United States Grand Alliance HDTV specification, the European Digital Video Broadcasting Group, and the high density compact disc (lead by rivals Sony/Philips and Toshiba). What is MPEG video syntax ? MPEG video syntax provides an efficient way to represent image sequences in the form of more compact coded data. The language of the coded bits is the syntax. For example, a few tokens can represent an entire block of 64 samples. MPEG also describes a decoding (reconstruction) process where the coded bits are mapped from the compact representation into the original, raw format of the image sequence. For example, a flag in the coded bitstream signals whether the following bits are to be decoded with a DCT algorithm or with a prediction algorithm. The algorithms comprising the decoding process are regulated by the semantics defined by MPEG. This syntax can be applied to exploit common video characteristics such as spatial redundancy, temporal redundancy, uniform motion, spatial masking, etc. MPEG Myths A brief summary myths. 1. Compression Ratios over 100:1 Articles in the press and marketing literature will often make the claim that MPEG can achieve high quality video with compression ratios over 100:1. These figures often include the oversampling factors in the source video. In reality, the coded sample rate specified in an MPEG image sequence is usually not much larger than 30 times the specified bit rate. Pre-compression through subsampling is chiefly responsible for 3 digit ratios for all video coding methods, including those of the non-MPEG variety. 2. MPEG-1 is 352x240 Both MPEG-1 and MPEG-2 video syntax can be applied at a wide range of bitrates and sample rates. The MPEG-1 that most people are familiar with has parameters of 30 SIF pictures (352 pixels x 240 lines) per second and a bitrate less than 1.86 megabits/sec----a combination known as "Constrained Parameters Bitstreams". This popular interoperability point is promoted by Compact Disc Video (White Book). In fact, it is syntactically possible to encode picture dimensions as high as 4095 x 4095 and a bitrates up to 100 Mbit/sec. With the advent of the MPEG-2 specification, the most popular combinations have coagulated into Levels, which are described later in this text. The two most common are affectionately known as SIF (e.g. 352 pixels x 240 lines x 30 frames/sec), or Low Level, and CCIR 601 (e.g. 720 pixels/line x 480 lines x 30 frames/sec), or Main Level. 3. Motion Compensation displaces macroblocks from previous pictures Macroblock predictions are formed out of arbitrary 16x16 pixel (or 16x8 in MPEG-2) areas from previously reconstructed pictures. There are no boundaries which limit the location of a macroblock prediction within the previous picture, other than the edges of the picture. 4. Display picture size is the same as the coded picture size In MPEG, the display picture size and frame rate may differ from the size (resolution) and frame rate encoded into the bitstream. For example, a regular pattern of pictures in a source image sequence may be dropped (decimated), and then each picture may itself be filtered and subsampled prior to encoding. Upon reconstruction, the picture may be interpolated and upsampled back to the source size and frame rate. In fact, the three fundamental phases (Source Rate, Coded Rate, and Display Rate) may differ by several parameters. The MPEG syntax can separately describe Coded and Display Rates through sequence_headers, but the Source Rate is known only by the encoder. 5. Picture coding types (I, P, B) all consist of the same macroblocks types. All macroblocks within an I picture must be coded Intra (like a baseline JPEG picture). However, macroblocks within a P picture may either be coded as Intra or Non-intra (temporally predicted from a previously reconstructed picture). Finally, macroblocks within the B picture can be independently selected as either Intra, Forward predicted, Backward predicted, or both forward and backward (Interpolated) predicted. The macroblock header contains an element, called macroblock_type, which can flip these modes on and off like switches. macroblock_type is possibly the single most powerful element in the whole of video syntax. Picture types (I, P, and B) merely enable macroblock modes by widening the scope of the semantics. The component switches are: 1. Intra or Non-intra 2. Forward temporally predicted (motion_forward) 3. Backward temporally predicted (motion_backward) (2+3 in combination represent “Interpolated”) 4. conditional replenishment (macroblock_pattern). 5. adaptation in quantization (macroblock_quantizer). 6. temporally predicted without motion compensation The first 5 switches are mostly orthogonal (the 6th is derived from the 1st and 2nd in P pictures, and does not exist in B pictures). Some switches are non-applicable in the presence of others. For example, in an Intra macroblock, all 6 blocks by definition contain DCT data, therefore there is no need to signal either the macroblock_pattern or any of the temporal prediction switches. Likewise, when there is no coded prediction error information in a Non-intra macroblock, the macroblock_quantizer signal would have no meaning. 6. Sequence structure is fixed to a specific I,P,B frame pattern. A sequence may consist of almost any pattern of I, P, and B pictures (there are a few minor semantic restrictions on their placement). It is common in industrial practice to have a fixed pattern (e.g. IBBPBBPBBPBBPBB), however, more advanced encoders will attempt to optimize the placement of the three picture types according to local sequence characteristics in the context of more global characteristics. Each picture type carries a penalty when coupled with the statistics of a particular picture (temporal masking, occlusion, motion activity, etc.). The variable length codes of the macroblock_type switch provide a direct clue, but it is the full scope of semantics of each picture type spell out the costs-benefits. For example, if the image sequence changes little from frame-to-frame, it is sensible to code more B pictures than P. Since B pictures by definition are never fed back into the prediction loop (i.e. not used as prediction for future pictures), bits spent on the picture are wasted in a sense (B pictures are like temporal spackle). Application requirements also govern picture type placement: random access points, mismatch/drift reduction, channel hopping, program indexing, and error recovery & concealment. The 6 Steps to Claiming Bogously High Compression Ratios: MPEG video is often quoted as achieving compression ratios over 100:1, when in reality the sweet spot rests between 8:1 and 30:1. Heres how the fabled greater than 100:1 reduction ratio is derived for the popular Compact Disc Video (White Book) bitrate of 1.15 Mbit/sec. Step 1. Start with the oversampled rate Most MPEG video sources originate at a higher sample rate than the "target sample rate encoded into the final MPEG bitstream. The most popular studio signal, known canonically as D-1 or CCIR 601 digital video, is coded at 270 Mbit/sec. The constant, 270 Mbit/sec, can be derived as follows: Luminance (Y): 858 samples/line x 525 lines/frame x 30 frames/sec x 10 bits/sample ~= 135 Mbit/sec R-Y (Cb): 429 samples/line x 525 lines/frame x 30 frames/sec x 10 bits/sample ~= 68 Mbit/sec B-Y (Cb): 429 samples/line x 525 lines/frame x 30 frames/sec x 10 bits/sample ~= 68 Mbit/sec Total: 27 million samples/sec x 10 bits/sample = 270 Mbit/sec. So, our compression ratio is: 270/1.15... an amazing 235:1 !! Step 2. Include blanking intervals Only 720 out of the 858 luminance samples per line contain active picture information. In fact, the debate over the true number of active samples is the cause of many hair-pulling cat-fights at TV engineering seminars and conventions, so it is safer to say that the number lies somewhere between 704 and 720. Likewise, only 480 lines out of the 525 lines contain active picture information. Again, the actual number is somewhere between 480 and 496. For the purposes of MPEG-1s and MPEG-2s famous conformance points (Constrained Parameters Bitstreams and Main Level, respectively), the number shall be 704 samples x 480 lines for luminance, and 352 samples x 480 lines for each of the two chrominance pictures. Recomputing the source rate, we arrive at: (luminance) 704 samples/line x 480 lines x 30 fps x 10 bits/sample ~= 104 Mbit/sec (chrominance) 2 components x 352 samples/line x 480 lines x 30 fps x 10 bits/sample ~= 104 Mbit/sec Total: ~ 207 Mbit/sec The ratio (207/1.15) is now only 180:1 Step 3. Include higher bits/sample The MPEG sample precision is 8 bits. Studio equipment often quantize samples with 10 bits of accuracy. The 2-bit improvement to the dynamic range is considered useful for suppressing noise in multi-generation video. The ratio is now only 180 * (8/10 ), or 144:1 Step 4. Include higher chroma ratio The famous CCIR-601studio signal represents the chroma signals (Cb, Cr) with half the horizontal sample density as the luminance signal, but with full vertical resolution. This particular ratio of subsampled components is known as 4:2:2. However, MPEG-1 and MPEG-2 Main Profile specify the exclusive use of the 4:2:0 format, deemed sufficient for consumer applications, where both chrominance signals have exactly half the horizontal and vertical resolution as luminance (the MPEG Studio Profile, however, centers around the 4:2:2 macroblock structure). Seen from the perspective of pixels being comprised of samples from multiple components, the 4:2:2 signal can be expressed as having an average of 2 samples per pixel (1 for Y, 0.5 for Cb, and 0.5 for Cr). Thanks to the reduction in the vertical direction (resulting in a 352 x 240 chrominance frame), the 4:2:0 signal would, in effect, have an average of 1.5 samples per pixel (1 for Y, and 0.25 for Cb and Cr each). Our source video bit rate may now be recomputed as: 720 pixels x 480 lines x 30 fps x 8 bits/sample x 1.5 samples/pixel = 124 Mbit/sec ... and the ratio is now 108:1. Step 5. Include pre-subsampled image size As a final act of pre-compression, the CCIR 601 frame is converted to the SIF frame by a subsampling of 2:1 in both the horizontal and vertical directions.... or 4:1 overall. Quality horizontal subsampling can be achieved by the application of a simple FIR filter (7 or 4 taps, for example), and vertical subsampling by either dropping every other field (in effect, dropping every other line) or again by an FIR filter (regulated by an interfield motion detection algorithm). Our ratio now becomes: 352 pixels x 240 lines x 30 fps x 8 bits/sample x 1.5 samples/pixel ~= 30 Mbit/sec !! .. and the ratio is now only 26:1 Thus, the true A/B comparison should be between the source sequence at the 30 Mbit/sec stage, the actual specified sample rate in the MPEG bitstream, and the reconstructed sequence produced from the 1.15 Mbit/sec coded bitstream. Step 6. Don’t forget the 3:2 pulldown A majority of high-end programs originates from film. Most of the movies encoded onto Compact Disc Video were in captured and reproduced at 24 frames/sec. So, in such an image sequence, 6 out of the 30 frames every second are in fact redundant and need not be coded into the MPEG bitstream, leading to the shocking discovery that the actual soure bit rate has really been 24 Mbit/sec all along, and the compression ratio a mere 21:1 !!! Even at the seemingly modest 20:1 ratio, discrepancies will appear between the 24 Mbit/sec source sequence and the reconstructed sequence. Only conservative ratios in the neighborhood of 8:1 have demonstrated true transparency for sequences with complex spatial-temporal characteristics (i.e. rapid, divergent motion and sharp edges, textures, etc.). However, if the video is carefully encoded by means of pre-processing and intelligent distribution of bits, higher ratios can be made to appear at least artifact-free. What are the parts of the MPEG document? The MPEG-1 specification (official title: ISO/IEC 11172 Information technology Coding of moving pictures and associated audio for digital storage media at up to about 1.5 Mbit/s, Copyright 1993.) consists of five parts. Each document is a part of the ISO/IEC number 11172. The first three parts reached International Standard in 1993. Part 4 reached IS in 1994. In mid 1995, Part 5 will go IS. Part 1---Systems: The first part of the MPEG standard has two primary purposes: 1). a syntax for transporting packets of audio and video bitstreams over digital channels and storage mediums (DSM), 2). a syntax for synchronizing video and audio streams. Part 2---Video: describes syntax (header and bitstream elements) and semantics (algorithms telling what to do with the bits). Video breaks the image sequence into a series of nested layers, each containing a finer granularity of sample clusters (sequence, picture, slice, macroblock, block, sample/coefficient). At each layer, algorithms are made available which can be used in combination to achieve efficient compression. The syntax also provides a number of different means for assisting decoders in synchronization, random access, buffer regulation, and error recovery. The highest layer, sequence, defines the frame rate and picture pixel dimensions for the encoded image sequence. Part 3---Audio: describes syntax and semantics for three classes of compression methods. Known as Layers I, II, and III, the classes trade increased syntax and coding complexity for improved coding efficiency at lower bitrates. The Layer II is the industrial favorite, applied almost exclusively in satellite broadcasting (Hughes DSS) and compact disc video (White Book). Layer I has similarities in terms of complexity, efficiency, and syntax to the Sony MiniDisc and the Philips Digitial Compact Cassette (DCC). Layer III has found a home in ISDN, satellite, and Internet audio applications. The sweet spots for the three layers are 384 kbit/sec (DCC), 224 kbit/sec (CD Video, DSS), and 128 Kbits/sec (ISDN/Internet), respectively. Part 4---Conformance: (circa 1992) defines the meaning of MPEG conformance for all three parts (Systems, Video, and Audio), and provides two sets of test guidelines for determining compliance in bitstreams and decoders. MPEG does not directly address encoder compliance. Part 5---Software Simulation: Contains an example ANSI C language software encoder and compliant decoder for video and audio. An example systems codec is also provided which can multiplex and demultiplex separate video and audio elementary streams contained in computer data files. As of March 1995, the MPEG-2 volume consists of a total of 9 parts under ISO/IEC 13818. Part 2 was jointly developed with the ITU-T, where it is known as recommendation H.262. The full title is: Information Technology--Generic Coding of Moving Pictures and Associated Audio. ISO/IEC 13818. The first five parts are organized in the same fashion as MPEG-1(System, Video, Audio, Conformance, and Software). The four additional parts are listed below: Part 6 Digital Storage Medium Command and Control (DSM-CC): provides a syntax for controlling VCR- style playback and random-access of bitstreams encoded onto digital storage mediums such as compact disc. Playback commands include Still frame, Fast Forward, Advance, Goto. Part 7 Non-Backwards Compatible Audio (NBC): addresses the need for a new syntax to efficiently de- correlate discrete mutlichannel surround sound audio. By contrast, MPEG-2 audio (13818-3) attempts to code the surround channels as an ancillary data to the MPEG-1 backwards-compatible Left and Right channels. This allows existing MPEG-1 decoders to parse and decode only the two primary channels while ignoring the side channels (parse to /dev/null). This is analogous to the Base Layer concept in MPEG-2 Scalable video. NBC candidates include non-compatible syntaxs such as Dolby AC-3. Final document is not expected until 1996. Part 8 10-bit video extension. Introduced in late 1994, this extension to the video part (13818-2) describes the syntax and semantics to coded representation of video with 10-bits of sample precision. The primary application is studio video (distribution, editing, archiving). Methods have been investigated by Kodak and Tektronix which employ Spatial scalablity, where the 8-bit signal becomes the Base Layer, and the 2-bit differential signal is coded as an Enhancement Layer. Final document is not expected until 1997 or 1998. [Part 8 will be withdrawn] Part 9 Real-time Interface (RTI): defines a syntax for video on demand control signals between set-top boxes and head-end servers. What is the evolution of an MPEG/ISO document? In chronological order: Abbr. ISO/Committee notation Author's notation ----- ------------------------------- ----------------------------- - Problem (unofficial first stage) barroom witticism or dare NI New work Item Napkin Item NP New Proposal Need Permission WD Working Draft We’re Drunk CD Committee Draft Calendar Deadlock DIS Draft International Standard Doesn't Include Substance IS International Standard Induced patent Statements Introductory paper to MPEG? Didier Le Gall, "MPEG: A Video Compression Standard for Multimedia Applications," Communications of the ACM, April 1991, Vol.34, No.4, pp. 47-58 MPEG in periodicals? The following journals and conferences have been known to contain information relating to MPEG: IEEE Transactions on Consumer Electronics IEEE Transactions on Broadcasting IEEE Transactions on Circuits and Systems for Video Technology Advanced Electronic Imaging Electronic Engineering Times (EE Times) IEEE Int'l Conference on Acoustics, Speech, and Signal Processing (ICASSP) International Broadcasting Convention (IBC) Society of Motion Pictures and Television Engineers Journal (SMPTE) SPIE conference on Visual Communications and Image Processing MPEG Book? Several MPEG books are under development. An MPEG book will be produced by the same team behind the JPEG book: Joan Mitchell and Bill Pennebaker.... along with Didier Le Gall. It is expected to be a tutorial on MPEG-1 video and some MPEG-2 video. Van Nostran Reinhold in 1995. A book, in the Japanese language, has already been published (ISBN: 4-7561-0247-6). The title is called MPEG by ASCII publishing. Keith Jack's second edition of Video Demystified, to be published in August 1995, will feature a large chapter on MPEG video. Information: ftp://ftp.pub.netcom/pub/kj/kjack/ MPEG is a DCT based scheme? The DCT and Huffman algorithms receive the most press coverage (e.g. "MPEG is a DCT based scheme with Huffman coding"), but are in fact less significant when compared to the variety of coding modes signaled to the decoder as context-dependent side information. The MPEG-1 and MPEG-2 IDCT has the same definition as H.261, H.263, JPEG. What are constant and variable bitrate streams? Constant bitrate streams are buffer regulated to allow continuos transfer of coded data across a constant rate channel without causing an overflow or underflow to a buffer on the receiving end. It is the responsibility of the Encoders Rate Control stage to generate bitstreams which prevent buffer overflow and underflow. The constant bit rate encoding can be modeled as a reservoir: variable sized coded pictures flow into the bit reservoir, but the reservoir is drained at a constant rate into the communications channel. The most challenging aspect of a constant rate encoder is, yes, to maintain constant channel rate (without overflowing or underflow a buffer of a fixed depth) while maintaining constant perceptual picture quality. In the simplest form, variable rate bitstreams do not obey any buffer rules, but will maintain constant picture quality. Constant picture quality is easiest to achieve by holding the macroblock quantizer step size constant (e.g. level 16 of 31). In its most advanced form, a variable bitrate stream may be more difficult to generate than constant bitrate streams. In advanced variable bitrate streams, the instantaneous bit rate (piece-wise bit rate) may be controlled by factors such as: 1. local activity measured against activity over large time intervals (e.g. the full span of a movie), or 2. instantaneous bandwidth availability of a communications channel. Summary of bitstream types Bitrate type Applications constant-rate fixed-rate communications channels like the original Compact Disc, digital video tape, single channel-per-carrier broadcast signal, hard disk storage simple variable-rate software decoders where the bitstream buffer (VBV) is the storage medium itself (very large). macroblock quantization scale is typically held constant over large number of macroblocks. complex variable-rate Statistical muliplexing (multiple-channel-per-carrier broadcast signals), compact discs and hard disks where the servo mechanisms can be controlled to increase or decrease the channel delivery rate, networked video where overall channel rate is constant but demand is variably share by multiple users, bitstreams which achieve average rates over very long time averages What is statistical multiplexing ? Progressive explanation: In the simplest coded bitstream, a PCM (Pulse Coded Modulated) digital signal, all samples have an equal number of bits. Bit distribution in a PCM image sequence is therefore not only uniform within a picture, (bits distributed along zero dimensions), but is also uniform across the full sequence of pictures. Audio coding algorithms such as MPEG-1s Layer I and II are capable of distributing bits over a one dimensional space, spanned by a frame. In layer II, for example, an audio channel coded at a bitrate of 128 bits/sec and sample rate of 44.1 Khz will have frames (which consist of 1152 subband coefficients each) coded with approximately 334 bits. Some subbands will receive more bits than others. In block-based still image compression methods which employ 2-D transform coding methods, bits are distributed over a 2 dimensional space (horizontal and vertical) within the block. Further, blocks throughout the picture may contain a varying number of bits as a result, for example, of adaptive quantization. For example, background sky may contain an average of only 50 bits per block, whereas complex areas containing flowers or text may contain more than 200 bits per block. In the typical adaptive quantization scheme, more bits are allocated to perceptually more complex areas in the picture. The quantization stepsizes can be selected against an overall picture normalization constant, to achieve a target bit rate for the whole picture. An encoder which generates coded image sequences comprised of independently coded still pictures, such as JPEG Motion video or MPEG Intra picture sequences, will typically generate coded pictures of equal bit size. MPEG non-intra coding introduces the concept of the distribution of bits across multiple pictures, augmenting the distribution space to 3 dimensions. Bits are now allocated to more complex pictures in the image sequence, normalized by the target bit size of the group of pictures, while at a lower layer, bits within a picture are still distributed according to more complex areas within the picture. Yet in most applications, especially those of the Constant Bitrate class, a restriction is placed in the encoder which guarantees that after a period of time, e.g. 0.25 seconds, the coded bitstream achieves a constant rate (in MPEG, the Video Buffer Verifier regulates the variable-to-constant rate mapping). The mapping of an inherently variable bitrate coded signal to a constant rate allows consistent delivery of the program over a fixed-rate communications channel. Statistical multiplexing takes the bit distribution model to 4 dimensions: horizontal, vertical, temporal, and program axis. The 4th dimension is enabled by the practice of mulitplexing multiple programs (each, for example, with respective video and audio bitstreams) on a common data carrier. In the Hughes' DSS system, a single data carrier is modulated with a payload capacity of 23 Mbits/sec, but a typical program will be transported at average bit rate of 6 Mbit/sec each. In the 4-D model, bits may be distributed according the relative complexity of each program against the complexities of the other programs of the common data carrier. For example, a program undergoing a rapid scene change will be assigned the highest bit allocation priority, whereas the program with a near-motionless scene will receive the lowest priority, or fewest bits. How does MPEG achieve compression? Here are some typical statistical conditions addressed by specific syntax and semantic tools: 1. Spatial correlation: transform coding with 8x8 DCT. 2. Human Visual Response---less acuity for higher spatial frequencies: lossy scalar quantization of the DCT coefficients. 3. Correlation across wide areas of the picture: prediction of the DC coefficient in the 8x8 DCT block. 4. Statistically more likely coded bitstream elements/tokens: variable length coding of macroblock_address_increment, macroblock_type, coded_block_pattern, motion vector prediction error magnitude, DC coefficient prediction error magnitude. 5. Quantized blocks with sparse quantized matrix of DCT coefficients: end_of_block token (variable length symbol). 6. Spatial masking: macroblock quantization scale factor. 7. Local coding adapted to overall picture perception (content dependent coding): macroblock quantization scale factor. 8. Adaptation to local picture characteristics: block based coding, macroblock_type, adaptive quantization. 9. Constant stepsizes in adaptive quantization: new quantization scale factor signaled only by special macroblock_type codes. (adaptive quantization scale not transmitted by default). 10. Temporal redundancy: forward, backwards macroblock_type and motion vectors at macroblock (16x16) granularity. 11. Perceptual coding of macroblock temporal prediction error: adaptive quantization and quantization of DCT transform coefficients (same mechanism as Intra blocks). 12. Low quantized macroblock prediction error: No prediction error for the macroblock may be signaled within macroblock_type. This is the macroblock_pattern switch. 13. Finer granularity coding of macroblock prediction error: Each of the blocks within a macroblock may be coded or not coded. Selective on/off coding of each block is achieved with the separate coded_block_pattern variable-length symbol, which is present in the macroblock only of the macroblock_pattern switch has been set. 14. Uniform motion vector fields (smooth optical flow fields): prediction of motion vectors. 15. Occlusion: forwards or backwards temporal prediction in B pictures. Example: an object becomes temporarily obscured by another object within an image sequence. As a result, there may be an area of samples in a previous picture (forward reference/prediction picture) which has similar energy to a macroblock in the current picture (thus it is a good prediction), but no areas within a future picture (backward reference) are similar enough. Therefore only forwards prediction would be selected by macroblock type of the current macroblock. Likewise, a good prediction may only be found in a future picture, but not in the past. In most cases, the object, or correlation area, will be present in both forward and backward references. macroblock_type can select the best of the three combinations. 16. Sub-sample temporal prediction accuracy: bi-linearly interpolated (filtered) "half-pel" block predictions. Real world motion displacements of objects (correlation areas) from picture-to-picture do not fall on integer pel boundaries, but on irrational . Half-pel interpolation attempts to extract the true object to within one order of approximation, often improving compression efficiency by at least 1 dB. 17. Limited motion activity in P pictures: skipped macroblocks. When the motion vector is zero for both the horizontal and vertical vector components, and no quantized prediction error for the current macroblock is present. Skipped macroblocks are the most desirable element in the bitstream since they consume no bits, except for a slight increase in the bits of the next non-skipped macroblock. 18. Co-planar motion within B pictures: skipped macroblocks. When the motion vector is the same as the previous macroblocks, and no quantized prediction error for the current macroblock is present. What is the difference between MPEG-1 and MPEG-2 syntax? Section D.9 of ISO/IEC 13818-2 is an informative piece of text describing the differences between MPEG-1 and MPEG-2 video syntax. The following is a little more informal. Sequence layer: MPEG-2 can represent interlaced or progressive video sequences, whereas MPEG-1 is strictly meant for progressive sequences since the target application was Compact Disc video coded at 1.2 Mbit/sec. MPEG-2 changed the meaning behind the aspect_ratio_information variable, while significantly reducing the number of defined aspect ratios in the table. In MPEG-2, aspect_ratio_information refers to the overall display aspect ratio (e.g. 4:3, 16:9), whereas in MPEG-2, the ratio refers to the particular pixel. The reduction in the entries of the aspect ratio table also helps interoperability by limiting the number of possible modes to a practical set, much like frame_rate_code limits the number of display frame rates that can be represented. Optional picture header variables called display_horizontal_size and display_vertical_size can be used to code unusual display sizes. frame_rate_code in MPEG-2 refers to the intended display rate, whereas in MPEG-1 it referred to the coded frame rate. In film source video, there are often 24 coded frames per second. Prior to bitstream coding, a good encoder will eliminate the redundant 6 frames or 12 fields from a 30 frame/sec video signal which encapsulates an inherently 24 frame/sec video source. The MPEG decoder or display device will then repeat frames or fields to recreate or synthesize the 30 frame/sec display rate. In MPEG-1, the decoder could only infer the intended frame rate, or derive it based on the Systems layer time stamps. MPEG-2 provides specific picture header variables called repeat_first_field and top_field_first which explicitly signal which frames or fields are to be repeated, and how many times. To address the concern of software decoders which may operate at rates lower or different than the common television rates, two new variables in MPEG-2 called frame_rate_extension_d and frame_rate_extension_n can be combined with frame_rate_code to specify a much wider variety of display frame rates. However, in the current set of define profiles and levels, these two variables are not allowed to change the value specified by frame_rate_code. Future extensions or Profiles of MPEG may enable them. In interlaced sequences, the coded macroblock height (mb_height) of a picture must be a multiple of 32 pixels, while the width, like MPEG-1, is a coded multiple of 16 pixels. A discrepancy between the coded width and height of a picture and the variables horizontal_size and vertical_size, respectively, occurs when either variable is not an integer multiple of macroblocks. All pixels must be coded within macroblocks, since there cannot be such a thing as fractional macroblocks. Never intended for display, these overhang pixels or lines exist along the left and bottom edges of the coded picture. The sample values within these trims can be arbitrary, but they can affect the values of samples within the current picture, and especially future coded pictures. In the current pictures, pixels which reside within the same 8x8 block as the overhang pixels are affect by the ripples of DCT quantization error. In future coded pictures, their energy can propagate anywhere within an image sequence as a result of motion compensated prediction. An encoder should fill in values which are easy to code, and should probably avoid creating motion vectors which would cause the Motion Compensated Prediction stage to extract samples from these areas. The application should probably select horizontal_size and vertical_size that are already multiples of 16 (or 32 in the vertical case of interlaced sequences) to begin with. Group of Pictures: The concept of the Group of Pictures layer does not exist in MPEG-2. It is an optional header useful only for establishing a SMPTE time code or for indicating that certain B pictures at the beginning of an edited sequence comprise a broken_link. This occurs when the current B picture requires prediction from a forward reference frame (previous in time to the current picture) has been removed from the bitstream by an editing process. In MPEG-1, the Group of Pictures header is mandatory, and must follow a sequence header. Picture layer: In MPEG-2, a frame may be coded progressively or interlaced, signaled by the progressive_frame variable. In interlaced frames (progressive_frame==0), frames may then be coded as either a frame picture (picture_structure==frame) or as two separately coded field pictures (picture_structure==top_field or picture_structure==bottom_field). Progressive frames are a logic choice for video material which originated from film, where all pixels are integrated or captured at the same time instant. Most electronic cameras today capture pictures in two separate stages: a top field consisting of all odd lines of the picture are nearly captured in the time instant, followed by a bottom field of all even lines. Frame pictures provide the option of coding each macroblock locally as either field or frame. An encoder may choose field pictures to save memory storage or reduce the end-to-end encoder-decoder delay by one field period. There is no longer such a thing called D pictures in MPEG-2 syntax. However, Main Profile @ Main Level MPEG-2 decoders, for example, are still required to decode D pictures at Main Level (e.g. 720x480x30 Hz). The usefulness of D pictures, a concept from the year 1990, had evaporated by the time MPEG-2 solidified in 1993. repeat_first_field was introduced in MPEG-2 to signal that a field or frame from the current frame is to be repeated for purposes of frame rate conversion (as in the 30 Hz display vs. 24 Hz coded example above). On average in a 24 frame/sec coded sequence, every other coded frame would signal the repeat_first_field flag. Thus the 24 frame/sec (or 48 field/sec) coded sequence would become a 30 frame/sec (60 field/sec) display sequence. This processes has been known for decades as 3:2 Pulldown. Most movies seen on NTSC displays since the advent of television have been displayed this way. Only within the past decade has it become possible to interpolate motion to create 30 truly unique frames from the original 24. Since the repeat_first_field flag is independently determined in every frame structured picture, the actual pattern can be irregular (it doesnt have to be every other frame literally). An irregularity would occur during a scene cut, for example. Slice: To aid implementations which break the decoding process into parallel operations along horizontal strips within the same picture, MPEG-2 introduced a general semantic mandatory requirement that all macroblock rows must start and end with at least one slice. Since a slice commences with a start code, it can be identified by inexpensively parsing through the bitstream along byte boundaries. Before, an implementation might have had to parse all the variable length tokens between each slice (thereby completing a significant stage of decoding process in advance) to know the exact position of each macroblock within the bitstream. In MPEG-1, it was possible to code a picture with only a single slice. Naturally, the mandatory slice per macroblock row restriction also facilitates error recovery. MPEG-2 also added the concept of the slice_id. This optional 6-bit element signals which picture a particular slice belongs to. In badly mangled bitstreams, the location of the picture headers could become garbled. slice_id allows a decoder to place a slice in the proper location within a sequence. Other elements in the slice header, such as slice_vertical_position, and the macroblock_address_increment of the first macroblock in the slice uniquely identify the exact macroblock position of the slice within the picture. Thus within a window of 64 pictures, a lost slice can find its way. Macroblock: motion vectors are now always represented along a half-pel grid. The usefulness of an integer-pel grid (option in MPEG-1) diminished with practice. A intrinsic half-pel accuracy can encourage use by encoders for the significant coding gain which half-pel interpolation offers. In both MPEG-1 and MPEG-2, the dynamic range of motion vectors is specified on a picture basis. A set of pictures corresponding to a rapid motion scene may need a motion vector range of up to +/- 64 integer pixels. A slower moving interval of pictures may need only a +/- 16 range. Due to the syntax by which motion vectors are signaled in a bitstream, pictures with little motion would suffer unnecessary bit overhead in describing motion vectors in a coordinate system established for a much wider range. MPEG-1s f_code picture header element prescribed a radius shared by horizontal and vertical motion vector components alike. It later became practice in industry to have a greater horizontal search range (motion vector radius) than vertical, since motion tends to be more prominent across the screen than up or down (vertical). Secondly, a decoder has a limited frame buffer size in which to store both the current picture under decoding and the set of pictures (forward, backward) used for prediction (reference) by subsequent pictures. A decoder can write over the pixels of the oldest reference picture as soon as it no longer is needed by subsequent pictures for prediction. A restricted vertical motion vector range creates a sliding window, which starts at the top of the reference picture and moves down as the macroblocks in the current picture are decoded in raster order. The moment a strip of pixels passes outside this window, they have ended their life in the MPEG decoding loop. As a result of all this, MPEG-2 created separate into horizontal and vertical range specifiers (f_code[][0] for horizontal, and f_code[][1] for vertical), and placed greater restrictions on the maximum vertical range than on the horizontal range. In Main Level frame pictures, this is range is [- 128,+127.5] vertically, and [-1024,+1023.5] horizontally. In field pictures, the vertical range is restricted to [- 64,+63.5]. Macroblock stuffing is now illegal in MPEG-2. The original intent behind stuffing in MPEG-1 was to provide a means for finer rate control adjustment at the macroblock layer. Since no self-respecting encoder would waste bits on such an element (it does not contribute to the refinement of the reconstructed video signal), and since this unlimited loop of stuffing variable length codes represent a significant headache for hardware implementations which have a fixed window of time in which to parse and decode a macroblock in a pipeline, the element was eliminated in January 1993 from the MPEG-2 syntax. Some feel that macroblock stuffing was beneficial since it permitted macroblocks to be coded along byte boundaries. A good compromise could have been a limited number of stuffs per macroblock. If stuffing is needed for purposes of rate control, an encoder can pad extra zero bytes before the start code of the next slice. If stuffing is required in the last row of macroblocks of the picture, the picture start code of the next picture can be padded with an arbitrary number of bytes. If the picture happens to be the last in the sequence, the sequence_end_code can be stuffed with zero bytes. The dct_type flag in both Intra and non-Intra coded macroblocks of frame structured pictures signals that the reconstructed samples output by the IDCT stage shall be organized in field or frame order. This flag provides an encoder with a sort of poor mans motion_type by adapting to the interparity (i.e. interfield) characteristics of the macroblock without signaling a need for motion vectors via the macroblock_type variable. dct_type plays an essential role in Intra frame pictures by organizing lines of a common parity together when there is significant interfield motion within the macroblock. This increases the decorrelation efficiency of the DCT stage. For non-intra macroblocks, dct_type organizes the 16 lines (... luminance, 8 lines chrominance) of the macroblock prediction error. In combination with motion_type, the meaning.... dct_type motion_format interpretation frame Intra coded block data is frame correlated field Intra coded block data is more strongly correlated along lines of opposite parity frame Field predicted 1. a low-cost encoder which only possesses frame motion estimation may use dct_type to decorrelate the prediction error of a prediction which is inherently field by characteristic 2. an intelligent encoder realizes that it is more bit efficient to signal frame prediction with field dct_type for the prediction error, than it is to signal a field prediction. field Field predicted A typical scenario. A field prediction tends to form a field-correlated prediction error. frame Frame predicted A typical scenario. A frame prediction tends to form a frame-correlated prediction error. field Frame predicted Makes little sense. If the encoder went through the trouble of finding a field prediction in the first place, why select frame organization for the prediction error? prediction modes now include field, frame, Dual Prime, and 16x8 MC. The combinations for Main Profile and Simple Profile are shown below. Frame pictures motion_type motion vectors per MB fundamental prediction block size (after half- pel) interpretation Frame 1 16x16 same as MPEG-1, with possibly different treatment of prediction error via dct_type Field 2 16x8 Two independently coded predictions are made: one for the 8 lines which correspond to the top field, another for the 8 bottom field lines. Dual Prime 1 16x8 Two independently coded predictions are made: one for the 8 lines which correspond to the top field, another for the 8 bottom field lines. Uses averaging of two 16x8 prediction blocks from fields of opposite parity to form a prediction for the top and bottom 8 lines. A second vector is derived from the first vector coded in the bitstream. Field pictures motion_type motion vectors per MB fundamental prediction block size (after half- pel) interpretation Field 1 16x16 same as MPEG-1, with possibly different treatment of prediction error via dct_type 16x8 2 16x8 Two independently coded predictions are made: one for the 8 lines which correspond to the top field, another for the 8 bottom field lines. Dual Prime 1 16x16 A single prediction is constructed from the average of two 16x16 predictions taken from fields of opposite parity. concealment motion vectors can be transmitted in the headers of intra macroblocks to help error recovery. When the macroblock data that the concealment motion vectors are intended for becomes corrupt, these vectors can be used to specify a concealment 16x16 area to be extracted from the previous picture. These vectors do not affect the normal decoding process, except for motion vector predictions. Additional chroma_format for 4:2:2 and 4:4:4 pictures. Like MPEG-1, Main Profile syntax is strictly limited to 4:2:0 format, however, the 4:2:2 format is the basis of the 4:2:2 Profile (aka Studio Profile). In 4:2:2 mode, all syntax essentially remains the same except where matters of block count are concerned. A coded_block_pattern extension was added to handle signaling of the extra two prediction error blocks. The 4:4:4 format is currently undefined in any Profile. chroma_format multiplex order within Macroblock Application 4:2:0 (6 blocks) YYYYCbCr main stream television, consumer entertainment. 4:2:2 (8 blocks) YYYYCbCrCbCr studio production environments, professional editing equipment, distribution and servers 4:4:4 (12 blocks) YYYYCbCrCbCrCbCrCbCr computer graphics Non-linear macroblock quantization was introduced in MPEG-2 to increase the precision of quantization at high bit rates, while increasing the dynamic range for low bit rate use where larger step size is needed. The quantization_scale_code may be selected b