Integrated System Laboratory C3I
Swiss Federal Institute of Technology, EPFL


The first question we would like to answer is: what do we mean nowadays for video processing? In the past, more or less till the end of the 80’s there where two distinct worlds: an analog TV world and a digital computer world. All TV processing from the camera to the receiver was based on analog processing, analog modulation and analog recording. With the progress of digital technology, a part of the analog processing could be implemented by digital circuits with consistent advantages in terms of reproducibility of the circuits leading to cost and stability advantages, and noise sensitivity leading to quality advantages. At the end of the 80’s completely new video processing possibilities became feasible by digital circuits. Today, image compression and decompression is the dominant digital video processing in term of importance and complexity of the all TV chain.

Figure 1. Schematic representation of a TV chain.

In the near future digital processing will be used to pass from standard resolution TV to HDTV for which compression and decompression is a must, considering the bandwidth that it would require for transmission. Other applications will be found at the level of the camera to increase the image quality by increasing the number of bit from 8 to 10 or 12 for each pixel, or by using appropriate processing aiming at compensating the sensors limitations (image enhancement by non-linear filtering and processing) . Digital processing will also enter into the studio for digital recording, editing and 50/60 Hz standard conversions. Today the high communications bandwidth required by uncompressed digital video necessary for editing and recording operations, between the studio devices limits the use of full digital video and digital video processing at studio level.


Video Compression

Why video compression has become the dominant video processing application of TV? An analog TV channel only needs a 5 MHz analog channel for the transmission, conversely in case of digital video with: 8 bit A/D, 720 pixels for 576 lines (54 MHz sampling rate) we need a transmission channel with a capacity of 168.8 Mbit/s!!! In case of digital HDTV the capacity for: 10 bit A/D, 1920 pixels 1152 lines raise up to1.1 Gbit/s!!! No affordable applications, in terms of cost, are thus possible without video compression.

These reasons have raised also the need of worldwide standards for video compression so as to achieve interoperability and compatibility among devices and operators. H.261 is the names given to the first digital video compression standard specifically designed for videoconference applications, MPEG-1 is the name for the one designed for CD storage (up to 1.5 Mbit/s) applications, MPEG-2 for digital TV and HDTV respectively from 4 up to 9 Mb/s for TV, or up to 20 Mb/s for HDTV; H.263 for videoconferencing at very low bit rates (16 - 128 kb/s). All these standards can be better considered as a family of standards sharing quite similar processing algorithms and features.

All of them are based on the same basic philosophy:

For TV and HDTV while we have very few encoders used by broadcaster companies (at limit just one for each channel), we must have a decoder on each TV set. This means that any compressed video bit-stream can be decoded without any ambiguity yielding the same video result. This means that a decoder must be able to decode any video bit-stream that respects the decoding syntax. This means that an encoder must encode a video content in a conformant syntax. This means that encoding algorithms are a competitive issues, encoders can be optimized aiming at achieving higher quality of compressed video or aiming at simplifying the encoding algorithm so as to have simple encoder. It also mean that in future disposing of more processing power we can use more and more sophisticated and processing demanding encoding algorithms to find the best choices of the available encoding syntax. These basic principles of the video compression standards have clearly strong consequences on the architectures implementing video compression. So as to understand what are the main processing and architectural issues in video compression we briefly analyze in more details the basic processing of MPEG-2 standard.

MPEG-2 Video Compression

MPEG-2 is a complete standard that specifies all stages from video acquisition up to the interface with the communication protocols. Figure 2 reports a schematic diagram of how MPEG-2 provides after compression a transport layer. Audio and video compressed bit-streams are multiplexed and put in packets in a suitable transport format. This part of the processing cannot be classified as video processing, and is not considered here in details.

Figure 2. MPEG-2 transport stream diagram.
Figure 3. Basic processing for MPEG-2 compression.
Figure 4. MPEG-2 pre-filtering and spatial redundancy reduction by DCT.
Figure 5. MPEG-2 spatial redundancy reduction by quantization and entropy coding.
The basic video processing algorithms of MPEG-2 are reported in Figure 3. These algorithms are also found with some variants in all the other compression standards mentioned before. First stage is the conversion of the image from RGB format to the YUV format and subsequent filtering and sub-sampling of the chrominance components to yield smaller color images. Then images are partitioned into block of pixels of size 8x8 and block are grouped in macro-blocks of size 16x16 pixels. Two main processes are then applied. One is the reduction of spatial redundancy, the other is the reduction of temporal redundancy.
Figure 6. MPEG-2 temporal redundancy reduction by motion compensated prediction.
Spatial redundancy is reduced applying the DCT transform to blocks and then entropy coding by Huffman tables the quantized transform coefficients. Temporal redundancy is reduced by motion compensation applied to macro-blocks according to the IBBP group of picture structure.

In more details (see Figures 4 and 5) spatial redundancy is reduced applying 8 times horizontally and 8 times vertically a 8x1 DCT transform. Then transform coefficients are quantized, thus reducing to zero small high frequency coefficients, scanned in zig-zag order starting from the DC coefficient at the upper left corner of the block and coded using Huffman tables referred also as Variable Length Coding (VLC).

The reduction of temporal redundancy is the process that drastically reduces the bit rate and enables to achieve high compression rates. It is based on the principle of finding the current macro-block in already transmitted pictures at the same position in the image or displaced by a so-called "motion vector" (see figure 6). Since an exact copy of the macro-block is not guaranteed to be found, the macro-block that has the lowest average error is chosen as reference macro-block. The "error macro-block" is then processed so as to reduce the spatial redundancy, if any, by means of the above mentioned procedure and transmitted so as to be able to reconstruct the desired macro-block disposing of the "motion vector" indicating the reference and the relative error.

Figure 7 reports the so-called MPEG-2 Group of Picture Structure that shows that images are classified as I (Intra), P (Predicted) and B (Bi-directionally interpolated). The standard specifies that Intra image macro-block can only be processed to reduce spatial redundancy, P image macro-block can also be processed to reduce the temporal redundancy referring only to past I or P frames, B image macro-block can also be processed using an interpolation of past and future reference macro-block. Obviously B macro-block can also be coded as Intra or Predicted if it is found to convenient for the compression. Note that since B picture can use as reference both past and future I or P frames, the MPEG-2 image transmission order is different from the display order, B picture are transmitted in the compressed bit-stream after the relative I and P pictures.

Figure 7. Structure of an MPEG-2 GOP, showing the reference pictured for motion compensated prediction of P and B pictures.

Complexity of MPEG Video Processing

At the end of the 80’s there have been a lot of discussions about the complexity of implementing DCT transforms in real-time at video rate. Blocks of 8x8 have been chosen instead of 16x16 in order to reduce the complexity of the transform. The main objective was to avoid complex processing at the decoder side. With this goal many DCT optimized implementations have appeared in both form of dedicated chips and software using reduced number of multiplication and additions.

Nowadays, digital technology has made many progresses in terms of speed increase and processing performance for which the DCT coding or decoding is not anymore a critical issue. If we look to Figure 8 we can find a schematic block diagram of an MPEG-2 decoder that is very similar to the ones of the other compression standards. A buffer is needed to receive at a constant bit-rate the compressed bits that during decoding are not "consumed" at a constant rate. VLD is a relatively simple processing that can be implemented by means of look-up tables or memories. Being a bit-wise processing, it cannot be parallelized and results quite inefficient to be implemented in general purpose processors. This is the reason for which new multimedia processors such as Philips "Trimedia" use specific VLC/VLD units for entropy coding. The more costly elements of the MPEG-2 decoder are the memories for the storage of past and future reference frames and the handling of the data flow between the Motion Compensated Interpolator unit and the Reference video memories.

Figure 8. Block diagram of an MPEG-2 decoder.

For an MPEG-2 encoder, see Figure 9, the situation is very different. First of all we can recognize a path that implements a complete MPEG-2 decoder, necessary to reconstruct reference images as they are found at the decoder size. Then we have a motion estimation block (Bi-directional motion estimator) that has the goal of finding the motion vector, and a block that selects and controls the macro-block encoding modes. As discussed in the previous paragraphs, the way to find the best motion vectors as well as the way to chose the right coding for each macro-block is not specified by the standard. Therefore, very simple (with limited quality performance), or extremely complex algorithms (with high quality performance) can be implemented for these functions. Moreover, MPEG-2 allows the dynamic definition of the GOP structure making possible many possibilities of coding modes. In general two are the critical issues of an MPEG-2 encoder: the motion estimation processor and the handling of the complex data flow with relative bandwidth problems between original and coded frame memories, motion estimation processor and the coding control unit.

We have also to mention that the coding modes of MPEG-2 are much more complex of what could seem from this brief description. In fact, existing TV is based on interlaced images and the processing all coding modes can be applied in distinct ways to "frame" blocks and macro-blocks or on "field" blocks and macro-blocks. The same applies for motion estimation for which we can use both field-based or frame-based vectors. Moreover all references for predictions can be made on true image pixels or on "virtual" image pixels obtained by bi-linear interpolations as shown in Figure 10.

Figure 9. Block diagram of an MPEG-2 encoder.


Figure 10. MPEG-2 macro-block references can be made also on "virtual" pixels (in red) obtained by bilinear interpolations, instead of image pixels from the original raster (gray).

In this case also, motion vectors with half pixel precision need to be estimated. The possibility of using all these possible encoding modes largely increases the quality of the compressed video, but it might become extremely demanding in terms of processing complexity.

The challenge of MPEG-2 encoder designer is to find the best trade-off between the complexity of the implemented algorithms and the quality of the compressed video. Architectural and algorithmic issues are very strictly related in MPEG-2 encoder architectures.


Digital Video and Computer Graphics

In the past digital video on computers was equivalent to computer graphics. Differently from the TV world all processing was obviously digital mainly treating synthetic images from 2-D or 3-D models. The concept of real-time computer graphic application was very approximate since usually the application was intended to run as fast as possible on the available processors using in parallel graphic accelerators for the arithmetic operations on pixels.

Figure 11. Sequence of typical computer graphic processing steps.
Figure 11 shows a schematic diagram of the basic computer graphic operations. For each image, 2-D or 3-D models composed by triangles or polygons are placed in the virtual space by the considered application that can be interactive. The position of each vertex is calculated according to the geometric transformation of the object and projected on the screen. The texture, mapped on each polygon, is transformed according to the light model corresponding to the position of the polygon in the space. The pixel on the screen corresponding to the screen raster are obtained from the "original" texture pixel on the polygon by appropriate filtering operations. Finally, the polygon is displayed on the screen.
Figure 12. processing requirements of 3-D graphic content in terms of pixel and polygon per second.
Computer graphic applications strongly rely on the performance of acceleration cards that are specialized to treat in parallel with a high level of pipelines all these numerous but simple pixel operations. Figure 12 reports a diagram of the processing requirements in terms of polygons/s and pixel/s of various graphic contents.


TV, Computer Graphics and Multimedia: MPEG-4?

The new MPEG-4 multimedia standard, which was defined as draft ISO international standard in October 98, is trying the ambitious challenge of putting together the world of natural video and TV with the world of computer and computer graphics.

In MPEG-4 we can find in fact both natural compressed video and 2-D and 3-D models. The standard is based on the concept of elementary streams that represents and carry the information of a single "object" that can be of any type "natural" or "synthetic", audio or video.

Figure 13, reports an example of what can be the content of a MPEG-4 scene. Natural and 2-D and 3-D synthetic audio-visual objects are received and composed in a scene as seen by an hypothetical viewer.

Figure 13. Example of the content and construction of an MPEG-4 scene.
Figure 14. Diagram of MPEG-4 System layer and interface with the network layer.
Two virtual levels are necessary to interface the "elementary stream" level with the network level. The first is necessary to multiplex/demultiplex each communication stream into packets and the second to synchronize each packet and build the "elementary streams" carrying the "object" information as shown in Figure 14.

The processing related to MPEG-4 Systems layer cannot considered as video processing and is very similar to the packet processing typical to network communication.

An MPEG-4 terminal can be schematized as shown in Figure 15. The communication network provides the stream that is demultiplexed into a set of "elementary streams". Each "elementary stream" is decoded into audio/video objects. Using the scene description transmitted with the elementary streams all object are "composed" in the video memory all together according to the size, view angle and position in the space and then "rendered" on the display, which can be interactive and originating a upstream data due to the user interaction and sent back to the MPEG-4 encoder.

MPEG-4 systems, therefore implement not only the classical MPEG-2-like compression/decompression processing and functionality but also computer graphics processing such as "composition" and "rendering". The main difference comparing to natural video of MPEG-1, MPEG-2, H.263, is the introduction of "shape coding" enabling the use of arbitrarily shaped video objects as illustrated in Figure 16. Shape coding information is based on macro-block data structures and arithmetic coding for the contour information associated at each boundary block.

Figure 15. Illustration of the processing and functionality implemented in an MPEG-4 terminal.
Figure 16. Compressed shape information is necessary for arbitrarily shaped objects.
Figure 17. MPEG-4 decoder block diagram, shape coding is coded in parallel to the DCT based texture coding. Shape coding can be of "Intra" type, or with motion compensation and prediction error like texture coding.
The block diagram of an MPEG-4 encoder is depicted in Figure 17. In general it is very similar as architecture to an MPEG-2 encoder block diagram. We can notice a new "shape coding" block in the motion estimation loop that produce shape coding information transmitted in parallel to the classical texture coding information.


Video Processing Architectures: Generalities

In general, we can classify the circuits implementing video processing in four families:

To this group belong all hardwired circuits specifically designed for a single processing task. The level of programmability is very low and the circuits are usually clocked at the frequency or multiples of the input/output data sampling rates. These architectures are based on a DSP cores plus special functions (such as 1-D, 2-D filters, FFT, graphics accelerators, block matching engines) that are specific to a selected application. These are the classical processors architectures specialized and efficient for multiply-accumulate operations on 16-24-32 bit data. The classical well-known families are the ones of Motorola and Texas Instruments. The level of programmability of these processors is very high. They are also employed for real-time applications with constant input/output rates. These are the classical PC processors (Intel, IBM PowerPC) and workstation processors (Alpha Digital, Sun UltraSparc,). Originally they were designed for general purpose software applications and in general, although very powerful, are not really adapted for video processing. Moreover the operating systems employed are not real-time OS. The design of real-time video application on these architectures is not a simple task as it could appear.
Considering the video processing implementations of the last years, in general, we can observe the trend versus the time illustrated in Figure 18. If we consider different video processing algorithms (indicated as Proc.1, Proc.2 etc… in order of increasing complexity.) such as DCT on a 8x8 block for instance, we find first in time to appear implementations based on ASIC architectures. After some years with the evolution of IC technology these functions can then be implemented in real-time by AS DSPs, then by standard DSPs, and then by GPPs. This trend corresponds to the desire of transferring the complexity of the processing from the hardware architecture to the software implementation. However, this trend does not present only advantages and does not apply to all the implementation cases. Figures 19, 22 and 23 reports an illustration of advantages and disadvantages for each class of architectures that should be considered case by case. Let us analyze in detail and discuss each feature.
Figure 18. Trend of algorithm implementations versus the time on different architectures.
Figure 19. Conflicting trade-offs for architecture families.
Figure 19 shows how the various families of architectures behave for the two conflicting requirements of real-time performance and flexibility/programmability. For high resource demanding processing no doubt that dedicated circuits can be order of magnitude faster than GPPs, but the advantages of programmability and possibility of changing the software to implement new processing capabilities becomes attractive for some applications. For instance a GPP can decode any video standard H.261, H.263, MPEG-1 and MPEG-2 just changing the software depending on the application. On the other hand real-time performance are not so easy to be guaranteed on most of GPP platform and the difficulty of handling at the same time real-time processing and other processes have to be carefully evaluated and verified. Figure 20 shows with a simple FIR filter example the concept. For a dedicated implementation (ASIC) a filter can be implemented with simple and extremely fast circuitry. Simple architecture based on registers and multipliers just of the size and speed necessary for the processing at hand are employed. The guarantee of real-time processing is easy to be achieved by appropriately clocking the system to the input data. Conversely, a programmable solution results much more complex. Figure 21 reports the various processing elements that are usually found: ALUs, memories for the data and algorithm program instructions, communication buses etc … Moreover, even simple processing algorithms such as a FIR filtering need to access several time the data and program memories, as reported in the instruction example.

These considerations lead to clear advantages in terms of cost for ASICs when high volumes are required (see Figure 23). Simpler circuits that require smaller silicon surface areas are the right solution for set-top boxes and application for high volumes (MPEG-2 decoders for digital TV broadcasting for instance). In these cases the high development costs and the lack of debugging and software tools for the simulation and design do not constitute a serious drawback. Modifications of the algorithms and the introduction of new versions are not possible, but are not required by this kind of applications. Conversely, for low volume applications, the use of programmable solutions immediately available on the market, well supported by compilers, debuggers and simulation tools that can effectively speed up the development time and cost, might be the right solution. The much higher cost of the programmable processor, in some cases become acceptable for relatively low volume of devices.

Another conflicting trend between hardwired and programmable solutions can be found by the need of designing low-power solutions required by the increasing importance of portable device applications and necessary to reduce the increasing power dissipated by high performance processors (see Figure 24). This trend conflicts with the need of transferring the increasing complexity of processing algorithms from the architecture to the software which is much easier and faster to be modified corrected and debugged. The optimization of memory size and accesses, clock frequency, and other architectural features that yield low-power consumption are only possible on ASICs architectures.

What is the range of power consumption reduction that can be reached passing from a GPP to an ASIC? It is difficult to answer to this question with a single figure, it depend by architecture to architecture, processing by processing. For instance Figure 24 reports the power dissipation of a 2-D convolution with 3x3 filter kernel on a 256x256 image on three different architectures. The result is that a ARM RISC implementation, beside being slower that the other alternatives and so providing a under-estimated result, is about 3 times more demanding than a FPGA implementation and 18 times more than an ASIC based one. The example of the IMAGE motion estimation chip that is reported at the end of this document shows that much higher reduction factors (even more than two orders of magnitude) can be reached by low-power optimized ASIC architectures for specific processing tasks when compared to GPPs providing the same performance.

Figure 20. Example of FIR filtering implementation on a dedicated architecture.
Figure 21. Example of a FIR filtering implementation on a DSP architecture.
Figure 22. Conflicting trade-off for architecture families.
Figure 23. Conflicting trade-off for architecture families.
Figure 24. Power dissipation reduction for the same processing (2-D convolution 3x3) on three different architectures.
A last general consideration about the efficiency of the various architectures for video processing regards the memory usage. Video processing applications, as we have seen in more detail for MPEG-2, require the handling of very large amount of data (pixels) that need to be processed and accesses several time in a video encoder or decoder. Images are filtered, coded, decoded, used as reference for motion compensation and motion estimation for different frames, in other words accessed in order or "randomly" several times in a compression/decompression stage. If we observe the speed of processors, and the speed to access cache SRAM and Synch. DRAM data in the last year we observe two distinct trends (see Figure 25). Speed of processors was similar to memory access speed in 1990, but now it is more than the double and the trend is towards a even higher speed ratios. It means that the performance bottleneck of nowadays video processing architectures is given by the efficiency of the data flow. A correct design of the software for GPPs and a careful evaluation of the achievable memory bandwidth of the various data exchanges is necessary to avoid the risk that the largest fraction of time is used by the processing unit just to wait for the correct data to be processed. For graphic accelerators performance the data flow handling is the basic objective of the processing. Figure 26 reports the performance of some state of the art devices versus the graphic content.
Figure 25. Evolution of the processing speed of processors, SRAM and Synch. DRAM in the last years. Memory access speed has become the performance bottleneck of data-intensive processing systems.
Figure 26. Performance and power dissipation of state of the art graphic accelerators (AS-DSPs) versus polygons and pixel/s.

Motion Estimation Case Study

Block motion estimation for high quality video compression applications (i.e. digital TV broadcasting, multimedia content production…..) is a typical example for which GPP architectures are not a good choice for the implementation.

Motion estimation is indeed the most computational demanding stage of video compression at the encoder side. For normal resolution TV we have to encode 1620 macro-block per frame, with 25 frame per second. Roughly, to search a motion vector error we need to perform about 510 arithmetic operations on data from 8 to 16 bits. The number of vector displacements depends on the search window size that should be large for guaranteeing high quality coding. For instance for sport sequences a size of about 100x100 is required. This leads to about 206 x 109 arithmetic operations per second on 8 to 16 data. Even if we are able to select an "intelligent" search algorithm that reduces from one up to two orders of magnitude the number of search points the number of operations remain extremely high and not feasible by for state of the art GPPs. Moreover, 32 or 64 bit processing arithmetic cores are wasted when operations only on 8 to 16 bits are necessary. Completely different architectures that implement a high level of parallelism at bit level are necessary.

If we want to be more accurate, we can notice that B pictures require both forward and backward motion estimation, and for instance for TV applications each macro-block can use the best between frame-based or field-based motion vectors at full or half pixel resolution level. Therefore, we realize that the real processing needs can increase of more than a factor 10, if all possible motion vectors are estimated.

Another reason for which ASICS or AP-DSPs are an interesting and actual choice for motion estimation is the still unsolved need of motion estimation for TV displays. Large TV displays require to double the refresh rate to avoid the annoying flickering phenomenon appearing on the side portions of large screens. A conversion of interlaced content from 50 to 100 Hz by the simple doubling of each field provides satisfactory results in there is no motion. In case of moving objects the image quality provided by field doubling is low and motion compensated interpolation is necessary to reconstruct the movement phase of the interpolated images. An efficient and low-cost motion estimation stage is necessary for high quality up-conversion on TV displays.


IMAGE a Motion Estimation Chip for MPEG-2 applications.

We briefly describe the characteristics of a motion estimation chip designed in the C3I laboratory of the EPFL in the framework of the ATLANTIC european project in collaboration with the BBC, CSELT, Snell&Wilcox and Fraunhofer Institute. IMAGE is an acronim for Integrated MIMD Architecture for Genetic motion Estimation. The requirements for the chip was to provide estimations for MPEG-2 encoders in very large search windows for forward, backward, field-based, frame-based, full-pel, half-pel precision motion vectors. Figure 27 and 28 report the MPEG-2 broadcasting chain and the main input-output specification of the chip. The same chip is also required to provide the result of the candidate motion compensation mode (forward, backward, filed, frame, intra), and the selection of the corresponding best coding decision. Since all these operation are macro-block based, they share the same level of parallelism of motion estimation algorithms.

The basic architectural idea has been to design a processing engine extremely efficient in getting the mean absolute difference between macro-blocks (matching error) with fast access to a large image section (search window size). By extremely efficient it is meant exploiting as much as possible the parallelism intrinsic to pixel operations on 16x16 block of pixels and able to access randomly any position in the search window without useless waiting times (i.e. providing the engine with the sufficient memory bandwidth to fully exploit its processing power). Figure 29 reports the block diagram of the "block-matching" engine. We can notice in the center the "pixel processor" for the parallel execution of the macro-block difference, two cache memory banks for the storage of the current macro-block and for the search window reference, a RISC processor for the handling of the genetic motion estimation algorithm and for the communications between processing units. The basic processing unit of Figure 29 is then reported in the general architecture of the chip reported in Figure 30. We can notice two macro-block processing units in parallel, the various I/O modules for the communication with the external frame memory and the communication interfaces for cascading the chip for forward and backward motion estimation and for larger search window sizes. As mentioned discussion data intensive applications one of the main difficulty of the chip design is the correct balancing of the processing time of the various units and the optimization of the various communications between modules. It is fundamental that all module processing are scheduled so as to avoid wait times and the communication busses have the necessary bandwidth.

Low power optimizations are summarized in figure 31. Deactivation of processing units, local gated clocks and implementation of a low-power internal SRAM as cache memory enabled to keep power dissipation below 1W. Figure 32 reports the final layout of the chip with the main design parameters.

In conclusion, the chip IMAGE can be classified as an AS-DSP for its high programmability where the application specific for which a special hardware is used is the calculation of macro-block differences. Its performance for motion estimation are much higher than any state of the art GPPs and obtained with a relatively small chip dissipating less than 1W when providing real-time motion estimation for MPEG-2 video compression. More details about the IMAGE chip can be found in: F. Mombers, M. Gumm and Al. "IMAGE: a low-cost low-power video processor for high quality motion estimation in MPEG-2 encoding", IEEE Trans. on Consumer Electronics, Vol 44, No. 3 August 1998, pp. 774-783.

Figure 27. Block diagram of a TV broadcasting chain based on MPEG-2 compression.
Figure 28. Requirements of a motion estimation/prediction selection chip for MPEG-2 encoding.
Figure 29. Block diagram of the "block matching" processor.
Figure 30. High level architecture of the IMAGE chip with the indication of the critical communication paths.
Figure 31. Low power optimizations achieved on the IMAGE chip.
Figure 32. Main design data of the IMAGE chip.