Showing posts with label DCT. Show all posts
Showing posts with label DCT. Show all posts

Tuesday, 31 December 2013

Abstract DIP model - Part II


An abstract model for DIP came across my mind. It contains four sections viz; Acquire, Transfer, Display and Interpret. This post is second part of the series. Please refer earlier post (November 2013) for detailed introduction and to know about Acquire block.

Transfer Block:
It is in-between Acquire and Display block. As in Figure 1, input and output of Transfer block is digital data. It acts as a data compressor at the transmitting side and as a data expander at the receiving side (Display Block). This is because sending digital data without compression is a bad idea. Figure 1 describes a scenario. Here a person shoots an image using digital camera and transfers the image to hard disk in JPEG file format. Later it is opened and displayed on the computer screen. Here camera functions as acquire block as well as transmitter-side of Transfer Block. Computer functions as a receiver-side Transfer block as well as display block.
Figure 1 Digital Camera and Computer Interface
The transfer of digital data between Transfer Sub-blocks may occur through communication channel. This may be through wired (cables) or wireless medium. By nature all communication channels are noisy. It means that the data sent over the channel will be corrupted (1 becomes 0 or 0 becomes 1). A channel is considered good, if out of one million bits, one bit goes corrupt (1 out of 1000000). Various measures are taken to minimize or eliminate the impact of noise. Adding extra bits to detect as well as to correct the corrupt bits is one such measure. In CD and DVD Reed-Solomon Codes are extensively used. For further information search Google with the keyword “Channel Coding.” One may wonder when CD becomes a communication channel. Normally transfer of data from point 'A' to 'B' takes place via a channel and takes finite time (order of milliseconds). If the transfer of data takes near-infinite time then it can be considered as stored. Thus from this view point, transmission of data and storage (in CD or DVD) are functionally same.

Why compression?
The photons are converted into charge by image sensor. All the charges are read-out and they form a electrical signal. This analog signal is sampled and quantized to produce digital signal. Because of spatial sampling, resultant data is voluminous. Thus each image sensor's photon accumulation is represented in a pixel as a digital value.

Image is made up of rows and columns of pixels. The row and columns of data can be represented in matrix form. Programmers consider image as an array. Each pixel may require one bit to multi-bytes to represent digital data. A pixel from two coloured image (say black and white) requires only one bit to represent. A gray scale image uses 8 bits to represent shades of gray. Black and White TV images falls under this (Gray image) category. Colour image is composed of red, green and blue colour. Each colour shade requires 8 bits. Thus 24 bits (3 bytes) are required for each pixel. A HDTV size frame (image) possesses 1920 x 1080 pixels and requires 6075 KB (1920 x 1080 x 3) size storage. A one minute video requires 8900 MB (6075 KB*25*60). Thus half a minute video will gobble one DVD.  It requires 170 DVDs for single movie. One may wonder, how a entire Hollywood movie (nearly 90 to 100 minutes) is put inside a DVD. The answer is compression. The main objective of this example is to make ourselves to realize the mammoth data size of image.

Solution: Remove Redundancy 
There is a high amount of correlation exists between pixels in continuous tone images (typical image from digital camera). Thus one can guess a pixel value by knowing the values of neighbouring pixels. Put in another way the difference between a pixel and its neighbours will be very minimum. Engineers exploit this feature to compress a entire Hollywood movie into a DVD.  

Redundancy can be classified into interpixel redundancy, temporal redundancy and psychovisual redundancy. Temporal (time) redundancy exists in video only and not in image. Our eyes are very sensitive to gray scale variation than colour variation. This is an example for psychovisual redundancy.  By reducing redundancy high compression can be achieved. Transform coding converts the spatial domain signals (image) into spatial frequency domain signals. In the spatial frequency domain, first few coefficients contain large amplitude (value) and rest of the coefficients contains very small amplitude. In Figure 2, bar height represents the value. White colour bars represent the pixels and pink colour bars represent DCT coefficients. The real compression occurs by proper quantization of coefficient amplitude. The low frequency components i.e. first few coefficients are mildly quantized. The high-frequency coefficients i.e. rest of coefficients is severely quantized and outcome reaches near zero value. High frequency signals are highly attenuated (suppressed) but not eliminated. The feel of image crispness arises due to the presence of high spatial frequency components. Once high frequency signals are removed from an image become blurred. In JPEG, colour images are mapped into luminance and two chrominance layers  (YCbCr) and Cb and Cr layers (psychovisual) are highly quantized  to achieve high compression.

Figure 2 Spatial domain and Spatial Frequency domain. Courtesy [1] hdtvprimer.com
The quantized coefficients are coded using Variable Length Code (VLC) and then sent to receiver or put into storage device. In the VLC, highly occurring code is alloted with fewer bits and rarely occurring codes are alloted with more number of bits. Very good example for VLC is Morse code. Well known Save Our Souls (SOS) signal is represented as dot dot dot dash dash dash dot dot dot (...---...). In English language S and O are frequently occurring so they given shorter code. Less occurring letters like X, U will have longer code. Huffman code is a VLC, that provides very good compression.

In the receiver side VLC are decoded. The reverse operation of quantization occurs and transformed coefficients are again reconverted into spatial signals to produce the reconstructed image. Severity of quantization and file size are directly proportional. Image quality and quantization severity are indirectly proportional. The quantization makes an irrecoverable loss of signal i.e. it is impossible to recover original signal from quantized signal. For our eyes compressed JPEG image and the original image are practically indistinguishable.

Compression of images using quantization of spatial frequency coefficients is called lossy compression. This method is not permitted for medical images and scanned legal documents. Thus lossless compression is used. A image with 100 KB file size can be compressed into 5 KB file size using lossy compression. But with lossless compression one can achieve only 35 KB file size. Lossy and lossless compression is possible with JPEG. Advanced version of JPEG is JPEG2000. In the JPEG2000 Wavelet transform is used instead of Discrete Cosine Transform (DCT).

Spatial Domain Compression
 The transform coding poorly performs on cartoon images with limited colours and line art images. The exploitation of correlation can be carried out in spatial domain itself. VLC can be used to used to compress this sort of images. But underlying source probability of image is required for efficient compression. To overcome this problem dictionary codes are used. All ZIP compression application use dictionary coding. This coding method was developed by Limpel and Ziv in way back in 1977 and the method was named as LZ77. Next year, LZ78 arrived. Later Welsh modified LZ77 to make it much more efficient. It was named LZW. In 1980 Graphics Image Format (GIF) was introduced in Internet. It extensively used LZW. Few years later people come to know that LZW is a patented technique. This sent jittery among Web developers and users. Programmers came out with alternate image standard called Portable Network Graphics (PNG) to subdue GIF dominance. PNG uses LZ77 technique and patent free. In dictionary coding, encoders search for patterns and then patterns are coded. Longer the patterns better the compression. 

Knowledge on Information Theory is required to understand to evaluate various VLC. Information theory is an application of probability theory. What is information? If a man bites a dog then it is news. This is because chance of event occurrence is very low and instills interest to read. Put it in another way information value is very high. (Pl. don't confuse with computer scientist usage of information)
  Digital content duplication is a simple affair. So, content creators were forced find some ways and means to curb piracy. Digital Watermarking is one such solution to curb piracy. Here copyright information is stored inside the image. Presence of TV logo on television programmes is a very good example. Invisible watermarking schemes are also available. Steganography is a art of hiding text in images. Functionally digital watermarking and steganogragphy is similar but their objectives are totally different.

Note 
 The objective of this post to give overview of Transfer block. For more information please Google highlighted phrases.


Source

1. What is exactly atsc [Online]. Available http://www.hdtvprimer.com/issues/what_is_atsc.html

Monday, 31 December 2012

Video Compression Basics

In olden days video transmission and storage was in analog domain. Popular analog transmission standards were NTSC, PAL and SECAM. Video tapes were used as a storage media and VHS and Betamax standards were adopted. Later video transmission and storage moved towards digital domain. Digital signals are immune to noise and power requirement to transmit is less compared to analog signals. But they require more bandwidth to transmit them.  In communication engineering Power and Bandwidth are scarce commodities.  Compression technique is employed to reduce bandwidth requirement by removing the redundancy present in the digital signals. From mathematical point of view it decorrelates the data. Following case study will highlight the need for compression. One second digitized NTSC video requires data rate of 165 Mbits/sec. A 90 minute uncompressed NTSC video will generate 110 Giga bytes. [1]. Around 23 DVDs are required to hold this huge data. But one would have come across DVD’s that contain four 90 minute movies. This is possible because of efficient video compression techniques only.

Television (TV) signals are combination of video, audio and synchronization signals. General public when they refer video they actually mean TV signals. In technical literature TV signals and video are different. If 30 still images (Assume each image is slightly different from next image) are shown within a second then it will create an illusion of motion in the eyes of observer. This phenomenon is called 'persistance of vision’. In video technology still image is called as frame. Eight frames are sufficient to show illusion of motion but 24 frames are required to create a smooth motion as in movies.

Figure 1 Two adjacent frames   (Top)  Temporal redundancy removed  image (Bottom)

Compression can be classified into two broad categories. One is transform coding and another one is statistical coding. In transform coding Discrete Cosine Transform (DCT) and Wavelet Transforms are extensively used in image and video compression. In source coding Huffman coding and Arithmatic coding are extensively used. First transform coding will be applied on digital video signals. Then source coding will be applied on coefficients of transform coded signal. This strategy is common to image and video signals. For further details read [2].

In video compression Intra-frame coding and Inter-frame coding is employed. Intra-frame coding is similar to JPEG coding. Inter-frame coding exploits the redundancy present among adjacent frames. Five to fifteen frames will form Group of Pictures (GOP).  In the figure GOP size is seven and it contains one Intra (I) frame and two Prediction (P) frame and four Bi-directional Prediction (B) frames. In  I frame  spatial redundancy alone  is exploited and it very similar to JPEG compression. In P and B frames both spatial and temporal (time) redundancy is removed. In the figure 1, Temporal redundancy removed image can be seen. In the figure 2, P frames are present in 4th and 7th position. Fourth position P1 frame contains difference between Ith frame and 4th frame. The difference or prediction error is only coded. To regenerate 4th frame, I frame and P1 frame is required.  Like wise 2nd frame uses prediction error between I, P1, and B1 frames. The decoding sequence is I PPBBBB4. (Check with a book)

Figure 2 Group of Pictures (GOP)

One may wonder why GOP is limited to 15 frames. We know presence of more number of P and B frames results in much efficient compression. The flip side is if there is an error in I frame then dependant P and B frames cannot be decoded properly. This results in partially decoded still image (i.e. I frame) shown to viewer for the entire duration of GOP. For 15 frames one may experience still image for half a second. Beyond this duration viewer will be annoyed to look at still image. Increase in GOP frame size increases decoding time. This time will be included in latency calculation. Real-time systems require very minimum latency.

In typical soap opera TV episodes very low scene changes occur within a fixed duration. Let us take two adjacent frames. Objects (like face, car etc) in the first frame would have slightly moved in the second frame. If we know direction and quantum of motion then we can move the first frame objects accordingly to recreate second frame. Idea is simple to comprehend but implementation is very taxing.  Each frame will be divided into number of macroblocks. Each macroblock will contain 16x16 pixels (in JPEG 8x8 pixels are called Block that is why 16x16 pixels are called Macroblock). Choose macroblock one by one in the current frame (in our example, 2nd frame in Figure 1) and find ‘best matching’ macroblock in the reference frame (i.e. first frame in Figure 2). The difference between the best matching macroblock and chosen macroblock is called as motion compensation. The positional difference between two blocks is represented by motion vector. This process of searching best matching macroblock is called as motion estimation [3].


Figure 3 Motion Vector and Macroblocks

       
         A closer look at the first and second frame in the figure 1 will offer following inferences. (1) There is a slight colour difference between first and second frame (2) The pixel located at 3,3 is the first frame is the 0,0 th pixel in the second frame. 
         In figure 3 a small portion of frame is taken and macroblocks are shown. In that there are 16 macroblocks in four rows and four columns.

Group of macroblocks are combined together to form a Slice.

Further Information:  
  • Display systems like TV, Computer Monitor incorporates Additive colour mixing concept. Primary colours are Red, Green and Blue. In printing Subractive colour mixing concept is used and the primary colours are Cyan, Magenta, Yellow and Black (CMYK). 
  • Human eye is more sensitive to brightness variation than colour variation. To exploit this feature YCbCr model is used. Y-> Luminance Cb -> Crominance Blue Cr-> Crominance Red. Please note Crominance Red  ≠  Red
  • To conserve bandwidth analog TV systems uses Vestigial Sideband Modulation a variant of Amplitude Modulation (AM) and incorporate Interlaced Scanning method.

Note: This article is written to make the reader to get some idea about video compression within a short span of time. Article is carefully written but guarantee cannot be given for accuracy. So, please read books and understand the concepts in proper manner.
Sources:
[2]  Salent-Compression-Report.pdf, http://www.salent.co.uk/downloads/Salent-Compression-Report.pdf  (PDF, 1921 KB)
[3]  Iain E. G. Richardson, “H.264 and MPEG-4 Video Compression Video Coding for Next-generation Multimedia,” John Wiley & Sons Ltd, 2003. (Examples are very good.  ISBN 0-470-84837-5)