Wednesday, 30 January 2013

Video Compression Standards


In the earlier post basics of digital video compression was discussed. In this post five broad application areas of video and H.26x and MPEG4 video compression standards will be discussed.
  In all cases video frames are acquired by camera, compressed and stored in a non-volatile memory. Based on the requirement compressed video is transmitted to other places. Quality of video depends upon following factors viz: Number of frames per second, frame resolution, pixel depth. A High Definition (HD) video exhibits 25 frames per second and each frame has 1920 x 1080 pixels.  A pixel consumes 24 bits to represent RGB values. Video quality has a direct bearing on cost. One has to understand the requirement first and based on that video quality has to be finalized to keep cost at minimum.

1. Studio:
In video and film production video taken in a set or location is called raw footage. Then footages are taken up editing. Here video processing operations like colour space conversion, noise reduction, motion compensation and frame rate conversion is carried out based on the requirement. After this Director and editor of the movie sit together and remove unwanted portions and then rearrange footage in an order to make a movie. At the time of editing there will be loss of quality. To compensate this raw footage should be of highest quality.

2. Television:
         Digital television signals are broadcasted through terrestrial transmitters or by satellite transponders. Digital Terrestrial broadcast is popular in USA and Europe. Digital video is economically stored and distributed through Video Compact Disc (VCD) and Digital Versatile Disc (DVD). In News clips frame-to-frame transition  will be less. In sports and action movies frame-to-frame transition will be high. Digital Video signals are optimized to standard resolution TVs ( old NTSC, PAL, SECAM). Earlier MPEG-1 video compression standard was used. Now MPEG-2 is used to get HD (HD720, HD1080) quality.

Figure 1.  Frame sizes used by different TV standards. 
3. Internet streaming:
 In video streaming continuous data is sent to the client over the Internet and the user is able to decode and view the video in near real-time. Internet is slowly becoming like a video server. YouTube, metacafe, dailymotion are few examples for popular video servers. The files used by servers are MOV, FLV, MPEG-4, 3GP and FLV.  It is called as wrappers and it contains meta-data. The video codec used are MPEG-4, H.264, Sorenson Spark, VC-1 etc [1].  The video resolution available is 240, 360 and HD. In streaming latency (time delay) is the greatest problem. The problem of latency is unheard in broadcast technologies. In video streaming server do not allow to store the content but on-line tools are available to store in local hard disk.

4. Video conferencing:
        The next great thing is video conferencing. It may be one-to-one or conference call. Foundation for video telephony started 40 years back. Integrated Services Digital Network (ISDN) technology was built to handle video telephony. A new compression standard H.261 was built. At that time video telephony was not commercially successful and stood as technological feat only. Advent of third generation (3G) wireless technologies recreated the interest in video telephony and conferencing. Video conference system has much more stringent latency considerations. Humans can tolerate loss in visual quality and not on latency. Now H.264 protocol is used widely. The video resolution will be 352 x288 i.e. one fourth size of PAL TV signals.

5. Surveillance:
     Fall in prices surveillance video systems and proven ability in crime prevention and crime detection made wider deployment. The video should be of high quality so as to able to recognize suspect face and the video content should not be altered. If it altered then it will not be accepted as proof in the court of law. They use Motion JPEG, H.264 and MPEG standards for recording surveillance video. Real-time monitoring systems use H.264 and MPEG video codecs and to capture frames Motion JPEG (MJPEG) codec are employed. Entertainment industry and surveillance industry requirements are totally different. Poor lighting, 24x7 storage requirement are unique to surveillance applications [2].

Video Compression standard interrelationships:

          There is a long list of video compression standards are available. Careful study of various standards will reveal lot of commonality among them. MPEG and H.26x  stands out as top contenders.

I. MPEG:
            Motion Pictures Experts Group (MPEG) is a study group that develops digital video standards for International Standards Organization (ISO) and International Electrotechnical Commission (IEC).  These standards were built for entertainment industry.  In 1993 MPEG-1 was introduced to store digital video with a quality equal to VHS tape. In 1994 MPEG-2 was developed to handle HD video. MPEG-3 was merged with MPEG-2. MPEG-4 was introduced in 1999 and it uses Wavelet transform instead of Discrete Cosine Transform (DCT). Variants like MPEG-7 and MPEG-21 are available [3],[4].

II. H.26x:
           International Telecommunications Union’s Telecommunication standards division was responsible to develop H.26x series of standards. These standards were built to handle video calls. In video calls the frame-to-frame transition will be less. Most of the time it has to transmit human face which moves mildly. This H.26x is network resilient and it has lowest latency. To reduce latency  ‘B’ frames will be avoided in the coded frames. As it evolved from telephone systems it uses 64k chunks. Here also DCT is used.
            Later H.262 was developed and  it is similar to MPEG-2. Then H.263 was developed. In 1999 developers of H.26x, Video Coding Experts Group (VCEG) joined with ISO/IEC to form Joint Video Team (JVT). They built H.264/MPEG-4 Part 10. This standard is otherwise called as Advanced Video Coding (AVC). In MPEG-4 there are 16 parts and 10th part discuss about video coding. The 4:2:0 sampling is used and both progressive and interlaced scanning is permitted.

III. Motion JPEG:
          It was developed in 1992. Here only intra-frame coding is used. Put it simply each frame is a JPEG image. It never uses inter-frame coding. Because of this compression efficiency is poor but it has relatively less latency and more resilient to errors. One may wonder how MJPEG different from JPEG. In MJPEG 16 JPEG frames are shown within a second to create an illusion of motion. It consumes more storage size but it contains more information. It frames can be used as proof in court of law. In MPEG systems it sends only two to four full frames (I-frames) per second to receiver. Now MJPEG2000 that is similar to JPEG2000 is introduced. It uses Wavelet transform instead of DCT. It is computationally tedious but compression efficiency is high. [4]

Source:
 [1] Yassine Bechqito, High Definition Video Streaming Using H.264 Video Compression, Master's Thesis, Helsinki Metropolia University of Applied Sciences. pg.18, 21(PDF, 3642 KB)
[2] http://www.initsys.net/attachments/Compression and DigitisationPDF.pdf (PDF, 242 KB)
[3] Iain E. G. Richardson, “H.264 and MPEG-4 Video Compression Video Coding for Next-generation Multimedia,” John Wiley & Sons Ltd, 2003. (Examples are very good.  ISBN 0-470-84837-5)
[4] Salent-Compression-Report.pdf,  http://www.salent.co.uk/downloads/Salent-Compression-Report.pdf  (PDF, 1921 KB)

Monday, 31 December 2012

Video Compression Basics

In olden days video transmission and storage was in analog domain. Popular analog transmission standards were NTSC, PAL and SECAM. Video tapes were used as a storage media and VHS and Betamax standards were adopted. Later video transmission and storage moved towards digital domain. Digital signals are immune to noise and power requirement to transmit is less compared to analog signals. But they require more bandwidth to transmit them.  In communication engineering Power and Bandwidth are scarce commodities.  Compression technique is employed to reduce bandwidth requirement by removing the redundancy present in the digital signals. From mathematical point of view it decorrelates the data. Following case study will highlight the need for compression. One second digitized NTSC video requires data rate of 165 Mbits/sec. A 90 minute uncompressed NTSC video will generate 110 Giga bytes. [1]. Around 23 DVDs are required to hold this huge data. But one would have come across DVD’s that contain four 90 minute movies. This is possible because of efficient video compression techniques only.

Television (TV) signals are combination of video, audio and synchronization signals. General public when they refer video they actually mean TV signals. In technical literature TV signals and video are different. If 30 still images (Assume each image is slightly different from next image) are shown within a second then it will create an illusion of motion in the eyes of observer. This phenomenon is called 'persistance of vision’. In video technology still image is called as frame. Eight frames are sufficient to show illusion of motion but 24 frames are required to create a smooth motion as in movies.

Figure 1 Two adjacent frames   (Top)  Temporal redundancy removed  image (Bottom)

Compression can be classified into two broad categories. One is transform coding and another one is statistical coding. In transform coding Discrete Cosine Transform (DCT) and Wavelet Transforms are extensively used in image and video compression. In source coding Huffman coding and Arithmatic coding are extensively used. First transform coding will be applied on digital video signals. Then source coding will be applied on coefficients of transform coded signal. This strategy is common to image and video signals. For further details read [2].

In video compression Intra-frame coding and Inter-frame coding is employed. Intra-frame coding is similar to JPEG coding. Inter-frame coding exploits the redundancy present among adjacent frames. Five to fifteen frames will form Group of Pictures (GOP).  In the figure GOP size is seven and it contains one Intra (I) frame and two Prediction (P) frame and four Bi-directional Prediction (B) frames. In  I frame  spatial redundancy alone  is exploited and it very similar to JPEG compression. In P and B frames both spatial and temporal (time) redundancy is removed. In the figure 1, Temporal redundancy removed image can be seen. In the figure 2, P frames are present in 4th and 7th position. Fourth position P1 frame contains difference between Ith frame and 4th frame. The difference or prediction error is only coded. To regenerate 4th frame, I frame and P1 frame is required.  Like wise 2nd frame uses prediction error between I, P1, and B1 frames. The decoding sequence is I PPBBBB4. (Check with a book)

Figure 2 Group of Pictures (GOP)

One may wonder why GOP is limited to 15 frames. We know presence of more number of P and B frames results in much efficient compression. The flip side is if there is an error in I frame then dependant P and B frames cannot be decoded properly. This results in partially decoded still image (i.e. I frame) shown to viewer for the entire duration of GOP. For 15 frames one may experience still image for half a second. Beyond this duration viewer will be annoyed to look at still image. Increase in GOP frame size increases decoding time. This time will be included in latency calculation. Real-time systems require very minimum latency.

In typical soap opera TV episodes very low scene changes occur within a fixed duration. Let us take two adjacent frames. Objects (like face, car etc) in the first frame would have slightly moved in the second frame. If we know direction and quantum of motion then we can move the first frame objects accordingly to recreate second frame. Idea is simple to comprehend but implementation is very taxing.  Each frame will be divided into number of macroblocks. Each macroblock will contain 16x16 pixels (in JPEG 8x8 pixels are called Block that is why 16x16 pixels are called Macroblock). Choose macroblock one by one in the current frame (in our example, 2nd frame in Figure 1) and find ‘best matching’ macroblock in the reference frame (i.e. first frame in Figure 2). The difference between the best matching macroblock and chosen macroblock is called as motion compensation. The positional difference between two blocks is represented by motion vector. This process of searching best matching macroblock is called as motion estimation [3].


Figure 3 Motion Vector and Macroblocks

       
         A closer look at the first and second frame in the figure 1 will offer following inferences. (1) There is a slight colour difference between first and second frame (2) The pixel located at 3,3 is the first frame is the 0,0 th pixel in the second frame. 
         In figure 3 a small portion of frame is taken and macroblocks are shown. In that there are 16 macroblocks in four rows and four columns.

Group of macroblocks are combined together to form a Slice.

Further Information:  
  • Display systems like TV, Computer Monitor incorporates Additive colour mixing concept. Primary colours are Red, Green and Blue. In printing Subractive colour mixing concept is used and the primary colours are Cyan, Magenta, Yellow and Black (CMYK). 
  • Human eye is more sensitive to brightness variation than colour variation. To exploit this feature YCbCr model is used. Y-> Luminance Cb -> Crominance Blue Cr-> Crominance Red. Please note Crominance Red  ≠  Red
  • To conserve bandwidth analog TV systems uses Vestigial Sideband Modulation a variant of Amplitude Modulation (AM) and incorporate Interlaced Scanning method.

Note: This article is written to make the reader to get some idea about video compression within a short span of time. Article is carefully written but guarantee cannot be given for accuracy. So, please read books and understand the concepts in proper manner.
Sources:
[2]  Salent-Compression-Report.pdf, http://www.salent.co.uk/downloads/Salent-Compression-Report.pdf  (PDF, 1921 KB)
[3]  Iain E. G. Richardson, “H.264 and MPEG-4 Video Compression Video Coding for Next-generation Multimedia,” John Wiley & Sons Ltd, 2003. (Examples are very good.  ISBN 0-470-84837-5)

Wednesday, 28 November 2012

Multicore Software Technologies


Powerful multicore processors arrived in the market, but programmers with sound knowledge on hardware to harness its full potential were in short fall. Obvious solution is to produce good number of programmers with sufficient knowledge on multicore architecture. Another solution is to create software that will convert the code meant for single processor into a multicore compatible code. In this article second method will be discussed in detail.  Reasons for first method’s non-feasibility are left to readers as exercise.

It is a well known fact that User friendliness and code efficiency don’t go hand in hand. For example, factorial program written in assembly language will produce smallest executable or binary (in windows .exe) file. Same algorithm implemented in higher languages (Fortran, Matlab) will produce large sized file. One can expect moderate binary file size in C. Writing a program in assembly language is tedious and in Matlab it is easy.



A graph that shows relationship between effort required to write a program and computational speedup achieved by that programming language is shown above. Essence of previous paragraph is beautifully presented in the graph. Multicore software tools like Cilk++, Titanium, VSIPL++ have the user riendliness and at the same time they are able to produce efficient applications. Is it not like "Have cake and eat it too" situation?  Let us hope, it will not take much time to reach coveted 'Green ellipse' position.

OpenMP (Open Multi-Processing) is an open standard and it is supported by major computer manufacturers. Code written in Fortran, C/C++ languages can be converted into code compatible to multicore processors. Multicore compilers are available in Windows, Linux and Apple Mac operating systems. Advantages of OpenMP is it is easy to learn and compatible with different multicore architectures. Software tools like  Unified Parallel C, Sequoia, Co-array Fortran, Titanium, Cilk++, pMatlab and Star-P are available as an alternative to OpenMP. CUDA, BROOK+, OpenGL are available to cater Graphics Processing Unit (GPU) based systems.

Name                           Developed by                  Keywords   Language extension 
.
Unified Parallel C (UPC)    UPC Consortium                  shared                     C
Sequoia                           Stanford University                 inner, leaf                 C
Co-Array Fortran                        ---                                   ---                           Fortran
Titanium                                    ---                                  ---                          Java
pMatlab MIT Lincoln Laboratory          ddense, *p                 MATLAB
Star- P                               Interactive Supercomputing          ---                        Matlab,
                                                                                                                           Python
Parallel Computing     Mathworks Inc                       spmd, end                   Matlab
Toolbox

--- Data not available

A multicore compatible code is developed in following way. Code is written in a high level language and it is compiled. This helps to rectify any error in the program. Next, code is analyzed and where ever parallelism is exhibited that section of code is marked with special keyword (see above table) by the programmer. It is again compiled with multicore-software tool. Software tools will automatically insert necessary code to take care of memory management and data movement. In a multicore environment, operating system creates a Master thread at the time of execution. Thereafter master thread takes care of the execution and where ever special keyword is spotted then threads are forked (i.e. created) and given to separate core. After the completion of job, threads are terminated.

  1.  #pragma omp parallel for \
  2.  private(n) \
  3. shared (B, y, v)
  4.  for (int n=0; n < K; n++)
  5.  y[n] = B*cos(v*n);

In this paragraph, a few line sample code presented above, will be explained. Above program have syntax similar to C. Steps 4 and 5 is sufficient to generate a cosine signal.  First three lines are there to parallelize the step 4 and 5. In the step 1, ‘#pragma’ is a pre-processor directive in C/C++. ‘omp’ stands for OpenMP software and ‘parallel for’ states that for loop is going to be parallelized. In the step 3, the ‘shared’ states which are the variable can be placed in global space and all the cores can access them. Amplitude ‘B’ and array name ‘y’ is placed in the global space.  Each core should maintain its own ‘n’ in its private space to generate proper cosine signal.

Program tailored to multicore will have generally two sections. One section performs the task and another section has target architecture specific information like no. of cores etc. All software assumes a model and for further details read [1]. Multicore hardware architecture can be classified into two broad categories viz; homogeneous (ex. x86 cores) and heterogeneous (ex. GPU).  Software tools are built to suit any one of the architecture only.

Middleware

Days were gone when only supercomputers having multi-processor system. Today even embedded system (ex. Smart phone) uses multicore hardware.  Embedded system is a combination of hardware and software. Typically any change in hardware will require some changes in software. Likewise any up gradation in software require change in hardware. To overcome this problem “Middleware” were concept was introduced.

MIT Lincoln Laboratory developed a middleware named Parallel Vector Library (PVL) suited for real-time embedded Signal and Image processing systems.   Likewise VSIPL++ (Vector Signal and Image Processing Library) was developed and maintained by High Performance Embedded Computing Software Initiative (HPEC-SI).  VSIPL++ is suited for homogeneous architecture. For heterogeneous architecture PVTOL (Parallel Vector Tile Optimizing Library) is used.  Here also programs are compiled, and then mapped to multicore systems.

Source

1. Hahn Kim, and R. Bond. “Multicore software technologies” IEEE Signal Processing Magazine, Vol. 26 No.6, 2009 pp. 80-89. http://hdl.handle.net/1721.1/52617 (PDF,629 KB)

2. Greg Slabaugh, Richard Boyes, Xiaoyun Yang, "Multicore Image Processing with OpenMP", Appeared in IEEE Signal Processing Magazine, March 2010, pp134-138. But it can be downloaded from  http://www.soi.city.ac.uk/~sbbh653/publications/OpenMP_SPM.pdf (PDF, 1160 KB)

Tuesday, 30 October 2012

Multicore Processors and Image Processing - II (GPU)



       Evolution of GPU has humble origin. In Earlier days multichip 3D rendering engines were developed and used as add-on graphics accelerator card in Personal iomputer. Slowly all the functionalities of the engine were fused into one single chip. Likewise processing power steadily increased and in 2001 it blossomed into  Graphics Processor Unit (GPU). Pioneer in GPU, NVIDIA introduced GeForce 3 in 2001. Then came GeForce 7800 model and in 2006, GeForce 8800 was available in the market. Present day GPU are capable to perform 3D graphics operations like transforms, lighting, rasterization, texturing, depth testing and display.

        NVIDIA has introduced Fermi GPU in its' GPGPU fold. It consists of multiple streaming multiprocessors (SMs) which are supported by cache, host interface, Gigathread scheduler and DRAM interfaces. Each SM consists of 32 cores; each of them can execute one floating point or integer operation per clock. Each core supported by 16 load/store units, four special function units, a 32K register file, 64K Random Access Memory. Fermi GPU adheres to IEEE 754-2008 floating point standard, which means that it offers high precision results. Fermi supports fused multiply-add (FMA) feature. [2]


GPU Market

    Most of  the technical blogs and websites cite John Peddie Research as their source for their data. I have also done the same thing. In this blog two data sources [4, 5] are given and as usual their data does not match with other. In ref. [5] market share based on discrete add-on board alone is considered. If it is integrated with CPU then Intel Inc. will have share in the market.

Total units sold in the 2nd Quarter (April to June) of 2012 is 14.76 million [5].

Company           Market Share
                           in percentage
AMD                       40.3
NVIDIA                  39.3
MATROX                 0.3
S3                             0.1

Differences between conventional CPU and GPU
  •  Typical CPU banks on speculative execution like cache and branch prediction. Speculative optimization strategy will pay off for code that has high data locality. This assumption may not fit well for all algorithms.
  •  CPU maximizes single threaded performance by increasing raw clock speed. This results in hotter transistors, more current leakage from transistors and more cost to manufacture.
  • Metric used for conventional CPU is raw clock speed. If performance is measured with  metrics like GFLOPS (Giga Floating Point Operations per Second) per dollar or power usage in Watts then results are not impressive. For example Tesla GPU is eight times more powerful than Intel Xeon processor in terms of GFLOPS but they cost more or less same.
  • In CPU most of the chip area is devoted to speculative execution.  Core i7 processor (Quad core) that is based on Intel’s Nehalem microarchitecture is fabricated using 45 nm technology. In Core i7, Only 9% of chip area is occupied by integer and floating point execution units. Remaining area is devoted to DRAM controller, L3 cache etc. But in a GPU most of the chip area is devoted to execution units.
  • GPU can never replace CPU. Parallism exibiting codes can be ported to GPU and efficiently it can be executed.

Intel's multi core chips like Atom, Core 2, Core i7 (Nehalem Architecure) processors and Xeon W5590 (Quad core, based on Nehelem Architecture) are all optimized for speculative execution. [2]

Programming GPU

General Purpose GPU (GPGPU) are the graphics-optimized GPUs which were commissioned to perform non-graphics processing. One needs in-depth knowledge in GPU hardware and software skill to run a algorithm in GPGPU. This feat is not possible to perform by typical software programmers. To enable the programmers to exploit power of GPU, NVIDIA developed CUDA (Compute Unified Device Architecture) tool kit, which helps software developers to focus on their algorithm rather than spend their valuable time in mapping algorithm to hardware, thus improving productivity. CUDA is available for C and Fortran programming languages. The next generation CUDA (code named Fermi) supports languages like C, C++, FORTRAN, Java, MATLAB and Python. CUDA tool kit is taught in more than 200 colleges through out the world. NVIDIA says it has sold more than 100 million CUDA-capable chips.

GPU will have several cores. For example, NVIDIA Tesla C2070 has 448 cores. Algorithms that are to be executed in GPU is partitioned into “host code” and “device code”. Host code will have one thread which is persistent through out the execution of algorithm. Where ever multiple operations are performed then, that portion is marked as device code by the programmer. At the time of execution of this region, multiple threads will be created (technical term is forked) and GPU cores will execute the code chunk. After completion, threads will be destroyed automatically. In GPU literature they use the term kernel for thread. They use technical terms like thread block, warp (32 parallel kernel together), grid (refer page 19 of  [2]).

NVCC is CUDA based C compiler developed by NVIDIA. Portland Group (PGI) has developed CUDA based Fortran compiler. GPU programming is not confined to CUDA toolkit alone. Software developers have come with their own packages to handle GPUs. OpenCL is a open GPU programming standard developed by Kronous Group, which is the same group that developed OpenGL. DirectCompute is a Microsoft product. HMPP (Hybrid Multicore Parallel Programming) workbench was developed by French based CAPS enterprise.

Image Processing Case Study

In CT scan, series of X-ray images are taken over a human body. Then 3D-images are reconstructed using the two dimensional X-ray images. Reconstruction is highly computational intensive. So obviously GPU are deployed for computation. It is reported that NVIDIA's GeForce 8800 GPU process 625 projections, each projection having a size of 1024 x 768, to produce 512x512x340 reconstructed volume size in 42 seconds. If medical-grade GPU is used then it can be reduced to 12 seconds. I presume 512x512x340 means 340 frames with a dimension of 512x512 pixels. Medical-grade GPU should have 32-bit precision end-to-end to produce accurate results. [3]

 Source

[1] White paper on “Easy Start to GPU Programming” by Fujitsu Incorporation, http://globalsp.ts.fujitsu.com/dmsp/Publications/public/wp-Easy-Start-to-GPU-Programming.pdf, (PDF, 280KB).
[2]  A white paper on “NVIDIA’s Fermi: The First Complete GPU Computing Architecture”  by Peter N. Glaskowsky,  http://cs.nyu.edu/courses/spring12/CSCI-GA.3033-012/Fermi-The_First_Complete_GPU_Architecture.pdf, (PDF, 1589 KB).
[3] White Paper on “Current and next-generation GPUs for accelerating CT reconstruction: quality, performance, and tuning” http://www.barco.com/en/products-solutions/~/media/Downloads/White papers/2007/Current and next-generation GPUs for accelerating CT reconstruction quality performance and tuning.pdf . 122KB
[4] http://www.jonpeddie.com/publications/market_watch/
[5] http://www.techspot.com/news/49946-discrete-gpu-shipments-down-in-q2-amd-regains-market-share.html



Sunday, 30 September 2012

Multicore Processors and Image Processing - I


         Quad-core based desktops and laptops have become order of the day. These multi-core processors after long years of academic captivity have come to the limelight. Study of multi-core and multi-processors, come under the field of High Performance Computing.
          Two or more processors are fabricated in a single package then it is called multi-core processor. Multi-processors and multi-process (single processor but multiple application are running simultaneously) are different from multi-core processors. Quad (four) and Duo (two) are commonly used multi-core processor by general public. With prevailing fabrication technology, race to increase raw clock power beyond 3GHz is nearing the end.  Further increase in computing power, is possible with deploying Parallel computing concept.  This is the reason why, all the manufacturers are introducing multi-core processors.

           Image processing algorithms are exhibits high degree parallelism.  Most of algorithms have loop(s) and they iterate through a pixel, row or an image region.  Let us take an example loop, which has to iterate 200 times.  Single processor will iterate 200 times and in a quad-core each processor will iterate 50 times only. Obviously quad-core will finish the task much faster than single processor. To achieve the stated speed, programs have to be slightly modified and multi-core optimized compilers are required.


Image size         Time to execute in (ms)
                    
        (Single-core)   (Multi-core)
256x256                   18                          8
512x512                   65                        21
1024x1024             260                        75

       Amount of time needed to twist  Lenna image is given in the above table. Lenna image with 1024x 1024 takes 260ms in single core and 75ms in multi-core to perform twist operation. From this it is very clear, in a single-core processor, as image size increases execution time increases exponentially but in multi-core processor in goes in linear fashion [1].

        Algorithms will exhibit either fine grain or medium grain or course grain parallelism. Smoothing, sharpening, filtering and convolution image processing functions operate on entire image and they are classified as fine-grain systems.  Medium grain parallelism is exhibited by Hough transform, motion analysis and the functions operate on part of an image.  Position estimation and object recognition comes under the class of course grain and parallelism exhibited is very less [2].  Algorithms can be split into memory-bound and CPU bound algorithms. In CPU bound algorithms no of cores helps to achieve linear-speed up subject to Amdahl's law [3].

       Multiprocessors are developed by Intel and AMD are very popular. Following table content got from [3].
Multicore Processor        No of core    Clock speed
Intel Xeon E5649                    12                2.53GHz
AMD Opteron 6220               16                3.00GHz
AMD Phenom X4(9550)          4                2.20GHz

        Intel Xeon was launched in 2002.  Intel introduced simultaneous multiple threading capability and named it as Hyper Threading. It has also launched Core 2 Duo T7250, which is a low voltage mobile processor runs at 2GHz. Sony’s Playstation 3 has a very powerful multi-core processor.  This is called Cell Broadband Engine and it is developed jointly by Sony, Toshiba and IBM.  It has a 64bit PowerPC processor connected by a circular bus to eight RISC (Reduced Instruction Set Computing) with 128 SIMD (Single Instruction Multiple Data) architecture based co-processors. SIMD is well suited  to exploit  fine grain parallelism.


In the part two series GPU and OpenMP Application Program Interface (API) wil l be discussed.


Source

[1] Greg Slabaugh, Richard Boyes, Xiaoyun Yang, "Multicore Image Processing with OpenMP", Appeared in IEEE Signal Processing Magazine, March 2010, pp134-138. But it can be downloaded from  http://www.soi.city.ac.uk/~sbbh653/publications/OpenMP_SPM.pdf (PDF, 1160KB )
[2] Trupti Patil, "Evaluation of Multi-core Architecture for Image Processing,"  MS Thesis,  year 2009, at Graduate School of Clemson University (PDF, 1156KB) www.ces.clemson.edu/~stb/students/trupti_thesis.pdf

[3] OpenMP in GraphicsMagick,  http://www.graphicsmagick.org/OpenMP.html


Note
  • Please read [1], as it is insightful and language used is simpler compared to standard technical articles.
  • Warped Lena and above table was adapted from [1] and not copied.  GIMP software was used to create the warped effect on the Lenna image.