Wednesday, 27 February 2013

Digital Television

         In New York World’s Fair, first time public had a chance to see Television.This happened in the year 1939. It took two years to develop a standard for Black and White TV broadcast. In USA Color TV broadcast standard was developed by National Television Systems Committee (NTSC) in 1953. It has 525 horizontal scanning lines out of which 480 lines are active (it means visible). Interlaced scanning is used. This system is called Standard Definition TV (SDTV) and mentioned in technical literature as 480i (480 lines and interlaced scanning) [1]. Early TVs were developed by Marconi company and John Logie Baird of England. Marconi company developed interlaced scanning and Baird used progressive scanning. Marconi used landscape format (width is longer and height is short) but Baird used portrait format to suit talking head. British Broadcasting Company (later it became corporation) adopted Marconi company system [2]. For early TV photos are available in Internet [3], [4].

Digital Television
         Federal Communications Commission (FCC) of USA established Advanced Television Systems Committee (ATSC) in 1995 to form Digital Television (DTV) standards. In 1996 DTV standard was adopted in USA. This standard was adopted by Canada, South Korea, Argentina and Mexico. As per original plan of FCC no analog broadcast after February 2009 in USA.

HDTV
One of the variant of DTV is High Definition Television (HDTV). ATSC proposed 18 different DTV formats out of which 12 are SDTV formats and remaining are in HDTV format. A DTV to be qualified as a HDTV should satisfy following five parameters. They are

  • Number of active lines per frame (720), 
  • Active pixels per line (1280), 
  • Aspect ratio (16:9), 
  • Frame rate (30), exception permitted
  • Pixel shape (square).

         Please note DTV with aspect ratio of 16:9 and frame size of 480x704 is available. But it is not HDTV format as it not having required active lines per frame.

  The six variant of HDTV format can be broadly classified into two groups based on frame size. First group has 1080x1920 frame size with 24p, 30p (progressive) and 30i (interlaced) frames per second (fps). Second group has 720x1280 frame size with 24p, 30p, 60p fps. Square pixels are widely used in computer screens. Progressive scanning is suitable for slow moving objects and interlaced scanning is suited for fast paced sequences. Minimum 60 frames per seconds are required to compensate its deficiency. DTV is designed to handle 19.39 Mbps only. This bit rate is not sufficient to handle 1080p with 60 fps. DTV’s SDTV formats can be broadly classified into three categories. First one is with 480x640 frame size, 4:3 aspect ratio, square pixel and with 24p, 30p, 30i and 60p fps. This is very similar to the existing analog SDTV format. Remaining two uses 480x704 frame size.

HDTV camcorder
HDTV camcorders were introduced in 2003. A professional camera recorder (camcorder) will have 2/3” inch image sensor with 3 chips, color view finder, and 10x or more zoom lens. In professional camcorders cables connectors are provided in the rear of the camera. The interface standards are IEEE Firewire or High Definition Serial Digital Interface (HDSDI). Coaxial cables are connected and content is sent to monitor or a Video Tape recorder.

      A consumer grade camcorder uses 1/6” inch image sensor with single chip and 3.5” LCD screen to view. It weighs around a kg and cost is less. Recorded content will be stored video cassette or hard disk.
Professional camera like Dalsa origin has an image sensor array of 4096 x 2048. At a rate of 24 fps it generates 420MB/second [2, pg. 165]. It means that a CD will be filled with 90 seconds of data. Image sensor array made up N x M pixels. These pixels may be made up of Charge Coupled Device (CCD) or Complementary Metal Oxide Semiconductor (CMOS).  Consumer grade camcorders and Digital camera use CMOS pixels. Following section will compare CCD and CMOS [2].
  1. CMOS releases less heat and consume 100 times less power than CCD.
  2.  CMOS can work well with low light conditions.
  3. CMOS is less sensitive and they always have transistor amplifier with each pixel. In CCD there is no requirement of transistor amplifier so more pixel area is devoted to sensitive area. (refer Figure)
  4. CMOS sensors can be fabricated on a normal silicon production line that is used to fabricate microprocessor and memories. CCD needs a dedicated production line.
  5. CMOS have the problem of Electrostatic discharge.
Three chip camera is made up of convex lens, beam splitter and red, green, blue sensor. Convex lens is used to focus the incoming light into the camera. The beam splitter is used to separate the color optically. It is made of glass prism and connected by red and blue dichroic reflectors. Proportional to light falls on the sensor, current is produced. It is then A-to-D converted and either stored are data is sent via cable.

       Bottom most layer on image sensor array is Printed Circuit Board (PCB). On the top of it pixel chips are mounted. Over that light sensitive sensor is placed (not indicated in the figure, light gray convex shape). Please note entire pixel chip is not devoted to sensor. The incoming light from dichoric reflectors are focused on each pixel chip by a lens. If it is a single chip camera then a primary color filter is placed in-between lens and sensor's sensitive area.

Single chip camera is made up of convex lens, color filter and sensor. Color filter will allow any one primary color (red, green, blue) to the sensor.  In a sensor array half of the array will have green filter and remaining half is equally split between red and blue. If color filters are arranged sequentially then aliasing effect will come into picture and results in MoirĂ© pattern. Films are made up of very small and very high irregular silver grains. This helps to suppress appearance of regular patterns. It is not possible to create a truly random pattern in single chip camera pixels. To create the effect of randomness pixels rows are arranged in such a way that each pixel row has different distribution compared to below and above row. This technique is called Bayer pattern and it was developed by Kodak company. When the pixels count and sampling rate are higher than any pattern frequency likely to appear (example checked shirt) then sequential filtering can be applied. For example Panavision Genesis has 12.4 million pixels. In sequential filtering each column contains one primary colour only.

DTV reception
Digital TVs are made to receive DTV broadcast signals. Set-top-box is external tuner system that can receive DTV signals and feed old analog television with analog signals. Color TV took 10 years to achieve five percentage market penetration. DTV in seven years achieved 20 percentage market penetration. Most of the satellite broadcast is in digital format and set-top-box in the receiver end generates analog signals which is suited to analog TVs. Days are not far off to have digital TV all over the world.

Source
  1. Chuck Gloman and Mark J. Pestacore, “Working with HDV : shoot, edit, and deliver your high definition video,” Focal press, 2007, ISBN 978-0-240-80888-8.
  2. Paul Wheeler, “High Definition Cinematography,” Focal Press, Second edition 2007. ISBN: 978-0-2405-2036-0
  3. http://www.scottpeter.pwp.blueyonder.co.uk/new_page_11.htm
  4. http://www.scottpeter.pwp.blueyonder.co.uk/Vintagetech.htm

Wednesday, 30 January 2013

Video Compression Standards


In the earlier post basics of digital video compression was discussed. In this post five broad application areas of video and H.26x and MPEG4 video compression standards will be discussed.
  In all cases video frames are acquired by camera, compressed and stored in a non-volatile memory. Based on the requirement compressed video is transmitted to other places. Quality of video depends upon following factors viz: Number of frames per second, frame resolution, pixel depth. A High Definition (HD) video exhibits 25 frames per second and each frame has 1920 x 1080 pixels.  A pixel consumes 24 bits to represent RGB values. Video quality has a direct bearing on cost. One has to understand the requirement first and based on that video quality has to be finalized to keep cost at minimum.

1. Studio:
In video and film production video taken in a set or location is called raw footage. Then footages are taken up editing. Here video processing operations like colour space conversion, noise reduction, motion compensation and frame rate conversion is carried out based on the requirement. After this Director and editor of the movie sit together and remove unwanted portions and then rearrange footage in an order to make a movie. At the time of editing there will be loss of quality. To compensate this raw footage should be of highest quality.

2. Television:
         Digital television signals are broadcasted through terrestrial transmitters or by satellite transponders. Digital Terrestrial broadcast is popular in USA and Europe. Digital video is economically stored and distributed through Video Compact Disc (VCD) and Digital Versatile Disc (DVD). In News clips frame-to-frame transition  will be less. In sports and action movies frame-to-frame transition will be high. Digital Video signals are optimized to standard resolution TVs ( old NTSC, PAL, SECAM). Earlier MPEG-1 video compression standard was used. Now MPEG-2 is used to get HD (HD720, HD1080) quality.

Figure 1.  Frame sizes used by different TV standards. 
3. Internet streaming:
 In video streaming continuous data is sent to the client over the Internet and the user is able to decode and view the video in near real-time. Internet is slowly becoming like a video server. YouTube, metacafe, dailymotion are few examples for popular video servers. The files used by servers are MOV, FLV, MPEG-4, 3GP and FLV.  It is called as wrappers and it contains meta-data. The video codec used are MPEG-4, H.264, Sorenson Spark, VC-1 etc [1].  The video resolution available is 240, 360 and HD. In streaming latency (time delay) is the greatest problem. The problem of latency is unheard in broadcast technologies. In video streaming server do not allow to store the content but on-line tools are available to store in local hard disk.

4. Video conferencing:
        The next great thing is video conferencing. It may be one-to-one or conference call. Foundation for video telephony started 40 years back. Integrated Services Digital Network (ISDN) technology was built to handle video telephony. A new compression standard H.261 was built. At that time video telephony was not commercially successful and stood as technological feat only. Advent of third generation (3G) wireless technologies recreated the interest in video telephony and conferencing. Video conference system has much more stringent latency considerations. Humans can tolerate loss in visual quality and not on latency. Now H.264 protocol is used widely. The video resolution will be 352 x288 i.e. one fourth size of PAL TV signals.

5. Surveillance:
     Fall in prices surveillance video systems and proven ability in crime prevention and crime detection made wider deployment. The video should be of high quality so as to able to recognize suspect face and the video content should not be altered. If it altered then it will not be accepted as proof in the court of law. They use Motion JPEG, H.264 and MPEG standards for recording surveillance video. Real-time monitoring systems use H.264 and MPEG video codecs and to capture frames Motion JPEG (MJPEG) codec are employed. Entertainment industry and surveillance industry requirements are totally different. Poor lighting, 24x7 storage requirement are unique to surveillance applications [2].

Video Compression standard interrelationships:

          There is a long list of video compression standards are available. Careful study of various standards will reveal lot of commonality among them. MPEG and H.26x  stands out as top contenders.

I. MPEG:
            Motion Pictures Experts Group (MPEG) is a study group that develops digital video standards for International Standards Organization (ISO) and International Electrotechnical Commission (IEC).  These standards were built for entertainment industry.  In 1993 MPEG-1 was introduced to store digital video with a quality equal to VHS tape. In 1994 MPEG-2 was developed to handle HD video. MPEG-3 was merged with MPEG-2. MPEG-4 was introduced in 1999 and it uses Wavelet transform instead of Discrete Cosine Transform (DCT). Variants like MPEG-7 and MPEG-21 are available [3],[4].

II. H.26x:
           International Telecommunications Union’s Telecommunication standards division was responsible to develop H.26x series of standards. These standards were built to handle video calls. In video calls the frame-to-frame transition will be less. Most of the time it has to transmit human face which moves mildly. This H.26x is network resilient and it has lowest latency. To reduce latency  ‘B’ frames will be avoided in the coded frames. As it evolved from telephone systems it uses 64k chunks. Here also DCT is used.
            Later H.262 was developed and  it is similar to MPEG-2. Then H.263 was developed. In 1999 developers of H.26x, Video Coding Experts Group (VCEG) joined with ISO/IEC to form Joint Video Team (JVT). They built H.264/MPEG-4 Part 10. This standard is otherwise called as Advanced Video Coding (AVC). In MPEG-4 there are 16 parts and 10th part discuss about video coding. The 4:2:0 sampling is used and both progressive and interlaced scanning is permitted.

III. Motion JPEG:
          It was developed in 1992. Here only intra-frame coding is used. Put it simply each frame is a JPEG image. It never uses inter-frame coding. Because of this compression efficiency is poor but it has relatively less latency and more resilient to errors. One may wonder how MJPEG different from JPEG. In MJPEG 16 JPEG frames are shown within a second to create an illusion of motion. It consumes more storage size but it contains more information. It frames can be used as proof in court of law. In MPEG systems it sends only two to four full frames (I-frames) per second to receiver. Now MJPEG2000 that is similar to JPEG2000 is introduced. It uses Wavelet transform instead of DCT. It is computationally tedious but compression efficiency is high. [4]

Source:
 [1] Yassine Bechqito, High Definition Video Streaming Using H.264 Video Compression, Master's Thesis, Helsinki Metropolia University of Applied Sciences. pg.18, 21(PDF, 3642 KB)
[2] http://www.initsys.net/attachments/Compression and DigitisationPDF.pdf (PDF, 242 KB)
[3] Iain E. G. Richardson, “H.264 and MPEG-4 Video Compression Video Coding for Next-generation Multimedia,” John Wiley & Sons Ltd, 2003. (Examples are very good.  ISBN 0-470-84837-5)
[4] Salent-Compression-Report.pdf,  http://www.salent.co.uk/downloads/Salent-Compression-Report.pdf  (PDF, 1921 KB)

Monday, 31 December 2012

Video Compression Basics

In olden days video transmission and storage was in analog domain. Popular analog transmission standards were NTSC, PAL and SECAM. Video tapes were used as a storage media and VHS and Betamax standards were adopted. Later video transmission and storage moved towards digital domain. Digital signals are immune to noise and power requirement to transmit is less compared to analog signals. But they require more bandwidth to transmit them.  In communication engineering Power and Bandwidth are scarce commodities.  Compression technique is employed to reduce bandwidth requirement by removing the redundancy present in the digital signals. From mathematical point of view it decorrelates the data. Following case study will highlight the need for compression. One second digitized NTSC video requires data rate of 165 Mbits/sec. A 90 minute uncompressed NTSC video will generate 110 Giga bytes. [1]. Around 23 DVDs are required to hold this huge data. But one would have come across DVD’s that contain four 90 minute movies. This is possible because of efficient video compression techniques only.

Television (TV) signals are combination of video, audio and synchronization signals. General public when they refer video they actually mean TV signals. In technical literature TV signals and video are different. If 30 still images (Assume each image is slightly different from next image) are shown within a second then it will create an illusion of motion in the eyes of observer. This phenomenon is called 'persistance of vision’. In video technology still image is called as frame. Eight frames are sufficient to show illusion of motion but 24 frames are required to create a smooth motion as in movies.

Figure 1 Two adjacent frames   (Top)  Temporal redundancy removed  image (Bottom)

Compression can be classified into two broad categories. One is transform coding and another one is statistical coding. In transform coding Discrete Cosine Transform (DCT) and Wavelet Transforms are extensively used in image and video compression. In source coding Huffman coding and Arithmatic coding are extensively used. First transform coding will be applied on digital video signals. Then source coding will be applied on coefficients of transform coded signal. This strategy is common to image and video signals. For further details read [2].

In video compression Intra-frame coding and Inter-frame coding is employed. Intra-frame coding is similar to JPEG coding. Inter-frame coding exploits the redundancy present among adjacent frames. Five to fifteen frames will form Group of Pictures (GOP).  In the figure GOP size is seven and it contains one Intra (I) frame and two Prediction (P) frame and four Bi-directional Prediction (B) frames. In  I frame  spatial redundancy alone  is exploited and it very similar to JPEG compression. In P and B frames both spatial and temporal (time) redundancy is removed. In the figure 1, Temporal redundancy removed image can be seen. In the figure 2, P frames are present in 4th and 7th position. Fourth position P1 frame contains difference between Ith frame and 4th frame. The difference or prediction error is only coded. To regenerate 4th frame, I frame and P1 frame is required.  Like wise 2nd frame uses prediction error between I, P1, and B1 frames. The decoding sequence is I PPBBBB4. (Check with a book)

Figure 2 Group of Pictures (GOP)

One may wonder why GOP is limited to 15 frames. We know presence of more number of P and B frames results in much efficient compression. The flip side is if there is an error in I frame then dependant P and B frames cannot be decoded properly. This results in partially decoded still image (i.e. I frame) shown to viewer for the entire duration of GOP. For 15 frames one may experience still image for half a second. Beyond this duration viewer will be annoyed to look at still image. Increase in GOP frame size increases decoding time. This time will be included in latency calculation. Real-time systems require very minimum latency.

In typical soap opera TV episodes very low scene changes occur within a fixed duration. Let us take two adjacent frames. Objects (like face, car etc) in the first frame would have slightly moved in the second frame. If we know direction and quantum of motion then we can move the first frame objects accordingly to recreate second frame. Idea is simple to comprehend but implementation is very taxing.  Each frame will be divided into number of macroblocks. Each macroblock will contain 16x16 pixels (in JPEG 8x8 pixels are called Block that is why 16x16 pixels are called Macroblock). Choose macroblock one by one in the current frame (in our example, 2nd frame in Figure 1) and find ‘best matching’ macroblock in the reference frame (i.e. first frame in Figure 2). The difference between the best matching macroblock and chosen macroblock is called as motion compensation. The positional difference between two blocks is represented by motion vector. This process of searching best matching macroblock is called as motion estimation [3].


Figure 3 Motion Vector and Macroblocks

       
         A closer look at the first and second frame in the figure 1 will offer following inferences. (1) There is a slight colour difference between first and second frame (2) The pixel located at 3,3 is the first frame is the 0,0 th pixel in the second frame. 
         In figure 3 a small portion of frame is taken and macroblocks are shown. In that there are 16 macroblocks in four rows and four columns.

Group of macroblocks are combined together to form a Slice.

Further Information:  
  • Display systems like TV, Computer Monitor incorporates Additive colour mixing concept. Primary colours are Red, Green and Blue. In printing Subractive colour mixing concept is used and the primary colours are Cyan, Magenta, Yellow and Black (CMYK). 
  • Human eye is more sensitive to brightness variation than colour variation. To exploit this feature YCbCr model is used. Y-> Luminance Cb -> Crominance Blue Cr-> Crominance Red. Please note Crominance Red  ≠  Red
  • To conserve bandwidth analog TV systems uses Vestigial Sideband Modulation a variant of Amplitude Modulation (AM) and incorporate Interlaced Scanning method.

Note: This article is written to make the reader to get some idea about video compression within a short span of time. Article is carefully written but guarantee cannot be given for accuracy. So, please read books and understand the concepts in proper manner.
Sources:
[2]  Salent-Compression-Report.pdf, http://www.salent.co.uk/downloads/Salent-Compression-Report.pdf  (PDF, 1921 KB)
[3]  Iain E. G. Richardson, “H.264 and MPEG-4 Video Compression Video Coding for Next-generation Multimedia,” John Wiley & Sons Ltd, 2003. (Examples are very good.  ISBN 0-470-84837-5)

Wednesday, 28 November 2012

Multicore Software Technologies


Powerful multicore processors arrived in the market, but programmers with sound knowledge on hardware to harness its full potential were in short fall. Obvious solution is to produce good number of programmers with sufficient knowledge on multicore architecture. Another solution is to create software that will convert the code meant for single processor into a multicore compatible code. In this article second method will be discussed in detail.  Reasons for first method’s non-feasibility are left to readers as exercise.

It is a well known fact that User friendliness and code efficiency don’t go hand in hand. For example, factorial program written in assembly language will produce smallest executable or binary (in windows .exe) file. Same algorithm implemented in higher languages (Fortran, Matlab) will produce large sized file. One can expect moderate binary file size in C. Writing a program in assembly language is tedious and in Matlab it is easy.



A graph that shows relationship between effort required to write a program and computational speedup achieved by that programming language is shown above. Essence of previous paragraph is beautifully presented in the graph. Multicore software tools like Cilk++, Titanium, VSIPL++ have the user riendliness and at the same time they are able to produce efficient applications. Is it not like "Have cake and eat it too" situation?  Let us hope, it will not take much time to reach coveted 'Green ellipse' position.

OpenMP (Open Multi-Processing) is an open standard and it is supported by major computer manufacturers. Code written in Fortran, C/C++ languages can be converted into code compatible to multicore processors. Multicore compilers are available in Windows, Linux and Apple Mac operating systems. Advantages of OpenMP is it is easy to learn and compatible with different multicore architectures. Software tools like  Unified Parallel C, Sequoia, Co-array Fortran, Titanium, Cilk++, pMatlab and Star-P are available as an alternative to OpenMP. CUDA, BROOK+, OpenGL are available to cater Graphics Processing Unit (GPU) based systems.

Name                           Developed by                  Keywords   Language extension 
.
Unified Parallel C (UPC)    UPC Consortium                  shared                     C
Sequoia                           Stanford University                 inner, leaf                 C
Co-Array Fortran                        ---                                   ---                           Fortran
Titanium                                    ---                                  ---                          Java
pMatlab MIT Lincoln Laboratory          ddense, *p                 MATLAB
Star- P                               Interactive Supercomputing          ---                        Matlab,
                                                                                                                           Python
Parallel Computing     Mathworks Inc                       spmd, end                   Matlab
Toolbox

--- Data not available

A multicore compatible code is developed in following way. Code is written in a high level language and it is compiled. This helps to rectify any error in the program. Next, code is analyzed and where ever parallelism is exhibited that section of code is marked with special keyword (see above table) by the programmer. It is again compiled with multicore-software tool. Software tools will automatically insert necessary code to take care of memory management and data movement. In a multicore environment, operating system creates a Master thread at the time of execution. Thereafter master thread takes care of the execution and where ever special keyword is spotted then threads are forked (i.e. created) and given to separate core. After the completion of job, threads are terminated.

  1.  #pragma omp parallel for \
  2.  private(n) \
  3. shared (B, y, v)
  4.  for (int n=0; n < K; n++)
  5.  y[n] = B*cos(v*n);

In this paragraph, a few line sample code presented above, will be explained. Above program have syntax similar to C. Steps 4 and 5 is sufficient to generate a cosine signal.  First three lines are there to parallelize the step 4 and 5. In the step 1, ‘#pragma’ is a pre-processor directive in C/C++. ‘omp’ stands for OpenMP software and ‘parallel for’ states that for loop is going to be parallelized. In the step 3, the ‘shared’ states which are the variable can be placed in global space and all the cores can access them. Amplitude ‘B’ and array name ‘y’ is placed in the global space.  Each core should maintain its own ‘n’ in its private space to generate proper cosine signal.

Program tailored to multicore will have generally two sections. One section performs the task and another section has target architecture specific information like no. of cores etc. All software assumes a model and for further details read [1]. Multicore hardware architecture can be classified into two broad categories viz; homogeneous (ex. x86 cores) and heterogeneous (ex. GPU).  Software tools are built to suit any one of the architecture only.

Middleware

Days were gone when only supercomputers having multi-processor system. Today even embedded system (ex. Smart phone) uses multicore hardware.  Embedded system is a combination of hardware and software. Typically any change in hardware will require some changes in software. Likewise any up gradation in software require change in hardware. To overcome this problem “Middleware” were concept was introduced.

MIT Lincoln Laboratory developed a middleware named Parallel Vector Library (PVL) suited for real-time embedded Signal and Image processing systems.   Likewise VSIPL++ (Vector Signal and Image Processing Library) was developed and maintained by High Performance Embedded Computing Software Initiative (HPEC-SI).  VSIPL++ is suited for homogeneous architecture. For heterogeneous architecture PVTOL (Parallel Vector Tile Optimizing Library) is used.  Here also programs are compiled, and then mapped to multicore systems.

Source

1. Hahn Kim, and R. Bond. “Multicore software technologies” IEEE Signal Processing Magazine, Vol. 26 No.6, 2009 pp. 80-89. http://hdl.handle.net/1721.1/52617 (PDF,629 KB)

2. Greg Slabaugh, Richard Boyes, Xiaoyun Yang, "Multicore Image Processing with OpenMP", Appeared in IEEE Signal Processing Magazine, March 2010, pp134-138. But it can be downloaded from  http://www.soi.city.ac.uk/~sbbh653/publications/OpenMP_SPM.pdf (PDF, 1160 KB)

Tuesday, 30 October 2012

Multicore Processors and Image Processing - II (GPU)



       Evolution of GPU has humble origin. In Earlier days multichip 3D rendering engines were developed and used as add-on graphics accelerator card in Personal iomputer. Slowly all the functionalities of the engine were fused into one single chip. Likewise processing power steadily increased and in 2001 it blossomed into  Graphics Processor Unit (GPU). Pioneer in GPU, NVIDIA introduced GeForce 3 in 2001. Then came GeForce 7800 model and in 2006, GeForce 8800 was available in the market. Present day GPU are capable to perform 3D graphics operations like transforms, lighting, rasterization, texturing, depth testing and display.

        NVIDIA has introduced Fermi GPU in its' GPGPU fold. It consists of multiple streaming multiprocessors (SMs) which are supported by cache, host interface, Gigathread scheduler and DRAM interfaces. Each SM consists of 32 cores; each of them can execute one floating point or integer operation per clock. Each core supported by 16 load/store units, four special function units, a 32K register file, 64K Random Access Memory. Fermi GPU adheres to IEEE 754-2008 floating point standard, which means that it offers high precision results. Fermi supports fused multiply-add (FMA) feature. [2]


GPU Market

    Most of  the technical blogs and websites cite John Peddie Research as their source for their data. I have also done the same thing. In this blog two data sources [4, 5] are given and as usual their data does not match with other. In ref. [5] market share based on discrete add-on board alone is considered. If it is integrated with CPU then Intel Inc. will have share in the market.

Total units sold in the 2nd Quarter (April to June) of 2012 is 14.76 million [5].

Company           Market Share
                           in percentage
AMD                       40.3
NVIDIA                  39.3
MATROX                 0.3
S3                             0.1

Differences between conventional CPU and GPU
  •  Typical CPU banks on speculative execution like cache and branch prediction. Speculative optimization strategy will pay off for code that has high data locality. This assumption may not fit well for all algorithms.
  •  CPU maximizes single threaded performance by increasing raw clock speed. This results in hotter transistors, more current leakage from transistors and more cost to manufacture.
  • Metric used for conventional CPU is raw clock speed. If performance is measured with  metrics like GFLOPS (Giga Floating Point Operations per Second) per dollar or power usage in Watts then results are not impressive. For example Tesla GPU is eight times more powerful than Intel Xeon processor in terms of GFLOPS but they cost more or less same.
  • In CPU most of the chip area is devoted to speculative execution.  Core i7 processor (Quad core) that is based on Intel’s Nehalem microarchitecture is fabricated using 45 nm technology. In Core i7, Only 9% of chip area is occupied by integer and floating point execution units. Remaining area is devoted to DRAM controller, L3 cache etc. But in a GPU most of the chip area is devoted to execution units.
  • GPU can never replace CPU. Parallism exibiting codes can be ported to GPU and efficiently it can be executed.

Intel's multi core chips like Atom, Core 2, Core i7 (Nehalem Architecure) processors and Xeon W5590 (Quad core, based on Nehelem Architecture) are all optimized for speculative execution. [2]

Programming GPU

General Purpose GPU (GPGPU) are the graphics-optimized GPUs which were commissioned to perform non-graphics processing. One needs in-depth knowledge in GPU hardware and software skill to run a algorithm in GPGPU. This feat is not possible to perform by typical software programmers. To enable the programmers to exploit power of GPU, NVIDIA developed CUDA (Compute Unified Device Architecture) tool kit, which helps software developers to focus on their algorithm rather than spend their valuable time in mapping algorithm to hardware, thus improving productivity. CUDA is available for C and Fortran programming languages. The next generation CUDA (code named Fermi) supports languages like C, C++, FORTRAN, Java, MATLAB and Python. CUDA tool kit is taught in more than 200 colleges through out the world. NVIDIA says it has sold more than 100 million CUDA-capable chips.

GPU will have several cores. For example, NVIDIA Tesla C2070 has 448 cores. Algorithms that are to be executed in GPU is partitioned into “host code” and “device code”. Host code will have one thread which is persistent through out the execution of algorithm. Where ever multiple operations are performed then, that portion is marked as device code by the programmer. At the time of execution of this region, multiple threads will be created (technical term is forked) and GPU cores will execute the code chunk. After completion, threads will be destroyed automatically. In GPU literature they use the term kernel for thread. They use technical terms like thread block, warp (32 parallel kernel together), grid (refer page 19 of  [2]).

NVCC is CUDA based C compiler developed by NVIDIA. Portland Group (PGI) has developed CUDA based Fortran compiler. GPU programming is not confined to CUDA toolkit alone. Software developers have come with their own packages to handle GPUs. OpenCL is a open GPU programming standard developed by Kronous Group, which is the same group that developed OpenGL. DirectCompute is a Microsoft product. HMPP (Hybrid Multicore Parallel Programming) workbench was developed by French based CAPS enterprise.

Image Processing Case Study

In CT scan, series of X-ray images are taken over a human body. Then 3D-images are reconstructed using the two dimensional X-ray images. Reconstruction is highly computational intensive. So obviously GPU are deployed for computation. It is reported that NVIDIA's GeForce 8800 GPU process 625 projections, each projection having a size of 1024 x 768, to produce 512x512x340 reconstructed volume size in 42 seconds. If medical-grade GPU is used then it can be reduced to 12 seconds. I presume 512x512x340 means 340 frames with a dimension of 512x512 pixels. Medical-grade GPU should have 32-bit precision end-to-end to produce accurate results. [3]

 Source

[1] White paper on “Easy Start to GPU Programming” by Fujitsu Incorporation, http://globalsp.ts.fujitsu.com/dmsp/Publications/public/wp-Easy-Start-to-GPU-Programming.pdf, (PDF, 280KB).
[2]  A white paper on “NVIDIA’s Fermi: The First Complete GPU Computing Architecture”  by Peter N. Glaskowsky,  http://cs.nyu.edu/courses/spring12/CSCI-GA.3033-012/Fermi-The_First_Complete_GPU_Architecture.pdf, (PDF, 1589 KB).
[3] White Paper on “Current and next-generation GPUs for accelerating CT reconstruction: quality, performance, and tuning” http://www.barco.com/en/products-solutions/~/media/Downloads/White papers/2007/Current and next-generation GPUs for accelerating CT reconstruction quality performance and tuning.pdf . 122KB
[4] http://www.jonpeddie.com/publications/market_watch/
[5] http://www.techspot.com/news/49946-discrete-gpu-shipments-down-in-q2-amd-regains-market-share.html