Tuesday 30 October 2012

Multicore Processors and Image Processing - II (GPU)



       Evolution of GPU has humble origin. In Earlier days multichip 3D rendering engines were developed and used as add-on graphics accelerator card in Personal iomputer. Slowly all the functionalities of the engine were fused into one single chip. Likewise processing power steadily increased and in 2001 it blossomed into  Graphics Processor Unit (GPU). Pioneer in GPU, NVIDIA introduced GeForce 3 in 2001. Then came GeForce 7800 model and in 2006, GeForce 8800 was available in the market. Present day GPU are capable to perform 3D graphics operations like transforms, lighting, rasterization, texturing, depth testing and display.

        NVIDIA has introduced Fermi GPU in its' GPGPU fold. It consists of multiple streaming multiprocessors (SMs) which are supported by cache, host interface, Gigathread scheduler and DRAM interfaces. Each SM consists of 32 cores; each of them can execute one floating point or integer operation per clock. Each core supported by 16 load/store units, four special function units, a 32K register file, 64K Random Access Memory. Fermi GPU adheres to IEEE 754-2008 floating point standard, which means that it offers high precision results. Fermi supports fused multiply-add (FMA) feature. [2]


GPU Market

    Most of  the technical blogs and websites cite John Peddie Research as their source for their data. I have also done the same thing. In this blog two data sources [4, 5] are given and as usual their data does not match with other. In ref. [5] market share based on discrete add-on board alone is considered. If it is integrated with CPU then Intel Inc. will have share in the market.

Total units sold in the 2nd Quarter (April to June) of 2012 is 14.76 million [5].

Company           Market Share
                           in percentage
AMD                       40.3
NVIDIA                  39.3
MATROX                 0.3
S3                             0.1

Differences between conventional CPU and GPU
  •  Typical CPU banks on speculative execution like cache and branch prediction. Speculative optimization strategy will pay off for code that has high data locality. This assumption may not fit well for all algorithms.
  •  CPU maximizes single threaded performance by increasing raw clock speed. This results in hotter transistors, more current leakage from transistors and more cost to manufacture.
  • Metric used for conventional CPU is raw clock speed. If performance is measured with  metrics like GFLOPS (Giga Floating Point Operations per Second) per dollar or power usage in Watts then results are not impressive. For example Tesla GPU is eight times more powerful than Intel Xeon processor in terms of GFLOPS but they cost more or less same.
  • In CPU most of the chip area is devoted to speculative execution.  Core i7 processor (Quad core) that is based on Intel’s Nehalem microarchitecture is fabricated using 45 nm technology. In Core i7, Only 9% of chip area is occupied by integer and floating point execution units. Remaining area is devoted to DRAM controller, L3 cache etc. But in a GPU most of the chip area is devoted to execution units.
  • GPU can never replace CPU. Parallism exibiting codes can be ported to GPU and efficiently it can be executed.

Intel's multi core chips like Atom, Core 2, Core i7 (Nehalem Architecure) processors and Xeon W5590 (Quad core, based on Nehelem Architecture) are all optimized for speculative execution. [2]

Programming GPU

General Purpose GPU (GPGPU) are the graphics-optimized GPUs which were commissioned to perform non-graphics processing. One needs in-depth knowledge in GPU hardware and software skill to run a algorithm in GPGPU. This feat is not possible to perform by typical software programmers. To enable the programmers to exploit power of GPU, NVIDIA developed CUDA (Compute Unified Device Architecture) tool kit, which helps software developers to focus on their algorithm rather than spend their valuable time in mapping algorithm to hardware, thus improving productivity. CUDA is available for C and Fortran programming languages. The next generation CUDA (code named Fermi) supports languages like C, C++, FORTRAN, Java, MATLAB and Python. CUDA tool kit is taught in more than 200 colleges through out the world. NVIDIA says it has sold more than 100 million CUDA-capable chips.

GPU will have several cores. For example, NVIDIA Tesla C2070 has 448 cores. Algorithms that are to be executed in GPU is partitioned into “host code” and “device code”. Host code will have one thread which is persistent through out the execution of algorithm. Where ever multiple operations are performed then, that portion is marked as device code by the programmer. At the time of execution of this region, multiple threads will be created (technical term is forked) and GPU cores will execute the code chunk. After completion, threads will be destroyed automatically. In GPU literature they use the term kernel for thread. They use technical terms like thread block, warp (32 parallel kernel together), grid (refer page 19 of  [2]).

NVCC is CUDA based C compiler developed by NVIDIA. Portland Group (PGI) has developed CUDA based Fortran compiler. GPU programming is not confined to CUDA toolkit alone. Software developers have come with their own packages to handle GPUs. OpenCL is a open GPU programming standard developed by Kronous Group, which is the same group that developed OpenGL. DirectCompute is a Microsoft product. HMPP (Hybrid Multicore Parallel Programming) workbench was developed by French based CAPS enterprise.

Image Processing Case Study

In CT scan, series of X-ray images are taken over a human body. Then 3D-images are reconstructed using the two dimensional X-ray images. Reconstruction is highly computational intensive. So obviously GPU are deployed for computation. It is reported that NVIDIA's GeForce 8800 GPU process 625 projections, each projection having a size of 1024 x 768, to produce 512x512x340 reconstructed volume size in 42 seconds. If medical-grade GPU is used then it can be reduced to 12 seconds. I presume 512x512x340 means 340 frames with a dimension of 512x512 pixels. Medical-grade GPU should have 32-bit precision end-to-end to produce accurate results. [3]

 Source

[1] White paper on “Easy Start to GPU Programming” by Fujitsu Incorporation, http://globalsp.ts.fujitsu.com/dmsp/Publications/public/wp-Easy-Start-to-GPU-Programming.pdf, (PDF, 280KB).
[2]  A white paper on “NVIDIA’s Fermi: The First Complete GPU Computing Architecture”  by Peter N. Glaskowsky,  http://cs.nyu.edu/courses/spring12/CSCI-GA.3033-012/Fermi-The_First_Complete_GPU_Architecture.pdf, (PDF, 1589 KB).
[3] White Paper on “Current and next-generation GPUs for accelerating CT reconstruction: quality, performance, and tuning” http://www.barco.com/en/products-solutions/~/media/Downloads/White papers/2007/Current and next-generation GPUs for accelerating CT reconstruction quality performance and tuning.pdf . 122KB
[4] http://www.jonpeddie.com/publications/market_watch/
[5] http://www.techspot.com/news/49946-discrete-gpu-shipments-down-in-q2-amd-regains-market-share.html