Wednesday, 28 November 2012

Multicore Software Technologies


Powerful multicore processors arrived in the market, but programmers with sound knowledge on hardware to harness its full potential were in short fall. Obvious solution is to produce good number of programmers with sufficient knowledge on multicore architecture. Another solution is to create software that will convert the code meant for single processor into a multicore compatible code. In this article second method will be discussed in detail.  Reasons for first method’s non-feasibility are left to readers as exercise.

It is a well known fact that User friendliness and code efficiency don’t go hand in hand. For example, factorial program written in assembly language will produce smallest executable or binary (in windows .exe) file. Same algorithm implemented in higher languages (Fortran, Matlab) will produce large sized file. One can expect moderate binary file size in C. Writing a program in assembly language is tedious and in Matlab it is easy.



A graph that shows relationship between effort required to write a program and computational speedup achieved by that programming language is shown above. Essence of previous paragraph is beautifully presented in the graph. Multicore software tools like Cilk++, Titanium, VSIPL++ have the user riendliness and at the same time they are able to produce efficient applications. Is it not like "Have cake and eat it too" situation?  Let us hope, it will not take much time to reach coveted 'Green ellipse' position.

OpenMP (Open Multi-Processing) is an open standard and it is supported by major computer manufacturers. Code written in Fortran, C/C++ languages can be converted into code compatible to multicore processors. Multicore compilers are available in Windows, Linux and Apple Mac operating systems. Advantages of OpenMP is it is easy to learn and compatible with different multicore architectures. Software tools like  Unified Parallel C, Sequoia, Co-array Fortran, Titanium, Cilk++, pMatlab and Star-P are available as an alternative to OpenMP. CUDA, BROOK+, OpenGL are available to cater Graphics Processing Unit (GPU) based systems.

Name                           Developed by                  Keywords   Language extension 
.
Unified Parallel C (UPC)    UPC Consortium                  shared                     C
Sequoia                           Stanford University                 inner, leaf                 C
Co-Array Fortran                        ---                                   ---                           Fortran
Titanium                                    ---                                  ---                          Java
pMatlab MIT Lincoln Laboratory          ddense, *p                 MATLAB
Star- P                               Interactive Supercomputing          ---                        Matlab,
                                                                                                                           Python
Parallel Computing     Mathworks Inc                       spmd, end                   Matlab
Toolbox

--- Data not available

A multicore compatible code is developed in following way. Code is written in a high level language and it is compiled. This helps to rectify any error in the program. Next, code is analyzed and where ever parallelism is exhibited that section of code is marked with special keyword (see above table) by the programmer. It is again compiled with multicore-software tool. Software tools will automatically insert necessary code to take care of memory management and data movement. In a multicore environment, operating system creates a Master thread at the time of execution. Thereafter master thread takes care of the execution and where ever special keyword is spotted then threads are forked (i.e. created) and given to separate core. After the completion of job, threads are terminated.

  1.  #pragma omp parallel for \
  2.  private(n) \
  3. shared (B, y, v)
  4.  for (int n=0; n < K; n++)
  5.  y[n] = B*cos(v*n);

In this paragraph, a few line sample code presented above, will be explained. Above program have syntax similar to C. Steps 4 and 5 is sufficient to generate a cosine signal.  First three lines are there to parallelize the step 4 and 5. In the step 1, ‘#pragma’ is a pre-processor directive in C/C++. ‘omp’ stands for OpenMP software and ‘parallel for’ states that for loop is going to be parallelized. In the step 3, the ‘shared’ states which are the variable can be placed in global space and all the cores can access them. Amplitude ‘B’ and array name ‘y’ is placed in the global space.  Each core should maintain its own ‘n’ in its private space to generate proper cosine signal.

Program tailored to multicore will have generally two sections. One section performs the task and another section has target architecture specific information like no. of cores etc. All software assumes a model and for further details read [1]. Multicore hardware architecture can be classified into two broad categories viz; homogeneous (ex. x86 cores) and heterogeneous (ex. GPU).  Software tools are built to suit any one of the architecture only.

Middleware

Days were gone when only supercomputers having multi-processor system. Today even embedded system (ex. Smart phone) uses multicore hardware.  Embedded system is a combination of hardware and software. Typically any change in hardware will require some changes in software. Likewise any up gradation in software require change in hardware. To overcome this problem “Middleware” were concept was introduced.

MIT Lincoln Laboratory developed a middleware named Parallel Vector Library (PVL) suited for real-time embedded Signal and Image processing systems.   Likewise VSIPL++ (Vector Signal and Image Processing Library) was developed and maintained by High Performance Embedded Computing Software Initiative (HPEC-SI).  VSIPL++ is suited for homogeneous architecture. For heterogeneous architecture PVTOL (Parallel Vector Tile Optimizing Library) is used.  Here also programs are compiled, and then mapped to multicore systems.

Source

1. Hahn Kim, and R. Bond. “Multicore software technologies” IEEE Signal Processing Magazine, Vol. 26 No.6, 2009 pp. 80-89. http://hdl.handle.net/1721.1/52617 (PDF,629 KB)

2. Greg Slabaugh, Richard Boyes, Xiaoyun Yang, "Multicore Image Processing with OpenMP", Appeared in IEEE Signal Processing Magazine, March 2010, pp134-138. But it can be downloaded from  http://www.soi.city.ac.uk/~sbbh653/publications/OpenMP_SPM.pdf (PDF, 1160 KB)

Tuesday, 30 October 2012

Multicore Processors and Image Processing - II (GPU)



       Evolution of GPU has humble origin. In Earlier days multichip 3D rendering engines were developed and used as add-on graphics accelerator card in Personal iomputer. Slowly all the functionalities of the engine were fused into one single chip. Likewise processing power steadily increased and in 2001 it blossomed into  Graphics Processor Unit (GPU). Pioneer in GPU, NVIDIA introduced GeForce 3 in 2001. Then came GeForce 7800 model and in 2006, GeForce 8800 was available in the market. Present day GPU are capable to perform 3D graphics operations like transforms, lighting, rasterization, texturing, depth testing and display.

        NVIDIA has introduced Fermi GPU in its' GPGPU fold. It consists of multiple streaming multiprocessors (SMs) which are supported by cache, host interface, Gigathread scheduler and DRAM interfaces. Each SM consists of 32 cores; each of them can execute one floating point or integer operation per clock. Each core supported by 16 load/store units, four special function units, a 32K register file, 64K Random Access Memory. Fermi GPU adheres to IEEE 754-2008 floating point standard, which means that it offers high precision results. Fermi supports fused multiply-add (FMA) feature. [2]


GPU Market

    Most of  the technical blogs and websites cite John Peddie Research as their source for their data. I have also done the same thing. In this blog two data sources [4, 5] are given and as usual their data does not match with other. In ref. [5] market share based on discrete add-on board alone is considered. If it is integrated with CPU then Intel Inc. will have share in the market.

Total units sold in the 2nd Quarter (April to June) of 2012 is 14.76 million [5].

Company           Market Share
                           in percentage
AMD                       40.3
NVIDIA                  39.3
MATROX                 0.3
S3                             0.1

Differences between conventional CPU and GPU
  •  Typical CPU banks on speculative execution like cache and branch prediction. Speculative optimization strategy will pay off for code that has high data locality. This assumption may not fit well for all algorithms.
  •  CPU maximizes single threaded performance by increasing raw clock speed. This results in hotter transistors, more current leakage from transistors and more cost to manufacture.
  • Metric used for conventional CPU is raw clock speed. If performance is measured with  metrics like GFLOPS (Giga Floating Point Operations per Second) per dollar or power usage in Watts then results are not impressive. For example Tesla GPU is eight times more powerful than Intel Xeon processor in terms of GFLOPS but they cost more or less same.
  • In CPU most of the chip area is devoted to speculative execution.  Core i7 processor (Quad core) that is based on Intel’s Nehalem microarchitecture is fabricated using 45 nm technology. In Core i7, Only 9% of chip area is occupied by integer and floating point execution units. Remaining area is devoted to DRAM controller, L3 cache etc. But in a GPU most of the chip area is devoted to execution units.
  • GPU can never replace CPU. Parallism exibiting codes can be ported to GPU and efficiently it can be executed.

Intel's multi core chips like Atom, Core 2, Core i7 (Nehalem Architecure) processors and Xeon W5590 (Quad core, based on Nehelem Architecture) are all optimized for speculative execution. [2]

Programming GPU

General Purpose GPU (GPGPU) are the graphics-optimized GPUs which were commissioned to perform non-graphics processing. One needs in-depth knowledge in GPU hardware and software skill to run a algorithm in GPGPU. This feat is not possible to perform by typical software programmers. To enable the programmers to exploit power of GPU, NVIDIA developed CUDA (Compute Unified Device Architecture) tool kit, which helps software developers to focus on their algorithm rather than spend their valuable time in mapping algorithm to hardware, thus improving productivity. CUDA is available for C and Fortran programming languages. The next generation CUDA (code named Fermi) supports languages like C, C++, FORTRAN, Java, MATLAB and Python. CUDA tool kit is taught in more than 200 colleges through out the world. NVIDIA says it has sold more than 100 million CUDA-capable chips.

GPU will have several cores. For example, NVIDIA Tesla C2070 has 448 cores. Algorithms that are to be executed in GPU is partitioned into “host code” and “device code”. Host code will have one thread which is persistent through out the execution of algorithm. Where ever multiple operations are performed then, that portion is marked as device code by the programmer. At the time of execution of this region, multiple threads will be created (technical term is forked) and GPU cores will execute the code chunk. After completion, threads will be destroyed automatically. In GPU literature they use the term kernel for thread. They use technical terms like thread block, warp (32 parallel kernel together), grid (refer page 19 of  [2]).

NVCC is CUDA based C compiler developed by NVIDIA. Portland Group (PGI) has developed CUDA based Fortran compiler. GPU programming is not confined to CUDA toolkit alone. Software developers have come with their own packages to handle GPUs. OpenCL is a open GPU programming standard developed by Kronous Group, which is the same group that developed OpenGL. DirectCompute is a Microsoft product. HMPP (Hybrid Multicore Parallel Programming) workbench was developed by French based CAPS enterprise.

Image Processing Case Study

In CT scan, series of X-ray images are taken over a human body. Then 3D-images are reconstructed using the two dimensional X-ray images. Reconstruction is highly computational intensive. So obviously GPU are deployed for computation. It is reported that NVIDIA's GeForce 8800 GPU process 625 projections, each projection having a size of 1024 x 768, to produce 512x512x340 reconstructed volume size in 42 seconds. If medical-grade GPU is used then it can be reduced to 12 seconds. I presume 512x512x340 means 340 frames with a dimension of 512x512 pixels. Medical-grade GPU should have 32-bit precision end-to-end to produce accurate results. [3]

 Source

[1] White paper on “Easy Start to GPU Programming” by Fujitsu Incorporation, http://globalsp.ts.fujitsu.com/dmsp/Publications/public/wp-Easy-Start-to-GPU-Programming.pdf, (PDF, 280KB).
[2]  A white paper on “NVIDIA’s Fermi: The First Complete GPU Computing Architecture”  by Peter N. Glaskowsky,  http://cs.nyu.edu/courses/spring12/CSCI-GA.3033-012/Fermi-The_First_Complete_GPU_Architecture.pdf, (PDF, 1589 KB).
[3] White Paper on “Current and next-generation GPUs for accelerating CT reconstruction: quality, performance, and tuning” http://www.barco.com/en/products-solutions/~/media/Downloads/White papers/2007/Current and next-generation GPUs for accelerating CT reconstruction quality performance and tuning.pdf . 122KB
[4] http://www.jonpeddie.com/publications/market_watch/
[5] http://www.techspot.com/news/49946-discrete-gpu-shipments-down-in-q2-amd-regains-market-share.html



Sunday, 30 September 2012

Multicore Processors and Image Processing - I


         Quad-core based desktops and laptops have become order of the day. These multi-core processors after long years of academic captivity have come to the limelight. Study of multi-core and multi-processors, come under the field of High Performance Computing.
          Two or more processors are fabricated in a single package then it is called multi-core processor. Multi-processors and multi-process (single processor but multiple application are running simultaneously) are different from multi-core processors. Quad (four) and Duo (two) are commonly used multi-core processor by general public. With prevailing fabrication technology, race to increase raw clock power beyond 3GHz is nearing the end.  Further increase in computing power, is possible with deploying Parallel computing concept.  This is the reason why, all the manufacturers are introducing multi-core processors.

           Image processing algorithms are exhibits high degree parallelism.  Most of algorithms have loop(s) and they iterate through a pixel, row or an image region.  Let us take an example loop, which has to iterate 200 times.  Single processor will iterate 200 times and in a quad-core each processor will iterate 50 times only. Obviously quad-core will finish the task much faster than single processor. To achieve the stated speed, programs have to be slightly modified and multi-core optimized compilers are required.


Image size         Time to execute in (ms)
                    
        (Single-core)   (Multi-core)
256x256                   18                          8
512x512                   65                        21
1024x1024             260                        75

       Amount of time needed to twist  Lenna image is given in the above table. Lenna image with 1024x 1024 takes 260ms in single core and 75ms in multi-core to perform twist operation. From this it is very clear, in a single-core processor, as image size increases execution time increases exponentially but in multi-core processor in goes in linear fashion [1].

        Algorithms will exhibit either fine grain or medium grain or course grain parallelism. Smoothing, sharpening, filtering and convolution image processing functions operate on entire image and they are classified as fine-grain systems.  Medium grain parallelism is exhibited by Hough transform, motion analysis and the functions operate on part of an image.  Position estimation and object recognition comes under the class of course grain and parallelism exhibited is very less [2].  Algorithms can be split into memory-bound and CPU bound algorithms. In CPU bound algorithms no of cores helps to achieve linear-speed up subject to Amdahl's law [3].

       Multiprocessors are developed by Intel and AMD are very popular. Following table content got from [3].
Multicore Processor        No of core    Clock speed
Intel Xeon E5649                    12                2.53GHz
AMD Opteron 6220               16                3.00GHz
AMD Phenom X4(9550)          4                2.20GHz

        Intel Xeon was launched in 2002.  Intel introduced simultaneous multiple threading capability and named it as Hyper Threading. It has also launched Core 2 Duo T7250, which is a low voltage mobile processor runs at 2GHz. Sony’s Playstation 3 has a very powerful multi-core processor.  This is called Cell Broadband Engine and it is developed jointly by Sony, Toshiba and IBM.  It has a 64bit PowerPC processor connected by a circular bus to eight RISC (Reduced Instruction Set Computing) with 128 SIMD (Single Instruction Multiple Data) architecture based co-processors. SIMD is well suited  to exploit  fine grain parallelism.


In the part two series GPU and OpenMP Application Program Interface (API) wil l be discussed.


Source

[1] Greg Slabaugh, Richard Boyes, Xiaoyun Yang, "Multicore Image Processing with OpenMP", Appeared in IEEE Signal Processing Magazine, March 2010, pp134-138. But it can be downloaded from  http://www.soi.city.ac.uk/~sbbh653/publications/OpenMP_SPM.pdf (PDF, 1160KB )
[2] Trupti Patil, "Evaluation of Multi-core Architecture for Image Processing,"  MS Thesis,  year 2009, at Graduate School of Clemson University (PDF, 1156KB) www.ces.clemson.edu/~stb/students/trupti_thesis.pdf

[3] OpenMP in GraphicsMagick,  http://www.graphicsmagick.org/OpenMP.html


Note
  • Please read [1], as it is insightful and language used is simpler compared to standard technical articles.
  • Warped Lena and above table was adapted from [1] and not copied.  GIMP software was used to create the warped effect on the Lenna image.

Saturday, 22 September 2012

RoboRetina


        RoboRetina™ image sensor is capable of producing satisfactory images even in non-uniform illumination conditions. Existing cameras requires uniform illumination to produce satisfactory results. Photographers vary camera's shutter speed to capture brightly illuminated or poor lit scenes. Amount of light falls on the image sensor (in olden days it was film) proportional to duration of time shutter is open. Thus sun-lit scenes need short shutter opening duration. In natural light conditions both bright and shadow regions (poor lit) will be available simultaneously. But camera can be made to capture either bright region or dark region and not both. Our eyes are able to adjust to natural light or non-uniform light with ease. This feat is barely noticed by us, until this article is read.



A surveillance camera system that monitors airport will fail to detect persons lurking under the shadows because of non-uniform illumination. Intrigue Technologies Incorporation having head quarters in Pittsburgh, Pennsylvania, USA has come out with RoboRetina™ image sensor that tries to mimic our human eye. They have developed a prototype with a resolution of 320x240. This is capable of seeing things under the shadow. They have used standard CMOS fabrication process to build the prototype. The 320x240 resolution is sufficient for a robot mounted with RoboRetina to navigate on a cloudy weather. Brightness adaptation operation is carried out without the use of traditional number crunching. This feature will amuse as all, as we are conditioned to think performing digital processing only.  Array of photoreceptors is called image sensor. In the prototype each photoreceptor is added with an analog circuit that is stimulated by the light and they control the functioning of photoreceptor.

Silicon based integrated chips that tries to mimic working of eye is called Neuromorphic vision sensors. The term 'neuromorphic engineering' was coined by Mr.Carver Mead in mid 1980. He was working at California Institute Technology, Pasadena, USA. Analog circuit based vision chips were developed by University of Pennsylvania, in Philadelphia, USA and Johns Hopkins University, in Baltimore, USA. Analog circuit present in the chip vary the sensitivity of the detector depending upon the light fallen on the detector. This concept only is extended in RoboRetina. Here light falls on the surrounding detectors also play a vital role in sensitivity adjustment of photo detector. Success of RoboRetina depends on the accurate estimation of illumination field.

Around 2005 itself Intregue's eye was available as an Adobe Photoshop plug-in. It was rightly named 'Shadow Illuminator'. Medical X-ray images were taken as input and software was able to reveal unclear portions of medical image. Photographers use this software to do correction in their photos which is technically called as 'Enhancement'. Software that does not use RoboRetina technique produces "halo" effect on the sharp discontinuities.

        CEO of Intrigue Mr. Vladimer Brajovic is an alumnus of The Robotics Institute at Carnegie Mellon University, Pittsburgh, USA. RoboRetina got the Frost & Sullivan Technology Innovation Award for the year 2006. Frost & Sullivan [4] is a growth consulting company with more than 1000 clients all over the world. This award is a feather in the cap for Intrigue Technology. After the emergence of neuromorphic sensor concept RoboRetina is the first breakthrough. Let us hope this will lead to the Autonomous Vision System revolution which will greatly enhance the performance of automotive systems, surveillance systems and unmanned systems.


Source: 

[1] Intrigue Technologies, The Vision Sensor Company, Press Release, http://www.intriguetek.com/PR_020407.htm
[2] Robotic Vision gets Sharper by Prachi Patel Predd - IEEE Spectrum March 2005  http://spectrum.ieee.org/biomedical/imaging/robotic-vision-gets-sharper
[3] Photo Courtesy: https://www.intrigueplugins.com/moreInfo.php?pID=CBB
[4] Frost & Sullivan, http://www.frost.com

Thanks:
       I want to personally thank Mr. B. Shrinath for emailing  me 'RoboRetina'  article, that was published in spectrum online.

Wednesday, 12 September 2012

Biscuit Inspection Systems

         We all know if a company wants to stay in business then it has to manufacture quality products. Visual inspection of products is well known method to check quality. Earlier days trained human beings were used for inspection. Nowadays machine vision systems are employed. The reasons are many. It can work throughout day and night without a sign of fatigue. It can surpass the human inspection speed. It can have a wide dynamic range camera which will help to differentiate even a small change in colour. With human inspection system to check all the manufactured products  is not cost effective.  Few samples from a batch of products are taken and quality testing on the samples are carried out. Statistical methods are employed to estimate the amount of failed products from the failed samples. But with online one can check each product individually [1].

Consumers expect high quality biscuits to have consistent size, shape, colour and flavour. Size and shape improves the aesthetics of biscuits. Colour and flavour has role on the taste of biscuits. Electronic nose can be used detect flavour. In tea processing industry electronic nose is employed and reported in scientific papers. Articles related to employability of electronic nose in biscuit manufacturing are to be found in Internet. It is a common knowledge that any biscuit that is over baked will be in dark brown in colour and under baked will posses light brown colour. It is technically called as 'Baking Curve'. Image processing techniques are used find the shades of biscuit colour and classification is carried out by artificial neural networks. This method was developed in way back of 1995 by Mr. Leonard G. C. Hamey [8]. In a typical cream sandwiched biscuit, top and bottom layers are biscuits and middle layer is made up of filling like cream or chocolate. Cost of filling is more than the biscuits. Over filling means less profit for the company. So much care is taken to maintain correct size of biscuit and filling height.


In a typical production line, every minute 30 rows (120 biscuits form a row) of baked biscuits passes on a conveyor and all of them has to be inspected. This results in checking of 3600 biscuits per minute. Length, width and thickness are measured with an accuracy of ± 0.17 mm. In addition to this check for cracks and splits are carried out. If a biscuit fails to qualify the then it is discarded. 

In a typical biscuit inspection system, three cameras will be used to grab the image of moving biscuits which are illuminated by special fluorescent lights. Grabbed images are processed to get size and shape. A fourth camera that is in 45 degree angle used to capture the laser light that falls on the biscuits.  When multiple laser line images are combined it gives a 3D shape of biscuit. [2, 4]. For sample inspection pictures go to [7] and download the pdf file.  The cameras are required to operate in a 45 degree centigrade ambient temperature. The captured images are transferred via GigE to the inspection room which is 100m away from the baking system. Special software displays the captured images on the computer screen with necessary controls. The images are stored for four years.

                                   
List of Vision system manufacturers
o Machine Vision Technology in United Kingdom [2] ,
o Hamey Vision Private Limited in Australia [4]
o Q-Bake from EyePro systems [6]

In India way back in 2002, CEERI (Central Electronics Engineering Research Institute) present in CSIR Madras complex developed a Biscuit Inspection system with a budget of Rs. 20.7 Lakhs. (1 Lakh =100,000 and   Rs. 50 approx. 1US$). It got the fund from Dept. of Science and Technology, Govt. of India and partnered with Britannia Industries, Chennai to get requirements [5].

Source
3. Biscuit Bake Colour Inspection System - Food Colour Inspection, http://www.hameyvision.com.au/biscuit-colour-inspection.html
4. Simac Masic,  http://www.simac.com 
5. CMC News, July – December, 2002, http://www.csirmadrascomplex.gov.in/jd02.pdf 
6. Q-Bake, Inspection Machine for Baked Goods, http://www.eyeprosystem.com/q-bake/index.html 
8. Pre-processing Colour Images with a Self-Organising Map:Baking Curve Identification and Bake Image Segmentation,  http://www.hameyvision.com.au/hamey-icpr98-som-baking.pdf

Courtesy
I want to thank Dr. A. Gopal, Senior principle scientist, CEERI, CSIR Madras Complex, Chennai, who gave a lecture on Biscuit inspection system in National level workshop on “Embedded Processors and Applications” held at SRM University, Chennai on 31-Aug-2012. He inspired me to write this article.