Powerful multicore processors arrived in the market, but programmers with sound knowledge on hardware to harness its full potential were in short fall. Obvious solution is to produce good number of programmers with sufficient knowledge on multicore architecture. Another solution is to create software that will convert the code meant for single processor into a multicore compatible code. In this article second method will be discussed in detail. Reasons for first method’s non-feasibility are left to readers as exercise.
It is a well known fact that User friendliness and code efficiency don’t go hand in hand. For example, factorial program written in assembly language will produce smallest executable or binary (in windows .exe) file. Same algorithm implemented in higher languages (Fortran, Matlab) will produce large sized file. One can expect moderate binary file size in C. Writing a program in assembly language is tedious and in Matlab it is easy.
A graph that shows relationship between effort required to write a program and computational speedup achieved by that programming language is shown above. Essence of previous paragraph is beautifully presented in the graph. Multicore software tools like Cilk++, Titanium, VSIPL++ have the user riendliness and at the same time they are able to produce efficient applications. Is it not like "Have cake and eat it too" situation? Let us hope, it will not take much time to reach coveted 'Green ellipse' position.
OpenMP (Open Multi-Processing) is an open standard and it is supported by major computer manufacturers. Code written in Fortran, C/C++ languages can be converted into code compatible to multicore processors. Multicore compilers are available in Windows, Linux and Apple Mac operating systems. Advantages of OpenMP is it is easy to learn and compatible with different multicore architectures. Software tools like Unified Parallel C, Sequoia, Co-array Fortran, Titanium, Cilk++, pMatlab and Star-P are available as an alternative to OpenMP. CUDA, BROOK+, OpenGL are available to cater Graphics Processing Unit (GPU) based systems.
Name Developed by Keywords Language extension
.
Unified Parallel C (UPC) UPC Consortium shared C
Sequoia Stanford University inner, leaf C
Co-Array Fortran --- --- Fortran
Titanium --- --- Java
pMatlab MIT Lincoln Laboratory ddense, *p MATLAB
Star- P Interactive Supercomputing --- Matlab,
Python
Parallel Computing Mathworks Inc spmd, end Matlab
Toolbox
--- Data not available
A multicore compatible code is developed in following way. Code is written in a high level language and it is compiled. This helps to rectify any error in the program. Next, code is analyzed and where ever parallelism is exhibited that section of code is marked with special keyword (see above table) by the programmer. It is again compiled with multicore-software tool. Software tools will automatically insert necessary code to take care of memory management and data movement. In a multicore environment, operating system creates a Master thread at the time of execution. Thereafter master thread takes care of the execution and where ever special keyword is spotted then threads are forked (i.e. created) and given to separate core. After the completion of job, threads are terminated.
- #pragma omp parallel for \
- private(n) \
- shared (B, y, v)
- for (int n=0; n < K; n++)
- y[n] = B*cos(v*n);
In this paragraph, a few line sample code presented above, will be explained. Above program have syntax similar to C. Steps 4 and 5 is sufficient to generate a cosine signal. First three lines are there to parallelize the step 4 and 5. In the step 1, ‘#pragma’ is a pre-processor directive in C/C++. ‘omp’ stands for OpenMP software and ‘parallel for’ states that for loop is going to be parallelized. In the step 3, the ‘shared’ states which are the variable can be placed in global space and all the cores can access them. Amplitude ‘B’ and array name ‘y’ is placed in the global space. Each core should maintain its own ‘n’ in its private space to generate proper cosine signal.
Program tailored to multicore will have generally two sections. One section performs the task and another section has target architecture specific information like no. of cores etc. All software assumes a model and for further details read [1]. Multicore hardware architecture can be classified into two broad categories viz; homogeneous (ex. x86 cores) and heterogeneous (ex. GPU). Software tools are built to suit any one of the architecture only.
Middleware
Days were gone when only supercomputers having multi-processor system. Today even embedded system (ex. Smart phone) uses multicore hardware. Embedded system is a combination of hardware and software. Typically any change in hardware will require some changes in software. Likewise any up gradation in software require change in hardware. To overcome this problem “Middleware” were concept was introduced.
MIT Lincoln Laboratory developed a middleware named Parallel Vector Library (PVL) suited for real-time embedded Signal and Image processing systems. Likewise VSIPL++ (Vector Signal and Image Processing Library) was developed and maintained by High Performance Embedded Computing Software Initiative (HPEC-SI). VSIPL++ is suited for homogeneous architecture. For heterogeneous architecture PVTOL (Parallel Vector Tile Optimizing Library) is used. Here also programs are compiled, and then mapped to multicore systems.
Source
1. Hahn Kim, and R. Bond. “Multicore software technologies” IEEE Signal Processing Magazine, Vol. 26 No.6, 2009 pp. 80-89. http://hdl.handle.net/1721.1/52617 (PDF,629 KB)
2. Greg Slabaugh, Richard Boyes, Xiaoyun Yang, "Multicore Image Processing with OpenMP", Appeared in IEEE Signal Processing Magazine, March 2010, pp134-138. But it can be downloaded from http://www.soi.city.ac.uk/~sbbh653/publications/OpenMP_SPM.pdf (PDF, 1160 KB)