I like C++ and after I have switched from Fortran to C++ at the beginning of the nineties, I wanted to do everything in C++ including linear algebra. My first linear algebra in C++ was GNUSSL by Robert D. Pierce. It was a nice template library that helped me a lot to understand how one can use templates creatively. Yet, at some point I have decided to benchmark it with the Fortran libraries. At that time I have used a lot the code from Numerical Methods and Software as well as IMSL. Yet, while developing my open source TDLIB, I wanted to ground it on free libraries and then I have decided to try LAPACK. The test has shown that LAPACK is much faster than GNUSSL and I use LAPACK since then. I should mention that LAPACK was faster not because it is written in Fortran but rather because it uses better algorithms. It well might be that nowadays one can find a linear algebra completely in C++ with the same performance (for example I have heard announcement about eigen, http://eigen.tuxfamily.org/, but I have not tried it yet). The goal of this section is to show how one can interface LAPACK from within C++ as well as to demonstrate what technologies behind LAPACK are must for any linear algebra library.
In LAPACK Users’ Guide (online version http://www.netlib.org/lapack/lug/) there is a description of available functions. There a lot of them as LAPACK has functions for different types of matrices and you can speed up your code if you employ a function that fits your particular matrix. The procedure of using LAPACK is as follows. You search for functions you need, then you open the Fortran code and there you find a description of function arguments, for example for DGETRF. Then what is left is to write a declaration and just use it in C++ as has been shown in the previous section Using decomp and solve from C++. Two basic examples for
DPOTRI could be found here
Somewhat better examples could be found in TDLIB
where there are inlined wrapper functions to simplify the use of LAPACK functions in the C++ code (see
ex subdirectory for examples on how to use the header
Below I consider
DGETRF for the LU-decomposition of the general double precision matrix and
DGETRS for the back substitution. To understand the technology I will also interface
DGETF2, the level 2 BLAS versions of
DGETF2 is similar to
DGETRF uses the block algorithm and hence we can compare what the block algorithm brings to the LU-decomposition performance in LAPACK.
I will use my matrix class from the previous section to keep the dense matrix column-wise in C++ (see also Class Matrix). At the end of matrix.h you will find declaration of
DGETF2 as well as the wrapper functions to use them with the
Matrix class. In the header there is also a compiler macro
MKL that changes the LAPACK function names as gfortran compiles function names lowercase and with underscore and in the MKL on Windows they are uppercase and without underscore. The C++ code is in main.cc that is similar to main.cc for decomp and solve and 02lu.py.
I compile first LAPACK with the reference BLAS from Netlib and then use LAPACK with the optimized BLAS to show the difference. I use
gcc/gfortran under Cygwin for the first goal and then Microsoft VC + MKL for the second (MKL already includes LAPACK) but it should be not too difficult to change the compilers and the optimized BLAS.
To compile LAPACK from Netlib under Cygwin use the next commands (I see that there is a new version of LAPACK and the new C interface, it could be in principle possible to use it):
$ wget http://www.netlib.org/lapack/lapack-3.3.0.tgz
$ tar zxvf lapack-3.3.0.tgz
$ cp INSTALL/make.inc.gfortran make.inc
$ make blaslib
$ make lapacklib
$ ls *.a
I compile only libraries. Just
make will also compiles tests. It could be possible to speed up the process by compiling only the double version but then with expense of couple of extra commands.
LINUX comes from
make.inc.gfortran, it could be possible to remove it there. Now let us just rename the libraries
$ mv blas_LINUX.a libblas.a
$ mv lapack_LINUX.a liblapack.a
and mention the directory where the files are located (
The first command to compile the code with
gcc and the reference BLAS compiled above is as follows
$ g++ main.cc -L $HOME/misc/lib/lapack-3.3.0 -llapack -lblas -lgfortran -o main
You need to change the path after
–L to adjust it for your setup. The second command to use Intel MKL is
$ cl -O2 -EHsc -D_SECURE_SCL=0 -MD -DUSECLOCK -DMKL main.cc mkl_intel_c.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib -Femain2.exe
Here there are some more options for MS VC to compile the C++ code and also two compiler macros:
USECLOCK to use
clock() in the
Timing class and
MKL to change the LAPACK names.
Intel MKL is multithreaded and by default uses all available cores. To make the tests only with one core, I issue under tsch
$ setenv OMP_NUM_THREADS 1
I will show parallel benchmarks later. In the tables below there are times on my new HP 8540w notebook with Intel Core i7 processor. The code is compiled 32-bit.
gcc is 4.3.4 under Cygwin 1.7 and Intel MKL is 11.1. To make comparison with
lu_factor, I have run then as well on my new notebook. SciPy is 0.7 with NumPy 1.3 under Python 2.5.
|Matrix dimension||dgetf2||dgetrf||dgetf2 (MKL)||dgetrf (MKL)||decomp||lu_factor|
The times for
decomp compared with those in the previous sections (Linear Solve in Python and Using decomp and solve from Fortran) went down but the times for
lu_factor are about the same. I do not know how to explain this. It could be because of the new hardware or it is due to the newer version of
In any case the old Forsythe’s decomp is quite competitive with
dgetf2 from the newest LAPACK. The LU-decomposition as such has not changed since then. The changes are in using BLAS and in the block algorithms
The BLAS as such does not change much.
dgetf2 is just a bit faster as
dgetf2 with the reference BLAS. It is a combination of the block algorithms with the optimized BLAS that makes the difference. When we compare
dgetrf with the reference BLAS then one already sees the difference. Finally
dgetrf with the optimized BLAS reduces time almost 10 times as compared with
dgetf2. I should mention that on another hardware I have observed that the difference between
dgetrf is even more than what adds the optimized BLAS:
but this seems to be hardware dependent.
The performance in Python is close to that with Intel MKL. I guess that in the SciPy version that I have employed they have used old ATLAS, so some difference in the table above.
Now back to multithreaded BLAS in Intel MKL. The table below shows the performance with 1, 2 and 4 processors in seconds. Four cores cut time almost twice.
|Matrix dimension||1 core||2 cores||4 cores|
Finally I would like to mention an interesting project FLAME (Formal Linear Algebra Method Environment) by Prof Robert A. van de Geijn
There is also a nice book that pretty didactically describes numerics as well as a new nice way to program linear algebra: The Science of Programming Matrix Computations
For those who knows Russian, I have some short description of the book here: