Evgenii B. Rudnyi, 2008, (c) All rights reserved
Matrix multiplication is a common operation required to implement different numerical algorithms. Its definition is pretty simple and it is assumed that you already know it. If you are in doubt, please refresh your knowledge in Wikipedia
http://en.wikipedia.org/wiki/Matrix_multiplication#Ordinary_matrix_product
before proceeding further. You need only section 1 “Ordinary_matrix_product” and actually subsection 1.1 “Calculating directly from the definition” will suffice.
The goal of this chapter is to review how this can be efficiently implemented with a programming language. We start with an interactive system for numerical linear algebra (module NumPy in Python), then continue with compiled languages (Fortran, C, C++) and finally consider specialized libraries (BLAS and ATLAS).
I have used GCC and GNU Make to compile all examples but any other compiler should do the job as well. The version of gcc employed was 3.4.4 under Cygwin on Windows.
The easiest start for numerical linear algebra nowadays is through interactive systems, that is, by means of an interpreted language. I will use the module NumPy in Python to demonstrate such systems but the results obtained should be common to other systems, for example, Matlab or Mathematica. It should be not too difficult to perform these computational experiments with any other system.
Let us consider matrix A with dimension dim1 x
dim2 and matrix B with dimensions dim2 x
dim3. The matrix A and B should be
compatible with each other, that is, the number of columns in A
should be equal to the number of rows in B. This fact is
reflected by using the same name dim2 to this end. The result of
a product A B is the matrix C with dimensions
dim1 x dim3. This convention will be followed through the whole
chapter.
In NumPy the product of two matrices is expressed as
C = A*B
The script mm.py gives an example of using this construct for the case above. It requires Python with NumPy installed on your computer. The script initializes the matrices A and B of given dimensions with random number and then measure the time needed to compute the matrix product. For example, the command
$ python mm.py 50 100 200
generates matrix A(50, 100) and matrix B(100,
200) and then computes matrix C(50, 200). If one
specifies only one argument, then it will be assumed that dim1 = dim2 =
dim3.
On my notebook the next command that multiplies A(1000, 1000)
by B(1000, 1000)
$ python mm.py 1000
says that
time for C(1000, 1000) = A(1000, 1000) B(1000, 1000) is
0.962140160272 s
It is understandable that when I execute the command several times the time changes a bit.
Let us compare this time with the number of operations required to compute
the matrix product. The matrix C has 1000000 members and to
compute each member one has to make 1000 multiplications and 1000 additions.
This means, we need 2*109 FLoating-point Operations (FLOP, see
also http://en.wikipedia.org/wiki/FLOPS)
or 2 Gflop. Provided that my notebook has a 2.16 GHz processor, the time of 1
s seems to be reasonable.
We see that modern interpreters give us convenient notation coupled with very good performance. The question may arise at this point if we need a compiled language at all. Well, it is very easy to find an answer. Let us just implement the matrix multiplication directly in Python – see direct.py. Now on my computer
$ python mm.py 100
time for C(100, 100) = A(100, 100) B(100, 100) is 0.00118562554738
s
$ python direct.py 100
time for C(100, 100) = A(100, 100) B(100, 100) is 10.8527335178
s
One can see that the direct implementation is slower by a factor more than 10000. How it could be? How actually NumPy computes the matrix product in the first case? The answer is simple. In the first case, the matrix product is actually computed outside Python in the precompiled code. This is a common strategy for interpreters: to create an efficient precompiled code for operations that are used most often and then just call it from the interpreter part. As you see, the looping by itself in interpreters is rather slow.
In this section we will consider a direct implementation of the matrix multiplication in three popular programming languages, Fortran, C, C++. Short information about them can be found in Wikipedia
http://en.wikipedia.org/wiki/Fortran
http://en.wikipedia.org/wiki/C_%28programming_language%29
http://en.wikipedia.org/wiki/C%2B%2B
The first implementation of matrix multiplication according to its definition is given in direct1.f in Fortran 77. The memory in early Fortran is allocated statically, that is, when one needs to change the matrix dimension, it is necessary to recompile program. Alternatively it was possible to allocate a big chunk of memory during compilation and then use only part of it during runtime. Dynamic memory allocation is possible starting Fortran 90 but my knowledge of Fortran 77 is much better than of the next Fortran versions.
Another Fortran peculiarity is the special format for the line of the code. The first 5 symbols are reserved for the label, the sixth is the continuation sign and the statement by itself can take place from 6th to 72th symbols only. You have to remember that Fortran was developed in the age of punch cards (see for example, http://en.wikipedia.org/wiki/Punch_card to learn what a punch card is) and at that time this rule was actually very handy. I highly recommend you to read the text Real Programmers Don't Use Pascal
http://www.pbm.com/~lindahl/real.programmers.html
Then you will understand the Fortran background much better.
The code in direct1.f should be pretty obvious. Note that numbering of matrix elements starts from 1. The matrices A and B are initialized by random numbers at the beginning and then go three loops to implement the matrix multiplication according to its definition – see the first equation at
http://en.wikipedia.org/wiki/Matrix_multiplication
Note that etime function has not been standardized and you
may need to change it if you use not g77.
The main difference with interpreters is that one has first to compile the program
$ g77 direct1.f -o direct1-f.exe
This command should produce the executable with the name direct1-f.exe and now we can run it.
./direct1-f.exe
On my notebook it produced
time for C( 1000, 1000) = A( 1000, 1000) B( 1000, 1000) is
11.3590012 s
Something wrong. It is very slow. Well, the first reason should be evident: I have not used the optimization flag. Let us repeat it with the optimization turned on
$ g77 –O3 direct1.f -o direct1-f.exe
$./direct1-f.exe
time for C( 1000, 1000) = A( 1000, 1000) B( 1000, 1000) is
6.60900021 s
It is already better but still more than 6 times slower than what we have seen in Python.
A few words about optimization. If you are interested what the difference
between the compilation with and without –O3, you can
make
$ g77 -S direct1.f -o without.s
$ g77 –O3 -S direct1.f -o with.s
and then compare the assembly code between each other. Other useful flags
that can improve performance with g77 are
-funroll-all-loops, -fno-f2c, and if you use
gcc under Cygwin on Windows -mno-cygwin. On the
other hand, it is good to remember that sometimes the compiler optimization
breaks the code.
In our case the optimization flags in g77 do not help
considerably any more. To find the second reason we have to remember that the
processor has to fetch data from the memory and this is much slower than the
execution with data in the processor cache. What happens in direct1.f is that the processor needs to
often to fetch data from the memory and the reuse data in processor cache is
low. The new version in direct2.f partly
solves this problem. In order to understand the code, you have to recognize
that
B(k, j) is now a constant in the internal loop.When we run the program now
$ g77 -O3 direct2.f -o direct2-f.exe
$ direct2-f.exe
time for C( 1000, 1000) = A( 1000, 1000) B( 1000, 1000) is
2.04699993 s
we can see that the change in the order of the loops gave us speed up of three times. Still it is two times slower than the performance in Python. Note that on computers with slower memory this ratio could be much higher (see the Table 1 at the end of the chapter). This shows that one can think of the direct implementation as of a naive one. It is easy to implement it but it does not give the best performance. The performance will be improved in the next section and now the direct implementation will be repeated in C and C++.
I should confess that I feel myself more confident in C++, rather than in C. As a result, I have made just a very basic implementation of the matrix multiplication in direct1.c. The order of loops is similar to that in direct2.f. The difference is due to the fact, that in C the array is stored row wise. Another difference from Fortran is that the numbering of matrix elements starts from 0.
As the memory is allocated for matrices on the stack and the default stack size is not enough to keep all three matrices, we have to increase the stack size
$ gcc -O3 -Wl,--stack=50000000 direct1.c –o
direct1-c.exe
$ ./direct1-c.exe
time for C(1000,1000) = A(1000,1000) B(1000,1000) is 2.469000
s
The timing is close to that in direct2.f but about 20 percent slower. It surprises me and all I can say here is that on other computers I have not seen such difference. Anyway, in direct2.c there is another implementation where the matrix is kept in the one-dimensional array and is stored column wise. In this case the timing on my notebook is closer to that in direct2.f:
$ gcc -O3 -Wl,--stack=50000000 direct2.c –o
direct2-c.exe
$ ./direct2-c.exe
time for C(1000,1000) = A(1000,1000) B(1000,1000) is 2.125000
s
In C++ it is possible to make a user-defined data type. A primitive class
for a rectangular matrix is defined in matrix.h. It uses
vector<double> from the standard library as a
one-dimensional storage and stores the matrix column wise. In the header
there is also a timer that is based on gettimeofday. If it is not available
at your system, compile with #define USECLOCK.
In direct.cc I have made an implementation closer to the Python one, when one can specify the dimension of the matrices from the command line.
$ g++ -O3 direct.cc -o direct-cc.exe
$ ./direct-cc.exe 1000
time for C(1000,1000) = A(1000,1000) B(1000,1000) is 2.093
s
One can see that the timing is very similar to that in direct2.f and direct2.c.
Fetching data from memory is a clear bottleneck for matrix multiplication of big matrices. The solution to this problem highly depends on the computer and the solution adapted in scientific computing is to introduce a standardized interface, Basic Linear Algebra Subprograms. See
http://en.wikipedia.org/wiki/BLAS
as short introduction and
http://www.netlib.org/lapack/lug/node145.html
as quick overview of available functions. The idea is that computer vendors will implement these subroutines in the most efficient way and this way one can combine portability of the user code with high efficiency.
In this section we will use general purpose BLAS implementation from Netlib
and Automatically Tuned Linear Algebra Software (ATLAS)
http://math-atlas.sourceforge.net/
The use of libraries makes the compilation process a bit more difficult and a good idea is to use make to automate it.
When the code is in one file it is quite easy to compile it. When the code is distributed among many files, the compilation process gets more difficult. The goal of make is to make it simple. The idea here is that there is one file in which the compilation process is described and then the tool make uses this file to perform operations that are needed to produce the executable.
An example of makefile is in the directory 2direct. Provided that you have GNU make on you path, a command performed in that directory
$ make clean
will remove executables compiled previously and then
$ make
will make all executables and run them.
If you open makefile in the text editor, this will give you an idea on how one can describe the compilation process. Short introduction to make is in Wikipedia
http://en.wikipedia.org/wiki/Make_%28software%29
as well as in the GNU make documentation
http://www.gnu.org/software/make/manual/html_node/Introduction.html
In directory make there is a simple example on how one can with make and preprocessor solve the problem that Fortran77 allocated memory statically. Now the command
$ make dim1=400 dim2=500 dim3=600
will compile and run the direct2.F with these values of parameters. Also it is possible to specify only one parameter
make dim1=1000
The reference BLAS implementation at Netlib does not give us good performance. Anyway it is good to compile it as an exercise.
Download the file. For example with wget
$ wget http://www.netlib.org/blas/blas.tgz
Unpack the archive and change the directory
$ tar zxvf blas.tgz
$ cd BLAS
Edit make.inc if necessary. By default it uses g77 with
–O3. Run make
$ make
This should produce the file blas.a (it may have a prefix
defined by PLAT in Makefile). Rename this file as
libblas.a and copy to some location of your choice. That’s
it.
The compilation of ATLAS is more complicated but I would advise you to do
it. Note that the newest ATLAS 3.8 requires gcc 4, if you have gcc 3 use
ATLAS 3.6. If successful it should produce libatlas.a,
libcblas.a, libf77blas.a and
liblapack.a. For this chapter we need the three first libraries.
Note that the process to generate ATLAS libraries is not only compilation.
ATLAS runs many internal tests in order to determine an optimal code to
compile. As a result, the process can take more than one hour.
If you have problems to compile ATLAS by yourself, in the directory lib/windows there are precompiled BLAS and ATLAS 3.6 libraries with gcc under cygwin on my notebook. However they may not be optimal for your computer.
For our purpose to multiply two general matrices we need to call DGEMM from BLAS
http://en.wikipedia.org/wiki/DGEMM
Calling DGEMM from Fortran is straightforward. Just one word about the
parameters LD* (so called leading dimension). As was mentioned
above the matrices are stored in Fortran column wise and it is possible to
allocate more memory at the beginning. For example,
double precision A(1000, 2000)
allocates memory for the matrix with 1000 rows and 2000 columns. However,
one can also use this matrix to keep smaller matrices A(m, n)
provided m <= 1000 and n <= 2000. This way
the storage in the underlying one-dimensional array will be discontinuous.
First go m values for the first column starting from the first
location, then go the next m values for the second column starting from
1000th location, and so on. That is, the position in the underlying
one-dimensional array is computed as i*1000 + j and the
subroutine should know not only the real dimension of the matrix but also the
number of rows declared in the statement above.
usedgemm.f demonstrates the call. Now
to compile the code it is necessary to specify the location of the libraries
with –L and the libraries with –l. This
command compiles the code with the reference BLAS library (you may need to
change the path after –L)
$ g77 usedgemm.f -L../../lib/windows -lblas
–ousedgemm-f-blas.exe
This compiles with ATLAS
$ g77 usedgemm.f -L../../lib/windows -lf77blas -lcblas -latlas
-ousedgemm-f-atlas.exe
Now one can see that the reference BLAS library is not optimized indeed
$ ./usedgemm-f-blas.exe
time for C( 1000, 1000) = A( 1000, 1000) B( 1000, 1000) is
2.04699993 s
The time is very similar to that of direct2.f. Actually if you look at the code
http://www.netlib.org/blas/dgemm.f
you find that direct2.f just reproduces a part related to not transposed matrices. Well, in dgemm.f there are more checks. On the other hand DGEMM from ATLAS
$ ./usedgemm-f-atlas.exe
time for C( 1000, 1000) = A( 1000, 1000) B( 1000, 1000) is
0.968999982 s
reproduces the time that we have seen in Numpy. No wonder, I guess that NumPy just uses ATLAS.
It is possible to call Fortran subroutines from C and C++ directly but in this case it is simpler to use CBLAS interface: it is included with ATLAS, see also
http://www.netlib.org/blas/blast-forum/cblas.tgz
The example for C++ is in usedgemm.cc. The call from C is similar,
one just does not need extern “C” over
#include “cblas.h” and the transformation to
double* for the array would be simpler.
As we need extra header to compile the file, one has specify its location
with -I
$ g++ -I../../lib/include usedgemm.cc -L../../lib/windows -lcblas
-latlas -ousedgemm-cc-atlas.exe
$ ./usedgemm-cc-atlas.exe 1000
time for C(1000,1000) = A(1000,1000) B(1000,1000) is 0.984
s
The time is again the same, as we have seen in Python.
makefile in this directory shows how to automate the compilation process.
In Table 1 there is a comparison between times measured on three different computers under Windows for the programs described in this chapter. Parameters of computers have been measured by CPU-Z.
| CPU Clock, MHz | 2160 | 1670 | 2600 |
| Memory Clock, MHz | 667 | 133 | 199 |
| L2 Cache, Kb | 4096 | 256 | 512 |
| direct1-f | 6.61 | 22.67 | 11.61 |
| direct2-f | 2.05 | 13.70 | 3.34 |
| direct1-c | 2.47 | 13.72 | 3.31 |
| direct2-c | 2.13 | 13.73 | 3.32 |
| direct-cc | 2.09 | 13.73 | 3.30 |
| usedgemm-f-atlas | 0.97 | 1.15 | 0.53 |
| usedgemm-cc-atlas | 0.98 | 1.15 | 0.53 |
One can see that the total performance depends on both processor and memory clock. This leads to what people call the megahertz myth
http://en.wikipedia.org/wiki/Megahertz_myth
Only when one removes the bottleneck of fetching data from the memory, as it happens in the optimized BLAS, the performance is proportional to the CPU clock.
I hope that this chapter has convinced you that it is very important to
use libraries even for seemingly simple things, like matrix multiplication.
Also the reading of the code from good libraries is a unique opportunity to
improve your programming skills. Another important point in numerical method
is robustness. Look at norm.cc. Here
dnrm2 returns correct values in both cases while
naive_dnrm2 in both cases fails (run make
norm.exe). (Small note: in this case –O3 makes
naive_dnrm2 works correctly).
Please post your comments, questions, suggestions to the discussion group at http://groups.google.com/group/matrixprogramming.
See also some discussion at sci.math.num-analysis: Matrix Multiplication.