- Introduction
- Interactive System for Numerical Linear Algebra
- Direct Implementation with Compiled Language
- Using Specialized Libraries: BLAS and ATLAS
- Conclusion
- Problems
- Discussion

## Introduction

Matrix multiplication is a common operation required to implement different numerical algorithms. Its definition is pretty simple and it is assumed that you already know it. If you are in doubt, please refresh your knowledge in Wikipedia

http://en.wikipedia.org/wiki/Matrix_multiplication#Ordinary_matrix_product

before proceeding further. You need only section 1 “Ordinary_matrix_product” and actually subsection 1.1 “Calculating directly from the definition” will suffice.

The goal of this chapter is to review how this can be efficiently implemented with a programming language. We start with an interactive system for numerical linear algebra (module NumPy in Python), then continue with compiled languages (Fortran, C, C++) and finally consider specialized libraries (BLAS and ATLAS).

I have used GCC and GNU Make to compile all examples but any other compiler should do the job as well. The version of gcc employed was 3.4.4 under Cygwin 1.5 on Windows.

## Interactive System for Numerical Linear Algebra

The easiest start for numerical linear algebra nowadays is through interactive systems, that is, by means of an interpreted language. I will use the module NumPy in Python to demonstrate such systems but the results obtained should be common to other systems, for example, Matlab or Mathematica. It should be not too difficult to perform these computational experiments with any other system.

Let us consider matrix `A`

with dimension `dim1 x dim2`

and matrix `B`

with dimensions `dim2 x dim3`

. The matrix `A`

and `B`

should be compatible with each other, that is, the number of columns in `A`

should be equal to the number of rows in `B`

. This fact is reflected by using the same name `dim2`

to this end. The result of a product `A B`

is the matrix `C`

with dimensions `dim1 x dim3`

. This convention will be followed through the whole chapter.

In NumPy the product of two matrices is expressed as

`C = A*B`

The script mm.py gives an example of using this construct for the case above. It requires Python with NumPy installed on your computer. The script initializes the matrices A and B of given dimensions with random number and then measure the time needed to compute the matrix product. For example, the command

`$ python mm.py 50 100 200`

generates matrix `A(50, 100)`

and matrix `B(100, 200)`

and then computes matrix `C(50, 200)`

. If one specifies only one argument, then it will be assumed that `dim1 = dim2 = dim3`

.

On my notebook the next command that multiplies `A(1000, 1000)`

by `B(1000, 1000)`

`$ python mm.py 1000`

says that

`time for C(1000, 1000) = A(1000, 1000) B(1000, 1000) is 0.962140160272 s`

It is understandable that when I execute the command several times the time changes a bit.

Let us compare this time with the number of operations required to compute the matrix product. The matrix `C`

has 1000000 members and to compute each member one has to make 1000 multiplications and 1000 additions. This means, we need 2*10^{9} FLoating-point Operations (FLOP, see also http://en.wikipedia.org/wiki/FLOPS) or 2 Gflop. Provided that my notebook has a 2.16 GHz processor, the time of 1 s seems to be reasonable.

We see that modern interpreters give us convenient notation coupled with very good performance. The question may arise at this point if we need a compiled language at all. Well, it is very easy to find an answer. Let us just implement the matrix multiplication directly in Python – see direct.py. Now on my computer

`$ python mm.py 100`

time for C(100, 100) = A(100, 100) B(100, 100) is 0.00118562554738 s

$ python direct.py 100

time for C(100, 100) = A(100, 100) B(100, 100) is 10.8527335178 s

One can see that the direct implementation is slower by a factor more than 10000. How it could be? How actually NumPy computes the matrix product in the first case? The answer is simple. In the first case, the matrix product is actually computed outside Python in the precompiled code. This is a common strategy for interpreters: to create an efficient precompiled code for operations that are used most often and then just call it from the interpreter part. As you see, the looping by itself in interpreters is rather slow.

## Direct Implementation with Compiled Language

In this section we will consider a direct implementation of the matrix multiplication in three popular programming languages, Fortran, C, C++. Short information about them can be found in Wikipedia

http://en.wikipedia.org/wiki/Fortran

http://en.wikipedia.org/wiki/C_%28programming_language%29

http://en.wikipedia.org/wiki/C%2B%2B

### Fortran

The first implementation of matrix multiplication according to its definition is given in direct1.f in Fortran 77. The memory in early Fortran is allocated statically, that is, when one needs to change the matrix dimension, it is necessary to recompile program. Alternatively it was possible to allocate a big chunk of memory during compilation and then use only part of it during runtime. Dynamic memory allocation is possible starting Fortran 90.

Another Fortran peculiarity is the special format for the line of the code. The first 5 symbols are reserved for the label, the sixth is the continuation sign and the statement by itself can take place from 6th to 72th symbols only. You have to remember that Fortran was developed in the age of punch cards (see for example, http://en.wikipedia.org/wiki/Punch_card to learn what a punch card is) and at that time this rule was actually very handy. I highly recommend you to read the text Real Programmers Don’t Use Pascal

http://www.pbm.com/~lindahl/real.programmers.html

Then you will understand the Fortran background much better.

The code in direct1.f should be pretty obvious. Note that numbering of matrix elements starts from 1. The matrices A and B are initialized by random numbers at the beginning and then go three loops to implement the matrix multiplication according to its definition – see the first equation at

http://en.wikipedia.org/wiki/Matrix_multiplication

Note that `etime`

function has not been standardized and you may need to change it if you use not `g77`

.

The main difference with interpreters is that one has first to compile the program

`$ g77 direct1.f -o direct1-f.exe`

This command should produce the executable with the name direct1-f.exe and now we can run it.

.`/direct1-f.exe`

On my notebook it produced

`time for C( 1000, 1000) = A( 1000, 1000) B( 1000, 1000) is 11.3590012 s`

Something wrong. It is very slow. Well, the first reason should be evident: I have not used the optimization flag. Let us repeat it with the optimization turned on

`$ g77 –O3 direct1.f -o direct1-f.exe`

$./direct1-f.exe

time for C( 1000, 1000) = A( 1000, 1000) B( 1000, 1000) is 6.60900021 s

It is already better but still more than 6 times slower than what we have seen in Python.

A few words about optimization. If you are interested what the difference between the compilation with and without `–O3`

, you can make

`$ g77 -S direct1.f -o without.s`

$ g77 –O3 -S direct1.f -o with.s

and then compare the assembly code between each other. Other useful flags that can improve performance with `g77`

are `-funroll-all-loops`

, `-fno-f2c`

, and if you use `gcc`

under Cygwin on Windows `-mno-cygwin`

. On the other hand, it is good to remember that sometimes the compiler optimization breaks the code.

In our case the optimization flags in `g77`

do not help considerably any more. To find the second reason we have to remember that the processor has to fetch data from the memory and this is much slower than the execution with data in the processor cache. What happens in direct1.f is that the processor needs to often to fetch data from the memory and the reuse data in processor cache is low. The new version in direct2.f partly solves this problem. In order to understand the code, you have to recognize that

- Three loops for the matrix multiplication can be freely exchanged with other,
- Fortran keeps the matrix in memory column wise,
`B(k, j)`

is now a constant in the internal loop.

When we run the program now

`$ g77 -O3 direct2.f -o direct2-f.exe`

$ direct2-f.exe

time for C( 1000, 1000) = A( 1000, 1000) B( 1000, 1000) is 2.04699993 s

we can see that the change in the order of the loops gave us speed up of three times. Still it is two times slower than the performance in Python. Note that on computers with slower memory this ratio could be much higher (see the Table 1 at the end of the chapter). This shows that one can think of the direct implementation as of a naive one. It is easy to implement it but it does not give the best performance. The performance will be improved in the next section and now the direct implementation will be repeated in C and C++.

### C

I should confess that I feel myself more confident in C++, rather than in C. As a result, I have made just a very basic implementation of the matrix multiplication in direct1.c. The order of loops is similar to that in direct2.f. The difference is due to the fact, that in C the array is stored row wise. Another difference from Fortran is that the numbering of matrix elements starts from 0.

As the memory is allocated for matrices on the stack and the default stack size is not enough to keep all three matrices, we have to increase the stack size

`$ gcc -O3 -Wl,--stack=50000000 direct1.c –o direct1-c.exe`

$ ./direct1-c.exe

time for C(1000,1000) = A(1000,1000) B(1000,1000) is 2.469000 s

The timing is close to that in direct2.f but about 20 percent slower. It surprises me and all I can say here is that on other computers I have not seen such difference. Anyway, in direct2.c there is another implementation where the matrix is kept in the one-dimensional array and is stored column wise. In this case the timing on my notebook is closer to that in direct2.f:

`$ gcc -O3 -Wl,--stack=50000000 direct2.c –o direct2-c.exe`

$ ./direct2-c.exe

time for C(1000,1000) = A(1000,1000) B(1000,1000) is 2.125000 s

### C++

In C++ it is possible to make a user-defined data type. A primitive class for a rectangular matrix is defined in matrix.h (see also Class Matrix). It uses `vector<double>`

from the standard library as a one-dimensional storage and stores the matrix column wise. In the header there is also a timer that is based on gettimeofday. If it is not available at your system, compile with `#define USECLOCK`

.

In direct.cc I have made an implementation closer to the Python one, when one can specify the dimension of the matrices from the command line.

`$ g++ -O3 direct.cc -o direct-cc.exe`

$ ./direct-cc.exe 1000

time for C(1000,1000) = A(1000,1000) B(1000,1000) is 2.093 s

One can see that the timing is very similar to that in direct2.f and direct2.c.

## Using Specialized Libraries: BLAS and ATLAS

Fetching data from memory is a bottleneck for matrix multiplication of big matrices. The solution to this problem highly depends on the computer and the solution adapted in scientific computing is to introduce a standardized interface, Basic Linear Algebra Subprograms. See

http://en.wikipedia.org/wiki/BLAS

as short introduction and

http://www.netlib.org/lapack/lug/node145.html

as quick overview of available functions. The idea is that computer vendors will implement these subroutines in the most efficient way and this way one can combine portability of the user code with high efficiency.

In this section we will use general purpose BLAS implementation from Netlib

and Automatically Tuned Linear Algebra Software (ATLAS)

http://math-atlas.sourceforge.net/

The use of libraries makes the compilation process a bit more difficult and a good idea is to use make to automate it.

### Using make

When the code is in one file it is quite easy to compile it. When the code is distributed among many files, the compilation process gets more difficult. The goal of make is to make it simple. The idea here is that there is one file in which the compilation process is described and then the tool make uses this file to perform operations that are needed to produce the executable.

An example of makefile is in the directory 2direct. Provided that you have GNU make on you path, a command performed in that directory

`$ make clean`

will remove executables compiled previously and then

`$ make`

will make all executables and run them.

If you open makefile in the text editor, this will give you an idea on how one can describe the compilation process. Short introduction to make is in Wikipedia

http://en.wikipedia.org/wiki/Make_%28software%29

as well as in the GNU make documentation

http://www.gnu.org/software/make/manual/html_node/Introduction.html

In the directory make there is a simple example on how one can with make and preprocessor solve the problem that Fortran77 allocated memory statically. Now the command

`$ make dim1=400 dim2=500 dim3=600`

will compile and run the direct2.F with these values of parameters. Also it is possible to specify only one parameter

`make dim1=1000`

### Compiling BLAS and ATLAS

The reference BLAS implementation at Netlib does not give us good performance. Anyway it is good to compile it as an exercise.

Download the file. For example with wget

`$ wget http://www.netlib.org/blas/blas.tgz`

Unpack the archive and change the directory

`$ tar zxvf blas.tgz`

$ cd BLAS

Edit make.inc if necessary. By default it uses `g77`

with `–O3`

. Run make

`$ make`

This should produce the file `blas.a`

(it may have a prefix defined by `PLAT`

in Makefile). Rename this file as `libblas.a`

and copy to some location of your choice. That’s it.

The compilation of ATLAS is more complicated but I would advise you to do it. Note that the newest ATLAS 3.8 requires gcc 4, if you have gcc 3 use ATLAS 3.6. If successful it should produce `libatlas.a`

, `libcblas.a`

, `libf77blas.a`

and `liblapack.a`

. For this chapter we need the three first libraries. Note that the process to generate ATLAS libraries is not only compilation. ATLAS runs many internal tests in order to determine an optimal code to compile. As a result, the process can take more than one hour.

If you have problems to compile ATLAS by yourself, there are precompiled BLAS and ATLAS 3.6 libraries with gcc under cygwin on my notebook. However they may not be optimal for your computer.

For our purpose to multiply two general matrices we need to call DGEMM from BLAS

http://en.wikipedia.org/wiki/DGEMM

### Calling DGEMM from Fortran

Calling DGEMM from Fortran is straightforward. Just one word about the parameters `LD*`

(so called leading dimension). As was mentioned above the matrices are stored in Fortran column wise and it is possible to allocate more memory at the beginning. For example,

`double precision A(1000, 2000)`

allocates memory for the matrix with 1000 rows and 2000 columns. However, one can also use this matrix to keep smaller matrices `A(m, n)`

provided `m <= 1000`

and `n <= 2000`

. This way the storage in the underlying one-dimensional array will be discontinuous. First go `m`

values for the first column starting from the first location, then go the next m values for the second column starting from 1000th location, and so on. That is, the position in the underlying one-dimensional array is computed as `i*1000 + j`

and the subroutine should know not only the real dimension of the matrix but also the number of rows declared in the statement above.

usedgemm.f demonstrates the call. Now to compile the code it is necessary to specify the location of the libraries with `–L`

and the libraries with `–l`

. This command compiles the code with the reference BLAS library (you may need to change the path after `–L`

)

`$ g77 usedgemm.f -L../../lib/windows -lblas –ousedgemm-f-blas.exe`

This compiles with ATLAS

`$ g77 usedgemm.f -L../../lib/windows -lf77blas -lcblas -latlas -ousedgemm-f-atlas.exe`

Now one can see that the reference BLAS library is not optimized indeed

`$ ./usedgemm-f-blas.exe`

time for C( 1000, 1000) = A( 1000, 1000) B( 1000, 1000) is 2.04699993 s

The time is very similar to that of direct2.f. Actually if you look at the code

http://www.netlib.org/blas/dgemm.f

you find that direct2.f just reproduces a part related to not transposed matrices. Well, in dgemm.f there are more checks. On the other hand DGEMM from ATLAS

`$ ./usedgemm-f-atlas.exe`

time for C( 1000, 1000) = A( 1000, 1000) B( 1000, 1000) is 0.968999982 s

reproduces the time that we have seen in Numpy. No wonder, I guess that NumPy just uses ATLAS.

### Calling DGEMM from C and C++

It is possible to call Fortran subroutines from C and C++ directly but in this case it is simpler to use CBLAS interface: it is included with ATLAS, see also

http://www.netlib.org/blas/blast-forum/cblas.tgz

The example for C++ is in usedgemm.cc. The call from C is similar, one just does not need `extern “C”`

over `#include “cblas.h”`

and the transformation to `double*`

for the array would be simpler.

As we need extra header to compile the file, one has specify its location with `-I`

`$ g++ -I../../lib/include usedgemm.cc -L../../lib/windows -lcblas -latlas -ousedgemm-cc-atlas.exe`

$ ./usedgemm-cc-atlas.exe 1000

time for C(1000,1000) = A(1000,1000) B(1000,1000) is 0.984 s

The time is again the same, as we have seen in Python.

makefile in this directory shows how to automate the compilation process.

## Conclusion

In Table 1 there is a comparison between times measured on three different computers under Windows for the programs described in this chapter. Parameters of computers have been measured by CPU-Z.

CPU Clock, MHz | 2160 | 1670 | 2600 |

Memory Clock, MHz | 667 | 133 | 199 |

L2 Cache, Kb | 4096 | 256 | 512 |

direct1-f | 6.61 | 22.67 | 11.61 |

direct2-f | 2.05 | 13.70 | 3.34 |

direct1-c | 2.47 | 13.72 | 3.31 |

direct2-c | 2.13 | 13.73 | 3.32 |

direct-cc | 2.09 | 13.73 | 3.30 |

usedgemm-f-atlas | 0.97 | 1.15 | 0.53 |

usedgemm-cc-atlas | 0.98 | 1.15 | 0.53 |

One can see that the total performance depends on both processor and memory clock. This leads to what people call the megahertz myth

http://en.wikipedia.org/wiki/Megahertz_myth

Only when one removes the bottleneck of fetching data from the memory, as it happens in the optimized BLAS, the performance is proportional to the CPU clock.

I hope that this chapter has convinced you that it is very important to use libraries even for seemingly simple things, like matrix multiplication. Also the reading of the code from good libraries is a unique opportunity to improve your programming skills. Another important point in numerical method is robustness. Look at norm.cc. Here `dnrm2`

returns correct values in both cases while `naive_dnrm2`

in both cases fails (run `make norm.exe`

). (Small note: in this case `–O3`

makes `naive_dnrm2`

works correctly).

## Problems

- Research on how the performance of matrix multiplication depends on the matrix dimension.
- Measure how much memory you need for the matrix multiplication. What matrix dimensions are necessary to reach the level when physical RAM will be not enough to keep three matrices? What will happens in this case?
- If you have an optimized BLAS for your system, use it for the matrix multiplication. Compare results with those of ATLAS.
- If possible, use different compilers to compile programs implementing the matrix multiplication directly and compare the performance.
- If you have dual boot, compile programs under different operating systems and compare performance.
- Call Fortran DGEMM from C and C++ directly.
- Try to write an optimized code (an equivalent DGEMM from ATLAS) for matrix multiplication by yourself.

## Discussion

Some discussion at sci.math.num-analysis: Matrix Multiplication.

## See also:

BLAS: Basic Linear Algebra System

Solving System of Linear Equations

Martin Thoma, Part I: Performance of Matrix multiplication in Python, Java and C++

http://martin-thoma.com/matrix-multiplication-python-java-cpp/

I potentially see one flaw here that is not addressed. When you compiled your programs did you enable auto-vectorization and auto-parallelization? I noticed in your compiler flags you did not include any cpu optimizations or anythign like that. I know from experience that enabling sse2 optimizations and autovectorization of these kinds of loops decreases compiler time significantly. The python modules would have certainly compiled with this in mind and in order to do a fair comparison you would need to ensure that you enable these types of optimizations. If you do this I think you will find that they all run about the same speed.

If I can find it I have a list of compiler flags to do this and then the test could be reapproached with them

You are right that it would be good to check all possible optimization flags. I have limited myself in this study by -O3 only.

In my view though, it is unlikely that by pure optimization at the compiler level one will achieve the performance of the optimized BLAS. Well, if you try it, please report your results here.

Is it possible there’s a bug in your code: usedgemm.cc. The last parameter should possibly be dim3, not dim1, right ?

The matrices are stored in the class Matrix by columns. Hence I would say that this is correct. Yet, it would be good to make a test.