21 December 2011 By Stuart Archibald

In my last post I briefly outlined some performance gains we were achieving in our tuned dense BLAS library. Evidently this was received with mixed opinion! Tuning Java does seem like an insane idea, and indeed some people pointed this out, but given that these are our constraints:

- Code has to be as fast as possible.
- Assume the hardware is fixed and tune to it.
- Code has to be in Java: no JNI, no hitting Fortran/BLAS/<insert thing we'd normally use here>.

So we don't really have a lot of choice! Once every piece of algorithmic tuning has been taken into account, the main point of performance death is going to be JVM and the fact that the JVM doesn't provide primitives those of us from the scientific computing world are used to, both of which are immutable as selections. However, trying to force their combination to produce something reasonable performance wise (as exhaustively pointed out by the people at LMAX) requires effort in terms of the acknowledgment of the hardware and a trip back to basic computer science.

Having been relatively successful with increasing performance in dense BLAS kernels, we turned our attention to the invariably hard sparse BLAS operations: It is our notes and observations on this topic that form the bulk of this article. For those unfamiliar with the notion of a sparse matrix, a sparse matrix is categorised as a matrix with a only a small number of non-zero elements (yes, defining small is part of the problem). Sparse matrices arise a lot in numerical methods and indeed as a result of certain mathematical operations which tend to create some form of diagonal dominance, but this is not strictly so. Additionally, there is no real condition on the pattern of the non-zero elements or what constitutes a sparse structure.

The reason that we categorise sparse matrices separately is for gains in both performance (we don't need to do operations involving zero entries) and memory foot print (we don't need to store zeros). For example, multiplying a sparse matrix by a vector should proceed as a linear function of the number of non-zero elements of the matrix as opposed to the dimension of the matrix. These performance gains can be carried over into other linear algebra operations but as we'll see later this is easier said than done, particularly with the utter lack of memory manipulation presented by the Java language.

First, we shall address the manner in which to store the sparse matrix data. Most Java programmers when asked this question reach for a Map implementation, and rightly so, a `Map<Pair<row number, column number>, value>`

would certainly suffice and be very easy to use. The problem, as always with built-in overkill, is that it is probably slow. Notably, the more we play with getting performance from Java, the more we have come to the conclusion that if the Java code doesn't look like a variant of Fortran, it's going to be slow. So although a Map has some advantages, as we'll see later, the bloat is a speed killer.

From the idea of a map, we deduce that to describe the matrix all that is needed is a tuple of <row number, column number, value>. The Coordinate Format matrix (referred to as COO) is perhaps the most basic sparse matrix storage format and it directly reflects this tuple layout by storing the data as three vectors, two int pointers (one to row indices and another to column indices) and a double/float pointer to the corresponding value. Using the following matrix as an example:

1 3 0 0 12 0 4 6 8 0 2 0 0 9 13 0 5 7 10 14 0 0 0 11 0

A COO matrix would store the data as:

int[] columnIndex = {0, 1, 4, 1, 2, 3, 0, 3, 4, 1, 2, 3, 4, 3}; int[] rowIndex = {0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4}; double[] data = {1, 3, 12, 4, 6, 8, 2, 9, 13, 5, 7, 10, 14, 11};

This example highlights two important points. First, the memory footprint of the stored data is larger than that of the original data (3*14 vs 5*5). Second, there is exploitable information redundancy in the rowIndex variable.

Our next matrix format is the Compressed Sparse Row storage format (CSR), which takes advantage of this redundancy by solely storing a row pointer from which it is possible to compute an offset into the data. This row pointer is formed by keeping a cumulative count of the number of non-zero data elements encountered up the start of a given row in the matrix. This is probably best described with an example, so continuing with the above matrix:

int[] columnIndex = {0, 1, 4, 1, 2, 3, 0, 3, 4, 1, 2, 3, 4, 3}; // This is the same as for COO int[] rowPtr = {0, 3, 6, 9, 13, 14}; // This effectively stores a cumulative data count double[] data = {1, 3, 12, 4, 6, 8, 2, 9, 13, 5, 7, 10, 14, 11}; // This is the same as for COO

Note that the rowPtr variable contains (number of rows + 1) entries. This is to account for the number of elements in the final row and to allow easy computation of the number of elements in a row for the sake of induction variables that index in the column space. For example, to walk through the data in row 2 (assuming 0 based row indexing):

int index = 2; for (int i = rowPtr[index]; i < rowPtr[index+1]; i++){ data[i]; // accessing data belonging to a row }

From this, is it clear that due to the storage of an additional rowPtr index, the same code can be used for all the rows, which considerably simplifies algorithms.

In addition to the CSR there is an equivalent storage format whereby the rows and columns reverse roles such that offsets are computed by column, and rows are used to give the locations within each offset. This storage format type is known as Compressed Sparse Column (CSC). By inspection, it is obvious that CSR and CSC formats will have different total information content for non-symmetric matrices. However, both will have a lower information content than COO (in most real world cases).

So we've visited the basic matrix storage formats for sparse matrices. Unsurprisingly they are also present in Python, GNU Octave and hardened system maths library collections like suitesparse. Now for the interesting bit...

Sparse matrices are an invariable pain to deal with. As demonstrated previously, it's possible to actually increase memory usage if the matrix is insufficiently sparse for the storage format chosen. However this isn't the main gripe with these matrices.

The main problem arises from the fact that the code will execute on hardware and this hardware behaves in a certain way. Without going into the ins and outs of processor behaviour and memory hierarchy performance, the full deviousness of tasks involving sparse matrices is hard to explain. However, for now we need to bear in mind two things:

- The processors can quickly access data so long as the next element to be used is stored in memory directly after the element presently being accessed.
- That operations are best pipelined by making use of a mix of instructions, and that for maximum throughput on the floating point units the data should be coalesced as SIMD operations (as seen in my previous post, SSE isn't that great in Java).

Now, if we take as an example the CSR format matrix "A", containing the information previously presented, and want to do a sparse matrix vector multiply operation (SpMV, as in, result:=A*x), we can use the following code:

Performing a bit of basic analysis on this algorithm with regards to points 1) and 2) above, we can note that the variable "x" is being accessed by the index dereferenced by the column index. This does not bode well for point 1) above. Although `values[], result[], rowPtr[] and colIdx[]`

are being indexed in a stride 1 fashion, "x" is not, and random data access is rather slow (pipeline stall, lack of hardware prefetching etc.) As to point 2), if the data cannot be fed to the floating point units fast enough, then their SIMD nature cannot be exploited (ignoring the lack of Java's ability to easily do this - it's a language invariant problem).

To attempt to improve the speed of this operation, we threw the standard set of techniques at the code (loop unwinding, strip mining, etc) and even weirder attempts like trying to force prefetching via loop skewing/peeling and reducing indexing to single bytes. However, none of this could overcome the inherent slowness caused by poor locality of reference.

For our own interest we compared the OG implementations of COO and CSR SpMV with equivalent operations from:

- Colt - has sparse support and uses
**threaded**SpMV(). - Commons Math - has sparse support
- Universal Java Matrix Package (UJMP) - has sparse support
- JScience - has sparse support

The following were not used (reasons given):

- Efficient Java Matrix Library (EJML) - no support
- Jama - no support
- jblas - wrapper to native code, cheating!
- Matrix Toolkit Java (MTJ) - no support
- OjAlgo - no support
- Parallel Colt - wrapper to native code, cheating!

For the tests we used a set of increasingly large matrices with a fixed sparsity and ran essentially the same setup as before (harness code thrashes cache, warms up JVM, does multiple timings. Machine is a reasonable desktop machine). The results we obtained are below, each pair of graphs corresponding to the percentage of non-zero entries as given in their titles. The horizontal axis of each graph indicates the matrix size being tested (matrices were square n x n). The graphs in the left column present the raw timings obtained and are plotted on a log_{10}() scale in the vertical axis. The graphs on the right show the relative speeds of the sparse matrices from the tested packages (as mentioned above) against the two OpenGamma sparse matrix types, the dashed lines are comparing against OpenGamma COO and the solid lines against OpenGamma CSR, again the vertical axis is at log_{10}() scale. The plots referring to 25% and 50% non-zeros are missing results from the jScience implementation as the run times were so long (running into days) that we stopped them!

As a small post analysis, maps have an advantage in that adding elements is relatively easy whereas adding elements to CSR or COO is less so and can involve some quite expensive **memcpy()**ing, evidently we haven't tested the performance of this yet! However, if the data structure is immutable, as it often is in numerical applications (as a result of some properties of a numerical model, i.e. element density and ordering), then operations involving the fixed sparse matrix are of more importance. Furthermore, a large number of numerical methods rely on the repeated computation of matrix vector products, for example, a huge number of Krylov subspace methods, and so it is of importance that this kernel is fast. Evidently it is this sort of operation that was tested above and the OpenGamma implementations perform relatively well. It is also interesting to note that throwing threads at the problem, as done in the Colt library, does not actually increase the performance of the operation unless sufficient arithmetic density is available within the data set. This result can be seen first in the n=900 and n=1000 matrices with 10% non-zeros, it is at this point that sufficient work is present for the threads to be kept busy (perhaps as a result of the less false-sharing or better out of core cache characteristics). Finally, for reference, an additional screen shot of jConsole (below) shows that the CPU usage is incredibly poor and this is like due to the poor locality of reference which inherently hinders the floating point operations (the sudden jump in the number of threads used is due to the threaded Colt library).

This leaves us in an interesting place, and it turns out that solving the previously discussed locality of reference problem also assists with other associated problems occurring in functions such as matrix decompositions. Given that most whole matrix operations can be rewritten in terms of a row or column permutation of the matrix (assuming infinite precision arithmetic) we can try to apply techniques to create a more amenable form with respect to a given performance model. For example, in the case of matrix vector multiply, it would be ideal if we could find a permutation that moved the data in such a way that contiguous blocks formed near the diagonal such that at least some stride 1 access in the "x" vector were possible. It turns out that a similar behaviour is also needed for algorithms such as sparse LU decomposition and sparse Cholesky decomposition. This is to reduce the number of fill-ins (writing data to an originally non-zero part of the matrix structure) caused; basically the same permutation idea but this time with a different optimisation criterion. Amusingly it transpires that this problem is NP-Hard. However, in large part thanks to the GPU community considerable research has continued into this area as any poor locality of reference problem on a CPU is considerably worse on a GPU. Furthermore, a number of further highly specialised storage formats making use of combinations of compressed sparse row, blocks and multiply packed indexes have been derived. These formats will also increase performance but generally rely on permutations found by attempting to solve the NP-Hard problem and having a sufficiently high level of arithmetic intensity (due to a lot of data!) such that reordering and branching code costs can be amortised.

To demonstrate a possible bandwidth reduction on the previously used matrix, the permutation A(P,P) as given below tightens the matrix around the diagonal thus increasing the possibility of performance gain.

P = { 5 1 3 4 2 } A(P,P) = 12 0 9 0 0 11 1 7 0 0 0 0 0 10 5 0 3 8 0 6 0 2 0 0 4

Essentially we reach an impasse. We are currently working on bandwidth-reduction and fill-in reduction algorithms to try and increase the performance possibilities within the sparse classes. Sparse decompositions are also a rather challenging area programmatically partly due to the algorithmic complexity required (building permutations, computing up-date trees, pruning trees, updating partial matrices, etc.), and partly due to the lack of pointers in Java making life generally hard work. Still, we shall persevere, and next time I write we should hopefully have some working prototypes.

This is the developer blog of OpenGamma. For more posts by the OpenGamma team, check out our main blog.