14 October 2011 By Stuart Archibald
I'm a new face here at OpenGamma, having joined this summer, and I wanted to introduce myself and what I'm doing here. My job here can be stated simply: make the OpenGamma maths library as fast as possible!
Having come to OpenGamma with a background in applications of high performance and accelerated computing in numerical modelling, I have a fair idea about how to make code performing mathematical operations run fast. However, doing this in Java as opposed to my beloved Fortran or C is a completely different game.
First, some comments about why on earth we are doing maths in Java. Well, "write once, run everywhere", coupled with the rather magic IDE behaviour of Eclipse (although I still prefer vi ;-)) means that the code written can be guaranteed to run without having experts in cross-platform compiling around to deal with any issues. The Eclipse editor also makes unit testing, debugging and general code writing considerably less error prone. Java also has the ability to call out to native code via JNI or JNA, both of which can be used in the future, should the maths library interfaces be written in a sane manner conducive to their integration.
There are a number of well-established maths libraries for use in Java. Two of the most commonly used are Colt, and Apache Commons Mathematics, and it is these libraries that we currently use in-house. Having briefly looked through the source code, there are some features/methods that are lacking (for example, more advanced gradient-based linear system solvers) and experience suggests that the performance (in terms of speed) of these libraries is not particularly good in comparison to native code (expected) or more heavily tuned Java code (as will be seen later). On the positive side, both libraries are tried and tested and have different, but relatively easy to use, APIs.
At OpenGamma speedy maths is essential. Financial analytics uses a wide range of numerical techniques from very simple models and operations (basic integration and curve fitting) through to the considerably heavier traditional numerical modelling techniques. Clearly these models have to run in near real time or at least fast enough that their execution completes between ticks so that new analytics can be displayed soon after data arrives down the wire. Obviously, to achieve this, speed is key.
A large number of mathematical operations are essentially matrix related and in acknowledgement of this, the industry standard BLAS APIs are quite wonderful as building blocks. Furthermore, evidence of the importance of this API comes from the chip vendor specific implementations of BLAS (along with extensions to sparse matrices and FFTs). Sadly, there is no complete and heavily optimised BLAS library for Java. This is where OpenGamma comes in.
In the last few months I have been working on identifying code patterns and techniques that the JVM can spot for optimisation. In general, these techniques are relatively simple and already well known, and are extensively used in the high performance scientific computing world. I've been concentrating on BLAS operations, specifically DGEMV() with alpha = 1 and beta = 0 such that the operation is essentially:
where y is a [m x 1] vector of doubles, A is a [m x n] matrix of doubles and x is a [n x 1] vector of doubles.
Following this, a naive implementation of the operation can be written, and it is this version that appears in many Java linear algebra libraries:
So, from here we applied a whole load of optimisation magic and we've outlined this in a paper that you can download from here. Have a read; most optimisations aren't too hard to do and we found out some interesting things about the JVM.
In general, it is nice to know that it is possible to improve the speed of code, and indeed knowing what happens in the metal is worthwhile. All the results relate to the computation
Y:=A*x for square matrix A of varying size. Implementation details can be found in the previously mentioned paper.
The purpose of this investigation was to find out if it was possible to improve on naive implementations of a simple BLAS operation, and to provide comparison to the Colt and Apache Commons Mathematics libraries. Results from racing OpenGamma code operating on fairly typical matrix sizes, against a) a naive
double backed matrix implementation b) Colt library c) Apache Commons Mathematics library are displayed below.
From this graph we can see that OpenGamma code runs on average at least 1.6x as fast as Apache Commons Mathematics. Furthermore, the OpenGamma code consistently runs at over 2x the speed of the Colt library code and peaks at around 2.4x. These comparisons demonstrate what is possible if you push Java hard and they give a hint of what is to come... watch this space, maths is about to get fast!
Based on a number of discussions on mailing lists and websites, we wanted to provide some common questions and answers.
Question: Why is OpenGamma not just using XYZ that already exists to provide BLAS?
Answer: In al cases we have found so far either the adaptability, the possibility of interfacing with native libraries, or the performance has been insufficient for what we want/need, hence us writing our own.
Question: Why not just use the netlib Java interface to native BLAS or the netlib f2j translation of BLAS?
Answer: We are fully aware of this project. To use a native BLAS, the calls have to go through the Java Native Interface (JNI). All calls through the JNI incur a penalty from crossing the JVM->native barrier and there is uncertainty over the cost of this jump. Clearly, for some BLAS1 (and perhaps some BLAS2/3) calls on smaller data sets, the cost of accessing native libraries will be prohibitive. Therefore, the interface would need tuning to be performance optimal, on top of resolving and difficulties with running native libraries alongside Java. As to using the f2j translation of BLAS or the higher level netlib-java API (MTJ), we really want to have something that performs access directly on native types and can be performance tuned and don't feel that MTJ offers this opportunity.
Question: Why not just write in native code/use a native BLAS?
Answer: We want to keep everything in Java for now; by doing this users can just download one package to get a system that JustWorks(TM). We are writing our library in such a way that jumping out to a tuned BLAS library when it becomes efficient to do so (in terms of setup time/user experience) will not be a problem.
Question: Why do I get different performance results on my machine?
Answer: Different hardware+JVM combinations are going to produce different performance results. The specification of the hardware+JVM we used is given in the above paper. The hardware we used is a fairly powerful number cruncher and so may well give better results for maths performance. Also, factors like how "warm" the JIT is, how much garbage collection goes on and how much heap is allocated will also affect the performance.
Question: These optimizations are really basic. So what?!
Answer: Correct, they are basic! It doesn't mean that they don't work though. Part of the investigation was to see which optimisations were available by default within the JIT and what could be coaxed out of the JIT with a bit of effort. We are fully aware that tuning to hardware is not a very Java like approach, but given the constraints (performance+staying in Java), it is something we are investigating and believe can be made to work.