Log in

JNA, JNI and Raw Java Performance

25 May 2012 By

What was said last

In my previous blog post I wrote about sparse matrix support in the OG-Maths library and how we are working on sparse direct decompositions. This is still true, and they are still in the pipeline. Whilst writing these sparse algorithms we found that having a reference implementation of equivalent dense matrix code massively aided debugging. So we wrote dense implementations of LU and Singular Value (SVD) decompositions, these being archetypal of the two fundamental decomposition strategies, direct permutation and orthogonal transform.

Today's questions...

All went well until it came to the question of how we test the code, and the SVD results in particular. The Colt and Apache Commons Math SVDs often gave different answers (in cases when numerical issues wouldn't be present). Clearly, both couldn't be right! So we asked the question, how can we test our SVD code? Mathematically speaking we can do some testing by checking against known results, but that doesn't really catch the edge cases where the more difficult problems invariably occur. So we asked another question: could we compare to the work-hardened LAPACK/BLAS system libraries, and if so, how?

It is these questions that form the basis for an article tuple:

So we face the problem of wanting results from decent implementations of LAPACK/BLAS obtained through Java calls, which clearly screams for native interfacing via the Java Native Interface (JNI). The follow question was then asked, what is the penalty of jumping through the JNI to native code for the type of calls we are making? We therefore set up a simple test on the standard y:=Ax BLAS2 operation (see this post for details) called via the following: JNA, JNI non-critical access, JNI critical access, Java and, as a reference, direct calls from C code. Within this test we were looking for two things: ease of use and therefore implied faster development, and raw performance in terms of time taken to perform the call.

A little bit about each method:

  1. JNA, Java Native Access, is a massively useful, very easy-to-use interface to native libraries. The SLOC count for using this library and effort required by the developer is really quite small, however this ease would come with an apparent time penalty. Data pushed through the JNA calls appears to be heavily marshalled and as a result a lot of work would be required to be undertaken on the native side to amortise the cost of passing the data.
  2. JNI Is the classic Java native interface. Within it are a couple of ways to grab data from the JVM from native code.
    • The first is via the nativePtr = (*env)->Get<Java BIGNUM Type>ArrayElements(); function calls which may or may not (most pretty much always do in practice) make copies of the data requested from the JVM and return a pointer to that opposed to accessing the version the JVM has. Obviously working on a copy of the data is thread safe until the time comes to publish the results of the native code (if applicable) back to the JVM memory space, in which case the native memory would clobber what is held equivalently by the JVM.
    • The second JNI method is via the critical family of calls (*env)->GetPrimitiveArrayCritical(); which allow (potentially) direct access to the data held by the JVM. There are two caveats with using these calls: 1) it only potentially allows direct access, it might return a copy and definitely will if type conversion is needed; and 2) the critical state cannot be held or persisted for "an extended period of time" (more info here) without calling the release functions. This method is only available if the OS on which the JVM is running supports memory pinning (most do!).

Set up

We use the same harness and system as described in this post and test the y:=Ax operation again. We ignore column vs. row major storage differences to ensure the level of work is the same, and we aren't really bothered about the actual answer but rather the time/effort taken to get it. Therefore all data is assumed to be laid out in a manner that is conducive to just running the DGEMV call. We chose the following combinations of calling methods and both ATLAS (Netlib) and Intel MKL BLAS:

Calling code language Calling method BLAS supplied by
Java Java OGMaths BLAS
Java JNA Netlib Fedora 16 default build (ATLAS)
Java JNI non-critical calls Netlib Fedora 16 default build (ATLAS)
Java JNI critical calls Netlib Fedora 16 default build (ATLAS)
C Direct Netlib Fedora 16 default build (ATLAS)
Java JNI non-critical calls Intel MKL 10.3
Java JNI critical calls Intel MKL 10.3
C Direct Intel MKL 10.3

Note: the expected call from JNA to Intel's MKL could not be made to work despite numerous efforts with regards to altering LD_PRELOAD, LD_LIBRARY_PATH, -Djava.library.path and -Djna.library.path. Static linking and loading was also attempted and failed. One day I have little doubt we will get this to work, but given the time cost required, we just wanted some results and so abandoned it. We also do not consider the lack of this result in any way detracts from our findings as they are replicated in full using Fedora 16 ATLAS BLAS.

Code

One of the criteria for choosing the right method to make the jump to native code was the ease of coding to do so. The JNA wins outright here: it can be nicely interfaced and handled rather cleanly with little effort.

JNA example

We first create an interface that extends the JNA Library class. For clarity we've not put in arguments to the dgemv_() call and have missed out the try{}catch{} on failure to load:

Then the call is made as

LAPACKLibrary.INSTANCE.dgemv_(...);

which is fantastically easy; there are even classes to aid with getting the right pointer types and mangling them coherently! Obviously the underscoring notation of the library needs to be accounted for, but that can be dealt with dynamically.

JNI example

On the other hand, calls through the JNI require considerably more effort than JNA. First, we write a wrapper to our native library call, for example:

    public static native void dgemv_critical_(...);

from which the header containing the corresponding signatures can be generated with the javah binary. We then have to implement the name mangled generated headers as functions, for example (in a very cut-down, pseudo-code-like version):

We then wrap this code into a shared library via:

and then write a static initialiser in our Java class to load the native library wrapper we just created:

Then, if we've done everything correctly, and the system library and java class paths are correct, we should just have to write the call to the function name in our wrapper library from the Java code:

dgemv_critical_(...);

This is evidently a lot more effort than using the JNA. However, in the case of LAPACK and BLAS it's likely the code could mostly be auto-generated (in fact, this is exactly what we do!).

Results

Running the same tests as in this post we obtain the following performance-related results:

Speed Comparison

The calls labelled "safe" are via the JNI non-critical calls, those labelled unsafe are via the JNI "critical" calls.

What do we learn from this graph?

  • That as expected Intel's MKL out-performs ATLAS BLAS.
  • That our OGMaths Java implementations of the DGEMV call is about as fast as ATLAS BLAS.
  • That critical calls to native code are considerably faster than non-critical calls.
  • That the cost of the non-critical calls appears to be proportional to the data size (no surprise given the memcpy()).
  • That JNA is a considerably slower method of reaching native libraries.

Note: It is nice that our DGEMV and ATLAS's run at approximately the same speed. This is down to the quality of the instruction stream generated. For the heavy-lifting part, our Java code is running a set of register-rotated scalar SSE instructions. After running objdump on the ATLAS's library, in this case, it appears the ATLAS code is similar at least in the instruction blend (register-rotated scalar SSE), hence similar speed.

What can we infer?

  • For ease of use, JNA wins. However, the compute cost on the native side would have to be considerable in proportion to the data set size to make it viable for use in terms of performance.
  • Critical calls are on the whole almost as fast as passing native pointers in terms of performance; this is a massive deal for us.
  • Writing JNI functions and interfaces in a safe and clean manner is hard and requires a lot of effort. For this reason they should only be undertaken if performance is critical, as JNA provides a considerably more convenient interface.

Conclusion

We confirmed what we expected: safe calls from Java to native libraries are expensive, but if done in a certain way, the cost is absolutely minimal. This means that the raw power of something like Intel's MKL can be harnessed from Java at very little overhead cost if systems require the performance. Finally we can conclude with the fact that at OpenGamma we need unit tests that compare against dependable results and so calling native libraries is going to have to happen. My next post will be about ways we test our maths libraries.

This is the developer blog of OpenGamma. For more posts by the OpenGamma team, check out our main blog.

About the Author

Stuart is a Member of Technical Staff at OpenGamma.

Follow us on Twitter