Wednesday, 29 January 2014

Linux Alternatives and Oracle Java

If, like me, you prefer to run the Oracle version of Java on your Linux machine as the default JDK, you will often find that the Linux distro will have other ideas.  Fedora for example has a number of Java based applications as part of the distribution which will include a dependency on the OpenJDK.  When the distro installs OpenJDK is will generally be setup as the default for executing the various Java binaries (e.g. 'java', 'javac').  However, the team at Redhat built a system called alternatives which maintains a set of symbolic links that allows the user to switch between multiple implementations of a package the supports the same functionality.  I've managed to understand enough about the alternatives package that I can now easily switch between the Oracle JDK and the OpenJDK.

Sunday, 15 December 2013

An Alternative Multi-Producer Approach

Recently on InfoQ, Aliaksei Papou posted an article on some of his experiments with high performance interchange of messages between threads.  There were a number of examples within the article, but I am going to focus on the multi-producer case.  One of the optimisations that the article showed was that if you knew the number of producers that you have at initialisation time you can build a structure that significantly reduces contention.  The existing MultiProducerSequencer does not have this constraint, which is essential for a large number of use cases.  However, I wanted to see what we could achieve if I applied this approach to this Disruptor.

Thursday, 10 October 2013

Fonts on Fedora Core 19

I recently upgraded my personal workstation from Ubuntu 12.04 to Fedora Core 19.  I'm very happy with the change, I prefer the standard Gnome 3 to Canonical's Unity desktop.  Probably something to do with my workstation not being a tablet.  However, one of the annoyances with Fedora Core has always been its font rendering.  It was never as nice a default Ubuntu install.  However after a couple of days of digging I've managed to figure out the magic incantation required to get all of my fonts looking good, including in Google Chrome.

Thursday, 11 April 2013

Release of Disruptor 3.0.0

I've decided that I'm a bit bored of the whole putting a beta tag on various versions of the Disruptor so I've decide to send forth Disruptor 3.0.0 into the world. The big challenges of this release were to clean up the code and come up with a better algorithm for handling multiple producers. If I was lucky, make it even faster. I went down a couple of dark alleys initially with this release, but have come up again for air with a version that is less different to the 2.x version, but still brings some nice benefits.  I had wanted to implement a few more functional tests and improve the documentation, but I could be waiting forever to those aspects to a level where I was 100% satisfied.

Wednesday, 7 November 2012

Speaking at Tech Mesh

I'm happy to announce that I will speaking at Tech Mesh in December. I'll be speaking about the Disruptor from two perspectives, firstly looking briefly back at some of the history and motivations behind the Disruptor. Then spending some time explaining at the challenges of building high performance concurrent systems (like the Disruptor) and delving into how the JVM and hardware could change to support the development of these systems.

Friday, 19 October 2012

Talk from JAX London

Last week I gave a talk on non-blocking concurrency at JAX London. Here are the slides:

The video is also available

Thursday, 30 August 2012

Arithmetic Overflow and Intrinsics

During a recent conversation on the LJC mailing list around a proposal for adding a library to the JDK that would add support for handling integer overflow a question arose. Would the JVM be able to optimise this code to make efficient use of the hardware support for overflow detection if this functionality was implemented as a library. I made the comment that this is problem is probably one best solved using intrinsics, but in the course of writing an explanation I thought it would be better explained in a blog post, so here goes...

What is an Intrinsic?

From the JVM perspective an intrinsic an identifiable code pattern (typically a method) where the JVM understands the intent, such that it can be complied to more optimal machine specific assembly. This is really useful when you have multiple target platforms and a subset of those targets contain instructions that may not be available on the others. E.g. Intel's X86 instruction set is quite rich when compared to a RISC-type processor, such as ARM.

An Example using POPCNT

One of the simplest examples of an intrinsic is the Integer.bitCount() method and the optimisation into Intel's POPCNT instruction (available on Nehalem and later), partially because it can be disabled and the effects of it not being applied are easy to observe. Lets start with some simple code that calls the Integer.bitCount() method:

The implementation of the Integer.bitCount() is a reasonably complex combination of arithmetic and bit shifting in order to calculate the number of bits set to 1 within a given int1.

If we run the PopCntTest class and print out the assembler generated by hotspot, we can see that this will result in quite a large number of instructions that need to be issued to the CPU. Running the class using following command line (-XX:-UsePopCountInstruction disables the intrinsic):

java -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly -XX:-UsePopCountInstruction PopCntTest.

Generates the following assembly code:

Now, lets look at what happens when the we allow the intrinsic optimisation, this time the command line is:

java -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly PopCntTest

In this case the entire body of the Integer.bitCount() method and it's associated ~17 assembly instructions are replaced with a single instruction.

How it Works

Inside of Hotspot (other JVMs may work differently) a number of things are happening to make this work. As Hotspot loads classes it builds an abstract syntax tree (AST) representation of the Java byte code. When executing the Java byte code, if the interpreter notices that a particular method has been called a certain number of times2 (default is 10000) then Hotspot will look to optimise and JIT that method. Before optimising, the method signature will be matched against the set of predefined intrinsics, declared in vmSymbols.hpp. If there is a match Hotspot will replace the nodes in AST with a set of nodes specific to the intrinsic that was matched. At some later point during the compile pass of the AST, it will see the new nodes and generate the optimised machine specific assembly for that part of the tree and type of node.

Does it affect performance?

This can be tested with a simple benchmark. I've used Google's Calliper micro-benchmarking framework with 2 test methods. The first uses the built-in Integer.bitCount(), the second uses my own implementation of the bit count. My implementation copies the Java implementation, but puts it in a method whose signature will not match the JVM defined intrinsic. Code:

Results:

      benchmark    ns linear runtime
IntegerBitCount 0.571 ====
     MyBitCount 3.438 ==============================

Fairly conclusive, the built-in Integer.bitCount() is 6-7 times faster than the one that is not optimised with an intrinsic. Hotspot supports a fair number of intrinsics for a variety of operations in the JDK libraries. Need to reorder the bytes in a int? Integer.reverseBytes() will compile down to a BWSAP instruction on Intel. This is also the mechanism by which AtomicInteger.compareAndSet() becomes a LOCK CMPXCHG instruction. When you combine intrinsics with Hotspot's daddy optimisation (inlining) it can produce some fairly optimal machine code3.

Back to Overflow

So how does this relate to the overflow conversation that turned up during the conversation on the LJC list? I think (there may be subtleties that I'm missing) that if overflow checking was implemented as a JDK library function, then it would be straight forward for the Hotspot (and other JVM) teams to implement an optimisation that will allow for overflow to be detected and raised by the hardware and be able to run at "close to the metal" speeds. There are a number of good reasons why this approach is preferable. 1) It's easy to implement, adding a library feature is much easier than changing the language or JVM. 2) Optimisation can come later, developers can start testing the functional behaviour of the code. 3) Being library code it is much easier to support the feature in alternative JVMs and create back ports for older JVMs. For those whose primary concern is not performance can get something that is functionally correct. When performance becomes a priority then they simply upgrade to the appropriate JVM.

  1. Implementation hails from Hacker's Delight by Henry S. Warren
  2. Actually I'm over simplifying here, with tiered compilation this decision is a little more complex.
  3. Intrinsics and inlining aren't only optimisations that Hotspot can perform, there is a huge host of others that help to produce fast code.