Wednesday 7 November 2012

Speaking at Tech Mesh

I'm happy to announce that I will speaking at Tech Mesh in December. I'll be speaking about the Disruptor from two perspectives, firstly looking briefly back at some of the history and motivations behind the Disruptor. Then spending some time explaining at the challenges of building high performance concurrent systems (like the Disruptor) and delving into how the JVM and hardware could change to support the development of these systems.

Friday 19 October 2012

Talk from JAX London

Last week I gave a talk on non-blocking concurrency at JAX London. Here are the slides:

The video is also available

Thursday 30 August 2012

Arithmetic Overflow and Intrinsics

During a recent conversation on the LJC mailing list around a proposal for adding a library to the JDK that would add support for handling integer overflow a question arose. Would the JVM be able to optimise this code to make efficient use of the hardware support for overflow detection if this functionality was implemented as a library. I made the comment that this is problem is probably one best solved using intrinsics, but in the course of writing an explanation I thought it would be better explained in a blog post, so here goes...

What is an Intrinsic?

From the JVM perspective an intrinsic an identifiable code pattern (typically a method) where the JVM understands the intent, such that it can be complied to more optimal machine specific assembly. This is really useful when you have multiple target platforms and a subset of those targets contain instructions that may not be available on the others. E.g. Intel's X86 instruction set is quite rich when compared to a RISC-type processor, such as ARM.

An Example using POPCNT

One of the simplest examples of an intrinsic is the Integer.bitCount() method and the optimisation into Intel's POPCNT instruction (available on Nehalem and later), partially because it can be disabled and the effects of it not being applied are easy to observe. Lets start with some simple code that calls the Integer.bitCount() method:

The implementation of the Integer.bitCount() is a reasonably complex combination of arithmetic and bit shifting in order to calculate the number of bits set to 1 within a given int1.

If we run the PopCntTest class and print out the assembler generated by hotspot, we can see that this will result in quite a large number of instructions that need to be issued to the CPU. Running the class using following command line (-XX:-UsePopCountInstruction disables the intrinsic):

java -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly -XX:-UsePopCountInstruction PopCntTest.

Generates the following assembly code:

Now, lets look at what happens when the we allow the intrinsic optimisation, this time the command line is:

java -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly PopCntTest

In this case the entire body of the Integer.bitCount() method and it's associated ~17 assembly instructions are replaced with a single instruction.

How it Works

Inside of Hotspot (other JVMs may work differently) a number of things are happening to make this work. As Hotspot loads classes it builds an abstract syntax tree (AST) representation of the Java byte code. When executing the Java byte code, if the interpreter notices that a particular method has been called a certain number of times2 (default is 10000) then Hotspot will look to optimise and JIT that method. Before optimising, the method signature will be matched against the set of predefined intrinsics, declared in vmSymbols.hpp. If there is a match Hotspot will replace the nodes in AST with a set of nodes specific to the intrinsic that was matched. At some later point during the compile pass of the AST, it will see the new nodes and generate the optimised machine specific assembly for that part of the tree and type of node.

Does it affect performance?

This can be tested with a simple benchmark. I've used Google's Calliper micro-benchmarking framework with 2 test methods. The first uses the built-in Integer.bitCount(), the second uses my own implementation of the bit count. My implementation copies the Java implementation, but puts it in a method whose signature will not match the JVM defined intrinsic. Code:

Results:

      benchmark    ns linear runtime
IntegerBitCount 0.571 ====
     MyBitCount 3.438 ==============================

Fairly conclusive, the built-in Integer.bitCount() is 6-7 times faster than the one that is not optimised with an intrinsic. Hotspot supports a fair number of intrinsics for a variety of operations in the JDK libraries. Need to reorder the bytes in a int? Integer.reverseBytes() will compile down to a BWSAP instruction on Intel. This is also the mechanism by which AtomicInteger.compareAndSet() becomes a LOCK CMPXCHG instruction. When you combine intrinsics with Hotspot's daddy optimisation (inlining) it can produce some fairly optimal machine code3.

Back to Overflow

So how does this relate to the overflow conversation that turned up during the conversation on the LJC list? I think (there may be subtleties that I'm missing) that if overflow checking was implemented as a JDK library function, then it would be straight forward for the Hotspot (and other JVM) teams to implement an optimisation that will allow for overflow to be detected and raised by the hardware and be able to run at "close to the metal" speeds. There are a number of good reasons why this approach is preferable. 1) It's easy to implement, adding a library feature is much easier than changing the language or JVM. 2) Optimisation can come later, developers can start testing the functional behaviour of the code. 3) Being library code it is much easier to support the feature in alternative JVMs and create back ports for older JVMs. For those whose primary concern is not performance can get something that is functionally correct. When performance becomes a priority then they simply upgrade to the appropriate JVM.

  1. Implementation hails from Hacker's Delight by Henry S. Warren
  2. Actually I'm over simplifying here, with tiered compilation this decision is a little more complex.
  3. Intrinsics and inlining aren't only optimisations that Hotspot can perform, there is a huge host of others that help to produce fast code.

Friday 27 July 2012

Disruptor v3 - Faster Hopefully

I've be working sporadically on the next major revision of the Disruptor, but still making steady progress. I've merged my experimental branch into the main line and I'm working on ensure comprehensive test coverage and re-implement the Disruptor DSL.

As a matter of course I've been running performance tests to ensure that we don't regress performance. While I've not been focusing on performance, just some refactoring and simplification I got a nice surprise. The new version is over twice as fast for the 1P1C simple test case; approximately 85M ops/sec versus 35M ops/sec. This is on my workstation which is an Intel(R) Xeon(R) CPU E5620@2.40GHz.

Current released version (2.10.1)

[barkerm@snake disruptor-2]$ taskset -c 4-8,12-15 ant througput:single-test
througput:single-test:
    [junit] Running
com.lmax.disruptor.OnePublisherToOneProcessorUniCastThroughputTest
    [junit] Started Disruptor run 0
    [junit] Disruptor=21,141,649 ops/sec
    [junit] Started Disruptor run 1
    [junit] Disruptor=20,597,322 ops/sec
    [junit] Started Disruptor run 2
    [junit] Disruptor=33,233,632 ops/sec
    [junit] Started Disruptor run 3
    [junit] Disruptor=32,883,919 ops/sec
    [junit] Started Disruptor run 4
    [junit] Disruptor=33,852,403 ops/sec
    [junit] Started Disruptor run 5
    [junit] Disruptor=32,819,166 ops/sec
    [junit] Started Disruptor run 6

Current trunk.

[barkerm@snake disruptor]$ taskset -c 4-8,12-15 ant througput:single-test
througput:single-test:
    [junit] Running
com.lmax.disruptor.OnePublisherToOneProcessorUniCastThroughputTest
    [junit] Started Disruptor run 0
    [junit] Disruptor=23,288,309 ops/sec
    [junit] Started Disruptor run 1
    [junit] Disruptor=23,573,785 ops/sec
    [junit] Started Disruptor run 2
    [junit] Disruptor=86,805,555 ops/sec
    [junit] Started Disruptor run 3
    [junit] Disruptor=87,183,958 ops/sec
    [junit] Started Disruptor run 4
    [junit] Disruptor=86,956,521 ops/sec
    [junit] Started Disruptor run 5
    [junit] Disruptor=87,260,034 ops/sec
    [junit] Started Disruptor run 6
    [junit] Disruptor=88,261,253 ops/sec
    [junit] Started Disruptor run 7

There is still more to come. In addition to improvement for the single producer use case the next version will include a new multiple producer algorithm that will replace the two that we currently have and work better than both of them in all scenarios.

Sunday 13 May 2012

Disruptor 2.10 Release

The Disruptor version 2.10 has been released. It is available from the Google Code download page and has been submitted to Maven central.

Changes

  • Remove deprecated timeout methods.
  • Added OSGI metadata to jar file.
  • Removed PaddedAtomicLong and use Sequence in all places.
  • Fix various generics warnings.
  • Change Sequence implementation to work around IBM JDK bug and improve performance by ~10%.
  • Add a remainingCapacity() call to the Sequencer class.

Tuesday 3 April 2012

Interview on the Disributed Podcast

Last month the guys from the Distributed Podcast interviewed me about LMAX and the Disruptor. Many thanks to Jonathan and Rinat, I had a great time.

Tuesday 14 February 2012

Friday 10 February 2012

Slides From Recent Presentations

My slides from JAX London - Beginner's Guide to Hardcore Concurrency:

Video: LJC@Playfish, JAX London

My slides from Devoxx - Disruptor Tools In Action:

Video: Devoxx (Payment required)

Sunday 8 January 2012

Building a CPU Topology on MacOS X

Within a project I've been working on I've had the need to simulate the capabilities of Linux's /proc/cpuinfo on Mac OS. Specifically I needed to build a topology of the CPUs on a given system. I.e. I need to map the operating system's processors to hardware threads, then build a picture of which cores and sockets those threads reside. For example my Mac looks something like:

CPU0 (Thread 0) ---+
                   |---> Core 0 ---+
CPU1 (Thread 1) ---+               |
                                   | ----> Socket 0
CPU2 (Thread 2) ---+               |
                   |---> Core 1 ---+
CPU3 (Thread 3) ---+

While this sounds very simple, it's actually fraught with a number of little niggles. Not only did it require getting down and dirty with a bit of C++ and X86 Assembly, it also required writing a MacOS kernel extension.

The first step was to understand what information was available from the CPU. Intel exposes an instruction called CPUID. The is the primary mechanism for getting information about the CPU. There is a raft of information available from listing of the CPU features available (e.g. hyperthreading) to sizes of the various levels of cache and the associated cache lines. To access the CPUID instruction we need a little bit of inline assembler.

The code shows how to get the vendor string from the CPU. On my Mac I get the following:

// Output:
Vendor String: GenuineIntel

For those unfamiliar with Intel inline assembly, the Intel CPU defines a number of registers. The ones used for the CPUID instruction are EAX, EBX, ECX, and EDX (referenced as RAX, RBX, etc if using 64 bit instructions via the REX extension). These used for both input and output. An inline asm segment consists of 3 parts. The first part is the instruction to be executed. In this case the "cpuid" instruction. The second line defines the output parameters. The snippet "=a" (data[0]) means store the result in the EAX register in the variable data[0]. The "=a" refers to the 2nd letter of the register designation. The 3rd and final section are the input parameters. The CPUID instruction takes 2 parameters, one in EAX and one in ECX.

The particular CPUID reference that provides information needed to building the topology is 0xB (11) - the extended topology enumeration leaf. The data returned from this instruction is:

     0                   1                   2                   3   
     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
EAX | Shift |                     Reserved                          |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
EBX |   No. Process at this level   |            Reserved           |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
ECX |   Level No.   |  Level type   |            Reserved           |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
EDX |                           x2APIC ID                           |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

The extended topology enumeration leaf is one of the CPUID indexes that also makes use of ECX as an input parameter. This indicates the level of the CPU that you wish to work at. I started at level 0 and worked my way up. The 'Level type' describes whether the current level is a CPU thread (1) or a physical core (2) or invalid (0). All values greater than 2 are reserved, presumably for later use. The 2 other useful values are the x2APIC ID and Shift. The x2APIC ID is a unique identifier for the smallest logic unit in the CPU. The Shift value is used to group x2APIC ID values into units at the next logical level. This is done by shifting the x2APIC ID right by the value specified by Shift. For example on my using the following code on my workstation (2 sockets, 8 cores, 16 threads):

Outputs the following:

Shift: 1, Count: 2, Level Id: 0, Level Type: 1, x2APIC: 33, Group: 16
Shift: 5, Count: 8, Level Id: 1, Level Type: 2, x2APIC: 33, Group: 1

This indicates that the hardware thread indicated by APIC 33 has the core id of 16 and socket id of 1. The socket and core ids aren't really references, but values that indicate threads have have the same id share that unit. This gets me most of the way there, I can group all of my threads into a hierarchy of cores and sockets. However there is one small niggle remaining. I need this code to run all of the CPUs on my system. On Linux this is reasonably simple, you count the number of CPUs then iterate through all of them and use the pthread_setaffinity_np(...) to specify specify which CPU to run the CPUID instructions on.

However, on Mac OS X actual thread affinity is not supported; just a mechanism logically grouping or separating threads, powerful in its own right, but not what I'm after here. This is where the Kernel module comes in. The XNU Kernel defines a funky little method called mp_rendevous(...). This method takes in a number function pointers. The key one (action_func), is run once on each of the CPUs. So to get the topology information for all of the CPUs we can use it like so:

Because the method mp_rendevous() is defined in the kernel the code above needs to be packaged as a kernel extension. Even then, getting access to the method is interesting. It's not defined in a header that can be easily included, however it is available at link time when compiling a kernel module. Therefore in order compile correctly, it's necessary to provide your own forward declaration of the method. The same is true of the cpu_number(). Calling into the kernel method from user space requires use of the IOKit framework, but I'll leave the details of that as an exercise for the reader.