Inspecting OpenJ9 performance with perf on Linux – JIT Compiled Methods

In the first post of this series we learned how to profile modules within the JVM itself using perf. This is useful for diagnosing JVM related problems. However in typical applications the user application code (JCL included) will dominate the profile, so it is often desirable to want to profile the user application code as well.

In a typical workload, past the start-up phase of the JVM, nearly all methods being executed will have been JIT compiled. Unfortunately this poses a problem for perf because the program counter (PC) addresses recorded during perf record are mapped to symbols in a post-processing step via perf report by examining the memory map of every binary loaded into the executable’s memory space during the record step. Because of this post-processing step and the nature of the JIT compiler, perf is unable to find the binaries corresponding to JIT method symbols, because these methods were generated just-in-time.

Fortunately OpenJ9 and the perf tool itself provide support for both de-mangling of JIT method symbols and assembly level annotations of JIT method bodies. We’ll explore how to collect such diagnostics within this post.

A Case Study

We’ll be basing this case study off of a vanilla Ubuntu 18.04 image. To begin we’ll need a workload to investigate. Microbenchmarking dynamic runtimes is notoriously difficult to get right. Thankfully OpenJDK has us covered with the JMH microbenchmark suite. First let’s install maven:

# Install maven and the default JDK
> apt-get update
> apt-get -y install maven default-jdk

Next we’ll use the JMH archetype to create the boilerplate for our benchmark:

> mvn archetype:generate -DinteractiveMode=false -DarchetypeGroupId=org.openjdk.jmh -DarchetypeArtifactId=jmh-java-benchmark-archetype -DgroupId=org.openj9 -DartifactId=indexof-jmh -Dversion=1.0

A Maven project will be created in the indexof-jmh directory and will be setup with all the dependencies needed to run JMH. As the name of the project suggests, our case study benchmark will be exercising the java.lang.String.indexOf API. Next, let’s open up the generated indexof-jmh/src/main/java/org/openj9/MyBenchmark.java benchmark and modify it slightly to make use of the indexOf API:

// ...

package org.openj9;

import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.infra.Blackhole;

@Fork(value = 1)
@Warmup(iterations = 3)
@Measurement(iterations = 3)
public class MyBenchmark {
    public static String data = "Some long string which we are going to search for the character '#'";

    @Benchmark
    public void testMethod(Blackhole blackhole) {
        int indexOfHash = data.indexOf('#');
        blackhole.consume(indexOfHash);
    }
}

We can now build the Maven project and run the JMH benchmark using the default JDK available:

> cd indexof-jmh
> mvn clean install
> java -version
openjdk version "11.0.3" 2019-04-16
OpenJDK Runtime Environment (build 11.0.3+7-Ubuntu-1ubuntu218.04.1)
OpenJDK 64-Bit Server VM (build 11.0.3+7-Ubuntu-1ubuntu218.04.1, mixed mode, sharing)

> java -jar indexof-jmh/target/benchmarks.jar
...
Benchmark                Mode  Cnt         Score         Error  Units
MyBenchmark.testMethod  thrpt    3  37004629.640 ± 1644503.443  ops/s

Measuring OpenJ9

OpenJDK Java 11 with HotSpot achieved a score of 37004629 on our machine. How does OpenJDK Java 11 with OpenJ9 compare? To measure we’ll use the AdoptOpenJDK API to grab the latest nightly OpenJDK Java 11 with OpenJ9:

> curl -L 'https://api.adoptopenjdk.net/v2/binary/nightly/openjdk11?openjdk_impl=openj9&os=linux&arch=x64&release=latest&type=jdk&heap_size=normal' > openj9.tar.gz
> tar xvf openj9.tar.gz

And repeat the same measurement:

> jdk-11.0.4+10/bin/java -version
openjdk version "11.0.4" 2019-07-16
OpenJDK Runtime Environment AdoptOpenJDK (build 11.0.4+10-201907151830)
Eclipse OpenJ9 VM AdoptOpenJDK (build master-9409a4266, JRE 11 Linux amd64-64-Bit Compressed References 20190715_284 (JIT enabled, AOT enabled)
OpenJ9   - 9409a4266
OMR      - 4beca561
JCL      - 411f32dbbb based on jdk-11.0.4+10)

> jdk-11.0.4+10/bin/java -jar indexof-jmh/target/benchmarks.jar
...
Benchmark                Mode  Cnt         Score          Error  Units
MyBenchmark.testMethod  thrpt    3  90224587.381 ± 41943779.975  ops/s

OpenJ9 has achieved a much higher score compared to HotSpot for our microbenchmark. But why?

Investigating with Linux perf

perf is not installed on our vanilla image so let’s install it using the package manager:

# Install perf
> apt-get update
> apt-get -y install linux-tools-generic linux-tools-`uname -r`

We can now use the perf record utility to capture a profile of our JMH microbenchmark. Hopefully this will tell us where time is being spent, which may give as a hint as to why OpenJ9 is performing so well. We will collect a profile of the cpu-cycles hardware event as we are interested in the assembly level annotation of the profile:

> perf record -k 1 -e cycles jdk-11.0.4+10/bin/java -jar indexof-jmh/target/benchmarks.jar

perf record will have generated a perf.data file which is our profile. This is the default input to the perf report utility which we will use to view the profile:

> perf report
# Overhead  Command          Shared Object       Symbol
# ........  ...............  ..................  ......................
#
     3.58%  org.openj9.MyBe  [JIT] tid 5465      [.] 0x00007f6ad2a24192
     2.93%  org.openj9.MyBe  [JIT] tid 5465      [.] 0x00007f6ad2a2418a
     2.82%  org.openj9.MyBe  [JIT] tid 5465      [.] 0x00007f6ad2a24199
     1.76%  org.openj9.MyBe  [JIT] tid 5465      [.] 0x00007f6ad2a252cd
     1.65%  org.openj9.MyBe  [JIT] tid 5465      [.] 0x00007f6ad2a2418e
     1.58%  org.openj9.MyBe  [JIT] tid 5465      [.] 0x00007f6ad2a25480
     1.54%  org.openj9.MyBe  [JIT] tid 5465      [.] 0x00007f6ad2a1ce97
     1.54%  org.openj9.MyBe  [JIT] tid 5465      [.] 0x00007f6ad2a25553
...
...
     0.09%  org.openj9.MyBe  perf-10944.map      [.] 0x00007f3bcd81d6cb                                                         
     0.09%  org.openj9.MyBe  perf-10944.map      [.] 0x00007f3bcd81bce1                                                         
     0.09%  Thread-6         libj9vm29.so        [.] VM_BytecodeInterpreter::run                                                
     0.09%  org.openj9.MyBe  perf-10944.map      [.] 0x00007f3bcd81dfd4                                                         
     0.09%  org.openj9.MyBe  perf-10944.map      [.] 0x00007f3bcd81dd21

Why are we not seeing valid symbol names? This is expected. As described earlier, perf is unable to map the profile ticks to symbols because of the JIT compiler. Note that the distance between the “symbols” in the perf report are quite small, sometimes a few bytes apart. This is because perf report is not able to group a set of profile ticks to one particular symbol (a method or function) so it can aggregate the ticks in the profile. The “symbols” we are seeing here are actually instruction addresses within the JIT method bodies.

To fix this perf provides built-in support for mapping JIT method symbols to address ranges. It does this by looking for a textual file /tmp/perf-<pid>.map which it uses to map addresses to symbols. The OpenJ9 JIT compiler has an option -Xjit:perfTool which will generate such a file which perf report will then use. Let’s try again:

> perf record -k 1 -e cycles jdk-11.0.4+10/bin/java -Xjit:perfTool -jar indexof-jmh/target/benchmarks.jar
> perf report
# Overhead  Command          Shared Object       Symbol
# ........  ...............  ..................  ......................
#
    31.63%  org.openj9.MyBe  perf-12610.map      [.] org/openj9/generated/MyBenchmark_testMethod_jmhTest.testMethod_thrpt_jmhStub(Lorg/openjdk/jmh/runner/InfraControl;Lorg/openjdk/jmh/results  
    15.75%  org.openj9.MyBe  perf-12610.map      [.] java/io/ObjectOutputStream$HandleTable.lookup(Ljava/lang/Object;)I_warm
    15.60%  org.openj9.MyBe  perf-12610.map      [.] org/openj9/generated/MyBenchmark_testMethod_jmhTest.testMethod_thrpt_jmhStub(Lorg/openjdk/jmh/runner/InfraControl;Lorg/openjdk/jmh/results  
    13.86%  org.openj9.MyBe  perf-12610.map      [.] java/net/PlainSocketImpl.socketClose0(Z)V_cold
     9.92%  org.openj9.MyBe  perf-12610.map      [.] java/lang/String.indexOf(II)I_warm
     3.46%  org.openj9.MyBe  perf-12610.map      [.] org/openj9/generated/MyBenchmark_testMethod_jmhTest.testMethod_thrpt_jmhStub(Lorg/openjdk/jmh/runner/InfraControl;Lorg/openjdk/jmh/results   
     1.26%  org.openj9.MyBe  perf-12610.map      [.] org/openjdk/jmh/infra/Blackhole.consume(I)V_warm
     0.89%  org.openj9.MyBe  perf-12610.map      [.] java/lang/String.indexOf(II)I_cold
     0.39%  main             libj9vm29.so        [.] VM_BytecodeInterpreter::run

Much better! Note that the symbols have a suffix “_cold”, “_warm”, etc. These represent the JIT compiler optimization level at which each method was compiled. Further note that the top method symbol in the profile is seen several times. This is because the JIT compiler has recompiled this method several times to improve performance.

Selecting the top method in the profile using the keyboard arrow keys and using the a hotkey will attempt to display the assembly annotation for the selected symbol. Because of the nature of the JIT perf cannot locate the binary which corresponds to the symbol and thus is unable to annotate where the hot spots are within the assembly of the JIT compiled method.

Building the Latest perf From Source

As of perf version 4.0.0-rc6 the folks over at Google have committed a patch to perf to provide support for assembly annotated profiling of JITed code with a built-in JVMTI agent for profiling the JVM out-of-the box. Unfortunately many Linux distributions do not ship with the compiled JVMTI agent, so we will have to build perf from source:

# Install dependencies needed to build perf and the JVMTI agent
> apt-get -y install build-essential make gcc g++ flex bison libelf-dev libdw-dev libslang2-dev libssl-dev libiberty-dev default-jdk

# Clone the latest Linux kernel source (perf is included in the tools directory)
> git clone --depth 1 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

# Build perf
> cd linux/tools/perf
> make

The build process will have built the perf JVMTI agent using the default JVM on the PATH which will be located in linux/tools/perf/libperf-jvmti.so. This JVMTI agent will query the JVM during the perf record step and extract the symbol and binary information of JIT method bodies and store it into the profile. A post-processing step will then transform such data to make the JIT method symbols look like other valid native functions. This enables us to view the annotation of the JIT compiled methods.

Assembly Annotation of JIT Compiled Methods

We can now repeat the perf record step using the perf we just built and add the JVMTI agent to our java command:

> ./linux/tools/perf/perf record -k 1 -e cycles jdk-11.0.4+10/bin/java -agentpath:`pwd`/linux/tools/perf/libperf-jvmti.so -jar indexof-jmh/target/benchmarks.jar

Note that the -Xjit:perfTool option is no longer needed as the symbol mapping is taken care of by the JVMTI agent. We now need to post-process the profile to map the JIT symbols to assembly annotations:

> ./linux/tools/perf/perf inject -i perf.data --jit -o perf.data.jitted

This creates a new perf.data.jitted profile with the injected JIT symbols which we can view using the perf report -i tool:

> ./linux/tools/perf/perf report -i perf.data.jitted

Again, using the arrow keys select the top method and use the a hotkey to annotate the method. The disassembly of the hot spots within the JIT compiled method will be shown. Using the H hotkey will navigate us to the hottest instruction within the method which appears to be in a tight loop:

  7.89 │ 1dc:   vmovdqu   0x8(%rsi,%rbx,2),%xmm0
  8.73 │        vpcmpeqw  %xmm1,%xmm0,%xmm0
 14.00 │        vpmovmskb %xmm0,%edx
 10.53 │        test      %edx,%edx
  0.01 │      ↓ jne       1f5
  7.80 │        add       $0x8,%ebx
  7.60 │        cmp       %eax,%ebx
  0.01 │      ↑ jl        1dc

Indeed vpmovmskb is an AVX/SSE instruction which the JIT has used to implement the core logic of the indexOf operation which explains why OpenJ9 is able to achieve such high throughput performance. Digging deeper into the OpenJ9 repository we can find this acceleration has been implemented on x86 as part of eclipse/openj9#1681.

The perf JVMTI agent is spec. compliant, so the same assembly annotation approach will work on other spec. compliant JVMs such as HotSpot. As regular contributors to the JIT component of the OpenJ9 project we do not to look at the disassembly of JIT methods from other JVMs so as to provoke healthy competition. Therefore I leave it as an exercise to the reader to annotate the HotSpot JVM using the same approach and check whether it makes use of AVX/SSE instructions to accelerate the indexOf API.

Conclusion

In this post we learned some of the caveats of profiling dynamic runtimes, such as the JVM, and how to overcome them using the features provided alongside the Linux perf tool. We were successfully able to use assembly annotations to identify the reason for an unexpected performance difference between the HotSpot and OpenJ9 JVMs. The same method described in this post can be used to diagnose any JIT related performance issue.

One comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s