In the first post of this series we learned how to profile modules within the JVM itself using perf
. This is useful for diagnosing JVM related problems. However in typical applications the user application code (JCL included) will dominate the profile, so it is often desirable to want to profile the user application code as well.
In a typical workload, past the start-up phase of the JVM, nearly all methods being executed will have been JIT compiled. Unfortunately this poses a problem for perf
because the program counter (PC) addresses recorded during perf record
are mapped to symbols in a post-processing step via perf report
by examining the memory map of every binary loaded into the executable’s memory space during the record step. Because of this post-processing step and the nature of the JIT compiler, perf
is unable to find the binaries corresponding to JIT method symbols, because these methods were generated just-in-time.
Fortunately OpenJ9 and the perf
tool itself provide support for both de-mangling of JIT method symbols and assembly level annotations of JIT method bodies. We’ll explore how to collect such diagnostics within this post.
A Case Study
We’ll be basing this case study off of a vanilla Ubuntu 18.04 image. To begin we’ll need a workload to investigate. Microbenchmarking dynamic runtimes is notoriously difficult to get right. Thankfully OpenJDK has us covered with the JMH microbenchmark suite. First let’s install maven
:
# Install maven and the default JDK
> apt-get update
> apt-get -y install maven default-jdk
Next we’ll use the JMH archetype to create the boilerplate for our benchmark:
> mvn archetype:generate -DinteractiveMode=false -DarchetypeGroupId=org.openjdk.jmh -DarchetypeArtifactId=jmh-java-benchmark-archetype -DgroupId=org.openj9 -DartifactId=indexof-jmh -Dversion=1.0
A Maven project will be created in the indexof-jmh
directory and will be setup with all the dependencies needed to run JMH. As the name of the project suggests, our case study benchmark will be exercising the java.lang.String.indexOf
API. Next, let’s open up the generated indexof-jmh/src/main/java/org/openj9/MyBenchmark.java
benchmark and modify it slightly to make use of the indexOf
API:
// ...
package org.openj9;
import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.infra.Blackhole;
@Fork(value = 1)
@Warmup(iterations = 3)
@Measurement(iterations = 3)
public class MyBenchmark {
public static String data = "Some long string which we are going to search for the character '#'";
@Benchmark
public void testMethod(Blackhole blackhole) {
int indexOfHash = data.indexOf('#');
blackhole.consume(indexOfHash);
}
}
We can now build the Maven project and run the JMH benchmark using the default JDK available:
> cd indexof-jmh
> mvn clean install
> java -version
openjdk version "11.0.3" 2019-04-16
OpenJDK Runtime Environment (build 11.0.3+7-Ubuntu-1ubuntu218.04.1)
OpenJDK 64-Bit Server VM (build 11.0.3+7-Ubuntu-1ubuntu218.04.1, mixed mode, sharing)
> java -jar indexof-jmh/target/benchmarks.jar
...
Benchmark Mode Cnt Score Error Units
MyBenchmark.testMethod thrpt 3 37004629.640 ± 1644503.443 ops/s
Measuring OpenJ9
OpenJDK Java 11 with HotSpot achieved a score of 37004629 on our machine. How does OpenJDK Java 11 with OpenJ9 compare? To measure we’ll use the AdoptOpenJDK API to grab the latest nightly OpenJDK Java 11 with OpenJ9:
> curl -L 'https://api.adoptopenjdk.net/v2/binary/nightly/openjdk11?openjdk_impl=openj9&os=linux&arch=x64&release=latest&type=jdk&heap_size=normal' > openj9.tar.gz
> tar xvf openj9.tar.gz
And repeat the same measurement:
> jdk-11.0.4+10/bin/java -version
openjdk version "11.0.4" 2019-07-16
OpenJDK Runtime Environment AdoptOpenJDK (build 11.0.4+10-201907151830)
Eclipse OpenJ9 VM AdoptOpenJDK (build master-9409a4266, JRE 11 Linux amd64-64-Bit Compressed References 20190715_284 (JIT enabled, AOT enabled)
OpenJ9 - 9409a4266
OMR - 4beca561
JCL - 411f32dbbb based on jdk-11.0.4+10)
> jdk-11.0.4+10/bin/java -jar indexof-jmh/target/benchmarks.jar
...
Benchmark Mode Cnt Score Error Units
MyBenchmark.testMethod thrpt 3 90224587.381 ± 41943779.975 ops/s
OpenJ9 has achieved a much higher score compared to HotSpot for our microbenchmark. But why?
Investigating with Linux perf
perf
is not installed on our vanilla image so let’s install it using the package manager:
# Install perf
> apt-get update
> apt-get -y install linux-tools-generic linux-tools-`uname -r`
We can now use the perf record
utility to capture a profile of our JMH microbenchmark. Hopefully this will tell us where time is being spent, which may give as a hint as to why OpenJ9 is performing so well. We will collect a profile of the cpu-cycles hardware event as we are interested in the assembly level annotation of the profile:
> perf record -k 1 -e cycles jdk-11.0.4+10/bin/java -jar indexof-jmh/target/benchmarks.jar
perf record
will have generated a perf.data
file which is our profile. This is the default input to the perf report
utility which we will use to view the profile:
> perf report
# Overhead Command Shared Object Symbol
# ........ ............... .................. ......................
#
3.58% org.openj9.MyBe [JIT] tid 5465 [.] 0x00007f6ad2a24192
2.93% org.openj9.MyBe [JIT] tid 5465 [.] 0x00007f6ad2a2418a
2.82% org.openj9.MyBe [JIT] tid 5465 [.] 0x00007f6ad2a24199
1.76% org.openj9.MyBe [JIT] tid 5465 [.] 0x00007f6ad2a252cd
1.65% org.openj9.MyBe [JIT] tid 5465 [.] 0x00007f6ad2a2418e
1.58% org.openj9.MyBe [JIT] tid 5465 [.] 0x00007f6ad2a25480
1.54% org.openj9.MyBe [JIT] tid 5465 [.] 0x00007f6ad2a1ce97
1.54% org.openj9.MyBe [JIT] tid 5465 [.] 0x00007f6ad2a25553
...
...
0.09% org.openj9.MyBe perf-10944.map [.] 0x00007f3bcd81d6cb
0.09% org.openj9.MyBe perf-10944.map [.] 0x00007f3bcd81bce1
0.09% Thread-6 libj9vm29.so [.] VM_BytecodeInterpreter::run
0.09% org.openj9.MyBe perf-10944.map [.] 0x00007f3bcd81dfd4
0.09% org.openj9.MyBe perf-10944.map [.] 0x00007f3bcd81dd21
Why are we not seeing valid symbol names? This is expected. As described earlier, perf
is unable to map the profile ticks to symbols because of the JIT compiler. Note that the distance between the “symbols” in the perf report
are quite small, sometimes a few bytes apart. This is because perf report
is not able to group a set of profile ticks to one particular symbol (a method or function) so it can aggregate the ticks in the profile. The “symbols” we are seeing here are actually instruction addresses within the JIT method bodies.
To fix this perf
provides built-in support for mapping JIT method symbols to address ranges. It does this by looking for a textual file /tmp/perf-<pid>.map
which it uses to map addresses to symbols. The OpenJ9 JIT compiler has an option -Xjit:perfTool
which will generate such a file which perf report
will then use. Let’s try again:
> perf record -k 1 -e cycles jdk-11.0.4+10/bin/java -Xjit:perfTool -jar indexof-jmh/target/benchmarks.jar
> perf report
# Overhead Command Shared Object Symbol
# ........ ............... .................. ......................
#
31.63% org.openj9.MyBe perf-12610.map [.] org/openj9/generated/MyBenchmark_testMethod_jmhTest.testMethod_thrpt_jmhStub(Lorg/openjdk/jmh/runner/InfraControl;Lorg/openjdk/jmh/results
15.75% org.openj9.MyBe perf-12610.map [.] java/io/ObjectOutputStream$HandleTable.lookup(Ljava/lang/Object;)I_warm
15.60% org.openj9.MyBe perf-12610.map [.] org/openj9/generated/MyBenchmark_testMethod_jmhTest.testMethod_thrpt_jmhStub(Lorg/openjdk/jmh/runner/InfraControl;Lorg/openjdk/jmh/results
13.86% org.openj9.MyBe perf-12610.map [.] java/net/PlainSocketImpl.socketClose0(Z)V_cold
9.92% org.openj9.MyBe perf-12610.map [.] java/lang/String.indexOf(II)I_warm
3.46% org.openj9.MyBe perf-12610.map [.] org/openj9/generated/MyBenchmark_testMethod_jmhTest.testMethod_thrpt_jmhStub(Lorg/openjdk/jmh/runner/InfraControl;Lorg/openjdk/jmh/results
1.26% org.openj9.MyBe perf-12610.map [.] org/openjdk/jmh/infra/Blackhole.consume(I)V_warm
0.89% org.openj9.MyBe perf-12610.map [.] java/lang/String.indexOf(II)I_cold
0.39% main libj9vm29.so [.] VM_BytecodeInterpreter::run
Much better! Note that the symbols have a suffix “_cold”, “_warm”, etc. These represent the JIT compiler optimization level at which each method was compiled. Further note that the top method symbol in the profile is seen several times. This is because the JIT compiler has recompiled this method several times to improve performance.
Selecting the top method in the profile using the keyboard arrow keys and using the a
hotkey will attempt to display the assembly annotation for the selected symbol. Because of the nature of the JIT perf
cannot locate the binary which corresponds to the symbol and thus is unable to annotate where the hot spots are within the assembly of the JIT compiled method.
Building the Latest perf From Source
As of perf
version 4.0.0-rc6 the folks over at Google have committed a patch to perf
to provide support for assembly annotated profiling of JITed code with a built-in JVMTI agent for profiling the JVM out-of-the box. Unfortunately many Linux distributions do not ship with the compiled JVMTI agent, so we will have to build perf
from source:
# Install dependencies needed to build perf and the JVMTI agent
> apt-get -y install build-essential make gcc g++ flex bison libelf-dev libdw-dev libslang2-dev libssl-dev libiberty-dev default-jdk
# Clone the latest Linux kernel source (perf is included in the tools directory)
> git clone --depth 1 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
# Build perf
> cd linux/tools/perf
> make
The build process will have built the perf
JVMTI agent using the default JVM on the PATH
which will be located in linux/tools/perf/libperf-jvmti.so
. This JVMTI agent will query the JVM during the perf record
step and extract the symbol and binary information of JIT method bodies and store it into the profile. A post-processing step will then transform such data to make the JIT method symbols look like other valid native functions. This enables us to view the annotation of the JIT compiled methods.
Assembly Annotation of JIT Compiled Methods
We can now repeat the perf record
step using the perf
we just built and add the JVMTI agent to our java
command:
> ./linux/tools/perf/perf record -k 1 -e cycles jdk-11.0.4+10/bin/java -agentpath:`pwd`/linux/tools/perf/libperf-jvmti.so -jar indexof-jmh/target/benchmarks.jar
Note that the -Xjit:perfTool
option is no longer needed as the symbol mapping is taken care of by the JVMTI agent. We now need to post-process the profile to map the JIT symbols to assembly annotations:
> ./linux/tools/perf/perf inject -i perf.data --jit -o perf.data.jitted
This creates a new perf.data.jitted
profile with the injected JIT symbols which we can view using the perf report -i
tool:
> ./linux/tools/perf/perf report -i perf.data.jitted
Again, using the arrow keys select the top method and use the a
hotkey to annotate the method. The disassembly of the hot spots within the JIT compiled method will be shown. Using the H
hotkey will navigate us to the hottest instruction within the method which appears to be in a tight loop:
7.89 │ 1dc: vmovdqu 0x8(%rsi,%rbx,2),%xmm0
8.73 │ vpcmpeqw %xmm1,%xmm0,%xmm0
14.00 │ vpmovmskb %xmm0,%edx
10.53 │ test %edx,%edx
0.01 │ ↓ jne 1f5
7.80 │ add $0x8,%ebx
7.60 │ cmp %eax,%ebx
0.01 │ ↑ jl 1dc
Indeed vpmovmskb
is an AVX/SSE instruction which the JIT has used to implement the core logic of the indexOf
operation which explains why OpenJ9 is able to achieve such high throughput performance. Digging deeper into the OpenJ9 repository we can find this acceleration has been implemented on x86 as part of eclipse/openj9#1681.
The perf
JVMTI agent is spec. compliant, so the same assembly annotation approach will work on other spec. compliant JVMs such as HotSpot. As regular contributors to the JIT component of the OpenJ9 project we do not to look at the disassembly of JIT methods from other JVMs so as to provoke healthy competition. Therefore I leave it as an exercise to the reader to annotate the HotSpot JVM using the same approach and check whether it makes use of AVX/SSE instructions to accelerate the indexOf
API.
Conclusion
In this post we learned some of the caveats of profiling dynamic runtimes, such as the JVM, and how to overcome them using the features provided alongside the Linux perf
tool. We were successfully able to use assembly annotations to identify the reason for an unexpected performance difference between the HotSpot and OpenJ9 JVMs. The same method described in this post can be used to diagnose any JIT related performance issue.
1 Reply to “Inspecting OpenJ9 performance with perf on Linux – JIT Compiled Methods”