Inspecting OpenJ9 performance with perf on Linux

Performance is critical to software. Software must perform better than its competitors to gain a competitive edge and acquire a greater market share. Improving performance and resolving performance regression issues for large software systems can be challenging and time consuming. Profiling tools such as Linux perf can help substantially reduce the time to evaluate and study software performance. This blog provides instructions to setup and inspect OpenJ9’s performance using Linux perf tools.

Building OpenJ9 for analysis with Linux perf

Follow these steps:

1) Download the Linux 64-bit (x86_64) DockerFile to build OpenJDK11 with OpenJ9:


2) Create a docker image from the DockerFile:

docker build -t openj9-jdk11 -f Dockerfile .

3) Create a docker container:

docker run --privileged --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -it openj9-jdk11

4) Get openj9-openjdk-jdk11 source:

git clone

5) Get OpenJ9 and OMR source:

cd openj9-openjdk-jdk11

6) Set environment variables to include compile options for supporting Linux perf:

export VMDEBUG=-fno-omit-frame-pointer
export VMLINK=-fno-omit-frame-pointer
export CFLAGS=-fno-omit-frame-pointer
export CXXFLAGS=-fno-omit-frame-pointer

7) Run configure with –with-extra-cflags=”$CFLAGS” –with-extra-cxxflags=”$CXXFLAGS“:

bash configure --with-freemarker-jar=/root/freemarker.jar --with-extra-cflags="$CFLAGS" --with-extra-cxxflags="$CXXFLAGS"

In the configure log, warnings about environment variables being ignored can be disregarded.

8) Build OpenJDK11:

make all

9) After the build has completed, zip the JDK and copy it out of the Docker container:

# Zip the JDK inside the Docker container.
cd build/linux-x86_64-normal-server-release/images
zip -r jdk/
# Copy the zipped from the Docker container to the host computer.
docker cp :

Next Steps: Setup

1) Make sure Linux perf is installed on the machine or in the docker container where you want to inspect the performance of OpenJ9.

2) It may be tricky to install and run perf in a Docker container. Here are the instructions to setup perf in an Ubuntu Docker container:

# Install perf
apt-get update
apt-get install linux-tools-generic

# Update symlink for the perf executable if it doesn't work by default
rm -f /usr/bin/perf
ln -s /usr/lib/linux-tools/<KERNEL_VERSION>-generic/perf /usr/bin/perf

# Check if perf works
perf -v
==> perf version 4.4.144 <==

3) If you are investigating a performance regression, you need two JDKs. One JDK acts as a baseline, and the other JDK shows symptoms of performance regression. Both JDKs should be built using the steps we’ve laid out. In this scenario you collect perf profiles for both JDKs. Then, you compare the two profiles to pinpoint the source of the performance regression.

How to use Linux perf?

1) Linux perf Examples:

2) Tutorial – Perf Wiki:

Performance Regression Example


When running java -version, OpenJ9-JDK11 performed poorly in comparison to OpenJ9-JDK8.

time openj9-jdk8/bin/java -version
real	0m0.057s
user	0m0.055s
sys	0m0.004s
time openj9-jdk11/bin/java -version
real    0m0.228s    # ~4x increase
user    0m0.092s    # ~2x increase
sys     0m0.140s    # ~35x increase

We decided to investigate the performance regression using Linux perf.

The perf profiles for OpenJ9-JDK8 and OpenJ9-JDK11 were collected using the following command:

perf record --call-graph dwarf -- <JAVA_COMMAND>

In OpenJ9, we used the JVM command line option, -Xjit:perfTool, to help Linux perf map symbols for JIT-compiled methods in the code cache.

The perf profiles were read using the following command:

perf report -i <PERF_DATA_FILE>

The perf profiles were compared using the following command:

perf diff -f

The perf diff report gave us the following information:

# Baseline  Delta Abs  Shared Object      Symbol                                                               
# ........  .........  .................  .........................................................
              +18.67%  [kernel.kallsyms]  [k] unmap_page_range
     1.90%    +17.44%       [.] VM_BytecodeInterpreter::run
    17.50%    -17.37%       [.] readAttributes
               +7.86%       [.] StringInternTable::areAddressesInSRPRange
     5.90%     -4.54%       [.] ClassFileOracle::walkMethodCodeAttributeCode
     4.57%     -4.48%       [.] j9bcutil_verifyCanonisizeAndCopyUTF8
               +3.70%       [.] VM_BytecodeInterpreter::inlineSendTarget
     0.40%     +3.38%       [.] ROMClassWriter::Helper::writeUTF8Block
               +2.84%       [.] VM_BytecodeInterpreter::putfield
               +2.70%  [kernel.kallsyms]  [k] syscall_return_via_sysret
     2.78%     -2.70%       [.] VM_BytecodeInterpreter::iinc
               +2.40%       [.] Cursor::writeBigEndianU16
               +1.86%  [kernel.kallsyms]  [k] __check_heap_object
               +1.72%      [.] omrmem_categories_increment_bytes
               +1.65%  [kernel.kallsyms]  [k] free_pgd_range
               +1.34%      [.] update_vmemIdentifier
               +1.27%       [.] VM_AtomicSupport::lockCompareExchange
               +1.15%       [.] VM_BytecodeInterpreter::getfieldLogic
     0.06%     +1.10%  [kernel.kallsyms]  [k] find_vma

In this report, many functions take more CPU cycles between OpenJ9-JDK8 and OpenJ9-JDK11. However, this information is not sufficient to resolve the performance regression. More information is required about the call paths to study and resolve the performance regression. Call stack information can be accessed via the perf report.

unmap_page_range takes 18.67% more CPU cycles in OpenJDK11. So, let’s focus on unmap_page_range for further analysis. The perf report provides the following call stack information:

    25.17%     0.14%  java             [.] omrvmem_free_memory
                                  |          sys_munmap
                                  |          vm_munmap
                                  |          |          
                                  |           --22.50%--do_munmap
                                  |                     |          
                                  |                     |--20.47%--unmap_region
                                  |                     |          |          
                                  |                     |          |--18.67%--unmap_vmas
                                  |                     |          |          unmap_single_vma
                                  |                     |          |          unmap_page_range
                                  |                     |          |          
                                  |                     |           --1.72%--free_pgtables
                                  |                     |                     |          
                                  |                     |                      --1.65%--free_pgd_range
                                  |                     |          
                                  |                      --1.12%--find_vma

From this call stack, we find that omrvmem_free_memory is the source of unmap_page_range, so now we have identified the OpenJ9 function that is related to the performance regression. After studying omrvmem_free_memory, we can imply that more time is spent to free virtual memory.

Freeing virtual memory is interlinked to allocating virtual memory. So, we should also study the call paths for virtual memory allocation. In the perf report, the following call stack correlates to virtual memory allocation:

               |          MM_Configuration::createHeap
               |          MM_ConfigurationGenerational::createHeapWithManager
               |          MM_ConfigurationStandard::createHeapWithManager
               |          MM_HeapVirtualMemory::newInstance
               |          MM_HeapVirtualMemory::initialize
               |          MM_MemoryManager::createVirtualMemoryForHeap
               |          MM_VirtualMemory::newInstance
               |          MM_VirtualMemory::initialize
               |          MM_VirtualMemory::reserveMemory
               |          omrvmem_reserve_memory_ex
               |          getMemoryInRangeForDefaultPages
               |          |          
               |           --3.29%--default_pageSize_reserve_memory
               |                     |          
               |                     |--1.95%--__mmap64
               |                     |          |          
               |                     |           --1.81%--entry_SYSCALL_64_fastpath
               |                     |                     |          
               |                     |                      --1.74%--sys_mmap
               |                     |                                sys_mmap_pgoff
               |                     |                                vm_mmap_pgoff
               |                     |                                do_mmap
               |                     |                                |          
               |                     |                                 --1.34%--mmap_region
               |                      --1.34%--update_vmemIdentifier

In this call stack, we notice that the GC invokes omrvmem_reserve_memory_ex -> getMemoryInRangeForDefaultPages to reserve virtual memory while initializing the JVM heap.

Now, we have identified two code paths that cause the performance regression.

The next step is to study the code paths from an algorithmic perspective. The goal is to answer the following questions: why is the regression happening? what has caused the behaviour change? how can the regression be resolved?

For this performance regression, the issue and solution are described in greater detail over here:


In large software systems such as OpenJ9, profiling tools such as Linux perf can greatly reduce investigation time to locate the source of performance regressions. With a little extra configuration during the build process, our Docker images deliver the ideal isolated environment for performance investigations.

2 Replies to “Inspecting OpenJ9 performance with perf on Linux”

  1. I think these two instructions are contradictory:

    1. Set environment variables to include [-fno-omit-frame-pointer] for supporting Linux perf:
    2. perf record –call-graph dwarf

    The default for –call-graph is “fp” which uses the RBP register that is saved with -fno-omit-frame-pointer; however, by specifying –call-graph dwarf, I believe you are disregarding RBP and simply depending on DWARF CLI information in the binaries to walk the call stack (which I believe is there due to -g, although I’m not sure how stripped debuginfo symbols play into this). I have tested this by simply running an original JDK without -fno-omit-frame-pointer but with –call-graph dwarf and it seems to have good output.

Leave a Reply