Innovations for Java running in containers

This blog is co-authored by Hang Shao, Marius Pirvu, Tobi Ajila and Vijay Sundaresan.

1.     Introduction

Over a decade ago, the common form of deployment was large monolithic applications that handled all the business logic. In this model applications typically had long deployment windows and they were typically deployed on large on-prem server farms. Since the deployer owned the hardware, the resource capabilities were static, as increasing capacity requires purchasing more hardware which doesn’t happen overnight. To defend against outages caused by spikes in incoming load, businesses often over-provision resources. This results in wastefulness during low load periods but given the static nature of resource provision this was a compromise many had to settle with.

Nowadays, instead of having one large application that does everything, people modularize their application into many smaller services. The deployments are frequent, and they may happen even multiple times a day. Applications are deployed on public clouds where there is no finite limit on resources. As demand changes, one can easily scale up or down. In fact, the cloud economic model is based on memory and time usage, so overprovisioning is not a good option in dealing with dynamic demand. The modularized micro-service approach has proven to be a more efficient mode of deployment in comparison to the monolith paradigm.

With all the advancements made in application deployment, there are still some challenges developers face when deploying applications.  With micro-services there may be many dependencies to manage. Also, given the non-uniform nature of public clouds, there may be many environments in which an application could be deployed in. The computing resources (amount of RAM, number of CPUs) may not be the same from node to node. Also, with dynamic resourcing policies there is a direct correlation between scaling time and latency that users may experience when demand suddenly increases.

Containers provide some solutions to the challenges in the cloud deployments. With tools like docker, it enables developers to modularize and package their solutions into immutable entities. This means they get the same configuration in development, testing and production. This solves the problem of dealing with multiple dependencies and makes it easier to deploy in different environments, thus reduces the risk in deployments. With container solutions, it becomes easier to deploy to different clouds and platforms, since all your dependencies are contained within the image.

Containers are effective means to address some, but not all the challenges in the cloud, such as cold starts and minimizing footprint. Start-up time is important because with micro-service based solutions, many instances aren’t run for very long. With FAAS/Serverless solutions, start-up time is very important as the start-up time is a noticeable portion of the application time. Also, with dynamic scaling policies, start-up time directly correlates with response time as new instances need to spin- up once demand increases. Given the pricing for cloud environments, which depends on memory usage, minimizing the size of the application becomes important. Tuning the JVM is very important in these cases.

Containers do not solve all the challenges, but they do offer JVM vendors an avenue to provide solutions to address some of the challenges that containers cannot solve. Containers offer a modularized entity which people can package their applications as well as JVM enhancements that ensure optimal performance in the cloud environment. Next, we will describe some of these enhancements and how users can leverage them in container-based solutions.

2.     OpenJ9 enhancements for the cloud environment

In this section, we will introduce some OpenJ9 features that improve the performance in a cloud environment, and we will evaluate the benefits of those features through performance experiments.

2.1.           Reducing Class Verification Overhead

Class Verification adds a bit of overhead during the start-up when many classes are loaded. Verification itself consumes CPU cycles and it can trigger class loading. Therefore, OpenJ9 has a feature called class relationship verifier, which omits class loading operations that would have occurred during class verification. Instead, OpenJ9 creates a record of them and verifies such records later, at runtime. With this feature, some of the classes may never be loaded, which can lead to start-up time improvements.

The command line option to enable this feature is: -XX:+ClassRelationshipVerifier. If you are interested in more details about this feature, you can check this blog post here.

2.2.           Shared Class Cache and Dynamic AOT technology

One of OpenJ9’s most powerful features to improve start-up time of applications is the Shared Class Cache (SCC) technology, which is an area of shared memory used to cache entities like: ROMClass (readonly representation of Java class), AOT compiled code, interpreter profiling information, etc. The SCC persists beyond the lifetime of the JVM. Since classes are loaded from memory instead of the disk, and the ROMClasses are already pre-processed, class loading is much faster with the shared class cache.

Another key technology than improves start-up time is dynamic AOT. It is called dynamic because methods are compiled to relocatable format at runtime and stored into the shared class cache. Any subsequent JVM instance can use these methods after a cheap relocation process, which is about 100 times faster that a JIT compilation. With AOT, the JVM can transition faster from the interpreter to machine code, which improves start-up time as a result. The downside of AOT is that the generated code is slightly less optimized because it is more generic as it needs to fit many JVMs. However, we can mitigate this issue by generating AOT only during the start-up phase of an application and by recompiling the AOT method bodies with the regular JIT compiler.

The command line option to enable SCC and AOT is: -Xshareclasses

2.3.           Embedded SCC in containers

The SCC is created and populated by a so called “cold run”. Start-up time improvements can be realized only in subsequent “warm runs” that benefit from existing cached data. This creates a problem when running inside a container because the populated SCC disappears once the container is terminated. If we want to effectively take advantage of the cached data, we need to ensure some sort of persistence for SCC across runs. One possible way of achieving this is to put the SCC in a volume. However, this approach is not user friendly as it puts a burden on users to create SCC in a volume. Moreover, when machines are provisioned on the fly in the cloud, we would still hit the cold SCC problem.

OpenJ9 solves the persistence issue by embedding the SCC into the OpenJ9 docker image, during the image creation time. Since this SCC may be used by many different containerized applications that are built on top of OpenJ9, we populate it only with classes and AOT methods from the bootstrap classloader, and for this reason we call it System-SCC. Starting with release 0.23.0, OpenJ9 docker images with embedded System-SCC can be downloaded from DockerHub:

JDK8 https://hub.docker.com/r/adoptopenjdk/openjdk8-openj9.

JDK11 https://hub.docker.com/r/adoptopenjdk/openjdk11-openj9.

2.4.           Multi-Layer SCC

As explained above, the embedded SCC in the OpenJ9 docker image caches only bootstrap classes and methods. Your application start-up time can be further improved if the application’s classes/methods are cached too. When running on top of the OpenJ9 image, you could use the embedded SCC to cache such data.  However, due to the Copy-On-Write (COW) mechanism, a duplicated SCC will be created in the higher docker layer first, then additional data are added. Things get worse if there are more layers that store data to the SCC because a duplicate SCC will be created in each layer, leading to a much bigger image size. An additional hurdle is picking the optimal size for the OpenJ9 SCC. Since applications running on top of OpenJ9 container can potentially be very large, we would need to be very generous with the SCC from the OpenJ9 layer. However, this exacerbates the duplication problem stemming from the COW mechanism.

OpenJ9 solves these problems by introducing the Multi-Layer SCC, which makes the SCC work in a layered fashion. Instead of adding data to a single SCC in one layer, each docker layer can package its data in a separate SCC file in its own layer. The docker’s COW issue is avoided as the JVM doesn’t have to write into the SCC files belonging to lower layers. Moreover, in Multi-Layer SCC we can size each cache layer individually based on its need, just right for the classes and AOT methods that it needs to hold. This results in a smaller on-disk image size and faster pushing and pulling of the image.

The command line options related to Multi-Layer SCC are: -Xshareclasses:createLayer and -Xshareclasses:layer=<n>. If you are interested in more details of this feature, you can check this blog post here.

2.5.           Performance Results

Let’s have a look at the performance improvements enabled by the OpenJ9 features introduced above. As benchmark we’ll use the well known PetClinic application running on top of the very popular Spring Boot framework. All the experiments were conducted in Docker containers running on a machine with Intel Core i7-6700K (skylake) @ 4.00GHz, 16 GB of RAM and Ubuntu 18.04.05 LTS. The base JVM layers for our PetClinic containers were pulled from Docker Hub and had the following tags:

  • adoptopenjdk/openjdk11:jre-nightly  (pulled Nov. 2020)
    • adoptopenjdk/openjdk11-openj9:jre-11.0.7_10_openj9-0.20.0
    • adoptopenjdk/openjdk11-openj9:jre-11.0.9_11_openj9-0.23.0

As can be seen, for OpenJ9 we tested two different images because the image for release 0.23 embeds a System SCC for the bootstrap classes/methods and also offers you the ability to use a multi-layer SCC for caching application classes/methods. No JVM options were used in these experiments, except for cases where we enabled the class relationship verifier with -XX:+ClassRelationshipVerifier. It should be noted though that, under the hood, OpenJ9 images already enable the following options:

-XX:+IgnoreUnrecognizedVMOptions
-XX:+IdleTuningGcOnIdle
-Xshareclasses:name=openj9_system_scc,cacheDir=/opt/java/.scc,readonly,nonfatal

The performance metrics we are going to look at are:

  1. Start-up time
  2. Footprint (Resident Set Size after start-up)
  3. Container image size

Each experiment was repeated 20 times and we report the average values.

Figure 1. Experimental Results

In terms of start-up time, the class relationship verifier gives us about 7% improvement from the OpenJ9 baseline shown in the orange bar. The system SCC from OpenJ9 release 0.23 (yellow bar) improves the performance by ~18% and at this point OpenJ9 overtakes HotSpot. However, the big start-up time improvement comes from allowing caching of application’s classes and methods: with multi-layer SCC (generated using this script), the performance is improved over 50%. The best result is achieved by enabling both multi-layer SCC and class relationship verifier which makes OpenJ9 45% better than HotSpot in terms of start-up time.

With respect to footprint, all the OpenJ9 configurations outperformed HotSpot by a large margin. There is a small footprint increase if you move from 0.20.0 release to 0.23.0 release, which is due to the embedded SCC. However, such a small increase is totally worth it considering the big improvement in start-up time.

In terms of image size, the image size increased a bit when we embedded a system SCC or Multi-Layer SCC in it. It should be emphasized though that, compared to using a single SCC to cache all the data, the multi-layer SCC configuration results in a smaller image size.

You can check this script if you are interested in how the results here were obtained.

2.6.           Optimization in Resources Constrained Environment

The JVM can experience an out-of-memory (OOM) event when it is attempting to allocate an object, not enough free space is left on the heap, and the heap cannot be expanded anymore. However, to function, a JVM needs not only the Java heap, but also other internal data structures, which fall into the so-called native memory category. If the JVM experiences a native OOM event, it may be forced to shut down or at the very least to forgo some of its functionalities. The JIT compiler is a big consumer of the native memory which is no surprise given its complexity. In OpenJ9, the memory consumed by the JIT compiler is called scratch memory. Scratch memory is allocated on demand in 16MB increments, up to a limit of 256MB per compilation thread. It is fully released to the OS at the end of compilation. Figure 2 shows the memory usage spikes from JIT compilation activities, which could generate a native OOM event.

Figure 2. OpenJ9 PetClinic Footprint During Load

When running applications inside containers it’s good practice to set memory limits for those containers. However, if the JVM is attempting to allocate more memory than the container limits, it will be terminated. As shown in Figure 2, the memory usage of the JVM can increase during the JIT compilation and this could trigger an OOM event.

OpenJ9 makes a best effort to avoid OOM events due to scratch memory consumption: it constantly monitors memory availability, taking into consideration the container memory limit, the amount of memory used by the JVM and the amount of free memory on the machine. When the available memory runs low, the JIT compiler may fail some compilations, or it could turn off some compilation threads. However, this mechanism is not perfect, because memory availability can be changed soon after the readout has been performed. Therefore, despite our efforts, spurious OOM could still happen.

Figure 3. PetClinic Footprint for 1024MB and 512MB limits
Figure 4. PetClinic Throughput for 1024MB and 512MB limits

Figure 3 and Figure 4 are OpenJ9’s footprint and throughput in containers with 512MB and 1024MB memory limits. When the memory limit is 512MB, OpenJ9 reduces its memory consumption, trying to stay within the given limit, and the footprint spikes are much smaller. However, this is not always successful: in our experiment, the container with OpenJ9 is sometimes killed due to native OOM (2/10 attempts) when the limit is set to 512MB. As a sidebar, it should be noted that containers with HotSpot are killed all the time; they simply cannot run with just 512MB of memory. The throughput of the 512MB configuration is lower than that of the 1024MB configuration. This is because OpenJ9 downgrades compilations or shuts down compilation threads, trying to stay within limits. OpenJ9 prefers to operate at lower capacity rather than crash the application due to an OOM event.

Despite our best effort, if not enough memory is available, OpenJ9 can still experience an OOM event. In order to make that event less likely, we have an add-on solution, which is the -Xtune:virtualized mode. This is a recommended option in a resource constraint environment, which is typically found on the cloud. Under this mode, the JIT compiler will reduce the aggressiveness of the compilation – which results in less CPU and memory consumption – and will cap the scratch memory usage to 16MB. The side effect is that some compilations are downgraded to a lower optimization level and in some cases the peak throughput may be impacted as a result. Figure 5 and Figure 6 illustrate the effect of the -Xtune:virtualized option on footprint and throughput respectively, when running in small (512 MB) containers.  As can be seen, the footprint spikes are gone while the throughput is better than in the default config. The explanation is that, in the default config, some expensive compilations bring the JVM very close to the memory exhaustion point and the JIT is forced to fail concurrent compilations in order to avoid an OOM event. In contrast, with –Xtune:virtualized the JVM does not approach the memory exhaustion point (as seen on the graph, the peak is around 400 MB, well below the 512 MB limit); most of the compilations can be performed within the 16 MB scratch memory limit, while a handful that exceed this limit will be retried at a lower optimization level.

Figure 5. PetClinic Footprint for 4P 512MB container limit
Figure 6. PetClinic Throughput for 4P 512MB container limit

The conclusion is that, if your application is running close to the memory limit, the virtualized mode could be a good solution, not only to avoid OOM, but also to achieve a level of throughput that is good enough.

The command line option to enable virtualized mode is: -Xtune:virtualized.

2.7.           Portable AOT

As we have introduced in section 2.2, OpenJ9 has AOT code that improves the start-up time, which is stored inside the SCC. A cold run builds and populates the SCC. You get start-up benefit from subsequent warm runs that load existing cached data. As mentioned in section 2.3 and 2.5, we got start-up performance improvement after pre-building an embedded SCC into OpenJ9’s docker image. You may want to do the same thing in your cloud deployment to embed a SCC in your docker image. In this way, your application will have a better out-of-box start-up performance due to the cached data that comes with the image.

However, the embedded SCC needs to address one issue: the portability of AOT code. The build time environment may not be the same as the runtime environment. There are many factors, such as different CPU features, different heap sizes or different memory barriers etc., that could invalidate the AOT code. OpenJ9 introduced a new command line option -XX:+PortableSharedCache to generate portable AOT code. This option is turned on by default insider container, since we may be running AOT code embedded in the container image on CPUs with different features. One nice feature about OpenJ9 is its container awareness. So OpenJ9 automatically generates portable AOT for you when running inside container without the need to explicitly turn on this option. The real value of portable AOT is that it ensures that AOT code is used in more cases since the start-up performance difference between using AOT code versus not using it could be as much 60% in our experience.

Command line option to turn on this feature: -XX:+PortableSharedCache (on by default in container)

3.     Summary

The ability to configure JVMs is an importance tool to get best performance for your application when running on the cloud. Fast start-up time and low memory footprint are important characteristics for efficient cloud deployments. OpenJ9 is optimized to have low memory footprint. It also provides tunning options to enable faster start-up. OpenJ9 is also container aware and has options to enable high performance in resources constrained environments. You can check this video if you want to see a demo of the experiments in this blog. You can check our website if you are interested in learning more about Eclipse OpenJ9.