In recent times, businesses have placed an emphasis on modernizing their software stack into cloud-based services. This transformation has generally meant breaking up big complex long running modules into much smaller short-lived units. These units are often packaged as containers and can be individually provisioned in worker nodes and scaled to meet demands. The days of long deployment windows and large on-premise server farms are slowly dissipating as the serverless/FAAS approaches gain in popularity. The result is that businesses are now better able to focus their resources on developing new products and features, and less on infrastructure.
The characteristics of cloud-based deployment has introduced many benefits such as dynamic scaling – pay for what you need, when you need it. However, it has also introduced some new challenges. For latency sensitive applications, fast start-up is a requirement if one is to use a dynamic scaling approach to provision resources. In addition, the typical pricing model for cloud resources is based on the amount of memory used, multiplied by the duration in which it is in use. This means that being efficient with memory can have a positive impact on your bottom line.
The micro-service approach has been overwhelmingly driven by container technology. Containers have made it easy to capture all dependencies into a single image thus guaranteeing consistent application behaviour in development, testing, and deployment. Container images are often sent over the network via container registries. Minimizing the size of the container can have a positive impact on deployment and general workflow.
Many online businesses are built on application servers; of these, applications such as Tomcat, Liberty and JBoss are all built on JVM technology. The JVM offers many benefits to users such as a rich class library, excellent throughput performance, debugging and tooling capabilities, and more. However, due to its interpreter/dynamic-compilation design, it doesn’t perform as well in the area of start-up. A typical JVM run will start by loading and initializing classes, then initially interpreting methods until the Just-In-Time compiler compiles those methods into machine code. The result is slower start-up times followed by a ramp-up phase as the JIT profiles and optimizes the code. After this phase the JVM can achieve peak performance.
This poses a challenge in the era of cloud based workloads. As a result, improving start-up times has been a major area of focus for JVM providers. Existing class metadata caching techniques such as the Shared Classes Cache (SCC) and dynamic ahead-of-time (AOT) compiled code have shown great improvements to start-up time. By caching internal class metadata structures, one greatly reduces classloading times. Dynamic AOT uses precompiled methods from a previous run to greatly reduce ramp up time. While these techniques show positive improvements (<1 second startup time for the Open Liberty application server using Eclipse OpenJ9), it still lags behind static compilation techniques.
Graal native image has made a big splash in recent years because of the extremely fast startup time it enables due the fact that it does static (or native) Java compilation. Native image starts by using a closed-world approach where static analysis is performed to determine all reachable paths for an application. Next, some static initializers are run at build time, after which the Java heap is saved, the rest of the application is compiled into machine code, and a native image is produced. This approach achieves very fast start-up times (<100 ms for some Quarkus applications) and a very small on-disk footprint. These performance characteristics check a lot of boxes for users who care about performance in the cloud; however, there are some drawbacks. The closed-world approach means that any class you may need at runtime must be made available to the compiler at build time. This has repercussions for features such as reflection, or loading libraries dynamically (JNI). There are also challenges with dynamic bytecodes such as invokedynamic and constantdynamic. Also, with the way the native image is produced, debugging is a challenge as it is not possible to attach the process to standard Java debuggers. Lastly, there is evidence that peak throughput performance lags behind the traditional JVM mode. These challenges introduce a barrier to those who may want to migrate from a JVM based application to a native-image based application.
The JVM based approach offers all the capabilities that Java users are accustomed to. However, it lags in start-up time, on-disk footprint performance and memory usage on startup. The native-image approach offers excellent start-up time as well as on-disk footprint performance but sacrifices commonly used JVM capabilities. In the remainder of this blog post, we discuss a third approach that has the potential to deliver many of the best characteristics of these two approaches. Linux Checkpoint Restore in User Space (CRIU) is a service that allows users to checkpoint an application in user space and then restore at a later time or on a different machine. JVMs can leverage this to greatly improve start-up time by performing a “build run” where classes are loaded, initializers are run, JIT compilation has compiled and optimized some methods, and the application has run to a point just before it is ready to serve requests. At this point the JVM is checkpointed and saved to disk. The resulting image can be used to deploy the application by performing a restore. Since all preparatory work has been done (JIT’ing methods, classloading, etc.), the application starts very fast – much faster than it would in JVM mode – and since it is a JVM, there are no restrictions on what capabilities it provides. This means users can get fast start-up while keeping many of the features and JVM capabilities that they are already accustomed to.
The CRIU approach naturally generalizes to cloud environments. Often, Cloud Native architectures involve having the services wrapped in containers that are managed and coordinated by some Container Engine (e.g., Podman) or Orchestrator (e.g., Kubernetes). Therefore, if one starts one of these applications in a container, checkpoints it at some well-defined point, and creates a new image, then new containers can be started by running the restore command. These new containers would not run the applications from scratch; they would simply continue execution from point the application was checkpointed, significantly speeding up startup time.
OpenJ9 has been investigating this approach and has built support in various components in the JDK, namely the VM, JIT compiler and class libraries (JCL) to allow Java applications to start taking advantage of the CRIU capability more easily. An integral part of this OpenJ9 CRIU support is an API that provides users the ability to create safe and robust checkpoint images that can be readily restored in cloud deployments. Using this API, the user can call a Java method to initiate a checkpoint by passing in the necessary arguments to CRIU at a program point of their choice. Other aspects of the OpenJ9 CRIU support include
- a hook method architecture/API that allows a user to register the sequence in which methods that they specify ought to be run before taking a checkpoint and upon restore
- compensation in common JCL methods, for example: time related classes, Random, etc., to account for differences between the checkpoint and restore environments
- a distinct approach in how Java security components are managed to avoid embedding any sensitive information in CRIU checkpoint image files
- changes to the JIT to generate code that is portable enough to be executed on any architecture version that the container image with the CRIU checkpoint (may include some JIT compiled code) gets deployed to
While the main use case is cloud deployments at present, this technique also has potential in other areas such as embedded systems where start-up matters, and high performance computing workloads in which periodic checkpoints aids fault tolerance. Early experiments with OpenJ9 CRIU support have shown up to 10x start-up time improvement on Open Liberty applications in comparison to JVM based modes (more details on this and the different aspects of OpenJ9 CRIU support mentioned earlier in future blogs). Given that CRIU is a Linux service there is no support for other operating systems, but we don’t expect this to be a major issue given the ubiquity of Linux in cloud deployments.
This blog explains why start-up time matters in the cloud, and what options are available to those who deploy application servers based on JVM technology. To date, the choice has been between the traditional JVM based approach or native images. OpenJ9 CRIU support offers a third option that provides fast start-up while preserving most of the JVM capabilities familiar to users of the language. If you are interested in trying this feature out, please read Getting started with OpenJ9 CRIU Support for steps on how to get started with OpenJ9 CRIU support. Additionally, if you are interested in the implementation details, please read OpenJ9 CRIU Support: A look under the hood and OpenJ9 CRIU Support: A look under the hood (part II).