A glimpse into performance of JITServer technology

In a previous blog we took a sneak-peek into JITServer technology which relieves the JVM from the negative effects of JIT compilation (interference due to JIT usage of CPU and memory) by offloading it to a remote process. In this article we are going to put this technology to the test, measuring a few key performance metrics in the context of a couple of Java-EE benchmarks running on top of Liberty application server.

Experimental setup

With JITServer technology there are two factors that pull in opposite directions: on one hand the overall CPU and memory consumption at the client JVM is reduced (JIT overhead has been moved to the server), but on the other hand remote JIT compilations are expected to take longer because they are affected by the network latency. To address this trade-off, in our experiments we need to cover two different use environments:

  1. Resource constrained environments where the containers running the Java application have small CPU and memory limits. This is where the negative effects of the JIT compilation are more prevalent, and we expect JITServer to pull ahead in such circumstances.
  2. Environments with generous CPU and memory limits. In this case the application is less hindered by the JIT compilation activity which, typically, is performed asynchronously. As such, we expect JITServer to perform slightly worse on some performance metrics, because the negative effects of network latency outweigh the benefits we can extract from offloading JIT compilation.

The two Java-EE benchmarks we’ll be using are AcmeAir, which simulates an airline reservation system, and Daytrader7, which implements an online platform for trading stocks. Both are running in Docker containers built on top of websphere-liberty:19.0.0.9-webProfile7 image. open_liberty would have worked as well, but we preferred the freely available websphere-liberty images for convenience: our benchmarks require some older versions of Liberty features that can be easily installed with the installUtility tool (this tool is available in websphere-liberty, but not in open-liberty).

As seen in Figure 1 below, the system-under-test (SUT) is a four core (8 HW threads) desktop machine with an Intel Core i7 6700K CPU and 16 GB of RAM running Ubuntu 16.04.  The database engine (MongoDB for AcmeAir and Db2 for Daytrader7) runs on a second desktop machine, while the JITServer (when enabled) and the JMeter application for putting load on the SUT run on a third desktop machine. In all the experiments the shared class cache (SCC) technology was enabled and persisted using a separate Docker volume.

Figure 1. Architecture of the benchmarking environment

Start-up experiments

For start-up we are going to use the AcmeAir benchmark. As shown in Table 1, in a resource constrained environment (1 vCPU, 150MB) the JITServer technology is able to improve start-up time by 6%. However, when enough computing resources are available (4 vCPUs, 512 MB), the JITServer is actually regressing start-up time by 12% because the high latency of remote compilations is the dominant factor (delaying the availability of compiled code makes the application spend more time in the interpreter, directly impacting start-up time).

–cpus=1 –memory=150m–cpus=4 –memory=512m
OpenJ93367 ms2141 ms
JITServer3155 ms (-6%)2394 ms (+12%)
JITServer -Xjit:enableJITServerHeuristics3261 ms (-3%)2267 ms (+6%)
Table 1. Start-up time of AcmeAir in different configurations

In an attempt to minimize the effects of network latency, JITServer has a mode where cheap compilations are performed locally at the client JVM and only expensive compilations are offloaded to the JITServer. This mode, enabled by -Xjit:enableJITServerHeuristics, manages to shrink the regression in start-up time from +12% to just +6% when using four vCPUs. The drawback is that in a resource constrained scenario the start-up advantage of JITServer also reduces, from 6% to 3%. We think that there is room for more improvement here.

Throughput experiments

Figures 2-3 show the throughput of the AcmeAir application when load is applied to it. In a resource constrained environment (see Figure 2), the JITServer technology (blue line) helps to improve rampup significantly for a cold run (a run with an empty shared class cache) and to a smaller degree for a warm run. In the latter case, the JIT will load many AOT compiled bodies from the shared class cache, a process that is very fast and cheap, so the rampup curve is quite steep in the beginning whether you use JITServer or not. Then, as recompilations of those AOT bodies are starting to dominate the compilation mix, vanilla OpenJ9 begins to lag again and it catches up after ~180 seconds. In both the cold and warm run the CPU consumed by the compilation threads is greatly reduced (60-75%) with JITServer. Although in theory the JVM should spend no CPU compiling when using a JITServer, that is not the case because compilation threads in the client JVM have to communicate with the JITServer, activity which consumes a non-trivial amount of resources.

In an environment with plenty of computing resources (see Figure 3), the advantage of JITServer is noticeably reduced. With four vCPUs at its disposal the JVM can solve the backlog of compilation requests relatively quickly and the negative effects of the JIT compilation are felt only for a short period of time. Larger applications (with more classes and methods to be compiled) may see a larger benefit from JITServer. This is indeed the case of Daytrader7 application where even with four vCPUs, it takes OpenJ9 about 200 seconds to reach the level of throughput achieved with JITServer (see Figure 5).

Effect of network latency

As explained in a previous blog, during a single JIT compilation there are many messages exchanged between the client JVM and JITServer and therefore the network latency directly affects the duration of JIT compilations. To measure the effect of network latency we’ve introduced an additional network switch between the SUT and the machine running the JITServer. As a result, the ping round trip time increased from 250 usec to 350 usec, and even such a small change produced a visible degradation to the rampup curve in AcmeAir (see Figure 6). As such, networks with latencies measured in milliseconds are probably not a good fit for JITServer technology.

Figure 6. Effect of network latency on AcmeAir rampup

Increasing the density of applications with JITServer

During the compilation of a method, JIT compilation threads need to allocate memory for their internal data structures, memory that is completely released back to the OS at the end of the compilation. However, as shown in Figure 7, this transient memory consumption can often push the high watermark for footprint, resulting in containers that are larger than needed.

Figure 7. Memory footprint for Daytrader7 application under load

To validate this point we ran some Daytrader7 experiments in a configuration without swap space, as recommended by OpenShift documentation, to preserve the QoS guarantees (according to OpenShift documentation “Swap memory is disabled on all RHEL machines that you add to your OpenShift Container Platform cluster. You cannot enable swap memory on these machines”). In our experiments we started to gradually increase the size of our containers (with Docker option –memory=) until the application could run without being terminated by the Out-Of-Memory killer. We found that with OpenJ9 we needed at least 400 MB to run Daytrader7 for 10 minutes without crashes, while with JITServer technology this limit can be reduced to 310 MB.

In practice though, it’s very likely that users will overprovision because compilation activity is unpredictable to them. In OpenJ9 a JIT thread is allowed to allocate up to 256 MB of memory. To account for the possibility of a compilation approaching this limit, if Daytrader7 can run at steady state in about 310 MB, we need to set a container limit of about 550 MB. Depending on the appetite for risk, this limit could/should be further increased to avoid the situation where several compilation threads approach the limit more or less simultaneously. Adding an extra 20 MB for safety in the JITServer scenario for a total of 330 MB, it follows that JITServer allows you to use containers that are (550 – 330)/550 = 40% smaller and therefore allows you to increase application density and reduce cost by the same amount.

Conclusion

JIT compilers improve the performance of JVMs in the long run, but they require CPU and memory to function, therefore possibly interfering with the smooth run of a Java application. By offloading the JIT compilation to a remote JITServer process, this interference can be alleviated and the performance characteristics of the Java application, be it start-up time, ramp-up time or peak memory footprint, can be improved. The improvements generated by the JITServer technology are more significant for large Java applications that compile many methods and that run in resource constrained environments. On the other hand, due to reliance on network communication, JITServer is less suitable for environments with unreliable or high latency network connections.

Leave a Reply