This post is meant to serve as a technical overview of the Concurrent Scavenge (CS) Garbage Collection (GC) policy, from the perspective of a Just-In-Time (JIT) compiler. It explains various CS and GC concepts, serving as background to later discussion on implementation details in JIT compiler’s Code Generator (CodeGen) and Optimizer, as well as the interaction between GC and JIT-compiled code, for readers to understand the technical challenge and solution. We were able to achieve good performance results
in both throughput and response time, when compared to the non-CS GC.
For additional information, a YouTube presentation on this topic can be found here.
What is Concurrent Scavenge GC?
The Concurrent Scavenge (CS) GC policy, also known as “Pause-less GC”, is a new GC mode based on the existing gencon GC policy. It allows garbage collection to run virtually concurrently with application threads without having to pause the JVM, hence providing greatly reduced GC pauses.
Why have Concurrent Scavenge GC?
The Java language allows programmers to create objects without having to worry about freeing them. Inside the Java runtime, objects are allocated in a region of committed address space called heap, and heap memory management is handled by the garbage collector autonomously. A typical Java program allocates objects in the heap during execution. When the heap is full, the application needs to be paused, then the GC takes control of the JVM and frees up space in the heap. When GC finishes, it relinquishes control of the JVM and the application resumes execution. For the duration of each GC cycle, the application appears unresponsive. This unresponsiveness is especially detrimental to applications with strict response time requirements.
The CS GC aims to address this issue by minimizing the amount of time an application is paused due to GC. It enables the GC to run most of its operations concurrently with the application threads without the need to pause them. The benefit is greatly improved application response time.
Key Components in Concurrent Scavenge
CS is based on the generational collector that is part of the gencon GC policy. The generational collector is also known simply as the scavenger collector. CS utilizes all existing building blocks of scavenger, and implements its own technological components. Several important ones are discussed here:
In the generational collector, the heap is divided into two regions:
— the tenure space where old, long-living objects reside.
— the nursery space where new, short-lived objects reside.
The nursery space is further divided into two sub-sections:
— The allocate space where new objects are allocated in.
— The survivor space, where live objects from the allocate space are copied during GC.
Survivor space is reserved for GC use, application threads only have access to allocate space for new object allocations, and tenure space is used for special allocations such as large objects and arrays.
When the allocate space becomes full, GC need to find all live objects by tracing the object reference graph and relocate (i.e. copy) them to either the survivor space if an object is young, or to the tenure space when an object is old. Each time an object is copied from allocate space to survivor space, its age is incremented by one. If an object’s age reaches a certain threshold, it is considered old and will be copied to the tenure space in the next GC cycle. The age is encoded with bits in header of an object. A generational collection ends when all live objects have been copied to either survivor space or tenure space. At this point, the survivor space becomes the new allocate space, and the allocate space becomes the new survivor space. When the new allocate space runs out of memory, a new GC cycle kicks off and the same process repeats.
As mentioned above, a live object is collected by copying it to the appropriate destination space, either survivor or tenure. How is the relocation performed?
First step is to determine if a live object needs to be relocated. Only live objects in allocate space are required to be relocated. When GC reaches a live object while traversing the object reference graph, a range check is needed to determine its location. If the object is in tenure space, no action is required as tenure space is not collected in generational collection. In a well-tuned gencon GC, the tenure space is rarely expected to run out of memory as almost all garbage consists of short-lived nursery objects.
Each GC thread has its own thread-private chunk of free memory from survivor space. When an object is checked to be in allocate space, a GC thread simply copies the object (header and data) to its thread-private memory. The next step is installing a forwarding pointer at the old object’s header, which points to the new location of the object, using an atomic operation like compare-and-swap (CAS). The figures below illustrate the described object relocation algorithm.
In the case when multiple GC threads all reach the same object and try to copy it, each can proceed with memcopy without contention since the destination memory is thread-private. Each thread then executes CAS to install the forwarding pointer. Only one thread will succeed, it updates the reference field slot that points to the old location to the new location. All other threads that fail CAS are aware another thread has beaten them to it, they read the forwarding pointer value and update their references with the forwarding pointer. The forwarding pointer is low tagged to differentiate from regular reference fields. The benefit of this algorithm is it requires no extra memory; and all steps involved from the memcopy to updating the reference field slot are done quickly in one pass.
Pause Time Characteristic
As mentioned previously, scavenger is a stop-the-world collection where all application threads (blue sections) must be suspended during GC. This causes random, significant pauses (red section) during the application execution, resulting in non-deterministic behavior and occasional long response times.
Concurrent Scavenger Timing
With CS, the cost of a large stop-the-world pause is amortized across application threads (blue sections) and GC background threads (pink section). When allocate space runs out of memory, there is a very short initial stop-the-world pause (red section on left) in which the GC scans a root set of objects (thread stacks, static objects) to establish a starting point for live object tracing. The initial pause completes quickly, and application threads resume. Instead of GC performing the entire operation in one pause, each application thread is now required to perform object relocation: every time it loads a reference field, it performs a range check on the referenced object to determine if it is in the allocate space and needs to be relocated; if so, the application thread now acts on behalf of GC and executes the object relocation algorithm outlined previously (the green section). In addition to application threads helping collect live objects, there are background GC threads running at lower priority that collect any live objects not relocated by application threads. When all live objects are relocated, there is another short final stop-the-world pause (red section on right) where the GC performs bookkeeping tasks and tidies things up. With CS, the time when the application is paused is greatly reduced compared to scavenger.
Steps to Access a Reference Field under CS
With CS GC mode, loading a reference field is no longer a simple instruction. When a reference field is accessed, we must check to see whether GC’s involvement is required. If so, GC must be notified to perform object-copy and related book-keeping. These operations are encapsulated in a VM helper called “read barrier”. JIT can call this helper when it deems it as necessary.
From the JIT compiler’s perspective, the key to access a reference field is to determine whether JIT needs to call the VM read barrier helper. This is achieved by checking if the recently loaded reference points to a region specified by GC.
Hardware-assisted JIT Implementation
CS is first implemented on IBM z14™ hardware, which has Guarded Storage Facility. It guards up to 64 segments of contiguous memory and has two Guarded Loaded Instructions: Load Guarded and Load Logical and Shift Guarded. The former one is used under default mode (a.k.a. large-heap or non-compressedrefs) and the later one is used under compressedrefs mode.
The two Guarded load Instructions behave as ordinary loads except they trap when the loaded value is within one of the 64 guarded regions. In the case of CS, a specialized signal handler catches the signal then notifies GC if necessary.
Pure Software JIT Implementation
Read Barrier is introduced in the JIT to implement CS on systems without hardware assistance as described in the previous section. It is a generic concept for load with side-effects. For example, IBM z14™ Guarded Loaded Instructions are load instructions but may trap under certain conditions.
A group of new IL opcode are introduced to represent read barriers in the JIT, one read barrier corresponds to one load opcode. The side-effect of read barriers is to check the loaded value and then notify GC if necessary. The key to implement CS becomes replacing the correct set of load instructions with read barriers.
As GC only needs to be notified when reading a reference field, only two types of read barriers are needed: irdbari for compressedrefs mode and ardbari for default mode (a.k.a. large-heap or non-compressedrefs). The most intuitive way to generate read barriers is in the IL generation phase. However, for a set of new opcodes, especially a set of generic opcodes with arbitrary side-effects, it is difficult for the JIT optimizer to perform the most optimal transformations on the resulting IL. A key observation is that read barriers are actually not needed till the native code generation phase. Once all optimizations are completed, similar to compressedrefs, during the code generation phase, IL is traversed to find all aloadi nodes. These nodes are replaced with read barrier nodes if necessary. Thus, read barriers are transparent to the optimizer, it can perform the same transformations as before.
In summary, CS can be implemented by pure software based read barriers.
Code Generation Support for CS
Functions and Variables
A call to shouldGenerateReadBarriersForFieldLoads() is used as a check to see if CS is enabled when compiling a method. If CS is enabled, this call will return true. The JIT can use this to determine if it is necessary to generate runtime checks for CS and calls to the read barrier helper in the JIT compiled code.
CS also uses four VM Thread fields: readBarrierRangeCheckBase, readBarrierRangeCheckTop, readBarrierRangeCheckBaseCompressed and readBarrierRangeCheckTopCompressed. readBarrierRangeCheckBase and readBarrierRangeCheckTop are used to indicate the boundaries of the allocate space. readBarrierRangeCheckBase is the lowest address of the region while readBarrierRangeCheckTop is the highest address. If the address of a loaded object falls between these two values, it will be necessary to call GC’s read barrier helper.
readBarrierRangeCheckBaseCompressed and readBarrierRangeCheckTopCompressed are used when compressed references are enabled. They hold compressed versions of the allocate space boundaries which can be directly compared against loaded compressed objects.These four fields are updated during the short stop-the-world GC pauses before and after CS occurs.
When CS is not currently in progress, the four VM Thread fields are given special values that they can’t normally have. The two bases are set to the maximum unsigned values while the two tops are set to 0. As a result, it is possible for JIT generated code to determine if CS is currently in progress by checking for these special values.
J9ReadBarrier VM Helper
When JIT generated code loads an object that appears to be inside the allocate space while CS is in progress, it calls out to the J9ReadBarrier VM Helper. As input to the helper the call passes along the VM Thread and the address of the location it tried to load from. If it turns out that the object has already been moved, the load location is updated with the new address. If the object has not been moved yet, the object is first moved out of the allocate space before the load location is updated. After the helper returns to the main line code, the main line code needs to redo the load to get the updated location of the object.
Perform load: objAddr = obj.objField
if (objAddr >= vmThread.readBarrierRangeCheckBase) goto OutlinedReadBarrier
Rest of the code
if (objAddr > vmThread.readBarrierRangeCheckTop) goto End
Call J9ReadBarrier VM Helper
Redo load: objAddr = object.objField
This is pseudocode that roughly represents the code generated by the irdbar and ardbar evaluators. ardbar evaluator is used for non-compressed refs while irdbar is used for compressed refs. The first step is the load is performed to get the address of the object. The address of the object is checked against base to see if it should jump to the outlined code where it checks the object against top. If the object is between base and top, a call to the read barrier is made. After returning from the read barrier the load must be performed again to get the updated address of the object before exiting the outlined code section and continuing on.
readBarrierRangeCheckBase and readBarrierRangeCheckTop are used when handling non-compressed refs while readBarrierRangeCheckBaseCompressed and readBarrierRangeCheckTopCompressed are used when handling non-compressedrefs. As a result, it is not necessary to decompress the object before comparing it to the base and top.
Special Case: Compare and Swap Object
sun/misc/Unsafe.compareAndSwapObject(Object, long, Object, Object)Z
Loads that were converted into read barrier opcodes are not the only place where read barriers are needed. There are other places where JIT compiled code attempts to load an object that might require a read barrier. One of these is compare and swap object.
The first object and the long type offset are used to determine the address to load an object from. The loaded object is compared against the second object. If they are the same, the third object is atomically swapped into where the loaded object was loaded from. The second and third object will have already gone through a check against base and top when they were originally loaded. However, the loaded object that came from the first object and its offset still needs to be checked.
Perform load: objAddr = *(object + offset)
if (objAddr >= vmThread.readBarrierRangeCheckBase1) goto OutlinedReadBarrier
if (objAddr > vmThread.readBarrierRangeCheckTop1) goto End
Call J9ReadBarrier VM Helper
The code generated for the read barrier check for compare and swap is fairly similar to the code generated for the read barrier opcodes. A regular load is used to read the first object. It is checked against base and top to see if it is inside the allocate space. If so, the read barrier is called. Unlike the read barrier case, an extra reload is not necessary after returning to mainline code. The generated code to perform the compare and swap will already handle reloading the object.
Special Case: Reference Arraycopy
Another special case that requires read barriers is when copying an array of objects from one location to another. The generated code takes a different approach here. Instead of reading objects one at a time and checking if they are in the evacuate region there’s just a single check to see if CS is currently in progress and moving objects around. If so, the JIT generated code gives up on trying to perform the array copy by itself and instead calls the reference array copy helper in the VM.
if (0 != vmThread.readBarrierRangeCheckTop) goto OutlinedHelper
JIT code to perform arraycopy
Rest of the code
Call referenceArrayCopy VM Helper
This is pseudocode that represents the code generated by the arraycopy evaluator. It checks to see if top is equal to 0 to see if CS is in progress or not. Remember that top is only set to a special value of 0 when CS is not in progress. If CS is in progress, a call to the reference arraycopy VM helper is made which takes care of the entire arraycopy. If CS is not in progress, the JIT optimized code version of arraycopy is performed.
An industry-standard benchmark was used to measure throughput and response time impact of CS compared to scavenger. For both scores, the higher the better. The system is a 14-core Skylake machine with 52GB memory.
For throughput, CS performs ~8% worse than scavenger, due to the overhead of object range check and relocation by application threads. Response time is where CS shines, it achieves a ~40% uplift compared to scavenger, due to its low pause run-time characteristics.
Here we compare the pause time between scavenger and CS. In the table, the average pause time of scavenger is 143 msec, while that of CS is only 6.3 msec. In a graphical representation, the CS pause times are consistent around 10 msec mark, while scavenger pause times range from 60 to over 100 msec.
Special thanks to Victor Ding, Jimmy Kwa, and Aleks Micic as contributing authors to this content.