Optimizing application throughput continues to be a primary goal for any application developer seeking improved application performance. Before I can talk about the details of DBF scan ordering, there are some terms and concepts that must be discussed first. To understand DBF scan ordering, a base knowledge of Garbage Collection (GC), OpenJ9 GC policies, and the OpenJ9 Testarossa Just-In-Time (JIT) compiler are required. Let’s begin by briefly discussing at a high level each of these concepts mentioned above.
Firstly, GC is a form of automatic memory management. The garbage collector attempts to reclaim memory occupied by objects that are no longer in use by the program. This removes the memory management burden from the developer. There are 5 primary OpenJ9 GC policies. These policies dictate the heap layout, how frequently a GC is triggered, whether cycles are “stop-the-world” (i.e. all application threads are stopped) or concurrent, and various other GC behavioural features. The GC policies for OpenJ9 include the following:
- optthruput – “stop-the-world” parallel collector
- optavgpause – concurrent collector
- gencon – generational copying collector
- balanced – region-based generational copying collector
- metronome – incremental soft real-time collector
We must now discuss GC scan ordering and what this means for a copying collector. Firstly, what is a copying collector? At an abstract level, a copying collector will find all live objects in a memory space by starting from root nodes, which are mostly objects referenced in the virtual machine (VM) stack, that is, the local variable table in the stack frame, and exhaustively trace out all references for each root node. Once this trace containing all live objects in the memory space is complete, these live objects are copied into a different memory space, and then the entire memory space where the objects used to reside is cleared. This allows for all live objects in a memory space to remain living, and for memory from all unreachable objects within a memory space to be reclaimed for the program to use to allocate more new objects. After all live objects in a memory space have been marked, these objects must now be copied to a different memory space to allow for unreachable objects to be cleared.
With regards to the scanning and copying of these live objects to a different memory space, this is where the GC scan ordering policy comes into play. Within a GC cycle of a copying collector, a GC scan ordering policy dictates the order in which live objects are scanned and copied to another memory space. Let’s look at an example of a copying collector that uses a Breadth-First (BF) scan ordering policy. The following diagram is an example of a memory space where memory in the memory space has been fully allocated, requiring a GC cycle to reclaim memory from unreachable objects. Figure 1 shows a visual trace of objects within the memory space after the marking phase of the GC cycle, with GC roots, reachable objects, and unreachable objects labelled.
Ignoring Figure 1 for the moment, let’s take a look at a single root object ‘A’, with a live object trace as shown in Figure 2. Root object ‘A’ has 2 fields, ‘B’ and ‘C’. Object ‘B’ has two fields ‘D’ and ‘E’, and object ‘C’ has 2 fields, ‘F’ and ‘G’. Copying this object graph from the full memory space to a new memory space with a BF scan ordering policy would result in objects being copied in the following order: A –> B –> C –> D —> E –> F –> G.
Since it is now clear what the GC scan ordering policy dictates, we will discuss briefly some key concepts of locality. I’m sure most people reading this are already familiar with this so I will keep it short. Firstly, here are some definition/concept refreshers:
- 90/10 rule: a program spends 90% of its time in 10% of its code
- Spatial locality: items whose addresses are near one another tend to be referenced close together in time
- Temporal locality: recently accessed items are likely to be accessed in the near future
- Caching: storing data in a temporary storage space or memory that allows fast access to the data
- Cache prefetching: fetching instructions or data from their original storage in slower memory to a faster local memory before this information is actually needed
- Caching hit-to-miss ratio: the ratio of cache hits to cache misses for cache memory requests
There is generally a correlation between an application’s tendency to follow typical locality concepts and its overall throughput, as systems are typically designed to exploit locality concepts. The more an application exhibits locality tendencies, the more application performance optimization is possible through the various existing locality exploiting techniques (e.g. cache prefetching). According to the 90/10 rule, there are likely some very “hot” (frequently-used) object access patterns and very “hot” object fields. The OpenJ9 Testarossa JIT compiler, which is the OpenJ9 dynamic optimizing compiler that can turn Java bytecode into much faster native code, is aware of an object’s hot fields. However, this hot field information is currently not exploited for throughput optimization concerning the spatial locality of an object and its hot fields. During a GC cycle, the GC will not attempt to force any object locality by trying to dynamically force object reference locality between hot object access patterns. However, this is achieved with DBF scan ordering.
DBF scan ordering will copy objects in a BF manner during a GC cycle, however, for every object that gets copied, it is scanned for hot fields, and if the object contains hot fields, the GC copy algorithm will dynamically and recursively copy these hot fields directly after the initial object is copied. This results in forced locality improvements. Let’s look at this further with an example. Let’s take the single root object ‘A’ with the live object trace that we looked at before, but now, we will denote percentage distributions in which fields of each object are accessed (see Figure 3).
With a typical BF copy order, during a GC, objects in Figure 3 would be copied in the following order: A –> B –> C –> D –> E –> F –> G. However this could lead to objects that are frequently accessed together often appearing in different cache lines, resulting in a higher cache miss percentage than if it were possible to copy objects that are frequently accessed together, one after another, allowing for a higher probability of objects frequently accessed together appearing in the same cache line, leading to a shorter average object access time.
This is the primary purpose and benefit of DBF scan ordering. With DBF scan ordering, every object that gets copied is scanned for hot fields, and if the object contains hot fields, the GC copy algorithm will immediately recursively depth copy the hot fields of this object.
Let’s look at an example to see exactly how this would affect the order in which objects are copied, and show how this scan ordering will produce improved object locality compared to a BF scan ordering policy. Looking at the example in Figure 3, this will result in the following object copy order using the DBF scan ordering policy: A –> C –> G –> B –> E –> D –> F. How did we get that? Firstly, the root object ‘A’ is copied. After ‘A’ is copied we will scan the object ‘A’ for hot fields, and it will return the object ‘C’. We will now choose to dynamically copy the object ‘C’ instead of continuing in a BF manner and copying object ‘B’. After object ‘C’ is copied we will scan object ‘C’ for hot fields, and it will return the object ‘G’. We will now choose to dynamically depth copy the object ‘G’. As no hot fields exist for ‘G’, we will continue from where we deviated from BF, ie. after object ‘A’ was copied, and copy the object ‘B’. After ‘B’ is copied we will scan object ‘B’ for hot fields. It will return the object ‘E’. We will now dynamically copy object ‘E’. As no hot fields exist for the object ‘E’, we will continue from where we deviated from BF, ie. after object ‘B’ was copied. We will then try to copy the object ‘C’. Since ‘C’ is already copied, we will skip it and go to the object ‘D’. We will then copy the object ‘D’. Since no hot fields exist for the object ‘D’, we will continue in a BF manner. Next is the object ‘E’ which will be skipped as it has already been copied. Next, the object ‘F’ will be copied. Lastly, the object ‘G’, will be skipped as it has already been copied.
To summarize, DBF scan ordering will copy objects in a BF manner and will deviate from BF to depth-first to copy objects depth-first if the object contains hot fields. This dynamic scan ordering policy is implemented efficiently and has led to increased performance throughput on various applications, with some applications seeing throughput improvements of over 5% compared to a traditional BF scan ordering policy. The more an application exhibits locality tendencies, the more application performance optimization can be achieved via the use of a DBF scan ordering policy.
How To Get Started?
Enabling DBF scan ordering can be achieved using the Java command-line interface. However, you must use either the ‘gencon’ or ‘balanced’ OpenJ9 GC policies. For the ‘balanced’ GC policy, DBF scan ordering is enabled by default. However, DBF scan ordering with ‘gencon’ will probably not be enabled by default for some time, as ‘gencon’ uses its own unique scan ordering policy (perhaps discussed in a future post). Currently, in order to try DBF scan ordering with ‘gencon’, which may or may not be beneficial depending on your workload, you must add the command line option “-Xgc:dynamicBreadthFirstScanOrdering“.
To wrap up, DBF scan ordering enables the possibility to better spatially localize, in memory, objects accessed frequently together. Among other features, the DBF scan ordering policy will likely result in a better cache hit ratio compared to standard BF scan ordering, leading to increased overall application throughput at a low overhead cost. Enjoy!