Who am I, and why am I writing this?
My name is Cedric Hansen, and I’m currently in the final week of my summer internship, working on the OpenJ9 GC team, in Ottawa. I’m writing this post to reflect on my experience, and discuss various aspects related to the work I did this summer. Coming from and living in Ottawa, my internship search quickly led me to IBM; so, I applied, and here I am, writing this blog post in reflection of my internship experience. I am looking to not only describe what I did for current J9 team members, but also to try and shed some light to potential future interns on what type of things are being worked on .
What did I do?
A good overview of what exactly the “GC” is, can be found here as a quick reference: https://en.wikipedia.org/wiki/Garbage_collection_(computer_science) and more specifically the GC that the OpenJ9 team works on: https://www-01.ibm.com/support/docview.wss?uid=swg27013824&aid=1 . In general, the GC team looks to add features, fix bugs, and make performance improvements to the OpenJ9/OMR GC . The work items that I was tasked with spanned multiple areas, some of which will be discussed below.
The very first task I tackled was to fix a GC issue which pretty much sounded like this: “The longer I run my app, the longer the GC pauses get, and they eventually get to be close to one second in length”. Several J9 team members had done some research on this prior to my arrival, and showed me roughly where the change had to be made. It was a reasonably simple change, but it resulted in a real issue being fixed, which was a great feeling. It was really a great way to get familiar with the tools and workflow I’d be using throughout the term.
Later during the term, I began working on something called “concurrent kickoff logic”. These changes were applied to a specific part of the GC; without getting into too many nitty gritty details, the GC can begin a certain phase of the process before the entire application is stopped. What these changes did is look at the rate, and variance, at which memory was being used in the heap, to determine when the right time to begin a certain phase of the GC cycle is. Many enterprise grade applications use memory at drastically different rates throughout the lifetime of the application (i.e., allocate memory very quickly at some points, and very slowly at others). The GC can very easily predict when the right start time when memory is being steadily consumed, but this typically isn’t the case in full-scale applications. This means that when memory was not being consumed at a steady pace, sometimes the GC doesn’t have a good idea of when to start a certain phase, which can result in noticeable pauses. The changes made here look at different types and sizes of allocations, along with the variation of these numbers, to determine when the GC cycle should begin so that longer pauses wouldn’t occur. This was a change I was very happy about, because it drastically reduced the number of long pauses in applications with irregular allocation patterns. This change was an interesting one for me, because it taught me a lot about how some basic statistics, along with some heuristics that can improve the performance of a piece of software. In this case, the changes resulted in a less long GC pauses, at the expense of slightly more GC cycles (which we determined to be well worth it). See Eclipse/omr PR #4134 for a view of this change.
One of the hardest parts of working on GC code, is that the GC is quite hard to test. The nature of the GC is quite different from any other type of software, because it relies on how another application is being used, along with the state of the machine running the program (available memory, number of cpu cores, etc). A program called “SyntheticGCWorkload” (SGCW) exists, which allows developers to create custom memory allocation patterns through the use of simple configuration files. This gives GC developers a controlled way to put the GC in specific situations. The issue with this framework as it stood, was that the allocations did not accurately simulate enterprise grade applications with irregular allocation rates. Steady allocation rates were easy to configure with SGCW, but for any sort of regular repetition or irregularity, this framework was incredibly cumbersome to configure. The 2 changes I made to this framework helped with this issue.
The first change in AdoptOpenJDK/openjdk-tests PR #1244, was the ability to run SGCW as a “javaagent”. This change allows people to stack allocation patterns specified in the SyntheticGCWorkload configurations, ON TOP OF a regular application. Because of this change, it is now possible to see the effect of increasing allocation rates at certain points in time for a particular application. By creating a configuration file with a certain allocation pattern, lets say 10 Mb/s for the lifetime of the application, we can now run the application, with SGCW as an agent, to increase the allocation rate by 10 Mb/s and observe the behaviour.
The second change in AdoptOpenJDK/openjdk-tests PR #1245, is the addition of a few keywords in the configuration files, which allow allocation rates to be repeated at desired intervals. This change is quite significant for GC developers, because it allows repeated allocations to be used alongside steady memory allocations. A typical application will generally allocate lots of long-lived objects at the start of the application (think constants, static fields, etc..), followed by several bursts throughout the applications lifetime (think someone makes a database request, changes the screen, taps a button, etc…). The ability to add repeated allocation is what simulates the bursty behaviour of real life application.
This picture (free heap space on y axis, time on x axis) represents the difference between running a traditional test program without the SyntheticGCWorkload as a javaagent and repetitions (pictured as the smooth solid line), and with the javaagent/repetitions (spiky dotted line). The GC can easily predict when to start a cycle in test environments (the smooth line), but can sometimes have a tough time predicting when to start a cycle when there are big allocation spikes (the dotted line). With the help of these two changes in SGCW, along with the changes I mentioned that have to do with the “concurrent kickoff logic”, my mentor and I were able to make changes so the GC has far less aborts, percolates**, and failures (in GC world, these things are very, very bad, and can result in application stalls from 0.5s to 10s in extremely bad cases).
One of the more unique tasks I was assigned, was to investigate an issue related to OpenJ9 taking several minutes to boot up when preloading the Libasan5 library on centOS. After several days of investigations, and lots of tests, the root of the problem was found. It turns out that the Libasan5 library always occupied a certain address range in memory, and this heavily conflicted with OpenJ9, which tries to get an address range in memory which overlaps with the range Libasan5 has already reserved, sending OpenJ9 on a long path to finding a suitable address range. The full discussion/explanation can be seen in Eclipse/openj9 Issue #6228. This was an incredible learning opportunity, because it exposed me to even more new parts of the codebase, and exposed me to more technologies (i.e. running a docker container to simulate the behavior of a centOS system).
What did I learn? (other than GC related things)
I’ll keep this section very brief, and aimed mostly toward prospective interns. First, I learned that a good github workflow is very important. Learning how to do PR’s, keep a branch up to date with upstream repo, rebasing, etc… are all way more important than I initially thought. Second, basic shell scripting is a must, and is well worth taking a few hours out of a workday to learn the basics. Lastly, I learned that it’s better to ask someone a question, than to be stuck on a task for hours or days.
I want to issue a special thank you to my mentor, Aleksander Micic, for being incredibly supportive and a great teacher throughout the term. Many of the tasks I completed would not have been possible without his guidance and explanations. Also a big thank you to all members of the team for making this a great place to work and learn!
** From DeveloperWorks article Using “IBM Pattern Modeling and Analysis Tool for Java Garbage Collector”… : A scavenge which is converted into a global collection is called a percolate. Usually it’s an “aggressive” GC since a previous GC was unable to reclaim sufficient resources. It means that the GC will try as much as it can, including compaction, class unloading, softref clearing, etc.