Tuning Garbage Collection for Mission-Critical Java Applications
I recently had the opportunity to test and tune the performance of several shop and portal applications built with Java and running on the Sun/Oracle JVM, among them some of the most visited in Germany. In many cases garbage collection is a key aspect of Java server performance. In the following article we take a look at the state-of-the-art advanced GC algorithms and important tuning options and compare them for diverse real-world scenarios.
Seen from the point of Garbage Collection, Java server applications have wide varying requirements:
- Some are high-traffic applications serving a huge amount of requests and creating a huge amount of objects. Sometimes, moderate-traffic applications using wasteful software frameworks do the same thing. Anyway, cleaning up these objects in an efficient way is a challenge for the garbage collector.
- Others have extremely long uptimes and require a constant quality of service during that uptime without slow degradation or the risk of sudden deterioration.
- Some place tight limits on their user response times (as in the online gaming or betting area) which do not leave much room for extended GC pauses.
In many cases you will find a combination of several of these requirements with different priorities. Several of my sample shops and portals were very demanding with respect to point 1, one put extreme priority on point 2 but most applications are not extremely demanding in all of the three aspects at the same time. This leaves you the necessary room to choose the right tradeoffs.
Out-of-the-Box GC Performance
JVMs have improved a lot but still cannot do your job of optimizing the runtime for your application. Default JVM settings have a fourth priority in mind in addition to the 3 mentioned above: minimizing the memory footprint. They need to support millions of users who do not run Java on a server with plenty of memory. This is even true for many e-business products which are most of the time preconfigured to run on developer notebooks instead of production servers. As a consequence, if you run your server with a minimal set of heap and GC parameters like the following
java -Xmx1024m -XX:MaxPermSize=256m -cp Portal.jar my.portal.Portal
you will almost certainly obtain results which are not good enough for efficient server operation. In the first step, it is good practice to configure not only memory limits but also initial sizes to avoid costly step-by-step increases during server startup. Whenever you know how much memory is enough for your server (which you should try to find out in time) it is best to make initial sizes and limits equal by adding
The last basic option frequently found in JVM configurations is a similar setting for the size of the so-called New generation heap:
These and other more sophisticated settings are explained in the next sections but let’s first look how the garbage collector works with them in a load test for one of our portal samples on a rather slow test server:
The blue curve shows the occupied total heap as a function of time, vertical grey lines show the duration of GC pauses.
In addition to these graphs, key indicators of GC operation and performance are shown on the right-hand side. First we have a look at the average amount of garbage created (and collected) in this test run. The value of 30.5 MB/s is marked in yellow because it is a considerable but still moderate garbage creation rate, just about right for an introductory GC tuning example. Other values indicate how well the JVM copes with cleaning up that amount of garbage: 99.55% of that garbage is cleaned up in the New generation and only 0.45% in the Old generation which is rather good and therefore marked green.
Why this is good can be seen from the pauses the GC activity imposes on the JVM (and all the worker threads executing user requests): There are numerous and rather short New generation GC pauses. They occurred on average every 6 seconds and lasted less than 50 milliseconds. Such pauses stopped the JVM during 0.77% of wall time but any single pause is unnoticeable to the users waiting for the server’s response.
On the other hand, Old generation GC pauses stop the JVM during only 0.19% of time. But given the fact that during that time they only clean up 0.45% of the garbage while 99.55% is cleaned up during the 0.77% New generation pause time this shows how extremely inefficient Old generation garbage collection is compared to New generation GC. In addition, Old generation pauses on average occurred less than once per hour but lasted as much as almost 8 seconds on average with a single outlier even reaching 19 seconds. As these are true pauses for all the JVM’s threads processing user requests, they should be as infrequent and short as possible.
From these observations follows the basic tuning goal for generational garbage collection:
Collect as much garbage as possible already in New generation and make Old generation pauses as infrequent and short as possible.
Basic Ideas of Generational Garbage Collection and Heap Sizing
The Java heap is made up of the Perm, Old and New (sometimes called Young) generations. The New generation is further made up of Eden space where objects are created and Survivor spaces S0 and S1 where they are kept later for a limited number of New generation garbage collection cycles. If you want more details, you might want to read Sun/Oracle’s whitepaper “Memory Management in the Java HotSpot Virtual Machine”.
By default the New generation as a whole, and the survivor spaces in particular, are too small to hold objects long enough until most of them are no longer needed and can be collected. Therefore, they are moved to the Old generation prematurely which will then fill up too fast and need to be cleaned up frequently which causes relatively many of the Full GC stops visible in figure 1 above.
Tuning the Generation Sizes
Tuning generational GC means making the New generation as a whole and in particular the survivor spaces larger than they are out-of-the-box. But to do this you also have to consider the GC algorithm used:
The default GC algorithm of a Sun/Oracle JVM running on today’s hardware is called ParallelGC and if it were not the default it could be configured explicitly using the JVM parameter
This algorithm by default does not work with fixed sizes for Eden and the survivor spaces but uses a policy called “AdaptiveSizePolicy”, which is an adjustment-controlled automatic sizing strategy. As described above, it delivers reasonable behavior for many scenarios including non-server usage but it is not optimal for server operation. To switch it off and start setting your survivor sizes explicitly to fixed values use the following JVM configuration switch:
Once this has been done, we can not only further increase the New generation but also effectively set the survivor sizes to a suitable value:
-XX:NewSize=400m -XX:MaxNewSize=400m -XX:SurvivorRatio=6
SurvivorRatio=6” means that each survivor space is 1/6 of Eden size or 1/8 of total New generation size, which in this case means 50 MB while adaptive sizing usually works with much smaller sizes in the range of only a few MB. By repeating the same load test as above with these settings we got the following result:
Note that during this test run of doubled duration there was on average almost the same garbage creation rate as before (30.2 compared to 30.5 MB/s). Nevertheless, there were only two Old generation (Full) GC pauses, no more than one in 25 hours. This was achieved by decreasing the rate of garbage ending up in the Old generation (the so-called promotion rate) from 137 kB/s to 6 kB/s or only 0.02% of all garbage. At the same time New generation GC pause duration increased only slightly from an average of 48 to 57 milliseconds and the average interval between pauses rose from 6 to 10 seconds. Altogether, switching off adaptive sizing and fine tuning the heap sizes decreased GC pause time from 0.95% to 0.59% of elapsed time which is an excellent result.
Similar results after tuning can be obtained with the ParNew algorithm as an alternative to the default ParallelGC. It was developed for compatibility with the CMS algorithm mentioned below and can be configured by
-XX:+UseParNewGC. It does not use adaptive sizing but works with fixed values for the survivor sizes. Therefore and with the default of
SurvivorRatio=8 it usually delivers much better out-of-the-box results for server usage than the ParallelGC.
Getting rid of long Old Generation GC Pauses
The only remaining problem with the latest result above are the long Old generation (Full) GC pauses of about 8 seconds on average. These pauses have been made rare by proper generation tuning but when they occur they still are a nuisance to users because during their duration the JVM is not executing worker threads (stop-the-world GC). In our case, these 8 seconds are caused by an old and slow test server and could be up to a factor of 3 faster on modern hardware. On the other hand, today’s applications typically also use larger heaps than 1 GB and have larger amounts of live objects in the heap than in this example. Web applications nowadays work with heaps up to 64 GB and (at least temporarily) need half of that for their live objects. In such cases, 8 seconds is short for Old generation pauses. They can easily come close to one minute which is totally unacceptable for an interactive web application.
One option to alleviate the problem is the use of parallel processing for Old generation GC. By default, the ParallelGC and ParNew GC algorithms in Java 6 used multiple GC threads only for young generation collections while Old generation collections were single-threaded. In the case of the ParallelGC collector this can be changed by adding
Since Java 7 this option is activated by default together with the
-XX:+UseParallelGC. However, even with 4 or 8 cpu cores in your system you should not expect much more than an improvement by a factor of 2, often less. In some cases, as in our 8 seconds example above, this can be a welcome improvement but in other more extreme cases it is not enough. The solution is to use low-latency GC algorithms.
The Concurrent Mark and Sweep (CMS) Collector
The CMS garbage collector is the first and most-widely used low-latency collector. It has been available since Java 1.4.2 but suffered from instability issues in the beginning. Solving them required quite a few Java 5 releases.
As indicated by its name the CMS collector uses a concurrent approach where most of the work is done by a GC thread that runs concurrently with the worker threads processing user requests. A single normal Old generation stop-the-world GC run is split up into two much shorter stop-the-world pauses plus 5 concurrent phases where worker threads are allowed to go on with their work. Find a more detailed description of the CMS in the article “Java SE 6 HotSpot Virtual Machine Garbage Collection Tuning”.
The CMS collector is activated by
Applying this to our sample application from above (under higher load than before) led to the following result:
It is visible that the Old generation pauses in the 8 seconds range are now gone. For each Old generation collection (in our case 5 of them in 50 hours) there are now two pauses and all of them are below 1 second.
By default, the CMS collector uses the ParNew collector to execute the New generation collections. If the ParNew collector runs together with the CMS its pauses tend to be a bit longer than when it runs without it because their cooperation requires some extra effort. In addition to the slightly higher average New generation pause times compared to the previous results, this can be seen from the frequent outliers in New generation pause times which reach up to 0.5 seconds in the test run shown. But they are all short enough to make the CMS/ParNew collector pair a good low-latency option for many applications.
A more important disadvantage of the CMS collector is related to the fact that it cannot be started when the Old generation heap is full. Once the Old generation is full, it is too late for the CMS and it must then fall back to the usual stop-the-world strategy (announced by a “concurrent mode failure” in the GC log). To reach its low-latency goal the CMS is started whenever Old generation occupation reaches a threshold set by
The CMS is started once 80% of the Old generation is occupied. For our application this reasonable value (which at the same time is also the default) worked well, but if the threshold is set too high a concurrent mode failure can any time bring back the long Old generation GC pauses. If on the other hand it is set too low (below the size of the live part of the heap) the CMS might run concurrently all the time and thus consume the processing power of one CPU entirely. If an application experiences brisk changes in its object creation and heap usage behavior, e.g. by the start of specialized tasks either interactively or by a timed trigger, it can be hard to set this threshold right to avoid both risks at all times.
The Specter of Fragmentation
The biggest disadvantage of the CMS, however, is related to the fact that it does not compact the Old generation heap. It therefore carries the risk of heap fragmentation and severe operations degradation over time. Two factors increase this risk: a tight Old generation heap and frequent CMS runs. The first factor can be improved by making the Old generation heap larger than what would be needed with the ParallelGC collector (which I did from 1024 to 1200 MB as can be seen in the previous figures). The second factor can be improved by proper generation sizing as described above. We actually saw how infrequent Old generation GC can be made by it. To demonstrate how essential it is to fine tune the generation sizes before switching to the CMS let’s have a look at what might happen if we do not follow this rule and apply the CMS directly to the little tuned heap of figure 1:
It is obvious that with these settings the JVM worked well for almost 14 hours under loadtest conditions (in production and with lower load this treacherously benign period may last much longer). Then suddenly there were very long GC pauses which actually stopped the JVM for about half of the remaining time. There were not only attempts to clean up the mess in the Old generation which lasted more than 10 seconds but even New generation GC pauses were in the seconds range because the collector spent a lot of time searching for space in the Old generation when it tried to promote objects from new to Old generation.
The fragmentation risk is the price to pay for the low-latency advantage of the CMS. This risk can be minimized but it is always there and it is hard to predict when it will strike. With proper GC tuning and monitoring, however, the risk can be managed.
The Promise of the Garbage First (G1) Collector
The G1 collector was designed to achieve low-latency behavior without the risk of heap fragmentation. As such, it is announced as a long-term replacement for the CMS collector by Oracle. G1 avoids the fragmentation risks because it is a compacting collector. As far as GC pauses are concerned, it does not aim at the shortest possible pauses but at controlling pauses by placing an upper limit on their duration which is maintained in a best-effort approach. Readers can find more details about the G1 collector in the great tutorial “Getting Started with the G1 Garbage Collector”, German readers also in Angelika Langer’s article “Der Garbage-First Garbage Collector (G1) – Übersicht über die Funktionalität”.
Before we examine the current state of the G1 collector by comparing its performance on our sample application with the performance of the classic collectors described above, let me summarize two important pieces of information about the G1 collector:
- G1 is officially supported by Oracle since Java 7u4, but for G1 you should go for the most recent Java 7 update available. The Oracle GC team is working hard on G1 and improvements in recent Java updates (7u7 to 7u9) have been noticeable. On the other hand, G1 has been in no way production-ready in any Java 6 release and the by far superior Java 7 implementation will probably never be backported.
- The generation sizing approach I described above is obsolete with G1. Setting generation sizes is in conflict with setting pause time targets and will prevent the G1 collector from doing what it was designed for. With G1 you set the overall memory size using “
-Xms” and “
-Xmx” and (optionally) a GC pause time target and usually leave all the rest to the G1 collector. It follows a similar approach as the ParallelGC collector’s AdapativeSizingPolicy and adjustment-controls the generation sizes in such a way as to fulfill the pause time target.
Once these guidelines were followed, the G1 collector delivered the following result out-of-the-box:
In this case, we used the default GC pause time target of 200 milliseconds. As can be seen from the indicators this target was almost met on average and the longest GC pauses were as good as with the CMS (figure 4). G1 apparently had very good control of GC pauses because outliers compared to the average duration were rather rare and limited.
On the other hand, average GC pause times were much longer than with the CMS collector (270 vs. 100ms) and because they were even more frequent this also means that accumulated GC pause time, i.e. the overhead for GC itself, was more than 4 times higher than with CMS (6,96 vs. 1.66% of elapsed time).
Just like the CMS the G1 works with GC pauses and with concurrent GC phases. In similar ways as the CMS, it starts concurrent phases based on an occupation threshold. It is visible in figure 6 that the available heap of 1GB is by far not fully used. This is because the G1’s default occupation threshold is much lower than the CMS’ threshold. It is also reported that the G1 in general tends to be satisfied with less heap than the other collectors.
Quantitative Comparison of Garbage Collectors
The following table summarizes some key performance indicators achieved with the 4 most important garbage collectors of Oracle Java 7 running the same load test on the same application but with different levels of load (indicated by the garbage creation rate shown in column 2):
All the collectors were run with about 1GB of total heap size; the traditional collectors (ParallelGC, ParNewGC and CMS) in addition used the following heap settings:
-XX:NewSize=400m -XX:MaxNewSize=400m -XX:SurvivorRatio=6
while the G1 collector ran without additional heap size settings and used the default pause time target of 200 milliseconds which can also be set explicitly by
As can be seen from this table the traditional collectors execute New generation collections (column 3) in similar time. This is true for the ParallelGC and the ParNewGC collectors but also for the CMS which in fact uses the ParNewGC to execute New generation collections. Promotion from new to Old generation, however, requires some coordination between ParNewGC and CMS during New generation GC pauses. This coordination creates an extra cost which translates into slightly longer New generation pauses for the CMS.
Column 7 summarizes the time lost in GC pauses as percentage of elapsed time. This number is a good measure of GC overhead because concurrent GC time (last column) and the CPU usage overhead it implies may be neglected. With heap sizes tuned as described above and thus with rare Old generation collections, column 7 is largely dominated by New generation pause time. New generation pause time is the product of New generation pause duration (column 3) and New generation pause frequency. New generation pause frequency is a function of the New generation size which was the same (400 MB) for all of the traditional collectors. Therefore and for these collectors column 7 more or less mirrors column 3 (for similar load).
The benefit of the CMS collector in this picture is evident from column 6: it trades much (one order of magnitude) shorter Old generation GC pauses against a slightly higher overhead. For many real world applications this is a very good deal.
How well does the G1 collector compete for our application? Column 6 (and 5) tells us that it successfully competes with the CMS in reducing Old generation GC pauses. But column 7 indicates that it pays a rather high price to achieve this: GC overhead was 7% compared to 1.6% for the CMS under the same load.
I will examine the conditions under which this higher overhead occurs as well as the strengths and weaknesses of the G1 compared to other collectors (in particular to the CMS collector) in a follow-up to this article as it is a vast and newsworthy subject in its own right.
Summary and Outlook
For all the classic Java GC algorithms (SerialGC, ParallelGC, ParNewGC and CMS) generation sizing is an essential tuning and fine tuning procedure which in many real-world applications is not practiced sufficiently. The consequences are suboptimal application performance and the risk of operations degradation (loss of performance and even application standstill for extended periods of time if it is not well monitored).
Generation sizing can improve application performance noticeably and reduce the occurrence of long GC pauses to a minimum. Elimination of long GC pauses, however, requires the usage of a low-latency collector. The preferred and most proven low-latency collector has been (and still is as of today) the CMS collector which in many cases does what is needed and, with proper tuning, also provides long-term stability in spite of its inherent heap fragmentation risk. The intended replacement, the G1 collector, is now (as of Java 7u9) a supported and usable alternative but there is still room for improvement. For many applications, it will deliver acceptable but not yet better results than the CMS collector. The details of its strengths and weaknesses deserve closer examination.