Benchmarking G1 and other Java 7 Garbage Collectors
As mentioned in a first post of this series, Oracle’s GarbageFirst (G1) collector has been a supported option in Java 7 for some time. This post examines in more detail the performance of the G1 garbage collector compared to the other collectors available in the Hotspot JVM. I used benchmark tests for this purpose instead of a real application because they can be executed and modified more easily. I found surprising strengths and weaknesses in several of Hotspot’s garbage collectors and even disclose a fully-fledged bug.
This article requires some knowledge about garbage collection in general and the Hotspot JVM’s garbage collector implementations. My previous article can serve as a useful introduction to this topic and, in the meantime, Martin Thompson has published a concise, rich and correct compendium on Garbage Collection, too.
Evaluating Garbage Collectors using Benchmarks
Micro benchmarks are simple programs, often single classes that create a special workload which is usually far from a realistic application workload. This post presents several homegrown micro benchmarks which I developed from very simple beginnings and elaborated step by step to reproduce more garbage collection aspects observed in real-world applications. On one hand, this helps to better understand the key aspects of a real-world GC load. On the other hand, it allows to understand how different garbage collector implementations handle such loads.
I also did tests with a commercial Java benchmark to place my own micro benchmarks in a wider perspective. By that, I got additional insight in the garbage collectors but also into the benchmark.
All the tests presented in this post have been executed on a notebook (Intel i7-2860QM 2.5GHz, 8GB RAM) under Windows 7 and the Oracle Hotspot 64bit JVM 7u45 and 1.8.0-ea.
The tests were executed with 6 GB of total Java heap, the recommended minimal heap size for the G1 collector which was designed for large heaps. Because “NewSize” is the second most important GC tuning knob I varied the new generation heap size in many of the tests presented below by setting
-XX:MaxNewSize=x and using a wide range of x values, usually from 100 MB to 3200 MB or more which is more than half of the 6 GB total heap. In everyday JVM configuration with a real application, nobody has the time to do that but it is feasible with scripted benchmark execution.
The GarbageOnly Micro Benchmark
A very simple GC benchmark test is a class that creates as much garbage as possible, i.e. allocates as much memory as possible by creating objects without any references to them. Let us call this benchmark GarbageOnly (source code). Note that this benchmark scenario is extreme but not pointless: it is a fundamental assumption of generational garbage collection that a lot of objects are very short lived as depicted in the following figure:
The size of the objects is a parameter to the benchmark but always the same during a test run. Experimenting with that size showed that in a wide range its exact value had little impact on the results. Therefore, I picked 100 bytes as a reasonable value and used it for all of the tests. The following figure summarizes the 4 main Java 7 garbage collectors’ performance while running that micro benchmark:
With all garbage collectors the garbage creation and collection rates are extreme in that benchmark: up to 9 GB per second. This is because the benchmark does nothing but create garbage and is limited only by the available memory bandwidth which is huge for modern CPUs as the i7. The garbage collection overhead is very small because marking of the minimal reference graph requires hardly any time and the collection takes place exclusively in the new generation.
Very small Values of NewSize can cost Performance
For practical purposes, you should take away from figure 2 that there usually is a minimal NewSize if you do not want to lose application throughput to garbage collection. For most collectors, 400-1000 MB are reasonable values and the losses even at 100 MB are limited (< 7%). But the CMS collector seems to be more sensitive because decreasing NewSize decreases throughput much faster and lower. You get less than 50% of the optimal throughput with 100 MB, 80% with 400 MB, a bit more than 90% with 1000 MB and about 97% with 2000 MB of NewSize. Without explicit configuration, its default value for NewSize lets its performance fall far behind the other collectors. In the meantime, it has turned out that the shown strong NewSize dependence for the CMS collector is most probably a bug in the Java 7 (and unchanged in the Java 8u5) implementation because it cannot be observed in Java 6. For Java 6.35 the CMS plot for all values of NewSize is between the ParNewGC and G1 plots shown in figure 2. This bug has also (as the bug from figure 11) been reported to the Java bug database and to the OpenJDK GC mailing list. This benchmark is extreme, however, and the dependence/problem will be much weaker in other scenarios as can also be observed with the other benchmarks discussed later in this article.
The following figure shows that the loss of throughput for small NewSizes can be largely accounted for by growing pause time:
Assuming that GC pauses are the only factor and reduce throughput only by reducing the time when the benchmark code can actually run, the following relationship should hold for the memory allocation / garbage collection rate:
Rate = Rate_max * (1 - pause_time/100%)
This is the case to great accuracy for the CMS collector with Rate_max=9040 MB/s. For the ParallelGC and ParNewGC collectors the limiting value Rate_max for large values of NewSize apparently is the same (compare upper right corner of figure 2), but on closer inspection the decrease at small NewSize is a bit steeper for them than that simple formula suggests.
The G1 collector again follows that same formula within the margin of measurement accuracy with a slightly but significantly lower limiting value Rate_max = 8870 MB/s. From this I conclude that the G1 compared to the other 3 collectors has a small extra cost that is created outside of GC pauses, i.e. during normal JVM code execution, most probably for more expensive bookkeeping during object creation, object referencing and other standard runtime operations.
The SPECjbb2005 Benchmark
The SPECjbb2005 benchmark is a well-known Java benchmark for multi-tiered Java business applications by the Standard Performance Evaluation Corporation (SPEC). It claims to produce a workload similar to the workload of the middle (application server) tier of a business JEE application and does not focus on GC performance but on processing speed of business operations. It is mostly used by vendors of hardware, operating systems and JVM implementations to benchmark their products and compare them to the offerings of competitors. A long list of submitted benchmark results is available to the public for all kind of research. The SPECjbb2005 benchmark has recently been retired and replaced by SPECjbb2013. Many thanks to SPEC for providing me a non-profit license for these tests.
Instead of benchmarking hardware and system software I used SPECjbb2005 to benchmark the different garbage collectors of the Oracle Hotspot JVM. The following figure summarizes the results obtained with the 4 garbage collectors and a total heap size of 6GB
Depending on the number of processors available (as returned by the Java call Runtime.availableProcessors() = 4 cores x 2 hyperthreads = 8 for the Intel i7 CPU used in these tests) the benchmark increases the number of ‘warehouses’ (which in fact are load generator threads) and measures the number of business operations executed in a certain interval of time. The overall SPECjbb2005 score (dashed horizontal lines) is then calculated as the average of the individual scores reached with 8 to 16 warehouses. The score calculated in this way is proportional to the garbage collection rate because the mix of business operations executed in a test cycle is always the same. For comparison with other benchmark results in this post I added that scale on the right.
CPU Cores count for Memory Operations
For all the garbage collectors, throughput increases almost linearly from 1 to 3 threads and reaches a maximum at around 5 threads which is only slightly higher than with 4 threads. Because we have seen much higher GC rates in the GarbageOnly benchmark we can conclude that memory bandwidth probably is not the bottleneck here but rather the hyperthreads’ inability to access memory in competition with their twins on the same core. What we see here is thus the processing power of the 4 cores while hyperthreads apparently contribute little. Beyond 5 threads, throughput is mostly flat with a small penalty paid for scheduling and context switching a growing number of threads.
Differences between Collectors are moderate in SPECjbb2005
While this general behavior is the same for all the garbage collectors considered, they do show different absolute levels of performance when used out-of-the-box (no heap sizing except
-Xms6g -Xmx6g): the parallel (stop-the-world) collectors come out at the top with the (default) ParallelGC collector in pole position and the ParNewGC collector a good second. The CMS collector has the lowest performance but is only about 15% behind the leader. However, it can almost reach the G1’s performance if the
-XX:NewSize is explicitly set to a higher value like 2 GB. We have seen before that the CMS suffers from a bad default value for NewSize compared to the other collectors. In addition, I also tested the G1 collector in Java 1.8.0 (early access build 109). This glimpse at the future promises a further slight performance increase which brings the G1 within reach of the ParNewGC.
SPECjbb2005 was not a GC Benchmark
Closer inspection of the GC log created during a SPECjbb2005 test cycle reveals that the benchmark produces a rather peculiar load for the garbage collectors:
The benchmark application creates a huge amount of short lived objects which the ParallelGC can easily collect as indicated by the performance indicators marked in green on the right. Note that it was started with a minimal heap configuration of
-Xms6g -Xmx6g and by default picked a NewSize of about 2 GB which made life rather easy for it. In contrast to that, the CMS collector by default worked with about 600 MB NewSize and therefore spent about 5.8% of the time with GC pauses. By explicitly setting NewSize to 2 GB this could be reduced to about 2% which led to the performance benefit shown in figure 4.
Besides those short-lived objects the benchmark apparently creates a certain amount of objects which live as long as the test lasts as can be seen from the stairs-like growth of heap usage which I numbered at the bottom of figure 5. Because there are only few objects of intermediate lifetime no old generation garbage collections usually occur in that test (as in this example). As can be verified in the configurations listed together with the submitted benchmark results submitters usually guaranteed this by making the new generation heap large, excessively large compared to real-world productive installations. It is also clear that without old generation collections the parallel stop-the-world collectors ParallelGC and ParNewGC deliver the best results because those old generation collections are their biggest performance problem.
Now, let us look at how the G1 collector copes with the load created by SPECjbb2005:
Its internal adjustment control is smart enough to make the new generation size as large as 4 GB. In this way, it even manages to make the pause time smaller than for the ParallelGC collector: 1.03% vs. 1.17%. Nevertheless its GC rate is 12% lower: 1325.3 MB/s vs. 1477.5 MB/s and as shown in figure 4 this GC rate exactly mirrors the SPECjbb2005 throughput/score reached. This confirms that G1 operation has considerable cost outside of GC pauses. The stats at the right of figure 6 show that there were no concurrent GC runs which could explain that. Again, it looks like the G1 slows down normal JVM operation compared to the other collectors.
After all, from a GC point of view the SPECjbb2005 benchmark is not very different from the GarbageOnly micro benchmark. Absolute GC rates are lower by a factor of 6 because much CPU power is consumed for business operations simulation, but in both cases all the GC strain is on the new generation collections. Therefore, it is natural that the relative performance of the collectors in both benchmarks is very similar.
Stressing the JVM with old generation Garbage Collections
In real-world applications (unlike both the GarbageOnly and SPECjbb2005 benchmarks) it is not a viable strategy to maximize the new generation heap because, first, there is a significant amount of objects with longer but finite lifetime. Such objects, e.g. session-scoped objects in Java web applications, require space in the old generation heap. Second, the overall heap is a limited resource which needs to be distributed among the new and old generation heap. We will see below that these real-world requirements are able to fundamentally change the game and its results when they are included in a benchmark.
The opposite scenario to the GarbageOnly benchmark is an application where no objects are short lived but all objects are kept live for a similar amount of time. In a simple implementation, this can be achieved by placing them all in a List object and keeping them there until they are finally replaced by other objects. The positions in the List may be recycled as the result of a linear walk through the List (which I will call the “LinearList” benchmark) or as the result of random access to the List (“RandomList” benchmark). The following figure shows the object lifetime distributions for these benchmarks:
In the LinearList sample all the objects have more or less the same lifetime because they become garbage as the result of a first-in-first-out process while for RandomList lifetimes are distributed around the same average value T0 but with a width in the same range as T0.
When generational Garbage Collection does not work well
Note that both cases are a challenge for any generational garbage collector because there are actually no generations and the main prerequisite of generational garbage collection that “most objects die young” is badly violated because, in this scenario, most objects die middle-aged. As a corner case this situation is nevertheless interesting to investigate. In addition, we will see later that it is a useful building block for more complex benchmarks.
The next figure shows that in this benchmark actually all garbage collectors yield rather poor performance:
Object creation and garbage collection rates are down by factors of 10 to almost 100 compared to the GarbageOnly benchmark because a single huge List must be traversed for marking of the live objects and claiming of dead objects happens in the old generation heap and requires compaction. Any attempt of the CMS and G1 collectors to execute some work concurrently is doomed: concurrent mode failure. For all collectors, Full GCs are the only means to clean up the heap because there are hardly any dead objects to be found in the New generation. The ParNew collector performs best, but the Parallel collector could be heavily improved by switching off the ParallelOldGC option (default since java 7u4, switch off by
-XX:-UseParallelOldGC). Most probably the marking of the entries of a single huge List is the most expensive operation, parallelization of this does not work well and even the attempt is harmful for performance. It is interesting that the CMS shows its relationship with the ParNewGC collector in these plots by imitating its NewSize dependence but at a lower level while the other two have individually shaped curves.
Note that the generally higher throughput in the LinearList compared to the RandomList benchmark is not GC related but mostly stems from the fact that random number generation consumes a considerable amount of CPU cycles which are taken away from object creation. This explains why the curves are somewhat squeezed together on the right hand compared to the left hand plot. Overall, G1 performance is the poorest and also suffers most from randomization of the List access. This probably is the case because randomization destroys the “Garbage First” assumption that there are always regions which contain mostly garbage and can therefore be reclaimed cheaply. With randomization, most regions contain similar amounts of garbage and reclaiming back some memory is expensive and requires cleaning up many regions at high cost. For the other collectors, the curves are almost erratic with steep slopes. If we ever met a real-world application with such behavior this would make it hard to find an optimal and stable configuration without extensive experiments.
By the way, the above mentioned average object lifetime T0 can easily be calculated from the object creation frequency and the size of the List. In the steady state, the object creation frequency is equal to the garbage collection rate divided by the object size (of 100 bytes). For the ParNew collector in the RandomList sample this is something like 4 Mio objects per second while the List used in the benchmark had 12.5 Mio elements. Therefore, average object lifetime was only about 3 seconds with the ParNew collector and about 12 seconds with the G1 collector. This is indeed only a medium lifetime compared to what we find in real-world applications. In the next section, I will show how to reach lifetimes which are up to a factor of 100 longer and thus reach typical lifetimes of objects in the session scope of Java web applications.
Putting it all together to build an application-like GC Benchmark
The lifetime distributions of figures 1 and 7 suggest that a combination of the GarbageOnly and RandomList microbenchmarks can be used to create a benchmark with more real-world object lifetimes. The recipe is the following:
- Each thread of the benchmark creates the overwhelming majority of objects without any references as in the GarbageOnly benchmark. A small fraction of the objects is placed in a List as in the RandomList benchmark. The smaller the fraction is the slower the replacement rate of the List entries and the longer the average lifetime of the objects in the List will be.
- The benchmark runs several threads each of which uses its own List and inserts its own fraction of objects into the List.
- The size of each thread’s List and the fraction of objects inserted determine the average lifetime Tn of each thread’s longer lived objects. The sum of the individual lifetime distributions yield a quasi-continuous lifetime distribution for the benchmark as a whole which stretches from a dominant peak at 0 lifetime into the minutes range.
I call this benchmark MixedRandomList. The exact numbers for List sizes and object fractions can be found in the source code which I ran with parameters objectSize=100, numberOfThreads=8, numLive(number of live objects)= 12500000.
By instrumentation of the benchmark I measured the object lifetime distribution in a typical 20 min test run:
By construction of the benchmark, the distribution still has a singularity at 0 lifetime (<10ms bin) but is a continuous and strictly decreasing function at higher lifetimes. With this benchmark, heap usage and GC behaviour already look very much as in real-world applications:
The ParallelGC collector at the top creates the typical superposition of two sawtooths in the blue heap usage curve, one very fast and created by many new generation collections, the other slow and created by the fewer old generation collections (grey vertical lines). Pause times of new generation collections are in the 10 milliseconds range, pause times of old generation collections are in the seconds range for the ParallelGC. Less than 1% of all objects make it to the old generation which is a good value for real-world applications. Only the absolute GC rate still is high above any reasonable real-world level because the JVM still does not do very much but create objects and collect garbage.
The following figure shows the different garbage collectors’ performance as a function of NewSize when applied to the MixedRandomList benchmark:
Garbage Collectors are not free of Bugs
The most striking feature in figure 11 is the very poor performance of the ParallelGC collector above 1800 MB of NewSize. Unfortunately, close to the default value of NewSize=2g (one third of the 6 GB of total heap size) it falls off a cliff and badly spoils its out-of-the-box performance. This is simply a bug which I reported to the openjdk GC community and the Java bug database. Because ParallelGC is the default collector in Java 6 and 7 this bug means that you obtain the lowest performance of the whole plot and lose a factor of 10 compared to the optimum if you run the benchmark with
-Xmx as the only parameters! From this we learn that it is sometimes a good investment to play around with at least the
-XX:NewSize parameter or select an alternative collector. I usually use ParNewGC to double-check because it has proven robust and generally delivers good performance (as in this case).
There usually is an optimum for NewSize
The second most striking feature in figure 11 is that in this benchmark it is no longer helpful to blindly maximize NewSize because the benchmark (by construction and similar to real applications) at any time has a considerable amount of live objects in its heap. In this case, there are about 3 GB of live objects as can also be estimated from figure 10. As soon as NewSize becomes large enough to leave less than these 3 GB for the old generation, throughput decreases sharply for all collectors because they have to give up on new generation collections and do only (more expensive) Full GCs (The ParallelGC collector above 1800 MB does the same thing, but too early, because erroneous statistics convince it that this cannot be avoided).
G1 is not quite as efficient as the classic Collectors
The third most striking feature is that now the G1 collectors in either the Java 7u45 or the early access 1.8 JVM fall considerably behind the classic collectors and reach only about 70% of their maximal throughput. Note, however, that their default performance (dashed horizontal lines) is better than the results reached with NewSize fixed to any constant value. This means it is true that as a best practice you should avoid setting NewSize with G1 and rather let it do the adjustment control job. But this control job apparently is done with considerable cost.
The following table summarizes key performance indicators of the different collectors from those runs where each delivered its best performance in figure 11:
It comes as a surprise that the CMS manages to outperform the other collectors in this benchmark in terms of throughput (second column). Key factors are the small amount of concurrent GC time (23.9%) compared to the G1 and the lowest percentage (6.5%) of accumulated pause time. One important advantage for the CMS should be mentioned: the absence of fragmentation from this benchmark. Because all the objects created have the same size, fragmentation should not be an issue. This not only avoids heap fragmentation and performance degradation in the long term (which I have verified) but in every new generation collection saves the potentially costly search for space of suitable size in the old generation. This argument does not explain why new generation pauses duration and accumulated pause times are low for the CMS in this micro benchmark (which remains a surprisingly good result) but why we should expect longer pauses, higher pause time and less throughput from the CMS in real-world applications while the other 3 collectors compact their heap anyway and would not suffer from fragmentation.
As a last remark, it is worthwhile comparing the table in figure 12 with its equivalent figure 7 from my previous post: this one is from a relatively simple benchmark and the other from a real-world application but they share many features. The most striking difference is that in the real-world sample GC pauses are much longer although heap sizes are much smaller. In real-world applications with much larger and more complicated reference graphs compared to the handful of List objects used by the benchmark, we should expect more effort for marking and thus considerably longer pauses.
Summary and Conclusions
With relatively simple micro benchmarks it is possible to reproduce many features of real-world Java garbage collection loads and gain insight into the weaknesses and strengths of different collector implementations. From the results presented above we can learn the following about the 4 main collectors of Oracle’s hotspot JVM:
- The G1 collector has good default settings and, as outlined in the official best practice guidelines, there is little to gain and much to lose from setting the new generation size explicitly.
- The G1 collector is very stable for its young age. I did not encounter a single JVM failure during hundreds of hours of benchmark execution (on 3 different hardware and OS platforms: Intel/Windows, Intel/Linux and SPARC/Solaris).
- The G1 does expensive processing compared to the other collectors which cost it throughput, in particular when it comes to old generation heap collections.
- For all other collectors it is recommendable to explicitly configure NewSize instead of leaving the default.
- There is no single and optimal heap and GC configuration because there are many circumstances which favor one or the other collector, strategy and configuration.
- Performance of the CMS collector suffers most from too small values of NewSize. Because the impact is surprisingly strong and it was not in Java 6 this is probably a bug
- The ParallelGC collector can deliver very good throughput if it does not fall into a pit which usually stems from its smart statistics and heuristics features.
-XX:+UseParallelOldGCoption is not always helpful (but has become default in Java 7).
- The ParNewGC collector in most cases belongs to the best performers and rarely causes problems because it is straightforward and simple. It is a good choice if you want to double-check whether a GC problem is caused by the application or an unfortunate collector/configuration. It is regrettable that it will be deprecated in Java 8 except in combination with the CMS.
Conclusions 4 to 9 are corroborated not only by the above benchmark results but also by my previous experience with GC tuning of real-world applications.
I have gathered already and will present more benchmark results to address the following:
- Refine the benchmark to make GC look more like from real applications.
- Have a closer look at important heap and GC parameters other than NewSize.
- Move to a much larger heap (50 GB) to tackle the leading edge range.
- At the same time, consider pause times more than mere throughput (which means focus on CMS vs. G1).
- In particular, have a closer look at the G1’s pause time control.
Beyond that it would also be interesting to have a closer look at the influence from moderate heap fragmentation which I have neglected so far to the advantage of the CMS collector.