Benchmarking G1 and other Java 7 Garbage Collectors

As mentioned in a first post of this series, Oracle’s GarbageFirst (G1) collector has been a supported option in Java 7 for some time. This post examines in more detail the performance of the G1 garbage collector compared to the other collectors available in the Hotspot JVM. I used benchmark tests for this purpose instead of a real application because they can be executed and modified more easily. I found surprising strengths and weaknesses in several of Hotspot’s garbage collectors and even disclose a fully-fledged bug.

This article requires some knowledge about garbage collection in general and the Hotspot JVM’s garbage collector implementations. My previous article can serve as a useful introduction to this topic and, in the meantime, Martin Thompson has published a concise, rich and correct compendium on Garbage Collection, too.

Evaluating Garbage Collectors using Benchmarks

Micro benchmarks are simple programs, often single classes that create a special workload which is usually far from a realistic application workload. This post presents several homegrown micro benchmarks which I developed from very simple beginnings and elaborated step by step to reproduce more garbage collection aspects observed in real-world applications. On one hand, this helps to better understand the key aspects of a real-world GC load. On the other hand, it allows to understand how different garbage collector implementations handle such loads.

I also did tests with a commercial Java benchmark to place my own micro benchmarks in a wider perspective. By that, I got additional insight in the garbage collectors but also into the benchmark.

All the tests presented in this post have been executed on a notebook (Intel i7-2860QM 2.5GHz, 8GB RAM) under Windows 7 and the Oracle Hotspot 64bit JVM 7u45 and 1.8.0-ea.

The tests were executed with 6 GB of total Java heap, the recommended minimal heap size for the G1 collector which was designed for large heaps. Because “NewSize” is the second most important GC tuning knob I varied the new generation heap size in many of the tests presented below by setting -XX:NewSize=x and -XX:MaxNewSize=x and using a wide range of x values, usually from 100 MB to 3200 MB or more which is more than half of the 6 GB total heap. In everyday JVM configuration with a real application, nobody has the time to do that but it is feasible with scripted benchmark execution.

The GarbageOnly Micro Benchmark

A very simple GC benchmark test is a class that creates as much garbage as possible, i.e. allocates as much memory as possible by creating objects without any references to them. Let us call this benchmark GarbageOnly (source code). Note that this benchmark scenario is extreme but not pointless: it is a fundamental assumption of generational garbage collection that a lot of objects are very short lived as depicted in the following figure:

1. Schematic object lifetime distribution as generally assumed for generational garbage collection (left) and produced by the GarbageOnly micro benchmark (right).

The size of the objects is a parameter to the benchmark but always the same during a test run. Experimenting with that size showed that in a wide range its exact value had little impact on the results. Therefore, I picked 100 bytes as a reasonable value and used it for all of the tests. The following figure summarizes the 4 main Java 7 garbage collectors’ performance while running that micro benchmark:

2. Performance of 4 garbage collectors in the GarbageOnly micro benchmark as a function of the new generation heap size with a total heap size of 6 GB ("-Xms6g -Xmx6g"). The CMS collector suffers most from (too) small values of NewSize.

With all garbage collectors the garbage creation and collection rates are extreme in that benchmark: up to 9 GB per second. This is because the benchmark does nothing but create garbage and is limited only by the available memory bandwidth which is huge for modern CPUs as the i7. The garbage collection overhead is very small because marking of the minimal reference graph requires hardly any time and the collection takes place exclusively in the new generation.

Very small Values of NewSize can cost Performance

For practical purposes, you should take away from figure 2 that there usually is a minimal NewSize if you do not want to lose application throughput to garbage collection. For most collectors, 400-1000 MB are reasonable values and the losses even at 100 MB are limited (< 7%). But the CMS collector seems to be more sensitive because decreasing NewSize decreases throughput much faster and lower. You get less than 50% of the optimal throughput with 100 MB, 80% with 400 MB, a bit more than 90% with 1000 MB and about 97% with 2000 MB of NewSize. Without explicit configuration, its default value for NewSize lets its performance fall far behind the other collectors. In the meantime, it has turned out that the shown strong NewSize dependence for the CMS collector is most probably a bug in the Java 7 (and unchanged in the Java 8u5) implementation because it cannot be observed in Java 6. For Java 6.35 the CMS plot for all values of NewSize is between the ParNewGC and G1 plots shown in figure 2. This bug has also (as the bug from figure 11) been reported to the Java bug database and to the OpenJDK GC mailing list. This benchmark is extreme, however, and the dependence/problem will be much weaker in other scenarios as can also be observed with the other benchmarks discussed later in this article.

The following figure shows that the loss of throughput for small NewSizes can be largely accounted for by growing pause time:

3. Time share spent with GC pauses as a function of the new generation heap size (from same measurements as in figure 2).

Assuming that GC pauses are the only factor and reduce throughput only by reducing the time when the benchmark code can actually run, the following relationship should hold for the memory allocation / garbage collection rate:

Rate = Rate_max * (1 - pause_time/100%)

This is the case to great accuracy for the CMS collector with Rate_max=9040 MB/s. For the ParallelGC and ParNewGC collectors the limiting value Rate_max for large values of NewSize apparently is the same (compare upper right corner of figure 2), but on closer inspection the decrease at small NewSize is a bit steeper for them than that simple formula suggests.

The G1 collector again follows that same formula within the margin of measurement accuracy with a slightly but significantly lower limiting value Rate_max = 8870 MB/s. From this I conclude that the G1 compared to the other 3 collectors has a small extra cost that is created outside of GC pauses, i.e. during normal JVM code execution, most probably for more expensive bookkeeping during object creation, object referencing and other standard runtime operations.

The SPECjbb2005 Benchmark

The SPECjbb2005 benchmark is a well-known Java benchmark for multi-tiered Java business applications by the Standard Performance Evaluation Corporation (SPEC). It claims to produce a workload similar to the workload of the middle (application server) tier of a business JEE application and does not focus on GC performance but on processing speed of business operations. It is mostly used by vendors of hardware, operating systems and JVM implementations to benchmark their products and compare them to the offerings of competitors. A long list of submitted benchmark results is available to the public for all kind of research. The SPECjbb2005 benchmark has recently been retired and replaced by SPECjbb2013. Many thanks to SPEC for providing me a non-profit license for these tests.

Instead of benchmarking hardware and system software I used SPECjbb2005 to benchmark the different garbage collectors of the Oracle Hotspot JVM. The following figure summarizes the results obtained with the 4 garbage collectors and a total heap size of 6GB

4. SPECjbb2005 benchmark results for the hotspot JVM's 4 main garbage collectors.

Depending on the number of processors available (as returned by the Java call Runtime.availableProcessors() = 4 cores x 2 hyperthreads = 8 for the Intel i7 CPU used in these tests) the benchmark increases the number of ‘warehouses’ (which in fact are load generator threads) and measures the number of business operations executed in a certain interval of time. The overall SPECjbb2005 score (dashed horizontal lines) is then calculated as the average of the individual scores reached with 8 to 16 warehouses. The score calculated in this way is proportional to the garbage collection rate because the mix of business operations executed in a test cycle is always the same. For comparison with other benchmark results in this post I added that scale on the right.

CPU Cores count for Memory Operations

For all the garbage collectors, throughput increases almost linearly from 1 to 3 threads and reaches a maximum at around 5 threads which is only slightly higher than with 4 threads. Because we have seen much higher GC rates in the GarbageOnly benchmark we can conclude that memory bandwidth probably is not the bottleneck here but rather the hyperthreads’ inability to access memory in competition with their twins on the same core. What we see here is thus the processing power of the 4 cores while hyperthreads apparently contribute little. Beyond 5 threads, throughput is mostly flat with a small penalty paid for scheduling and context switching a growing number of threads.

Differences between Collectors are moderate in SPECjbb2005

While this general behavior is the same for all the garbage collectors considered, they do show different absolute levels of performance when used out-of-the-box (no heap sizing except -Xms6g -Xmx6g): the parallel (stop-the-world) collectors come out at the top with the (default) ParallelGC collector in pole position and the ParNewGC collector a good second. The CMS collector has the lowest performance but is only about 15% behind the leader. However, it can almost reach the G1’s performance if the -XX:NewSize is explicitly set to a higher value like 2 GB. We have seen before that the CMS suffers from a bad default value for NewSize compared to the other collectors. In addition, I also tested the G1 collector in Java 1.8.0 (early access build 109). This glimpse at the future promises a further slight performance increase which brings the G1 within reach of the ParNewGC.

SPECjbb2005 was not a GC Benchmark

Closer inspection of the GC log created during a SPECjbb2005 test cycle reveals that the benchmark produces a rather peculiar load for the garbage collectors:

5. Heap usage and ParallelGC operation during a full SPECjbb2005 test cycle.

The benchmark application creates a huge amount of short lived objects which the ParallelGC can easily collect as indicated by the performance indicators marked in green on the right. Note that it was started with a minimal heap configuration of -Xms6g -Xmx6g and by default picked a NewSize of about 2 GB which made life rather easy for it. In contrast to that, the CMS collector by default worked with about 600 MB NewSize and therefore spent about 5.8% of the time with GC pauses. By explicitly setting NewSize to 2 GB this could be reduced to about 2% which led to the performance benefit shown in figure 4.

Besides those short-lived objects the benchmark apparently creates a certain amount of objects which live as long as the test lasts as can be seen from the stairs-like growth of heap usage which I numbered at the bottom of figure 5. Because there are only few objects of intermediate lifetime no old generation garbage collections usually occur in that test (as in this example). As can be verified in the configurations listed together with the submitted benchmark results submitters usually guaranteed this by making the new generation heap large, excessively large compared to real-world productive installations. It is also clear that without old generation collections the parallel stop-the-world collectors ParallelGC and ParNewGC deliver the best results because those old generation collections are their biggest performance problem.

Now, let us look at how the G1 collector copes with the load created by SPECjbb2005:

6. Heap usage and G1 operation during the full SPECjbb2005 test cycle.

Its internal adjustment control is smart enough to make the new generation size as large as 4 GB. In this way, it even manages to make the pause time smaller than for the ParallelGC collector: 1.03% vs. 1.17%. Nevertheless its GC rate is 12% lower: 1325.3 MB/s vs. 1477.5 MB/s and as shown in figure 4 this GC rate exactly mirrors the SPECjbb2005 throughput/score reached. This confirms that G1 operation has considerable cost outside of GC pauses. The stats at the right of figure 6 show that there were no concurrent GC runs which could explain that. Again, it looks like the G1 slows down normal JVM operation compared to the other collectors.

After all, from a GC point of view the SPECjbb2005 benchmark is not very different from the GarbageOnly micro benchmark. Absolute GC rates are lower by a factor of 6 because much CPU power is consumed for business operations simulation, but in both cases all the GC strain is on the new generation collections. Therefore, it is natural that the relative performance of the collectors in both benchmarks is very similar.

Stressing the JVM with old generation Garbage Collections

In real-world applications (unlike both the GarbageOnly and SPECjbb2005 benchmarks) it is not a viable strategy to maximize the new generation heap because, first, there is a significant amount of objects with longer but finite lifetime. Such objects, e.g. session-scoped objects in Java web applications, require space in the old generation heap. Second, the overall heap is a limited resource which needs to be distributed among the new and old generation heap. We will see below that these real-world requirements are able to fundamentally change the game and its results when they are included in a benchmark.

The opposite scenario to the GarbageOnly benchmark is an application where no objects are short lived but all objects are kept live for a similar amount of time. In a simple implementation, this can be achieved by placing them all in a List object and keeping them there until they are finally replaced by other objects. The positions in the List may be recycled as the result of a linear walk through the List (which I will call the “LinearList” benchmark) or as the result of random access to the List (“RandomList” benchmark). The following figure shows the object lifetime distributions for these benchmarks:

7. Theoretical object lifetime distributions for the LinearList and RandomList benchmarks.

In the LinearList sample all the objects have more or less the same lifetime because they become garbage as the result of a first-in-first-out process while for RandomList lifetimes are distributed around the same average value T0 but with a width in the same range as T0.

When generational Garbage Collection does not work well

Note that both cases are a challenge for any generational garbage collector because there are actually no generations and the main prerequisite of generational garbage collection that “most objects die young” is badly violated because, in this scenario, most objects die middle-aged. As a corner case this situation is nevertheless interesting to investigate. In addition, we will see later that it is a useful building block for more complex benchmarks.

The next figure shows that in this benchmark actually all garbage collectors yield rather poor performance:

8. Garbage collector performance in the LinearList and RandomList benchmarks.

Object creation and garbage collection rates are down by factors of 10 to almost 100 compared to the GarbageOnly benchmark because a single huge List must be traversed for marking of the live objects and claiming of dead objects happens in the old generation heap and requires compaction. Any attempt of the CMS and G1 collectors to execute some work concurrently is doomed: concurrent mode failure. For all collectors, Full GCs are the only means to clean up the heap because there are hardly any dead objects to be found in the New generation. The ParNew collector performs best, but the Parallel collector could be heavily improved by switching off the ParallelOldGC option (default since java 7u4, switch off by -XX:-UseParallelOldGC). Most probably the marking of the entries of a single huge List is the most expensive operation, parallelization of this does not work well and even the attempt is harmful for performance. It is interesting that the CMS shows its relationship with the ParNewGC collector in these plots by imitating its NewSize dependence but at a lower level while the other two have individually shaped curves.

Note that the generally higher throughput in the LinearList compared to the RandomList benchmark is not GC related but mostly stems from the fact that random number generation consumes a considerable amount of CPU cycles which are taken away from object creation. This explains why the curves are somewhat squeezed together on the right hand compared to the left hand plot. Overall, G1 performance is the poorest and also suffers most from randomization of the List access. This probably is the case because randomization destroys the “Garbage First” assumption that there are always regions which contain mostly garbage and can therefore be reclaimed cheaply. With randomization, most regions contain similar amounts of garbage and reclaiming back some memory is expensive and requires cleaning up many regions at high cost. For the other collectors, the curves are almost erratic with steep slopes. If we ever met a real-world application with such behavior this would make it hard to find an optimal and stable configuration without extensive experiments.

By the way, the above mentioned average object lifetime T0 can easily be calculated from the object creation frequency and the size of the List. In the steady state, the object creation frequency is equal to the garbage collection rate divided by the object size (of 100 bytes). For the ParNew collector in the RandomList sample this is something like 4 Mio objects per second while the List used in the benchmark had 12.5 Mio elements. Therefore, average object lifetime was only about 3 seconds with the ParNew collector and about 12 seconds with the G1 collector. This is indeed only a medium lifetime compared to what we find in real-world applications. In the next section, I will show how to reach lifetimes which are up to a factor of 100 longer and thus reach typical lifetimes of objects in the session scope of Java web applications.

Putting it all together to build an application-like GC Benchmark

The lifetime distributions of figures 1 and 7 suggest that a combination of the GarbageOnly and RandomList microbenchmarks can be used to create a benchmark with more real-world object lifetimes. The recipe is the following:

  1. Each thread of the benchmark creates the overwhelming majority of objects without any references as in the GarbageOnly benchmark. A small fraction of the objects is placed in a List as in the RandomList benchmark. The smaller the fraction is the slower the replacement rate of the List entries and the longer the average lifetime of the objects in the List will be.
  2. The benchmark runs several threads each of which uses its own List and inserts its own fraction of objects into the List.
  3. The size of each thread’s List and the fraction of objects inserted determine the average lifetime Tn of each thread’s longer lived objects. The sum of the individual lifetime distributions yield a quasi-continuous lifetime distribution for the benchmark as a whole which stretches from a dominant peak at 0 lifetime into the minutes range.

I call this benchmark MixedRandomList. The exact numbers for List sizes and object fractions can be found in the source code which I ran with parameters objectSize=100, numberOfThreads=8, numLive(number of live objects)= 12500000.

By instrumentation of the benchmark I measured the object lifetime distribution in a typical 20 min test run:

9. Object lifetime distribution in a typical 20 min test run of the MixedRandomList benchmark. Lifetimes were measured with a resolution of 10 milliseconds and are displayed in logarithmic scale.

By construction of the benchmark, the distribution still has a singularity at 0 lifetime (<10ms bin) but is a continuous and strictly decreasing function at higher lifetimes. With this benchmark, heap usage and GC behaviour already look very much as in real-world applications:

10. Heap usage and GC activity in the MixedRandomList benchmark for the ParallelGC (top) and the G1 collector (bottom).

The ParallelGC collector at the top creates the typical superposition of two sawtooths in the blue heap usage curve, one very fast and created by many new generation collections, the other slow and created by the fewer old generation collections (grey vertical lines). Pause times of new generation collections are in the 10 milliseconds range, pause times of old generation collections are in the seconds range for the ParallelGC. Less than 1% of all objects make it to the old generation which is a good value for real-world applications. Only the absolute GC rate still is high above any reasonable real-world level because the JVM still does not do very much but create objects and collect garbage.

The following figure shows the different garbage collectors’ performance as a function of NewSize when applied to the MixedRandomList benchmark:

11. Throughput of the hotspot JVM's garbage collectors running the MixedRandomList benchmark as a function of NewSize (solid lines) and for unspecified NewSize (dashed lines).

Garbage Collectors are not free of Bugs

The most striking feature in figure 11 is the very poor performance of the ParallelGC collector above 1800 MB of NewSize. Unfortunately, close to the default value of NewSize=2g (one third of the 6 GB of total heap size) it falls off a cliff and badly spoils its out-of-the-box performance. This is simply a bug which I reported to the openjdk GC community and the Java bug database. Because ParallelGC is the default collector in Java 6 and 7 this bug means that you obtain the lowest performance of the whole plot and lose a factor of 10 compared to the optimum if you run the benchmark with -Xms and -Xmx as the only parameters! From this we learn that it is sometimes a good investment to play around with at least the -XX:NewSize parameter or select an alternative collector. I usually use ParNewGC to double-check because it has proven robust and generally delivers good performance (as in this case).

There usually is an optimum for NewSize

The second most striking feature in figure 11 is that in this benchmark it is no longer helpful to blindly maximize NewSize because the benchmark (by construction and similar to real applications) at any time has a considerable amount of live objects in its heap. In this case, there are about 3 GB of live objects as can also be estimated from figure 10. As soon as NewSize becomes large enough to leave less than these 3 GB for the old generation, throughput decreases sharply for all collectors because they have to give up on new generation collections and do only (more expensive) Full GCs (The ParallelGC collector above 1800 MB does the same thing, but too early, because erroneous statistics convince it that this cannot be avoided).

G1 is not quite as efficient as the classic Collectors

The third most striking feature is that now the G1 collectors in either the Java 7u45 or the early access 1.8 JVM fall considerably behind the classic collectors and reach only about 70% of their maximal throughput. Note, however, that their default performance (dashed horizontal lines) is better than the results reached with NewSize fixed to any constant value. This means it is true that as a best practice you should avoid setting NewSize with G1 and rather let it do the adjustment control job. But this control job apparently is done with considerable cost.

The following table summarizes key performance indicators of the different collectors from those runs where each delivered its best performance in figure 11:

12. Key performance indicators for the 4 collectors' best result from figure 11.

It comes as a surprise that the CMS manages to outperform the other collectors in this benchmark in terms of throughput (second column). Key factors are the small amount of concurrent GC time (23.9%) compared to the G1 and the lowest percentage (6.5%) of accumulated pause time. One important advantage for the CMS should be mentioned: the absence of fragmentation from this benchmark. Because all the objects created have the same size, fragmentation should not be an issue. This not only avoids heap fragmentation and performance degradation in the long term (which I have verified) but in every new generation collection saves the potentially costly search for space of suitable size in the old generation. This argument does not explain why new generation pauses duration and accumulated pause times are low for the CMS in this micro benchmark (which remains a surprisingly good result) but why we should expect longer pauses, higher pause time and less throughput from the CMS in real-world applications while the other 3 collectors compact their heap anyway and would not suffer from fragmentation.

As a last remark, it is worthwhile comparing the table in figure 12 with its equivalent figure 7 from my previous post: this one is from a relatively simple benchmark and the other from a real-world application but they share many features. The most striking difference is that in the real-world sample GC pauses are much longer although heap sizes are much smaller. In real-world applications with much larger and more complicated reference graphs compared to the handful of List objects used by the benchmark, we should expect more effort for marking and thus considerably longer pauses.

Summary and Conclusions

With relatively simple micro benchmarks it is possible to reproduce many features of real-world Java garbage collection loads and gain insight into the weaknesses and strengths of different collector implementations. From the results presented above we can learn the following about the 4 main collectors of Oracle’s hotspot JVM:

  1. The G1 collector has good default settings and, as outlined in the official best practice guidelines, there is little to gain and much to lose from setting the new generation size explicitly.
  2. The G1 collector is very stable for its young age. I did not encounter a single JVM failure during hundreds of hours of benchmark execution (on 3 different hardware and OS platforms: Intel/Windows, Intel/Linux and SPARC/Solaris).
  3. The G1 does expensive processing compared to the other collectors which cost it throughput, in particular when it comes to old generation heap collections.
  4. For all other collectors it is recommendable to explicitly configure NewSize instead of leaving the default.
  5. There is no single and optimal heap and GC configuration because there are many circumstances which favor one or the other collector, strategy and configuration.
  6. Performance of the CMS collector suffers most from too small values of NewSize. Because the impact is surprisingly strong and it was not in Java 6 this is probably a bug
  7. The ParallelGC collector can deliver very good throughput if it does not fall into a pit which usually stems from its smart statistics and heuristics features.
  8. Its -XX:+UseParallelOldGC option is not always helpful (but has become default in Java 7).
  9. The ParNewGC collector in most cases belongs to the best performers and rarely causes problems because it is straightforward and simple. It is a good choice if you want to double-check whether a GC problem is caused by the application or an unfortunate collector/configuration. It is regrettable that it will be deprecated in Java 8 except in combination with the CMS.

Conclusions 4 to 9 are corroborated not only by the above benchmark results but also by my previous experience with GC tuning of real-world applications.

Outlook

I have gathered already and will present more benchmark results to address the following:

  • Refine the benchmark to make GC look more like from real applications.
  • Have a closer look at important heap and GC parameters other than NewSize.
  • Move to a much larger heap (50 GB) to tackle the leading edge range.
  • At the same time, consider pause times more than mere throughput (which means focus on CMS vs. G1).
  • In particular, have a closer look at the G1’s pause time control.

Beyond that it would also be interesting to have a closer look at the influence from moderate heap fragmentation which I have neglected so far to the advantage of the CMS collector.

Share

Leave a Reply

*

14 Responses to “Benchmarking G1 and other Java 7 Garbage Collectors”

  1. Thanks for the benchmarks, interesting reads. Two comments:

    a) there is one place where you name the benchmark GarbageFirst instead of GarbageOnky.

    b) you should specify what flags you used to run the different GCs. Your names do not really make sense. For example ParNewGC is not a new Version of ParGC, it is actually a flag for the new generation collection. Therefore you can use CMS+ParNewGC which is a typical combination and seems not to be benchmarked in your case. In fact the 6 combinations are (maybe you can add which of the combinaitions are the ones you have named in your charts):

    Serial (DefNew)+
    Serial Mark-Sweep-Compact (PSOldGen)
    -XX:+UseSerialGC

    Parallel scavenge (PSYoungGen)+
    Serial Mark-Sweep-Compact (PSOldGen)
    -XX:+UseParallelGC

    Parallel scavenge (PSYoungGen)+
    Parallel Mark-Compact (ParOldGen)
    -XX:+UseParallelGC -XX:+UseParallelOldGC

    Serial (DefNew)+
    Concurrent Mark Sweep
    -XX:+UseConcMarkSweepGC -XX:-UseParNewGC

    Parallel (ParNew)+
    Concurrent Mark Sweep
    -XX:+UseConcMarkSweepGC -XX:+UseParNewGC

    G1GC
    -XX:+UseG1GC

    • Dr. Andreas Müller says:

      Thanks for the hints. You are right that several collectors for the new and old generations can be combined.
      Because I wanted to keep things simple and avoid discussing this multitude (plus all defaults and their changes from Java release to Java release) I applied a minimum of flags for all of the tests discussed here. Therefore, each of the 4 collectors named in the text corresponds to exactly one flag which I applied exclusively:
      a)-XX:+UseParallelGC or
      b)-XX:+UseParNewGC or
      c)-XX:+UseConcMarkSweepGC (abbreviated “CMS” in text and legends) or
      d)-XX:+UseG1GC.
      As your own list shows, these 4 flags are influenced by defaults:
      a) could be omitted because it has been the default in Java 5-8 for any relevant (“server grade”) hardware and by default it also includes -XX:+UseParallelOldGC since an early Java 7 release
      b) by default uses the SerialGC for old generation collections (it therefore is available independently of -XX:+UseConcMarkSweepGC but this very combination will be deprecated in Java 8 together with others from your list)
      c) by default includes -XX:+UseParNewGC to run the new generation collections as we both know
      d) is the only of the 4 collectors which by itself covers both new and old generation in a single implementation.
      Altogether, the heap and GC part of the commandline in all cases looked like that:
      -Xms6g -Xmx6g (-XX:NewSize=x -XX:MaxNewSize=x) -XX:+Use[abcd].
      When I left out the NewSize configuration in brackets I marked it as “Default” with a dashed horizontal line in all plots where NewSize is on the horizontal axis.
      Wouldn’t it be desirable to work with a config as simple as that in 98% of the cases? I think that is possible and should be the aim of all people working on the subject.

  2. Ben Treeser says:

    Hi,
    could you also post the source code files of the OldOnlyLinear and OldOnlyRandom? I would really appreciate it.

    - You use Runtime.getRuntime().maxMemory()/objectSize/5 to calculate the number of objects to fill about half of the available memory. Why not use Runtime.getRuntime().freeMemory()? And why dividing by 5?

    - For the GarbageProducer you use for the fractionLive the value 50^8, where 8 is the 8th Thread. Well 50^8 is quite a big number: 39062500000000. Why not use an equal number for each thread? How do you came up with this equation and what is the purpose of this? As I understand by a higher value of fractionLive the program takes more time in creating garbage than in creating long lived objects. Is that correct?

    • Dr. Andreas Müller says:

      Hi Ben,

      >could you also post the source code files of the OldOnlyLinear and OldOnlyRandom? I would really appreciate it
      I have not posted the source code of the LinearList and RandomList samples because they are less important than the other two and useful mainly as building blocks for more complex like the MixedRandomList sample. They are also very simple and you can easily deduce them from the MixedRandomList source code: use only one thread, set fractionLive to 1 and remove the index randomization in the LinearList case.

      >You use Runtime.getRuntime().maxMemory()/objectSize/5 to calculate the number of objects to fill about half of the available memory.
      As far as the factor is concerned you would probably start with 2 to fill half the heap but because a character uses two bytes the factor doubles to 4 and because each String created uses heap in addition to the char array you end up with about 5. By running the sample you can easily check that that number roughly fills half the available heap. But (as explained in a comment in the source) the value calculated like that is not exactly independent of the garbage collector chosen. To guarantee equal load for all collectors I therefore always used an explicit and constant program parameter to overwrite that default (as is also explained in the text and visible in the source).

      >For the GarbageProducer you use for the fractionLive the value 50^8, where 8 is the 8th Thread. … Why not use an equal number for each thread? How do you came up with this equation and what is the purpose of this?
      To achieve a lifetime distribution from which no lifetimes are missing it is necessary to use a wide range of fractionLive values for the different lists. The larger that value, the less frequently objects are inserted into one list and the longer those already in the list can survive. On the other hand, the lifetime distribution of one list also gets broader when the average lifetime grows. To achieve a smooth distribution, it therefore makes sense to use a power law for fractionLive (a^n where n is the number of the thread). You are right, however, that the basis a=50 is a bit large and means the 8th (and the 7th and also the 6th) thread only create garbage during the execution time of the test. On the other hand, making it too small will produce too many long lived objects from thread 1 and rather quickly break the concurrent garbage collectors, CMS and G1, just as in the LinearList and RandomList samples. I played a bit with that value to achieve similar GC behavior as observed in many normal applications but my choice can certainly be further improved (or varied to cover other scenarios).

      >Your object life time distributions are just schematic, right?
      No, I have actually measured the lifetime distribution shown in figure 9 using instrumentation of the MixedRandomList sample. The result convinced me that the formula for fractionLive discussed above achieves what I wanted to achieve. Note that similar distributions (nearly straight line in a double logarithmic plot) are common in many areas. Of course, the lifetime distribution in figure 1 left is schematic (from an original Sun publication) and the ones on the right and in figure 7 are schematic but rather trivial.

  3. Ben Treeser says:

    And one more question: Your object life time distributions are just schematic, right (you think the look like this)? Is there no way how to get the real life time distributions?

  4. Steve Pedlar says:

    Hi,
    ref your comment:
    “This bug has also (as the bug from figure 11) been reported to the Java bug database and to the OpenJDK GC mailing list.”
    – Do you have any bug reference numbers provided by Oracle?
    Thanks
    Steve

    • Dr. Andreas Müller says:

      Yes, it was assigned the bug id 9009508 (which is not searchable from the bug database, yet).

  5. Ben Treeser says:

    Thanks for your explanation Andreas. I would like to investigate the dependency between object life time distribution and NewRatio of the heap. Distributions as in Fig1(left) could tell which ratio between young and old generation you have to choose in order to minimize the response time e.g.. Can you please explain how you measure the object life time distributions?

    • Dr. Andreas Müller says:

      You create a custom class which stores the value of System.currentTimeMillis() to an instance variable in the constructor. Before you replace an object of that class in any List or similar structure you call a method which returns the elapsed time since creation of the object. You then increment a counter for an appropriate lifetime bin before you dereference the object. Each thread holds an array of such counters which are added to the counters of the other threads only in the end to avoid any thread synchronization overhead during test execution.

  6. Brent Boyer says:

    Andreas: just wanted to say thanks for your detailed series of GC articles.

    I hope that you find the time to periodically followup with all the latest developments (e.g. when Java 8 finally comes out).

    I also hope that you can get your hands on Azul’s Zing and benchmark it as well.

  7. Brent Boyer says:

    Andreas: another technology that looks promising is the Shenandoah ultra-Low-Pause-Time Garbage Collector, which is like G1 but does concurrent evacuation:
    http://openjdk.java.net/jeps/189

    • Ryan Gardner says:

      What were the list of all the flags that you were tuning G1 with?

      It looks like, at a quick glance, that you were setting an explicit new size for all your G1 runs?

      With G1, you setting an explicit new size is a bad idea as it disables the built-in ergonomics. If you feel the need to tune it you of the default settings.

      I’ve personally had very good success tuning G1 on applications with very large heaps and very taxing GC loads and object allocation rates with lots of variation in them. G1 will adjust the new size to meet the pause time targets very well.

      Your G1 results will be much better if you just use minimal args such as “-XX:+UseG1GC -XX:MaxGCPauseMillis=100″ and let it tune itself – only tweak it if you absolutely must.

  8. orenkomp.ru says:

    What I have found is that concentrated analysis of GC logs is always key in determining the best garbage collector and arguments.

  9. Dr. Andreas Müller says:

    Hi Ryan,
    in some plots with NewSize on the x-axis there are also horizontal dashed lines which show the result when NewSize is left undefined (as you suggest). You can see that in most cases for G1 (unlike the other collectors) the default result is better than what you can get by setting NewSize explicitly. The most striking example is figure 11. This confirms nicely what you say and I also emphasized the same recommendation in conclusion 1.