Tuesday, August 30, 2011

JVM performance tuning part 3: JVM tuning

The Java virtual machine can be tuned in several ways. The three most important ones will be discussed in this blog entry:
  1. Java heap space
  2. Garbage collection tuning
  3. Garbage collection ergonomic 

Java heap space
As discussed in an earlier blog post, objects reside in the java heap space. The more objects exist, the more heap space will be needed. So thus the most important part of JVM tuning is sizing the heap correctly. Only after changing the heap to the correct size, you can start playing with garbage collection options in order to further improve performance to reach specific goals.

For java server application Oracle recommends the following regarding java heap size:
Unless you have problems with pauses, try granting as much memory as possible to the virtual machine. The default size (64MB) is often too small.
Setting -Xms and -Xmx to the same value increases predictability by removing the most important sizing decision from the virtual machine. On the other hand, the virtual machine can't compensate if you make a poor choice.
Be sure to increase the memory as you increase the number of processors, since allocation can be parallelized.
This means that, as soon as you start getting the error java.lang.OutOfMemoryError: Heap space it’s time to increase the heap. The easiest way to size correctly is by monitoring garbage collection, but this will be discussed in the next blog entry.
Sizing the heap can be done using the parameters –Xms and –Xmx, defining the minimum and maximum heap space respectively. –Xms is also called committed memory. –Xmx is also called reserved memory, although only the committed memory will be asked to the operating system at startup of the virtual machine. The difference between maximum and minimum memory is called virtual memory.
For server applications it’s best to set –Xms equal to –Xms. This allows the JVM to reserve all the necessary virtual memory at startup of the virtual machine. The best way to explain this is using an example:
suppose we have a server application that needs 800Mb of java heap after startup. If we would put –Xms256MB and –Xmx1024MB the JVM will ask the operating system 256MB of virtual memory for the heap just after creation of the virtual machine. As the server starts and needs more and more memory, the virtual machine will ask the operating system for more memory. Then you can only hope the operating system is able to give more memory. If not, the JVM starts throwing a java.lang.OutOfMemoryError. If we would put –Xms1024MB and –Xmx1024MB the JVM will ask the operating system enough virtual memory to create the maximum heap size. If that’s not possible the JVM simply won’t start, but you won’t get possible memory errors (related to resizing the heap) while the JVM is running.
By setting –Xms = -Xmx the JVM will start faster (no overhead added to allocate new memory blocks during startup, since the whole block will be reserved at startup).
Another advantage of setting –Xms equal to –Xmx is a reduction in the number of garbage collections, but adding larger pause times for each garbage collection.
Another rule is that the larger the heap size, the larger the pause times will be.

Garbage collection tuning
Since JDK 1.5 update 6, four different types of garbage collectors are available:
  • The serial collector (the default one)
  • The parallel or throughput collector
  • The parallel compacting collector
  • The Concurrent Mark Sweep collector
The serial collector
The serial is the default collector where both minor and major collections happen in a “stop the world” way. As the name implies garbage collections run serial and are only using one CPU core, even if more of them are available.

A young generation collection is done by copying the live object from the “Eden space” to the empty survivor space (“To space” in the figures). Objects that are too big for the survivor space are directly copied to the tenured space. Relative young objects in the other survivor space (“From space”) will be copied to the other survivor space (“To space”), while relative old objects will be copied to the tenured space. This also happens for all the other objects in the “Eden space” or “From space” when the “To space” becomes too small. Objects that are still in the “Eden space” or the “From space” after the copy operation, are dead objects and can be swept.





 
An old generation collection makes use of a mark-sweep-compact algorithm. The mark phase determines the live objects. The sweep phase erases the dead objects. During compaction the objects that are still live will be slide to the beginning of the tenured space. The result is a “full” region and an “empty” region in the tenured space. New objects too large for the “Eden space” can be allocated directly in the “empty” part of the tenured space.

The serial collector can be activated by using –XX:+UseSerialGC.  This collector is good for client side applications that have no strong pause constraints.

The parallel or throughput collector
The design of this GC is focused on the use of more CPU cores during garbage collection instead of leaving the other cores unused while one is doing all the garbage collection.

The young generation collector makes use of a parallel variant of the serial collector. Although it makes use of multiple cores it’s still a “stop the world” garbage collector. The use of multiple cores decreases total GC time and increased the throughput.

A collection of the old generation happens in the same way as a serial collection.

The throughput collector can be activated by using –XX:+UseParallelGC

This collector is efficient on machines with more than one CPU, but still has the disadvantage of long pause times for a full GC.

The parallel compacting collector
This GC is new since JDK 1.5 update 6 and has been added to perform old generation collections in a parallel fashion.

A young generation collection is done the same way as a young generation collection in the throughput collector.

The old generation collection is also done in a “stop the world” fashion, but is done in parallel with added sliding compaction. The collector consists out of 3 phases: mark, summary and compaction. First of all the old generation is divided into regions of fixed length. During the mark phase objects are divided among several GC threads. These threads mark all live objects. The summary phase defines the density of each region; if the density is large enough no compaction will be performed on that region. As soon as a region will be reached for which the density is low enough to do compaction (the cost of compaction is low enough), compaction will be performed on all subsequent regions based on information from the summary phase. The mark and compaction phases are parallel phases while the summary phase is implemented serial.

The parallel compacting collector can be activated by using –XX:+UseParallelOldGC

The collector is efficient on machines with more than one CPU and for applications that have higher requirements regarding pause times, since a full GC will be done in parallel.

The Concurrent Mark Sweeper
The focus of this GC is on reducing pause times rather than improving the throughput. Some java servers require large heap space, leading to major collections that can take a while to complete. This behavior introduces large pause times. That’s why this GC has been introduced.

The young generation collection is done the same way as a young generation collection in the throughput collector.

The biggest part of the old generation collection occurs in parallel with the application threads, resulting in shorter pause times. The CMS will start a GC before the tenured space will be full. Its goal is to perform a GC before the tenured space will have no more space left. Due to fluctuations in the load of a server the tenured space can be filled more quickly than the CMS GC can be ended. At that moment the CMS GC will stop and a serial GC will take place. The CMS has 3 major phases:
  • Initial mark: the application threads will be stopped to see which objects are directly reachable from the java code.
  • Concurrent mark: during this phase the GC determines which objects are still reachable from the set of the initial mark. Application threads keep on running during this phase. This makes that this phase can’t determine all reachable objects (since the application threads are still running new objects can be made). For this reason a third phase is required.
  • Remark: an extra check will be performed on the set of objects from the concurrent mark phase. Applications threads will be interrupted, but this phase makes use of multiple threads.
Since there is some overlap between the different phases, this GC introduces some overhead.

The CMS GC doesn’t make use of compacting, resulting in less time spent during GC, but adding additional cost during object allocation.


Another disadvantage is that this GC introduces floating GC. During the concurrent mark phase application threads are still running, resulting in live objects moving to the tenured space and becoming dead objects. These dead objects can only be cleaned during the next GC. As a result this GC requires additional heap space.

The CMS can be activated by using -XX:+UseConcMarkSweepGC
The collector is efficient on machines with more than one CPU and for applications requiring low pause times rather than high throughput. It’s also possible to enable CMS for the permgen space by adding -XX:+CMSPermGenSweepingEnabled
JVM Ergonomics
JVM ergonomics have been introduced since JDK 5 and makes the JVM doing some kind of “self tuning”. This is partially based on the underlying platform (hardware, OS, …). Based on the platform a specific GC and heap size will be chosen automatically. On the other hand it’s possible to define a desired behavior (pause time and throughput), resulting in the JVM sizing its heap automatically to meet the desired behavior as good as possible.

Regarding the platform the JVM makes a distinction between a client and a server class machine. A server class machine is a machine with at least 2 CPU’s and +2GB RAM.

In case of a server class machine the following options will be chosen automatically:
  • Throughput GC
  • Min heap size = 1/64 of the available physical memory with a maximum of 1GB
  • Max heap size = ¼ of the available physical memory with a maximum of 1GB
  • Server runtime compiler
The throughput collector and the parallel compacting collector allow the definition of the desired behavior. By defining these parameters the JVM will try to size the heap and GC parameters to meet these requirements as tight as possible. It’s important to know that the JVM only sees it as a hint. There is no guarantee that the JVM can meet this defined behavior.
  • The desired maximum pause time can be defined by setting -XX:MaxGCPauseMillis=<nnn> where <nnn> is the time defined in milliseconds.
  • The desired throughput goal can be defined by setting -XX:GCTimeRatio=<nnn>.  The ratio GC time versus application time will be defined by 1 / 1 + <nnn>.  This means by setting –XX:GCTimeRatio=19 a maximum of 5% of the time will be spent on GC.
  • If both the maximum pause time and the throughput goal can be met, the JVM will try to meet the footprint goal by reducing the heap size
Author: Dimitri