CloudBees Jenkins JVM Troubleshooting


Purpose

This document is intended as a reference for tuning and troubleshooting the Java Virtual Machine within which Jenkins runs, using current best practices as well as a troubleshooting guide for performance issues for Jenkins Administrators.

Background

Jenkins is a self-contained Java-based program, ready to run out-of-the-box, with packages for Windows, Mac OS X and other Unix-like operating systems. The Jenkins application runs inside a Java Virtual Machine (JVM).

JVMs have two primary functions: . To allow Java programs to run on any device or operating system (known as the "Write once, run anywhere" principle) . To manage and optimize program memory.

There are several aspects to tuning a JVM for Jenkins: * A supported version of Java * Minimum JVM specifications * Recommended Jenkins JVM specifications

Supported Java versions

JVMs manage program resources during execution based on configurable settings, and it’s important to ensure that you’re running a supported version of Java.

Suggested Specifications for Jenkins

Heap size

Heap sizes are calculated based on the amount of memory on the machine unless the initial and maximum heap sizes are specified on the command line.

Default JVM Specifications (Out-of-the-box) Recommended JVM Specifications

Initial Heap Size

1/64th of physical memory up to 1GB

Minimum of 2GB (4GB for Production Instances)

Maximum Heap Size

1/4 of physical memory up to 1 GB

Maximum of 16GB (Anything larger should be scaled horizontally)

Garbage Collection

ParallelGC

G1GC

You can specify the initial and maximum heap sizes using the flags -Xms (initial heap size) and -Xmx (maximum heap size). See Recommended Jenkins JVM Specifications.

Garbage Collection

In many traditional programming languages, program memory was managed by the programmer. Java uses a process called “garbage collection” to manage program memory. Garbage collection happens inside a running JVM and continuously identifies and eliminates unused objects in memory from Java programs. There are four different types of garbage collection algorithms:

Serial Collector
  • Designed for single-threaded environments

  • Freezes all application threads whenever it’s working

  • Does not make effective use of multiple processors so not a good fit for most Jenkins instances.

Parallel/Throughput Collector
  • Designed to minimize the amount of CPU time spent for garbage collection (maximum throughput) at the cost of sometimes long application pauses.

  • JVM default collector in Java 8 (In later versions of Java G1 is default)

  • Uses multiple threads to scan through and compact the heap

  • Stops application threads when performing either a minor or full garbage collection

  • Best suited for apps that can tolerate application pauses and are trying to optimize for lower CPU overhead

CMS Collector
  • concurrent-mark-sweep

  • Uses multiple threads (“concurrent”) to scan through the heap (“mark”) for unused objects that can be recycled (“sweep”)

  • Does not perform compaction of older objects—eventually the heap will become fragmented and have to do a slow Full GC.

  • Requires a lot of tuning to use with Jenkins, and tends to be more “brittle” than other collectors — if not carefully managed, it can cause serious problems. Recommended to not be used.

G1 Collector
  • The Garbage first collector (G1) aims to minimize GC pause times and adjust itself automatically rather than requiring specific tuning.

  • Uses parallel threads to collect young objects (stop the world pause), and collects older objects mostly without interrupting the application (concurrent GC).

  • Does not aim to collect all garbage at once — most collection cycles just remove young objects, and when older objects are collected they are usually done a part at a time.

  • Introduced in JDK 7 update 4

  • Designed to better support heaps larger than 4GB.

  • Utilizes multiple background threads to scan through the heap, and then it divides the heap into regions ranging from 1MB to 32MB (depending on the size of your heap)

  • Geared towards scanning those regions that contain the most garbage objects first (thus Garbage first” name).

  • Uses a bit of extra memory to support garbage collection.

Suggested JVM specifications describes the recommended minimum settings you need for a JVM, and gave some context for the defaults and garbage collection choices.

This section covers how to configure the Jenkins JVM according to current best practices (as of late 2018) and how to configure it to use a garbage-collection strategy that aligns with your requirements.

The Jenkins service configuration file

If you installed Jenkins using system packages, the Jenkins JVM is controlled by a service configuration file, to which you can add Java arguments. The location of this file varies by product.

CloudBees Jenkins Platform Client Master

For the CloudBees Jenkins Platform Client Master , the service configuration file is located in:

  • Linux distributions: /etc/default/jenkins

  • RedHat/CentOS: /etc/sysconfig/jenkins

  • Windows: %ProgramFiles%\Jenkins\jenkins.xml

CloudBees Jenkins Operations Center

For the CloudBees Jenkins Operations Center, you can find the service configuration in:

  • Linux distributions: /etc/default/jenkins-oc

  • RedHat/CentOS: /etc/sysconfig/jenkins-oc

  • Windows: %ProgramFiles%\Jenkins-OC\jenkins.xml

Adding arguments, and supported Java arguments

This CloudBees Article explains how to add Java Arguments to the service configuration file.

While there are many additional JVM arguments, the following arguments are recommended:

-XX:+AlwaysPreTouch: pre-zeroes memory mapped pages on JVM startup – improves runtime performance, especially at startup and with large heaps.

-XX:+HeapDumpOnOutOfMemoryError: tells the JVM to automatically generate a heap dump file when a heap memory allocation can’t be satisfied. If you experience an OutOfMemoryError, you can provide the heap dump file to CloudBees and we can analyze it to determine if there is a memory leak or some other problem in Jenkins.

-XX:HeapDumpPath=: allows you to specify the directory path where the heap dump file should be written. Please ensure that the directory path specified can be written to by the Jenkins process and has adequate space to hold the heap dump files.

-verbose:gc: enables verbose logging of the garbage collector.

-Xloggc:$path/gc-%t.log: specifies where the verbose GC data is written to. The %t in the gc log file path will cause the JVM to generate a new file with each JVM restart with a timestamp.

-XX:NumberOfGCLogFiles=2 -XX:GCLogFileSize=100m -XX:+UseGCLogFileRotation: limits the number of files, their size, and rotates the GC log respectively.

-XX:+PrintGC -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCCause -XX:+PrintTenuringDistribution -XX:+PrintReferenceGC -XX:+PrintAdaptiveSizePolicy: these parameters gather info on object age and reference GC time for further tuning if needed.

-XX:+PrintHeapAtGC: will generate a large amount of logging to the specified GC log file. Please ensure there is adequate file system space when enabling this option. All the information produced by this option is known to the JVM; therefore, there is no extra processing required as a result of enabling this parameter. However, there will be a slight amount of additional I/O overhead which should amount to less than 3%.

-XX:+LogVMOutput (requires -XX:+UnlockDiagnosticVMOptions): Logs all the VM output (like PrintCompilation) to the default hotspot.log file in current directory.

-XX:LogFile= (requires -XX:+UnlockDiagnosticVMOptions): Specifies the path and name for hotspot.log Linux: -XX:LogFile=/var/log/jenkins/jvm.log Windows: -XX:LogFile="%ProgramFiles%\Jenkins\jvm.log"

-XX:+UseG1GC: enables the G1 Garbage Collector.

-XX:+UseStringDeduplication: looks for the strings with the same contents and deduplicates them. Can significantly reduce memory used for Jenkins and may improve performance

-XX:+ParallelRefProcEnabled: enables parallelize reference processing, reducing young and old GC times.

-XX:+UnlockExperimentalVMOptions: allows the values of experimental flags to be changed by unlocking them.

Memory Specifications

Default Maximum heap size “out-of-the-box” with Java is 1/4 of your physical memory. It’s not recommended to exceed ½ of your physical memory, noting that 16GB heaps are the largest supported. For small instances you always want at least 1GB of physical memory left empty for Metaspace, off-heap Java use, and operating system use.

Minimum physical memory

We generically suggest starting Jenkins with at least 2 GB of heap (-Xmx2g), and a minimum physical memory of 4GB. Smaller instances that only serve a small team can use less, however, we do not suggest running Jenkins with less than 1GB of heap and 2 GB of total system memory, because this requires careful tuning to avoid running out of memory. It is important to plan for Jenkins heap to be increased as usage increases, and scale horizontally when the recommended maximum heap size of 16GB is reached.

Scaling memory in production

For a production environment with considerable workload, a good starting point would be -Xmx4g, which dictates a minimum physical memory of 8GB.

In the event your instance demands more than -Xmx16g, we encourage scaling horizontally by adding more masters using CloudBees Jenkins Operations Center, which divides the workload more efficiently. Another good reason to divide the workload at this threshold is to minimize a single point of failure. At 24 GB, this becomes extremely important because managing a master at this scale requires a specialized skill-set and incurs extra administrative problems. Besides reduced efficiency, it concentrates so much responsibility on one system that it is hard to schedule routine maintenance and harder to guarantee system stability.

How Do I Know If I Need to Increase Heap?

There are a few ways to tell if it’s time to increase the amount of heap allocated to Jenkins. If any of the following is not true, it you may want to consider increasing the heap size: * Garbage collection should not be running more often than once every 15 seconds on average, with one collection per 30 seconds being a good number to aim for. * The GC throughput (the percentage of time spent running the application rather than doing garbage collection) should be above 98%. * The amount of the heap used (not just allocated) for old-generation data should not exceed 70% except for brief spikes and should be 50% or less in most cases.

By analyzing GC logs in gceasy.io, it is easy to evaluate these metrics.

Minimum and maximum heap sizes

Set your minimum heap size (-Xms) to at least 1/2 of your maximum heap size. Again, it is recommended that your maximum heap size not exceed 16GB (-Xmx16g).

Examples

-Xms8g: sets the initial (and minimum) heap size to 8GB -Xmx16g: sets the maximum heap size to 16g

Garbage Collection Specifications

Currently it is recommended to use the G1 Garbage Collection algorithm.

G1 is superior to ParallelGC or CMS for a Jenkins JVM. This is because G1 avoids long “stop the world” pauses which can take many seconds in a large collection. It is also the default Garbage Collector in Java 9.

Specify G1 GC as follows:

-XX:+UseG1GC
-XX:+UseStringDeduplication
-XX:+ParallelRefProcEnabled
-XX:+DisableExplicitGC
-XX:+UnlockDiagnosticVMOptions
-XX:+UnlockExperimentalVMOptions

Troubleshooting

Troubleshooting a performance issue requires the collection of data. Generally, this requires gathering several Java thread dumps, which are a way of finding out what every JVM thread is doing at a particular point in time. This is especially useful if the Java application seems to hang under load, as an analysis of the thread dumps will indicate where threads are stuck.

It is important that the data is collected while the performance issue is actively occurring on the JVM/Jenkins. If Jenkins is restarted before the data is collected, the active threads will be lost and it may be harder to pinpoint the issue.

There are two preferred methods of collecting data: * Generating a support bundle * Running the JenkinsHangsWithJStack Script

Generating a support bundle

This article explains how to generate a support bundle for CloudBees products.

Prerequisite:

Recommended Java Arguments are set per Recommended Jenkins JVM Options

jenkinshangWithJstack Script

The jenkinshangWithJstack script has been developed to gather data for analysis during a performance issue occuring when running on a Unix-like system. The script collects a thread dump via jstack and includes output from the top, vmstat, netstat, nfsiostat, nfsstat and iostat utilities. ==== Running the script

To run the script:

  1. Confirm that the tools to run the script are included in your $PATH:

    • iostat

    • netstat

    • nfsiostat

    • nfsstat

    • vmstat

    • top

  2. Download the jenkinshangWithJstack.sh script.

  3. Make the script executable with the command

    shell% chmod +x jenkinshangWithJstack.sh
  4. Retrieve the values of the variables $JENKINS_USER and $JENKINS_PID by using the echo command

  5. Make sure that the directory where the script is running has rw (read and write) permissions set on it

  6. As the same user that runs Jenkins, execute the script with the command

    shell% sudo -u $JENKINS_USER sh jenkinshangWithJstack.sh $JENKINS_PID 300 5

    Note that the “300” and “5” are arguments for “Length to run the script in seconds” and “Intervals to execute commands in seconds.”

    Tip
    It is highly recommended to practice executing the script in your environment to make sure everything works as expected. When performance issues arise, you will be able to execute the script right away rather than spending extra time to troubleshoot the script execution.

Data included in the script output

The following section contains more details on the information obtained by the script:

Iostat

(input/output statistics) is a computer system monitor tool used to collect and show operating system storage input and output statistics. It is often used to identify performance issues with storage devices, including local disks, or remote disks accessed over network file systems such as NFS.

Example Output:

iostat output
jstack

The jstack command-line utility attaches to the specified process or core file and prints the stack traces of all threads that are attached to the virtual machine, including Java threads and VM internal threads, and optionally native stack frames. The utility also performs deadlock detection.

Example Output:

jstack output
netstat

Netstat is a command line TCP/IP networking utility available in most versions of Windows, Linux, UNIX and other operating systems. Netstat provides information and statistics about protocols in use and current TCP/IP network connections.

Example Output:

netstat output
nfsiostat

The sysstat family includes a utility called nfsiostat, that resembles iostat, but allows you to monitor the read and write usage on NFS mounted file systems.

Example Output:

nfsiostat output
nfsstat

nfsstat displays statistics kept about NFS client and server activity.

Example Output:

nfsstat output
top

The top command allows users to monitor processes and system resource usage on Linux. By combining a series of JStack and top (or, more likely, top -h) dumps, you can help identify Java operations that are consuming a lot of CPU time. Convert thread IDs that are using a lot of CPU in “top” to hexadecimal and look for them in the jstack dumps taken at about the same time. If you see the same thread id appearing as a high CPU user across multiple thread dumps, that is often a culprit high CPU user. Be aware that these are all moment-in-time measurements though — you may just happen to capture something that is briefly using a lot of CPU but which is overall not a problem.

Example Output:

top output
top -H

The dash H (-H) flag starts top with all individual threads displayed. Otherwise, top displays a summation of all threads in a process.

Example Output:

top -H output
vmstat

vmstat (virtual memory statistics) is a computer system monitoring tool that collects and displays summary information about operating system memory, processes, interrupts, paging and block I/O.

Example Output:

vmstat output

Analyzing data collected by JenkinsHangsWithJstack

The output of jenkinshanghswithjstack.sh will be compressed via tar and gzip. Once extracted, the output of the commands are viewable in a text editor, and are grouped by folder:

Example:

JenkinsHangsWithJstack output

Understanding Thread Dumps

A web server uses tens to hundreds of threads to process a large number of concurrent users. If two or more threads utilize the same resources, a contention between the threads is inevitable, and sometimes deadlock occurs.

Thread contention is a status in which one thread is waiting for a lock, held by another thread, to be lifted. Different threads frequently access shared resources on a web application. For example, to record a log, the thread trying to record the log must obtain a lock and access the shared resources.

Deadlock is a special type of thread contention, in which two or more threads are cannot progress at all because each holds a lock and is waiting on a lock the other thread already holds.

Different issues can arise from thread contention. To analyze such issues, you need to use a thread dump. A thread dump will give you the information on the exact status of each thread.

Below, you will find the aforementioned example using JStack:

"Computer.threadPoolForRemoting [#25] for OperationsCenter2 connection from example.com/10.62.98.33:43856" #7992 daemon prio=5 os_prio=0 tid=0x00007fa7741bd000 nid=0x2c94 runnable [0x00007fa6d46da000]
   java.lang.Thread.State: RUNNABLE
    at org.mindrot.jbcrypt.BCrypt.key(BCrypt.java:556)
    at org.mindrot.jbcrypt.BCrypt.crypt_raw(BCrypt.java:629)
    at org.mindrot.jbcrypt.BCrypt.hashpw(BCrypt.java:692)
    at org.mindrot.jbcrypt.BCrypt.checkpw(BCrypt.java:763)
    at hudson.security.HudsonPrivateSecurityRealm$3.isPasswordValid(HudsonPrivateSecurityRealm.java:686)
    at hudson.security.HudsonPrivateSecurityRealm$4.isPasswordValid(HudsonPrivateSecurityRealm.java:707)
    at hudson.security.HudsonPrivateSecurityRealm$Details.isPasswordCorrect(HudsonPrivateSecurityRealm.java:517)
    at hudson.security.HudsonPrivateSecurityRealm.authenticate(HudsonPrivateSecurityRealm.java:184)
    at sun.reflect.GeneratedMethodAccessor351.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)

In this example, the following can be gathered:

Thread Name: "Computer.threadPoolForRemoting [#25] for OperationsCenter2 connection from cbj.example.com/10.62.98.33:43856"

  • When using the java.lang.Thread class to generate a thread, it will be named ThreadName-(Number), whereas when using the java.util.concurrent.ThreadFactory class, it will be named pool-(number)-thread-(number).

Thread Type: daemon

  • Java threads can be divided into two:

    1. daemon threads;

    2. non-daemon threads.

Thread Priority: prio=5 os_prio=0

Java Thread ID: tid=0x00007fa7741bd000

  • This is the Java Thread Id obtained via java.lang.Thread.getId() and usually implemented as an auto-incrementing long 1..n

Native Thread ID: nid=0x2c94

  • This is hexadecimal. If you convert it to a base 10 number, you can often find the thread in a ‘top’ dump taken at the same time.

  • Crucial information as this native Thread Id allows you to correlate for example which Threads from an OS perspective are using the most CPU within your JVM etc.

Thread State and Detail: runnable [0x00007fa6d46da000]

  • Allows quick identification of Thread state and its potential current blocking condition

  • Thread States:

    • NEW: The thread is created but has not been processed yet.

    • RUNNABLE: The thread is occupying the CPU and processing a task. (It may be in WAITING status due to the OS’s resource distribution.)

    • BLOCKED: The thread is waiting for a different thread to release its lock in order to get the monitor lock.

    • WAITING: The thread is waiting by using a wait, join, or park method.

    • TIMED_WAITING: The thread is waiting by using a sleep, wait, join or park method.

Thread Stack Trace:

at org.mindrot.jbcrypt.BCrypt.key(BCrypt.java:556) at
org.mindrot.jbcrypt.BCrypt.crypt_raw(BCrypt.java:629) at
org.mindrot.jbcrypt.BCrypt.hashpw(BCrypt.java:692) at
org.mindrot.jbcrypt.BCrypt.checkpw(BCrypt.java:763) at
hudson.security.HudsonPrivateSecurityRealm$3.isPasswordValid(HudsonPrivateSecurityRealm.java:686....

Each line in the stack trace corresponds to code, and subsequently a line number. As an example, we could trace:

hudson.security.HudsonPrivateSecurityRealm$3.isPasswordValid(HudsonPrivateSecurityRealm.java:686

Identifying and fixing performance issues

Typically, the term “performance issue” can be traced back to one of several different arenas.

CPU/Memory Issues - these may be related:

Note that the issues are ordered from highest to lowest severity and that the previous issue can cause the symptoms of the subsequent issues to occur. For example, an Out of Memory Error can cause high CPU, a hang, and/or poor GC performance. Also note that if you are experiencing an Out of Memory Error along with high CPU & poor GC performance, the Out of Memory Error takes precedence. Once the Out of Memory Error is resolved it also resolves the other symptoms in most cases.

Out of Memory Error

It is critical to determine the type of Out of Memory error since each type requires different steps for debugging and resolution. For example, reviewing a heap dump should be done after confirming that a Java heap Out of Memory Error occurred via the log files or verbose GC data. This is because the JVM will generate a heap dump for any Out of Memory Error that occurs if the Java Argument -XX:+HeapDumpOnOutOfMemoryError is set. A Java heap dump also shows valid memory usage, so it is best to review the file generated when the Out of Memory Error is thrown if possible.

The following are common Out Of Memory Errors:

java.lang.OutOfMemoryError: Unable to create new native thread: The Java application has hit the limit of how many Threads it can launch. This can typically be resolved by increasing a ulimit setting. This error is often encountered because the user has created too many processes. The number and kind of running processes should be reviewed using top and ps.

java.lang.OutOfMemoryError: Java heap space : This is the Java Virtual Machine’s way to announce that there is no more room in the virtual machine heap area. You are trying to create a new object, but the amount of memory this newly created structure is about to consume is more than the JVM has in the heap. The fastest way to mitigate this error is to increase the heap via -Xmx parameter but keep in mind that increasing the heap space may only fix it temporarily and there may be underlying issues that need to be investigated further.

java.lang.OutOfMemoryError: GC overhead limit exceeded : By default the JVM is configured to throw this error if you are spending more than 98% of the total time in GC and less than 2% of the heap is recovered after the GC. Changes to GC settings will mitigate this error but keep in mind there may be underlying issues that need to be investigated further.

Useful files for Out of Memory Errors:

  • about.md from support bundle (shows Java Virtual Machine (JVM) configuration)

  • verbose GC data (useful for Java heap, metaspace, GC overhead limit exceeded Out of Memory Errors)

  • heap dump (useful for Java heap Out of Memory Errors)

  • heap histogram (useful for Java heap Out of Memory Errors)

  • limits.txt file from support bundle (useful for Java native Out of Memory Errors)

  • nodes.md for agent JVM configuration (if the issue occurs there)

  • hs_err_pidXXXX.log file (useful for native Out of Memory Errors)

It’s important to note that when dealing with an Out of Memory Error, increasing the heap space will most likely resolve the issue temporarily, but it may be masking the root cause, and therefore analysis of a thread dump is always a best practice to understand if there are larger issues, many times associated with plugins that will need to be addressed.

High CPU Usage

Tip
Low Java heap space will also cause high CPU usage, as CPU will be doing Garbage Collections constantly to free up memory. Before considering the issue as high CPU, please review “Out of Memory Error” above and make sure the issue is not caused by low memory.

Most of the time, servers only use a small fraction of their CPU power. When Jenkins is performing tasks, CPU usage will rise and/or spike temporarily. Once the CPU intensive process completes, the CPU usage should once again drop down to a lower level.

However, if you are receiving high CPU alerts and/or are experiencing application performance degradation, this may be due to a Jenkins process being stuck in an infinite loop or that the service has encountered an unexpected error. If your server is using close to 100% of the CPU, it will constantly have to free up processing power for different processes, which will slow the server down and may render the application unreachable.

In order to reduce the CPU usage, you first need to determine what processes are taxing your CPU. The best way of diagnosing this is by executing the jenkinshangWithJstack.sh script while the CPU usage is high, as it will deliver the outputs of top and top -H while the issue is occuring so you can see which threads are consuming the most CPU.

In the following example scenario, we have reports that the Jenkins UI has become unresponsive, and have subsequently ran the jenkinshangWithJstack.sh script during this time to gather data. In the output of top we see that the JVM is consuming a high amount of CPU:

unresponsive jenkins ui top result

Additionally, we also see in the output of “top -h” that there are a couple of “Hot Threads” or threads that are consuming a large amount of CPU:

unresponsive jenkins ui top -H result

By obtaining the PID’s of these threads, we are able to convert the PID using an online Binary Hexadecimal Converter which will provide us the ThreadID that we can then search for in the output of Jstack.

Example Converting “15801” to Hexadecimal Value of “3DB9”:

binary hex conversion of 15801 to 3DB9

Using the new Hexadecimal value, we can now search our JStack output for “3DB9” and find the following stacktrace:

"Computer.threadPoolForRemoting [#25] for OperationsCenter2 connection from test/10.62.98.33:43856" #7992 daemon prio=5 os_prio=0 tid=0x00007fa7741bd000 nid=0x3DB9 runnable [0x00007fa6d46da000]
   java.lang.Thread.State: RUNNABLE
    at org.mindrot.jbcrypt.BCrypt.key(BCrypt.java:556)
    at org.mindrot.jbcrypt.BCrypt.crypt_raw(BCrypt.java:629)
    at org.mindrot.jbcrypt.BCrypt.hashpw(BCrypt.java:692)
    at org.mindrot.jbcrypt.BCrypt.checkpw(BCrypt.java:763)
    at hudson.security.HudsonPrivateSecurityRealm$3.isPasswordValid(HudsonPrivateSecurityRealm.java:686)
    at hudson.security.HudsonPrivateSecurityRealm$4.isPasswordValid(HudsonPrivateSecurityRealm.java:707)
    at hudson.security.HudsonPrivateSecurityRealm$Details.isPasswordCorrect(HudsonPrivateSecurityRealm.java:517)
    at hudson.security.HudsonPrivateSecurityRealm.authenticate(HudsonPrivateSecurityRealm.java:184)
    at sun.reflect.GeneratedMethodAccessor351.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)

The above findings show that BCrypt is causing the CPU usage to spike which implies that it is a CPU intensive operation. In this example, a bug report was created showing the information gathered and a fix was then implemented in a future release of Jenkins.

Fastthread.io

Once you have the output of jstack, for faster diagnosis you can use fastthread.io to analyze the thread dump:

Fastthread.io

Once uploaded, fastthread.io will provide an analysis, and typically a prescriptive action can be determined. Example:

Fastthread.io intelligence report

Pipeline Hang

Oftentimes the root cause of performance issues can be traced back to Pipelines that are not constructed properly. If you are experiencing a hanging Pipeline, it is recommended to troubleshoot the issue by providing a Support Bundle to the CloudBees Support Team for a review of the Pipeline-Thread-Dump.txt file included with the Support Bundle.

UI Hang/Unreachable UI

As we have discussed above, an unresponsive UI is typically a symptom of high CPU utilization and memory consumption. However, there are times when other factors can come into play such as networking latency or bad plugins. The following best practices should be observed when faced with an unresponsive or lagging UI.

Confirm Browser Functionality

Using Firefox or Google Chrome, ensure that your web browser is up-to-date, clear your cache and cookies, and enable Developer Tools to ensure a proxy or firewall is not blocking your connection to Jenkins. The example below shows a stable network connection returning 200 status when attempting to reach the Jenkins instance:

Operations Center showing 200 error

The example below shows a 404 status stating the server could not be reached. This would dictate further investigation as to why the server is unreachable. One way of would be to provide a HAR file to CloudBees Support for investigation. It is advised to obtain the output of jenkinshangWithJstack.sh during this time to determine if high CPU, GC, or Out of Memory errors are present. Additionally, if you see requests that are taking a long response time, you may be experiencing a network-related issue. It is also advised to look over the slow response data of the Support Bundle to troubleshoot further.

Operations Center showing 404 error

Poor GC Performance / GC Thrashing

Very often, the cause of an Out Of Memory error is due to incorrect GC settings, or GC “thrashing”. Frequent garbage collection, due to failure to allocate memory for an object, insufficient free memory, or insufficient contiguous free memory due to memory fragmentation, is referred to as “thrashing”.

It’s important to remember that many GC events are “Stop The World” events. Therefore, ideally you want to limit these events to less than a couple of seconds to ensure that your application does not become unresponsive for your end-users. Very long GC pauses may also result in 503 errors and other instabilities . In extreme cases, very long GC pauses may make it appear that build agents have disconnected or become unresponsive. If the JVM’s garbage collection is not performing as desired, you will need to benchmark Jenkins and tune the JVM to reach an appropriate set of GC parameters.

The recommended tuning approach to these parameters is to evaluate the GC logs using GCEasy.io which is an online GC log analyzer.

Analysis Using GCEasy

It’s important to ensure that the GC logs are captured as part of your support bundle created from Jenkins. This is a selectable checkbox in the CloudBees Support section of Jenkins:

GC log collection

GCEasy.io

Once you have collected the GC logs, you can visit gceasy.io to upload the logs to the analyzer:

gceasy.io

Once the log is analyzed, you will be presented with a report:

gceasy.io report

Most GC issues can be resolved by examining the reports generated by GCEasy.io and making adjustments accordingly.

FAQ

Q: I found a number of “concurrent humongous allocation” error messages in my garbage collection log. What do I do?

Q: I saw this article that recommended GC setting “X” for better performance… should I use it?

A: Probably not, unless you see poor GC performance with the suggested settings. Jenkins has an unusual pattern of memory use, so a lot of common expert tuning options do not apply to it. The recommended settings above are used at many Fortune 500 companies with only minor customizations. However, if you do wish to tinker, please make sure to gather GC logs under real-world use before and after applying the settings so you can compare.

  • Recommended customization options people may explore, beyond max/min heap size: MetaspaceSize, region size, and on very large systems HugePages (requires OS-level support).

Q: Should I set minimum heap and maximum heap to the same value?

A: You certainly don’t have to, but this gives some modest performance benefits for Jenkins-resizing the heap can be somewhat performance: expensive and in some specific cases may trigger higher SYS CPU use. It ensures more consistent performance when the level of load varies widely: heuristics won’t be able to shrink heap sizes in response to quieter periods.

Q: Should I use Concurrent Mark Sweep (CMS) GC for heaps from 2-4 GB?

A: No. Use G1 GC — CMS requires significant expert tuning to perform well for Jenkins and has several failure modes (ex: heap fragmentation) that can cause serious stability problems for Jenkins. We have seen big issues with some customers trying to use CMS GC.