CloudBees Jenkins JVM troubleshooting

This document is intended as a reference for tuning and troubleshooting the Java Virtual Machine within which Jenkins runs, using current best practices as well as a troubleshooting guide for performance issues for Jenkins Administrators.

Background

Jenkins is a self-contained Java-based program, ready to run out-of-the-box, with packages for Windows, Mac OS X and other Unix-like operating systems. The Jenkins application runs inside a Java Virtual Machine (JVM).

JVMs have two primary functions:

To allow Java programs to run on any device or operating system (known as the "Write once, run anywhere" principle)
To manage and optimize program memory.

There are several aspects to tuning a JVM for Jenkins:

A supported version of Java
Minimum JVM specifications
Recommended Jenkins JVM specifications

Supported Java versions

JVMs manage program resources during execution based on configurable settings, and it’s important to ensure that you’re running a supported version of Java.

Recommended specifications for Jenkins

Heap size

Heap sizes are calculated based on the amount of memory on the machine unless the initial and maximum heap sizes are specified on the command line.

	Default JVM Specifications (Out-of-the-box)	Recommended JVM Specifications for Jenkins
Initial Heap Size	1/64th of physical memory up to 1GB	A minimum of 2GB for the operations center and a minimum of 4GB for controllers in a production instance
Maximum Heap Size	1/4 of physical memory up to 1 GB	Maximum of 16GB (Anything larger should be scaled horizontally)
Garbage Collection	ParallelGC	G1GC

Default JVM Specifications (Out-of-the-box)

Recommended JVM Specifications for Jenkins

Initial Heap Size

1/64th of physical memory up to 1GB

A minimum of 2GB for the operations center and a minimum of 4GB for controllers in a production instance

Maximum Heap Size

1/4 of physical memory up to 1 GB

Maximum of 16GB (Anything larger should be scaled horizontally)

Garbage Collection

ParallelGC

G1GC

You can specify the initial and maximum heap sizes using the flags -Xms (initial heap size) and -Xmx (maximum heap size). See Recommended Jenkins JVM options. It is recommended to define the same value for both -Xms and -Xmx so that the memory is allocated on startup rather than runtime.

Garbage collection

In many traditional programming languages, program memory was managed by the programmer. Java uses a process called garbage collection to manage program memory. Garbage collection happens inside a running JVM and continuously identifies and eliminates unused objects in memory from Java programs. As of Java 8, there are four different types of garbage collection algorithms:

Serial collector

Designed for single-threaded environments
Freezes all application threads whenever it’s working
Does not make effective use of multiple processors so not a good fit for most Jenkins instances.

Parallel/Throughput collector

Designed to minimize the amount of CPU time spent for garbage collection (maximum throughput) at the cost of sometimes long application pauses.
JVM default collector in Java 8 (In later versions of Java G1 is default)
Uses multiple threads to scan through and compact the heap
Stops application threads when performing either a minor or full garbage collection
Best suited for apps that can tolerate application pauses and are trying to optimize for lower CPU overhead

CMS collector

concurrent-mark-sweep
Uses multiple threads (“concurrent”) to scan through the heap (“mark”) for unused objects that can be recycled (“sweep”)
Does not perform compaction of older objects—eventually the heap will become fragmented and have to do a slow Full GC.
Requires a lot of tuning to use with Jenkins, and tends to be more “brittle” than other collectors — if not carefully managed, it can cause serious problems. Recommended to not be used.

G1 collector

The Garbage first collector (G1) aims to minimize GC pause times and adjust itself automatically rather than requiring specific tuning.
Uses parallel threads to collect young objects (stop the world pause), and collects older objects mostly without interrupting the application (concurrent GC).
Does not aim to collect all garbage at once — most collection cycles just remove young objects, and when older objects are collected they are usually done a part at a time.
Introduced in JDK 7 update 4
Designed to better support heaps larger than 4GB.
Utilizes multiple background threads to scan through the heap, and then it divides the heap into regions ranging from 1MB to 32MB (depending on the size of your heap)
Geared towards scanning those regions that contain the most garbage objects first (thus Garbage first” name).
Uses a bit of extra memory to support garbage collection.

Recommended Jenkins JVM options

Recommended specifications for Jenkins describes the recommended minimum settings you need for a JVM, and gave some context for the defaults and garbage collection choices.

This section covers how to configure the Jenkins JVM according to current best practices (as of January 2020) and how to configure it to use a garbage-collection strategy that aligns with your requirements.

The Jenkins service configuration file

If you installed Jenkins using system packages, the Jenkins JVM is controlled by a service configuration file, to which you can add Java arguments. The location of this file varies by product.

CloudBees Jenkins Platform client controller

For the CloudBees Jenkins Platform client controller , the service configuration file is located in:

Linux distributions: /etc/default/jenkins
RedHat/CentOS: /etc/sysconfig/jenkins
Windows: %ProgramFiles%\Jenkins\jenkins.xml

Operations center

For the operations center, you can find the service configuration in:

Linux distributions: /etc/default/jenkins-oc
RedHat/CentOS: /etc/sysconfig/jenkins-oc
Windows: %ProgramFiles%\Jenkins-OC\jenkins.xml

Adding Java arguments to the Jenkins service configuration file

This CloudBees KB Article explains how to add Java arguments to the service configuration file.

Refer to the sections below for Supported Java 8 arguments and Supported Java 11 and 17 arguments. Using unsupported Java arguments will result in startup failure.

CloudBees strongly recommends that you upgrade to Java 17 as soon as possible for the best experience. CloudBees no longer supports Java 8 and in late 2024, support for Java 11 will be discontinued and you will be unable to run new versions of the product on Java 11. Releases that occurred during the support window for Java 11 will continue to run on Java 11.

Supported Java 8 arguments

While there are many additional JVM arguments, the following arguments are recommended to follow Best Practices for Java 8 users:

-XX:+AlwaysPreTouch: pre-zeroes memory mapped pages on JVM startup – improves runtime performance, especially at startup and with large heaps.

-XX:+HeapDumpOnOutOfMemoryError: tells the JVM to automatically generate a heap dump file when a heap memory allocation can’t be satisfied. If you experience an OutOfMemoryError, you can provide the heap dump file to CloudBees and we can analyze it to determine if there is a memory leak or some other problem in Jenkins.

-XX:HeapDumpPath=: allows you to specify the directory path where the heap dump file should be written. Please ensure that the directory path specified can be written to by the Jenkins process and has adequate space to hold the heap dump files.

-verbose:gc: enables verbose logging of the garbage collector.

-Xloggc:${LOGDIR}/gc-%t.log: specifies where the verbose GC data is written to. The %t in the gc log file path will cause the JVM to generate a new file with each JVM restart with a timestamp.

-XX:NumberOfGCLogFiles=2 -XX:GCLogFileSize=100m -XX:+UseGCLogFileRotation: limits the number of files, their size, and rotates the GC log respectively.

-XX:+PrintGC -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCCause -XX:+PrintTenuringDistribution -XX:+PrintReferenceGC -XX:+PrintAdaptiveSizePolicy: these parameters gather info on object age and reference GC time for further tuning.

-XX:+PrintHeapAtGC: will generate a large amount of logging to the specified GC log file. Please ensure there is adequate file system space when enabling this option. All the information produced by this option is known to the JVM; therefore, there is no extra processing required as a result of enabling this parameter. However, there will be a slight amount of additional I/O overhead which should amount to less than 3%.

-XX:+LogVMOutput (requires -XX:+UnlockDiagnosticVMOptions): Logs all the VM output (like PrintCompilation) to the default hotspot.log file in current directory.

-XX:LogFile= (requires -XX:+UnlockDiagnosticVMOptions): Specifies the path and name for hotspot.log Linux: -XX:LogFile=/var/log/jenkins/jvm.log Windows: -XX:LogFile="%ProgramFiles%\Jenkins\jvm.log"

-XX:+UseG1GC: enables the G1 Garbage Collector.

-XX:+UseStringDeduplication: looks for the strings with the same contents and deduplicates them. Can significantly reduce memory used for Jenkins and may improve performance

-XX:+ParallelRefProcEnabled: enables parallelize reference processing, reducing young and old GC times.

-XX:+DisableExplicitGC: disables the system.gc() method called often used by third party plugins to explicity invoke the garbage collector. Disabling has proven to improve performance.

-XX:+UnlockExperimentalVMOptions: allows the values of experimental flags to be changed by unlocking them.

-XX:+UnlockDiagnosticVMOptions: allows the values of diagnostic flags to be changed by unlocking them.

Supported Java 11 and 17 arguments

Java 11 brought many changes to garbage collection configuration. The following arguments are not supported for JDK 11 or JDK 17:

-Xloggc:${LOGDIR}/gc.log
-XX:NumberOfGCLogFiles=2
-XX:+UseGCLogFileRotation
-XX:GCLogFileSize=100m
-XX:+PrintGC
-XX:+PrintGCDateStamps
-XX:+PrintGCDetails
-XX:+PrintHeapAtGC
-XX:+PrintGCCause
-XX:+PrintTenuringDistribution
-XX:+PrintReferenceGC
-XX:+PrintAdaptiveSizePolicy

Further, -verbose:gc is now superseded by Xlog:gc.

Using unsupported Java arguments could cause startup failures. For Java 11 or Java 17 users, CloudBees recommends that you update your garbage collection configuration to use the following argument, as a best practice:

-Xlog:gc*=info,gc+heap=debug,gc+ref*=debug,gc+ergo*=trace,gc+age*=trace:file=${LOGDIR}/gc-%t.log:utctime,pid,level,tags:filecount=2,filesize=100M

Memory Specifications

Default Maximum heap size “out-of-the-box” with Java is 1/4 of your physical memory. It’s not recommended to exceed ½ of your physical memory, noting that 16GB heaps are the largest supported. For small instances, you always want at least 1GB of physical memory left empty for Metaspace, off-heap Java use, and operating system use.

Minimum physical memory

We generically suggest starting Jenkins with at least 2 GB of heap (-Xmx2g), and a minimum physical memory of 4GB. Smaller instances that only serve a small team can use less, however, we do not suggest running Jenkins with less than 1GB of heap and 2 GB of total system memory, because this requires careful tuning to avoid running out of memory. It is important to plan for Jenkins heap to be increased as usage increases, and scale horizontally when the recommended maximum heap size of 16GB is reached.

Scaling memory in production

For a production environment with considerable workload, a good starting point would be -Xmx4g, which dictates a minimum physical memory of 8GB.

In the event your instance demands more than -Xmx16g, try scaling horizontally by adding more controllers using operations center, which divides the workload more efficiently. Dividing the workload at this threshold also minimizes a single point of failure. At 24 GB, this becomes extremely important because managing a controller at this scale requires a specialized skill-set and incurs extra administrative problems. Besides reduced efficiency, it concentrates so much responsibility on one system that it is hard to schedule routine maintenance and harder to guarantee system stability.

How do I know if I need to increase heap?

Traditionally, JVM administrators measure CPU, RAM, and DISK I/O as macro metrics for the performance of the application. In addition to this, measuring micro metrics can assist in properly forecasting for growth. Micro metrics can be measured by analysing garbage collection logs, and thread dumps. * Thread count should be under 700. * The GC throughput (the percentage of time spent running the application rather than doing garbage collection) should be above 98%. * The amount of the heap used (not just allocated) for old-generation data should not exceed 70% except for brief spikes and should be 50% or less in most cases.

By analyzing GC logs in products such as gceasy.io, it is easy to evaluate these metrics.

Minimum and maximum heap sizes

Set your minimum heap size (-Xms) to the same value as your maximum heap size. Again, it is recommended that your maximum heap size not exceed 16GB (-Xmx16g). Setting these values ensures the memory needed for your application is allocated at startup rather than runtime.

Examples

-Xms8g: sets the initial (and minimum) heap size to 8GB -Xmx8g: sets the maximum heap size to 8g

Garbage Collection Specifications

Currently it is recommended to use the G1 Garbage Collection algorithm.

G1 is superior to ParallelGC or CMS for a Jenkins JVM. This is because G1 avoids long “stop the world” pauses which can take many seconds in a large collection. Jenkins is an application that relies on low latency, so the shorter the GC pauses, the better. It is also the default Garbage Collector in Java 9.

Specify G1 GC as follows:

-XX:+UseG1GC▼

Troubleshooting

Troubleshooting a performance issue requires the collection of data. Generally, this requires gathering several Java thread dumps, which are a way of finding out what every JVM thread is doing at a particular point in time. This is especially useful if the Java application seems to hang under load, as an analysis of the thread dumps will indicate where threads are stuck.

It is important that the data is collected while the performance issue is actively occurring on the JVM/Jenkins. If Jenkins is restarted before the data is collected, the active threads will be lost and it may be harder to pinpoint the issue.

There are two recommended methods of collecting data:

Generating a support bundle
collectPerformanceData Script

Generating a support bundle

Generating a support bundle explains how to generate a support bundle for CloudBees products.

Prerequisite

Recommended Java Arguments are set per Recommended specifications for Jenkins.

collectPerformanceData Script

The collectPerformanceData script has been developed to gather data for analysis during a performance issue that occurs when running on a Unix-like system. The script collects a thread dump via jstack and includes output from the top, vmstat, netstat, nfsiostat, nfsstat and iostat binary utilities.

Running the script

To run the script:

Confirm that the binary tools to run the script are included in your $PATH:
- iostat
- netstat
- nfsiostat
- nfsstat
- vmstat
- top
Download the collectPerformanceData.sh script.

Make the script executable with the command

shell% chmod +x collectPerformanceData.sh▼

Retrieve the values of the variables $JENKINS_USER and $JENKINS_PID by using the echo command
Make sure that the directory where the script is running has rw (read and write) permissions set on it

As the same user that runs Jenkins, execute the script with the command

shell% sudo -u $JENKINS_USER sh collectPerformanceData.sh $JENKINS_PID 300 5▼

Note that the “300” and “5” are configurable arguments for “Length to run the script in seconds” and “Intervals to execute commands in seconds.”

It is highly recommended to practice executing the script in your environment to make sure everything works as expected. When performance issues arise, you will be able to execute the script right away rather than spending extra time to troubleshoot the script execution.

Data included in the script output

The following section contains more details on the information obtained by the script:

iostat

(input/output statistics) is a computer system monitor tool used to collect and show operating system storage input and output statistics. It is often used to identify performance issues with storage devices, including local disks, or remote disks accessed over network file systems such as NFS.

Example Output:

jstack

The jstack command-line utility attaches to the specified process or core file and prints the stack traces of all threads that are attached to the virtual machine, including Java threads and VM internal threads, and optionally native stack frames. The utility also performs deadlock detection.

Example Output:

netstat

Netstat is a command line TCP/IP networking utility available in most versions of Windows, Linux, UNIX and other operating systems. Netstat provides information and statistics about protocols in use and current TCP/IP network connections.

Example Output:

nfsiostat

The sysstat family includes a utility called nfsiostat, that resembles iostat, but allows you to monitor the read and write usage on NFS mounted file systems.

Example Output:

nfsstat

nfsstat displays statistics kept about NFS client and server activity.

Example Output:

top

The top command allows users to monitor processes and system resource usage on Linux. By combining a series of JStack and top (or, more likely, top -h) dumps, you can help identify Java operations that are consuming a lot of CPU time. Convert thread IDs that are using a lot of CPU in “top” to hexadecimal and look for them in the jstack dumps taken at about the same time. If you see the same thread id appearing as a high CPU user across multiple thread dumps, that is often a culprit high CPU user. Be aware that these are all moment-in-time measurements though — you may just happen to capture something that is briefly using a lot of CPU but which is overall not a problem.

Example Output:

top -H

The dash H (-H) flag starts top with all individual threads displayed. Otherwise, top displays a summation of all threads in a process.

Example Output:

vmstat

vmstat (virtual memory statistics) is a computer system monitoring tool that collects and displays summary information about operating system memory, processes, interrupts, paging and block I/O.

Example Output:

Analyzing data collected by JenkinsHangsWithJstack

The output of collectPerformanceData.sh will be compressed via tar and gzip. Once extracted, the output of the commands are viewable in a text editor, and are grouped by folder:

Example:

Understanding Thread Dumps

A web server uses tens to hundreds of threads to process a large number of concurrent users. If two or more threads utilize the same resources, a contention between the threads is inevitable, and sometimes deadlock occurs.

Thread contention is a status in which one thread is waiting for a lock, held by another thread, to be lifted. Different threads frequently access shared resources on a web application. For example, to record a log, the thread trying to record the log must obtain a lock and access the shared resources.

Deadlock is a special type of thread contention, in which two or more threads are cannot progress at all because each holds a lock and is waiting on a lock the other thread already holds.

Different issues can arise from thread contention. To analyze such issues, you need to use a thread dump. A thread dump will give you the information on the exact status of each thread.

Below, you will find the aforementioned example using jstack:

"Computer.threadPoolForRemoting [#25] for OperationsCenter2 connection from example.com/10.62.98.33:43856" #7992 daemon prio=5 os_prio=0 tid=0x00007fa7741bd000 nid=0x2c94 runnable [0x00007fa6d46da000]
   java.lang.Thread.State: RUNNABLE
    at org.mindrot.jbcrypt.BCrypt.key(BCrypt.java:556)
    at org.mindrot.jbcrypt.BCrypt.crypt_raw(BCrypt.java:629)
    at org.mindrot.jbcrypt.BCrypt.hashpw(BCrypt.java:692)
    at org.mindrot.jbcrypt.BCrypt.checkpw(BCrypt.java:763)
    at hudson.security.HudsonPrivateSecurityRealm$3.isPasswordValid(HudsonPrivateSecurityRealm.java:686)
    at hudson.security.HudsonPrivateSecurityRealm$4.isPasswordValid(HudsonPrivateSecurityRealm.java:707)
    at hudson.security.HudsonPrivateSecurityRealm$Details.isPasswordCorrect(HudsonPrivateSecurityRealm.java:517)
    at hudson.security.HudsonPrivateSecurityRealm.authenticate(HudsonPrivateSecurityRealm.java:184)
    at sun.reflect.GeneratedMethodAccessor351.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)▼

In this example, the following can be gathered:

Thread Name: "Computer.threadPoolForRemoting [#25] for OperationsCenter2 connection from cbj.example.com/10.62.98.33:43856"

When using the java.lang.Thread class to generate a thread, it will be named ThreadName-(Number), whereas when using the java.util.concurrent.ThreadFactory class, it will be named pool-(number)-thread-(number).

Thread Type: daemon

Java threads can be divided into two:
1. daemon threads;
2. non-daemon threads.

Thread Priority: prio=5 os_prio=0

Java Thread ID: tid=0x00007fa7741bd000

This is the Java Thread Id obtained via java.lang.Thread.getId() and usually implemented as an auto-incrementing long 1..n

Native Thread ID: nid=0x2c94

This is hexadecimal. If you convert it to a base 10 number, you can often find the thread in a top capture taken at the same time.
Crucial information as this native Thread Id allows you to correlate for example which Threads from an OS perspective are using the most CPU within your JVM etc.

Thread State and Detail: runnable [0x00007fa6d46da000]

Allows quick identification of Thread state and its potential current blocking condition
Thread States:
- NEW: The thread is created but has not been processed yet.
- RUNNABLE: The thread is occupying the CPU and processing a task. (It may be in WAITING status due to the OS’s resource distribution.)
- BLOCKED: The thread is waiting for a different thread to release its lock in order to get the monitor lock.
- WAITING: The thread is waiting by using a wait, join, or park method.
- TIMED_WAITING: The thread is waiting by using a sleep, wait, join or park method.

Thread Stack Trace:

at org.mindrot.jbcrypt.BCrypt.key(BCrypt.java:556) at
org.mindrot.jbcrypt.BCrypt.crypt_raw(BCrypt.java:629) at
org.mindrot.jbcrypt.BCrypt.hashpw(BCrypt.java:692) at
org.mindrot.jbcrypt.BCrypt.checkpw(BCrypt.java:763) at
hudson.security.HudsonPrivateSecurityRealm$3.isPasswordValid(HudsonPrivateSecurityRealm.java:686....▼

Each line in the stack trace corresponds to code, and subsequently a line number. As an example, we could trace:

hudson.security.HudsonPrivateSecurityRealm$3.isPasswordValid(HudsonPrivateSecurityRealm.java:686▼

to this code in GitHub.

Identifying and fixing performance issues

Typically, performance issues can be traced back to one of several different arenas.

CPU/Memory Issues - these may be related:

Out of Memory error
High CPU
Poor GC Performance/Thrashing in GC

Note that the issues are ordered from highest to lowest severity and that the previous issue can cause the symptoms of the subsequent issues to occur. For example, an Out of Memory Error can cause high CPU, a hang, and/or poor GC performance. Also note that if you are experiencing an Out of Memory Error along with high CPU & poor GC performance, the Out of Memory Error takes precedence. Once the Out of Memory Error is resolved it also resolves the other symptoms in most cases.

Out of Memory error

It is critical to determine the type of Out of Memory error since each type requires different steps for debugging and resolution. For example, reviewing a heap dump should be done after confirming that a Java heap Out of Memory Error occurred via the log files or verbose GC data. This is because the JVM will generate a heap dump for any Out of Memory Error that occurs if the Java Argument -XX:+HeapDumpOnOutOfMemoryError is set. A Java heap dump also shows valid memory usage, so it is best to review the file generated when the Out of Memory Error is thrown if possible.

The following are common Out Of Memory Errors:

java.lang.OutOfMemoryError: Unable to create new native thread: The Java application has hit the limit of how many Threads it can launch. This can typically be resolved by increasing a ulimit setting. This error is often encountered because the user has created too many processes. The number and kind of running processes should be reviewed using top and ps.

java.lang.OutOfMemoryError: Java heap space : This is the Java Virtual Machine’s way to announce that there is no more room in the virtual machine heap area. You are trying to create a new object, but the amount of memory this newly created structure is about to consume is more than the JVM has in the heap. The fastest way to mitigate this error is to increase the heap via -Xmx parameter but keep in mind that increasing the heap space may only fix it temporarily and there may be underlying issues that need to be investigated further.

java.lang.OutOfMemoryError: GC overhead limit exceeded : By default the JVM is configured to throw this error if you are spending more than 98% of the total time in GC and less than 2% of the heap is recovered after the GC. Changes to GC settings will mitigate this error but keep in mind there may be underlying issues that need to be investigated further.

Useful files for Out of Memory Errors:

about.md from support bundle (shows Java Virtual Machine (JVM) configuration)
verbose GC data (useful for Java heap, metaspace, GC overhead limit exceeded Out of Memory Errors)
heap dump (useful for Java heap Out of Memory Errors)
heap histogram (useful for Java heap Out of Memory Errors)
limits.txt file from support bundle (useful for Java native Out of Memory Errors)
nodes.md for agent JVM configuration (if the issue occurs there)
hs_err_pidXXXX.log file (useful for native Out of Memory Errors)

It’s important to note that when dealing with an Out of Memory Error, increasing the heap space will most likely resolve the issue temporarily, but it may be masking the root cause, and therefore analysis of a thread dump is always a best practice to understand if there are larger issues, many times associated with plugins that will need to be addressed.

High CPU Usage

Low Java heap space will also cause high CPU usage, as CPU will be doing Garbage Collections constantly to free up memory. Before considering the issue as high CPU, please review Out of Memory error above and make sure the issue is not caused by low memory.

Most of the time, servers only use a small fraction of their CPU power. When Jenkins is performing tasks, CPU usage will rise and/or spike temporarily. Once the CPU intensive process completes, the CPU usage should once again drop down to a lower level.

However, if you are receiving high CPU alerts and/or are experiencing application performance degradation, this may be due to a Jenkins process being stuck in an infinite loop or that the service has encountered an unexpected error. If your server is using close to 100% of the CPU, it will constantly have to free up processing power for different processes, which will slow the server down and may render the application unreachable.

In order to reduce the CPU usage, you first need to determine what processes are taxing your CPU. The best way of diagnosing this is by executing the collectPerformanceData.sh script while the CPU usage is high, as it will deliver the outputs of top and top -H while the issue is occuring so you can see which threads are consuming the most CPU.

In the following example scenario, we had reports that the Jenkins UI has become unresponsive and subsequently ran the collectPerformanceData.sh script during this time to gather data. In the output of top we see that the JVM is consuming a high amount of CPU:

Additionally, we also see in the output of “top -h” that there are a couple of “Hot Threads” or threads that are consuming a large amount of CPU:

By obtaining the PID’s of these threads, we are able to convert the PID using an online Binary Hexadecimal Converter which will provide us the ThreadID that we can then search for in the output of Jstack.

Example Converting “15801” to Hexadecimal Value of “3DB9”:

Using the new Hexadecimal value, we can now search our JStack output for “3DB9” and find the following stacktrace:

"Computer.threadPoolForRemoting [#25] for OperationsCenter2 connection from test/10.62.98.33:43856" #7992 daemon prio=5 os_prio=0 tid=0x00007fa7741bd000 nid=0x3DB9 runnable [0x00007fa6d46da000]
   java.lang.Thread.State: RUNNABLE
    at org.mindrot.jbcrypt.BCrypt.key(BCrypt.java:556)
    at org.mindrot.jbcrypt.BCrypt.crypt_raw(BCrypt.java:629)
    at org.mindrot.jbcrypt.BCrypt.hashpw(BCrypt.java:692)
    at org.mindrot.jbcrypt.BCrypt.checkpw(BCrypt.java:763)
    at hudson.security.HudsonPrivateSecurityRealm$3.isPasswordValid(HudsonPrivateSecurityRealm.java:686)
    at hudson.security.HudsonPrivateSecurityRealm$4.isPasswordValid(HudsonPrivateSecurityRealm.java:707)
    at hudson.security.HudsonPrivateSecurityRealm$Details.isPasswordCorrect(HudsonPrivateSecurityRealm.java:517)
    at hudson.security.HudsonPrivateSecurityRealm.authenticate(HudsonPrivateSecurityRealm.java:184)
    at sun.reflect.GeneratedMethodAccessor351.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)▼

The above findings show that BCrypt is causing the CPU usage to spike which implies that it is a CPU intensive operation. In this example, a bug report was created showing the information gathered and a fix was then implemented in a future release of Jenkins.

Fastthread.io

Once you have the output of jstack, for faster diagnosis you can use fastthread.io to analyze the thread dump:

Once uploaded, fastthread.io will provide an analysis, and typically a prescriptive action can be determined. Example:

Pipeline Hang

Oftentimes the root cause of performance issues can be traced back to poorly constructed pipelines that are not following Pipeline Best Practices. If you are experiencing a hanging pipeline, it is recommended to troubleshoot the issue by providing the Required Data to CloudBees Support(https://support.cloudbees.com/hc/en-us/requests/new).

UI Hang/Unreachable UI

As discussed above, an unresponsive UI is typically a symptom of high CPU utilization and memory consumption. However, other factors can come into play, such as networking latency or bad plugins. The following best practices should be observed when faced with an unresponsive or lagging UI.

Confirm Browser Functionality

Using Firefox or Google Chrome, ensure that your web browser is up-to-date, clear your cache and cookies, and enable Developer Tools to ensure a proxy or firewall is not blocking your connection to Jenkins. The example below shows a stable network connection returning 200 status when attempting to reach the Jenkins instance:

The example below shows a 404 status stating the server could not be reached. This would dictate further investigation as to why the server is unreachable. One way of would be to provide a HAR file to CloudBees Support for investigation. It is advised to obtain the output of collectPerformanceData.sh during this time to determine if high CPU, GC, or Out of Memory errors are present. Additionally, if you see requests that are taking a long response time, you may be experiencing a network-related issue. It is also advised to look over the slow response data of the Support Bundle to troubleshoot further.

Poor GC Performance / GC Thrashing

Very often, the root cause of Out Of Memory errors, as well as UI unavailability, is due to incorrect GC settings, or GC “thrashing”. Frequent garbage collection, due to failure to allocate memory for an object, insufficient free memory, or insufficient contiguous free memory due to memory fragmentation, is referred to as “thrashing”.

All GC events are “Stop The World” events. You want to limit these events to less than a second to ensure that your application does not become unresponsive for your end-users. Very long GC pauses may also result in 404 errors and other instabilities. In extreme cases, very long GC pauses may make it appear that build agents have disconnected or become unresponsive. If the JVM’s garbage collection is not performing as desired, you will need to benchmark Jenkins and tune the JVM based on the recommendations from CloudBees Support.

The recommended tuning approach to these parameters is to evaluate the GC logs using GCEasy.io, an online GC log analyzer.

Analysis using gceasy

Ensure that GC logs are being captured as part of the aforementioned JVM Arguments, and are included in your support bundle created from Jenkins. This is a selectable checkbox in the CloudBees Support section of Jenkins:

GCEasy.io

Once you have collected the GC logs, you can visit gceasy.io to upload the logs to the analyzer:

Once the log is analyzed, you will be presented with a report:

Most GC issues can be resolved by examining the reports generated by GCEasy.io and making adjustments accordingly.

FAQ

Q: I saw this article that recommended GC setting “X” for better performance… should I use it?

A: Probably not, unless you see poor GC performance with the suggested settings. Jenkins has an unusual pattern of memory use, so a lot of common expert tuning options do not apply to it. The recommended settings above are used at many Fortune 500 companies with only minor customizations. However, if you do wish to tinker, please make sure to gather GC logs under real-world use before and after applying the settings so you can compare the deltas.

Recommended customization options people may explore, beyond max/min heap size: MetaspaceSize, region size, and on very large systems HugePages (requires OS-level support).

Q: Should I set minimum heap and maximum heap to the same value?

A: You certainly don’t have to, but this gives some modest performance benefits for Jenkins-resizing the heap can be somewhat performance: expensive and in some specific cases may trigger higher SYS CPU use. It ensures more consistent performance when the level of load varies widely: heuristics won’t be able to shrink heap sizes in response to quieter periods.

Q: Should I use Concurrent Mark Sweep (CMS) GC for heaps from 2-4 GB?

A: No. Use G1 GC — CMS requires significant expert tuning to perform well for Jenkins and has several failure modes (ex: heap fragmentation) that can cause serious stability problems for Jenkins. We have seen big issues with some customers trying to use CMS GC.

Need help?

The CloudBees Support team is here to assist you. If you are already have a support subscription, please submit a support request.

If you are experiencing performance issues, you can troubleshoot the issue using the CloudBees Performance Decision Tree for troubleshooting.

References

In August 2020, the Jenkins project voted to replace the term master with controller. We have taken a pragmatic approach to cleaning these up, ensuring the least amount of downstream impact as possible. CloudBees is committed to ensuring a culture and environment of inclusiveness and acceptance - this includes ensuring the changes are not just cosmetic ones, but pervasive. As this change happens, please note that the term master has been replaced through the latest versions of the CloudBees documentation with controller (as in managed controller, client controller, team controller) except when still used in the UI or in code.