Table of Contents

Troubleshooting CloudBees Jenkins Enterprise 1.x

How to use the CloudBees Jenkins Enterprise command line tools

There is a knowledge base article with common CLI operations at: How to use cje command line tools

Log Files

This section provides some guidance for CloudBees Jenkins Enterprise administrators to be able to locate the logs of each relevant component in the architecture.

Tenant Logs

Command line access to the tenant logs is greatly simplified in the latest versions of CloudBees Jenkins Enterprise. Be sure that your cluster is up to date since the latest releases include more tools that simplify the supportability of the product.

Tenant Logs from the Mesos UI

Unfortunately this approach only works for clusters not running though HTTPS. The Mesos version bundled in the product does not have this capacity for HTTPS clusters. However, in a Proof of Concepts or initial installation without HTTPS this method might be still useful.

The Mesos UI could be discovered by running the run display-outputs and the username and password by executing cat .dna/secrets at CloudBees Jenkins Enterprise project level.

$ cje run display-outputs

Controllers: ec2-34-228-237-6.compute-1.amazonaws.com
Workers    : ec2-54-235-255-70.compute-1.amazonaws.com,ec2-37-153-102-202.compute-1.amazonaws.com,ec2-54-167-58-87.compute-1.amazonaws.com

CJOC    : http://cluster.example.com/cjoc/
Mesos   : http://mesos.cluster.example.com
Marathon: http://marathon.cluster.example.com

From the Mesos UI, use the Sandbox section to access to the tenant logs of a specific task in Mesos. The Mesos tasks are under Active Tasks → ID or Completed tasks → ID

cje troubleshooting mm mesos task id

Once the required task is located in the list of Mesos logs, click on the SandBox button to get direct access to the stderr.

cje troubleshooting mm mesos stderr

From a CloudBees Jenkins Enterprise admin point of view, this is by far the easiest way to access to the tenant logs.

Access to the Tenant Access Logs for CloudBees Jenkins Enterprise Version Newer than 1.10

The support-log tail-tenant-logs command is available only for CloudBees Jenkins Enterprise clusters running a version newer than 1.10.

# Get the last task for the MM we want
$ cje run support-marathon mt-running-tasks | jq . | grep taskId
# Tail the latest task logs
$ cje run support-log tail-tenant-logs <TASK_ID>
# Download the task logs locally
$ cje run support-log get-tenant-logs <TASK_ID>

The <TASK_ID> can also be obtained from the MESOS UI. In the image from the previous section, if the <TASK_ID> were masters_master-1.62ceb825-13f5-11e8-80c9-3a2e409cf189, we could get the tenant logs from this specific master.

Access to the Tenant Access Logs for CloudBees Jenkins Enterprise Version Older than 1.10

# List all the applications to show all the MM ids
$ cje run list-applications
# Look for worker in which MM is located
$ cje run find-worker-for <MM_ID>
# Connect to the worker
$ dna connect <WORKER_ID>
# Move to the tenant directory
$ cd /var/lib/mesos/slaves/<LATEST_SLAVE>/frameworks/<*>/executors/
# Get the last task
$ ls -lart masters_<MM_ID>
# Print the task
$ tail -f <LAST_TASK>/runs/latest/stderr
# Download the logs tar -xvzf + dna copy
$ sudo tar -chJf /tmp/logfiles.tar.xz /var/lib/mesos/slaves/<LATEST_SLAVE>/frameworks/<*>/executors/<LAST_TASK>/runs/latest/stderr
# Exit for the worker
$ exit

Download the generated file

dna copy  -rs /tmp/logfiles.tar.xz . worker-1

Finally, remove the compressed file.

dna connect worker-1
sudo rm /tmp/logfiles.tar.xz
exit

Component Logs

Mesos Master

Component Path

Mesos Master

/var/log/syslog (Debian distribution), /var/log/messages (Redhat distribution)

# Get the controller leader. If before 1.10, then use mesos UI
# i.e "leader": "10.16.7.195:8080". Then, cje run display-outputs
# to match the IP with the corresponding controller
$ cje run support-marathon mt-info |jq .| grep leader

# Connect to the controller leader
$ dna connect <CONTROLLER_ID>

# Print the logs
$ sudo grep "mesos-master\[" /var/log/syslog | grep "my-managed-master-test"

Mesos Agent

Component Path

Mesos Agent

/var/log/syslog (Debian distribution), /var/log/messages (Redhat distribution)

# Get the controller leader. If before 1.10, then use mesos UI
$ cje run support-marathon mt-info |jq .| grep leader

# Connect to the controller leader
$ dna connect <CONTROLLER_ID>

# Print the logs
$ sudo grep "mesos-master\[" /var/log/syslog | grep "my-managed-master-test"

Marathon

Component Path

Marathon

/var/log/syslog (Debian distribution), /var/log/messages (Redhat distribution)

# Get the controller leader. If before 1.10, then use mesos UI
$ cje run support-marathon mt-info |jq .| grep leader

# (optional) list servers to filter for (<SERVER_ID>)
$ dna server

# Connect to the controller leader
$ dna connect <CONTROLLER_ID>

# Print the logs grepping by
$ sudo grep "marathon\[" /var/log/syslog

Router

Component Path

Router

/var/log/nginx/

# Get the controller leader. If before 1.10, then use mesos UI
$ cje run support-marathon mt-info |jq .| grep leader

# Connect to the controller leader
$ dna connect <CONTROLLER_ID>

#Print the logs
$ tail /var/log/router/*.log
$ tail /var/log/router/error.log
$ cat /etc/nginx/conf.d/*

Castle

The logs for castle can be retrieved with the command:

cje run get-application-log castle

Or you can connect to the worker and inspect the log directly:

Component Path

Castle

/var/lib/mesos/slaves/<SLAVE_ID>/frameworks/<FRAMEWORK_ID>/executors/

# 1. Find the worker (<WORKER_ID>) where the Managed Master is deployed
$ cje run find-worker-for <MASTER_ID>

# 2. Connect to the worker
$ dna connect <WORKER_ID>

# 3. Move to /var/lib/mesos/slaves/<SLAVE_ID>/frameworks/<FRAMEWORK_ID>/executors/
cd /var/lib/mesos/slaves/<SLAVE_ID>/frameworks/<FRAMEWORK_ID>/executors/

# 4. Get the last task
ls -lart | grep jce_castle // get the last task

# 5. Print the castle logs
$ tail -F /var/lib/mesos/slaves/<SLAVE_ID>/
frameworks/<FRAMEWORK_ID>/executors/jce_castle.<ID>/runs/latest/stderr

Palace

The logs for palace can be retrieved with the command:

cje run get-application-log palace

CJOC

The logs for CloudBees Jenkins Operations Center can be retrieved with the command:

cje run get-application-log cjoc

Elasticsearch

The logs for elasticsearch can be retrieved with the command:

cje run get-application-log elasticsearch

Managed Master Provisioning

The following section provides some guidance to troubleshoot Managed Master provisioning failures.

Architecture

The Managed Master provisioning process contains the following interactions between the different CJE components:

  • MM creation → CJOC → Master Provisioning Mesos plugin → Marathon → Mesos → Master worker

  • MM start-up sequence

    • Container start

    • MM image download (Mesos agent)

    • Volume provisioning (talking to Castle: volume acquisition or NFS mount)

    • Jenkins launches (MM docker container)

cje troubleshooting mm provisioning architecture

Diagnostic Sequence

The following table shows the most important steps which happen when a Managed Master is provisioned, specifying if the step is usually a source of issues, the component where the event is allocated, the location and the log paths.

Frequent? Event Component Location Logs path Access to the logs

MM creation

CJOC → Master Provisioning Mesos plugin

Worker: master

Enable logging at plugin level:

  • com.cloudbees.jce.masterprovisioning.mesos

  • com.cloudbees.pse.masterprovisioning.mesos

Access to logs

Requests redirection

Router

One of the controller: random

  • /var/log/nginx/

    • tail /var/log/router/*.log

    • tail /var/log/router/error.log

    • cat /etc/nginx/conf.d/*

Task schedule

Marathon

Check Marathon endpoint to determine which controller contains the Marathon leader

/var/log/syslog (Debian distribution), /var/log/messages (Redhat distribution)

star yellow

Run the task

Mesos Master

Controller: primary

/var/log/syslog (Debian distribution), /var/log/messages (Redhat distribution)

Caution
There must be a worker of type 'master' with enough resources (CPU + Memory) to accept the Marathon task. Use cje run list-resources to check this.

If the MM creation task appears in both Mesos and Marathon, the troubleshooting can start from here

star yellow

Sandbox creation

Mesos agent

Worker: master

/var/log/syslog (Debian distribution), /var/log/messages (Redhat distribution)

Access to logs

star yellow

Volume provisioning

Castle

Worker: master

/var/log/syslog (Debian distribution), /var/log/messages (Redhat distribution)

star yellow

Container starts

Jenkins

Worker: master

/var/lib/mesos/slaves/…​/runs/latest/stderr

The image below, represents the diagnostic troubleshooting sequence from a graphical point of view. The components that are marked with a star represent the most problematic ones in the Managed Master provisioning process.

cje troubleshooting mm diagnostic sequence

Simple Diagnosis Analysis

The points below describe the general procedure to diagnose Managed Master provisioning failures. Solutions are included for those cases that can be handled through CloudBees Jenkins Enterprise administrator.

It is very important to understand if you are under a CloudBees Jenkins Enterprise, or a Jenkins issue. Notice that all Managed Masters have an Advanced section in the configuration, where you can modify the default values for the Health checks that Marathon does in order to decide if Jenkins is in a healthy state or, if not, re-provisioned.

  • Grace Period (seconds). Health check failures are ignored within this number of seconds of the task being started or until the task becomes healthy for the first time.

  • Interval (seconds). Number of seconds to wait between health checks.

Tip
The Grace Period value is directly influenced by the length of time it takes for Castle to provision the storage. The default settings for the various components are generous enough for typical use cases. However, in some situations, such as when the volume is extremely large, e.g., 1TB, this value needs to be adjusted to reflect reality. The Health Check end point will not be accessible until after the volume for CJOC/MM becomes available and thus, it is important to set this value correctly.

For troubleshooting reasons, it is very important to increase the Grace Period so, in case of a Jenkins performance issue, we can check if the Jenkins UI is accessible, even if it is not responsive.

Use these steps to troubleshoot Managed Master provisioning issues in a simple way:

  1. Stop the Managed Master provisioning from the Manage section in the Operations Center UI.

    cje troubleshooting mm stop
  2. In the Health check section, under the Advanced configuration section in the Operations Center UI, ensure that the Grace Period (seconds) is at least 1200 seconds to ptevent the Managed Master from being restarted every 2 minutes when Jenkins does not respond to the Health checks from Marathon.

    cje troubleshooting mm configure
  3. Start the Managed Master provisioning from the Manage section in the Operations Center UI. At this point, if the UI is accessible, even if it is not responsive, then it means Jenkins is suffering a performance issue. However, if the UI is not even accessible, then it might be a CloudBees Jenkins Enterprise issue.

  4. Check the worker where the Managed Master got provisioned and connect to it

  5. Print the syslogs to understand what failed.

 // 1. Find the worker (<WORKER_ID>) where the Managed Master is deployed
 $ cje run find-worker-for <MASTER_ID>
// 2. Connect to the worker
$ dna connect <WORKER_ID>
// 3. Print the syslogs
$ tail -F /var/log/syslog

In the syslogs, you should now see the Managed Master provisioning, and, if this a a CloudBees Jenkins Enterprise issue, the problem should then be exposed.

Advanced Diagnosis Procedure

The advanced Diagnosis procedure checks each element in the diagnosis sequence to ensure that everything has worked as expected.

The first thing to check is whether the Mesos task for the Managed Master provisioning exists so we can skip most of the health checks we need to do at log level.

Support Bundle Analysis

The following section describes the full log information that is included in a support bundle when a MM is deployed. This support bundle was generated following the points below:

  1. Create an MM master in OC UI called "my-managed-master" as <MASTER_ID>

  2. Locate the worker in which it was provisioned (cje run find-worker-for <MASTER_ID>)

  3. Generate a support bundle (cje prepare pse-support)

  4. Decompress the support bundle and look for the <MASTER_ID>

OC Logs

In the OC, we can see the workflow: MM creation → CJOC → Master Provisioning Mesos plugin

cje troubleshooting mm support bundle oc logs

Cluster Logs

In the cluster logs we can see the workflow: Marathon → Mesos Master → Container start: MM image download, Volume provisioning, Jenkins launches

cje troubleshooting mm support bundle cluster logs

Chronological List of all the Deployment Tenant Logs

When an MM has been re-provisioned several times, we will see a ton of tenant logs and it is very difficult to track which is the latest one. For this, we can use the find command below, which lists all the MM provisioned tenants by deployment order.

find . -path *<MASTER:ID>* -iname stderr -exec grep "Connecting to Castle (1/12)" {} \; | sort

Most Common Issues

In this section we will review the most common issues that happen in a Managed Master provisioning issue.

Container Resources

This problem usually happens because the quantity of memory assigned to the MM is not enough for Jenkins to work. i.e you assign a very low quantity of memory to the container to work.

If this happens, we should see a stacktrace like the one below in the syslogs. We can see how the Kernel is killing the Docker container oom_kill_process.

Nov 13 13:14:00 ip-10-XX-7-XX kernel: [3040534.983832] CPU: 0 PID: 1007 Comm: git-remote-http Not tainted 3.13.0-132-generic #181-Ubuntu
Nov 13 13:14:00 ip-10-XX-7-XX kernel: [3040534.983834] Hardware name: Xen HVM domU, BIOS 4.2.amazon 11/11/2016
Nov 13 13:14:00 ip-10-XX-7-XX kernel: [3040534.983836]  0000000000000000 ffff88007e87fc50 ffffffff8172d909 ffff8803d83e3000
Nov 13 13:14:00 ip-10-XX-7-XX kernel: [3040534.983840]  ffff8803f7f3f800 ffff88007e87fcd8 ffffffff81727ea8 0000000000000000
Nov 13 13:14:00 ip-10-XX-7-XX kernel: [3040534.983843]  0000000000000003 0000000000000046 ffff88007e87fca8 ffffffff81156937
Nov 13 13:14:00 ip-10-XX-7-XX kernel: [3040534.983847] Call Trace:
Nov 13 13:14:00 ip-10-XX-7-XX kernel: [3040534.983855]  [<ffffffff8172d909>] dump_stack+0x64/0x82
Nov 13 13:14:00 ip-10-XX-7-XX kernel: [3040534.983859]  [<ffffffff81727ea8>] dump_header+0x7f/0x1f1
Nov 13 13:14:00 ip-10-XX-7-XX kernel: [3040534.983864]  [<ffffffff81156937>] ? find_lock_task_mm+0x47/0xa0
Nov 13 13:14:00 ip-10-XX-7-XX kernel: [3040534.983867]  [<ffffffff81156db1>] oom_kill_process+0x201/0x360
Nov 13 13:14:00 ip-10-XX-7-XX kernel: [3040534.983872]  [<ffffffff812de0e5>] ? security_capable_noaudit+0x15/0x20
Nov 13 13:14:00 ip-10-XX-7-XX kernel: [3040534.983876]  [<ffffffff811b921c>] mem_cgroup_oom_synchronize+0x51c/0x560
Nov 13 13:14:00 ip-10-XX-7-XX kernel: [3040534.983879]  [<ffffffff811b8750>] ? mem_cgroup_charge_common+0xa0/0xa0
Nov 13 13:14:00 ip-10-XX-7-XX kernel: [3040534.983882]  [<ffffffff81157594>] pagefault_out_of_memory+0x14/0x80
Nov 13 13:14:00 ip-10-XX-7-XX kernel: [3040534.983887]  [<ffffffff817264d5>] mm_fault_error+0x67/0x140
Nov 13 13:14:00 ip-10-XX-7-XX kernel: [3040534.983890]  [<ffffffff81739da2>] __do_page_fault+0x4a2/0x560
Nov 13 13:14:00 ip-10-XX-7-XX kernel: [3040534.983894]  [<ffffffff81183bed>] ? vma_merge+0x23d/0x340
Nov 13 13:14:00 ip-10-XX-7-XX kernel: [3040534.983897]  [<ffffffff81184d44>] ? do_brk+0x1d4/0x360
Nov 13 13:14:00 ip-10-XX-7-XX kernel: [3040534.983899]  [<ffffffff81739e7a>] do_page_fault+0x1a/0x70
Nov 13 13:14:00 ip-10-XX-7-XX kernel: [3040534.983903]  [<ffffffff817361a8>] page_fault+0x28/0x30
Nov 13 13:14:00 ip-10-XX-7-XX kernel: [3040534.983905] Task in /docker/e0aad9567ff04e68a8effaab4dd30e3a80a2dfa22af4c40ce03c267cb3cd315d killed as a result of limit of /docker/e0aad9567ff04e68a8effaab4dd30e3a80a2dfa22af4c40ce03c267cb3cd315d
...
Nov 13 13:14:00 ip-10-XX-7-XX kernel: [3040534.983984] Memory cgroup out of memory: Kill process 1006 (java) score 1007 or sacrifice child
Nov 13 13:14:00 ip-10-XX-7-XX kernel: [3040535.002600] Killed process 1003 (git) total-vm:15600kB, anon-rss:188kB, file-rss:888kB

To solve this memory issue, increase the overall container memory in Jenkins Master Memory in MB and/or increase the JVM Max heap ratio for Jenkins under the configuration section at Managed Master level.

The JVM Max heap ratio must be a decimal between 0 and 1. Values over 0.7 are not recommended and can cause master restarts when running out of memory.

Regarding the Jenkins Master Memory in MB, it is the amount of RAM that will be given to the container, expressed in megabytes. The heap given to the Master JVM is a ratio of this memory. The minimum recommended value for non test/demo instances is 4096 MB.

cje troubleshooting mm memory

Cluster Resources

Before generating more Managed Masters in a CloudBees Jenkins Enterprise cluster, you should ensure that the cluster has enough CPU and Memory resources to allow new Docker containers to be created.

The memory and CPU assigned to a Managed Master are listed with the Advanced button in the Provisioning section of Jenkins  Master Name  Operations center  Configure  Advanced for each Managed Master.

In order to check the current resources availables in the cluster you can use cje run list-resources. It tells you how much RAM there is for each of the master worker. If there is not, at least, one worker with enough RAM to accommodate the MM’s container required RAM (referred to as "Jenkins Master Memory in MB" in above diagram), then the Managed Master will not be provisioned.

In the example below, we can see that for the worker-2, which is type master, 3.4 units of CPU (calculated as 4.0-0.6) and 8623 MB of memory (calculated as 15023-6400) are still available.

$ cje run list-resources
name       type       cpus                      mem
worker-3   build      0.1/4.0 (2.50 %)          2048.0/15023.0 (13.63 %)
worker-2   master     0.6/4.0 (15.00 %)         6400.0/15023.0 (42.60 %)
worker-1   master     0.7/4.0 (17.50 %)         6912.0/15023.0 (46.01 %)

At this point, you might find this article: How to extend the size of the worker, interesting.

Disk Space

A common issue for a Managed Master not being provisioned is that one of the infrastructure elements involved runs out of space. There are two main elements that might run out of space:

  • Worker machine

  • Masters volume mount

Worker machine devices in AWS have as a device /dev/xvda1, when masters volume mount start with /mnt/

In order to check if there is enough disk space in the worker machine and the master volume mount, connect to the worker where the Managed Master is provisioned and use the df command to check whether there is enough space on the disk.

Additionally, Operations Center also has built-in Health Checks for workers. They have a hard-coded 90% threshold for the disk on the worker. Or users can also use CJOC raw metrics which is available from /metrics/currentUser/metrics.

// 1. Find the worker where the Managed Master is deployed
$ cje run find-worker-for <MASTER_ID>
worker-2
// 2. Connect to the worker where the master is getting provisioned
$ dna connect worker-2
// 3. Print the disk space
$ df
Filesystem     1K-blocks     Used Available Use% Mounted on
udev             8205252       12   8205240   1% /dev
tmpfs            1643316      700   1642616   1% /run
/dev/xvda1      51466360 11975512  37294356  25% /
none                   4        0         4   0% /sys/fs/cgroup
none                5120        0      5120   0% /run/lock
none             8216568     2192   8214376   1% /run/shm
none              102400        0    102400   0% /run/user
/dev/xvdq        2086912   236980   1849932  12% /mnt/dse.teams/c04bbbfb3f6d1a306a6b018403d32074a902d27c4df1db4db67dbf49483c1941
/dev/xvdr        2086912   169484   1917428   9% /mnt/master-2/cd438f6572765b3bf62dc73117c14a1a0efa8d277de493396db8e12aed920013
/dev/xvdj       20961280   829424  20131856   4% /mnt/master-1/c905b7b62a0a8869cbb6aab2b1736f3cdf9eb29ebb4e7b1658560c9af1f72c0f

Volume Provisioning

The health of the Castle system can be checked from the Marathon UI in the jce folder of the Castle application. We should see a Castle application for each worker of type master.

Once we know the worker in which the Managed Master is getting provisioned, we can access the Castle logs and check if the volume is provisioned correctly. The Castle logs are located in the mesos logs under /var/lib/mesos/slaves/<SLAVE_ID>/frameworks/<FRAMEWORK_ID>/executors/jce_castle.<APP_ID>/runs/latest/stderr.

// 1. Find the worker where the Managed Master is deployed
$ cje run find-worker-for <MASTER_ID>
worker-2
// 2. Connect to the worker
$ dna connect worker-2
// 3. Get the Castle Container ID
$ sudo docker ps
CONTAINER ID        IMAGE                           COMMAND                  CREATED             STATUS              PORTS                                NAMES
...
577d3840c6ff        cloudbees/pse-castle:1.11.8.1   "/bin/tini -- /usr..."   8 days ago          Up 8 days           50000/tcp, 0.0.0.0:31080->8080/tcp   mesos-6fc76875-f6db-4788-86db-4a62b5fd5d6d-S6.7ac62e6d-e73c-495c-8206-b15cce89bfad
...
// 4. Get the Castle Slave Dir
$ sudo docker inspect 577d3840c6ff | jq . | grep "mesos/slaves"
        "Source": "var/lib/mesos/slaves/f55b8159-9165-4a01-9baa-71ba24b42936-S0/frameworks/f55b8159-9165-4a01-9baa-71ba24b42936-0000/executors/jce_castle.fdab4a32-bd64-11e7-9234-d63c791b927e/runs/7ac62e6d-e73c-495c-8206-b15cce89bfad",
        "var/lib/mesos/slaves/f55b8159-9165-4a01-9baa-71ba24b42936-S0/frameworks/f55b8159-9165-4a01-9baa-71ba24b42936-0000/executors/jce_castle.fdab4a32-bd64-11e7-9234-d63c791b927e/runs/7ac62e6d-e73c-495c-8206-b15cce89bfad:/mnt/mesos/sandbox",
// 5. Print the castle logs by adding to the executor PATH (var/lib../7ac62e6d-e73c-495c-8206-b15cce89bfad) '/runs/latest/stderr'
$ tail -F /var/lib/mesos/slaves/f55b8159-9165-4a01-9baa-71ba24b42936-S0/frameworks/f55b8159-9165-4a01-9baa-71ba24b42936-0000/executors/jce_castle.fdab4a32-bd64-11e7-9234-d63c791b927e/runs/latest/stderr

If everything is fine we should see something similar to the logs below. If not, the problem at Castle level should be exposed here.

INFO: EBS provisioning for master-1 [c905b7b62a0a8869cbb6aab2b1736f3cdf9eb29ebb4e7b1658560c9af1f72c0f]
Nov 13, 2017 2:47:24 PM com.cloudbees.jce.castle.util.LoggingRetryListener onRetry
INFO: Created tags for volume vol-04786bfed1b96dbfa (1 tries, 62 ms): ok
Nov 13, 2017 2:47:24 PM com.cloudbees.dac.castle.EbsBackend tagAndMountVolume
INFO: Tagging/Mounting volume vol-04786bfed1b96dbfa [c905b7b62a0a8869cbb6aab2b1736f3cdf9eb29ebb4e7b1658560c9af1f72c0f] from snapshot snap-0a48379ba498d3c85 on instance i-08019a00f2d13bf62 and size 20GiB
Nov 13, 2017 2:47:24 PM com.cloudbees.dac.castle.VolumeDeviceUtils attachVolume
INFO: Attaching volume vol-04786bfed1b96dbfa to instance i-08019a00f2d13bf62 on device /dev/sdj
Nov 13, 2017 2:47:24 PM com.cloudbees.jce.castle.util.LoggingRetryListener onRetry
INFO: Attached volume vol-04786bfed1b96dbfa to instance i-08019a00f2d13bf62 on device /dev/sdj (1 tries, 209 ms): ok
Nov 13, 2017 2:47:28 PM com.cloudbees.dac.castle.VolumeDeviceUtils lambda$attachVolume$10
INFO: Attached volume vol-04786bfed1b96dbfa to instance i-08019a00f2d13bf62 on device /dev/sdj
Nov 13, 2017 2:47:28 PM com.cloudbees.jce.castle.util.LoggingRetryListener onRetry
INFO: Attached volume vol-04786bfed1b96dbfa to instance i-08019a00f2d13bf62 on device /dev/sdj (5 tries, 3,885 ms): {VolumeId: vol-04786bfed1b96dbfa,InstanceId: i-08019a00f2d13bf62,Device: /dev/sdj,State: attached,AttachTime: Mon Nov 13 14:47:24 UTC 2017,DeleteOnTermination: false}
Nov 13, 2017 2:47:28 PM com.cloudbees.jce.castle.util.LoggingRetryListener onRetry
INFO: Attached volume vol-04786bfed1b96dbfa [c905b7b62a0a8869cbb6aab2b1736f3cdf9eb29ebb4e7b1658560c9af1f72c0f] for host i-08019a00f2d13bf62 (1 tries, 4,158 ms): {VolumeId: vol-04786bfed1b96dbfa,InstanceId: i-08019a00f2d13bf62,Device: /dev/sdj,State: attached,AttachTime: Mon Nov 13 14:47:24 UTC 2017,DeleteOnTermination: false}
Nov 13, 2017 2:47:28 PM com.cloudbees.dac.castle.EbsBackend tagAndMountVolume
INFO: Mounting device for vol-04786bfed1b96dbfa [c905b7b62a0a8869cbb6aab2b1736f3cdf9eb29ebb4e7b1658560c9af1f72c0f]: /dev/xvdj - /mnt/master-1/c905b7b62a0a8869cbb6aab2b1736f3cdf9eb29ebb4e7b1658560c9af1f72c0f
Nov 13, 2017 2:47:28 PM com.cloudbees.dac.castle.EbsBackend tagAndMountVolume
INFO: Growing XFS filesystem for c905b7b62a0a8869cbb6aab2b1736f3cdf9eb29ebb4e7b1658560c9af1f72c0f: /mnt/master-1/c905b7b62a0a8869cbb6aab2b1736f3cdf9eb29ebb4e7b1658560c9af1f72c0f
Nov 13, 2017 2:47:28 PM com.cloudbees.dac.castle.EbsBackend tagAndMountVolume
INFO: Mounting filesystem for c905b7b62a0a8869cbb6aab2b1736f3cdf9eb29ebb4e7b1658560c9af1f72c0f: /mnt/master-1/c905b7b62a0a8869cbb6aab2b1736f3cdf9eb29ebb4e7b1658560c9af1f72c0f into /var/jenkins_home
Nov 13, 2017 2:47:29 PM com.cloudbees.dac.castle.EbsBackend tagAndMountVolume
INFO: Setting filesystem permissions 1000:1000 for c905b7b62a0a8869cbb6aab2b1736f3cdf9eb29ebb4e7b1658560c9af1f72c0f: /mnt/master-1/c905b7b62a0a8869cbb6aab2b1736f3cdf9eb29ebb4e7b1658560c9af1f72c0f
Nov 13, 2017 2:47:29 PM com.cloudbees.dac.castle.EbsBackend doProvision

MM Docker Image Cannot be Downloaded

The tenant logs do not tell you what is wrong; you must look at the Mesos agent logs, since the image is downloaded by this process - or check with docker images command that the image is actually on disk.

One of causes for failure is Docker registry credentials, in which case you will need to update the Docker Registry.

# In case the Docker Private Registry needs to be updated
$ cje prepare docker-registry-update
# To update EC2 Container Registry settings
$ cje prepare ecr-update
Note
You typically do not need both of the commands - use the appropriate one for the situation.

Below, the typical tenant logs stack stuck at Fetched file:///root/docker.tar.gz.

I0130 18:00:54.508648 30457 fetcher.cpp:424] Fetcher Info: {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/400c68cb-827b-4c9d-a0f0-6bdad1e94053-S6","items":[{"action":"BYPASS_CACHE","uri":{"cache":false,"executable":false,"extract":true,"value":"file:\/\/\/root\/docker.tar.gz"}}],"sandbox_directory":"\/var\/lib\/mesos\/slaves\/400c68cb-827b-4c9d-a0f0-6bdad1e94053-S6\/frameworks\/574e9852-d73c-4f8c-9958-892fd4838bb6-0000\/executors\/masters_mm-image-does-not-exist.83fc635f-05e7-11e8-80c9-3a2e409cf189\/runs\/209e728c-7bc7-49a6-8cdc-1b63168af322"}
I0130 18:00:54.510223 30457 fetcher.cpp:379] Fetching URI 'file:///root/docker.tar.gz'
I0130 18:00:54.510236 30457 fetcher.cpp:250] Fetching directly into the sandbox directory
I0130 18:00:54.510248 30457 fetcher.cpp:187] Fetching URI 'file:///root/docker.tar.gz'
I0130 18:00:54.510262 30457 fetcher.cpp:167] Copying resource with command:cp '/root/docker.tar.gz' '/var/lib/mesos/slaves/400c68cb-827b-4c9d-a0f0-6bdad1e94053-S6/frameworks/574e9852-d73c-4f8c-9958-892fd4838bb6-0000/executors/masters_mm-image-does-not-exist.83fc635f-05e7-11e8-80c9-3a2e409cf189/runs/209e728c-7bc7-49a6-8cdc-1b63168af322/docker.tar.gz'
I0130 18:00:54.512020 30457 fetcher.cpp:84] Extracting with command: tar -C '/var/lib/mesos/slaves/400c68cb-827b-4c9d-a0f0-6bdad1e94053-S6/frameworks/574e9852-d73c-4f8c-9958-892fd4838bb6-0000/executors/masters_mm-image-does-not-exist.83fc635f-05e7-11e8-80c9-3a2e409cf189/runs/209e728c-7bc7-49a6-8cdc-1b63168af322' -xf '/var/lib/mesos/slaves/400c68cb-827b-4c9d-a0f0-6bdad1e94053-S6/frameworks/574e9852-d73c-4f8c-9958-892fd4838bb6-0000/executors/masters_mm-image-does-not-exist.83fc635f-05e7-11e8-80c9-3a2e409cf189/runs/209e728c-7bc7-49a6-8cdc-1b63168af322/docker.tar.gz'
I0130 18:00:54.514690 30457 fetcher.cpp:92] Extracted '/var/lib/mesos/slaves/400c68cb-827b-4c9d-a0f0-6bdad1e94053-S6/frameworks/574e9852-d73c-4f8c-9958-892fd4838bb6-0000/executors/masters_mm-image-does-not-exist.83fc635f-05e7-11e8-80c9-3a2e409cf189/runs/209e728c-7bc7-49a6-8cdc-1b63168af322/docker.tar.gz' into '/var/lib/mesos/slaves/400c68cb-827b-4c9d-a0f0-6bdad1e94053-S6/frameworks/574e9852-d73c-4f8c-9958-892fd4838bb6-0000/executors/masters_mm-image-does-not-exist.83fc635f-05e7-11e8-80c9-3a2e409cf189/runs/209e728c-7bc7-49a6-8cdc-1b63168af322'
I0130 18:00:54.514710 30457 fetcher.cpp:456] Fetched 'file:///root/docker.tar.gz' to '/var/lib/mesos/slaves/400c68cb-827b-4c9d-a0f0-6bdad1e94053-S6/frameworks/574e9852-d73c-4f8c-9958-892fd4838bb6-0000/executors/masters_mm-image-does-not-exist.83fc635f-05e7-11e8-80c9-3a2e409cf189/runs/209e728c-7bc7-49a6-8cdc-1b63168af322/docker.tar.gz'

In the Mesos agent logs, we can see the issue exposed

E0130 18:00:54.801223  5925 slave.cpp:3758] Container '209e728c-7bc7-49a6-8cdc-1b63168af322' for executor 'masters_mm-image-does-not-exist.83fc635f-05e7-11e8-80c9-3a2e409cf189' of framework 574e9852-d73c-4f8c-9958-892fd4838bb6-0000 failed to start: Failed to 'docker -H unix:///var/run/docker.sock pull cloudbees/cje-mm:2.32.1.1': exit status = exited with status 1 stderr = Error response from daemon: manifest for cloudbees/cje-mm:2.32.1.1 not found
W0130 18:00:54.801731  5929 docker.cpp:1302] Ignoring updating unknown container: 209e728c-7bc7-49a6-8cdc-1b63168af322
E0130 18:01:04.944007  5929 slave.cpp:3758] Container '2e0ebac2-99c4-446b-8500-99541a76e47a' for executor 'masters_mm-image-does-not-exist.89f837d1-05e7-11e8-80c9-3a2e409cf189' of framework 574e9852-d73c-4f8c-9958-892fd4838bb6-0000 failed to start: Failed to 'docker -H unix:///var/run/docker.sock pull cloudbees/cje-mm:2.32.1.1': exit status = exited with status 1 stderr = Error response from daemon: manifest for cloudbees/cje-mm:2.32.1.1 not found
W0130 18:01:04.944435  5923 docker.cpp:1302] Ignoring updating unknown container: 2e0ebac2-99c4-446b-8500-99541a76e47a
E0130 18:01:09.864475  5927 slave.cpp:3758] Container '9b09c95b-bf95-434a-a571-1cbaed1146db' for executor 'masters_mm-image-does-not-exist.8cf63592-05e7-11e8-80c9-3a2e409cf189' of framework 574e9852-d73c-4f8c-9958-892fd4838bb6-0000 failed to start: Failed to 'docker -H unix:///var/run/docker.sock pull cloudbees/cje-mm:2.32.1.1': exit status = exited with status 1 stderr = Error response from daemon: manifest for cloudbees/cje-mm:2.32.1.1 not found
W0130 18:01:09.864888  5925 docker.cpp:1302] Ignoring updating unknown container: 9b09c95b-bf95-434a-a571-1cbaed1146db
E0130 18:01:20.007645  5923 slave.cpp:3758] Container '88856388-52fe-4c3e-86e4-f5f5ebfc2618' for executor 'masters_mm-image-does-not-exist.92f25824-05e7-11e8-80c9-3a2e409cf189' of framework 574e9852-d73c-4f8c-9958-892fd4838bb6-0000 failed to start: Failed to 'docker -H unix:///var/run/docker.sock pull cloudbees/cje-mm:2.32.1.1': exit status = exited with status 1 stderr = Error response from daemon: manifest for cloudbees/cje-mm:2.32.1.1 not found

Performance Issues

Performance issues will prevent the MM from being reprovisioned over and over again. The difference between a performance issue and the rest of MM provisioning issues is that the UI is reachable for a few seconds/minutes, but, after a few Marathon attempts to re-provision the instance, it does not respond to the Marathon Health checks.

The first step to cope with this issue is to increase the grace period, which prevents Marathon from attempting to re-provision the instance over and over again. For this go to the MM configuration and, under the Health check section, ensure that the Grace Period is at least 1200 seconds and the interval 60 seconds. This gives us enough time to take threadDumps. You can increase the value to something higher if you want to stop Marathon from reprovisioning this instance so frequently.

After, increasing the Grace period run

# Only available for CJE-CLI version >= 1.7.1
# For version below 1.7.1 you need to manually do this by using jstack or kill -3 command
$ cje run support-performance master-1 60 15

...
Downloading worker-2:/tmp/20170711092408.worker-2.master-1.performance.tgz to /Users/dvilladiego/workspaces/support/support-cluster-cje/.dna/support/performance/20170711092408.worker-2.master-1.performance.tgz
Warning: Permanently added 'ec2-54-242-218-225.compute-1.amazonaws.com,54.242.218.225' (ECDSA) to the list of known hosts.
20170711092408.worker-2.master-1.performance.tgz

Operations Center Provisioning

The Operations Center provisioning follows almost the same process than the Managed Master provisioning from both, architectural and troubleshooting point of view. This means that under a Operations Center provisioning issue, the procedure explained in the Managed Master provisioning section applies as well to this section.

The main difference is the way we can configure Operations Center. When in the Managed Masters we do it through the Operations Center UI, under the Advanced button of the Managed Master item, in Operations Center we perform this action through the CloudBees Jenkins Enterprise CLI.

The CJE CLI command cje prepare cjoc-update is the one used for:

  • Configuring the Application memory in MB

  • Adding JVM options

  • Specify the CPU reservation/allocation

  • Determinate the Disk size in GB

  • Define the Custom docker image tag

$ cje prepare cjoc-update
cjoc-update is staged - review cjoc-update.config and edit as needed - then run 'cje apply' to perform the operation.
[cjoc]

## Application memory in MB
# The java heap size needs to be adjusted separately via the jvm_options (see below)
# memory = 2048

## JVM options
# The default JVM GC settings are the recommended values for running CJOC. Please do not change them unless
# there are good reasons for doing so
# jvm_options = -Xmx2048m -XX:+PrintGCDetails -Duser.timezone=America/Chicago

## CPU reservation/allocation
# cpus = 0.2

## Disk size in GB
# disk = 250

## Custom docker image tag
# docker_image = cloudbees/cje-oc:2.121.3.1

Architecture

The Operations Center provisioning process contains the following interactions between the different CJE components:

  • CJE CLI → dna init cjoc → Marathon → Mesos → Master worker

  • Operations Center start-up sequence

    • Container start

    • OC image download (Mesos agent)

    • Volume provisioning (talking to Castle: volume acquisition or NFS mount)

    • Jenkins launches (Operations Center docker container)

cje troubleshooting oc provisioning architecture

Build Agent provisioning

The following section provides some guidance to troubleshoot Build agent provisioning failures.

Architecture

The Build Agent provisioning process contains the following interactions between the different CJE components:

  1. MM requests an agent from Palace Cloud (build console log, master logs)

  2. Palace Cloud requests an agent from Palace (Jenkins logs, Jenkins UI: CJE Agent Provisioning)

  3. Palace creates a task in Mesos (Mesos UI: Palace task logs)

  4. Mesos creates the docker agent container in a worker of type build (Mesos UI: agent task logs, Worker logs)

  5. Docker agent container connects through JNLP with the corresponding Master (Mesos UI: agent task logs) (Remoting)

  6. When the build finishes, the docker container finishes (Mesos UI: agent task logs)

  7. Mesos kills the palace task (Mesos UI: agent task logs)

cje troubleshooting build agent architecture

Diagnostic Sequence

The following table lists the steps that provision a Build Agent, specifying if the step is usually a source of issues, the component where the event occurs, the location and the log paths.

Frequent? Event Component Location Logs path Access to the logs

MM request an agent to Palace Cloud

Master → Palace Cloud plugin

Worker: master

Enable logging at plugin level:

  • com.cloudbees.tiger.plugins.palace

Manage Jenkins  System Logs

We could also check the Build Console logs

Started by user admin
[Pipeline] echo
Build agent provisioning use case
[Pipeline] echo
Requesting node...
[Pipeline] node
Still waiting to schedule task
3ef2350e is offline
Agent 3ef2350e is provisioned from template Operations Center Shared Templates » maven-jdk-8
Agent specification[maven-jdk-8] : cpus=0.1, mem=2048, jvmMem=512, dockerImage=maven:3.5-jdk-8, containerProperties=[--volumes-from certs], constraints=
Running on 3ef2350e in /jenkins/workspace/support/cje-build-use-case

When the agent is finally provided, the agent ephemeral id is provided. In this case, agent 3ef2350e has been provided, using the default docker template.

Requests redirection

Router

Controller:primary

  • /var/log/nginx/

    • tail /var/log/router/*.log

    • tail /var/log/router/error.log

    • cat /etc/nginx/conf.d/*

Access to logs

Task schedule

Marathon

Controller:primary

/var/log/syslog (Debian distribution), /var/log/messages (Redhat distribution)

Access to logs

star yellow

Run the task

Mesos Master

Controller: primary

/var/log/syslog (Debian distribution), /var/log/messages (Redhat distribution)

Caution
There must be Worker of type build with enough resources (CPU + Memory) to accept the offer from task received by Marathon. Use cje run list-resources to check this.

Access to logs

If the Build creation task appears in both Mesos and Marathon, the troubleshooting can start from here

star yellow

Sandbox creation

Mesos agent

Worker: build

/var/log/syslog (Debian distribution), /var/log/messages (Redhat distribution)

Access to logs

star yellow

Build container starts

Mesos agent

Worker: build

/var/lib/mesos/slaves/…​/runs/latest/stderr

Access to logs

The image below represents the diagnostic troubleshooting sequence. The components marked with a star represent the most problematic ones in the Managed Master provisioning process.

cje troubleshooting build agent diagnosis sequence

Simple Diagnosis Analysis

The first thing to do is to match the build which is not working with the corresponding Palace task. Search the Build Console Output for a line similar to Agent 3ef2350e is provisioned from template Operations Center Shared Templates » maven-jdk-8.

Started by user admin
[Pipeline] echo
Build agent provisioning use case
[Pipeline] echo
Requesting node...
[Pipeline] node
Still waiting to schedule task
3ef2350e is offline
Agent 3ef2350e is provisioned from template Operations Center Shared Templates » maven-jdk-8
Agent specification[maven-jdk-8] : cpus=0.1, mem=2048, jvmMem=512, dockerImage=maven:3.5-jdk-8, containerProperties=[--volumes-from certs], constraints=
Running on 3ef2350e in /jenkins/workspace/support/cje-build-use-case

The CJE Agent Provisioning section in the bottom left on the main Managed Master dashboard shows the corresponding Palace Task.

cje troubleshooting build agent cje agent provisioning panel

Clicking the link to see the Palace tasks. Find the task which contains <MASTER_ID>.<AGENT>. It is master-1.3ef2350e in our case.

cje troubleshooting build agent cje agent provisioning

The Error output section provides hints about the failure.

cje troubleshooting build agent cje agent provisioning error
Tip
If you prefer to gather this information from the command line, find the build worker for the build agent and read the tenant logs for <MASTER_ID>.<AGENT>. The path is usually /var/lib/mesos/slaves/400c68cb-827b-4c9d-a0f0-6bdad1e94053-S0/frameworks/400c68cb-827b-4c9d-a0f0-6bdad1e94053-0000/executors/master-1.7c7d2b1e:b5092e83-9537-243-9679-33a0513db9d4/runs/c19722fd-60c6-49c2-8f1b-9a54608d2aa5. See how to get the tenant logs for more information.
Warning
If the last line of the Error output section is I0222 06:58:55.485720 8159 fetcher.cpp:456] Fetched 'file:///root/docker.tar.gz' to '/var/lib/mesos/slaves/400c68cb-827b-4c9d-a0f0-6bdad1e94053-S0/docker/links/903884d6-7fc3-45c1-b087-977ef64f6c2e/docker.tar.gz' then the issue is clearly in the image not being able to be retrieved from the Internet, the Docker Private Registry or whatever is used.

Advanced Diagnosis Analysis

The advanced Diagnosis procedure checks each element in the diagnosis sequence to ensure that everything has worked as expected.

Confirm that the Mesos task exists for the Build Agent provisioning so we can skip most of the health checks we need to do at log level.

Most Common Issues

In this section we will review the most common issues that happen in a Build Agent provisioning.

Lack of Enough Memory for the Build to Work

This problem usually happens because the memory assigned to the Agent is insufficient for the build. This might happen if you assign insufficient memory to the container or if the build requires an unexpected amount of memory.

If this happens, we should see a stacktrace like the one below in the syslogs of the corresponding build worker. We can see how the Kernel is killing the Docker container oom_kill_process.

Nov 13 13:14:00 ip-10-XX-7-XX kernel: [3040534.983832] CPU: 0 PID: 1007 Comm: git-remote-http Not tainted 3.13.0-132-generic #181-Ubuntu
Nov 13 13:14:00 ip-10-XX-7-XX kernel: [3040534.983834] Hardware name: Xen HVM domU, BIOS 4.2.amazon 11/11/2016
Nov 13 13:14:00 ip-10-XX-7-XX kernel: [3040534.983836]  0000000000000000 ffff88007e87fc50 ffffffff8172d909 ffff8803d83e3000
Nov 13 13:14:00 ip-10-XX-7-XX kernel: [3040534.983840]  ffff8803f7f3f800 ffff88007e87fcd8 ffffffff81727ea8 0000000000000000
Nov 13 13:14:00 ip-10-XX-7-XX kernel: [3040534.983843]  0000000000000003 0000000000000046 ffff88007e87fca8 ffffffff81156937
Nov 13 13:14:00 ip-10-XX-7-XX kernel: [3040534.983847] Call Trace:
Nov 13 13:14:00 ip-10-XX-7-XX kernel: [3040534.983855]  [<ffffffff8172d909>] dump_stack+0x64/0x82
Nov 13 13:14:00 ip-10-XX-7-XX kernel: [3040534.983859]  [<ffffffff81727ea8>] dump_header+0x7f/0x1f1
Nov 13 13:14:00 ip-10-XX-7-XX kernel: [3040534.983864]  [<ffffffff81156937>] ? find_lock_task_mm+0x47/0xa0
Nov 13 13:14:00 ip-10-XX-7-XX kernel: [3040534.983867]  [<ffffffff81156db1>] oom_kill_process+0x201/0x360
Nov 13 13:14:00 ip-10-XX-7-XX kernel: [3040534.983872]  [<ffffffff812de0e5>] ? security_capable_noaudit+0x15/0x20
Nov 13 13:14:00 ip-10-XX-7-XX kernel: [3040534.983876]  [<ffffffff811b921c>] mem_cgroup_oom_synchronize+0x51c/0x560
Nov 13 13:14:00 ip-10-XX-7-XX kernel: [3040534.983879]  [<ffffffff811b8750>] ? mem_cgroup_charge_common+0xa0/0xa0
Nov 13 13:14:00 ip-10-XX-7-XX kernel: [3040534.983882]  [<ffffffff81157594>] pagefault_out_of_memory+0x14/0x80
Nov 13 13:14:00 ip-10-XX-7-XX kernel: [3040534.983887]  [<ffffffff817264d5>] mm_fault_error+0x67/0x140
Nov 13 13:14:00 ip-10-XX-7-XX kernel: [3040534.983890]  [<ffffffff81739da2>] __do_page_fault+0x4a2/0x560
Nov 13 13:14:00 ip-10-XX-7-XX kernel: [3040534.983894]  [<ffffffff81183bed>] ? vma_merge+0x23d/0x340
Nov 13 13:14:00 ip-10-XX-7-XX kernel: [3040534.983897]  [<ffffffff81184d44>] ? do_brk+0x1d4/0x360
Nov 13 13:14:00 ip-10-XX-7-XX kernel: [3040534.983899]  [<ffffffff81739e7a>] do_page_fault+0x1a/0x70
Nov 13 13:14:00 ip-10-XX-7-XX kernel: [3040534.983903]  [<ffffffff817361a8>] page_fault+0x28/0x30
Nov 13 13:14:00 ip-10-XX-7-XX kernel: [3040534.983905] Task in /docker/e0aad9567ff04e68a8effaab4dd30e3a80a2dfa22af4c40ce03c267cb3cd315d killed as a result of limit of /docker/e0aad9567ff04e68a8effaab4dd30e3a80a2dfa22af4c40ce03c267cb3cd315d
...
Nov 13 13:14:00 ip-10-XX-7-XX kernel: [3040534.983984] Memory cgroup out of memory: Kill process 1006 (java) score 1007 or sacrifice child
Nov 13 13:14:00 ip-10-XX-7-XX kernel: [3040535.002600] Killed process 1003 (git) total-vm:15600kB, anon-rss:188kB, file-rss:888kB

Increase the overall container memory in the Build Agent Template level under Memory and JVM Memory . Any build agent template should have less than 2048MB of Memory and 512MB JVM Memory.

Build Agent Memory
Build Agent Docker Image Cannot be Downloaded

The tenant logs do not tell you what is wrong; you must look at the Mesos agent logs, since the image is downloaded by this process.

I0130 18:00:54.508648 30457 fetcher.cpp:424] Fetcher Info: {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/400c68cb-827b-4c9d-a0f0-6bdad1e94053-S6","items":[{"action":"BYPASS_CACHE","uri":{"cache":false,"executable":false,"extract":true,"value":"file:\/\/\/root\/docker.tar.gz"}}],"sandbox_directory":"\/var\/lib\/mesos\/slaves\/400c68cb-827b-4c9d-a0f0-6bdad1e94053-S6\/frameworks\/574e9852-d73c-4f8c-9958-892fd4838bb6-0000\/executors\/masters_mm-image-does-not-exist.83fc635f-05e7-11e8-80c9-3a2e409cf189\/runs\/209e728c-7bc7-49a6-8cdc-1b63168af322"}
I0130 18:00:54.510223 30457 fetcher.cpp:379] Fetching URI 'file:///root/docker.tar.gz'
I0130 18:00:54.510236 30457 fetcher.cpp:250] Fetching directly into the sandbox directory
I0130 18:00:54.510248 30457 fetcher.cpp:187] Fetching URI 'file:///root/docker.tar.gz'
I0130 18:00:54.510262 30457 fetcher.cpp:167] Copying resource with command:cp '/root/docker.tar.gz' '/var/lib/mesos/slaves/400c68cb-827b-4c9d-a0f0-6bdad1e94053-S6/frameworks/574e9852-d73c-4f8c-9958-892fd4838bb6-0000/executors/masters_mm-image-does-not-exist.83fc635f-05e7-11e8-80c9-3a2e409cf189/runs/209e728c-7bc7-49a6-8cdc-1b63168af322/docker.tar.gz'
I0130 18:00:54.512020 30457 fetcher.cpp:84] Extracting with command: tar -C '/var/lib/mesos/slaves/400c68cb-827b-4c9d-a0f0-6bdad1e94053-S6/frameworks/574e9852-d73c-4f8c-9958-892fd4838bb6-0000/executors/masters_mm-image-does-not-exist.83fc635f-05e7-11e8-80c9-3a2e409cf189/runs/209e728c-7bc7-49a6-8cdc-1b63168af322' -xf '/var/lib/mesos/slaves/400c68cb-827b-4c9d-a0f0-6bdad1e94053-S6/frameworks/574e9852-d73c-4f8c-9958-892fd4838bb6-0000/executors/masters_mm-image-does-not-exist.83fc635f-05e7-11e8-80c9-3a2e409cf189/runs/209e728c-7bc7-49a6-8cdc-1b63168af322/docker.tar.gz'
I0130 18:00:54.514690 30457 fetcher.cpp:92] Extracted '/var/lib/mesos/slaves/400c68cb-827b-4c9d-a0f0-6bdad1e94053-S6/frameworks/574e9852-d73c-4f8c-9958-892fd4838bb6-0000/executors/masters_mm-image-does-not-exist.83fc635f-05e7-11e8-80c9-3a2e409cf189/runs/209e728c-7bc7-49a6-8cdc-1b63168af322/docker.tar.gz' into '/var/lib/mesos/slaves/400c68cb-827b-4c9d-a0f0-6bdad1e94053-S6/frameworks/574e9852-d73c-4f8c-9958-892fd4838bb6-0000/executors/masters_mm-image-does-not-exist.83fc635f-05e7-11e8-80c9-3a2e409cf189/runs/209e728c-7bc7-49a6-8cdc-1b63168af322'
I0130 18:00:54.514710 30457 fetcher.cpp:456] Fetched 'file:///root/docker.tar.gz' to '/var/lib/mesos/slaves/400c68cb-827b-4c9d-a0f0-6bdad1e94053-S6/frameworks/574e9852-d73c-4f8c-9958-892fd4838bb6-0000/executors/masters_mm-image-does-not-exist.83fc635f-05e7-11e8-80c9-3a2e409cf189/runs/209e728c-7bc7-49a6-8cdc-1b63168af322/docker.tar.gz'

In the Mesos agent logs, we can see the issue exposed

E0130 18:00:54.801223  5925 slave.cpp:3758] Container '209e728c-7bc7-49a6-8cdc-1b63168af322' for executor 'masters_mm-image-does-not-exist.83fc635f-05e7-11e8-80c9-3a2e409cf189' of framework 574e9852-d73c-4f8c-9958-892fd4838bb6-0000 failed to start: Failed to 'docker -H unix:///var/run/docker.sock pull cloudbees/cje-mm:2.32.1.1': exit status = exited with status 1 stderr = Error response from daemon: manifest for cloudbees/cje-mm:2.32.1.1 not found
W0130 18:00:54.801731  5929 docker.cpp:1302] Ignoring updating unknown container: 209e728c-7bc7-49a6-8cdc-1b63168af322
E0130 18:01:04.944007  5929 slave.cpp:3758] Container '2e0ebac2-99c4-446b-8500-99541a76e47a' for executor 'masters_mm-image-does-not-exist.89f837d1-05e7-11e8-80c9-3a2e409cf189' of framework 574e9852-d73c-4f8c-9958-892fd4838bb6-0000 failed to start: Failed to 'docker -H unix:///var/run/docker.sock pull cloudbees/cje-mm:2.32.1.1': exit status = exited with status 1 stderr = Error response from daemon: manifest for cloudbees/cje-mm:2.32.1.1 not found
W0130 18:01:04.944435  5923 docker.cpp:1302] Ignoring updating unknown container: 2e0ebac2-99c4-446b-8500-99541a76e47a
E0130 18:01:09.864475  5927 slave.cpp:3758] Container '9b09c95b-bf95-434a-a571-1cbaed1146db' for executor 'masters_mm-image-does-not-exist.8cf63592-05e7-11e8-80c9-3a2e409cf189' of framework 574e9852-d73c-4f8c-9958-892fd4838bb6-0000 failed to start: Failed to 'docker -H unix:///var/run/docker.sock pull cloudbees/cje-mm:2.32.1.1': exit status = exited with status 1 stderr = Error response from daemon: manifest for cloudbees/cje-mm:2.32.1.1 not found
W0130 18:01:09.864888  5925 docker.cpp:1302] Ignoring updating unknown container: 9b09c95b-bf95-434a-a571-1cbaed1146db
E0130 18:01:20.007645  5923 slave.cpp:3758] Container '88856388-52fe-4c3e-86e4-f5f5ebfc2618' for executor 'masters_mm-image-does-not-exist.92f25824-05e7-11e8-80c9-3a2e409cf189' of framework 574e9852-d73c-4f8c-9958-892fd4838bb6-0000 failed to start: Failed to 'docker -H unix:///var/run/docker.sock pull cloudbees/cje-mm:2.32.1.1': exit status = exited with status 1 stderr = Error response from daemon: manifest for cloudbees/cje-mm:2.32.1.1 not found
Build Agent Docker Image

If the failed build is happening with a customized build agent that never has worked before, the issue is likely to be the image not being correctly created. As a first step, always check that the new customized images work on pretty simple FreeStyle or Pipeline jobs. To troubleshoot this problem, just follow the section of the simple diagnosis, which should provide you the problem exposed.

Palace Service does not Start

The palace logs show something similar to the trace shown below:

Oct 29 17:10:59 ip-10-237-50-191 075b79dc6adf[10090]: [ERROR] 2018-10-29 17:10:59,447 [main]  com.cloudbees.utils.MesosClient getMesosData - Failure to get mesos data [/metrics/snapshot]
Oct 29 17:10:59 ip-10-237-50-191 075b79dc6adf[10090]: javax.ws.rs.ProcessingException: javax.net.ssl.SSLHandshakeException: Remote host closed connection during handshake
...
...
Oct 29 17:10:59 ip-10-237-50-191 075b79dc6adf[10090]:   ... 16 more
Oct 29 17:10:59 ip-10-237-50-191 075b79dc6adf[10090]: Caused by: java.io.EOFException: SSL peer shut down incorrectly

The traces are commonly seen when the Palace service can’t reach the expected mesos endpoint /metrics/snapshot. When the mesos endpoint can’t be reached, the service will not start. This issue is commonly related to security policy changes in your load balancer. This check is performed when the Palace service is starting, thus any changes performed in your security settings will affect Palace on its next restart.

In order to get additional details on the error specifics, enable the following startup parameter in the palace service: javax.net.debug=all.

Upgrades

The following section provides some guidance to troubleshoot CloudBees Jenkins Enterprise upgrades. The most common issues are:

The CloudBees Jenkins Enterprise upgrade consists of 4 main phases:

  1. Reinitialization of the CloudBees Jenkins Enterprise workspace with the new version of the CLI

  2. Execution of upgrade scripts

  3. Terraform Refresh

  4. Upgrade of the Marathon applications (Operations Center and commonly Castle)

Most issues happens in the last phase, when marathon re-deploys the applications that need to be upgraded. Operations Center is almost always restarted since the release of the CLI is in line with the Operations Center release. Other applications that may be upgraded are Palace, Castle and Elasticsearch.

If the upgrade reaches the point where there is a Marathon deployment and a Mesos task for Operations Center and this Mesos task shows that the Jenkins process is started, then this is a Jenkins issue. Otherwise, this is a CloudBees Jenkins Enterprise issue.

Note
The CloudBees Jenkins Enterprise upgrade often fails due to problems that were already present in the cluster but that were either not detected or not dealt with. For example, Operations Center credentials that were not updated, a faulty zookeeper node or a castle container that is not running anymore on one of the master workers. While these problem do not directly impact operations in Jenkins or not much, they are a source of issues for few CLI operations like an upgrade.

Diagnostic Sequence

The following table lists the steps of a CloudBees Jenkins Enterprise Upgrade, specifying if the step is usually a source of issues, the component where the event occurs, the location and the log paths.

Frequent? Event Component Location Logs path Access to the logs

Upgrade Scripts

Worker/Controller

Bastion

.dna/logs/*-upgrade/*

If the script upgrades were correctly executed, the troubleshooting can start from here

Waiting for Marathon

Zookeeper

Controller: primary

  • /var/log/syslog (Debian distribution), /var/log/messages (Redhat distribution)

  • /var/log/zookeeper/zookeeper.log

Access to logs

Task Schedule

Marathon

Controller: primary

/var/log/syslog (Debian distribution), /var/log/messages (Redhat distribution)

star yellow

Run the task

Mesos Master

Controller: primary

/var/log/syslog (Debian distribution), /var/log/messages (Redhat distribution)

If the marathon applications being upgraded appears in both Mesos and Marathon the troubleshooting can start from here

Sandbox creation

Mesos agent

Worker: master

/var/log/syslog (Debian distribution), /var/log/messages (Redhat distribution)

Access to logs

star yellow

Volume Provisioning

Castle

Worker: master

  • /var/log/syslog (Debian distribution), /var/log/messages (Redhat distribution)

  • /var/lib/mesos/slaves/…​/runs/latest/stderr

star yellow

Container starts

Tenant

Worker: master

/var/lib/mesos/slaves/…​/runs/latest/stderr

The image below represents the diagnostic troubleshooting sequence. The components marked with a star represent the most problematic ones in the Upgrade process.

cje troubleshooting upgrades diagnosis sequence

Most Common Issues

Castle Not Running

The CJE cluster requires that one Castle container is running per master worker. If that condition is not met the upgrade fails.

The health of the castle system can be checked from the Marathon UI in the jce folder of the castle application. We should see a castle application for each worker of type master.

Check which worker is missing castle with the command cje run list-applications. The following output shows that castle is missing on worker-2:

$ cje run list-applications
castle.jce : worker-1
elasticsearch.jce : worker-2
palace.jce : worker-1
master-1.masters : worker-1
master-2.masters : worker-2
cjoc.jce : worker-2

There could be different reasons why castle is not running on a specific worker (lack of resources, disk space, …​). Check the castle logs on the worker to see what caused castle to stop. The logs are located in the Mesos task logs under /var/lib/mesos/slaves/<SLAVE_ID>/frameworks/<FRAMEWORK_ID>/executors/jce_castle.<APP_ID>/runs/latest/stderr:

// 1. Connect to the worker
$ dna connect worker-2
// 2. Print the castle logs
$ tail -F /var/lib/mesos/slaves/f1604681-25c1-44b9-a809-cf3419e869d5-S2/frameworks/f1604681-25c1-44b9-a809-cf3419e869d5-0000/executors/jce_castle.0fd2f05c-1d0e-11e8-9213-26bc65ea473c/runs/latest/stderr

Then restart castle:

$ dna stop castle
$ dna start castle

And check in Marathon the health check for the jce > castle applications

Wrong Credentials

When the Operations Center credentials are incorrect, the upgrade log is explicit:

[cjoc] CJOC Login failed (HTTP Status 401).
[cjoc] Looking up initial admin password
[cjoc] Unable to look up initial password. Maybe CJOC is already initialized.
[cjoc] 02:51:22 Failure (255) Retrying in 5 seconds..
[cjoc] Unable to look up initial password. Maybe CJOC is already initialized.
[cjoc] 02:51:29 Failure (255) Retrying in 5 seconds..
[cjoc] Unable to look up initial password. Maybe CJOC is already initialized.
[cjoc] 02:51:36 Unable to log in to CJOC after 3 attempts. (cjoc-lookup-initial-credentials)
[cjoc]
[cjoc] To update CJOC credentials, use the 'credentials-update' operation.
[cjoc]
[cjoc] For more information about the errors and possible solutions, please read the following articles:
[cjoc] 1. https://support.cloudbees.com/hc/en-us/articles/223406287-I-changed-the-security-realm-and-since-then-bees-pse-operations-are-failing
[cjoc] 2. https://support.cloudbees.com/hc/en-us/articles/229298168-How-to-disable-CJOC-authorization-in-PSE
An error occurred during cjoc initialization (255) - see /Users/jenkins/Projects/pse/cje-example-01/.dna/logs/last/cjoc

This issue often happens when upgrading for the first time. The cluster initialize with a default admin user whose credentials are stored in the CloudBees Jenkins Enterprise workspace ($CJE_PROJECT_HOME/.dna/secrets).

After setting a Security Realm in Operations Center, it is required to update the local configuration with the new credentials. If the Operations Center credentials stored in the CloudBees Jenkins Enterprise workspace are wrong, the upgrade fails because the CLI does not have administrative access to the CJOC.

The solution is to update the Operations Center credentials in the file $CJE_PROJECT_HOME/.dna/secrets with the credentials of a Jenkins administrator. The API Token can be used as password.

Cluster Resources

A lack of resources would prevent Marathon to deploy applications, commonly Operations Center and Caste. If a marathon task is never deployed, this could be a resource problem. In such scenario, Marathon tries to deploy an application over and over but the task is never accepted or processed by Mesos.

Following is an example of upgrade logs when Operations Center cannot be deployed:

[cjoc] Reconfiguring Marathon app /jce/cjoc at http://marathon.cje-aburdajewicz-01-controlle-327711499.ap-southeast-2.elb.amazonaws.com.elb.cloudbees.net/v2/apps/jce/cjoc?force=true
[cjoc] Marathon ok: {"version":"2018-02-22T05:29:19.546Z","deploymentId":"0416d8e1-a285-407b-8350-800a372cc6e4"}
[cjoc] /jce/cjoc: {instances=1, staged=0, running=0, healthy=0, unhealthy=0}
[cjoc] 05:29:19 Failure (1) Retrying in 10 seconds...
[cjoc] /jce/cjoc: {instances=1, staged=0, running=0, healthy=0, unhealthy=0}
[cjoc] 05:29:29 Failure (1) Retrying in 10 seconds...
[cjoc] /jce/cjoc: {instances=1, staged=0, running=0, healthy=0, unhealthy=0}
[cjoc] 05:29:39 Failure (1) Retrying in 10 seconds...

Check that there is a Marathon deployment for Operations Center. If there is no Mesos task associated (we see "No Tasks Running" even after a while) we need to check on the cluster available resources.

cje troubleshooting upgrades marathon deploy no tasks

Check the resources with cje run list-resources. We can see in the following output that the master workers are both at full capacity, leaving no space for Operations Center:

$ cje run list-resources
name       type       cpus                      mem
worker-1   master     0.5/4.0 (12.50 %)         14860.0/14861.0 (99.99 %)
worker-3   build      0.0/4.0 (0.00 %)          0.0/14861.0 (0.00 %)
worker-2   master     0.5/4.0 (12.50 %)         14860.0/14861.0 (99.99 %)

Then check the running applications. We can see in the following output that only Operations Center (cjoc.jce) is missing:

$ cje run list-applications
castle.jce : worker-1
castle.jce : worker-2
elasticsearch.jce : worker-2
palace.jce : worker-1
master-1.masters : worker-1
master-2.masters : worker-2

Check the applications configuration in $CJE_PROJECT_HOME/.dna/project.config. Here we can see Operations Center requires 2048 Mb of memory.

[...]
castle_mem =
castle_cpus =
[...]
elasticsearch_max_instance_count = 1
elasticsearch_mem = 2048
elasticsearch_cpus = 0.1
[...]
cjoc_mem = 2048
cjoc_cpus = 0.2
[...]

Use this information to ensure that there are enough resources to deploy all the applications:

  • castle requires one 768 Mb / 0.1 CPU container PER master worker

  • palace requires one 1024 Mb / 0.2 CPU container in ONE master worker

  • cjoc requirements in terms of memory / cpu can be found in project.config

  • elasticsearch requirements in terms of memory / cpu can be found in project.config. We recommend running elasticsearch on dedicated elasticsearch workers.

In such cases, the solution would be to:

  • Free resources by decreasing the memory allocated to some components (commonly Managed Masters)

  • Add resources with the command cje prepare worker-add

Zookeeper Failures

If a Zookeeper node is faulty, Marathon and Mesos may not operate correctly or even be unresponsive. This leads to upgrade failures.

For example, the upgrade logs might show that Marathon is unreachable:

Waiting for Marathon at http://marathon.cje-aburdajewicz-01-controlle-327711499.ap-southeast-2.elb.amazonaws.com.elb.cloudbees.net
Timeout waiting for http://marathon.cje-aburdajewicz-01-controlle-327711499.ap-southeast-2.elb.amazonaws.com.elb.cloudbees.net
There were one or more errors

Check the health check alerts in Operations Center. If zookeeper is down on one controller, you should see an alert:

cje troubleshooting upgrades cjoc alerts

From one worker, check that zookeeper (port 2181) is reachable for each controller:

// 1. Connect to the worker
$ dna connect worker-1
// 2. curl the controller (replace $CONTROLLER_IP by the controller internal IP)
$ curl -Iv $CONTROLLER_IP:2181

If zookeeper is reachable, the command would display an output similar to:

$ curl -Iv $CONTROLLER_IP:2181
* About to connect() to $CONTROLLER_IP port 2181 (#0)
*   Trying $CONTROLLER_IP...
* Connected to $CONTROLLER_IP ($CONTROLLER_IP) port 2181 (#0)
> HEAD / HTTP/1.1
> User-Agent: curl/7.29.0
> Host: $CONTROLLER_IP:2181
> Accept: */*
>
* Empty reply from server
* Connection #0 to host $CONTROLLER_IP left intact
curl: (52) Empty reply from server

In a case where Zookeeper is down, the command would display an output similar to:

$ curl -Iv $CONTROLLER_IP:2181
* About to connect() to $CONTROLLER_IP port 2181 (#0)
*   Trying $CONTROLLER_IP...
* Connection refused
* Failed connect to $CONTROLLER_IP:2181; Connection refused
* Closing connection 0
curl: (7) Failed connect to $CONTROLLER_IP:2181; Connection refused

Check the Zookeeper logs in /var/log/zookeeper/zookeeper.log on the faulty controller to see what’s wrong.

// 1. Connect to the controller
$ dna connect controller-1
// 2. Print the zookeeper logs
$ tail -F /var/log/zookeeper/zookeeper.log

The log may show a corrupted database:

 2017-11-16 09:13:37,769 - ERROR [main:QuorumPeer@453] - Unable to load database on disk
 java.io.IOException: Failed to process transaction type: 1 error: KeeperErrorCode = NoNode for /marathon/leader

In that case, restart Zookeeper and force it to recreate the database:

$ dna connect controller-1
$ sudo service zookeeper stop
$ sudo mv /var/lib/zookeeper/version-2 /var/lib/zookeeper/version-2-BAD
$ sudo service zookeeper start

EBS Volume Provisioning

In a Multi Availability Zone cluster, EBS provisioning can fail due to a race condition. EBS snapshots / volumes are not visible across all Availability Zones. When deploying Operations Center to a different availability zone, EBS snapshots must be copied to provision the EBS volume, that can take time. If the provisioning is not done in time, the deployment of Operations Center would fail with a timeout.

What can be seen in such cases is a repeated restart of the Operations Center task in Mesos with a short duration.

To troubleshoot this, check the castle logs:

tail -100 /var/lib/mesos/slaves/98f1be93-b182-471f-b7e5-b1794ea9a8a1-S1/frameworks/98f1be93-b182-471f-b7e5-b1794ea9a8a1-0000/executors/jce_castle.58ed9141-1ce6-11e8-9add-fe8a8bdb5dd7/runs/latest/stderr

If provisioning works, here is the output that should appears in castle logs. Otherwise the issue should be exposed:

Mar 01, 2018 5:05:54 AM com.cloudbees.dac.castle.EbsBackend doProvision
INFO: EBS provisioning for cjoc [a9532a6c0405d00ffb8486c8b06c370dc112bc7a0b55361fd6f3f219ff952ed1]
[...]
Mar 01, 2018 5:05:59 AM com.cloudbees.dac.castle.EbsBackend doProvision
INFO: EBS provisioning finished for cjoc [a9532a6c0405d00ffb8486c8b06c370dc112bc7a0b55361fd6f3f219ff952ed1] in 4 seconds

Check the Operations Center logs:

$ tail -F /var/lib/mesos/slaves/98f1be93-b182-471f-b7e5-b1794ea9a8a1-S1/frameworks/98f1be93-b182-471f-b7e5-b1794ea9a8a1-0000/executors/jce_cjoc.d21dbabd-1cfc-11e8-9c22-6e78e1ea6fe9/runs/latest/stderr

If provisioning works, here is the output that should start to appear in Operations Center logs:

[...]
++ exec java -Djenkins.model.Jenkins.slaveAgentPortEnforce=true -Djenkins.model.Jenkins.slaveAgentPort=31900 -Dorg.jenkinsci.main.modules.sshd.SSHD.hostName=cje.troubleshooting.com -Dhudson.TcpSlaveAgentListener.hostName=cje.troubleshooting.com -Duser.home=/var/jenkins_home -Xmx1711m -Xms1024m -Xmx2048m -XshowSettings:vm -XX:+AlwaysPreTouch -XX:+UseG1GC -XX:+ExplicitGCInvokesConcurrent -XX:+ParallelRefProcEnabled -XX:+UseStringDeduplication -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=5 -XX:GCLogFileSize=40m -Xloggc:/var/jenkins_home/gc.log -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintHeapAtGC -XX:+PrintGCCause -XX:+PrintTenuringDistribution -XX:+PrintReferenceGC -XX:+PrintAdaptiveSizePolicy '-Dhttp.nonProxyHosts=127.*|localhost|127.0.0.1' -Dcb.IMProp.warProfiles=mesos.json -Dcb.IMProp.warProfiles.cje=mesos.json -Dcom.cloudbees.jce.masterprovisioning.DockerImageDefinitionConfiguration.disableAutoConfiguration=true -Dcom.cloudbees.opscenter.analytics.reporter.JocAnalyticsReporter.PERIOD=120 -Dcom.cloudbees.opscenter.analytics.reporter.metrics.AperiodicMetricSubmitter.PERIOD=120 -Dcom.cloudbees.opscenter.analytics.FeederConfiguration.PERIOD=120 -jar '-Dcb.distributable.name=Docker Common CJE' -Dcb.distributable.commit_sha=50a2e662fbc2be1332bd34af3121fd83603f1358 /usr/share/jenkins/jenkins.war --httpPort=31899 --webroot=/tmp/jenkins/war --pluginroot=/tmp/jenkins/plugins --prefix=/cjoc/
VM settings:
    Min. Heap Size: 1.00G
    Max. Heap Size: 2.00G
    Ergonomics Machine Class: server
    Using VM: OpenJDK 64-Bit Server VM

Mar 01, 2018 5:05:59 AM Main deleteWinstoneTempContents
WARNING: Failed to delete the temporary Winstone file /tmp/winstone/jenkins.war
Mar 01, 2018 5:06:00 AM org.eclipse.jetty.util.log.Log initialized
INFO: Logging initialized @943ms to org.eclipse.jetty.util.log.JavaUtilLog
Mar 01, 2018 5:06:00 AM winstone.Logger logInternal
INFO: Beginning extraction from war file

When the EBS provisioning is stuck, you can see in the logs that Operations Center can connect to castle but never eventually starts Jenkins exec java …​:

2018-03-01T05:05:43Z Connecting to Castle (1/36)
*   Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 31080 (#0)
> POST /provision HTTP/1.1
> Host: localhost:31080
> User-Agent: curl/7.56.1
> Accept: */*
> Content-type: application/json
> Content-Length: 570
[...]

A workaround is to increase the startup timeout (also known as grace period) for Operations Center.

Locate the following line in $CJE_PROJECT_HOME/.dna/project.config:

[cjoc]
marathon_timeout = 900

Change the marathon_timeout to a bigger number - for example 3600 (1 hour). Then launch the upgrade again:

cje upgrade
Note
Since CloudBees Jenkins Enterprise 1.11.3, Marathon has full control on how to quit a castle task. This makes EBS provisioning more robust in a Multi Availability Zone cluster.

Red Hat Enterprise Linux Subscription

Clusters that are configured with rhel images require that the worker and controller instances be registered using the Red Hat Subscription Manager. This is needed to be able to install packages from yum on a Red Hat Enterprise Linux instance.

If rhel instances are not registered, the upgrade could fail when executing upgrade scripts on workers and controllers.

Following is an example of an issue when upgraded from 1.6.2 to a more recent version:

> Install lsof package on workers
[...]
[worker-1] https://rhui2-cds02.ap-southeast-2.aws.ce.redhat.com/pulp/repos//rhui-client-config/rhel/server/7/x86_64/os/repodata/repomd.xml: [Errno 14] HTTPS Error 401 - Unauthorized
[worker-1] Trying other mirror.
[worker-1] https://rhui2-cds01.ap-southeast-2.aws.ce.redhat.com/pulp/repos//rhui-client-config/rhel/server/7/x86_64/os/repodata/repomd.xml: [Errno 14] HTTPS Error 401 - Unauthorized
[worker-1] Trying other mirror.
[worker-1] https://rhui2-cds02.ap-southeast-2.aws.ce.redhat.com/pulp/repos//content/dist/rhel/rhui/server/7/7Server/x86_64/os/repodata/repomd.xml: [Errno 14] HTTPS Error 401 - Unauthorized
[worker-1] Trying other mirror.
[worker-1] https://rhui2-cds01.ap-southeast-2.aws.ce.redhat.com/pulp/repos//content/dist/rhel/rhui/server/7/7Server/x86_64/os/repodata/repomd.xml: [Errno 14] HTTPS Error 401 - Unauthorized
[worker-1] Trying other mirror.
[worker-1] https://rhui2-cds02.ap-southeast-2.aws.ce.redhat.com/pulp/repos//content/dist/rhel/rhui/server/7/7Server/x86_64/rh-common/os/repodata/repomd.xml: [Errno 14] HTTPS Error 401 - Unauthorized
[worker-1] Trying other mirror.
[worker-1] https://rhui2-cds01.ap-southeast-2.aws.ce.redhat.com/pulp/repos//content/dist/rhel/rhui/server/7/7Server/x86_64/rh-common/os/repodata/repomd.xml: [Errno 14] HTTPS Error 401 - Unauthorized
[worker-1] Trying other mirror.
[...]

The solution is to register the worker/controller instances with the subscription manager.

Since version 1.7.0, the operation cje prepare registration-update can be used to provide register / unregister scripts to execute when an instance is started / terminated.

Otherwise, a workaround is to manually register the instances:

dna connect [controller|worker]
subscription-manager register –username <rhel_username> –password <rhel_password> –auto-attach

Once registered, the upgrade succeeds.

Cluster Recovery

The following section provides some guidance to troubleshoot Cluster Recovery failures.

Note: The CloudBees Jenkins Enterprise recover in repair mode is an attempt to recover a cluster when it is in a bad state. It is recommended to first destroy the cluster and the recover it as a destroyed cluster, as explained in Restore a CloudBees Jenkins Enterprise Cluster.

Cluster Recovery Version Support

If a cluster has been created with a version of CloudBees Jenkins Enterprise lower than 1.6.3, the cluster-recover operation fails with the following:

Reinitializing environment terraform
No servers matched. Use 'dna servers' to list all servers.
cluster-recover failed

The workaround is to edit the file CJE_INSTALL_DIR/share/setup-templates/aws/templates/tiger/start file to bypass this check:

tiger-start() {
    trigger-mfa-if-needed
    #---------------------------------
    #-----we bypass this check------
    verify-cloud-resources || echo "WARNING: was not possible to verify cloud resources before 1.6.3"
    #---------------------------------
    start-shared-infrastructure
    start-servers
    start-elb
    start-route53
    check-domain
    init-storage-bucket
    init-servers
    # dna run check-network-performance tiger
    tiger-waitfor-mesos
    tiger-waitfor-marathon
    tiger-init-apps
    mark-tiger-initialized
}

Remnants of Cluster Delete

If a cluster is not fully destroyed because there are resources created by the cluster that still exist or data (like S3) that has been preserved, then a full recovery would fail with a terraform message explaining that the resource already exists. Here is an example with an existing S3 bucket:

[storage-bucket] Error: Error applying plan:
[storage-bucket]
[storage-bucket] 1 error(s) occurred:
[storage-bucket]
[storage-bucket] * aws_s3_bucket.pse-storage-bucket: 1 error(s) occurred:
[storage-bucket]
[storage-bucket] * aws_s3_bucket.pse-storage-bucket: Error creating S3 bucket: BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
[storage-bucket] status code: 409, request id: 0498D058694C41BA, host id: XuV6wVr/5n36bOZxYmOjLLGqEKxJHAtCHYk24CC89iyT4qOZfH32tREN/jjamezSZNQC7esAxvM=

When the failure is reported, the full recovery may have created some components already, and the state of the DNA project being restored is no longer consistent.

To address this:

  • Cancel the cluster-recover operation

  • Destroy the cluster again with cluster-destroy (that removes the components that have been created during the recovery)

  • Run the cluster-recover operation in repair mode

Subsystems Not Reinitialized (Anywhere)

Workers / Controllers have several subsystems installed. There are marker files /etc/.*installed used by CJE to know if the system must be reconfigured / restarted:

$ ls -la /etc/.*installed
-rw-r--r--. 1 root root  9 Feb 14 18:27 /etc/.docker-installed
-rw-r--r--. 1 root root 28 Feb 14 18:07 /etc/.java-installed
-rw-r--r--. 1 root root  9 Feb 14 18:26 /etc/.marathon-installed
-rw-r--r--. 1 root root  9 Feb 14 18:26 /etc/.mesos-installed
-rw-r--r--. 1 root root  3 Feb 14 18:06 /etc/.ntp-installed
-rw-r--r--. 1 root root  3 Feb 14 18:07 /etc/.rsyslog-installed
-rw-r--r--. 1 root root  8 Feb 14 18:27 /etc/.topbeat-installed

If the images that are used for workers / controllers are custom VSphere images, the marker file may already be present. In such cases, the subsystems are not reconfigured / restarted by the dna init.

To address this:

  • Delete the /etc/.*installed files before running cluster-recover

Elasticsearch

The following section provides some guidance to troubleshoot ES failures.

Test the Connection Between Elasticsearch and Operations Center

Go to Manage Jenkins  Configure Analytics and press Test Connection.

cje troubleshooting es rd cjp analytics test connection

Notice that, if you add the Elasticsearch hostname into No Proxy Host in OC, a restart is needed to apply the change.

Elasticsearch Compatibility

Review the Elasticsearch logs for errors. If the logs contain parse errors, the Elasticsearch cluster could be broken or you could be using an incorrect version of Elasticsearch. Refer to the Analytics documentation for Elasticsearch version compatibility. Check the Elasticsearch health report to confirm Elasticsearch is functioning as expected.

This exception is shown in Elasticsearch when you try to use an Elasticsearch version later than 1.7.X

java.lang.IllegalArgumentException: Limit of total fields [1000] in index [metrics-20170419] has been exceeded
	at org.elasticsearch.index.mapper.MapperService.checkTotalFieldsLimit(MapperService.java:593) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.index.mapper.MapperService.internalMerge(MapperService.java:418) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.index.mapper.MapperService.internalMerge(MapperService.java:334) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.index.mapper.MapperService.merge(MapperService.java:266) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.cluster.metadata.MetaDataMappingService$PutMappingExecutor.applyRequest(MetaDataMappingService.java:311) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.cluster.metadata.MetaDataMappingService$PutMappingExecutor.execute(MetaDataMappingService.java:230) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.cluster.service.ClusterService.executeTasks(ClusterService.java:679) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.cluster.service.ClusterService.calculateTaskOutputs(ClusterService.java:658) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.cluster.service.ClusterService.runTasks(ClusterService.java:617) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.cluster.service.ClusterService$UpdateTask.run(ClusterService.java:1117) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:544) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:238) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:201) ~[elasticsearch-5.3.0.jar:5.3.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_65]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_65]
	at java.lang.Thread.run(Thread.java:745) [?:1.8.0_65]

In the Operations Center embedded Kibana, you could see this exception if you try to connect to a Elasticsearch version later than 1.7.X

Error: Unknown error while connecting to Elasticsearch
Error: Authorization Exception
    at respond (http://192.168.1.20:8080/plugin/operations-center-analytics-viewer/index.js?_b=7562:85289:15)
    at checkRespForFailure (http://192.168.1.20:8080/plugin/operations-center-analytics-viewer/index.js?_b=7562:85257:7)
    at http://192.168.1.20:8080/plugin/operations-center-analytics-viewer/index.js?_b=7562:83895:7
    at wrappedErrback (http://192.168.1.20:8080/plugin/operations-center-analytics-viewer/index.js?_b=7562:20902:78)
    at wrappedErrback (http://192.168.1.20:8080/plugin/operations-center-analytics-viewer/index.js?_b=7562:20902:78)
    at wrappedErrback (http://192.168.1.20:8080/plugin/operations-center-analytics-viewer/index.js?_b=7562:20902:78)
    at http://192.168.1.20:8080/plugin/operations-center-analytics-viewer/index.js?_b=7562:21035:76
    at Scope.$eval (http://192.168.1.20:8080/plugin/operations-center-analytics-viewer/index.js?_b=7562:22022:28)
    at Scope.$digest (http://192.168.1.20:8080/plugin/operations-center-analytics-viewer/index.js?_b=7562:21834:31)
    at Scope.$apply (http://192.168.1.20:8080/plugin/operations-center-analytics-viewer/index.js?_b=7562:22126:24)

Operations Center Accessing the Internet Through a Proxy

This is by far the most common issue when you have a Proxy set up in Operations Center under Manage Jenkins  Manage Plugins  Advanced. In such cases, you must add the Elasticsearch hostname to the No Proxy Host section. i.e domain.example.com/elasticsearch. Notice that a restart is needed each time you modify the No Proxy Host section for Analytics to take the changes.

Instead of using the No Proxy Host, you could use the -Dhttp.nonProxyHosts`Java argument; i.e `-Dhttp.nonProxyHosts=domain.example.com/elasticsearch. Just as with No Proxy Host, a restart is needed for Analytics to take effect after the Java argument is added to the Operations Center.

To test the connectivity between Elasticsearch and the Operations Center you can use:

  • The Test Connection button under Manage Jenkins  Configure Analytics

  • Execute this script from Manage Jenkins  Script Console

import jenkins.plugins.asynchttpclient.AHC
import com.ning.http.client.AsyncHttpClient
import com.ning.http.client.ListenableFuture
import com.ning.http.client.Response

AsyncHttpClient ahc = AHC.instance()
ListenableFuture<Response> response = ahc.prepareGet("http://<ELASTICSEARCH_HOSTNAME>:9200/").execute()
println(response.get().status.statusCode + " " + response.get().status.statusText)
println("---")
println(response.get().getResponseBody())

If everything is fine, you should receive a HTTP 200 response like the example below:

{
  "status" : 200,
  "name" : "Eros",
  "cluster_name" : "elasticsearch",
  "version" : {
    "number" : "1.7.6",
    "build_hash" : "c730b59357f8ebc555286794dcd90b3411f517c9",
    "build_timestamp" : "2016-11-18T15:21:16Z",
    "build_snapshot" : false,
    "lucene_version" : "4.10.4"
  },
  "tagline" : "You Know, for Search"
}

Restart Operations Center After the Initial Configuration

After first configuring Analytics, the Operations Center must be restarted to create the index and dashboards.

Elasticsearch Snapshots/Backups

Analytics should be be configured to take snapshots/backups of your Elasticsearch data at least one or two times per day — not more, since this is a heavy-load process that may require hours to complete. Furthermore, keep seven to fourteen snapshots/backups that can be used to restore data after a week. Backup Interval in Minutes should be set to 1440 for one snapshot per day or 720 for two snapshots per day. Number of Backup Snapshots to Keep for Elasticsearch should be set to 7 if you make a snapshot per day to keep a week of snapshots or 14 if you make two snapshots per day.

cje troubleshooting es rd cjp analytics test connection

Kibana Dashboards are Not Created

If you see the following error, it means that the Analytics required dashboards have not been created.

cje-troubleshooting-es-could_not_locate_dashboard.png

cje-troubleshooting-es-kibana4cloudbees-index-error.png

To resolve this issue, you need to restart the Operations Center, which will recreate the default indices and dashboards.

Elasticsearch Default Index Does Not Exist

If the default index is not selected, you will see the following page.

cje-troubleshooting-es-default_index_does_not_exists.png

You only have to select the time-field, click on create, and then you can go to another Analytics tab to check that the data is displayed.

Elasticsearch is Not Accessible

When it is not possible to connect to the Elasticsearch service, you will see the following type of error.

cje-troubleshooting-es-es_is_not_accessible.png

To identify the source of this problem, check the following:

  • the connectivity between Operations Center and the Elasticsearch service,

  • the health of the Elasticsearch cluster, and

  • the Jenkins proxy settings.

Recommended Cluster Size

The Elasticsearch cluster should have at least three nodes, which provides you with fault tolerance for up to 2 nodes crashing. Each node should have 16-32GB of RAM and 50-200GB of disk space, depending on your environment size. If you have more than 10 Masters or more than 10000 jobs, you will need a large Elasticsearch environment to support your load.

Retrieve Username, Password and Elasticsearch URL on CloudBees Jenkins Enterprise

If you are on CloudBees Jenkins Enterprise, you can retrieve the username, password and Elasticsearch URL with these commands:

export CJE_PROTOCOL=$(awk '/protocol/ {print $3}' .dna/project.config)
export CJE_DOMAIN=$(awk '/domain_name/ {print $3}' .dna/project.config)
export ES_USR=$(cje run echo-secrets elasticsearch_username)
export ES_PASSWD=$(cje run echo-secrets elasticsearch_password)
export DOMAIN="${CJE_PROTOCOL}://${CJE_DOMAIN}/elasticsearch"

Elasticsearch Cluster Health

Did you restart your Operations Center after first configuring Analytics? This is necessary to create the index and dashboards. If you did that, you can move on to checking the state of your cluster. Assuming you retrieved the critical information as described above, you can execute the following commands to obtain base information about the health of the Elasticsearch cluster:

curl -u $ES_USR:$ES_PASSWD "$DOMAIN/_cluster/health?pretty" > health.json
curl -u $ES_USR:$ES_PASSWD "$DOMAIN/_cat/nodes?v&h=host,ip,name,load,uptime,master,heapCurrent,heapPercent,heapMax,ramCurrent,ramPercent,ramMax,disk,fielddataMemory,queryCacheMemory" > nodes.txt
curl -u $ES_USR:$ES_PASSWD "$DOMAIN/_cat/indices?v" > indices.txt
curl -u $ES_USR:$ES_PASSWD "$DOMAIN/_cat/shards?v" > shards.txt
curl -u $ES_USR:$ES_PASSWD "$DOMAIN/_nodes/stats/os?pretty" > stats_os.json
curl -u $ES_USR:$ES_PASSWD "$DOMAIN/_nodes/stats/os,process?pretty" > stats_os_process.json
curl -u $ES_USR:$ES_PASSWD "$DOMAIN/_nodes/stats/process?pretty" > stats_process.json

health.json gives you the status of the cluster, shards status, indices status, pendings tasks, … For more information, see Check Cluster Health

{
  "cluster_name" : "elasticsearch",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 2551,
  "active_shards" : 7053,
  "relocating_shards" : 0,
  "initializing_shards" : 3,
  "unassigned_shards" : 6,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 11736,
  "number_of_in_flight_fetch" : 0
}

nodes.txt gives you the Id of the nodes, IP, name, free memory, free disk space, … For more information, see Elasticsearch cat API

h            i          n              l    u m     hc hp     hm      rc rp     rm       d      fm qcm
6701a5b2dca8 172.17.0.3 Nico Minoru 2.93 4.4d m 10.3gb 74 13.9gb  29.4gb 57 29.9gb  52.6gb 444.5mb  0b
559932af40d5 172.17.0.3 Fearmaster  1.93   5d * 10.5gb 75 13.9gb  29.6gb 67 29.9gb  52.6gb 304.6mb  0b
4054511a6f8f 172.17.0.3 Ricadonna   0.18   5d m  5.8gb 41 13.8gb 113.6gb 23  120gb 262.8gb 334.2mb  0b

shards.txt gives you the shards that are unassigned

index                   shard prirep state          docs    store ip         node
builds-20170311         8     r      STARTED           0     144b 172.17.0.3 Ricadonna
builds-20170311         8     r      STARTED           0     144b 172.17.0.3 Fearmaster
builds-20170311         8     r      UNASSIGNED        0     144b 172.17.0.3 Fearmaster

stats_os.json, stats_os_process.json, stats_process.json give you general stats of the cluster and nodes. For more information, see Nodes Stats

Mismatch between Elasticsearch workers and applications

In some situations such as a failure with an Elasticsearch worker, there can be a mismatch between the list-workers output and the list-applications output. If the two numbers returned by the below commands do not match, there is likely a Elasticsearch worker that is broken:

$ cje run list-workers | grep elasticsearch | wc -l
3
$ cje run list-applications | grep elasticsearch | wc -l
3

Unassigned Shards

If you see unassigned shards in your Cluster Health information and you do not have a node that is restarting, you must assign all shards in order to have your cluster on status "green". If you have a node that is restarting, you should wait until that node is up and running, and the pending tasks returned by the health check stabilizes.

This script is designed to assign shards on a Elasticsearch cluster with 3 nodes; you must set the environment variables ES_USR (user to access to ES), ES_PASSWD (password) and, DOMAIN (url to access to ES).

#fix ES shards
export ES_USR="YOUR_USERNAME"
export ES_PASSWD="YOUR_PASSWORD"
export DOMAIN="ES_URL"
export NODE_NAMES=$(curl -u $ES_USR:$ES_PASSWD "$DOMAIN/_cat/nodes?h=h" |awk '{printf $1" "}')
export NODES=(${NODE_NAMES//:/ })
export NUM_UNASSIGNED_SHARDS=$(curl -u $ES_USR:$ES_PASSWD "$DOMAIN/_cat/shards?v" | grep -c UNASSIGNED)
export NUM_PER_NODE=$(( $NUM_UNASSIGNED_SHARDS / 3 ))
export UNASSIGNED_SHARDS=$(curl -u $ES_USR:$ES_PASSWD "$DOMAIN/_cat/shards?v" |grep UNASSIGNED | awk '{print $1"#"$2 }')
export N=0

for i in $UNASSIGNED_SHARDS
do
    INDICE=$(echo $i| cut -d "#" -f 1)
    SHARD=$(echo $i| cut -d "#" -f 2)
    if [ $N -le $NUM_PER_NODE ]; then
        NODE="${NODES[0]}"
    fi
    if [ $N -gt $NUM_PER_NODE ] && [ $N -le $(( 2 * $NUM_PER_NODE )) ] ; then
        NODE="${NODES[1]}"
    fi
    if [ $N -gt $(( 2 * $NUM_PER_NODE )) ]; then
        NODE="${NODES[2]}"
    fi
    echo "fixing $INDICE $SHARD"
    curl -XPOST -u $ES_USR:$ES_PASSWD "$DOMAIN/_cluster/reroute" -d "{\"commands\" : [ {\"allocate\" : {\"index\" : \"$INDICE\",\"shard\" : $SHARD, \"node\" : \"$NODE\", \"allow_primary\" : true }}]}" > fix_shard_out_${N}.log
    sleep 2s
    N=$(( $N + 1 ))
done

Get Pending Tasks on the Elasticsearch Cluster

Sometimes if you execute the health commands and check the pending tasks, you may see that there are too many tasks, or some index on initializing status. To obtain the details about these tasks, use the following commands to obtain the pending tasks of the cluster; you may then be able to determine the cause of the problems.

export ES_USR="YOUR_USERNAME"
export ES_PASSWD="YOUR_PASSWORD"
export DOMAIN="URL_OF_ES"

curl -u $ES_USR:$ES_PASSWD -XGET '$DOMAIN/_cluster/pending_tasks?pretty" > pending_tasks.json

Delete Index

If you detect problemS on an index that you cannot fix, you probably need to delete the index and try to restore it from a snapshot. To delete the index, use these commands:

export ES_USR="YOUR_USERNAME"
export ES_PASSWD="YOUR_PASSWORD"
export DOMAIN="URL_OF_ES"
export ES_INDEX="INDEX_NAME"

curl -XDELETE  -u $ES_USR:$ES_PASSWD '$DOMAIN/$ES_INDEX?pretty'
  • Before deleting an index it might be interesting to create a backup of the current state.

Manage Elasticsearch Snapshots

An Elasticsearch Snapshot is a backup of the current status of indices in the Elasticsearch Cluster. A snapshot is stored in a snapshot repository that should exist on disk and should be configured in Elasticsearch.

Make a Snapshot of Indices

Sometimes before doing an operation over the cluster, you need to make a snapshot of the data in it. To do this, you can create a new snapshot repository in which to create the new snapshot.

This script lists the available snapshot repositories:

#get all backup repositories
export ES_USR="YOUR_USERNAME"
export ES_PASSWD="YOUR_PASSWORD"
export DOMAIN="ES_URL"
curl -u $ES_USR:$ES_PASSWD -XGET '$DOMAIN/_snapshot/_all?pretty&pretty'

The following script creates a new snapshot repository named backup. This repository stores its data into the /usr/share/elasticsearch/snapshot/elasticsearch/backup folder.

export ES_USR="YOUR_USERNAME"
export ES_PASSWD="YOUR_PASSWORD"
export DOMAIN="ES_URL"
export REPO="backup"

curl -XPUT -u $ES_USR:$ES_PASSWD  "$DOMAIN/_snapshot/$REPO?pretty" -d '{ "type": "fs", "settings": { "compress": "true", "location": "/usr/share/elasticsearch/snapshot/elasticsearch/backup"}}'

The following script create a new snapshot named snapshot_1 into the repository backup:

export ES_USR="YOUR_USERNAME"
export ES_PASSWD="YOUR_PASSWORD"
export DOMAIN="ES_URL"
export REPO="backup"
export SNAPSHOT="snapshot_1"

curl -u $ES_USR:$ES_PASSWD -XPUT "$DOMAIN/_snapshot/$REPO/$SNAPSHOT?pretty"

In some cases, this process could take more than an hour; you can to check the status of your snapshot with these commands:

export ES_USR="YOUR_USERNAME"
export ES_PASSWD="YOUR_PASSWORD"
export DOMAIN="ES_URL"
export REPO="backup"
export SNAPSHOT="snapshot_1"

#status of the current snapshot
# when the snapshot it is finished it will return this
# {
#    "snapshots": []
# }
#
curl -u $ES_USR:$ES_PASSWD -XGET "$DOMAIN/_snapshot/$REPO/_status?pretty"
#status of snapshot_1, you have to check the `status field` until it will be SUCCESS/FAILED
curl -u $ES_USR:$ES_PASSWD -XGET "$DOMAIN/_snapshot/$REPO/$SNAPSHOT?pretty"

Errors Trying to Create a Snapshot: snapshot is already running

If you execute your snapshot command and see the following error, it is due to another snapshot still running. Therefore, either wait until that snapshot finished, or cancel it.

{
    "error":"RemoteTransportException[[Smuggler][inet[/172.17.0.2:9300]][cluster/snapshot/create]]; nested: ConcurrentSnapshotExecutionException[[backup:snapshot_1] a snapshot is already running]; ",
    "status":503
}

To check which repository is making a snapshot, use these commands:

export ES_USR="YOUR_USERNAME"
export ES_PASSWD="YOUR_PASSWORD"
export DOMAIN="ES_URL"
export REPO="backup"

curl -u $ES_USR:$ES_PASSWD -XGET "$DOMAIN/_snapshot/$REPO/_status?pretty"

Once you have the name of the relevant snapshot, you can cancel it with these commands:

export ES_USR="YOUR_USERNAME"
export ES_PASSWD="YOUR_PASSWORD"
export DOMAIN="ES_URL"
export REPO="cloudbees-analytics"
export SNAPSHOT_NAME="SET_THE_NAME_OF_SNAPSHOT_HERE"

## Cancel the Snapshot

curl -PDELETE -u $ES_USR:$ES_PASSWD -m 30 "$DOMAIN/_snapshot/$REPO/$SNAPSHOT_NAME?pretty"
#check the status again
curl -u $ES_USR:$ES_PASSWD -XGET "$DOMAIN/_snapshot/$REPO/_status?pretty"

Restore a Snapshot of Indices

If a disaster happens, you can restore data from an existing snapshot. To do this, obtain the list of available snapshots with these commands:

#obtain ES list of snapshots on cloudbees-analytics repository
export ES_USR="YOUR_USERNAME"
export ES_PASSWD="YOUR_PASSWORD"
export DOMAIN="ES_URL"
export REPO="cloudbees-analytics"

curl -u $ES_USR:$ES_PASSWD "$DOMAIN/_snapshot/$REPO/_all?pretty" > $REPO-snapshots.json

Next, examine the cloudbees-analytics-snapshots.json file and check which snapshots and indices you want to restore. Once this is done, edit the following script by adding a new line restore "SNAPSHOT_NAME" "INDEX_NAME" for each index you want to restore. The following script creates a file for each snapshot with the results of the restore operation:

#set the username
export ES_USR="YOUR_USERNAME"
#set the password
export ES_PASSWD="YOUR_PASSWORD"
export ES_URL="URL_OF_ES"

export ES_CREDS="$ES_USR:$ES_PASSWD"
#name of the snapshot repository
export ES_REPO="cloudbees-analytics"

restore() {
    local ES_SNAPSHOT=$1
    local index=$2
    local FILE=restore-${ES_REPO}-${ES_SNAPSHOT}.json
    echo "Restoring $ES_REPO - $ES_SNAPSHOT - $index"
    echo "Close $index" >> $FILE
    curl -XPOST -u $ES_CREDS "$ES_URL/$index/_close?pretty" >> $FILE
    echo "Restore $ES_REPO - $ES_SNAPSHOT - $index" >> $FILE
    curl -XPOST -u $ES_CREDS "$ES_URL/_snapshot/$ES_REPO/$ES_SNAPSHOT/_restore?wait_for_completion=true&pretty" -d"{ \"indices\" : \"$index\", \"ignore_unavailable\": \"true\", \"include_global_state\": \"true\" }" >> $FILE
    EXIT=1
    while [ $EXIT -eq 1 ]; do
        echo " wait_for_completion $ES_REPO/$ES_SNAPSHOT"
        sleep 5s
        EXIT=$(curl -XGET -u $ES_CREDS "$ES_URL/_snapshot/$ES_REPO/$ES_SNAPSHOT/_status" | grep -c "IN_PROGRESS")
    done
}

#set the SNAPSHOT_NAME and INDEX_NAME you want to restore
restore "SNAPSHOT_NAME" "INDEX_NAME"

Delete Snapshots

If you want to keep only an exact number of snapshots on a repository, you can use the following script to do this, assuming that you have installed the JSON parser jq). This will list all snapshots, keep only the last 30 and delete the remainder.

#set the username
export ES_USR="YOUR_USERNAME"
#set the password
export ES_PASSWD="YOUR_PASSWORD"
export DOMAIN="URL_OF_ES"
#number of snapshot to keep
export LIMIT=30
export REPO="cloudbees-analytics"

export ES_SNAPSHOTS=$(curl -u admin:$ES_PASSWD -s -XGET "$ES_URL/_snapshot/$REPO/_all" | jq -r ".snapshots[:-${LIMIT}][].snapshot")

# Loop over the results and delete each snapshot
for SNAPSHOT in $ES_SNAPSHOTS
do
 echo "Deleting snapshot: $SNAPSHOT"
 curl -u $ES_USR:$ES_PASSWD -s -XDELETE "$DOMAIN/_snapshot/$REPO/$SNAPSHOT?pretty"
done

Mesos and Marathon Troubleshooting

Background

When focusing on the elements that control task scheduling and provisioning in the cluster, pay special attention to the following subsystems: Mesos, Marathon and Zookeeper. These three subsystems are critical to CloudBees Jenkins Enterprise and any problem inside or affecting them can cause the cluster to enter an unhealthy and erratic state.

Being critical to the cluster life, these subsystems work in HA.

Architecture

In terms of CloudBees Jenkins Enterprise architecture, these subsystems "live" inside the cluster’s controllers. For testing purposes, one might think in terms of a single controller architecture, whereas for production environments, it is critical to have more than one controller. This leads to the first important concept - the number of working controllers in the cluster must be odd. Following on from above, since these subsystems work in HA, the election mechanism must have an odd number of elements so that the leader (the active subsystem) can be elected.

It is key to understand the symptoms to know that the Load Balancer will be sending requests to different controllers depending on the load using a Round-Robin algorithm.

Typically, these subsystems will experience problems when they somehow get out of sync. What does this mean? We say that one of these subsystems in HA is out of sync when there is no agreement on which node is the leader.

cje troubleshooting guide architecture overview

Symptoms of an Existing Problem

There are different symptoms of subsystem failures:

  • Intermittent behavior of the cluster.

  • Execution of several cje run commands returning no information or inconsistent information:

    • cje run list-applications

    • cje run list-resources

    • cje status

Diagnose

The following approaches can help determine if your cluster is affected by these kinds of problems:

CloudBees Jenkins Enterprise Bundle Analysis

  • CloudBees Jenkins Enterprise Support Bundle includes the information for our Cluster controllers to review the contents of the folder: pse/logs/controller-x/router/config.d. If the contents of the folder are not the same, then that will mean that one or more subsystems are out of sync.

Manual Controller Data Review

  • Access the controller terminal. From your Bastion Host, run the following command: dna connect controller-x.

    • Once there, connect to the docker process running the router service, sudo docker ps, and look for cloudbees/pse-router Image.

    • Once you locate the image, access its container with sudo docker exec -it "container_id" /bin/sh

    • Then finally review the contents of the folder /etc/nginx/conf.d. These options will help you determine whether or not Mesos and Marathon are running out of sync. Also, marathon.conf and mesos.conf contain information about which node is the controller that is the leader for any of these services.

    • As an alternative method, you can get the same information by using curl to determine the leader for Marathon and Mesos. Once logged in to each controller, you can curl localhost:5050/state | jq . |grep leader to get Mesos leader information and then curl -u [marathon_username]:[marathon_pwd] localhost:8080/v2/leader. All the controllers should agree on the elected master.

  • Additionally, for ZooKeeper, you can also perform the following operation to check whether the number of leaders is correct or not:

    • Connect to every controller using dna connect controller-x

    • Get the information from the Zookeeper endpoint echo "srvr" | nc localhost 2181. (if you have not nc already installed in your controller you might need to install it). Nevertheless, there is another way to get the information about the leader by running cje run support-mesos ms-state | jq . |grep leader

    • This information should show 1 leader and C-1 followers, where C is the number of Controllers in the cluster.

Mesos UI Review

  • You can get the information provided by Mesos in terms of tasks running. The presence of any duplicate task on this json will imply that there is more than one Mesos Master Service which considers that it is the leader. In order to do that, we can hit our Mesos url as shown when you run the cje run display-outputs command. You can get the data by invoking different API endpoints, in this case http://controller-x-ip:5050/tasks.

  • In addition to this, if you see a number of frameworks active different from 2 (more or less) in the $Mesos_URL//frameworks then that will mean that there is something wrong with one of these subsystems, and will require an action on our side. You can also run cje run support-mesos ms-state | jq '[.frameworks]|.[][]|[.name, .id]' to get the running frameworks. Example: In the case where there is a Marathon leader-election problem, there maybe 2 or more Marathon leaders, all of which will register themselves to Mesos as frameworks, in which case, the Mesos UI will show more than one framework named "marathon".

  • Another possible problem that you can detect with an inspection of the Mesos UI, is the existence of "orphan tasks", being these tasks that only exist in the Mesos UI. You can verify that you have found an "orphan task" by running the cje run list-applications command in the Bastion Host console. This command’s output will show an application running in a specific worker, but when you connect the worker (dna connect worker-x), and list the container running in this worker (sudo docker ps -ef), you will not be able to find any container corresponding to that application.

Support Commands

CJE includes several diagnostics commands that can help us locate inconsistencies on the cluster:

Marathon Support Commands
  • cje run support-marathon mt-running-tasks : List all running tasks.

  • cje run support-marathon mt-search-duplicate-taks <tasks_json_file> : check for duplicate taks in the ouput of mt-running-tasks, required installed. This will help you determine whether or not there is a conflict in your Marathon subsystem.

  • cje run support-marathon mt-ping : make a ping to the marathon service

  • cje run support-marathon mt-info : Get info about the Marathon Instance

  • cje run support-marathon mt-metrics : Get metrics data from this Marathon instance

  • cje run support-marathon mt-get-jce-info : Get the application with id jce. The response includes some status information besides the current configuration of the app. You can specify optional embed arguments, to get more embedded information.

  • cje run support-marathon mt-get-masters-info : Get the application with id masters. The response includes some status information besides the current configuration of the app. You can specify optional embed arguments, to get more embedded information.

  • cje run support-marathon mt-running-apps : Get the list of running applications. Several filters can be applied via the following query parameters.

  • cje run support-marathon mt-get-app-info <palace|castle|cjoc|elasticsearch> : Get the application with id jce/<palace|castle|cjoc|elasticsearch>. The response includes some status information besides the current configuration of the app. You can specify optional embed arguments, to get more embedded information.

  • cje run support-marathon mt-search-dupliate-taks <tasks_json_file> : check for duplicate taks in the ouput of mt-running-tasks, required 'jq' installed.

Mesos Support Commands
  • cje run ms-state-summary : Summary of agents, tasks, and registered frameworks in cluster. If you see more than two frameworks in the output, this will be related to a Mesos subsystem problem.

  • cje run ms-tasks : Lists tasks from all active frameworks. Mesos subsystem problems will be reflected on taks killed continuously.

  • cje run support-mesos ms-slaves : get info about Mesos slaves (workers)

  • cje run support-mesos ms-state : Information about state of master (running/killed)

  • cje run support-mesos ms-state-summary : Summary of agents, tasks, and registered frameworks in cluster.

  • cje run support-mesos ms-tasks : Lists tasks from all active frameworks.

  • cje run support-mesos ms-browse-mesos-stderr <slave_id> <framework_id> <mesos_id> : Returns a file listing for a directory.

  • cje run support-mesos ms-download-mesos-stderr <slave_id> <framework_id> <mesos_id> : Returns the raw file contents for a given path.

  • cje run support-mesos ms-read-mesos-stderr <slave_id> <framework_id> <mesos_id> : Reads data from a file.

Next steps, Subsystem Fixing

Once you have identified the failing subsystem, perform a series of operations that will help bring back the cluster to a healthy status.

Service Restart

In most cases, once you have determined that one of the subsystems is misbehaving, the following action will likely fix the problems:

  • Stop the subsystem that was acting as a "foul leader". sudo service [marathon, mesos-master, zookeeper] stop.

  • Wait for 30-90s.

  • Start the subsystem again. sudo service [marathon, mesos-master, zookeeper] start.

Important

If you determine that the zookeeper subsystem is misbehaving, remove/rename the database /var/lib/zookeeper/version-2 (e.g. mv /var/lib/zookeeper/version-2 /var/lib/zookeeper/old) before restarting the service. By cleaning the database this way, the system might relaunch tasks that are actually running. Therefore, run cje run list-applications to verify that no duplicate tasks are running.

Mesos Orphan Tasks

In most cases, this problem is due to the mesos-slave service not working properly in the corresponding worker. The recommended way to restore the service is by restarting the worker itself (dna stop worker-x and then dna start worker-x). This way, all the services running in that worker will be restarted and the worker will be brought back to a healthy status.

Important

Before stopping and restarting any worker, please be sure that your cluster has enough resources to provision the applications running in that worker (for when it restarts).

Consult the Knowledge Base

The Knowledge Base can be very helpful in troubleshooting problems with CloudBees Jenkins Enterprise and can be accessed on the CloudBees Support site.

Be aware that the name for CloudBees Jenkins Enterprise changed with version 1.6.0; it was previously known as "Private SaaS Edition" (PSE). When searching, search for both "CJE" and "PSE".

Examine the Logs

If a cje command installation fails or you encounter other problems, the first place to look is the logs. Start with the cje log, which is the output produced on your computer when you run cje.

The logs for each cje operation are stored in the operations subdirectory of your cje-project directory:

$ ls cje-project/operations/
20180306T180200Z-cluster-init  20180307T182206Z-worker-add  20180307T182726Z-worker-remove  20180307T192046Z-pse-support
$ head -3 cje-project/operations/20180307T182726Z-worker-remove/log/20180307T183028Z-worker-remove/apply
Worker count: 3
Removing worker: worker-3
Rendering /workspace/cje-project/.dna/servers/castle/config/jce-castle-config.xml => /workspace/cje-project/.dna/servers/castle/bundle/jce-castle-config.xml

Expected failures

Not all errors or failures you see in the logs are significant. Installing and starting services takes time and the cje command spends some of its time waiting and retrying services to see if they have started. For example, these types of message are typical and not a sign of a problem:

11:23:46 [cjoc] curl: (22) The requested URL returned error: 503 Service Unavailable
11:23:46 [cjoc] 11:23:26 Failure (22) Retrying in 10..

Unexpected Failures

The cje tool will only wait and retry an operation a limited number of times. When the retry limit is reached, cje emits an error message and exits. You will see something like this at the end of the log:

11:23:46 [cjoc] 11:23:46 Failed in the last attempt (curl -k -fsSL http://cjoc..../health/check)
11:23:46 An error occurred during cjoc initialization (22) - see .../.dna/logs/20160229T104328Z-cluster-init/cjoc
[cjoc] Failed in the last attempt

By examining the log files you might be able to determine which part of CJE is failing. In the above example you can see that the problem is "cjoc" which is the Operations Center. This typically means that there is some misconfiguration in the CloudBees Jenkins Enterprise config file.

Apache Mesos Logs

Another place to look for troubleshooting information is the Apache Mesos console. By looking at the Mesos and Marathon consoles, you can see which processes are running, which are not and you can view logs for each process. That might give you a clue about what caused some processes to fail to come up.

You can get the URLs of the Mesos console via the cje run display-outputs command. If you are running on AWS and your cluster name is "cluster1", you will see something like this:

$ cje run display-outputs

Controllers: ec2-N-N-N-N.compute-1.amazonaws.com
Workers    : ec2-N-N-N-N.compute-1.amazonaws.com,ec2-NN-NN-NN-NN.compute-1.amazonaws.com,ec2-N-N-N-N.compute-1.amazonaws.com

CJOC    : http://cluster1-controller-NNNNN.us-east-1.elb.amazonaws.com.elb.cloudbees.net/cjoc/
Mesos   : http://mesos.cluster1-controller-NNNNN.us-east-1.elb.amazonaws.com.elb.cloudbees.net
Marathon: http://marathon.cluster1-controller-NNNNN.us-east-1.elb.amazonaws.com.elb.cloudbees.net

To access Mesos, open your browser and go to the Mesos console at the URL above. Login with the Mesos credentials. You can get these credentials via the echo-secrets command. The two commands below will give you the username and password:

$ cje run echo-secrets mesos_web_ui_username
$ cje run echo-secrets mesos_password

Once you login to Mesos, you will see what processes are running and which ones have failed. In the console, look for the completed tasks section.

If Operations Center failed to start you should see something like this:

mesos cjoc failed

If you click on the sandbox link, you will be able to view the Operations Center stdout and stderr logs which should provide some insight into why Operations Center failed to start.

Operations Center depends on the CloudBees Jenkins Enterprise Castle and Palace components, so you should also examine the logs for those Mesos tasks.

If you see that Operations Center failed on a specific host in the previous screenshot), then look at the logs for castle.cje running on the same host. As you did with Operations Center, click on the Sandbox link and examine the stdout and stderr logs.

mesos castle task

Next Steps

Installation problems are typically resolved by changing property settings in your cluster-init.config and cluster-init.secrets files, and then rebuilding your cluster. Before you can do that, you need to destroy your failed cluster. See destroying the CloudBees Jenkins Enterprise cluster for information on how to destroy and then start the cluster again from scratch.

Shell Access to Servers

In some cases, it is necessary to access servers and their filesystems on the running Operations Center or Master servers.

First, run cje run list-applications to find out in which worker host the container is running. In the following example it would be worker-2 for Operations Center:

$ cje run list-applications
castle.jce : worker-2
elasticsearch.jce : worker-2
castle.jce : worker-3
cjoc.jce : worker-2
castle.jce : worker-1

Then ssh into the worker with

dna connect worker-2

At this point we can get a shell into the container for a specific TENANT_ID:.

sudo docker exec -ti $(sudo docker ps -f label=com.cloudbees.pse.tenant=TENANT_ID -q) bash

Where TENANT_ID would be cjoc to access the CJOC container:

sudo docker exec -ti $(sudo docker ps -f label=com.cloudbees.pse.tenant=cjoc -q) bash

For any application that has only one instance in the cluster (for example: cjoc) there is a command that does all three of the above steps in a single command:

cje run ssh-into-tenant cjoc

Accessing Files

Sometimes it is also necessary to get files into our local computer (or jumpbox) that have the CloudBees Jenkins Enterprise project configuration.

Run cje run find-worker-for TENANT_ID to find out in which worker the container is running, and then get the ID of the container running as follows:

dna connect worker-2
sudo docker ps -f label=com.cloudbees.pse.tenant=TENANT_ID -q --no-trunc
exit

This will print a long ID, for example 7cc975f4da476f43602a18c60b3bcbb451b5914e61077a5e578fde26326ebf62

Next, let’s find the address of the worker with

$ cje run list-workers
worker-1 (master: m4.xlarge) ec2-N-N-N-N.compute-1.amazonaws.com > ACTIVE
worker-3 (build: m4.xlarge) ec2-N-N-N-N.compute-1.amazonaws.com > ACTIVE
worker-2 (master: m4.xlarge) ec2-N-N-N-N.compute-1.amazonaws.com > ACTIVE

Now we can download any file from the JENKINS_HOME (/mnt/TENANT_ID/CONTAINER_ID/ in the worker filesystem) using scp. For example:

scp -i .dna/identities/default ubuntu@ec2-N-N-N.compute-1.amazonaws.com:/mnt/cjoc/7cc975f4da476f43602a18c60b3bcbb451b5914e61077a5e578fde26326ebf62/support/support_*.zip .

Running your own scripts on workers

If you have some shell script you would like to run on various workers, you can create a new script under .dna/scripts, but it is critical that you use a unique script name that is not already there. You can then easily run that script on any worker using:

$ vim .dna/scripts/myScript.sh
$ chmod 755 .dna/scripts/myScript.sh
$ dna run myScript.sh worker-1
Running myScript.sh on 10.0.78.200:

Reviewing workers

Removing multiple workers

If you are wanting to remove multiple workers, or wanting to script the removal, you can pass the worker to be removed as an argument to the operation, instead of having to add it to the staged worker-remove.config:

$ cje prepare worker-remove --server.name worker-25
worker-remove is staged - review worker-remove.config and edit as needed - then run 'cje apply' to perform the operation.
$ cje apply

Purging deleted workers

As you add and remove workers from your cluster, you will notice that removed workers will still show up as DELETED. To purge those entries, run cje run purge-deleted-workers

$ cje run list-workers
worker-1 (master: m4.xlarge) 10.0.78.200 > ENABLED
worker-2 (build: m4.xlarge) 10.0.69.145 > ENABLED
worker-3 (build: )  > DELETED
$ cje run purge-deleted-workers
 Purge: worker-3
$ cje run list-workers
worker-1 (master: m4.xlarge) 10.0.78.200 > ENABLED
worker-2 (build: m4.xlarge) 10.0.69.145 > ENABLED

Accessing application data from each worker

If you want to inspect the data from an application, for example the JENKINS_HOME data for the CJOC, you can do so by connecting to the worker with dna connect worker-N, then the data is mounted at /mnt/$application/$containerID, for example to access the CJOC data:

$ cje run list-applications
elasticsearch.jce : worker-1
master-1.masters : worker-1
master-2.masters : worker-1
cjoc.jce : worker-1
palace.jce : worker-1
castle.jce : worker-1
$ dna connect worker-1
$ df | grep /mnt
/dev/xvdo        2086912  242012   1844900  12% /mnt/cjoc/111789703de50a550210e277d42547233426933c58e83c54dc9852ae8443285b
/dev/xvdn        2086912  213100   1873812  11% /mnt/master-1/5e083a71812c887ab0cd00546186e45d6c8a369254e8f6405019d1501d03754b
/dev/xvdq        2086912  185560   1901352   9% /mnt/master-2/84bb1c9b5442f01816b7ff1bd0f2364520bfdb2180130304cac373f51ddb741c

Create a Support Request

You can call on CloudBees to help resolve your problems. You can do this by submitting a support request at the CloudBees Zendesk site. In your request state the problem, any steps-to-reproduce and a Support Bundle. A Support Bundle is an archive of logs for all CloudBees Jenkins Enterprise applications and access logs for your CloudBees Jenkins Enterprise cluster. You create a Support Bundle, send it to CloudBees support and they can use the information within to help you diagnose the problem you are encountering.

You can find instructions for creating a Support Bundle in the CloudBees Knowledge Base here: How to generate a CloudBees Jenkins Enterprise Support Bundle

Also, be aware that the command cje was known as bees-pse before CloudBees Jenkins Enterprise version 1.6.0 was released. You can still use the bees-pse command, but it is really just a link to cje. In future releases bees-pse may be removed.

Accessing a support bundle when a Managed Master’s web interface is not working

If a managed master is running, but the web interface is inaccessible, you can get the support bundle directly from the worker that is running that master with the commands:

$ cje run list-applications | grep master-1
master-1.masters : worker-1
$ dna connect worker-1
$ df -h | grep master-1
/dev/xvdn        50G  181M   50G   1% /mnt/master-1/c3ce80611976e94c405df086cd30366d622e66fc2e84094bbd122b093c63a830
$ ls  /mnt/master-1/c3ce80611976e94c405df086cd30366d622e66fc2e84094bbd122b093c63a830/support/cloudbees-support*
# review the output and determine which file is the latest support bundle
$ logout
# back on the bastion host, copy the file from the worker
$ dna copy -rs /mnt/master-1/c3ce80611976e94c405df086cd30366d622e66fc2e84094bbd122b093c63a830/support/cloudbees-support_cje-mm-cje.zip . worker-1

Support Bundle Anonymization

The Support Core Plugin collects diagnostic information about a running Jenkins instance. These data can contain sensitive information, but this can be automatically filtered by enabling support bundle anonymization. Anonymization is applied to agent names, agent computer names, agent labels, view names, job names, usernames, and IP addresses. These strings are mapped to randomly generated anonymous counterparts which mask their real values. If you need to determine the real value for an anonymized one, you can look that up in the support bundle anonymization web page.

Configuration

When anonymization is disabled, a warning message is shown on the Support web page.

WARNING: Support bundle anonymization is disabled. This can be enabled in the global configuration under Support Bundle Anonymization.

Click the link to Manage Jenkins  Configure System and enable support bundle anonymization.

Anonymize support bundle contents checkbox

Viewing Anonymized Mappings

When submitting an anonymized support bundle to your support organization, they may need to ask further details about items with anonymized names. To translate that, navigate to Manage Jenkins  Support Bundle Anonymization.

Support Bundle Anonymization management link

This page contains a table of mappings between original names and their corresponding anonymized versions. This also contains a list of stop words that are ignored when anonymization generates anonymized counterparts. These are common terms in Jenkins that by themselves convey no personal meaning. For example, an agent named "Jenkins" will not be anonymized because "jenkins" is a stop word.

Screenshot of anonymized mappings management page example

Limitations

Anonymization filters only apply to text files. It cannot handle non-Jenkins URLs, custom proprietary Jenkins plugin names, and exceptions quoting invalid Groovy code in a Jenkins pipeline. The active plugins, disabled plugins, failed plugins, and Dockerfile reports are not anonymized due to several Jenkins plugins and other Java libraries using version numbers that are indistinguishable from IP addresses. These reports are in the files plugins/active.txt, plugins/disabled.txt, plugins/failed.txt, and docker/Dockerfile. These files should all be manually reviewed if you do not wish to disclose the names of custom proprietary plugins.