Troubleshooting CloudBees Jenkins Enterprise 2.x on Kubernetes

Caution

This guide is an old version of Troubleshooting CloudBees Jenkins Enterprise 2.x on Kubernetes, and is superseded by Troubleshooting CloudBees Core on Kubernetes.

Please refer to Troubleshooting CloudBees Core on Kubernetes for updated content.

There are a number of resources that you can use to troubleshoot a CloudBees Jenkins Enterprise failure.

In this section we will cover each of these approaches.

Consult the Knowledge Base

The Knowledge Base can be very helpful in troubleshooting problems with CloudBees Jenkins Enterprise and can be accessed on the CloudBees Support site.

Instances provisioning

Operations Center Provisioning

  1. Check pod status

  2. All associated objects are already created: pod, svc, statefulset (1 - 1), ingress, pvc and pv

  3. Check the events related with the pod and associated objects: see table

  4. Jenkins logs

Managed Master Provisioning

  1. Connectivity logs. Probably will give you the answer to the problem

  2. Check pod status

  3. All associated objects are already created: pod, svc, statefulset (1 - 1), ingress, pvc and pv

  4. Check the events related with the pod and associated objects: see table

  5. Jenkins logs

Build Agent Provisioning

  1. Jenkins logs. Probably will give you the answer to the problem

  2. Check k8s events

  3. Review Kubernetes shared cloud item configuration at Operations Center.

CloudBees Jenkins Enterprise basic operations

Viewing Cluster Resources

# Gives you quick readable detail
$ kubectl get -a pod,statefulset,svc,ingress,pvc,pv -o wide
# Gives you high level of detail
$ kubectl get -a pod,statefulset,svc,ingress,pvc,pv -o yaml
# Describe commands with verbose output
$ kubectl describe <TYPE> <NAME>

Pod Access

# Access to the bash
$ kubectl exec <POD_NAME> -i -t -- bash -li
master2-0:/$ ps -ef
PID   USER     TIME   COMMAND
    1 jenkins    0:00 /sbin/tini -- /usr/local/bin/launch.sh
    5 jenkins    1:53 java -Dhudson.slaves.NodeProvisioner.initialDelay=0 -Duser.home=/var/jenkins_home -Xmx1433m -Xms1433m -Djenkins.model.Jenkins.slaveAgentPortEnforce=true -Djenkins.model.Jenkins.slav
  481 jenkins    0:00 bash -li
  485 jenkins    0:00 ps -ef
# Bash execution command
$ kubectl exec <POD_NAME> -- ps -ef
PID   USER     TIME   COMMAND
    1 jenkins    0:00 /sbin/tini -- /usr/local/bin/launch.sh
    5 jenkins    2:05 java -Dhudson.slaves.NodeProvisioner.initialDelay=0 -Duser.home=/var/jenkins_home -Xmx1433m -Xms1433m -Djenkins.model.Jenkins.slaveAgentPortEnforce=true -Djenkins.model.Jenkins.slaveAgentPort=50000 -DMASTER_GRANT_ID=270bd80c-3e5c-498c-88fe-35ac9e11f3d3 -Dcb.IMProp.warProfiles.cje=kubernetes.json -DMASTER_INDEX=1 -Dcb.IMProp.warProfiles=kubernetes.json -DMASTER_OPERATIONSCENTER_ENDPOINT=http://cjoc/cjoc -DMASTER_NAME=master2 -DMASTER_ENDPOINT=http://cje.support-cje2.beescloud.k8s.local/master2/ -jar -Dcb.distributable.name=Docker Common CJE -Dcb.distributable.commit_sha=888f01a54c12cfae5c66ec27fd4f2a7346097997 /usr/share/jenkins/jenkins.war --webroot=/tmp/jenkins/war --pluginroot=/tmp/jenkins/plugins --prefix=/master2/
  645 jenkins    0:00 ps -ef

Access to the Pod Logs

kubectl logs -f <POD_NAME>

Pod Scale Down/Up

$ kubectl scale statefulset/master2 --replicas=0
statefulset "master2" scaled

$ kubectl get -a statefulset -o wide
NAME      DESIRED   CURRENT   AGE       CONTAINERS   IMAGES
cjoc      1         1         1d        jenkins      cloudbees/cje-oc:2.121.3.1
master1   1         1         2h        jenkins      cloudbees/cje-mm:2.121.3.1
master2   0         0         36m       jenkins      cloudbees/cje-mm:2.121.3.1

CloudBees Jenkins Enterprise Cluster Resources

In the installation phase of CloudBees Jenkins Enterprise the following service accounts, roles and roles binding are created.

$ kubectl get sa,role,rolebinding
NAME         SECRETS   AGE
sa/cjoc      1         21h
sa/default   1         21h
sa/jenkins   1         21h

NAME                      AGE
roles/master-management   21h
roles/pods-all            21h

NAME                   AGE
rolebindings/cjoc      21h
rolebindings/jenkins   21h

Once the installation is done and the CloudBees Jenkins Enterprise cluster is already up and running, then we can easily check the status of the most important CloudBees Jenkins Enterprise resources: pod,statefulset,svc,ingress,pvc and pv.

$ kubectl get pod,statefulset,svc,ingress,pvc,pv

NAME           READY     STATUS    RESTARTS   AGE
po/cjoc-0      1/1       Running   0          21h
po/master1-0   1/1       Running   0          14h

NAME                   DESIRED   CURRENT   AGE
statefulsets/cjoc      1         1         21h
statefulsets/master1   1         1         14h

NAME          TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)            AGE
svc/cjoc      ClusterIP   100.66.207.191   <none>        80/TCP,50000/TCP   21h
svc/master1   ClusterIP   100.67.1.49      <none>        80/TCP,50000/TCP   14h

NAME          HOSTS                                  ADDRESS            PORTS     AGE
ing/cjoc      cje.support-cje2.beescloud.k8s.local   af9463f6a2b68...   80        21h
ing/default   cje.support-cje2.beescloud.k8s.local   af9463f6a2b68...   80        21h
ing/master1   cje.support-cje2.beescloud.k8s.local   af9463f6a2b68...   80        14h

NAME                         STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
pvc/jenkins-home-cjoc-0      Bound     pvc-c5cad012-2b69-11e8-80fc-12582571ed5c   20Gi       RWO            gp2            21h
pvc/jenkins-home-master1-0   Bound     pvc-e4b5e473-2ba2-11e8-80fc-12582571ed5c   50Gi       RWO            gp2            14h

NAME                                          CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS    CLAIM                                        STORAGECLASS   REASON    AGE
pv/pvc-c5cad012-2b69-11e8-80fc-12582571ed5c   20Gi       RWO            Delete           Bound     cje-on-support-cje2/jenkins-home-cjoc-0      gp2                      21h
pv/pvc-e4b5e473-2ba2-11e8-80fc-12582571ed5c   50Gi       RWO            Delete           Bound     cje-on-support-cje2/jenkins-home-master1-0   gp2                      14h

In the following sections the expected results of different Kubernetes resources are defined. The definition of each Kubernetes resource was taken from Kubernetes official documentation.

Pods

A pod is the smallest and simplest Kubernetes object, which represents a set of running containers on your cluster. A Pod is typically set up to run a single primary container, although a pod can also run optional sidecar containers that add supplementary features like logging. Pods are commonly managed by a Deployment.

The get pod will provide you current applications running in the cluster. Applications which are currently stopped or not deployed will not appear as a pod of the cluster.

$ kubectl get pod

NAME           READY     STATUS    RESTARTS   AGE
po/cjoc-0      1/1       Running   0          21h
po/master1-0   1/1       Running   0          14h

Pods Events

Pod events provide you insights about why a specific pod is failing to start in the cluster. In other words, pod events will tell you the reason why a specific application cannot start or be deployed in the cluster.

The table below summarize the most common pods event which might happen in CloudBees Jenkins Enterprise.

To get the list of events associated with a given pod you will need to run:

$ kubectl describe pod the_pod_name

For example:

$ kubectl describe pod cjoc-0
Status Events Cause

ImagePullBackOff

The image you are using cannot be found in the Docker registry, or when using a private registry there is no secret configured

Node issues

See below. Get node info with kubectl describe nodes

Pending

Insufficient memory

Not enough memory, either increase the nodes or node size in the cluster or reduce the memory requirement of Operations Center (yaml file) or Master (under configuration)

Pending

Insufficient cpu

Not enough CPUs, either increase the nodes or node size in the cluster or reduce the CPU requirement of Operations Center (yaml file) or Master (under configuration)

Pending

NoVolumeZoneConflict

There are no nodes available in the zone where the persistent volume was created, start more nodes in that zone

Pending

CrashLoopBackOff

Find out why the Docker container crashes. The easiest and first check should be if there are any errors in the output of the previous startup, e.g.:

Running but restarting every so often.

describe pod shows Last State: Terminated Reason: OOMKilled Exit Code: 137

The Xmx or MaxRAM JVM parameters are too high for the container memory, try increasing memory limit

Unknown

This usually indicates a bad node, if there are several pods in that node in the same state. Check with `kubectl get pods --all-namespaces -o wide

StatefulSet

A StatefulSet manages the deployment and scaling of a set of Pods and provides guarantees about the ordering and uniqueness of these Pods.

Like a Deployment, a StatefulSet manages Pods that are based on an identical container spec. Unlike a Deployment, a StatefulSet maintains a sticky identity for each of their Pods. These pods are created from the same spec, but are not interchangeable: each has a persistent identifier that it maintains across any rescheduling.

A StatefulSet operates under the same pattern as any other Controller. You define your desired state in a StatefulSet object and the StatefulSet controller makes any necessary updates to get there from the current state.

$ kubectl get statefulset

NAME                   DESIRED   CURRENT   AGE
statefulsets/cjoc      1         1         21h
statefulsets/master1   1         1         14h

In CloudBees Jenkins Enterprise, the expected DESIRED and CURRENT status of any application should be 1. Not Jenkins, neither build agents supports more than one instance running at the same time.

Service

A service is the API object that describes how to access applications (such as a set of Pods) and can describe ports and load-balancers.

The access point can be internal or external to the cluster.

$ kubectl get svc

NAME          TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)            AGE
svc/cjoc      ClusterIP   100.66.207.191   <none>        80/TCP,50000/TCP   21h
svc/master1   ClusterIP   100.67.1.49      <none>        80/TCP,50000/TCP   14h

A service must exist for each application running in the cluster. Otherwise, the service will not be accessible.

Ingress

Ingresses represent the routes to access the applications, where an ingress could be thought of as a Load Balancer.

$ kubectl get ingress

NAME          HOSTS                                  ADDRESS            PORTS     AGE
ing/cjoc      cje.support-cje2.beescloud.k8s.local   af9463f6a2b68...   80        21h
ing/default   cje.support-cje2.beescloud.k8s.local   af9463f6a2b68...   80        21h
ing/master1   cje.support-cje2.beescloud.k8s.local   af9463f6a2b68...   80        14h

The required ingresses for CloudBees Jenkins Enterprise to work are:

  • A ing/default as the default entry point to the cluster

  • A ing/cjoc ingress for the access to the Operations Center

  • A ing/<MASTER_ID> ingress for the access to each master

Important
The product expects these ingresses to be present and so they must not be modified - even to reduce the complexity of scope. Modifying ingresses at the Kubernetes level might produce issues in the product, such as Managed Masters becoming unable to communicate correctly with the Operations Center.

Persistent Volume Claims (PVC)

Persistent volume claims (PVCs) represent the volumes associated which each application running in the cluster.

NAME                         STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
pvc/jenkins-home-cjoc-0      Bound     pvc-c5cad012-2b69-11e8-80fc-12582571ed5c   20Gi       RWO            gp2            21h
pvc/jenkins-home-master1-0   Bound     pvc-e4b5e473-2ba2-11e8-80fc-12582571ed5c   50Gi       RWO            gp2            14h

PVCs events

The table below summarize the most common pods event associated with PVCs that might occur in CloudBees Jenkins Enterprise.

To obtain the list of events associated with a given pod, run:

$ kubectl describe pvc the_pvc_name

For example:

$ kubectl describe pvc jenkins-home-cjoc-0
Status Events Cause

Pending

no persistent volumes available for this claim and no storage class is set

There is no default storageclass - follow these instructions to set a default storageclass

Persistent Volume (PV)

The persistent volume represents the volumes created in the cluster.

NAME                                          CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS    CLAIM                                        STORAGECLASS   REASON    AGE
pv/pvc-c5cad012-2b69-11e8-80fc-12582571ed5c   20Gi       RWO            Delete           Bound     cje-on-support-cje2/jenkins-home-cjoc-0      gp2                      21h
pv/pvc-e4b5e473-2ba2-11e8-80fc-12582571ed5c   50Gi       RWO            Delete           Bound     cje-on-support-cje2/jenkins-home-master1-0   gp2                      14h

Accessing $JENKINS_HOME

Accessing Jenkins Home Directory (Pod Running)

By running the following sequence of commands, you can ascertain the path of the $JENKINS_HOME inside a given pod and a specific CloudBees Jenkins Enterprise instance.

# Get the location of the $JENKINS_HOME
$ kubectl describe pod master2-0 | grep " jenkins-home " | awk '{print $1}'
/var/jenkins_home

# Access the bash of a given pod
$ kubectl exec master2-0 -i -t -- bash -i -l
master2-0:/$ cd /var/jenkins_home/
master2-0:~$ ps -ef
PID   USER     TIME   COMMAND
    1 jenkins    0:00 /sbin/tini -- /usr/local/bin/launch.sh
    5 jenkins    1:46 java -Dhudson.slaves.NodeProvisioner.initialDelay=0 -Duser.home=/var/jenkins_home -Xmx1433m -Xms1433m -Djenkins.model.Jenkins.slaveAgentPortEnforce=true -Djenkins.model.Jenkins.slav
  516 jenkins    0:00 bash -i -l
  524 jenkins    0:00 ps -ef
master2-0:~$ ps -ef | grep java
    5 jenkins    1:46 java -Dhudson.slaves.NodeProvisioner.initialDelay=0 -Duser.home=/var/jenkins_home -Xmx1433m -Xms1433m -Djenkins.model.Jenkins.slaveAgentPortEnforce=true -Djenkins.model.Jenkins.slaveAgentPort=50000 -DMASTER_GRANT_ID=270bd80c-3e5c-498c-88fe-35ac9e11f3d3 -Dcb.IMProp.warProfiles.cje=kubernetes.json -DMASTER_INDEX=1 -Dcb.IMProp.warProfiles=kubernetes.json -DMASTER_OPERATIONSCENTER_ENDPOINT=http://cjoc/cjoc -DMASTER_NAME=master2 -DMASTER_ENDPOINT=http://cje.support-cje2.beescloud.k8s.local/master2/ -jar -Dcb.distributable.name=Docker Common CJE -Dcb.distributable.commit_sha=888f01a54c12cfae5c66ec27fd4f2a7346097997 /usr/share/jenkins/jenkins.war --webroot=/tmp/jenkins/war --pluginroot=/tmp/jenkins/plugins --prefix=/master2/
  528 jenkins    0:00 grep java

# Operations to be done. This is an example
$ kubectl cp master2-0:/var/jenkins_home/jobs/ ./jobs/
tar: removing leading '/' from member names

Accessing Jenkins Home Directory (Pod Not Running)

# Stop a pod
$ kubectl scale statefulset/master2 --replicas=0
statefulset "master2" scaled

# Create a new rescue-pod running something with any effect
# in the $JENKINS_HOME
$ cat <<EOF | kubectl create -f -
kind: Pod
apiVersion: v1
metadata:
  name: rescue-pod
spec:
  volumes:
    - name: rescue-storage
      persistentVolumeClaim:
       claimName: jenkins-home-master2-0
  containers:
    - name: rescue-container
      image: nginx
      volumeMounts:
        - mountPath: "/tmp/jenkins-home"
          name: rescue-storage
EOF

# Access to the bash of the rescue-pod
$ kubectl exec rescue-pod -i -t -- bash -i -l
mesg: ttyname failed: Success
root@rescue-pod:/# cd /tmp/jenkins-home/
root@rescue-pod:/tmp/jenkins-home#

# Operations to be done. This is an example
$ kubectl cp rescue-pod:/tmp/jenkins_home/jobs/ ./jobs/
tar: removing leading '/' from member names

# Delete the rescue pod
$ kubectl delete pod rescue-pod
pod "rescue-pod" deleted

# Start the pod
$ kubectl scale statefulset/master2 --replicas=1
statefulset "master2" scaled

Operations Center Setup Customization

The Operations Center instance could be configured by either editing {YAML-CONFIG-FILE} or using the Kubernetes command line.

# Set the memory to 2G
$ kubectl patch statefulset cjoc -p '{"spec":{"template":{"spec":{"containers":[{"name":"jenkins","resources":{"limits":{"memory": "2G"}}}]}}}}'
statefulset "cjoc" patched
# Set initialDelay to 320 seconds
$ kubectl patch statefulset cjoc -p '{"spec":{"template":{"spec":{"containers":[{"name":"jenkins","livenessProbe":{"initialDelaySeconds":"320"}}]}}}}'
statefulset "cjoc" patched
# Set timeout to 10 seconds
$ kubectl patch statefulset cjoc -p '{"spec":{"template":{"spec":{"containers":[{"name":"jenkins","livenessProbe":{"timeoutSeconds":"10"}}]}}}}'
statefulset "cjoc" patched

Performance Issues - High CPU / Blocked Threads

# export cje2 cluster information
$ kubectl get pod,svc,endpoints,statefulset,ingress,pvc,pv,sa,role,rolebinding -o yaml > to-el-cluster.yml

# jenkinshangWithJstack
$ kubectl cp ~/Downloads/jenkinshangWithJstack.sh master1-0:/tmp/

$ kubectl exec master1-0 -- jps
5 jenkins.war
8807 Jps

$ kubectl exec master1-0 -- chmod u+x /tmp/jenkinshangWithJstack.sh

# currently I cannot make it work without login into the pod/container
$ kubectl exec master1-0 -it -- bash -il
master1-0:/$ /tmp/jenkinshangWithJstack.sh 5 60 5

$ kubectl cp master1-0:/tmp/jenkinshangWithJstack.5.output.tar ./

Create a Support Request

You can call on CloudBees to help resolve your problems. You can do this by submitting a support request at the CloudBees Zendesk site. In your request state the problem, any steps-to-reproduce, Support Bundles and cje2 cluster description.

Required Data

Cluster and Operations Center Data

# Create required data folder
$ mkdir cje2-required-data
$ cd cje2-required-data

# Dump cluster info
$ kubectl cluster-info dump --output-directory=./cluster-state/

# Copy the Operations Center bundles
$ kubectl cp cjoc-0:/var/jenkins_home/support/ ./cjoc-support/

# Cluster information
$ kubectl cluster-info > 000-cluster-info.txt

# cje2 cluster description
$ kubectl get node,pod,statefulset,svc,endpoints,ingress,pvc,pv,sa,role,rolebinding -o wide > cje2-cluster-wide.txt
$ kubectl get node,pod,statefulset,svc,endpoints,ingress,pvc,pv,sa,role,rolebinding -o yaml > cje2-cluster-wide.yml

Master Data

# Also grab the cluster and Operations Center data

# Copy master1 bundles
$ kubectl cp master1-0:/var/jenkins_home/support/ ./master1-support/

Connectivity Checks

# Also grab the cluster, Operations Center and master data

$ kubectl exec -ti cjoc-0 curl localhost:50000 > 001-cjoc-curl-local-5000.txt
$ kubectl exec -ti master1-0 curl cjoc:50000 > 002-master1-curl-cjoc-5000.txt
$ kubectl exec -ti master1-0 curl 100.66.207.191:50000 > 003-master1-curl-cjoc-ip-5000.txt
$ kubectl exec -ti cjoc-0 curl -Iv http://master1.default.svc.cluster.local/master1/ > 004-cjoc-curl-master1.txt
$ kubectl exec -ti master1-0 curl -Iv http://cjoc/cjoc/ > 005-master1-curl-cjoc-txt
$ kubectl exec -ti cjoc-0 curl -Iv http://100.67.1.49/master1/ > 006-cjoc-curl-master1-ip.txt
$ kubectl exec -ti master1-0 curl -Iv http://100.66.207.191/cjoc/ > 007-master1-curl-cjoc-ip.txt

Pipeline Build Data

$ kubectl cp master1-0:/var/jenkins_home/jobs/<path-to-pipeline-build-folder> ./pipeline-build-data/

Support Bundle Anonymization

The Support Core Plugin collects diagnostic information about a running Jenkins instance. These data can contain sensitive information, but this can be automatically filtered by enabling support bundle anonymization. Anonymization is applied to agent names, agent computer names, agent labels, view names, job names, usernames, and IP addresses. These strings are mapped to randomly generated anonymous counterparts which mask their real values. If you need to determine the real value for an anonymized one, you can look that up in the support bundle anonymization web page.

Configuration

When anonymization is disabled, a warning message is shown on the Support web page.

WARNING: Support bundle anonymization is disabled. This can be enabled in the global configuration under Support Bundle Anonymization.

Click the link to Manage Jenkins  Configure System and enable support bundle anonymization.

Anonymize support bundle contents checkbox

Viewing Anonymized Mappings

When submitting an anonymized support bundle to your support organization, they may need to ask further details about items with anonymized names. To translate that, navigate to Manage Jenkins  Support Bundle Anonymization.

Support Bundle Anonymization management link

This page contains a table of mappings between original names and their corresponding anonymized versions. This also contains a list of stop words that are ignored when anonymization generates anonymized counterparts. These are common terms in Jenkins that by themselves convey no personal meaning. For example, an agent named "Jenkins" will not be anonymized because "jenkins" is a stop word.

Screenshot of anonymized mappings management page example

Limitations

Anonymization filters only apply to text files. It cannot handle non-Jenkins URLs, custom proprietary Jenkins plugin names, and exceptions quoting invalid Groovy code in a Jenkins pipeline. The active plugins, disabled plugins, failed plugins, and Dockerfile reports are not anonymized due to several Jenkins plugins and other Java libraries using version numbers that are indistinguishable from IP addresses. These reports are in the files plugins/active.txt, plugins/disabled.txt, plugins/failed.txt, and docker/Dockerfile. These files should all be manually reviewed if you do not wish to disclose the names of custom proprietary plugins.