Troubleshooting

Node bootstrap failures when using CABPK with cloud-init

Failures during Node bootstrapping can have a lot of different causes. For example, Cluster API resources might be misconfigured or there might be problems with the network. The following steps describe how bootstrap failures can be troubleshooted systematically.

  1. Access the Node via ssh.
  2. Take a look at cloud-init logs via less /var/log/cloud-init-output.log or journalctl -u cloud-init --since "1 day ago". (Note: cloud-init persists logs of the commands it executes (like kubeadm) only after they have returned.)
  3. It might also be helpful to take a look at journalctl --since "1 day ago".
  4. If you see that kubeadm times out waiting for the static Pods to come up, take a look at:
    1. containerd: crictl ps -a, crictl logs, journalctl -u containerd
    2. Kubelet: journalctl -u kubelet --since "1 day ago" (Note: it might be helpful to increase the Kubelet log level by e.g. setting --v=8 via systemctl edit --full kubelet && systemctl restart kubelet)
  5. If Node bootstrapping consistently fails and the kubeadm logs are not verbose enough, the kubeadm verbosity can be increased via KubeadmConfigSpec.Verbosity.

Labeling nodes with reserved labels such as node-role.kubernetes.io fails with kubeadm error during bootstrap

Self-assigning Node labels such as node-role.kubernetes.io using the kubelet --node-labels flag (see kubeletExtraArgs in the CABPK examples) is not possible due to a security measure imposed by the NodeRestriction admission controller that kubeadm enables by default.

Assigning such labels to Nodes must be done after the bootstrap process has completed:

kubectl label nodes <name> node-role.kubernetes.io/worker=""

For convenience, here is an example one-liner to do this post installation

kubectl get nodes --no-headers -l '!node-role.kubernetes.io/master' -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}' | xargs -I{} kubectl label node {} node-role.kubernetes.io/worker=''

Cluster API with Docker

When provisioning workload clusters using Cluster API with Docker infrastructure, provisioning might be stuck:

  1. if there are stopped containers on your machine from previous runs. Clean unused containers with docker rm -f .

  2. if the docker space on your disk is being exhausted