Q-16 — Fix a Broken kubeadm Cluster After Machine Migration

A kubeadm-provisioned cluster was migrated to a new machine. After migration, the cluster is not working and requires configuration fixes to run successfully.

Task

1. Fix a broken single-node Kubernetes cluster

The cluster became non-functional during the machine migration.

2. Identify broken control-plane components

Inspect control-plane components and determine which ones are failing and why.

3. External etcd was used earlier

The old cluster relied on an external etcd server. You must verify whether the new machine still points to the correct etcd location.

4. Fix configuration of all broken components

Update all misconfigured files (static pod manifests, kubelet configs, systemd services, etc.) so that the components can start successfully.

5. Restart all required services/components

Ensure necessary services (e.g., kubelet, container runtime) and static pods reload the corrected configuration.

6. Validate cluster health

Finally, ensure:

  • The cluster is functional
  • The single node is Ready
  • All pods are Running and Ready

controlplane ~   k get po
The connection to the server controlplane:6443 was refused - did you specify the right host or port?

controlplane ~  crictl ps -a
CONTAINER           IMAGE               CREATED              STATE               NAME                      ATTEMPT             
bcb555633c4e9       90550c43ad2bc       About a minute ago   Exited              kube-apiserver            5                   
0bb498c78daf0       6331715a2ae96       About a minute ago   Exited              calico-kube-controllers   5                   
d090d77f4308c       46169d968e920       6 minutes ago        Running             kube-scheduler            1                   
8352a9de86ba3       a0af72f2ec6d6       6 minutes ago        Running             kube-controller-manager   1                   
e93ffe2c3d000       ead0a4a53df89       2 hours ago          Running             coredns                   0                   
d9ef1c7891c67       ead0a4a53df89       2 hours ago          Running             coredns                   0                   
c22a119a37eb7       c9fe3bce8a6d8       2 hours ago          Running             kube-flannel              0                   
96163f5cfda6e       feb26d4585d68       2 hours ago          Running             calico-node               0                   
d796b122cc292       7dd6ea186aba0       2 hours ago          Exited              install-cni               0                   
ee7e7169ee83b       df0860106674d       2 hours ago          Running             kube-proxy                0                   
49bbfd560899c       a0af72f2ec6d6       2 hours ago          Exited              kube-controller-manager   0                   
b2b3cdeb18f51       5f1f5298c888d       2 hours ago          Running             etcd                      0                   
dc5b85c20084a       46169d968e920       2 hours ago          Exited              kube-scheduler            0                   

controlplane ~   crictl logs bcb555633c4e9
I1116 13:18:31.111693       1 options.go:263] external host was not specified, using 192.168.121.202
I1116 13:18:31.190571       1 server.go:150] Version: v1.34.0
W1116 13:18:51.944160       1 logging.go:55] [core] [Channel #7 SubChannel #9]grpc: addrConn.createTransport failed to connect to {Addr: "12.0.0.1:2379", ServerName: "12.0.0.1:2379", BalancerAttributes: {"<%!p(pickfirstleaf.managedByPickfirstKeyType={})>": "<%!p(bool=true)>" }}. Err: connection error: desc = "transport: Error while dialing: dial tcp 12.0.0.1:2379: i/o timeout"
F1116 13:18:51.944204       1 instance.go:232] Error creating leases: error creating storage factory: context deadline exceeded

controlplane ~   journalctl -u kubelet.service -f | grep kube-apiserver           # no specific clue
Nov 16 13:21:17 controlplane kubelet[3467]: E1116 13:21:17.604061    3467 mirror_client.go:139] "Failed deleting a mirror pod" err="Delete \"https://192.168.121.202:6443/api/v1/namespaces/kube-system/pods/kube-apiserver-controlplane\": dial tcp 192.168.121.202:6443: connect: connection refused" pod="kube-system/kube-apiserver-controlplane"
Nov 16 13:21:17 controlplane kubelet[3467]: E1116 13:21:17.604362    3467 pod_workers.go:1324] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"kube-apiserver\" with CrashLoopBackOff: \"back-off 2m40s restarting failed container=kube-apiserver pod=kube-apiserver-controlplane_kube-system(585d248aec0e70f4e24b6aeb6267f642)\"" pod="kube-system/kube-apiserver-controlplane" podUID="585d248aec0e70f4e24b6aeb6267f642"
Nov 16 13:21:19 controlplane kubelet[3467]: E1116 13:21:19.602527    3467 status_manager.go:1018] "Failed to get status for pod" err="Get \"https://192.168.121.202:6443/api/v1/namespaces/kube-system/pods/kube-apiserver-controlplane\": dial tcp 192.168.121.202:6443: connect: connection refused" podUID="585d248aec0e70f4e24b6aeb6267f642" pod="kube-system/kube-apiserver-controlplane"
Nov 16 13:21:29 controlplane kubelet[3467]: E1116 13:21:29.602174    3467 status_manager.go:1018] "Failed to get status for pod" err="Get \"https://192.168.121.202:6443/api/v1/namespaces/kube-system/pods/kube-apiserver-controlplane\": dial tcp 192.168.121.202:6443: connect: connection refused" podUID="585d248aec0e70f4e24b6aeb6267f642" pod="kube-system/kube-apiserver-controlplane"
^C

controlplane ~  cat /etc/kubernetes/manifests/etcd.yaml | grep -i advertise
    kubeadm.kubernetes.io/etcd.advertise-client-urls: https://192.168.121.202:2379
    - --advertise-client-urls=https://192.168.121.202:2379
    - --initial-advertise-peer-urls=https://192.168.121.202:2380

controlplane ~   cat /etc/kubernetes/manifests/kube-apiserver.yaml | grep -i etcd-server          # address is wrong
    - --etcd-servers=https://12.0.0.1:2379

controlplane ~   sudo cp /etc/kubernetes/manifests/kube-apiserver.yaml /etc/kubernetes/manifests/kube-apiserver.yaml.bak

controlplane ~   vi /etc/kubernetes/manifests/kube-apiserver.yaml 

controlplane ~   cat /etc/kubernetes/manifests/kube-apiserver.yaml | grep -i etcd-server
    - --etcd-servers=https://127.0.0.1:2379

controlplane ~   systemctl restart kubelet

controlplane ~   k get po
No resources found in default namespace.

controlplane ~   k get po -A
NAMESPACE     NAME                                       READY   STATUS    RESTARTS       AGE
kube-system   calico-kube-controllers-587f6db6c5-shzhf   1/1     Running   8 (7m9s ago)   128m
kube-system   canal-hspzw                                2/2     Running   0              128m
kube-system   coredns-6678bcd974-4wj8b                   1/1     Running   0              128m
kube-system   coredns-6678bcd974-b44nn                   1/1     Running   0              128m
kube-system   etcd-controlplane                          1/1     Running   0              128m
kube-system   kube-apiserver-controlplane                1/1     Running   0              128m
kube-system   kube-controller-manager-controlplane       1/1     Running   1 (18m ago)    128m
kube-system   kube-proxy-nsf5t                           1/1     Running   0              128m
kube-system   kube-scheduler-controlplane                1/1     Running   1 (18m ago)    128m

controlplane ~ 

Other Common follow-up fixes depending on the error

A) If error is “connection refused / no route to host”

  • Confirm endpoint IP and port.
  • Ensure etcd is listening on that IP/port (ss -tnlp | grep 2379 on etcd host).
  • If etcd on another host, check firewall, security group, or node network.

B) If TLS / certificate error (x509)

  • kube-apiserver uses client certs to authenticate to etcd. Confirm the cert files referenced by kube-apiserver exist and are valid.
  • Typical kube-apiserver flags:

  • --etcd-cafile=/etc/kubernetes/pki/etcd/ca.crt

  • --etcd-certfile=/etc/kubernetes/pki/apiserver-etcd-client.crt
  • --etcd-keyfile=/etc/kubernetes/pki/apiserver-etcd-client.key
  • If etcd’s server certificate SANs don’t include the endpoint IP/hostname you used, TLS will fail. Either:

  • Use an endpoint whose name/IP is in the etcd cert SANs, or

  • Regenerate etcd certs to include the desired IP/hostname (not trivial in exam — prefer the simpler option).

C) If etcd is external and requires a different advertised URL

  • Copy the etcd advertised/client URL (from the etcd host manifest) and paste it into kube-apiserver manifest --etcd-servers=...
  • Ensure kube-apiserver has the right CA/cert/key to authenticate to that etcd.