Q-16 — Fix a Broken kubeadm Cluster After Machine Migration
A kubeadm-provisioned cluster was migrated to a new machine. After migration, the cluster is not working and requires configuration fixes to run successfully.
Task
1. Fix a broken single-node Kubernetes cluster
The cluster became non-functional during the machine migration.
2. Identify broken control-plane components
Inspect control-plane components and determine which ones are failing and why.
3. External etcd was used earlier
The old cluster relied on an external etcd server. You must verify whether the new machine still points to the correct etcd location.
4. Fix configuration of all broken components
Update all misconfigured files (static pod manifests, kubelet configs, systemd services, etc.) so that the components can start successfully.
5. Restart all required services/components
Ensure necessary services (e.g., kubelet, container runtime) and static pods reload the corrected configuration.
6. Validate cluster health
Finally, ensure:
- The cluster is functional
- The single node is Ready
- All pods are Running and Ready
controlplane ~ ➜ k get po
The connection to the server controlplane:6443 was refused - did you specify the right host or port?
controlplane ~ ✖ crictl ps -a
CONTAINER IMAGE CREATED STATE NAME ATTEMPT
bcb555633c4e9 90550c43ad2bc About a minute ago Exited kube-apiserver 5
0bb498c78daf0 6331715a2ae96 About a minute ago Exited calico-kube-controllers 5
d090d77f4308c 46169d968e920 6 minutes ago Running kube-scheduler 1
8352a9de86ba3 a0af72f2ec6d6 6 minutes ago Running kube-controller-manager 1
e93ffe2c3d000 ead0a4a53df89 2 hours ago Running coredns 0
d9ef1c7891c67 ead0a4a53df89 2 hours ago Running coredns 0
c22a119a37eb7 c9fe3bce8a6d8 2 hours ago Running kube-flannel 0
96163f5cfda6e feb26d4585d68 2 hours ago Running calico-node 0
d796b122cc292 7dd6ea186aba0 2 hours ago Exited install-cni 0
ee7e7169ee83b df0860106674d 2 hours ago Running kube-proxy 0
49bbfd560899c a0af72f2ec6d6 2 hours ago Exited kube-controller-manager 0
b2b3cdeb18f51 5f1f5298c888d 2 hours ago Running etcd 0
dc5b85c20084a 46169d968e920 2 hours ago Exited kube-scheduler 0
controlplane ~ ➜ crictl logs bcb555633c4e9
I1116 13:18:31.111693 1 options.go:263] external host was not specified, using 192.168.121.202
I1116 13:18:31.190571 1 server.go:150] Version: v1.34.0
W1116 13:18:51.944160 1 logging.go:55] [core] [Channel #7 SubChannel #9]grpc: addrConn.createTransport failed to connect to {Addr: "12.0.0.1:2379", ServerName: "12.0.0.1:2379", BalancerAttributes: {"<%!p(pickfirstleaf.managedByPickfirstKeyType={})>": "<%!p(bool=true)>" }}. Err: connection error: desc = "transport: Error while dialing: dial tcp 12.0.0.1:2379: i/o timeout"
F1116 13:18:51.944204 1 instance.go:232] Error creating leases: error creating storage factory: context deadline exceeded
controlplane ~ ➜ journalctl -u kubelet.service -f | grep kube-apiserver # no specific clue
Nov 16 13:21:17 controlplane kubelet[3467]: E1116 13:21:17.604061 3467 mirror_client.go:139] "Failed deleting a mirror pod" err="Delete \"https://192.168.121.202:6443/api/v1/namespaces/kube-system/pods/kube-apiserver-controlplane\": dial tcp 192.168.121.202:6443: connect: connection refused" pod="kube-system/kube-apiserver-controlplane"
Nov 16 13:21:17 controlplane kubelet[3467]: E1116 13:21:17.604362 3467 pod_workers.go:1324] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"kube-apiserver\" with CrashLoopBackOff: \"back-off 2m40s restarting failed container=kube-apiserver pod=kube-apiserver-controlplane_kube-system(585d248aec0e70f4e24b6aeb6267f642)\"" pod="kube-system/kube-apiserver-controlplane" podUID="585d248aec0e70f4e24b6aeb6267f642"
Nov 16 13:21:19 controlplane kubelet[3467]: E1116 13:21:19.602527 3467 status_manager.go:1018] "Failed to get status for pod" err="Get \"https://192.168.121.202:6443/api/v1/namespaces/kube-system/pods/kube-apiserver-controlplane\": dial tcp 192.168.121.202:6443: connect: connection refused" podUID="585d248aec0e70f4e24b6aeb6267f642" pod="kube-system/kube-apiserver-controlplane"
Nov 16 13:21:29 controlplane kubelet[3467]: E1116 13:21:29.602174 3467 status_manager.go:1018] "Failed to get status for pod" err="Get \"https://192.168.121.202:6443/api/v1/namespaces/kube-system/pods/kube-apiserver-controlplane\": dial tcp 192.168.121.202:6443: connect: connection refused" podUID="585d248aec0e70f4e24b6aeb6267f642" pod="kube-system/kube-apiserver-controlplane"
^C
controlplane ~ ✖ cat /etc/kubernetes/manifests/etcd.yaml | grep -i advertise
kubeadm.kubernetes.io/etcd.advertise-client-urls: https://192.168.121.202:2379
- --advertise-client-urls=https://192.168.121.202:2379
- --initial-advertise-peer-urls=https://192.168.121.202:2380
controlplane ~ ➜ cat /etc/kubernetes/manifests/kube-apiserver.yaml | grep -i etcd-server # address is wrong
- --etcd-servers=https://12.0.0.1:2379
controlplane ~ ➜ sudo cp /etc/kubernetes/manifests/kube-apiserver.yaml /etc/kubernetes/manifests/kube-apiserver.yaml.bak
controlplane ~ ➜ vi /etc/kubernetes/manifests/kube-apiserver.yaml
controlplane ~ ➜ cat /etc/kubernetes/manifests/kube-apiserver.yaml | grep -i etcd-server
- --etcd-servers=https://127.0.0.1:2379
controlplane ~ ➜ systemctl restart kubelet
controlplane ~ ➜ k get po
No resources found in default namespace.
controlplane ~ ➜ k get po -A
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system calico-kube-controllers-587f6db6c5-shzhf 1/1 Running 8 (7m9s ago) 128m
kube-system canal-hspzw 2/2 Running 0 128m
kube-system coredns-6678bcd974-4wj8b 1/1 Running 0 128m
kube-system coredns-6678bcd974-b44nn 1/1 Running 0 128m
kube-system etcd-controlplane 1/1 Running 0 128m
kube-system kube-apiserver-controlplane 1/1 Running 0 128m
kube-system kube-controller-manager-controlplane 1/1 Running 1 (18m ago) 128m
kube-system kube-proxy-nsf5t 1/1 Running 0 128m
kube-system kube-scheduler-controlplane 1/1 Running 1 (18m ago) 128m
controlplane ~ ➜
Other Common follow-up fixes depending on the error
A) If error is “connection refused / no route to host”
- Confirm endpoint IP and port.
- Ensure etcd is listening on that IP/port (
ss -tnlp | grep 2379on etcd host). - If etcd on another host, check firewall, security group, or node network.
B) If TLS / certificate error (x509)
- kube-apiserver uses client certs to authenticate to etcd. Confirm the cert files referenced by kube-apiserver exist and are valid.
-
Typical kube-apiserver flags:
-
--etcd-cafile=/etc/kubernetes/pki/etcd/ca.crt --etcd-certfile=/etc/kubernetes/pki/apiserver-etcd-client.crt--etcd-keyfile=/etc/kubernetes/pki/apiserver-etcd-client.key-
If etcd’s server certificate SANs don’t include the endpoint IP/hostname you used, TLS will fail. Either:
-
Use an endpoint whose name/IP is in the etcd cert SANs, or
- Regenerate etcd certs to include the desired IP/hostname (not trivial in exam — prefer the simpler option).
C) If etcd is external and requires a different advertised URL
- Copy the etcd advertised/client URL (from the etcd host manifest) and paste it into kube-apiserver manifest
--etcd-servers=... - Ensure kube-apiserver has the right CA/cert/key to authenticate to that etcd.