Kubernetes之节点删除后重新加入异常处理
LiuSw Lv6

Kubernetes之节点删除后重新加入异常处理

删除节点重新加入报错:

1
2
3
error execution phase check-etcd: etcd cluster is not healthy: failed to dial endpoint https://192.168.123.21:2379 with maintenance client: context deadline exceeded

error execution phase check-etcd: error syncing endpoints with etc: dial tcp 172.31.182.152:2379: connect: connection refused

解决方法:

1.在kubeadm-config删除的状态不存在的etcd节点:

1
2
3
4
5
6
7
8
9
10
11
12
13
kubectl edit configmaps -n kube-system kubeadm-config

# 删除apiEndpoints下不存在的节点:(本例为master1)
apiEndpoints:
master1: # 删掉
advertiseAddress: 172.16.11.10 # 删掉
bindPort: 6443 # 删掉
master2:
advertiseAddress: 172.16.11.14
bindPort: 6443
master3:
advertiseAddress: 172.16.11.15
bindPort: 6443
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
# Please edit the object below. Lines beginning with a '#' will be ignored,
# and an empty file will abort the edit. If an error occurs while saving this file will be
# reopened with the relevant failures.
#
apiVersion: v1
data:
ClusterConfiguration: |
apiServer:
certSANs:
- 127.0.0.1
- apiserver.cluster.local
- 172.16.11.10
- 10.103.97.2
extraArgs:
authorization-mode: Node,RBAC
feature-gates: TTLAfterFinished=true
extraVolumes:
- hostPath: /etc/localtime
mountPath: /etc/localtime
name: localtime
pathType: File
readOnly: true
timeoutForControlPlane: 4m0s
apiVersion: kubeadm.k8s.io/v1beta2
certificatesDir: /etc/kubernetes/pki
clusterName: kubernetes
controlPlaneEndpoint: apiserver.cluster.local:6443
controllerManager:
extraArgs:
experimental-cluster-signing-duration: 876000h
feature-gates: TTLAfterFinished=true
extraVolumes:
- hostPath: /etc/localtime
mountPath: /etc/localtime
name: localtime
pathType: File
readOnly: true
dns:
type: CoreDNS
etcd:
local:
dataDir: /var/lib/etcd
imageRepository: k8s.gcr.io
kind: ClusterConfiguration
kubernetesVersion: v1.17.3
networking:
dnsDomain: cluster.local
podSubnet: 100.64.0.0/10
serviceSubnet: 10.96.0.0/12
scheduler:
extraArgs:
feature-gates: TTLAfterFinished=true
extraVolumes:
- hostPath: /etc/localtime
mountPath: /etc/localtime
name: localtime
pathType: File
readOnly: true
ClusterStatus: |
apiEndpoints:
master1:
advertiseAddress: 172.16.11.10
bindPort: 6443
master2:
advertiseAddress: 172.16.11.14
bindPort: 6443
master3:
advertiseAddress: 172.16.11.15
bindPort: 6443
apiVersion: kubeadm.k8s.io/v1beta2
kind: ClusterStatus
kind: ConfigMap
metadata:
creationTimestamp: "2022-02-12T09:09:18Z"
name: kubeadm-config
namespace: kube-system
resourceVersion: "1308389"
selfLink: /api/v1/namespaces/kube-system/configmaps/kubeadm-config
uid: 6a9e5249-af69-4e01-9231-945e1a236a42

2.删除etcd集群内的成员

因为我是用kubeadm搭建的集群,所有etcd在每个master节点都会以pod的形式存在一个,etcd是在每个控制平面都启动一个实例的,当删除k8s-001节点时,etcd集群未自动删除此节点上的etcd成员,因此需要手动删除。
注意这里首先要进入etcd的pod。

1
kubectl exec -it etcd-master1 sh -n kube-system

容器内执行

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
export ETCDCTL_API=3
alias etcdctl='etcdctl --endpoints=https://172.31.182.153:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key'
/ # etcdctl member list
ceb6b1f4369e9ecc, started, cn-hongkong.i-j6caps6av1mtyxyofmrx, https://172.31.182.154:2380, https://172.31.182.154:2379
d4322ce19cc3f8da, started, cn-hongkong.i-j6caps6av1mtyxyofmrw, https://172.31.182.152:2380, https://172.31.182.152:2379
d598f7eabefcc101, started, cn-hongkong.i-j6caps6av1mtyxyofmry, https://172.31.182.153:2380, https://172.31.182.153:2379
#删除不存在的节点
/ # etcdctl member remove d4322ce19cc3f8da
Member d4322ce19cc3f8da removed from cluster ed812b9f85d5bcd7
/ # etcdctl member list
ceb6b1f4369e9ecc, started, cn-hongkong.i-j6caps6av1mtyxyofmrx, https://172.31.182.154:2380, https://172.31.182.154:2379
d598f7eabefcc101, started, cn-hongkong.i-j6caps6av1mtyxyofmry, https://172.31.182.153:2380, https://172.31.182.153:2379
/ # etcdctl member list
cd4e1e075b1904b2, started, cn-hongkong.i-j6caps6av1mtyxyofmrw, https://172.31.182.152:2380, https://172.31.182.152:2379
ceb6b1f4369e9ecc, started, cn-hongkong.i-j6caps6av1mtyxyofmrx, https://172.31.182.154:2380, https://172.31.182.154:2379
d598f7eabefcc101, started, cn-hongkong.i-j6caps6av1mtyxyofmry, https://172.31.182.153:2380, https://172.31.182.153:2379
/ # exit

最后每次kubeadm join失败后要kubeadm reset重置节点,在kubeadm join才会成功。

join加入后报错

1
error execution phase control-plane-prepare/download-certs

控制平面认证的certs已过期,默认时间两个小时,需要重新生成上传

在已存在的控制平面运行:

1
kubeadm init phase upload-certs --upload-certs

生成的替换–certificate-key的值

例子:

1
kubeadm join 172.31.182.153:6443 --token vauo7d.d40khbya379q7bk4 --discovery-token-ca-cert-hash sha256:139ff25e1af59d940089f85614bd02066dfbe6bee937b087f0cc7896e24d8e54 --control-plane --certificate-key 2d0f05294f03306f7867c27b11c2d73c5ebef4413a8369e5cc03bf9abe53b836

有跳过的步骤可在–ignore-preflight-errors加入跳过的名称

1
kubeadm join 192.168.11.52:6443 --token vauo7d.d40khbya379q7bk4 --discovery-token-ca-cert-hash sha256:139ff25e1af59d940089f85614bd02066dfbe6bee937b087f0cc7896e24d8e54 --control-plane --certificate-key 2d0f05294f03306f7867c27b11c2d73c5ebef4413a8369e5cc03bf9abe53b836 --ignore-preflight-errors all

The END

 评论