Kubernetes之节点删除后重新加入异常处理

删除节点重新加入报错：

1
2
3

error execution phase check-etcd: etcd cluster is not healthy: failed to dial endpoint https://192.168.123.21:2379 with maintenance client: context deadline exceeded

error execution phase check-etcd: error syncing endpoints with etc: dial tcp 172.31.182.152:2379: connect: connection refused

解决方法：

1.在kubeadm-config删除的状态不存在的etcd节点：

kubectl edit configmaps -n kube-system kubeadm-config

# 删除apiEndpoints下不存在的节点:(本例为master1)
    apiEndpoints:
      master1: # 删掉
        advertiseAddress: 172.16.11.10 # 删掉
        bindPort: 6443 # 删掉
      master2:
        advertiseAddress: 172.16.11.14
        bindPort: 6443
      master3:
        advertiseAddress: 172.16.11.15
        bindPort: 6443

# Please edit the object below. Lines beginning with a '#' will be ignored,
# and an empty file will abort the edit. If an error occurs while saving this file will be
# reopened with the relevant failures.
#
apiVersion: v1
data:
  ClusterConfiguration: |
    apiServer:
      certSANs:
      - 127.0.0.1
      - apiserver.cluster.local
      - 172.16.11.10
      - 10.103.97.2
      extraArgs:
        authorization-mode: Node,RBAC
        feature-gates: TTLAfterFinished=true
      extraVolumes:
      - hostPath: /etc/localtime
        mountPath: /etc/localtime
        name: localtime
        pathType: File
        readOnly: true
      timeoutForControlPlane: 4m0s
    apiVersion: kubeadm.k8s.io/v1beta2
    certificatesDir: /etc/kubernetes/pki
    clusterName: kubernetes
    controlPlaneEndpoint: apiserver.cluster.local:6443
    controllerManager:
      extraArgs:
        experimental-cluster-signing-duration: 876000h
        feature-gates: TTLAfterFinished=true
      extraVolumes:
      - hostPath: /etc/localtime
        mountPath: /etc/localtime
        name: localtime
        pathType: File
        readOnly: true
    dns:
      type: CoreDNS
    etcd:
      local:
        dataDir: /var/lib/etcd
    imageRepository: k8s.gcr.io
    kind: ClusterConfiguration
    kubernetesVersion: v1.17.3
    networking:
      dnsDomain: cluster.local
      podSubnet: 100.64.0.0/10
      serviceSubnet: 10.96.0.0/12
    scheduler:
      extraArgs:
        feature-gates: TTLAfterFinished=true
      extraVolumes:
      - hostPath: /etc/localtime
        mountPath: /etc/localtime
        name: localtime
        pathType: File
        readOnly: true
  ClusterStatus: |
    apiEndpoints:
      master1:
        advertiseAddress: 172.16.11.10
        bindPort: 6443
      master2:
        advertiseAddress: 172.16.11.14
        bindPort: 6443
      master3:
        advertiseAddress: 172.16.11.15
        bindPort: 6443
    apiVersion: kubeadm.k8s.io/v1beta2
    kind: ClusterStatus
kind: ConfigMap
metadata:
  creationTimestamp: "2022-02-12T09:09:18Z"
  name: kubeadm-config
  namespace: kube-system
  resourceVersion: "1308389"
  selfLink: /api/v1/namespaces/kube-system/configmaps/kubeadm-config
  uid: 6a9e5249-af69-4e01-9231-945e1a236a42

2.删除etcd集群内的成员

因为我是用kubeadm搭建的集群，所有etcd在每个master节点都会以pod的形式存在一个，etcd是在每个控制平面都启动一个实例的，当删除k8s-001节点时，etcd集群未自动删除此节点上的etcd成员，因此需要手动删除。
注意这里首先要进入etcd的pod。

1	kubectl exec -it etcd-master1 sh -n kube-system

容器内执行

export ETCDCTL_API=3
alias etcdctl='etcdctl --endpoints=https://172.31.182.153:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key'
/ # etcdctl member list
ceb6b1f4369e9ecc, started, cn-hongkong.i-j6caps6av1mtyxyofmrx, https://172.31.182.154:2380, https://172.31.182.154:2379
d4322ce19cc3f8da, started, cn-hongkong.i-j6caps6av1mtyxyofmrw, https://172.31.182.152:2380, https://172.31.182.152:2379
d598f7eabefcc101, started, cn-hongkong.i-j6caps6av1mtyxyofmry, https://172.31.182.153:2380, https://172.31.182.153:2379
#删除不存在的节点
/ # etcdctl member remove d4322ce19cc3f8da
Member d4322ce19cc3f8da removed from cluster ed812b9f85d5bcd7
/ # etcdctl member list
ceb6b1f4369e9ecc, started, cn-hongkong.i-j6caps6av1mtyxyofmrx, https://172.31.182.154:2380, https://172.31.182.154:2379
d598f7eabefcc101, started, cn-hongkong.i-j6caps6av1mtyxyofmry, https://172.31.182.153:2380, https://172.31.182.153:2379
/ # etcdctl member list
cd4e1e075b1904b2, started, cn-hongkong.i-j6caps6av1mtyxyofmrw, https://172.31.182.152:2380, https://172.31.182.152:2379
ceb6b1f4369e9ecc, started, cn-hongkong.i-j6caps6av1mtyxyofmrx, https://172.31.182.154:2380, https://172.31.182.154:2379
d598f7eabefcc101, started, cn-hongkong.i-j6caps6av1mtyxyofmry, https://172.31.182.153:2380, https://172.31.182.153:2379
/ # exit

最后每次kubeadm join失败后要kubeadm reset重置节点，在kubeadm join才会成功。

join加入后报错

1	error execution phase control-plane-prepare/download-certs

控制平面认证的certs已过期，默认时间两个小时，需要重新生成上传

在已存在的控制平面运行：

1	kubeadm init phase upload-certs --upload-certs

生成的替换–certificate-key的值

例子：

kubeadm join 172.31.182.153:6443 --token vauo7d.d40khbya379q7bk4 --discovery-token-ca-cert-hash sha256:139ff25e1af59d940089f85614bd02066dfbe6bee937b087f0cc7896e24d8e54 --control-plane --certificate-key 2d0f05294f03306f7867c27b11c2d73c5ebef4413a8369e5cc03bf9abe53b836

有跳过的步骤可在–ignore-preflight-errors加入跳过的名称

kubeadm join 192.168.11.52:6443 --token vauo7d.d40khbya379q7bk4 --discovery-token-ca-cert-hash sha256:139ff25e1af59d940089f85614bd02066dfbe6bee937b087f0cc7896e24d8e54 --control-plane --certificate-key 2d0f05294f03306f7867c27b11c2d73c5ebef4413a8369e5cc03bf9abe53b836 --ignore-preflight-errors all

The END