集群爆炸，但是修好了

2023年07月10日我永远爱学习

周一到公司来一看，本地集群连不上了，据说是周末停电关了服务器造成的。
~~既然环境爆炸那啥也干不了只能挂机了，~~ 那可不行，赶快修好了开始干活。

首先ssh能通，至少虚拟机没问题，试试docker ps。有反应，但容器全都停在pause上。同时kubectl get node卡住没有回应。

[root@vmw253 ~]# docker ps
 CONTAINER ID   IMAGE                                                       COMMAND                  CREATED         STATUS             PORTS     NAMES
 949258e9135c   37c6aeb3663b                                                "kube-controller-man…"   3 hours ago     Up 3 hours                   k8s_kube-controller-manager_kube-controller-manager-vmw253_kube-system_b53e0e776bdd23d5be83ec232a105a76_1036
 4a3a50cbf1fb   registry.cn-beijing.aliyuncs.com/kubesphereio/pause:3.4.1   "/pause"                 3 hours ago     Up 3 hours                   k8s_POD_kube-controller-manager-vmw253_kube-system_b53e0e776bdd23d5be83ec232a105a76_11
 89aa599bc953   registry.cn-beijing.aliyuncs.com/kubesphereio/pause:3.4.1   "/pause"                 3 hours ago     Up About an hour             k8s_POD_kube-apiserver-vmw253_kube-system_6aa730f99776180fbac39af544cc5c43_11
 76914665acbd   56c5af1d00b5                                                "kube-scheduler --au…"   3 hours ago     Up 3 hours                   k8s_kube-scheduler_kube-scheduler-vmw253_kube-system_ceb9697816522e2427509f5707b24a58_1007
 8248bd5e395d   registry.cn-beijing.aliyuncs.com/kubesphereio/pause:3.4.1   "/pause"                 3 hours ago     Up 3 hours                   k8s_POD_kube-scheduler-vmw253_kube-system_ceb9697816522e2427509f5707b24a58_11

求助运维老哥，老哥看了下现场，说api-server不见了，可能是etcd爆炸，叫检查下etcd。因为使用kubesphere脚本部署的，所以也不知道etcd怎么启动的，检查了一圈，没有发现etcd的影子。

ps -ef时意外发现系统定时任务正在拉起etcd的备份任务，由此确定etcd应该是由systemctl管理，随后crontab -l结果如下

[root@vmw253 ~]# crontab -l
*/5 * * * * /usr/sbin/ntpdate ntp3.aliyun.com &>/dev/null
0 3 * * * ps -A -ostat,ppid | grep -e '^[Zz]' | awk '{print }' | xargs kill -HUP > /dev/null 2>&1
*/30 * * * * sh /usr/local/bin/kube-scripts/etcd-backup.sh

检查etcd-backup.sh，发现了备份所在地，从环境变量中取出了数据目录

# 此行声明了备份地址
BACKUP_DIR="/var/backups/kube_etcd/etcd-$(date +%Y-%m-%d-%H-%M-%S)"
# 此行引用了数据目录变量
export ETCDCTL_API=2;$ETCDCTL_PATH backup --data-dir $ETCD_DATA_DIR --backup-dir $BACKUP_DIR
# 此行调用etcdctl进行备份 (重要-1)
{
 export ETCDCTL_API=3;$ETCDCTL_PATH --endpoints="$ENDPOINTS" snapshot save $BACKUP_DIR/snapshot.db \
                                 --cacert="$ETCDCTL_CA_FILE" \
                                 --cert="$ETCDCTL_CERT" \
                                 --key="$ETCDCTL_KEY"
} > /dev/null

检查etcd-backup.sh，发现从环境变量中取出了备份地址，然后从sytemctl status etcd中取到etcd配置，再从etcd配置取到env文件地址(/etc/etcd.env)

# 因为写的时候已经把etcd修好了，所以Active是running
 [root@vmw253 ~]# systemctl status etcd
 ● etcd.service - etcd
 Loaded: loaded (/etc/systemd/system/etcd.service; enabled; vendor preset: disabled)
 Active: active (running) since 一 2023-07-10 15:06:18 CST; 22min ago
 Main PID: 36781 (etcd)
     Tasks: 25
 Memory: 97.8M
 CGroup: /system.slice/etcd.service
         └─36781 /usr/local/bin/etcd
 [root@vmw253 ~]# cat /etc/systemd/system/etcd.service
 [Unit]
 Description=etcd
 After=network.target

 [Service]
 User=root
 Type=notify
 EnvironmentFile=/etc/etcd.env
 ExecStart=/usr/local/bin/etcd
 NotifyAccess=all
 RestartSec=10s
 LimitNOFILE=40000
 Restart=always

 [Install]
 WantedBy=multi-user.target

检查env文件,得到关键的数据目录地址
1
ETCD_DATA_DIR=/var/lib/etcd

停止etcd，移除数据目录

1 2	`systemctl stop etcd mv /var/lib/etcd /var/lib/etcd.bak`

从定时任务触发的命令中发现了etcd的命令行客户端:etcdctl，使用–help看看他的用法，发现一个snapshot restore指令

[root@vmw253 ~]# etcdctl snapshot
NAME:
        snapshot - Manages etcd node snapshots

USAGE:
        etcdctl snapshot <subcommand> [flags]

API VERSION:
        3.4


COMMANDS:
        restore Restores an etcd member snapshot to an etcd directory
        save    Stores an etcd node backend snapshot to a given file
        status  Gets backend snapshot status of a given file

OPTIONS:
-h, --help[=false]    help for snapshot

GLOBAL OPTIONS:
    --cacert=""                               verify certificates of TLS-enabled secure servers using this CA bundle
    --cert=""                                 identify secure client using this TLS certificate file
    --command-timeout=5s                      timeout for short running command (excluding dial timeout)
    --debug[=false]                           enable client-side debug logging
    --dial-timeout=2s                         dial timeout for client connections
-d, --discovery-srv=""                        domain name to query for SRV records describing cluster endpoints
    --discovery-srv-name=""                   service name to query when using DNS discovery
    --endpoints=[127.0.0.1:2379]              gRPC endpoints
    --hex[=false]                             print byte strings as hex encoded strings
    --insecure-discovery[=true]               accept insecure SRV records describing cluster endpoints
    --insecure-skip-tls-verify[=false]        skip server certificate verification (CAUTION: this option should be enabled only for testing purposes)
    --insecure-transport[=true]               disable transport security for client connections
    --keepalive-time=2s                       keepalive time for client connections
    --keepalive-timeout=6s                    keepalive timeout for client connections
    --key=""                                  identify secure client using this TLS key file
    --password=""                             password for authentication (if this option is used, --user option shouldn't include password)
    --user=""                                 username[:password] for authentication (prompt if password is not supplied)
-w, --write-out="simple"                      set the output format (fields, json, protobuf, simple, table)

通过备份脚本中的命令(重要-1)和上一步中的提示，反向拼接还原命令

/usr/local/bin/etcdctl --endpoints=https://192.168.20.233:2379 snapshot restore /var/backups/kube_etcd/etcd-2023-07-08-00-00-01/snapshot.db --cacert=/etc/ssl/etcd/ssl/ca.pem --cert=/etc/ssl/etcd/ssl/admin-vmw253.pem --key=/etc/ssl/etcd/ssl/admin-vmw253-key.pem

很遗憾报了错，说数据目录已存在，旁边运维老哥提示加data-dir参数，于是把刚才得到的参数拼接进去

/usr/local/bin/etcdctl --endpoints=https://192.168.20.233:2379 snapshot restore /var/backups/kube_etcd/etcd-2023-07-08-00-00-01/snapshot.db --cacert=/etc/ssl/etcd/ssl/ca.pem --cert=/etc/ssl/etcd/ssl/admin-vmw253.pem --key=/etc/ssl/etcd/ssl/admin-vmw253-key.pem

这次不遗憾，看起来成功了，随后启动etcd，未见明显异常。kubectl 也马上返回了结果，随后集群很快恢复了

[root@vmw253 ~]# docker ps
CONTAINER ID   IMAGE                                                       COMMAND                  CREATED       STATUS          PORTS     NAMES
949258e9135c   37c6aeb3663b                                                "kube-controller-man…"   3 hours ago   Up 3 hours                k8s_kube-controller-manager_kube-controller-manager-vmw253_kube-system_b53e0e776bdd23d5be83ec232a105a76_1036
4a3a50cbf1fb   registry.cn-beijing.aliyuncs.com/kubesphereio/pause:3.4.1   "/pause"                 3 hours ago   Up 3 hours                k8s_POD_kube-controller-manager-vmw253_kube-system_b53e0e776bdd23d5be83ec232a105a76_11
89aa599bc953   registry.cn-beijing.aliyuncs.com/kubesphereio/pause:3.4.1   "/pause"                 3 hours ago   Up 54 minutes             k8s_POD_kube-apiserver-vmw253_kube-system_6aa730f99776180fbac39af544cc5c43_11
76914665acbd   56c5af1d00b5                                                "kube-scheduler --au…"   3 hours ago   Up 3 hours                k8s_kube-scheduler_kube-scheduler-vmw253_kube-system_ceb9697816522e2427509f5707b24a58_1007
8248bd5e395d   registry.cn-beijing.aliyuncs.com/kubesphereio/pause:3.4.1   "/pause"                 3 hours ago   Up 3 hours                k8s_POD_kube-scheduler-vmw253_kube-system_ceb9697816522e2427509f5707b24a58_11
[root@vmw253 ~]# systemctl start etcd
[root@vmw253 ~]# kubectl get node
NAME     STATUS   ROLES                         AGE    VERSION
vmw253   Ready    control-plane,master,worker   525d   v1.23.0
vmw254   Ready    worker                        525d   v1.23.0
vmw255   Ready    worker                        525d   v1.23.0

居然修好了，真实可喜可贺呢:-D

categories

recent_posts

集群爆炸，但是修好了