Daizc
count.articles52
count.tags25
count.categories3
集群爆炸,但是修好了

集群爆炸,但是修好了

周一到公司来一看,本地集群连不上了,据说是周末停电关了服务器造成的。
既然环境爆炸那啥也干不了只能挂机了, 那可不行,赶快修好了开始干活。

  1. 首先ssh能通,至少虚拟机没问题,试试docker ps。有反应,但容器全都停在pause上。同时kubectl get node卡住没有回应。
    1
    2
    3
    4
    5
    6
    7
    [root@vmw253 ~]# docker ps
    CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
    949258e9135c 37c6aeb3663b "kube-controller-man…" 3 hours ago Up 3 hours k8s_kube-controller-manager_kube-controller-manager-vmw253_kube-system_b53e0e776bdd23d5be83ec232a105a76_1036
    4a3a50cbf1fb registry.cn-beijing.aliyuncs.com/kubesphereio/pause:3.4.1 "/pause" 3 hours ago Up 3 hours k8s_POD_kube-controller-manager-vmw253_kube-system_b53e0e776bdd23d5be83ec232a105a76_11
    89aa599bc953 registry.cn-beijing.aliyuncs.com/kubesphereio/pause:3.4.1 "/pause" 3 hours ago Up About an hour k8s_POD_kube-apiserver-vmw253_kube-system_6aa730f99776180fbac39af544cc5c43_11
    76914665acbd 56c5af1d00b5 "kube-scheduler --au…" 3 hours ago Up 3 hours k8s_kube-scheduler_kube-scheduler-vmw253_kube-system_ceb9697816522e2427509f5707b24a58_1007
    8248bd5e395d registry.cn-beijing.aliyuncs.com/kubesphereio/pause:3.4.1 "/pause" 3 hours ago Up 3 hours k8s_POD_kube-scheduler-vmw253_kube-system_ceb9697816522e2427509f5707b24a58_11
  2. 求助运维老哥,老哥看了下现场,说api-server不见了,可能是etcd爆炸,叫检查下etcd。因为使用kubesphere脚本部署的,所以也不知道etcd怎么启动的,检查了一圈,没有发现etcd的影子。
  3. ps -ef时意外发现系统定时任务正在拉起etcd的备份任务,由此确定etcd应该是由systemctl管理,随后crontab -l结果如下
    1
    2
    3
    4
    [root@vmw253 ~]# crontab -l
    */5 * * * * /usr/sbin/ntpdate ntp3.aliyun.com &>/dev/null
    0 3 * * * ps -A -ostat,ppid | grep -e '^[Zz]' | awk '{print }' | xargs kill -HUP > /dev/null 2>&1
    */30 * * * * sh /usr/local/bin/kube-scripts/etcd-backup.sh
  4. 检查etcd-backup.sh,发现了备份所在地,从环境变量中取出了数据目录
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    # 此行声明了备份地址
    BACKUP_DIR="/var/backups/kube_etcd/etcd-$(date +%Y-%m-%d-%H-%M-%S)"
    # 此行引用了数据目录变量
    export ETCDCTL_API=2;$ETCDCTL_PATH backup --data-dir $ETCD_DATA_DIR --backup-dir $BACKUP_DIR
    # 此行调用etcdctl进行备份 (重要-1)
    {
    export ETCDCTL_API=3;$ETCDCTL_PATH --endpoints="$ENDPOINTS" snapshot save $BACKUP_DIR/snapshot.db \
    --cacert="$ETCDCTL_CA_FILE" \
    --cert="$ETCDCTL_CERT" \
    --key="$ETCDCTL_KEY"
    } > /dev/null
  5. 检查etcd-backup.sh,发现从环境变量中取出了备份地址,然后从sytemctl status etcd中取到etcd配置,再从etcd配置取到env文件地址(/etc/etcd.env)
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    # 因为写的时候已经把etcd修好了,所以Active是running
    [root@vmw253 ~]# systemctl status etcd
    ● etcd.service - etcd
    Loaded: loaded (/etc/systemd/system/etcd.service; enabled; vendor preset: disabled)
    Active: active (running) since 一 2023-07-10 15:06:18 CST; 22min ago
    Main PID: 36781 (etcd)
    Tasks: 25
    Memory: 97.8M
    CGroup: /system.slice/etcd.service
    └─36781 /usr/local/bin/etcd
    [root@vmw253 ~]# cat /etc/systemd/system/etcd.service
    [Unit]
    Description=etcd
    After=network.target

    [Service]
    User=root
    Type=notify
    EnvironmentFile=/etc/etcd.env
    ExecStart=/usr/local/bin/etcd
    NotifyAccess=all
    RestartSec=10s
    LimitNOFILE=40000
    Restart=always

    [Install]
    WantedBy=multi-user.target
  6. 检查env文件,得到关键的数据目录地址
    1
    ETCD_DATA_DIR=/var/lib/etcd
  7. 停止etcd,移除数据目录
    1
    2
    systemctl stop etcd
    mv /var/lib/etcd /var/lib/etcd.bak
  8. 从定时任务触发的命令中发现了etcd的命令行客户端:etcdctl,使用–help看看他的用法,发现一个snapshot restore指令
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    [root@vmw253 ~]# etcdctl snapshot
    NAME:
    snapshot - Manages etcd node snapshots

    USAGE:
    etcdctl snapshot <subcommand> [flags]

    API VERSION:
    3.4


    COMMANDS:
    restore Restores an etcd member snapshot to an etcd directory
    save Stores an etcd node backend snapshot to a given file
    status Gets backend snapshot status of a given file

    OPTIONS:
    -h, --help[=false] help for snapshot

    GLOBAL OPTIONS:
    --cacert="" verify certificates of TLS-enabled secure servers using this CA bundle
    --cert="" identify secure client using this TLS certificate file
    --command-timeout=5s timeout for short running command (excluding dial timeout)
    --debug[=false] enable client-side debug logging
    --dial-timeout=2s dial timeout for client connections
    -d, --discovery-srv="" domain name to query for SRV records describing cluster endpoints
    --discovery-srv-name="" service name to query when using DNS discovery
    --endpoints=[127.0.0.1:2379] gRPC endpoints
    --hex[=false] print byte strings as hex encoded strings
    --insecure-discovery[=true] accept insecure SRV records describing cluster endpoints
    --insecure-skip-tls-verify[=false] skip server certificate verification (CAUTION: this option should be enabled only for testing purposes)
    --insecure-transport[=true] disable transport security for client connections
    --keepalive-time=2s keepalive time for client connections
    --keepalive-timeout=6s keepalive timeout for client connections
    --key="" identify secure client using this TLS key file
    --password="" password for authentication (if this option is used, --user option shouldn't include password)
    --user="" username[:password] for authentication (prompt if password is not supplied)
    -w, --write-out="simple" set the output format (fields, json, protobuf, simple, table)
  9. 通过备份脚本中的命令(重要-1)和上一步中的提示,反向拼接还原命令
    1
    /usr/local/bin/etcdctl --endpoints=https://192.168.20.233:2379 snapshot restore /var/backups/kube_etcd/etcd-2023-07-08-00-00-01/snapshot.db --cacert=/etc/ssl/etcd/ssl/ca.pem --cert=/etc/ssl/etcd/ssl/admin-vmw253.pem --key=/etc/ssl/etcd/ssl/admin-vmw253-key.pem
  10. 很遗憾报了错,说数据目录已存在,旁边运维老哥提示加data-dir参数,于是把刚才得到的参数拼接进去
    1
    /usr/local/bin/etcdctl --endpoints=https://192.168.20.233:2379 snapshot restore /var/backups/kube_etcd/etcd-2023-07-08-00-00-01/snapshot.db --cacert=/etc/ssl/etcd/ssl/ca.pem --cert=/etc/ssl/etcd/ssl/admin-vmw253.pem --key=/etc/ssl/etcd/ssl/admin-vmw253-key.pem
  11. 这次不遗憾,看起来成功了,随后启动etcd,未见明显异常。kubectl 也马上返回了结果,随后集群很快恢复了
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    [root@vmw253 ~]# docker ps
    CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
    949258e9135c 37c6aeb3663b "kube-controller-man…" 3 hours ago Up 3 hours k8s_kube-controller-manager_kube-controller-manager-vmw253_kube-system_b53e0e776bdd23d5be83ec232a105a76_1036
    4a3a50cbf1fb registry.cn-beijing.aliyuncs.com/kubesphereio/pause:3.4.1 "/pause" 3 hours ago Up 3 hours k8s_POD_kube-controller-manager-vmw253_kube-system_b53e0e776bdd23d5be83ec232a105a76_11
    89aa599bc953 registry.cn-beijing.aliyuncs.com/kubesphereio/pause:3.4.1 "/pause" 3 hours ago Up 54 minutes k8s_POD_kube-apiserver-vmw253_kube-system_6aa730f99776180fbac39af544cc5c43_11
    76914665acbd 56c5af1d00b5 "kube-scheduler --au…" 3 hours ago Up 3 hours k8s_kube-scheduler_kube-scheduler-vmw253_kube-system_ceb9697816522e2427509f5707b24a58_1007
    8248bd5e395d registry.cn-beijing.aliyuncs.com/kubesphereio/pause:3.4.1 "/pause" 3 hours ago Up 3 hours k8s_POD_kube-scheduler-vmw253_kube-system_ceb9697816522e2427509f5707b24a58_11
    [root@vmw253 ~]# systemctl start etcd
    [root@vmw253 ~]# kubectl get node
    NAME STATUS ROLES AGE VERSION
    vmw253 Ready control-plane,master,worker 525d v1.23.0
    vmw254 Ready worker 525d v1.23.0
    vmw255 Ready worker 525d v1.23.0

居然修好了,真实可喜可贺呢:-D

copyright.author:Daizc
copyright.permalink:https://note.bequick.run/%E9%9B%86%E7%BE%A4%E7%88%86%E7%82%B8%EF%BC%8C%E4%BD%86%E6%98%AF%E4%BF%AE%E5%A5%BD%E4%BA%86/
版权声明:本文采用 CC BY-NC-SA 3.0 CN 协议进行许可