K8S Node异常问题排查过程
一、简介
可使用 kubectl 命令行对 K8S Node 异常做初步定位,方法几乎适用于所有 K8S 集群。
二、排查方法
使用 grafana kubelet 查看 NotReady 发生时间:

# kubectl get nodes NAME STATUS ROLES AGE VERSION 172.16.80.4 Ready <none> 18h v1.20.8 172.16.80.6 NotReady <none> 18h v1.20.8
kubectl describe node 172.16.80.6 查看异常 event
Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- NetworkUnavailable False Mon, 13 Jun 2022 19:41:10 +0800 Mon, 13 Jun 2022 19:41:10 +0800 RouteCreated CCE RouteController created a route MemoryPressure Unknown Tue, 14 Jun 2022 14:08:00 +0800 Tue, 14 Jun 2022 14:09:36 +0800 NodeStatusUnknown Kubelet stopped posting node status. DiskPressure Unknown Tue, 14 Jun 2022 14:08:00 +0800 Tue, 14 Jun 2022 14:09:36 +0800 NodeStatusUnknown Kubelet stopped posting node status. PIDPressure Unknown Tue, 14 Jun 2022 14:08:00 +0800 Tue, 14 Jun 2022 14:09:36 +0800 NodeStatusUnknown Kubelet stopped posting node status. Ready Unknown Tue, 14 Jun 2022 14:08:00 +0800 Tue, 14 Jun 2022 14:09:36 +0800 NodeStatusUnknown Kubelet stopped posting node status.
kubectl get node 172.16.80.6 -o yaml 查看 NodeConditions:
conditions:
- lastHeartbeatTime: "2022-06-13T11:41:10Z"
lastTransitionTime: "2022-06-13T11:41:10Z"
message: CCE RouteController created a route
reason: RouteCreated
status: "False"
type: NetworkUnavailable
- lastHeartbeatTime: "2022-06-14T06:08:00Z"
lastTransitionTime: "2022-06-14T06:09:36Z"
message: Kubelet stopped posting node status.
reason: NodeStatusUnknown
status: Unknown
type: MemoryPressure
- lastHeartbeatTime: "2022-06-14T06:08:00Z"
lastTransitionTime: "2022-06-14T06:09:36Z"
message: Kubelet stopped posting node status.
reason: NodeStatusUnknown
status: Unknown
type: DiskPressure
- lastHeartbeatTime: "2022-06-14T06:08:00Z"
lastTransitionTime: "2022-06-14T06:09:36Z"
message: Kubelet stopped posting node status.
reason: NodeStatusUnknown
status: Unknown
type: PIDPressure
- lastHeartbeatTime: "2022-06-14T06:08:00Z"
lastTransitionTime: "2022-06-14T06:09:36Z"
message: Kubelet stopped posting node status.
reason: NodeStatusUnknown
status: Unknown
type: Ready登录节点查看 kubelet 的日志:
journalctl -u kubelet --since="2022-06-14 14:00:00" | less
三、常见问题
Kubelet stopped posting node status
kubelet 停止汇报心跳,通常是 node 节点宕机,可让用户尝试登录节点,无法登录的话,一般通过重启恢复。原因一般和节点负载有关,可通过监控查看节点异常前负载情况。
PLEG is not healthy
Pod Lifecycle Event Generator,kubelet 会定期同步 pod 状态,当同步 pod 状态超时(3分钟),会将 node 置为 not ready 状态。
- 通过命令定位是否有容器 inspect 卡住的情况: docker ps -a -q | xargs docker inspect
如果该命令卡住,则进一步定位是由具体的哪个容器导致,通过 docker inspect {CONTAINER ID} 确认。定位到具体容器后,经客户允许后可将该容器删除 docker rm -f {CONTAINER ID}
Node Evicted
当节点因为资源不足(CPU、内存、磁盘)被驱逐时,需根据不同原因处理:
- CPU,内存资源不足:
- 虚机升配
- 合理设置资源的 resource request, 使 pod 合理调度到不同的节点上。
- 磁盘空间不足:
- 扩容容器数据目录所在磁盘
总结
以上为个人经验,希望能给大家一个参考,也希望大家多多支持脚本之家。
相关文章
详解k8s ConfigMap 中 subPath 字段和 items
volumeMounts.subPath 属性可用于指定所引用的卷内的子路径,而不是其根路径,这篇文章主要介绍了详解k8s ConfigMap 中 subPath 字段和 items 字段,需要的朋友可以参考下2023-03-03


最新评论