AWS EKS中Kubernetes从1.21升级到1.22后Prometheus部署及Node Exporter故障排查求助
AWS EKS中Kubernetes从1.21升级到1.22后Prometheus部署及Node Exporter故障排查求助
大家好,我最近在AWS EKS上把Kubernetes集群从1.21版本升级到了1.22,集群本身升级是成功的,但关联的Prometheus部署却出了问题,折腾了好一阵没头绪,想请大家给点思路或者排查方向。
首先是Prometheus Operator的报错,我执行了以下命令查看日志:
kubectl -n monitoring logs prometheus-operator-***
得到的日志内容如下:
W0109 20:31:28.602872 1 client_config.go:608] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. {"level":"info","msg":"patching webhook configurations 'prometheus-operator-kube-p-admission' mutating=true, validating=true, failurePolicy=Fail","source":"k8s/k8s.go:39","time":"2023-01-09T20:31:28Z"} {"err":"the server could not find the requested resource","level":"fatal","msg":"failed getting validating webhook","source":"k8s/k8s.go:48","time":"2023-01-09T20:31:28Z"}
从日志看,致命错误是找不到请求的validating webhook资源,我猜测会不会是K8s 1.22版本弃用了v1beta1的Webhook相关API,而我们用的Prometheus Operator版本还在依赖旧的API?
另外,Node Exporter那边也出了问题,事件里显示:
Liveness probe failed: Get "http://*******:9100/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
这个存活探针超时的问题,我暂时想到可能是网络连通性问题,或者Node Exporter pod资源不足没正常启动,但还没排查出具体原因。
我们用的是Helm3来管理这些部署,有没有朋友遇到过类似的情况?麻烦给点建议,谢谢!
备注:内容来源于stack exchange,提问作者vijaya lakshmi




