1、介绍
Kubernetes 默认情况下使用 cAdvisor 来收集容器的各项指标,足以满足大多数人的需求,但还是有所欠缺,比如缺少对以下几个指标的收集:
- OOM kill
- 容器重启的次数
- 容器的退出码
missing-container-metrics 这个项目弥补了 cAdvisor 的缺陷,新增了以上几个指标,集群管理员可以利用这些指标迅速定位某些故障。例如,假设某个容器有多个子进程,其中某个子进程被 OOM kill,但容器还在运行,如果不对 OOM kill 进行监控,管理员很难对故障进行定位。
2、公开的指标
参数 | 说明 |
---|---|
container_restarts | 容器重启次数 |
container_ooms | 容器的OOM 终止数,包含了容器cgroup 中任何进程的OOM 终止 |
container_last_exit_code | 容器的最后退出代码 |
3、安装
helm repo add missing-container-metrics https://draganm.github.io/missing-container-metrics
helm install missing-container-metrics missing-container-metrics -n monitoring
查看服务
kubectl get po -n monitoring | grep miss
missing-container-metrics-72hj5 1/1 Running 0 1h
missing-container-metrics-778rl 1/1 Running 0 1h
missing-container-metrics-7dbd8 1/1 Running 0 1h
查看暴露出来的指标,默认运行在端口3001上
curl 100.93.246.120:3001/metrics
# HELP container_last_exit_code Last exit code of the container
# TYPE container_last_exit_code gauge
container_last_exit_code{container_id="containerd://0497bb0d3fe33e15688ad81e6c5167bf2f9a69a9bb4932c7b84564e0e41fd3d8",container_short_id="0497bb0d3fe3",docker_container_id="not-a-docker-container",image_id="k8s.gcr.io/pause:3.2",name="",namespace="monitoring",pod="node-exporter-zzp7r"} 0
container_last_exit_code{container_id="containerd://07fabad9a3ac0fc9236d572c3f437bdda9e857467598d0b43d7d12cd4306a91d",container_short_id="07fabad9a3ac",docker_container_id="not-a-docker-container",image_id="k8s.gcr.io/pause:3.2",name="",namespace="kube-system",pod="kube-proxy-lw9ss"} 0
4、添加监控和告警规则
通过Prometheus来监控,添加PodMonitor
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
labels:
app.kubernetes.io/name: missing-container-metrics
name: missing-container-metrics
namespace: monitoring
spec:
namespaceSelector:
matchNames:
- monitoring
podMetricsEndpoints:
- port: http
selector:
matchLabels:
app.kubernetes.io/name: missing-container-metrics
查看Prometheus监控指标
添加告警规则
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: k8s-pod-rules
namespace: monitoring
spec:
groups:
- name: kubernetes-apps
rules:
- alert: 容器发生内存OOM
annotations:
description: '{{ $labels.exported_namespace }}/{{ $labels.poexported_pod }} 在5分钟之内触发了OOM,请及时处理!'
expr: sum(increase(container_ooms[5m])) by (exported_namespace, exported_pod) > 1
for: 5m
labels:
severity: critical
- alert: 容器发生异常退出
annotations:
description: '{{ $labels.exported_namespace }}/{{ $labels.poexported_pod }} 在5分钟之内发生异常退出,请及时处理!'
expr: sum(increase(container_last_exit_code[5m])) by (exported_namespace, exported_pod) == 137
for: 5m
labels:
severity: critical
查看Prometheus告警规则
评论区