From 9206268051e92ca23a76eaf3882422d844e99bb9 Mon Sep 17 00:00:00 2001 From: George Angel Date: Mon, 25 Nov 2024 17:17:44 +1000 Subject: [PATCH] Use % of requests error rather then absolute (#138) We can false positive alerts when we have a few errors that are actually less then 1% of total requests failig. Change the alert to only fire on >10% failed requests. --- common/all.yaml.tmpl | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/common/all.yaml.tmpl b/common/all.yaml.tmpl index 2195ff2..0a1e3cd 100644 --- a/common/all.yaml.tmpl +++ b/common/all.yaml.tmpl @@ -511,14 +511,12 @@ groups: If requeued items are not being processed promptly then this indicates a persistent issue. The mirror services are likely to be in an incorrect state. - alert: SemaphoreServiceMirrorKubeClientErrors - expr: increase(semaphore_service_mirror_kube_http_request_total{code!="200"}[5m]) > 0 - for: 10m + expr: sum(rate(semaphore_service_mirror_kube_http_request_total{code!="200"}[10m])) / sum(rate(semaphore_service_mirror_kube_http_request_total[10m])) > 0.1 + for: 5m labels: team: infra annotations: - summary: "{{ $labels.app }} kubernetes client reports errors speaking to apiserver at {{ $labels.host }} for more than 10 minutes" - description: | - Kubernetes client requests returning code different than 200 for longer than 10 minutes. Check the pods logs for further information. + summary: "{{ $labels.app }} more then 10% of APIServer requests are failing" logs: - alert: SemaphoreXDSRequeued expr: semaphore_xds_queue_requeued_items > 0