Skip to content

Commit

Permalink
Use % of requests error rather then absolute (#138)
Browse files Browse the repository at this point in the history
We can false positive alerts when we have a few errors that are actually
less then 1% of total requests failig. Change the alert to only fire on
>10% failed requests.
  • Loading branch information
george-angel authored Nov 25, 2024
1 parent 5a73089 commit 9206268
Showing 1 changed file with 3 additions and 5 deletions.
8 changes: 3 additions & 5 deletions common/all.yaml.tmpl
Original file line number Diff line number Diff line change
Expand Up @@ -511,14 +511,12 @@ groups:

If requeued items are not being processed promptly then this indicates a persistent issue. The mirror services are likely to be in an incorrect state.
- alert: SemaphoreServiceMirrorKubeClientErrors
expr: increase(semaphore_service_mirror_kube_http_request_total{code!="200"}[5m]) > 0
for: 10m
expr: sum(rate(semaphore_service_mirror_kube_http_request_total{code!="200"}[10m])) / sum(rate(semaphore_service_mirror_kube_http_request_total[10m])) > 0.1
for: 5m
labels:
team: infra
annotations:
summary: "{{ $labels.app }} kubernetes client reports errors speaking to apiserver at {{ $labels.host }} for more than 10 minutes"
description: |
Kubernetes client requests returning code different than 200 for longer than 10 minutes. Check the pods logs for further information.
summary: "{{ $labels.app }} more then 10% of APIServer requests are failing"
logs: <https://grafana.$ENVIRONMENT.aws.uw.systems/explore?left=["now-1h","now","Loki",{"expr":"{kubernetes_cluster=\"{{$labels.kubernetes_cluster}}\",kubernetes_namespace=\"{{$labels.namespace}}\",container=\"{{$labels.container}}\"}"}]|link>
- alert: SemaphoreXDSRequeued
expr: semaphore_xds_queue_requeued_items > 0
Expand Down

0 comments on commit 9206268

Please sign in to comment.