Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[enterprise] Helm: Context deadline exceeded error with Istio enabled #10442

Open
dudududi opened this issue Jan 15, 2025 · 0 comments
Open

[enterprise] Helm: Context deadline exceeded error with Istio enabled #10442

dudududi opened this issue Jan 15, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@dudududi
Copy link

What is the bug?

I am trying to deploy Grafana Enterprise Metrics on my EKS cluster with Istio enabled. I do have deployed Grafana UI in one namespace (grafana-enterprise) using grafana Helm Chart, and GEM in the other namespace (grafana-enterprise-metrics) using the mimir-distributed Helm Chart. When I am trying to setup the enterprise metrics plugin in Grafana UI, I am getting a timeout error.

How to reproduce it?

  1. Deploy Grafana UI and Grafana Enterprise Metrics on cluster with Istio injection enabled
  2. Exec into Grafana UI pod
  3. Execute curl -u :<admin-token> http://gem-gateway.grafana-enterprise-metrics.svc/admin/api/v3/features
  4. Observe timeout error

What did you think would happen?

I debugged the connectivity for a while, and below are my findings so far:

  • when I exec into the grafana UI pod in grafana-enterprise namespace, the below cURL successully returns me 200 response, that proves Istio service mesh allows the communication between Grafana UI and gem-gateway pods
curl -u :<admin-token> http://gem-gateway.grafana-enterprise-metrics.svc/services
  • when I exec into the same grafana UI pod in grafana-enterprise namespace, the below cURL fails with 504 error 30s:
curl -u :<admin-token> http://gem-gateway.grafana-enterprise-metrics.svc/admin/api/v3/features
{"status":"error","errorType":"timeout","error":"context deadline exceeded"}

and in the gem-gateway logs I can see:

 ts=2025-01-15T15:00:57.161674816Z caller=logging.go:128 level=warn trace_id_unsampled=1c9ce85c7792ca0b msg="GET /admin/api/v3/features (504) 30.00071491s Response: \"{\\\"status\\\":\\\"error\\\",\\\"errorType\\\":\\\"timeout\\\",\\\"error │
  • when I exec into the gem-gateway pod in grafana-enterprise-metrics namespace, the below cURL to admin-api pods returns me 200 response, that proves Istio service mesh allows the communication between gem-gateway and gem-admin-api pods
curl -u :<admin-token> http://gem-gateway.grafana-enterprise-metrics.svc:8080/admin/api/v3/features
{"name":"GEM","version":"v2.14.0","features":{"debug_export":"v1","editable_access_policies":"v1","editable_tenants":"v1","lbac":"v1","self_monitoring":"v1","federated_rules":"v1","federated_queries":"v1","block_upload":"v1"}}
  • however, when I exec into the same gem-gateway pod in grafana-enterprise-metrics namespace, the below cURL to localhost failes with 502:
curl -u :<admin-token> http://localhost:8080/admin/api/v3/features

and in the gem-gateway logs I can see:

ts=2025-01-15T14:58:06.913144188Z caller=logging.go:118 level=info trace_id_unsampled=5e9a38dcdc5b13b8 msg="GET /admin/api/v3/features (502) 411.334433ms"
  • everything works fine when I disable istio-injection for those two namespaces.

What was your environment?

I am testing this configuration with:

  • grafana Helm Chart v.8.6.4
  • mimir-distributed Helm Chart v5.5.1
  • Istio version 1.19.3 (but tried also with newer 1.24.2 - the same result)
  • Kubernetes 1.29 hosted on AWS EKS

Some additional configuration we have applied from Istio perspective:

  • both the grafana-enterprise and grafana-enterprise-metrics namespaces are labelled with istion-injection: enabled
  • AuthroizationPolicies and Sidecar Istio CRs are configured correctly to allow traffic between namespaces. Below is their configuration:
apiVersion: security.istio.io/v1
kind: AuthorizationPolicy
metadata:
  name: default-auth-policy
  namespace: grafana-enterprise
spec:
  action: ALLOW
  rules:
  - from:
    - source:
        namespaces:
        - grafana-enterprise
  - from:
    - source:
        principals:
        - '*/gateway-proxy'
  selector: {}
---
apiVersion: networking.istio.io/v1
kind: Sidecar
metadata:
  name: default
  namespace: grafana-enterprise
spec:
  egress:
  - hosts:
    - ./*
    - istio-system/*
    - tracing/*
    - authz-engine/*
    - grafana-enterprise-metrics/gem-gateway.grafana-enterprise-metrics.svc.cluster.local
---
apiVersion: security.istio.io/v1
kind: AuthorizationPolicy
metadata:
  name: default-auth-policy
  namespace: grafana-enterprise
spec:
  action: ALLOW
  rules:
  - from:
    - source:
        namespaces:
        - grafana-enterprise-metrics
  - from:
    - source:
        principals:
        - '*/gateway-proxy'
  - from:
    - source:
        principals:
        - '*/ge'
    to:
    - operation:
        ports:
        - "80"
        - "8080"
  selector: {}
---
apiVersion: networking.istio.io/v1
kind: Sidecar
metadata:
  name: default
  namespace: grafana-enterprise-metrics
spec:
  egress:
  - hosts:
    - ./*
    - istio-system/*
    - tracing/*
    - authz-engine/*
    - grafana-enterprise/ge.grafana-enterprise.svc.cluster.local
    - default/kubernetes.default.svc.cluster.local
  • We use Istio only for Service Mesh - we don't use Istio as Ingress Gateway. Below is our global mesh config:
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
metadata:
  name: installed-state
  namespace: istio-system
spec:
  components:
    base:
      enabled: true
    cni:
      enabled: false
    egressGateways:
    - enabled: false
      name: istio-egressgateway
    ingressGateways:
    - enabled: false
      name: istio-ingressgateway
    istiodRemote:
      enabled: false
    pilot:
      enabled: true
      k8s:
        env:
        - name: PILOT_ENABLE_GATEWAY_API
          value: "false"
        - name: PILOT_ENABLE_GATEWAY_API_DEPLOYMENT_CONTROLLER
          value: "false"
        - name: PILOT_ENABLE_GATEWAY_API_STATUS
          value: "false"
        hpaSpec:
          maxReplicas: 4
          minReplicas: 2
        replicaCount: 2
  hub: 123456789.dkr.ecr.us-east-1.amazonaws.com
  meshConfig:
    defaultConfig:
      terminationDrainDuration: 15s
      holdApplicationUntilProxyStarts: true
    outboundTrafficPolicy:
      mode: REGISTRY_ONLY
  profile: default
  tag: 1.19.3
  values:
    global:
      proxy:
        includeIPRanges: 172.20.0.0/16,240.240.0.0/16,10.0.8.0/22,10.0.20.0/22,10.0.32.0/22,10.0.0.0/16
        excludeOutboundPorts: "53"        
        resources:
          limits:
            memory: 2048Mi
    telemetry:
      enabled: true
      v2:
        enabled: true        
        prometheus:          
          enabled: true
        stackdriver:
          enabled: false

Any additional context to share?

No response

@dudududi dudududi added the bug Something isn't working label Jan 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant