Skip to content

Commit

Permalink
Ensure OLM finalizer runs to prevent px-operator namespace from being…
Browse files Browse the repository at this point in the history
… stuck terminating (#2059)

Summary: Ensure OLM finalizer runs to prevent px-operator namespace from
being stuck terminating

The helm install process followed by a helm uninstall does not fully
clean up all pixie resources in the v0.1.7 operator release. The OLM
project
[added](operator-framework/operator-lifecycle-manager@f94a5ed)
a csv-cleanup finalizer in
[v0.27.0](https://github.com/operator-framework/operator-lifecycle-manager/releases/tag/v0.27.0)
that causes the px-operator to get stuck in a terminating state if the
`olm` and `px-operator` namespaces are deleted at the same time.

In order to address this, a new Job is introduced within the olm
namespace that triggers the deletion of the olm operator namespace
(px-operator) from a `pre-delete` hook. This bug is not present when OLM
is installed outside of the helm since the finalizer has time to run.
Therefore this job only needs to run if `deployOLM` is set (helm is
managing OLM).

The other alternative I considered was writing another one off utility
similar to the `vizier_deleter` Job. This would have the benefit of
having a small surface area and wouldn't rely on third party images. Let
me know if you have opinions/thoughts on that option or any other
alternatives.

Relevant Issues: #1917

Type of change: /kind bug

Test Plan: Verified that the operator dev helm chart from this branch
uninstalls properly
```
$ helm install pixie  pixie-dev-operator/pixie-operator-chart  --version 0.1.7-pre-ddelnano-fix-helm-uninstall-olm-finalizer.0 --set cloudAddr=<cloud_addr> --set deployKey=<deploy_key> --set clusterName='helm-uninstall-test' --namespace pl --create-namespace
NAME: pixie
LAST DEPLOYED: Wed Dec 11 03:13:42 2024
NAMESPACE: pl
STATUS: deployed
REVISION: 1
TEST SUITE: None
$ helm -n pl uninstall pixie
release "pixie" uninstalled

$ kubectl get namespaces | grep 'px-operator\|olm\|pl'
pl                   Active   6m31s
$ kubectl -n pl get all
No resources found in pl namespace.
```
- [x] Verified deployOLM controls if Job is present with `helm template`
```
$ helm template --set deployOLM=true k8s/operator/helm/ | grep -A 5 'Job'
kind: Job
metadata:
  name: csv-deleter
  namespace: olm
  annotations:
    "helm.sh/hook": pre-delete
--
kind: Job
metadata:
  name: vizier-deleter
  annotations:
    "helm.sh/hook": pre-delete
    "helm.sh/hook-delete-policy": hook-succeeded
$ helm template --set deployOLM=false k8s/operator/helm/ | grep -A 5 'Job'
kind: Job
metadata:
  name: vizier-deleter
  annotations:
    "helm.sh/hook": pre-delete
    "helm.sh/hook-delete-policy": hook-succeeded
```

Changelog Message: Fix bug with the v0.1.7 operator helm chart that
would cause a stuck `px-operator` namespace on uninstall

---------

Signed-off-by: Dom Del Nano <[email protected]>
  • Loading branch information
ddelnano authored Dec 18, 2024
1 parent e2a6737 commit 9effb34
Showing 1 changed file with 51 additions and 0 deletions.
51 changes: 51 additions & 0 deletions k8s/operator/helm/templates/00_olm.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -228,4 +228,55 @@ metadata:
spec:
targetNamespaces:
- {{ .Values.olmNamespace }}
---
apiVersion: batch/v1
kind: Job
metadata:
name: csv-deleter
namespace: {{ .Values.olmNamespace }}
annotations:
"helm.sh/hook": pre-delete
"helm.sh/hook-delete-policy": hook-succeeded,hook-failed
spec:
template:
spec:
restartPolicy: Never
serviceAccountName: olm-operator-serviceaccount
containers:
- name: trigger-csv-finalizer
image: ghcr.io/pixie-io/pixie-oss-pixie-dev-public-curl:multiarch-7.87.0@sha256:f7f265d5c64eb4463a43a99b6bf773f9e61a50aaa7cefaf564f43e42549a01dd
command:
- /bin/sh
- -c
- |
NAMESPACE="{{ .Values.olmOperatorNamespace }}"
API_SERVER="https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT"
CA_CERT=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt
TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
DELETE_STATUS=$(curl --cacert $CA_CERT \
-H "Authorization: Bearer $TOKEN" \
-X DELETE -s \
-o /dev/null -w "%{http_code}" \
$API_SERVER/api/v1/namespaces/$NAMESPACE)
if [ "$DELETE_STATUS" -ne 200 ] && [ "$DELETE_STATUS" -ne 202 ]; then
echo "Failed to initiate deletion for namespace $NAMESPACE. HTTP status code: $DELETE_STATUS"
exit 1
fi
echo "Waiting for finalizer in $NAMESPACE to complete..."
while true; do
STATUS=$(curl --cacert $CA_CERT \
-H "Authorization: Bearer $TOKEN" \
-o /dev/null -w "%{http_code}" -s \
$API_SERVER/api/v1/namespaces/$NAMESPACE)
if [ "$STATUS" = "404" ]; then
echo "Namespace $NAMESPACE finalizer completed."
break
else
echo "Finalizer still running in $NAMESPACE. Retrying in 5 seconds..."
sleep 5
fi
done
{{- end}}

0 comments on commit 9effb34

Please sign in to comment.