-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initialize Containers - HttpError: HTTP request failed - EKS - containerMode kubernetes #128
Comments
Well, finding some more time to dig, I went into the source code here and started tracing out the execution path since the stack trace doesn't give many clues as to where this HTTP request failed. Noting that the first thing that probably does a request is: I shell into my pod and install node and then attempt this direct basic implementation: const k8s = require('@kubernetes/client-node');
const kc = new k8s.KubeConfig();
kc.loadFromDefault();
const k8sApi = kc.makeApiClient(k8s.CoreV1Api);
let main = async () => {
try {
const podsRes = await k8sApi.listNamespacedPod('actions-runners');
console.log(podsRes.body);
} catch (err) {
console.error(err);
}
};
main(); It fails like so: {
// ...
body: {
kind: 'Status',
apiVersion: 'v1',
metadata: {},
status: 'Failure',
message: 'Unauthorized',
reason: 'Unauthorized',
code: 401
},
statusCode: 401
} So I can presume that the problem is not the fault of the hooks library, but something is wrong with either the service account or the cluster configuration in EKS. There's not a lot of easily-findable documentation on how to perform in-cluster authentication via service account because most users want to authenticate to their cluster from outside, using eksctl or similar. The Role is as configured by the helm chart: apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: reverbdotcom-general-purpose-gha-rs-kube-mode
namespace: actions-runners
- apiGroups:
- ""
resources:
- pods
verbs:
- get
- list
- create
- delete
- apiGroups:
- ""
resources:
- pods/exec
verbs:
- get
- create
- apiGroups:
- ""
resources:
- pods/log
verbs:
- get
- list
- watch
- apiGroups:
- batch
resources:
- jobs
verbs:
- get
- list
- create
- delete
- apiGroups:
- ""
resources:
- secrets
verbs:
- get
- list
- create
- delete Apparently if there is some RBAC issue I should receive a 403. A 401 indicates that the token was rejected completely. I also checked to see if the token in the client configuration matched the one mounted in the pod, and it does. I'm out of ideas for now... until I can learn more about debugging 401 with an in-cluster service account token. |
I created a test pod on the cluster in the
This is expected because I didn't specify a service account, so I got the default service account which has no role bound to it. Next I attempted the same, but specified the
With this service account, I again get a The default service account has a "mountable secret" but the "reverbdotcom-general-purpose-gha-rs-kube-mode" does not.
We're on kubernetes 1.24, which is now undocumented, so I can't be sure but other documentation indicates that it shouldn't be necessary to manually create tokens, and that the admission controller should use the Refresh api to obtain a token for the projected volume when a pod is scheduling... It definitely obtains a token, but the token is unauthorized. Off to spend some time digging around in EKS docs and attempting to figure out if there's some configuration setting I need to flip. |
After more experimentation I accidentally deleted the service account, and then had to recreate it by forcing a new helm install-upgrade. Following that, new pods which used the kube-mode service account were able to communicate with the apiserver, but old pods were not. I destroyed the old runner pod and waited for the controller to create a new one, whereupon it was able to make apiserver requests again. Unknown why replacing the serviceaccount made it start working, monitoring to see if it breaks again after some interval of time. If so, then theory goes that the projected token is not being refreshed. |
Hey @carl-reverb, Sorry for the late response. Is there any news regarding this issue? Does it work now? |
Yes, it's been working now, thank you. |
Ok, I reproduced this as I'm rolling out a new set of runner. My first runner, an arm64 kubernetes-mode runner again presented this issue. The Chart version is
I also happened to deploy an amd64 kubernetes-mode runner at the same time. My job contained a javascript action and ran in an alpine container so I had to switch it to the amd64. Once again, HTTP denied. Repeat my workaround ... and it works again. |
Could you please write the exact steps that you are doing to land on this spot? I can't seem to reproduce the issue. It seems like there is some kind of permission issue where service account is not mounted to the runner container. Could you please write an example values.yaml file with the stuff you would like to hide redacted? Write exact commands that you are using to deploy this scale set. I just can't reproduce this issue |
This is AWS EKS version 1.25. I'm sorry I don't have the bandwidth to work on reproduction. I simply install the runner scale set helm chart My hook extension is:
Now, all this is being installed with Flux, so the order of application is up to those controllers. Because of the nature of the resources, I presume that the kustomize controller runs first and creates my service account, config maps, and the helm release CRD, upon which the helm controller runs to evaluate the Helm Release and execute helm install. Sorry I can't remember much more detail than this. |
Oh, please do not apologize, I'm the one being late on this issue. I think the problem is that we are creating service account on demand and mounting it on the runner pod. The hook does extension does not need a service account. Extension is only scoped to the workflow pod. The service account needs to be mounted on the runner. It is likely that something in the tooling is not mounting the service account properly. |
Oh, to explain the hook extension on the service account, you're right that's a red herring, but I do need that because I'm using docker buildx with the native kubernetes driver and that driver requires a service account with some role bindings in order to create buildx pods. Yes you're absolutely right, the service account which is a problem is the one for the runner pod which consumes the workflow and then fails to spawn job pods due to not having a good service account token. |
Right, but the role you pasted is not the one the runner needs.This is the actual role that is created for the runner: https://github.com/actions/actions-runner-controller/blob/master/charts/gha-runner-scale-set/templates/kube_mode_role.yaml Is it possible that the incorrect role binding was made, so runner did not have enough permissions? |
Had the same issue. Seems to happen due to namespace rename OR a helm chart upgrade - our IT did both simultaneously. For me the following had caused the issue:
Somehow you run into the 401 error, for me it was the runner-hook trying to do a get on the K8S api about "am I allowed to access a secret". Fix was to delete the helm of the ARC runner and re-deploy it. I guess the serviceaccount did not had the permission required either due to rename or version upgrade. I was debugging this for 2 days and only found the origin after implementing #158 / #159. With the trace beeing available, finding the root cause was an easy job. Maybe @nikola-jokic could have a look on this PR? |
Reproduced the issue again, this time on 0.9.1
|
hi @carl-reverb, before i was unable to run a job in a container in
|
@carl-reverb you are the MVP! Uninstall runner set helm chart, then find all remaining resources in namespace:
Edit and remove finalizer:
Then reinstall the runner set. This alone seemed to solve the problem for me. |
When I attempt to run a workflow against a self-hosted runner deployed using the
gha-runner-scale-set-controller
andgha-runner-scale-set
charts, my job fails on the 'Initialize Containers' step.Runner Scale Set
values.yaml
:In the Github UI after the job is picked up the following error message appears in the log:
Full error context:
My workflow:
I have tried a lot of different things to try to understand what is not working here but the chain of dependencies and effects is not easy to comprehend. There is a lot of red herrings and other noise in the logs which led me on several chases around the web, and I spent a while trying security contexts, various container images, etc. At this point I think I have run out of time to figure this out and will have to fall back to the previous actions runner controller and advise my team that the next generation of actions runners is a risk and we should evaluate alternative CI pipelines.
The text was updated successfully, but these errors were encountered: