-
Notifications
You must be signed in to change notification settings - Fork 807
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better observability into driver auth/networking errors #2263
Comments
Hi @LMmarsupial, let me try to reproduce your issue on a 1.24 cluster. In the meantime, would you attach:
I'm perplexed by the Thank you. |
Hi @LMmarsupial, I just performed the following test and was not able to reproduce your issue:
Everything worked fine for me. Are you able to properly follow the dynamic-provisioning example? |
apologies, I didn't save off the logs for the ebs-plugin. I will send you newer ones, but they won't line up in terms of time.
|
@AndrewSirenko I can follow the dynamic provisioning steps on monday. Unfortunately, I have no time left for the rest of the week. I appreciate your support. |
@AndrewSirenko I followed your instructions, and I was able to validate that this issue is continuing using the dynamic provisioning steps. Any further debug steps I can take?
|
Hi @LMmarsupial, thanks for following the example. I'm sorry that you're running into this. An error like Re-deploying the driver via helm with |
@AndrewSirenko No problem, I'll try with a new deploy. My theory is that the rate limit issue is only appearing after several errors when it tries to provision. I think its a symptom of the issue, and not the central issue. Hopefully these debug logs will help. I'll update the thread with the logs for you when I get it deployed. |
@AndrewSirenko the log is quite long, I can't seem to nail down the behavior. I don't have up to the rate limit token calls, because its so long, but I can paste more if you need. I censored my key value with FYI.
|
@LMmarsupial No need for more logs. These logs show that the driver is making AWS API calls that are never getting a response. Thank you. This means it's even more likely there is some networking/auth issue that is blocking these calls from being made. This issue would be hard to pin down because it could be due to allow/denylists, dns, subnet/VPC configuration, service-account issue, CNI plugin issues, etc. Are you relying on EKS? If so, can you try creating a fresh EKS cluster in us-west-2, and perhaps try following the EKS add-on installation guide instead? Comparing with a working cluster may help you pin-point the issue. |
that makes sense. I feel like auth could definitely be causing this. It seems like it is getting unauthorized or whatever issue so many times, then getting locked with the rate limit token. I am not using EKS, I'm currently using rancher with rke. Is there any other method I can use to get some better error info? The debug flag you added was very helpful in allowing me to see what's going on. thanks! |
Ah, IIRC using IAM roles for ServiceAccounts is EKS only. And I see in your deployment that you give a ServiceAccount name. That might be the issue:
You may need to setup driver permissions in another way (e.g. ensure/patch nodes where EBS CSI Controller runs to have the IAM role using something IAM instance profile). Rancher May have their own installation guide for EBS CSI Driver.
You can try to manually make an EC2 DescribeVolumes call with the IAM Role rule out the IAM policy being wrong. You can use something like an ephemeral debug container to try to prove you can connect to the EC2 endpoint from the EBS CSI Controller pod. |
@LMmarsupial any luck with either using IAM Instance profile for the EBS CSI Controller nodes, or (less recommended) using a secret? |
Hey @AndrewSirenko, I tried adding a IAM role for the EBS CSI controller, but its already using an admin access key, so I don't think it made a difference. I was using the ephemeral debug container, and I realized that it may be proxy settings. I now have it properly configured with proxy settings, but i'm still seeing similar log lines. I also had issues with CA certificates when I was running AWS CLI commands. I am thinking that I may need to mount our CA certificate to the csi controller deploy in order to have it properly configured. |
I don't think its CA related, but I do believe it is still network related. |
got it to work. I was right when I said "I am thinking that I may need to mount our CA certificate to the csi controller deploy in order to have it properly configured." in our environment, we're using a proxy in order to communicate to external services. the csi controller was failing to tell me that there was unhandled certificate exceptions. After adding volume mounts to the controller, with our proxy CA certificate, I was able to communicate and provision volumes. |
Seems like the CSI driver could use some more logs when handling the authentication process. unhandled exceptions made it very difficult to debug the process |
Thanks @LMmarsupial for letting us know that you were able to solve your original issue! You are right that we can improve the user experience here by making it easier to troubleshoot auth/networking errors. Either through clearer (handled) logs or even a readiness-probe that attempts a simple EC2 DescribeVolumes call to prove everything works. Let me re-title the issue and turn it into a feature request for this. /retitle Better observability into driver auth/networking errors /kind feature I will try to work on this sometime this month. /assign |
Thank you! I appreciate your assistance throughout the process. |
/kind bug
What happened?
Followed instructions on github page to deploy to deploy ebs-csi-driver, and provision a volume using a custom storage class. EBS driver is fully deployed, but the pvc's are stuck in pending and the provisioner is throwing unhandled exceptions. Eventually it ends with rate limit exceptions.
What you expected to happen?
a pvc successfully bound with the storage class specified
How to reproduce it (as minimally and precisely as possible)?
deploy EBS driver, attempt to provision volume using storage class.
Anything else we need to know?:
attaching csi-provisioner log and describe of a volume with events. Rate limit errors continue to appear after it errors out with the unhandled exceptions. I think those are just a product of the unhandled exception.
Environment
kubectl version
): 1.24.17The text was updated successfully, but these errors were encountered: