-
-
Notifications
You must be signed in to change notification settings - Fork 316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable taking node temporarily offline due to specific machine issue in Adoptium #5730
Comments
test-azure-ubuntu2404-x64-1 was hit twice due to the https://ci.adoptium.net/job/Test_openjdk21_hs_sanity.openjdk_x86-64_linux_testList_1/19/console
https://ci.adoptium.net/job/Test_openjdk21_hs_special.system_x86-64_linux/28/console
Currently test-azure-ubuntu2404-x64-1 is marked offline. I believe it's marked offline by jenkins auto-offline machines that are low on space?@sxa is it marked offline by infra's scheduled task?. How would infra process this case? |
Heya @sophia-guo It looks like the auto-offline logic isn't working at the moment. In short, jobs like this one fail due to lack of space, and our attempt to take the machine offline fails with this error:
Which I presume is being caused by this code. |
re #5730 (comment), this is not the code issue. As the error stated, Adoptium Jenkins Admin needs to permit using staticMethod hudson.model.User current at Adoptium Jenkins. |
The problem is that the code is attempting to use a method it is not authorized to do so. You are correct in that one solution is to get Jenkins admins to authorise use of that static method. Your solution also looks like the best one when I compared it to alternatives (such as using SimpleOfflineCause instead of UserCause, which is less optimal because I'm not seeing a trivial way to create instances of the Localisable class). P.S. I also discovered that the setTemporarilyOffline method is deprecated in favour of setTemporaryOfflineCause. I'm noting that here in case this setTemporarilyOffline is removed in a future update. |
I have permitted it, but have also done so for a different method previously. Not sure what other ones will pop up. May be worth bringing that machine back online and sending a job to it to see if we get past any other approvals needed. |
That machine is back online now, though it now has much more free space (I raised an issue for it), and is unlikely to see this issue again any time soon. |
Will keep an eye open for automatic machine disabling in future triage (whether it works or not). |
https://ci.adoptium.net/view/Test_grinder/job/Grinder/12049/ same agent no space left again. test-azure-ubuntu2404-x64-1. Just added the SLACK_CHANNEL parameter, so the permission issue happens again. https://ci.adoptium.net/view/Test_grinder/job/Grinder/12050/
@smlambert maybe you can check if permission message pops up? And after this fix if we don't like this feature available for grinder I can remove the parameter. @adamfarley if the machine is back with fix it's weird that it runs out of space in such short time. |
Agreed. Here's the issue: adoptium/infrastructure#3843 Note that this was the second time in a month that this machine has run out of space and been resurrected as "fixed", so perhaps a lack of overall storage space isn't the problem. |
Useful link: API for "setTemporarilyOffline": https://javadoc.jenkins.io/hudson/model/Computer.html#setTemporarilyOffline(boolean,hudson.slaves.OfflineCause) If we do go for a code fix instead of enabling us to use the currently-banned static method, I suggest this:
|
Adding the parameter SLACK_CHANNEL to the configuration of https://ci.adoptium.net/view/Test_grinder/job/Test_Job_Auto_Gen/ can take node offline due to specfiic machine issues.
This issue opened to monitor any issues with this enabled.
Exception: hudson.AbortException: Failed to run ssh-agent: mkdtemp: private socket dir: No space left on device
https://ci.adoptium.net/job/Test_openjdk21_hs_special.system_x86-64_linux/28/consoleThe text was updated successfully, but these errors were encountered: