You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are a few situations where a slave node can get "stuck" -- it will get into a state where it will never finish what it's doing, but it also won't be marked offline so any external capacity monitoring or alerting around offline nodes will not report accurately.
The situations we've seen this occur are:
Could not post idle notification to master because master is unresponsive.
[2017-10-06 15:22:54.089] 5274 INFO Bld20637-Teardw cluster_slave Build teardown complete for build 20637.
[2017-10-06 15:24:54.175] 5274 NOTICE Bld20637-Teardw cluster_slave Could not post idle notification to master because master is unresponsive.
This could happen for different reasons: transient network issues, the cluster master API server tied up by a large download or upload, etc.
Posting subjob result to master failed because master is unresponsive.
[2017-10-06 15:23:10.198] 5452 DEBUG Bld20637-Sub221 git Command completed with exit code 0.
[2017-10-06 15:24:59.538] 5452 ERROR Bld20637-Sub222 unhandled_excep Unhandled exception handler caught exception.
Traceback (most recent call last):
File "/home/jenkins/.virtualenvs/clusterrunner/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 319, in _make_request
TypeError: getresponse() got an unexpected keyword argument 'buffering'
During handling of the above exception, another exception occurred:
...
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/jenkins/ClusterRunnerBuild/app/util/safe_thread.py", line 18, in run
File "/usr/local/lib/python3.4/threading.py", line 868, in run
File "/home/jenkins/ClusterRunnerBuild/app/slave/cluster_slave.py", line 313, in _execute_subjob
File "/home/jenkins/ClusterRunnerBuild/app/util/decorators.py", line 38, in function_with_retries
File "/home/jenkins/ClusterRunnerBuild/app/util/network.py", line 62, in post
File "/home/jenkins/ClusterRunnerBuild/app/util/network.py", line 150, in _request
File "/home/jenkins/.virtualenvs/clusterrunner/lib/python3.4/site-packages/requests/sessions.py", line 456, in request
File "/home/jenkins/.virtualenvs/clusterrunner/lib/python3.4/site-packages/requests/sessions.py", line 559, in send
File "/home/jenkins/.virtualenvs/clusterrunner/lib/python3.4/site-packages/requests/adapters.py", line 384, in send
requests.exceptions.Timeout: HTTPConnectionPool(host='pod4101-automation1024.pod.box.net', port=43000): Read timed out. (read timeout=120)
In this case the slave service actually shuts down because of the Timeout exception, but since the master was still unresponsive the call from the slave to the master to notify it about shutdown also failed. The master still thinks this node is online.
Setup failed due to full disk space.
[2017-10-06 09:16:27.119] 5429 INFO SlaveTornadoThr cluster_slave Executing setup for build 20401 (type: git).
[2017-10-06 09:16:27.200] 5429 ERROR SlaveTornadoThr cluster_base_ha Exception occurred during request to /v1/build/20401/setup.
Traceback (most recent call last):
File "/home/jenkins/.virtualenvs/clusterrunner/lib/python3.4/site-packages/tornado/web.py", line 1346, in _when_complete
File "/home/jenkins/.virtualenvs/clusterrunner/lib/python3.4/site-packages/tornado/web.py", line 1367, in _execute_method
File "/home/jenkins/ClusterRunnerBuild/app/util/decorators.py", line 63, in function_with_auth
File "/home/jenkins/ClusterRunnerBuild/app/web_framework/cluster_slave_application.py", line 121, in post
File "/home/jenkins/ClusterRunnerBuild/app/slave/cluster_slave.py", line 99, in setup_build
File "/home/jenkins/ClusterRunnerBuild/app/util/util.py", line 50, in create_project_type
File "/home/jenkins/ClusterRunnerBuild/app/project_type/git.py", line 126, in __init__
OSError: [Errno 28] No space left on device: '/home/jenkins/.clusterrunner/repos/slave/pod4101automation1024.pod.box.net/home/jenkins/.clusterrunner/repos/master/[email protected]/scm/' -> '/tmp/clusterrunner_build_symlinks/85e1ab48-3688-4286-af46-367914e5486f'
[2017-10-06 09:16:27.224] 5429 ERROR SlaveTornadoThr tornado.access 500 POST /v1/build/20401/setup (10.5.7.57) 108.18ms
I suspect that this error can also occur without recording the exception in the logs. If disk space fills up completely then the OSError exception will not be able to be written to the log.
In all of these cases, the master still thinks the slave node is doing real work, even though the slave service will never recover. When this happens, the cluster is operating at reduced capacity but it's almost impossible to detect externally via the API.
Currently the only way we can tell that this has happened is that our dashboard shows these nodes as busy, but the builds they're supposedly working on have completed. From the API, this looks like the /slave endpoint listing a node as working on build X, but the /queue endpoint does not show build X as in-progress (usually because other nodes finished all its subjobs). I believe that some of the above errors could also lead to builds appearing to get stuck forever.
The text was updated successfully, but these errors were encountered:
Related: #287 - we should probably make the master do heartbeat calls to the slave nodes that it hasn't heard from in a while.
Due to the first situation I mentioned above (slave could not post "idle" notification to master), the master should probably do more than just check that the slave service is still alive; it should also verify its status (idle, busy, etc.) matches what is expected.
There are a few situations where a slave node can get "stuck" -- it will get into a state where it will never finish what it's doing, but it also won't be marked offline so any external capacity monitoring or alerting around offline nodes will not report accurately.
The situations we've seen this occur are:
Could not post idle notification to master because master is unresponsive.
This could happen for different reasons: transient network issues, the cluster master API server tied up by a large download or upload, etc.
Posting subjob result to master failed because master is unresponsive.
In this case the slave service actually shuts down because of the Timeout exception, but since the master was still unresponsive the call from the slave to the master to notify it about shutdown also failed. The master still thinks this node is online.
Setup failed due to full disk space.
I suspect that this error can also occur without recording the exception in the logs. If disk space fills up completely then the OSError exception will not be able to be written to the log.
In all of these cases, the master still thinks the slave node is doing real work, even though the slave service will never recover. When this happens, the cluster is operating at reduced capacity but it's almost impossible to detect externally via the API.
Currently the only way we can tell that this has happened is that our dashboard shows these nodes as busy, but the builds they're supposedly working on have completed. From the API, this looks like the
/slave
endpoint listing a node as working on build X, but the/queue
endpoint does not show build X as in-progress (usually because other nodes finished all its subjobs). I believe that some of the above errors could also lead to builds appearing to get stuck forever.The text was updated successfully, but these errors were encountered: