You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A build could potentially take a long time, but the ClusterRunner is made for breaking up a large task into small chunks. That being said, the expectation is that a sub job should finish within a reasonable amount of time. If it does not finish in a short period of time, nodes will not be deallocating in a timely manner, if at all, which can severely impact the cluster. For example, there have been occurrences where a single subjob has been stuck indefinitely.
I think that subjob durations should be restricted to a finite time limit.
The text was updated successfully, but these errors were encountered:
Fixing build cancellation would solve part of this problem (the client can cancel a job when it has taken longer than it wants to wait). Currently cancellation will not interrupt in-progress atoms.
I agree with TJ that if we added this that it should be an atom_timeout vs. a subjob_timeout. Subjobs are an intermediate internal batching that users don't have control over. Users have control over their atoms.
If we were to add a default atom timeout, it should be very large and configurable in the clusterrunner.conf.
I filed this issue because it looked like a bunch of subjobs were stuck on the dashboard. After talking with Joey, it sounds like some may be false positives caused by #287.
A build could potentially take a long time, but the ClusterRunner is made for breaking up a large task into small chunks. That being said, the expectation is that a sub job should finish within a reasonable amount of time. If it does not finish in a short period of time, nodes will not be deallocating in a timely manner, if at all, which can severely impact the cluster. For example, there have been occurrences where a single subjob has been stuck indefinitely.
I think that subjob durations should be restricted to a finite time limit.
The text was updated successfully, but these errors were encountered: