-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updates to 'fre pp run' to account for delay in starting Cylc scheduler #318
Conversation
…cheduler - check to see if the scheduler is already up; if so exit right away with success - otherwise, sleep 30 seconds, then confirm the scheduler is actually up - if not, error 30 seconds was chosen based on hand-testing and might have to be adjusted later
It's downright shameful that I'm checking only 3 out of 7 of the Ready checklist here... We don't actually run anything Cylc-related in the CI, though, so how would one test it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is less a "request changes" and more a, "request clarification".
The status script also has a time-out time.
Question- instead of trying to ping for the scheduler itself, why not just issue a status
check until the desired job appears? Why do we need two different time-out times?
do a
given that |
result = subprocess.run(['cylc', 'scan', '--name', f"^{name}$"], capture_output=True).stdout.decode('utf-8') | ||
if len(result): | ||
print("Workflow already running!") | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we should we check/output for beginning task failures that happened in the running workflow?
For example, I ran an pp.starter sbatch script on gaea. Checking through cylc scan
on ppan, it shows my experiment is running. However, I had already gotten an email about the pp.starter failing. Soo, if the pp-starter or stage-history task has already failed, the workflow spends quite a bit of time still "running" when really I know the rest of the workflow won't run if those failed already. Instead of waiting 20-30 minutes or however long for a complete failure, maybe we can check those tasks, and either cylc stop
or print something?
Hmmm, I guess maybe we don't want a complete stop if only some years fail with the pp-starter task though. Eh, I'm just thinking out loud here, no need to listen to this one.
ALSO, every once in a while, we might get something like this in doing cylc scan:
INFO - Removed contact file for c96L65_am5f7b12r1_amip__gfdl.ncrc5-intel22-classic__prod-openmp (workflow no
longer running).
WARNING - Workflow not running: c96L65_am5f7b12r1_amip__gfdl.ncrc5-intel22-classic__prod-openmp
This was obviously an experiment I wasn't running, but here the stdout in result
wouldn't be empty (now this doesn't happen often I don't think - and honestly it may have only been because I accidentally did cylc scan
instead of cylc scan --name [experiment]
). According the the cylc scan --help
:
"These files may left behind if the scheduler did not shut down cleanly"
This might've just been a me issue but just wanted to be aware of this...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Dana, thanks for the testing and feedback. All of the issues you're seeing we have to bring under more control, improve, and document.
A cylc workflow can be running (scheduler up and making progress on the task graph), stalled (scheduler up and not making progress on the task graph likely due to task failures that need intervention), or not running.
"if the pp-starter or stage-history task has already failed, the workflow spends quite a bit of time still "running" when really I know the rest of the workflow won't run if those failed already. Instead of waiting 20-30 minutes or however long for a complete failure, maybe we can check those tasks, and either cylc stop or print something?"
I don't understand this part.. let's try to disentangle it. For an example, let's use our 40 year amip case. When the Cylc scheduler starts, it will try to run each of the 40 pp.starter tasks. All will fail except for the one whose history files just arrived. That one will run. Then the second year history transfers, then the pp.starter for the second year will run, and it will restart the second year pp.starter task, which should work this time as the history files are there. Where does the 20-30 minutes come in?
Part of what you're seeing is the workaround-way we're communicating to Cylc when the history file is ready. The pp.starter task merely checks to see if the history tarfile is present, so failed pp.starter tasks are expected in our current setup. The better way to communicate to Cylc that a history file is ready is to use "external triggers" not a task. We can make that change, and that would improve the look and feel of this.
I see those cylc scan
notes too now and then. The mechanism that Cylc uses to determine if a cylc workflow is running or not, is what it's alluding to. Sometimes when a workflow was running and is no longer, I see that note as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the 20-30 minutes of waiting I'm pretty sure is due to when the workflow is stalled. When that happens, I think doing cylc scan --state=running --name [name]
, still outputs with the experiment name. That's why I added quotes around running. I could be misunderstanding cylc scan though. In any case, I agree with your first statement - "All of the issues you're seeing we have to bring under more control, improve, and document."
My brain was mostly rambling with this comment with some things I saw or was thinking about.
The
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my concerns are addressed. i don't want to grant approval if you and @singhd789 are still discussing things though
my clarification-request satisfied by #318 (comment)
I think we could have a "dummy" trivial Cylc workflow that could live here in fre-cli, that could do nothing and still test the basic Sort of like the null model the fre make team is using to test the compile scripts maybe. |
Thanks Ian. Dana, I wish we were already using the external triggers for the history file check, which would present a more natural view of the pp.starting handoff. The way we have it now is just bad.. it doesn't bother me that much because I can imagine how it should be (plus, the proto Shield pp flow.cylc already does this the right way), but it's bad nonetheless and it would drive me crazy too. |
Thanks for letting this one through, you two. Much of what bugs you Dana about this interface is exactly what we need to improve. Silent errors and partial errors (stalled, etc) are so naturally confusing and unpalatable. We need a unified way for the user to recover when things go wrong, and a way to give users the bad news when necessary. As we get better with Cylc this will look better, but I can't disagree it's not very clear right now. We have a few bad-news scenarios now. When the pp.starter (the Bronx-submitted job, not the Cylc task) task fails, when when the Cylc workflow stalls, and when it shuts down for good (with removing the workflow share directory). |
Describe your changes
Issue ticket number and link (if applicable)
#317
Checklist before requesting a review
- [ ] I wrote a new test, if applicable(calls tocylc
are not yet testable presently)