Updates to 'fre pp run' to account for delay in starting Cylc scheduler #318

ceblanton · 2025-01-08T21:51:07Z

Describe your changes

check to see if the scheduler is already up; if so exit right away with success
otherwise, sleep 30 seconds, then confirm the scheduler is actually up
if not, error 30 seconds was chosen based on hand-testing and might have to be adjusted later

Issue ticket number and link (if applicable)

Checklist before requesting a review

I ran my code
I tried to make my code readable
I tried to comment my code
~~- [ ] I wrote a new test, if applicable~~ (calls to cylc are not yet testable presently)
I wrote new instructions/documentation, if applicable
I ran pytest and inspected it's output
I ran pylint and attempted to implement some of it's feedback

…cheduler - check to see if the scheduler is already up; if so exit right away with success - otherwise, sleep 30 seconds, then confirm the scheduler is actually up - if not, error 30 seconds was chosen based on hand-testing and might have to be adjusted later

ceblanton · 2025-01-08T21:54:44Z

It's downright shameful that I'm checking only 3 out of 7 of the Ready checklist here...

We don't actually run anything Cylc-related in the CI, though, so how would one test it?

ilaflott

this is less a "request changes" and more a, "request clarification".

The status script also has a time-out time.

Question- instead of trying to ping for the scheduler itself, why not just issue a status check until the desired job appears? Why do we need two different time-out times?

ilaflott · 2025-01-08T23:43:44Z

It's downright shameful that I'm checking only 3 out of 7 of the Ready checklist here...

do a .md cross out, surrounding with ~~ to signify "not-applicable".

We don't actually run anything Cylc-related in the CI, though, so how would one test it?

given that wrapper calls run a workflow, to test it in fre-cli sensibly, you're looking at either developing a suitable mock set-up (non-trivial, but containers can/should be used). IMO, that's not the biggest bang-for-buck at this time. We don't have any tests for fre-workflows yet, and that repo is about our workflows. We could use fre-cli to notch a basic integration test in fre-workflows. That'll take some data

singhd789 · 2025-01-09T16:20:18Z

fre/pp/run_script.py

+    result = subprocess.run(['cylc', 'scan', '--name', f"^{name}$"], capture_output=True).stdout.decode('utf-8')
+    if len(result):
+        print("Workflow already running!")
+        return


Do you think we should we check/output for beginning task failures that happened in the running workflow?

For example, I ran an pp.starter sbatch script on gaea. Checking through cylc scan on ppan, it shows my experiment is running. However, I had already gotten an email about the pp.starter failing. Soo, if the pp-starter or stage-history task has already failed, the workflow spends quite a bit of time still "running" when really I know the rest of the workflow won't run if those failed already. Instead of waiting 20-30 minutes or however long for a complete failure, maybe we can check those tasks, and either cylc stop or print something?

Hmmm, I guess maybe we don't want a complete stop if only some years fail with the pp-starter task though. Eh, I'm just thinking out loud here, no need to listen to this one.

ALSO, every once in a while, we might get something like this in doing cylc scan:

INFO - Removed contact file for c96L65_am5f7b12r1_amip__gfdl.ncrc5-intel22-classic__prod-openmp (workflow no longer running). WARNING - Workflow not running: c96L65_am5f7b12r1_amip__gfdl.ncrc5-intel22-classic__prod-openmp

This was obviously an experiment I wasn't running, but here the stdout in result wouldn't be empty (now this doesn't happen often I don't think - and honestly it may have only been because I accidentally did cylc scan instead of cylc scan --name [experiment]). According the the cylc scan --help:

"These files may left behind if the scheduler did not shut down cleanly"

This might've just been a me issue but just wanted to be aware of this...

Hi Dana, thanks for the testing and feedback. All of the issues you're seeing we have to bring under more control, improve, and document.

A cylc workflow can be running (scheduler up and making progress on the task graph), stalled (scheduler up and not making progress on the task graph likely due to task failures that need intervention), or not running.

"if the pp-starter or stage-history task has already failed, the workflow spends quite a bit of time still "running" when really I know the rest of the workflow won't run if those failed already. Instead of waiting 20-30 minutes or however long for a complete failure, maybe we can check those tasks, and either cylc stop or print something?"

I don't understand this part.. let's try to disentangle it. For an example, let's use our 40 year amip case. When the Cylc scheduler starts, it will try to run each of the 40 pp.starter tasks. All will fail except for the one whose history files just arrived. That one will run. Then the second year history transfers, then the pp.starter for the second year will run, and it will restart the second year pp.starter task, which should work this time as the history files are there. Where does the 20-30 minutes come in?

Part of what you're seeing is the workaround-way we're communicating to Cylc when the history file is ready. The pp.starter task merely checks to see if the history tarfile is present, so failed pp.starter tasks are expected in our current setup. The better way to communicate to Cylc that a history file is ready is to use "external triggers" not a task. We can make that change, and that would improve the look and feel of this.

I see those cylc scan notes too now and then. The mechanism that Cylc uses to determine if a cylc workflow is running or not, is what it's alluding to. Sometimes when a workflow was running and is no longer, I see that note as well.

Yeah, the 20-30 minutes of waiting I'm pretty sure is due to when the workflow is stalled. When that happens, I think doing cylc scan --state=running --name [name], still outputs with the experiment name. That's why I added quotes around running. I could be misunderstanding cylc scan though. In any case, I agree with your first statement - "All of the issues you're seeing we have to bring under more control, improve, and document."

My brain was mostly rambling with this comment with some things I saw or was thinking about.

ceblanton · 2025-01-10T19:28:35Z

this is less a "request changes" and more a, "request clarification".

The status script also has a time-out time.

Question- instead of trying to ping for the scheduler itself, why not just issue a status check until the desired job appears? Why do we need two different time-out times?

The fre pp status tool might use a time-out, but it doesn't have to, and would be simpler without it. As we have it now, fre pp status is just a call-out to cylc workflow-state.

cylc workflow-state lists the state of all the tasks, but doesn't report on whether the workflow engine/scheduler itself is running. cylc scan reports on whether the workflow is running or not, but doesn't give any details on the tasks.

ilaflott

my concerns are addressed. i don't want to grant approval if you and @singhd789 are still discussing things though

my clarification-request satisfied by #318 (comment)

ceblanton · 2025-01-10T19:47:41Z

It's downright shameful that I'm checking only 3 out of 7 of the Ready checklist here...

do a .md cross out, surrounding with ~~ to signify "not-applicable".

We don't actually run anything Cylc-related in the CI, though, so how would one test it?

given that wrapper calls run a workflow, to test it in fre-cli sensibly, you're looking at either developing a suitable mock set-up (non-trivial, but containers can/should be used). IMO, that's not the biggest bang-for-buck at this time. We don't have any tests for fre-workflows yet, and that repo is about our workflows. We could use fre-cli to notch a basic integration test in fre-workflows. That'll take some data

I think we could have a "dummy" trivial Cylc workflow that could live here in fre-cli, that could do nothing and still test the basic cylc install, cylc run, and cylc scan calls that the fre pp wrapper set is exercising.

Sort of like the null model the fre make team is using to test the compile scripts maybe.

ceblanton · 2025-01-10T19:52:08Z

Thanks Ian. Dana, I wish we were already using the external triggers for the history file check, which would present a more natural view of the pp.starting handoff. The way we have it now is just bad.. it doesn't bother me that much because I can imagine how it should be (plus, the proto Shield pp flow.cylc already does this the right way), but it's bad nonetheless and it would drive me crazy too.

ceblanton · 2025-01-10T20:15:03Z

Thanks for letting this one through, you two. Much of what bugs you Dana about this interface is exactly what we need to improve. Silent errors and partial errors (stalled, etc) are so naturally confusing and unpalatable.

We need a unified way for the user to recover when things go wrong, and a way to give users the bad news when necessary. As we get better with Cylc this will look better, but I can't disagree it's not very clear right now.

We have a few bad-news scenarios now. When the pp.starter (the Bronx-submitted job, not the Cylc task) task fails, when when the Cylc workflow stalls, and when it shuts down for good (with removing the workflow share directory).

ceblanton requested review from ilaflott and singhd789 January 8, 2025 21:52

ilaflott previously requested changes Jan 8, 2025

View reviewed changes

singhd789 reviewed Jan 9, 2025

View reviewed changes

ilaflott reviewed Jan 10, 2025

View reviewed changes

singhd789 approved these changes Jan 10, 2025

View reviewed changes

ceblanton merged commit f5d4c37 into main Jan 10, 2025
2 checks passed

ceblanton deleted the 317.fre-pp-run-double-check branch January 10, 2025 20:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updates to 'fre pp run' to account for delay in starting Cylc scheduler #318

Updates to 'fre pp run' to account for delay in starting Cylc scheduler #318

ceblanton commented Jan 8, 2025 •

edited

Loading

ceblanton commented Jan 8, 2025

ilaflott left a comment •

edited

Loading

ilaflott commented Jan 8, 2025

singhd789 Jan 9, 2025 •

edited

Loading

ceblanton Jan 10, 2025

singhd789 Jan 10, 2025 •

edited

Loading

ceblanton commented Jan 10, 2025

ilaflott left a comment

ceblanton commented Jan 10, 2025

ceblanton commented Jan 10, 2025

ceblanton commented Jan 10, 2025

Updates to 'fre pp run' to account for delay in starting Cylc scheduler #318

Updates to 'fre pp run' to account for delay in starting Cylc scheduler #318

Conversation

ceblanton commented Jan 8, 2025 • edited Loading

Describe your changes

Issue ticket number and link (if applicable)

Checklist before requesting a review

ceblanton commented Jan 8, 2025

ilaflott left a comment • edited Loading

Choose a reason for hiding this comment

ilaflott commented Jan 8, 2025

singhd789 Jan 9, 2025 • edited Loading

Choose a reason for hiding this comment

ceblanton Jan 10, 2025

Choose a reason for hiding this comment

singhd789 Jan 10, 2025 • edited Loading

Choose a reason for hiding this comment

ceblanton commented Jan 10, 2025

ilaflott left a comment

Choose a reason for hiding this comment

ceblanton commented Jan 10, 2025

ceblanton commented Jan 10, 2025

ceblanton commented Jan 10, 2025

ceblanton commented Jan 8, 2025 •

edited

Loading

ilaflott left a comment •

edited

Loading

singhd789 Jan 9, 2025 •

edited

Loading

singhd789 Jan 10, 2025 •

edited

Loading