Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates to 'fre pp run' to account for delay in starting Cylc scheduler #318

Merged
merged 1 commit into from
Jan 10, 2025

Conversation

ceblanton
Copy link
Collaborator

@ceblanton ceblanton commented Jan 8, 2025

Describe your changes

  • check to see if the scheduler is already up; if so exit right away with success
  • otherwise, sleep 30 seconds, then confirm the scheduler is actually up
  • if not, error 30 seconds was chosen based on hand-testing and might have to be adjusted later

Issue ticket number and link (if applicable)

#317

Checklist before requesting a review

  • I ran my code
  • I tried to make my code readable
  • I tried to comment my code
    - [ ] I wrote a new test, if applicable (calls to cylc are not yet testable presently)
  • I wrote new instructions/documentation, if applicable
  • I ran pytest and inspected it's output
  • I ran pylint and attempted to implement some of it's feedback

…cheduler

- check to see if the scheduler is already up; if so exit right away with success
- otherwise, sleep 30 seconds, then confirm the scheduler is actually up
- if not, error
30 seconds was chosen based on hand-testing and might have to be adjusted later
@ceblanton
Copy link
Collaborator Author

It's downright shameful that I'm checking only 3 out of 7 of the Ready checklist here...

We don't actually run anything Cylc-related in the CI, though, so how would one test it?

Copy link
Member

@ilaflott ilaflott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is less a "request changes" and more a, "request clarification".

The status script also has a time-out time.

Question- instead of trying to ping for the scheduler itself, why not just issue a status check until the desired job appears? Why do we need two different time-out times?

@ilaflott
Copy link
Member

ilaflott commented Jan 8, 2025

It's downright shameful that I'm checking only 3 out of 7 of the Ready checklist here...

do a .md cross out, surrounding with ~~ to signify "not-applicable".

We don't actually run anything Cylc-related in the CI, though, so how would one test it?

given that wrapper calls run a workflow, to test it in fre-cli sensibly, you're looking at either developing a suitable mock set-up (non-trivial, but containers can/should be used). IMO, that's not the biggest bang-for-buck at this time. We don't have any tests for fre-workflows yet, and that repo is about our workflows. We could use fre-cli to notch a basic integration test in fre-workflows. That'll take some data

Comment on lines +15 to +18
result = subprocess.run(['cylc', 'scan', '--name', f"^{name}$"], capture_output=True).stdout.decode('utf-8')
if len(result):
print("Workflow already running!")
return
Copy link
Collaborator

@singhd789 singhd789 Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we should we check/output for beginning task failures that happened in the running workflow?

For example, I ran an pp.starter sbatch script on gaea. Checking through cylc scan on ppan, it shows my experiment is running. However, I had already gotten an email about the pp.starter failing. Soo, if the pp-starter or stage-history task has already failed, the workflow spends quite a bit of time still "running" when really I know the rest of the workflow won't run if those failed already. Instead of waiting 20-30 minutes or however long for a complete failure, maybe we can check those tasks, and either cylc stop or print something?

Hmmm, I guess maybe we don't want a complete stop if only some years fail with the pp-starter task though. Eh, I'm just thinking out loud here, no need to listen to this one.

ALSO, every once in a while, we might get something like this in doing cylc scan:

INFO - Removed contact file for c96L65_am5f7b12r1_amip__gfdl.ncrc5-intel22-classic__prod-openmp (workflow no
    longer running).
WARNING - Workflow not running: c96L65_am5f7b12r1_amip__gfdl.ncrc5-intel22-classic__prod-openmp

This was obviously an experiment I wasn't running, but here the stdout in result wouldn't be empty (now this doesn't happen often I don't think - and honestly it may have only been because I accidentally did cylc scan instead of cylc scan --name [experiment]). According the the cylc scan --help:

"These files may left behind if the scheduler did not shut down cleanly"

This might've just been a me issue but just wanted to be aware of this...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Dana, thanks for the testing and feedback. All of the issues you're seeing we have to bring under more control, improve, and document.

A cylc workflow can be running (scheduler up and making progress on the task graph), stalled (scheduler up and not making progress on the task graph likely due to task failures that need intervention), or not running.

"if the pp-starter or stage-history task has already failed, the workflow spends quite a bit of time still "running" when really I know the rest of the workflow won't run if those failed already. Instead of waiting 20-30 minutes or however long for a complete failure, maybe we can check those tasks, and either cylc stop or print something?"

I don't understand this part.. let's try to disentangle it. For an example, let's use our 40 year amip case. When the Cylc scheduler starts, it will try to run each of the 40 pp.starter tasks. All will fail except for the one whose history files just arrived. That one will run. Then the second year history transfers, then the pp.starter for the second year will run, and it will restart the second year pp.starter task, which should work this time as the history files are there. Where does the 20-30 minutes come in?

Part of what you're seeing is the workaround-way we're communicating to Cylc when the history file is ready. The pp.starter task merely checks to see if the history tarfile is present, so failed pp.starter tasks are expected in our current setup. The better way to communicate to Cylc that a history file is ready is to use "external triggers" not a task. We can make that change, and that would improve the look and feel of this.

I see those cylc scan notes too now and then. The mechanism that Cylc uses to determine if a cylc workflow is running or not, is what it's alluding to. Sometimes when a workflow was running and is no longer, I see that note as well.

Copy link
Collaborator

@singhd789 singhd789 Jan 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the 20-30 minutes of waiting I'm pretty sure is due to when the workflow is stalled. When that happens, I think doing cylc scan --state=running --name [name], still outputs with the experiment name. That's why I added quotes around running. I could be misunderstanding cylc scan though. In any case, I agree with your first statement - "All of the issues you're seeing we have to bring under more control, improve, and document."

My brain was mostly rambling with this comment with some things I saw or was thinking about.

@ceblanton
Copy link
Collaborator Author

this is less a "request changes" and more a, "request clarification".

The status script also has a time-out time.

Question- instead of trying to ping for the scheduler itself, why not just issue a status check until the desired job appears? Why do we need two different time-out times?

The fre pp status tool might use a time-out, but it doesn't have to, and would be simpler without it. As we have it now, fre pp status is just a call-out to cylc workflow-state.

cylc workflow-state lists the state of all the tasks, but doesn't report on whether the workflow engine/scheduler itself is running. cylc scan reports on whether the workflow is running or not, but doesn't give any details on the tasks.

Copy link
Member

@ilaflott ilaflott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my concerns are addressed. i don't want to grant approval if you and @singhd789 are still discussing things though

@ilaflott ilaflott dismissed their stale review January 10, 2025 19:45

my clarification-request satisfied by #318 (comment)

@ceblanton
Copy link
Collaborator Author

It's downright shameful that I'm checking only 3 out of 7 of the Ready checklist here...

do a .md cross out, surrounding with ~~ to signify "not-applicable".

We don't actually run anything Cylc-related in the CI, though, so how would one test it?

given that wrapper calls run a workflow, to test it in fre-cli sensibly, you're looking at either developing a suitable mock set-up (non-trivial, but containers can/should be used). IMO, that's not the biggest bang-for-buck at this time. We don't have any tests for fre-workflows yet, and that repo is about our workflows. We could use fre-cli to notch a basic integration test in fre-workflows. That'll take some data

I think we could have a "dummy" trivial Cylc workflow that could live here in fre-cli, that could do nothing and still test the basic cylc install, cylc run, and cylc scan calls that the fre pp wrapper set is exercising.

Sort of like the null model the fre make team is using to test the compile scripts maybe.

@ceblanton
Copy link
Collaborator Author

Thanks Ian. Dana, I wish we were already using the external triggers for the history file check, which would present a more natural view of the pp.starting handoff. The way we have it now is just bad.. it doesn't bother me that much because I can imagine how it should be (plus, the proto Shield pp flow.cylc already does this the right way), but it's bad nonetheless and it would drive me crazy too.

@ceblanton
Copy link
Collaborator Author

Thanks for letting this one through, you two. Much of what bugs you Dana about this interface is exactly what we need to improve. Silent errors and partial errors (stalled, etc) are so naturally confusing and unpalatable.

We need a unified way for the user to recover when things go wrong, and a way to give users the bad news when necessary. As we get better with Cylc this will look better, but I can't disagree it's not very clear right now.

We have a few bad-news scenarios now. When the pp.starter (the Bronx-submitted job, not the Cylc task) task fails, when when the Cylc workflow stalls, and when it shuts down for good (with removing the workflow share directory).

@ceblanton ceblanton merged commit f5d4c37 into main Jan 10, 2025
2 checks passed
@ceblanton ceblanton deleted the 317.fre-pp-run-double-check branch January 10, 2025 20:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants