Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job.yaml doesn't include any pbs specific info #545

Open
jo-basevi opened this issue Jan 3, 2025 · 0 comments · May be fixed by #550
Open

job.yaml doesn't include any pbs specific info #545

jo-basevi opened this issue Jan 3, 2025 · 0 comments · May be fixed by #550

Comments

@jo-basevi
Copy link
Collaborator

There's a job.yaml that gets written out during payu run job once the model has been run that outputs information about the payu job, e.g.

PAYU_CONTROL_DIR: /home/$USERID/test-payu/mom6-double-grye-base
PAYU_CURRENT_RUN: 0
PAYU_FINISH_TIME: '2024-12-20T08:25:39.134174'
PAYU_JOB_STATUS: 0
PAYU_N_RUNS: 1
PAYU_PATH: /g/data/vk83/prerelease/apps/base_conda/envs/payu-1.1.6/bin
PAYU_RUN_ID: b35e7409aed05466e0d8215727df3b107f05be85
PAYU_START_TIME: '2024-12-20T08:25:34.287522'
PAYU_WALLTIME: 4.846652 s

Currently it is being generated from within the Experiment.run() method:

payu/payu/experiment.py

Lines 675 to 701 in 20a8e76

info = get_job_info()
if info is None:
# Not being run under PBS, reverse engineer environment
info = {
'PAYU_PATH': os.path.dirname(self.payu_path)
}
# Add extra information to save to jobinfo
info.update(
{
'PAYU_CONTROL_DIR': self.control_path,
'PAYU_RUN_ID': self.run_id,
'PAYU_CURRENT_RUN': self.counter,
'PAYU_N_RUNS': self.n_runs,
'PAYU_JOB_STATUS': rc,
'PAYU_START_TIME': self.start_time.isoformat(),
'PAYU_FINISH_TIME': self.finish_time.isoformat(),
'PAYU_WALLTIME': "{0} s".format(
(self.finish_time - self.start_time).total_seconds()
),
}
)
# Dump job info
with open(self.job_fname, 'w') as file:
file.write(yaml.dump(info, default_flow_style=False))

get_job_info() uses qstat to obtain the job information. This is wrapped in a tenancity retry block that retries the qstat command for 10s. However it's been failing as the 'PBS_EXEC' environment variable does not exist during the jobs, so no PBS job info is being logged out.

payu/payu/schedulers/pbs.py

Lines 206 to 207 in 20a8e76

qstat = os.path.join(os.environ['PBS_EXEC'], 'bin', 'qstat')
cmd = '{} {}'.format(qstat, qflag)

Changing the above line to cmd = f"qstat {qflag}" seems to work OK.

There's also an issue with PBS scheduler specific functions getting imported in the main experiment class - which may not work once running with other schedulers:

from payu.schedulers.pbs import get_job_info, pbs_env_init, get_job_id

@jo-basevi jo-basevi linked a pull request Jan 20, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant