Adding tracking to payu #546

jo-basevi · 2025-01-07T00:39:07Z

Telemetry is currently being added to ACCESS-NRI tools such as the intake catalogue using an IPython extension (https://github.com/ACCESS-NRI/access-ipy-telemetry). The goal is to add something similar to payu to automatically track the usage of models and resources. Ideally, in the longer term, it could be useful for users to track metadata from experiments.

As a first iteration, it would be useful to decide what should be tracked. There's currently some information already in a payu run being logged to a job.yaml after the model has been run. There's currently a bug with this code as it no longer includes any info obtained from a qstat command (#545).

Fixing the above bug, gives this information for `job.yaml` (stored in `archive/output000`):

Checkpoint: u
Error_Path: gadi.nci.org.au:/home/189/jb4202/test-payu/release-1deg_jra55_ryf/1deg_jra55_ryf.e131803315
Hold_Types: n
Job_ID: '131803315'
Job_Name: 1deg_jra55_ryf
Job_Owner: [email protected]
Join_Path: n
Keep_Files: n
Mail_Points: a
Output_Path: gadi.nci.org.au:/home/189/jb4202/test-payu/release-1deg_jra55_ryf/1deg_jra55_ryf.o131803315
PAYU_CONTROL_DIR: /home/189/jb4202/test-payu/release-1deg_jra55_ryf
PAYU_CURRENT_RUN: 0
PAYU_FINISH_TIME: '2025-01-03T12:45:52.119094'
PAYU_JOB_STATUS: 0
PAYU_N_RUNS: 1
PAYU_RUN_ID: ec0d0c8702781a5ed9f61d14dd1ff921fedb23c1
PAYU_START_TIME: '2025-01-03T12:44:32.071138'
PAYU_WALLTIME: 80.047956 s
Priority: '0'
Rerunable: 'False'
Resource_List.jobfs: 629145600b
Resource_List.mem: 1073741823996b
Resource_List.mpiprocs: '288'
Resource_List.ncpus: '288'
Resource_List.nodect: '6'
Resource_List.place: free
Resource_List.select: 6:ncpus=48:mpiprocs=48:mem=178956970666:job_tags=normal:jobfs=104857600
Resource_List.storage: gdata/vk83+scratch/tm70
Resource_List.walltime: 03:00:00
Resource_List.wd: '1'
Submit_Host: gadi-login-06.gadi.nci.org.au
Submit_arguments: -q normal -P tm70 -l walltime=10800 -l ncpus=288 -l mem=1000GB -N
 1deg_jra55_ryf -l wd -j n -v LD_LIBRARY_PATH=/apps/python3/3.10.0/lib,PAYU_PATH=/scratch/tm70/jb4202/payu-venv/bin,MODULESHOME=/opt/Modules/v4.3.0,MODULES_CMD=/opt/Modules/v4.3.0/libexec/modulecmd.tcl,MODULEPATH=/etc/scl/modulefiles:/opt/Modules/modulefiles:/opt/Modules/v4.3.0/modulefiles:/apps/Modules/modulefiles
 -W umask=027 -l storage=gdata/vk83+scratch/tm70 -- /scratch/tm70/jb4202/payu-venv/bin/python3
 /scratch/tm70/jb4202/payu-venv/bin/payu-run
Variable_List: PBS_O_HOME=/home/189/jb4202,PBS_O_LANG=en_AU.UTF-8,PBS_O_LOGNAME=jb4202,PBS_O_PATH=/apps/python3/3.10.0/bin:/scratch/tm70/jb4202/payu-venv/bin:/home/189/jb4202/.local/bin:/home/189/jb4202/bin:/opt/pbs/default/bin:/opt/nci/bin:/opt/bin:/opt/Modules/v4.3.0/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/pbs/default/bin,PBS_O_MAIL=/var/spool/mail/jb4202,PBS_O_SHELL=/bin/bash,PBS_O_TZ=:/etc/localtime,PBS_O_HOST=gadi-login-06.gadi.nci.org.au,PBS_O_WORKDIR=/home/189/jb4202/test-payu/release-1deg_jra55_ryf,PBS_O_SYSTEM=Linux,LD_LIBRARY_PATH=/apps/python3/3.10.0/lib,PAYU_PATH=/scratch/tm70/jb4202/payu-venv/bin,MODULESHOME=/opt/Modules/v4.3.0,MODULES_CMD=/opt/Modules/v4.3.0/libexec/modulecmd.tcl,MODULEPATH=/etc/scl/modulefiles:/opt/Modules/modulefiles:/opt/Modules/v4.3.0/modulefiles:/apps/Modules/modulefiles,PBS_NCI_HT=0,PBS_NCI_STORAGE=gdata/vk83+scratch/tm70,PBS_NCI_IMAGE=,PBS_NCPUS=288,PBS_NGPUS=0,PBS_NNODES=6,PBS_NCI_NCPUS_PER_NODE=48,PBS_NCI_NUMA_PER_NODE=4,PBS_NCI_NCPUS_PER_NUMA=12,PROJECT=tm70,PBS_VMEM=1073741824000,PBS_NCI_WD=1,PBS_NCI_JOBFS=629145600b,PBS_NCI_LAUNCH_COMPATIBILITY=0,PBS_NCI_FS_GDATA1=0,PBS_NCI_FS_GDATA1A=0,PBS_NCI_FS_GDATA1B=0,PBS_NCI_FS_GDATA2=0,PBS_NCI_FS_GDATA3=0,PBS_NCI_FS_GDATA4=0,PBS_O_QUEUE=normal,PBS_JOBFS=/jobfs/131803315.gadi-pbs
argument_list: <jsdl-hpcpa:Argument>/scratch/tm70/jb4202/payu-venv/bin/payu-run</jsdl-hpcpa:Argument>
comment: Job run at Fri Jan 03 at 12:44 on (gadi-cpu-clx-0328:ncpus=48:mem=174762667kb:jobfs=102400kb)+(gadi-cpu-clx-0329:ncpus=48:mem=174762667kb:jobfs=102400kb)+(gadi-cpu-clx-0331:ncpus=48:mem=174762667kb:jobfs=102400kb)+(gadi-cpu-clx-0332:ncpus=48:mem=174762...
ctime: Fri Jan  3 12:44:23 2025
etime: Fri Jan  3 12:44:23 2025
exec_host: gadi-cpu-clx-0328/0*48+gadi-cpu-clx-0329/0*48+gadi-cpu-clx-0331/0*48+gadi-cpu-clx-0332/0*48+gadi-cpu-clx-0333/0*48+gadi-cpu-clx-0334/0*48
exec_vnode: (gadi-cpu-clx-0328:ncpus=48:mem=174762667kb:jobfs=102400kb)+(gadi-cpu-clx-0329:ncpus=48:mem=174762667kb:jobfs=102400kb)+(gadi-cpu-clx-0331:ncpus=48:mem=174762667kb:jobfs=102400kb)+(gadi-cpu-clx-0332:ncpus=48:mem=174762667kb:jobfs=102400kb)+(gadi-cpu-clx-0333:ncpus=48:mem=174762667kb:jobfs=102400kb)+(gadi-cpu-clx-0334:ncpus=48:mem=174762667kb:jobfs=102400kb)
executable: <jsdl-hpcpa:Executable>/scratch/tm70/jb4202/payu-venv/bin/python3</jsdl-hpcpa:Executable>
group_list: tm70
job_state: R
jobdir: /home/189/jb4202
mtime: Fri Jan  3 12:45:50 2025
project: tm70
qtime: Fri Jan  3 12:44:23 2025
queue: normal-exec
resources_used.cpupercent: '9388'
resources_used.cput: 00:58:28
resources_used.jobfs: 16717kb
resources_used.mem: 43684816kb
resources_used.ncpus: '288'
resources_used.vmem: 43684816kb
resources_used.walltime: 00:01:18
run_count: '1'
server: gadi-pbs-01.gadi.nci.org.au
session_id: '272838'
stime: Fri Jan  3 12:44:27 2025
substate: '42'
umask: '27'

My user ID is present in a bunch of fields in job.yaml, so if not tracking user ID was an initial requirement while the tracking is in testing, some useful information to track would be:

PAYU_RUN_ID (Commit hash of current payu run)
PAYU_CURRENT_RUN (Run counter, e.g. 0 for the first run)
PAYU_JOB_STATUS (Exit code of mpirun call)
PAYU_N_RUNS (N_RUNS value in command payu run -n {N_RUNS}`)
PAYU_START_TIME (start time when the experiment class is initialised)
PAYU_FINISH_TIME (time after the model run command is finished)
PAYU_WALLTIME (Uses above time difference)

PBS-specific

Job_ID, project

Additional fields to add:

payu.__version__ (Current running version of payu)
Dictionary of metadata fields from read from file (At leastexperiment_uuid,parent_experiment,name,model - from Experiment.metadata.read_file())
Adding fields for archive directory path and remote sync archive path might be useful.

There are resources_used.* fields from the qstat command - but that won't be a final value while the PBS job is still running and it only gets updated periodically. From the above job which was relatively short to run, the memory used in the job.yaml was ~43GB vs Memory Used: 78.81GB in the Resource usage in the logs. The walltime was close with 00:01:18 vs 00:01:24 but the CPU time used was quite different with resources_used.cput: 00:58:28 vs CPU Time Used: 04:52:23. For longer running jobs, these fields might still be useful to provide a lower approximate value of resources used.

There's Andrew Kiss's Run Summary tool that summarises ACCESS-OM2 experiments (https://github.com/aekiss/run_summary). This can be run in a postscript job after the payu run PBS job completes, which means it can parse out all the exact values of resources used from the PBS job output files. @aidanheerdegen mentioned tracking GB outputs per model year would be useful to help users plan resources required needed to run experiments. I think that would require adding model driver code that can parse out the model run time from outputs.

If tracking was added to the payu run method (so inside the running job after the model has run), I think it would be likely to add a function that could extend the info dictionary that is dumped to job.yaml with metadata fields and additional payu fields such as version, and model specific fields to get model runtime. The info dict gets initially built here:

payu/payu/experiment.py

Line 675 in 20a8e76

info = get_job_info()

Another thing to think about is when the model is run multiple times on one PBS job ( e.g. when payu run -n N and config.yaml allows runspersub: N). If there's tracking inside Experiment.run(), then resource usage would be accumulating for each model run. But if tracking jobID, it would be possible to see multiple model runs were running on one job. I thought about adding tracking to the final model run in the submission right at the end of the payu run command:

payu/payu/subcommands/run_cmd.py

Line 113 in 20a8e76

def runscript():

So that it would run qstat command after archival and any user defined run scripts to more accurately capture resources. But I think we would want to track the commit hashes for each model run

The text was updated successfully, but these errors were encountered:

jo-basevi mentioned this issue Jan 9, 2025

Telemetry implementation on payu run ACCESS-NRI/payu#1

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding tracking to payu #546

Adding tracking to payu #546

jo-basevi commented Jan 7, 2025

Adding tracking to payu #546

Adding tracking to payu #546

Comments

jo-basevi commented Jan 7, 2025