-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add caching support for CWL #5187
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this @stxue1 !
I tested this PR using the CWL conformance tests; without --cachedir
they still pass.
Alas, if I add --cachedir
then 134 of the conformance tests fail (though 242 test DO pass!)
Thanks, I tested it on my machine and I think I tracked down the bug. I think it has to do with how we move around files from cwltool to Toil. We do some relocation of outputs from when we call cwltool internally which is messing some things up. It seems like setting a cache directory will set the current working directory of the file to be written to. When we execute the Toil job's side, I think we set the destination directory as the jobstore, resulting in the outputs being relocated eventually without copying/symlinking. Since caching depends on cwltool behavior, it is probably best to either copy the files into the jobstore (or figure out a way for the jobstore to be cache aware?). I have yet to find an entrypoint to control the |
We can add one on the cwltool side, if needed. |
I believe I was able to find an entrypoint |
As of 68f88e0, 128 tests fail. |
@mr-c Was this ran with a clean cache directory? If the cache directory was populated with the previous broken version of toil-cwl-runner then the runner will look up files from previous cached runs, and they won't exist. I ran some of the tests on my machine and they seem okay so far. Also, how many tests in parallel were run? It might be possible it could be a synchronization issue. |
Clean directory, |
FYI: Here is my invocation run from the root of a checkout of https://github.com/common-workflow-language/cwl-v1.2
Running not in parallel, I still get 127 failures. with |
Oddly enough, running the same command on my machine (but without
I'll run it again with podman and see if anything changes. |
832c1fe
to
ba0f0ea
Compare
I think I accidentally left a local change to cwltool without realizing it and forgot to open the respective cwltool PR until later. I've opened the PR now and updated cwltool to the latest version with the update, ideally the conformance tests will pass now, but iirc there may be 2 tests that still fail. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like it should work! Much simpler than the WDL version!
ae69085
to
f6a6e46
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much improved; and thanks again for the fix sent to cwltool!
Two CWL v1.2 conformance tests fail locally for me when using a shared --cachedir
(that was initially empty):
Test 232 failed: /home/michael/src/toil2/env3.12/bin/toil-cwl-runner --clean=always --relax-path-checks --podman --cachedir=/home/michael/cwl-v1.2/toil-cache --outdir=/tmp/tmpzyr_9q07 --quiet tests/env-tool3.cwl tests/env-job4.yaml
Test conflicting requirements in input document via EnvVarRequirement
Compare failure expected: {
"out": {
"checksum": "sha1$715e62184492851512a020c36ab7118eca114a59",
"class": "File",
"location": "out",
"size": 23
}
}
got: {
"out": {
"basename": "out",
"checksum": "sha1$b3ec4ed1749c207e52b3a6d08c59f31d83bff519",
"class": "File",
"location": "file:///tmp/tmpzyr_9q07/out",
"nameext": "",
"nameroot": "out",
"path": "/tmp/tmpzyr_9q07/out",
"size": 15
}
}
caused by: failed comparison for key 'out': expected: {
"checksum": "sha1$715e62184492851512a020c36ab7118eca114a59",
"class": "File",
"location": "out",
"size": 23
}
got: {
"basename": "out",
"checksum": "sha1$b3ec4ed1749c207e52b3a6d08c59f31d83bff519",
"class": "File",
"location": "file:///tmp/tmpzyr_9q07/out",
"nameext": "",
"nameroot": "out",
"path": "/tmp/tmpzyr_9q07/out",
"size": 15
}
caused by: Output file checksums do not match: actual 'sha1$b3ec4ed1749c207e52b3a6d08c59f31d83bff519' is not equal to expected 'sha1$715e62184492851512a020c36ab7118eca114a59'
I confirmed this manually, correct output is produced without --cachedir
; incorrect output is "hello test env" instead of "conflict_user_override"
Test 337 failed: /home/michael/src/toil2/env3.12/bin/toil-cwl-runner --clean=always --relax-path-checks --podman --cachedir=/home/michael/cwl-v1.2/toil-cache --outdir=/tmp/tmptwczsgzv --quiet tests/iwd/iwd-container-entryname3.cwl tests/loadContents/input.yml
Test input mount locations when container is a hint (should fail)
Returned zero but it should be non-zero
I also confirmed this manually, this test correctly fails without --cachedir
.
For completeness, I re-ran the conformance tests again; re-using the shared cache directory from the first run. The same tests above still fail; but 8 other tests also fail:
Test 229 failed: /home/michael/src/toil2/env3.12/bin/toil-cwl-runner --clean=always --relax-path-checks --podman --cachedir=/home/michael/cwl-v1.2/toil-cache --outdir=/tmp/tmppphahmca --quiet tests/stage-array-dirs.cwl tests/stage-array-dirs-job.yml
Test array of directories InitialWorkDirRequirement
Compare failure expected: {
"output": [
{
"checksum": "sha1$da39a3ee5e6b4b0d3255bfef95601890afd80709",
"class": "File",
"location": "a",
"size": 0
},
{
"checksum": "sha1$da39a3ee5e6b4b0d3255bfef95601890afd80709",
"class": "File",
"location": "B",
"size": 0
}
]
}
got: {
"output": []
}
caused by: failed comparison for key 'output': expected: [
{
"checksum": "sha1$da39a3ee5e6b4b0d3255bfef95601890afd80709",
"class": "File",
"location": "a",
"size": 0
},
{
"checksum": "sha1$da39a3ee5e6b4b0d3255bfef95601890afd80709",
"class": "File",
"location": "B",
"size": 0
}
]
got: []
caused by: lengths don't match
Test 253 failed: /home/michael/src/toil2/env3.12/bin/toil-cwl-runner --clean=always --relax-path-checks --podman --cachedir=/home/michael/cwl-v1.2/toil-cache --outdir=/tmp/tmpa9fmsg7e --quiet tests/exitcode.cwl tests/empty.json
Can access exit code in outputEval
Compare failure expected: {
"code": 7
}
got: {
"code": true
}
caused by: failed comparison for key 'code': expected: 7
got: true
Test 333 failed: /home/michael/src/toil2/env3.12/bin/toil-cwl-runner --clean=always --relax-path-checks --podman --cachedir=/home/michael/cwl-v1.2/toil-cache --outdir=/tmp/tmp7jdbyeai --quiet tests/iwd/iwd-fileobjs1.cwl
Test File and Directory object in listing
Returned non-zero
Test 334 failed: /home/michael/src/toil2/env3.12/bin/toil-cwl-runner --clean=always --relax-path-checks --podman --cachedir=/home/michael/cwl-v1.2/toil-cache --outdir=/tmp/tmpvii3c2sz --quiet tests/iwd/iwd-fileobjs2.cwl
Test File and Directory object in listing
Returned non-zero
Test 340 failed: /home/michael/src/toil2/env3.12/bin/toil-cwl-runner --clean=always --relax-path-checks --podman --cachedir=/home/michael/cwl-v1.2/toil-cache --outdir=/tmp/tmp8r91ctqm --quiet tests/iwd/iwd-subdir-wf.cwl tests/iwd/iwd-subdir-job.yml
Test emitting a subdirectory from initial workdir
Returned non-zero
Test 371 failed: /home/michael/src/toil2/env3.12/bin/toil-cwl-runner --clean=always --relax-path-checks --podman --cachedir=/home/michael/cwl-v1.2/toil-cache --outdir=/tmp/tmpjrsjkq5f --quiet tests/capture-files-and-dirs.cwl tests/dir-job.yml
Test that both files and directories are captured by glob evaluation when type is [Directory, File]
Compare failure expected: {
"result": [
{
"basename": "a",
"checksum": "sha1$da39a3ee5e6b4b0d3255bfef95601890afd80709",
"class": "File",
"size": 0
},
{
"basename": "b",
"checksum": "sha1$da39a3ee5e6b4b0d3255bfef95601890afd80709",
"class": "File",
"size": 0
},
{
"basename": "c",
"class": "Directory",
"listing": [
{
"basename": "d",
"checksum": "sha1$da39a3ee5e6b4b0d3255bfef95601890afd80709",
"class": "File",
"size": 0
}
]
}
]
}
got: {
"result": []
}
caused by: failed comparison for key 'result': expected: [
{
"basename": "a",
"checksum": "sha1$da39a3ee5e6b4b0d3255bfef95601890afd80709",
"class": "File",
"size": 0
},
{
"basename": "b",
"checksum": "sha1$da39a3ee5e6b4b0d3255bfef95601890afd80709",
"class": "File",
"size": 0
},
{
"basename": "c",
"class": "Directory",
"listing": [
{
"basename": "d",
"checksum": "sha1$da39a3ee5e6b4b0d3255bfef95601890afd80709",
"class": "File",
"size": 0
}
]
}
]
got: []
caused by: lengths don't match
Resolves #4298
Adds support for the cwltool equivalent
--cachedir
. This should make toil-cwl-runner be cache aware and use previous steps when possible and properly restart when there are new. This is different than the--restart
flag. Jobs previously ran with--cachedir
can rerun with--cachedir
and not with--restart
.--restart
should be used to run failed jobs that should succeed. If the CWL needs editing, then caching should be used, although this will take significantly more storage space compared to the default behavior. Ideally, this should only be used for development purposes.Changelog Entry
To be copied to the draft changelog by merger:
toil-cwl-runner
. Use--cachedir [dir]
to enable and rerun previously cached jobs.Reviewer Checklist
issues/XXXX-fix-the-thing
in the Toil repo, or from an external repo.camelCase
that want to be insnake_case
.docs/running/{cliOptions,cwl,wdl}.rst
Merger Checklist