Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add caching support for CWL #5187

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

Conversation

stxue1
Copy link
Contributor

@stxue1 stxue1 commented Dec 17, 2024

Resolves #4298

Adds support for the cwltool equivalent --cachedir. This should make toil-cwl-runner be cache aware and use previous steps when possible and properly restart when there are new. This is different than the --restart flag. Jobs previously ran with --cachedir can rerun with --cachedir and not with --restart. --restart should be used to run failed jobs that should succeed. If the CWL needs editing, then caching should be used, although this will take significantly more storage space compared to the default behavior. Ideally, this should only be used for development purposes.

Changelog Entry

To be copied to the draft changelog by merger:

  • Add caching to toil-cwl-runner. Use --cachedir [dir] to enable and rerun previously cached jobs.

Reviewer Checklist

  • Make sure it is coming from issues/XXXX-fix-the-thing in the Toil repo, or from an external repo.
    • If it is coming from an external repo, make sure to pull it in for CI with:
      contrib/admin/test-pr otheruser theirbranchname issues/XXXX-fix-the-thing
      
    • If there is no associated issue, create one.
  • Read through the code changes. Make sure that it doesn't have:
    • Addition of trailing whitespace.
    • New variable or member names in camelCase that want to be in snake_case.
    • New functions without type hints.
    • New functions or classes without informative docstrings.
    • Changes to semantics not reflected in the relevant docstrings.
    • New or changed command line options for Toil workflows that are not reflected in docs/running/{cliOptions,cwl,wdl}.rst
    • New features without tests.
  • Comment on the lines of code where problems exist with a review comment. You can shift-click the line numbers in the diff to select multiple lines.
  • Finish the review with an overall description of your opinion.

Merger Checklist

  • Make sure the PR passed tests, including the Gitlab tests, for the most recent commit in its branch.
  • Make sure the PR has been reviewed. If not, review it. If it has been reviewed and any requested changes seem to have been addressed, proceed.
  • Merge with the Github "Squash and merge" feature.
    • If there are multiple authors' commits, add Co-authored-by to give credit to all contributing authors.
  • Copy its recommended changelog entry to the Draft Changelog.
  • Append the issue number in parentheses to the changelog entry.

@mr-c mr-c mentioned this pull request Dec 17, 2024
19 tasks
Copy link
Contributor

@mr-c mr-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this @stxue1 !

I tested this PR using the CWL conformance tests; without --cachedir they still pass.

Alas, if I add --cachedir then 134 of the conformance tests fail (though 242 test DO pass!)

@stxue1
Copy link
Contributor Author

stxue1 commented Dec 18, 2024

Thank you for this @stxue1 !

I tested this PR using the CWL conformance tests; without --cachedir they still pass.

Alas, if I add --cachedir then 134 of the conformance tests fail (though 242 test DO pass!)

Thanks, I tested it on my machine and I think I tracked down the bug. I think it has to do with how we move around files from cwltool to Toil. We do some relocation of outputs from when we call cwltool internally which is messing some things up. It seems like setting a cache directory will set the current working directory of the file to be written to.

When we execute the Toil job's side, I think we set the destination directory as the jobstore, resulting in the outputs being relocated eventually without copying/symlinking.

Since caching depends on cwltool behavior, it is probably best to either copy the files into the jobstore (or figure out a way for the jobstore to be cache aware?). I have yet to find an entrypoint to control the relocateOutputs behavior on the Toil side though.

@mr-c
Copy link
Contributor

mr-c commented Dec 18, 2024

I have yet to find an entrypoint to control the relocateOutputs behavior on the Toil side though.

We can add one on the cwltool side, if needed.

@stxue1
Copy link
Contributor Author

stxue1 commented Dec 18, 2024

I have yet to find an entrypoint to control the relocateOutputs behavior on the Toil side though.

We can add one on the cwltool side, if needed.

I believe I was able to find an entrypoint

@mr-c
Copy link
Contributor

mr-c commented Dec 19, 2024

Alas, if I add --cachedir then 134 of the conformance tests fail (though 242 test DO pass!)

As of 68f88e0, 128 tests fail.

@stxue1
Copy link
Contributor Author

stxue1 commented Dec 19, 2024

Alas, if I add --cachedir then 134 of the conformance tests fail (though 242 test DO pass!)

As of 68f88e0, 128 tests fail.

@mr-c Was this ran with a clean cache directory? If the cache directory was populated with the previous broken version of toil-cwl-runner then the runner will look up files from previous cached runs, and they won't exist. I ran some of the tests on my machine and they seem okay so far. Also, how many tests in parallel were run? It might be possible it could be a synchronization issue.

@mr-c
Copy link
Contributor

mr-c commented Dec 19, 2024

Alas, if I add --cachedir then 134 of the conformance tests fail (though 242 test DO pass!)

As of 68f88e0, 128 tests fail.

@mr-c Was this ran with a clean cache directory? If the cache directory was populated with the previous broken version of toil-cwl-runner then the runner will look up files from previous cached runs, and they won't exist. I ran some of the tests on my machine and they seem okay so far. Also, how many tests in parallel were run? It might be possible it could be a synchronization issue.

Clean directory, -j8. I can try without parallel, and l will confirm that cwltool can run with caching in parallel

@mr-c
Copy link
Contributor

mr-c commented Dec 19, 2024

I ran some of the tests on my machine and they seem okay so far.

FYI: Here is my invocation run from the root of a checkout of https://github.com/common-workflow-language/cwl-v1.2

./run_test.sh RUNNER=toil-cwl-runner EXTRA="--clean=always --relax-path-checks --podman --cachedir=/home/michael/cwl-v1.2/toil-cache"

Running not in parallel, I still get 127 failures.

with RUNNER=cwltool I am able to run in parallel (with 6 failures, though I think they are specific to my development environment); And re-running using the previous cache directory repeats the outcome.

@stxue1
Copy link
Contributor Author

stxue1 commented Dec 20, 2024

I ran some of the tests on my machine and they seem okay so far.

FYI: Here is my invocation run from the root of a checkout of https://github.com/common-workflow-language/cwl-v1.2

./run_test.sh RUNNER=toil-cwl-runner EXTRA="--clean=always --relax-path-checks --podman --cachedir=/home/michael/cwl-v1.2/toil-cache"

Running not in parallel, I still get 127 failures.

with RUNNER=cwltool I am able to run in parallel (with 6 failures, though I think they are specific to my development environment); And re-running using the previous cache directory repeats the outcome.

Oddly enough, running the same command on my machine (but without --podman) only gives me 2 failing tests:

$ ./run_test.sh RUNNER=toil-cwl-runner EXTRA="--clean=always --relax-path-checks --cachedir=cache"
...

Test [232/378] cwl_requirements_override_static: Test conflicting requirements in input document via EnvVarRequirement
Test 232 failed: /home/heaucques/Documents/toil/venv3.12/bin/toil-cwl-runner --clean=always --relax-path-checks --cachedir=cache --outdir=/tmp/tmpzie43tg5 --quiet tests/env-tool3.cwl tests/env-job4.yaml
Test conflicting requirements in input document via EnvVarRequirement
Compare failure expected: {
    "out": {
        "checksum": "sha1$715e62184492851512a020c36ab7118eca114a59",
        "class": "File",
        "location": "out",
        "size": 23
    }
}
got: {
    "out": {
        "basename": "out",
        "checksum": "sha1$b3ec4ed1749c207e52b3a6d08c59f31d83bff519",
        "class": "File",
        "location": "file:///tmp/tmpzie43tg5/out",
        "nameext": "",
        "nameroot": "out",
        "path": "/tmp/tmpzie43tg5/out",
        "size": 15
    }
}
caused by: failed comparison for key 'out': expected: {
    "checksum": "sha1$715e62184492851512a020c36ab7118eca114a59",
    "class": "File",
    "location": "out",
    "size": 23
}
got: {
    "basename": "out",
    "checksum": "sha1$b3ec4ed1749c207e52b3a6d08c59f31d83bff519",
    "class": "File",
    "location": "file:///tmp/tmpzie43tg5/out",
    "nameext": "",
    "nameroot": "out",
    "path": "/tmp/tmpzie43tg5/out",
    "size": 15
}
caused by: Output file checksums do not match: actual 'sha1$b3ec4ed1749c207e52b3a6d08c59f31d83bff519' is not equal to expected 'sha1$715e62184492851512a020c36ab7118eca114a59'

...

Test [337/378] iwd-container-entryname3: Test input mount locations when container is a hint (should fail)
Test 337 failed: /home/heaucques/Documents/toil/venv3.12/bin/toil-cwl-runner --clean=always --relax-path-checks --cachedir=cache --outdir=/tmp/tmpfeo35a5x --quiet tests/iwd/iwd-container-entryname3.cwl tests/loadContents/input.yml
Test input mount locations when container is a hint (should fail)
Returned zero but it should be non-zero

I'll run it again with podman and see if anything changes.

@stxue1
Copy link
Contributor Author

stxue1 commented Jan 15, 2025

I think I accidentally left a local change to cwltool without realizing it and forgot to open the respective cwltool PR until later. I've opened the PR now and updated cwltool to the latest version with the update, ideally the conformance tests will pass now, but iirc there may be 2 tests that still fail.

Copy link
Member

@adamnovak adamnovak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it should work! Much simpler than the WDL version!

@mr-c mr-c force-pushed the issues/4298-cwl-job-restart-edit branch from ae69085 to f6a6e46 Compare January 17, 2025 09:48
Copy link
Contributor

@mr-c mr-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much improved; and thanks again for the fix sent to cwltool!

Two CWL v1.2 conformance tests fail locally for me when using a shared --cachedir (that was initially empty):

Test 232 failed: /home/michael/src/toil2/env3.12/bin/toil-cwl-runner --clean=always --relax-path-checks --podman --cachedir=/home/michael/cwl-v1.2/toil-cache --outdir=/tmp/tmpzyr_9q07 --quiet tests/env-tool3.cwl tests/env-job4.yaml
Test conflicting requirements in input document via EnvVarRequirement
Compare failure expected: {
    "out": {
        "checksum": "sha1$715e62184492851512a020c36ab7118eca114a59",
        "class": "File",
        "location": "out",
        "size": 23
    }
}
got: {
    "out": {
        "basename": "out",
        "checksum": "sha1$b3ec4ed1749c207e52b3a6d08c59f31d83bff519",
        "class": "File",
        "location": "file:///tmp/tmpzyr_9q07/out",
        "nameext": "",
        "nameroot": "out",
        "path": "/tmp/tmpzyr_9q07/out",
        "size": 15
    }
}
caused by: failed comparison for key 'out': expected: {
    "checksum": "sha1$715e62184492851512a020c36ab7118eca114a59",
    "class": "File",
    "location": "out",
    "size": 23
}
got: {
    "basename": "out",
    "checksum": "sha1$b3ec4ed1749c207e52b3a6d08c59f31d83bff519",
    "class": "File",
    "location": "file:///tmp/tmpzyr_9q07/out",
    "nameext": "",
    "nameroot": "out",
    "path": "/tmp/tmpzyr_9q07/out",
    "size": 15
}
caused by: Output file checksums do not match: actual 'sha1$b3ec4ed1749c207e52b3a6d08c59f31d83bff519' is not equal to expected 'sha1$715e62184492851512a020c36ab7118eca114a59'

I confirmed this manually, correct output is produced without --cachedir; incorrect output is "hello test env" instead of "conflict_user_override"

Test 337 failed: /home/michael/src/toil2/env3.12/bin/toil-cwl-runner --clean=always --relax-path-checks --podman --cachedir=/home/michael/cwl-v1.2/toil-cache --outdir=/tmp/tmptwczsgzv --quiet tests/iwd/iwd-container-entryname3.cwl tests/loadContents/input.yml
Test input mount locations when container is a hint (should fail)
Returned zero but it should be non-zero

I also confirmed this manually, this test correctly fails without --cachedir.

For completeness, I re-ran the conformance tests again; re-using the shared cache directory from the first run. The same tests above still fail; but 8 other tests also fail:

Test 229 failed: /home/michael/src/toil2/env3.12/bin/toil-cwl-runner --clean=always --relax-path-checks --podman --cachedir=/home/michael/cwl-v1.2/toil-cache --outdir=/tmp/tmppphahmca --quiet tests/stage-array-dirs.cwl tests/stage-array-dirs-job.yml
Test array of directories InitialWorkDirRequirement
Compare failure expected: {
    "output": [
        {
            "checksum": "sha1$da39a3ee5e6b4b0d3255bfef95601890afd80709",
            "class": "File",
            "location": "a",
            "size": 0
        },
        {
            "checksum": "sha1$da39a3ee5e6b4b0d3255bfef95601890afd80709",
            "class": "File",
            "location": "B",
            "size": 0
        }
    ]
}
got: {
    "output": []
}
caused by: failed comparison for key 'output': expected: [
    {
        "checksum": "sha1$da39a3ee5e6b4b0d3255bfef95601890afd80709",
        "class": "File",
        "location": "a",
        "size": 0
    },
    {
        "checksum": "sha1$da39a3ee5e6b4b0d3255bfef95601890afd80709",
        "class": "File",
        "location": "B",
        "size": 0
    }
]
got: []
caused by: lengths don't match
Test 253 failed: /home/michael/src/toil2/env3.12/bin/toil-cwl-runner --clean=always --relax-path-checks --podman --cachedir=/home/michael/cwl-v1.2/toil-cache --outdir=/tmp/tmpa9fmsg7e --quiet tests/exitcode.cwl tests/empty.json
Can access exit code in outputEval
Compare failure expected: {
    "code": 7
}
got: {
    "code": true
}
caused by: failed comparison for key 'code': expected: 7
got: true
Test 333 failed: /home/michael/src/toil2/env3.12/bin/toil-cwl-runner --clean=always --relax-path-checks --podman --cachedir=/home/michael/cwl-v1.2/toil-cache --outdir=/tmp/tmp7jdbyeai --quiet tests/iwd/iwd-fileobjs1.cwl
Test File and Directory object in listing
Returned non-zero
Test 334 failed: /home/michael/src/toil2/env3.12/bin/toil-cwl-runner --clean=always --relax-path-checks --podman --cachedir=/home/michael/cwl-v1.2/toil-cache --outdir=/tmp/tmpvii3c2sz --quiet tests/iwd/iwd-fileobjs2.cwl
Test File and Directory object in listing
Returned non-zero
Test 340 failed: /home/michael/src/toil2/env3.12/bin/toil-cwl-runner --clean=always --relax-path-checks --podman --cachedir=/home/michael/cwl-v1.2/toil-cache --outdir=/tmp/tmp8r91ctqm --quiet tests/iwd/iwd-subdir-wf.cwl tests/iwd/iwd-subdir-job.yml
Test emitting a subdirectory from initial workdir
Returned non-zero
Test 371 failed: /home/michael/src/toil2/env3.12/bin/toil-cwl-runner --clean=always --relax-path-checks --podman --cachedir=/home/michael/cwl-v1.2/toil-cache --outdir=/tmp/tmpjrsjkq5f --quiet tests/capture-files-and-dirs.cwl tests/dir-job.yml
Test that both files and directories are captured by glob evaluation when type is [Directory, File]
Compare failure expected: {
    "result": [
        {
            "basename": "a",
            "checksum": "sha1$da39a3ee5e6b4b0d3255bfef95601890afd80709",
            "class": "File",
            "size": 0
        },
        {
            "basename": "b",
            "checksum": "sha1$da39a3ee5e6b4b0d3255bfef95601890afd80709",
            "class": "File",
            "size": 0
        },
        {
            "basename": "c",
            "class": "Directory",
            "listing": [
                {
                    "basename": "d",
                    "checksum": "sha1$da39a3ee5e6b4b0d3255bfef95601890afd80709",
                    "class": "File",
                    "size": 0
                }
            ]
        }
    ]
}
got: {
    "result": []
}
caused by: failed comparison for key 'result': expected: [
    {
        "basename": "a",
        "checksum": "sha1$da39a3ee5e6b4b0d3255bfef95601890afd80709",
        "class": "File",
        "size": 0
    },
    {
        "basename": "b",
        "checksum": "sha1$da39a3ee5e6b4b0d3255bfef95601890afd80709",
        "class": "File",
        "size": 0
    },
    {
        "basename": "c",
        "class": "Directory",
        "listing": [
            {
                "basename": "d",
                "checksum": "sha1$da39a3ee5e6b4b0d3255bfef95601890afd80709",
                "class": "File",
                "size": 0
            }
        ]
    }
]
got: []
caused by: lengths don't match

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Job restart does not take in any new edits made to cwl
3 participants