This example demonstrates running a multi-stage workflow on Google Cloud Platform.
- The workflow is launched with the Google Genomics Pipelines API.
- The workflow is defined using the Broad Institute's Workflow Definition Language (WDL).
- The workflow stages are orchestrated by the Broad Institute's Cromwell.
When submitted using the Pipelines API, the workflow runs on multiple Google Compute Engine virtual machines. First a master node is created for Cromwell, and then Cromwell submits each stage of the workflow as one or more separate pipelines.
Execution of a running Pipeline proceeds as:
-
Create Compute Engine virtual machine
-
On the VM, in a Docker container, execute wdl_runner.py
a. Run Cromwell (server)
b. Submit workflow, inputs, and options to Cromwell server
c. Poll for completion as Cromwell executes:
1) Call pipelines.run() to execute call 1 2) Poll for completion of call 1 3) Call pipelines.run() to execute call 2 4) Poll for completion of call 2 <etc. until all WDL "calls" complete>
d. Copy workflow metadata to output path
e. Copy workflow outputs to output path
-
Destroy Compute Engine Virtual machine
Code packaging for the Pipelines API is done through Docker images. The instructions provided here explain how to create your own Docker image, although a copy of this Docker image has already been built and made available by the Broad Institute.
The code in the wdl_runner Docker image includes:
- OpenJDK 8 runtime engine (JRE)
- Python 2.7 interpreter
- Cromwell release 26
- Python and shell scripts from this repository
Take a look at the Dockerfile for full details.
-
Clone or fork this repository.
-
Enable the Genomics, Cloud Storage, and Compute Engine APIs on a new or existing Google Cloud Project using the Cloud Console
-
Follow the Google Genomics getting started instructions to install and authorize the Google Cloud SDK.
-
Follow the Cloud Storage instructions for Creating Storage Buckets to create a bucket for workflow output and logging
-
If you plan to create your own Docker images, then install docker
If you are going to use the published version of the docker image, then skip this step.
Every Google Cloud project provides a private repository for saving and serving Docker images called the Google Container Registry.
The following instructions allow you to stage a Docker image in your project's Container Registry with all necessary code for orchestrating your workflow.
git clone https://github.com/googlegenomics/pipelines-api-examples.git
cd pipelines-api-examples/wdl_runner/
docker build -t ${USER}/wdl_runner ./cromwell_launcher
In this example, we push the container to Google Container Registry via the following commands:
docker tag ${USER}/wdl_runner gcr.io/YOUR-PROJECT-ID/wdl_runner
gcloud docker -- push gcr.io/YOUR-PROJECT-ID/wdl_runner
- Replace
YOUR-PROJECT-ID
with your project ID.
The file ./workflows/wdl_pipeline.yaml defines a pipeline for running WDL workflows. By default, it uses the docker image built by the Broad Institute from this repository:
docker:
imageName: gcr.io/broad-dsde-outreach/wdl_runner:<datestamp>
If you have built your own Docker image, then change the imageName:
docker:
imageName: gcr.io/YOUR-PROJECT-ID/wdl_runner
- Replace
YOUR-PROJECT-ID
with your project ID.
gcloud \
alpha genomics pipelines run \
--pipeline-file workflows/wdl_pipeline.yaml \
--zones us-central1-f \
--logging gs://YOUR-BUCKET/pipelines-api-examples/wdl_runner/logging \
--inputs-from-file WDL=workflows/vcf_chr_count/vcf_chr_count.wdl \
--inputs-from-file WORKFLOW_INPUTS=workflows/vcf_chr_count/vcf_chr_count.sample.inputs.json \
--inputs-from-file WORKFLOW_OPTIONS=workflows/common/basic.jes.us.options.json \
--inputs WORKSPACE=gs://YOUR-BUCKET/pipelines-api-examples/wdl_runner/workspace \
--inputs OUTPUTS=gs://YOUR-BUCKET/pipelines-api-examples/wdl_runner/output
- Replace
YOUR-BUCKET
with a bucket in your project.
The output will be an operation ID for the Pipeline.
This github repo includes a shell script,
./tools/monitor_wdl_pipelines.sh,
for monitoring the status of a pipeline launched using wdl_pipeline.yaml
.
$ ./tools/monitor_wdl_pipeline.sh YOUR-NEW-OPERATION-ID
Logging: gs://YOUR-BUCKET/pipelines-api-examples/wdl_runner/logging
Workspace: gs://YOUR-BUCKET/pipelines-api-examples/wdl_runner/workspace
Outputs: gs://YOUR-BUCKET/pipelines-api-examples/wdl_runner/output
2016-09-01 09:37:44: operation not complete
No operations logs found.
There are 0 output files
Sleeping 60 seconds
...
2016-09-01 09:40:53: operation not complete
Calls started but not complete:
call-vcf_split
Sleeping 60 seconds
...
2016-09-01 09:44:02: operation not complete
Operation logs found:
YOUR-NEW-OPERATION-ID.log
YOUR-NEW-OPERATION-ID.log
YOUR-NEW-OPERATION-ID
Calls (including shards) completed: 1
Calls started but not complete:
call-vcf_record_count/shard-0
call-vcf_record_count/shard-1
call-vcf_record_count/shard-2
Sleeping 60 seconds
...
2016-09-01 09:54:31: operation not complete
Calls (including shards) completed: 4
No calls currently in progress.
(Transitioning to next stage or copying final output).
Sleeping 60 seconds
2016-09-01 09:55:34: operation not complete
Calls (including shards) completed: 4
Calls started but not complete:
call-gather
Sleeping 60 seconds
2016-09-01 09:56:37: operation not complete
Calls (including shards) completed: 5
No calls currently in progress.
(Transitioning to next stage or copying final output).
There are 1 output files
Sleeping 60 seconds
2016-09-01 09:57:40: operation complete
Completed operation status information
done: true
metadata:
events:
- description: start
startTime: '2016-09-01T16:38:18.215458712Z'
- description: pulling-image
startTime: '2016-09-01T16:38:18.215809129Z'
- description: localizing-files
startTime: '2016-09-01T16:38:42.613937060Z'
- description: running-docker
startTime: '2016-09-01T16:38:42.613978300Z'
- description: delocalizing-files
startTime: '2016-09-01T16:56:42.144127783Z'
- description: ok
startTime: '2016-09-01T16:56:43.725128719Z'
name: operations/YOUR-NEW-OPERATION-ID
gs://YOUR-BUCKET/pipelines-api-examples/wdl_runner/output/output.txt
gs://YOUR-BUCKET/pipelines-api-examples/wdl_runner/output/wdl_run_metadata.json
Preemptions:
None
Check the operation output for a top-level errors
field.
If none, then the operation should have finished successfully.
$ gsutil ls -l gs://YOUR-BUCKET/pipelines-api-examples/wdl_runner/output
TOTAL: 2 objects, 13025 bytes (12.72 KiB)
46 2016-09-01T16:56:40Z gs://YOUR-BUCKET/pipelines-api-examples/wdl_runner/output/output.txt
15069 2016-09-01T16:56:37Z gs://YOUR-BUCKET/pipelines-api-examples/wdl_runner/output/wdl_run_metadata.json
TOTAL: 2 objects, 15115 bytes (14.76 KiB)
- Replace
YOUR-BUCKET
with a bucket in your project.
$ gsutil cat gs://YOUR-BUCKET/pipelines-api-examples/wdl_runner/output/output.txt
chrM.vcf 197
chrX.vcf 4598814
chrY.vcf 653100
- Replace
YOUR-BUCKET
with a bucket in your project.
When Cromwell runs, per-stage output and other intermediate files are
written to the WORKSPACE path you specified in the gcloud
command above.
To remove these files, run:
gsutil -m rm gs://YOUR-BUCKET/pipelines-api-examples/wdl_runner/workspace/**
- Replace
YOUR-BUCKET
with a bucket in your project.