Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Completely update the lesson #14

Merged
merged 8 commits into from
May 2, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
56 changes: 56 additions & 0 deletions CITATION.cff
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: HPC Workflow Management with Snakemake
message: >-
If you use this software, please cite it using the
metadata from this file.
type: software
authors:
- given-names: Alan
family-names: O'Cais
email: [email protected]
affiliation: University of Barcelona
orcid: 'https://orcid.org/0000-0002-8254-8752'
repository-code: 'https://github.com/carpentries-incubator/hpc-workflows'
url: 'https://carpentries-incubator.github.io/hpc-workflows/'
abstract: >-
When using HPC resources, it's very common to need to
carry out the same set of tasks over a set of data
(commonly called a workflow or pipeline). In this lesson
we will make an experiment that takes an application which
runs in parallel and investigate it’s scalability. To do
that we will need to gather data, in this case that means
running the application multiple times with different
numbers of CPU cores and recording the execution time.
Once we’ve done that we need to create a visualisation of
the data to see how it compares against the ideal case.


We could do all of this manually, but there are useful
tools to help us manage data analysis pipelines like we
have in our experiment. In the context of this lesson,
we’ll learn about one of those: Snakemake.
keywords:
- HPC
- Carpentries
- Lesson
- Workflow
- Pipeline
license: CC-BY-4.0
references:
- authors:
- family-names: Collins
given-names: Daniel
title: "Getting Started with Snakemake"
type: software
repository-code: 'https://github.com/carpentries-incubator/workflows-snakemake/'
url: 'https://carpentries-incubator.github.io/workflows-snakemake/'
- authors:
- family-names: Booth
given-names: Tim
title: "Snakemake for Bioinformatics"
type: software
repository-code: 'https://github.com/carpentries-incubator/snakemake-novice-bioinformatics/'
url: 'https://carpentries-incubator.github.io/snakemake-novice-bioinformatics'
20 changes: 10 additions & 10 deletions config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -58,22 +58,22 @@ contact: '[email protected]'
# - another-learner.md

# Order of episodes in your lesson
episodes:
- amdahl_foundation.md
- snakemake_single.md
- snakemake_multiple.md
- snakemake_cluster.md
- snakemake_profiles.md
- amdahl_snakemake.md
episodes:
- 01-introduction.md
- 02-snakemake_on_the_cluster.md
- 03-placeholders.md
- 04-snakemake_and_mpi.md
- 05-chaining_rules.md
- 06-expansion.md

# Information for Learners
learners:
learners:

# Information for Instructors
instructors:
instructors:

# Learner Profiles
profiles:
profiles:

# Customisation ---------------------------------------------
#
Expand Down
176 changes: 176 additions & 0 deletions episodes/01-introduction.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,176 @@
---
title: "Running commands with Snakemake"
teaching: 30
exercises: 30
---

::: questions
- "How do I run a simple command with Snakemake?"
:::

:::objectives
- "Create a Snakemake recipe (a Snakefile)"
:::


## What is the workflow I'm interested in?

In this lesson we will make an experiment that takes an application which runs
in parallel and investigate it's scalability. To do that we will need to gather
data, in this case that means running the application multiple times with
different numbers of CPU cores and recording the execution time. Once we've
done that we need to create a visualisation of the data to see how it compares
against the ideal case.

From the visualisation we can then decide at what scale it
makes most sense to run the application at in production to maximise the use of
our CPU allocation on the system.

We could do all of this manually, but there are useful tools to help us manage
data analysis pipelines like we have in our experiment. Today we'll learn about
one of those: Snakemake.

In order to get started with Snakemake, let's begin by taking a simple command
and see how we can run that via Snakemake. Let's choose the command `hostname`
which prints out the name of the host where the command is executed:

```bash
[ocaisa@node1 ~]$ hostname
```
```output
node1.int.jetstream2.hpc-carpentry.org
```

That prints out the result but Snakemake relies on files to know the status of
your workflow, so let's redirect the output to a file:

```bash
[ocaisa@node1 ~]$ hostname > hostname_login.txt
```

## Making a Snakefile

Edit a new text file named `Snakefile`.

Contents of `Snakefile`:

```python
rule hostname_login:
output: "hostname_login.txt"
input:
shell:
"hostname > hostname_login.txt"
```

::: callout

## Key points about this file

1. The file is named `Snakefile` - with a capital `S` and no file extension.
1. Some lines are indented. Indents must be with space characters, not tabs. See
the setup section for how to make your text editor do this.
1. The rule definition starts with the keyword `rule` followed by the rule name,
then a colon.
1. We named the rule `hostname_login`. You may use letters, numbers or
underscores, but the rule name must begin with a letter and may not be a
keyword.
1. The keywords `input`, `output`, `shell` are all followed by a colon.
1. The file names and the shell command are all in `"quotes"`.
1. The output filename is given before the input filename. In fact, Snakemake
doesn't care what order they appear in but we give the output first
throughout this course. We'll see why soon.
1. In this use case there is no input file for the command so we leave this
blank.

:::

Back in the shell we'll run our new rule. At this point, if there were any
missing quotes, bad indents, etc. we may see an error.

```bash
$ snakemake -j1 -p hostname_login
```

::: callout

## `bash: snakemake: command not found...`

If your shell tells you that it cannot find the command `snakemake` then we need
to make the software available somehow. In our case, this means searching for
the module that we need to load:
```bash
module spider snakemake
```

```output
[ocaisa@node1 ~]$ module spider snakemake

--------------------------------------------------------------------------------------------------------
snakemake:
--------------------------------------------------------------------------------------------------------
Versions:
snakemake/8.2.1-foss-2023a
snakemake/8.2.1 (E)

Names marked by a trailing (E) are extensions provided by another module.


--------------------------------------------------------------------------------------------------------
For detailed information about a specific "snakemake" package (including how to load the modules) use the module's full name.
Note that names that have a trailing (E) are extensions provided by other modules.
For example:

$ module spider snakemake/8.2.1
--------------------------------------------------------------------------------------------------------

```

Now we want the module, so let's load that to make the package available

```bash
[ocaisa@node1 ~]$ module load snakemake
```

and then make sure we have the `snakemake` command available

```bash
[ocaisa@node1 ~]$ which snakemake
```
```output
/cvmfs/software.eessi.io/host_injections/2023.06/software/linux/x86_64/amd/zen3/software/snakemake/8.2.1-foss-2023a/bin/snakemake
```
:::

::: challenge
## Running Snakemake

Run `snakemake --help | less` to see the help for all available options.
What does the `-p` option in the `snakemake` command above do?

1. Protects existing output files
1. Prints the shell commands that are being run to the terminal
1. Tells Snakemake to only run one process at a time
1. Prompts the user for the correct input file

*Hint: you can search in the text by pressing `/`, and quit back to the shell
with `q`*

:::::: solution
(2) Prints the shell commands that are being run to the terminal

This is such a useful thing we don't know why it isn't the default! The `-j1`
option is what tells Snakemake to only run one process at a time, and we'll
stick with this for now as it makes things simpler. Answer 4 is a total
red-herring, as Snakemake never prompts interactively for user input.
::::::
:::

::: keypoints

- "Before running Snakemake you need to write a Snakefile"
- "A Snakefile is a text file which defines a list of rules"
- "Rules have inputs, outputs, and shell commands to be run"
- "You tell Snakemake what file to make and it will run the shell command
defined in the appropriate rule"

:::
Loading
Loading