* Added documentation on Snakemake. Still WIP; first draft

was published Jan 2023 on ODU HPC wiki. Sigs: # -rw-r--r-- 1 wirawan wirawan 3569 2023-01-24 12:13 20230123.snakemake-software.md # -rw-r--r-- 1 wirawan wirawan 7292 2023-01-24 16:06 20230124.snakemake-intro.md
2 years ago · b601a7e249
parent c0ead53c0c
commit b601a7e249
2 changed files with 407 additions and 0 deletions
--- a/workflow/20230123.snakemake-software.md
+++ b/workflow/20230123.snakemake-software.md
@ -0,0 +1,159 @@
 # Snakemake: A General Introduction
 Main-website:
 https://snakemake.github.io/
 Snakemake Documentation
 https://snakemake.readthedocs.io/en/stable/
 Tutorials:
  * Official tutorial (complete steps)
    https://snakemake.readthedocs.io/en/stable/tutorial/tutorial.html#tutorial
      - Basic tutorial:
        https://snakemake.readthedocs.io/en/stable/tutorial/basics.html#tutorial-basics
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~`
 Standalone
 This is the easiest way to run it, you will be allocated to a single node, and all resource is allocated to you, there is no cross-node parallelization in this way, you can use job script below:
 #!/bin/bash
 #SBATCH --exclusive
 enable_lmod
 module load container_env snakemake
 crun snakemake
 Break rules to their individual Slurm Jobs
 This is my recommended way of running this package, it utilizes resource most effectively, however it is a little bit more complicated, so I have written some wrapper script to do most of legging work, you can just launch it like this:
 #!/bin/bash
 #SBATCH -c 2
 enable_lmod
 module load container_env snakemake
 snakemake.helper -j 10
                Please note that this script should only take 1 or 2 core to run, it is a master script, it does not do the real work, additional jobs will be launched by this job.
                Instead of run "crun snakemake" directly, you run "snakemake.helper" instead, it's the helper script I mentioned, it will launch job within the cluster for you.
                For cluster resource, I only enforce "threads" in snake rules, any other resource (mem,disk,….) might or might not be enforced by snakemake (I am not sure), at least I will not enforce it on the scheduler level.
 rule map_reads:
    input:
        "data/genome.fa",
        "data/samples/{sample}.fastq"
    output:
        "results/mapped/{sample}.bam"
    threads: 2
    shell:
        "bwa mem {input} | samtools view -b - > {output}"
                When setting threads in a rule, I will launch Slurm job with same "--cpus-per-task" configuration to match it. When not explicitly set, it will always be single thread.  
 My snakemake module also support MPI mode, but from what I observe your tasks usually involving running some code on a lot of input, instead of running a single multiple node code on a single input, so mode 2 should be most useful to you. If you do need to run MPI somehow, please let me know it will require some additional setup, especially when combined with "--use-conda" .
 Snakemake conda is supported and tested, you should be able to install any conda package you want to. Unless it requires MPI, it usually works. When using conda, please make sure environment file is given and launch snakemake with "—use-conda":
 # Snakefile
 rule map_reads:
    input:
        "data/genome.fa",
        "data/samples/{sample}.fastq"
    output:
        "results/mapped/{sample}.bam"
    threads:
        2
    conda:
        "envs/mapping.yaml"
    shell:
        "bwa mem {input} | samtools view -b - > {output}"
 # job_script.sh
 crun snakemake --use-conda          # for standalone
 snakemake.helper -j 10 --use-conda  # for running in scheduler mode
 Running other container modules is also possible, please just let me know what you need. If you can install what you need with conda, please use conda first since you can do it yourself.
 You can find my sample job script in /home/jsun/snakemake , please let me know if you have any questions or issues.
--- a/workflow/20230124.snakemake-intro.md
+++ b/workflow/20230124.snakemake-intro.md
@ -0,0 +1,248 @@
 # Snakemake: A General Introduction
 Snakemake is a workflow tool for managing and executing
 interrelated sets of computation/analysis steps.
 For more information, please visit
 [Snakemake documentation](https://snakemake.readthedocs.io/en/stable/)
 ## Snakemake in a Nutshell
 Imagine that you have a complex computation that consists of a series
 of tasks that must be executed in order
 (each taking the output from the previous step as its input):
    input preprocessing => simulation => output postprocessing
 In the example above, a complete computation consists of:
  * step 1: input preprocessing of some sort, such as generating
    a set of input files from a few parameters;
  * step 2: compute-heavy simulation, potentially taking hours or days
    on many CPU cores and/or GPUs;
  * step 3: output postprocessing, such as calculating some statistics
    from the simulation, determining molecular properties from
    the simulation, creating a report with graph panels.
 If there is only one computation to do, then it's no brainer to do the
 three steps "by hand", i.e. creating up to three SLURM job scripts and
 running them in sequence.
 What if there are 1000 of such computations to perform?
 What if the set of steps are complex (e.g. one step requires a few
 inputs from different prior computations), or long?
 Is there a better way than sitting in front of terminal submitting
 1000+ jobs *in a particular order*?
 The answer is, **yes!**
 This is where we need *workflow tools* such as Snakemake.
 In Snakemake, a complete workflow is stored in a specially formatted text
 file named `Snakefile`.
 Each processing step is expressed as a *rule* inside the snakefile.
 Each rule may depend on one or more other rules; the input and output
 file declaration determines this dependency rule.
 > A more thorough introduction to Snakemake is beyond the scope of this article.
 Readers are referred to some articles and tutorials linked at the end
 of this article to learn more.
 We assume that you have a basic idea of workflow tools in order to use Snakemake.
 {.is-info}
 ## Snakemake on ODU HPC
 On Wahab and Turing, there are two modes to use Snakemake:
 Snakemake is installed, there are two modes to run this software:
  - Cluster (job-based) execution
  - Single-node (standalone) execution
 We will describe cluster execution first, which is more scalable and powerful.
 ### Mode 1: Cluster Execution  (recommended)
 In the [Cluster (job-based) execution](https://snakemake.readthedocs.io/en/stable/executing/cluster.html),
 Snakefile will turn each rule into an individual Slurm job and execute
 them via the job scheduler.
 This will allow the HPC resources to be used most efficiently,
 therefore we recommend using this way to run your workflow, whenever possible.
 We created a helper script called `snakemake.helper` to help executing
 a snakefile workflow through a job scheduler.
 Here is an example invocation (written as a Slurm job script):
 ```
 #!/bin/bash
 #SBATCH --cpus-per-task 2
 #SBATCH --job-name Example_snakemake_cluster
 enable_lmod
 module load container_env snakemake
 snakemake.helper -j 10
 ```
 This script should only take 1 or 2 core to run, since it is *only* a master script.
 It does not do the real work (e.g. it does not run the rule in this script).
 When processing the input snakefile, Snakemake will spawn additional Slurm jobs
 (where one snakefile rule == one Slurm job)
 and monitor their completion in order to push the workflow forward to its completion.
 Pros:
  - Very scalable, this method can use as many CPU cores and compute nodes as
    specified in the snakefile (or by the cluster policy);
  - Complete control over resource specification per rule;
 Cons:
  - May not be efficient for small/simple rules.
 > Important: In the cluster execution model, do not run `crun snakemake` directly!
  You run `snakemake.helper` helper script instead.
  {.is-info}
 > For cluster resources, we only enforce `threads` in the snakefile rules.
  Any other resource (memory, disk, ..) might or might not be enforced by
  Snakemake.
  {.is-info}
 ## Mode 2: Single-node (stand-alone) execution
 In the single-node execution mode, you will acquire all the CPU
 resources in a single node and can use all the cores to perform as
 many tasks as can be parallelized to shorten the execution time.
 (Snakemake will figure out what rules can be executed in parallel and
 try to execute them concurrently, subject to the CPU core
 constraints).
 Use the following template to start your own job running Snakemake in
 the single-node mode:
 ```
 #!/bin/bash
 #SBATCH --exclusive
 #SBATCH --job-name Example_snakemake_1node
 enable_lmod
 module load container_env snakemake
 crun snakemake
 ```
 Pros:
  - The easiest way to run Snakemake (no need to think about threads, etc.);
  - You will have complete access to all compute resources on a single node
    (CPU, memory, ...).
 Cons:
  - No parallelization across multiple nodes,
    thus limiting the parallel scalability;
  - Rules will be invoked within the same container as the Snakemake program;
    if your program requires software in other containers, this will not work.
    (Currently a containerized program [those with `crun` in its invocation]
    cannot execute another program located inside a different container).
 (I am not sure), at least I will not enforce it on the scheduler level.
 rule map_reads:
    input:
        "data/genome.fa",
        "data/samples/{sample}.fastq"
    output:
        "results/mapped/{sample}.bam"
    threads: 2
    shell:
        "bwa mem {input} | samtools view -b - > {output}"
                When setting threads in a rule, I will launch Slurm job with same "--cpus-per-task" configuration to match it. When not explicitly set, it will always be single thread.  
 My snakemake module also support MPI mode, but from what I observe your tasks usually involving running some code on a lot of input, instead of running a single multiple node code on a single input, so mode 2 should be most useful to you. If you do need to run MPI somehow, please let me know it will require some additional setup, especially when combined with "--use-conda" .
  - 
 is installed, there are two modes to run this software:
 Standalone
 Snakemake conda is supported and tested, you should be able to install any conda package you want to. Unless it requires MPI, it usually works. When using conda, please make sure environment file is given and launch snakemake with "—use-conda":
 # Snakefile
 rule map_reads:
    input:
        "data/genome.fa",
        "data/samples/{sample}.fastq"
    output:
        "results/mapped/{sample}.bam"
    threads:
        2
    conda:
        "envs/mapping.yaml"
    shell:
        "bwa mem {input} | samtools view -b - > {output}"
 # job_script.sh
 crun snakemake --use-conda          # for standalone
 snakemake.helper -j 10 --use-conda  # for running in scheduler mode
 Running other container modules is also possible, please just let me know what you need. If you can install what you need with conda, please use conda first since you can do it yourself.
 You can find my sample job script in /home/jsun/snakemake , please let me know if you have any questions or issues.