Part 3: Multi-sample analysis
In this part, we are going to start from the pipeline structure we built in the previous part and extend it for multi-sample analysis.
Note
We will be working now in the multi directory. Thus, please make sure you are in the correct directory and move the data:
cd /workspaces/taxoflow_tutorial/TaxoFlow/multi
mv ../single/data .
This will give us the opportunity to practice using the following Nextflow features:
- Using a Nextflow operator to control the flow of data
- Controlling the execution of modules according to an input condition
- Including a process that runs a customized script
- Generate a performance report
1. Multi-sample input
With our shiny brand-new pipeline, we are at this moment able to analyze each sample individually by running the workflow multiple times. Nonetheless, one of the most powerful capabilities by Nextflow is its native parallel execution according to the available resources the executor finds. You can think of this as a sort of “integrated for loop” that will process all the samples in parallel in a single run without the need of re-running the pipeline.
Nextflow´s parallelization
Formally speaking, Nextflow’s parallelization is implicitly defined by the processes input and output declarations, or in other words, Nextflow uses asynchronous first-in, first-out (FIFO) queues or channels. If you want to know more about this, we recomend this section of the official documentation.
To achieve this purpose, there are two possibilities:
- The use of wildcards in the input (this can be tricky and requires taking into account particular folder structures).
- Create a file that points out to the sample files regardless of their location in the file system.
In this course, we will target the second input option, but you are welcome to explore how you can use the first option by checking out the Nextflow documentation.
To move forward, let’s create the file samplesheet.csv inside the folder data/:
| data/samplesheet.csv | |
|---|---|
1 2 3 4 | |
Here, we have provided the sample id and the absolute paths to both forward and reverse reads per sample.
Please notice that the files are not required to be stored in the directory; however, this is recommended in order to maintain a consistent fodirectorylder structure.
However, we cannot use this file as input in the current state of the pipeline, given that it expects only a path to create a paired-end channel.
So let’s include an additional parameter in the multi/nextflow.config file (inside the parameter block, keeping the same structure):
| multi/nextflow.config | |
|---|---|
10 | |
We initialize this parameter as null since it can be used or not.
Now, we need to modify the main.nf file to state how the input should be handled depending on the type of input:
| main.nf | |
|---|---|
22 23 24 25 26 27 28 | |
This modified declaration states that if we use the parameter --reads when we invoke the nextflow run main.nf, the reads channel will be created using only the path to paired-end files.
Otherwise, we must include the parameter --sheet_csv with the corresponding file containing the sample information.
1.1 Conditional execution with if statements
TaxoFlow showcases two layers of conditional logic:
- At the top‑level workflow (
main.nf) to decide how to build the read channel. - Inside the
TaxoFlowworkflow (taxoflow.nf) to decide whether to run downstream reporting steps (shown in section 3. Conditional reporting insidetaxoflow.nf).
The if statement decides how inputs are parsed:
- Branch 1 –
params.readsis set:- Use
channel.fromFilePairs(params.reads, checkIfExists:true)to build a channel of paired‑end read files directly from a glob pattern. - This is convenient when your reads are already organised on disk and you do not need a sample sheet.
- Use
- Branch 2 –
params.readsis not set:- Use
channel.fromPath(params.sheet_csv)followed by.splitCsv(header:true)to read a CSV samplesheet. - Map each row into a tuple:
tuple(row.sample_id, [file(row.fastq_1), file(row.fastq_2)]). - This is useful when metadata such as
sample_idis stored in a table.
- Use
In both cases the result is a single channel reads_ch that emits:
- A
sample_idvalue. - A list with the two FASTQ files.
The rest of the pipeline (TaxoFlow(...)) is independent of how reads_ch was created, illustrating a common pattern:
- Use
ifblocks early in the workflow to normalize different input formats into a canonical channel shape.
If structure
Abstracting the if block from main.nf:
| if statement | |
|---|---|
1 2 3 4 5 | |
else statement is not always required.
Being so, it is necessary to use one of the two forms of input; if we use both at the same time, the --reads will predominate or if none of them is indicated, the pipeline will fail.
Do not worry now for the way in which channel is created using the .csv file, this declaration is quite standard and you can just copy and paste for other pipelines in which you would like to use it; however, you can learn more about this here.
Now, we would be ready to re-run the pipeline to process all the samples in a single call. Notwithstanding, the inclusion of additional samples has the advantage that we can expand the analysis to estimate β-diversity and compare them to extract important insights.
2. Additional processes
2.1. Kraken-biom
Let’s create a new module that is going to handle the Bracken output to produce a Biological Observation Matrix (BIOM) file that concatenates the species abundance in each sample.
The kraken_biom.nf file will be located in the multi/modules/ directory:
| multi/modules/kraken_biom.nf | |
|---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | |
This process will collect each output from the Bracken files to build a single *.biom file that contains the abundance species data of all the samples.
In the script statement we find three tasks to execute, the first two lines are for variable manipulation required to handle the type of input this process receives (more about this when modifying multi/taxoflow.nf below), and the second line executes the kraken-biom command that is available thanks to specified container.
2.1.1. Operator collect() and conditional execution
Nextflow provides a high number of operators that smooth data handling and orchestrates the workflow to do exactly what we want.
In this case, the process KRAKEN_BIOM requires all the files produced by Bracken belonging to each sample, which means that KRAKEN_BIOM can not be triggered until all Bracken processes are finished.
For this task, the operator collect() comes really handy, and therefore let’s include it in our multi/taxoflow.nf… but wait!
Let’s recall that KRAKEN_BIOM and the following KNIT_PHYLOSEQ are only triggered if the execution is aiming at processing more than one sample.
Being so, we will include these processes and modify the workflow execution to add the conditional statement in multi/taxoflow.nf:
| multi/taxoflow.nf | |
|---|---|
11 | |
| multi/taxoflow.nf | |
|---|---|
35 36 37 | |
Here, you can see that we have added the operator collect() to capture all the output files from BRACKEN, and this is happening only if we are using --sheet_csv as input.
This operator is going to return a list of the elements specified in the output of the process (BRACKEN), and, for instance, we are interested in each “second” (indices 1,4,7…) element of the list to run the kraken-biom command; this is the reason why within the script statement in multi/modules/kraken_biom.nf we have included two codelines to obtain the paths to these files.
If this is not entirely clear, please check the Nextflow documentation.
2.2. Phyloseq
2.2.1. Including a customized script
We are at the last step of the pipeline execution, and now we need to process the *.biom file by transforming it into a Phyloseq object, which is easier to use, more intuitive to understand, and is equipped with multiple tools and methods to plot.
Another amazing feature by Nextflow is the possibility to run the so-called Scripts à la carte, which means that a process does not necessarily require an external tool to execute, and hence you can develop your own analysis with customized scripts, i.e., R or Python.
Here, we will run an R script inside the module multi/modules/knit_phyloseq.nf to create and process the Phyloseq object taking as input the output from multi/modules/kraken_biom.nf:
| multi/modules/knit_phyloseq.nf | |
|---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | |
Global vs Nextflow variables
Within the modules/knit_phyloseq.nf you can notice that some variables like biom_pat and outreport are preceded by a backslash (\). In Nextflow, it is really important to distinguish Nextflow variables from Bash or environment variables. This is achieved through the use of double quotes in the script section plus adding the escape character (backslash) before Bash variables. More about this here.
As you can see, we are declaring some variables both in Nextflow and bash to be able to call the script.
This is a special case since this type of scripts can be stored in the bin directory for Nextflow to find them directly.
Nevertheless, as we are not “running the script” directly but we are calling Rscript to render a final *.html report, Nextflow is not able to automatically find the customized script nor detect when the report is rendered.
As a result the output from this process is just a standard/command-line output, and we have to include an additional parameter in the multi/nextflow.config file:
| multi/nextflow.config | |
|---|---|
11 | |
This R Markdown file uses a Phyloseq object created from Kraken2/Bracken output (BIOM file), then applies standard functions from for diversity and network analysis. Thus, results should therefore be interpreted as visual summaries of community structure, not as statistically validated differences.
Diversity and network analysis methods
Input data
Taxonomic abundance tables generated with Bracken were imported into R as a Phyloseq object. Taxa were agglomerated at the genus level (tax_glom), and low-abundance genera were filtered by retaining taxa with a mean relative abundance ≥ 3% across samples.
α-diversity (within-sample diversity)
- Calculated using
plot_richness()from Phyloseq. - Diversity indices:
- Chao1: richness estimator accounting for unseen taxa.
- Shannon: richness and evenness estimator.
- Normalization: raw abundance counts were used directly (no rarefaction or scaling).
β-diversity (among-sample diversity)
- Community dissimilarity calculated using Bray–Curtis distance.
- Ordination performed with Principal Coordinates Analysis (PCoA).
- Visualizations:
- Heatmap of taxonomic abundance with sample ordering based on ordination.
- PCoA scatter plot for comparing sample composition.
- Normalization: Bray–Curtis calculated directly from the filtered abundance table.
Network construction
- Generated using
plot_net()from Phyloseq. - Parameters:
- Distance metric: Bray–Curtis.
- Network type: taxa co-occurrence (
type = "taxa"). - Maximum distance threshold:
maxdist = 0.9.
- Nodes represent genera; edges connect taxa based on pairwise similarity under the defined threshold.
Statistical testing
This workflow provides descriptive and exploratory analysis only. Results should therefore be interpreted as visual summaries of community structure, not as statistically validated differences. Please visit the microbiome R package tutorial for expanding the statistical analysis of microbiome data.
In addition, please notice the container used for the KNIT_PHYLOSEQ, which is combination of multiple packages required to render the *.html report.
This is possible thanks to Seqera Containers, which is able to build almost any container (for docker or singularity!) by just “merging” different PyPI or Conda packages.
Also, we have to include this new process within multi/taxoflow.nf:
| multi/taxoflow.nf | |
|---|---|
12 | |
We need to call it as well inside the conditional execution if multi-sample is being handled:
| multi/taxoflow.nf | |
|---|---|
37 | |
2.3. MultiQC
As noted in the method overview, the last step of the multi-sample workflow is to compile quality-control results from several tools into a single interactive report using MultiQC.
While FastQC, Trim Galore, and Kraken2 each produce their own reports per sample, MultiQC scans those outputs and builds one HTML summary that makes it easier to compare samples side by side.
Let’s create the multiqc.nf file inside the multi/modules/ directory:
| multi/modules/multiqc.nf | |
|---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | |
This process runs once for the whole pipeline run, after all per-sample QC tasks have finished.
Unlike the per-sample modules, its tag is fixed to "All samples" because it does not process reads from a single sample in isolation.
2.3.1. Process inputs
The input block expects two arguments:
path '*'— a collection of report files from FastQC, Trim Galore, and Kraken2. Nextflow stages all of these files in the process work directory before the command runs.val output_name— the base name for the final report (in our case, this comes fromparams.report_id).
2.3.2. Process outputs
MultiQC produces two outputs:
${output_name}.html(emitted asreport) — the interactive summary report you open in a browser.${output_name}_data(emitted asdata) — a folder with the parsed data MultiQC used to build the report.
2.3.3. Command script
The script block runs a single command: multiqc . -n ${output_name}.html.
MultiQC scans the current directory (.) for recognized report formats and writes the combined HTML file.
The -n flag sets the output file name.
2.3.4. Modify taxoflow.nf
First, import the new module at the top of multi/taxoflow.nf:
| multi/taxoflow.nf | |
|---|---|
13 | |
Next, we need to gather the QC files produced across all samples before calling MULTIQC.
Add the following lines inside the main: block of the TaxoFlow workflow, after the per-sample processes:
| multi/taxoflow.nf | |
|---|---|
39 40 41 42 43 44 45 46 47 48 49 | |
Here is what each part does:
channel.empty().mix(...)starts from an empty channel and merges several channels into one. Each channel contributes files from a different step: FastQC (before trimming), Trim Galore (trimming reports and post-trim FastQC), and Kraken2 (.k2reportfiles).
Operator mix()
The mix operator combines the items emitted by two (or more) channels. More here.
- As explained before,
.collect()waits until all samples have finished the upstream processes, then bundles every file into a single list. This is what allowsMULTIQCto run only once, with inputs from every sample. MULTIQC(multiqc_files_list, params.report_id)passes that list and the report name to the process.
Unlike KRAKEN_BIOM and KNIT_PHYLOSEQ, MultiQC runs every time the pipeline runs, whether you use --reads or --sheet_csv.
Finally, expose the MultiQC outputs in the emit block:
| multi/taxoflow.nf | |
|---|---|
62 63 | |
2.3.5. Modify main.nf
We need to publish the MultiQC results in multi/main.nf, alongside the other outputs from TaxoFlow.
Add these lines to the publish block:
| multi/main.nf | |
|---|---|
45 46 | |
And assign a destination folder in the output block:
| multi/main.nf | |
|---|---|
85 86 87 88 89 90 | |
When the pipeline finishes, you will find all_samples.html (or whatever name you set in params.report_id) and the companion all_samples_data folder under output/multiqc/.
2.3.6. Parameter in nextflow.config
Add the following parameter to the params block in multi/nextflow.config:
| multi/nextflow.config | |
|---|---|
13 | |
This value is passed to MULTIQC as the output name.
You can change it at run time, for example: --report_id 'cuatro_cienegas_run'. That would produce cuatro_cienegas_run.html instead of all_samples.html.
2.3.7. Adaptation of Bowtie2
In order to include Bowtie2’s output into the final MultiQC report, we need to modify the module itself, along with additions to main.nf and taxoflow.nf:
- Changes in
bowtie2.nf:
bowtie2.nf
| multi/modules/bowtie2.nf | |
|---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | |
Getting the stderr of Bowtie2
Bowtie2 places its output regarding the number of aligned reads to a common CLI output channel known as stderr. Thus, to capture this information we had to modify the command by using braces and create the file where we want the stderr to be stored. This file will be used by MultiQC to build the report.
- Changes in
taxoflow.nf:
taxoflow.nf
| multi/taxoflow.nf | |
|---|---|
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 | |
- Additions in
main.nf:
main.nf
| multi/main.nf | |
|---|---|
34 | |
| multi/main.nf | |
|---|---|
53 54 55 | |
3. Conditional reporting inside taxoflow.nf
The inner if (params.sheet_csv) in taxoflow.nf controls whether to:
- Merge Bracken outputs across samples with
KRAKEN_BIOM(BRACKEN.out.collect()). - Render a Phyloseq HTML report with
KNIT_PHYLOSEQ(KRAKEN_BIOM.out).
Key ideas:
- When running from a samplesheet, we know which samples belong together, so it makes sense to aggregate them into a single biom file and downstream report.
- When running from raw file pairs only (
params.reads),params.sheet_csvisnullinnextflow.config, so the extra report is skipped.
This is a clean way to:
- Keep core processing always enabled.
- Toggle extra reporting or QC steps based on parameters.
4. Enabling the built‑in report
In nextflow.config:
| multi/nextflow.config | |
|---|---|
22 23 24 25 | |
This block tells Nextflow to:
- Generate a single HTML report named
performanceReport.htmlat the end of each run. - Place it in the results directory from where you launched Nextflow. Please notice that this is a custom path. It does not take from
params.outdir, and thus ifoutdiris changed, the report will not be stored on the new output directory. - Keep all outputs (taxonomy results, Krona plots, RMarkdown report, MultiQC output,Nextflow report) under a single
output/tree.
You do not need to change main.nf or taxoflow.nf to use this feature; it is entirely controlled by configuration.
Enabling the report as a parameter
It is possible to generate the report just by adding the parameter -with-report <file_name> when executing the pipeline. More about this here.
5. Execution
Now, we are completely set to run the analysis for as many samples as we would like, and we will obtain a final report depicting different metrics regarding taxonomic abundance, network analysis, and α and β-diversity. Let’s execute (please remember that we are within the multi directory):
nextflow run main.nf --sheet_csv 'data/samplesheet.csv'
On the output of the command line, you will see:
Multi-sample execution
N E X T F L O W ~ version 25.10.4
Launching `main.nf` [stoic_miescher] DSL2 - revision: 8f65b983e6
___________________________________________________________________________________________________
___________________________________________________________________________________________________
>===>>=====> >=======> >=>
>=> >=> >=>
>=> >=> >=> >=> >=> >=> >=> >=> >=> >=> >=>
>=> >=> >=> >> >=> >=> >=> >=====> >=> >=> >=> >=> > >=>
>=> >=> >=> >> >=> >=> >=> >=> >=> >=> >=> >> >=>
>=> >=> >=> >> >=> >=> >=> >=> >=> >=> >=> >=>> >=>=>
>=> >==>>>==> >=> >=> >=> >=> >==> >=> >==> >==>
___________________________________________________________________________________________________
___________________________________________________________________________________________________
executor > local (24)
[1f/176196] TaxoFlow:FASTQC (ERR2143768) [100%] 3 of 3 ✔
[c5/42b877] TaxoFlow:TRIM_GALORE (ERR2143774) [100%] 3 of 3 ✔
[da/6ba52f] TaxoFlow:BOWTIE2 (ERR2143774) [100%] 3 of 3 ✔
[45/054c41] TaxoFlow:KRAKEN2 (ERR2143774) [100%] 3 of 3 ✔
[30/94f39e] TaxoFlow:BRACKEN (ERR2143774) [100%] 3 of 3 ✔
[67/1dabc3] TaxoFlow:K_REPORT_TO_KRONA (ERR2143774) [100%] 3 of 3 ✔
[ed/8bf05a] TaxoFlow:KT_IMPORT_TEXT (ERR2143774) [100%] 3 of 3 ✔
[b2/cfd842] TaxoFlow:KRAKEN_BIOM (merge_samples) [100%] 1 of 1 ✔
[cb/815f36] TaxoFlow:KNIT_PHYLOSEQ (knit_phyloseq) [100%] 1 of 1 ✔
[7f/b77259] TaxoFlow:MULTIQC (All samples) [100%] 1 of 1 ✔
Completed at: 27-Nov-2025 13:03:40
Duration : 1m 36s
CPU hours : (a few seconds)
Succeeded : 10
Use your own data
If you want to use your own data, you just need to change sample IDs and the paths to sequencing reads on the file data/samplesheet.csv, or create your own one to use it with the parameter --sheet_csv.
Keep in mind that since the execution is in parallel, the order in which the samples are processed is random and the order in which sample ids appear will differ among executions.
Also, while the pipeline is running you will see that KRAKEN_BIOM, and hence KNIT_PHYLOSEQ, will not be triggered until all the samples are processed by the previous processes.
Finally, inside the output directory, you will see multiple folders with the exact sample ids, and within these all the output files, including the files to visualize the Krona plots.
Likewise, in the output folder you will see the file TaxoReport.html which is ready to be opened and explored. It’s your time to analyze it!
Below you can see the plots included in the report, where it is possible to observe different metrics and general trends of annotated reads at genus level considering the custom database that we created for this tutorial with only 54 bacterial species (we applied a filter to keep only genera with relative above 3%, and keep in mind that this is the Bracken output!):
5.1 Performance report
Open the file performanceReport.html in a browser. You will find:
- A timeline of all tasks across processes like
FASTQC,BOWTIE2,KRAKEN2,BRACKEN, etc. - A resources table with CPU, memory and time usage per process.
- A tasks section showing how many samples were processed and how long each step took.
This native report complements the domain‑specific Phyloseq HTML:
- The Nextflow report focuses on pipeline performance and resource usage.
- The Phyloseq report focuses on biological interpretation of the metagenomic profiles.
- The MultiQC report focuses on general statistics and metrics aggregated from different tools.
6. Biological meaning
Once again, please remember that this execution of the pipeline, and therefore it doesn’t have a clear biological interpretation. To extract truly insights from the samples hereby provided or your own data, you need to use the corresponding genome index and Kraken2/Bracken databases for you analysis purposes. Please check how to achieve this here.
Takeaway
You just learnt how to control workflow execution by including conditionals and operators, processing multiple samples simultaneously and running a customized script to perform a metagenomics data analysis at read level.
What’s next?
Great! You are well equipped now to start developing your own pipelines.
