Skip to content

Part 2: Single-sample analysis

Now that we have a clear overview of what we want to achieve, we can start writing code! We are going to do this in four steps:

  1. Create the modules that are going to perform each task
  2. Create a taxoflow.nf containing the core logic of the pipeline
  3. Create a main.nf entrypoint workflow
  4. Create a nextflow.config configuration file

Then we will test that we can run the pipeline and that it produces the expected outputs.


1. Create the modules

You can think of modules as building blocks that we can string together to form an assembly line. Each module corresponds to a step in the pipeline.

1.1. Initial quality control with FastQC

As noted in the method overview, the objective of this module is to assess the quality of the raw paired-end reads before any trimming or filtering is applied.

Let’s create the fastqc.nf file inside the single/modules/ folder and write the following code:

single/modules/fastqc.nf
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
process FASTQC {
    tag "${sample_id}"
    container "community.wave.seqera.io/library/trim-galore:0.6.10--1bf8ca4e1967cd18"

    input:
    tuple val(sample_id), path(reads)

    output:
    path "*_fastqc.zip", emit: zip
    path "*_fastqc.html", emit: html

    script:
    """
    fastqc ${reads}
    """
    }

Let’s take a moment to break down what we are seeing here.

First, let’s look at the ‘housekeeping’ parts of the process, which you should be familiar with if you worked through the beginner Nextflow training.

  • The process name is FASTQC, this is important when creating the workflow file.
  • The tag directive is used to indicate which sample is being processed at a determined moment. This will be useful when running the pipeline.
  • container indicates the docker container with which the process will be run. We have retrieved all the containers from Seqera Containers.

Container images and profiles

Please keep in mind that we are using Docker images. If you want to use Apptainer/Singularity images, please refer to the HPC installation section. It is also possible to control the container/manager technology used in every execution. This is achieved by creating specific profiles according to the wished behavior of the pipeline, more here.

Now let’s have a look at the interesting bits!

1.1.1. Process inputs

The input for this process is a tuple containing the sample id and the paired-end reads (forward and reverse FASTQ files).

We’ll go over how this channel is built when we create the main.nf file.

1.1.2. Process outputs

This process emits two separate channels: one for the *_fastqc.zip archives and one for the *_fastqc.html reports. The HTML files are meant for interactive inspection in a browser; the ZIP archives contain the underlying data and plots used to build those reports.

emit:

You see that a the end of each output there is an emit: statement. This is useful to name file or set of files to later refer to them when calling the ouput of the process. This way offers a practical way to track the resulting outputs of a process. More about this when defining taxoflow.nf file.

1.1.3. Command script

The script block runs a single fastqc command on both read files at once. FastQC will generate one HTML report and one ZIP file per input FASTQ. To learn more about how to interpret the modules and plots in these reports, see the FastQC help pages.

1.2. Read trimming and quality re-estimation using Trim Galore

As noted in the method overview, the objective of this module is to remove adapter sequences and low-quality bases from the reads, and to re-run quality control on the trimmed data.

Let’s create the trimgalore.nf file inside the single/modules/ folder and write the following code:

single/modules/trimgalore.nf
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
process TRIM_GALORE {
    tag "${sample_id}"
    container "community.wave.seqera.io/library/trim-galore:0.6.10--1bf8ca4e1967cd18"

    input:
    tuple val(sample_id), path(reads)

    output:
    tuple val(sample_id), path("*_val_1.fq.gz"), path("*_val_2.fq.gz"), emit: trimmed_reads
    path "*_trimming_report.txt", emit: trimming_reports
    path "*_val_1_fastqc.{zip,html}", emit: fastqc_reports_1
    path "*_val_2_fastqc.{zip,html}", emit: fastqc_reports_2

    script:
    """
    trim_galore --fastqc --paired ${reads[0]} ${reads[1]}
    """
    }

Once again, this module follows the same structure as the previous process. The directives tag and container play the same role. Note that Trim Galore and FastQC share the same container image in this pipeline, because the Trim Galore image bundles both tools.

1.2.1. Process inputs

The input is the same tuple as for FastQC: the sample id and the paired-end reads. Trim Galore receives the raw reads directly from the input channel, in parallel with the initial FastQC step.

1.2.2. Process outputs

This process produces several outputs:

  • a tuple with the sample id and the paths to the trimmed forward and reverse reads (*_val_1.fq.gz and *_val_2.fq.gz), which are passed on to Bowtie2;
  • a trimming report (*_trimming_report.txt) summarizing how many bases were removed;
  • FastQC HTML and ZIP reports for each trimmed read file, allowing you to compare quality before and after trimming.

1.2.3. Command script

The script block runs Trim Galore with three important flags:

  • --paired tells Trim Galore to treat the reads as paired-end and to trim them accordingly;
  • --fastqc runs FastQC automatically on the trimmed reads;
  • the two input FASTQ paths are passed as ${reads[0]} and ${reads[1]} for the forward and reverse files, respectively.

You are encouraged to check the Trim Galore documentation for additional options, such as adjusting quality thresholds or adapter sequences.

1.3. Remove host sequence with Bowtie2

As noted in the method overview, the objective of this module is to clean the reads by aligning them against a reference genome and remove any reads that map to that reference.

Let’s create the bowtie2.nf file inside the single/modules/ folder and write the following code:

single/modules/bowtie2.nf
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
process BOWTIE2 {
    tag "${sample_id}"
    container "community.wave.seqera.io/library/bowtie2:2.5.4--d51920539234bea7"

    input:
    tuple val(sample_id), path(read1), path(read2)
    path bowtie2_index

    output:
    tuple val("${sample_id}"), path("${sample_id}.1"), path("${sample_id}.2"), path("${sample_id}.sam")

    script:
    """
    export BOWTIE2_INDEXES="${bowtie2_index}"
    export index="${bowtie2_index}/\$(basename ${bowtie2_index})"
    (bowtie2 -x "\${index}" -1 ${read1} -2 ${read2} -p 2 -S ${sample_id}.sam --un-conc-gz ${sample_id}) 2> ${sample_id}_aln_sum.log
    """
    }
Resources

The parameter -p is used to indicate how many cores or threads we want to use for this tool. The name of the parameter can change (i.e. in Kraken2, it is called --threads), and we can control the number of cpus assigned to each process with this kind of parameters. In Nextflow, there is a really convenient way to centralize these resources, assigning them to each process according to the requirements; this is an adavanced control feature that you can explore here.

At first glance, you can see that it follows the same structure as the later processes. The directives tag and container play the same role.

1.3.1. Process inputs

The input for this process will be the sample id, the paired-end reads, as well as the path to the bowtie index.

We’ll go over these in more detail when we create the main.nf file.

1.3.2. Process outputs

This process will produce a tuple containing the sample id, the path to the cleaned reads and the path to the *.sam file.

The SAM file contains the information about the alignment of the sequences against the reference indexed genome. To learn more about this format, please see the SAM format specification.

1.3.3. Command script

The script block contains two commands:

  1. An environment variable is exported pointing to the directory where the indexed genome is stored. This is required by Bowtie2. If you plan to use another reference genome, you will need to adapt this path pointing to the indexed genome.

  2. The Bowtie2 command that includes again the path to the indexed genome. The -1 and -2 arguments capture the path to forward and reverse reads, respectively.The -S argument sets the desired format output. The --un-conc-gz argument is used to write out any paired-end reads that fail to align concordantly to the reference genome. It is important to keep that these are the reads we are actually looking for, and therefore the ones we are passing to next process (Kraken2) to proceed with the taxonomic classification. To learn more about this parameter, see the Bowtie2 manual.

1.4. Apply taxonomic classification with Kraken2

Now, let’s create the module for our taxonomic classifier.

Let’s create kraken2.nf file inside the single/modules/ folder and write this code:

single/modules/kraken2.nf
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
process KRAKEN2 {
	tag "${sample_id}"
	container "community.wave.seqera.io/library/kraken2:2.14--83aa57048e304f01"

	input:
	tuple val(sample_id), path(reads_1), path(reads_2), path(sam)
	path kraken2_db

	output:
	tuple val("${sample_id}"), path("${sample_id}.k2report"), path("${sample_id}.kraken2"), emit: files
        path "${sample_id}.k2report", emit: report

	script:
	"""
	kraken2 --db $kraken2_db --threads 2 \
	--report ${sample_id}.k2report \
	--report-minimizer-data \
	--minimum-hit-groups 2 \
	--gzip-compressed \
	--paired \
	${reads_1} ${reads_2} > ${sample_id}.kraken2
	"""
    }

At first glance, you can see that it follows the same structure as the previous process. The directives tag and container play the same role.

1.4.1. Process inputs

The input in this case is a tuple containing the sample id, the cleaned reads and the *.sam file. This one is declared just to maintain the correspondence between the output from Bowtie2 and the Kraken2 input.

Also, the path to the Kraken database is declared. We’ll go over this in more detail when create the main.nf file.

1.4.2. Process outputs

The output from this process will be a tuple containing the sample id, the path to the .k2report file, as well as the path to the .kraken2 file.

1.4.3. Command script

This is a standard Kraken2 command that specifies the following arguments:

  • the path to the database
  • the number of threads to use
  • the path to the reports Kraken2 generates
  • the minimum number of ‘hit groups’ needed to make a classification call
  • the flag to report the minimizers and distinct minimizer count
  • the parameters that indicate that the received reads are paired-end and compressed in a .gz format.

You are strongly encouraged to check out both the protocol that documents this methodology, as well as the source publication to understand how Kraken2 works and the type of files it generates. This article about minimizers may also be useful.

1.5. Apply Bayesian re-estimation of species abundance with Bracken

Let’s create the file bracken.nf inside single/modules/ and write the following code:

single/modules/bracken.nf
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
process BRACKEN {
	tag "${sample_id}"
	container "community.wave.seqera.io/library/bracken:3.1--22a4e66ce04c5e01"

	input:
	tuple val(sample_id), path(k2report), path(kraken2)
	path kraken2_db
        val length

	output:
	tuple val("${sample_id}"), path("${sample_id}.breport"), path("${sample_id}.bracken")

	script:
	"""
	bracken -d $kraken2_db \
	-i ${k2report} -r ${length} -l S -t 2 \
	-o ${sample_id}.bracken \
	-w ${sample_id}.breport
	"""
    }

Once again, this module follows the same structure as the previous process. The directives tag and container play the same role.

1.5.1. Process inputs

The declared input for this Bracken process matches the output generated by Kraken2. It expects to receive a tuple containing the sample id and the paths to .k2report and kraken2 reports. It requires also the read length (length) to compute the probabilities, as well as the path to the database (kraken2_db)

1.5.2. Process outputs

Similarly, the output establishes that this process will generate a tuple with the sample id and the two types of reports Bracken produces. You can find more information about these reports in the here.

1.5.3. Command script

The Bracken command specifies the following arguments:

  • the path to the database
  • the input represented by the Kraken2 report
  • the length of the reads
  • the taxonomic level at which we want to re-estimate the abundance (species in this case)
  • the required number of reads before abundance estimation to perform re-estimation
  • the format for the output reports

1.6. Generate plots with Krona

Krona plots are interactive pie charts that are widely used in the metagenomics field.

To generate them, we will need two distinct processes; one to extract species abundance metrics from the Bracken report, and another to render the Krona plot itself based on that file.

1.6.1. Extract the species abundance metrics

Let’s create the file kReport2Krona.nf inside single/modules/ and write the following code:

single/modules/kReport2Krona.nf
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
process K_REPORT_TO_KRONA {
	tag "${sample_id}"
	container "community.wave.seqera.io/library/krakentools:1.2--db94e0b19cfa397b"

	input:
	tuple val(sample_id), path(b_report), path(bracken)

	output:
	tuple val("${sample_id}"), path("${sample_id}.b.krona.txt")

	script:
	"""
	kreport2krona.py -r ${b_report} \
	-o ${sample_id}.b.krona.txt \
	--no-intermediate-ranks
	"""
    }

As before, this process takes as input the exact output from Bracken in a tuple form, and it is going to generate another tuple containing sample id and the *.txt file with the abundance metrics.

The command will run a script called kreport2krona.py that is provided as part of KrakenTools and uses the Bracken report as input to produce a plain text file containing the species abundance information.

Note that the command includes an additional flag to leave out non-traditional ranks. For more information about that option, see the KrakenTools documentation referenced above.

1.6.2. Extract the species abundance metrics

Let’s create the file ktImportText.nf inside single/modules/ and write the following code:

single/modules/ktImportText.nf
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
process KT_IMPORT_TEXT {
	tag "${sample_id}"
	container "community.wave.seqera.io/library/krona:2.8.1--2f750080982f027e"

	input:
	tuple val(sample_id), path(krona_txt)

	output:
	path "${sample_id}.krona.html"

	script:
	"""
	ktImportText ${krona_txt} \
	-o ${sample_id}.krona.html
	"""
    }

This process takes as input the plain *.txt in a tuple with the sample ID, and uses Krona to render the Krona plot in a self-contained *.html file, which is the final output of our pipeline at this stage.

The Krona plot can be opened in a standard modern browser, just like the one you are using to follow this tutorial.

This completes the implementation of the modules we need for the single-sample use case, so we can now move on to creating the files that will describe the workflow and control its execution.


2. Create taxoflow.nf

There are several strategies for building Nextflow workflows. Here we are going to use a composable workflow structure as described in the Workflows of Workflows Side Quest, which uses an entrypoint workflow in a single/main.nf file, an embedded single/taxoflow.nf file containing the core logic of the workflow, and the take syntax to declare inputs.

Let’s start by creating the single/taxoflow.nf file that will the core logic of the workflow. Note that we create this file in our single/ directory, NOT in single/modules/.

We start by adding the following code to import modules:

single/taxoflow.nf
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
/*
 * required tasks
 */
include { FASTQC                      }  from './modules/fastqc.nf'
include { TRIM_GALORE                 }  from './modules/trimgalore.nf'
include { BOWTIE2                     }  from './modules/bowtie2.nf'
include { KRAKEN2                     }  from './modules/kraken2.nf'
include { BRACKEN                     }  from './modules/bracken.nf'
include { K_REPORT_TO_KRONA           }  from './modules/kReport2Krona.nf'
include { KT_IMPORT_TEXT              }  from './modules/ktImportText.nf'

Here we list all the modules that we wish to import into the workflow using the standard include syntax. This consists of two parts: invoking the name of the process inside the curly brackets, and pointing to the relative path where the files are located. Note that the process names should match exactly how they are written in the module files.

Now, let’s write the workflow itself. We need to declare the primary input for the workflow and invoke the processes on the appropriate inputs.

single/taxoflow.nf
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
/*
 * workflow
 */

workflow TaxoFlow {
    // required inputs
    take:
        bowtie2_index
        kraken2_db
        reads_ch
    // workflow implementation
    main:
         // Initial quality control
        FASTQC(reads_ch)
        TRIM_GALORE(reads_ch)
        BOWTIE2(TRIM_GALORE.out.trimmed_reads, bowtie2_index)
        KRAKEN2(BOWTIE2.out, kraken2_db)
        BRACKEN(KRAKEN2.out.files, kraken2_db, params.read_length)
        K_REPORT_TO_KRONA(BRACKEN.out)
        KT_IMPORT_TEXT(K_REPORT_TO_KRONA.out)

    emit:
        bowtie_unali             =    BOWTIE2.out
        kraken_class             =    KRAKEN2.out.files
        bracken_class            =    BRACKEN.out
        krona                    =    KT_IMPORT_TEXT.out
        fastqc_zip               =    FASTQC.out.zip
        fastqc_html              =    FASTQC.out.html
        trimmed_reads            =    TRIM_GALORE.out.trimmed_reads
        trimming_reports         =    TRIM_GALORE.out.trimming_reports
        trimming_fastqc_1        =    TRIM_GALORE.out.fastqc_reports_1
        trimming_fastqc_2        =    TRIM_GALORE.out.fastqc_reports_2
    }

In this declaration you see that we need three primary inputs for the pipeline: the indexed reference genome for Bowtie2, the indexed database for Kraken2 and Bracken, and the paths to the reads; when creating the nextflow.config, you will see how to specify these paths.

In the block main, you see the names of the processes with one or more parameters inside the parenthesis depicting the exact data flow:

  1. FASTQC(reads_ch) and TRIM_GALORE(reads_ch) will be the first executed processes in parallel using the raw reads; the other processes must wait until TRIM_GALORE is finished.
  2. BOWTIE2(TRIM_GALORE.out.trimmed_reads, bowtie2_index) will received the trimmed reads from TRIM_GALORE to execute its task using the clean sequences and the indexed genome.
  3. Once BOWTIE2 has completed the task, KRAKEN2 will take the output, along with the database path to perform the specified task; the remaining processes are on hold until Kraken2 is finished.
  4. BRACKEN, in turn, will take KRAKEN2 output to run the abundance re-estimation using the same database; K_REPORT_TO_KRONA and KT_IMPORT_TEXT can not be run until BRACKEN is finished with its task.
  5. Finally, K_REPORT_TO_KRONA, and subsequently KT_IMPORT_TEXT, will be run to generate our final goal file which is the Krona plot.

2.1. The emit block

At the end of the TaxoFlow workflow you will notice an emit block. This is how a composable workflow declares its outputs: the channels that another workflow (our main.nf entrypoint) can use after TaxoFlow has run.

Each line gives a short name to a process output, using the form name = PROCESS.out.<channel>. For example:

  • fastqc_zip = FASTQC.out.zip forwards the ZIP reports from FastQC. The .zip part refers to the emit: zip label we set in the fastqc.nf module.
  • trimmed_reads = TRIM_GALORE.out.trimmed_reads exposes the trimmed reads produced by Trim Galore.

You only need to list the results you want to keep or inspect later. The names you choose here (krona, bracken_class, and so on) will show up again in main.nf, where we connect them to the publish block.


3. Create the main.nf entrypoint workflow

We are getting closer to running the pipeline! Let’s create the single/main.nf file.

Then we add an include statement to import the TaxoFlow workflow from the single/taxoflow.nf file, as well as a workflow block that sets up an input channel and invokes it:

main.nf
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
include {TaxoFlow} from './taxoflow.nf'

workflow {

    reads_ch = Channel.fromFilePairs(params.reads, checkIfExists:true)
    TaxoFlow(params.bowtie2_index, params.kraken2_db, reads_ch)

	// publish files
    publish:

    bowtie_unali            =    TaxoFlow.out.bowtie_unali
    kraken_class            =    TaxoFlow.out.kraken_class
    bracken_class           =    TaxoFlow.out.bracken_class
    krona                   =    TaxoFlow.out.krona
    fastqc_zip              =    TaxoFlow.out.fastqc_zip
    fastqc_html             =    TaxoFlow.out.fastqc_html 
    trimmed_reads           =    TaxoFlow.out.trimmed_reads
    trimming_reports        =    TaxoFlow.out.trimming_reports
    trimming_fastqc_1       =    TaxoFlow.out.trimming_fastqc_1
    trimming_fastqc_2       =    TaxoFlow.out.trimming_fastqc_2
}

output {

    bowtie_unali {
        path 'bowtie2'
    }
    kraken_class {
        path 'kraken2'
    }
    bracken_class {
        path 'bracken'
    }
    krona {
        path 'krona'
    }
    fastqc_zip {
        path 'fastqc'
    }
    fastqc_html {
        path 'fastqc'
    }
    trimmed_reads {
        path 'trimming'
    }
    trimming_reports {
        path 'trimming'
    }
    trimming_fastqc_1 {
        path 'trimming'
    }
    trimming_fastqc_2 {
        path 'trimming'
    }
}

Within this file, we create a banner with the pipeline name to be shown when the execution starts using log.info.

log.info is a method call on a logging object, and can be used to write informational messages to the Nextflow log during a pipeline’s execution. You can learn more about this here.

single/main.nf
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
workflow {

    main:

    log.info """\
	___________________________________________________________________________________________________
	___________________________________________________________________________________________________
    >===>>=====>                                 >=======>  >=>                         
     	>=>                                      >=>        >=>                         
     	>=>        >=> >=>  >=>   >=>    >=>     >=>        >=>    >=>     >=>      >=> 
     	>=>      >=>   >=>    >> >=>   >=>  >=>  >=====>    >=>  >=>  >=>   >=>  >  >=> 
     	>=>     >=>    >=>     >>     >=>    >=> >=>        >=> >=>    >=>  >=> >>  >=> 
     	>=>      >=>   >=>   >>  >=>   >=>  >=>  >=>        >=>  >=>  >=>   >=>>  >=>=> 
     	>=>       >==>>>==> >=>   >=>    >=>     >=>       >==>    >=>     >==>    >==>   
	___________________________________________________________________________________________________
	___________________________________________________________________________________________________
    """
    .stripIndent()

    reads_ch = Channel.fromFilePairs(params.reads, checkIfExists:true)
    TaxoFlow(params.bowtie2_index, params.kraken2_db, reads_ch)

Let’s break down the rest of this file.

3.1. Setting up the input channel

params.reads takes the path to the paired-end reads, using the channel factory fromFilePairs to build a channel of paired FASTQ files.

The result is reads_ch: a channel of tuples with the sample ID and the two read files.

Input reads

This way of providing the path to the reads requires the use regular expressions. Fortunately, in this case you just need to know the specific expected pattern by this paramater: name_of_the_files{1,2}.fastq.gz. More about this here.

3.2. Calling TaxoFlow

The line TaxoFlow(params.bowtie2_index, params.kraken2_db, reads_ch) runs our core workflow. This statement is really explicit in terms of the requirements, which are the path to the Kraken2 database, the indexed genome for Bowtie2 and the reads. The three arguments match the three inputs declared in the take: block of taxoflow.nf.

3.3. The publish block

After calling TaxoFlow, the publish block selects which emitted channels should be saved to your results folder. Each line connects a label to the matching output from the subworkflow, for example:

fastqc_zip = TaxoFlow.out.fastqc_zip

The name on the left-hand side must match an entry in the output block below. Channels that are not listed here will not be copied to the output directory (although they may still exist in Nextflow’s work directory while the run is in progress).

3.4. The output block

The output block tells Nextflow where to place each published file on disk. For each name from publish, we specify a subdirectory such as path 'kraken2' or path 'krona'. These folders are created inside outputDir, which we configure in nextflow.config in the next section.

When the pipeline finishes successfully, you will find your result files organized in these folders under the output directory.


4. Create the nextflow.config configuration file

Finally, we create the file single/nextflow.config, where we’ll set up the configuration of our pipeline.

This is where we provide default input parameters, define the output folder for the pipeline and enable the use of Docker containers.

single/nextflow.config
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
/*
 * pipeline input parameters
 */

params {
    reads                                 = null
    outdir                                = "${projectDir}/output"
    bowtie2_index                         = "${projectDir}/data/genome/TAIR10"
    kraken2_db                            = "${projectDir}/data/krakendb"
    read_length                           = 250
    }

// Enable using docker as the container engine to run the pipeline
docker.enabled = true

// Hard copies of files
workflow.output.mode = 'copy'
outputDir = params.outdir

Importantly,

  • workflow.output.mode defines if we want hard copies ('copy') of the files or just symlinks (default mode) to the files that will be located within the working directory (the work directory created during each Nextflow’s execution).

  • outputDir is a Nextflow variable used to set the output directory. Here we use the param outdir that can be controlled via CLI to use any other name for the output folder.

Paths to files and directories

We are providing the paths using the Nextflow ${projectDir} variable to point to the project directory directly; this ensures system independence. Usually, relative paths are preferred for system-independence of the pipelines, although some tools or scripts will fail if parameters are not provided as absolute paths. Likewise, please notice the paths to the database and indexed genomes, these must be changed in case you want to use a different genome and an alternative database. More about this here.

Override parameters

Parameters can be overridden at run time when included as command flags. You can learn more about this on Nextflow parameters. For example:

nextflow run main.nf \
	--reads "path/to/reads" \
	--outdir "path/to/outdir" \
	--bowtie2_index "path/to/bowtie2/index" \
	--kraken2_db "path/to/kraken2/database" \
    -- read_length N

5. Run the pipeline

That’s it, we are all set to run the pipeline!

Let’s just pick one of the samples provided (you can choose any of them) and run the following command:

mv data single/
cd single
nextflow run main.nf --reads 'data/samples/ERR2143768/ERR2143768_{1,2}.fastq.gz'

On the output of the command line, you will see:

Single-sample execution
Output
N E X T F L O W   ~  version 25.10.4

Launching `main.nf` [determined_lorenz] DSL2 - revision: 8f65b983e6

	___________________________________________________________________________________________________
	___________________________________________________________________________________________________
            >===>>=====>                                 >=======>  >=>                         
		>=>                                      >=>        >=>                         
		>=>        >=> >=>  >=>   >=>    >=>     >=>        >=>    >=>     >=>      >=> 
		>=>      >=>   >=>    >> >=>   >=>  >=>  >=====>    >=>  >=>  >=>   >=>  >  >=> 
		>=>     >=>    >=>     >>     >=>    >=> >=>        >=> >=>    >=>  >=> >>  >=> 
		>=>      >=>   >=>   >>  >=>   >=>  >=>  >=>        >=>  >=>  >=>   >=>>  >=>=> 
		>=>       >==>>>==> >=>   >=>    >=>     >=>       >==>    >=>     >==>    >==>                                                                                                                                                                      
	___________________________________________________________________________________________________
	___________________________________________________________________________________________________

    executor >  local (7)
    [2f/3e9520] TaxoFlow:FASTQC (ERR2143768)            [100%] 1 of 1 ✔
    [a9/4bc888] TaxoFlow:TRIM_GALORE (ERR2143768)       [100%] 1 of 1 ✔
    [e4/84c903] TaxoFlow:BOWTIE2 (ERR2143768)           [100%] 1 of 1 ✔
    [7b/13c639] TaxoFlow:KRAKEN2 (ERR2143768)           [100%] 1 of 1 ✔
    [45/b44c39] TaxoFlow:BRACKEN (ERR2143768)           [100%] 1 of 1 ✔
    [1f/3d7684] TaxoFlow:K_REPORT_TO_KRONA (ERR2143768) [100%] 1 of 1 ✔
    [59/5b2453] TaxoFlow:KT_IMPORT_TEXT (ERR2143768)    [100%] 1 of 1 ✔
    Completed at: 19-May-2026 17:43:27
    Duration    : 4m 1s
    CPU hours   : 0.1
    Succeeded   : 7
Use your own data

If you want to use your own data, you just need to change the path indicated with the --reads parameter.

If that worked for you, it’s finally time to analyze the results!

Indexed genome and database

Please keep in mind that we are using an indexed genome and database that do not confer explicit biological interpretation, and thus the results are NOT meaningful. Please make sure that you checked the corresponding section.

You can find all the output files in the output/ directory by running the tree command:

tree output

You should see the following:

Output directory content
Output contents
output/
├── bowtie2
│   ├── ERR2143768.1
│   ├── ERR2143768.2
│   └── ERR2143768.sam
├── bracken
│   ├── ERR2143768.bracken
│   └── ERR2143768.breport
├── fastqc
│   ├── ERR2143768_1_fastqc.html
│   ├── ERR2143768_1_fastqc.zip
│   ├── ERR2143768_2_fastqc.html
│   └── ERR2143768_2_fastqc.zip
├── kraken2
│   ├── ERR2143768.k2report
│   └── ERR2143768.kraken2
├── krona
│   └── ERR2143768.krona.html
└── trimming
    ├── ERR2143768_1.fastq.gz_trimming_report.txt
    ├── ERR2143768_1_val_1_fastqc.html
    ├── ERR2143768_1_val_1_fastqc.zip
    ├── ERR2143768_1_val_1.fq.gz
    ├── ERR2143768_2.fastq.gz_trimming_report.txt
    ├── ERR2143768_2_val_2_fastqc.html
    ├── ERR2143768_2_val_2_fastqc.zip
    └── ERR2143768_2_val_2.fq.gz

Please feel free to explore each of the files to understand each process and how data were handled.

The file we are most interested in is the *.html file within krona/ containing the Krona plot. You can either download it and open it in your browser, or install the preview extension for Visual Studio. Below you can see a snapshot of how it should look like (keep in mind that this is the Bracken output!):

Metagenomics


Takeaway

Congratulations! You just have performed a taxonomic annotation and abundance estimation of your metagenomics sample.

What’s next?

Learn how to modify the workflow to process multiple samples in parallel, create a report with different diversity measurements, execute customized scripts, and use Nextflow operators and conditionals to control the workflow.