Output

The directories and output files listed below can be found inside the the zip file. All paths are relative to the top-level results directory.

FastQC

FastQC (ver. 0.11.9) gives general quality metrics about your reads. It provides information about the quality score distribution across your reads and per base sequence content (%T/A/G/C). You get information about adapter contamination and other overrepresented sequences.

For further reading and documentation see the FastQC help.

The FastQC plots displayed in the MultiQC report shows untrimmed reads. They may contain adapter sequences and potentially regions with low quality.

Output directory: <RESULTS>/<SAMPLE>/1-fastqc/

<SAMPLE>_fastqc.html
FastQC report, containing quality metrics for your untrimmed raw fastq files

zips/<SAMPLE>_fastqc.zip
zip file containing the FastQC report, tab-delimited data file and plot images

Fastp

The Flomics/SARSCoV2 pipeline uses Fastp for removal of adapter contamination and trimming of low quality regions. Fastp runs FastQC after it finishes.

MultiQC reports the number of reads that pass the filter of Fastp in the General Statistics table, along with the reads that didn’t pass the filter and the reason.

Output directory: <RESULTS>/<SAMPLE>/2-fastp/

<SAMPLE>.fastp.html
Fastp report, containing all the information of reads trimming.

fastqc/zips/<SAMPLE>_fastqc.zip
Zip file containing the FastQC report after trimming.

fastqc/<SAMPLE>_trim_fastqc.html
FastQC report, containing quality metrics for your trimmed reads.

logs/<SAMPLE>.fastqc.log
Log files for fastp.

Kraken2

Kraken2 is a sequence classifier that assigns taxonomic labels to DNA sequences. Kraken examines the k-mers within a query sequence and uses the information within those k-mers to query a database. That database maps k-mers to the lowest common ancestor (LCA) of all genomes known to contain a given k-mer.

We used a Kraken2 database in this workflow to classify reads in their species so that it is possible to identify reads for SARSCoV2 and determine whether it is a positive or negative sample. This pipeline filters kraken2 reports to obtain information about unmapped reads and those that come from species. It creates piecharts and tables summarizing those results at species level. Piecharts include the top10 most abundant species, human and unclassified. If there are more species they will be classified as Other. In case that SARSCoV2 is present in the sample but not in the Top10, it will also include it in the piechart.

Output directory: <RESULTS>/<SAMPLE>/3-kraken2/

figures/<SAMPLE>.piechart.png
Per sample, shows the composition at species level of virus and human in the sample (for the top10 most abundant viruses)

tables/<SAMPLE>.table.png
Per samples, includes the composition at species of virus and human in the sample (all virus present)

reports/<SAMPLE>.kraken2.report.png
Kraken2 report of classified and unclassified reads

Bowtie 2

Bowtie 2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. Bowtie 2 supports gapped, local, and paired-end alignment modes.

Output directory: <RESULTS>/<SAMPLE>/4-bam/

logs/<SAMPLE>.bowtie2.log/
Bowtie 2 mapping log file.

SAMtools

Bowtie 2 BAM files are further processed with SAMtools to coordinate sort and index the alignments, as well as to generate read mapping statistics.

Output directory: <RESULTS>/<SAMPLE>/4-bam/

bam/<SAMPLE>.sorted.bam
Original BAM sorted file created by Bowtie 2.

bai/<SAMPLE>.sorted.bam.bai
Indexed bam sorted files.

samtools_stats/<SAMPLE>.sorted.bam.flagstat, samtools_stats/<SAMPLE>.sorted.bam.idxstats and samtools_stats/<SAMPLE>.sorted.bam.stats
Files generated from the alignment files.

Qualimap

Qualimap is a standalone package written in java. It calculates read alignment assignment, transcript coverage, read genomic origin, junction analysis and 3’-5’ bias.

Output directory: <RESULTS>/<SAMPLE>/5-qualimap/<SAMPLE>.sorted_stats

iVar trim

iVar is used to trim amplicon primer sequences from the aligned reads.

Output directory: <RESULTS>/<SAMPLE>/4-bam/

bam/<SAMPLE>.sorted.trim.bam
Original BAM sorted file created by Bowtie 2.

bai/<SAMPLE>.trim.sorted.bam.bai
Indexed bam sorted files.

iVar variants

iVar is used again to do variant calling.

Output directory: <RESULTS>/<SAMPLE>/5-variants/

<SAMPLE>.tsv
Variants in TSV format that PASS all filters.

<SAMPLE>.modified.tsv
TSV file containing additional information regarding the variants.

<SAMPLE>.pass.vcf.gz
Variants in TSV format that PASS all filters, compliant with the VCF file specifications.

<SAMPLE>.pass.vcf.gz.tbi
Variants in TSV format that PASS all filters, compliant with the VCF file specifications.

<SAMPLE>.all.vcf.gz
Variants in TSV format that PASS and FAIL all filters ,compliant with the VCF file specifications. Contains the variants that have a lower allelic frequency than the threshold.

logs/<SAMPLE>.variant.counts.log
Variant counts for variants that PASSED all filters.

bcftools_stats/<SAMPLE>.pass.bcftools_stats.txt
Statistics and counts obtained from low frequency variants VCF file.

SnpEff and SnpSift

SnpEff is a genetic variant annotation and functional effect prediction toolbox. It annotates and predicts the effects of genetic variants on genes and proteins (such as amino acid changes).

SnpSift annotates genomic variants using databases, filters, and manipulates genomic annotated variants. After annotation with SnpEff, you can use SnpSift to help filter large genomic datasets in order to find the most significant variants.

Output directory: <RESULTS>/<SAMPLE>/8-snpeff/

<SAMPLE>.snpEff.csv
Variant annotation csv file.

<SAMPLE>.snpEff.genes.txt
Gene table for annotated variants.

<SAMPLE>.snpEff.summary.html
Summary html file for variants.

<SAMPLE>.snpEff.vcf.gz
VCF file with variant annotations.

<SAMPLE>.snpEff.vcf.gz.tbi
Index for VCF file with variant annotations.

<SAMPLE>.snpSift.table.modified.txt
SnpSift summary table, with additional information.

<SAMPLE>.snpSift.table.txt
SnpSift summary table.

<SAMPLE>_mutations.tsv
TSV table containing the sample mutations that appear in our curated mutation database as mutations of interest or concern.

Freyja

Freyja is a tool to recover relative lineage abundances from mixed SARS-CoV-2 samples from a sequencing dataset (BAM aligned to the Hu-1 reference). The method uses lineage-determining mutational “barcodes” derived from the UShER global phylogenetic tree as a basis set to solve the constrained (unit sum, non-negative) de-mixing problem.

Output directory: <RESULTS>/<SAMPLE>/5-variants/

<SAMPLE>_demix
File containing the relative abundances of all the SARS-CoV-2 lineages found in the sample.

Markdown report

We use markdown to generate a summary report for each sample.

Output directory: <RESULTS>/<SAMPLE>/9-report/

<SAMPLE>_report.html
Report with information about all steps per sample. It provides information about number of sequences, sequences trimmed, sequences mapped, coverage, variants detected, variants interpretation.

MultiQC

MultiQC (ver. 1.10.1) is a visualization tool that generates a single HTML report summarizing all QC information for all the samples in your project.

The pipeline has special steps which allow the software versions used to be reported in the MultiQC output for future traceability.

Output directory: <RESULTS>/multiqc/

general.report.html
MultiQC report - a standalone HTML file that can be viewed in a web browser.

Pipeline information

The pipeline also provides a table listing software used and their respective versions.

Output directory: <RESULTS>/pipeline_info/

software_versions.tsv