Test Project

Read Quality

Trimming statistics with fastp

For quality and adapter trimming, the tool fastp was used. The summary statistics are shown in the tables below.

Overall statistics of the read trimming step.
Statistics	Value
passed_filter_reads	5’575’612
low_quality_reads	6’876
too_many_N_reads	430
too_short_reads	436’440
too_long_reads	0

Read statistics before and after trimming.
status	total_reads	total_bases	q20_bases	q30_bases	q20_rate	q30_rate	read1_mean_length	read2_mean_length	gc_content
before_filtering	6’019’358	904’382’415	855’006’881	792’066’379	0.95	0.88	150	150	0.35
after_filtering	5’575’612	740’611’159	730’295’883	700’855’690	0.99	0.95	143	121	0.35

FastQC

FastQC was used to check the quality of the reads after filtering. The tool checks for example if platform specific adapter sequences were removed.

Results from FastQC. Deduplicated refers to the percentage of reads remaining if optical duplicates are removed. Optical duplicates occur when the exact same fragment has been sequenced more than once.
Sample	Reads	Deduplicated	Adapter	direction
FAM24235-i1-2_1	2’787’806	75.04	pass	forward
FAM24235-i1-2_2	2’787’806	81.91	pass	reverse

Assembly Statistics with Qast

The assembly is evaluated using Quast. The tool computes several statistics that help inferring if the assembly was successful.

Particularly the following metrics are important:

Largest Contig: Length of the largest contig of the asembled contigs.
N50: To compute this metric, the contigs are first sorted by size. The sizes of the largest contigs are summed up until the sum of these is larger than half of the total size of the assembly. The N50 describes the length of the smallest contig that is needed to get the sum that is larger than half of the total assembly size. Example: Given an assembly with 5 contigs of length 9,6,5,4,3; then the total assembly size is 27 and the N50 cutoff is 13.5. Thus the N50 of this assembly would be 6, because 6 + 9 > 13.5.
L50: L50 is the number of contigs equal or larger than the N50 contig. In the example above, the L50 value is 2.

See here for further explanations of quast metrics and plot descriptions.

Some basic metrics evaluated by Quast.
Assembly	FAM24235.i1.2_assembly
# contigs (>= 0 bp)	132
# contigs (>= 1000 bp)	91
# contigs (>= 5000 bp)	69
# contigs (>= 10000 bp)	57
# contigs (>= 25000 bp)	40
# contigs (>= 50000 bp)	21
Total length (>= 0 bp)	2’827’863
Total length (>= 1000 bp)	2’810’516
Total length (>= 5000 bp)	2’752’415
Total length (>= 10000 bp)	2’666’577
Total length (>= 25000 bp)	2’389’702
Total length (>= 50000 bp)	1’700’956
# contigs	132
Largest contig	163’612
Total length	2’827’863
GC (%)	35.01
N50	60’351
N75	36’552
L50	16
L75	31
# N’s per 100 kbp	3.54

BUSCO

From the BUSCO website:

“BUSCO attempts to provide a quantitative assessment of the completeness in terms of expected gene content of a genome assembly, transcriptome, or annotated gene set. The results are simplified into categories of Complete and single-copy, Complete and duplicated, Fragmented, or Missing BUSCOs.

BUSCO completeness results make sense only in the context of the biology of your organism. You have to understand whether missing or duplicated genes are of biological or technical origin. For instance, a high level of duplication may be explained by a recent whole duplication event (biological) or a chimeric assembly of haplotypes (technical). Transcriptomes and protein sets that are not filtered for isoforms will lead to a high proportion of duplicates. Therefore you should filter them before a BUSCO analysis. Finally, focusing on specific tissues or specific life stages and conditions in a transcriptomic experiment is unlikely to produce a BUSCO-complete transcriptome. In this case, consistency across your samples is what you will be aiming for."

Genome assembly assessment using BUSCO4. The figure shows how many expected genes are complete, complete and duplicate, fragmented or missing.

confindr

confindr analyses genes that are known to be single-copy and conserved across all bacteria and evaluates whether there is more than one allele present in the tested genes. This may indicate intra-species contamination. The tool uses the raw reads for evaluating this and therefore, the results are given for forward and reverse reads independently.

confindr results.
Sample	Genus	NumContamSNVs	ContamStatus	PercentContam	PercentContamStandardDeviation	BasesExamined	DatabaseDownloadDate
FAM24235-i1-2_2	ND	0	False	0	0	20’757	2020-6-5

Annotation using Prokka

Prokka is used to annotate the assembly with useful features, such as CDS, genes, or rRNAs.

Summary statistics of the annotation with Prokka.
Feature	Count
contigs	132
bases	2’827’863
CDS	2’775
rRNA	5
tRNA	59
tmRNA	1

Taxonomy

In order to obtain a taxonomic classification of the assembly, the tool GTDBTK compares the assembly with a reference database. The tool infers the taxonomy by computing the Average Nucleotide Identity (ANI) against the reference database and by placing the genome into a phylogenetic tree.

Taxonomic placement of the assembly. The classification method is either ANI/placement, placement or ANI. ANI/placement means that both methods agree on the taxonomic output. Otherwise the taxonomy of the better match is returned.
Sample	Domain	Phylum	Class	Order	Family	Genus	Species	classification_method
FAM24235-i1-2	d__Bacteria	Firmicutes	Bacilli	Lactobacillales	Carnobacteriaceae	Marinilactibacillus	Marinilactibacillus psychrotolerans	ANI/Placement

Software Versions

fastp_version,0.20.1,
fastqc_version,0.11.7,
quast_version,4.6.0,
ConFindr 0.7.2

Config

{
  "Project_Title": "Test Project",
  "author": "Simone Oberhaensli",
  "email": "simone.oberhaensli@bioinformatics.unibe.ch",
  "institute": "Interfaculty Bioinformatics Unit, University of Bern",
  "DataFolder" : "/data/projects/p446_Dialact_Phoenix/2_analyses/A_illumina/202204_reassemblies_emmanuelle/test/genome-assembly-pipeline/src/data/",
  "DataFolder_testing": "/data/datasets/E_coli_testdata/genome_assembly_test/",
  "extension" : ".fastq.gz",
  "mates" : {
    "mate1" : "_1",
    "mate2" : "_2"
    },
  "SampleSheet" : "sample_sheet.tsv",
  "Mode" : "Illumina",
  "fastp" : {
    "fastp_version" : "0.20.1",
    "fastp_threads" : 10,
    "fastp_hours": 4,
    "fastp_mem_mb": 5000
  },
  "fastqc" : {
    "fastqc_version" : "0.11.7",
    "fastqc_threads": 5,
    "fastqc_hours": 4,
    "fastqc_mem_mb": 4000,
  },
  "spades" : {
    "spades_hours": 72,
    "spades_mem_mb": 30000,
    "spades_mem_gb":28,
    "spades_threads": 10,
    "spades_min_scaffold_length": 200
  },
  "quast" : {
    "quast_version" : "4.6.0",
    "quast_threads": 10,
    "quast_hours": 4,
    "quast_mem_mb": 5000,
  },
  "busco" : {
    "busco_threads": 5,
    "busco_hours": 3,
    "busco_mem_mb": 20000,
  },
  "confindr" : {
    "confindr_threads": 5,
    "confindr_hours": 4,
    "confindr_mem_mb": 30000,
  },
  "prokka" : {
    "prokka_threads": 10,
    "prokka_hours": 8,
    "prokka_mem_mb": 5000,
  },
  "gtdb" : {
    "gtdb_threads": 10,
    "gtdb_hours": 4,
    "gtdb_mem_mb": 300000,
  },
  "short_sh_commands_threads": 16,
  "short_sh_commands_hours": 1,
  "short_commands_mb": 2000,
  "testing" : {
    "testing_threads": 8,
    "testing_hours": 1,
    "testing_mem_mb": 16000
  },
  "eof": "true"
}