Assembly of Sample FAM24235-i1-2

Contact: simone.oberhaensli@bioinformatics.unibe.ch

Institute: Interfaculty Bioinformatics Unit, University of Bern

Read Quality

Trimming statistics with fastp

For quality and adapter trimming, the tool fastp was used. The summary statistics are shown in the tables below.


Overall statistics of the read trimming step.
Statistics Value
passed_filter_reads 5’575’612
low_quality_reads 6’876
too_many_N_reads 430
too_short_reads 436’440
too_long_reads 0


Read statistics before and after trimming.
status total_reads total_bases q20_bases q30_bases q20_rate q30_rate read1_mean_length read2_mean_length gc_content
before_filtering 6’019’358 904’382’415 855’006’881 792’066’379 0.95 0.88 150 150 0.35
after_filtering 5’575’612 740’611’159 730’295’883 700’855’690 0.99 0.95 143 121 0.35

FastQC

FastQC was used to check the quality of the reads after filtering. The tool checks for example if platform specific adapter sequences were removed.

Results from FastQC. Deduplicated refers to the percentage of reads remaining if optical duplicates are removed. Optical duplicates occur when the exact same fragment has been sequenced more than once.
Sample Reads Deduplicated Adapter direction
FAM24235-i1-2_1 2’787’806 75.04 pass forward
FAM24235-i1-2_2 2’787’806 81.91 pass reverse

Assembly Statistics with Qast

The assembly is evaluated using Quast. The tool computes several statistics that help inferring if the assembly was successful.

Particularly the following metrics are important:

  • Largest Contig: Length of the largest contig of the asembled contigs.

  • N50: To compute this metric, the contigs are first sorted by size. The sizes of the largest contigs are summed up until the sum of these is larger than half of the total size of the assembly. The N50 describes the length of the smallest contig that is needed to get the sum that is larger than half of the total assembly size. Example: Given an assembly with 5 contigs of length 9,6,5,4,3; then the total assembly size is 27 and the N50 cutoff is 13.5. Thus the N50 of this assembly would be 6, because 6 + 9 > 13.5.

  • L50: L50 is the number of contigs equal or larger than the N50 contig. In the example above, the L50 value is 2.

See here for further explanations of quast metrics and plot descriptions.


Some basic metrics evaluated by Quast.
Assembly FAM24235.i1.2_assembly
# contigs (>= 0 bp) 132
# contigs (>= 1000 bp) 91
# contigs (>= 5000 bp) 69
# contigs (>= 10000 bp) 57
# contigs (>= 25000 bp) 40
# contigs (>= 50000 bp) 21
Total length (>= 0 bp) 2’827’863
Total length (>= 1000 bp) 2’810’516
Total length (>= 5000 bp) 2’752’415
Total length (>= 10000 bp) 2’666’577
Total length (>= 25000 bp) 2’389’702
Total length (>= 50000 bp) 1’700’956
# contigs 132
Largest contig 163’612
Total length 2’827’863
GC (%) 35.01
N50 60’351
N75 36’552
L50 16
L75 31
# N’s per 100 kbp 3.54


BUSCO

From the BUSCO website:

“BUSCO attempts to provide a quantitative assessment of the completeness in terms of expected gene content of a genome assembly, transcriptome, or annotated gene set. The results are simplified into categories of Complete and single-copy, Complete and duplicated, Fragmented, or Missing BUSCOs.

BUSCO completeness results make sense only in the context of the biology of your organism. You have to understand whether missing or duplicated genes are of biological or technical origin. For instance, a high level of duplication may be explained by a recent whole duplication event (biological) or a chimeric assembly of haplotypes (technical). Transcriptomes and protein sets that are not filtered for isoforms will lead to a high proportion of duplicates. Therefore you should filter them before a BUSCO analysis. Finally, focusing on specific tissues or specific life stages and conditions in a transcriptomic experiment is unlikely to produce a BUSCO-complete transcriptome. In this case, consistency across your samples is what you will be aiming for."


Genome assembly assessment using BUSCO4. The figure shows how many expected genes are complete, complete and duplicate, fragmented or missing.

Genome assembly assessment using BUSCO4. The figure shows how many expected genes are complete, complete and duplicate, fragmented or missing.

confindr

confindr analyses genes that are known to be single-copy and conserved across all bacteria and evaluates whether there is more than one allele present in the tested genes. This may indicate intra-species contamination. The tool uses the raw reads for evaluating this and therefore, the results are given for forward and reverse reads independently.


confindr results.
Sample Genus NumContamSNVs ContamStatus PercentContam PercentContamStandardDeviation BasesExamined DatabaseDownloadDate
FAM24235-i1-2_2 ND 0 False 0 0 20’757 2020-6-5

Annotation using Prokka

Prokka is used to annotate the assembly with useful features, such as CDS, genes, or rRNAs.


Summary statistics of the annotation with Prokka.
Feature Count
contigs 132
bases 2’827’863
CDS 2’775
rRNA 5
tRNA 59
tmRNA 1

Taxonomy

In order to obtain a taxonomic classification of the assembly, the tool GTDBTK compares the assembly with a reference database. The tool infers the taxonomy by computing the Average Nucleotide Identity (ANI) against the reference database and by placing the genome into a phylogenetic tree.


Taxonomic placement of the assembly. The classification method is either ANI/placement, placement or ANI. ANI/placement means that both methods agree on the taxonomic output. Otherwise the taxonomy of the better match is returned.
Sample Domain Phylum Class Order Family Genus Species classification_method
FAM24235-i1-2 d__Bacteria Firmicutes Bacilli Lactobacillales Carnobacteriaceae Marinilactibacillus Marinilactibacillus psychrotolerans ANI/Placement

Software Versions

fastp_version,0.20.1,
fastqc_version,0.11.7,
quast_version,4.6.0,
ConFindr 0.7.2

Config

{
  "Project_Title": "Test Project",
  "author": "Simone Oberhaensli",
  "email": "simone.oberhaensli@bioinformatics.unibe.ch",
  "institute": "Interfaculty Bioinformatics Unit, University of Bern",
  "DataFolder" : "/data/projects/p446_Dialact_Phoenix/2_analyses/A_illumina/202204_reassemblies_emmanuelle/test/genome-assembly-pipeline/src/data/",
  "DataFolder_testing": "/data/datasets/E_coli_testdata/genome_assembly_test/",
  "extension" : ".fastq.gz",
  "mates" : {
    "mate1" : "_1",
    "mate2" : "_2"
    },
  "SampleSheet" : "sample_sheet.tsv",
  "Mode" : "Illumina",
  "fastp" : {
    "fastp_version" : "0.20.1",
    "fastp_threads" : 10,
    "fastp_hours": 4,
    "fastp_mem_mb": 5000
  },
  "fastqc" : {
    "fastqc_version" : "0.11.7",
    "fastqc_threads": 5,
    "fastqc_hours": 4,
    "fastqc_mem_mb": 4000,
  },
  "spades" : {
    "spades_hours": 72,
    "spades_mem_mb": 30000,
    "spades_mem_gb":28,
    "spades_threads": 10,
    "spades_min_scaffold_length": 200
  },
  "quast" : {
    "quast_version" : "4.6.0",
    "quast_threads": 10,
    "quast_hours": 4,
    "quast_mem_mb": 5000,
  },
  "busco" : {
    "busco_threads": 5,
    "busco_hours": 3,
    "busco_mem_mb": 20000,
  },
  "confindr" : {
    "confindr_threads": 5,
    "confindr_hours": 4,
    "confindr_mem_mb": 30000,
  },
  "prokka" : {
    "prokka_threads": 10,
    "prokka_hours": 8,
    "prokka_mem_mb": 5000,
  },
  "gtdb" : {
    "gtdb_threads": 10,
    "gtdb_hours": 4,
    "gtdb_mem_mb": 300000,
  },
  "short_sh_commands_threads": 16,
  "short_sh_commands_hours": 1,
  "short_commands_mb": 2000,
  "testing" : {
    "testing_threads": 8,
    "testing_hours": 1,
    "testing_mem_mb": 16000
  },
  "eof": "true"
}