Contact: simone.oberhaensli@bioinformatics.unibe.ch
Institute: Interfaculty Bioinformatics Unit, University of Bern
For quality and adapter trimming, the tool fastp was used. The summary statistics are shown in the tables below.
Statistics | Value |
---|---|
passed_filter_reads | 5’575’612 |
low_quality_reads | 6’876 |
too_many_N_reads | 430 |
too_short_reads | 436’440 |
too_long_reads | 0 |
status | total_reads | total_bases | q20_bases | q30_bases | q20_rate | q30_rate | read1_mean_length | read2_mean_length | gc_content |
---|---|---|---|---|---|---|---|---|---|
before_filtering | 6’019’358 | 904’382’415 | 855’006’881 | 792’066’379 | 0.95 | 0.88 | 150 | 150 | 0.35 |
after_filtering | 5’575’612 | 740’611’159 | 730’295’883 | 700’855’690 | 0.99 | 0.95 | 143 | 121 | 0.35 |
FastQC was used to check the quality of the reads after filtering. The tool checks for example if platform specific adapter sequences were removed.
Sample | Reads | Deduplicated | Adapter | direction |
---|---|---|---|---|
FAM24235-i1-2_1 | 2’787’806 | 75.04 | pass | forward |
FAM24235-i1-2_2 | 2’787’806 | 81.91 | pass | reverse |
The assembly is evaluated using Quast. The tool computes several statistics that help inferring if the assembly was successful.
Particularly the following metrics are important:
Largest Contig: Length of the largest contig of the asembled contigs.
N50: To compute this metric, the contigs are first sorted by size. The sizes of the largest contigs are summed up until the sum of these is larger than half of the total size of the assembly. The N50 describes the length of the smallest contig that is needed to get the sum that is larger than half of the total assembly size. Example: Given an assembly with 5 contigs of length 9,6,5,4,3; then the total assembly size is 27 and the N50 cutoff is 13.5. Thus the N50 of this assembly would be 6, because 6 + 9 > 13.5.
L50: L50 is the number of contigs equal or larger than the N50 contig. In the example above, the L50 value is 2.
See here for further explanations of quast metrics and plot descriptions.
Assembly | FAM24235.i1.2_assembly |
---|---|
# contigs (>= 0 bp) | 132 |
# contigs (>= 1000 bp) | 91 |
# contigs (>= 5000 bp) | 69 |
# contigs (>= 10000 bp) | 57 |
# contigs (>= 25000 bp) | 40 |
# contigs (>= 50000 bp) | 21 |
Total length (>= 0 bp) | 2’827’863 |
Total length (>= 1000 bp) | 2’810’516 |
Total length (>= 5000 bp) | 2’752’415 |
Total length (>= 10000 bp) | 2’666’577 |
Total length (>= 25000 bp) | 2’389’702 |
Total length (>= 50000 bp) | 1’700’956 |
# contigs | 132 |
Largest contig | 163’612 |
Total length | 2’827’863 |
GC (%) | 35.01 |
N50 | 60’351 |
N75 | 36’552 |
L50 | 16 |
L75 | 31 |
# N’s per 100 kbp | 3.54 |
From the BUSCO website:
“BUSCO attempts to provide a quantitative assessment of the completeness in terms of expected gene content of a genome assembly, transcriptome, or annotated gene set. The results are simplified into categories of Complete and single-copy, Complete and duplicated, Fragmented, or Missing BUSCOs.
BUSCO completeness results make sense only in the context of the biology of your organism. You have to understand whether missing or duplicated genes are of biological or technical origin. For instance, a high level of duplication may be explained by a recent whole duplication event (biological) or a chimeric assembly of haplotypes (technical). Transcriptomes and protein sets that are not filtered for isoforms will lead to a high proportion of duplicates. Therefore you should filter them before a BUSCO analysis. Finally, focusing on specific tissues or specific life stages and conditions in a transcriptomic experiment is unlikely to produce a BUSCO-complete transcriptome. In this case, consistency across your samples is what you will be aiming for."
Genome assembly assessment using BUSCO4. The figure shows how many expected genes are complete, complete and duplicate, fragmented or missing.
confindr analyses genes that are known to be single-copy and conserved across all bacteria and evaluates whether there is more than one allele present in the tested genes. This may indicate intra-species contamination. The tool uses the raw reads for evaluating this and therefore, the results are given for forward and reverse reads independently.
Sample | Genus | NumContamSNVs | ContamStatus | PercentContam | PercentContamStandardDeviation | BasesExamined | DatabaseDownloadDate |
---|---|---|---|---|---|---|---|
FAM24235-i1-2_2 | ND | 0 | False | 0 | 0 | 20’757 | 2020-6-5 |
Prokka is used to annotate the assembly with useful features, such as CDS, genes, or rRNAs.
Feature | Count |
---|---|
contigs | 132 |
bases | 2’827’863 |
CDS | 2’775 |
rRNA | 5 |
tRNA | 59 |
tmRNA | 1 |
In order to obtain a taxonomic classification of the assembly, the tool GTDBTK compares the assembly with a reference database. The tool infers the taxonomy by computing the Average Nucleotide Identity (ANI) against the reference database and by placing the genome into a phylogenetic tree.
Sample | Domain | Phylum | Class | Order | Family | Genus | Species | classification_method |
---|---|---|---|---|---|---|---|---|
FAM24235-i1-2 | d__Bacteria | Firmicutes | Bacilli | Lactobacillales | Carnobacteriaceae | Marinilactibacillus | Marinilactibacillus psychrotolerans | ANI/Placement |
fastp_version,0.20.1,
fastqc_version,0.11.7,
quast_version,4.6.0,
ConFindr 0.7.2
{
"Project_Title": "Test Project",
"author": "Simone Oberhaensli",
"email": "simone.oberhaensli@bioinformatics.unibe.ch",
"institute": "Interfaculty Bioinformatics Unit, University of Bern",
"DataFolder" : "/data/projects/p446_Dialact_Phoenix/2_analyses/A_illumina/202204_reassemblies_emmanuelle/test/genome-assembly-pipeline/src/data/",
"DataFolder_testing": "/data/datasets/E_coli_testdata/genome_assembly_test/",
"extension" : ".fastq.gz",
"mates" : {
"mate1" : "_1",
"mate2" : "_2"
},
"SampleSheet" : "sample_sheet.tsv",
"Mode" : "Illumina",
"fastp" : {
"fastp_version" : "0.20.1",
"fastp_threads" : 10,
"fastp_hours": 4,
"fastp_mem_mb": 5000
},
"fastqc" : {
"fastqc_version" : "0.11.7",
"fastqc_threads": 5,
"fastqc_hours": 4,
"fastqc_mem_mb": 4000,
},
"spades" : {
"spades_hours": 72,
"spades_mem_mb": 30000,
"spades_mem_gb":28,
"spades_threads": 10,
"spades_min_scaffold_length": 200
},
"quast" : {
"quast_version" : "4.6.0",
"quast_threads": 10,
"quast_hours": 4,
"quast_mem_mb": 5000,
},
"busco" : {
"busco_threads": 5,
"busco_hours": 3,
"busco_mem_mb": 20000,
},
"confindr" : {
"confindr_threads": 5,
"confindr_hours": 4,
"confindr_mem_mb": 30000,
},
"prokka" : {
"prokka_threads": 10,
"prokka_hours": 8,
"prokka_mem_mb": 5000,
},
"gtdb" : {
"gtdb_threads": 10,
"gtdb_hours": 4,
"gtdb_mem_mb": 300000,
},
"short_sh_commands_threads": 16,
"short_sh_commands_hours": 1,
"short_commands_mb": 2000,
"testing" : {
"testing_threads": 8,
"testing_hours": 1,
"testing_mem_mb": 16000
},
"eof": "true"
}