Benchmarking topological accuracy of bacterial phylogenomic workflows using in silico evolution

Phylogenetic analyses are widely used in microbiological research, for example to trace the progression of bacterial outbreaks based on whole-genome sequencing data. In practice, multiple analysis steps such as de novo assembly, alignment and phylogenetic inference are combined to form phylogenetic workflows. Comprehensive benchmarking of the accuracy of complete phylogenetic workflows is lacking. To benchmark different phylogenetic workflows, we simulated bacterial evolution under a wide range of evolutionary models, varying the relative rates of substitution, insertion, deletion, gene duplication, gene loss and lateral gene transfer events. The generated datasets corresponded to a genetic diversity usually observed within bacterial species (≥95% average nucleotide identity). We replicated each simulation three times to assess replicability. In total, we benchmarked seventeen distinct phylogenetic workflows using 8 different simulated datasets. We found that recently developed k-mer alignment methods such as kSNP and SKA achieve similar accuracy as reference mapping. The high accuracy of k-mer alignment methods can be explained by the large fractions of genomes these methods can align, relative to other approaches. We also found that the choice of de novo assembly algorithm influences the accuracy of phylogenetic reconstruction, with workflows employing SPAdes or SKESA outperforming those employing Velvet. Finally, we found that the results of phylogenetic benchmarking are highly variable between replicates. We conclude that for phylogenomic reconstruction k-mer alignment methods are relevant alternatives to reference mapping at species level, especially in the absence of suitable reference genomes. We show de novo genome assembly accuracy to be an underappreciated parameter required for accurate phylogenomic reconstruction.

Download Full-text

CaMuS: simultaneous fitting and de novo imputation of cancer mutational signature

Scientific Reports ◽

10.1038/s41598-020-75753-8 ◽

2020 ◽

Vol 10 (1) ◽

Author(s):

Maria Cartolano ◽

Nima Abedpour ◽

Viktor Achter ◽

Tsun-Po Yang ◽

Sandra Ackermann ◽

...

Keyword(s):

De Novo ◽

Probability Distributions ◽

Simulated Data ◽

Whole Genome Sequencing Data ◽

Sequencing Data ◽

Mutational Signatures ◽

Computational Performance ◽

Reliable Parameter ◽

Similar Accuracy ◽

Mutational Processes

Abstract The identification of the mutational processes operating in tumour cells has implications for cancer diagnosis and therapy. These processes leave mutational patterns on the cancer genomes, which are referred to as mutational signatures. Recently, 81 mutational signatures have been inferred using computational algorithms on sequencing data of 23,879 samples. However, these published signatures may not always offer a comprehensive view on the biological processes underlying tumour types that are not included or underrepresented in the reference studies. To circumvent this problem, we designed CaMuS (Cancer Mutational Signatures) to construct de novo signatures while simultaneously fitting publicly available mutational signatures. Furthermore, we propose to estimate signature similarity by comparing probability distributions using the Hellinger distance. We applied CaMuS to infer signatures of mutational processes in poorly studied cancer types. We used whole genome sequencing data of 56 neuroblastoma, thus providing evidence for the versatility of CaMuS. Using simulated data, we compared the performance of CaMuS to sigfit, a recently developed algorithm with comparable inference functionalities. CaMuS and sigfit reconstructed the simulated datasets with similar accuracy; however two main features may argue for CaMuS over sigfit: (i) superior computational performance and (ii) a reliable parameter selection method to avoid spurious signatures.

Download Full-text

Divergence and introgression among the virilis group of Drosophila

10.1101/2022.01.11.475832 ◽

2022 ◽

Author(s):

Leeban Yusuf ◽

Venera Tyukmaeva ◽

Anneli Hoikkala ◽

Michael G Ritchie

Keyword(s):

Gene Flow ◽

Related Species ◽

De Novo ◽

Phylogenetic Reconstruction ◽

Sequence Divergence ◽

Sexual Isolation ◽

Whole Genome Sequencing Data ◽

Sequencing Data ◽

Closely Related Species ◽

Virilis Group

Speciation with gene flow is now widely regarded as common. However, the frequency of introgression between recently diverged species and the evolutionary consequences of gene flow are still poorly understood. The virilis group of Drosophila contains around a dozen species that are geographically widespread and show varying levels of pre-zygotic and post-zygotic isolation. Here, we utilize de novo genome assemblies and whole-genome sequencing data to resolve phylogenetic relationships and describe patterns of introgression and divergence across the group. We suggest that the virilis group consists of three, rather than the traditional two, subgroups. We found evidence of pervasive phylogenetic discordance caused by ancient introgression events between distant lineages within the group, and much more recent gene flow between closely-related species. When assessing patterns of genome-wide divergence in species pairs across the group, we found no consistent genomic evidence of a disproportionate role for the X chromosome. Some genes undergoing rapid sequence divergence across the group were involved in chemical communication and may be related to the evolution of sexual isolation. We suggest that gene flow between closely-related species has potentially had an impact on lineage-specific adaptation and the evolution of reproductive barriers. Our results show how ancient and recent introgression confuse phylogenetic reconstruction, and suggest that shared variation can facilitate adaptation and speciation.

Download Full-text

Genomes of Three Closely Related Caribbean Amazons Provide Insight for Species History and Conservation

Genes ◽

10.3390/genes10010054 ◽

2019 ◽

Vol 10 (1) ◽

pp. 54 ◽

Cited By ~ 1

Author(s):

Sofiia Kolchanova ◽

Sergei Kliver ◽

Aleksei Komissarov ◽

Pavel Dobrinin ◽

Gaik Tamazian ◽

...

Keyword(s):

Puerto Rican ◽

De Novo ◽

Demographic History ◽

Model Systems ◽

Whole Genome Sequencing Data ◽

Sequencing Data ◽

De Novo Genome Assembly ◽

The Caribbean ◽

Amazona Vittata ◽

Puerto Rican Parrot

Islands have been used as model systems for studies of speciation and extinction since Darwin published his observations about finches found on the Galapagos. Amazon parrots inhabiting the Greater Antillean Islands represent a fascinating model of species diversification. Unfortunately, many of these birds are threatened as a result of human activity and some, like the Puerto Rican parrot, are now critically endangered. In this study we used a combination of de novo and reference-assisted assembly methods, integrating it with information obtained from related genomes to perform genome reconstruction of three amazon species. First, we used whole genome sequencing data to generate a new de novo genome assembly for the Puerto Rican parrot (Amazona vittata). We then improved the obtained assembly using transcriptome data from Amazona ventralis and used the resulting sequences as a reference to assemble the genomes Hispaniolan (A. ventralis) and Cuban (Amazona leucocephala) parrots. Finally, we, annotated genes and repetitive elements, estimated genome sizes and current levels of heterozygosity, built models of demographic history and provided interpretation of our findings in the context of parrot evolution in the Caribbean.

Download Full-text

Discovery of novel community-relevant small proteins in a simplified human intestinal microbiome

Microbiome ◽

10.1186/s40168-020-00981-z ◽

2021 ◽

Vol 9 (1) ◽

Author(s):

Hannes Petruschke ◽

Christian Schori ◽

Sebastian Canzler ◽

Sarah Riesbeck ◽

Anja Poehlein ◽

...

Keyword(s):

Microbial Communities ◽

Intestinal Microbiota ◽

De Novo ◽

Bacterial Species ◽

Intestinal Microbiome ◽

Single Strain ◽

Small Proteins ◽

Human Intestinal Microbiota ◽

Wide Range

Abstract Background The intestinal microbiota plays a crucial role in protecting the host from pathogenic microbes, modulating immunity and regulating metabolic processes. We studied the simplified human intestinal microbiota (SIHUMIx) consisting of eight bacterial species with a particular focus on the discovery of novel small proteins with less than 100 amino acids (= sProteins), some of which may contribute to shape the simplified human intestinal microbiota. Although sProteins carry out a wide range of important functions, they are still often missed in genome annotations, and little is known about their structure and function in individual microbes and especially in microbial communities. Results We created a multi-species integrated proteogenomics search database (iPtgxDB) to enable a comprehensive identification of novel sProteins. Six of the eight SIHUMIx species, for which no complete genomes were available, were sequenced and de novo assembled. Several proteomics approaches including two earlier optimized sProtein enrichment strategies were applied to specifically increase the chances for novel sProtein discovery. The search of tandem mass spectrometry (MS/MS) data against the multi-species iPtgxDB enabled the identification of 31 novel sProteins, of which the expression of 30 was supported by metatranscriptomics data. Using synthetic peptides, we were able to validate the expression of 25 novel sProteins. The comparison of sProtein expression in each single strain versus a multi-species community cultivation showed that six of these sProteins were only identified in the SIHUMIx community indicating a potentially important role of sProteins in the organization of microbial communities. Two of these novel sProteins have a potential antimicrobial function. Metabolic modelling revealed that a third sProtein is located in a genomic region encoding several enzymes relevant for the community metabolism within SIHUMIx. Conclusions We outline an integrated experimental and bioinformatics workflow for the discovery of novel sProteins in a simplified intestinal model system that can be generically applied to other microbial communities. The further analysis of novel sProteins uniquely expressed in the SIHUMIx multi-species community is expected to enable new insights into the role of sProteins on the functionality of bacterial communities such as those of the human intestinal tract.

Download Full-text

SCAPP: an algorithm for improved plasmid assembly in metagenomes

Microbiome ◽

10.1186/s40168-021-01068-z ◽

2021 ◽

Vol 9 (1) ◽

Author(s):

David Pellow ◽

Alvah Zorea ◽

Maraike Probst ◽

Ori Furman ◽

Arik Segal ◽

...

Keyword(s):

Bacterial Species ◽

Bacterial Genome ◽

Biological Knowledge ◽

Assessment Procedure ◽

Metagenomic Sequencing ◽

Sequencing Data ◽

Human Gut ◽

Double Stranded Dna ◽

Wide Range ◽

Python Package

Abstract Background Metagenomic sequencing has led to the identification and assembly of many new bacterial genome sequences. These bacteria often contain plasmids: usually small, circular double-stranded DNA molecules that may transfer across bacterial species and confer antibiotic resistance. These plasmids are generally less studied and understood than their bacterial hosts. Part of the reason for this is insufficient computational tools enabling the analysis of plasmids in metagenomic samples. Results We developed SCAPP (Sequence Contents-Aware Plasmid Peeler)—an algorithm and tool to assemble plasmid sequences from metagenomic sequencing. SCAPP builds on some key ideas from the Recycler algorithm while improving plasmid assemblies by integrating biological knowledge about plasmids. We compared the performance of SCAPP to Recycler and metaplasmidSPAdes on simulated metagenomes, real human gut microbiome samples, and a human gut plasmidome dataset that we generated. We also created plasmidome and metagenome data from the same cow rumen sample and used the parallel sequencing data to create a novel assessment procedure. Overall, SCAPP outperformed Recycler and metaplasmidSPAdes across this wide range of datasets. Conclusions SCAPP is an easy to use Python package that enables the assembly of full plasmid sequences from metagenomic samples. It outperformed existing metagenomic plasmid assemblers in most cases and assembled novel and clinically relevant plasmids in samples we generated such as a human gut plasmidome. SCAPP is open-source software available from: https://github.com/Shamir-Lab/SCAPP.

Download Full-text

Norgal: extraction and de novo assembly of mitochondrial DNA from whole-genome sequencing data

BMC Bioinformatics ◽

10.1186/s12859-017-1927-y ◽

2017 ◽

Vol 18 (1) ◽

Cited By ~ 21

Author(s):

Kosai Al-Nakeeb ◽

Thomas Nordahl Petersen ◽

Thomas Sicheritz-Pontén

Keyword(s):

Mitochondrial Dna ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

De Novo Assembly ◽

De Novo ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data

Download Full-text

An emergent clade of SARS-CoV-2 linked to returned travellers from Iran

10.1101/2020.03.15.992818 ◽

2020 ◽

Cited By ~ 20

Author(s):

John-Sebastian Eden ◽

Rebecca Rockett ◽

Ian Carter ◽

Hossinur Rahman ◽

Joep de Ligt ◽

...

Keyword(s):

New Zealand ◽

Infectious Diseases ◽

Genome Sequencing ◽

Phylogenetic Analyses ◽

Emerging Infectious Diseases ◽

Whole Genome Sequencing Data ◽

Viral Diversity ◽

Whole Genome ◽

Sequencing Data ◽

Public Data

AbstractThe SARS-CoV-2 epidemic has rapidly spread outside China with major outbreaks occurring in Italy, South Korea and Iran. Phylogenetic analyses of whole genome sequencing data identified a distinct SARS-CoV-2 clade linked to travellers returning from Iran to Australia and New Zealand. This study highlights potential viral diversity driving the epidemic in Iran, and underscores the power of rapid genome sequencing and public data sharing to improve the detection and management of emerging infectious diseases.

Download Full-text

Metagenome of SARS-Cov2 patients in Shenzhen with travel to Wuhan shows a wide range of species - Lautropia, Cutibacterium, Haemophilus being most abundant - and Campylobacter explaining diarrhea

10.31219/osf.io/jegwq ◽

2020 ◽

Cited By ~ 7

Author(s):

Sandeep Chakraborty

Keyword(s):

Secondary Infection ◽

Bacterial Species ◽

Bacterial Load ◽

San Diego County ◽

Opportunistic Pathogens ◽

Sequencing Data ◽

Familial Cluster ◽

Chinese Study ◽

Wide Range ◽

Initial Results

The metagenome of patients infected with SARS-Cov2 [1] has shown Prevotella to be a key player in immune response [2] in one Chinese study [3], just starting in another [4] and a host of other opportunistic pathogens in a study from San Diego county [5]. The metagenome can also be queried to find host response genes [5], as was done in monkey cells infected with SARS-Cov2 [6]Nanopore sequencing data from a familial cluster in ShenzhenThe patients were tested for 4 bacterial species - Bordetella pertussis, Bordetella parapertussis, Chlamydophila pneumoniae, and Mycoplasma pneumoniae. The sequencing data (Accid:SRR10948474, Nanopore) from five patients in a family cluster from Shenzhen who presented with unexplained pneumonia after returning from Wuhan (Table 1) shows a wide range of bacterial species - Lautropia, Cutibacterium, Haemophilus being most abundant. The presence of Campylobacter explains diarrhea seen in the patient [7,8]. Also, their tests should have detected Mycoplasma, since it is there in the data.Significant bacterial load with some bacterial species predominatingThe bacterial reads are about 20% (95K out of 500K reads). The viral load is also significant here (70K reads) [2]. They are in SI.familial/allsequences.fa. The number of bacterial species (with at least two reads) is 876 (SI.familial/list.allbacteria.txt). Thus, it is important to consider secondary infection, a possible reason why azithromycin (in addition to hydroxychloroquine) has given good initial results in a clinical trial [9].

Download Full-text

Implications of Genetic Distance to Reference and De Novo Genome Assembly for Clinical Genomics in Africans

10.1101/2020.09.25.20201780 ◽

2020 ◽

Author(s):

Daniel Shriner ◽

Adebowale Adeyemo ◽

Charles Rotimi

Keyword(s):

Genetic Distance ◽

De Novo ◽

Reference Sequence ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

De Novo Genome Assembly ◽

Single Nucleotide ◽

Clinical Genomics ◽

Advantages And Disadvantages ◽

False Discovery

In clinical genomics, variant calling from short-read sequencing data typically relies on a pan-genomic, universal human reference sequence. A major limitation of this approach is that the number of reads that incorrectly map or fail to map increase as the reads diverge from the reference sequence. In the context of genome sequencing of genetically diverse Africans, we investigate the advantages and disadvantages of using a de novo assembly of the read data as the reference sequence in single sample calling. Conditional on sufficient read depth, the alignment-based and assembly-based approaches yielded comparable sensitivity and false discovery rates for single nucleotide variants when benchmarked against a gold standard call set. The alignment-based approach yielded coverage of an additional 270.8 Mb over which sensitivity was lower and the false discovery rate was higher. Although both approaches detected and missed clinically relevant variants, the assembly-based approach identified more such variants than the alignment-based approach. Of particular relevance to individuals of African descent, the assembly-based approach identified four heterozygous genotypes containing the sickle allele whereas the alignment-based approach identified no occurrences of the sickle allele. Variant annotation using dbSNP and gnomAD identified systematic biases in these databases due to underrepresentation of Africans. Using the counts of homozygous alternate genotypes from the alignment-based approach as a measure of genetic distance to the reference sequence GRCh38.p12, we found that the numbers of misassemblies, total variant sites, potentially novel single nucleotide variants (SNVs), and certain variant classes (e.g., splice acceptor variants, stop loss variants, missense variants, synonymous variants, and variants absent from gnomAD) were significantly correlated with genetic distance. In contrast, genomic coverage and other variant classes (e.g., ClinVar pathogenic or likely pathogenic variants, start loss variants, stop gain variants, splice donor variants, incomplete terminal codons, variants with CADD score ≥20) were not correlated with genetic distance. With improvement in coverage, the assembly-based approach can offer a viable alternative to the alignment-based approach, with the advantage that it can obviate the need to generate diverse human reference sequences or collections of alternate scaffolds.

Download Full-text

CAMISIM: Simulating metagenomes and microbial communities

10.1101/300970 ◽

2018 ◽

Cited By ~ 4

Author(s):

Adrian Fritz ◽

Peter Hofmann ◽

Stephan Majda ◽

Eik Dahms ◽

Johannes Dröge ◽

...

Keyword(s):

Microbial Communities ◽

De Novo ◽

Real Data ◽

Small Data ◽

Data Sets ◽

Sequencing Data ◽

Taxonomic Profiling ◽

Benchmark Data ◽

Sequencing Technologies ◽

Wide Range

Shotgun metagenome data sets of microbial communities are highly diverse, not only due to the natural variation of the underlying biological systems, but also due to differences in laboratory protocols, replicate numbers, and sequencing technologies. Accordingly, to effectively assess the performance of metagenomic analysis software, a wide range of benchmark data sets are required. Here, we describe the CAMISIM microbial community and metagenome simulator. The software can model different microbial abundance profiles, multi-sample time series and differential abundance studies, includes real and simulated strain-level diversity, and generates second and third generation sequencing data from taxonomic profiles or de novo. Gold standards are created for sequence assembly, genome binning, taxonomic binning, and taxonomic profiling. CAMSIM generated the benchmark data sets of the first CAMI challenge. For two simulated multi-sample data sets of the human and mouse gut microbiomes we observed high functional congruence to the real data. As further applications, we investigated the effect of varying evolutionary genome divergence, sequencing depth, and read error profiles on two popular metagenome assemblers, MEGAHIT and metaSPAdes, on several thousand small data sets generated with CAMISIM. CAMISIM can simulate a wide variety of microbial communities and metagenome data sets together with truth standards for method evaluation. All data sets and the software are freely available at: https://github.com/CAMI-challenge/CAMISIM

Download Full-text