Consensify: A Method for Generating Pseudohaploid Genome Sequences from Palaeogenomic Datasets with Reduced Error Rates

A standard practise in palaeogenome analysis is the conversion of mapped short read data into pseudohaploid sequences, frequently by selecting a single high-quality nucleotide at random from the stack of mapped reads. This controls for biases due to differential sequencing coverage, but it does not control for differential rates and types of sequencing error, which are frequently large and variable in datasets obtained from ancient samples. These errors have the potential to distort phylogenetic and population clustering analyses, and to mislead tests of admixture using D statistics. We introduce Consensify, a method for generating pseudohaploid sequences, which controls for biases resulting from differential sequencing coverage while greatly reducing error rates. The error correction is derived directly from the data itself, without the requirement for additional genomic resources or simplifying assumptions such as contemporaneous sampling. For phylogenetic and population clustering analysis, we find that Consensify is less affected by artefacts than methods based on single read sampling. For D statistics, Consensify is more resistant to false positives and appears to be less affected by biases resulting from different laboratory protocols than other frequently used methods. Although Consensify is developed with palaeogenomic data in mind, it is applicable for any low to medium coverage short read datasets. We predict that Consensify will be a useful tool for future studies of palaeogenomes.

Download Full-text

Consensify: a method for generating pseudohaploid genome sequences from palaeogenomic datasets with reduced error rates

10.1101/498915 ◽

2018 ◽

Cited By ~ 2

Author(s):

Axel Barlow ◽

Stefanie Hartmann ◽

Javier Gonzalez ◽

Michael Hofreiter ◽

Johanna L.A. Paijmans

Keyword(s):

Clustering Analysis ◽

Branch Length ◽

Genetic Distances ◽

Error Rates ◽

Sequencing Error ◽

Short Read ◽

Future Studies ◽

Sequencing Coverage ◽

Simplifying Assumptions ◽

Population Clustering

A standard practise in palaeogenome analysis is the conversion of mapped short read data into pseudohaploid sequences, typically by selecting a single high quality nucleotide at random from the stack of mapped reads. This controls for biases due to differential sequencing coverage but it does not control for differential rates and types of sequencing error, which are frequently large and variable in datasets obtained from ancient samples. These errors have the potential to distort phylogenetic and population clustering analyses, and to mislead tests of admixture using D statistics. We introduce Consensify, a method for generating pseudohaploid sequences which controls for biases resulting from differential sequencing coverage while greatly reducing error rates. The error correction is derived directly from the data itself, without the requirement for additional genomic resources or simplifying assumptions such as contemporaneous sampling. For phylogenetic analysis, we find that Consensify is less affected by branch length artefacts than methods based on standard pseudohaploidisation, and it performs similarly for population clustering analysis based on genetic distances. For D statistics, Consensify is more resistant to false positives and appears to be less affected by biases resulting from different laboratory protocols than other available methods. Although Consensify is developed with palaeogenomic data in mind, it is applicable for any low to medium coverage short read datasets. We predict that Consenify will be a useful tool for future studies of palaeogenomes.

Download Full-text

Complete and Circularized Genome Sequences of 17 Xanthomonas Strains Responsible for Common Bacterial Blight of Bean

Microbiology Resource Announcements ◽

10.1128/mra.00371-21 ◽

2021 ◽

Vol 10 (31) ◽

Author(s):

Martial Briand ◽

Mylène Ruh ◽

Armelle Darrasse ◽

Marie-Agnès Jacques ◽

Nicolas W. G. Chen

Keyword(s):

Bacterial Blight ◽

Plant Pathogens ◽

Common Bacterial Blight ◽

Xanthomonas Citri ◽

Genome Sequences ◽

High Quality ◽

Short Read ◽

Content Type ◽

Short Read Sequencing ◽

Quality Material

We report the complete and circularized genome sequences of 17 strains of Xanthomonas citri pv. fuscans and Xanthomonas phaseoli pv. phaseoli, which cause common bacterial blight of bean. These new assemblies combining PacBio and short-read sequencing methods provide high-quality material for studying the evolution of these plant pathogens.

Download Full-text

Acceleration of Nucleotide Semi-Global Alignment with Adaptive Banded Dynamic Programming

10.1101/130633 ◽

2017 ◽

Cited By ~ 9

Author(s):

Hajime Suzuki ◽

Masahiro Kasahara

Keyword(s):

Dynamic Programming ◽

Single Molecule ◽

Computation Time ◽

Error Rates ◽

Nucleotide Sequences ◽

Sequencing Error ◽

Local Alignment ◽

Global Alignment ◽

Alignment Algorithm ◽

Short Read

AbstractMotivationPairwise alignment of nucleotide sequences has previously been carried out using the seed- and-extend strategy, where we enumerate seeds (shared patterns) between sequences and then extend the seeds by Smith-Waterman-like semi-global dynamic programming to obtain full pairwise alignments. With the advent of massively parallel short read sequencers, algorithms and data structures for efficiently finding seeds have been extensively explored. However, recent advances in single-molecule sequencing technologies have enabled us to obtain millions of reads, each of which is orders of magnitude longer than those output by the short-read sequencers, demanding a faster algorithm for the extension step that accounts for most of the computation time required for pairwise local alignment. Our goal is to design a faster extension algorithm suitable for single-molecule sequencers with high sequencing error rates (e.g., 10-15%) and with more frequent insertions and deletions than substitutions.ResultsWe propose an adaptive banded dynamic programming algorithm for calculating pairwise semi-global alignment of nucleotide sequences that allows a relatively high insertion or deletion rate while keeping band width relatively low (e.g., 32 or 64 cells) regardless of sequence lengths. Our new algorithm eliminated mutual dependences between elements in a vector, allowing an efficient Single-Instruction-Multiple-Data parallelization. We experimentally demonstrate that our algorithm runs approximately 5× faster than the extension alignment algorithm in NCBI BLAST+ while retaining similar sensitivity (recall).We also show that our extension algorithm is more sensitive than the extension alignment routine in DALIGNER, while the computation time is comparable.AvailabilityThe implementation of the algorithm and the benchmarking scripts are available at https://github.com/ocxtal/[email protected]

Download Full-text

Parameter exploration improves the accuracy of long-read genome assembly

10.1101/2021.05.28.446135 ◽

2021 ◽

Author(s):

Anurag Priyam ◽

Alicja Witwicka ◽

Anindita Brahma ◽

Eckart Stolle ◽

Yannick Wurm

Keyword(s):

Genome Assembly ◽

Reference Genome ◽

Error Rates ◽

Fine Tuning ◽

Sequencing Error ◽

High Quality ◽

Long Read ◽

Genome Assemblies ◽

Error Profiles

Long-molecule sequencing is now routinely applied to generate high-quality reference genome assemblies. However, datasets differ in repeat composition, heterozygosity, read lengths and error profiles. The assembly parameters that provide the best results could thus differ across datasets. By integrating four complementary and biologically meaningful metrics, we show that simple fine-tuning of assembly parameters can substantially improve the quality of long-read genome assemblies. In particular, modifying estimates of sequencing error rates improves some metrics more than two-fold. We provide a flexible software, CompareGenomeQualities, that automates comparisons of assembly qualities for researchers wanting a straightforward mechanism for choosing among multiple assemblies.

Download Full-text

Long-read sequencing settings for efficient structural variation detection based on comprehensive evaluation

BMC Bioinformatics ◽

10.1186/s12859-021-04422-y ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Tao Jiang ◽

Shiqi Liu ◽

Shuqi Cao ◽

Yadong Liu ◽

Zhe Cui ◽

...

Keyword(s):

Structural Variation ◽

Comprehensive Evaluation ◽

Full Range ◽

Error Rates ◽

Influential Factor ◽

Read Length ◽

Sequencing Error ◽

Clinical Practices ◽

Sequencing Coverage ◽

Long Read

Abstract Background With the rapid development of long-read sequencing technologies, it is possible to reveal the full spectrum of genetic structural variation (SV). However, the expensive cost, finite read length and high sequencing error for long-read data greatly limit the widespread adoption of SV calling. Therefore, it is urgent to establish guidance concerning sequencing coverage, read length, and error rate to maintain high SV yields and to achieve the lowest cost simultaneously. Results In this study, we generated a full range of simulated error-prone long-read datasets containing various sequencing settings and comprehensively evaluated the performance of SV calling with state-of-the-art long-read SV detection methods. The benchmark results demonstrate that almost all SV callers perform better when the long-read data reach 20× coverage, 20 kbp average read length, and approximately 10–7.5% or below 1% error rates. Furthermore, high sequencing coverage is the most influential factor in promoting SV calling, while it also directly determines the expensive costs. Conclusions Based on the comprehensive evaluation results, we provide important guidelines for selecting long-read sequencing settings for efficient SV calling. We believe these recommended settings of long-read sequencing will have extraordinary guiding significance in cutting-edge genomic studies and clinical practices.

Download Full-text

Genome analysis of the fatal tapeworm Sparganum proliferum unravels the cryptic lifecycle and mechanisms underlying the aberrant larval proliferation

10.1101/2020.05.19.105387 ◽

2020 ◽

Author(s):

Taisei Kikuchi ◽

Mehmet Dayi ◽

Vicky L. Hunt ◽

Atsushi Toyoda ◽

Yasunobu Maeda ◽

...

Keyword(s):

Extracellular Matrix ◽

Asexual Reproduction ◽

Sexual Maturity ◽

Reference Genome ◽

Cestode Species ◽

Genome Sequences ◽

High Quality ◽

Genetic Sequence ◽

Future Studies ◽

Underlying Mechanisms

AbstractBackgroundThe cryptic parasite Sparganum proliferum proliferates in humans and invades tissues and organs. Only scattered cases have been reported, but S. proliferum infection is always fatal. However, the S. proliferum phylogeny and lifecycle are still an enigma.ResultsTo investigate the phylogenetic relationships between S. proliferum and other cestode species, and to examine the underlying mechanisms of pathogenicity, we sequenced the entire S. proliferum genome. Additionally, S. proliferum plerocercoid larvae transcriptome analyses were performed to identify genes involved in asexual reproduction in the host. The genome sequences confirmed that the S. proliferum genetic sequence is distinct from that of the closely related Spirometra erinaceieuropaei. Moreover, nonordinal extracellular matrix coordination allows for asexual reproduction in the host and loss of sexual maturity in S. proliferum is related to its fatal pathogenicity in humans.ConclusionsThe high-quality reference genome sequences generated should prove valuable for future studies of pseudophyllidean tapeworm biology and parasitism.

Download Full-text

Genome of the fatal tapeworm Sparganum proliferum uncovers mechanisms for cryptic life cycle and aberrant larval proliferation

Communications Biology ◽

10.1038/s42003-021-02160-8 ◽

2021 ◽

Vol 4 (1) ◽

Author(s):

Taisei Kikuchi ◽

Mehmet Dayi ◽

Vicky L. Hunt ◽

Kenji Ishiwata ◽

Atsushi Toyoda ◽

...

Keyword(s):

Life Cycle ◽

Asexual Reproduction ◽

Sexual Maturity ◽

Evolutionary History ◽

Reference Genome ◽

Cestode Species ◽

Genome Sequences ◽

High Quality ◽

Future Studies ◽

Life Threatening

AbstractThe cryptic parasite Sparganum proliferum proliferates in humans and invades tissues and organs. Only scattered cases have been reported, but S. proliferum infection is always fatal. However, S. proliferum’s phylogeny and life cycle remain enigmatic. To investigate the phylogenetic relationships between S. proliferum and other cestode species, and to examine the mechanisms underlying pathogenicity, we sequenced the entire genomes of S. proliferum and a closely related non–life-threatening tapeworm Spirometra erinaceieuropaei. Additionally, we performed larvae transcriptome analyses of S. proliferum plerocercoid to identify genes involved in asexual reproduction in the host. The genome sequences confirmed that the S. proliferum has experienced a clearly distinct evolutionary history from S. erinaceieuropaei. Moreover, we found that nonordinal extracellular matrix coordination allows asexual reproduction in the host, and loss of sexual maturity in S. proliferum are responsible for its fatal pathogenicity to humans. Our high-quality reference genome sequences should be valuable for future studies of pseudophyllidean tapeworm biology and parasitism.

Download Full-text

Nanopore sequencing enables high-resolution analysis of resistance determinants and mobile elements in the human gut microbiome

10.1101/456905 ◽

2018 ◽

Cited By ~ 6

Author(s):

Denis Bertrand ◽

Jim Shaw ◽

Manesh Kalathiappan ◽

Amanda Hui Qi Ng ◽

Senthil Muthiah ◽

...

Keyword(s):

Complex Dynamics ◽

Human Microbiome ◽

Abundant Species ◽

Error Rates ◽

Nanopore Sequencing ◽

High Quality ◽

Short Read ◽

Long Reads ◽

Long Read

AbstractThe analysis of information rich whole-metagenome datasets acquired from complex microbial communities is often restricted by the fragmented nature of assembly from short-read sequencing. The availability of long-reads from third-generation sequencing technologies (e.g. PacBio or Oxford Nanopore) can help improve assembly quality in principle, but high error rates and low throughput have limited their application in metagenomics. In this work, we describe the first hybrid metagenomic assembler which combines the advantages of short and long-read technologies, providing an order of magnitude improvement in contiguity compared to short read assemblies, and high base-pair level accuracy. The proposed approach (OPERA-MS) integrates a novel assembly-based metagenome clustering technique with an exact scaffolding algorithm that can efficiently assemble repeat rich sequences. Based on evaluations with defined in vitro communities and virtual gut microbiomes, we show that it is possible to assemble near complete genomes from metagenomes with as little as 9× long read coverage, thus enabling high quality assembly of lowly abundant species (<1%). Furthermore, OPERA-MS’s fine-grained clustering is able to deconvolute and assemble multiple genomes of the same species in a single sample, allowing us to study the complex dynamics of the human microbiome at the sub-species level. Applying nanopore sequencing to gut metagenomes of patients undergoing antibiotic treatment, we show that long reads can be obtained from stool samples in clinical studies to produce more meaningful metagenomic assemblies (up to 200× improvement over short-read assemblies), including the closed assembly of >80 putative plasmid/phage sequences and a 263kbp jumbo phage. Our results highlight that high-quality hybrid assemblies provide an unprecedented view of the gut resistome in these patients, including strain dynamics and identification of novel plasmid sequences.

Download Full-text

Genome Analysis of the Fatal Tapeworm Sparganum Proliferum Uncovers the Cryptic Life Cycle and Mechanisms Underlying Aberrant Larval Proliferation

10.21203/rs.3.rs-78313/v1 ◽

2020 ◽

Author(s):

Taisei Kikuchi ◽

Mehmet Dayi ◽

Vicky Hunt ◽

Atsushi Toyoda ◽

Yasunobu Maeda ◽

...

Keyword(s):

Life Cycle ◽

Asexual Reproduction ◽

Sexual Maturity ◽

Evolutionary History ◽

Reference Genome ◽

Cestode Species ◽

Genome Sequences ◽

High Quality ◽

Future Studies ◽

Life Threatening

Abstract The cryptic parasite Sparganum proliferum proliferates in humans and invades tissues and organs. Only scattered cases have been reported, but S. proliferum infection is always fatal. However, S. proliferum’s phylogeny and life cycle remain enigmatic. To investigate the phylogenetic relationships between S. proliferum and other cestode species, and to examine the mechanisms underlying pathogenicity, we sequenced the entire genomes of S. proliferum and a closely related non–life-threatening tapeworm Spirometra erinaceieuropaei. Additionally, we performed larvae transcriptome analyses of S. proliferum plerocercoid to identify genes involved in asexual reproduction in the host. The genome sequences confirmed that the S. proliferum has experienced a clearly distinct evolutionary history from S. erinaceieuropaei. Moreover, we found that nonordinal extracellular matrix coordination allows asexual reproduction in the host, and loss of sexual maturity in S. proliferum are responsible for its fatal pathogenicity to humans. Our high-quality reference genome sequences should be valuable for future studies of pseudophyllidean tapeworm biology and parasitism.

Download Full-text

Evaluating the accuracy of Listeria monocytogenes assemblies from quasimetagenomic samples using long and short reads

BMC Genomics ◽

10.1186/s12864-021-07702-2 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Seth Commichaux ◽

Kiran Javkar ◽

Padmini Ramachandran ◽

Niranjan Nagarajan ◽

Denis Bertrand ◽

...

Keyword(s):

Public Health ◽

Public Health Response ◽

High Quality ◽

Short Read ◽

Short Reads ◽

The Core ◽

Long Reads ◽

Health Response ◽

Long Read ◽

Core Genes

Abstract Background Whole genome sequencing of cultured pathogens is the state of the art public health response for the bioinformatic source tracking of illness outbreaks. Quasimetagenomics can substantially reduce the amount of culturing needed before a high quality genome can be recovered. Highly accurate short read data is analyzed for single nucleotide polymorphisms and multi-locus sequence types to differentiate strains but cannot span many genomic repeats, resulting in highly fragmented assemblies. Long reads can span repeats, resulting in much more contiguous assemblies, but have lower accuracy than short reads. Results We evaluated the accuracy of Listeria monocytogenes assemblies from enrichments (quasimetagenomes) of naturally-contaminated ice cream using long read (Oxford Nanopore) and short read (Illumina) sequencing data. Accuracy of ten assembly approaches, over a range of sequencing depths, was evaluated by comparing sequence similarity of genes in assemblies to a complete reference genome. Long read assemblies reconstructed a circularized genome as well as a 71 kbp plasmid after 24 h of enrichment; however, high error rates prevented high fidelity gene assembly, even at 150X depth of coverage. Short read assemblies accurately reconstructed the core genes after 28 h of enrichment but produced highly fragmented genomes. Hybrid approaches demonstrated promising results but had biases based upon the initial assembly strategy. Short read assemblies scaffolded with long reads accurately assembled the core genes after just 24 h of enrichment, but were highly fragmented. Long read assemblies polished with short reads reconstructed a circularized genome and plasmid and assembled all the genes after 24 h enrichment but with less fidelity for the core genes than the short read assemblies. Conclusion The integration of long and short read sequencing of quasimetagenomes expedited the reconstruction of a high quality pathogen genome compared to either platform alone. A new and more complete level of information about genome structure, gene order and mobile elements can be added to the public health response by incorporating long read analyses with the standard short read WGS outbreak response.

Download Full-text