MAtCHap: an ultra fast algorithm for solving the single individual haplotype assembly problem

Assembly Problem ◽

Sequencing Platforms ◽

AbstractBackgroundHuman genomes are diploid, which means they have two homologous copies of each chromosome and the assignment of heterozygous variants to each chromosome copy, the haplotype assembly problem, is of fundamental importance for medical and population genetics.While short reads from second generation sequencing platforms drastically limit haplotype reconstruction as the great majority of reads do not allow to link many variants together, novel long reads from third generation sequencing can span several variants along the genome allowing to infer much longer haplotype blocks.However, the great majority of haplotype assembly algorithms, originally devised for short sequences, fail when they are applied to noisy long reads data, and although novel algorithm have been properly developed to deal with the properties of this new generation of sequences, these methods are capable to manage only datasets with limited coverages.ResultsTo overcome the limits of currently available algorithms, I propose a novel formulation of the single individual haplotype assembly problem, based on maximum allele co-occurrence (MAC) and I develop an ultra-fast algorithm that is capable to reconstruct the haplotype structure of a diploid genome from low- and high-coverage long read datasets with high accuracy. I test my algorithm (MAtCHap) on synthetic and real PacBio and Nanopore human dataset and I compare its result with other eight state-of-the-art algorithms. All the results obtained by these analyses show that MAtCHap outperforms other methods in terms of accuracy, contiguity, completeness and computational speed.AvailabilityMAtCHap is publicly available at https://sourceforge.net/projects/matchap/.

HECIL: A Hybrid Error Correction Algorithm for Long Reads with Iterative Learning

10.1101/162917 ◽

2017 ◽

Author(s):

Olivia Choudhury ◽

Ankush Chakrabarty ◽

Scott J. Emrich

Keyword(s):

Error Correction ◽

Real Data ◽

Error Rates ◽

Iterative Learning ◽

Sequencing Error ◽

Full Potential ◽

Long Reads ◽

Sequencing Platforms ◽

AbstractSecond-generation sequencing techniques generate short reads that can result in fragmented genome assemblies. Third-generation sequencing platforms mitigate this limitation by producing longer reads that span across complex and repetitive regions. Currently, the usefulness of such long reads is limited, however, because of high sequencing error rates. To exploit the full potential of these longer reads, it is imperative to correct the underlying errors. We propose HECIL—Hybrid Error Correction with Iterative Learning—a hybrid error correction framework that determines a correction policy for erroneous long reads, based on optimal combinations of decision weights obtained from short read alignments. We demonstrate that HECIL outperforms state-of-the-art error correction algorithms for an overwhelming majority of evaluation metrics on diverse real data sets includingE. coli,S. cerevisiae, and the malaria vector mosquitoA. funestus. We further improve the performance of HECIL by introducing an iterative learning paradigm that improves the correction policy at each iteration by incorporating knowledge gathered from previous iterations via confidence metrics assigned to prior corrections.Availability and Implementationhttps://github.com/NDBL/[email protected]

Haplotype-aware genotyping from noisy long reads

10.1101/293944 ◽

2018 ◽

Cited By ~ 5

Author(s):

Jana Ebler ◽

Marina Haukness ◽

Trevor Pesout ◽

Tobias Marschall ◽

Benedict Paten

Keyword(s):

Error Rates ◽

Sequencing Error ◽

Sequencing Data ◽

Novel Approach ◽

Long Reads ◽

Oxford Nanopore ◽

Linkage Information ◽

Sequencing Platforms ◽

MotivationCurrent genotyping approaches for single nucleotide variations (SNVs) rely on short, relatively accurate reads from second generation sequencing devices. Presently, third generation sequencing platforms able to generate much longer reads are becoming more widespread. These platforms come with the significant drawback of higher sequencing error rates, which makes them ill-suited to current genotyping algorithms. However, the longer reads make more of the genome unambiguously mappable and typically provide linkage information between neighboring variants.ResultsIn this paper we introduce a novel approach for haplotype-aware genotyping from noisy long reads. We do this by considering bipartitions of the sequencing reads, corresponding to the two haplotypes. We formalize the computational problem in terms of a Hidden Markov Model and compute posterior genotype probabilities using the forward-backward algorithm. Genotype predictions can then be made by picking the most likely genotype at each site. Our experiments indicate that longer reads allow significantly more of the genome to potentially be accurately genotyped. Further, we are able to use both Oxford Nanopore and Pacific Biosciences sequencing data to independently validate millions of variants previously identified by short-read technologies in the reference NA12878 sample, including hundreds of thousands of variants that were not previously included in the high-confidence reference set.

HapCHAT: Adaptive haplotype assembly for efficiently leveraging high coverage in long reads

10.1101/170225 ◽

2017 ◽

Author(s):

Stefano Beretta ◽

Murray D Patterson ◽

Simone Zaccaria ◽

Gianluca Della Vedova ◽

Paola Bonizzoni

Keyword(s):

Error Rate ◽

Feasible Solution ◽

State Of The Art ◽

High Coverage ◽

Haplotype Blocks ◽

Haplotype Assembly ◽

Current State ◽

Error Corrections ◽

Long Reads ◽

Computational Resources

AbstractBackgroundHaplotype assembly is the process of assigning the different alleles of the variants covered by mapped sequencing reads to the two haplotypes of the genome of a human individual. Long reads, which are nowadays cheaper to produce and more widely available than ever before, have been used to reduce the fragmentation of the assembled haplotypes since their ability to span several variants along the genome. These long reads are also characterized by a high error rate, an issue which may be mitigated, however, with larger sets of reads, when this error rate is uniform across genome positions. Unfortunately, current state-of-the-art dynamic programming approaches designed for long reads deal only with limited coverages.ResultsHere, we propose a new method for assembling haplotypes which combines and extends the features of previous approaches to deal with long reads and higher coverages. In particular, our algorithm is able to dynamically adapt the estimated number of errors at each variant site, while minimizing the total number of error corrections necessary for finding a feasible solution. This allows our method to significantly reduce the required computational resources, allowing to consider datasets composed of higher coverages. The algorithm has been implemented in a freely available tool, HapCHAT: Haplotype Assembly Coverage Handling by Adapting Thresholds. An experimental analysis on sequencing reads with up to 60× coverage reveals improvements in accuracy and recall achieved by considering a higher coverage with lower runtimes.ConclusionsOur method leverages the long-range information of sequencing reads that allows to obtain assembled haplotypes fragmented in a lower number of unphased haplotype blocks. At the same time, our method is also able to deal with higher coverages to better correct the errors in the original reads and to obtain more accurate haplotypes as a result.AvailabilityHapCHAT is available at http://hapchat.algolab.eu under the GPL license.

Accurate Haplotype-Resolved Assembly Reveals The Origin Of Structural Variants For Human Trios

Bioinformatics ◽

10.1093/bioinformatics/btab068 ◽

2021 ◽

Author(s):

Mengyang Xu ◽

Lidong Guo ◽

Xiao Du ◽

Lei Li ◽

Brock A Peters ◽

...

Keyword(s):

De Novo ◽

Substantial Improvement ◽

Supplementary Information ◽

Sequencing Data ◽

Homologous Chromosomes ◽

Assembly Method ◽

Long Reads ◽

Long Read ◽

Abstract Motivation Achieving a near complete understanding of how the genome of an individual affects the phenotypes of that individual requires deciphering the order of variations along homologous chromosomes in species with diploid genomes. However, true diploid assembly of long-range haplotypes remains challenging. Results To address this, we have developed Haplotype-resolved Assembly for Synthetic long reads using a Trio-binning strategy, or HAST, which uses parental information to classify reads into maternal or paternal. Once sorted, these reads are used to independently de novo assemble the parent-specific haplotypes. We applied HAST to co-barcoded second-generation sequencing data from an Asian individual, resulting in a haplotype assembly covering 94.7% of the reference genome with a scaffold N50 longer than 11 Mb. The high haplotyping precision (∼99.7%) and recall (∼95.9%) represents a substantial improvement over the commonly used tool for assembling co-barcoded reads (Supernova), and is comparable to a trio-binning-based third generation long-read based assembly method (TrioCanu) but with a significantly higher single-base accuracy (up to 99.99997% (Q65)). This makes HAST a superior tool for accurate haplotyping and future haplotype-based studies. Availability The code of the analysis is available at https://github.com/BGI-Qingdao/HAST. Supplementary information Supplementary data are available at Bioinformatics online.

HapCHAT: adaptive haplotype assembly for efficiently leveraging high coverage in long reads

BMC Bioinformatics ◽

10.1186/s12859-018-2253-8 ◽

2018 ◽

Vol 19 (1) ◽

Cited By ~ 3

Author(s):

Stefano Beretta ◽

Murray D. Patterson ◽

Simone Zaccaria ◽

Gianluca Della Vedova ◽

Paola Bonizzoni

Keyword(s):

High Coverage ◽

Haplotype Assembly ◽

Long Reads

SNP discovery performance of two second-generation sequencing platforms in the NOD2 gene region

Human Mutation ◽

10.1002/humu.21276 ◽

2010 ◽

Vol 31 (7) ◽

pp. 875-885 ◽

Cited By ~ 14

Author(s):

Espen Melum ◽

Sandra May ◽

Markus B. Schilhabel ◽

Ingo Thomsen ◽

Tom H. Karlsen ◽

...

Keyword(s):

Second Generation ◽

Gene Region ◽

Snp Discovery ◽

Sequencing Platforms ◽

Enhanced Recovery of Microbial Genes and Genomes From a Marine Water Column Using Long-Read Metagenomics

Frontiers in Microbiology ◽

10.3389/fmicb.2021.708782 ◽

2021 ◽

Vol 12 ◽

Author(s):

Jose M. Haro-Moreno ◽

Mario López-Pérez ◽

Francisco Rodriguez-Valera

Keyword(s):

Water Column ◽

Error Rate ◽

Population Genomics ◽

Enhanced Recovery ◽

Third Generation ◽

Long Reads ◽

Flexible Genome ◽

Long Read ◽

Third-generation sequencing has penetrated little in metagenomics due to the high error rate and dependence for assembly on short-read designed bioinformatics. However, second-generation sequencing metagenomics (mostly Illumina) suffers from limitations, particularly in the assembly of microbes with high microdiversity and retrieval of the flexible (adaptive) fraction of prokaryotic genomes. Here, we have used a third-generation technique to study the metagenome of a well-known marine sample from the mixed epipelagic water column of the winter Mediterranean. We have compared PacBio Sequel II with the classical approach using Illumina Nextseq short reads followed by assembly to study the metagenome. Long reads allow for efficient direct retrieval of complete genes avoiding the bias of the assembly step. Besides, the application of long reads on metagenomic assembly allows for the reconstruction of much more complete metagenome-assembled genomes (MAGs), particularly from microbes with high microdiversity such as Pelagibacterales. The flexible genome of reconstructed MAGs was much more complete containing many adaptive genes (some with biotechnological potential). PacBio Sequel II CCS appears particularly suitable for cellular metagenomics due to its low error rate. For most applications of metagenomics, from community structure analysis to ecosystem functioning, long reads should be applied whenever possible. Specifically, for in silico screening of biotechnologically useful genes, or population genomics, long-read metagenomics appears presently as a very fruitful approach and can be analyzed from raw reads before a computationally demanding (and potentially artifactual) assembly step.

A Comprehensive Transcriptome Assembly of Pigeonpea (Cajanus cajan L.) using Sanger and Second-Generation Sequencing Platforms

Molecular Plant ◽

10.1093/mp/ssr111 ◽

2012 ◽

Vol 5 (5) ◽

pp. 1020-1028 ◽

Cited By ~ 67

Author(s):

Himabindu Kudapa ◽

Arvind K. Bharti ◽

Steven B. Cannon ◽

Andrew D. Farmer ◽

Benjamin Mulaosmanovic ◽

...

Keyword(s):

Cajanus Cajan ◽

Second Generation ◽

Transcriptome Assembly ◽

Sequencing Platforms ◽

Evaluating approaches to find exon chains based on long reads

10.1101/066241 ◽

2016 ◽

Author(s):

Anna Kuosmanen ◽

Veli Mäkinen

Keyword(s):

Second Generation ◽

Simulated Data ◽

Error Rates ◽

Third Generation ◽

Sequencing Technologies ◽

Third Generation Sequencing ◽

Long Reads ◽

Long Read ◽

AbstractMotivationTranscript prediction can be modelled as a graph problem where exons are modelled as nodes and reads spanning two or more exons are modelled as exon chains. PacBio third-generation sequencing technology produces significantly longer reads than earlier second-generation sequencing technologies, which gives valuable information about longer exon chains in a graph. However, with the high error rates of third-generation sequencing, aligning long reads correctly around the splice sites is a challenging task. Incorrect alignments lead to spurious nodes and arcs in the graph, which in turn lead to incorrect transcript predictions.ResultsWe survey several approaches to find the exon chains corresponding to long reads in a splicing graph, and experimentally study the performance of these methods using simulated data to allow for sensitivity / precision analysis. Our experiments show that short reads from second-generation sequencing can be used to significantly improve exon chain correctness either by error-correcting the long reads before splicing graph creation, or by using them to create a splicing graph on which the long read alignments are then projected. We also study the memory and time consumption of various modules, and show that accurate exon chains lead to significantly increased transcript prediction accuracy.AvailabilityThe simulated data and in-house scripts used for this article are available at http://cs.helsinki.fi/u/aekuosma/exon_chain_evaluation_publish.tar.gz.

IsoDetect: Detection of splice isoforms from third generation long reads based on short feature sequences

Current Bioinformatics ◽

10.2174/1574893615666200316101205 ◽

2020 ◽

Vol 15 ◽

Author(s):

Hongdong Li ◽

Wenjing Zhang ◽

Yuwen Luo ◽

Jianxin Wang

Keyword(s):

Sequence Similarity ◽

Detection Methods ◽

Sequence Information ◽

Third Generation ◽

Sequencing Data ◽

Splice Isoforms ◽

Third Generation Sequencing ◽

Long Reads ◽

Feature Sequence ◽

Aims: Accurately detect isoforms from third generation sequencing data. Background: Transcriptome annotation is the basis for the analysis of gene expression and regulation. The transcriptome annotation of many organisms such as humans is far from incomplete, due partly to the challenge in the identification of isoforms that are produced from the same gene through alternative splicing. Third generation sequencing (TGS) reads provide unprecedented opportunity for detecting isoforms due to their long length that exceeds the length of most isoforms. One limitation of current TGS reads-based isoform detection methods is that they are exclusively based on sequence reads, without incorporating the sequence information of known isoforms. Objective: Develop an efficient method for isoform detection. Method: Based on annotated isoforms, we propose a splice isoform detection method called IsoDetect. First, the sequence at exon-exon junction is extracted from annotated isoforms as the “short feature sequence”, which is used to distinguish different splice isoforms. Second, we aligned these feature sequences to long reads and divided long reads into groups that contain the same set of feature sequences, thereby avoiding the pair-wise comparison among the large number of long reads. Third, clustering and consensus generation are carried out based on sequence similarity. For the long reads that do not contain any short feature sequence, clustering analysis based on sequence similarity is performed to identify isoforms. Result: Tested on two datasets from Calypte Anna and Zebra Finch, IsoDetect showed higher speed and compelling accuracy compared with four existing methods. Conclusion: IsoDetect is a promising method for isoform detection. Other: This paper was accepted by the CBC2019 conference.