Haplotype-aware genotyping from noisy long reads

MotivationCurrent genotyping approaches for single nucleotide variations (SNVs) rely on short, relatively accurate reads from second generation sequencing devices. Presently, third generation sequencing platforms able to generate much longer reads are becoming more widespread. These platforms come with the significant drawback of higher sequencing error rates, which makes them ill-suited to current genotyping algorithms. However, the longer reads make more of the genome unambiguously mappable and typically provide linkage information between neighboring variants.ResultsIn this paper we introduce a novel approach for haplotype-aware genotyping from noisy long reads. We do this by considering bipartitions of the sequencing reads, corresponding to the two haplotypes. We formalize the computational problem in terms of a Hidden Markov Model and compute posterior genotype probabilities using the forward-backward algorithm. Genotype predictions can then be made by picking the most likely genotype at each site. Our experiments indicate that longer reads allow significantly more of the genome to potentially be accurately genotyped. Further, we are able to use both Oxford Nanopore and Pacific Biosciences sequencing data to independently validate millions of variants previously identified by short-read technologies in the reference NA12878 sample, including hundreds of thousands of variants that were not previously included in the high-confidence reference set.

Download Full-text

HECIL: A Hybrid Error Correction Algorithm for Long Reads with Iterative Learning

10.1101/162917 ◽

2017 ◽

Author(s):

Olivia Choudhury ◽

Ankush Chakrabarty ◽

Scott J. Emrich

Keyword(s):

Error Correction ◽

Real Data ◽

Error Rates ◽

Iterative Learning ◽

Sequencing Error ◽

Full Potential ◽

Long Reads ◽

Second Generation Sequencing ◽

Sequencing Platforms ◽

Generation Sequencing

AbstractSecond-generation sequencing techniques generate short reads that can result in fragmented genome assemblies. Third-generation sequencing platforms mitigate this limitation by producing longer reads that span across complex and repetitive regions. Currently, the usefulness of such long reads is limited, however, because of high sequencing error rates. To exploit the full potential of these longer reads, it is imperative to correct the underlying errors. We propose HECIL—Hybrid Error Correction with Iterative Learning—a hybrid error correction framework that determines a correction policy for erroneous long reads, based on optimal combinations of decision weights obtained from short read alignments. We demonstrate that HECIL outperforms state-of-the-art error correction algorithms for an overwhelming majority of evaluation metrics on diverse real data sets includingE. coli,S. cerevisiae, and the malaria vector mosquitoA. funestus. We further improve the performance of HECIL by introducing an iterative learning paradigm that improves the correction policy at each iteration by incorporating knowledge gathered from previous iterations via confidence metrics assigned to prior corrections.Availability and Implementationhttps://github.com/NDBL/[email protected]

Download Full-text

MAtCHap: an ultra fast algorithm for solving the single individual haplotype assembly problem

10.1101/860262 ◽

2019 ◽

Author(s):

Alberto Magi

Keyword(s):

Fast Algorithm ◽

Great Majority ◽

Single Individual ◽

High Coverage ◽

Haplotype Assembly ◽

Long Reads ◽

Second Generation Sequencing ◽

Assembly Problem ◽

Sequencing Platforms ◽

Generation Sequencing

AbstractBackgroundHuman genomes are diploid, which means they have two homologous copies of each chromosome and the assignment of heterozygous variants to each chromosome copy, the haplotype assembly problem, is of fundamental importance for medical and population genetics.While short reads from second generation sequencing platforms drastically limit haplotype reconstruction as the great majority of reads do not allow to link many variants together, novel long reads from third generation sequencing can span several variants along the genome allowing to infer much longer haplotype blocks.However, the great majority of haplotype assembly algorithms, originally devised for short sequences, fail when they are applied to noisy long reads data, and although novel algorithm have been properly developed to deal with the properties of this new generation of sequences, these methods are capable to manage only datasets with limited coverages.ResultsTo overcome the limits of currently available algorithms, I propose a novel formulation of the single individual haplotype assembly problem, based on maximum allele co-occurrence (MAC) and I develop an ultra-fast algorithm that is capable to reconstruct the haplotype structure of a diploid genome from low- and high-coverage long read datasets with high accuracy. I test my algorithm (MAtCHap) on synthetic and real PacBio and Nanopore human dataset and I compare its result with other eight state-of-the-art algorithms. All the results obtained by these analyses show that MAtCHap outperforms other methods in terms of accuracy, contiguity, completeness and computational speed.AvailabilityMAtCHap is publicly available at https://sourceforge.net/projects/matchap/.

Download Full-text

Accurate Haplotype-Resolved Assembly Reveals The Origin Of Structural Variants For Human Trios

Bioinformatics ◽

10.1093/bioinformatics/btab068 ◽

2021 ◽

Author(s):

Mengyang Xu ◽

Lidong Guo ◽

Xiao Du ◽

Lei Li ◽

Brock A Peters ◽

...

Keyword(s):

De Novo ◽

Substantial Improvement ◽

Supplementary Information ◽

Sequencing Data ◽

Homologous Chromosomes ◽

Assembly Method ◽

Long Reads ◽

Long Read ◽

Second Generation Sequencing ◽

Generation Sequencing

Abstract Motivation Achieving a near complete understanding of how the genome of an individual affects the phenotypes of that individual requires deciphering the order of variations along homologous chromosomes in species with diploid genomes. However, true diploid assembly of long-range haplotypes remains challenging. Results To address this, we have developed Haplotype-resolved Assembly for Synthetic long reads using a Trio-binning strategy, or HAST, which uses parental information to classify reads into maternal or paternal. Once sorted, these reads are used to independently de novo assemble the parent-specific haplotypes. We applied HAST to co-barcoded second-generation sequencing data from an Asian individual, resulting in a haplotype assembly covering 94.7% of the reference genome with a scaffold N50 longer than 11 Mb. The high haplotyping precision (∼99.7%) and recall (∼95.9%) represents a substantial improvement over the commonly used tool for assembling co-barcoded reads (Supernova), and is comparable to a trio-binning-based third generation long-read based assembly method (TrioCanu) but with a significantly higher single-base accuracy (up to 99.99997% (Q65)). This makes HAST a superior tool for accurate haplotyping and future haplotype-based studies. Availability The code of the analysis is available at https://github.com/BGI-Qingdao/HAST. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

ELECTOR: Evaluator for long reads correction methods

10.1101/512889 ◽

2019 ◽

Cited By ~ 1

Author(s):

Camille Marchet ◽

Pierre Morisse ◽

Lolita Lecompte ◽

Arnaud Lefebvre ◽

Thierry Lecroq ◽

...

Keyword(s):

Error Correction ◽

State Of The Art ◽

Error Rates ◽

Sequencing Data ◽

Third Generation Sequencing ◽

Long Reads ◽

Wide Range ◽

Downstream Processes ◽

Generation Sequencing

AbstractMotivationIn the last few years, the error rates of third generation sequencing data have been capped above 5%, including many insertions and deletions. Thereby, an increasing number of long reads correction methods have been proposed to reduce the noise in these sequences. Whether hybrid or self-correction methods, there exist multiple approaches to correct long reads. As the quality of the error correction has huge impacts on downstream processes, developing methods allowing to evaluate error correction tools with precise and reliable statistics is therefore a crucial need. Since error correction is often a resource bottleneck in long reads pipelines, a key feature of assessment methods is therefore to be efficient, in order to allow the fast comparison of different tools.ResultsWe propose ELECTOR, a reliable and efficient tool to evaluate long reads correction, that enables the evaluation of hybrid and self-correction methods. Our tool provides a complete and relevant set of metrics to assess the read quality improvement after correction and scales to large datasets. ELECTOR is directly compatible with a wide range of state-of-the-art error correction tools, using whether simulated or real long reads. We show that ELECTOR displays a wider range of metrics than the state-of-the-art tool, LRCstats, and additionally importantly decreases the runtime needed for assessment on all the studied datasets.AvailabilityELECTOR is available at https://github.com/kamimrcht/[email protected] or [email protected]

Download Full-text

TRiCoLOR: tandem repeat profiling using whole-genome long-read sequencing data

GigaScience ◽

10.1093/gigascience/giaa101 ◽

2020 ◽

Vol 9 (10) ◽

Cited By ~ 1

Author(s):

Davide Bolognini ◽

Alberto Magi ◽

Vladimir Benes ◽

Jan O Korbel ◽

Tobias Rausch

Keyword(s):

Tandem Repeat ◽

Error Rates ◽

Sequencing Error ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Third Generation ◽

Sequencing Data ◽

Sequencing Technologies ◽

Third Generation Sequencing ◽

Generation Sequencing

Abstract Background Tandem repeat sequences are widespread in the human genome, and their expansions cause multiple repeat-mediated disorders. Genome-wide discovery approaches are needed to fully elucidate their roles in health and disease, but resolving tandem repeat variation accurately remains a challenging task. While traditional mapping-based approaches using short-read data have severe limitations in the size and type of tandem repeats they can resolve, recent third-generation sequencing technologies exhibit substantially higher sequencing error rates, which complicates repeat resolution. Results We developed TRiCoLOR, a freely available tool for tandem repeat profiling using error-prone long reads from third-generation sequencing technologies. The method can identify repetitive regions in sequencing data without a prior knowledge of their motifs or locations and resolve repeat multiplicity and period size in a haplotype-specific manner. The tool includes methods to interactively visualize the identified repeats and to trace their Mendelian consistency in pedigrees. Conclusions TRiCoLOR demonstrates excellent performance and improved sensitivity and specificity compared with alternative tools on synthetic data. For real human whole-genome sequencing data, TRiCoLOR achieves high validation rates, suggesting its suitability to identify tandem repeat variation in personal genomes.

Download Full-text

A Long-Read Sequencing Approach for Direct Haplotype Phasing in Clinical Settings

International Journal of Molecular Sciences ◽

10.3390/ijms21239177 ◽

2020 ◽

Vol 21 (23) ◽

pp. 9177

Author(s):

Simone Maestri ◽

Maria Giovanna Maturo ◽

Emanuela Cosentino ◽

Luca Marcolungo ◽

Barbara Iadarola ◽

...

Keyword(s):

Diagnostic Testing ◽

Variant Calling ◽

Clinical Settings ◽

Sequencing Data ◽

Sequencing Platform ◽

Variant Discovery ◽

Long Reads ◽

Oxford Nanopore ◽

Long Read ◽

Second Generation Sequencing

The reconstruction of individual haplotypes can facilitate the interpretation of disease risks; however, high costs and technical challenges still hinder their assessment in clinical settings. Second-generation sequencing is the gold standard for variant discovery but, due to the production of short reads covering small genomic regions, allows only indirect haplotyping based on statistical methods. In contrast, third-generation methods such as the nanopore sequencing platform developed by Oxford Nanopore Technologies (ONT) generate long reads that can be used for direct haplotyping, with fewer drawbacks. However, robust standards for variant phasing in ONT-based target resequencing efforts are not yet available. In this study, we presented a streamlined proof-of-concept workflow for variant calling and phasing based on ONT data in a clinically relevant 12-kb region of the APOE locus, a hotspot for variants and haplotypes associated with aging-related diseases and longevity. Starting with sequencing data from simple amplicons of the target locus, we demonstrated that ONT data allow for reliable single-nucleotide variant (SNV) calling and phasing from as little as 60 reads, although the recognition of indels is less efficient. Even so, we identified the best combination of ONT read sets (600) and software (BWA/Minimap2 and HapCUT2) that enables full haplotype reconstruction when both SNVs and indels have been identified previously using a highly-accurate sequencing platform. In conclusion, we established a rapid and inexpensive workflow for variant phasing based on ONT long reads. This allowed for the analysis of multiple samples in parallel and can easily be implemented in routine clinical practice, including diagnostic testing.

Download Full-text

Assembly methods for nanopore-based metagenomic sequencing: a comparative study

10.1101/722405 ◽

2019 ◽

Cited By ~ 1

Author(s):

Adriel Latorre-Pérez ◽

Pascual Villalba-Bermell ◽

Javier Pascual ◽

Manuel Porcar ◽

Cristina Vilanova

Keyword(s):

Bacterial Genome ◽

Gene Clusters ◽

Systematic Evaluation ◽

Metagenomic Sequencing ◽

Sequencing Data ◽

Short Reads ◽

Sequencing Technologies ◽

Long Reads ◽

Oxford Nanopore ◽

Sequencing Platforms

ABSTRACTBackgroundMetagenomic sequencing has lead to the recovery of previously unexplored microbial genomes. In this sense, short-reads sequencing platforms often result in highly fragmented metagenomes, thus complicating downstream analyses. Third generation sequencing technologies, such as MinION, could lead to more contiguous assemblies due to their ability to generate long reads. Nevertheless, there is a lack of studies evaluating the suitability of the available assembly tools for this new type of data.FindingsWe benchmarked the ability of different short-reads and long-reads tools to assembly two different commercially available mock communities, and observed remarkable differences in the resulting assemblies depending on the software of choice. Short-reads metagenomic assemblers proved unsuitable for MinION data. Among the long-reads assemblers tested, Flye and Canu were the only ones performing well in all the datasets. These tools were able to retrieve complete individual genomes directly from the metagenome, and assembled a bacterial genome in only two contigs in the best scenario. Despite the intrinsic high error of long-reads technologies, Canu and Flye lead to high accurate assemblies (~99.4-99.8 % of accuracy). However, errors still had an impact on the prediction of biosynthetic gene clusters.ConclusionsMinION metagenomic sequencing data proved sufficient for assembling low-complex microbial communities, leading to the recovery of highly complete and contiguous individual genomes. This work is the first systematic evaluation of the performance of different assembly tools on MinION data, and may help other researchers willing to use this technology to choose the most appropriate software depending on their goals. Future work is still needed in order to assess the performance of Oxford Nanopore MinION data on more complex microbiomes.

Download Full-text

MultiNanopolish: refined grouping method for reducing redundant calculations in Nanopolish

Bioinformatics ◽

10.1093/bioinformatics/btab078 ◽

2021 ◽

Author(s):

Kang Hu ◽

Neng Huang ◽

You Zou ◽

Xingyu Liao ◽

Jianxin Wang

Keyword(s):

Error Rate ◽

Supplementary Information ◽

Sequencing Data ◽

Running Time ◽

Whole Process ◽

Sequencing Technologies ◽

Oxford Nanopore ◽

Second Generation Sequencing ◽

Polishing Tool ◽

Generation Sequencing

Abstract Motivation Compared with the second-generation sequencing technologies, the third-generation sequencing technologies allows us to obtain longer reads (average ∼10 kbps, maximum 900 kbps), but brings a higher error rate (∼15% error rate). Nanopolish is a variant and methylation detection tool based on hidden Markov model, which uses Oxford Nanopore sequencing data for signal-level analysis. Nanopolish can greatly improve the accuracy of assembly, whereas it is limited by long running time since most executive parts of Nanopolish is a serial and computationally expensive process. Results In this paper, we present an effective polishing tool, Multithreading Nanopolish (MultiNanopolish), which decomposes the whole process of iterative calculation in Nanopolish into small independent calculation tasks, making it possible to run this process in the parallel mode. Experimental results show that MultiNanopolish reduces running time by 50% with read-uncorrected assembler (Miniasm) and 20% with read-corrected assembler (Canu and Flye) based on 40 threads mode compared to the original Nanopolish. Availability and implementation MultiNanopolish is available at GitHub: https://github.com/BioinformaticsCSU/MultiNanopolish Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Systematic Comparison of the Performances of De Novo Genome Assemblers for Oxford Nanopore Technology Reads From Piroplasm

Frontiers in Cellular and Infection Microbiology ◽

10.3389/fcimb.2021.696669 ◽

2021 ◽

Vol 11 ◽

Author(s):

Jinming Wang ◽

Kai Chen ◽

Qiaoyun Ren ◽

Ying Zhang ◽

Junlong Liu ◽

...

Keyword(s):

De Novo ◽

Next Generation Sequencing Data ◽

Sequencing Data ◽

Coverage Depth ◽

Sequence Coverage ◽

Long Reads ◽

Oxford Nanopore ◽

Generation Sequencing ◽

Easy Operation

BackgroundEmerging long reads sequencing technology has greatly changed the landscape of whole-genome sequencing, enabling scientists to contribute to decoding the genetic information of non-model species. The sequences generated by PacBio or Oxford Nanopore Technology (ONT) be assembled de novo before further analyses. Some genome de novo assemblers have been developed to assemble long reads generated by ONT. The performance of these assemblers has not been completely investigated. However, genome assembly is still a challenging task.Methods and ResultsWe systematically evaluated the performance of nine de novo assemblers for ONT on different coverage depth datasets. Several metrics were measured to determine the performance of these tools, including N50 length, sequence coverage, runtime, easy operation, accuracy of genome and genomic completeness in varying depths of coverage. Based on the results of our assessments, the performances of these tools are summarized as follows: 1) Coverage depth has a significant effect on genome quality; 2) The level of contiguity of the assembled genome varies dramatically among different de novo tools; 3) The correctness of an assembled genome is closely related to the completeness of the genome. More than 30× nanopore data can be assembled into a relatively complete genome, the quality of which is highly dependent on the polishing using next generation sequencing data.ConclusionConsidering the results of our investigation, the advantage and disadvantage of each tool are summarized and guidelines of selecting assembly tools are provided under specific conditions.

Download Full-text

Evaluating approaches to find exon chains based on long reads

10.1101/066241 ◽

2016 ◽

Author(s):

Anna Kuosmanen ◽

Veli Mäkinen

Keyword(s):

Second Generation ◽

Simulated Data ◽

Error Rates ◽

Third Generation ◽

Sequencing Technologies ◽

Third Generation Sequencing ◽

Long Reads ◽

Long Read ◽

Second Generation Sequencing ◽

Generation Sequencing

AbstractMotivationTranscript prediction can be modelled as a graph problem where exons are modelled as nodes and reads spanning two or more exons are modelled as exon chains. PacBio third-generation sequencing technology produces significantly longer reads than earlier second-generation sequencing technologies, which gives valuable information about longer exon chains in a graph. However, with the high error rates of third-generation sequencing, aligning long reads correctly around the splice sites is a challenging task. Incorrect alignments lead to spurious nodes and arcs in the graph, which in turn lead to incorrect transcript predictions.ResultsWe survey several approaches to find the exon chains corresponding to long reads in a splicing graph, and experimentally study the performance of these methods using simulated data to allow for sensitivity / precision analysis. Our experiments show that short reads from second-generation sequencing can be used to significantly improve exon chain correctness either by error-correcting the long reads before splicing graph creation, or by using them to create a splicing graph on which the long read alignments are then projected. We also study the memory and time consumption of various modules, and show that accurate exon chains lead to significantly increased transcript prediction accuracy.AvailabilityThe simulated data and in-house scripts used for this article are available at http://cs.helsinki.fi/u/aekuosma/exon_chain_evaluation_publish.tar.gz.

Download Full-text