Long read nanopore sequencing for detection of HLA and CYP2D6 variants and haplotypes

Haplotypes are often critical for the interpretation of genetic laboratory observations into medically actionable findings. Current massively parallel DNA sequencing technologies produce short sequence reads that are often unable to resolve haplotype information. Phasing short read data typically requires supplemental statistical phasing based on known haplotype structure in the population or parental genotypic data. Here we demonstrate that the MinION nanopore sequencer is capable of producing very long reads to resolve both variants and haplotypes of HLA-A, HLA-B and CYP2D6 genes important in determining patient drug response in sample NA12878 of CEPH/UTAH pedigree 1463, without the need for statistical phasing. Long read data from a single 24-hour nanopore sequencing run was used to reconstruct haplotypes, which were confirmed by HapMap data and statistically phased Complete Genomics and Sequenom genotypes. Our results demonstrate that nanopore sequencing is an emerging standalone technology with potential utility in a clinical environment to aid in medical decision-making.

Download Full-text

Long read nanopore sequencing for detection of HLA and CYP2D6 variants and haplotypes

F1000Research ◽

10.12688/f1000research.6037.1 ◽

2015 ◽

Vol 4 ◽

pp. 17 ◽

Cited By ~ 8

Author(s):

Ron Ammar ◽

Tara A. Paton ◽

Dax Torti ◽

Adam Shlien ◽

Gary D. Bader

Keyword(s):

Medical Decision ◽

Nanopore Sequencing ◽

Clinical Environment ◽

Sequencing Technologies ◽

Long Reads ◽

Long Read ◽

Complete Genomics ◽

Nanopore Sequencer ◽

Actionable Findings ◽

Haplotype Information

Download Full-text

Evaluation of Germline Structural Variant Calling Methods for Nanopore Sequencing Data

Frontiers in Genetics ◽

10.3389/fgene.2021.761791 ◽

2021 ◽

Vol 12 ◽

Author(s):

Davide Bolognini ◽

Alberto Magi

Keyword(s):

Variant Calling ◽

Research Report ◽

Nanopore Sequencing ◽

Sequencing Data ◽

Factors Affecting ◽

Sequencing Technologies ◽

Long Reads ◽

Oxford Nanopore ◽

Sequencing Studies ◽

Long Read

Structural variants (SVs) are genomic rearrangements that involve at least 50 nucleotides and are known to have a serious impact on human health. While prior short-read sequencing technologies have often proved inadequate for a comprehensive assessment of structural variation, more recent long reads from Oxford Nanopore Technologies have already been proven invaluable for the discovery of large SVs and hold the potential to facilitate the resolution of the full SV spectrum. With many long-read sequencing studies to follow, it is crucial to assess factors affecting current SV calling pipelines for nanopore sequencing data. In this brief research report, we evaluate and compare the performances of five long-read SV callers across four long-read aligners using both real and synthetic nanopore datasets. In particular, we focus on the effects of read alignment, sequencing coverage, and variant allele depth on the detection and genotyping of SVs of different types and size ranges and provide insights into precision and recall of SV callsets generated by integrating the various long-read aligners and SV callers. The computational pipeline we propose is publicly available at https://github.com/davidebolo1993/EViNCe and can be adjusted to further evaluate future nanopore sequencing datasets.

Download Full-text

Hapo-G, haplotype-aware polishing of genome assemblies with accurate reads

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqab034 ◽

2021 ◽

Vol 3 (2) ◽

Author(s):

Jean-Marc Aury ◽

Benjamin Istace

Keyword(s):

Single Molecule ◽

Direct Consequence ◽

High Quality ◽

Sequencing Errors ◽

Coding Regions ◽

Sequencing Technologies ◽

Long Reads ◽

Oxford Nanopore ◽

Long Read ◽

Genome Assemblies

Abstract Single-molecule sequencing technologies have recently been commercialized by Pacific Biosciences and Oxford Nanopore with the promise of sequencing long DNA fragments (kilobases to megabases order) and then, using efficient algorithms, provide high quality assemblies in terms of contiguity and completeness of repetitive regions. However, the error rate of long-read technologies is higher than that of short-read technologies. This has a direct consequence on the base quality of genome assemblies, particularly in coding regions where sequencing errors can disrupt the coding frame of genes. In the case of diploid genomes, the consensus of a given gene can be a mixture between the two haplotypes and can lead to premature stop codons. Several methods have been developed to polish genome assemblies using short reads and generally, they inspect the nucleotide one by one, and provide a correction for each nucleotide of the input assembly. As a result, these algorithms are not able to properly process diploid genomes and they typically switch from one haplotype to another. Herein we proposed Hapo-G (Haplotype-Aware Polishing Of Genomes), a new algorithm capable of incorporating phasing information from high-quality reads (short or long-reads) to polish genome assemblies and in particular assemblies of diploid and heterozygous genomes.

Download Full-text

Fast and sensitive mapping of error-prone nanopore sequencing reads with GraphMap

10.1101/020719 ◽

2015 ◽

Cited By ~ 1

Author(s):

Ivan Sovic ◽

Mile Sikic ◽

Andreas Wilm ◽

Shannon Nicole Fenlon ◽

Swaine Chen ◽

...

Keyword(s):

Human Genome ◽

Variant Calling ◽

Error Rates ◽

Nanopore Sequencing ◽

Structural Variants ◽

Specific Identification ◽

Long Reads ◽

Long Read ◽

Specific Error ◽

Very High

Exploiting the power of nanopore sequencing requires the development of new bioinformatics approaches to deal with its specific error characteristics. We present the first nanopore read mapper (GraphMap) that uses a read-funneling paradigm to robustly handle variable error rates and fast graph traversal to align long reads with speed and very high precision (>95%). Evaluation on MinION sequencing datasets against short and long-read mappers indicates that GraphMap increases mapping sensitivity by at least 15-80%. GraphMap alignments are the first to demonstrate consensus calling with <1 error in 100,000 bases, variant calling on the human genome with 76% improvement in sensitivity over the next best mapper (BWA-MEM), precise detection of structural variants from 100bp to 4kbp in length and species and strain-specific identification of pathogens using MinION reads. GraphMap is available open source under the MIT license at https://github.com/isovic/graphmap.

Download Full-text

Expanding an expanded genome: long-read sequencing ofTrypanosoma cruzi

10.1101/279174 ◽

2018 ◽

Cited By ~ 3

Author(s):

Luisa Berná ◽

Matías Rodríguez ◽

María Laura Chiribao ◽

Adriana Parodi-Talice ◽

Sebastián Pita ◽

...

Keyword(s):

Repetitive Sequences ◽

Gc Content ◽

Gene Copy ◽

Accurate Estimation ◽

Sequencing Technologies ◽

Copy Numbers ◽

Long Reads ◽

Long Read ◽

Large Clusters ◽

Gene Copy Numbers

Although the genome ofTrypanosoma cruzi, the causative agent of Chagas disease, was first made available in 2005, with additional strains reported later, the intrinsic genome complexity of this parasite (abundance of repetitive sequences and genes organized in tandem) has traditionally hindered high-quality genome assembly and annotation. This also limits diverse types of analyses that require high degree of precision. Long reads generated by third-generation sequencing technologies are particularly suitable to address the challenges associated withT. cruzi´sgenome since they permit directly determining the full sequence of large clusters of repetitive sequences without collapsing them. This, in turn, allows not only accurate estimation of gene copy numbers but also circumvents assembly fragmentation. Here, we present the analysis of the genome sequences of twoT. cruziclones: the hybrid TCC (DTU TcVI) and the non-hybrid Dm28c (DTU TcI), determined by PacBio SMRT technology. The improved assemblies herein obtained permitted us to accurately estimate gene copy numbers, abundance and distribution of repetitive sequences (including satellites and retroelements). We found that the genome ofT. cruziis composed of a "core compartment" and a "disruptive compartment" which exhibit opposite gene and GC content composition. New tandem and disperse repetitive sequences were identified, including some located inside coding sequences. Additionally, homologous chromosomes were separately assembled, allowing us to retrieve haplotypes as separate contigs instead of a unique mosaic sequence. Finally, manual annotation of surface multigene families MUC and trans-sialidases allows now a better overview of these complex groups of genes.

Download Full-text

LRSDAY: Long-read Sequencing Data Analysis for Yeasts

10.1101/184572 ◽

2017 ◽

Author(s):

Jia-Xing Yue ◽

Gianni Liti

Keyword(s):

Genome Assembly ◽

Model Organism ◽

Sequencing Data ◽

Protein Coding ◽

Sequencing Technologies ◽

Long Reads ◽

Long Read ◽

Downstream Analysis ◽

Eukaryotic Organisms ◽

Genomic Regions

AbstractLong-read sequencing technologies have become increasingly popular in genome projects due to their strengths in resolving complex genomic regions. As a leading model organism with small genome size and great biotechnological importance, the budding yeast, Saccharomyces cerevisiae, has many isolates currently being sequenced with long reads. However, analyzing long-read sequencing data to produce high-quality genome assembly and annotation remains challenging. Here we present LRSDAY, the first one-stop solution to streamline this process. LRSDAY can produce chromosome-level end-to-end genome assembly and comprehensive annotations for various genomic features (including centromeres, protein-coding genes, tRNAs, transposable elements and telomere-associated elements) that are ready for downstream analysis. Although tailored for S. cerevisiae, we designed LRSDAY to be highly modular and customizable, making it adaptable for virtually any eukaryotic organisms. Applying LRSDAY to a S. cerevisiae strain takes ∼43 hrs to generate a complete and well-annotated genome from ∼100X Pacific Biosciences (PacBio) reads using four threads.

Download Full-text

QAlign: Aligning nanopore reads accurately using current-level modeling

10.1101/862813 ◽

2019 ◽

Author(s):

Dhaivat Joshi ◽

Shunfu Mao ◽

Sreeram Kannan ◽

Suhas Diggavi

Keyword(s):

Reference Genome ◽

Genomic Analysis ◽

Vital Role ◽

High Error Rate ◽

Sequencing Technology ◽

Long Reads ◽

A Genome ◽

Long Read ◽

Nanopore Sequencer ◽

Sequencing Process

AbstractMotivationEfficient and accurate alignment of DNA / RNA sequence reads to each other or to a reference genome / transcriptome is an important problem in genomic analysis. Nanopore sequencing has emerged as a major sequencing technology and many long-read aligners have been designed for aligning nanopore reads. However, the high error rate makes accurate and efficient alignment difficult. Utilizing the noise and error characteristics inherent in the sequencing process properly can play a vital role in constructing a robust aligner. In this paper, we design QAlign, a pre-processor that can be used with any long-read aligner for aligning long reads to a genome / transcriptome or to other long reads. The key idea in QAlign is to convert the nucleotide reads into discretized current levels that capture the error modes of the nanopore sequencer before running it through a sequence aligner.ResultsWe show that QAlign is able to improve alignment rates from around 80% up to 90% with nanopore reads when aligning to the genome. We also show that QAlign improves the average overlap quality by 9.2%, 2.5% and 10.8% in three real datasets for read-to-read alignment. Read-to-transcriptome alignment rates are improved from 51.6% to 75.4% and 82.6% to 90% in two real datasets.Availabilityhttps://github.com/joshidhaivat/QAlign.git

Download Full-text

NanoCaller for accurate detection of SNPs and indels in difficult-to-map regions from long-read sequencing by haplotype-aware deep neural networks

10.1101/2019.12.29.890418 ◽

2019 ◽

Cited By ~ 1

Author(s):

Umair Ahsan ◽

Qian Liu ◽

Li Fang ◽

Kai Wang

Keyword(s):

Deep Neural Network ◽

Deep Neural Networks ◽

Variant Calling ◽

Sequencing Data ◽

Long Reads ◽

Novel Variants ◽

Long Read ◽

Variant Detection ◽

Genomic Regions ◽

Haplotype Information

AbstractVariant (SNPs/indels) detection from high-throughput sequencing data remains an important yet unresolved problem. Long-read sequencing enables variant detection in difficult-to-map genomic regions that short-read sequencing cannot reliably examine (for example, only ~80% of genomic regions are marked as “high-confidence region” to have SNP/indel calls in the Genome In A Bottle project); however, the high per-base error rate poses unique challenges in variant detection. Existing methods on long-read data typically rely on analyzing pileup information from neighboring bases surrounding a candidate variant, similar to short-read variant callers, yet the benefits of much longer read length are not fully exploited. Here we present a deep neural network called NanoCaller, which detects SNPs by examining pileup information solely from other nonadjacent candidate SNPs that share the same long reads using long-range haplotype information. With called SNPs by NanoCaller, NanoCaller phases long reads and performs local realignment on two sets of phased reads to call indels by another deep neural network. Extensive evaluation on 5 human genomes (sequenced by Nanopore and PacBio long-read techniques) demonstrated that NanoCaller greatly improved performance in difficult-to-map regions, compared to other long-read variant callers. We experimentally validated 41 novel variants in difficult-to-map regions in a widely-used benchmarking genome, which cannot be reliably detected previously. We extensively evaluated the run-time characteristics and the sensitivity of parameter settings of NanoCaller to different characteristics of sequencing data. Finally, we achieved the best performance in Nanopore-based variant calling from MHC regions in the PrecisionFDA Variant Calling Challenge on Difficult-to-Map Regions by ensemble calling. In summary, by incorporating haplotype information in deep neural networks, NanoCaller facilitates the discovery of novel variants in complex genomic regions from long-read sequencing data.

Download Full-text

Impact of Chimera-less Long Reads on Metagenomics of Human Gut Viromes Treated With Multiple Displacement Amplification

10.21203/rs.3.rs-58640/v1 ◽

2020 ◽

Author(s):

Yuya Kiguchi ◽

Suguru Nishijima ◽

Naveen Kumar ◽

Masahira Hattori ◽

Wataru Suda

Keyword(s):

De Novo ◽

Multiple Displacement Amplification ◽

Read Length ◽

Human Gut ◽

Average Read Length ◽

Sequencing Technologies ◽

Genomic Complexity ◽

Long Reads ◽

Long Read ◽

The Impact

Abstract Background: The ecological and biological features of the indigenous phage community (virome) in the human gut microbiome are poorly understood, possibly due to many fragmented contigs and fewer complete genomes based on conventional short-read metagenomics. Long-read sequencing technologies have attracted attention as an alternative approach to reconstruct long and accurate contigs from microbial communities. However, the impact of long-read metagenomics on human gut virome analysis has not been well evaluated. Results: Here we present chimera-less PacBio long-read metagenomics of multiple displacement amplification (MDA)-treated human gut virome DNA. The method included the development of a novel bioinformatics tool, SACRA (Split Amplified Chimeric Read Algorithm), which efficiently detects and splits numerous chimeric reads in PacBio reads from the MDA-treated virome samples. SACRA treatment of PacBio reads from five samples markedly reduced the average chimera ratio from 72 to 1.5%, generating chimera-less PacBio reads with an average read-length of 1.8 kb. De novo assembly of the chimera-less long reads generated contigs with an average N50 length of 11.1 kb, whereas those of MiSeq short reads from the same samples were 0.7 kb, dramatically improving contig extension. Alignment of both contig sets generated 378 high-quality merged contigs (MCs) composed of the minimum scaffolds of 434 MiSeq and 637 PacBio contigs, respectively, and also identified numerous MiSeq short fragmented contigs ≤500 bp additionally aligned to MCs, which possibly originated from a small fraction of MiSeq chimeric reads. The alignment also revealed that fragmentations of the scaffolded MiSeq contigs were caused primarily by genomic complexity of the community, including local repeats, hypervariable regions, and highly conserved sequences in and between the phage genomes. We identified 142 complete and near-complete phage genomes including 108 novel genomes, varying from 5 to 185 kb in length, the majority of which were predicted to be Microviridae phages including several variants with homologous but distinct genomes, which were fragmented in MiSeq contigs. Conclusions: Long-read metagenomics coupled with SACRA provides an improved method to reconstruct accurate and extended phage genomes from MDA-treated virome samples of the human gut, and potentially from other environmental virome samples.

Download Full-text

Investigation of chimeric reads using the MinION

F1000Research ◽

10.12688/f1000research.11547.2 ◽

2017 ◽

Vol 6 ◽

pp. 631 ◽

Cited By ~ 8

Author(s):

Ruby White ◽

Christophe Pellefigues ◽

Franca Ronchese ◽

Olivier Lamiable ◽

David Eccles

Keyword(s):

Quality Control ◽

Sample Preparation ◽

Amplicon Sequencing ◽

Nanopore Sequencing ◽

Blast Search ◽

Sequencing Technologies ◽

Long Read ◽

Pcr Products

Following a nanopore sequencing run of PCR products of three amplicons less than 1kb, an abundance of reads failed quality control due to template/complement mismatch. A BLAST search demonstrated that some of the failed reads mapped to two different genes -- an unexpected observation, given that PCR was carried out separately for each amplicon. A further investigation was carried out specifically to search for chimeric reads, using separate barcodes for each amplicon and trying two different ligation methods prior to sample loading. Despite the separation of ligation products, chimeric reads formed from different amplicons were still observed in the base-called sequence. The long-read nature of nanopore sequencing presents an effective tool for the discovery and filtering of chimeric reads. We have found that at least 1.7% of reads prepared using the Nanopore LSK002 2D Ligation Kit include post-amplification chimeric elements. This finding has potential implications for other amplicon sequencing technologies, as the process is unlikely to be specific to the sample preparation used for nanopore sequencing.

Download Full-text