scholarly journals Pan-genomic Matching Statistics for Targeted Nanopore Sequencing

2021 ◽  
Author(s):  
Omar Ahmed ◽  
Massimiliano Rossi ◽  
Sam Kovaka ◽  
Michael Schatz ◽  
Travis Gagie ◽  
...  

Nanopore sequencing is an increasingly powerful tool for genomics. Recently, computational advances have allowed nanopores to sequence in a targeted fashion; as the sequencer emits data, software can analyze the data in real time and signal the sequencer to eject "non-target" DNA molecules. We present a novel method called SPUMONI, which enables rapid and accurate targeted sequencing with the help of efficient pangenome indexes. SPUMONI uses a compressed index to rapidly generate exact or approximate matching statistics (half-maximal exact matches) in a streaming fashion. When used to target a specific strain in a mock community, SPUMONI has similar accuracy as minimap2 when both are run against an index containing many strains per species. However SPUMONI is 12 times faster than minimap2. SPUMONI's index and peak memory footprint are also 15 to 4 times smaller than minimap2, respectively. These improvements become even more pronounced with even larger reference databases; SPUMONI's index size scales sublinearly with the number of reference genomes included. This could enable accurate targeted sequencing even in the case where the targeted strains have not necessarily been sequenced or assembled previously. SPUMONI is open source software available from https://github.com/oma219/spumoni.

2017 ◽  
Author(s):  
Zhemin Zhou ◽  
Nina Luhmann ◽  
Nabil-Fareed Alikhan ◽  
Christopher Quince ◽  
Mark Achtman

AbstractExploring the genetic diversity of microbes within the environment through metagenomic sequencing first requires classifying these reads into taxonomic groups. Current methods compare these sequencing data with existing biased and limited reference databases. Several recent evaluation studies demonstrate that current methods either lack sufficient sensitivity for species-level assignments or suffer from false positives, overestimating the number of species in the metagenome. Both are especially problematic for the identification of low-abundance microbial species, e. g. detecting pathogens in ancient metagenomic samples. We present a new method, SPARSE, which improves taxonomic assignments of metagenomic reads. SPARSE balances existing biased reference databases by grouping reference genomes into similarity-based hierarchical clusters, implemented as an efficient incremental data structure. SPARSE assigns reads to these clusters using a probabilistic model, which specifically penalizes non-specific mappings of reads from unknown sources and hence reduces false-positive assignments. Our evaluation on simulated datasets from two recent evaluation studies demonstrated the improved precision of SPARSE in comparison to other methods for species-level classification. In a third simulation, our method successfully differentiated multiple co-existing Escherichia coli strains from the same sample. In real archaeological datasets, SPARSE identified ancient pathogens with ≤ 0.02% abundance, consistent with published findings that required additional sequencing data. In these datasets, other methods either missed targeted pathogens or reported non-existent ones. SPARSE and all evaluation scripts are available at https://github.com/zheminzhou/SPARSE.


2019 ◽  
Vol 9 (1) ◽  
Author(s):  
Lili Quan ◽  
Ruyi Dong ◽  
Wenjuan Yang ◽  
Lanyou Chen ◽  
Jidong Lang ◽  
...  

AbstractHuman papillomavirus (HPV) is a major pathogen that causes cervical cancer and many other related diseases. HPV infection related cervical microbiome could be an induce factor of cervical cancer. However, it is uncommon to find a single test on the market that can simultaneously provide information on both HPV and the microbiome. Herein, a novel method was developed in this study to simultaneously detect HPV infection and microbiota composition promptly and accurately. It provides a new and simple way to detect vaginal pathogen situation and also provide valuable information for clinical diagnose. This approach combined multiplex PCR, which targeted both HPV16 E6E7 and full-length 16S rRNA, and Nanopore sequencing to generate enough information to understand the vagina condition of patients. One HPV positive liquid-based cytology (LBC) sample was sequenced and analyzed. After comparing with Illumina sequencing, the results from Nanopore showed a similar microbiome composition. An instant sequencing evaluation showed that 15 min sequencing is enough to identify the top 10 most abundant bacteria. Moreover, two HPV integration sites were identified and verified by Sanger sequencing. This approach has many potential applications in pathogen detection and can potentially aid in providing a more rapid clinical diagnosis.


2017 ◽  
Author(s):  
Philippe Faucon ◽  
Robert Trevino ◽  
Parithi Balachandran ◽  
Kylie Standage-Beier ◽  
Xiao Wang

AbstractNanopore sequencing has introduced the ability to sequence long stretches of DNA, enabling the resolution of repeating segments, or paired SNPs across long stretches of DNA. Unfortunately significant error rates >15%, introduced through systematic and random noise inhibit downstream analysis. We propose a novel method, using unsupervised learning, to correct biologically amplified reads before downstream analysis proceeds. We also demonstrate that our method has performance comparable to existing techniques without limiting the detection of repeats, or the length of the input sequence.


2019 ◽  
Author(s):  
Brandon D. Wilson ◽  
Michael Eisenstein ◽  
H. Tom Soh

AbstractNanopore sequencing offers a portable and affordable alternative to sequencing-by-synthesis methods but suffers from lower accuracy and cannot sequence ultra-short DNA. This puts applications such as molecular diagnostics based on the analysis of cell-free DNA or single-nucleotide variants (SNV) out of reach. To overcome these limitations, we report a nanopore-based sequencing strategy in which short target sequences are first circularized and then amplified via rolling-circle amplification to produce long stretches of concatemeric repeats. These can be sequenced on the Oxford Nanopore Technology’s (ONT) MinION platform, and the resulting repeat sequences aligned to produce a highly-accurate consensus that reduces the high error-rate present in the individual repeats. Using this approach, we demonstrate for the first time the ability to obtain unbiased and accurate nanopore data for target DNA sequences of < 100 bp. Critically, this approach is sensitive enough to achieve SNV discrimination in mixtures of sequences and even enables quantitative detection of specific variants present at ratios of < 10%. Our method is simple, cost-effective, and only requires well-established processes. It therefore expands the utility of nanopore sequencing for molecular diagnostics and other applications, especially in resource-limited settings.One Sentence SummaryWe introduce a simple method of accurately sequencing ultra-short (<100bp) target DNA on a nanopore sequencing platform.


2019 ◽  
Author(s):  
Robert Shaw ◽  
Grant Hill

A novel method for the accurate and efficient calculation of interaction energies in weakly-bound complexes comprised of a large number of molecules is presented. The new ALMO+RPAd method circumvents the prohibitive scaling of coupled cluster singles and doubles while still providing similar accuracy across a diverse range of intermolecular interactions. Tests on various dimers and the S66 benchmark set demonstrate results within 0.5 kcal/mol of coupled cluster singles and doubles results. On a large cluster of water molecules, we achieve calculations involving over 3500 orbital and 12000 auxiliary basis functions in under ten minutes on a single CPU core.


2018 ◽  
Author(s):  
Jae Young Choi ◽  
Zoe N. Lye ◽  
Simon C. Groen ◽  
Xiaoguang Dai ◽  
Priyesh Rughani ◽  
...  

ABSTRACTBACKGROUNDThe circum-basmati group of cultivated Asian rice (Oryza sativa) contains many iconic varieties and is widespread in the Indian subcontinent. Despite its economic and cultural importance, a high-quality reference genome is currently lacking, and the group’s evolutionary history is not fully resolved. To address these gaps, we used long-read nanopore sequencing and assembled the genomes of two circum-basmati rice varieties, Basmati 334 and Dom Sufid.RESULTSWe generated two high-quality, chromosome-level reference genomes that represented the 12 chromosomes of Oryza. The assemblies showed a contig N50 of 6.32Mb and 10.53Mb for Basmati 334 and Dom Sufid, respectively. Using our highly contiguous assemblies we characterized structural variations segregating across circum-basmati genomes. We discovered repeat expansions not observed in japonica—the rice group most closely related to circum- basmati—as well as presence/absence variants of over 20Mb, one of which was a circum- basmati-specific deletion of a gene regulating awn length. We further detected strong evidence of admixture between the circum-basmati and circum-aus groups. This gene flow had its greatest effect on chromosome 10, causing both structural variation and single nucleotide polymorphism to deviate from genome-wide history. Lastly, population genomic analysis of 78 circum-basmati varieties showed three major geographically structured genetic groups: (1) Bhutan/Nepal group, (2) India/Bangladesh/Myanmar group, and (3) Iran/Pakistan group.CONCLUSIONAvailability of high-quality reference genomes from nanopore sequencing allowed functional and evolutionary genomic analyses, providing genome-wide evidence for gene flow between circum-aus and circum-basmati, the nature of circum-basmati structural variation, and the presence/absence of genes in this important and iconic rice variety group.


2020 ◽  
Author(s):  
Chen Sun ◽  
Kai-Chun Chang ◽  
Adam R. Abate

AbstractTargeted sequencing enables sensitive and cost-effective analysis by focusing resources on molecules of interest. Existing methods, however, are limited in enrichment power and target capture length. Here, we present a novel method that uses compound nucleic acid cytometry to achieve million-fold enrichments of molecules >10 kbp in length using minimal prior target information. We demonstrate the approach by sequencing HIV proviruses in infected individuals. Our method is useful for rare target sequencing in research and clinical applications, including for identifying cancer-associated mutations or sequencing viruses infecting cells.


2021 ◽  
Author(s):  
Oguzhan Begik ◽  
Huanle Liu ◽  
Anna Delgado-Tejedor ◽  
Cassandra Kontur ◽  
Antonio J Giraldez ◽  
...  

RNA polyadenylation plays a central role in RNA maturation, fate and stability. In response to developmental cues, polyA tail lengths can vary, affecting the translatability and stability of mRNAs. Here we develop Nano3P-seq, a novel method that relies on nanopore sequencing to simultaneously quantify RNA abundance and tail length dynamics at per-read resolution. By employing a template switching-based sequencing protocol, Nano3P-seq can sequence any given RNA molecule from its 3'end, regardless of its polyadenylation status, without the need of PCR amplification or ligation of RNA adapters. We demonstrate that Nano3P-seq captures a wide diversity of RNA biotypes, providing quantitative estimates of RNA abundance and tail lengths in mRNAs, lncRNAs, sn/snoRNAs, scaRNAs and rRNAs. We find that, in addition to mRNAs and lncRNAs, polyA tails can be identified in 16S mitochondrial rRNA in both mouse and zebrafish. Moreover, we show that mRNA tail lengths are dynamically regulated during vertebrate embryogenesis at the isoform-specific level, correlating with mRNA decay. Overall, Nano3P-seq is a simple and robust method to accurately estimate transcript levels and tail lengths in full-length individual reads, with minimal library preparation biases, both in the coding and non-coding transcriptome.


2017 ◽  
Author(s):  
Angel Mojarro ◽  
Julie Hachey ◽  
Gary Ruvkun ◽  
Maria T. Zuber ◽  
Christopher E. Carr

AbstractMotivationLong-read nanopore sequencing technology is of particular significance for taxonomic identification at or below the species level. For many environmental samples, the total extractable DNA is far below the current input requirements of nanopore sequencing, preventing “sample to sequence” metagenomics from low-biomass or recalcitrant samples.ResultsHere we address this problem by employing carrier sequencing, a method to sequence low-input DNA by preparing the target DNA with a genomic carrier to achieve ideal library preparation and sequencing stoichiometry without amplification. We then use CarrierSeq, a sequence analysis workflow to identify the low-input target reads from the genomic carrier. We tested CarrierSeq experimentally by sequencing from a combination of 0.2 ng Bacillus subtilis ATCC 6633 DNA in a background of 1 μg Enterobacteria phage λ DNA. After filtering of carrier, low quality, and low complexity reads, we detected target reads (B. subtilis), contamination reads, and “high quality noise reads” (HQNRs) not mapping to the carrier, target or known lab contaminants. These reads appear to be artifacts of the nanopore sequencing process as they are associated with specific channels (pores). By treating reads as a Poisson arrival process, we implement a statistical test to reject data from channels dominated by HQNRs while retaining target reads.AvailabilityCarrierSeq is an open-source bash script with supporting python scripts which leverage a variety of bioinformatics software packages on macOS and Ubuntu. Supplemental documentation is available from Github - https://github.com/amojarro/carrierseq. In addition, we have compiled all required dependencies in a Docker image available from - https://hub.docker.com/r/mojarro/carrierseq.


2021 ◽  
Author(s):  
Jeremie S. Kim ◽  
Can Firtina ◽  
Meryem Banu Cavlak ◽  
Damla Senol Cali ◽  
Nastaran Hajinazar ◽  
...  

AbstractAs genome sequencing tools and techniques improve, researchers are able to incrementally assemble more accurate reference genomes, which enable sensitivity in read mapping and downstream analysis such as variant calling. A more sensitive downstream analysis is critical for a better understanding of the genome donor (e.g., health characteristics). Therefore, read sets from sequenced samples should ideally be mapped to the latest available reference genome that represents the most relevant population. Unfortunately, the increasingly large amount of available genomic data makes it prohibitively expensive to fully re-map each read set to its respective reference genome every time the reference is updated. There are several tools that attempt to accelerate the process of updating a read data set from one reference to another (i.e., remapping) by 1) identifying regions that appear similarly between two references and 2) updating the mapping location of reads that map to any of the identified regions in the old reference to the corresponding similar region in the new reference. The main drawback of existing approaches is that if a read maps to a region in the old reference that does not appear with a reasonable degree of similarity in the new reference, the read cannot be remapped. We find that, as a result of this drawback, a significant portion of annotations (i.e., coding regions in a genome) are lost when using state-of-the-art remapping tools. To address this major limitation in existing tools, we propose AirLift, a fast and comprehensive technique for remapping alignments from one genome to another. Compared to the state-of-the-art method for remapping reads (i.e., full mapping), AirLift reduces 1) the number of reads (out of the entire read set) that need to be fully mapped to the new reference by up to 99.99% and 2) the overall execution time to remap read sets between two reference genome versions by 6.7×, 6.6×, and 2.8× for large (human), medium (C. elegans), and small (yeast) reference genomes, respectively. We validate our remapping results with GATK and find that AirLift provides similar accuracy in identifying ground truth SNP and INDEL variants as the baseline of fully mapping a read set.Code AvailabilityAirLift source code and readme describing how to reproduce our results are available at https://github.com/CMU-SAFARI/AirLift.


Sign in / Sign up

Export Citation Format

Share Document