Microhaplotypes provide increased power from short-read DNA sequences for relationship inference

ABSTRACT Next-generation sequencing technology is of great importance for many biological disciplines; however, due to technical and biological limitations, the short DNA sequences produced by modern sequencers require numerous quality control (QC) measures to reduce errors, remove technical contaminants, or merge paired-end reads together into longer or higher-quality contigs. Many tools for each step exist, but choosing the appropriate methods and usage parameters can be challenging because the parameterization of each step depends on the particularities of the sequencing technology used, the type of samples being analyzed, and the stochasticity of the instrumentation and sample preparation. Furthermore, end users may not know all of the relevant information about how their data were generated, such as the expected overlap for paired-end sequences or type of adaptors used to make informed choices. This increasing complexity and nuance demand a pipeline that combines existing steps together in a user-friendly way and, when possible, learns reasonable quality parameters from the data automatically. We propose a user-friendly quality control pipeline called SHI7 (canonically pronounced “shizen”), which aims to simplify quality control of short-read data for the end user by predicting presence and/or type of common sequencing adaptors, what quality scores to trim, whether the data set is shotgun or amplicon sequencing, whether reads are paired end or single end, and whether pairs are stitchable, including the expected amount of pair overlap. We hope that SHI7 will make it easier for all researchers, expert and novice alike, to follow reasonable practices for short-read data quality control. IMPORTANCE Quality control of high-throughput DNA sequencing data is an important but sometimes laborious task requiring background knowledge of the sequencing protocol used (such as adaptor type, sequencing technology, insert size/stitchability, paired-endedness, etc.). Quality control protocols typically require applying this background knowledge to selecting and executing numerous quality control steps with the appropriate parameters, which is especially difficult when working with public data or data from collaborators who use different protocols. We have created a streamlined quality control pipeline intended to substantially simplify the process of DNA quality control from raw machine output files to actionable sequence data. In contrast to other methods, our proposed pipeline is easy to install and use and attempts to learn the necessary parameters from the data automatically with a single command.

Download Full-text

An assembly-free method of phylogeny reconstruction using short-read sequences from pooled samples without barcodes

10.1101/2021.04.09.439138 ◽

2021 ◽

Author(s):

Thomas K. F. Wong ◽

Teng Li ◽

Louis Ranjard ◽

Steven Wu ◽

Jeet Sukumaran ◽

...

Keyword(s):

Dna Sequences ◽

Dna Barcode ◽

Real Data ◽

Reference Sequence ◽

Nucleotide Polymorphisms ◽

Data Set ◽

Single Nucleotide ◽

Short Read ◽

Pooled Samples ◽

Haplotype Information

AbstractA current strategy for obtaining haplotype information from several individuals involves short-read sequencing of pooled amplicons, where fragments from each individual is identified by a unique DNA barcode. In this paper, we report a new method to recover the phylogeny of haplotypes from short-read sequences obtained using pooled amplicons from a mixture of individuals, without barcoding. The method, AFPhyloMix, accepts an alignment of the mixture of reads against a reference sequence, obtains the single-nucleotide-polymorphisms (SNP) patterns along the alignment, and constructs the phylogenetic tree according to the SNP patterns. AFPhyloMix adopts a Bayesian model of inference to estimates the phylogeny of the haplotypes and their relative frequencies, given that the number of haplotypes is known. In our simulations, AFPhyloMix achieved at least 80% accuracy at recovering the phylogenies and frequencies of the constituent haplotypes, for mixtures with up to 15 haplotypes. AFPhyloMix also worked well on a real data set of kangaroo mitochondrial DNA sequences.

Download Full-text

Computation of Rank and Select Functions on Hierarchical Binary String and Its Application to Genome Mapping Problems for Short-Read DNA Sequences

Journal of Computational Biology ◽

10.1089/cmb.2008.0146 ◽

2009 ◽

Vol 16 (11) ◽

pp. 1601-1613 ◽

Cited By ~ 6

Author(s):

Kouichi Kimura ◽

Yutaka Suzuki ◽

Sumio Sugano ◽

Asako Koike

Keyword(s):

Dna Sequences ◽

Genome Mapping ◽

Binary String ◽

Short Read ◽

Rank And Select

Download Full-text

Contig annotation tool CAT robustly classifies assembled metagenomic contigs and long sequences

10.1101/072868 ◽

2016 ◽

Cited By ~ 13

Author(s):

Diego D. Cambuy ◽

Felipe H. Coutinho ◽

Bas E. Dutilh

Keyword(s):

Single Molecule ◽

Dna Sequences ◽

Taxonomic Classification ◽

Annotation Tool ◽

Single Molecule Sequencing ◽

Short Read ◽

Long Read ◽

Micro Organisms ◽

Taxonomic Annotation

AbstractIn modern-day metagenomics, there is an increasing need for robust taxonomic annotation of long DNA sequences from unknown micro-organisms. Long metagenomic sequences may be derived from assembly of short-read metagenomes, or from long-read single molecule sequencing. Here we introduce CAT, a pipeline for robust taxonomic classification of long DNA sequences. We show that CAT correctly classifies contigs at different taxonomic levels, even in simulated metagenomic datasets that are very distantly related from the sequences in the database. CAT is implemented in Python and the required scripts can be freely downloaded from Github.

Download Full-text

GPU-accelerated alignment of bisulfite-treated short-read sequences

10.1101/175729 ◽

2017 ◽

Author(s):

Richard Wilton ◽

Xin Li ◽

Andrew P. Feinberg ◽

Alexander S. Szalay

Keyword(s):

Dna Sequences ◽

Graphics Processing Unit ◽

General Purpose ◽

Processing Unit ◽

Short Read ◽

Wide Range ◽

Programming Logic ◽

Short Read Aligner ◽

Graphics Processing ◽

Better Than

AbstractThe alignment of bisulfite-treated DNA sequences (BS-seq reads) to a large genome involves a significant computational burden beyond that required to align non-bisulfite-treated reads. In the analysis of BS-seq data, this can present an important performance bottleneck that can potentially be addressed by appropriate software-engineering and algorithmic improvements. One strategy is to integrate this additional programming logic into the read-alignment implementation in a way that the software becomes amenable to optimizations that lead to both higher speed and greater sensitivity than can be achieved without this integration.We have evaluated this approach using Arioc, a short-read aligner that uses GPU (general-purpose graphics processing unit) hardware to accelerate computationally-expensive programming logic. We integrated the BS-seq computational logic into both GPU and CPU code throughout the Arioc implementation. We then carried out a read-by-read comparison of Arioc's reported alignments with the alignments reported by the most widely used BS-seq read aligners. With simulated reads, Arioc's accuracy is equal to or better than the other read aligners we evaluated. With human sequencing reads, Arioc's throughput is at least 10 times faster than existing BS-seq aligners across a wide range of sensitivity settings.The Arioc software is available at https://github.com/RWilton/Arioc. It is released under a BSD open-source license.

Download Full-text

BURST enables mathematically optimal short-read alignment for big data

10.1101/2020.09.08.287128 ◽

2020 ◽

Author(s):

Gabriel Al-Ghalith ◽

Dan Knights

Keyword(s):

Next Generation Sequencing ◽

Dna Sequences ◽

Heuristic Algorithms ◽

Database Search ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Genome Database ◽

Short Read ◽

Lowest Common Ancestor ◽

Generation Sequencing

AbstractOne of the fundamental tasks in analyzing next-generation sequencing data is genome database search, in which DNA sequences are compared to known reference genomes for identification or annotation. Although algorithms exist for optimal database search with perfect sensitivity and specificity, these have largely been abandoned for next-generation sequencing (NGS) data in favor of faster heuristic algorithms that sacrifice alignment quality. Virtually all DNA alignment tools that are commonly used in genomic and metagenomic database search use approximate methods that sometimes report the wrong match, and sometimes fail to find a valid match when present. Here we introduce BURST, a high-throughput DNA short-read aligner that uses several new synergistic optimizations to enable provably optimal alignment in NGS datasets. BURST finds all equally good matches in the database above a specified identity threshold and can either report all of them, pick the most likely among tied matches, or provide lowest-common-ancestor taxonomic annotation among tied matches. BURST can align, disambiguate, and assign taxonomy at a rate of 1,000,000 query sequences per minute against the RefSeq v82 representative prokaryotic genome database (5,500 microbial genomes, 19GB) at 98% identity on a 32-core computer, representing a speedup of up to 20,000-fold over current optimal gapped alignment techniques. This may have broader implications for clinical applications, strain tracking, and other situations where fast, exact, extremely sensitive alignment is desired.

Download Full-text

read_haps: using read haplotypes to detect same species contamination in DNA sequences

10.1101/2020.02.11.941773 ◽

2020 ◽

Author(s):

Hannes P. Eggertsson ◽

Bjarni V. Halldorsson

Keyword(s):

Data Analysis ◽

Genome Sequencing ◽

Dna Sequences ◽

Diploid Species ◽

Reliable Data ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Short Read ◽

Polymorphic Snps

AbstractMotivationData analysis is requisite on reliable data. In genetics this includes verifying that the sample is not contaminated with another, a problem ubiquitous in biology.ResultsIn human, and other diploid species, DNA contamination from the same species can be found by the presence of three haplotypes between polymorphic SNPs. read_haps is a tool that detects sample contamination from short read whole genome sequencing data.Availabilitygithub.com/DecodeGenetics/[email protected]

Download Full-text

Detection and assembly of novel sequence insertions using Linked-Read technology

10.1101/551028 ◽

2019 ◽

Cited By ~ 3

Author(s):

Dmitry Meleshko ◽

Patrick Marks ◽

Stephen Williams ◽

Iman Hajirasouliha

Keyword(s):

Dna Sequences ◽

De Novo Assembly ◽

De Novo ◽

Supplementary Information ◽

Computational Techniques ◽

Whole Genome ◽

Structural Variations ◽

Short Read ◽

Link Type ◽

Long Read

AbstractMotivationEmerging Linked-Read (aka read-cloud) technologies such as the 10x Genomics Chromium system have great potential for accurate detection and phasing of largescale human genome structural variations (SVs). By leveraging the long-range information encoded in Linked-Read sequencing, computational techniques are able to detect and characterize complex structural variations that are previously undetectable by short-read methods. However, there is no available Linked-Read method for detection and assembly of novel sequence insertions, DNA sequences present in a given sequenced sample but missing in the reference genome, without requiring whole genome de novo assembly. In this paper, we propose a novel integrated alignment-based and local-assembly-based algorithm, Novel-X, that effectively uses the barcode information encoded in Linked-Read sequencing datasets to improve detection of such events without the need of whole genome de novo assembly. We evaluated our method on two haploid human genomes, CHM1 and CHM13, sequenced on the 10x Genomics Chromium system. These genomes have been also characterized with high coverage PacBio long-reads recently. We also tested our method on NA12878, the wellknown HapMap CEPH diploid genome and the child genome in a Yoruba trio (NA19240) which was recently studied on multiple sequencing platforms. Detecting insertion events is very challenging using short reads and the only viable available solution is by long-read sequencing (e.g. PabBio or ONT). Our experiments, however, show that Novel-X finds many insertions that cannot be found by state of the art tools using short-read sequencing data but present in PacBio data. Since Linked-Read sequencing is significantly cheaper than long-read sequencing, our method using Linked-Reads enables routine large-scale screenings of sequenced genomes for novel sequence insertions.AvailabilitySoftware is freely available at https://github.com/1dayac/[email protected] informationSupplementary data are available at https://github.com/1dayac/novel_insertions_supplementary

Download Full-text

Short-read DNA Sequencing Yields Microsatellite Markers for Rheum

Journal of the American Society for Horticultural Science ◽

10.21273/jashs.139.1.22 ◽

2014 ◽

Vol 139 (1) ◽

pp. 22-29 ◽

Cited By ~ 3

Author(s):

Barbara S. Gilmore ◽

Nahla V. Bassil ◽

Danny L. Barney ◽

Brian J. Knaus ◽

Kim E. Hummer

Keyword(s):

Ssr Markers ◽

Dna Sequences ◽

Morphological Characteristics ◽

Bootstrap Support ◽

Sequencing Data ◽

Short Read ◽

Shared Allele ◽

Short Read Sequencing ◽

Horticultural Crops ◽

Amplification Success Rate

Identifying and evaluating genetic diversity of culinary rhubarb (Rheum ×rhababarum) cultivars using morphological characteristics is challenging given the existence of synonyms and nomenclatural inconsistencies. Some cultivars with similar names are morphologically different, and seedlings may grow and become associated with the parental name. Morphological traits of one cultivar may vary when measured under different environmental conditions. Molecular markers are consistent for unique genotypes across environments and provide genetic fingerprints to assist in resolving identity issues. Microsatellite repeats, also called simple sequence repeats (SSRs), are commonly used for fingerprinting fruit and nut crops, but only 10 SSRs have previously been reported in rhubarb. The objectives of this study were to use short-read DNA sequences to develop new di-nucleotide-containing SSR markers for rhubarb and to determine if the markers were useful for cultivar identification. A total of 97 new SSR primer pairs were designed from the short-read DNA sequences. The amplification success rate of these SSRs was 77%, whereas polymorphism of those reached 76% in a test panel of four or eight rhubarb individuals. From the 57 potentially polymorphic primer pairs obtained, 25 SSRs were evaluated in 58 Rheum accessions preserved in the U.S. Department of Agriculture, National Plant Germplasm System. The primer pairs generated 314 fragments with an average of 12.6 fragments per pair. The clustering of many accessions in well-supported groups supported previous findings based on amplified fragment length polymorphisms (AFLPs). Cluster analysis, using the proportion of shared allele distance among the 25 SSRs, distinguished each of the 58 accessions including individuals that had similar names or the same name. Accessions that grouped in well-supported clusters previously belonged to similar clusters with high bootstrap support based on AFLP. In summary, our technique of mining short-read sequencing data was successful in identifying 97 di-nucleotide-containing SSR sequences. Of those tested, the 25 most polymorphic and easy-to-score primer pairs proved useful in fingerprinting rhubarb cultivars. We recommend the use of short-read sequencing for the development of SSR markers in the identification of horticultural crops.

Download Full-text

An assembly-free method of phylogeny reconstruction using short-read sequences from pooled samples without barcodes

PLoS Computational Biology ◽

10.1371/journal.pcbi.1008949 ◽

2021 ◽

Vol 17 (9) ◽

pp. e1008949

Author(s):

Thomas K. F. Wong ◽

Teng Li ◽

Louis Ranjard ◽

Steven H. Wu ◽

Jeet Sukumaran ◽

...

Keyword(s):

Dna Sequences ◽

Dna Barcode ◽

Real Data ◽

Reference Sequence ◽

Nucleotide Polymorphisms ◽

Data Set ◽

Single Nucleotide ◽

Short Read ◽

Relative Abundances ◽

Pooled Samples

A current strategy for obtaining haplotype information from several individuals involves short-read sequencing of pooled amplicons, where fragments from each individual is identified by a unique DNA barcode. In this paper, we report a new method to recover the phylogeny of haplotypes from short-read sequences obtained using pooled amplicons from a mixture of individuals, without barcoding. The method, AFPhyloMix, accepts an alignment of the mixture of reads against a reference sequence, obtains the single-nucleotide-polymorphisms (SNP) patterns along the alignment, and constructs the phylogenetic tree according to the SNP patterns. AFPhyloMix adopts a Bayesian inference model to estimate the phylogeny of the haplotypes and their relative abundances, given that the number of haplotypes is known. In our simulations, AFPhyloMix achieved at least 80% accuracy at recovering the phylogenies and relative abundances of the constituent haplotypes, for mixtures with up to 15 haplotypes. AFPhyloMix also worked well on a real data set of kangaroo mitochondrial DNA sequences.

Download Full-text