Computation of Rank and Select Functions on Hierarchical Binary String and Its Application to Genome Mapping Problems for Short-Read DNA Sequences

ABSTRACT Next-generation sequencing technology is of great importance for many biological disciplines; however, due to technical and biological limitations, the short DNA sequences produced by modern sequencers require numerous quality control (QC) measures to reduce errors, remove technical contaminants, or merge paired-end reads together into longer or higher-quality contigs. Many tools for each step exist, but choosing the appropriate methods and usage parameters can be challenging because the parameterization of each step depends on the particularities of the sequencing technology used, the type of samples being analyzed, and the stochasticity of the instrumentation and sample preparation. Furthermore, end users may not know all of the relevant information about how their data were generated, such as the expected overlap for paired-end sequences or type of adaptors used to make informed choices. This increasing complexity and nuance demand a pipeline that combines existing steps together in a user-friendly way and, when possible, learns reasonable quality parameters from the data automatically. We propose a user-friendly quality control pipeline called SHI7 (canonically pronounced “shizen”), which aims to simplify quality control of short-read data for the end user by predicting presence and/or type of common sequencing adaptors, what quality scores to trim, whether the data set is shotgun or amplicon sequencing, whether reads are paired end or single end, and whether pairs are stitchable, including the expected amount of pair overlap. We hope that SHI7 will make it easier for all researchers, expert and novice alike, to follow reasonable practices for short-read data quality control. IMPORTANCE Quality control of high-throughput DNA sequencing data is an important but sometimes laborious task requiring background knowledge of the sequencing protocol used (such as adaptor type, sequencing technology, insert size/stitchability, paired-endedness, etc.). Quality control protocols typically require applying this background knowledge to selecting and executing numerous quality control steps with the appropriate parameters, which is especially difficult when working with public data or data from collaborators who use different protocols. We have created a streamlined quality control pipeline intended to substantially simplify the process of DNA quality control from raw machine output files to actionable sequence data. In contrast to other methods, our proposed pipeline is easy to install and use and attempts to learn the necessary parameters from the data automatically with a single command.

Download Full-text

An assembly-free method of phylogeny reconstruction using short-read sequences from pooled samples without barcodes

10.1101/2021.04.09.439138 ◽

2021 ◽

Author(s):

Thomas K. F. Wong ◽

Teng Li ◽

Louis Ranjard ◽

Steven Wu ◽

Jeet Sukumaran ◽

...

Keyword(s):

Dna Sequences ◽

Dna Barcode ◽

Real Data ◽

Reference Sequence ◽

Nucleotide Polymorphisms ◽

Data Set ◽

Single Nucleotide ◽

Short Read ◽

Pooled Samples ◽

Haplotype Information

AbstractA current strategy for obtaining haplotype information from several individuals involves short-read sequencing of pooled amplicons, where fragments from each individual is identified by a unique DNA barcode. In this paper, we report a new method to recover the phylogeny of haplotypes from short-read sequences obtained using pooled amplicons from a mixture of individuals, without barcoding. The method, AFPhyloMix, accepts an alignment of the mixture of reads against a reference sequence, obtains the single-nucleotide-polymorphisms (SNP) patterns along the alignment, and constructs the phylogenetic tree according to the SNP patterns. AFPhyloMix adopts a Bayesian model of inference to estimates the phylogeny of the haplotypes and their relative frequencies, given that the number of haplotypes is known. In our simulations, AFPhyloMix achieved at least 80% accuracy at recovering the phylogenies and frequencies of the constituent haplotypes, for mixtures with up to 15 haplotypes. AFPhyloMix also worked well on a real data set of kangaroo mitochondrial DNA sequences.

Download Full-text

Contig annotation tool CAT robustly classifies assembled metagenomic contigs and long sequences

10.1101/072868 ◽

2016 ◽

Cited By ~ 13

Author(s):

Diego D. Cambuy ◽

Felipe H. Coutinho ◽

Bas E. Dutilh

Keyword(s):

Single Molecule ◽

Dna Sequences ◽

Taxonomic Classification ◽

Annotation Tool ◽

Single Molecule Sequencing ◽

Short Read ◽

Long Read ◽

Micro Organisms ◽

Taxonomic Annotation

AbstractIn modern-day metagenomics, there is an increasing need for robust taxonomic annotation of long DNA sequences from unknown micro-organisms. Long metagenomic sequences may be derived from assembly of short-read metagenomes, or from long-read single molecule sequencing. Here we introduce CAT, a pipeline for robust taxonomic classification of long DNA sequences. We show that CAT correctly classifies contigs at different taxonomic levels, even in simulated metagenomic datasets that are very distantly related from the sequences in the database. CAT is implemented in Python and the required scripts can be freely downloaded from Github.

Download Full-text

GPU-accelerated alignment of bisulfite-treated short-read sequences

10.1101/175729 ◽

2017 ◽

Author(s):

Richard Wilton ◽

Xin Li ◽

Andrew P. Feinberg ◽

Alexander S. Szalay

Keyword(s):

Dna Sequences ◽

Graphics Processing Unit ◽

General Purpose ◽

Processing Unit ◽

Short Read ◽

Wide Range ◽

Programming Logic ◽

Short Read Aligner ◽

Graphics Processing ◽

Better Than

AbstractThe alignment of bisulfite-treated DNA sequences (BS-seq reads) to a large genome involves a significant computational burden beyond that required to align non-bisulfite-treated reads. In the analysis of BS-seq data, this can present an important performance bottleneck that can potentially be addressed by appropriate software-engineering and algorithmic improvements. One strategy is to integrate this additional programming logic into the read-alignment implementation in a way that the software becomes amenable to optimizations that lead to both higher speed and greater sensitivity than can be achieved without this integration.We have evaluated this approach using Arioc, a short-read aligner that uses GPU (general-purpose graphics processing unit) hardware to accelerate computationally-expensive programming logic. We integrated the BS-seq computational logic into both GPU and CPU code throughout the Arioc implementation. We then carried out a read-by-read comparison of Arioc's reported alignments with the alignments reported by the most widely used BS-seq read aligners. With simulated reads, Arioc's accuracy is equal to or better than the other read aligners we evaluated. With human sequencing reads, Arioc's throughput is at least 10 times faster than existing BS-seq aligners across a wide range of sensitivity settings.The Arioc software is available at https://github.com/RWilton/Arioc. It is released under a BSD open-source license.

Download Full-text

Microhaplotypes provide increased power from short-read DNA sequences for relationship inference

Molecular Ecology Resources ◽

10.1111/1755-0998.12737 ◽

2017 ◽

Vol 18 (2) ◽

pp. 296-305 ◽

Cited By ~ 41

Author(s):

Diana S. Baetscher ◽

Anthony J. Clemento ◽

Thomas C. Ng ◽

Eric C. Anderson ◽

John C. Garza

Keyword(s):

Dna Sequences ◽

Short Read

Download Full-text

Nonisotopic in situ hybridization and plant genome mapping: the first 10 years

Genome ◽

10.1139/g94-102 ◽

1994 ◽

Vol 37 (5) ◽

pp. 717-725 ◽

Cited By ~ 179

Author(s):

Jiming Jiang ◽

Bikram S. Gill

Keyword(s):

In Situ Hybridization ◽

Physical Mapping ◽

Dna Sequences ◽

Genome Mapping ◽

Molecular Cytogenetics ◽

Gene Families ◽

Single Copy ◽

Plant Genome ◽

Metaphase Chromosomes

Nonisotopic in situ hybridization (ISH) was introduced in plants in 1985. Since then the technique has been widely used in various areas of plant genome mapping. ISH has become a routine method for physical mapping of repetitive DNA sequences and multicopy gene families. ISH patterns on somatic metaphase chromosomes using tandemly repeated sequences provide excellent physical markers for chromosome identification. Detection of low or single copy sequences were also reported. Genomic in situ hybridization (GISH) was successfully used to analyze the chromosome structure and evolution of allopolyploid species. GISH also provides a powerful technique for monitoring chromatin introgession during interspecific hybridization. A sequential chromosome banding and ISH technique was developed. The sequential technique is very useful for more precise and efficient mapping as well as cytogenetic determination of genomic affinities of individual chromosomes in allopolyploid species. A critical review is made on the present resolution of the ISH technique and the future outlook of ISH research is discussed.Key words: in situ hybridization, physical mapping, genome mapping, molecular cytogenetics.

Download Full-text

BURST enables mathematically optimal short-read alignment for big data

10.1101/2020.09.08.287128 ◽

2020 ◽

Author(s):

Gabriel Al-Ghalith ◽

Dan Knights

Keyword(s):

Next Generation Sequencing ◽

Dna Sequences ◽

Heuristic Algorithms ◽

Database Search ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Genome Database ◽

Short Read ◽

Lowest Common Ancestor ◽

Generation Sequencing

AbstractOne of the fundamental tasks in analyzing next-generation sequencing data is genome database search, in which DNA sequences are compared to known reference genomes for identification or annotation. Although algorithms exist for optimal database search with perfect sensitivity and specificity, these have largely been abandoned for next-generation sequencing (NGS) data in favor of faster heuristic algorithms that sacrifice alignment quality. Virtually all DNA alignment tools that are commonly used in genomic and metagenomic database search use approximate methods that sometimes report the wrong match, and sometimes fail to find a valid match when present. Here we introduce BURST, a high-throughput DNA short-read aligner that uses several new synergistic optimizations to enable provably optimal alignment in NGS datasets. BURST finds all equally good matches in the database above a specified identity threshold and can either report all of them, pick the most likely among tied matches, or provide lowest-common-ancestor taxonomic annotation among tied matches. BURST can align, disambiguate, and assign taxonomy at a rate of 1,000,000 query sequences per minute against the RefSeq v82 representative prokaryotic genome database (5,500 microbial genomes, 19GB) at 98% identity on a 32-core computer, representing a speedup of up to 20,000-fold over current optimal gapped alignment techniques. This may have broader implications for clinical applications, strain tracking, and other situations where fast, exact, extremely sensitive alignment is desired.

Download Full-text

read_haps: using read haplotypes to detect same species contamination in DNA sequences

10.1101/2020.02.11.941773 ◽

2020 ◽

Author(s):

Hannes P. Eggertsson ◽

Bjarni V. Halldorsson

Keyword(s):

Data Analysis ◽

Genome Sequencing ◽

Dna Sequences ◽

Diploid Species ◽

Reliable Data ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Short Read ◽

Polymorphic Snps

AbstractMotivationData analysis is requisite on reliable data. In genetics this includes verifying that the sample is not contaminated with another, a problem ubiquitous in biology.ResultsIn human, and other diploid species, DNA contamination from the same species can be found by the presence of three haplotypes between polymorphic SNPs. read_haps is a tool that detects sample contamination from short read whole genome sequencing data.Availabilitygithub.com/DecodeGenetics/[email protected]

Download Full-text

AUSPP: A universal short-read pre-processing package

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720019500379 ◽

2019 ◽

Vol 17 (06) ◽

pp. 1950037

Author(s):

Lei Gao ◽

Cong Wu ◽

Lin Liu

Keyword(s):

Genome Mapping ◽

Meta Analysis ◽

Read Mapping ◽

Raw Data ◽

Short Read ◽

Short Reads ◽

Source Codes ◽

Automatic Mapping ◽

Next Generation Sequencing Ngs ◽

Ngs Data

There are many short-read aligners that can map short reads to a reference genome/sequence, and most of them can directly accept a FASTQ file as the input query file. However, the raw data usually need to be pre-processed. Few software programs specialize in pre-processing raw data generated by a variety of next-generation sequencing (NGS) technologies. Here, we present AUSPP, a Perl script-based pipeline for pre-processing and automatic mapping of NGS short reads. This pipeline encompasses quality control, adaptor trimming, collapsing of reads, structural RNA removal, length selection, read mapping, and normalized wiggle file creation. It facilitates the processing from raw data to genome mapping and is therefore a powerful tool for the steps before meta-analysis. Most importantly, since AUSPP has default processing pipeline settings for many types of NGS data, most of the time, users will simply need to provide the raw data and genome. AUSPP is portable and easy to install, and the source codes are freely available at https://github.com/highlei/AUSPP .

Download Full-text

Sequential chromosome banding and in situ hybridization analysis

Genome ◽

10.1139/g93-104 ◽

1993 ◽

Vol 36 (4) ◽

pp. 792-795 ◽

Cited By ~ 47

Author(s):

Jiming Jiang ◽

Bikram S. Gill

Keyword(s):

In Situ Hybridization ◽

Dna Sequences ◽

Acid Treatment ◽

Chromosome Banding ◽

Molecular Mapping ◽

Genome Mapping ◽

Genomic In Situ Hybridization ◽

Metaphase Chromosomes ◽

C Banding

Different combinations of chromosome N- or C-banding with in situ hybridization (ISH) or genomic in situ hybridization (GISH) were sequentially performed on metaphase chromosomes of wheat. A modified N-banding–ISH/GISH sequential procedure gave best results. Similarly, a modified C-banding – ISH/GISH procedure also gave satisfactory results. The variation of the hot acid treatment in the standard chromosome N- or C-banding procedures was the major factor affecting the resolution of the subsequent ISH and GISH. By the sequential chromosome banding – ISH/GISH analysis, multicopy DNA sequences and the breakpoints of wheat–alien translocations were directly allocated to specific chromosomes of wheat. The sequential chromosome banding– ISH/GISH technique should be widely applicable in genome mapping, especially in cytogenetic and molecular mapping of heterochromatic and euchromatic regions of plant and animal chromosomes.Key words: N-banding, C-banding, in situ hybridization, genomic in situ hybridization.

Download Full-text