Qudaich: A smart sequence aligner

Mapping Intimacies ◽

10.1101/060509 ◽

2016 ◽

Author(s):

Sajia Akhter ◽

Robert A Edwards

Keyword(s):

Sequence Alignment ◽

Dna Sequences ◽

High Throughput Sequencing ◽

Query Sequence ◽

Metagenomic Data ◽

Alignment Algorithm ◽

Next Generation ◽

Sequence Alignments ◽

Alignment Algorithms ◽

Local Sequence

AbstractNext generation sequencing (NGS) technology produces massive amounts of data in a reasonable time and low cost. Analyzing and annotating these data requires sequence alignments to compare them with genes, proteins and genomes in different databases. Sequence alignment is the first step in metagenomics analysis, and pairwise comparisons of sequence reads provide a measure of similarity between environments. Most of the current aligners focus on aligning NGS datasets against long reference sequences rather than comparing between datasets. As the number of metagenomes and other genomic data increases each year, there is a demand for more sophisticated, faster sequence alignment algorithms. Here, we introduce a novel sequence aligner, Qudaich, which can efficiently process large volumes of data and is suited to de novo comparisons of next generation reads datasets. Qudaich can handle both DNA and protein sequences and attempts to provide the best possible alignment for each query sequence. Qudaich can produce more useful alignments quicker than other contemporary alignment algorithms.Author SummaryThe recent developments in sequencing technology provides high throughput sequencing data and have resulted in large volumes of genomic and metagenomic data available in public databases. Sequence alignment is an important step for annotating these data. Many sequence aligners have been developed in last few years for efficient analysis of these data, however most of them are only able to align DNA sequences and mainly focus on aligning NGS data against long reference genomes. Therefore, in this study we have designed a new sequence aligner, qudaich, which can generate pairwise local sequence alignment (at both the DNA and protein level) between two NGS datasets and can efficiently handle the large volume of NGS datasets. In qudaich, we introduce a unique sequence alignment algorithm, which outperforms the traditional approaches. Qudaich not only takes less time to execute, but also finds more useful alignments than contemporary aligners.

Download Full-text

Evaluating global and local sequence alignment methods for comparing patient medical records

BMC Medical Informatics and Decision Making ◽

10.1186/s12911-019-0965-y ◽

2019 ◽

Vol 19 (S6) ◽

Cited By ~ 4

Author(s):

Ming Huang ◽

Nilay D. Shah ◽

Lixia Yao

Keyword(s):

Sequence Alignment ◽

Medical Records ◽

Local Alignment ◽

Sequence Alignments ◽

Standard Data ◽

Alignment Algorithms ◽

Local Sequence ◽

Similar Disease ◽

Dynamic Time ◽

Similarity Scores

Abstract Background Sequence alignment is a way of arranging sequences (e.g., DNA, RNA, protein, natural language, financial data, or medical events) to identify the relatedness between two or more sequences and regions of similarity. For Electronic Health Records (EHR) data, sequence alignment helps to identify patients of similar disease trajectory for more relevant and precise prognosis, diagnosis and treatment of patients. Methods We tested two cutting-edge global sequence alignment methods, namely dynamic time warping (DTW) and Needleman-Wunsch algorithm (NWA), together with their local modifications, DTW for Local alignment (DTWL) and Smith-Waterman algorithm (SWA), for aligning patient medical records. We also used 4 sets of synthetic patient medical records generated from a large real-world EHR database as gold standard data, to objectively evaluate these sequence alignment algorithms. Results For global sequence alignments, 47 out of 80 DTW alignments and 11 out of 80 NWA alignments had superior similarity scores than reference alignments while the rest 33 DTW alignments and 69 NWA alignments had the same similarity scores as reference alignments. Forty-six out of 80 DTW alignments had better similarity scores than NWA alignments with the rest 34 cases having the equal similarity scores from both algorithms. For local sequence alignments, 70 out of 80 DTWL alignments and 68 out of 80 SWA alignments had larger coverage and higher similarity scores than reference alignments while the rest DTWL alignments and SWA alignments received the same coverage and similarity scores as reference alignments. Six out of 80 DTWL alignments showed larger coverage and higher similarity scores than SWA alignments. Thirty DTWL alignments had the equal coverage but better similarity scores than SWA. DTWL and SWA received the equal coverage and similarity scores for the rest 44 cases. Conclusions DTW, NWA, DTWL and SWA outperformed the reference alignments. DTW (or DTWL) seems to align better than NWA (or SWA) by inserting new daily events and identifying more similarities between patient medical records. The evaluation results could provide valuable information on the strengths and weakness of these sequence alignment methods for future development of sequence alignment methods and patient similarity-based studies.

Download Full-text

A novel sequence alignment algorithm based on deep learning of the protein folding code

Bioinformatics ◽

10.1093/bioinformatics/btaa810 ◽

2020 ◽

Cited By ~ 1

Author(s):

Mu Gao ◽

Jeffrey Skolnick

Keyword(s):

Protein Folding ◽

Deep Learning ◽

Sequence Alignment ◽

Protein Sequence ◽

Protein Structures ◽

Supplementary Information ◽

Alignment Algorithm ◽

Sequence Alignments ◽

Alignment Algorithms ◽

Structural Alignments

Abstract Motivation From evolutionary interference, function annotation to structural prediction, protein sequence comparison has provided crucial biological insights. While many sequence alignment algorithms have been developed, existing approaches often cannot detect hidden structural relationships in the ‘twilight zone’ of low sequence identity. To address this critical problem, we introduce a computational algorithm that performs protein Sequence Alignments from deep-Learning of Structural Alignments (SAdLSA, silent ‘d’). The key idea is to implicitly learn the protein folding code from many thousands of structural alignments using experimentally determined protein structures. Results To demonstrate that the folding code was learned, we first show that SAdLSA trained on pure α-helical proteins successfully recognizes pairs of structurally related pure β-sheet protein domains. Subsequent training and benchmarking on larger, highly challenging datasets show significant improvement over established approaches. For challenging cases, SAdLSA is ∼150% better than HHsearch for generating pairwise alignments and ∼50% better for identifying the proteins with the best alignments in a sequence library. The time complexity of SAdLSA is O(N) thanks to GPU acceleration. Availability and implementation Datasets and source codes of SAdLSA are available free of charge for academic users at http://sites.gatech.edu/cssb/sadlsa/. Contact [email protected] or [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

High-throughput Protein Sequence Alignment on Multi-core Systems

International Journal of Integrated Engineering ◽

10.30880/ijie.2020.12.07.007 ◽

2020 ◽

Vol 12 (7) ◽

Author(s):

Muhammad Yahya ◽

◽

Laiq Hasan ◽

Syed Asad Ali ◽

◽

...

Keyword(s):

Sequence Alignment ◽

Large Scale ◽

Query Sequence ◽

Rapid Evolution ◽

Accurate Method ◽

Alignment Algorithm ◽

Sequencing Technologies ◽

Main Challenge ◽

Alignment Algorithms ◽

Wide Range

Rapid evolution in sequencing technologies results in generating data on an enormous scale. A focal and main challenge in analyzing data at such a large scale is the alignment of the DNA/Protein sequences, whereby reads are compared to the reference sequences. To find similar sequences, alignment algorithms are used to align a query sequence with the database. Alignment algorithms can be utilized to classify the source of a sequence, to discover similarities among the organisms, or to deduce a progenitor connection. A wide range of algorithms for alignment has been developed in recent years. In this paper, an accurate method of accelerating such algorithms using GPUs has been investigated. A Swiss-Prot database has been processed using GPU implemented Smith-Waterman Sequence Alignment Algorithm. The first step in the process generates the alignment scores but not the actual alignment. Various available alignment tools like ssearch2 are then utilized to align the output file generated during the first step. The performance of GPU-accelerated implementation as compared to other techniques is then evaluated for performance /throughput improvement. Swiss-Prot database was aligned using various alignment tools. NVIDIA TESLA K40 GPU is being utilized for generating the results for this research. This implementation achieves the performance of 44.3 Giga cell updates per second (GCUPS), which is 22.9 times better than its implementation on GTX 275. Performance is improved as the workload of sequences of equal length is equally distributed among all the threads on Multiprocessors of GPU.

Download Full-text

Efficient Multiple Sequences Alignment Algorithm Generation via Components Assembly Under PAR Framework

Frontiers in Genetics ◽

10.3389/fgene.2020.628175 ◽

2021 ◽

Vol 11 ◽

Author(s):

Haipeng Shi ◽

Haihe Shi ◽

Shenghua Xu

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Sequence Similarity ◽

Alignment Algorithm ◽

Pairwise Sequence Alignment ◽

Multiple Sequence ◽

Sequence Alignment Algorithm ◽

Alignment Algorithms ◽

Sequence Similarity Analysis ◽

High Level

As a key algorithm in bioinformatics, sequence alignment algorithm is widely used in sequence similarity analysis and genome sequence database search. Existing research focuses mainly on the specific steps of the algorithm or is for specific problems, lack of high-level abstract domain algorithm framework. Multiple sequence alignment algorithms are more complex, redundant, and difficult to understand, and it is not easy for users to select the appropriate algorithm; some computing errors may occur. Based on our constructed pairwise sequence alignment algorithm component library and the convenient software platform PAR, a few expansion domain components are developed for multiple sequence alignment application domain, and specific multiple sequence alignment algorithm can be designed, and its corresponding program, i.e., C++/Java/Python program, can be generated efficiently and thus enables the improvement of the development efficiency of complex algorithms, as well as accuracy of sequence alignment calculation. A star alignment algorithm is designed and generated to demonstrate the development process.

Download Full-text

Meta-Align: A Novel HMM-based Algorithm for Pairwise Alignment of Error-Prone Sequencing Reads

10.1101/2020.05.11.087676 ◽

2020 ◽

Author(s):

Kentaro Tomii ◽

Shravan Kumar ◽

Degui Zhi ◽

Steven E. Brenner

Keyword(s):

Dna Sequences ◽

Viterbi Algorithm ◽

Query Sequence ◽

Pairwise Alignment ◽

Next Generation Sequencing Data ◽

Alignment Algorithm ◽

Sequencing Data ◽

Sequencing Errors ◽

Insertion And Deletion ◽

Protein Space

AbstractBackgroundInsertion and deletion sequencing errors are relatively common in next-generation sequencing data and produce long stretches of mistranslated sequence. These frameshifting errors can cause very serious damages to downstream data analysis of reads. However, it is possible to obtain more precise alignment of DNA sequences by taking into account both coding frame and sequencing errors estimated by quality scores.ResultsHere we designed and proposed a novel hidden Markov model (HMM)-based pairwise alignment algorithm, Meta-Align, that aligns DNA sequences in the protein space, incorporating quality scores from the DNA sequences and allowing frameshifts caused by insertions and deletions. Our model is based on both an HMM transducer of a pair HMM and profile HMMs for all possible amino acid pairs. A Viterbi algorithm over our model produces the optimal alignment of a pair of metagenomic reads taking into account all possible translating frames and gap penalties in both the protein space and the DNA space. To reduce the sheer number of states of this model, we also derived and implemented a computationally feasible model, leveraging the degeneracy of the genetic code. In a benchmark test on a diverse set of simulated reads based on BAliBASE we show that Meta-Align outperforms TBLASTX which compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database using the BLAST algorithm. We also demonstrate the effects of incorporating quality scores on Meta-Align.ConclusionsMeta-Align will be particularly effective when applied to error-prone DNA sequences. The package of our software can be downloaded at https://github.com/shravan-repos/Metaalign.

Download Full-text

A survey of sequence alignment algorithms for next-generation sequencing

Briefings in Bioinformatics ◽

10.1093/bib/bbq015 ◽

2010 ◽

Vol 11 (5) ◽

pp. 473-483 ◽

Cited By ~ 554

Author(s):

H. Li ◽

N. Homer

Keyword(s):

Next Generation Sequencing ◽

Sequence Alignment ◽

Next Generation ◽

Alignment Algorithms ◽

Generation Sequencing

Download Full-text

Confirmation of the Sequence of ‘Candidatus Liberibacter asiaticus’ and Assessment of Microbial Diversity in Huanglongbing-Infected Citrus Phloem Using a Metagenomic Approach

Molecular Plant-Microbe Interactions ◽

10.1094/mpmi-22-12-1624 ◽

2009 ◽

Vol 22 (12) ◽

pp. 1624-1634 ◽

Cited By ~ 66

Author(s):

Heather L. Tyler ◽

Luiz F. W. Roesch ◽

Siddarame Gowda ◽

William O. Dawson ◽

Eric W. Triplett

Keyword(s):

Dna Sequences ◽

High Throughput Sequencing ◽

Metagenomic Data ◽

Candidatus Liberibacter Asiaticus ◽

Metagenomic Dna ◽

Rna Sequences ◽

Culture Independent ◽

Candidatus Liberibacter ◽

Liberibacter Asiaticus ◽

Sequencing Platforms

The citrus disease Huanglongbing (HLB) is highly destructive in many citrus-growing regions of the world. The putative causal agent of this disease, ‘Candidatus Liberibacter asiaticus’, is difficult to culture, and Koch's postulates have not yet been fulfilled. As a result, efforts have focused on obtaining the genome sequence of ‘Ca. L. asiaticus’ in order to give insight on the physiology of this organism. In this work, three next-generation high-throughput sequencing platforms, 454, Solexa, and SOLiD, were used to obtain metagenomic DNA sequences from phloem tissue of Florida citrus trees infected with HLB. A culture-independent, polymerase chain reaction (PCR)-independent analysis of 16S ribosomal RNA sequences showed that the only bacterium present within the phloem metagenome was ‘Ca L. asiaticus’. No viral or viroid sequences were identified within the metagenome. By reference assembly, the phloem metagenome contained sequences that provided 26-fold coverage of the ‘Ca. L. asiaticus’ contigs in GenBank. By the same approach, phloem metagenomic data yielded less than 0.2-fold coverage of five other alphaproteobacterial genomes. Thus, phloem metagenomic DNA provided a PCR-independent means of verifying the presence of ‘Ca L. asiaticus’ in infected tissue and strongly suggests that no other disease agent was present in phloem. Analysis of these metagenomic data suggest that this approach has a detection limit of one ‘Ca. Liberibacter’ cell for every 52 phloem cells. The phloem sample sequenced here is estimated to have contained 1.7 ‘Ca. Liberibacter’ cells per phloem cell.

Download Full-text

A Parallel Pairwise Local Sequence Alignment Algorithm

IEEE Transactions on NanoBioscience ◽

10.1109/tnb.2009.2019642 ◽

2009 ◽

Vol 8 (2) ◽

pp. 139-146 ◽

Cited By ~ 4

Author(s):

S. Bandyopadhyay ◽

R. Mitra

Keyword(s):

Sequence Alignment ◽

Alignment Algorithm ◽

Local Sequence Alignment ◽

Sequence Alignment Algorithm ◽

Local Sequence

Download Full-text

Fast and SNP-aware short read alignment with SALT

BMC Bioinformatics ◽

10.1186/s12859-021-04088-6 ◽

2021 ◽

Vol 22 (S9) ◽

Author(s):

Wei Quan ◽

Bo Liu ◽

Yadong Wang

Keyword(s):

Sequence Alignment ◽

Genetic Variants ◽

High Throughput Sequencing ◽

Reference Genome ◽

Graph Model ◽

Sequence Alignments ◽

Short Read ◽

Read Alignment ◽

Short Read Alignment ◽

Alignment Tool

Abstract Background DNA sequence alignment is a common first step in most applications of high-throughput sequencing technologies. The accuracy of sequence alignments directly affects the accuracy of downstream analyses, such as variant calling and quantitative analysis of transcriptome; therefore, rapidly and accurately mapping reads to a reference genome is a significant topic in bioinformatics. Conventional DNA read aligners map reads to a linear reference genome (such as the GRCh38 primary assembly). However, such a linear reference genome represents the genome of only one or a few individuals and thus lacks information on variations in the population. This limitation can introduce bias and impact the sensitivity and accuracy of mapping. Recently, a number of aligners have begun to map reads to populations of genomes, which can be represented by a reference genome and a large number of genetic variants. However, compared to linear reference aligners, an aligner that can store and index all genetic variants has a high cost in memory (RAM) space and leads to extremely long run time. Aligning reads to a graph-model-based index that includes all types of variants is ultimately an NP-hard problem in theory. By contrast, considering only single nucleotide polymorphism (SNP) information will reduce the complexity of the index and improve the speed of sequence alignment. Results The SNP-aware alignment tool (SALT) is a fast, memory-efficient, and SNP-aware short read alignment tool. SALT uses 5.8 GB of RAM to index a human reference genome (GRCh38) and incorporates 12.8M UCSC common SNPs. Compared with a state-of-the-art aligner, SALT has a similar speed but higher accuracy. Conclusions Herein, we present an SNP-aware alignment tool (SALT) that aligns reads to a reference genome that incorporates an SNP database. We benchmarked SALT using simulated and real datasets. The results demonstrate that SALT can efficiently map reads to the reference genome with significantly improved accuracy. Incorporating SNP information can improve the accuracy of read alignment and can reveal novel variants. The source code is freely available at https://github.com/weiquan/SALT.

Download Full-text

Index-based map-to-sequence alignment in large eukaryotic genomes

10.1101/017194 ◽

2015 ◽

Cited By ~ 2

Author(s):

Davide Verzotto ◽

Axel M Hillmer ◽

Audrey S M Teo ◽

Niranjan Nagarajan

Keyword(s):

Sequence Alignment ◽

High Throughput ◽

High Throughput Sequencing ◽

Error Rates ◽

Alignment Algorithms ◽

Efficient Data ◽

Unique Source ◽

Efficient Data Structures ◽

Eukaryotic Genomes

Resolution of complex repeat structures and rearrangements in the assembly and analysis of large eukaryotic genomes is often aided by a combination of high-throughput sequencing and mapping technologies (e.g. optical restriction mapping). In particular, mapping technologies can generate sparse maps of large DNA fragments (150 kbp--2 Mbp) and thus provide a unique source of information for disambiguating complex rearrangements in cancer genomes. Despite their utility, combining high-throughput sequencing and mapping technologies has been challenging due to the lack of efficient and freely available software for robustly aligning maps to sequences. Here we introduce two new map-to-sequence alignment algorithms that efficiently and accurately align high-throughput mapping datasets to large, eukaryotic genomes while accounting for high error rates. In order to do so, these methods (OPTIMA for glocal and OPTIMA-Overlap for overlap alignment) exploit the ability to create efficient data structures that index continuous-valued mapping data while accounting for errors. We also introduce an approach for evaluating the significance of alignments that avoids expensive permutation-based tests while being agnostic to technology-dependent error rates. Our benchmarking results suggest that OPTIMA and OPTIMA-Overlap outperform state-of-the-art approaches in sensitivity (1.6--2X improvement) while simultaneously being more efficient (170--200%) and precise in their alignments (99% precision). These advantages are independent of the quality of the data, suggesting that our indexing approach and statistical evaluation are robust and provide improved sensitivity while guaranteeing high precision.

Download Full-text