scholarly journals Qudaich: A smart sequence aligner

2016 ◽  
Author(s):  
Sajia Akhter ◽  
Robert A Edwards

AbstractNext generation sequencing (NGS) technology produces massive amounts of data in a reasonable time and low cost. Analyzing and annotating these data requires sequence alignments to compare them with genes, proteins and genomes in different databases. Sequence alignment is the first step in metagenomics analysis, and pairwise comparisons of sequence reads provide a measure of similarity between environments. Most of the current aligners focus on aligning NGS datasets against long reference sequences rather than comparing between datasets. As the number of metagenomes and other genomic data increases each year, there is a demand for more sophisticated, faster sequence alignment algorithms. Here, we introduce a novel sequence aligner, Qudaich, which can efficiently process large volumes of data and is suited to de novo comparisons of next generation reads datasets. Qudaich can handle both DNA and protein sequences and attempts to provide the best possible alignment for each query sequence. Qudaich can produce more useful alignments quicker than other contemporary alignment algorithms.Author SummaryThe recent developments in sequencing technology provides high throughput sequencing data and have resulted in large volumes of genomic and metagenomic data available in public databases. Sequence alignment is an important step for annotating these data. Many sequence aligners have been developed in last few years for efficient analysis of these data, however most of them are only able to align DNA sequences and mainly focus on aligning NGS data against long reference genomes. Therefore, in this study we have designed a new sequence aligner, qudaich, which can generate pairwise local sequence alignment (at both the DNA and protein level) between two NGS datasets and can efficiently handle the large volume of NGS datasets. In qudaich, we introduce a unique sequence alignment algorithm, which outperforms the traditional approaches. Qudaich not only takes less time to execute, but also finds more useful alignments than contemporary aligners.

Author(s):  
Ming Huang ◽  
Nilay D. Shah ◽  
Lixia Yao

Abstract Background Sequence alignment is a way of arranging sequences (e.g., DNA, RNA, protein, natural language, financial data, or medical events) to identify the relatedness between two or more sequences and regions of similarity. For Electronic Health Records (EHR) data, sequence alignment helps to identify patients of similar disease trajectory for more relevant and precise prognosis, diagnosis and treatment of patients. Methods We tested two cutting-edge global sequence alignment methods, namely dynamic time warping (DTW) and Needleman-Wunsch algorithm (NWA), together with their local modifications, DTW for Local alignment (DTWL) and Smith-Waterman algorithm (SWA), for aligning patient medical records. We also used 4 sets of synthetic patient medical records generated from a large real-world EHR database as gold standard data, to objectively evaluate these sequence alignment algorithms. Results For global sequence alignments, 47 out of 80 DTW alignments and 11 out of 80 NWA alignments had superior similarity scores than reference alignments while the rest 33 DTW alignments and 69 NWA alignments had the same similarity scores as reference alignments. Forty-six out of 80 DTW alignments had better similarity scores than NWA alignments with the rest 34 cases having the equal similarity scores from both algorithms. For local sequence alignments, 70 out of 80 DTWL alignments and 68 out of 80 SWA alignments had larger coverage and higher similarity scores than reference alignments while the rest DTWL alignments and SWA alignments received the same coverage and similarity scores as reference alignments. Six out of 80 DTWL alignments showed larger coverage and higher similarity scores than SWA alignments. Thirty DTWL alignments had the equal coverage but better similarity scores than SWA. DTWL and SWA received the equal coverage and similarity scores for the rest 44 cases. Conclusions DTW, NWA, DTWL and SWA outperformed the reference alignments. DTW (or DTWL) seems to align better than NWA (or SWA) by inserting new daily events and identifying more similarities between patient medical records. The evaluation results could provide valuable information on the strengths and weakness of these sequence alignment methods for future development of sequence alignment methods and patient similarity-based studies.


Author(s):  
Mu Gao ◽  
Jeffrey Skolnick

Abstract Motivation From evolutionary interference, function annotation to structural prediction, protein sequence comparison has provided crucial biological insights. While many sequence alignment algorithms have been developed, existing approaches often cannot detect hidden structural relationships in the ‘twilight zone’ of low sequence identity. To address this critical problem, we introduce a computational algorithm that performs protein Sequence Alignments from deep-Learning of Structural Alignments (SAdLSA, silent ‘d’). The key idea is to implicitly learn the protein folding code from many thousands of structural alignments using experimentally determined protein structures. Results To demonstrate that the folding code was learned, we first show that SAdLSA trained on pure α-helical proteins successfully recognizes pairs of structurally related pure β-sheet protein domains. Subsequent training and benchmarking on larger, highly challenging datasets show significant improvement over established approaches. For challenging cases, SAdLSA is ∼150% better than HHsearch for generating pairwise alignments and ∼50% better for identifying the proteins with the best alignments in a sequence library. The time complexity of SAdLSA is O(N) thanks to GPU acceleration. Availability and implementation Datasets and source codes of SAdLSA are available free of charge for academic users at http://sites.gatech.edu/cssb/sadlsa/. Contact [email protected] or [email protected] Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Muhammad Yahya ◽  
◽  
Laiq Hasan ◽  
Syed Asad Ali ◽  
◽  
...  

Rapid evolution in sequencing technologies results in generating data on an enormous scale. A focal and main challenge in analyzing data at such a large scale is the alignment of the DNA/Protein sequences, whereby reads are compared to the reference sequences. To find similar sequences, alignment algorithms are used to align a query sequence with the database. Alignment algorithms can be utilized to classify the source of a sequence, to discover similarities among the organisms, or to deduce a progenitor connection. A wide range of algorithms for alignment has been developed in recent years. In this paper, an accurate method of accelerating such algorithms using GPUs has been investigated. A Swiss-Prot database has been processed using GPU implemented Smith-Waterman Sequence Alignment Algorithm. The first step in the process generates the alignment scores but not the actual alignment. Various available alignment tools like ssearch2 are then utilized to align the output file generated during the first step. The performance of GPU-accelerated implementation as compared to other techniques is then evaluated for performance /throughput improvement. Swiss-Prot database was aligned using various alignment tools. NVIDIA TESLA K40 GPU is being utilized for generating the results for this research. This implementation achieves the performance of 44.3 Giga cell updates per second (GCUPS), which is 22.9 times better than its implementation on GTX 275. Performance is improved as the workload of sequences of equal length is equally distributed among all the threads on Multiprocessors of GPU.


2021 ◽  
Vol 11 ◽  
Author(s):  
Haipeng Shi ◽  
Haihe Shi ◽  
Shenghua Xu

As a key algorithm in bioinformatics, sequence alignment algorithm is widely used in sequence similarity analysis and genome sequence database search. Existing research focuses mainly on the specific steps of the algorithm or is for specific problems, lack of high-level abstract domain algorithm framework. Multiple sequence alignment algorithms are more complex, redundant, and difficult to understand, and it is not easy for users to select the appropriate algorithm; some computing errors may occur. Based on our constructed pairwise sequence alignment algorithm component library and the convenient software platform PAR, a few expansion domain components are developed for multiple sequence alignment application domain, and specific multiple sequence alignment algorithm can be designed, and its corresponding program, i.e., C++/Java/Python program, can be generated efficiently and thus enables the improvement of the development efficiency of complex algorithms, as well as accuracy of sequence alignment calculation. A star alignment algorithm is designed and generated to demonstrate the development process.


2020 ◽  
Author(s):  
Kentaro Tomii ◽  
Shravan Kumar ◽  
Degui Zhi ◽  
Steven E. Brenner

AbstractBackgroundInsertion and deletion sequencing errors are relatively common in next-generation sequencing data and produce long stretches of mistranslated sequence. These frameshifting errors can cause very serious damages to downstream data analysis of reads. However, it is possible to obtain more precise alignment of DNA sequences by taking into account both coding frame and sequencing errors estimated by quality scores.ResultsHere we designed and proposed a novel hidden Markov model (HMM)-based pairwise alignment algorithm, Meta-Align, that aligns DNA sequences in the protein space, incorporating quality scores from the DNA sequences and allowing frameshifts caused by insertions and deletions. Our model is based on both an HMM transducer of a pair HMM and profile HMMs for all possible amino acid pairs. A Viterbi algorithm over our model produces the optimal alignment of a pair of metagenomic reads taking into account all possible translating frames and gap penalties in both the protein space and the DNA space. To reduce the sheer number of states of this model, we also derived and implemented a computationally feasible model, leveraging the degeneracy of the genetic code. In a benchmark test on a diverse set of simulated reads based on BAliBASE we show that Meta-Align outperforms TBLASTX which compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database using the BLAST algorithm. We also demonstrate the effects of incorporating quality scores on Meta-Align.ConclusionsMeta-Align will be particularly effective when applied to error-prone DNA sequences. The package of our software can be downloaded at https://github.com/shravan-repos/Metaalign.


2009 ◽  
Vol 22 (12) ◽  
pp. 1624-1634 ◽  
Author(s):  
Heather L. Tyler ◽  
Luiz F. W. Roesch ◽  
Siddarame Gowda ◽  
William O. Dawson ◽  
Eric W. Triplett

The citrus disease Huanglongbing (HLB) is highly destructive in many citrus-growing regions of the world. The putative causal agent of this disease, ‘Candidatus Liberibacter asiaticus’, is difficult to culture, and Koch's postulates have not yet been fulfilled. As a result, efforts have focused on obtaining the genome sequence of ‘Ca. L. asiaticus’ in order to give insight on the physiology of this organism. In this work, three next-generation high-throughput sequencing platforms, 454, Solexa, and SOLiD, were used to obtain metagenomic DNA sequences from phloem tissue of Florida citrus trees infected with HLB. A culture-independent, polymerase chain reaction (PCR)-independent analysis of 16S ribosomal RNA sequences showed that the only bacterium present within the phloem metagenome was ‘Ca L. asiaticus’. No viral or viroid sequences were identified within the metagenome. By reference assembly, the phloem metagenome contained sequences that provided 26-fold coverage of the ‘Ca. L. asiaticus’ contigs in GenBank. By the same approach, phloem metagenomic data yielded less than 0.2-fold coverage of five other alphaproteobacterial genomes. Thus, phloem metagenomic DNA provided a PCR-independent means of verifying the presence of ‘Ca L. asiaticus’ in infected tissue and strongly suggests that no other disease agent was present in phloem. Analysis of these metagenomic data suggest that this approach has a detection limit of one ‘Ca. Liberibacter’ cell for every 52 phloem cells. The phloem sample sequenced here is estimated to have contained 1.7 ‘Ca. Liberibacter’ cells per phloem cell.


2021 ◽  
Vol 22 (S9) ◽  
Author(s):  
Wei Quan ◽  
Bo Liu ◽  
Yadong Wang

Abstract Background DNA sequence alignment is a common first step in most applications of high-throughput sequencing technologies. The accuracy of sequence alignments directly affects the accuracy of downstream analyses, such as variant calling and quantitative analysis of transcriptome; therefore, rapidly and accurately mapping reads to a reference genome is a significant topic in bioinformatics. Conventional DNA read aligners map reads to a linear reference genome (such as the GRCh38 primary assembly). However, such a linear reference genome represents the genome of only one or a few individuals and thus lacks information on variations in the population. This limitation can introduce bias and impact the sensitivity and accuracy of mapping. Recently, a number of aligners have begun to map reads to populations of genomes, which can be represented by a reference genome and a large number of genetic variants. However, compared to linear reference aligners, an aligner that can store and index all genetic variants has a high cost in memory (RAM) space and leads to extremely long run time. Aligning reads to a graph-model-based index that includes all types of variants is ultimately an NP-hard problem in theory. By contrast, considering only single nucleotide polymorphism (SNP) information will reduce the complexity of the index and improve the speed of sequence alignment. Results The SNP-aware alignment tool (SALT) is a fast, memory-efficient, and SNP-aware short read alignment tool. SALT uses 5.8 GB of RAM to index a human reference genome (GRCh38) and incorporates 12.8M UCSC common SNPs. Compared with a state-of-the-art aligner, SALT has a similar speed but higher accuracy. Conclusions Herein, we present an SNP-aware alignment tool (SALT) that aligns reads to a reference genome that incorporates an SNP database. We benchmarked SALT using simulated and real datasets. The results demonstrate that SALT can efficiently map reads to the reference genome with significantly improved accuracy. Incorporating SNP information can improve the accuracy of read alignment and can reveal novel variants. The source code is freely available at https://github.com/weiquan/SALT.


2015 ◽  
Author(s):  
Davide Verzotto ◽  
Axel M Hillmer ◽  
Audrey S M Teo ◽  
Niranjan Nagarajan

Resolution of complex repeat structures and rearrangements in the assembly and analysis of large eukaryotic genomes is often aided by a combination of high-throughput sequencing and mapping technologies (e.g. optical restriction mapping). In particular, mapping technologies can generate sparse maps of large DNA fragments (150 kbp--2 Mbp) and thus provide a unique source of information for disambiguating complex rearrangements in cancer genomes. Despite their utility, combining high-throughput sequencing and mapping technologies has been challenging due to the lack of efficient and freely available software for robustly aligning maps to sequences. Here we introduce two new map-to-sequence alignment algorithms that efficiently and accurately align high-throughput mapping datasets to large, eukaryotic genomes while accounting for high error rates. In order to do so, these methods (OPTIMA for glocal and OPTIMA-Overlap for overlap alignment) exploit the ability to create efficient data structures that index continuous-valued mapping data while accounting for errors. We also introduce an approach for evaluating the significance of alignments that avoids expensive permutation-based tests while being agnostic to technology-dependent error rates. Our benchmarking results suggest that OPTIMA and OPTIMA-Overlap outperform state-of-the-art approaches in sensitivity (1.6--2X improvement) while simultaneously being more efficient (170--200%) and precise in their alignments (99% precision). These advantages are independent of the quality of the data, suggesting that our indexing approach and statistical evaluation are robust and provide improved sensitivity while guaranteeing high precision.


Sign in / Sign up

Export Citation Format

Share Document