Evaluating global and local sequence alignment methods for comparing patient medical records

Abstract Background Sequence alignment is a way of arranging sequences (e.g., DNA, RNA, protein, natural language, financial data, or medical events) to identify the relatedness between two or more sequences and regions of similarity. For Electronic Health Records (EHR) data, sequence alignment helps to identify patients of similar disease trajectory for more relevant and precise prognosis, diagnosis and treatment of patients. Methods We tested two cutting-edge global sequence alignment methods, namely dynamic time warping (DTW) and Needleman-Wunsch algorithm (NWA), together with their local modifications, DTW for Local alignment (DTWL) and Smith-Waterman algorithm (SWA), for aligning patient medical records. We also used 4 sets of synthetic patient medical records generated from a large real-world EHR database as gold standard data, to objectively evaluate these sequence alignment algorithms. Results For global sequence alignments, 47 out of 80 DTW alignments and 11 out of 80 NWA alignments had superior similarity scores than reference alignments while the rest 33 DTW alignments and 69 NWA alignments had the same similarity scores as reference alignments. Forty-six out of 80 DTW alignments had better similarity scores than NWA alignments with the rest 34 cases having the equal similarity scores from both algorithms. For local sequence alignments, 70 out of 80 DTWL alignments and 68 out of 80 SWA alignments had larger coverage and higher similarity scores than reference alignments while the rest DTWL alignments and SWA alignments received the same coverage and similarity scores as reference alignments. Six out of 80 DTWL alignments showed larger coverage and higher similarity scores than SWA alignments. Thirty DTWL alignments had the equal coverage but better similarity scores than SWA. DTWL and SWA received the equal coverage and similarity scores for the rest 44 cases. Conclusions DTW, NWA, DTWL and SWA outperformed the reference alignments. DTW (or DTWL) seems to align better than NWA (or SWA) by inserting new daily events and identifying more similarities between patient medical records. The evaluation results could provide valuable information on the strengths and weakness of these sequence alignment methods for future development of sequence alignment methods and patient similarity-based studies.

Download Full-text

Qudaich: A smart sequence aligner

10.1101/060509 ◽

2016 ◽

Author(s):

Sajia Akhter ◽

Robert A Edwards

Keyword(s):

Sequence Alignment ◽

Dna Sequences ◽

High Throughput Sequencing ◽

Query Sequence ◽

Metagenomic Data ◽

Alignment Algorithm ◽

Next Generation ◽

Sequence Alignments ◽

Alignment Algorithms ◽

Local Sequence

AbstractNext generation sequencing (NGS) technology produces massive amounts of data in a reasonable time and low cost. Analyzing and annotating these data requires sequence alignments to compare them with genes, proteins and genomes in different databases. Sequence alignment is the first step in metagenomics analysis, and pairwise comparisons of sequence reads provide a measure of similarity between environments. Most of the current aligners focus on aligning NGS datasets against long reference sequences rather than comparing between datasets. As the number of metagenomes and other genomic data increases each year, there is a demand for more sophisticated, faster sequence alignment algorithms. Here, we introduce a novel sequence aligner, Qudaich, which can efficiently process large volumes of data and is suited to de novo comparisons of next generation reads datasets. Qudaich can handle both DNA and protein sequences and attempts to provide the best possible alignment for each query sequence. Qudaich can produce more useful alignments quicker than other contemporary alignment algorithms.Author SummaryThe recent developments in sequencing technology provides high throughput sequencing data and have resulted in large volumes of genomic and metagenomic data available in public databases. Sequence alignment is an important step for annotating these data. Many sequence aligners have been developed in last few years for efficient analysis of these data, however most of them are only able to align DNA sequences and mainly focus on aligning NGS data against long reference genomes. Therefore, in this study we have designed a new sequence aligner, qudaich, which can generate pairwise local sequence alignment (at both the DNA and protein level) between two NGS datasets and can efficiently handle the large volume of NGS datasets. In qudaich, we introduce a unique sequence alignment algorithm, which outperforms the traditional approaches. Qudaich not only takes less time to execute, but also finds more useful alignments than contemporary aligners.

Download Full-text

An Improved Hybridized Evolutionary Algorithm Based on Rules for Local Sequence Alignment

Exploring Critical Approaches of Evolutionary Computation - Advances in Computer and Electrical Engineering ◽

10.4018/978-1-5225-5832-3.ch011 ◽

2019 ◽

pp. 215-237

Author(s):

Jayapriya J. ◽

Michael Arock

Keyword(s):

Evolutionary Algorithm ◽

Sequence Alignment ◽

Biological Database ◽

Computational Time ◽

Matched Pair ◽

Rank Test ◽

Local Alignment ◽

Local Sequence Alignment ◽

Local Sequence ◽

Signed Rank Test

In bioinformatics, sequence alignment is the heart of the sequence analysis. Sequence can be aligned locally or globally depending upon the biologist's need for the analysis. As local sequence alignment is considered important, there is demand for an efficient algorithm. Due to the enormous sequences in the biological database, there is a trade-off between computational time and accuracy. In general, all biological problems are considered as computational intensive problems. To solve these kinds of problems, evolutionary-based algorithms are proficiently used. This chapter focuses local alignment in molecular sequences and proposes an improvised hybrid evolutionary algorithm using particle swarm optimization and cellular automata (IPSOCA). The efficiency of the proposed algorithm is proved using the experimental analysis for benchmark dataset BaliBase and compared with other state-of-the-art techniques. Using the Wilcoxon matched pair signed rank test, the significance of the proposed algorithm is explicated.

Download Full-text

LOCAL SEQUENCE-STRUCTURE MOTIFS IN RNA

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720004000818 ◽

2004 ◽

Vol 02 (04) ◽

pp. 681-698 ◽

Cited By ~ 28

Author(s):

ROLF BACKOFEN ◽

SEBASTIAN WILL

Keyword(s):

Information Structure ◽

Structure Alignment ◽

General Definition ◽

Local Alignment ◽

Sequence Information ◽

Sequence Structure ◽

Worst Case ◽

Rna Molecules ◽

Alignment Algorithms ◽

Local Sequence

Ribonuclic acid (RNA) enjoys increasing interest in molecular biology; despite this interest fundamental algorithms are lacking, e.g. for identifying local motifs. As proteins, RNA molecules have a distinctive structure. Therefore, in addition to sequence information, structure plays an important part in assessing the similarity of RNAs. Furthermore, common sequence-structure features in two or several RNA molecules are often only spatially local, where possibly large parts of the molecules are dissimilar. Consequently, we address the problem of comparing RNA molecules by computing an optimal local alignment with respect to sequence and structure information. While local alignment is superior to global alignment for identifying local similarities, no general local sequence-structure alignment algorithms are currently known. We suggest a new general definition of locality for sequence-structure alignments that is biologically motivated and efficiently tractable. To show the former, we discuss locality of RNA and prove that the defined locality means connectivity by atomic and non-atomic bonds. To show the latter, we present an efficient algorithm for the newly defined pairwise local sequence-structure alignment (lssa) problem for RNA. For molecules of lengthes n and m, the algorithm has worst-case time complexity of O(n2·m2· max (n,m)) and a space complexity of only O(n·m). An implementation of our algorithm is available at . Its runtime is competitive with global sequence-structure alignment.

Download Full-text

A novel sequence alignment algorithm based on deep learning of the protein folding code

Bioinformatics ◽

10.1093/bioinformatics/btaa810 ◽

2020 ◽

Cited By ~ 1

Author(s):

Mu Gao ◽

Jeffrey Skolnick

Keyword(s):

Protein Folding ◽

Deep Learning ◽

Sequence Alignment ◽

Protein Sequence ◽

Protein Structures ◽

Supplementary Information ◽

Alignment Algorithm ◽

Sequence Alignments ◽

Alignment Algorithms ◽

Structural Alignments

Abstract Motivation From evolutionary interference, function annotation to structural prediction, protein sequence comparison has provided crucial biological insights. While many sequence alignment algorithms have been developed, existing approaches often cannot detect hidden structural relationships in the ‘twilight zone’ of low sequence identity. To address this critical problem, we introduce a computational algorithm that performs protein Sequence Alignments from deep-Learning of Structural Alignments (SAdLSA, silent ‘d’). The key idea is to implicitly learn the protein folding code from many thousands of structural alignments using experimentally determined protein structures. Results To demonstrate that the folding code was learned, we first show that SAdLSA trained on pure α-helical proteins successfully recognizes pairs of structurally related pure β-sheet protein domains. Subsequent training and benchmarking on larger, highly challenging datasets show significant improvement over established approaches. For challenging cases, SAdLSA is ∼150% better than HHsearch for generating pairwise alignments and ∼50% better for identifying the proteins with the best alignments in a sequence library. The time complexity of SAdLSA is O(N) thanks to GPU acceleration. Availability and implementation Datasets and source codes of SAdLSA are available free of charge for academic users at http://sites.gatech.edu/cssb/sadlsa/. Contact [email protected] or [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Sequence Alignment Algorithms for Intrusion Detection in the Internet of Things

Nonlinear Phenomena in Complex Systems ◽

10.33581/1561-4085-2020-23-4-397-404 ◽

2020 ◽

Vol 23 (4) ◽

pp. 397-404

Author(s):

M. Kalinin ◽

V. Krundyshev

Keyword(s):

Intrusion Detection ◽

Sequence Alignment ◽

Percent Level ◽

Computational Procedure ◽

Individual Characteristics ◽

Local Alignment ◽

New Approach ◽

Alignment Algorithms ◽

Detection Approach ◽

The Internet Of Things

The paper reviews the intrusion detection approach based on bioinformatics algorithms for alignment and comparing of the nucleotide sequences. Sequence alignment is a natureclose computational procedure for matching the coded strings by searching for the regions of individual characteristics that are located in the same order. A calculated rank of similarity is used instead of equity checking to estimate the distance between a sequence of the monitored operational acts and a generalized intrusion pattern. Multiple alignment schema is more effective and accurate than the Smith–Waterman local alignment due to ability to find few blocks of similarity. In comparison with a traditional signature-based IDS, it is found that the nature-inspired approach provides the better work characteristics. The experimental study have shown that new approach demonstrates high, 99 percent, level of accuracy.

Download Full-text

KELSA: A Knowledge-Enriched Local Sequence Alignment Algorithm for Comparing Patient Medical Records

Explainable AI in Healthcare and Medicine - Studies in Computational Intelligence ◽

10.1007/978-3-030-53352-6_21 ◽

2020 ◽

pp. 227-240

Author(s):

Ming Huang ◽

Nilay D. Shah ◽

Lixia Yao

Keyword(s):

Sequence Alignment ◽

Medical Records ◽

Alignment Algorithm ◽

Local Sequence Alignment ◽

Sequence Alignment Algorithm ◽

Local Sequence

Download Full-text

BLANT—fast graphlet sampling tool

Bioinformatics ◽

10.1093/bioinformatics/btz603 ◽

2019 ◽

Vol 35 (24) ◽

pp. 5363-5364

Author(s):

Sridevi Maharaj ◽

Brennan Tracy ◽

Wayne B Hayes

Keyword(s):

Functional Similarity ◽

Supplementary Information ◽

Local Alignment ◽

Supplementary Data ◽

Input Graph ◽

Sequence Alignments ◽

Ppi Networks ◽

Local Sequence ◽

Taxonomic Trees

Abstract Summary BLAST creates local sequence alignments by first building a database of small k-letter sub-sequences called k-mers. Identical k-mers from different regions provide ‘seeds’ for longer local alignments. This seed-and-extend heuristic makes BLAST extremely fast and has led to its almost exclusive use despite the existence of more accurate, but slower, algorithms. In this paper, we introduce the Basic Local Alignment for Networks Tool (BLANT). BLANT is the analog of BLAST, but for networks: given an input graph, it samples small, induced, k-node sub-graphs called k-graphlets. Graphlets have been used to classify networks, quantify structure, align networks both locally and globally, identify topology-function relationships and build taxonomic trees without the use of sequences. Given an input network, BLANT produces millions of graphlet samples in seconds—orders of magnitude faster than existing methods. BLANT offers sampled graphlets in various forms: distributions of graphlets or their orbits; graphlet degree or graphlet orbit degree vectors, the latter being compatible with ORCA; or an index to be used as the basis for seed-and-extend local alignments. We demonstrate BLANT’s usefelness by using its indexing mode to find functional similarity between yeast and human PPI networks. Availability and implementation BLANT is written in C and is available at https://github.com/waynebhayes/BLANT/releases. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

GASAL2: a GPU accelerated sequence alignment library for high-throughput NGS data

BMC Bioinformatics ◽

10.1186/s12859-019-3086-9 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 5

Author(s):

Nauman Ahmed ◽

Jonathan Lévy ◽

Shanshan Ren ◽

Hamid Mushtaq ◽

Koen Bertels ◽

...

Keyword(s):

Sequence Alignment ◽

High Throughput ◽

High Performance ◽

Local Alignment ◽

Global Alignment ◽

Pairwise Sequence Alignment ◽

Rna Sequences ◽

Dna And Rna ◽

Alignment Algorithms ◽

Ngs Data

Abstract Background Due the computational complexity of sequence alignment algorithms, various accelerated solutions have been proposed to speedup this analysis. NVBIO is the only available GPU library that accelerates sequence alignment of high-throughput NGS data, but has limited performance. In this article we present GASAL2, a GPU library for aligning DNA and RNA sequences that outperforms existing CPU and GPU libraries. Results The GASAL2 library provides specialized, accelerated kernels for local, global and all types of semi-global alignment. Pairwise sequence alignment can be performed with and without traceback. GASAL2 outperforms the fastest CPU-optimized SIMD implementations such as SeqAn and Parasail, as well as NVIDIA’s own GPU-based library known as NVBIO. GASAL2 is unique in performing sequence packing on GPU, which is up to 750x faster than NVBIO. Overall on Geforce GTX 1080 Ti GPU, GASAL2 is up to 21x faster than Parasail on a dual socket hyper-threaded Intel Xeon system with 28 cores and up to 13x faster than NVBIO with a query length of up to 300 bases and 100 bases, respectively. GASAL2 alignment functions are asynchronous/non-blocking and allow full overlap of CPU and GPU execution. The paper shows how to use GASAL2 to accelerate BWA-MEM, speeding up the local alignment by 20x, which gives an overall application speedup of 1.3x vs. CPU with up to 12 threads. Conclusions The library provides high performance APIs for local, global and semi-global alignment that can be easily integrated into various bioinformatics tools.

Download Full-text

Hubsm: A Novel Amino Acid Substitution Matrix for Comparing Hub Proteins

International Journal of Advanced Research in Computer Science and Software Engineering ◽

10.23956/ijarcsse.v7i8.53 ◽

2017 ◽

Vol 7 (8) ◽

pp. 212

Author(s):

Renganayaki G. ◽

Achuthsankar S. Nair

Keyword(s):

Amino Acid ◽

Amino Acid Substitution ◽

Low Complexity ◽

Database Search ◽

Substitution Matrix ◽

Compositional Bias ◽

Sequence Alignments ◽

Amino Acid Substitution Matrix ◽

Alignment Algorithms ◽

Hub Proteins

Sequence alignment algorithms and database search methods use BLOSUM and PAM substitution matrices constructed from general proteins. These de facto matrices are not optimal to align sequences accurately, for the proteins with markedly different compositional bias in the amino acid. In this work, a new amino acid substitution matrix is calculated for the disorder and low complexity rich region of Hub proteins, based on residue characteristics. Insights into the amino acid background frequencies and the substitution scores obtained from the Hubsm unveils the residue substitution patterns which differs from commonly used scoring matrices .When comparing the Hub protein sequences for detecting homologs, the use of this Hubsm matrix yields better results than PAM and BLOSUM matrices. Usage of Hubsm matrix can be optimal in database search and for the construction of more accurate sequence alignments of Hub proteins.

Download Full-text

A Comprehensive Analysis of Sequence Alignment Algorithms for LongRead Sequencing

Current Bioinformatics ◽

10.2174/1574893611666160115213144 ◽

2016 ◽

Vol 11 (3) ◽

pp. 375-381

Author(s):

Yu Zhang ◽

Jian Tai He ◽

Yangde Zhang ◽

Ke Zuo

Keyword(s):

Sequence Alignment ◽

Comprehensive Analysis ◽

Alignment Algorithms

Download Full-text