High Performance Pattern Matching on Heterogeneous Platform

Summary Pattern discovery is one of the fundamental tasks in bioinformatics and pattern recognition is a powerful technique for searching sequence patterns in the biological sequence databases. Fast and high performance algorithms are highly demanded in many applications in bioinformatics and computational molecular biology since the significant increase in the number of DNA and protein sequences expand the need for raising the performance of pattern matching algorithms. For this purpose, heterogeneous architectures can be a good choice due to their potential for high performance and energy efficiency. In this paper we present an efficient implementation of Aho-Corasick (AC) which is a well known exact pattern matching algorithm with linear complexity, and Parallel Failureless Aho-Corasick (PFAC) algorithm which is the massively parallelized version of AC algorithm without failure transitions, on a heterogeneous CPU/GPU architecture. We progressively redesigned the algorithms and data structures to fit on the GPU architecture. Our results on different protein sequence data sets show that the new implementation runs 15 times faster compared to the original implementation of the PFAC algorithm.

Download Full-text

Faculty Opinions recommendation of Universal trees based on large combined protein sequence data sets.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.1002337.23309 ◽

2001 ◽

Author(s):

Christos Ouzounis

Keyword(s):

Protein Sequence ◽

Sequence Data ◽

Data Sets ◽

Protein Sequence Data

Download Full-text

A fast hierarchical clustering algorithm for large-scale protein sequence data sets

Computers in Biology and Medicine ◽

10.1016/j.compbiomed.2014.02.016 ◽

2014 ◽

Vol 48 ◽

pp. 94-101 ◽

Cited By ~ 10

Author(s):

Sándor M. Szilágyi ◽

László Szilágyi

Keyword(s):

Hierarchical Clustering ◽

Protein Sequence ◽

Large Scale ◽

Clustering Algorithm ◽

Sequence Data ◽

Data Sets ◽

Protein Sequence Data ◽

Hierarchical Clustering Algorithm

Download Full-text

MOLECULAR PHYLOGENY AND STRUCTURE PREDICTION OF RICE RFT1 PROTEIN

Jurnal Teknologi ◽

10.11113/jt.v78.4752 ◽

2016 ◽

Vol 78 (2) ◽

Author(s):

Shahkila Mohd Arif ◽

Abdulrahman Mahmoud Dogara ◽

Nyuk Ling Ma ◽

Mohd Shahir Shamsir Omar ◽

Sepideh Parvizpour ◽

...

Keyword(s):

Structure Prediction ◽

Sequence Data ◽

World Population ◽

Data Sets ◽

Important Species ◽

The Family ◽

Close Relationship ◽

Protein Sequence Data ◽

Flowering Time Gene ◽

Mp Method

Rice is one of the most important species in the family of Poaceae. As one of the major crop that is consumed by world population, it is cultivated commercially in many parts of the world. Hence, the phylogeny study of this crop is crucial as a step for improvement of its breeding programs. Phylogenetic relationship among 12 rice cultivars that originated from two common sub-species; Indica and Japonica were inferred by comparing protein sequence data sets derived from its flowering time gene, namely RFT1 and analyzed using maximum parsimony (MP) method. The predicted structure of RFT1 protein was generated by I-TASSER server and analyzed using YASARA software. The result showed that the cultivars were classified into two major groups, where the first group (Japonica) evolved first followed by the second group (Indica). The findings suggested that some cultivars had a close relationship with each other even it is originates from different varieties. The relationships among these cultivars provide useful information for better understanding of molecular evolution process and designing good breeding program in order to generate new cultivar.

Download Full-text

Quantifying and cataloguing unknown sequences within human microbiomes

10.1101/2021.01.22.427751 ◽

2021 ◽

Author(s):

Sejal Modha ◽

David L. Robertson ◽

Joseph Hughes ◽

Richard Orton

Keyword(s):

Dark Matter ◽

Sequence Data ◽

Human Microbiome ◽

Circulatory System ◽

Daily Basis ◽

Data Sets ◽

Biological Sequence ◽

Computational Framework ◽

Viral Genomes ◽

Sequencing Technologies

AbstractAdvances in genome sequencing technologies and lower costs have enabled the exploration of a multitude of known and novel environments and microbiomes. This has led to an exponential growth in the raw sequence data that is deposited in online repositories. Metagenomic and metatranscriptomic data sets are typically analysed with regards to a specific biological question. However, it is widely acknowledged that these data sets are comprised of a proportion of sequences that bear no similarity to any currently known biological sequence, and this so-called ‘dark matter’ is often excluded from downstream analyses. In this study, a systematic framework was developed to assemble, identify, and measure the proportion of unknown sequences present in distinct human microbiomes. This framework was applied to forty distinct studies, comprising 963 samples, and covering ten different human microbiomes including fecal, oral, lung, skin and circulatory system microbiomes. The framework was used to determine the proportion of taxonomically unknown sequences present within samples, and to compare such sequences both within and across assembled metagenomes. We found that whilst the human microbiome is one of the most extensively studied, on average 2% of assembled sequences have not yet been taxonomically defined. However, this proportion varied extensively among different microbiomes and was as high as 25% for skin and oral microbiomes that have more interactions with the environment. The publicly available data sets used have not previously been systematically mined to quantify and compare such dark matter. Typically, these unknown sequences are found in several microbiomes and potentially belong to unidentified novel microbes that we interact with on a daily basis. A cross-study comparison led to the identification of similar unknown sequences in different samples and/or microbiomes. A rate of taxonomic characterisation of 1.64% of unknown sequences being characterised per month was calculated from these taxonomically unknown sequences discovered in this study. Additionally, the approach led to the discovery of several potentially novel viral genomes that bear no similarity to sequences in the public databases. Both our computational framework and the novel unknown sequences produced are publicly available for future cross-referencing.

Download Full-text

Choosing Non-redundant Representative Subsets Of Protein Sequence Data Sets Using Submodular Optimization

Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics - BCB '18 ◽

10.1145/3233547.3233717 ◽

2018 ◽

Author(s):

Maxwell W. Libbrecht ◽

Jeffrey A. Bilmes ◽

William Stafford Noble

Keyword(s):

Protein Sequence ◽

Sequence Data ◽

Data Sets ◽

Submodular Optimization ◽

Protein Sequence Data

Download Full-text

Universal trees based on large combined protein sequence data sets

Nature Genetics ◽

10.1038/90129 ◽

2001 ◽

Vol 28 (3) ◽

pp. 281-285 ◽

Cited By ~ 244

Author(s):

James R. Brown ◽

Christophe J. Douady ◽

Michael J. Italia ◽

William E. Marshall ◽

Michael J. Stanhope

Keyword(s):

Protein Sequence ◽

Sequence Data ◽

Data Sets ◽

Protein Sequence Data

Download Full-text

SeqRepo: A system for managing local collections of biological sequences

PLoS ONE ◽

10.1371/journal.pone.0239883 ◽

2020 ◽

Vol 15 (12) ◽

pp. e0239883

Author(s):

Reece K. Hart ◽

Andreas Prlić

Keyword(s):

Programming Languages ◽

High Performance ◽

Sequence Data ◽

Random Access ◽

Biological Sequences ◽

Biological Sequence ◽

Public And Private ◽

Human Sequence ◽

Local Sequence ◽

Sequence Identifier

Motivation Access to biological sequence data, such as genome, transcript, or protein sequence, is at the core of many bioinformatics analysis workflows. The National Center for Biotechnology Information (NCBI), Ensembl, and other sequence database maintainers provide methods to access sequences through network connections. For many users, the convenience and currency of remotely managed data are compelling, and the network latency is non-consequential. However, for high-throughput and clinical applications, local sequence collections are essential for performance, stability, privacy, and reproducibility. Results Here we describe SeqRepo, a novel system for building a local, high-performance, non-redundant collection of biological sequences. SeqRepo enables clients to use primary database identifiers and several digests to identify sequences and sequence alises. SeqRepo provides a native Python interface and a REST interface, which can run locally and enables access from other programming languages. SeqRepo also provides an alternative REST interface based on the GA4GH refget protocol. SeqRepo provides fast random access to sequence slices. We provide results that demonstrate that a local SeqRepo sequence collection yields significant performance benefits of up to 1300-fold over remote sequence collections. In our use case for a variant validation and normalization pipeline, SeqRepo improved throughput 50-fold relative to use with remote sequences. SeqRepo may be used with any species or sequence type. Regular snapshots of Human sequence collections are available. It is often convenient or necessary to use a computed digest as a sequence identifier. For example, a digest-based identifier may be used to refer to proprietary reference genomes or segments of a graph genome, for which conventional identifiers will not be available. Here we also introduce a convention for the application of the SHA-512 hashing algorithm with Base64 encoding to generate URL-safe identifiers. This convention, sha512t24u, combines a fast digest mechanism with a space-efficient representation that can be used for any object. Our report includes an analysis of timing and collision probabilities for sha512t24u. SeqRepo enables clients to use sha512t24u as identifiers, thereby seamlessly integrating public and private sequence sets. Availability SeqRepo is released under the Apache License 2.0 and is available on github and PyPi. Docker images and database snapshots are also available. See https://github.com/biocommons/biocommons.seqrepo.

Download Full-text

Partitioning clustering algorithms for protein sequence data sets

BioData Mining ◽

10.1186/1756-0381-2-3 ◽

2009 ◽

Vol 2 (1) ◽

Cited By ~ 9

Author(s):

Sondes Fayech ◽

Nadia Essoussi ◽

Mohamed Limam

Keyword(s):

Protein Sequence ◽

Sequence Data ◽

Clustering Algorithms ◽

Data Sets ◽

Protein Sequence Data

Download Full-text

Approximate symbolic pattern matching for protein sequence data

International Journal of Approximate Reasoning ◽

10.1016/s0888-613x(02)00082-8 ◽

2003 ◽

Vol 32 (2-3) ◽

pp. 171-186 ◽

Cited By ~ 5

Author(s):

Bill C.H. Chang ◽

Saman K. Halgamuge

Keyword(s):

Pattern Matching ◽

Protein Sequence ◽

Sequence Data ◽

Protein Sequence Data

Download Full-text

Text Filtering through Multi-Pattern Matching: A Case Study of Wu–Manber–Uy on the Language of Uyghur

Information ◽

10.3390/info10080246 ◽

2019 ◽

Vol 10 (8) ◽

pp. 246 ◽

Cited By ~ 1

Author(s):

Turdi Tohti ◽

Jimmy Huang ◽

Askar Hamdulla ◽

Xing Tan

Keyword(s):

Pattern Matching ◽

Language Processing ◽

High Performance ◽

Data Sets ◽

Lexical Meaning ◽

Expansion Strategy ◽

Two Factors ◽

Text Filtering ◽

Stem Deformation ◽

And Control

Given its generality in applications and its high time-efficiency on big data-sets, in recent years, the technique of text filtering through pattern matching has been attracting increasing attention from the field of information retrieval and Natural language Processing (NLP) research communities at large. That being the case, however, it has yet to be seen how this technique and its algorithms, (e.g., Wu–Manber, which is also considered in this paper) can be applied and adopted properly and effectively to Uyghur, a low-resource language that is mostly spoken by the ethnic Uyghur group with a population of more than eleven-million in Xinjiang, China. We observe that technically, the challenge is mainly caused by two factors: (1) Vowel weakening and (2) mismatching in semantics between affixes and stems. Accordingly, in this paper, we propose Wu–Manber–Uy, a variant of an improvement to Wu–Manber, dedicated particularly for working on the Uyghur language. Wu–Manber–Uy implements a stem deformation-based pattern expansion strategy, specifically for reducing the mismatching of patterns caused by vowel weakening and spelling errors. A two-way strategy that applies invigilation and control on the change of lexical meaning of stems during word-building is also used in Wu–Manber–Uy. Extra consideration with respect to Word2vec and the dictionary are incorporated into the system for processing Uyghur. The experimental results we have obtained consistently demonstrate the high performance of Wu–Manber–Uy.

Download Full-text