scholarly journals High Performance Pattern Matching on Heterogeneous Platform

2014 ◽  
Vol 11 (3) ◽  
pp. 88-98 ◽  
Author(s):  
Shima Soroushnia ◽  
Masoud Daneshtalab ◽  
Juha Plosila ◽  
Tapio Pahikkala ◽  
Pasi Liljeberg

Summary Pattern discovery is one of the fundamental tasks in bioinformatics and pattern recognition is a powerful technique for searching sequence patterns in the biological sequence databases. Fast and high performance algorithms are highly demanded in many applications in bioinformatics and computational molecular biology since the significant increase in the number of DNA and protein sequences expand the need for raising the performance of pattern matching algorithms. For this purpose, heterogeneous architectures can be a good choice due to their potential for high performance and energy efficiency. In this paper we present an efficient implementation of Aho-Corasick (AC) which is a well known exact pattern matching algorithm with linear complexity, and Parallel Failureless Aho-Corasick (PFAC) algorithm which is the massively parallelized version of AC algorithm without failure transitions, on a heterogeneous CPU/GPU architecture. We progressively redesigned the algorithms and data structures to fit on the GPU architecture. Our results on different protein sequence data sets show that the new implementation runs 15 times faster compared to the original implementation of the PFAC algorithm.

2016 ◽  
Vol 78 (2) ◽  
Author(s):  
Shahkila Mohd Arif ◽  
Abdulrahman Mahmoud Dogara ◽  
Nyuk Ling Ma ◽  
Mohd Shahir Shamsir Omar ◽  
Sepideh Parvizpour ◽  
...  

Rice is one of the most important species in the family of Poaceae. As one of the major crop that is consumed by world population, it is cultivated commercially in many parts of the world. Hence, the phylogeny study of this crop is crucial as a step for improvement of its breeding programs. Phylogenetic relationship among 12 rice cultivars that originated from two common sub-species; Indica and Japonica were inferred by comparing protein sequence data sets derived from its flowering time gene, namely RFT1 and analyzed using maximum parsimony (MP) method. The predicted structure of RFT1 protein was generated by I-TASSER server and analyzed using YASARA software. The result showed that the cultivars were classified into two major groups, where the first group (Japonica) evolved first followed by the second group (Indica). The findings suggested that some cultivars had a close relationship with each other even it is originates from different varieties. The relationships among these cultivars provide useful information for better understanding of molecular evolution process and designing good breeding program in order to generate new cultivar.


2021 ◽  
Author(s):  
Sejal Modha ◽  
David L. Robertson ◽  
Joseph Hughes ◽  
Richard Orton

AbstractAdvances in genome sequencing technologies and lower costs have enabled the exploration of a multitude of known and novel environments and microbiomes. This has led to an exponential growth in the raw sequence data that is deposited in online repositories. Metagenomic and metatranscriptomic data sets are typically analysed with regards to a specific biological question. However, it is widely acknowledged that these data sets are comprised of a proportion of sequences that bear no similarity to any currently known biological sequence, and this so-called ‘dark matter’ is often excluded from downstream analyses. In this study, a systematic framework was developed to assemble, identify, and measure the proportion of unknown sequences present in distinct human microbiomes. This framework was applied to forty distinct studies, comprising 963 samples, and covering ten different human microbiomes including fecal, oral, lung, skin and circulatory system microbiomes. The framework was used to determine the proportion of taxonomically unknown sequences present within samples, and to compare such sequences both within and across assembled metagenomes. We found that whilst the human microbiome is one of the most extensively studied, on average 2% of assembled sequences have not yet been taxonomically defined. However, this proportion varied extensively among different microbiomes and was as high as 25% for skin and oral microbiomes that have more interactions with the environment. The publicly available data sets used have not previously been systematically mined to quantify and compare such dark matter. Typically, these unknown sequences are found in several microbiomes and potentially belong to unidentified novel microbes that we interact with on a daily basis. A cross-study comparison led to the identification of similar unknown sequences in different samples and/or microbiomes. A rate of taxonomic characterisation of 1.64% of unknown sequences being characterised per month was calculated from these taxonomically unknown sequences discovered in this study. Additionally, the approach led to the discovery of several potentially novel viral genomes that bear no similarity to sequences in the public databases. Both our computational framework and the novel unknown sequences produced are publicly available for future cross-referencing.


10.1038/90129 ◽  
2001 ◽  
Vol 28 (3) ◽  
pp. 281-285 ◽  
Author(s):  
James R. Brown ◽  
Christophe J. Douady ◽  
Michael J. Italia ◽  
William E. Marshall ◽  
Michael J. Stanhope

PLoS ONE ◽  
2020 ◽  
Vol 15 (12) ◽  
pp. e0239883
Author(s):  
Reece K. Hart ◽  
Andreas Prlić

Motivation Access to biological sequence data, such as genome, transcript, or protein sequence, is at the core of many bioinformatics analysis workflows. The National Center for Biotechnology Information (NCBI), Ensembl, and other sequence database maintainers provide methods to access sequences through network connections. For many users, the convenience and currency of remotely managed data are compelling, and the network latency is non-consequential. However, for high-throughput and clinical applications, local sequence collections are essential for performance, stability, privacy, and reproducibility. Results Here we describe SeqRepo, a novel system for building a local, high-performance, non-redundant collection of biological sequences. SeqRepo enables clients to use primary database identifiers and several digests to identify sequences and sequence alises. SeqRepo provides a native Python interface and a REST interface, which can run locally and enables access from other programming languages. SeqRepo also provides an alternative REST interface based on the GA4GH refget protocol. SeqRepo provides fast random access to sequence slices. We provide results that demonstrate that a local SeqRepo sequence collection yields significant performance benefits of up to 1300-fold over remote sequence collections. In our use case for a variant validation and normalization pipeline, SeqRepo improved throughput 50-fold relative to use with remote sequences. SeqRepo may be used with any species or sequence type. Regular snapshots of Human sequence collections are available. It is often convenient or necessary to use a computed digest as a sequence identifier. For example, a digest-based identifier may be used to refer to proprietary reference genomes or segments of a graph genome, for which conventional identifiers will not be available. Here we also introduce a convention for the application of the SHA-512 hashing algorithm with Base64 encoding to generate URL-safe identifiers. This convention, sha512t24u, combines a fast digest mechanism with a space-efficient representation that can be used for any object. Our report includes an analysis of timing and collision probabilities for sha512t24u. SeqRepo enables clients to use sha512t24u as identifiers, thereby seamlessly integrating public and private sequence sets. Availability SeqRepo is released under the Apache License 2.0 and is available on github and PyPi. Docker images and database snapshots are also available. See https://github.com/biocommons/biocommons.seqrepo.


2009 ◽  
Vol 2 (1) ◽  
Author(s):  
Sondes Fayech ◽  
Nadia Essoussi ◽  
Mohamed Limam

Information ◽  
2019 ◽  
Vol 10 (8) ◽  
pp. 246 ◽  
Author(s):  
Turdi Tohti ◽  
Jimmy Huang ◽  
Askar Hamdulla ◽  
Xing Tan

Given its generality in applications and its high time-efficiency on big data-sets, in recent years, the technique of text filtering through pattern matching has been attracting increasing attention from the field of information retrieval and Natural language Processing (NLP) research communities at large. That being the case, however, it has yet to be seen how this technique and its algorithms, (e.g., Wu–Manber, which is also considered in this paper) can be applied and adopted properly and effectively to Uyghur, a low-resource language that is mostly spoken by the ethnic Uyghur group with a population of more than eleven-million in Xinjiang, China. We observe that technically, the challenge is mainly caused by two factors: (1) Vowel weakening and (2) mismatching in semantics between affixes and stems. Accordingly, in this paper, we propose Wu–Manber–Uy, a variant of an improvement to Wu–Manber, dedicated particularly for working on the Uyghur language. Wu–Manber–Uy implements a stem deformation-based pattern expansion strategy, specifically for reducing the mismatching of patterns caused by vowel weakening and spelling errors. A two-way strategy that applies invigilation and control on the change of lexical meaning of stems during word-building is also used in Wu–Manber–Uy. Extra consideration with respect to Word2vec and the dictionary are incorporated into the system for processing Uyghur. The experimental results we have obtained consistently demonstrate the high performance of Wu–Manber–Uy.


Sign in / Sign up

Export Citation Format

Share Document