scholarly journals Frequent Patterns Algorithm of Biological Sequences based on Pattern Prefix-tree

2019 ◽  
Vol 14 (4) ◽  
pp. 574-589
Author(s):  
Linyan Xue ◽  
Xiaoke Zhang ◽  
Fei Xie ◽  
Shuang Liu ◽  
Peng Lin

In the application of bioinformatics, the existing algorithms cannot be directly and efficiently implement sequence pattern mining. Two fast and efficient biological sequence pattern mining algorithms for biological single sequence and multiple sequences are proposed in this paper. The concept of the basic pattern is proposed, and on the basis of mining frequent basic patterns, the frequent pattern is excavated by constructing prefix trees for frequent basic patterns. The proposed algorithms implement rapid mining of frequent patterns of biological sequences based on pattern prefix trees. In experiment the family sequence data in the pfam protein database is used to verify the performance of the proposed algorithm. The prediction results confirm that the proposed algorithms can’t only obtain the mining results with effective biological significance, but also improve the running time efficiency of the biological sequence pattern mining.

2021 ◽  
Vol 169 ◽  
pp. 114530
Author(s):  
Areej Ahmad Abdelaal ◽  
Sa'ed Abed ◽  
Mohammad Al-Shayeji ◽  
Mohammad Allaho

Electronics ◽  
2021 ◽  
Vol 10 (12) ◽  
pp. 1478
Author(s):  
Penugonda Ravikumar ◽  
Palla Likhitha ◽  
Bathala Venus Vikranth Raj ◽  
Rage Uday Kiran ◽  
Yutaka Watanobe ◽  
...  

Discovering periodic-frequent patterns in temporal databases is a challenging problem of great importance in many real-world applications. Though several algorithms were described in the literature to tackle the problem of periodic-frequent pattern mining, most of these algorithms use the traditional horizontal (or row) database layout, that is, either they need to scan the database several times or do not allow asynchronous computation of periodic-frequent patterns. As a result, this kind of database layout makes the algorithms for discovering periodic-frequent patterns both time and memory inefficient. One cannot ignore the importance of mining the data stored in a vertical (or columnar) database layout. It is because real-world big data is widely stored in columnar database layout. With this motivation, this paper proposes an efficient algorithm, Periodic Frequent-Equivalence CLass Transformation (PF-ECLAT), to find periodic-frequent patterns in a columnar temporal database. Experimental results on sparse and dense real-world and synthetic databases demonstrate that PF-ECLAT is memory and runtime efficient and highly scalable. Finally, we demonstrate the usefulness of PF-ECLAT with two case studies. In the first case study, we have employed our algorithm to identify the geographical areas in which people were periodically exposed to harmful levels of air pollution in Japan. In the second case study, we have utilized our algorithm to discover the set of road segments in which congestion was regularly observed in a transportation network.


Author(s):  
Mengling Feng ◽  
Jinyan Li ◽  
Guozhu Dong ◽  
Limsoon Wong

This chapter surveys the maintenance of frequent patterns in transaction datasets. It is written to be accessible to researchers familiar with the field of frequent pattern mining. The frequent pattern maintenance problem is summarized with a study on how the space of frequent patterns evolves in response to data updates. This chapter focuses on incremental and decremental maintenance. Four major types of maintenance algorithms are studied: Apriori-based, partition-based, prefix-tree-based, and conciserepresentation- based algorithms. The authors study the advantages and limitations of these algorithms from both the theoretical and experimental perspectives. Possible solutions to certain limitations are also proposed. In addition, some potential research opportunities and emerging trends in frequent pattern maintenance are also discussed.


2012 ◽  
Vol 433-440 ◽  
pp. 4457-4462 ◽  
Author(s):  
Jun Shan Tan ◽  
Zhu Fang Kuang ◽  
Guo Gui Yang

The design of synopses structure is an important issue of frequent patterns mining over data stream. A data stream synopses structure FPD-Graph which is based on directed graph is proposed in this paper. The FPD-Graph contains list head node FPDG-Head and list node FPDG-Node. The operations of FPD-Graph consist of insert operation and deletion operation. A frequent pattern mining algorithm DGFPM based on sliding window over data stream is proposed in this paper. The IBM synthesizes data generation which output customers shopping a data are adopted as experiment data. The DGFPM algorithm not only has high precision for mining frequent patterns, but also has low processing time.


2017 ◽  
Vol 10 (13) ◽  
pp. 191
Author(s):  
Nikhil Jamdar ◽  
A Vijayalakshmi

There are many algorithms available in data mining to search interesting patterns from transactional databases of precise data. Frequent pattern mining is a technique to find the frequently occurred items in data mining. Most of the techniques used to find all the interesting patterns from a collection of precise data, where items occurred in each transaction are certainly known to the system. As well as in many real-time applications, users are interested in a tiny portion of large frequent patterns. So the proposed user constrained mining approach, will help to find frequent patterns in which user is interested. This approach will efficiently find user interested frequent patterns by applying user constraints on the collections of uncertain data. The user can specify their own interest in the form of constraints and uses the Map Reduce model to find uncertain frequent pattern that satisfy the user-specified constraints 


2014 ◽  
Vol 37 ◽  
pp. 109-116 ◽  
Author(s):  
Shamila Nasreen ◽  
Muhammad Awais Azam ◽  
Khurram Shehzad ◽  
Usman Naeem ◽  
Mustansar Ali Ghazanfar

2020 ◽  
Author(s):  
Eli N. Weinstein ◽  
Debora S. Marks

AbstractLarge-scale sequencing has revealed extraordinary diversity among biological sequences, produced over the course of evolution and within the lifetime of individual organisms. Existing methods for building statistical models of sequences often pre-process the data using multiple sequence alignment, an unreliable approach for many genetic elements (antibodies, disordered proteins, etc.) that is subject to fundamental statistical pathologies. Here we introduce a structured emission distribution (the MuE distribution) that accounts for mutational variability (substitutions and indels) and use it to construct generative and predictive hierarchical Bayesian models (H-MuE models). Our framework enables the application of arbitrary continuous-space vector models (e.g. linear regression, factor models, image neural-networks) to unaligned sequence data. Theoretically, we show that the MuE generalizes classic probabilistic alignment models. Empirically, we show that H-MuE models can infer latent representations and features for immune repertoires, predict functional unobserved members of disordered protein families, and forecast the future evolution of pathogens.


2018 ◽  
Author(s):  
Weina Li ◽  
Jiadong Ren

A significant approach for the discovery of biological regulatory rules of genes, protein and their inheritance relationships is the extraction of meaningful patterns from biological sequence data.The existing algorithms of sequence pattern discovery, like MSPM and FBSB, suffice their low efficiency and accuracy. In order to deal with this issue, this paper presents a new algorithm for biological sequence pattern mining abbreviated MpBsmi based on the data Index Structure.The MpBsmi algorithm employs a sequence position table abbreviated ST and a sequence database index structure named DB-Index for data storing, mining and pattern expansion. The ST and DB-Index of single items are firstly obtained through scanning sequence database once. Then a new algorithm for fast support counting is developed to mine the table ST to identify the frequent single items. Based on a recursive connection strategy, the frequenct patterns are expanded and the expanded table ST is updated by scanning the DB-Index. The fast support counting algorithm is used for obtaining the frequent expansion patterns. Finally, a new pruning techniqueis developed for extended pattern to avoid the generation of unnecessarily large number of candidate patterns. The experiments results on multiple the classical protein sequence from the Pfam database validate the performance of the proposed algorithm including the accuracy, stability and scalability. It is showed that the proposed algorithm has achieved the better space efficiency, stability and scalability comparing with MSPM, FBSB which are the two main algorithms for biological sequence mining.


Author(s):  
Ashesh Nandy

The exponential growth in the depositories of biological sequence data have generated an urgent need to store, retrieve and analyse the data efficiently and effectively for which the standard practice of using alignment procedures are not adequate due to high demand on computing resources and time. Graphical representation of sequences has become one of the most popular alignment-free strategies to analyse the biological sequences where each basic unit of the sequences – the bases adenine, cytosine, guanine and thymine for DNA/RNA, and the 20 amino acids for proteins – are plotted on a multi-dimensional grid. The resulting curve in 2D and 3D space and the implied graph in higher dimensions provide a perception of the underlying information of the sequences through visual inspection; numerical analyses, in geometrical or matrix terms, of the plots provide a measure of comparison between sequences and thus enable study of sequence hierarchies. The new approach has also enabled studies of comparisons of DNA sequences over many thousands of bases and provided new insights into the structure of the base compositions of DNA sequences In this article we review in brief the origins and applications of graphical representations and highlight the future perspectives in this field.


Sign in / Sign up

Export Citation Format

Share Document