sequence clustering Latest Research Papers

In the era of big data, next-generation sequencing produces a large amount of genomic data. With these genetic sequence data, research in biology fields will be further advanced. However, the growth of data scale often leads to privacy issues. Even if the data is not open, it is still possible for an attacker to steal private information by a member inference attack. In this paper, we proposed a private profile hidden Markov model (PHMM) with differential identifiability for gene sequence clustering. By adding random noise into the model, the probability of identifying individuals in the database is limited. The gene sequences could be unsupervised clustered without labels according to the output scores of private PHMM. The variation of the divergence distance in the experimental results shows that the addition of noise makes the profile hidden Markov model distort to a certain extent, and the maximum divergence distance can reach 15.47 when the amount of data is small. Also, the cosine similarity comparison of the clustering model before and after adding noise shows that as the privacy parameters changes, the clustering model distorts at a low or high level, which makes it defend the member inference attack.

Download Full-text

EdClust: A heuristic sequence clustering method with higher sensitivity

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720021500360 ◽

2021 ◽

Author(s):

Ming Cao ◽

Qinke Peng ◽

Ze-Gang Wei ◽

Fei Liu ◽

Yi-Fan Hou

Keyword(s):

Large Scale ◽

Sequence Data ◽

Clustering Algorithms ◽

Clustering Methods ◽

Sequencing Data ◽

Clustering Method ◽

Cluster Number ◽

Sequence Clustering ◽

Downstream Analysis ◽

Heuristic Clustering

The development of high-throughput technologies has produced increasing amounts of sequence data and an increasing need for efficient clustering algorithms that can process massive volumes of sequencing data for downstream analysis. Heuristic clustering methods are widely applied for sequence clustering because of their low computational complexity. Although numerous heuristic clustering methods have been developed, they suffer from two limitations: overestimation of inferred clusters and low clustering sensitivity. To address these issues, we present a new sequence clustering method (edClust) based on Edlib, a C/C[Formula: see text] library for fast, exact semi-global sequence alignment to group similar sequences. The new method edClust was tested on three large-scale sequence databases, and we compared edClust to several classic heuristic clustering methods, such as UCLUST, CD-HIT, and VSEARCH. Evaluations based on the metrics of cluster number and seed sensitivity (SS) demonstrate that edClust can produce fewer clusters than other methods and that its SS is higher than that of other methods. The source codes of edClust are available from https://github.com/zhang134/EdClust.git under the GNU GPL license.

Download Full-text

Scalable Sequence Clustering for Large-Scale Immune Repertoire Analysis

10.1109/bigdata52589.2021.9671320 ◽

2021 ◽

Author(s):

Prem Bhusal ◽

A K M Mubashwir Alam ◽

Keke Chen ◽

Ning Jiang ◽

Jun Xiao

Keyword(s):

Large Scale ◽

Immune Repertoire ◽

Sequence Clustering ◽

Repertoire Analysis ◽

Immune Repertoire Analysis

Download Full-text

Dynamic order Markov model for categorical sequence clustering

Journal Of Big Data ◽

10.1186/s40537-021-00547-2 ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Rongbo Chen ◽

Haojun Sun ◽

Lifei Chen ◽

Jianfei Zhang ◽

Shengrui Wang

Keyword(s):

Markov Model ◽

Clustering Algorithm ◽

Markov Models ◽

The State ◽

Sequential Data ◽

Implicit Assumption ◽

Sequence Clustering ◽

Clustering And Classification ◽

Real World Datasets ◽

Sparse Pattern

AbstractMarkov models are extensively used for categorical sequence clustering and classification due to their inherent ability to capture complex chronological dependencies hidden in sequential data. Existing Markov models are based on an implicit assumption that the probability of the next state depends on the preceding context/pattern which is consist of consecutive states. This restriction hampers the models since some patterns, disrupted by noise, may be not frequent enough in a consecutive form, but frequent in a sparse form, which can not make use of the information hidden in the sequential data. A sparse pattern corresponds to a pattern in which one or some of the state(s) between the first and last one in the pattern is/are replaced by wildcard(s) that can be matched by a subset of values in the state set. In this paper, we propose a new model that generalizes the conventional Markov approach making it capable of dealing with the sparse pattern and handling the length of the sparse patterns adaptively, i.e. allowing variable length pattern with variable wildcards. The model, named Dynamic order Markov model (DOMM), allows deriving a new similarity measure between a sequence and a set of sequences/cluster. DOMM builds a sparse pattern from sub-frequent patterns that contain significant statistical information veiled by the noise. To implement DOMM, we propose a sparse pattern detector (SPD) based on the probability suffix tree (PST) capable of discovering both sparse and consecutive patterns, and then we develop a divisive clustering algorithm, named DMSC, for Dynamic order Markov model for categorical sequence clustering. Experimental results on real-world datasets demonstrate the promising performance of the proposed model.

Download Full-text

MapReduce paradigm: DNA sequence clustering based on repeats as features

Expert Systems ◽

10.1111/exsy.12827 ◽

2021 ◽

Author(s):

Chandra Mohan Dasari ◽

Raju Bhukya

Keyword(s):

Dna Sequence ◽

Sequence Clustering ◽

Mapreduce Paradigm

Download Full-text

Sequence Clustering in Process Mining for Business Process Analysis Using K-Means

MIND Journal ◽

10.26760/mindjournal.v6i1.16-30 ◽

2021 ◽

Vol 6 (1) ◽

pp. 16-30

Author(s):

NUR FITRIANTI FAHRUDIN

Keyword(s):

Business Process ◽

Process Mining ◽

Process Analysis ◽

Sequence Clustering ◽

Business Process Analysis

Download Full-text

Exploring Learning Strategies by Sequence Clustering and Analysing their Correlation with Student's Engagement and Learning Outcome

2021 International Conference on Advanced Learning Technologies (ICALT) ◽

10.1109/icalt52272.2021.00115 ◽

2021 ◽

Author(s):

Jim B.J. Huang ◽

Anna Y.Q. Huang ◽

Owen H.T. Lu ◽

Stephen J.H. Yang

Keyword(s):

Learning Strategies ◽

Learning Outcome ◽

Sequence Clustering

Download Full-text

Sequence Clustering Techniques in Educational Data Mining

Handbook of Research on Modern Educational Technologies, Applications, and Management ◽

10.4018/978-1-7998-3476-2.ch005 ◽

2021 ◽

pp. 68-84

Author(s):

Qi Guo ◽

Ying Cui ◽

Jacqueline P. Leighton ◽

Man-Wai Chu

Keyword(s):

Digital Technology ◽

Educational Data Mining ◽

Learning Program ◽

Cross Sectional ◽

Feature Clustering ◽

Clustering Techniques ◽

Advantages And Disadvantages ◽

Sequence Clustering ◽

Log Files ◽

Log File

Digital technology has profound impacts on modern education. Digital technology not only greatly improves access to quality education, but it also can automatically save all the interactions between students and computers in log files. Clustering of log files can help researchers better understand students and improve the learning program. One challenge associated with log file clustering is that log files are sequential in nature, but traditional cluster analysis techniques are designed for cross-sectional data. To overcome this problem, several sequence clustering techniques are proposed recently. There are three major categories of sequence clustering techniques: Markov chain clustering, sequence distance clustering, and sequence feature clustering. The purpose of this chapter is to introduce these sequence clustering techniques and discuss their potential advantages and disadvantages.

Download Full-text

An Efficient Greedy Incremental Sequence Clustering Algorithm

10.1007/978-3-030-91415-8_50 ◽

2021 ◽

pp. 596-607

Author(s):

Zhen Ju ◽

Huiling Zhang ◽

Jingtao Meng ◽

Jingjing Zhang ◽

Xuelei Li ◽

...

Keyword(s):

Clustering Algorithm ◽

Sequence Clustering

Download Full-text

sequence clustering
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Characterizing visitor engagement behavior at large-scale events: Activity sequence clustering and ranking using GPS tracking data

Gene Sequence Clustering Based on the Profile Hidden Markov Model with Differential Identifiability

EdClust: A heuristic sequence clustering method with higher sensitivity

Scalable Sequence Clustering for Large-Scale Immune Repertoire Analysis

Dynamic order Markov model for categorical sequence clustering

MapReduce paradigm: DNA sequence clustering based on repeats as features

Sequence Clustering in Process Mining for Business Process Analysis Using K-Means

Exploring Learning Strategies by Sequence Clustering and Analysing their Correlation with Student's Engagement and Learning Outcome

Sequence Clustering Techniques in Educational Data Mining

An Efficient Greedy Incremental Sequence Clustering Algorithm

Export Citation Format

sequence clusteringRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Characterizing visitor engagement behavior at large-scale events: Activity sequence clustering and ranking using GPS tracking data

Gene Sequence Clustering Based on the Profile Hidden Markov Model with Differential Identifiability

EdClust: A heuristic sequence clustering method with higher sensitivity

Scalable Sequence Clustering for Large-Scale Immune Repertoire Analysis

Dynamic order Markov model for categorical sequence clustering

MapReduce paradigm: DNA sequence clustering based on repeats as features

Sequence Clustering in Process Mining for Business Process Analysis Using K-Means

Exploring Learning Strategies by Sequence Clustering and Analysing their Correlation with Student's Engagement and Learning Outcome

Sequence Clustering Techniques in Educational Data Mining

An Efficient Greedy Incremental Sequence Clustering Algorithm

sequence clustering
Recently Published Documents