sequence clustering
Recently Published Documents


TOTAL DOCUMENTS

135
(FIVE YEARS 24)

H-INDEX

17
(FIVE YEARS 2)

2021 ◽  
Vol 2021 ◽  
pp. 1-9
Author(s):  
Xujie Ren ◽  
Tao Shang ◽  
Yatong Jiang ◽  
Jianwei Liu

In the era of big data, next-generation sequencing produces a large amount of genomic data. With these genetic sequence data, research in biology fields will be further advanced. However, the growth of data scale often leads to privacy issues. Even if the data is not open, it is still possible for an attacker to steal private information by a member inference attack. In this paper, we proposed a private profile hidden Markov model (PHMM) with differential identifiability for gene sequence clustering. By adding random noise into the model, the probability of identifying individuals in the database is limited. The gene sequences could be unsupervised clustered without labels according to the output scores of private PHMM. The variation of the divergence distance in the experimental results shows that the addition of noise makes the profile hidden Markov model distort to a certain extent, and the maximum divergence distance can reach 15.47 when the amount of data is small. Also, the cosine similarity comparison of the clustering model before and after adding noise shows that as the privacy parameters changes, the clustering model distorts at a low or high level, which makes it defend the member inference attack.


Author(s):  
Ming Cao ◽  
Qinke Peng ◽  
Ze-Gang Wei ◽  
Fei Liu ◽  
Yi-Fan Hou

The development of high-throughput technologies has produced increasing amounts of sequence data and an increasing need for efficient clustering algorithms that can process massive volumes of sequencing data for downstream analysis. Heuristic clustering methods are widely applied for sequence clustering because of their low computational complexity. Although numerous heuristic clustering methods have been developed, they suffer from two limitations: overestimation of inferred clusters and low clustering sensitivity. To address these issues, we present a new sequence clustering method (edClust) based on Edlib, a C/C[Formula: see text] library for fast, exact semi-global sequence alignment to group similar sequences. The new method edClust was tested on three large-scale sequence databases, and we compared edClust to several classic heuristic clustering methods, such as UCLUST, CD-HIT, and VSEARCH. Evaluations based on the metrics of cluster number and seed sensitivity (SS) demonstrate that edClust can produce fewer clusters than other methods and that its SS is higher than that of other methods. The source codes of edClust are available from https://github.com/zhang134/EdClust.git under the GNU GPL license.


Author(s):  
Prem Bhusal ◽  
A K M Mubashwir Alam ◽  
Keke Chen ◽  
Ning Jiang ◽  
Jun Xiao

2021 ◽  
Vol 8 (1) ◽  
Author(s):  
Rongbo Chen ◽  
Haojun Sun ◽  
Lifei Chen ◽  
Jianfei Zhang ◽  
Shengrui Wang

AbstractMarkov models are extensively used for categorical sequence clustering and classification due to their inherent ability to capture complex chronological dependencies hidden in sequential data. Existing Markov models are based on an implicit assumption that the probability of the next state depends on the preceding context/pattern which is consist of consecutive states. This restriction hampers the models since some patterns, disrupted by noise, may be not frequent enough in a consecutive form, but frequent in a sparse form, which can not make use of the information hidden in the sequential data. A sparse pattern corresponds to a pattern in which one or some of the state(s) between the first and last one in the pattern is/are replaced by wildcard(s) that can be matched by a subset of values in the state set. In this paper, we propose a new model that generalizes the conventional Markov approach making it capable of dealing with the sparse pattern and handling the length of the sparse patterns adaptively, i.e. allowing variable length pattern with variable wildcards. The model, named Dynamic order Markov model (DOMM), allows deriving a new similarity measure between a sequence and a set of sequences/cluster. DOMM builds a sparse pattern from sub-frequent patterns that contain significant statistical information veiled by the noise. To implement DOMM, we propose a sparse pattern detector (SPD) based on the probability suffix tree (PST) capable of discovering both sparse and consecutive patterns, and then we develop a divisive clustering algorithm, named DMSC, for Dynamic order Markov model for categorical sequence clustering. Experimental results on real-world datasets demonstrate the promising performance of the proposed model.


Author(s):  
Qi Guo ◽  
Ying Cui ◽  
Jacqueline P. Leighton ◽  
Man-Wai Chu

Digital technology has profound impacts on modern education. Digital technology not only greatly improves access to quality education, but it also can automatically save all the interactions between students and computers in log files. Clustering of log files can help researchers better understand students and improve the learning program. One challenge associated with log file clustering is that log files are sequential in nature, but traditional cluster analysis techniques are designed for cross-sectional data. To overcome this problem, several sequence clustering techniques are proposed recently. There are three major categories of sequence clustering techniques: Markov chain clustering, sequence distance clustering, and sequence feature clustering. The purpose of this chapter is to introduce these sequence clustering techniques and discuss their potential advantages and disadvantages.


2021 ◽  
pp. 596-607
Author(s):  
Zhen Ju ◽  
Huiling Zhang ◽  
Jingtao Meng ◽  
Jingjing Zhang ◽  
Xuelei Li ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document