A SCALABLE CLUSTERING METHOD FOR CATEGORICAL SEQUENCE DATA

2005 ◽  
Vol 02 (02) ◽  
pp. 167-180
Author(s):  
SEUNG-JOON OH ◽  
JAE-YEARN KIM

Clustering of sequences is relatively less explored but it is becoming increasingly important in data mining applications such as web usage mining and bioinformatics. The web user segmentation problem uses web access log files to partition a set of users into clusters such that users within one cluster are more similar to one another than to the users in other clusters. Similarly, grouping protein sequences that share a similar structure can help to identify sequences with similar functions. However, few clustering algorithms consider sequentiality. In this paper, we study how to cluster sequence datasets. Due to the high computational complexity of hierarchical clustering algorithms for clustering large datasets, a new clustering method is required. Therefore, we propose a new scalable clustering method using sampling and a k-nearest-neighbor method. Using a splice dataset and a synthetic dataset, we show that the quality of clusters generated by our proposed approach is better than that of clusters produced by traditional algorithms.

Author(s):  
SEUNG-JOON OH ◽  
JAE-YEARN KIM

Recently, there has been enormous growth in the amount of commercial and scientific data, such as protein sequences, retail transactions, and web-logs. Such datasets consist of sequence data that have an inherent sequential nature. However, few existing clustering algorithms consider sequentiality. In this paper, we study how to cluster these sequence datasets. We propose a new similarity measure to compute the similarity between two sequences. In the proposed measure, subsets of a sequence are considered, and the more identical subsets there are, the more similar the two sequences. In addition, we propose a hierarchical clustering algorithm and an efficient method for measuring similarity. Using a splice dataset and synthetic datasets, we show that the quality of clusters generated by our proposed approach is better than that of clusters produced by traditional clustering algorithms.


2015 ◽  
pp. 125-138 ◽  
Author(s):  
I. V. Goncharenko

In this article we proposed a new method of non-hierarchical cluster analysis using k-nearest-neighbor graph and discussed it with respect to vegetation classification. The method of k-nearest neighbor (k-NN) classification was originally developed in 1951 (Fix, Hodges, 1951). Later a term “k-NN graph” and a few algorithms of k-NN clustering appeared (Cover, Hart, 1967; Brito et al., 1997). In biology k-NN is used in analysis of protein structures and genome sequences. Most of k-NN clustering algorithms build «excessive» graph firstly, so called hypergraph, and then truncate it to subgraphs, just partitioning and coarsening hypergraph. We developed other strategy, the “upward” clustering in forming (assembling consequentially) one cluster after the other. Until today graph-based cluster analysis has not been considered concerning classification of vegetation datasets.


2021 ◽  
Vol 25 (6) ◽  
pp. 1453-1471
Author(s):  
Chunhua Tang ◽  
Han Wang ◽  
Zhiwen Wang ◽  
Xiangkun Zeng ◽  
Huaran Yan ◽  
...  

Most density-based clustering algorithms have the problems of difficult parameter setting, high time complexity, poor noise recognition, and weak clustering for datasets with uneven density. To solve these problems, this paper proposes FOP-OPTICS algorithm (Finding of the Ordering Peaks Based on OPTICS), which is a substantial improvement of OPTICS (Ordering Points To Identify the Clustering Structure). The proposed algorithm finds the demarcation point (DP) from the Augmented Cluster-Ordering generated by OPTICS and uses the reachability-distance of DP as the radius of neighborhood eps of its corresponding cluster. It overcomes the weakness of most algorithms in clustering datasets with uneven densities. By computing the distance of the k-nearest neighbor of each point, it reduces the time complexity of OPTICS; by calculating density-mutation points within the clusters, it can efficiently recognize noise. The experimental results show that FOP-OPTICS has the lowest time complexity, and outperforms other algorithms in parameter setting and noise recognition.


2018 ◽  
Vol 74 ◽  
pp. 1-14 ◽  
Author(s):  
Yikun Qin ◽  
Zhu Liang Yu ◽  
Chang-Dong Wang ◽  
Zhenghui Gu ◽  
Yuanqing Li

2012 ◽  
Vol 9 (4) ◽  
pp. 1645-1661 ◽  
Author(s):  
Ray-I Chang ◽  
Shu-Yu Lin ◽  
Jan-Ming Ho ◽  
Chi-Wen Fann ◽  
Yu-Chun Wang

Image retrieval has been popular for several years. There are different system designs for content based image retrieval (CBIR) system. This paper propose a novel system architecture for CBIR system which combines techniques include content-based image and color analysis, as well as data mining techniques. To our best knowledge, this is the first time to propose segmentation and grid module, feature extraction module, K-means and k-nearest neighbor clustering algorithms and bring in the neighborhood module to build the CBIR system. Concept of neighborhood color analysis module which also recognizes the side of every grids of image is first contributed in this paper. The results show the CBIR systems performs well in the training and it also indicates there contains many interested issue to be optimized in the query stage of image retrieval.


2020 ◽  
Vol 2 (1) ◽  
pp. 1-14
Author(s):  
Torkis Nasution

The selection was an attempt College to get qualified prospective students. Test data for new students able to describe the quality of academic and connect to graduate on time. Recognizing the academic quality of students is required in the implementation of the lecture to obtain optimal results. Real conditions today, timely graduation has not achieved optimally, need to be improved to reach the limits of reasonableness. Data that has no need to do a classification based on academic quality, in order to obtain predictions timely graduation. Therefore, proposed an effort to resolve the problem by applying the K-Nearest Neighbor algorithm to re-clustering the test result data for new students. The procedure is to determine the amount of data clusters, determining the center point of the cluster, calculate the distance of the object with the centroid, classifying objects. If the new data group calculation results together with the results of calculation of new data group then finished its calculations. The data will be used in clustering is the result of the entrance exam for new students 3 years old, and has been declared STMIK Amik Riau. This study aims to predict the graduation on time or not. Results of research on testing the value of k, maximum accuracy is obtained when k = 5, reaching 99.25%. Accuracy will decline if the k value the greater the more inaccurate results. The data will be used in clustering is the result of the entrance exam for new students 3 years old, and has been declared STMIK Amik Riau. This study aims to predict the graduation on time or not. Results of research on testing the value of k, maximum accuracy is obtained when k = 5, reaching 99.25%. Accuracy will decline if the k value the greater the more inaccurate results. The data will be used in clustering is the result of the entrance exam for new students 3 years old, and has been declared STMIK Amik Riau. This study aims to predict the graduation on time or not. Results of research on testing the value of k, maximum accuracy is obtained when k = 5, reaching 99.25%. Accuracy will decline if the k value the greater the more inaccurate results.  


Author(s):  
Ming Cao ◽  
Qinke Peng ◽  
Ze-Gang Wei ◽  
Fei Liu ◽  
Yi-Fan Hou

The development of high-throughput technologies has produced increasing amounts of sequence data and an increasing need for efficient clustering algorithms that can process massive volumes of sequencing data for downstream analysis. Heuristic clustering methods are widely applied for sequence clustering because of their low computational complexity. Although numerous heuristic clustering methods have been developed, they suffer from two limitations: overestimation of inferred clusters and low clustering sensitivity. To address these issues, we present a new sequence clustering method (edClust) based on Edlib, a C/C[Formula: see text] library for fast, exact semi-global sequence alignment to group similar sequences. The new method edClust was tested on three large-scale sequence databases, and we compared edClust to several classic heuristic clustering methods, such as UCLUST, CD-HIT, and VSEARCH. Evaluations based on the metrics of cluster number and seed sensitivity (SS) demonstrate that edClust can produce fewer clusters than other methods and that its SS is higher than that of other methods. The source codes of edClust are available from https://github.com/zhang134/EdClust.git under the GNU GPL license.


2016 ◽  
Vol 16 (1) ◽  
pp. 67
Author(s):  
Komang Kompyang Agus Subrata ◽  
I Made Oka Widyantara ◽  
Linawati Linawati

ABSTRACT—Network traffic internet is data communication in a network characterized by a set of statistical flow with the application of a structured pattern. Structured pattern in question is the information from the packet header data. Proper classification to an Internet traffic is very important to do, especially in terms of the design of the network architecture, network management and network security. The analysis of computer network traffic is one way to know the use of the computer network communication protocol, so it can be the basis for determining the priority of Quality of Service (QoS). QoS is the basis for giving priority to analyzing the network traffic data. In this study the classification of the data capture network traffic that though the use of K-Neaerest Neighbor algorithm (K-NN). Tools used to capture network traffic that wireshark application. From the observation of the dataset and the network traffic through the calculation process using K-NN algorithm obtained a result that the value generated by the K-NN classification has a very high level of accuracy. This is evidenced by the results of calculations which reached 99.14%, ie by calculating k = 3. Intisari—Trafik jaringan internet adalah lalu lintas ko­mu­nikasi data dalam jaringan yang ditandai dengan satu set ali­ran statistik dengan penerapan pola terstruktur. Pola ter­struktur yang dimaksud adalah informasi dari header paket data. Klasifikasi yang tepat terhadap sebuah trafik internet sa­ngat penting dilakukan terutama dalam hal disain perancangan arsitektur jaringan, manajemen jaringan dan keamanan jari­ngan. Analisa terhadap suatu trafik jaringan komputer meru­pakan salah satu cara mengetahui penggunaan protokol komu­nikasi jaringan komputer, sehingga dapat menjadi dasar pe­nen­tuan prioritas Quality of Service (QoS). Dasar pemberian prio­ritas QoS adalah dengan penganalisaan terhadap data trafik jaringan. Pada penelitian ini melakukan klasifikasi ter­hadap data capture trafik jaringan yang di olah menggunakan Algoritma K-Neaerest Neighbor (K-NN). Apli­kasi yang digu­nakan untuk capture trafik jaringan yaitu aplikasi wireshark. Hasil observasi terhadap dataset trafik jaringan dan melalui proses perhitungan menggunakan Algoritma K-NN didapatkan sebuah hasil bahwa nilai yang dihasilkan oleh klasifikasi K-NN memiliki tingkat keakuratan yang sangat tinggi. Hal ini dibuktikan dengan hasil perhi­tungan yang mencapai nilai 99,14 % yaitu dengan perhitungan k = 3. DOI: 10.24843/MITE.1601.10


Author(s):  
Omar Freddy Chamorro-Atalaya ◽  
Guillermo Morales Romero ◽  
Adrián Quispe Andía ◽  
Beatriz Caycho Salas ◽  
Elizabeth Katerin Auqui Ramos ◽  
...  

The objective of this study is to analyze and discuss the metrics of the predictive model using the K-nearest neighbor (K-NN) learning algorithm, which will be applied to the data on the perception of engineering students on the quality of the virtual administrative service, such as part of the methodology was analyzed the indicators of accuracy, precision, sensitivity and specificity, from the obtaining of the confusion matrix and the receiver operational characteristic (ROC) curve. The collected data were validated through Cronbach's Alpha, finding consistency values higher than 0.9, which allows to continue with the analysis. Through the predictive model through the Matlab R2021a software, it was concluded that the average metrics for all classes are optimal, presenting a precision of 92.77%, sensitivity 86.62%, and specificity 94.7%; with a total accuracy of 85.5%. In turn, the highest level of the area under the curve (AUC) is 0.98, which is why it is considered an optimal predictive model. Having carried out this study, it is possible to contribute significantly to the decision-making of the higher institution in relation to the improvement of the quality of the virtual administrative service.


2017 ◽  
Vol 6 (2) ◽  
pp. 113
Author(s):  
Taftyani Yusuf Prahudaya ◽  
Agus Harjoko

Guava (Psidium guajava L.) is a fruit that has many health benefits. Guava also has commercial value in Indonesia and has a large market share. This indicates that the commodity of guava has been consumed by society extensively. This time the sorting process is still done manually which still has many shortcomings. This classification gives the classification results are less accurate and inconsistent due to the carelessness of humans. Grading process in the marketing sector is essential. Improper grading potentially detrimental to farmers because all the fruit quality were priced the same. Therefore, we need a consistent classification system.The system uses image processing to extract the color and texture features of guava. As a quality classification KNN method (K-Nearest Neighbor) is used. This system will classify guava into four quality classes, namely the super class, class A, class B, and external quality. KNN designed with input 7 features extraction which is the average value of RGB (Red, Green, and Blue), total defect area, and the GLCM value (entropy, homogeneity, and contrast) with the 4 outputs of quality. From the test results showed that the classification method is able to classify the quality of guava. The highest accuracy is obtained in testing K = 3 with 91.25% accuracy rate.


Sign in / Sign up

Export Citation Format

Share Document