A SCALABLE CLUSTERING METHOD FOR CATEGORICAL SEQUENCE DATA

Clustering of sequences is relatively less explored but it is becoming increasingly important in data mining applications such as web usage mining and bioinformatics. The web user segmentation problem uses web access log files to partition a set of users into clusters such that users within one cluster are more similar to one another than to the users in other clusters. Similarly, grouping protein sequences that share a similar structure can help to identify sequences with similar functions. However, few clustering algorithms consider sequentiality. In this paper, we study how to cluster sequence datasets. Due to the high computational complexity of hierarchical clustering algorithms for clustering large datasets, a new clustering method is required. Therefore, we propose a new scalable clustering method using sampling and a k-nearest-neighbor method. Using a splice dataset and a synthetic dataset, we show that the quality of clusters generated by our proposed approach is better than that of clusters produced by traditional algorithms.

Download Full-text

A SEQUENCE-ELEMENT-BASED HIERARCHICAL CLUSTERING ALGORITHM FOR CATEGORICAL SEQUENCE DATA

International Journal of Information Technology & Decision Making ◽

10.1142/s0219622005001398 ◽

2005 ◽

Vol 04 (01) ◽

pp. 81-96 ◽

Cited By ~ 5

Author(s):

SEUNG-JOON OH ◽

JAE-YEARN KIM

Keyword(s):

Hierarchical Clustering ◽

Clustering Algorithm ◽

Sequence Data ◽

Clustering Algorithms ◽

Scientific Data ◽

Sequence Element ◽

Hierarchical Clustering Algorithm ◽

Synthetic Datasets ◽

Better Than

Recently, there has been enormous growth in the amount of commercial and scientific data, such as protein sequences, retail transactions, and web-logs. Such datasets consist of sequence data that have an inherent sequential nature. However, few existing clustering algorithms consider sequentiality. In this paper, we study how to cluster these sequence datasets. We propose a new similarity measure to compute the similarity between two sequences. In the proposed measure, subsets of a sequence are considered, and the more identical subsets there are, the more similar the two sequences. In addition, we propose a hierarchical clustering algorithm and an efficient method for measuring similarity. Using a splice dataset and synthetic datasets, we show that the quality of clusters generated by our proposed approach is better than that of clusters produced by traditional clustering algorithms.

Download Full-text

DRSA: a non-hierarchical clustering algorithm using k-NN graph and its application in vegetation classification

Vegetation of Russia ◽

10.31111/vegrus/2015.27.125 ◽

2015 ◽

pp. 125-138 ◽

Cited By ~ 2

Author(s):

I. V. Goncharenko

Keyword(s):

Cluster Analysis ◽

Clustering Algorithm ◽

Nearest Neighbor ◽

Clustering Algorithms ◽

Protein Structures ◽

Hierarchical Cluster ◽

Vegetation Classification ◽

K Nearest Neighbor ◽

Neighbor Graph ◽

Nearest Neighbor Graph

In this article we proposed a new method of non-hierarchical cluster analysis using k-nearest-neighbor graph and discussed it with respect to vegetation classification. The method of k-nearest neighbor (k-NN) classiﬁcation was originally developed in 1951 (Fix, Hodges, 1951). Later a term “k-NN graph” and a few algorithms of k-NN clustering appeared (Cover, Hart, 1967; Brito et al., 1997). In biology k-NN is used in analysis of protein structures and genome sequences. Most of k-NN clustering algorithms build «excessive» graph firstly, so called hypergraph, and then truncate it to subgraphs, just partitioning and coarsening hypergraph. We developed other strategy, the “upward” clustering in forming (assembling consequentially) one cluster after the other. Until today graph-based cluster analysis has not been considered concerning classification of vegetation datasets.

Download Full-text

An improved OPTICS clustering algorithm for discovering clusters with uneven densities

Intelligent Data Analysis ◽

10.3233/ida-205497 ◽

2021 ◽

Vol 25 (6) ◽

pp. 1453-1471

Author(s):

Chunhua Tang ◽

Han Wang ◽

Zhiwen Wang ◽

Xiangkun Zeng ◽

Huaran Yan ◽

...

Keyword(s):

Time Complexity ◽

Clustering Algorithm ◽

Nearest Neighbor ◽

Clustering Algorithms ◽

Substantial Improvement ◽

Experimental Results ◽

High Time ◽

Parameter Setting ◽

K Nearest Neighbor ◽

Density Based Clustering

Most density-based clustering algorithms have the problems of difficult parameter setting, high time complexity, poor noise recognition, and weak clustering for datasets with uneven density. To solve these problems, this paper proposes FOP-OPTICS algorithm (Finding of the Ordering Peaks Based on OPTICS), which is a substantial improvement of OPTICS (Ordering Points To Identify the Clustering Structure). The proposed algorithm finds the demarcation point (DP) from the Augmented Cluster-Ordering generated by OPTICS and uses the reachability-distance of DP as the radius of neighborhood eps of its corresponding cluster. It overcomes the weakness of most algorithms in clustering datasets with uneven densities. By computing the distance of the k-nearest neighbor of each point, it reduces the time complexity of OPTICS; by calculating density-mutation points within the clusters, it can efficiently recognize noise. The experimental results show that FOP-OPTICS has the lowest time complexity, and outperforms other algorithms in parameter setting and noise recognition.

Download Full-text

A Novel clustering method based on hybrid K-nearest-neighbor graph

Pattern Recognition ◽

10.1016/j.patcog.2017.09.008 ◽

2018 ◽

Vol 74 ◽

pp. 1-14 ◽

Cited By ~ 19

Author(s):

Yikun Qin ◽

Zhu Liang Yu ◽

Chang-Dong Wang ◽

Zhenghui Gu ◽

Yuanqing Li

Keyword(s):

Nearest Neighbor ◽

K Nearest Neighbor ◽

Clustering Method ◽

Neighbor Graph ◽

Nearest Neighbor Graph

Download Full-text

A novel content based image retrieval system using K-means/KNN with feature extraction

Computer Science and Information Systems ◽

10.2298/csis120122047c ◽

2012 ◽

Vol 9 (4) ◽

pp. 1645-1661 ◽

Cited By ~ 10

Author(s):

Ray-I Chang ◽

Shu-Yu Lin ◽

Jan-Ming Ho ◽

Chi-Wen Fann ◽

Yu-Chun Wang

Keyword(s):

Feature Extraction ◽

Image Retrieval ◽

Nearest Neighbor ◽

Clustering Algorithms ◽

Content Based Image Retrieval ◽

K Nearest Neighbor ◽

Color Analysis ◽

Image Retrieval System ◽

First Time ◽

System Designs

Image retrieval has been popular for several years. There are different system designs for content based image retrieval (CBIR) system. This paper propose a novel system architecture for CBIR system which combines techniques include content-based image and color analysis, as well as data mining techniques. To our best knowledge, this is the first time to propose segmentation and grid module, feature extraction module, K-means and k-nearest neighbor clustering algorithms and bring in the neighborhood module to build the CBIR system. Concept of neighborhood color analysis module which also recognizes the side of every grids of image is first contributed in this paper. The results show the CBIR systems performs well in the training and it also indicates there contains many interested issue to be optimized in the query stage of image retrieval.

Download Full-text

IMPLEMENTASI ALGORITMA K-NEAREST NEIGHBOR UNTUK PENENTUAN KELULUSAN MAHASISWA TEPAT WAKTU

JURNAL PERANGKAT LUNAK ◽

10.32520/jupel.v2i1.944 ◽

2020 ◽

Vol 2 (1) ◽

pp. 1-14

Author(s):

Torkis Nasution

Keyword(s):

Nearest Neighbor ◽

K Nearest Neighbor ◽

Academic Quality ◽

K Value ◽

Entrance Exam ◽

Maximum Accuracy ◽

Data Clusters ◽

New Students ◽

Data Group

The selection was an attempt College to get qualified prospective students. Test data for new students able to describe the quality of academic and connect to graduate on time. Recognizing the academic quality of students is required in the implementation of the lecture to obtain optimal results. Real conditions today, timely graduation has not achieved optimally, need to be improved to reach the limits of reasonableness. Data that has no need to do a classification based on academic quality, in order to obtain predictions timely graduation. Therefore, proposed an effort to resolve the problem by applying the K-Nearest Neighbor algorithm to re-clustering the test result data for new students. The procedure is to determine the amount of data clusters, determining the center point of the cluster, calculate the distance of the object with the centroid, classifying objects. If the new data group calculation results together with the results of calculation of new data group then finished its calculations. The data will be used in clustering is the result of the entrance exam for new students 3 years old, and has been declared STMIK Amik Riau. This study aims to predict the graduation on time or not. Results of research on testing the value of k, maximum accuracy is obtained when k = 5, reaching 99.25%. Accuracy will decline if the k value the greater the more inaccurate results. The data will be used in clustering is the result of the entrance exam for new students 3 years old, and has been declared STMIK Amik Riau. This study aims to predict the graduation on time or not. Results of research on testing the value of k, maximum accuracy is obtained when k = 5, reaching 99.25%. Accuracy will decline if the k value the greater the more inaccurate results. The data will be used in clustering is the result of the entrance exam for new students 3 years old, and has been declared STMIK Amik Riau. This study aims to predict the graduation on time or not. Results of research on testing the value of k, maximum accuracy is obtained when k = 5, reaching 99.25%. Accuracy will decline if the k value the greater the more inaccurate results.

Download Full-text

EdClust: A heuristic sequence clustering method with higher sensitivity

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720021500360 ◽

2021 ◽

Author(s):

Ming Cao ◽

Qinke Peng ◽

Ze-Gang Wei ◽

Fei Liu ◽

Yi-Fan Hou

Keyword(s):

Large Scale ◽

Sequence Data ◽

Clustering Algorithms ◽

Clustering Methods ◽

Sequencing Data ◽

Clustering Method ◽

Cluster Number ◽

Sequence Clustering ◽

Downstream Analysis ◽

Heuristic Clustering

The development of high-throughput technologies has produced increasing amounts of sequence data and an increasing need for efficient clustering algorithms that can process massive volumes of sequencing data for downstream analysis. Heuristic clustering methods are widely applied for sequence clustering because of their low computational complexity. Although numerous heuristic clustering methods have been developed, they suffer from two limitations: overestimation of inferred clusters and low clustering sensitivity. To address these issues, we present a new sequence clustering method (edClust) based on Edlib, a C/C[Formula: see text] library for fast, exact semi-global sequence alignment to group similar sequences. The new method edClust was tested on three large-scale sequence databases, and we compared edClust to several classic heuristic clustering methods, such as UCLUST, CD-HIT, and VSEARCH. Evaluations based on the metrics of cluster number and seed sensitivity (SS) demonstrate that edClust can produce fewer clusters than other methods and that its SS is higher than that of other methods. The source codes of edClust are available from https://github.com/zhang134/EdClust.git under the GNU GPL license.

Download Full-text

KLASIFIKASI PENGGUNAAN PROTOKOL KOMUNIKASI PADA TRAFIK JARINGAN MENGGUNAKAN ALGORITMA K-NEAREST NEIGHBOR

Majalah Ilmiah Teknologi Elektro ◽

10.24843/mite.1601.10 ◽

2016 ◽

Vol 16 (1) ◽

pp. 67

Author(s):

Komang Kompyang Agus Subrata ◽

I Made Oka Widyantara ◽

Linawati Linawati

Keyword(s):

Quality Of Service ◽

Network Traffic ◽

Network Architecture ◽

Nearest Neighbor ◽

Computer Network ◽

Data Communication ◽

Data Capture ◽

K Nearest Neighbor ◽

High Level

ABSTRACT—Network traffic internet is data communication in a network characterized by a set of statistical flow with the application of a structured pattern. Structured pattern in question is the information from the packet header data. Proper classification to an Internet traffic is very important to do, especially in terms of the design of the network architecture, network management and network security. The analysis of computer network traffic is one way to know the use of the computer network communication protocol, so it can be the basis for determining the priority of Quality of Service (QoS). QoS is the basis for giving priority to analyzing the network traffic data. In this study the classification of the data capture network traffic that though the use of K-Neaerest Neighbor algorithm (K-NN). Tools used to capture network traffic that wireshark application. From the observation of the dataset and the network traffic through the calculation process using K-NN algorithm obtained a result that the value generated by the K-NN classification has a very high level of accuracy. This is evidenced by the results of calculations which reached 99.14%, ie by calculating k = 3. Intisari—Trafik jaringan internet adalah lalu lintas komunikasi data dalam jaringan yang ditandai dengan satu set aliran statistik dengan penerapan pola terstruktur. Pola terstruktur yang dimaksud adalah informasi dari header paket data. Klasifikasi yang tepat terhadap sebuah trafik internet sangat penting dilakukan terutama dalam hal disain perancangan arsitektur jaringan, manajemen jaringan dan keamanan jaringan. Analisa terhadap suatu trafik jaringan komputer merupakan salah satu cara mengetahui penggunaan protokol komunikasi jaringan komputer, sehingga dapat menjadi dasar penentuan prioritas Quality of Service (QoS). Dasar pemberian prioritas QoS adalah dengan penganalisaan terhadap data trafik jaringan. Pada penelitian ini melakukan klasifikasi terhadap data capture trafik jaringan yang di olah menggunakan Algoritma K-Neaerest Neighbor (K-NN). Aplikasi yang digunakan untuk capture trafik jaringan yaitu aplikasi wireshark. Hasil observasi terhadap dataset trafik jaringan dan melalui proses perhitungan menggunakan Algoritma K-NN didapatkan sebuah hasil bahwa nilai yang dihasilkan oleh klasifikasi K-NN memiliki tingkat keakuratan yang sangat tinggi. Hal ini dibuktikan dengan hasil perhitungan yang mencapai nilai 99,14 % yaitu dengan perhitungan k = 3. DOI: 10.24843/MITE.1601.10

Download Full-text

K-NN supervised learning algorithm in the predictive analysis of the quality of the university administrative service in the virtual environment

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v25.i1.pp521-528 ◽

2022 ◽

Vol 25 (1) ◽

pp. 521

Author(s):

Omar Freddy Chamorro-Atalaya ◽

Guillermo Morales Romero ◽

Adrián Quispe Andía ◽

Beatriz Caycho Salas ◽

Elizabeth Katerin Auqui Ramos ◽

...

Keyword(s):

Predictive Model ◽

Engineering Students ◽

Nearest Neighbor ◽

Learning Algorithm ◽

Confusion Matrix ◽

Area Under The Curve ◽

Operational Characteristic ◽

K Nearest Neighbor ◽

Administrative Service

The objective of this study is to analyze and discuss the metrics of the predictive model using the K-nearest neighbor (K-NN) learning algorithm, which will be applied to the data on the perception of engineering students on the quality of the virtual administrative service, such as part of the methodology was analyzed the indicators of accuracy, precision, sensitivity and specificity, from the obtaining of the confusion matrix and the receiver operational characteristic (ROC) curve. The collected data were validated through Cronbach's Alpha, finding consistency values higher than 0.9, which allows to continue with the analysis. Through the predictive model through the Matlab R2021a software, it was concluded that the average metrics for all classes are optimal, presenting a precision of 92.77%, sensitivity 86.62%, and specificity 94.7%; with a total accuracy of 85.5%. In turn, the highest level of the area under the curve (AUC) is 0.98, which is why it is considered an optimal predictive model. Having carried out this study, it is possible to contribute significantly to the decision-making of the higher institution in relation to the improvement of the quality of the virtual administrative service.

Download Full-text

METODE KLASIFIKASI MUTU JAMBU BIJI MENGGUNAKAN KNN BERDASARKAN FITUR WARNA DAN TEKSTUR

Jurnal Teknosains ◽

10.22146/teknosains.26972 ◽

2017 ◽

Vol 6 (2) ◽

pp. 113

Author(s):

Taftyani Yusuf Prahudaya ◽

Agus Harjoko

Keyword(s):

Nearest Neighbor ◽

Texture Features ◽

Psidium Guajava ◽

K Nearest Neighbor ◽

Average Value ◽

Quality Classification ◽

Defect Area ◽

Class B ◽

Sorting Process

Guava (Psidium guajava L.) is a fruit that has many health benefits. Guava also has commercial value in Indonesia and has a large market share. This indicates that the commodity of guava has been consumed by society extensively. This time the sorting process is still done manually which still has many shortcomings. This classification gives the classification results are less accurate and inconsistent due to the carelessness of humans. Grading process in the marketing sector is essential. Improper grading potentially detrimental to farmers because all the fruit quality were priced the same. Therefore, we need a consistent classification system.The system uses image processing to extract the color and texture features of guava. As a quality classification KNN method (K-Nearest Neighbor) is used. This system will classify guava into four quality classes, namely the super class, class A, class B, and external quality. KNN designed with input 7 features extraction which is the average value of RGB (Red, Green, and Blue), total defect area, and the GLCM value (entropy, homogeneity, and contrast) with the 4 outputs of quality. From the test results showed that the classification method is able to classify the quality of guava. The highest accuracy is obtained in testing K = 3 with 91.25% accuracy rate.

Download Full-text