Data Mining Clustering Algorithm Research and Application

In this paper, first introduce data mining knowledge then focuses on the clustering analysis algorithms, including classification clustering algorithm, and each classification typical cluster analysis algorithms, including the formal description of each algorithm as well as the advantages and disadvantages of each algorithm also has a more detailed description. Then carefully introduce data mining algorithm on the basis of cluster analysis. And using cohesion based clustering algorithm with DBSCAN algorithm and clustering in consumer spending in two-dimensional space, 2,000 data points for each area, and get a reasonable clustering results, resulting in hierarchical clustering results valuable information, so as to realize the practical application of the algorithm and clustering analysis theory combined.

Download Full-text

A Data Distribution View of Clustering Algorithms

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch059 ◽

2011 ◽

pp. 374-381 ◽

Cited By ~ 1

Author(s):

Junjie Wu ◽

Jian Chen ◽

Hui Xiong

Keyword(s):

Data Mining ◽

Cluster Analysis ◽

Clustering Analysis ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Data Distribution ◽

Point Of View ◽

Group Method ◽

Data Sets ◽

Distribution Point

Cluster analysis (Jain & Dubes, 1988) provides insight into the data by dividing the objects into groups (clusters), such that objects in a cluster are more similar to each other than objects in other clusters. Cluster analysis has long played an important role in a wide variety of fields, such as psychology, bioinformatics, pattern recognition, information retrieval, machine learning, and data mining. Many clustering algorithms, such as K-means and Unweighted Pair Group Method with Arithmetic Mean (UPGMA), have been wellestablished. A recent research focus on clustering analysis is to understand the strength and weakness of various clustering algorithms with respect to data factors. Indeed, people have identified some data characteristics that may strongly affect clustering analysis including high dimensionality and sparseness, the large size, noise, types of attributes and data sets, and scales of attributes (Tan, Steinbach, & Kumar, 2005). However, further investigation is expected to reveal whether and how the data distributions can have the impact on the performance of clustering algorithms. Along this line, we study clustering algorithms by answering three questions: 1. What are the systematic differences between the distributions of the resultant clusters by different clustering algorithms? 2. How can the distribution of the “true” cluster sizes make impact on the performances of clustering algorithms? 3. How to choose an appropriate clustering algorithm in practice? The answers to these questions can guide us for the better understanding and the use of clustering methods. This is noteworthy, since 1) in theory, people seldom realized that there are strong relationships between the clustering algorithms and the cluster size distributions, and 2) in practice, how to choose an appropriate clustering algorithm is still a challenging task, especially after an algorithm boom in data mining area. This chapter thus tries to fill this void initially. To this end, we carefully select two widely used categories of clustering algorithms, i.e., K-means and Agglomerative Hierarchical Clustering (AHC), as the representative algorithms for illustration. In the chapter, we first show that K-means tends to generate the clusters with a relatively uniform distribution on the cluster sizes. Then we demonstrate that UPGMA, one of the robust AHC methods, acts in an opposite way to K-means; that is, UPGMA tends to generate the clusters with high variation on the cluster sizes. Indeed, the experimental results indicate that the variations of the resultant cluster sizes by K-means and UPGMA, measured by the Coefficient of Variation (CV), are in the specific intervals, say [0.3, 1.0] and [1.0, 2.5] respectively. Finally, we put together K-means and UPGMA for a further comparison, and propose some rules for the better choice of the clustering schemes from the data distribution point of view.

Download Full-text

A Clustering Analysis Method for Massive Music Data

Modern Electronic Technology ◽

10.26549/met.v5i1.6763 ◽

2021 ◽

Vol 5 (1) ◽

pp. 24

Author(s):

Yanping Xu ◽

Sen Xu

Keyword(s):

Data Mining ◽

Cluster Analysis ◽

Pattern Recognition ◽

Fourier Transform ◽

Clustering Analysis ◽

Clustering Algorithm ◽

Analysis Method ◽

Audio File ◽

Beat Period ◽

Different Types

Clustering analysis plays a very important role in the field of data mining, image segmentation and pattern recognition. The method of cluster analysis is introduced to analyze NetEYun music data. In addition, different types of music data are clustered to find the commonness among the same kind of music. A music data-oriented clustering analysis method is proposed: Firstly, the audio beat period is calculated by reading the audio file data, and the emotional features of the audio are extracted; Secondly, the audio beat period is calculated by Fourier transform. Finally, a clustering algorithm is designed to obtain the clustering results of music data.

Download Full-text

Data Analysis of College Students’ Mental Health Based on Clustering Analysis Algorithm

Complexity ◽

10.1155/2021/9996146 ◽

2021 ◽

Vol 2021 ◽

pp. 1-10

Author(s):

Yichen Chu ◽

Xiaojian Yin

Keyword(s):

Mental Health ◽

College Students ◽

Data Mining ◽

Clustering Analysis ◽

Management System ◽

Clustering Algorithm ◽

Analysis Algorithm ◽

Advantages And Disadvantages ◽

Clustering Model ◽

Psychological Management

Mental health is an important basic condition for college students to become adults. Educators gradually attach importance to strengthening the mental health education of college students. This paper makes a detailed analysis and research on college students’ mental health, expounds the development and application of clustering analysis algorithm, applies the distance formula and clustering criterion function commonly used in clustering analysis, and makes a specific description of some classic algorithms of clustering analysis. Based on expounding the advantages and disadvantages of fast-clustering analysis algorithm and hierarchical clustering analysis algorithm, this paper introduces the concept of the two-step clustering algorithm, discusses the algorithm flow of clustering model in detail, and gives the algorithm flow chart. The main work of this paper is to analyze the clustering algorithm of students’ mental health database formed by mental health assessment tool test, establish a data mining model, mine the database, analyze the state characteristics of different college students’ mental health, and provide corresponding solutions. In order to meet the needs of the psychological management system based on the clustering analysis method, the clustering analysis algorithm is used to cluster the data. Based on the original database, this paper establishes the methods of selecting, cleaning, and transforming the data of students’ psychological archives. Finally, it expounds on the application of data mining in students’ psychological management system and summarizes and prospects the implementation of the system.

Download Full-text

Fuzzy clustering and fuzzy c-means partition cluster analysis and validation studies on a subset of citescore dataset

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v9i4.pp2760-2770 ◽

2019 ◽

Vol 9 (4) ◽

pp. 2760

Author(s):

K. Varada Rajkumar ◽

Adimulam Yesubabu ◽

K. Subrahmanyam

Keyword(s):

Cluster Analysis ◽

Fuzzy Clustering ◽

Time Complexity ◽

Clustering Algorithm ◽

Fuzzy Cluster ◽

Fuzzy Cluster Analysis ◽

Fuzzy C Means ◽

A Value ◽

Data Points ◽

Partition Clustering

A hard partition clustering algorithm assigns equally distant points to one of the clusters, where each datum has the probability to appear in simultaneous assignment to further clusters. The fuzzy cluster analysis assigns membership coefficients of data points which are equidistant between two clusters so the information directs have a place toward in excess of one cluster in the meantime. For a subset of CiteScore dataset, fuzzy clustering (fanny) and fuzzy c-means (fcm) algorithms were implemented to study the data points that lie equally distant from each other. Before analysis, clusterability of the dataset was evaluated with Hopkins statistic which resulted in 0.4371, a value < 0.5, indicating that the data is highly clusterable. The optimal clusters were determined using NbClust package, where it is evidenced that 9 various indices proposed 3 cluster solutions as best clusters. Further, appropriate value of fuzziness parameter m was evaluated to determine the distribution of membership values with variation in m from 1 to 2. Coefficient of variation (CV), also known as relative variability was evaluated to study the spread of data. The time complexity of fuzzy clustering (fanny) and fuzzy c-means algorithms were evaluated by keeping data points constant and varying number of clusters.

Download Full-text

A dynamic K-means clustering for data mining

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v13.i2.pp521-526 ◽

2019 ◽

Vol 13 (2) ◽

pp. 521

Author(s):

Md. Zakir Hossain ◽

Md.Nasim Akhtar ◽

R.B. Ahmad ◽

Mostafijur Rahman

Keyword(s):

Data Mining ◽

Clustering Algorithm ◽

Large Data ◽

Threshold Value ◽

Specific Pattern ◽

Large Data Sets ◽

Data Sets ◽

Data Set ◽

Number Of Clusters ◽

Data Points

Data mining is the process of finding structure of data from large data sets. With this process, the decision makers can make a particular decision for further development of the real-world problems. Several data clusteringtechniques are used in data mining for finding a specific pattern of data. The K-means method isone of the familiar clustering techniques for clustering large data sets. The K-means clustering method partitions the data set based on the assumption that the number of clusters are fixed.The main problem of this method is that if the number of clusters is to be chosen small then there is a higher probability of adding dissimilar items into the same group. On the other hand, if the number of clusters is chosen to be high, then there is a higher chance of adding similar items in the different groups. In this paper, we address this issue by proposing a new K-Means clustering algorithm. The proposed method performs data clustering dynamically. The proposed method initially calculates a threshold value as a centroid of K-Means and based on this value the number of clusters are formed. At each iteration of K-Means, if the Euclidian distance between two points is less than or equal to the threshold value, then these two data points will be in the same group. Otherwise, the proposed method will create a new cluster with the dissimilar data point. The results show that the proposed method outperforms the original K-Means method.

Download Full-text

Malicious Intrusion Data Mining Algorithm of Wireless Personal Communication Network Supported by Legal Big Data

Wireless Communications and Mobile Computing ◽

10.1155/2021/8321636 ◽

2021 ◽

Vol 2021 ◽

pp. 1-7

Author(s):

Kai Ji

Keyword(s):

Data Mining ◽

Communication Network ◽

Big Data ◽

Clustering Algorithm ◽

Personal Information ◽

Personal Communication ◽

Communication Process ◽

Data Mining Algorithm ◽

Mining Algorithm ◽

Wireless Personal Communication

Wireless personal communication network is easily affected by intrusion data in the communication process, resulting in the inability to ensure the security of personal information in wireless communication. Therefore, this paper proposes a malicious intrusion data mining algorithm based on legitimate big data in wireless personal communication networks. The clustering algorithm is used to iteratively obtain the central point of malicious intrusion data and determine its expected membership. The noise in malicious intrusion data is denoised by objective function, and the membership degree of communication data is calculated. The change factor of the neighborhood center of gravity of malicious intrusion data in wireless personal communication network is determined, the similarity between the characteristics of malicious intrusion data by using the Markov distance was determined, and the malicious intrusion data mining of wireless personal communication network supported by legal big data was completed. The experimental results show that the accuracy of mining malicious data is high and the mining time is short.

Download Full-text

Distributed Correlation-Based Clustering Mechanism for Large-Scale Datasets

10.20944/preprints202011.0010.v1 ◽

2020 ◽

Author(s):

Kuei-Sheng Lee ◽

Meng-Feng Tsai ◽

Chi-Sheng Huang

Keyword(s):

Machine Learning ◽

Cluster Analysis ◽

Execution Time ◽

Large Scale ◽

Comprehensive Analysis ◽

Ease Of Use ◽

Practical Application ◽

Processing Data ◽

Data Points ◽

Best Parameters

In the field of machine learning, cluster analysis has always been a very important technology for determining useful or implicit characteristics in the data. However, the current mainstream cluster analysis algorithms require comprehensive analysis of the overall data to obtain the best parameters in the algorithm. As a result, handling large-scale datasets would be difficult. This research proposes a distributed related clustering mechanism for Unsupervised Learning, which assumes that if adjacent data are similar, a group can be formed by relating to more data points. Therefore, when processing data, large-scale datasets can be distributed to multiple computers, and the correlation of any two datasets in each computer can be calculated simultaneously. Later, results are processed through aggregation and filtering before assembled into groups. This method would greatly reduce the pre-processing and execution time of the dataset; in practical application, it only needs to focus on how the relevance of the data is designed. In addition, the experimental results show the accuracy, applicability, and ease of use of this method.

Download Full-text

Analysis of University Students’ Behavior Based on a Fusion K-Means Clustering Algorithm

Applied Sciences ◽

10.3390/app10186566 ◽

2020 ◽

Vol 10 (18) ◽

pp. 6566

Author(s):

Wenbing Chang ◽

Xinpeng Ji ◽

Yinglai Liu ◽

Yiyong Xiao ◽

Bang Chen ◽

...

Keyword(s):

Data Mining ◽

Large Scale ◽

Clustering Algorithm ◽

Learning Performance ◽

Practical Significance ◽

Digital Campus ◽

Data Mining Algorithms ◽

Density Peaks ◽

Data Points ◽

The University

With the development of big data technology, creating the ‘Digital Campus’ is a hot issue. For an increasing amount of data, traditional data mining algorithms are not suitable. The clustering algorithm is becoming more and more important in the field of data mining, but the traditional clustering algorithm does not take the clustering efficiency and clustering effect into consideration. In this paper, the algorithm based on K-Means and clustering by fast search and find of density peaks (K-CFSFDP) is proposed, which improves on the distance and density of data points. This method is used to cluster students from four universities. The experiment shows that K-CFSFDP algorithm has better clustering results and running efficiency than the traditional K-Means clustering algorithm, and it performs well in large scale campus data. Additionally, the results of the cluster analysis show that the students of different categories in four universities had different performances in living habits and learning performance, so the university can learn about the students’ behavior of different categories and provide corresponding personalized services, which have certain practical significance.

Download Full-text

A Study on Improved Eclat Data Mining Algorithm

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.328-330.1896 ◽

2011 ◽

Vol 328-330 ◽

pp. 1896-1899 ◽

Cited By ~ 1

Author(s):

Zhi Fang Li ◽

Xiu Fang Liu ◽

Xu Cao

Keyword(s):

Data Mining ◽

Data Mining Algorithm ◽

Advantages And Disadvantages ◽

Mining Algorithm

An introduction on the algorithm Apriori and FP-growth is given. And their advantages and disadvantages are pointed out. The rule of mining transaction databases has two common formats, horizontal layout and vertical layout. Normally, algorithm using vertical database layout is often superior to those using horizontal layout. A new Eclat algorithm was brought out, which is an improvement of Eclat and show good performance with a lot of datasets.

Download Full-text

Penerapan Algoritma K-Means Clustering untuk Melihat Hubungan Kegiatan Tahfiz dengan Hasil Belajar

Jurnal Sistim Informasi dan Teknologi ◽

10.37034/jsisfotek.v2i2.20 ◽

2020 ◽

pp. 41-47

Author(s):

Asri Hidayad ◽

Sarjon Defit ◽

S Sumijan

Keyword(s):

Data Mining ◽

Learning Outcomes ◽

Clustering Algorithm ◽

Student Learning Outcomes ◽

Training Data ◽

Data Mining Algorithm ◽

Data Mining Technique ◽

Grouping Method ◽

Student Grades ◽

And Training

The purpose of this study is to evaluate whether Tahfiz activities and learning outcomes are effective or not. The data processed in this study were data on tahfiz activities and data on student learning outcomes in class XI (eleven) totaling 42 data sourced from memorization of tahfiz, tahfiz grades, and student grades in Madrasah Aliyah Negeri 1 Bukittinggi. Based on the analysis of the data, this classification uses one of the methods of the Data Mining algorithm, K-Means Clustering. K-Means Clustering algorithm works based on the grouping method, In this data mining technique consists of data testing and training data with the input of the number of memorization of tahfiz, and the value of tahfiz as well as learning outcomes. The results of this study the school can determine how influential this activity tahfiz on student grades.

Download Full-text