Dimensional Reduction of Data for Anomaly Detection and Speed Performance using PCA and DBSCAN

Anomaly detection is the major problem facing by many of industries. It includes network intrusion and medical sciences. Several fields like Astronomy and research also facing difficulties in finding effective anomaly detection. They have included several techniques to solve such problems. Clustering is the technique which has been employed by many of the researchers. The most commonly used algorithm to perform clustering is DBSCAN. It is well known clustering algorithm used in data mining and Machine learning. It is referred as Density based spatial clustering of application with noise. Because of its high complexity in computation, it must be decreased in terms of dimensionality of data points. PCA is a method used then to reduce dimensionality and produced a new data set which is again undergo DBSCAN. Here by the nature of the test results was precise there by such a methodology can be adjusted. The mix of PCA and DBSCAN was acutely confirmed and resultant examination shows that a speedup of 25% was improved while the quality was 80% diminishing the dimensionality of informational index of half.

Download Full-text

Massively scalable density based clustering (DBSCAN) on the HPCC systems big data platform

IAES International Journal of Artificial Intelligence (IJ-AI) ◽

10.11591/ijai.v10.i1.pp207-214 ◽

2021 ◽

Vol 10 (1) ◽

pp. 207

Author(s):

Yatish H. R. ◽

Shubham Milind Phal ◽

Tanmay Sanjay Hukkeri ◽

Lili Xu ◽

Shobha G ◽

...

Keyword(s):

Clustering Algorithm ◽

Spatial Clustering ◽

Computation Time ◽

Large Data ◽

Single Node ◽

Data Set ◽

Traffic Pattern ◽

Density Based Clustering ◽

Data Points ◽

Hpcc Systems

Dealing with large samples of unlabeled data is a key challenge in today’s world, especially in applications such as traffic pattern analysis and disaster management. DBSCAN, or density based spatial clustering of applications with noise, is a well-known density-based clustering algorithm. Its key strengths lie in its capability to detect outliers and handle arbitrarily shaped clusters. However, the algorithm, being fundamentally sequential in nature, proves expensive and time consuming when operated on extensively large data chunks. This paper thus presents a novel implementation of a parallel and distributed DBSCAN algorithm on the HPCC Systems platform. The algorithm seeks to fully parallelize the algorithm implementation by making use of HPCC Systems optimal distributed architecture and performing a tree-based union to merge local clusters. The proposed approach* was tested both on synthetic as well as standard datasets (MFCCs Data Set) and found to be completely accurate. Additionally, when compared against a single node setup, a significant decrease in computation time was observed with no impact to accuracy. The parallelized algorithm performed eight times better for higher number of data points and takes exponentially lesser time as the number of data points increases.

Download Full-text

The Application of Fuzzy Clustering Number Algorithm in Network Intrusion Detection

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.760-762.2220 ◽

2013 ◽

Vol 760-762 ◽

pp. 2220-2223

Author(s):

Lang Guo

Keyword(s):

Intrusion Detection ◽

Fuzzy Clustering ◽

Clustering Algorithm ◽

Local Optimum ◽

Cluster Number ◽

Data Set ◽

Network Intrusion ◽

Correlation Degree ◽

Indicator Data ◽

Detection Effect

In view of the defects of K-means algorithm in intrusion detection: the need of preassign cluster number and sensitive initial center and easy to fall into local optimum, this paper puts forward a fuzzy clustering algorithm. The fuzzy rules are utilized to express the invasion features, and standardized matrix is adopted to further process so as to reflect the approximation degree or correlation degree between the invasion indicator data and establish a similarity matrix. The simulation results of KDD CUP1999 data set show that the algorithm has better intrusion detection effect and can effectively detect the network intrusion data.

Download Full-text

DBSCANI: Noise-Resistant Method for Missing Value Imputation

Journal of Intelligent Systems ◽

10.1515/jisys-2014-0172 ◽

2016 ◽

Vol 25 (3) ◽

pp. 431-440 ◽

Cited By ~ 1

Author(s):

Archana Purwar ◽

Sandeep Kumar Singh

Keyword(s):

Spatial Data ◽

Missing Values ◽

Clustering Algorithm ◽

Spatial Clustering ◽

Data Sets ◽

Quality Of Data ◽

Data Set ◽

Dbscan Clustering ◽

Density Based Clustering

AbstractThe quality of data is an important task in the data mining. The validity of mining algorithms is reduced if data is not of good quality. The quality of data can be assessed in terms of missing values (MV) as well as noise present in the data set. Various imputation techniques have been studied in MV study, but little attention has been given on noise in earlier work. Moreover, to the best of knowledge, no one has used density-based spatial clustering of applications with noise (DBSCAN) clustering for MV imputation. This paper proposes a novel technique density-based imputation (DBSCANI) built on density-based clustering to deal with incomplete values in the presence of noise. Density-based clustering algorithm proposed by Kriegal groups the objects according to their density in spatial data bases. The high-density regions are known as clusters, and the low-density regions refer to the noise objects in the data set. A lot of experiments have been performed on the Iris data set from life science domain and Jain’s (2D) data set from shape data sets. The performance of the proposed method is evaluated using root mean square error (RMSE) as well as it is compared with existing K-means imputation (KMI). Results show that our method is more noise resistant than KMI on data sets used under study.

Download Full-text

SPSM: A NEW HYBRID DATA CLUSTERING ALGORITHM FOR NONLINEAR DATA ANALYSIS

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001409007685 ◽

2009 ◽

Vol 23 (08) ◽

pp. 1701-1737 ◽

Cited By ~ 3

Author(s):

UREERAT WATTANACHON ◽

CHIDCHANOK LURSINSAP

Keyword(s):

Clustering Algorithm ◽

Color Image ◽

Clustering Algorithms ◽

Noisy Data ◽

Second Phase ◽

Data Sets ◽

Data Set ◽

Cluster Distance ◽

Data Points ◽

Hybrid Data

Existing clustering algorithms, such as single-link clustering, k-means, CURE, and CSM are designed to find clusters based on predefined parameters specified by users. These algorithms may be unsuccessful if the choice of parameters is inappropriate with respect to the data set being clustered. Most of these algorithms work very well for compact and hyper-spherical clusters. In this paper, a new hybrid clustering algorithm called Self-Partition and Self-Merging (SPSM) is proposed. The SPSM algorithm partitions the input data set into several subclusters in the first phase and, then, removes the noisy data in the second phase. In the third phase, the normal subclusters are continuously merged to form the larger clusters based on the inter-cluster distance and intra-cluster distance criteria. From the experimental results, the SPSM algorithm is very efficient to handle the noisy data set, and to cluster the data sets of arbitrary shapes of different density. Several examples for color image show the versatility of the proposed method and compare with results described in the literature for the same images. The computational complexity of the SPSM algorithm is O(N2), where N is the number of data points.

Download Full-text

STUDY ON ADAPTIVE PARAMETER DETERMINATION OF CLUSTER ANALYSIS IN URBAN MANAGEMENT CASES

ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences ◽

10.5194/isprs-archives-xlii-2-w7-1143-2017 ◽

2017 ◽

Vol XLII-2/W7 ◽

pp. 1143-1150 ◽

Cited By ~ 2

Author(s):

J. Y. Fu ◽

C. F. Jing ◽

M. Y. Du ◽

Y. L. Fu ◽

P. P. Dai

Keyword(s):

Clustering Algorithm ◽

Spatial Clustering ◽

Parameter Determination ◽

Urban Management ◽

Full Account ◽

Data Set ◽

K Value ◽

Global Parameter ◽

Parameter Adaptive

The fine management for cities is the important way to realize the smart city. The data mining which uses spatial clustering analysis for urban management cases can be used in the evaluation of urban public facilities deployment, and support the policy decisions, and also provides technical support for the fine management of the city. Aiming at the problem that DBSCAN algorithm which is based on the density-clustering can not realize parameter adaptive determination, this paper proposed the optimizing method of parameter adaptive determination based on the spatial analysis. Firstly, making analysis of the function Ripley's K for the data set to realize adaptive determination of global parameter MinPts, which means setting the maximum aggregation scale as the range of data clustering. Calculating every point object’s highest frequency K value in the range of Eps which uses K-D tree and setting it as the value of clustering density to realize the adaptive determination of global parameter MinPts. Then, the R language was used to optimize the above process to accomplish the precise clustering of typical urban management cases. The experimental results based on the typical case of urban management in XiCheng district of Beijing shows that: The new DBSCAN clustering algorithm this paper presents takes full account of the data’s spatial and statistical characteristic which has obvious clustering feature, and has a better applicability and high quality. The results of the study are not only helpful for the formulation of urban management policies and the allocation of urban management supervisors in XiCheng District of Beijing, but also to other cities and related fields.

Download Full-text

A dynamic K-means clustering for data mining

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v13.i2.pp521-526 ◽

2019 ◽

Vol 13 (2) ◽

pp. 521

Author(s):

Md. Zakir Hossain ◽

Md.Nasim Akhtar ◽

R.B. Ahmad ◽

Mostafijur Rahman

Keyword(s):

Data Mining ◽

Clustering Algorithm ◽

Large Data ◽

Threshold Value ◽

Specific Pattern ◽

Large Data Sets ◽

Data Sets ◽

Data Set ◽

Number Of Clusters ◽

Data Points

Data mining is the process of finding structure of data from large data sets. With this process, the decision makers can make a particular decision for further development of the real-world problems. Several data clusteringtechniques are used in data mining for finding a specific pattern of data. The K-means method isone of the familiar clustering techniques for clustering large data sets. The K-means clustering method partitions the data set based on the assumption that the number of clusters are fixed.The main problem of this method is that if the number of clusters is to be chosen small then there is a higher probability of adding dissimilar items into the same group. On the other hand, if the number of clusters is chosen to be high, then there is a higher chance of adding similar items in the different groups. In this paper, we address this issue by proposing a new K-Means clustering algorithm. The proposed method performs data clustering dynamically. The proposed method initially calculates a threshold value as a centroid of K-Means and based on this value the number of clusters are formed. At each iteration of K-Means, if the Euclidian distance between two points is less than or equal to the threshold value, then these two data points will be in the same group. Otherwise, the proposed method will create a new cluster with the dissimilar data point. The results show that the proposed method outperforms the original K-Means method.

Download Full-text

Recognition and labeling of faults in wind turbines with a density-based clustering algorithm

Data Technologies and Applications ◽

10.1108/dta-09-2020-0223 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Shuai Luo ◽

Hongwei Liu ◽

Ershi Qi

Keyword(s):

Wind Turbines ◽

Clustering Algorithm ◽

Support Vector ◽

Scanning Strategy ◽

Data Set ◽

Content Type ◽

Vibration Data ◽

Density Based Clustering ◽

Extreme Gradient Boosting ◽

Data Points

PurposeThe purpose of this paper is to recognize and label the faults in wind turbines with a new density-based clustering algorithm, named contour density scanning clustering (CDSC) algorithm.Design/methodology/approachThe algorithm includes four components: (1) computation of neighborhood density, (2) selection of core and noise data, (3) scanning core data and (4) updating clusters. The proposed algorithm considers the relationship between neighborhood data points according to a contour density scanning strategy.FindingsThe first experiment is conducted with artificial data to validate that the proposed CDSC algorithm is suitable for handling data points with arbitrary shapes. The second experiment with industrial gearbox vibration data is carried out to demonstrate that the time complexity and accuracy of the proposed CDSC algorithm in comparison with other conventional clustering algorithms, including k-means, density-based spatial clustering of applications with noise, density peaking clustering, neighborhood grid clustering, support vector clustering, random forest, core fusion-based density peak clustering, AdaBoost and extreme gradient boosting. The third experiment is conducted with an industrial bearing vibration data set to highlight that the CDSC algorithm can automatically track the emerging fault patterns of bearing in wind turbines over time.Originality/valueData points with different densities are clustered using three strategies: direct density reachability, density reachability and density connectivity. A contours density scanning strategy is proposed to determine whether the data points with the same density belong to one cluster. The proposed CDSC algorithm achieves automatically clustering, which means that the trends of the fault pattern could be tracked.

Download Full-text

Correlative Density-Based Clustering

Journal of Computational and Theoretical Nanoscience ◽

10.1166/jctn.2016.5650 ◽

2016 ◽

Vol 13 (10) ◽

pp. 6935-6943 ◽

Cited By ~ 1

Author(s):

Jia-Lin Hua ◽

Jian Yu ◽

Miin-Shen Yang

Keyword(s):

Correlation Analysis ◽

Clustering Algorithm ◽

Clustering Methods ◽

Data Set ◽

Density Based Clustering ◽

Inherent Structure ◽

Data Points ◽

Artificial Datasets

Mountains, which heap up by densities of a data set, intuitively reflect the structure of data points. These mountain clustering methods are useful for grouping data points. However, the previous mountain-based clustering suffers from the choice of parameters which are used to compute the density. In this paper, we adopt correlation analysis to determine the density, and propose a new clustering algorithm, called Correlative Density-based Clustering (CDC). The new algorithm computes the density with a modified way and determines the parameters based on the inherent structure of data points. Experiments on artificial datasets and real datasets demonstrate the simplicity and effectiveness of the proposed approach.

Download Full-text

Preprocessing Method for Encrypted Traffic Based on Semisupervised Clustering

Security and Communication Networks ◽

10.1155/2020/8824659 ◽

2020 ◽

Vol 2020 ◽

pp. 1-13

Author(s):

Rongfeng Zheng ◽

Jiayong Liu ◽

Weina Niu ◽

Liang Liu ◽

Kai Li ◽

...

Keyword(s):

Network Traffic ◽

Clustering Algorithm ◽

Network Flows ◽

Spatial Clustering ◽

Clustering Algorithms ◽

Communication Channels ◽

Transport Layer ◽

Clustering Model ◽

Network Intrusion ◽

Semisupervised Clustering

The explosive growth in network traffic in recent times has resulted in increased processing pressure on network intrusion detection systems. In addition, there is a lack of reliable methods for preprocessing network traffic generated by benign applications that do not steal users’ data from their devices. To alleviate these problems, this study analyzed the differences between benign and malicious traffic produced by benign applications and malware, respectively. To fully express these differences, this study proposed a new set of statistical features for training a clustering model. Furthermore, to mine the communication channels generated by benign applications in batches, a semisupervised clustering method was adopted. Using a small number of labeled samples, our method aggregated historical network traffic into two types of clusters. The cluster that did not contain labeled malicious samples was regarded as a benign traffic cluster. The experimental results were compared using four types of clustering algorithms. The density-based spatial clustering of applications with noise (DBSCAN) clustering algorithm was selected to mine benign communication channels. We also compared our method with two other methods, and the results demonstrated that the benign channels mined through our method were more reliable. Finally, using our method, 1,811 benign transport layer security (TLS) channels were mined from 18,357 TLS communication channels. The number of flows carried by these benign channels comprised 65.37% of the entire network flows, and no malicious flow was included in our results, which proves the effectiveness of our method.

Download Full-text

Study on Fuzzy Clustering Algorithm of Spatial Data Mining

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.416-417.1244 ◽

2013 ◽

Vol 416-417 ◽

pp. 1244-1250

Author(s):

Ting Ting Zhao

Keyword(s):

Data Mining ◽

Fuzzy Clustering ◽

Spatial Data ◽

Clustering Algorithm ◽

Spatial Clustering ◽

Rapid Development ◽

Spatial Database ◽

Spatial Data Mining ◽

Data Set ◽

Fuzzy Similarity

With rapid development of space information crawl technology, different types of spatial database and data size of spatial database increases continuously. How to extract valuable information from complicated spatial data has become an urgent issue. Spatial data mining provides a new thought for solving the problem. The paper introduces fuzzy clustering into spatial data clustering field, studies the method that fuzzy set theory is applied to spatial data mining, proposes spatial clustering algorithm based on fuzzy similar matrix, fuzzy similarity clustering algorithm. The algorithm not only can solve the disadvantage that fuzzy clustering cant process large data set, but also can give similarity measurement between objects.

Download Full-text