scholarly journals Density-based classification with the DENCLUE algorithm

Author(s):  
Mouhcine El Hassani ◽  
Noureddine Falih ◽  
Belaid Bouikhalene

<p><span>Classification of information is a vague and difficult to explore area of research, hence the emergence of grouping techniques, often referred to Clustering. It is necessary to differentiate between an unsupervised and a supervised classification. Clustering methods are numerous. Data partitioning and hierarchization push to use them in parametric form or not. Also, their use is influenced by algorithms of a probabilistic nature during the partitioning of data. The choice of a method depends on the result of the Clustering that we want to have. This work focuses on classification using the density-based spatial clustering of applications with noise (DBSCAN) and DENsity-based CLUstEring (DENCLUE) algorithm through an application made in csharp. Through the use of three databases which are the IRIS database, breast cancer wisconsin (diagnostic) data set and bank marketing data set, we show experimentally that the choice of the initial data parameters is important to accelerate the processing and can minimize the number of iterations to reduce the execution time of the application.</span></p>

2016 ◽  
Vol 25 (3) ◽  
pp. 431-440 ◽  
Author(s):  
Archana Purwar ◽  
Sandeep Kumar Singh

AbstractThe quality of data is an important task in the data mining. The validity of mining algorithms is reduced if data is not of good quality. The quality of data can be assessed in terms of missing values (MV) as well as noise present in the data set. Various imputation techniques have been studied in MV study, but little attention has been given on noise in earlier work. Moreover, to the best of knowledge, no one has used density-based spatial clustering of applications with noise (DBSCAN) clustering for MV imputation. This paper proposes a novel technique density-based imputation (DBSCANI) built on density-based clustering to deal with incomplete values in the presence of noise. Density-based clustering algorithm proposed by Kriegal groups the objects according to their density in spatial data bases. The high-density regions are known as clusters, and the low-density regions refer to the noise objects in the data set. A lot of experiments have been performed on the Iris data set from life science domain and Jain’s (2D) data set from shape data sets. The performance of the proposed method is evaluated using root mean square error (RMSE) as well as it is compared with existing K-means imputation (KMI). Results show that our method is more noise resistant than KMI on data sets used under study.


2019 ◽  
Vol 8 (3) ◽  
pp. 4373-4378

The amount of data belonging to different domains are being stored rapidly in various repositories across the globe. Extracting useful information from the huge volumes of data is always difficult due to the dynamic nature of data being stored. Data Mining is a knowledge discovery process used to extract the hidden information from the data stored in various repositories, termed as warehouses in the form of patterns. One of the popular tasks of data mining is Classification, which deals with the process of distinguishing every instance of a data set into one of the predefined class labels. Banking system is one of the realworld domains, which collects huge number of client data on a daily basis. In this work, we have collected two variants of the bank marketing data set pertaining to a Portuguese financial institution consisting of 41188 and 45211 instances and performed classification on them using two data reduction techniques. Attribute subset selection has been performed on the first data set and the training data with the selected features are used in classification. Principal Component Analysis has been performed on the second data set and the training data with the extracted features are used in classification. A deep neural network classification algorithm based on Backpropagation has been developed to perform classification on both the data sets. Finally, comparisons are made on the performance of each deep neural network classifier with the four standard classifiers, namely Decision trees, Naïve Bayes, Support vector machines, and k-nearest neighbors. It has been found that the deep neural network classifier outperforms the existing classifiers in terms of accuracy


1981 ◽  
Vol 18 (1) ◽  
pp. 63-72 ◽  
Author(s):  
William R. Dillon ◽  
Matthew Goldstein ◽  
Lucy Lement

The marketing manager faces several dilemmas when analyzing multivariate frequency data. If the choice is to analyze a series of two-dimensional condensed tables, the interrelationships between those factors not in the table will be lost and biased inferences can result. If the decision is to analyze the complete multiway table, many of the cells may be sparse. The authors address the issue of how best to handle sparse-cell values in the context of a marketing data set relating store choice behavior to a number of shopper-specific variables. A simple new approach to this problem, which utilizes loglinear modeling techniques, is developed and contrasted with alternative remedies. The results of the comparative analysis show the proposed approach performs well, especially in the correct classification of seemingly unclassifiable shoppers.


2021 ◽  
Vol 5 (1) ◽  
pp. 187-192
Author(s):  
Yoga Religia ◽  
Agung Nugroho ◽  
Wahyu Hadikristanto

The world of banking requires a marketer to be able to reduce the risk of borrowing by keeping his customers from occurring non-performing loans. One way to reduce this risk is by using data mining techniques. Data mining provides a powerful technique for finding meaningful and useful information from large amounts of data by way of classification. The classification algorithm that can be used to handle imbalance problems can use the Random Forest (RF) algorithm. However, several references state that an optimization algorithm is needed to improve the classification results of the RF algorithm. Optimization of the RF algorithm can be done using Bagging and Genetic Algorithm (GA). This study aims to classify Bank Marketing data in the form of loan application receipts, which data is taken from the www.data.world site. Classification is carried out using the RF algorithm to obtain a predictive model for loan application acceptance with optimal accuracy. This study will also compare the use of optimization in the RF algorithm with Bagging and Genetic Algorithms. Based on the tests that have been done, the results show that the most optimal performance of the classification of Bank Marketing data is by using the RF algorithm with an accuracy of 88.30%, AUC (+) of 0.500 and AUC (-) of 0.000. The optimization of Bagging and Genetic Algorithm has not been able to improve the performance of the RF algorithm for classification of Bank Marketing data.  


2016 ◽  
Vol 13 (10) ◽  
pp. 6935-6943 ◽  
Author(s):  
Jia-Lin Hua ◽  
Jian Yu ◽  
Miin-Shen Yang

Mountains, which heap up by densities of a data set, intuitively reflect the structure of data points. These mountain clustering methods are useful for grouping data points. However, the previous mountain-based clustering suffers from the choice of parameters which are used to compute the density. In this paper, we adopt correlation analysis to determine the density, and propose a new clustering algorithm, called Correlative Density-based Clustering (CDC). The new algorithm computes the density with a modified way and determines the parameters based on the inherent structure of data points. Experiments on artificial datasets and real datasets demonstrate the simplicity and effectiveness of the proposed approach.


2005 ◽  
Vol 15 (05) ◽  
pp. 391-401 ◽  
Author(s):  
DIMITRIOS S. FROSSYNIOTIS ◽  
CHRISTOS PATERITSAS ◽  
ANDREAS STAFYLOPATIS

A multi-clustering fusion method is presented based on combining several runs of a clustering algorithm resulting in a common partition. More specifically, the results of several independent runs of the same clustering algorithm are appropriately combined to obtain a distinct partition of the data which is not affected by initialization and overcomes the instabilities of clustering methods. Subsequently, a fusion procedure is applied to the clusters generated during the previous phase to determine the optimal number of clusters in the data set according to some predefined criteria.


Author(s):  
Yatish H. R. ◽  
Shubham Milind Phal ◽  
Tanmay Sanjay Hukkeri ◽  
Lili Xu ◽  
Shobha G ◽  
...  

<span id="docs-internal-guid-919b015d-7fff-56da-f81d-8f032097bce2"><span>Dealing with large samples of unlabeled data is a key challenge in today’s world, especially in applications such as traffic pattern analysis and disaster management. DBSCAN, or density based spatial clustering of applications with noise, is a well-known density-based clustering algorithm. Its key strengths lie in its capability to detect outliers and handle arbitrarily shaped clusters. However, the algorithm, being fundamentally sequential in nature, proves expensive and time consuming when operated on extensively large data chunks. This paper thus presents a novel implementation of a parallel and distributed DBSCAN algorithm on the HPCC Systems platform. The algorithm seeks to fully parallelize the algorithm implementation by making use of HPCC Systems optimal distributed architecture and performing a tree-based union to merge local clusters. The proposed approach* was tested both on synthetic as well as standard datasets (MFCCs Data Set) and found to be completely accurate. Additionally, when compared against a single node setup, a significant decrease in computation time was observed with no impact to accuracy. The parallelized algorithm performed eight times better for higher number of data points and takes exponentially lesser time as the number of data points increases.</span></span>


2020 ◽  
Author(s):  
Nur Nasuha Mohd Rashid ◽  
Mohd. Asyraf Mansor ◽  
Mohd Shareduwan Mohd Kasihmuddin ◽  
Saratha Sathasivam

2021 ◽  
Author(s):  
Lili Czirok ◽  
Lukács Kuslits ◽  
Katalin Gribovszki

&lt;p&gt;The SE-Carpathians produce significant geodynamic activity due to the current subduction process. The strong seismicity in the Vrancea-zone is its most important indicator. The focus area of these seismic events is relatively small, around 80*100 km and the distribution of their locations is quiet dense.&lt;/p&gt;&lt;p&gt;The authors have carried out cluster analyses of the focal mechanism solutions estimated from local and tele-seismic measurements and stress inversions to support the recent and previously published studies in this region. They have applied different pre-existing clustering methods &amp;#8211; e.g. HDBSCAN (hierarchical density-based clustering for applications with noise) and agglomerative hierarchical analysis &amp;#8211; considering to the geographical coordinates, focal depths and parameters of the focal mechanism solutions of the used seismic events, as well. Moreover, they have attempted to improve a fully-automated algorithm for the classification of the earthquakes for the estimations. This algorithm does not call for the setting of hyper-parameters, thus the affection of the subjectivity can be reduced significantly and the running time can be also decreased. In all cases, the resulted stress tensors are in close agreement with the earlier presented results.&lt;/p&gt;


Author(s):  
Alicia Taylor Lamere

This chapter discusses several popular clustering functions and open source software packages in R and their feasibility of use on larger datasets. These will include the kmeans() function, the pvclust package, and the DBSCAN (density-based spatial clustering of applications with noise) package, which implement K-means, hierarchical, and density-based clustering, respectively. Dimension reduction methods such as PCA (principle component analysis) and SVD (singular value decomposition), as well as the choice of distance measure, are explored as methods to improve the performance of hierarchical and model-based clustering methods on larger datasets. These methods are illustrated through an application to a dataset of RNA-sequencing expression data for cancer patients obtained from the Cancer Genome Atlas Kidney Clear Cell Carcinoma (TCGA-KIRC) data collection from The Cancer Imaging Archive (TCIA).


Sign in / Sign up

Export Citation Format

Share Document