Clustering Based on a Novel Density Estimation Method

2013 ◽  
Vol 748 ◽  
pp. 590-594
Author(s):  
Li Liao ◽  
Yong Gang Lu ◽  
Xu Rong Chen

We propose a novel density estimation method using both the k-nearest neighbor (KNN) graph and the potential field of the data points to capture the local and global data distribution information respectively. The clustering is performed based on the computed density values. A forest of trees is built using each data point as the tree node. And the clusters are formed according to the trees in the forest. The new clustering method is evaluated by comparing with three popular clustering methods, K-means++, Mean Shift and DBSCAN. Experiments on two synthetic data sets and one real data set show that our approach can effectively improve the clustering results.

Polymers ◽  
2021 ◽  
Vol 13 (21) ◽  
pp. 3811
Author(s):  
Iosif Sorin Fazakas-Anca ◽  
Arina Modrea ◽  
Sorin Vlase

This paper proposes a new method for calculating the monomer reactivity ratios for binary copolymerization based on the terminal model. The original optimization method involves a numerical integration algorithm and an optimization algorithm based on k-nearest neighbour non-parametric regression. The calculation method has been tested on simulated and experimental data sets, at low (<10%), medium (10–35%) and high conversions (>40%), yielding reactivity ratios in a good agreement with the usual methods such as intersection, Fineman–Ross, reverse Fineman–Ross, Kelen–Tüdös, extended Kelen–Tüdös and the error in variable method. The experimental data sets used in this comparative analysis are copolymerization of 2-(N-phthalimido) ethyl acrylate with 1-vinyl-2-pyrolidone for low conversion, copolymerization of isoprene with glycidyl methacrylate for medium conversion and copolymerization of N-isopropylacrylamide with N,N-dimethylacrylamide for high conversion. Also, the possibility to estimate experimental errors from a single experimental data set formed by n experimental data is shown.


2021 ◽  
Vol 87 (6) ◽  
pp. 445-455
Author(s):  
Yi Ma ◽  
Zezhong Zheng ◽  
Yutang Ma ◽  
Mingcang Zhu ◽  
Ran Huang ◽  
...  

Many manifold learning algorithms conduct an eigen vector analysis on a data-similarity matrix with a size of N×N, where N is the number of data points. Thus, the memory complexity of the analysis is no less than O(N2). We pres- ent in this article an incremental manifold learning approach to handle large hyperspectral data sets for land use identification. In our method, the number of dimensions for the high-dimensional hyperspectral-image data set is obtained with the training data set. A local curvature varia- tion algorithm is utilized to sample a subset of data points as landmarks. Then a manifold skeleton is identified based on the landmarks. Our method is validated on three AVIRIS hyperspectral data sets, outperforming the comparison algorithms with a k–nearest-neighbor classifier and achieving the second best performance with support vector machine.


Author(s):  
Amit Saxena ◽  
John Wang

This paper presents a two-phase scheme to select reduced number of features from a dataset using Genetic Algorithm (GA) and testing the classification accuracy (CA) of the dataset with the reduced feature set. In the first phase of the proposed work, an unsupervised approach to select a subset of features is applied. GA is used to select stochastically reduced number of features with Sammon Error as the fitness function. Different subsets of features are obtained. In the second phase, each of the reduced features set is applied to test the CA of the dataset. The CA of a data set is validated using supervised k-nearest neighbor (k-nn) algorithm. The novelty of the proposed scheme is that each reduced feature set obtained in the first phase is investigated for CA using the k-nn classification with different Minkowski metric i.e. non-Euclidean norms instead of conventional Euclidean norm (L2). Final results are presented in the paper with extensive simulations on seven real and one synthetic, data sets. It is revealed from the proposed investigation that taking different norms produces better CA and hence a scope for better feature subset selection.


2019 ◽  
Vol 12 (2) ◽  
pp. 140
Author(s):  
Retsi Firda Maulina ◽  
Anik Djuraidah ◽  
Anang Kurnia

Poverty is a complex and multidimensional problem so that it becomes a development priority. Applications of poverty modeling in discrete data are still few and applications of the Bayesian paradigm are also still few. The Bayes Method is a parameter estimation method that utilizes initial information (prior) and sample information so that it can provide predictions that have a higher accuracy than the classical methods. Bayes inference using INLA approach provides faster computation than MCMC and possible uses large data sets. This study aims to model Javanese poverty using the Bayesian Spatial Probit with the INLA approach with three weighting matrices, namely K-Nearest Neighbor (KNN), Inverse Distance, and Exponential Distance. Furthermore, the result showed poverty analysis in Java based on the best model is using Bayesian SAR Probit INLA with KNN weighting matrix produced the highest level of classification accuracy, with specificity is 85.45%, sensitivity is 93.75%, and accuracy is 89.92%.


2019 ◽  
Vol 8 (4) ◽  
pp. 9155-9158

Classification is a machine learning task which consists in predicting the set association of unclassified examples, whose label is not known, by the properties of examples in a representation learned earlier as of training examples, that label was known. Classification tasks contain a huge assortment of domains and real world purpose: disciplines such as medical diagnosis, bioinformatics, financial engineering and image recognition between others, where domain experts can use the model erudite to sustain their decisions. All the Classification Approaches proposed in this paper were evaluate in an appropriate experimental framework in R Programming Language and the major emphasis is on k-nearest neighbor method which supports vector machines and decision trees over large number of data sets with varied dimensionality and by comparing their performance against other state-of-the-art methods. In this process the experimental results obtained have been verified by statistical tests which support the better performance of the methods. In this paper we have survey various classification techniques of Data Mining and then compared them by using diverse datasets from “University of California: Irvine (UCI) Machine Learning Repository” for acquiring the accurate calculations on Iris Data set.


Author(s):  
Mahziyar Darvishi ◽  
Omid Ziaee ◽  
Arash Rahmati ◽  
Mohammad Silani

Numerous structure geometries are available for cellular structures, and selecting the suitable structure that reflects the intended characteristics is cumbersome. While testing many specimens for determining the mechanical properties of these materials could be time-consuming and expensive, finite element analysis (FEA) is considered an efficient alternative. In this study, we present a method to find the suitable geometry for the intended mechanical characteristics by implementing machine learning (ML) algorithms on FEA results of cellular structures. Different cellular structures of a given material are analyzed by FEA, and the results are validated with their corresponding analytical equations. The validated results are employed to create a data set used in the ML algorithms. Finally, by comparing the results with the correct answers, the most accurate algorithm is identified for the intended application. In our case study, the cellular structures are three widely used cellular structures as bone implants: Cube, Kelvin, and Rhombic dodecahedron, made of Ti–6Al–4V. The ML algorithms are simple Bayesian classification, K-nearest neighbor, XGBoost, random forest, and artificial neural network. By comparing the results of these algorithms, the best-performing algorithm is identified.


2020 ◽  
Vol 633 ◽  
pp. A46
Author(s):  
L. Siltala ◽  
M. Granvik

Context. The bulk density of an asteroid informs us about its interior structure and composition. To constrain the bulk density, one needs an estimated mass of the asteroid. The mass is estimated by analyzing an asteroid’s gravitational interaction with another object, such as another asteroid during a close encounter. An estimate for the mass has typically been obtained with linearized least-squares methods, despite the fact that this family of methods is not able to properly describe non-Gaussian parameter distributions. In addition, the uncertainties reported for asteroid masses in the literature are sometimes inconsistent with each other and are suspected to be unrealistically low. Aims. We aim to present a Markov-chain Monte Carlo (MCMC) algorithm for the asteroid mass estimation problem based on asteroid-asteroid close encounters. We verify that our algorithm works correctly by applying it to synthetic data sets. We use astrometry available through the Minor Planet Center to estimate masses for a select few example cases and compare our results with results reported in the literature. Methods. Our mass-estimation method is based on the robust adaptive Metropolis algorithm that has been implemented into the OpenOrb asteroid orbit computation software. Our method has the built-in capability to analyze multiple perturbing asteroids and test asteroids simultaneously. Results. We find that our mass estimates for the synthetic data sets are fully consistent with the ground truth. The nominal masses for real example cases typically agree with the literature but tend to have greater uncertainties than what is reported in recent literature. Possible reasons for this include different astrometric data sets and weights, different test asteroids, different force models or different algorithms. For (16) Psyche, the target of NASA’s Psyche mission, our maximum likelihood mass is approximately 55% of what is reported in the literature. Such a low mass would imply that the bulk density is significantly lower than previously expected and thus disagrees with the theory of (16) Psyche being the metallic core of a protoplanet. We do, however, note that masses reported in recent literature remain within our 3-sigma limits. Results. The new MCMC mass-estimation algorithm performs as expected, but a rigorous comparison with results from a least-squares algorithm with the exact same data set remains to be done. The matters of uncertainties in comparison with other algorithms and correlations of observations also warrant further investigation.


2005 ◽  
Vol 01 (01) ◽  
pp. 173-193
Author(s):  
HIROSHI MAMITSUKA

We consider the problem of mining from noisy unsupervised data sets. The data point we call noise is an outlier in the current context of data mining, and it has been generally defined as the one locates in low probability regions of an input space. The purpose of the approach for this problem is to detect outliers and to perform efficient mining from noisy unsupervised data. We propose a new iterative sampling approach for this problem, using both model-based clustering and the likelihood given to each example by a trained probabilistic model for finding data points of such low probability regions in an input space. Our method uses an arbitrary probabilistic model as a component model and repeats two steps of sampling non-outliers with high likelihoods (computed by previously obtained models) and training the model with the selected examples alternately. In our experiments, we focused on two-mode and co-occurrence data and empirically evaluated the effectiveness of our proposed method, comparing with two other methods, by using both synthetic and real data sets. From the experiments using the synthetic data sets, we found that the significance level of the performance advantage of our method over the two other methods had more pronounced for higher noise ratios, for both medium- and large-sized data sets. From the experiments using a real noisy data set of protein–protein interactions, a typical co-occurrence data set, we further confirmed the performance of our method for detecting outliers from a given data set. Extended abstracts of parts of the work presented in this paper have appeared in Refs. 1 and 2.


Author(s):  
Sucitra Sahara ◽  
Rizqi Agung Permana ◽  
Hariyanto Hariyanto

Abstrak: Virus pada komputer menjadi hal yang membahayakan bagi para pengguna komputer perorangan maupun perusahaan yang telah menerapkan sistem terkomputerisasi. Virus program yang didesain untuk tujuan jahat dapat merusak bagian tertentu dari komputer, bahkan yang paling merugikan adalah dapat merusak data penting pada perusahaan. Dalam hal ini maka diciptakanlah sebuah software anti virus, perkembangan anti virus selalu lebih lambat dari virus itu sendiri, sehingga peneliti akan mengadakan penyeleksian software anti virus pada suatu opini atau berdasarkan komentar masyarakat yang telah menggunakan software anti virus produk tertentu dan dituangkan ke media online seperti komentar pada suatu situs penjualan produk tersebut. Berdasarkan ribuan komentar akan diolah dan dikelompokkan pada jenis kata teks positif dan teks negatif, dan peneliti membuat klasifikasi data dengan menggunakan metode algoritma k-Nearest Neighbor (k-NN), algoritma k-NN adalah salah satu algoritma yang sesuai dalam penelitian kali ini. Peneliti menemukan bahwa algoritma k-NN mampu mengolah data set yang sudah dikelompokan pada teks positif dan negatif khususnya dalam pemilihan teks, dan penerapan metode optimasi Particle Swarm Optimization (PSO) yang dikombinasikan dengan k-NN diharapkan mampu meningkatkan nilai akurasi sehingga datanya lebih kuat dan valid. Sebelum data set diolah menggukanan optimasi PSO hanya menggunakan metode k-NN akurasi data yang diperoleh 70,50% sedangkan hasil akurasi setelah penggunaan metode k-nn dan optimasi PSO didapatkan nilai akurasi sebesar 83,50%. Dapat disimpulkan bahwa penggunaan optimasi PSO dan metode k-NN sangat sesuai pada konsep text mining dan  penyeksian pada data set berupa text. Kata kunci: Analisis Review, Optimasi Particle Swarm Optimization, Metode k-Nearest Neighbor.   Abstract: Viruses on computers become dangerous for individual computer users and companies that have implemented computerized systems. Virus programs that are designed for malicious purposes can damage certain parts of the computer, even the most detrimental is that it can damage important data on the company. In this case an anti-virus software is created, the development of anti-virus is always slower than the virus itself, so researchers will conduct an anti-virus software selection on an opinion or based on public comments that have used a particular product's anti-virus software and poured it into online media such as comment on a product sales site. Of the thousands of comments will be processed and grouped on the type of positive and negative text words, and researchers make data classification using the k-Nearest Neighbor (k-NN) algorithm method, the k-NN algorithm is one of the appropriate algorithms in this study. The researcher found that the k-NN algorithm is able to process data sets that have been grouped in positive and negative texts, especially in text selection, and the application of the Particle Swarm Optimization (PSO) optimization method combined with k-NN is expected to be able to increase the accuracy value so that the data is stronger and valid. Before the data set is processed using PSO optimization only using the k-NN method the accuracy of the data obtained is 70.50% while the accuracy results after the use of the k-nn method and PSO optimization obtained an accuracy value of 83.50%. It can be concluded that the use of PSO optimization and the k-NN method are very compatible with the concept of text mining and correction of text data sets. Keywords: Analysis Review, k-Nearest Neighbor Method, Particle Swarm Optimization optimization


2020 ◽  
Vol 5 (2) ◽  
pp. 85-92
Author(s):  
Sucitra Sahara ◽  
Rizqi Agung Permana

Many companies have not implemented accounting software in financial management. Even though the current era of technology is increasingly updated and developing, more and more superior products are being issued by software development companies, especially in accounting software. There are not a few software products whose quality is still below standard or incomplete with features and facilities. So that researchers concentrate on companies or individual businesses that still use manual methods in processing their finances by helping and making it easier to choose the software product they will choose. Researchers first carry out the accounting software product selection stage based on an opinion or opinion of the public who have bought and used the software they choose and they pour this opinion into online media such as comments on a product selling site. Thousands of comments will be processed and grouped into data sets and this time the researcher processes the data classification using the k-Nearest Neighbor (K-NN) algorithm. By using the K-NN method, it is expected to be able to produce the expected accuracy value so that the data set processing is stronger and more valid. It turns out that after applying the data accuracy value obtained by 80.50%, it can be concluded that the K-NN method is very suitable for the concept of text mining this time and for selecting the data set in the form of text.


Sign in / Sign up

Export Citation Format

Share Document