An Improved Flexible Partial Histogram Bayes Learning Algorithm

<em>This paper presents a proposed supervised classification technique namely flexible partial histogram Bayes (fPHBayes) learning algorithm. In our previous work, partial histogram Bayes (PHBayes) learning algorithm showed some advantages in the aspects of speed and accuracy in classification tasks. However, its accuracy declines when dealing with small number of instances or when the class feature distributes in wide area. In this work, the proposed fPHBayes solves these limitations in order to increase the classification accuracy. fPHBayes was analyzed and compared with PHBayes and other standard learning algorithms like first nearest neighbor, nearest subclass mean, nearest class mean, naive Bayes and Gaussian mixture model classifier. The experiments were performed using both real data and synthetic data considering different number of instances and different variances of Gaussians. The results showed that fPHBayes is more accurate and flexible to deal with different number of instances and different variances of Gaussians as compared to PHBayes.</em>

Download Full-text

Bayesian inversion of magnetotelluric data considering dimensionality discrepancies

Geophysical Journal International ◽

10.1093/gji/ggaa391 ◽

2020 ◽

Vol 223 (3) ◽

pp. 1565-1583

Author(s):

Hoël Seillé ◽

Gerhard Visser

Keyword(s):

Learning Algorithm ◽

Synthetic Data ◽

Real Data ◽

Error Model ◽

Bayesian Inversion ◽

Magnetotelluric Data ◽

Data Set ◽

Likelihood Functions ◽

Training Images ◽

Phase Tensor

SUMMARY Bayesian inversion of magnetotelluric (MT) data is a powerful but computationally expensive approach to estimate the subsurface electrical conductivity distribution and associated uncertainty. Approximating the Earth subsurface with 1-D physics considerably speeds-up calculation of the forward problem, making the Bayesian approach tractable, but can lead to biased results when the assumption is violated. We propose a methodology to quantitatively compensate for the bias caused by the 1-D Earth assumption within a 1-D trans-dimensional Markov chain Monte Carlo sampler. Our approach determines site-specific likelihood functions which are calculated using a dimensionality discrepancy error model derived by a machine learning algorithm trained on a set of synthetic 3-D conductivity training images. This is achieved by exploiting known geometrical dimensional properties of the MT phase tensor. A complex synthetic model which mimics a sedimentary basin environment is used to illustrate the ability of our workflow to reliably estimate uncertainty in the inversion results, even in presence of strong 2-D and 3-D effects. Using this dimensionality discrepancy error model we demonstrate that on this synthetic data set the use of our workflow performs better in 80 per cent of the cases compared to the existing practice of using constant errors. Finally, our workflow is benchmarked against real data acquired in Queensland, Australia, and shows its ability to detect the depth to basement accurately.

Download Full-text

Map-Reduce based Distance Weighted k-Nearest Neighbor Machine Learning Algorithm for Big Data Applications

10.21203/rs.3.rs-684319/v1 ◽

2021 ◽

Author(s):

Gothai E ◽

Usha Moorthy ◽

Sathishkumar V E ◽

Abeer Ali Alnuaim ◽

Wesam Atef Hatamleh ◽

...

Keyword(s):

Big Data ◽

Nearest Neighbor ◽

Mean Squared Error ◽

Learning Algorithm ◽

Machine Learning Algorithm ◽

K Nearest Neighbor ◽

Squared Error ◽

Big Data Applications ◽

Distance Weighted ◽

Classification Tasks

Abstract With the evolution of Internet standards and advancements in various Internet and mobile technologies, especially since web 4.0, more and more web and mobile applications emerge such as e-commerce, social networks, online gaming applications and Internet of Things based applications. Due to the deployment and concurrent access of these applications on the Internet and mobile devices, the amount of data and the kind of data generated increases exponentially and the new era of Big Data has come into existence. Presently available data structures and data analyzing algorithms are not capable to handle such Big Data. Hence, there is a need for scalable, flexible, parallel and intelligent data analyzing algorithms to handle and analyze the complex massive data. In this article, we have proposed a novel distributed supervised machine learning algorithm based on the MapReduce programming model and Distance Weighted k-Nearest Neighbor algorithm called MR-DWkNN to process and analyze the Big Data in the Hadoop cluster environment. The proposed distributed algorithm is based on supervised learning performs both regression tasks as well as classification tasks on large-volume of Big Data applications. Three performance metrics, such as Root Mean Squared Error (RMSE), Determination coefficient (R2) for regression task, and Accuracy for classification tasks are utilized for the performance measure of the proposed MR-DWkNN algorithm. The extensive experimental results shows that there is an average increase of 3–4.5% prediction and classification performances as compared to standard distributed k-NN algorithm and a considerable decrease of Root Mean Squared Error (RMSE) with good parallelism characteristics of scalability and speedup thus, proves its effectiveness in Big Data predictive and classification applications.

Download Full-text

Associative Memories to Accelerate Approximate Nearest Neighbor Search

Applied Sciences ◽

10.3390/app8091676 ◽

2018 ◽

Vol 8 (9) ◽

pp. 1676 ◽

Cited By ~ 5

Author(s):

Vincent Gripon ◽

Matthias Löwe ◽

Franck Vermet

Keyword(s):

Associative Memory ◽

Nearest Neighbor ◽

Synthetic Data ◽

Random Variables ◽

Real Data ◽

Nearest Neighbor Search ◽

Associative Memories ◽

Neighbor Search ◽

Trade Offs ◽

Approximate Result

Nearest neighbor search is a very active field in machine learning. It appears in many application cases, including classification and object retrieval. In its naive implementation, the complexity of the search is linear in the product of the dimension and the cardinality of the collection of vectors into which the search is performed. Recently, many works have focused on reducing the dimension of vectors using quantization techniques or hashing, while providing an approximate result. In this paper, we focus instead on tackling the cardinality of the collection of vectors. Namely, we introduce a technique that partitions the collection of vectors and stores each part in its own associative memory. When a query vector is given to the system, associative memories are polled to identify which one contains the closest match. Then, an exhaustive search is conducted only on the part of vectors stored in the selected associative memory. We study the effectiveness of the system when messages to store are generated from i.i.d. uniform ±1 random variables or 0–1 sparse i.i.d. random variables. We also conduct experiments on both synthetic data and real data and show that it is possible to achieve interesting trade-offs between complexity and accuracy.

Download Full-text

Clustering Based on a Novel Density Estimation Method

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.748.590 ◽

2013 ◽

Vol 748 ◽

pp. 590-594

Author(s):

Li Liao ◽

Yong Gang Lu ◽

Xu Rong Chen

Keyword(s):

Density Estimation ◽

Nearest Neighbor ◽

Mean Shift ◽

Estimation Method ◽

Synthetic Data ◽

Real Data ◽

Data Sets ◽

Clustering Methods ◽

K Nearest Neighbor ◽

Data Set

We propose a novel density estimation method using both the k-nearest neighbor (KNN) graph and the potential field of the data points to capture the local and global data distribution information respectively. The clustering is performed based on the computed density values. A forest of trees is built using each data point as the tree node. And the clusters are formed according to the trees in the forest. The new clustering method is evaluated by comparing with three popular clustering methods, K-means++, Mean Shift and DBSCAN. Experiments on two synthetic data sets and one real data set show that our approach can effectively improve the clustering results.

Download Full-text

Classification methods for functional data

10.1093/oxfordhb/9780199568444.013.10 ◽

2018 ◽

Author(s):

Amparo Baillo ◽

Antonio Cuevas ◽

Ricardo Fraiman

Keyword(s):

Functional Data ◽

Supervised Classification ◽

Nearest Neighbor ◽

Real Data ◽

Reproducing Kernels ◽

Unsupervised Classification ◽

Classification Problem ◽

Classification Methods ◽

K Nearest Neighbor ◽

Empirical Risk

This article reviews the literature concerning supervised and unsupervised classification of functional data. It first explains the meaning of unsupervised classification vs. supervised classification before discussing the supervised classification problem in the infinite-dimensional case, showing that its formal statement generally coincides with that of discriminant analysis in the classical multivariate case. It then considers the optimal classifier and plug-in rules, empirical risk and empirical minimization rules, linear discrimination rules, the k nearest neighbor (k-NN) method, and kernel rules. It also describes classification based on partial least squares, classification based on reproducing kernels, and depth-based classification. Finally, it examines unsupervised classification methods, focusing on K-means for functional data, K-means for data in a Hilbert space, and impartial trimmed K-means for functional data. Some practical issues, in particular real-data examples and simulations, are reviewed and some selected proofs are given.

Download Full-text

Penerapan Metode K-Nearest Neighbors (kNN) pada Bearing

Jurnal Riset Statistika ◽

10.29313/jrs.v1i1.16 ◽

2021 ◽

Vol 1 (1) ◽

pp. 10-18

Author(s):

Anggi Priliani Yulianto ◽

Sutawanir Darwis

Keyword(s):

Machine Learning ◽

Nearest Neighbor ◽

Learning Algorithm ◽

Real Data ◽

Remaining Useful Life ◽

Operating Conditions ◽

Training Data ◽

K Nearest Neighbor ◽

K Nearest Neighbors ◽

K Value

Abstract. Monitoring the condition of the engine is a top priority to avoid damage. To know the condition of the bearing, it is important to know the remaining useful life of the machine. In the IEEE PHM 2012 Prognostic Challenge platform provides real data related to accelerated bearing degradation carried out under constant operating conditions and online controlled variables of temperature and vibration (with horizontal and vertical accelerometers). In this platform, the data used is bearing2_3 data in the horizontal direction which has a duration of about 2 hours, calculated RMS every 1/10 second (2560 data). In this study machine learning based modeling will be done using the k-nearest neighbor (kNN) method to determine the prediction of RMS bearings. The kNN method is based on the classification of objects based on training data that is the closest distance to the object. kNN is a nonparametric machine learning algorithm which is a model that does not assume distribution. The advantage is that the class decision line produced by the model can be very flexible and very nonlinear. The smallest MSE value was obtained at k = 16 with MSE value = 0.157579. After getting the optimum k value, proceed with predicting a RMS of 97 lags and identifying bearing performance in several phases. Abstrak. Pemantauan kondisi mesin menjadi prioritas utama untuk menghindari adanya kerusakan. Untuk mengetahui kondisi bantalan, penting untuk mengetahui sisa masa manfaat dari mesin tersebut. Dalam platfrom IEEE PHM 2012 Prognostic Challenge ini menyediakan data nyata terkait dengan degradasi bantalan yang dipercepat yang dilakukan di bawah kondisi operasi konstan dan variabel yang dikendalikan secara online berupa suhu dan getaran (dengan akselerometer horizontal dan vertikal). Dalam platform ini, data yang digunakan adalah data bearing2_3 pada arah horizontal yang berdurasi sekitar 2 jam ini dihitung RMS setiap 1/10 detik (2560 data). Dalam penelitian ini akan dilakukan pemodelan berbasis machine learning menggunakan metode k-nearest neighbor (kNN) untuk mengetahui prediksi RMS bearing. Metode kNN didasarkan pada klasifikasi terhadap objek berdasarkan data pelatihan yang jaraknya paling dekat dengan objek tersebut. kNN merupakan salah satu algoritma pembelajaran mesin yang bersifat nonparametrik yakni model yang tidak mengasumsikan distribusi. Kelebihannya adalah garis keputusan kelas yang dihasilkan model tersebut bisa jadi sangat fleksibel dan sangat nonlinier. Nilai MSE terkecil diperoleh pada k = 16 dengan nilai MSE = 0,157579. Setelah mendapatkan nilai k optimum, dilanjutkan dengan memprediksi RMS sebanyak 97-lag serta mengidentifikasi performance kinerja bearing dalam beberapa fase.

Download Full-text

LEARNING GENE NETWORK USING TIME-DELAYED BAYESIAN NETWORK

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213006002710 ◽

2006 ◽

Vol 15 (03) ◽

pp. 353-370 ◽

Cited By ~ 2

Author(s):

TIE-FEI LIU ◽

WING-KIN SUNG ◽

ANKUSH MITTAL

Keyword(s):

Time Delay ◽

Gene Networks ◽

Gene Network ◽

Structure Learning ◽

Learning Algorithm ◽

Research Work ◽

Synthetic Data ◽

Real Data ◽

Information Criterion

Exact determination of a gene network is required to discover the higher-order structures of an organism and to interpret its behavior. Most research work in learning gene networks either assumes that there is no time delay in gene expression or that there is a constant time delay. This paper shows how Bayesian Networks can be applied to represent multi-time delay relationships as well as directed loops. The intractability of the network learning algorithm is handled by using an improved mutual information criterion. Also, a new structure learning algorithm, "Learning By Modification", is proposed to learn the sparse structure of a gene network. The experimental results on synthetic data and real data show that our method is more accurate in determining the gene structure as compared to the traditional methods. Even transcriptional loops spanning over the whole cell can be detected by our algorithm.

Download Full-text

GAN-EM: GAN Based EM Learning Framework

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/612 ◽

2019 ◽

Cited By ~ 1

Author(s):

Wentian Zhao ◽

Shaojie Wang ◽

Zhihuai Xie ◽

Jing Shi ◽

Chenliang Xu

Keyword(s):

Maximum Likelihood ◽

Latent Variables ◽

Latent Variable ◽

Supervised Classification ◽

Real Data ◽

Gaussian Mixture ◽

Generative Models ◽

Image Clustering ◽

Class Label ◽

Learning Framework

Expectation maximization (EM) algorithm is to find maximum likelihood solution for models having latent variables. A typical example is Gaussian Mixture Model (GMM) which requires Gaussian assumption, however, natural images are highly non-Gaussian so that GMM cannot be applied to perform image clustering task on pixel space. To overcome such limitation, we propose a GAN based EM learning framework that can maximize the likelihood of images and estimate the latent variables. We call this model GAN-EM, which is a framework for image clustering, semi-supervised classification and dimensionality reduction. In M-step, we design a novel loss function for discriminator of GAN to perform maximum likelihood estimation (MLE) on data with soft class label assignments. Specifically, a conditional generator captures data distribution for K classes, and a discriminator tells whether a sample is real or fake for each class. Since our model is unsupervised, the class label of real data is regarded as latent variable, which is estimated by an additional network (E-net) in E-step. The proposed GAN-EM achieves state-of-the-art clustering and semi-supervised classification results on MNIST, SVHN and CelebA, as well as comparable quality of generated images to other recently developed generative models.

Download Full-text

Method of determination of the text direction on the image with the use of convolutional neural network

Informatization and communication ◽

10.34219/2078-8320-2020-11-2-96-99 ◽

2020 ◽

pp. 96-99

Author(s):

P.L. Nikolaev

Keyword(s):

Neural Network ◽

Convolutional Neural Network ◽

Deep Neural Network ◽

Binary Classification ◽

Synthetic Data ◽

Real Data ◽

Method Of Determination ◽

Classification Of Images

This article deals with method of binary classification of images with small text on them Classification is based on the fact that the text can have 2 directions – it can be positioned horizontally and read from left to right or it can be turned 180 degrees so the image must be rotated to read the sign. This type of text can be found on the covers of a variety of books, so in case of recognizing the covers, it is necessary first to determine the direction of the text before we will directly recognize it. The article suggests the development of a deep neural network for determination of the text position in the context of book covers recognizing. The results of training and testing of a convolutional neural network on synthetic data as well as the examples of the network functioning on the real data are presented.

Download Full-text

G-Tric: generating three-way synthetic datasets with triclustering solutions

BMC Bioinformatics ◽

10.1186/s12859-020-03925-4 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

João Lobo ◽

Rui Henriques ◽

Sara C. Madeira

Keyword(s):

State Of The Art ◽

Synthetic Data ◽

Ground Truth ◽

Real Data ◽

Three Dimensions ◽

Additional Advantage ◽

Urban Dynamics ◽

Data Generator ◽

Real World Datasets ◽

Synthetic Datasets

Abstract Background Three-way data started to gain popularity due to their increasing capacity to describe inherently multivariate and temporal events, such as biological responses, social interactions along time, urban dynamics, or complex geophysical phenomena. Triclustering, subspace clustering of three-way data, enables the discovery of patterns corresponding to data subspaces (triclusters) with values correlated across the three dimensions (observations $$\times$$ × features $$\times$$ × contexts). With increasing number of algorithms being proposed, effectively comparing them with state-of-the-art algorithms is paramount. These comparisons are usually performed using real data, without a known ground-truth, thus limiting the assessments. In this context, we propose a synthetic data generator, G-Tric, allowing the creation of synthetic datasets with configurable properties and the possibility to plant triclusters. The generator is prepared to create datasets resembling real 3-way data from biomedical and social data domains, with the additional advantage of further providing the ground truth (triclustering solution) as output. Results G-Tric can replicate real-world datasets and create new ones that match researchers needs across several properties, including data type (numeric or symbolic), dimensions, and background distribution. Users can tune the patterns and structure that characterize the planted triclusters (subspaces) and how they interact (overlapping). Data quality can also be controlled, by defining the amount of missing, noise or errors. Furthermore, a benchmark of datasets resembling real data is made available, together with the corresponding triclustering solutions (planted triclusters) and generating parameters. Conclusions Triclustering evaluation using G-Tric provides the possibility to combine both intrinsic and extrinsic metrics to compare solutions that produce more reliable analyses. A set of predefined datasets, mimicking widely used three-way data and exploring crucial properties was generated and made available, highlighting G-Tric’s potential to advance triclustering state-of-the-art by easing the process of evaluating the quality of new triclustering approaches.

Download Full-text