scholarly journals An Improved Flexible Partial Histogram Bayes Learning Algorithm

Author(s):  
Haider O. Lawend ◽  
Anuar Muad ◽  
Aini Hussain

<em>This paper presents a proposed supervised classification technique namely flexible partial histogram Bayes (fPHBayes) learning algorithm. In our previous work, partial histogram Bayes (PHBayes) learning algorithm showed some advantages in the aspects of speed and accuracy in classification tasks. However, its accuracy declines when dealing with small number of instances or when the class feature distributes in wide area. In this work, the proposed fPHBayes solves these limitations in order to increase the classification accuracy. fPHBayes was analyzed and compared with PHBayes and other standard learning algorithms like first nearest neighbor, nearest subclass mean, nearest class mean, naive Bayes and Gaussian mixture model classifier. The experiments were performed using both real data and synthetic data considering different number of instances and different variances of Gaussians. The results showed that fPHBayes is more accurate and flexible to deal with different number of instances and different variances of Gaussians as compared to PHBayes.</em>

2020 ◽  
Vol 223 (3) ◽  
pp. 1565-1583
Author(s):  
Hoël Seillé ◽  
Gerhard Visser

SUMMARY Bayesian inversion of magnetotelluric (MT) data is a powerful but computationally expensive approach to estimate the subsurface electrical conductivity distribution and associated uncertainty. Approximating the Earth subsurface with 1-D physics considerably speeds-up calculation of the forward problem, making the Bayesian approach tractable, but can lead to biased results when the assumption is violated. We propose a methodology to quantitatively compensate for the bias caused by the 1-D Earth assumption within a 1-D trans-dimensional Markov chain Monte Carlo sampler. Our approach determines site-specific likelihood functions which are calculated using a dimensionality discrepancy error model derived by a machine learning algorithm trained on a set of synthetic 3-D conductivity training images. This is achieved by exploiting known geometrical dimensional properties of the MT phase tensor. A complex synthetic model which mimics a sedimentary basin environment is used to illustrate the ability of our workflow to reliably estimate uncertainty in the inversion results, even in presence of strong 2-D and 3-D effects. Using this dimensionality discrepancy error model we demonstrate that on this synthetic data set the use of our workflow performs better in 80 per cent of the cases compared to the existing practice of using constant errors. Finally, our workflow is benchmarked against real data acquired in Queensland, Australia, and shows its ability to detect the depth to basement accurately.


2021 ◽  
Author(s):  
Gothai E ◽  
Usha Moorthy ◽  
Sathishkumar V E ◽  
Abeer Ali Alnuaim ◽  
Wesam Atef Hatamleh ◽  
...  

Abstract With the evolution of Internet standards and advancements in various Internet and mobile technologies, especially since web 4.0, more and more web and mobile applications emerge such as e-commerce, social networks, online gaming applications and Internet of Things based applications. Due to the deployment and concurrent access of these applications on the Internet and mobile devices, the amount of data and the kind of data generated increases exponentially and the new era of Big Data has come into existence. Presently available data structures and data analyzing algorithms are not capable to handle such Big Data. Hence, there is a need for scalable, flexible, parallel and intelligent data analyzing algorithms to handle and analyze the complex massive data. In this article, we have proposed a novel distributed supervised machine learning algorithm based on the MapReduce programming model and Distance Weighted k-Nearest Neighbor algorithm called MR-DWkNN to process and analyze the Big Data in the Hadoop cluster environment. The proposed distributed algorithm is based on supervised learning performs both regression tasks as well as classification tasks on large-volume of Big Data applications. Three performance metrics, such as Root Mean Squared Error (RMSE), Determination coefficient (R2) for regression task, and Accuracy for classification tasks are utilized for the performance measure of the proposed MR-DWkNN algorithm. The extensive experimental results shows that there is an average increase of 3–4.5% prediction and classification performances as compared to standard distributed k-NN algorithm and a considerable decrease of Root Mean Squared Error (RMSE) with good parallelism characteristics of scalability and speedup thus, proves its effectiveness in Big Data predictive and classification applications.


2018 ◽  
Vol 8 (9) ◽  
pp. 1676 ◽  
Author(s):  
Vincent Gripon ◽  
Matthias Löwe ◽  
Franck Vermet

Nearest neighbor search is a very active field in machine learning. It appears in many application cases, including classification and object retrieval. In its naive implementation, the complexity of the search is linear in the product of the dimension and the cardinality of the collection of vectors into which the search is performed. Recently, many works have focused on reducing the dimension of vectors using quantization techniques or hashing, while providing an approximate result. In this paper, we focus instead on tackling the cardinality of the collection of vectors. Namely, we introduce a technique that partitions the collection of vectors and stores each part in its own associative memory. When a query vector is given to the system, associative memories are polled to identify which one contains the closest match. Then, an exhaustive search is conducted only on the part of vectors stored in the selected associative memory. We study the effectiveness of the system when messages to store are generated from i.i.d. uniform ±1 random variables or 0–1 sparse i.i.d. random variables. We also conduct experiments on both synthetic data and real data and show that it is possible to achieve interesting trade-offs between complexity and accuracy.


2013 ◽  
Vol 748 ◽  
pp. 590-594
Author(s):  
Li Liao ◽  
Yong Gang Lu ◽  
Xu Rong Chen

We propose a novel density estimation method using both the k-nearest neighbor (KNN) graph and the potential field of the data points to capture the local and global data distribution information respectively. The clustering is performed based on the computed density values. A forest of trees is built using each data point as the tree node. And the clusters are formed according to the trees in the forest. The new clustering method is evaluated by comparing with three popular clustering methods, K-means++, Mean Shift and DBSCAN. Experiments on two synthetic data sets and one real data set show that our approach can effectively improve the clustering results.


Author(s):  
Amparo Baillo ◽  
Antonio Cuevas ◽  
Ricardo Fraiman

This article reviews the literature concerning supervised and unsupervised classification of functional data. It first explains the meaning of unsupervised classification vs. supervised classification before discussing the supervised classification problem in the infinite-dimensional case, showing that its formal statement generally coincides with that of discriminant analysis in the classical multivariate case. It then considers the optimal classifier and plug-in rules, empirical risk and empirical minimization rules, linear discrimination rules, the k nearest neighbor (k-NN) method, and kernel rules. It also describes classification based on partial least squares, classification based on reproducing kernels, and depth-based classification. Finally, it examines unsupervised classification methods, focusing on K-means for functional data, K-means for data in a Hilbert space, and impartial trimmed K-means for functional data. Some practical issues, in particular real-data examples and simulations, are reviewed and some selected proofs are given.


2021 ◽  
Vol 1 (1) ◽  
pp. 10-18
Author(s):  
Anggi Priliani Yulianto ◽  
Sutawanir Darwis

Abstract. Monitoring the condition of the engine is a top priority to avoid damage. To know the condition of the bearing, it is important to know the remaining useful life of the machine. In the IEEE PHM 2012 Prognostic Challenge platform provides real data related to accelerated bearing degradation carried out under constant operating conditions and online controlled variables of temperature and vibration (with horizontal and vertical accelerometers). In this platform, the data used is bearing2_3 data in the horizontal direction which has a duration of about 2 hours, calculated RMS every 1/10 second (2560 data). In this study machine learning based modeling will be done using the k-nearest neighbor (kNN) method to determine the prediction of RMS bearings. The kNN method is based on the classification of objects based on training data that is the closest distance to the object. kNN is a nonparametric machine learning algorithm which is a model that does not assume distribution. The advantage is that the class decision line produced by the model can be very flexible and very nonlinear. The smallest MSE value was obtained at k = 16 with MSE value = 0.157579. After getting the optimum k value, proceed with predicting a RMS of 97 lags and identifying bearing performance in several phases. Abstrak. Pemantauan kondisi mesin menjadi prioritas utama untuk menghindari adanya kerusakan. Untuk mengetahui kondisi bantalan, penting untuk mengetahui sisa masa manfaat dari mesin tersebut. Dalam platfrom IEEE PHM 2012 Prognostic Challenge ini menyediakan data nyata terkait dengan degradasi bantalan yang dipercepat yang dilakukan di bawah kondisi operasi konstan dan variabel yang dikendalikan secara online berupa suhu dan getaran (dengan akselerometer horizontal dan vertikal). Dalam platform ini, data yang digunakan adalah data bearing2_3 pada arah horizontal yang berdurasi sekitar 2 jam ini dihitung RMS setiap 1/10 detik (2560 data). Dalam penelitian ini akan dilakukan pemodelan berbasis machine learning menggunakan metode k-nearest neighbor (kNN) untuk mengetahui prediksi RMS bearing. Metode kNN didasarkan pada klasifikasi terhadap objek berdasarkan data pelatihan yang jaraknya paling dekat dengan objek tersebut. kNN merupakan salah satu algoritma pembelajaran mesin yang bersifat nonparametrik yakni model yang tidak mengasumsikan distribusi. Kelebihannya adalah garis keputusan kelas yang dihasilkan model tersebut bisa jadi sangat fleksibel dan sangat nonlinier. Nilai MSE terkecil diperoleh pada k = 16 dengan nilai MSE = 0,157579. Setelah mendapatkan nilai k optimum, dilanjutkan dengan memprediksi RMS sebanyak 97-lag serta mengidentifikasi performance kinerja bearing dalam beberapa fase.


2006 ◽  
Vol 15 (03) ◽  
pp. 353-370 ◽  
Author(s):  
TIE-FEI LIU ◽  
WING-KIN SUNG ◽  
ANKUSH MITTAL

Exact determination of a gene network is required to discover the higher-order structures of an organism and to interpret its behavior. Most research work in learning gene networks either assumes that there is no time delay in gene expression or that there is a constant time delay. This paper shows how Bayesian Networks can be applied to represent multi-time delay relationships as well as directed loops. The intractability of the network learning algorithm is handled by using an improved mutual information criterion. Also, a new structure learning algorithm, "Learning By Modification", is proposed to learn the sparse structure of a gene network. The experimental results on synthetic data and real data show that our method is more accurate in determining the gene structure as compared to the traditional methods. Even transcriptional loops spanning over the whole cell can be detected by our algorithm.


Author(s):  
Wentian Zhao ◽  
Shaojie Wang ◽  
Zhihuai Xie ◽  
Jing Shi ◽  
Chenliang Xu

Expectation maximization (EM) algorithm is to find maximum likelihood solution for models having latent variables. A typical example is Gaussian Mixture Model (GMM) which requires Gaussian assumption, however, natural images are highly non-Gaussian so that GMM cannot be applied to perform image clustering task on pixel space. To overcome such limitation, we propose a GAN based EM learning framework that can maximize the likelihood of images and estimate the latent variables. We call this model GAN-EM, which is a framework for image clustering, semi-supervised classification and dimensionality reduction. In M-step, we design a novel loss function for discriminator of GAN to perform maximum likelihood estimation (MLE) on data with soft class label assignments. Specifically, a conditional generator captures data distribution for K classes, and a discriminator tells whether a sample is real or fake for each class. Since our model is unsupervised, the class label of real data is regarded as latent variable, which is estimated by an additional network (E-net) in E-step. The proposed GAN-EM achieves state-of-the-art clustering and semi-supervised classification results on MNIST, SVHN and CelebA, as well as comparable quality of generated images to other recently developed generative models.


Author(s):  
P.L. Nikolaev

This article deals with method of binary classification of images with small text on them Classification is based on the fact that the text can have 2 directions – it can be positioned horizontally and read from left to right or it can be turned 180 degrees so the image must be rotated to read the sign. This type of text can be found on the covers of a variety of books, so in case of recognizing the covers, it is necessary first to determine the direction of the text before we will directly recognize it. The article suggests the development of a deep neural network for determination of the text position in the context of book covers recognizing. The results of training and testing of a convolutional neural network on synthetic data as well as the examples of the network functioning on the real data are presented.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
João Lobo ◽  
Rui Henriques ◽  
Sara C. Madeira

Abstract Background Three-way data started to gain popularity due to their increasing capacity to describe inherently multivariate and temporal events, such as biological responses, social interactions along time, urban dynamics, or complex geophysical phenomena. Triclustering, subspace clustering of three-way data, enables the discovery of patterns corresponding to data subspaces (triclusters) with values correlated across the three dimensions (observations $$\times$$ × features $$\times$$ × contexts). With increasing number of algorithms being proposed, effectively comparing them with state-of-the-art algorithms is paramount. These comparisons are usually performed using real data, without a known ground-truth, thus limiting the assessments. In this context, we propose a synthetic data generator, G-Tric, allowing the creation of synthetic datasets with configurable properties and the possibility to plant triclusters. The generator is prepared to create datasets resembling real 3-way data from biomedical and social data domains, with the additional advantage of further providing the ground truth (triclustering solution) as output. Results G-Tric can replicate real-world datasets and create new ones that match researchers needs across several properties, including data type (numeric or symbolic), dimensions, and background distribution. Users can tune the patterns and structure that characterize the planted triclusters (subspaces) and how they interact (overlapping). Data quality can also be controlled, by defining the amount of missing, noise or errors. Furthermore, a benchmark of datasets resembling real data is made available, together with the corresponding triclustering solutions (planted triclusters) and generating parameters. Conclusions Triclustering evaluation using G-Tric provides the possibility to combine both intrinsic and extrinsic metrics to compare solutions that produce more reliable analyses. A set of predefined datasets, mimicking widely used three-way data and exploring crucial properties was generated and made available, highlighting G-Tric’s potential to advance triclustering state-of-the-art by easing the process of evaluating the quality of new triclustering approaches.


Sign in / Sign up

Export Citation Format

Share Document