LEARNING DECISION TREES WITH LOG CONDITIONAL LIKELIHOOD

Author(s):  
HAN LIANG ◽  
YUHONG YAN ◽  
HARRY ZHANG

In machine learning and data mining, traditional learning models aim for high classification accuracy. However, accurate class probability prediction is more desirable than classification accuracy in many practical applications, such as medical diagnosis. Although it is known that decision trees can be adapted to be class probability estimators in a variety of approaches, and the resulting models are uniformly called Probability Estimation Trees (PETs), the performances of these PETs in class probability estimation, have not yet been investigated. We begin our research by empirically studying PETs in terms of class probability estimation, measured by Log Conditional Likelihood (LCL). We also compare a PET called C4.4 with other representative models, including Naïve Bayes, Naïve Bayes Tree, Bayesian Network, KNN and SVM, in LCL. From our experiments, we draw several valuable conclusions. First, among various tree-based models, C4.4 is the best in yielding precise class probability prediction measured by LCL. We provide an explanation for this and reveal the nature of LCL. Second, compared with non tree-based models, C4.4 also performs best. Finally, LCL does not dominate another well-established relevant metric — AUC, which suggests that different decision-tree learning models should be used for different objectives. Our experiments are conducted on the basis of 36 UCI sample sets. We run all the models within a machine learning platform — Weka. We also explore an approach to improve the class probability estimation of Naïve Bayes Tree. We propose a greedy and recursive learning algorithm, where at each step, LCL is used as the scoring function to expand the decision tree. The algorithm uses Naïve Bayes created at leaves to estimate class probabilities of test samples. The whole tree encodes the posterior class probability in its structure. One benefit of improving class probability estimation is that both classification accuracy and AUC can be possibly scaled up. We call the new model LCL Tree (LCLT). Our experiments on 33 UCI sample sets show that LCLT outperforms all state-of-the-art learning models, such as Naïve Bayes Tree, significantly in accurate class probability prediction measured by LCL, as well as in classification accuracy and AUC.

2012 ◽  
Vol 26 ◽  
pp. 239-245 ◽  
Author(s):  
Liangxiao Jiang ◽  
Zhihua Cai ◽  
Dianhong Wang ◽  
Harry Zhang

2019 ◽  
Author(s):  
Thomas M. Kaiser ◽  
Pieter B. Burger

Machine learning continues to make strident advances in the prediction of desired properties concerning drug development. Problematically, the efficacy of machine learning in these arenas is reliant upon highly accurate and abundant data. These two limitations, high accuracy and abundance, are often taken together; however, insight into the dataset accuracy limitation of contemporary machine learning algorithms may yield insight into whether non-bench experimental sources of data may be used to generate useful machine learning models where there is a paucity of experimental data. We took highly accurate data across six kinase types, one GPCR, one polymerase, a human protease, and HIV protease, and intentionally introduced error at varying population proportions in the datasets for each target. With the generated error in the data, we explored how the retrospective accuracy of a Naïve Bayes Network, a Random Forest Model, and a Probabilistic Neural Network model decayed as a function of error. Additionally, we explored the ability of a training dataset with an error profile resembling that produced by the Free Energy Perturbation method (FEP+) to generate machine learning models with useful retrospective capabilities. The categorical error tolerance was quite high for a Naïve Bayes Network algorithm averaging 39% error in the training set required to lose predictivity on the test set. Additionally, a Random Forest tolerated a significant degree of categorical error introduced into the training set with an average error of 29% required to lose predictivity. However, we found the Probabilistic Neural Network algorithm did not tolerate as much categorical error requiring an average of 20% error to lose predictivity. Finally, we found that a Naïve Bayes Network and a Random Forest could both use datasets with an error profile resembling that of FEP+. This work demonstrates that computational methods of known error distribution like FEP+ may be useful in generating machine learning models not based on extensive and expensive in vitro-generated datasets.


Author(s):  
Arturo Rodriguez ◽  
Carlos R. Cuellar ◽  
Luis F. Rodriguez ◽  
Armando Garcia ◽  
V. S. Rao Gudimetla ◽  
...  

Abstract The Large Eddy Simulations (LES) modeling of turbulence effects is computationally expensive even when not all scales are resolved, especially in the presence of deep turbulence effects in the atmosphere. Machine learning techniques provide a novel way to propagate the effects from inner- to outer-scale in atmospheric turbulence spectrum and to accelerate its characterization on long-distance laser propagation. We simulated the turbulent flow of atmospheric air in an idealized box with a temperature difference between the lower and upper surfaces of about 27 degrees Celsius with the LES method. The volume was voxelized, and several quantities, such as the velocity, temperature, and the pressure were obtained at regularly spaced grid points. These values were binned and converted into symbols that were concatenated along the length of the box to create a ‘text’ that was used to train a long short-term memory (LSTM) neural network and propose a way to use a naive Bayes model. LSTMs are used in speech recognition, and handwriting recognition tasks and naïve Bayes is used extensively in text categorization. The trained LSTM and the naïve Bayes models were used to generate instances of turbulent-like flows. Errors are quantified, and portrait as a difference that enables our studies to track error quantities passed through stochastic generative machine learning models — considering that our LES studies have a high state of the art high-fidelity approximation solutions of the Navier-Stokes. In the present work, LES solutions are imitated and compare against generative machine learning models.


Author(s):  
Sikha Bagui ◽  
Keerthi Devulapalli ◽  
Sharon John

This study presents an efficient way to deal with discrete as well as continuous values in Big Data in a parallel Naïve Bayes implementation on Hadoop's MapReduce environment. Two approaches were taken: (i) discretizing continuous values using a binning method; and (ii) using a multinomial distribution for probability estimation of discrete values and a Gaussian distribution for probability estimation of continuous values. The models were analyzed and compared for performance with respect to run time and classification accuracy for varying data sizes, data block sizes, and map memory sizes.


2021 ◽  
Vol 11 ◽  
Author(s):  
Jianyong Wu ◽  
Conghe Song ◽  
Eric A. Dubinsky ◽  
Jill R. Stewart

Current microbial source tracking techniques that rely on grab samples analyzed by individual endpoint assays are inadequate to explain microbial sources across space and time. Modeling and predicting host sources of microbial contamination could add a useful tool for watershed management. In this study, we tested and evaluated machine learning models to predict the major sources of microbial contamination in a watershed. We examined the relationship between microbial sources, land cover, weather, and hydrologic variables in a watershed in Northern California, United States. Six models, including K-nearest neighbors (KNN), Naïve Bayes, Support vector machine (SVM), simple neural network (NN), Random Forest, and XGBoost, were built to predict major microbial sources using land cover, weather and hydrologic variables. The results showed that these models successfully predicted microbial sources classified into two categories (human and non-human), with the average accuracy ranging from 69% (Naïve Bayes) to 88% (XGBoost). The area under curve (AUC) of the receiver operating characteristic (ROC) illustrated XGBoost had the best performance (average AUC = 0.88), followed by Random Forest (average AUC = 0.84), and KNN (average AUC = 0.74). The importance index obtained from Random Forest indicated that precipitation and temperature were the two most important factors to predict the dominant microbial source. These results suggest that machine learning models, particularly XGBoost, can predict the dominant sources of microbial contamination based on the relationship of microbial contaminants with daily weather and land cover, providing a powerful tool to understand microbial sources in water.


2019 ◽  
Vol 31 (05) ◽  
pp. 1950040 ◽  
Author(s):  
Marwa Mostafa Abd El Hamid ◽  
Mai S. Mabrouk ◽  
Yasser M. K. Omar

Alzheimer’s disease (AD) is an irreversible, progressive disorder that assaults the nerve cells of the brain. It is the most widely recognized kind of dementia among older adults. Apolipoprotein E (APOE), is one of the most common genetic risk factors for AD whose significant association with AD is observed in various genome-wide association studies (GWAS). Single nucleotide polymorphisms (SNPs) are the most common type of genetic variation among individuals. SNPs related to many common diseases like AD. SNPs are recognized as significant biomarkers for this disease, they help in understanding and detecting the disease in its early stages. Detecting SNPs biomarkers associated to the disease with high classification accuracy leads to early prediction and diagnosis. Machine learning techniques are utilized to discover new biomarkers of the disease. Sequential minimal optimization (SMO) algorithm with different kernels, Naive Bayes (NB), tree augmented Naive Bayes (TAN) and K2 learning algorithm have been applied on all genetic data of Alzheimer’s disease neuroimaging initiative phase 1 (ADNI-1)/Whole genome sequencing (WGS) datasets. The highest classification accuracy was achieved using 500 SNPs based on the [Formula: see text]-value threshold ([Formula: see text]-value [Formula: see text]). In whole genome approach ADNI-1, results revealed that NB and K2 learning algorithms scored an overall accuracy of 98% and 98.40%, respectively. In whole genome approach WGS, NB and K2 learning algorithms scored an overall accuracy of 99.63% and 99.75%, respectively.


Author(s):  
LIANGXIAO JIANG ◽  
CHAOQUN LI ◽  
ZHIHUA CAI

Traditionally, the performance of a classifier is measured by its classification accuracy or error rate. In fact, probability-based classifiers also produce the class probability estimation (the probability that a test instance belongs to the predicted class). This information is often ignored in classification, as long as the class with the highest class probability estimation is identical to the actual class. In many data mining applications, however, classification accuracy and error rate are not enough. For example, in direct marketing, we often need to deploy different promotion strategies to customers with different likelihood (class probability) of buying some products. Thus, accurate class probability estimations are often required to make optimal decisions. In this paper, we firstly review some state-of-the-art probability-based classifiers and empirically investigate their class probability estimation performance. From our experimental results, we can draw a conclusion: C4.4 is an attractive algorithm for class probability estimation. Then, we present a locally weighted version of C4.4 to scale up its class probability estimation performance by combining locally weighted learning with C4.4. We call our improved algorithm locally weighted C4.4, simply LWC4.4. We experimentally test LWC4.4 using the whole 36 UCI data sets selected by Weka. The experimental results show that LWC4.4 significantly outperforms C4.4 in terms of class probability estimation.


Molecules ◽  
2019 ◽  
Vol 24 (11) ◽  
pp. 2115 ◽  
Author(s):  
Thomas M. Kaiser ◽  
Pieter B. Burger

Machine learning continues to make strident advances in the prediction of desired properties concerning drug development. Problematically, the efficacy of machine learning in these arenas is reliant upon highly accurate and abundant data. These two limitations, high accuracy and abundance, are often taken together; however, insight into the dataset accuracy limitation of contemporary machine learning algorithms may yield insight into whether non-bench experimental sources of data may be used to generate useful machine learning models where there is a paucity of experimental data. We took highly accurate data across six kinase types, one GPCR, one polymerase, a human protease, and HIV protease, and intentionally introduced error at varying population proportions in the datasets for each target. With the generated error in the data, we explored how the retrospective accuracy of a Naïve Bayes Network, a Random Forest Model, and a Probabilistic Neural Network model decayed as a function of error. Additionally, we explored the ability of a training dataset with an error profile resembling that produced by the Free Energy Perturbation method (FEP+) to generate machine learning models with useful retrospective capabilities. The categorical error tolerance was quite high for a Naïve Bayes Network algorithm averaging 39% error in the training set required to lose predictivity on the test set. Additionally, a Random Forest tolerated a significant degree of categorical error introduced into the training set with an average error of 29% required to lose predictivity. However, we found the Probabilistic Neural Network algorithm did not tolerate as much categorical error requiring an average of 20% error to lose predictivity. Finally, we found that a Naïve Bayes Network and a Random Forest could both use datasets with an error profile resembling that of FEP+. This work demonstrates that computational methods of known error distribution like FEP+ may be useful in generating machine learning models not based on extensive and expensive in vitro-generated datasets.


2020 ◽  
Vol 197 ◽  
pp. 11014
Author(s):  
Antonio Capodieci ◽  
Antonio Caricato ◽  
Antonio Paolo Carlucci ◽  
Antonio Ficarella ◽  
Luca Mainetti ◽  
...  

The Aircraft uptime is getting increasingly important as the transport solutions become more complex and the transport industry seeks new ways of being competitive. To reach this objective, traditional Fleet Management systems are gradually extended with new features to improve reliability and then provide better maintenance planning. Main goal of this work is the development of iterative algorithms based on Artificial Intelligence to define the engine removal plan and its maintenance work, optimizing engine availability at the customer and maintenance costs, as well as obtaining a procurement plan of integrated parts with planning of interventions and implementation of a maintenance strategy. In order to reach this goal, Machine Learning has been applied on a workshop dataset with the aim to optimize warehouse spare parts number, costs and lead-time. This dataset consists of the repair history of a specific engine type, from several years and several fleets, and contains information like repair claim, engine working time, forensic evidences and general information about processed spare parts. Using these data as input, several Machine Learning models have been built in order to predict the repair state of each spare part for a better warehouse handling. A multi-label classification approach has been used in order to build and train, for each spare part, a Machine Learning model that predicts the part repair state as a multiclass classifier does. Mainly, each classifier is requested to predict the repair state (classified as “Efficient”, “Repaired” or “Replaced”) of the corresponding part, starting from two variables: the repairing claim and the engine working time. Then, global results have been evaluated using the Confusion Matrix, from which Accuracy, Precision, Recall and F1-Score metrics are retrieved, in order to analyse the cost of incorrect prediction. These metrics are calculated for each spare part related model on test sets and, then, a final single performance value is obtained by averaging results. In this way, three Machine Learning models (Naïve Bayes, Logistic Regression and Random Forest classifiers) are applied and results are compared. Naïve Bayes and Logistic Regression, that are fully probabilistic methods, have best global performances with an accuracy value of almost 80%, making the models being correct most of the times.


Sign in / Sign up

Export Citation Format

Share Document