BoostTree and BoostForest for Ensemble Learning

Abstract Bootstrap aggregating (Bagging) and boosting are two popular ensemble learning approaches, which combine multiple base learners to generate a composite model for more accurate and more reliable performance. They have been widely used in biology, engineering, healthcare, etc. This article proposes BoostForest, which is an ensemble learning approach using BoostTree as base learners and can be used for both classification and regression. BoostTree constructs a tree model by gradient boosting. It achieves high randomness (diversity) by sampling its parameters randomly from a parameter pool, and selecting a subset of features randomly at node splitting. BoostForest further increases the randomness by bootstrapping the training data in constructing different BoostTrees. BoostForest outperformed four classical ensemble learning approaches (Random Forest, Extra-Trees, XGBoost and LightGBM) on 34 classification and regression datasets. Remarkably, BoostForest has only one hyper-parameter (the number of BoostTrees), which can be easily specified. Our code is publicly available, and the proposed ensemble learning framework can also be used to combine many other base learners.

Download Full-text

Ensemble Learning Approach with LASSO for Predicting Catalytic Reaction Rates

Synlett ◽

10.1055/a-1304-4878 ◽

2020 ◽

Author(s):

Akira Yada ◽

Kazuhiko Sato ◽

Tarojiro Matsumura ◽

Yasunobu Ando ◽

Kenji Nagata ◽

...

Keyword(s):

Ensemble Learning ◽

Reaction Rates ◽

Initial Reaction Rate ◽

Training Dataset ◽

Initial Reaction ◽

Learning Approach ◽

Learning Framework ◽

Machine Learning Approach ◽

Reasonable Prediction ◽

Epoxidation Of Alkenes

AbstractThe prediction of the initial reaction rate in the tungsten-catalyzed epoxidation of alkenes by using a machine learning approach is demonstrated. The ensemble learning framework used in this study consists of random sampling with replacement from the training dataset, the construction of several predictive models (weak learners), and the combination of their outputs. This approach enables us to obtain a reasonable prediction model that avoids the problem of overfitting, even when analyzing a small dataset.

Download Full-text

Rack Temperature Prediction Model Using Machine Learning after Stopping Computer Room Air Conditioner in Server Room

Energies ◽

10.3390/en13174300 ◽

2020 ◽

Vol 13 (17) ◽

pp. 4300

Author(s):

Kosuke Sasakura ◽

Takeshi Aoki ◽

Masayoshi Komatsu ◽

Takeshi Watanabe

Keyword(s):

Machine Learning ◽

High Heat ◽

Training Data ◽

Coefficient Of Determination ◽

Gradient Boosting ◽

Air Conditioner ◽

Tree Model ◽

Explanatory Variables ◽

Temperature Environment ◽

The Impact

Data centers (DCs) are becoming increasingly important in recent years, and highly efficient and reliable operation and management of DCs is now required. The generated heat density of the rack and information and communication technology (ICT) equipment is predicted to get higher in the future, so it is crucial to maintain the appropriate temperature environment in the server room where high heat is generated in order to ensure continuous service. It is especially important to predict changes of rack intake temperature in the server room when the computer room air conditioner (CRAC) is shut down, which can cause a rapid rise in temperature. However, it is quite difficult to predict the rack temperature accurately, which in turn makes it difficult to determine the impact on service in advance. In this research, we propose a model that predicts the rack intake temperature after the CRAC is shut down. Specifically, we use machine learning to construct a gradient boosting decision tree model with data from the CRAC, ICT equipment, and rack intake temperature. Experimental results demonstrate that the proposed method has a very high prediction accuracy: the coefficient of determination was 0.90 and the root mean square error (RMSE) was 0.54. Our model makes it possible to evaluate the impact on service and determine if action to maintain the temperature environment is required. We also clarify the effect of explanatory variables and training data of the machine learning on the model accuracy.

Download Full-text

Improving Remote Sensing Multiple Classification by Data and Ensemble Selection

Photogrammetric Engineering & Remote Sensing ◽

10.14358/pers.20-00071r3 ◽

2021 ◽

Vol 87 (11) ◽

pp. 841-852

Author(s):

S. Boukir ◽

L. Guo ◽

N. Chehata

Keyword(s):

Remote Sensing ◽

Ensemble Learning ◽

Remote Sensing Data ◽

Training Data ◽

Dramatic Improvement ◽

Ensemble Classifiers ◽

Learning Framework ◽

Ensemble Selection ◽

Multiple Classification ◽

Selection Of

In this article, margin theory is exploited to design better ensemble classifiers for remote sensing data. A semi-supervised version of the ensemble margin is at the core of this work. Some major challenges in ensemble learning are investigated using this paradigm in the difficult context of land cover classification: selecting the most informative instances to form an appropriate training set, and selecting the best ensemble members. The main contribution of this work lies in the explicit use of the ensemble margin as a decision method to select training data and base classifiers in an ensemble learning framework. The selection of training data is achieved through an innovative iterative guided bagging algorithm exploiting low-margin instances. The overall classification accuracy is improved by up to 3%, with more dramatic improvement in per-class accuracy (up to 12%). The selection of ensemble base classifiers is achieved by an ordering-based ensemble-selection algorithm relying on an original margin-based criterion that also targets low-margin instances. This method reduces the complexity (ensemble size under 30) but maintains performance.

Download Full-text

Applying Deep Neural Networks and Ensemble Machine Learning Methods to Forecast Airborne Ambrosia Pollen

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph16111992 ◽

2019 ◽

Vol 16 (11) ◽

pp. 1992 ◽

Cited By ~ 6

Author(s):

Gebreab K. Zewdie ◽

David J. Lary ◽

Estelle Levetin ◽

Gemechu F. Garuma

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Land Surface ◽

Deep Neural Networks ◽

Airborne Pollen ◽

Training Data ◽

Gradient Boosting ◽

Learning Approaches ◽

Ambrosia Pollen ◽

Extreme Gradient Boosting

Allergies to airborne pollen are a significant issue affecting millions of Americans. Consequently, accurately predicting the daily concentration of airborne pollen is of significant public benefit in providing timely alerts. This study presents a method for the robust estimation of the concentration of airborne Ambrosia pollen using a suite of machine learning approaches including deep learning and ensemble learners. Each of these machine learning approaches utilize data from the European Centre for Medium-Range Weather Forecasts (ECMWF) atmospheric weather and land surface reanalysis. The machine learning approaches used for developing a suite of empirical models are deep neural networks, extreme gradient boosting, random forests and Bayesian ridge regression methods for developing our predictive model. The training data included twenty-four years of daily pollen concentration measurements together with ECMWF weather and land surface reanalysis data from 1987 to 2011 is used to develop the machine learning predictive models. The last six years of the dataset from 2012 to 2017 is used to independently test the performance of the machine learning models. The correlation coefficients between the estimated and actual pollen abundance for the independent validation datasets for the deep neural networks, random forest, extreme gradient boosting and Bayesian ridge were 0.82, 0.81, 0.81 and 0.75 respectively, showing that machine learning can be used to effectively forecast the concentrations of airborne pollen.

Download Full-text

A gradient boosting decision tree algorithm combining synthetic minority oversampling technique for lithology identification

Geophysics ◽

10.1190/geo2019-0429.1 ◽

2020 ◽

Vol 85 (4) ◽

pp. WA147-WA158

Author(s):

Kaibo Zhou ◽

Jianyu Zhang ◽

Yusong Ren ◽

Zhen Huang ◽

Luanxiao Zhao

Keyword(s):

Decision Tree ◽

Classification Performance ◽

Training Data ◽

Gradient Boosting ◽

Well Log ◽

Learning Approaches ◽

Data Set ◽

Log Data ◽

Lithology Identification ◽

Exploration And Production

Lithology identification based on conventional well-logging data is of great importance for geologic features characterization and reservoir quality evaluation in the exploration and production development of petroleum reservoirs. However, there are some limitations in the traditional lithology identification process: (1) It is very time consuming to build a model so that it cannot realize real-time lithology identification during well drilling, (2) it must be modeled by experienced geologists, which consumes a lot of manpower and material resources, and (3) the imbalance of labeled data in well-log data may reduce the classification performance of the model. We have developed a gradient boosting decision tree (GBDT) algorithm combining synthetic minority oversampling technique (SMOTE) to realize fast and automatic lithology identification. First, the raw well-log data are normalized by maximum and minimum normalization algorithm. Then, SMOTE is adopted to balance the number of samples in each class in training process. Next, a lithology identification model is built by GBDT to fit the preprocessed training data set. Finally, the built model is verified with the testing data set. The experimental results indicate that the proposed approach improves the lithology identification performance compared with other machine-learning approaches.

Download Full-text

Ensemble Learning Approaches Based on Covariance Pooling of CNN Features for High Resolution Remote Sensing Scene Classification

Remote Sensing ◽

10.3390/rs12203292 ◽

2020 ◽

Vol 12 (20) ◽

pp. 3292

Author(s):

Sara Akodad ◽

Lionel Bombrun ◽

Junshi Xia ◽

Yannick Berthoumieu ◽

Christian Germain

Keyword(s):

Machine Learning ◽

Remote Sensing ◽

Ensemble Learning ◽

Covariance Matrices ◽

Machine Learning Algorithms ◽

Learning Approach ◽

Learning Approaches ◽

Scene Classification ◽

First Order ◽

Fisher Vector Encoding

Remote sensing image scene classification, which consists of labeling remote sensing images with a set of categories based on their content, has received remarkable attention for many applications such as land use mapping. Standard approaches are based on the multi-layer representation of first-order convolutional neural network (CNN) features. However, second-order CNNs have recently been shown to outperform traditional first-order CNNs for many computer vision tasks. Hence, the aim of this paper is to show the use of second-order statistics of CNN features for remote sensing scene classification. This takes the form of covariance matrices computed locally or globally on the output of a CNN. However, these datapoints do not lie in an Euclidean space but a Riemannian manifold. To manipulate them, Euclidean tools are not adapted. Other metrics should be considered such as the log-Euclidean one. This consists of projecting the set of covariance matrices on a tangent space defined at a reference point. In this tangent plane, which is a vector space, conventional machine learning algorithms can be considered, such as the Fisher vector encoding or SVM classifier. Based on this log-Euclidean framework, we propose a novel transfer learning approach composed of two hybrid architectures based on covariance pooling of CNN features, the first is local and the second is global. They rely on the extraction of features from models pre-trained on the ImageNet dataset processed with some machine learning algorithms. The first hybrid architecture consists of an ensemble learning approach with the log-Euclidean Fisher vector encoding of region covariance matrices computed locally on the first layers of a CNN. The second one concerns an ensemble learning approach based on the covariance pooling of CNN features extracted globally from the deepest layers. These two ensemble learning approaches are then combined together based on the strategy of the most diverse ensembles. For validation and comparison purposes, the proposed approach is tested on various challenging remote sensing datasets. Experimental results exhibit a significant gain of approximately 2% in overall accuracy for the proposed approach compared to a similar state-of-the-art method based on covariance pooling of CNN features (on the UC Merced dataset).

Download Full-text

Ensemble Learning Approach with Application to Chinese Dialect Identification

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.333-335.769 ◽

2013 ◽

Vol 333-335 ◽

pp. 769-774

Author(s):

Yu Guo Xia ◽

Ming Liang Gu

Keyword(s):

Ensemble Learning ◽

Gaussian Mixture Models ◽

Gaussian Mixture ◽

Training Data ◽

Language Models ◽

Learning Approach ◽

Weak Learner ◽

Chinese Dialect ◽

N Gram ◽

Base Learner

In this paper we propose ensemble learning based approach to identify Chinese dialects. This new method firstly uses Gaussian Mixture Models and N-gram language models to produce a set of base learners. Then the two typical ensemble learning approach, Bagging and AdaBoost are conducted to combine the base learner to determine the dialect category. The ANN is selected as weak learner. The experimental results show that the ensemble approach not only enhances the performance of the system greatly, but also reduces the contradiction between the training data and the number of parameters in models.

Download Full-text

Performance and clinical utility of supervised machine-learning approaches in detecting familial hypercholesterolaemia in primary care

npj Digital Medicine ◽

10.1038/s41746-020-00349-5 ◽

2020 ◽

Vol 3 (1) ◽

Author(s):

Ralph K. Akyea ◽

Nadeem Qureshi ◽

Joe Kai ◽

Stephen F. Weng

Keyword(s):

Machine Learning ◽

Primary Care ◽

Logistic Regression ◽

Heart Disease ◽

Ensemble Learning ◽

Clinical Utility ◽

Familial Hypercholesterolaemia ◽

Predictive Accuracy ◽

Gradient Boosting ◽

Learning Approaches

Abstract Familial hypercholesterolaemia (FH) is a common inherited disorder, causing lifelong elevated low-density lipoprotein cholesterol (LDL-C). Most individuals with FH remain undiagnosed, precluding opportunities to prevent premature heart disease and death. Some machine-learning approaches improve detection of FH in electronic health records, though clinical impact is under-explored. We assessed performance of an array of machine-learning approaches for enhancing detection of FH, and their clinical utility, within a large primary care population. A retrospective cohort study was done using routine primary care clinical records of 4,027,775 individuals from the United Kingdom with total cholesterol measured from 1 January 1999 to 25 June 2019. Predictive accuracy of five common machine-learning algorithms (logistic regression, random forest, gradient boosting machines, neural networks and ensemble learning) were assessed for detecting FH. Predictive accuracy was assessed by area under the receiver operating curves (AUC) and expected vs observed calibration slope; with clinical utility assessed by expected case-review workload and likelihood ratios. There were 7928 incident diagnoses of FH. In addition to known clinical features of FH (raised total cholesterol or LDL-C and family history of premature coronary heart disease), machine-learning (ML) algorithms identified features such as raised triglycerides which reduced the likelihood of FH. Apart from logistic regression (AUC, 0.81), all four other ML approaches had similarly high predictive accuracy (AUC > 0.89). Calibration slope ranged from 0.997 for gradient boosting machines to 1.857 for logistic regression. Among those screened, high probability cases requiring clinical review varied from 0.73% using ensemble learning to 10.16% using deep learning, but with positive predictive values of 15.5% and 2.8% respectively. Ensemble learning exhibited a dominant positive likelihood ratio (45.5) compared to all other ML models (7.0–14.4). Machine-learning models show similar high accuracy in detecting FH, offering opportunities to increase diagnosis. However, the clinical case-finding workload required for yield of cases will differ substantially between models.

Download Full-text

Similarity-Based Methods and Machine Learning Approaches for Target Prediction in Early Drug Discovery: Performance and Scope

International Journal of Molecular Sciences ◽

10.3390/ijms21103585 ◽

2020 ◽

Vol 21 (10) ◽

pp. 3585 ◽

Cited By ~ 3

Author(s):

Neann Mathai ◽

Johannes Kirchmair

Keyword(s):

Machine Learning ◽

Drug Discovery ◽

Target Prediction ◽

Training Data ◽

Target Space ◽

Learning Approach ◽

Learning Approaches ◽

Individual Test ◽

Machine Learning Approach ◽

The Individual

Computational methods for predicting the macromolecular targets of drugs and drug-like compounds have evolved as a key technology in drug discovery. However, the established validation protocols leave several key questions regarding the performance and scope of methods unaddressed. For example, prediction success rates are commonly reported as averages over all compounds of a test set and do not consider the structural relationship between the individual test compounds and the training instances. In order to obtain a better understanding of the value of ligand-based methods for target prediction, we benchmarked a similarity-based method and a random forest based machine learning approach (both employing 2D molecular fingerprints) under three testing scenarios: a standard testing scenario with external data, a standard time-split scenario, and a scenario that is designed to most closely resemble real-world conditions. In addition, we deconvoluted the results based on the distances of the individual test molecules from the training data. We found that, surprisingly, the similarity-based approach generally outperformed the machine learning approach in all testing scenarios, even in cases where queries were structurally clearly distinct from the instances in the training (or reference) data, and despite a much higher coverage of the known target space.

Download Full-text

An Active Learning Approach with Uncertainty, Representativeness, and Diversity

The Scientific World JOURNAL ◽

10.1155/2014/827586 ◽

2014 ◽

Vol 2014 ◽

pp. 1-6 ◽

Cited By ~ 5

Author(s):

Tianxu He ◽

Shukui Zhang ◽

Jie Xin ◽

Pengpeng Zhao ◽

Jian Wu ◽

...

Keyword(s):

Active Learning ◽

Clustering Algorithm ◽

Ad Hoc ◽

State Of The Art ◽

The Internet ◽

Learning Approach ◽

Learning Approaches ◽

Learning Framework ◽

Query Selection ◽

The Internet Of Things

Big data from the Internet of Things may create big challenge for data classification. Most active learning approaches select either uncertain or representative unlabeled instances to query their labels. Although several active learning algorithms have been proposed to combine the two criteria for query selection, they are usually ad hoc in finding unlabeled instances that are both informative and representative and fail to take the diversity of instances into account. We address this challenge by presenting a new active learning framework which considers uncertainty, representativeness, and diversity creation. The proposed approach provides a systematic way for measuring and combining the uncertainty, representativeness, and diversity of an instance. Firstly, use instances’ uncertainty and representativeness to constitute the most informative set. Then, use the kernelk-means clustering algorithm to filter the redundant samples and the resulting samples are queried for labels. Extensive experimental results show that the proposed approach outperforms several state-of-the-art active learning approaches.

Download Full-text