A Combination of Feature Selection and Random Forest Techniques to Solve a Problem Related to Blast-Induced Ground Vibration

In mining and civil engineering applications, a reliable and proper analysis of ground vibration due to quarry blasting is an extremely important task. While advances in machine learning led to numerous powerful regression models, the usefulness of these models for modeling the peak particle velocity (PPV) remains largely unexplored. Using an extensive database comprising quarry site datasets enriched with vibration variables, this article compares the predictive performance of five selected machine learning classifiers, including classification and regression trees (CART), chi-squared automatic interaction detection (CHAID), random forest (RF), artificial neural network (ANN), and support vector machine (SVM) for PPV analysis. Before conducting these model developments, feature selection was applied in order to select the most important input parameters for PPV. The results of this study show that RF performed substantially better than any of the other investigated regression models, including the frequently used SVM and ANN models. The results and process analysis of this study can be utilized by other researchers/designers in similar fields.

Download Full-text

Machine Learning-Based Prediction of Air Quality

Applied Sciences ◽

10.3390/app10249151 ◽

2020 ◽

Vol 10 (24) ◽

pp. 9151

Author(s):

Yun-Chia Liang ◽

Yona Maimury ◽

Angela Hsiang-Ling Chen ◽

Josue Rodolfo Cuevas Juarez

Keyword(s):

Machine Learning ◽

Air Quality ◽

Random Forest ◽

Prediction Models ◽

Superior Performance ◽

Support Vector ◽

Economic Activities ◽

Adaptive Boosting ◽

Series Of Experiments ◽

Artificial Neural Network Ann

Air, an essential natural resource, has been compromised in terms of quality by economic activities. Considerable research has been devoted to predicting instances of poor air quality, but most studies are limited by insufficient longitudinal data, making it difficult to account for seasonal and other factors. Several prediction models have been developed using an 11-year dataset collected by Taiwan’s Environmental Protection Administration (EPA). Machine learning methods, including adaptive boosting (AdaBoost), artificial neural network (ANN), random forest, stacking ensemble, and support vector machine (SVM), produce promising results for air quality index (AQI) level predictions. A series of experiments, using datasets for three different regions to obtain the best prediction performance from the stacking ensemble, AdaBoost, and random forest, found the stacking ensemble delivers consistently superior performance for R2 and RMSE, while AdaBoost provides best results for MAE.

Download Full-text

Techniques for Detecting Malware Traffic: A Comprehensive Approach to Feature Selection and Classification

International Journal for Research in Applied Science and Engineering Technology ◽

10.22214/ijraset.2021.39088 ◽

2021 ◽

Vol 9 (12) ◽

pp. 1-10

Author(s):

Harsha A K

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Random Forest ◽

Learning Algorithms ◽

Malware Detection ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Support Vector ◽

Steady Increase ◽

Extreme Gradient Boosting

Abstract: Since the advent of encryption, there has been a steady increase in malware being transmitted over encrypted networks. Traditional approaches to detect malware like packet content analysis are inefficient in dealing with encrypted data. In the absence of actual packet contents, we can make use of other features like packet size, arrival time, source and destination addresses and other such metadata to detect malware. Such information can be used to train machine learning classifiers in order to classify malicious and benign packets. In this paper, we offer an efficient malware detection approach using classification algorithms in machine learning such as support vector machine, random forest and extreme gradient boosting. We employ an extensive feature selection process to reduce the dimensionality of the chosen dataset. The dataset is then split into training and testing sets. Machine learning algorithms are trained using the training set. These models are then evaluated against the testing set in order to assess their respective performances. We further attempt to tune the hyper parameters of the algorithms, in order to achieve better results. Random forest and extreme gradient boosting algorithms performed exceptionally well in our experiments, resulting in area under the curve values of 0.9928 and 0.9998 respectively. Our work demonstrates that malware traffic can be effectively classified using conventional machine learning algorithms and also shows the importance of dimensionality reduction in such classification problems. Keywords: Malware Detection, Extreme Gradient Boosting, Random Forest, Feature Selection.

Download Full-text

A Framework for Effective Application of Machine Learning to Microbiome-Based Classification Problems

mBio ◽

10.1128/mbio.00434-20 ◽

2020 ◽

Vol 11 (3) ◽

Cited By ~ 9

Author(s):

Begüm D. Topçuoğlu ◽

Nicholas A. Lesniak ◽

Mack T. Ruffin ◽

Jenna Wiens ◽

Patrick D. Schloss

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Random Forest ◽

Sequence Data ◽

Characteristic Curve ◽

Predictive Performance ◽

Model Complexity ◽

Support Vector ◽

Classification Problems ◽

Microbial Biomarkers

ABSTRACT Machine learning (ML) modeling of the human microbiome has the potential to identify microbial biomarkers and aid in the diagnosis of many diseases such as inflammatory bowel disease, diabetes, and colorectal cancer. Progress has been made toward developing ML models that predict health outcomes using bacterial abundances, but inconsistent adoption of training and evaluation methods call the validity of these models into question. Furthermore, there appears to be a preference by many researchers to favor increased model complexity over interpretability. To overcome these challenges, we trained seven models that used fecal 16S rRNA sequence data to predict the presence of colonic screen relevant neoplasias (SRNs) (n = 490 patients, 261 controls and 229 cases). We developed a reusable open-source pipeline to train, validate, and interpret ML models. To show the effect of model selection, we assessed the predictive performance, interpretability, and training time of L2-regularized logistic regression, L1- and L2-regularized support vector machines (SVM) with linear and radial basis function kernels, a decision tree, random forest, and gradient boosted trees (XGBoost). The random forest model performed best at detecting SRNs with an area under the receiver operating characteristic curve (AUROC) of 0.695 (interquartile range [IQR], 0.651 to 0.739) but was slow to train (83.2 h) and not inherently interpretable. Despite its simplicity, L2-regularized logistic regression followed random forest in predictive performance with an AUROC of 0.680 (IQR, 0.625 to 0.735), trained faster (12 min), and was inherently interpretable. Our analysis highlights the importance of choosing an ML approach based on the goal of the study, as the choice will inform expectations of performance and interpretability. IMPORTANCE Diagnosing diseases using machine learning (ML) is rapidly being adopted in microbiome studies. However, the estimated performance associated with these models is likely overoptimistic. Moreover, there is a trend toward using black box models without a discussion of the difficulty of interpreting such models when trying to identify microbial biomarkers of disease. This work represents a step toward developing more-reproducible ML practices in applying ML to microbiome research. We implement a rigorous pipeline and emphasize the importance of selecting ML models that reflect the goal of the study. These concepts are not particular to the study of human health but can also be applied to environmental microbiology studies.

Download Full-text

Glass Classification based on Machine Learning Algorithms

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.h6819.0991120 ◽

2020 ◽

Vol 9 (11) ◽

pp. 139-142

Keyword(s):

Machine Learning ◽

Random Forest ◽

Amorphous Solid ◽

Machine Learning Algorithms ◽

Support Vector ◽

K Nearest Neighbors ◽

X Ray ◽

Svm Algorithm ◽

Artificial Neural Network Ann ◽

Logistic Regression Algorithm

Glass Industry is considered one of the most important industries in the world. The Glass is used everywhere, from water bottles to X-Ray and Gamma Rays protection. This is a non-crystalline, amorphous solid that is most often transparent. There are lots of uses of glass, and during investigation in a crime scene, the investigators need to know what is type of glass in a scene. To find out the type of glass, we will use the online dataset and machine learning to solve the above problem. We will be using ML algorithms such as Artificial Neural Network (ANN), K-nearest neighbors (KNN) algorithm, Support Vector Machine (SVM) algorithm, Random Forest algorithm, and Logistic Regression algorithm. By comparing all the algorithm Random Forest did the best in glass classification.

Download Full-text

A framework for effective application of machine learning to microbiome-based classification problems

10.1101/816090 ◽

2019 ◽

Cited By ~ 3

Author(s):

Begüm D. Topçuoğlu ◽

Nicholas A. Lesniak ◽

Mack Ruffin ◽

Jenna Wiens ◽

Patrick D. Schloss

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Random Forest ◽

Sequence Data ◽

Predictive Performance ◽

Model Complexity ◽

Support Vector ◽

Classification Problems ◽

16S Rrna Sequence ◽

Microbial Biomarkers

AbstractMachine learning (ML) modeling of the human microbiome has the potential to identify microbial biomarkers and aid in the diagnosis of many diseases such as inflammatory bowel disease, diabetes, and colorectal cancer. Progress has been made towards developing ML models that predict health outcomes using bacterial abundances, but inconsistent adoption of training and evaluation methods call the validity of these models into question. Furthermore, there appears to be a preference by many researchers to favor increased model complexity over interpretability. To overcome these challenges, we trained seven models that used fecal 16S rRNA sequence data to predict the presence of colonic screen relevant neoplasias (SRNs; n=490 patients, 261 controls and 229 cases). We developed a reusable open-source pipeline to train, validate, and interpret ML models. To show the effect of model selection, we assessed the predictive performance, interpretability, and training time of L2-regularized logistic regression, L1 and L2-regularized support vector machines (SVM) with linear and radial basis function kernels, decision trees, random forest, and gradient boosted trees (XGBoost). The random forest model performed best at detecting SRNs with an AUROC of 0.695 [IQR 0.651-0.739] but was slow to train (83.2 h) and not inherently interpretable. Despite its simplicity, L2-regularized logistic regression followed random forest in predictive performance with an AUROC of 0.680 [IQR 0.625-0.735], trained faster (12 min), and was inherently interpretable. Our analysis highlights the importance of choosing an ML approach based on the goal of the study, as the choice will inform expectations of performance and interpretability.ImportanceDiagnosing diseases using machine learning (ML) is rapidly being adopted in microbiome studies. However, the estimated performance associated with these models is likely over-optimistic. Moreover, there is a trend towards using black box models without a discussion of the difficulty of interpreting such models when trying to identify microbial biomarkers of disease. This work represents a step towards developing more reproducible ML practices in applying ML to microbiome research. We implement a rigorous pipeline and emphasize the importance of selecting ML models that reflect the goal of the study. These concepts are not particular to the study of human health but can also be applied to environmental microbiology studies.

Download Full-text

Machine Learning Models Based on Random Forest Feature Selection and Bayesian Optimization for Predicting Daily Global Solar Radiation

International Journal of Renewable Energy Development ◽

10.14710/ijred.2022.41451 ◽

2021 ◽

Vol 11 (1) ◽

pp. 309-323

Author(s):

Mohamed Chaibi ◽

El Mahjoub Benghoulam ◽

Lhoussaine Tarik ◽

Mohamed Berrada ◽

Abdellah El Hmaidi

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Random Forest ◽

Solar Radiation ◽

Predictive Accuracy ◽

Sunshine Duration ◽

Computational Cost ◽

Global Solar Radiation ◽

Bayesian Optimization ◽

Support Vector

Prediction of daily global solar radiation with simple and highly accurate models would be beneficial for solar energy conversion systems. In this paper, we proposed a hybrid machine learning methodology integrating two feature selection methods and a Bayesian optimization algorithm to predict H in the city of Fez, Morocco. First, we identified the most significant predictors using two Random Forest methods of feature importance: Mean Decrease in Impurity (MDI) and Mean Decrease in Accuracy (MDA). Then, based on the feature selection results, ten models were developed and compared: (1) five standalone machine learning (ML) models including Classification and Regression Trees (CART), Random Forests (RF), Bagged Trees Regression (BTR), Support Vector Regression (SVR), and Multi-Layer Perceptron (MLP); and (2) the same models tuned by the Bayesian optimization (BO) algorithm: CART-BO, RF-BO, BTR-BO, SVR-BO, and MLP-BO. Both MDI and MDA techniques revealed that extraterrestrial solar radiation and sunshine duration fraction were the most influential features. The BO approach improved the predictive accuracy of MLP, CART, SVR, and BTR models and prevented the CART model from overfitting. The best improvements were obtained using the MLP model, where RMSE and MAE were reduced by 17.6% and 17.2%, respectively. Among the studied models, the SVR-BO algorithm provided the best trade-off between prediction accuracy (RMSE=0.4473kWh/m²/day, MAE=0.3381kWh/m²/day, and R²=0.9465), stability (with a 0.0033kWh/m²/day increase in RMSE), and computational cost.

Download Full-text

Physicochemical Habitat Traits Preferred by Small Indigenous Fish (Chanda Nama ) in Indian River Discerning through Machine Learning

10.21203/rs.3.rs-591781/v1 ◽

2021 ◽

Author(s):

Rohan Kumar Raman ◽

Archan Kanti Das ◽

Ranjan Kumar Manna ◽

Sanjeev Kumar Sahu ◽

Basanta Kumar Das

Keyword(s):

Machine Learning ◽

Random Forest ◽

River System ◽

Fish Distribution ◽

Support Vector ◽

Operating Characteristics ◽

K Nearest Neighbors ◽

Peninsular India ◽

Preferred Habitat ◽

Artificial Neural Network Ann

Abstract Physicochemical traits of river influence the habitat of fish species in aquatic ecosystems. Fish showed a complex relationship with aquatic factors in river. Machine learning (ML) modeling is a useful tool to established relationship between complex systems. This study identified the preferred habitat indicators of Chanda nama (a small indigenous fish), in the Krishna River, of peninsular India, using machine learning modeling. Data were observed on Chanda nama fish distribution (presence/absence) and associated ten physical and chemical parameters of water at 22 sampling sites on the river during year 2001-02. Machine learning models such as random forest (RF), artificial neural network (ANN), support vector machine (SVM), k-nearest neighbors (KNN) used for the classification of Chanda nama distribution in the river. The ML model efficiency was evaluated using classification accuracy (CCI), Cohen’s kappa coefficient (k), sensitivity, specificity and receiver-operating-characteristics (ROC). Results showed that random forest is the best model with 82% accuracy, CCI (0.82), k (0.55), sensitivity (0.57), specificity (0.76) and ROC (0.72) for Chanda nama distribution (presence/absence) in the Krishna River. Random Forest model identified three preferred physicochemical habitat traits like altitude, temperature and depth for Chanda nama distribution in the Krishna River, India. This study will be helpful for researcher and policy maker to understand the important habitat physicochemical traits for sustainable management of small indigenous fish (Chanda nama ) in the river system.

Download Full-text

Monitoring Forest Health Using Hyperspectral Imagery: Does Feature Selection Improve the Performance of Machine-Learning Techniques?

Remote Sensing ◽

10.3390/rs13234832 ◽

2021 ◽

Vol 13 (23) ◽

pp. 4832

Author(s):

Patrick Schratz ◽

Jannes Muenchow ◽

Eugenia Iturritxa ◽

José Cortés ◽

Bernd Bischl ◽

...

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Predictive Performance ◽

Environmental Modeling ◽

Gradient Boosting ◽

Support Vector ◽

Substantial Impact ◽

Feature Sets ◽

Filter Methods ◽

Extreme Gradient Boosting

This study analyzed highly correlated, feature-rich datasets from hyperspectral remote sensing data using multiple statistical and machine-learning methods. The effect of filter-based feature selection methods on predictive performance was compared. In addition, the effect of multiple expert-based and data-driven feature sets, derived from the reflectance data, was investigated. Defoliation of trees (%), derived from in situ measurements from fall 2016, was modeled as a function of reflectance. Variable importance was assessed using permutation-based feature importance. Overall, the support vector machine (SVM) outperformed other algorithms, such as random forest (RF), extreme gradient boosting (XGBoost), and lasso (L1) and ridge (L2) regressions by at least three percentage points. The combination of certain feature sets showed small increases in predictive performance, while no substantial differences between individual feature sets were observed. For some combinations of learners and feature sets, filter methods achieved better predictive performances than using no feature selection. Ensemble filters did not have a substantial impact on performance. The most important features were located around the red edge. Additional features in the near-infrared region (800–1000 nm) were also essential to achieve the overall best performances. Filter methods have the potential to be helpful in high-dimensional situations and are able to improve the interpretation of feature effects in fitted models, which is an essential constraint in environmental modeling studies. Nevertheless, more training data and replication in similar benchmarking studies are needed to be able to generalize the results.

Download Full-text

Predicting heart failure using a wrapper-based feature selection

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v21.i3.pp1530-1539 ◽

2021 ◽

Vol 21 (3) ◽

pp. 1530

Author(s):

Minh Tuan Le ◽

Minh Thanh Vo ◽

Nhat Tan Pham ◽

Son V.T Dao

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Nearest Neighbor ◽

Machine Learning Algorithms ◽

Support Vector ◽

K Nearest Neighbor ◽

Medical Practitioners ◽

Machine Learning Model ◽

Heart Contraction ◽

Artificial Neural Network Ann

In the current health system, it is very difficult for medical practitioners/physicians to diagnose the effectiveness of heart contraction. In this research, we proposed a machine learning model to predict heart contraction using an artificial neural network (ANN). We also proposed a novel wrapper-based feature selection utilizing a grey wolf optimization (GWO) to reduce the number of required input attributes. In this work, we compared the results achieved using our method and several conventional machine learning algorithms approaches such as support vector machine, decision tree, K-nearest neighbor, naïve bayes, random forest, and logistic regression. Computational results show not only that much fewer features are needed, but also higher prediction accuracy can be achieved around 87%. This work has the potential to be applicable to clinical practice and become a supporting tool for doctors/physicians.

Download Full-text

A Goal Programming-Based Methodology for Machine Learning Model Selection Decisions: A Predictive Maintenance Application

Mathematics ◽

10.3390/math9192405 ◽

2021 ◽

Vol 9 (19) ◽

pp. 2405

Author(s):

Ioannis Mallidis ◽

Volha Yakavenka ◽

Anastasios Konstantinidis ◽

Nikolaos Sariannidis

Keyword(s):

Neural Network ◽

Machine Learning ◽

Random Forest ◽

Decision Tree ◽

Goal Programming ◽

Regression Models ◽

Support Vector ◽

Threshold Values ◽

Time Efficiency ◽

The Neural Network

The paper develops a goal programming-based multi-criteria methodology, for assessing different machine learning (ML) regression models under accuracy and time efficiency criteria. The developed methodology provides users with high flexibility in assessing the models as it allows for a fast and computationally efficient sensitivity analysis of accuracy and time significance weights as well as accuracy and time significance threshold values. Four regression models were assessed, namely the decision tree, random forest, support vector and the neural network. The developed methodology was employed to forecast the time to failures of NASA Turbofans. The results reveal that decision tree regression (DTR) seems to be preferred for low values of accuracy weights (up to 30%) and low accuracy and time efficiency threshold values. As the accuracy weights tend to increase and for higher accuracy and time efficiency threshold values, random forest regression (RFR) seems to be the best choice. The preference for the RFR model however, seems to change towards the adoption of the neural network for accuracy weights equal to and higher than 90%.

Download Full-text