Feature Selection and Optimization of Random Forest Modeling

2014 ◽  
Vol 687-691 ◽  
pp. 1416-1419
Author(s):  
Min Zhu ◽  
Jing Xia ◽  
Mo Lei Yan ◽  
Sheng Yu Zhang ◽  
Guo Long Cai ◽  
...  

Traditional random forest algorithm is difficult to achieve very good effect for the classification of small sample data set. Because in the process of repeated random selection, selection sample is little, resulting in trees with very small degree of difference, which floods right decisions, makes bigger generalization error of the model, and the predict rate is reduced. For the sample size of sepsis cases data, this paper adopts for parameters used in random forest modeling interval division choice; divide feature interval into high correlation and uncertain correlation intervals; select data from two intervals respectively for modeling. Eventually reduce model generalization error, and improve accuracy of prediction.

2004 ◽  
Vol 43 (05) ◽  
pp. 439-444 ◽  
Author(s):  
Michae Schimek

Summary Objectives: A typical bioinformatics task in microarray analysis is the classification of biological samples into two alternative categories. A procedure is needed which, based on the expression levels measured, allows us to compute the probability that a new sample belongs to a certain class. Methods: For the purpose of classification the statistical approach of binary regression is considered. High-dimensionality and at the same time small sample sizes make it a challenging task. Standard logit or probit regression fails because of condition problems and poor predictive performance. The concepts of frequentist and of Bayesian penalization for binary regression are introduced. A Bayesian interpretation of the penalized log-likelihood is given. Finally the role of cross-validation for regularization and feature selection is discussed. Results: Penalization makes classical binary regression a suitable tool for microarray analysis. We illustrate penalized logit and Bayesian probit regression on a well-known data set and compare the obtained results, also with respect to published results from decision trees. Conclusions: The frequentist and the Bayesian penalization concept work equally well on the example data, however some method-specific differences can be made out. Moreover the Bayesian approach yields a quantification (posterior probabilities) of the bias due to the constraining assumptions.


Author(s):  
Sanjiban Sekhar Roy ◽  
Pulkit Kulshrestha ◽  
Pijush Samui

Drought is a condition of land in which the ground water faces a severe shortage. This condition affects the survival of plants and animals. Drought can impact ecosystem and agricultural productivity, severely. Hence, the economy also gets affected by this situation. This paper proposes Deep Belief Network (DBN) learning technique, which is one of the state of the art machine learning algorithms. This proposed work uses DBN, for classification of drought and non-drought images. Also, k nearest neighbour (kNN) and random forest learning methods have been proposed for the classification of the same drought images. The performance of the Deep Belief Network(DBN) has been compared with k nearest neighbour (kNN) and random forest. The data set has been split into 80:20, 70:30 and 60:40 as train and test. Finally, the effectiveness of the three proposed models have been measured by various performance metrics.


2020 ◽  
Vol 48 (4) ◽  
pp. 2316-2327
Author(s):  
Caner KOC ◽  
Dilara GERDAN ◽  
Maksut B. EMİNOĞLU ◽  
Uğur YEGÜL ◽  
Bulent KOC ◽  
...  

Classification of hazelnuts is one of the values adding processes that increase the marketability and profitability of its production. While traditional classification methods are used commonly, machine learning and deep learning can be implemented to enhance the hazelnut classification processes. This paper presents the results of a comparative study of machine learning frameworks to classify hazelnut (Corylus avellana L.) cultivars (‘Sivri’, ‘Kara’, ‘Tombul’) using DL4J and ensemble learning algorithms. For each cultivar, 50 samples were used for evaluations. Maximum length, width, compression strength, and weight of hazelnuts were measured using a caliper and a force transducer. Gradient boosting machine (Boosting), random forest (Bagging), and DL4J feedforward (Deep Learning) algorithms were applied in traditional machine learning algorithms. The data set was partitioned into a 10-fold-cross validation method. The classifier performance criteria of accuracy (%), error percentage (%), F-Measure, Cohen’s Kappa, recall, precision, true positive (TP), false positive (FP), true negative (TN), false negative (FN) values are provided in the results section. The results showed classification accuracies of 94% for Gradient Boosting, 100% for Random Forest, and 94% for DL4J Feedforward algorithms.


2019 ◽  
Vol 92 (1099) ◽  
pp. 20190159 ◽  
Author(s):  
Usman Bashir ◽  
Bhavin Kawa ◽  
Muhammad Siddique ◽  
Sze Mun Mak ◽  
Arjun Nair ◽  
...  

Objective: Non-invasive distinction between squamous cell carcinoma and adenocarcinoma subtypes of non-small-cell lung cancer (NSCLC) may be beneficial to patients unfit for invasive diagnostic procedures or when tissue is insufficient for diagnosis. The purpose of our study was to compare the performance of random forest algorithms utilizing CT radiomics and/or semantic features in classifying NSCLC. Methods: Two thoracic radiologists scored 11 semantic features on CT scans of 106 patients with NSCLC. A set of 115 radiomics features was extracted from the CT scans. Random forest models were developed from semantic (RM-sem), radiomics (RM-rad), and all features combined (RM-all). External validation of models was performed using an independent test data set (n = 100) of CT scans. Model performance was measured with out-of-bag error and area under curve (AUC), and compared using receiver-operating characteristics curve analysis on the test data set. Results: The median (interquartile-range) error rates of the models were: RF-sem 24.5 % (22.6 – 37.5 %), RF-rad 35.8 % (34.9 – 38.7 %), and RM-all 37.7 % (37.7 – 37.7). On training data, both RF-rad and RF-all gave perfect discrimination (AUC = 1), which was significantly higher than that achieved by RF-sem (AUC = 0.78; p < 0.0001). On test data, however, RM-sem model (AUC = 0.82) out-performed RM-rad and RM-all (AUC = 0.5 and AUC = 0.56; p < 0.0001), neither of which was significantly different from random guess ( p = 0.9 and 0.6 respectively). Conclusion: Non-invasive classification of NSCLC can be done accurately using random forest classification models based on well-known CT-derived descriptive features. However, radiomics-based classification models performed poorly in this scenario when tested on independent data and should be used with caution, due to their possible lack of generalizability to new data. Advances in knowledge: Our study describes novel CT-derived random forest models based on radiologist-interpretation of CT scans (semantic features) that can assist NSCLC classification when histopathology is equivocal or when histopathological sampling is not possible. It also shows that random forest models based on semantic features may be more useful than those built from computational radiomic features.


IEEE Access ◽  
2020 ◽  
Vol 8 ◽  
pp. 117096-117108
Author(s):  
Bing Liu ◽  
Wenyue Guo ◽  
Xin Chen ◽  
Kuiliang Gao ◽  
Xibing Zuo ◽  
...  

2020 ◽  
Vol 15 (1) ◽  
Author(s):  
Syed Saiden Abbas ◽  
Tjeerd M. H. Dijkstra

Abstract Background The conventional method for the diagnosis of malaria parasites is the microscopic examination of stained blood films, which is time consuming and requires expertise. We introduce computer-based image segmentation and life stage classification with a random forest classifier. Segmentation and stage classification are performed on a large dataset of malaria parasites with ground truth labels provided by experts. Methods We made use of Giemsa stained images obtained from the blood of 16 patients infected with Plasmodium falciparum. Experts labeled the parasite types from each of the images. We applied a two-step approach: image segmentation followed by life stage classification. In segmentation, we classified each pixel as a parasite or non-parasite pixel using a random forest classifier. Performance was evaluated with classification accuracy, Dice coefficient and free-response receiver operating characteristic (FROC) analysis. In life stage classification, we classified each of the segmented objects into one of 8 classes: 6 parasite life stages, early ring, late ring or early trophozoite, mid trophozoite, early schizont, late schizont or segmented, and two other classes, white blood cell or debris. Results Our segmentation method gives an average cross-validated Dice coefficient of 0.82 which is a 13% improvement compared to the Otsu method. The Otsu method achieved a True Positive Fraction (TPF) of 0.925 at the expense of a False Positive Rate (FPR) of 2.45. At the same TPF of 0.925, our method achieved an FPR of 0.92, an improvement by more than a factor two. We find that inclusion of average intensity of the whole image as feature for the random forest considerably improves segmentation performance. We obtain an overall accuracy of 58.8% when classifying all life stages. Stages are mostly confused with their neighboring stages. When we reduce the life stages to ring, trophozoite and schizont only, we obtain an accuracy of 82.7%. Conclusion Pixel classification gives better segmentation performance than the conventional Otsu method. Effects of staining and background variations can be reduced with the inclusion of average intensity features. The proposed method and data set can be used in the development of automatic tools for the detection and stage classification of malaria parasites. The data set is publicly available as a benchmark for future studies.


2021 ◽  
Vol 7 ◽  
pp. e361
Author(s):  
Sana Aurangzeb ◽  
Rao Naveed Bin Rais ◽  
Muhammad Aleem ◽  
Muhammad Arshad Islam ◽  
Muhammad Azhar Iqbal

Due to the expeditious inclination of online services usage, the incidents of ransomware proliferation being reported are on the rise. Ransomware is a more hazardous threat than other malware as the victim of ransomware cannot regain access to the hijacked device until some form of compensation is paid. In the literature, several dynamic analysis techniques have been employed for the detection of malware including ransomware; however, to the best of our knowledge, hardware execution profile for ransomware analysis has not been investigated for this purpose, as of today. In this study, we show that the true execution picture obtained via a hardware execution profile is beneficial to identify the obfuscated ransomware too. We evaluate the features obtained from hardware performance counters to classify malicious applications into ransomware and non-ransomware categories using several machine learning algorithms such as Random Forest, Decision Tree, Gradient Boosting, and Extreme Gradient Boosting. The employed data set comprises 80 ransomware and 80 non-ransomware applications, which are collected using the VirusShare platform. The results revealed that extracted hardware features play a substantial part in the identification and detection of ransomware with F-measure score of 0.97 achieved by Random Forest and Extreme Gradient Boosting.


Author(s):  
Supun Nakandala ◽  
Marta M. Jankowska ◽  
Fatima Tuz-Zahra ◽  
John Bellettiere ◽  
Jordan A. Carlson ◽  
...  

Background: Machine learning has been used for classification of physical behavior bouts from hip-worn accelerometers; however, this research has been limited due to the challenges of directly observing and coding human behavior “in the wild.” Deep learning algorithms, such as convolutional neural networks (CNNs), may offer better representation of data than other machine learning algorithms without the need for engineered features and may be better suited to dealing with free-living data. The purpose of this study was to develop a modeling pipeline for evaluation of a CNN model on a free-living data set and compare CNN inputs and results with the commonly used machine learning random forest and logistic regression algorithms. Method: Twenty-eight free-living women wore an ActiGraph GT3X+ accelerometer on their right hip for 7 days. A concurrently worn thigh-mounted activPAL device captured ground truth activity labels. The authors evaluated logistic regression, random forest, and CNN models for classifying sitting, standing, and stepping bouts. The authors also assessed the benefit of performing feature engineering for this task. Results: The CNN classifier performed best (average balanced accuracy for bout classification of sitting, standing, and stepping was 84%) compared with the other methods (56% for logistic regression and 76% for random forest), even without performing any feature engineering. Conclusion: Using the recent advancements in deep neural networks, the authors showed that a CNN model can outperform other methods even without feature engineering. This has important implications for both the model’s ability to deal with the complexity of free-living data and its potential transferability to new populations.


Author(s):  
Matthew Klawonn ◽  
Eric Heim ◽  
James Hendler

In many domains, collecting sufficient labeled training data for supervised machine learning requires easily accessible but noisy sources, such as crowdsourcing services or tagged Web data. Noisy labels occur frequently in data sets harvested via these means, sometimes resulting in entire classes of data on which learned classifiers generalize poorly. For real world applications, we argue that it can be beneficial to avoid training on such classes entirely. In this work, we aim to explore the classes in a given data set, and guide supervised training to spend time on a class proportional to its learnability. By focusing the training process, we aim to improve model generalization on classes with a strong signal. To that end, we develop an online algorithm that works in conjunction with classifier and training algorithm, iteratively selecting training data for the classifier based on how well it appears to generalize on each class. Testing our approach on a variety of data sets, we show our algorithm learns to focus on classes for which the model has low generalization error relative to strong baselines, yielding a classifier with good performance on learnable classes.


Sign in / Sign up

Export Citation Format

Share Document