scholarly journals Uncovering feature interdependencies in high-noise environments with stepwise lookahead decision forests

2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Delilah Donick ◽  
Sandro Claudio Lera

AbstractConventionally, random forests are built from “greedy” decision trees which each consider only one split at a time during their construction. The sub-optimality of greedy implementation has been well-known, yet mainstream adoption of more sophisticated tree building algorithms has been lacking. We examine under what circumstances an implementation of less greedy decision trees actually yields outperformance. To this end, a “stepwise lookahead” variation of the random forest algorithm is presented for its ability to better uncover binary feature interdependencies. In contrast to the greedy approach, the decision trees included in this random forest algorithm, each simultaneously consider three split nodes in tiers of depth two. It is demonstrated on synthetic data and financial price time series that the lookahead version significantly outperforms the greedy one when (a) certain non-linear relationships between feature-pairs are present and (b) if the signal-to-noise ratio is particularly low. A long-short trading strategy for copper futures is then backtested by training both greedy and stepwise lookahead random forests to predict the signs of daily price returns. The resulting superior performance of the lookahead algorithm is at least partially explained by the presence of “XOR-like” relationships between long-term and short-term technical indicators. More generally, across all examined datasets, when no such relationships between features are present, performance across random forests is similar. Given its enhanced ability to understand the feature-interdependencies present in complex systems, this lookahead variation is a useful extension to the toolkit of data scientists, in particular for financial machine learning, where conditions (a) and (b) are typically met.

2021 ◽  
Vol 257 ◽  
pp. 02080
Author(s):  
Ruishan Sun ◽  
Chongfeng Li

Landing safety is a hot issue in civil aviation safety management. In order to fully mine the influence factors of hard landing in flight data and effectively predict the risk of hard landing, the random forest algorithm was introduced. Firstly, this paper qualitatively analyzed the influence factors of hard landing, and chose the features of the model based on the flight data. Secondly, this paper gives a quantitative analysis method of the importance of features based on Gini index. Finally, for the dataset of hard landing was class-imbalanced, the model was training based on SMOTE method. Then, the random forests classifier was built and verified by real flight data. The results showed that the recall rate of the model was 85.50%. The model can not only effectively prevent the occurrence of hard landing, but also provide a method reference for airlines to apply data mining to improve the ability of flight events management.


2009 ◽  
Vol 5 (S267) ◽  
pp. 147-147
Author(s):  
Yanxia Zhang ◽  
Yongheng Zhao ◽  
Hongwen Zheng

We investigate selection and weighting of features by applying a random forest algorithm to multiwavelength data. Then we employ a k-nearest neighbor method to distinguish quasars from stars. We then compare the performance of this approach based on all features, weighted features, and selected features. We find that the k-nearest neighbor approach combined with random forests effectively separates quasars from stars.


Author(s):  
Oyelakin A. M ◽  
Alimi O. M ◽  
Mustapha I. O ◽  
Ajiboye I. K

Phishing attacks have been used in different ways to harvest the confidential information of unsuspecting internet users. To stem the tide of phishing-based attacks, several machine learning techniques have been proposed in the past. However, fewer studies have considered investigating single and ensemble machine learning-based models for the classification of phishing attacks. This study carried out performance analysis of selected single and ensemble machine learning (ML) classifiers in phishing classification.The focus is to investigate how these algorithms behave in the classification of phishing attacks in the chosen dataset. Logistic Regression and Decision Trees were chosen as single learning classifiers while simple voting techniques and Random Forest were used as the ensemble machine learning algorithms. Accuracy, Precision, Recall and F1-score were used as performance metrics. Logistic Regression algorithm recorded 0.86 as accuracy, 0.89 as precision, 0.87 as recall and 0.81 as F1-score. Similarly, the Decision Trees classifier achieved an accuracy of 0.87, 0.83 for precision, 0.88 for recall and 0.81 for F1-score. In the voting ensemble, accuracy of 0.92 was achieved. 0.90 was obtained for precision, 0.92 for recall and 0.92 for F1-score. Random Forest algorithm recorded 0.98, 0.97, 0.98 and 0.97 as accuracy, precision, recall and F1-score respectively. From the experimental analyses, Random Forest algorithm outperformed simple averaging classifier and the two single algorithms used for phishing url detection. The study established that the ensemble techniques that were used for the experimentations are more efficient for phishing url identification compared to the single classifiers.  


2021 ◽  
Vol 2021 ◽  
pp. 1-9
Author(s):  
Shuhui Yi ◽  
Hongxia Zhu ◽  
Junjie Liu ◽  
Junnan Li

Nonintrusive industrial load identification can accurately acquire the operation data of each load in the plant, which is the benefit of intelligent power management. The identification method of the industrial load is complicated and difficult to be realized due to the difficulty in collecting transient data for modeling, and high-precision measuring equipment is required. Aiming at this situation, the article proposes a nonintrusive industrial load identification method using a random forest algorithm and steady-state waveform. Firstly, by monitoring the change of the industrial load power state, when the load changes and becomes stable, the steady-state waveform is extracted. Due to different electrical characteristics of industrial loads, the current waveform of loads is different to some extent. We can construct characteristic data for each industrial load to construct its own current steady-state waveform. Then, using the high-dimensional data of the steady-state waveform as the sample data, the bootstrap sampling method and the CART algorithm in the random forest algorithm are used to generate multiple decision trees. Finally, the industrial load types are identified by voting multiple decision trees. The actual operating load data of a factory are used as the sample data in the simulation, and the effectiveness and rapidity of the proposed identification algorithm are verified by the combined load method simulation comparison. The simulation results show that the accuracy of the proposed identification algorithm is more than 99%, the identification time is 3.36 s, which is much higher than that of other methods, and the operation time is less than that of other methods. Therefore, the proposed identification algorithm can effectively realize the nonintrusive industrial load identification.


Author(s):  
A.E. Semenov

The method of pedestrian navigation in the cities illustrated by the example of Saint-Petersburg was investigated. The factors influencing people when they choose a route for their walk were determined. Based on acquired factors corresponding data was collected and used to develop model determining attractiveness of a street in the city using Random Forest algorithm. The results obtained shows that routes provided by the method are 14% more attractive and just 6% longer compared with the shortest ones.


Author(s):  
Jasmine Ye Nakayama ◽  
Joyce Ho ◽  
Emily Cartwright ◽  
Roy Simpson ◽  
Vicki Stover Hertzberg

Sign in / Sign up

Export Citation Format

Share Document