Random KNN feature selection - a fast and stable alternative to Random Forests

In the field of machine learning, a considerable amount of research is involved in the interpretability of models and their decisions. The interpretability contradicts the model quality. Random Forests are among the best quality technologies of machine learning, but their operation is of “black box” character. Among the quantifiable approaches to the model interpretation, there are measures of association of predictors and response. In case of the Random Forests, this approach usually consists of calculating the model’s feature importances. Known methods, including the built-in one, are less suitable in settings with strong multicollinearity of features. Therefore, we propose an experimental approach to the feature selection task, a greedy forward feature selection method with least-trees-used criterion. It yields a set of most informative features that can be used in a machine learning (ML) training process with similar prediction quality as the original feature set. We verify the results of the proposed method on two known datasets, one with small feature multicollinearity and another with large feature multicollinearity. The proposed method also allows for a domain expert help with selecting among equally important features, which is known as the human-in-the-loop approach.

Download Full-text

Hybrid biogeography based simultaneous feature selection and MHC class I peptide binding prediction using support vector machines and random forests

Journal of Immunological Methods ◽

10.1016/j.jim.2012.09.013 ◽

2013 ◽

Vol 387 (1-2) ◽

pp. 284-292 ◽

Cited By ~ 8

Author(s):

Atulji Srivastava ◽

Shameek Ghosh ◽

N. Anantharaman ◽

V.K. Jayaraman

Keyword(s):

Feature Selection ◽

Support Vector Machines ◽

Mhc Class I ◽

Random Forests ◽

Peptide Binding ◽

Class I ◽

Support Vector ◽

Binding Prediction ◽

Vector Machines ◽

Peptide Binding Prediction

Download Full-text

Random forests for feature selection in QSPR Models - an application for predicting standard enthalpy of formation of hydrocarbons

Journal of Cheminformatics ◽

10.1186/1758-2946-5-9 ◽

2013 ◽

Vol 5 (1) ◽

Cited By ~ 34

Author(s):

Ana L Teixeira ◽

João P Leal ◽

Andre O Falcao

Keyword(s):

Feature Selection ◽

Enthalpy Of Formation ◽

Random Forests ◽

Standard Enthalpy ◽

Standard Enthalpy Of Formation

Download Full-text

CURE-SMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests

BMC Bioinformatics ◽

10.1186/s12859-017-1578-z ◽

2017 ◽

Vol 18 (1) ◽

Cited By ~ 43

Author(s):

Li Ma ◽

Suohai Fan

Keyword(s):

Feature Selection ◽

Parameter Optimization ◽

Random Forests ◽

Hybrid Algorithm

Download Full-text

Short-Term Traffic Flow Forecasting for Urban Roads Using Data-Driven Feature Selection Strategy and Bias-Corrected Random Forests

Transportation Research Record Journal of the Transportation Research Board ◽

10.3141/2645-17 ◽

2017 ◽

Vol 2645 (1) ◽

pp. 157-167 ◽

Cited By ~ 9

Author(s):

Jishun Ou ◽

Jingxin Xia ◽

Yao-Jan Wu ◽

Wenming Rao

Keyword(s):

Feature Selection ◽

Traffic Flow ◽

Random Forests ◽

Urban Traffic ◽

Data Driven ◽

Series Data ◽

Selection Strategy ◽

Percentage Error ◽

Short Term ◽

Traffic Flow Forecasting

Urban traffic flow forecasting is essential to proactive traffic control and management. Most existing forecasting methods depend on proper and reliable input features, for example, weather conditions and spatiotemporal lagged variables of traffic flow. However, the feature selection process is often done manually without comprehensive evaluation and leads to inaccurate results. For that challenge, this paper presents an approach combining the bias-corrected random forests algorithm with a data-driven feature selection strategy for short-term urban traffic flow forecasting. First, several input features were extracted from traffic flow time series data. Then the importance of these features was quantified with the permutation importance measure. Next, a data-driven feature selection strategy was introduced to identify the most important features. Finally, the forecasting model was built on the bias-corrected random forests algorithm and the selected features. The proposed approach was validated with data collected from three types of urban roads (expressway, major arterial, and minor arterial) in Kunshan City, China. The proposed approach was also compared with 10 existing approaches to verify its effectiveness. The results of the validation and comparison show that even without further model tuning, the proposed approach achieves the lowest average mean absolute error and root mean square error on six stations while it achieves the second-best average performance in mean absolute percentage error. Meanwhile, the training efficiency is improved compared with the original random forests method owing to the use of the feature selection strategy.

Download Full-text