The impact of feature selection methods on machine learning-based docking prediction of Indonesian medicinal plant compounds and HIV-1 protease

Predicting alfalfa biomass and crop yield for livestock feed is important to the daily lives of virtually everyone, and many features of data from this domain combined with corresponding weather data can be used to train machine learning models for yield prediction. In this work, we used yield data of different alfalfa varieties from multiple years in Kentucky and Georgia, and we compared the impact of different feature selection methods on machine learning (ML) models trained to predict alfalfa yield. Linear regression, regression trees, support vector machines, neural networks, Bayesian regression, and nearest neighbors were all developed with cross validation. The features used included weather data, historical yield data, and the sown date. The feature selection methods that were compared included a correlation-based method, the ReliefF method, and a wrapper method. We found that the best method was the correlation-based method, and the feature set it found consisted of the Julian day of the harvest, the number of days between the sown and harvest dates, cumulative solar radiation since the previous harvest, and cumulative rainfall since the previous harvest. Using these features, the k-nearest neighbor and random forest methods achieved an average R value over 0.95, and average mean absolute error less than 200 lbs./acre. Our top R2 of 0.90 beats a previous work’s best R2 of 0.87. Our primary contribution is the demonstration that ML, with feature selection, shows promise in predicting crop yields even on simple datasets with a handful of features, and that reporting accuracies in R and R2 offers an intuitive way to compare results among various crops.

Download Full-text

Sentiment Analysis of Movie Reviews: A Study of Machine Learning Algorithms with Various Feature Selection Methods

International Journal of Computer Sciences and Engineering ◽

10.26438/ijcse/v5i9.113121 ◽

2017 ◽

Vol 5 (9) ◽

Cited By ~ 1

Author(s):

Rajwinder Kaur

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Sentiment Analysis ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Selection Methods

Download Full-text

Radiomics side experiments and DAFIT approach in identifying pulmonary hypertension using Cardiac MRI derived radiomics based machine learning models

Scientific Reports ◽

10.1038/s41598-021-92155-6 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Sarv Priya ◽

Tanya Aggarwal ◽

Caitlin Ward ◽

Girish Bathla ◽

Mathews Jacob ◽

...

Keyword(s):

Machine Learning ◽

Pulmonary Hypertension ◽

Feature Selection ◽

Subgroup Analysis ◽

Cardiac Mri ◽

Intraclass Correlation ◽

Poor Performance ◽

Superior Performance ◽

Feature Filtering ◽

The Impact

AbstractSide experiments are performed on radiomics models to improve their reproducibility. We measure the impact of myocardial masks, radiomic side experiments and data augmentation for information transfer (DAFIT) approach to differentiate patients with and without pulmonary hypertension (PH) using cardiac MRI (CMRI) derived radiomics. Feature extraction was performed from the left ventricle (LV) and right ventricle (RV) myocardial masks using CMRI in 82 patients (42 PH and 40 controls). Various side study experiments were evaluated: Original data without and with intraclass correlation (ICC) feature-filtering and DAFIT approach (without and with ICC feature-filtering). Multiple machine learning and feature selection strategies were evaluated. Primary analysis included all PH patients with subgroup analysis including PH patients with preserved LVEF (≥ 50%). For both primary and subgroup analysis, DAFIT approach without feature-filtering was the highest performer (AUC 0.957–0.958). ICC approaches showed poor performance compared to DAFIT approach. The performance of combined LV and RV masks was superior to individual masks alone. There was variation in top performing models across all approaches (AUC 0.862–0.958). DAFIT approach with features from combined LV and RV masks provide superior performance with poor performance of feature filtering approaches. Model performance varies based upon the feature selection and model combination.

Download Full-text

Comparative study on total nitrogen prediction in wastewater treatment plant and effect of various feature selection methods on machine learning algorithms performance

Journal of Water Process Engineering ◽

10.1016/j.jwpe.2021.102033 ◽

2021 ◽

Vol 41 ◽

pp. 102033

Author(s):

Faramarz Bagherzadeh ◽

Mohamad-Javad Mehrani ◽

Milad Basirifard ◽

Javad Roostaei

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Wastewater Treatment ◽

Comparative Study ◽

Total Nitrogen ◽

Wastewater Treatment Plant ◽

Learning Algorithms ◽

Treatment Plant ◽

Machine Learning Algorithms ◽

Selection Methods

Download Full-text

Machine learning integrated ensemble of feature selection methods followed by survival analysis for predicting breast cancer subtype specific miRNA biomarkers

Computers in Biology and Medicine ◽

10.1016/j.compbiomed.2021.104244 ◽

2021 ◽

Vol 131 ◽

pp. 104244

Author(s):

Jnanendra Prasad Sarkar ◽

Indrajit Saha ◽

Anasua Sarkar ◽

Ujjwal Maulik

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Feature Selection ◽

Survival Analysis ◽

Breast Cancer Subtype ◽

Selection Methods ◽

Cancer Subtype

Download Full-text

Incorporate Syntactic Information for Short Text Classification

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.268-270.697 ◽

2011 ◽

Vol 268-270 ◽

pp. 697-700

Author(s):

Rui Xue Duan ◽

Xiao Jie Wang ◽

Wen Feng Li

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Learning Environment ◽

Text Classification ◽

The Internet ◽

Selection Methods ◽

Text Documents ◽

Short Text ◽

Syntactic Information ◽

Dependency Relations

As the volume of online short text documents grow tremendously on the Internet, it is much more urgent to solve the task of organizing the short texts well. However, the traditional feature selection methods cannot suitable for the short text. In this paper, we proposed a method to incorporate syntactic information for the short text. It emphasizes the feature which has more dependency relations with other words. The classifier SVM and machine learning environment Weka are involved in our experiments. The experiment results show that incorporate syntactic information in the short text, we can get more powerful features than traditional feature selection methods, such as DF, CHI. The precision of short text classification improved from 86.2% to 90.8%.

Download Full-text

Automated Feature Selection and Classification for High-Dimensional Biomedical Data

10.21203/rs.3.rs-563410/v1 ◽

2021 ◽

Author(s):

Tammo P.A. Beishuizen ◽

Joaquin Vanschoren ◽

Peter A.J. Hilbers ◽

Dragan Bošnački

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Automated System ◽

Complex Data ◽

Biomedical Data ◽

Selection Methods ◽

Model Predictions ◽

Automated Machine Learning ◽

Feature Selection Techniques ◽

Best Fit

Abstract Background: Automated machine learning aims to automate the building of accurate predictive models, including the creation of complex data preprocessing pipelines. Although successful in many fields, they struggle to produce good results on biomedical datasets, especially given the high dimensionality of the data. Result: In this paper, we explore the automation of feature selection in these scenarios. We analyze which feature selection techniques are ideally included in an automated system, determine how to efficiently find the ones that best fit a given dataset, integrate this into an existing AutoML tool (TPOT), and evaluate it on four very different yet representative types of biomedical data: microarray, mass spectrometry, clinical and survey datasets. We focus on feature selection rather than latent feature generation since we often want to explain the model predictions in terms of the intrinsic features of the data. Conclusion: Our experiments show that for none of these datasets we need more than 200 features to accurately explain the output. Additional features did not increase the quality significantly. We also find that the automated machine learning results are significantly improved after adding additional feature selection methods and prior knowledge on how to select and tune them.

Download Full-text

Feature Selection Optimization for Highlighting Opinions Using Supervised and Unsupervised Learning on Arabic Language

International Journal of Advanced Trends in Computer Science and Engineering ◽

10.30534/ijatcse/2021/251022021 ◽

2021 ◽

Vol 10 (2) ◽

pp. 636-642

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Text Mining ◽

Unsupervised Learning ◽

Arabic Language ◽

Machine Learning Techniques ◽

Business Decision ◽

Supervised And Unsupervised Learning ◽

Proposed Model ◽

The Impact

Text mining utilizes machine learning (ML) and natural language processing (NLP) for text implicit knowledge recognition, such knowledge serves many domains as translation, media searching, and business decision making. Opinion mining (OM) is one of the promised text mining fields, which are used for polarity discovering via text and has terminus benefits for business. ML techniques are divided into two approaches: supervised and unsupervised learning, since we herein testified an OM feature selection(FS)using four ML techniques. In this paper, we had implemented number of experiments via four machine learning techniques on the same three Arabic language corpora. This paper aims at increasing the accuracy of opinion highlighting on Arabic language, by using enhanced feature selection approaches. FS proposed model is adopted for enhancing opinion highlighting purpose. The experimental results show the outperformance of the proposed approaches in variant levels of supervisory,i.e. different techniques via distinct data domains. Multiple levels of comparison are carried out and discussed for further understanding of the impact of proposed model on several ML techniques.

Download Full-text

To use or not to use: Feature selection for sentiment analysis of highly imbalanced data

Natural Language Engineering ◽

10.1017/s1351324917000298 ◽

2017 ◽

Vol 24 (1) ◽

pp. 3-37 ◽

Cited By ~ 5

Author(s):

SANDRA KÜBLER ◽

CAN LIU ◽

ZEESHAN ALI SAYYED

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Sentiment Analysis ◽

Information Gain ◽

Binary Classification ◽

Small Subset ◽

Large Set ◽

Learning Approaches ◽

Selection Methods ◽

Data Set

AbstractWe investigate feature selection methods for machine learning approaches in sentiment analysis. More specifically, we use data from the cooking platform Epicurious and attempt to predict ratings for recipes based on user reviews. In machine learning approaches to such tasks, it is a common approach to use word or part-of-speech n-grams. This results in a large set of features, out of which only a small subset may be good indicators for the sentiment. One of the questions we investigate concerns the extension of feature selection methods from a binary classification setting to a multi-class problem. We show that an inherently multi-class approach, multi-class information gain, outperforms ensembles of binary methods. We also investigate how to mitigate the effects of extreme skewing in our data set by making our features more robust and by using review and recipe sampling. We show that over-sampling is the best method for boosting performance on the minority classes, but it also results in a severe drop in overall accuracy of at least 6 per cent points.

Download Full-text

Impact of feature selection on classification via clustering techniques in software defect prediction

Journal of Computer Science and Its Application ◽

10.4314/jcsia.v26i1.8 ◽

2020 ◽

Vol 26 (1) ◽

Cited By ~ 1

Author(s):

F.E. Usman-Hamza ◽

A.F. Atte ◽

A.O. Balogun ◽

H.A. Mojeed ◽

A.O. Bajeh ◽

...

Keyword(s):

Feature Selection ◽

Information Gain ◽

Feature Selection Method ◽

Predictive Performance ◽

Defect Prediction ◽

Software Defect Prediction ◽

Selection Methods ◽

Clustering Techniques ◽

Software Defect ◽

The Impact

Software testing using software defect prediction aims to detect as many defects as possible in software before the software release. This plays an important role in ensuring quality and reliability. Software defect prediction can be modeled as a classification problem that classifies software modules into two classes: defective and non-defective; and classification algorithms are used for this process. This study investigated the impact of feature selection methods on classification via clustering techniques for software defect prediction. Three clustering techniques were selected; Farthest First Clusterer, K-Means and Make-Density Clusterer, and three feature selection methods: Chi-Square, Clustering Variation, and Information Gain were used on software defect datasets from NASA repository. The best software defect prediction model was farthest-first using information gain feature selection method with an accuracy of 78.69%, precision value of 0.804 and recall value of 0.788. The experimental results showed that the use of clustering techniques as a classifier gave a good predictive performance and feature selection methods further enhanced their performance. This indicates that classification via clustering techniques can give competitive results against standard classification methods with the advantage of not having to train any model using labeled dataset; as it can be used on the unlabeled datasets.Keywords: Classification, Clustering, Feature Selection, Software Defect PredictionVol. 26, No 1, June, 2019

Download Full-text