The impact of feature selection methods on machine learning-based docking prediction of Indonesian medicinal plant compounds and HIV-1 protease

Author(s):  
Rahman Pujianto ◽  
Yohanes Gultom ◽  
Ari Wibisono ◽  
Arry Yanuar ◽  
Heru Suhartanto
AI ◽  
2021 ◽  
Vol 2 (1) ◽  
pp. 71-88
Author(s):  
Christopher D. Whitmire ◽  
Jonathan M. Vance ◽  
Hend K. Rasheed ◽  
Ali Missaoui ◽  
Khaled M. Rasheed ◽  
...  

Predicting alfalfa biomass and crop yield for livestock feed is important to the daily lives of virtually everyone, and many features of data from this domain combined with corresponding weather data can be used to train machine learning models for yield prediction. In this work, we used yield data of different alfalfa varieties from multiple years in Kentucky and Georgia, and we compared the impact of different feature selection methods on machine learning (ML) models trained to predict alfalfa yield. Linear regression, regression trees, support vector machines, neural networks, Bayesian regression, and nearest neighbors were all developed with cross validation. The features used included weather data, historical yield data, and the sown date. The feature selection methods that were compared included a correlation-based method, the ReliefF method, and a wrapper method. We found that the best method was the correlation-based method, and the feature set it found consisted of the Julian day of the harvest, the number of days between the sown and harvest dates, cumulative solar radiation since the previous harvest, and cumulative rainfall since the previous harvest. Using these features, the k-nearest neighbor and random forest methods achieved an average R value over 0.95, and average mean absolute error less than 200 lbs./acre. Our top R2 of 0.90 beats a previous work’s best R2 of 0.87. Our primary contribution is the demonstration that ML, with feature selection, shows promise in predicting crop yields even on simple datasets with a handful of features, and that reporting accuracies in R and R2 offers an intuitive way to compare results among various crops.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Sarv Priya ◽  
Tanya Aggarwal ◽  
Caitlin Ward ◽  
Girish Bathla ◽  
Mathews Jacob ◽  
...  

AbstractSide experiments are performed on radiomics models to improve their reproducibility. We measure the impact of myocardial masks, radiomic side experiments and data augmentation for information transfer (DAFIT) approach to differentiate patients with and without pulmonary hypertension (PH) using cardiac MRI (CMRI) derived radiomics. Feature extraction was performed from the left ventricle (LV) and right ventricle (RV) myocardial masks using CMRI in 82 patients (42 PH and 40 controls). Various side study experiments were evaluated: Original data without and with intraclass correlation (ICC) feature-filtering and DAFIT approach (without and with ICC feature-filtering). Multiple machine learning and feature selection strategies were evaluated. Primary analysis included all PH patients with subgroup analysis including PH patients with preserved LVEF (≥ 50%). For both primary and subgroup analysis, DAFIT approach without feature-filtering was the highest performer (AUC 0.957–0.958). ICC approaches showed poor performance compared to DAFIT approach. The performance of combined LV and RV masks was superior to individual masks alone. There was variation in top performing models across all approaches (AUC 0.862–0.958). DAFIT approach with features from combined LV and RV masks provide superior performance with poor performance of feature filtering approaches. Model performance varies based upon the feature selection and model combination.


2011 ◽  
Vol 268-270 ◽  
pp. 697-700
Author(s):  
Rui Xue Duan ◽  
Xiao Jie Wang ◽  
Wen Feng Li

As the volume of online short text documents grow tremendously on the Internet, it is much more urgent to solve the task of organizing the short texts well. However, the traditional feature selection methods cannot suitable for the short text. In this paper, we proposed a method to incorporate syntactic information for the short text. It emphasizes the feature which has more dependency relations with other words. The classifier SVM and machine learning environment Weka are involved in our experiments. The experiment results show that incorporate syntactic information in the short text, we can get more powerful features than traditional feature selection methods, such as DF, CHI. The precision of short text classification improved from 86.2% to 90.8%.


2021 ◽  
Author(s):  
Tammo P.A. Beishuizen ◽  
Joaquin Vanschoren ◽  
Peter A.J. Hilbers ◽  
Dragan Bošnački

Abstract Background: Automated machine learning aims to automate the building of accurate predictive models, including the creation of complex data preprocessing pipelines. Although successful in many fields, they struggle to produce good results on biomedical datasets, especially given the high dimensionality of the data. Result: In this paper, we explore the automation of feature selection in these scenarios. We analyze which feature selection techniques are ideally included in an automated system, determine how to efficiently find the ones that best fit a given dataset, integrate this into an existing AutoML tool (TPOT), and evaluate it on four very different yet representative types of biomedical data: microarray, mass spectrometry, clinical and survey datasets. We focus on feature selection rather than latent feature generation since we often want to explain the model predictions in terms of the intrinsic features of the data. Conclusion: Our experiments show that for none of these datasets we need more than 200 features to accurately explain the output. Additional features did not increase the quality significantly. We also find that the automated machine learning results are significantly improved after adding additional feature selection methods and prior knowledge on how to select and tune them.


Text mining utilizes machine learning (ML) and natural language processing (NLP) for text implicit knowledge recognition, such knowledge serves many domains as translation, media searching, and business decision making. Opinion mining (OM) is one of the promised text mining fields, which are used for polarity discovering via text and has terminus benefits for business. ML techniques are divided into two approaches: supervised and unsupervised learning, since we herein testified an OM feature selection(FS)using four ML techniques. In this paper, we had implemented number of experiments via four machine learning techniques on the same three Arabic language corpora. This paper aims at increasing the accuracy of opinion highlighting on Arabic language, by using enhanced feature selection approaches. FS proposed model is adopted for enhancing opinion highlighting purpose. The experimental results show the outperformance of the proposed approaches in variant levels of supervisory,i.e. different techniques via distinct data domains. Multiple levels of comparison are carried out and discussed for further understanding of the impact of proposed model on several ML techniques.


2017 ◽  
Vol 24 (1) ◽  
pp. 3-37 ◽  
Author(s):  
SANDRA KÜBLER ◽  
CAN LIU ◽  
ZEESHAN ALI SAYYED

AbstractWe investigate feature selection methods for machine learning approaches in sentiment analysis. More specifically, we use data from the cooking platform Epicurious and attempt to predict ratings for recipes based on user reviews. In machine learning approaches to such tasks, it is a common approach to use word or part-of-speech n-grams. This results in a large set of features, out of which only a small subset may be good indicators for the sentiment. One of the questions we investigate concerns the extension of feature selection methods from a binary classification setting to a multi-class problem. We show that an inherently multi-class approach, multi-class information gain, outperforms ensembles of binary methods. We also investigate how to mitigate the effects of extreme skewing in our data set by making our features more robust and by using review and recipe sampling. We show that over-sampling is the best method for boosting performance on the minority classes, but it also results in a severe drop in overall accuracy of at least 6 per cent points.


Author(s):  
F.E. Usman-Hamza ◽  
A.F. Atte ◽  
A.O. Balogun ◽  
H.A. Mojeed ◽  
A.O. Bajeh ◽  
...  

Software testing using software defect prediction aims to detect as many defects as possible in software before the software release. This plays an important role in ensuring quality and reliability. Software defect prediction can be modeled as a classification problem that classifies software modules into two classes: defective and non-defective; and classification algorithms are used for this process. This study investigated the impact of feature selection methods on classification via clustering techniques for software defect prediction. Three clustering techniques were selected; Farthest First Clusterer, K-Means and Make-Density Clusterer, and three feature selection methods: Chi-Square, Clustering Variation, and Information Gain were used on software defect datasets from NASA repository. The best software defect prediction model was farthest-first using information gain feature selection method with an accuracy of 78.69%, precision value of 0.804 and recall value of 0.788. The experimental results showed that the use of clustering techniques as a classifier gave a good predictive performance and feature selection methods further enhanced their performance. This indicates that classification via clustering techniques can give competitive results against standard classification methods with the advantage of not having to train any model using labeled dataset; as it can be used on the unlabeled datasets.Keywords: Classification, Clustering, Feature Selection, Software Defect PredictionVol. 26, No 1, June, 2019


Sign in / Sign up

Export Citation Format

Share Document