scholarly journals Using Machine Learning and Feature Selection for Alfalfa Yield Prediction

AI ◽  
2021 ◽  
Vol 2 (1) ◽  
pp. 71-88
Author(s):  
Christopher D. Whitmire ◽  
Jonathan M. Vance ◽  
Hend K. Rasheed ◽  
Ali Missaoui ◽  
Khaled M. Rasheed ◽  
...  

Predicting alfalfa biomass and crop yield for livestock feed is important to the daily lives of virtually everyone, and many features of data from this domain combined with corresponding weather data can be used to train machine learning models for yield prediction. In this work, we used yield data of different alfalfa varieties from multiple years in Kentucky and Georgia, and we compared the impact of different feature selection methods on machine learning (ML) models trained to predict alfalfa yield. Linear regression, regression trees, support vector machines, neural networks, Bayesian regression, and nearest neighbors were all developed with cross validation. The features used included weather data, historical yield data, and the sown date. The feature selection methods that were compared included a correlation-based method, the ReliefF method, and a wrapper method. We found that the best method was the correlation-based method, and the feature set it found consisted of the Julian day of the harvest, the number of days between the sown and harvest dates, cumulative solar radiation since the previous harvest, and cumulative rainfall since the previous harvest. Using these features, the k-nearest neighbor and random forest methods achieved an average R value over 0.95, and average mean absolute error less than 200 lbs./acre. Our top R2 of 0.90 beats a previous work’s best R2 of 0.87. Our primary contribution is the demonstration that ML, with feature selection, shows promise in predicting crop yields even on simple datasets with a handful of features, and that reporting accuracies in R and R2 offers an intuitive way to compare results among various crops.

2021 ◽  
Vol 11 ◽  
Author(s):  
Qi Wan ◽  
Jiaxuan Zhou ◽  
Xiaoying Xia ◽  
Jianfeng Hu ◽  
Peng Wang ◽  
...  

ObjectiveTo evaluate the performance of 2D and 3D radiomics features with different machine learning approaches to classify SPLs based on magnetic resonance(MR) T2 weighted imaging (T2WI).Material and MethodsA total of 132 patients with pathologically confirmed SPLs were examined and randomly divided into training (n = 92) and test datasets (n = 40). A total of 1692 3D and 1231 2D radiomics features per patient were extracted. Both radiomics features and clinical data were evaluated. A total of 1260 classification models, comprising 3 normalization methods, 2 dimension reduction algorithms, 3 feature selection methods, and 10 classifiers with 7 different feature numbers (confined to 3–9), were compared. The ten-fold cross-validation on the training dataset was applied to choose the candidate final model. The area under the receiver operating characteristic curve (AUC), precision-recall plot, and Matthews Correlation Coefficient were used to evaluate the performance of machine learning approaches.ResultsThe 3D features were significantly superior to 2D features, showing much more machine learning combinations with AUC greater than 0.7 in both validation and test groups (129 vs. 11). The feature selection method Analysis of Variance(ANOVA), Recursive Feature Elimination(RFE) and the classifier Logistic Regression(LR), Linear Discriminant Analysis(LDA), Support Vector Machine(SVM), Gaussian Process(GP) had relatively better performance. The best performance of 3D radiomics features in the test dataset (AUC = 0.824, AUC-PR = 0.927, MCC = 0.514) was higher than that of 2D features (AUC = 0.740, AUC-PR = 0.846, MCC = 0.404). The joint 3D and 2D features (AUC=0.813, AUC-PR = 0.926, MCC = 0.563) showed similar results as 3D features. Incorporating clinical features with 3D and 2D radiomics features slightly improved the AUC to 0.836 (AUC-PR = 0.918, MCC = 0.620) and 0.780 (AUC-PR = 0.900, MCC = 0.574), respectively.ConclusionsAfter algorithm optimization, 2D feature-based radiomics models yield favorable results in differentiating malignant and benign SPLs, but 3D features are still preferred because of the availability of more machine learning algorithmic combinations with better performance. Feature selection methods ANOVA and RFE, and classifier LR, LDA, SVM and GP are more likely to demonstrate better diagnostic performance for 3D features in the current study.


2019 ◽  
Vol 8 (4) ◽  
pp. 1333-1338

Text classification is a vital process due to the large volume of electronic articles. One of the drawbacks of text classification is the high dimensionality of feature space. Scholars developed several algorithms to choose relevant features from article text such as Chi-square (x2 ), Information Gain (IG), and Correlation (CFS). These algorithms have been investigated widely for English text, while studies for Arabic text are still limited. In this paper, we investigated four well-known algorithms: Support Vector Machines (SVMs), Naïve Bayes (NB), K-Nearest Neighbors (KNN), and Decision Tree against benchmark Arabic textual datasets, called Saudi Press Agency (SPA) to evaluate the impact of feature selection methods. Using the WEKA tool, we have experimented the application of the four mentioned classification algorithms with and without feature selection algorithms. The results provided clear evidence that the three feature selection methods often improves classification accuracy by eliminating irrelevant features.


2021 ◽  
Vol 11 (7) ◽  
pp. 3273
Author(s):  
Joana Morgado ◽  
Tania Pereira ◽  
Francisco Silva ◽  
Cláudia Freitas ◽  
Eduardo Negrão ◽  
...  

The evolution of personalized medicine has changed the therapeutic strategy from classical chemotherapy and radiotherapy to a genetic modification targeted therapy, and although biopsy is the traditional method to genetically characterize lung cancer tumor, it is an invasive and painful procedure for the patient. Nodule image features extracted from computed tomography (CT) scans have been used to create machine learning models that predict gene mutation status in a noninvasive, fast, and easy-to-use manner. However, recent studies have shown that radiomic features extracted from an extended region of interest (ROI) beyond the tumor, might be more relevant to predict the mutation status in lung cancer, and consequently may be used to significantly decrease the mortality rate of patients battling this condition. In this work, we investigated the relation between image phenotypes and the mutation status of Epidermal Growth Factor Receptor (EGFR), the most frequently mutated gene in lung cancer with several approved targeted-therapies, using radiomic features extracted from the lung containing the nodule. A variety of linear, nonlinear, and ensemble predictive classification models, along with several feature selection methods, were used to classify the binary outcome of wild-type or mutant EGFR mutation status. The results show that a comprehensive approach using a ROI that included the lung with nodule can capture relevant information and successfully predict the EGFR mutation status with increased performance compared to local nodule analyses. Linear Support Vector Machine, Elastic Net, and Logistic Regression, combined with the Principal Component Analysis feature selection method implemented with 70% of variance in the feature set, were the best-performing classifiers, reaching Area Under the Curve (AUC) values ranging from 0.725 to 0.737. This approach that exploits a holistic analysis indicates that information from more extensive regions of the lung containing the nodule allows a more complete lung cancer characterization and should be considered in future radiogenomic studies.


2013 ◽  
Vol 2013 ◽  
pp. 1-9 ◽  
Author(s):  
Md Rahat Hossain ◽  
Amanullah Maung Than Oo ◽  
A. B. M. Shawkat Ali

This paper empirically shows that the effect of applying selected feature subsets on machine learning techniques significantly improves the accuracy for solar power prediction. Experiments are performed using five well-known wrapper feature selection methods to obtain the solar power prediction accuracy of machine learning techniques with selected feature subsets. For all the experiments, the machine learning techniques, namely, least median square (LMS), multilayer perceptron (MLP), and support vector machine (SVM), are used. Afterwards, these results are compared with the solar power prediction accuracy of those same machine leaning techniques (i.e., LMS, MLP, and SVM) but without applying feature selection methods (WAFS). Experiments are carried out using reliable and real life historical meteorological data. The comparison between the results clearly shows that LMS, MLP, and SVM provide better prediction accuracy (i.e., reduced MAE and MASE) with selected feature subsets than without selected feature subsets. Experimental results of this paper facilitate to make a concrete verdict that providing more attention and effort towards the feature subset selection aspect (e.g., selected feature subsets on prediction accuracy which is investigated in this paper) can significantly contribute to improve the accuracy of solar power prediction.


Author(s):  
Anusara Hirunyawanakul ◽  
◽  
Nuntawut Kaoungku ◽  
Nittaya Kerdprasop ◽  
Kittisak Kerdprasop

Hard Disk Drive (HDD) manufacturing is one real-world application area that machine learning has been extensively adopted for problem solving. However, most problem solving activities in HDD industry tackle on failure root-cause analysis task. Machine learning is rarely applied in a task of yield prediction. This research presents the application of machine learning and statistical techniques to select appropriate features to be used in yield prediction for the HDD manufacturing process. The seven well-known algorithms are used in the feature selection step. These algorithms are decision tree (C5 and CART), Support Vector Machine (SVM), stepwise regression, Genetic Algorithm (GA), chi-square and information gain. The two prominent learning algorithms, Multiple Linear Regression (MLR) and Artificial Neural Networks (ANN), are used in the yield prediction modeling step. Yield prediction performance has been assessed based on the two evaluation metrics: Root Mean Square Error (RMSE) and Mean Absolute Error (MAE). Yield prediction with MLR shows higher accuracy than yield estimation traditionally performed by human engineers. Resulting to conclusion that the proposed novel learning steps can help HDD process engineers to predict yield with the better performance, especially on applying GA as feature selection tool, the MAE is reduced from 0.014 (yield estimated by human engineer) to 0.0059 (yield predicted by MLR). That means error reduction is about 60%.


Sensors ◽  
2021 ◽  
Vol 21 (19) ◽  
pp. 6407
Author(s):  
Nina Pilyugina ◽  
Akihiko Tsukahara ◽  
Keita Tanaka

The aim of this study was to find an efficient method to determine features that characterize octave illusion data. Specifically, this study compared the efficiency of several automatic feature selection methods for automatic feature extraction of the auditory steady-state responses (ASSR) data in brain activities to distinguish auditory octave illusion and nonillusion groups by the difference in ASSR amplitudes using machine learning. We compared univariate selection, recursive feature elimination, principal component analysis, and feature importance by testifying the results of feature selection methods by using several machine learning algorithms: linear regression, random forest, and support vector machine. The univariate selection with the SVM as the classification method showed the highest accuracy result, 75%, compared to 66.6% without using feature selection. The received results will be used for future work on the explanation of the mechanism behind the octave illusion phenomenon and creating an algorithm for automatic octave illusion classification.


2020 ◽  
Vol 12 (3) ◽  
pp. 1035 ◽  
Author(s):  
Huseyin Polat ◽  
Onur Polat ◽  
Aydin Cetin

Software Defined Networking (SDN) offers several advantages such as manageability, scaling, and improved performance. However, SDN involves specific security problems, especially if its controller is defenseless against Distributed Denial of Service (DDoS) attacks. The process and communication capacity of the controller is overloaded when DDoS attacks occur against the SDN controller. Consequently, as a result of the unnecessary flow produced by the controller for the attack packets, the capacity of the switch flow table becomes full, leading the network performance to decline to a critical threshold. In this study, DDoS attacks in SDN were detected using machine learning-based models. First, specific features were obtained from SDN for the dataset in normal conditions and under DDoS attack traffic. Then, a new dataset was created using feature selection methods on the existing dataset. Feature selection methods were preferred to simplify the models, facilitate their interpretation, and provide a shorter training time. Both datasets, created with and without feature selection methods, were trained and tested with Support Vector Machine (SVM), Naive Bayes (NB), Artificial Neural Network (ANN), and K-Nearest Neighbors (KNN) classification models. The test results showed that the use of the wrapper feature selection with a KNN classifier achieved the highest accuracy rate (98.3%) in DDoS attack detection. The results suggest that machine learning and feature selection algorithms can achieve better results in the detection of DDoS attacks in SDN with promising reductions in processing loads and times.


2021 ◽  
Vol 9 ◽  
Author(s):  
Naresh Mali ◽  
Varun Dutt ◽  
K. V. Uday

Landslide disaster risk reduction necessitates the investigation of different geotechnical causal factors for slope failures. Machine learning (ML) techniques have been proposed to study causal factors across many application areas. However, the development of ensemble ML techniques for identifying the geotechnical causal factors for slope failures and their subsequent prediction has lacked in literature. The primary goal of this research is to develop and evaluate novel feature selection methods for identifying causal factors for slope failures and assess the potential of ensemble and individual ML techniques for slope failure prediction. Twenty-one geotechnical causal factors were obtained from 60 sites (both landslide and non-landslide) spread across a landslide-prone area in Mandi, India. Relevant causal factors were evaluated by developing a novel ensemble feature selection method that involved an average of different individual feature selection methods like correlation, information-gain, gain-ratio, OneR, and F-ratio. Furthermore, different ensemble ML techniques (Random Forest (RF), AdaBoost (AB), Bagging, Stacking, and Voting) and individual ML techniques (Bayesian network (BN), decision tree (DT), multilayer perceptron (MLP), and support vector machine (SVM)) were calibrated to 70% of the locations and tested on 30% of the sites. The ensemble feature selection method yielded six major contributing parameters to slope failures: relative compaction, porosity, saturated permeability, slope angle, angle of the internal friction, and in-situ moisture content. Furthermore, the ensemble RF and AB techniques performed the best compared to other ensemble and individual ML techniques on test data. The present study discusses the implications of different causal factors for slope failure prediction.


2019 ◽  
Author(s):  
Weina Zhang ◽  
Han Liu ◽  
Vincent Michael Bernard Silenzio ◽  
Peiyuan Qiu ◽  
Wenjie Gong

BACKGROUND Postpartum depression (PPD) is a serious public health problem. Building a predictive model for PPD using data during pregnancy can facilitate earlier identification and intervention. OBJECTIVE The aims of this study are to compare the effects of four different machine learning models using data during pregnancy to predict PPD and explore which factors in the model are the most important for PPD prediction. METHODS Information on the pregnancy period from a cohort of 508 women, including demographics, social environmental factors, and mental health, was used as predictors in the models. The Edinburgh Postnatal Depression Scale score within 42 days after delivery was used as the outcome indicator. Using two feature selection methods (expert consultation and random forest-based filter feature selection [FFS-RF]) and two algorithms (support vector machine [SVM] and random forest [RF]), we developed four different machine learning PPD prediction models and compared their prediction effects. RESULTS There was no significant difference in the effectiveness of the two feature selection methods in terms of model prediction performance, but 10 fewer factors were selected with the FFS-RF than with the expert consultation method. The model based on SVM and FFS-RF had the best prediction effects (sensitivity=0.69, area under the curve=0.78). In the feature importance ranking output by the RF algorithm, psychological elasticity, depression during the third trimester, and income level were the most important predictors. CONCLUSIONS In contrast to the expert consultation method, FFS-RF was important in dimension reduction. When the sample size is small, the SVM algorithm is suitable for predicting PPD. In the prevention of PPD, more attention should be paid to the psychological resilience of mothers.


Sign in / Sign up

Export Citation Format

Share Document