scholarly journals A stacking ensemble learning framework for genomic prediction

2020 ◽  
Author(s):  
Mang Liang ◽  
Tianpeng Chang ◽  
Bingxing An ◽  
Xinghai Duan ◽  
Lili Du ◽  
...  

Abstract Background: Machine learning (ML) is perhaps the most useful for the interpretation of large genomic datasets. However, the performance of a single machine learning method in genomic selection (GS) was unsatisfactory in existing research. To improve the genomic predictions, we constructed a stacking ensemble learning framework (SELF) integrated three machine learning methods to predict genomic estimated breeding values (GEBVs). Results: We evaluated the prediction ability of SELF by three real datasets and compared the prediction accuracy of SELF, base learners, GBLUP and BayesB. For each trait, SELF performed better than base learners, which included support vector regression (SVR), kernel ridge regression (KRR) and elastic net (ENET). The prediction accuracy of SELF had an average 7.70% improvement compared with GBLUP in three datasets. Except for the milk fat percentage (MFP) traits of the German Holstein dairy cattle dataset, SELF more robust than BayesB in the remaining traits.Conclusions: In this study, we utilized a stacking ensemble learning framework (SELF) to genomic prediction and it performed much better than GBLUP and BayesB in three real datasets with different genetic architecture. Therefore, we believed SEFL had the potential to be promoted to estimate GEBVs in other animals and plants.

2021 ◽  
Vol 12 ◽  
Author(s):  
Mang Liang ◽  
Tianpeng Chang ◽  
Bingxing An ◽  
Xinghai Duan ◽  
Lili Du ◽  
...  

Machine learning (ML) is perhaps the most useful tool for the interpretation of large genomic datasets. However, the performance of a single machine learning method in genomic selection (GS) is currently unsatisfactory. To improve the genomic predictions, we constructed a stacking ensemble learning framework (SELF), integrating three machine learning methods, to predict genomic estimated breeding values (GEBVs). The present study evaluated the prediction ability of SELF by analyzing three real datasets, with different genetic architecture; comparing the prediction accuracy of SELF, base learners, genomic best linear unbiased prediction (GBLUP) and BayesB. For each trait, SELF performed better than base learners, which included support vector regression (SVR), kernel ridge regression (KRR) and elastic net (ENET). The prediction accuracy of SELF was, on average, 7.70% higher than GBLUP in three datasets. Except for the milk fat percentage (MFP) traits, of the German Holstein dairy cattle dataset, SELF was more robust than BayesB in all remaining traits. Therefore, we believed that SEFL has the potential to be promoted to estimate GEBVs in other animals and plants.


Author(s):  
Hrushikesh Bhosale ◽  
Vigneshwar Ramakrishnan ◽  
Valadi K. Jayaraman

Bacterial virulence can be attributed to a wide variety of factors including toxins that harm the host. Pore-forming toxins are one class of toxins that confer virulence to the bacteria and are one of the promising targets for therapeutic intervention. In this work, we develop a sequence-based machine learning framework for the prediction of pore-forming toxins. For this, we have used distributed representation of the protein sequence encoded by reduced alphabet schemes based on conformational similarity and hydropathy index as input features to Support Vector Machines (SVMs). The choice of conformational similarity and hydropathy indices is based on the functional mechanism of pore-forming toxins. Our methodology achieves about 81% accuracy indicating that conformational similarity, an indicator of the flexibility of amino acids, along with hydrophobic index can capture the intrinsic features of pore-forming toxins that distinguish it from other types of transporter proteins. Increased understanding of the mechanisms of pore-forming toxins can further contribute to the use of such “mechanism-informed” features that may increase the prediction accuracy further.


2021 ◽  
Author(s):  
Xue Wang ◽  
Shaolei Shi ◽  
Guijiang Wang ◽  
Wenxue Luo ◽  
Xia Wei ◽  
...  

Abstract Background Recently, machine learning (ML) is becoming attractive in genomic prediction, while its superiority in genomic prediction and the choosing of optimal ML methods are needed investigation. Results In this study, 2566 Chinese Yorkshire pigs with reproduction traits records were used, they were genotyped with GenoBaits Porcine SNP 50K and PorcineSNP50 panel. Four ML methods, including support vector regression (SVR), kernel ridge regression (KRR), random forest (RF) and Adaboost.R2 were implemented. Through 20 replicates of five-fold cross-validation, the genomic prediction abilities of ML methods were explored. Compared with genomic BLUP(GBLUP), single-step GBLUP (ssGBLUP) and Bayesian method BayesHE, our results indicated that ML methods significantly outperformed. The prediction accuracy of ML methods was improved by 19.3%, 15.0% and 20.8% on average over GBLUP, ssGBLUP and BayesHE, ranging from 8.9–24.0%, 7.6–17.5% and 11.1–24.6%, respectively. In addition, ML methods yielded smaller mean squared error (MSE) and mean absolute error (MAE) in all scenarios. ssGBLUP yielded improvement of 3.7% on average compared to GBLUP, and the performance of BayesHE was close to GBLUP. Among four ML methods, SVR and KRR had the most robust prediction abilities, which yielded higher accuracies, lower bias, lower MSE and MAE, and comparable computing efficiency as GBLUP. RF demonstrated the lowest prediction ability and computational efficiency among ML methods. Conclusion Our findings demonstrated that ML methods are more efficient than traditional genomic selection methods, and it could be new options for genomic prediction.


Author(s):  
Saheb Foroutaifar

AbstractThe main objectives of this study were to compare the prediction accuracy of different Bayesian methods for traits with a wide range of genetic architecture using simulation and real data and to assess the sensitivity of these methods to the violation of their assumptions. For the simulation study, different scenarios were implemented based on two traits with low or high heritability and different numbers of QTL and the distribution of their effects. For real data analysis, a German Holstein dataset for milk fat percentage, milk yield, and somatic cell score was used. The simulation results showed that, with the exception of the Bayes R, the other methods were sensitive to changes in the number of QTLs and distribution of QTL effects. Having a distribution of QTL effects, similar to what different Bayesian methods assume for estimating marker effects, did not improve their prediction accuracy. The Bayes B method gave higher or equal accuracy rather than the rest. The real data analysis showed that similar to scenarios with a large number of QTLs in the simulation, there was no difference between the accuracies of the different methods for any of the traits.


Author(s):  
Anik Das ◽  
Mohamed M. Ahmed

Accurate lane-change prediction information in real time is essential to safely operate Autonomous Vehicles (AVs) on the roadways, especially at the early stage of AVs deployment, where there will be an interaction between AVs and human-driven vehicles. This study proposed reliable lane-change prediction models considering features from vehicle kinematics, machine vision, driver, and roadway geometric characteristics using the trajectory-level SHRP2 Naturalistic Driving Study and Roadway Information Database. Several machine learning algorithms were trained, validated, tested, and comparatively analyzed including, Classification And Regression Trees (CART), Random Forest (RF), eXtreme Gradient Boosting (XGBoost), Adaptive Boosting (AdaBoost), Support Vector Machine (SVM), K Nearest Neighbor (KNN), and Naïve Bayes (NB) based on six different sets of features. In each feature set, relevant features were extracted through a wrapper-based algorithm named Boruta. The results showed that the XGBoost model outperformed all other models in relation to its highest overall prediction accuracy (97%) and F1-score (95.5%) considering all features. However, the highest overall prediction accuracy of 97.3% and F1-score of 95.9% were observed in the XGBoost model based on vehicle kinematics features. Moreover, it was found that XGBoost was the only model that achieved a reliable and balanced prediction performance across all six feature sets. Furthermore, a simplified XGBoost model was developed for each feature set considering the practical implementation of the model. The proposed prediction model could help in trajectory planning for AVs and could be used to develop more reliable advanced driver assistance systems (ADAS) in a cooperative connected and automated vehicle environment.


Sensors ◽  
2021 ◽  
Vol 21 (7) ◽  
pp. 2503
Author(s):  
Taro Suzuki ◽  
Yoshiharu Amano

This paper proposes a method for detecting non-line-of-sight (NLOS) multipath, which causes large positioning errors in a global navigation satellite system (GNSS). We use GNSS signal correlation output, which is the most primitive GNSS signal processing output, to detect NLOS multipath based on machine learning. The shape of the multi-correlator outputs is distorted due to the NLOS multipath. The features of the shape of the multi-correlator are used to discriminate the NLOS multipath. We implement two supervised learning methods, a support vector machine (SVM) and a neural network (NN), and compare their performance. In addition, we also propose an automated method of collecting training data for LOS and NLOS signals of machine learning. The evaluation of the proposed NLOS detection method in an urban environment confirmed that NN was better than SVM, and 97.7% of NLOS signals were correctly discriminated.


Author(s):  
Ke Wang ◽  
Qingwen Xue ◽  
Jian John Lu

Identifying high-risk drivers before an accident happens is necessary for traffic accident control and prevention. Due to the class-imbalance nature of driving data, high-risk samples as the minority class are usually ill-treated by standard classification algorithms. Instead of applying preset sampling or cost-sensitive learning, this paper proposes a novel automated machine learning framework that simultaneously and automatically searches for the optimal sampling, cost-sensitive loss function, and probability calibration to handle class-imbalance problem in recognition of risky drivers. The hyperparameters that control sampling ratio and class weight, along with other hyperparameters, are optimized by Bayesian optimization. To demonstrate the performance of the proposed automated learning framework, we establish a risky driver recognition model as a case study, using video-extracted vehicle trajectory data of 2427 private cars on a German highway. Based on rear-end collision risk evaluation, only 4.29% of all drivers are labeled as risky drivers. The inputs of the recognition model are the discrete Fourier transform coefficients of target vehicle’s longitudinal speed, lateral speed, and the gap between the target vehicle and its preceding vehicle. Among 12 sampling methods, 2 cost-sensitive loss functions, and 2 probability calibration methods, the result of automated machine learning is consistent with manual searching but much more computation-efficient. We find that the combination of Support Vector Machine-based Synthetic Minority Oversampling TEchnique (SVMSMOTE) sampling, cost-sensitive cross-entropy loss function, and isotonic regression can significantly improve the recognition ability and reduce the error of predicted probability.


2011 ◽  
Vol 130-134 ◽  
pp. 2047-2050 ◽  
Author(s):  
Hong Chun Qu ◽  
Xie Bin Ding

SVM(Support Vector Machine) is a new artificial intelligence methodolgy, basing on structural risk mininization principle, which has better generalization than the traditional machine learning and SVM shows powerfulability in learning with limited samples. To solve the problem of lack of engine fault samples, FLS-SVM theory, an improved SVM, which is a method is applied. 10 common engine faults are trained and recognized in the paper.The simulated datas are generated from PW4000-94 engine influence coefficient matrix at cruise, and the results show that the diagnostic accuracy of FLS-SVM is better than LS-SVM.


2021 ◽  
Vol 23 (Supplement_6) ◽  
pp. vi139-vi139
Author(s):  
Jan Lost ◽  
Tej Verma ◽  
Niklas Tillmanns ◽  
W R Brim ◽  
Harry Subramanian ◽  
...  

Abstract PURPOSE Identifying molecular subtypes in gliomas has prognostic and therapeutic value, traditionally after invasive neurosurgical tumor resection or biopsy. Recent advances using artificial intelligence (AI) show promise in using pre-therapy imaging for predicting molecular subtype. We performed a systematic review of recent literature on AI methods used to predict molecular subtypes of gliomas. METHODS Literature review conforming to PRSIMA guidelines was performed for publications prior to February 2021 using 4 databases: Ovid Embase, Ovid MEDLINE, Cochrane trials (CENTRAL), and Web of Science core-collection. Keywords included: artificial intelligence, machine learning, deep learning, radiomics, magnetic resonance imaging, glioma, and glioblastoma. Non-machine learning and non-human studies were excluded. Screening was performed using Covidence software. Bias analysis was done using TRIPOD guidelines. RESULTS 11,727 abstracts were retrieved. After applying initial screening exclusion criteria, 1,135 full text reviews were performed, with 82 papers remaining for data extraction. 57% used retrospective single center hospital data, 31.6% used TCIA and BRATS, and 11.4% analyzed multicenter hospital data. An average of 146 patients (range 34-462 patients) were included. Algorithms predicting IDH status comprised 51.8% of studies, MGMT 18.1%, and 1p19q 6.0%. Machine learning methods were used in 71.4%, deep learning in 27.4%, and 1.2% directly compared both methods. The most common algorithm for machine learning were support vector machine (43.3%), and for deep learning convolutional neural network (68.4%). Mean prediction accuracy was 76.6%. CONCLUSION Machine learning is the predominant method for image-based prediction of glioma molecular subtypes. Major limitations include limited datasets (60.2% with under 150 patients) and thus limited generalizability of findings. We recommend using larger annotated datasets for AI network training and testing in order to create more robust AI algorithms, which will provide better prediction accuracy to real world clinical datasets and provide tools that can be translated to clinical practice.


2017 ◽  
Author(s):  
Manato Akiyama ◽  
Kengo Sato ◽  
Yasubumi Sakakibara

AbstractMotivation: A popular approach for predicting RNA secondary structure is the thermodynamic nearest neighbor model that finds a thermodynamically most stable secondary structure with the minimum free energy (MFE). For further improvement, an alternative approach that is based on machine learning techniques has been developed. The machine learning based approach can employ a fine-grained model that includes much richer feature representations with the ability to fit the training data. Although a machine learning based fine-grained model achieved extremely high performance in prediction accuracy, a possibility of the risk of overfitting for such model has been reported.Results: In this paper, we propose a novel algorithm for RNA secondary structure prediction that integrates the thermodynamic approach and the machine learning based weighted approach. Ourfine-grained model combines the experimentally determined thermodynamic parameters with a large number of scoring parameters for detailed contexts of features that are trained by the structured support vector machine (SSVM) with the ℓ1 regularization to avoid overfitting. Our benchmark shows that our algorithm achieves the best prediction accuracy compared with existing methods, and heavy overfitting cannot be observed.Availability: The implementation of our algorithm is available at https://github.com/keio-bioinformatics/mxfold.Contact:[email protected]


Sign in / Sign up

Export Citation Format

Share Document