Predictive ability of Random Forests, Boosting, Support Vector Machines and Genomic Best Linear Unbiased Prediction in different scenarios of genomic evaluation

Three machine learning algorithms: Random Forests (RF), Boosting and Support Vector Machines (SVM) as well as Genomic Best Linear Unbiased Prediction (GBLUP) were used to predict genomic breeding values (GBV) and their predictive performance was compared in different combinations of heritability (0.1, 0.3, and 0.5), number of quantitative trait loci (QTL) (100, 1000) and distribution of QTL effects (normal, uniform and gamma). To this end, a genome comprised of five chromosomes, one Morgan each, was simulated on which 10000 bi-allelic single nucleotide polymorphisms were distributed. Pearson’s correlation between the true and predicted GBV and Mean Squared Error of GBV prediction were used, respectively, as measures of the predictive accuracy and the overall fit achieved with each method. In all methods, an increase in accuracy of prediction was seen following increase in heritability and decrease in the number of QTL. GBLUP had better predictive accuracy than machine learning methods in particular in the scenarios of higher number of QTL and normal and uniform distributions of QTL effects; though in most cases, the differences were non-significant. In the scenarios of small number of QTL and gamma distribution of QTL effects, Boosting outperformed other methods. Regarding Mean Squared Error of GBV prediction, in most cases Boosting outperformed other methods, although the estimates were close to that of GBLUP. Among methods studied, SVM with 0.6 gigabytes (GIG) was the most efficient user of memory followed by RF, GBLUP and Boosting with 1.2-GIG, 1.3-GIG and 2.3-GIG memory requirements, respectively. Regarding computational time, GBLUP, SVM, RF and Boosting ranked first, second, third and last with 10 min, 15 min, 75 min and 600 min, respectively. It was concluded that although stochastic gradient Boosting can predict GBV with high prediction accuracy, significantly longer computational time and memory requirement can be a serious limitation for this algorithm. Therefore, using of other variants of Boosting such as Random Boosting was recommended for genomic evaluation.

Download Full-text

Regression Models for Symbolic Interval-Valued Variables

Entropy ◽

10.3390/e23040429 ◽

2021 ◽

Vol 23 (4) ◽

pp. 429

Author(s):

Jose Emmanuel Chacón ◽

Oldemar Rodríguez

Keyword(s):

Regression Models ◽

Mean Squared Error ◽

Nearest Neighbors ◽

Support Vector ◽

K Nearest Neighbors ◽

R Language ◽

Squared Error ◽

Vector Machines ◽

Synthetic Datasets ◽

Interval Valued

This paper presents new approaches to fit regression models for symbolic internal-valued variables, which are shown to improve and extend the center method suggested by Billard and Diday and the center and range method proposed by Lima-Neto, E.A.and De Carvalho, F.A.T. Like the previously mentioned methods, the proposed regression models consider the midpoints and half of the length of the intervals as additional variables. We considered various methods to fit the regression models, including tree-based models, K-nearest neighbors, support vector machines, and neural networks. The approaches proposed in this paper were applied to a real dataset and to synthetic datasets generated with linear and nonlinear relations. For an evaluation of the methods, the root-mean-squared error and the correlation coefficient were used. The methods presented herein are available in the the RSDA package written in the R language, which can be installed from CRAN.

Download Full-text

Visualization & Prediction of COVID-19 Future Outbreak by Using Machine Learning

International Journal of Information Technology and Computer Science ◽

10.5815/ijitcs.2021.03.02 ◽

2021 ◽

Vol 13 (3) ◽

pp. 16-32

Author(s):

Ahmed Hassan Mohammed Hassan ◽

◽

Arfan Ali Mohammed Qasem ◽

Walaa Faisal Mohammed Abdalla ◽

Omer H. Elhassan

Keyword(s):

Machine Learning ◽

Polynomial Regression ◽

Mean Squared Error ◽

Absolute Error ◽

Future Perspective ◽

Support Vector ◽

Squared Error ◽

Vector Machines ◽

The World ◽

Negative Factors

Day by day, the accumulative incidence of COVID-19 is rapidly increasing. After the spread of the Corona epidemic and the death of more than a million people around the world countries, scientists and researchers have tended to conduct research and take advantage of modern technologies to learn machine to help the world to get rid of the Coronavirus (COVID-19) epidemic. To track and predict the disease Machine Learning (ML) can be deployed very effectively. ML techniques have been anticipated in areas that need to identify dangerous negative factors and define their priorities. The significance of a proposed system is to find the predict the number of people infected with COVID19 using ML. Four standard models anticipate COVID-19 prediction, which are Neural Network (NN), Support Vector Machines (SVM), Bayesian Network (BN) and Polynomial Regression (PR). The data utilized to test these models content of number of deaths, newly infected cases, and recoveries in the next 20 days. Five measures parameters were used to evaluate the performance of each model, namely root mean squared error (RMSE), mean squared error (MAE), mean absolute error (MSE), Explained Variance score and r2 score (R2). The significance and value of proposed system auspicious mechanism to anticipate these models for the current cenario of the COVID-19 epidemic. The results showed NN outperformed the other models, while in the available dataset the SVM performs poorly in all the prediction. Reference to our results showed that injuries will increase slightly in the coming days. Also, we find that the results give rise to hope due to the low death rate. For future perspective, case explanation and data amalgamation must be kept up persistently.

Download Full-text

A Method for Greenhouse Temperature Prediction Based on XGBoost Algorithm and Linear Residual Model

CONVERTER ◽

10.17762/converter.271 ◽

2021 ◽

pp. 108-121

Author(s):

Huijin Han, Et al.

Keyword(s):

Mean Squared Error ◽

Prediction Method ◽

Stochastic Gradient Descent ◽

Support Vector ◽

Small Scale ◽

Temperature Prediction ◽

Precise Control ◽

Squared Error ◽

Vector Machines ◽

Better Than

Temperature prediction is significant for precise control of the greenhouse environment. Traditional machine learning methods usually rely on a large amount of data. Therefore, it is difficult to make a stable and accurate prediction based on a small amount of data. This paper proposes a temperature prediction method for greenhouses. With the prediction target transformed to the logarithmic difference of temperature inside and outside the greenhouse,the method first uses XGBoost algorithm to make a preliminary prediction. Second, a linear model is used to predict the residuals of the predicted target. The predicted temperature is obtained combining the preliminary prediction and the residuals. Based on the 20-day greenhouse data, the results show that the target transformation applied in our method is better than the others presented in the paper. The MSE (Mean Squared Error) of our method is 0.0844, which is respectively 20.7%, 76.0%, 10.2%, and 95.3% of the MSE of LR (Logistic Regression), SGD (Stochastic Gradient Descent), SVM (Support Vector Machines), and XGBoost algorithm. The results indicate that our method significantly improves the accuracy of the prediction based on the small-scale data.

Download Full-text

Genomic selection in French dairy cattle

Animal Production Science ◽

10.1071/an11119 ◽

2012 ◽

Vol 52 (3) ◽

pp. 115 ◽

Cited By ~ 63

Author(s):

D. Boichard ◽

F. Guillaume ◽

A. Baur ◽

P. Croiseau ◽

M. N. Rossignol ◽

...

Keyword(s):

Linkage Disequilibrium ◽

Genomic Selection ◽

Reference Population ◽

Specific Reference ◽

Nucleotide Polymorphisms ◽

Genomic Evaluation ◽

Linear Unbiased Prediction ◽

Best Linear Unbiased ◽

Estimated Breeding Values ◽

Restricted Use

Genomic selection is implemented in French Holstein, Montbéliarde, and Normande breeds (70%, 16% and 12% of French dairy cows). A characteristic of the model for genomic evaluation is the use of haplotypes instead of single-nucleotide polymorphisms (SNPs), so as to maximise linkage disequilibrium between markers and quantitative trait loci (QTLs). For each trait, a QTL-BLUP model (i.e. a best linear unbiased prediction model including QTL random effects) includes 300–700 trait-dependent chromosomal regions selected either by linkage disequilibrium and linkage analysis or by elastic net. This model requires an important effort to phase genotypes, detect QTLs, select SNPs, but was found to be the most efficient one among all tested ones. QTLs are defined within breed and many of them were found to be breed specific. Reference populations include 1800 and 1400 bulls in Montbéliarde and Normande breeds. In Holstein, the very large reference population of 18 300 bulls originates from the EuroGenomics consortium. Since 2008, ~65 000 animals have been genotyped for selection by Labogena with the 50k chip. Bulls genomic estimated breeding values (GEBVs) were made official in June 2009. In 2010, the market share of the young bulls reached 30% and is expected to increase rapidly. Advertising actions have been undertaken to recommend a time-restricted use of young bulls with a limited number of doses. In January 2011, genomic selection was opened to all farmers for females. Current developments focus on the extension of the method to a multi-breed context, to use all reference populations simultaneously in genomic evaluation.

Download Full-text

The Application of Machine Learning to a General Risk–Need Assessment Instrument in the Prediction of Criminal Recidivism

Criminal Justice and Behavior ◽

10.1177/0093854820969753 ◽

2020 ◽

pp. 009385482096975

Author(s):

Mehdi Ghasemi ◽

Daniel Anvari ◽

Mahshid Atapour ◽

J. Stephen wormith ◽

Keira C. Stockdale ◽

...

Keyword(s):

Machine Learning ◽

Predictive Accuracy ◽

Characteristic Curve ◽

Assessment Instrument ◽

Support Vector ◽

Data Set ◽

Applied Machine Learning ◽

Vector Machines ◽

Individual Scores

The Level of Service/Case Management Inventory (LS/CMI) is one of the most frequently used tools to assess criminogenic risk–need in justice-involved individuals. Meta-analytic research demonstrates strong predictive accuracy for various recidivism outcomes. In this exploratory study, we applied machine learning (ML) algorithms (decision trees, random forests, and support vector machines) to a data set with nearly 100,000 LS/CMI administrations to provincial corrections clientele in Ontario, Canada, and approximately 3 years follow-up. The overall accuracies and areas under the receiver operating characteristic curve (AUCs) were comparable, although ML outperformed LS/CMI in terms of predictive accuracy for the middle scores where it is hardest to predict the recidivism outcome. Moreover, ML improved the AUCs for individual scores to near 0.60, from 0.50 for the LS/CMI, indicating that ML also improves the ability to rank individuals according to their probability of recidivating. Potential considerations, applications, and future directions are discussed.

Download Full-text

Spatiotemporal Approaches for Quality Control and Error Correction of Atmospheric Data through Machine Learning

Computational Intelligence and Neuroscience ◽

10.1155/2020/7980434 ◽

2020 ◽

Vol 2020 ◽

pp. 1-12 ◽

Cited By ~ 2

Author(s):

Hye-Jin Kim ◽

Sung Min Park ◽

Byung Jin Choi ◽

Seung-Hyun Moon ◽

Yong-Hyuk Kim

Keyword(s):

Machine Learning ◽

Time Series ◽

Quality Control ◽

Mean Squared Error ◽

Machine Learning Algorithms ◽

Support Vector ◽

Weather Element ◽

Applied Machine Learning ◽

Squared Error ◽

Atmospheric Data

We propose three quality control (QC) techniques using machine learning that depend on the type of input data used for training. These include QC based on time series of a single weather element, QC based on time series in conjunction with other weather elements, and QC using spatiotemporal characteristics. We performed machine learning-based QC on each weather element of atmospheric data, such as temperature, acquired from seven types of IoT sensors and applied machine learning algorithms, such as support vector regression, on data with errors to make meaningful estimates from them. By using the root mean squared error (RMSE), we evaluated the performance of the proposed techniques. As a result, the QC done in conjunction with other weather elements had 0.14% lower RMSE on average than QC conducted with only a single weather element. In the case of QC with spatiotemporal characteristic considerations, the QC done via training with AWS data showed performance with 17% lower RMSE than QC done with only raw data.

Download Full-text

Directed Policy Search for Decision Making Using Relevance Vector Machines

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213014600161 ◽

2014 ◽

Vol 23 (04) ◽

pp. 1460016

Author(s):

Ioannis Rexakis ◽

Michail G. Lagoudakis

Keyword(s):

Decision Making ◽

State Space ◽

Previous Method ◽

The State ◽

Computational Time ◽

Support Vector ◽

Learning Approaches ◽

Policy Search ◽

Relevance Vector Machines ◽

Vector Machines

Several recent learning approaches in decision making under uncertainty suggest the use of classifiers for representing policies compactly. The space of possible policies, even under such structured representations, is huge and must be searched carefully to avoid computationally expensive policy simulations (rollouts). In our recent work, we proposed a method for directed exploration of policy space using support vector classifiers, whereby rollouts are directed to states around the boundaries between different action choices indicated by the separating hyperplanes in the represented policies. While effective, this method suffers from the growing number of support vectors in the underlying classifiers as the number of training examples increases. In this paper, we propose an alternative method for directed policy search based on relevance vector machines. Relevance vector machines are used both for classification (to represent a policy) and regression (to approximate the corresponding relative action advantage function). Classification is enhanced by anomaly detection for accurate policy representation. Exploiting the internal structure of the regressor, we guide the probing of the state space only to critical areas corresponding to changes of action dominance in the underlying policy. This directed focus on critical parts of the state space iteratively leads to refinement and improvement of the underlying policy and delivers excellent control policies in only a few iterations, while the small number of relevance vectors yields significant computational time savings. We demonstrate the proposed approach and compare it with our previous method on standard reinforcement learning domains (inverted pendulum and mountain car).

Download Full-text

Genomic best linear unbiased prediction method including imprinting effects for genomic evaluation

Genetics Selection Evolution ◽

10.1186/s12711-015-0091-y ◽

2015 ◽

Vol 47 (1) ◽

Cited By ~ 11

Author(s):

Motohide Nishio ◽

Masahiro Satoh

Keyword(s):

Prediction Method ◽

Best Linear Unbiased Prediction ◽

Genomic Evaluation ◽

Linear Unbiased Prediction ◽

Best Linear Unbiased ◽

Unbiased Prediction

Download Full-text

Predicting Tunnel Squeezing Using Multiclass Support Vector Machines

Advances in Civil Engineering ◽

10.1155/2018/4543984 ◽

2018 ◽

Vol 2018 ◽

pp. 1-12 ◽

Cited By ~ 3

Author(s):

Yang Sun ◽

Xianda Feng ◽

Lingqiang Yang

Keyword(s):

Predictive Accuracy ◽

Average Error ◽

Support Vector ◽

Proposed Model ◽

Svm Model ◽

Vector Machines ◽

Multiclass Support Vector Machine ◽

Multiclass Svm ◽

Tunnel Instability ◽

Weak Rock Masses

Tunnel squeezing is one of the major geological disasters that often occur during the construction of tunnels in weak rock masses subjected to high in situ stresses. It could cause shield jamming, budget overruns, and construction delays and could even lead to tunnel instability and casualties. Therefore, accurate prediction or identification of tunnel squeezing is extremely important in the design and construction of tunnels. This study presents a modified application of a multiclass support vector machine (SVM) to predict tunnel squeezing based on four parameters, that is, diameter (D), buried depth (H), support stiffness (K), and rock tunneling quality index (Q). We compiled a database from the literature, including 117 case histories obtained from different countries such as India, Nepal, and Bhutan, to train the multiclass SVM model. The proposed model was validated using 8-fold cross validation, and the average error percentage was approximately 11.87%. Compared with existing approaches, the proposed multiclass SVM model yields a better performance in predictive accuracy. More importantly, one could estimate the severity of potential squeezing problems based on the predicted squeezing categories/classes.

Download Full-text

Improving the Interpretability of Classification Rules Discovered by an Ant Colony Algorithm: Extended Results

Evolutionary Computation ◽

10.1162/evco_a_00155 ◽

2016 ◽

Vol 24 (3) ◽

pp. 385-409 ◽

Cited By ~ 6

Author(s):

Fernando E. B. Otero ◽

Alex A. Freitas

Keyword(s):

Ant Colony Algorithm ◽

Predictive Accuracy ◽

Ant Colony ◽

Support Vector ◽

Classification Rules ◽

Class Prediction ◽

Vector Machines ◽

Size Measure ◽

The Impact

Most ant colony optimization (ACO) algorithms for inducing classification rules use a ACO-based procedure to create a rule in a one-at-a-time fashion. An improved search strategy has been proposed in the cAnt-Miner[Formula: see text] algorithm, where an ACO-based procedure is used to create a complete list of rules (ordered rules), i.e., the ACO search is guided by the quality of a list of rules instead of an individual rule. In this paper we propose an extension of the cAnt-Miner[Formula: see text] algorithm to discover a set of rules (unordered rules). The main motivations for this work are to improve the interpretation of individual rules by discovering a set of rules and to evaluate the impact on the predictive accuracy of the algorithm. We also propose a new measure to evaluate the interpretability of the discovered rules to mitigate the fact that the commonly used model size measure ignores how the rules are used to make a class prediction. Comparisons with state-of-the-art rule induction algorithms, support vector machines, and the cAnt-Miner[Formula: see text] producing ordered rules are also presented.

Download Full-text