scholarly journals An evaluation of machine-learning for predicting phenotype: studies in yeast, rice, and wheat

2019 ◽  
Vol 109 (2) ◽  
pp. 251-277 ◽  
Author(s):  
Nastasiya F. Grinberg ◽  
Oghenejokpeme I. Orhobor ◽  
Ross D. King

Abstract In phenotype prediction the physical characteristics of an organism are predicted from knowledge of its genotype and environment. Such studies, often called genome-wide association studies, are of the highest societal importance, as they are of central importance to medicine, crop-breeding, etc. We investigated three phenotype prediction problems: one simple and clean (yeast), and the other two complex and real-world (rice and wheat). We compared standard machine learning methods; elastic net, ridge regression, lasso regression, random forest, gradient boosting machines (GBM), and support vector machines (SVM), with two state-of-the-art classical statistical genetics methods; genomic BLUP and a two-step sequential method based on linear regression. Additionally, using the clean yeast data, we investigated how performance varied with the complexity of the biological mechanism, the amount of observational noise, the number of examples, the amount of missing data, and the use of different data representations. We found that for almost all the phenotypes considered, standard machine learning methods outperformed the methods from classical statistical genetics. On the yeast problem, the most successful method was GBM, followed by lasso regression, and the two statistical genetics methods; with greater mechanistic complexity GBM was best, while in simpler cases lasso was superior. In the wheat and rice studies the best two methods were SVM and BLUP. The most robust method in the presence of noise, missing data, etc. was random forests. The classical statistical genetics method of genomic BLUP was found to perform well on problems where there was population structure. This suggests that standard machine learning methods need to be refined to include population structure information when this is present. We conclude that the application of machine learning methods to phenotype prediction problems holds great promise, but that determining which methods is likely to perform well on any given problem is elusive and non-trivial.

2017 ◽  
Author(s):  
Nastasiya F. Grinberg ◽  
Oghenejokpeme I. Orhobor ◽  
Ross D. King

AbstractIn phenotype prediction, the physical characteristics of an organism are predicted from knowledge of its genotype and environment. Such studies are of the highest societal importance, as they are of central importance to medicine, crop-breeding, etc. We investigated three phenotype prediction problems: one simple and clean (yeast), and the other two complex and real-world (rice and wheat). We compared standard machine learning methods (elastic net, ridge regression, lasso regression, random forest, gradient boosting machines (GBM), and support vector machines (SVM)), with two state-of-the-art classical statistical genetics methods (including genomic BLUP). Additionally, using the clean yeast data, we investigated how performance varied with the complexity of the biological mechanism, the amount of observational noise, the number of examples, the amount of missing data, and the use of different data representations. We found that for almost all phenotypes considered standard machine learning methods outperformed the methods from classical statistical genetics. On the yeast problem, the most successful method was GBM, followed by lasso regression, and the two statistical genetics methods; with greater mechanistic complexity GBM was best, while in simpler cases lasso was superior. When applied to the wheat and rice studies the best two methods were SVM and BLUP. The most robust method in the presence of noise, missing data, etc. was random forests. The classical statistical genetics method of genomic BLUP was found to perform well on problems where there was population structure, which suggests one way to improve standard machine learning methods when population structure is present. We conclude that the application of machine learning methods to phenotype prediction problems holds great promise.


2020 ◽  
Author(s):  
Abdur Rahman M. A. Basher ◽  
Steven J. Hallam

AbstractMachine learning methods show great promise in predicting metabolic pathways at different levels of biological organization. However, several complications remain that can degrade prediction performance including inadequately labeled training data, missing feature information, and inherent imbalances in the distribution of enzymes and pathways within a dataset. This class imbalance problem is commonly encountered by the machine learning community when the proportion of instances over class labels within a dataset are uneven, resulting in poor predictive performance for underrepresented classes. Here, we present leADS, multi-label learning based on active dataset subsampling, that leverages the idea of subsampling points from a pool of data to reduce the negative impact of training loss due to class imbalance. Specifically, leADS performs an iterative process to: (i)-construct an acquisition model in an ensemble framework; (ii) select informative points using an appropriate acquisition function; and (iii) train on selected samples. Multiple base learners are implemented in parallel where each is assigned a portion of labeled training data to learn pathways. We benchmark leADS using a corpora of 10 experimental datasets manifesting diverse multi-label properties used in previous pathway prediction studies, including manually curated organismal genomes, synthetic microbial communities and low complexity microbial communities. Resulting performance metrics equaled or exceeded previously reported machine learning methods for both organismal and multi-organismal genomes while establishing an extensible framework for navigating class imbalances across diverse real world datasets.Availability and implementationThe software package, and installation instructions are published on github.com/[email protected]


Crystals ◽  
2021 ◽  
Vol 11 (2) ◽  
pp. 210
Author(s):  
Kangkang Duan ◽  
Shuangyin Cao ◽  
Jinbao Li ◽  
Chongfa Xu

Machine learning techniques have become a popular solution to prediction problems. These approaches show excellent performance without being explicitly programmed. In this paper, 448 sets of data were collected to predict the neutralization depth of concrete bridges in China. Random forest was used for parameter selection. Besides this, four machine learning methods, such as support vector machine (SVM), k-nearest neighbor (KNN) and XGBoost, were adopted to develop models. The results show that machine learning models obtain a high accuracy (>80%) and an acceptable macro recall rate (>80%) even with only four parameters. For SVM models, the radial basis function has a better performance than other kernel functions. The radial basis kernel SVM method has the highest verification accuracy (91%) and the highest macro recall rate (86%). Besides this, the preference of different methods is revealed in this study.


2021 ◽  
Author(s):  
Li-Da Wu ◽  
Feng Li ◽  
Jia-Yi Chen ◽  
Jie Zhang ◽  
Ling-Ling Qian ◽  
...  

Abstract Objective: We aimed to screen out biomarkers for atrial fibrillation (AF) based on machine learning methods and evaluate the degree of immune infiltration in AF patients in detail.Methods: Two datasets (GSE41177 and GSE79768) related to AF in GEO database were included. Differentially expressed genes (DEGs) were screened out using “limma” package. Candidate biomarkers for AF were identified using machine learning methods of the LASSO regression algorithm and SVM-RFE algorithm. Receiver operating characteristic (ROC) curve was employed to assess the diagnostic effectiveness of biomarkers, which was further validated in the GSE14795 dataset. Moreover, we used CIBERSORT to study the proportion of infiltrating immune cells in each sample, and the Spearman method was used to explore the correlation between biomarkers and immune cells.Results: 129 DEGs were identified, and CYBB, CXCR2, and S100A4 were identified as key biomarkers of AF using LASSO regression and SVM-RFE algorithm, and the diagnostic value was further validated in GSE14795. Immune infiltration analysis indicated that, compared with sinus rhythm (SR), the atrial samples of patients with AF contained a higher T cells gamma delta, neutrophils and mast cells resting, whereas T cells follicular helper were relatively lower. Correlation analysis demonstrated that CYBB, CXCR2, and S100A4 were significantly correlated with the infiltrating immune cells.Conclusions: In conclusion, this study suggested that CYBB, CXCR2, and S100A4 are key biomarkers correlated with infiltrating immune cells in AF, and infiltrating immune cells play pivotal roles in AF.


2010 ◽  
Vol 50 (2) ◽  
pp. 105-115 ◽  
Author(s):  
José M. Jerez ◽  
Ignacio Molina ◽  
Pedro J. García-Laencina ◽  
Emilio Alba ◽  
Nuria Ribelles ◽  
...  

2020 ◽  
Vol 12 (2) ◽  
pp. 134-143
Author(s):  
Gilberto De Melo Junior ◽  
Symone G. Soares Alcalá ◽  
Geovanne Pereira Furriel ◽  
Sílvio L. Vieira

O aprendizado de máquina (ML) tornou-se uma tecnologia emergente capaz de resolver problemas em muitas áreas, incluindo educação, medicina, robótica e aeroespacial. O ML é um campo específico de inteligência artificial que projeta modelos computacionais capazes de aprender com os dados. No entanto, para desenvolver um modelo de ML, é necessário garantir a qualidade dos dados, pois os dados do mundo real são incompletos, ruídosos e inconsistentes. Este artigo avalia métodos avançados de tratamento de dados ausentes usando algoritmos ML para classificar o desempenho de estudantes do ensino médio do Instituto Federal de Goiânia como no Brasil. O objetivo é fornecer uma ferramenta computacional eficiente para auxiliar o desempenho educacional que permite aos educadores verificar a tendência do aluno a reprovar. Os resultados indicam que o método de ignorar e descartar supera outros métodos de tratamento de dados ausentes. Além disso, os testes revelam que a Otimização Mínima Sequencial, Redes Neurais e Bagging superam os outros algoritmos de ML, como Naive Bayes e Árvore de Decisão, em termos de precisão de classificação.


Sign in / Sign up

Export Citation Format

Share Document