scholarly journals A novel genomic prediction method combining randomized Haseman-Elston regression with a modified algorithm for Proven and Young for large genomic data

2021 ◽  
Author(s):  
Hailan Liu ◽  
Guo-Bo Chen
Genetics ◽  
2021 ◽  
Author(s):  
Marco Lopez-Cruz ◽  
Gustavo de los Campos

Abstract Genomic prediction uses DNA sequences and phenotypes to predict genetic values. In homogeneous populations, theory indicates that the accuracy of genomic prediction increases with sample size. However, differences in allele frequencies and in linkage disequilibrium patterns can lead to heterogeneity in SNP effects. In this context, calibrating genomic predictions using a large, potentially heterogeneous, training data set may not lead to optimal prediction accuracy. Some studies tried to address this sample size/homogeneity trade-off using training set optimization algorithms; however, this approach assumes that a single training data set is optimum for all individuals in the prediction set. Here, we propose an approach that identifies, for each individual in the prediction set, a subset from the training data (i.e., a set of support points) from which predictions are derived. The methodology that we propose is a Sparse Selection Index (SSI) that integrates Selection Index methodology with sparsity-inducing techniques commonly used for high-dimensional regression. The sparsity of the resulting index is controlled by a regularization parameter (λ); the G-BLUP (the prediction method most commonly used in plant and animal breeding) appears as a special case which happens when λ = 0. In this study, we present the methodology and demonstrate (using two wheat data sets with phenotypes collected in ten different environments) that the SSI can achieve significant (anywhere between 5-10%) gains in prediction accuracy relative to the G-BLUP.


PLoS Genetics ◽  
2021 ◽  
Vol 17 (12) ◽  
pp. e1009944
Author(s):  
Torsten Pook ◽  
Adnane Nemri ◽  
Eric Gerardo Gonzalez Segovia ◽  
Daniel Valle Torres ◽  
Henner Simianer ◽  
...  

High-throughput genotyping of large numbers of lines remains a key challenge in plant genetics, requiring geneticists and breeders to find a balance between data quality and the number of genotyped lines under a variety of different existing genotyping technologies when resources are limited. In this work, we are proposing a new imputation pipeline (“HBimpute”) that can be used to generate high-quality genomic data from low read-depth whole-genome-sequence data. The key idea of the pipeline is the use of haplotype blocks from the software HaploBlocker to identify locally similar lines and subsequently use the reads of all locally similar lines in the variant calling for a specific line. The effectiveness of the pipeline is showcased on a dataset of 321 doubled haploid lines of a European maize landrace, which were sequenced at 0.5X read-depth. The overall imputing error rates are cut in half compared to state-of-the-art software like BEAGLE and STITCH, while the average read-depth is increased to 83X, thus enabling the calling of copy number variation. The usefulness of the obtained imputed data panel is further evaluated by comparing the performance of sequence data in common breeding applications to that of genomic data generated with a genotyping array. For both genome-wide association studies and genomic prediction, results are on par or even slightly better than results obtained with high-density array data (600k). In particular for genomic prediction, we observe slightly higher data quality for the sequence data compared to the 600k array in the form of higher prediction accuracies. This occurred specifically when reducing the data panel to the set of overlapping markers between sequence and array, indicating that sequencing data can benefit from the same marker ascertainment as used in the array process to increase the quality and usability of genomic data.


2021 ◽  
Author(s):  
Giovanni Galli ◽  
Felipe Sabadin ◽  
Rafael Massahiro Yassue ◽  
Cassia Galves de Souza ◽  
Humberto Fanelli Carvalho ◽  
...  

Abstract Machine learning methods such as Multilayer perceptrons (MLP) and Convolutional Neural Networks (CNN) have emerged as promising methods for genomic prediction (GP). In this sense, we assess the performance of MLP and CNN on regression and classification tasks in a case study with maize hybrids. The genomic information was provided to the MLP as a relationship matrix and to the CNN as “genomic images”. In the regression task, the machine learning models were compared along with GBLUP. Under the classification task, MLP and CNN were compared. In this case, the traits (plant height and grain yield) were discretized in such a way to create balanced (moderate selection intensity) and unbalanced (extreme selection intensity) datasets for further evaluations. An automatic hyperparameter search for MLP and CNN was performed, and the best models were reported. For both task types, several metrics were calculated under a validation scheme to assess the effect of the prediction method and other variables. Overall, MLP and CNN presented competitive results to GBLUP but improved a little using only the additive genomic layer. It is expected that the average effect of allele substitution is mostly linear. Nevertheless, the methodology’s potential for GP is unprecedented because we can create “multispectral genome images,” including other effects and layers of data, such as dominance, epistasis, g × e, transcriptome, and so on, capturing linear and non-linear effects and boosting prediction accuracies. Hence, we bring new insights on automated machine learning for genomic prediction and its implications to plant breeding.


2021 ◽  
Author(s):  
Giovanni Galli ◽  
Felipe Sabadin ◽  
Rafael Massahiro Yassue ◽  
Cassia Galves de Souza ◽  
Humberto Fanelli Carvalho ◽  
...  

Abstract Machine learning methods such as Multilayer perceptrons (MLP) and Convolutional Neural Networks (CNN) have emerged as promising methods for genomic prediction (GP). In this sense, we assess the performance of MLP and CNN on regression and classification tasks in a case study with maize hybrids. The genomic information was provided to the MLP as a relationship matrix and to the CNN as “genomic images”. In the regression task, the machine learning models were compared along with GBLUP. Under the classification task, MLP and CNN were compared. In this case, the traits (plant height and grain yield) were discretized in such a way to create balanced (moderate selection intensity) and unbalanced (extreme selection intensity) datasets for further evaluations. An automatic hyperparameter search for MLP and CNN was performed, and the best models were reported. For both task types, several metrics were calculated under a validation scheme to assess the effect of the prediction method and other variables. Overall, MLP and CNN presented competitive results to GBLUP but improved a little using only the additive genomic layer. It is expected that the average effect of allele substitution is mostly linear. Nevertheless, the methodology’s potential for GP is unprecedented because we can create “multispectral genome images,” including other effects and layers of data, such as dominance, epistasis, g × e, transcriptome, and so on, capturing linear and non-linear effects and boosting prediction accuracies. Hence, we bring new insights on automated machine learning for genomic prediction and its implications to plant breeding.


2020 ◽  
Vol 10 (1) ◽  
Author(s):  
Seongmun Jeong ◽  
Jae-Yoon Kim ◽  
Namshin Kim

AbstractThe increased accessibility to genomic data in recent years has laid the foundation for studies to predict various phenotypes of organisms based on the genome. Genomic prediction collectively refers to these studies, and it estimates an individual’s phenotypes mainly using single nucleotide polymorphism markers. Typically, the accuracy of these genomic prediction studies is highly dependent on the markers used; however, in practice, choosing optimal markers with high accuracy for the phenotype to be used is a challenging task. Therefore, we present a new tool called GMStool for selecting optimal marker sets and predicting quantitative phenotypes. The GMStool is based on a genome-wide association study (GWAS) and heuristically searches for optimal markers using statistical and machine-learning methods. The GMStool performs the genomic prediction using statistical and machine/deep-learning models and presents the best prediction model with the optimal marker-set. For the evaluation, the GMStool was tested on real datasets with four phenotypes. The prediction results showed higher performance than using the entire markers or the GWAS-top markers, which have been used frequently in prediction studies. Although the GMStool has several limitations, it is expected to contribute to various studies for predicting quantitative phenotypes. The GMStool written in R is available at www.github.com/JaeYoonKim72/GMStool.


2019 ◽  
Author(s):  
Owen Powell ◽  
Raphael Mrode ◽  
R. Chris Gaynor ◽  
Martin Johnsson ◽  
Gregor Gorjanc ◽  
...  

AbstractBackgroundGenetic evaluation is a central component of a breeding program. In advanced economies, most genetic evaluations depend on large quantities of data that are recorded on commercial farms. Large herd sizes and widespread use of artificial insemination create strong genetic connectedness that enables the genetic and environmental effects of an individual animal’s phenotype to be accurately separated. In contrast to this, herds are neither large nor have strong genetic connectedness in smallholder dairy production systems of many low to middle-income countries (LMIC). This limits genetic evaluation, and furthermore, the pedigree information needed for traditional genetic evaluation is typically unavailable. Genomic information keeps track of shared haplotypes rather than shared relatives. This information could capture and strengthen genetic connectedness between herds and through this may enable genetic evaluations for LMIC smallholder dairy farms. The objective of this study was to use simulation to quantify the power of genomic information to enable genetic evaluation under such conditions.ResultsThe results from this study show: (i) the genetic evaluation of phenotyped cows using genomic information had higher accuracy compared to pedigree information across all breeding designs; (ii) the genetic evaluation of phenotyped cows with genomic information and modelling herd as a random effect had higher or equal accuracy compared to modelling herd as a fixed effect; (iii) the genetic evaluation of phenotyped cows from breeding designs with strong genetic connectedness had higher accuracy compared to breeding designs with weaker genetic connectedness; (iv) genomic prediction of young bulls was possible using marker estimates from the genetic evaluations of their phenotyped dams. For example, the accuracy of genomic prediction of young bulls from an average herd size of 1 (μ=1.58) was 0.40 under a breeding design with 1,000 sires mated per generation and a training set of 8,000 phenotyped and genotyped cows.ConclusionsThis study demonstrates the potential of genomic information to be an enabling technology in LMIC smallholder dairy production systems by facilitating genetic evaluations with in-situ records collected from farms with herd sizes of four cows or less. Across a range of breeding designs, genomic data enabled accurate genetic evaluation of phenotyped cows and genomic prediction of young bulls using data sets that contained small herds with weak genetic connections. The use of smallholder dairy data in genetic evaluations would enable the establishment of breeding programs to improve in-situ germplasm and, if required, would enable the importation of the most suitable external germplasm. This could be individually tailored for each target environment. Together this would increase the productivity, profitability and sustainability of LMIC smallholder dairy production systems. However, data collection, including genomic data, is expensive and business models will need to be carefully constructed so that the costs are sustainably offset.


2018 ◽  
pp. 214-223
Author(s):  
AM Faria ◽  
MM Pimenta ◽  
JY Saab Jr. ◽  
S Rodriguez

Wind energy expansion is worldwide followed by various limitations, i.e. land availability, the NIMBY (not in my backyard) attitude, interference on birds migration routes and so on. This undeniable expansion is pushing wind farms near populated areas throughout the years, where noise regulation is more stringent. That demands solutions for the wind turbine (WT) industry, in order to produce quieter WT units. Focusing in the subject of airfoil noise prediction, it can help the assessment and design of quieter wind turbine blades. Considering the airfoil noise as a composition of many sound sources, and in light of the fact that the main noise production mechanisms are the airfoil self-noise and the turbulent inflow (TI) noise, this work is concentrated on the latter. TI noise is classified as an interaction noise, produced by the turbulent inflow, incident on the airfoil leading edge (LE). Theoretical and semi-empirical methods for the TI noise prediction are already available, based on Amiet’s broadband noise theory. Analysis of many TI noise prediction methods is provided by this work in the literature review, as well as the turbulence energy spectrum modeling. This is then followed by comparison of the most reliable TI noise methodologies, qualitatively and quantitatively, with the error estimation, compared to the Ffowcs Williams-Hawkings solution for computational aeroacoustics. Basis for integration of airfoil inflow noise prediction into a wind turbine noise prediction code is the final goal of this work.


Sign in / Sign up

Export Citation Format

Share Document