scholarly journals Environment-Specific Genomic Prediction Ability in Maize Using Environmental Covariates Depends on Environmental Similarity To Training Data

Author(s):  
Anna R Rogers ◽  
James B Holland

Abstract Technology advances have made possible the collection of a wealth of genomic, environmental, and phenotypic data for use in plant breeding. Incorporation of environmental data into environment-specific genomic prediction (GP) is hindered in part because of inherently high data dimensionality. Computationally efficient approaches to combining genomic and environmental information may facilitate extension of GP models to new environments and germplasm, and better understanding of genotype-by-environment (G × E) interactions. Using genomic, yield trial, and environmental data on 1,918 unique hybrids evaluated in 59 environments from the maize Genomes to Fields project, we determined that a set of 10,153 SNP dominance coefficients and a 5-day temporal window size for summarizing environmental variables were optimal for GP using only genetic and environmental main effects. Adding marker-by-environment variable interactions required dimension reduction, and we found that reducing dimensionality of the genetic data while keeping the full set of environmental covariates was best for environment-specific GP of grain yield, leading to an increase in prediction ability of 2.7% to achieve a prediction ability of 80% across environments when data were masked at random. We then measured how prediction ability within environments was affected under stratified training-testing sets to approximate scenarios commonly encountered by plant breeders, finding that incorporation of marker-by-environment effects improved prediction ability in cases where training and test sets shared environments, but did not improve prediction in new untested environments. The environmental similarity between training and testing sets had a greater impact on the efficacy of prediction than genetic similarity between training and test sets.

Author(s):  
Sebastian Michel ◽  
Franziska Löschenberger ◽  
Christian Ametz ◽  
Hermann Bürstmayr

Abstract Key message Genomic relationship matrices based on mid-parent and family bulk genotypes represent cost-efficient alternatives to full genomic prediction approaches with individually genotyped early generation selection candidates. Abstract The routine usage of genomic selection for improving line varieties has gained an increasing popularity in recent years. Harnessing the benefits of this approach can, however, be too costly for many small-scale breeding programs, as in most genomic breeding strategies several hundred or even thousands of lines have to be genotyped each year. The aim of this study was thus to compare a full genomic prediction strategy using individually genotyped selection candidates with genomic predictions based on genotypes obtained from pooled DNA of progeny families as well as genotypes inferred from crossing parents. A population of 722 wheat lines representing 63 families tested in more than 100 multi-environment trials during 2010–2019 was for this purpose employed to conduct an empirical study, which was supplemented by a simulation with genotypic data from further 3855 lines. A similar or higher prediction ability was achieved for grain yield, protein yield, and the protein content when using mid-parent or family bulk genotypes in comparison with pedigree selection in the empirical across family prediction scenario. The difference of these methods with a full genomic prediction strategy became furthermore marginal if pre-existing phenotypic data of the selection candidates was already available. Similar observations were made in the simulation, where the usage of individually genotyped lines or family bulks was generally preferable with smaller family sizes. The proposed methods can thus be regarded as alternatives to full genomic or pedigree selection strategies, especially when pedigree information is limited like in the exchange of germplasm between breeding programs.


2021 ◽  
Vol 245 ◽  
pp. 104421
Author(s):  
Rosiane P. Silva ◽  
Rafael Espigolan ◽  
Mariana P. Berton ◽  
Raysildo B. Lôbo ◽  
Cláudio U. Magnabosco ◽  
...  

Genetics ◽  
2021 ◽  
Author(s):  
Marco Lopez-Cruz ◽  
Gustavo de los Campos

Abstract Genomic prediction uses DNA sequences and phenotypes to predict genetic values. In homogeneous populations, theory indicates that the accuracy of genomic prediction increases with sample size. However, differences in allele frequencies and in linkage disequilibrium patterns can lead to heterogeneity in SNP effects. In this context, calibrating genomic predictions using a large, potentially heterogeneous, training data set may not lead to optimal prediction accuracy. Some studies tried to address this sample size/homogeneity trade-off using training set optimization algorithms; however, this approach assumes that a single training data set is optimum for all individuals in the prediction set. Here, we propose an approach that identifies, for each individual in the prediction set, a subset from the training data (i.e., a set of support points) from which predictions are derived. The methodology that we propose is a Sparse Selection Index (SSI) that integrates Selection Index methodology with sparsity-inducing techniques commonly used for high-dimensional regression. The sparsity of the resulting index is controlled by a regularization parameter (λ); the G-BLUP (the prediction method most commonly used in plant and animal breeding) appears as a special case which happens when λ = 0. In this study, we present the methodology and demonstrate (using two wheat data sets with phenotypes collected in ten different environments) that the SSI can achieve significant (anywhere between 5-10%) gains in prediction accuracy relative to the G-BLUP.


Entropy ◽  
2021 ◽  
Vol 23 (1) ◽  
pp. 126
Author(s):  
Sharu Theresa Jose ◽  
Osvaldo Simeone

Meta-learning, or “learning to learn”, refers to techniques that infer an inductive bias from data corresponding to multiple related tasks with the goal of improving the sample efficiency for new, previously unobserved, tasks. A key performance measure for meta-learning is the meta-generalization gap, that is, the difference between the average loss measured on the meta-training data and on a new, randomly selected task. This paper presents novel information-theoretic upper bounds on the meta-generalization gap. Two broad classes of meta-learning algorithms are considered that use either separate within-task training and test sets, like model agnostic meta-learning (MAML), or joint within-task training and test sets, like reptile. Extending the existing work for conventional learning, an upper bound on the meta-generalization gap is derived for the former class that depends on the mutual information (MI) between the output of the meta-learning algorithm and its input meta-training data. For the latter, the derived bound includes an additional MI between the output of the per-task learning procedure and corresponding data set to capture within-task uncertainty. Tighter bounds are then developed for the two classes via novel individual task MI (ITMI) bounds. Applications of the derived bounds are finally discussed, including a broad class of noisy iterative algorithms for meta-learning.


BMC Genomics ◽  
2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Osval Antonio Montesinos-López ◽  
Abelardo Montesinos-López ◽  
Paulino Pérez-Rodríguez ◽  
José Alberto Barrón-López ◽  
Johannes W. R. Martini ◽  
...  

Abstract Background Several conventional genomic Bayesian (or no Bayesian) prediction methods have been proposed including the standard additive genetic effect model for which the variance components are estimated with mixed model equations. In recent years, deep learning (DL) methods have been considered in the context of genomic prediction. The DL methods are nonparametric models providing flexibility to adapt to complicated associations between data and output with the ability to adapt to very complex patterns. Main body We review the applications of deep learning (DL) methods in genomic selection (GS) to obtain a meta-picture of GS performance and highlight how these tools can help solve challenging plant breeding problems. We also provide general guidance for the effective use of DL methods including the fundamentals of DL and the requirements for its appropriate use. We discuss the pros and cons of this technique compared to traditional genomic prediction approaches as well as the current trends in DL applications. Conclusions The main requirement for using DL is the quality and sufficiently large training data. Although, based on current literature GS in plant and animal breeding we did not find clear superiority of DL in terms of prediction power compared to conventional genome based prediction models. Nevertheless, there are clear evidences that DL algorithms capture nonlinear patterns more efficiently than conventional genome based. Deep learning algorithms are able to integrate data from different sources as is usually needed in GS assisted breeding and it shows the ability for improving prediction accuracy for large plant breeding data. It is important to apply DL to large training-testing data sets.


2021 ◽  
Vol 11 (22) ◽  
pp. 10771
Author(s):  
Giacomo Segala ◽  
Roberto Doriguzzi-Corin ◽  
Claudio Peroni ◽  
Tommaso Gazzini ◽  
Domenico Siracusa

COVID-19 has underlined the importance of monitoring indoor air quality (IAQ) to guarantee safe conditions in enclosed environments. Due to its strict correlation with human presence, carbon dioxide (CO2) represents one of the pollutants that most affects environmental health. Therefore, forecasting future indoor CO2 plays a central role in taking preventive measures to keep CO2 level as low as possible. Unlike other research that aims to maximize the prediction accuracy, typically using data collected over many days, in this work we propose a practical approach for predicting indoor CO2 using a limited window of recent environmental data (i.e., temperature; humidity; CO2 of, e.g., a room, office or shop) for training neural network models, without the need for any kind of model pre-training. After just a week of data collection, the error of predictions was around 15 parts per million (ppm), which should enable the system to regulate heating, ventilation and air conditioning (HVAC) systems accurately. After a month of data we reduced the error to about 10 ppm, thereby achieving a high prediction accuracy in a short time from the beginning of the data collection. Once the desired mobile window size is reached, the model can be continuously updated by sliding the window over time, in order to guarantee long-term performance.


Author(s):  
Aye Nyein Mon ◽  
Win Pa Pa ◽  
Ye Kyaw Thu

This paper introduces a speech corpus which is developed for Myanmar Automatic Speech Recognition (ASR) research. Automatic Speech Recognition (ASR) research has been conducted by the researchers around the world to improve their language technologies. Speech corpora are important in developing the ASR and the creation of the corpora is necessary especially for low-resourced languages. Myanmar language can be regarded as a low-resourced language because of lack of pre-created resources for speech processing research. In this work, a speech corpus named UCSY-SC1 (University of Computer Studies Yangon - Speech Corpus1) is created for Myanmar ASR research. The corpus consists of two types of domain: news and daily conversations. The total size of the speech corpus is over 42 hrs. There are 25 hrs of web news and 17 hrs of conversational recorded data.<br />The corpus was collected from 177 females and 84 males for the news data and 42 females and 4 males for conversational domain. This corpus was used as training data for developing Myanmar ASR. Three different types of acoustic models  such as Gaussian Mixture Model (GMM) - Hidden Markov Model (HMM), Deep Neural Network (DNN), and Convolutional Neural Network (CNN) models were built and compared their results. Experiments were conducted on different data  sizes and evaluation is done by two test sets: TestSet1, web news and TestSet2, recorded conversational data. It showed that the performance of Myanmar ASRs using this corpus gave satisfiable results on both test sets. The Myanmar ASR  using this corpus leading to word error rates of 15.61% on TestSet1 and 24.43% on TestSet2.<br /><br />


Author(s):  
Y. A. Lumban-Gaol ◽  
K. A. Ohori ◽  
R. Y. Peters

Abstract. Satellite-Derived Bathymetry (SDB) has been used in many applications related to coastal management. SDB can efficiently fill data gaps obtained from traditional measurements with echo sounding. However, it still requires numerous training data, which is not available in many areas. Furthermore, the accuracy problem still arises considering the linear model could not address the non-relationship between reflectance and depth due to bottom variations and noise. Convolutional Neural Networks (CNN) offers the ability to capture the connection between neighbouring pixels and the non-linear relationship. These CNN characteristics make it compelling to be used for shallow water depth extraction. We investigate the accuracy of different architectures using different window sizes and band combinations. We use Sentinel-2 Level 2A images to provide reflectance values, and Lidar and Multi Beam Echo Sounder (MBES) datasets are used as depth references to train and test the model. A set of Sentinel-2 and in-situ depth subimage pairs are extracted to perform CNN training. The model is compared to the linear transform and applied to two other study areas. Resulting accuracy ranges from 1.3 m to 1.94 m, and the coefficient of determination reaches 0.94. The SDB model generated using a window size of 9x9 indicates compatibility with the reference depths, especially at areas deeper than 15 m. The addition of both short wave infrared bands to the four visible bands in training improves the overall accuracy of SDB. The implementation of the pre-trained model to other study areas provides similar results depending on the water conditions.


2020 ◽  
Vol 10 (8) ◽  
pp. 2629-2639
Author(s):  
Edna K. Mageto ◽  
Jose Crossa ◽  
Paulino Pérez-Rodríguez ◽  
Thanda Dhliwayo ◽  
Natalia Palacios-Rojas ◽  
...  

Zinc (Zn) deficiency is a major risk factor for human health, affecting about 30% of the world’s population. To study the potential of genomic selection (GS) for maize with increased Zn concentration, an association panel and two doubled haploid (DH) populations were evaluated in three environments. Three genomic prediction models, M (M1: Environment + Line, M2: Environment + Line + Genomic, and M3: Environment + Line + Genomic + Genomic x Environment) incorporating main effects (lines and genomic) and the interaction between genomic and environment (G x E) were assessed to estimate the prediction ability (rMP) for each model. Two distinct cross-validation (CV) schemes simulating two genomic prediction breeding scenarios were used. CV1 predicts the performance of newly developed lines, whereas CV2 predicts the performance of lines tested in sparse multi-location trials. Predictions for Zn in CV1 ranged from -0.01 to 0.56 for DH1, 0.04 to 0.50 for DH2 and -0.001 to 0.47 for the association panel. For CV2, rMP values ranged from 0.67 to 0.71 for DH1, 0.40 to 0.56 for DH2 and 0.64 to 0.72 for the association panel. The genomic prediction model which included G x E had the highest average rMP for both CV1 (0.39 and 0.44) and CV2 (0.71 and 0.51) for the association panel and DH2 population, respectively. These results suggest that GS has potential to accelerate breeding for enhanced kernel Zn concentration by facilitating selection of superior genotypes.


Sign in / Sign up

Export Citation Format

Share Document