scholarly journals Preprocessing Tools for Data Preparation

Author(s):  
Osval Antonio Montesinos López ◽  
Abelardo Montesinos López ◽  
Jose Crossa

AbstractThis data preparation chapter is of paramount importance for implementing statistical machine learning methods for genomic selection. We present the basic linear mixed model that gives rise to BLUE and BLUP and explain how to decide when to use fixed or random effects that give rise to best linear unbiased estimates (BLUE or BLUEs) and best linear unbiased predictors (BLUP or BLUPs). The R codes for fitting linear mixed model for the data are given in small examples. We emphasize tools for computing BLUEs and BLUPs for many linear combinations of interest in genomic-enabled prediction and plant breeding. We present tools for cleaning, imputing, and detecting minor and major allele frequency computation, marker recodification, frequency of heterogeneous, frequency of NAs, and three methods for computing the genomic relationship matrix. In addition, scaling and data compression of inputs are important in statistical machine learning. For a more extensive description of linear mixed models, see Chap. 10.1007/978-3-030-89010-0_5.

Author(s):  
Boby Mathew ◽  
Jens Léon ◽  
Said Dadshani ◽  
Klaus Pillen ◽  
Mikko J Sillanpää ◽  
...  

Abstract Advanced Backcross (AB) populations have been widely used to identify and utilize beneficial alleles in various crops such as rice, tomato, wheat and barley. For the development of an AB population, a controlled crossing scheme is used and this controlled crossing along with the selection (both natural and artificial) of agronomically-adapted alleles during the development of AB population may lead to unbalanced allele frequencies in the population. However, it is commonly believed that interval mapping mapping of traits in experimental crosses such as AB populations are immune to the deviations from the expected frequencies under Mendelian segregation. Using two AB populations and simulated data sets as examples, we describe the severity of the problem caused by unbalanced allele frequencies in quantitative trait loci (QTL) mapping and demonstrate how it can be corrected using the linear mixed model having a polygenic effect with the covariance structure (genomic relationship matrix) calculated from molecular markers.


2020 ◽  
Author(s):  
Zhaotong Lin ◽  
Souvik Seal ◽  
Saonli Basu

AbstractSNP heritability of a trait is measured by the proportion of total variance explained by the additive effects of genome-wide single nucleotide polymorphisms (SNPs). Linear mixed models are routinely used to estimate SNP heritability for many complex traits. The basic concept behind this approach is to model genetic contribution as a random effect, where the variance of this genetic contribution attributes to the heritability of the trait. This linear mixed model approach requires estimation of ‘relatedness’ among individuals in the sample, which is usually captured by estimating a genetic relationship matrix (GRM). Heritability is estimated by the restricted maximum likelihood (REML) or method of moments (MOM) approaches, and this estimation relies heavily on the GRM computed from the genetic data on individuals. Presence of population substructure in the data could significantly impact the GRM estimation and may introduce bias in heritability estimation. The common practice of accounting for such population substructure is to adjust for the top few principal components of the GRM as covariates in the linear mixed model. Here we propose an alternative way of estimating heritability in multi-ethnic studies. Our proposed approach is a MOM estimator derived from the Haseman-Elston regression and gives an asymptotically unbiased estimate of heritability in presence of population stratification. It introduces adjustments for the population stratification in a second-order estimating equation and allows for the total phenotypic variance vary by ethnicity. We study the performance of different MOM and REML approaches in presence of population stratification through extensive simulation studies. We estimate the heritability of height, weight and other anthropometric traits in the UK Biobank cohort to investigate the impact of subtle population substructure on SNP heritability estimation.


2021 ◽  
Vol 8 ◽  
Author(s):  
Gonzalo Bravo ◽  
Nicolas Moity ◽  
Edgardo Londoño-Cruz ◽  
Frank Muller-Karger ◽  
Gregorio Bigatti ◽  
...  

Standardized methods for effectively and rapidly monitoring changes in the biodiversity of marine ecosystems are critical to assess status and trends in ways that are comparable between locations and over time. In intertidal and subtidal habitats, estimates of fractional cover and abundance of organisms are typically obtained with traditional quadrat-based methods, and collection of photoquadrat imagery is a standard practice. However, visual analysis of quadrats, either in the field or from photographs, can be very time-consuming. Cutting-edge machine learning tools are now being used to annotate species records from photoquadrat imagery automatically, significantly reducing processing time of image collections. However, it is not always clear whether information is lost, and if so to what degree, using automated approaches. In this study, we compared results from visual quadrats versus automated photoquadrat assessments of macroalgae and sessile organisms on rocky shores across the American continent, from Patagonia (Argentina), Galapagos Islands (Ecuador), Gorgona Island (Colombian Pacific), and the northeast coast of the United States (Gulf of Maine) using the automated software CoralNet. Photoquadrat imagery was collected at the same time as visual surveys following a protocol implemented across the Americas by the Marine Biodiversity Observation Network (MBON) Pole to Pole of the Americas program. Our results show that photoquadrat machine learning annotations can estimate percent cover levels of intertidal benthic cover categories and functional groups (algae, bare substrate, and invertebrate cover) nearly identical to those from visual quadrat analysis. We found no statistical differences of cover estimations of dominant groups in photoquadrat images annotated by humans and those processed in CoralNet (binomial generalized linear mixed model or GLMM). Differences between these analyses were not significant, resulting in a Bray-Curtis average distance of 0.13 (sd 0.11) for the full label set, and 0.12 (sd 0.14) for functional groups. This is the first time that CoralNet automated annotation software has been used to monitor “Invertebrate Abundance and Distribution” and “Macroalgal Canopy Cover and Composition” Essential Ocean Variables (EOVs) in intertidal habitats. We recommend its use for rapid, continuous surveys over expanded geographical scales and monitoring of intertidal areas globally.


2021 ◽  
Author(s):  
Mitchell J. Feldmann ◽  
Hans-Peter Piepho ◽  
Steven J. Knapp

Many important traits in plants, animals, and microbes are polygenic and are therefore difficult to improve through traditional marker?assisted selection. Genomic prediction addresses this by enabling the inclusion of all genetic data in a mixed model framework. The main method for predicting breeding values is genomic best linear unbiased prediction (GBLUP), which uses the realized genomic relationship or kinship matrix (K) to connect genotype to phenotype. The use of relationship matrices allows information to be shared for estimating the genetic values for observed entries and predicting genetic values for unobserved entries. One of the key parameters of such models is genomic heritability (h2g), or the variance of a trait associated with a genome-wide sample of DNA polymorphisms. Here we discuss the relationship between several common methods for calculating the genomic relationship matrix and propose a new matrix based on the average semivariance that yields accurate estimates of genomic variance in the observed population regardless of the focal population quality as well as accurate breeding value predictions in unobserved samples. Notably, our proposed method is highly similar to the approach presented by Legarra (2016) despite different mathematical derivations and statistical perspectives and only deviates from the classic approach presented in VanRaden (2008) by a scaling factor. With current approaches, we found that the genomic heritability tends to be either over- or underestimated depending on the scaling and centering applied to the marker matrix (Z), the value of the average diagonal element of K, and the assortment of alleles and heterozygosity (H) in the observed population and that, unlike its predecessors, our newly proposed kinship matrix KASV yields accurate estimates of h2g in the observed population, generalizes to larger populations, and produces BLUPs equivalent to common methods in plants and animals.


2021 ◽  
Author(s):  
Antonello Maruotti ◽  
Giovanna Jona-Lasinio ◽  
Fabio Divino ◽  
Gianfranco Lovison ◽  
Massimo Ciccozzi ◽  
...  

We compare the expected all-cause mortality with the observed one for different age classes during the pandemic in Lombardy, which was the epicenter of the epidemic in Italy and still is the region most affected by the pandemic. A generalized linear mixed model is introduced to model weekly mortality from 2011 to 2019, taking into account seasonal patterns and year-specific trends. Based on the 2019 year-specific conditional best linear unbiased predictions, a significant excess of mortality is estimated in 2020, leading to approximately 35000 more deaths than expected, mainly arising during the first wave. In 2021, instead, the excess mortality is not significantly different from zero, for the 85+ and 15-64 age classes, and significant reductions with respect to the 2020 estimated excess mortality are estimated for other age classes.


2014 ◽  
Vol 685 ◽  
pp. 618-622
Author(s):  
Yan Yu Liu ◽  
Ming Zhong Jin ◽  
De You Xie ◽  
Min Qing Gong

For small area estimation (SAE) Spatial Empirical Best Linear Unbiased Prediction, SEBLUP, is involved in linear mixed model with spatial correlation while Empirical Best Linear Unbiased Prediction, EBLUP, often with mutual independence. In this paper, we discussed maximum likelihood estimation (MLE) and compared the efficiency. Simulation shows that SEBLUP with spatial correlation data of spatial small area is more effective than EBLUP.


2020 ◽  
Vol 18 (1) ◽  
pp. 2-22
Author(s):  
Kusman Sadik ◽  
Rahma Anisa ◽  
Euis Aqmaliyah

The most commonly used method of small area estimation (SAE) is the empirical best linear unbiased prediction method based on a linear mixed model. However, it is not appropriate in the case of the zero-inflated target variable with a mixture of zeros and continuously distributed positive values. Therefore, various model-based SAE methods for zero-inflated data are developed, such as the Frequentist approach and the Bayesian approach. Both approaches are compared with the survey regression (SR) method which ignores the presence of zero-inflation in the data. The results show that the two SAE approaches for zero-inflated data are capable to yield more accurate area mean estimates than the SR method.


2021 ◽  
Author(s):  
Runqing Yang ◽  
Yuxin Song ◽  
Zhiyu Hao ◽  
Zhonghua Liu

AbstractIn genome-wide association analysis for complex diseases, we partitioned the genomic generalized linear mixed model (GLMM) into two hierarchies—the GLMM regarding genomic breeding values (GBVs) and a generalized linear regression of the GBVs to the tested marker effects. In the first hierarchy, the GBVs were predicted by solving for the genomic best linear unbiased prediction for GLMM, and in the second hierarchy, association tests were performed using the generalized least square (GLS) method. The so-called Hi-GLMM method exhibited advantages over existing methods in terms of both genomic control for complex population structure and statistical power to detect quantitative trait nucleotides (QTNs), especially when the GBVs were estimated precisely, and using joint association analysis for QTN candidates obtained from a test at once.


2019 ◽  
Vol 97 (12) ◽  
pp. 4761-4769 ◽  
Author(s):  
Pâmela A Alexandre ◽  
Laercio R Porto-Neto ◽  
Emre Karaman ◽  
Sigrid A Lehnert ◽  
Antonio Reverter

Abstract The growing concern with the environment is making important for livestock producers to focus on selection for efficiency-related traits, which is a challenge for commercial cattle herds due to the lack of pedigree information. To explore a cost-effective opportunity for genomic evaluations of commercial herds, this study compared the accuracy of bulls’ genomic estimated breeding values (GEBV) using different pooled genotype strategies. We used ten replicates of previously simulated genomic and phenotypic data for one low (t1) and one moderate (t2) heritability trait of 200 sires and 2,200 progeny. Sire’s GEBV were calculated using a univariate mixed model, with a hybrid genomic relationship matrix (h-GRM) relating sires to: 1) 1,100 pools of 2 animals; 2) 440 pools of 5 animals; 3) 220 pools of 10 animals; 4) 110 pools of 20 animals; 5) 88 pools of 25 animals; 6) 44 pools of 50 animals; and 7) 22 pools of 100 animals. Pooling criteria were: at random, grouped sorting by t1, grouped sorting by t2, and grouped sorting by a combination of t1 and t2. The same criteria were used to select 110, 220, 440, and 1,100 individual genotypes for GEBV calculation to compare GEBV accuracy using the same number of individual genotypes and pools. Although the best accuracy was achieved for a given trait when pools were grouped based on that same trait (t1: 0.50–0.56, t2: 0.66–0.77), pooling by one trait impacted negatively on the accuracy of GEBV for the other trait (t1: 0.25–0.46, t2: 0.29–0.71). Therefore, the combined measure may be a feasible alternative to use the same pools to calculate GEBVs for both traits (t1: 0.45–0.57, t2: 0.62–0.76). Pools of 10 individuals were identified as representing a good compromise between loss of accuracy (~10%–15%) and cost savings (~90%) from genotype assays. In addition, we demonstrated that in more than 90% of the simulations, pools present higher sires’ GEBV accuracy than individual genotypes when the number of genotype assays is limited (i.e., 110 or 220) and animals are assigned to pools based on phenotype. Pools assigned at random presented the poorest results (t1: 0.07–0.45, t2: 0.14–0.70). In conclusion, pooling by phenotype is the best approach to implementing genomic evaluation using commercial herd data, particularly when pools of 10 individuals are evaluated. While combining phenotypes seems a promising strategy to allow more flexibility to the estimates made using pools, more studies are necessary in this regard.


2021 ◽  
Vol 13 (13) ◽  
pp. 2435
Author(s):  
Fiona H. Evans ◽  
Jianxiu Shen

Satellite remote sensing offers a cost-effective means of generating long-term hindcasts of yield that can be used to understand how yield varies in time and space. This study investigated the use of remotely sensed phenology, climate data and machine learning for estimating yield at a resolution suitable for optimising crop management in fields. We used spatially weighted growth curve estimation to identify the timing of phenological events from sequences of Landsat NDVI and derive phenological and seasonal climate metrics. Using data from a 17,000 ha study area, we investigated the relationships between the metrics and yield over 17 years from 2003 to 2019. We compared six statistical and machine learning models for estimating yield: multiple linear regression, mixed effects models, generalised additive models, random forests, support vector regression using radial basis functions and deep learning neural networks. We used a 50-50 train-test split on paddock-years where 50% of paddock-year combinations were randomly selected and used to train each model and the remaining 50% of paddock-years were used to assess the model accuracy. Using only phenological metrics, accuracy was highest using a linear mixed model with a random effect that allowed the relationship between integrated NDVI and yield to vary by year (R2 = 0.67, MAE = 0.25 t ha−1, RMSE = 0.33 t ha−1, NRMSE = 0.25). We quantified the improvements in accuracy when seasonal climate metrics were also used as predictors. We identified two optimal models using the combined phenological and seasonal climate metrics: support vector regression and deep learning models (R2 = 0.68, MAE = 0.25 t ha−1, RMSE = 0.32 t ha−1, NRMSE = 0.25). While the linear mixed model using only phenological metrics performed similarly to the nonlinear models that are also seasonal climate metrics, the nonlinear models can be more easily generalised to estimate yield in years for which training data are unavailable. We conclude that long-term hindcasts of wheat yield in fields, at 30 m spatial resolution, can be produced using remotely sensed phenology from Landsat NDVI, climate data and machine learning.


Sign in / Sign up

Export Citation Format

Share Document