Preprocessing Tools for Data Preparation

Multivariate Statistical Machine Learning Methods for Genomic Prediction ◽

10.1007/978-3-030-89010-0_2 ◽

2022 ◽

pp. 35-70

Author(s):

Osval Antonio Montesinos López ◽

Abelardo Montesinos López ◽

Jose Crossa

Keyword(s):

Machine Learning ◽

Mixed Model ◽

Linear Mixed Model ◽

Genomic Relationship Matrix ◽

Relationship Matrix ◽

Data Preparation ◽

Statistical Machine Learning ◽

Major Allele Frequency ◽

Major Allele ◽

Best Linear Unbiased

AbstractThis data preparation chapter is of paramount importance for implementing statistical machine learning methods for genomic selection. We present the basic linear mixed model that gives rise to BLUE and BLUP and explain how to decide when to use fixed or random effects that give rise to best linear unbiased estimates (BLUE or BLUEs) and best linear unbiased predictors (BLUP or BLUPs). The R codes for fitting linear mixed model for the data are given in small examples. We emphasize tools for computing BLUEs and BLUPs for many linear combinations of interest in genomic-enabled prediction and plant breeding. We present tools for cleaning, imputing, and detecting minor and major allele frequency computation, marker recodification, frequency of heterogeneous, frequency of NAs, and three methods for computing the genomic relationship matrix. In addition, scaling and data compression of inputs are important in statistical machine learning. For a more extensive description of linear mixed models, see Chap. 10.1007/978-3-030-89010-0_5.

Importance of correcting genomic relationships in single-locus QTL mapping model with an Advanced Backcross population

G3 Genes|Genome|Genetics ◽

10.1093/g3journal/jkab105 ◽

2021 ◽

Author(s):

Boby Mathew ◽

Jens Léon ◽

Said Dadshani ◽

Klaus Pillen ◽

Mikko J Sillanpää ◽

...

Keyword(s):

Qtl Mapping ◽

Mixed Model ◽

Linear Mixed Model ◽

Allele Frequencies ◽

Genomic Relationship Matrix ◽

Polygenic Effect ◽

Relationship Matrix ◽

Mendelian Segregation ◽

Backcross Population ◽

Advanced Backcross

Abstract Advanced Backcross (AB) populations have been widely used to identify and utilize beneficial alleles in various crops such as rice, tomato, wheat and barley. For the development of an AB population, a controlled crossing scheme is used and this controlled crossing along with the selection (both natural and artificial) of agronomically-adapted alleles during the development of AB population may lead to unbalanced allele frequencies in the population. However, it is commonly believed that interval mapping mapping of traits in experimental crosses such as AB populations are immune to the deviations from the expected frequencies under Mendelian segregation. Using two AB populations and simulated data sets as examples, we describe the severity of the problem caused by unbalanced allele frequencies in quantitative trait loci (QTL) mapping and demonstrate how it can be corrected using the linear mixed model having a polygenic effect with the covariance structure (genomic relationship matrix) calculated from molecular markers.

Estimating SNP heritability in presence of population substructure in biobank-scale datasets

10.1101/2020.08.05.236901 ◽

2020 ◽

Author(s):

Zhaotong Lin ◽

Souvik Seal ◽

Saonli Basu

Keyword(s):

Complex Traits ◽

Population Stratification ◽

Mixed Model ◽

Linear Mixed Model ◽

Population Substructure ◽

Relationship Matrix ◽

Phenotypic Variance ◽

Genetic Contribution ◽

Heritability Estimation ◽

The Impact

AbstractSNP heritability of a trait is measured by the proportion of total variance explained by the additive effects of genome-wide single nucleotide polymorphisms (SNPs). Linear mixed models are routinely used to estimate SNP heritability for many complex traits. The basic concept behind this approach is to model genetic contribution as a random effect, where the variance of this genetic contribution attributes to the heritability of the trait. This linear mixed model approach requires estimation of ‘relatedness’ among individuals in the sample, which is usually captured by estimating a genetic relationship matrix (GRM). Heritability is estimated by the restricted maximum likelihood (REML) or method of moments (MOM) approaches, and this estimation relies heavily on the GRM computed from the genetic data on individuals. Presence of population substructure in the data could significantly impact the GRM estimation and may introduce bias in heritability estimation. The common practice of accounting for such population substructure is to adjust for the top few principal components of the GRM as covariates in the linear mixed model. Here we propose an alternative way of estimating heritability in multi-ethnic studies. Our proposed approach is a MOM estimator derived from the Haseman-Elston regression and gives an asymptotically unbiased estimate of heritability in presence of population stratification. It introduces adjustments for the population stratification in a second-order estimating equation and allows for the total phenotypic variance vary by ethnicity. We study the performance of different MOM and REML approaches in presence of population stratification through extensive simulation studies. We estimate the heritability of height, weight and other anthropometric traits in the UK Biobank cohort to investigate the impact of subtle population substructure on SNP heritability estimation.

Robots Versus Humans: Automated Annotation Accurately Quantifies Essential Ocean Variables of Rocky Intertidal Functional Groups and Habitat State

Frontiers in Marine Science ◽

10.3389/fmars.2021.691313 ◽

2021 ◽

Vol 8 ◽

Author(s):

Gonzalo Bravo ◽

Nicolas Moity ◽

Edgardo Londoño-Cruz ◽

Frank Muller-Karger ◽

Gregorio Bigatti ◽

...

Keyword(s):

Machine Learning ◽

Functional Groups ◽

Mixed Model ◽

Linear Mixed Model ◽

Visual Analysis ◽

Gulf Of Maine ◽

The United States ◽

Marine Biodiversity ◽

Percent Cover ◽

Automated Annotation

Standardized methods for effectively and rapidly monitoring changes in the biodiversity of marine ecosystems are critical to assess status and trends in ways that are comparable between locations and over time. In intertidal and subtidal habitats, estimates of fractional cover and abundance of organisms are typically obtained with traditional quadrat-based methods, and collection of photoquadrat imagery is a standard practice. However, visual analysis of quadrats, either in the field or from photographs, can be very time-consuming. Cutting-edge machine learning tools are now being used to annotate species records from photoquadrat imagery automatically, significantly reducing processing time of image collections. However, it is not always clear whether information is lost, and if so to what degree, using automated approaches. In this study, we compared results from visual quadrats versus automated photoquadrat assessments of macroalgae and sessile organisms on rocky shores across the American continent, from Patagonia (Argentina), Galapagos Islands (Ecuador), Gorgona Island (Colombian Pacific), and the northeast coast of the United States (Gulf of Maine) using the automated software CoralNet. Photoquadrat imagery was collected at the same time as visual surveys following a protocol implemented across the Americas by the Marine Biodiversity Observation Network (MBON) Pole to Pole of the Americas program. Our results show that photoquadrat machine learning annotations can estimate percent cover levels of intertidal benthic cover categories and functional groups (algae, bare substrate, and invertebrate cover) nearly identical to those from visual quadrat analysis. We found no statistical differences of cover estimations of dominant groups in photoquadrat images annotated by humans and those processed in CoralNet (binomial generalized linear mixed model or GLMM). Differences between these analyses were not significant, resulting in a Bray-Curtis average distance of 0.13 (sd 0.11) for the full label set, and 0.12 (sd 0.14) for functional groups. This is the first time that CoralNet automated annotation software has been used to monitor “Invertebrate Abundance and Distribution” and “Macroalgal Canopy Cover and Composition” Essential Ocean Variables (EOVs) in intertidal habitats. We recommend its use for rapid, continuous surveys over expanded geographical scales and monitoring of intertidal areas globally.

Genomic Heritability: A Ragged Diagonal Between Bias and Variance

10.1101/2021.09.19.460999 ◽

2021 ◽

Author(s):

Mitchell J. Feldmann ◽

Hans-Peter Piepho ◽

Steven J. Knapp

Keyword(s):

Mixed Model ◽

Dna Polymorphisms ◽

Breeding Value ◽

Genomic Relationship Matrix ◽

Relationship Matrix ◽

Genomic Relationship ◽

Model Framework ◽

Kinship Matrix ◽

Genomic Heritability ◽

A Genome

Many important traits in plants, animals, and microbes are polygenic and are therefore difficult to improve through traditional marker?assisted selection. Genomic prediction addresses this by enabling the inclusion of all genetic data in a mixed model framework. The main method for predicting breeding values is genomic best linear unbiased prediction (GBLUP), which uses the realized genomic relationship or kinship matrix (K) to connect genotype to phenotype. The use of relationship matrices allows information to be shared for estimating the genetic values for observed entries and predicting genetic values for unobserved entries. One of the key parameters of such models is genomic heritability (h2g), or the variance of a trait associated with a genome-wide sample of DNA polymorphisms. Here we discuss the relationship between several common methods for calculating the genomic relationship matrix and propose a new matrix based on the average semivariance that yields accurate estimates of genomic variance in the observed population regardless of the focal population quality as well as accurate breeding value predictions in unobserved samples. Notably, our proposed method is highly similar to the approach presented by Legarra (2016) despite different mathematical derivations and statistical perspectives and only deviates from the classic approach presented in VanRaden (2008) by a scaling factor. With current approaches, we found that the genomic heritability tends to be either over- or underestimated depending on the scaling and centering applied to the marker matrix (Z), the value of the average diagonal element of K, and the assortment of alleles and heterozygosity (H) in the observed population and that, unlike its predecessors, our newly proposed kinship matrix KASV yields accurate estimates of h2g in the observed population, generalizes to larger populations, and produces BLUPs equivalent to common methods in plants and animals.

Estimating COVID-19-induced Excess Mortality in Lombardy

10.1101/2021.11.17.21266455 ◽

2021 ◽

Author(s):

Antonello Maruotti ◽

Giovanna Jona-Lasinio ◽

Fabio Divino ◽

Gianfranco Lovison ◽

Massimo Ciccozzi ◽

...

Keyword(s):

Excess Mortality ◽

Mixed Model ◽

Linear Mixed Model ◽

Generalized Linear Mixed Model ◽

Seasonal Patterns ◽

Age Classes ◽

Significant Excess ◽

All Cause Mortality ◽

Best Linear Unbiased

We compare the expected all-cause mortality with the observed one for different age classes during the pandemic in Lombardy, which was the epicenter of the epidemic in Italy and still is the region most affected by the pandemic. A generalized linear mixed model is introduced to model weekly mortality from 2011 to 2019, taking into account seasonal patterns and year-specific trends. Based on the 2019 year-specific conditional best linear unbiased predictions, a significant excess of mortality is estimated in 2020, leading to approximately 35000 more deaths than expected, mainly arising during the first wave. In 2021, instead, the excess mortality is not significantly different from zero, for the 85+ and 15-64 age classes, and significant reductions with respect to the 2020 estimated excess mortality are estimated for other age classes.

MC Simulation of SEBLUP with Spatial Linear Mixed Model for SAE

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.685.618 ◽

2014 ◽

Vol 685 ◽

pp. 618-622

Author(s):

Yan Yu Liu ◽

Ming Zhong Jin ◽

De You Xie ◽

Min Qing Gong

Keyword(s):

Spatial Correlation ◽

Small Area ◽

Mixed Model ◽

Linear Mixed Model ◽

Likelihood Estimation ◽

Best Linear Unbiased Prediction ◽

Linear Unbiased Prediction ◽

Mutual Independence ◽

Best Linear Unbiased ◽

Unbiased Prediction

For small area estimation (SAE) Spatial Empirical Best Linear Unbiased Prediction, SEBLUP, is involved in linear mixed model with spatial correlation while Empirical Best Linear Unbiased Prediction, EBLUP, often with mutual independence. In this paper, we discussed maximum likelihood estimation (MLE) and compared the efficiency. Simulation shows that SEBLUP with spatial correlation data of spatial small area is more effective than EBLUP.

Small Area Estimation on Zero-Inflated Data Using Frequentist and Bayesian Approach

Journal of Modern Applied Statistical Methods ◽

10.22237/jmasm/1582727606 ◽

2020 ◽

Vol 18 (1) ◽

pp. 2-22

Author(s):

Kusman Sadik ◽

Rahma Anisa ◽

Euis Aqmaliyah

Keyword(s):

Bayesian Approach ◽

Small Area ◽

Mixed Model ◽

Linear Mixed Model ◽

Small Area Estimation ◽

Prediction Method ◽

Linear Unbiased Prediction ◽

Area Estimation ◽

Best Linear Unbiased ◽

Sr Method

The most commonly used method of small area estimation (SAE) is the empirical best linear unbiased prediction method based on a linear mixed model. However, it is not appropriate in the case of the zero-inflated target variable with a mixture of zeros and continuously distributed positive values. Therefore, various model-based SAE methods for zero-inflated data are developed, such as the Frequentist approach and the Bayesian approach. Both approaches are compared with the survey regression (SR) method which ignores the presence of zero-inflation in the data. The results show that the two SAE approaches for zero-inflated data are capable to yield more accurate area mean estimates than the SR method.

Generalized Hierarchical Mixed Model Association Analysis

10.1101/2021.03.10.434742 ◽

2021 ◽

Author(s):

Runqing Yang ◽

Yuxin Song ◽

Zhiyu Hao ◽

Zhonghua Liu

Keyword(s):

Association Analysis ◽

Statistical Power ◽

Mixed Model ◽

Linear Mixed Model ◽

Least Square ◽

Linear Unbiased Prediction ◽

Genomic Breeding ◽

Genome Wide ◽

Complex Population ◽

Best Linear Unbiased

AbstractIn genome-wide association analysis for complex diseases, we partitioned the genomic generalized linear mixed model (GLMM) into two hierarchies—the GLMM regarding genomic breeding values (GBVs) and a generalized linear regression of the GBVs to the tested marker effects. In the first hierarchy, the GBVs were predicted by solving for the genomic best linear unbiased prediction for GLMM, and in the second hierarchy, association tests were performed using the generalized least square (GLS) method. The so-called Hi-GLMM method exhibited advantages over existing methods in terms of both genomic control for complex population structure and statistical power to detect quantitative trait nucleotides (QTNs), especially when the GBVs were estimated precisely, and using joint association analysis for QTN candidates obtained from a test at once.

Pooled genotyping strategies for the rapid construction of genomic reference populations1

Journal of Animal Science ◽

10.1093/jas/skz344 ◽

2019 ◽

Vol 97 (12) ◽

pp. 4761-4769 ◽

Cited By ~ 2

Author(s):

Pâmela A Alexandre ◽

Laercio R Porto-Neto ◽

Emre Karaman ◽

Sigrid A Lehnert ◽

Antonio Reverter

Keyword(s):

Mixed Model ◽

Cost Savings ◽

Cost Effective ◽

Pedigree Information ◽

Genomic Relationship Matrix ◽

Relationship Matrix ◽

Phenotypic Data ◽

Feasible Alternative ◽

Cattle Herds ◽

Estimated Breeding Values

Abstract The growing concern with the environment is making important for livestock producers to focus on selection for efficiency-related traits, which is a challenge for commercial cattle herds due to the lack of pedigree information. To explore a cost-effective opportunity for genomic evaluations of commercial herds, this study compared the accuracy of bulls’ genomic estimated breeding values (GEBV) using different pooled genotype strategies. We used ten replicates of previously simulated genomic and phenotypic data for one low (t1) and one moderate (t2) heritability trait of 200 sires and 2,200 progeny. Sire’s GEBV were calculated using a univariate mixed model, with a hybrid genomic relationship matrix (h-GRM) relating sires to: 1) 1,100 pools of 2 animals; 2) 440 pools of 5 animals; 3) 220 pools of 10 animals; 4) 110 pools of 20 animals; 5) 88 pools of 25 animals; 6) 44 pools of 50 animals; and 7) 22 pools of 100 animals. Pooling criteria were: at random, grouped sorting by t1, grouped sorting by t2, and grouped sorting by a combination of t1 and t2. The same criteria were used to select 110, 220, 440, and 1,100 individual genotypes for GEBV calculation to compare GEBV accuracy using the same number of individual genotypes and pools. Although the best accuracy was achieved for a given trait when pools were grouped based on that same trait (t1: 0.50–0.56, t2: 0.66–0.77), pooling by one trait impacted negatively on the accuracy of GEBV for the other trait (t1: 0.25–0.46, t2: 0.29–0.71). Therefore, the combined measure may be a feasible alternative to use the same pools to calculate GEBVs for both traits (t1: 0.45–0.57, t2: 0.62–0.76). Pools of 10 individuals were identified as representing a good compromise between loss of accuracy (~10%–15%) and cost savings (~90%) from genotype assays. In addition, we demonstrated that in more than 90% of the simulations, pools present higher sires’ GEBV accuracy than individual genotypes when the number of genotype assays is limited (i.e., 110 or 220) and animals are assigned to pools based on phenotype. Pools assigned at random presented the poorest results (t1: 0.07–0.45, t2: 0.14–0.70). In conclusion, pooling by phenotype is the best approach to implementing genomic evaluation using commercial herd data, particularly when pools of 10 individuals are evaluated. While combining phenotypes seems a promising strategy to allow more flexibility to the estimates made using pools, more studies are necessary in this regard.

Improving the Efficiency of Genomic Selection in Chinese Simmental beef cattle

10.1101/022673 ◽

2015 ◽

Author(s):

Jiangwei Xia ◽

Yang Wu ◽

Huizhong Fang ◽

Wengang Zhang ◽

Yuxin Song ◽

...

Keyword(s):

Beef Cattle ◽

Genomic Relationship Matrix ◽

Relationship Matrix ◽

Nucleotide Polymorphisms ◽

Linear Unbiased Prediction ◽

Single Nucleotide ◽

Kinship Matrix ◽

Genome Wide ◽

Best Linear Unbiased ◽

Selection Of

Genomic selection is an accurate and efficient method of estimating genetic merits by using high-density genome-wide single nucleotide polymorphisms (SNPs).In this study, we investigate an approach to increase the efficiency of genomic prediction by using genome-wide markers. The approach is a feature selection based on genomic best linear unbiased prediction (GBLUP),which is a statistical method used to predict breeding values using SNPs for selection in animal and plant breeding. The objective of this study is the choice of kinship matrix for genomic best linear unbiased prediction (GBLUP).The G-matrix is using the information of genome-wide dense markers. We compare three kinds of kinships based on different combinations of centring and scaling of marker genotypes.And find a suitable kinship approach that adjusts for the resource population of Chinese Simmental beef cattle.Single nucleotide polymorphism (SNPs) can be used to estimate kinship matrix and individual inbreeding coefficients more accurately. So in our research a genomic relationship matrix was developed for 1059 Chinese Simmental beef cattle using 640000 single nucleotide polymorphisms and breeding values were estimated using phenotypes about Carcass weight and Sirloin weight. The number of SNPs needed to accurately estimate a genomic relationship matrix was evaluated in this population. Another aim of this study was to optimize the selection of markers and determine the required number of SNPs for estimation of kinship in the Chinese Simmental beef cattle. We find that the feature selection of GBLUP using Xu’s and the Astle and Balding’s kinships model performed similarly well, and were the best-performing methods in our study. Inbreeding and kinship matrix can be estimated with high accuracy using ≥12,000s in Chinese Simmental beef cattle.