lme4GS: An R-Package for Genomic Selection

Genomic selection (GS) is a technology used for genetic improvement, and it has many advantages over phenotype-based selection. There are several statistical models that adequately approach the statistical challenges in GS, such as in linear mixed models (LMMs). An active area of research is the development of software for fitting LMMs mainly used to make genome-based predictions. The lme4 is the standard package for fitting linear and generalized LMMs in the R-package, but its use for genetic analysis is limited because it does not allow the correlation between individuals or groups of individuals to be defined. This article describes the new lme4GS package for R, which is focused on fitting LMMs with covariance structures defined by the user, bandwidth selection, and genomic prediction. The new package is focused on genomic prediction of the models used in GS and can fit LMMs using different variance–covariance matrices. Several examples of GS models are presented using this package as well as the analysis using real data.

Download Full-text

Fast zero-inflated negative binomial mixed modeling approach for analyzing longitudinal metagenomics data

Bioinformatics ◽

10.1093/bioinformatics/btz973 ◽

2020 ◽

Vol 36 (8) ◽

pp. 2345-2351 ◽

Cited By ~ 2

Author(s):

Xinyan Zhang ◽

Nengjun Yi

Keyword(s):

Count Data ◽

Mixed Models ◽

Negative Binomial ◽

Linear Mixed Models ◽

Human Microbiome ◽

Real Data ◽

R Package ◽

Supplementary Information ◽

Sequencing Data ◽

Metagenomics Data

Abstract Motivation Longitudinal metagenomics data, including both 16S rRNA and whole-metagenome shotgun sequencing data, enhanced our abilities to understand the dynamic associations between the human microbiome and various diseases. However, analytic tools have not been fully developed to simultaneously address the main challenges of longitudinal metagenomics data, i.e. high-dimensionality, dependence among samples and zero-inflation of observed counts. Results We propose a fast zero-inflated negative binomial mixed modeling (FZINBMM) approach to analyze high-dimensional longitudinal metagenomic count data. The FZINBMM approach is based on zero-inflated negative binomial mixed models (ZINBMMs) for modeling longitudinal metagenomic count data and a fast EM-IWLS algorithm for fitting ZINBMMs. FZINBMM takes advantage of a commonly used procedure for fitting linear mixed models, which allows us to include various types of fixed and random effects and within-subject correlation structures and quickly analyze many taxa. We found that FZINBMM remarkably outperformed in computational efficiency and was statistically comparable with two R packages, GLMMadaptive and glmmTMB, that use numerical integration to fit ZINBMMs. Extensive simulations and real data applications showed that FZINBMM outperformed other previous methods, including linear mixed models, negative binomial mixed models and zero-inflated Gaussian mixed models. Availability and implementation FZINBMM has been implemented in the R package NBZIMM, available in the public GitHub repository http://github.com//nyiuab//NBZIMM. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

lme4qtl: linear mixed models with flexible covariance structure for genetic studies of related individuals

10.1101/139816 ◽

2017 ◽

Cited By ~ 2

Author(s):

Andrey Ziyatdinov ◽

Miquel Vázquez-Santiago ◽

Helena Brunel ◽

Angel Martinez-Perez ◽

Hugues Aschard ◽

...

Keyword(s):

Qtl Mapping ◽

Random Effects ◽

Mixed Models ◽

Linear Mixed Models ◽

R Package ◽

Covariance Matrices ◽

Linkage Studies ◽

Genetic Studies ◽

Related Individuals ◽

Mapping Models

AbstractBackgroundQuantitative trait locus (QTL) mapping in genetic data often involves analysis of correlated observations, which need to be accounted for to avoid false association signals. This is commonly performed by modeling such correlations as random effects in linear mixed models (LMMs). The R package lme4 is a well-established tool that implements major LMM features using sparse matrix methods; however, it is not fully adapted for QTL mapping association and linkage studies. In particular, two LMM features are lacking in the base version of lme4: the definition of random effects by custom covariance matrices; and parameter constraints, which are essential in advanced QTL models. Apart from applications in linkage studies of related individuals, such functionalities are of high interest for association studies in situations where multiple covariance matrices need to be modeled, a scenario not covered by many genome-wide association study (GWAS) software.ResultsTo address the aforementioned limitations, we developed a new R package lme4qtl as an extension of lme4. First, lme4qtl contributes new models for genetic studies within a single tool integrated with lme4 and its companion packages. Second, lme4qtl offers a flexible framework for scenarios with multiple levels of relatedness and becomes efficient when covariance matrices are sparse. We showed the value of our package using real family-based data in the Genetic Analysis of Idiopathic Thrombophilia 2 (GAIT2) project.ConclusionsOur software lme4qtl enables QTL mapping models with a versatile structure of random effects and efficient computation for sparse covariances. lme4qtl is available at https://github.com/variani/lme4qtl.

Download Full-text

The look ahead trace back optimizer for genomic selection under transparent and opaque simulators

Scientific Reports ◽

10.1038/s41598-021-83567-5 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Fatemeh Amini ◽

Felipe Restrepo Franco ◽

Guiping Hu ◽

Lizhi Wang

Keyword(s):

Genomic Selection ◽

Genetic Gain ◽

Genomic Prediction ◽

In Planta ◽

Data Set ◽

Look Ahead ◽

Trace Back ◽

New Variant ◽

The Look ◽

Partially Observable

AbstractRecent advances in genomic selection (GS) have demonstrated the importance of not only the accuracy of genomic prediction but also the intelligence of selection strategies. The look ahead selection algorithm, for example, has been found to significantly outperform the widely used truncation selection approach in terms of genetic gain, thanks to its strategy of selecting breeding parents that may not necessarily be elite themselves but have the best chance of producing elite progeny in the future. This paper presents the look ahead trace back algorithm as a new variant of the look ahead approach, which introduces several improvements to further accelerate genetic gain especially under imperfect genomic prediction. Perhaps an even more significant contribution of this paper is the design of opaque simulators for evaluating the performance of GS algorithms. These simulators are partially observable, explicitly capture both additive and non-additive genetic effects, and simulate uncertain recombination events more realistically. In contrast, most existing GS simulation settings are transparent, either explicitly or implicitly allowing the GS algorithm to exploit certain critical information that may not be possible in actual breeding programs. Comprehensive computational experiments were carried out using a maize data set to compare a variety of GS algorithms under four simulators with different levels of opacity. These results reveal how differently a same GS algorithm would interact with different simulators, suggesting the need for continued research in the design of more realistic simulators. As long as GS algorithms continue to be trained in silico rather than in planta, the best way to avoid disappointing discrepancy between their simulated and actual performances may be to make the simulator as akin to the complex and opaque nature as possible.

Download Full-text

Detection of differentially methylated CpG sites between tumor samples with uneven tumor purities

Bioinformatics ◽

10.1093/bioinformatics/btz885 ◽

2019 ◽

Vol 36 (7) ◽

pp. 2017-2024

Author(s):

Weiwei Zhang ◽

Ziyi Li ◽

Nana Wei ◽

Hua-Jun Wu ◽

Xiaoqi Zheng

Keyword(s):

Real Data ◽

R Package ◽

Differential Methylation ◽

Least Square ◽

Epigenetic Mechanism ◽

Supplementary Information ◽

Cpg Sites ◽

Tumor Purity ◽

Different Sources ◽

Normal Controls

Abstract Motivation Inference of differentially methylated (DM) CpG sites between two groups of tumor samples with different geno- or pheno-types is a critical step to uncover the epigenetic mechanism of tumorigenesis, and identify biomarkers for cancer subtyping. However, as a major source of confounding factor, uneven distributions of tumor purity between two groups of tumor samples will lead to biased discovery of DM sites if not properly accounted for. Results We here propose InfiniumDM, a generalized least square model to adjust tumor purity effect for differential methylation analysis. Our method is applicable to a variety of experimental designs including with or without normal controls, different sources of normal tissue contaminations. We compared our method with conventional methods including minfi, limma and limma corrected by tumor purity using simulated datasets. Our method shows significantly better performance at different levels of differential methylation thresholds, sample sizes, mean purity deviations and so on. We also applied the proposed method to breast cancer samples from TCGA database to further evaluate its performance. Overall, both simulation and real data analyses demonstrate favorable performance over existing methods serving similar purpose. Availability and implementation InfiniumDM is a part of R package InfiniumPurify, which is freely available from GitHub (https://github.com/Xiaoqizheng/InfiniumPurify). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

mixIndependR: a R package for statistical independence testing of loci in database of multi-locus genotypes

BMC Bioinformatics ◽

10.1186/s12859-020-03945-0 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Bing Song ◽

August E. Woerner ◽

John Planz

Keyword(s):

Population Genetics ◽

Linkage Disequilibrium ◽

Genetic Markers ◽

Software Package ◽

Tandem Repeats ◽

Population Data ◽

Real Data ◽

R Package ◽

Nucleotide Polymorphisms ◽

Mutual Independence

Abstract Background Multi-locus genotype data are widely used in population genetics and disease studies. In evaluating the utility of multi-locus data, the independence of markers is commonly considered in many genomic assessments. Generally, pairwise non-random associations are tested by linkage disequilibrium; however, the dependence of one panel might be triplet, quartet, or other. Therefore, a compatible and user-friendly software is necessary for testing and assessing the global linkage disequilibrium among mixed genetic data. Results This study describes a software package for testing the mutual independence of mixed genetic datasets. Mutual independence is defined as no non-random associations among all subsets of the tested panel. The new R package “mixIndependR” calculates basic genetic parameters like allele frequency, genotype frequency, heterozygosity, Hardy–Weinberg equilibrium, and linkage disequilibrium (LD) by mutual independence from population data, regardless of the type of markers, such as simple nucleotide polymorphisms, short tandem repeats, insertions and deletions, and any other genetic markers. A novel method of assessing the dependence of mixed genetic panels is developed in this study and functionally analyzed in the software package. By comparing the observed distribution of two common summary statistics (the number of heterozygous loci [K] and the number of share alleles [X]) with their expected distributions under the assumption of mutual independence, the overall independence is tested. Conclusion The package “mixIndependR” is compatible to all categories of genetic markers and detects the overall non-random associations. Compared to pairwise disequilibrium, the approach described herein tends to have higher power, especially when number of markers is large. With this package, more multi-functional or stronger genetic panels can be developed, like mixed panels with different kinds of markers. In population genetics, the package “mixIndependR” makes it possible to discover more about admixture of populations, natural selection, genetic drift, and population demographics, as a more powerful method of detecting LD. Moreover, this new approach can optimize variants selection in disease studies and contribute to panel combination for treatments in multimorbidity. Application of this approach in real data is expected in the future, and this might bring a leap in the field of genetic technology. Availability The R package mixIndependR, is available on the Comprehensive R Archive Network (CRAN) at: https://cran.r-project.org/web/packages/mixIndependR/index.html.

Download Full-text

A review of deep learning applications for genomic selection

BMC Genomics ◽

10.1186/s12864-020-07319-x ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Osval Antonio Montesinos-López ◽

Abelardo Montesinos-López ◽

Paulino Pérez-Rodríguez ◽

José Alberto Barrón-López ◽

Johannes W. R. Martini ◽

...

Keyword(s):

Deep Learning ◽

Plant Breeding ◽

Genomic Selection ◽

Genomic Prediction ◽

Mixed Model ◽

Prediction Models ◽

Genetic Effect ◽

Training Data ◽

Additive Genetic Effect ◽

Main Body

Abstract Background Several conventional genomic Bayesian (or no Bayesian) prediction methods have been proposed including the standard additive genetic effect model for which the variance components are estimated with mixed model equations. In recent years, deep learning (DL) methods have been considered in the context of genomic prediction. The DL methods are nonparametric models providing flexibility to adapt to complicated associations between data and output with the ability to adapt to very complex patterns. Main body We review the applications of deep learning (DL) methods in genomic selection (GS) to obtain a meta-picture of GS performance and highlight how these tools can help solve challenging plant breeding problems. We also provide general guidance for the effective use of DL methods including the fundamentals of DL and the requirements for its appropriate use. We discuss the pros and cons of this technique compared to traditional genomic prediction approaches as well as the current trends in DL applications. Conclusions The main requirement for using DL is the quality and sufficiently large training data. Although, based on current literature GS in plant and animal breeding we did not find clear superiority of DL in terms of prediction power compared to conventional genome based prediction models. Nevertheless, there are clear evidences that DL algorithms capture nonlinear patterns more efficiently than conventional genome based. Deep learning algorithms are able to integrate data from different sources as is usually needed in GS assisted breeding and it shows the ability for improving prediction accuracy for large plant breeding data. It is important to apply DL to large training-testing data sets.

Download Full-text

Impact of residual covariance structures on genomic prediction ability in multi-environment trials

PLoS ONE ◽

10.1371/journal.pone.0201181 ◽

2018 ◽

Vol 13 (7) ◽

pp. e0201181

Author(s):

Boby Mathew ◽

Jens Léon ◽

Mikko J. Sillanpää

Keyword(s):

Genomic Prediction ◽

Covariance Structures ◽

Prediction Ability ◽

Residual Covariance

Download Full-text

An alternative REML estimation of covariance matrices in linear mixed models

Statistics & Probability Letters ◽

10.1016/j.spl.2012.12.028 ◽

2013 ◽

Vol 83 (4) ◽

pp. 1071-1077 ◽

Cited By ~ 1

Author(s):

Erning Li ◽

Mohsen Pourahmadi

Keyword(s):

Mixed Models ◽

Linear Mixed Models ◽

Covariance Matrices ◽

Reml Estimation

Download Full-text

An evaluation of the interpretability and predictive performance of the BayesR model for genomic prediction

10.1101/2020.10.23.351700 ◽

2020 ◽

Author(s):

Fanny Mollandin ◽

Andrea Rau ◽

Pascal Croiseau

Keyword(s):

Linkage Disequilibrium ◽

Qtl Mapping ◽

Effect Size ◽

Genomic Prediction ◽

Simulated Data ◽

Predictive Performance ◽

Real Data ◽

Sliding Windows ◽

Size Classes ◽

The Impact

ABSTRACTTechnological advances and decreasing costs have led to the rise of increasingly dense genotyping data, making feasible the identification of potential causal markers. Custom genotyping chips, which combine medium-density genotypes with a custom genotype panel, can capitalize on these candidates to potentially yield improved accuracy and interpretability in genomic prediction. A particularly promising model to this end is BayesR, which divides markers into four effect size classes. BayesR has been shown to yield accurate predictions and promise for quantitative trait loci (QTL) mapping in real data applications, but an extensive benchmarking in simulated data is currently lacking. Based on a set of real genotypes, we generated simulated data under a variety of genetic architectures, phenotype heritabilities, and we evaluated the impact of excluding or including causal markers among the genotypes. We define several statistical criteria for QTL mapping, including several based on sliding windows to account for linkage disequilibrium. We compare and contrast these statistics and their ability to accurately prioritize known causal markers. Overall, we confirm the strong predictive performance for BayesR in moderately to highly heritable traits, particularly for 50k custom data. In cases of low heritability or weak linkage disequilibrium with the causal marker in 50k genotypes, QTL mapping is a challenge, regardless of the criterion used. BayesR is a promising approach to simultaneously obtain accurate predictions and interpretable classifications of SNPs into effect size classes. We illustrated the performance of BayesR in a variety of simulation scenarios, and compared the advantages and limitations of each.

Download Full-text

Towards an improvement of optically stimulated luminescence (OSL) age uncertainties: modelling OSL ages with systematic errors, stratigraphic constraints and radiocarbon ages using the R package BayLum

Geochronology ◽

10.5194/gchron-3-229-2021 ◽

2021 ◽

Vol 3 (1) ◽

pp. 229-245

Author(s):

Guillaume Guérin ◽

Christelle Lahaye ◽

Maryam Heydari ◽

Martin Autzen ◽

Jan-Pieter Buylaert ◽

...

Keyword(s):

R Package ◽

Systematic Errors ◽

Covariance Matrices ◽

Optically Stimulated Luminescence ◽

Osl Dating ◽

Dose Rates ◽

Determination Procedure ◽

Single Grain ◽

Measurement Uncertainties ◽

Osl Age

Abstract. Statistical analysis has become increasingly important in optically stimulated luminescence (OSL) dating since it has become possible to measure signals at the single-grain scale. The accuracy of large chronological datasets can benefit from the inclusion, in chronological modelling, of stratigraphic constraints and shared systematic errors. Recently, a number of Bayesian models have been developed for OSL age calculation; the R package “BayLum” presented herein allows different models of this type to be implemented, particularly for samples in stratigraphic order which share systematic errors. We first show how to introduce stratigraphic constraints in BayLum; then, we focus on the construction, based on measurement uncertainties, of dose covariance matrices to account for systematic errors specific to OSL dating. The nature (systematic versus random) of errors affecting OSL ages is discussed, based – as an example – on the dose rate determination procedure at the IRAMAT-CRP2A laboratory (Bordeaux). The effects of the stratigraphic constraints and dose covariance matrices are illustrated on example datasets. In particular, the benefit of combining the modelling of systematic errors with independent ages, unaffected by these errors, is demonstrated. Finally, we discuss other common ways of estimating dose rates and how they may be taken into account in the covariance matrix by other potential users and laboratories. Test datasets are provided as a Supplement to the reader, together with an R markdown tutorial allowing the reproduction of all calculations and figures presented in this study.

Download Full-text