Longitudinal Phenotypes Improve Genotype Association for Hyperketonemia in Dairy Cattle

The objective of our study was to identify genomic regions associated with varying concentrations of non-esterified fatty acid (NEFA), β-hydroxybutyrate (BHB), and the development of hyperketonemia (HYK) in longitudinally sampled Holstein dairy cows. Our study population consisted of 147 multiparous cows intensively characterized by serial NEFA and BHB concentrations. To identify individuals with contrasting combinations in longitudinal BHB and NEFA concentrations, phenotypes were established using incremental area under the curve (AUC) and categorized as follows: Group (1) high NEFA and high BHB, group (2) low NEFA and high BHB), group (3) low NEFA and low BHB, and group (4) high NEFA and low BHB. Cows were genotyped on the Illumina Bovine High-density (777 K) beadchip. Genome-wide association studies using mixed linear models with the least-related animals were performed to establish a genetic association with HYK, BHB-AUC, NEFA-AUC, and the comparisons of the 4 AUC phenotypic groups using Golden Helix software. Nine single-nucleotide polymorphisms were associated with high longitudinal concentrations of BHB and further investigated. Five candidate genes related to energy metabolism and homeostasis were identified. These results provide biological insight and help identify susceptible animals thus improving genetic selection criteria thereby decreasing the incidence of HYK.

Download Full-text

TAGOOS: genome-wide supervised learning of non-coding loci associated to complex phenotypes

Nucleic Acids Research ◽

10.1093/nar/gkz320 ◽

2019 ◽

Vol 47 (14) ◽

pp. e79-e79

Author(s):

Aitor González ◽

Marie Artufel ◽

Pascal Rihet

Keyword(s):

Cleft Lip ◽

Association Studies ◽

Area Under The Curve ◽

Genome Wide Association Studies ◽

Nucleotide Polymorphisms ◽

Functional Snps ◽

Functional Regions ◽

Complex Phenotypes ◽

Genome Wide ◽

Intergenic Regions

Abstract Genome-wide association studies (GWAS) associate single nucleotide polymorphisms (SNPs) to complex phenotypes. Most human SNPs fall in non-coding regions and are likely regulatory SNPs, but linkage disequilibrium (LD) blocks make it difficult to distinguish functional SNPs. Therefore, putative functional SNPs are usually annotated with molecular markers of gene regulatory regions and prioritized with dedicated prediction tools. We integrated associated SNPs, LD blocks and regulatory features into a supervised model called TAGOOS (TAG SNP bOOSting) and computed scores genome-wide. The TAGOOS scores enriched and prioritized unseen associated SNPs with an odds ratio of 4.3 and 3.5 and an area under the curve (AUC) of 0.65 and 0.6 for intronic and intergenic regions, respectively. The TAGOOS score was correlated with the maximal significance of associated SNPs and expression quantitative trait loci (eQTLs) and with the number of biological samples annotated for key regulatory features. Analysis of loci and regions associated to cleft lip and human adult height phenotypes recovered known functional loci and predicted new functional loci enriched in transcriptions factors related to the phenotypes. In conclusion, we trained a supervised model based on associated SNPs to prioritize putative functional regions. The TAGOOS scores, annotations and UCSC genome tracks are available here: https://tagoos.readthedocs.io.

Download Full-text

MixMir: microRNA motif discovery from gene expression data using mixed linear models

10.1101/004010 ◽

2014 ◽

Author(s):

LIYANG Diao ◽

Antoine Marcais ◽

Scott Norton ◽

Kevin C. Chen

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Motif Discovery ◽

Linear Models ◽

Developmental Stages ◽

Sequence Similarity ◽

Association Studies ◽

Genome Wide Association Studies ◽

Expression Data ◽

Mixed Linear Models

MicroRNAs (miRNAs) are a class of ~22nt non-coding RNAs that potentially regulate over 60% of human protein-coding genes. MiRNA activity is highly specific, differing between cell types, developmental stages and environmental conditions, so the identification of active miRNAs in a given sample is of great interest. Here we present a novel computational approach for analyzing both mRNA sequence and gene expression data, called MixMir. Our method corrects for 3' UTR background sequence similarity between transcripts, which is known to correlate with mRNA transcript abundance. We demonstrate that after accounting for kmer sequence similarities in 3' UTRs, a statistical linear model based on motif presence/absence can effectively discover active miRNAs in a sample. MixMir utilizes fast software implementations for solving mixed linear models which are widely-used in genome-wide association studies (GWAS). Essentially we use 3' UTR sequence similarity in place of population cryptic relatedness in the GWAS problem. Compared to similar methods such as miREDUCE, Sylamer and cWords, we found that MixMir performed better at discovering true miRNA motifs in Dicer knockout CD4+ T-cells, as well as protein and mRNA expression data obtained from miRNA transfection experiments in human cell lines. MixMir can be freely downloaded from https://github.com/ldiao/MixMir.

Download Full-text

Detection of Genomic Regions with Pleiotropic Effects for Growth and Carcass Quality Traits in the Rubia Gallega Cattle Breed

Animals ◽

10.3390/ani11061682 ◽

2021 ◽

Vol 11 (6) ◽

pp. 1682

Author(s):

Maria Martinez-Castillero ◽

Carlos Then ◽

Juan Altarriba ◽

Houssemeddine Srihi ◽

David López-Carbonell ◽

...

Keyword(s):

Association Studies ◽

Cattle Breed ◽

Snp Markers ◽

Single Step ◽

Pleiotropic Effects ◽

Carcass Quality ◽

Genome Wide Association Studies ◽

Quality Traits ◽

Nucleotide Polymorphisms ◽

Genomic Regions

The breeding scheme in the Rubia Gallega cattle population is based upon traits measured in farms and slaughterhouses. In recent years, genomic evaluation has been implemented by using a ssGBLUP (single-step Genomic Best Linear Unbiased Prediction). This procedure can reparameterized to perform ssGWAS (single-step Genome Wide Association Studies) by backsolving the SNP (single nucleotide polymorphisms) effects. Therefore, the objective of this study was to identify genomic regions associated with the genetic variability in growth and carcass quality traits. We implemented a ssGBLUP by using a database that included records for Birth Weight (BW-327,350 records-), Weaning Weight (WW-83,818-), Cold Carcass Weight (CCW-91,621-), Fatness (FAT-91,475-) and Conformation (CON-91,609-). The pedigree included 464,373 individuals, 2449 of which were genotyped. After a process of filtering, we ended up using 43,211 SNP markers. We used the GBLUP and SNPBLUP model equivalences to obtain the effects of the SNPs and then calculated the percentage of variance explained by the regions of the genome between 1 Mb. We identified 7 regions of the genome for CCW; 8 regions for BW, WW, FAT and 9 regions for CON, which explained the percentage of variance above 0.5%. Furthermore, a number of the genome regions had pleiotropic effects, located at: BTA1 (131–132 Mb), BTA2 (1–11 Mb), BTA3 (32–33 Mb), BTA6 (36–38 Mb), BTA16 (24–26 Mb), and BTA 21 (56–57 Mb). These regions contain, amongst others, the following candidate genes: NCK1, MSTN, KCNA3, LCORL, NCAPG, and RIN3.

Download Full-text

Mixed Logistic Regression in Genome-Wide Association Studies

10.1101/2020.01.17.910109 ◽

2020 ◽

Author(s):

Jacqueline Milet ◽

Hervé Perdry

Keyword(s):

Logistic Regression ◽

Linear Models ◽

Association Studies ◽

Score Test ◽

R Package ◽

Genome Wide Association ◽

Supplementary Information ◽

Genome Wide Association Studies ◽

Mixed Linear Models ◽

Genome Wide

AbstractMotivationMixed linear models (MLM) have been widely used to account for population structure in case-control genome-wide association studies, the status being analyzed as a quantitative phenotype. Chen et al. proved that this method is inappropriate and proposed a score test for the mixed logistic regression (MLR). However this test does not allow an estimation of the variants’ effects.ResultsWe propose two computationally efficient methods to estimate the variants’ effects. Their properties are evaluated on two simulations sets, and compared with other methods (MLM, logistic regression). MLR performs the best in all circumstances. The variants’ effects are well evaluated by our methods, with a moderate bias when the effect sizes are large. Additionally, we propose a stratified QQ-plot, enhancing the diagnosis of p-values inflation or deflation, when population strata are not clearly identified in the sample.AvailabilityAll methods are implemented in the R package milorGWAS available at https://github.com/genostats/[email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

Data Mining in Genome Wide Association Studies

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch073 ◽

2011 ◽

pp. 465-471

Author(s):

Tom Burr

Keyword(s):

Data Mining ◽

Genetic Basis ◽

Association Studies ◽

Causal Variant ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Nucleotide Polymorphisms ◽

Common Diseases ◽

Genome Wide ◽

Genomic Regions

The genetic basis for some human diseases, in which one or a few genome regions increase the probability of acquiring the disease, is fairly well understood. For example, the risk for cystic fibrosis is linked to particular genomic regions. Identifying the genetic basis of more common diseases such as diabetes has proven to be more difficult, because many genome regions apparently are involved, and genetic effects are thought to depend in unknown ways on other factors, called covariates, such as diet and other environmental factors (Goldstein and Cavalleri, 2005). Genome-wide association studies (GWAS) aim to discover the genetic basis for a given disease. The main goal in a GWAS is to identify genetic variants, single nucleotide polymorphisms (SNPs) in particular, that show association with the phenotype, such as “disease present” or “disease absent” either because they are causal, or more likely, because they are statistically correlated with an unobserved causal variant (Goldstein and Cavalleri, 2005). A GWAS can analyze “by DNA site” or “by multiple DNA sites. ” In either case, data mining tools (Tachmazidou, Verzilli, and De Lorio, 2007) are proving to be quite useful for understanding the genetic causes for common diseases.

Download Full-text

Methodological implementation of mixed linear models in multi-locus genome-wide association studies

Briefings in Bioinformatics ◽

10.1093/bib/bbx028 ◽

2017 ◽

Vol 18 (5) ◽

pp. 906-906 ◽

Cited By ~ 12

Author(s):

Yang-Jun Wen ◽

Hanwen Zhang ◽

Yuan-Li Ni ◽

Bo Huang ◽

Jin Zhang ◽

...

Keyword(s):

Linear Models ◽

Association Studies ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Mixed Linear Models ◽

Genome Wide

Download Full-text

Finding genetic variants in plants without complete genomes

10.1101/818096 ◽

2019 ◽

Cited By ~ 2

Author(s):

Yoav Voichek ◽

Detlef Weigel

Keyword(s):

Genetic Variants ◽

Association Studies ◽

Genome Wide Association Studies ◽

Nucleotide Polymorphisms ◽

Structural Variants ◽

Sequencing Data ◽

New Associations ◽

Maize Populations ◽

Genome Wide ◽

Genomic Regions

AbstractStructural variants and presence/absence polymorphisms are common in plant genomes, yet they are routinely overlooked in genome-wide association studies (GWAS). Here, we expand the genetic variants detected in GWAS to include major deletions, insertions, and rearrangements. We first use raw sequencing data directly to derive short sequences, k-mers, that mark a broad range of polymorphisms independently of a reference genome. We then link k-mers associated with phenotypes to specific genomic regions. Using this approach, we re-analyzed 2,000 traits measured in Arabidopsis thaliana, tomato, and maize populations. Associations identified with k-mers recapitulate those found with single-nucleotide polymorphisms (SNPs), however, with stronger statistical support. Moreover, we identified new associations with structural variants and with regions missing from reference genomes. Our results demonstrate the power of performing GWAS before linking sequence reads to specific genomic regions, which allow detection of a wider range of genetic variants responsible for phenotypic variation.

Download Full-text

Methodological implementation of mixed linear models in multi-locus genome-wide association studies

Briefings in Bioinformatics ◽

10.1093/bib/bbw145 ◽

2017 ◽

Vol 19 (4) ◽

pp. 700-712 ◽

Cited By ~ 71

Author(s):

Yang-Jun Wen ◽

Hanwen Zhang ◽

Yuan-Li Ni ◽

Bo Huang ◽

Jin Zhang ◽

...

Keyword(s):

Linear Models ◽

Association Studies ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Mixed Linear Models ◽

Genome Wide

Download Full-text

Predicting novel genomic regions linked to genetic disorders using GWAS and chromosome conformation data – a case study of schizophrenia

Scientific Reports ◽

10.1038/s41598-019-54514-2 ◽

2019 ◽

Vol 9 (1) ◽

Author(s):

Daniel S. Buxton ◽

Declan J. Batten ◽

Jonathan J. Crofts ◽

Nadia Chuzhanova

Keyword(s):

Human Genome ◽

Association Studies ◽

Genetic Disorders ◽

Enrichment Analysis ◽

Genome Wide Association Studies ◽

Nucleotide Polymorphisms ◽

Novel Genes ◽

Chromosome Conformation ◽

Genomic Regions

AbstractGenome-wide association studies identified numerous loci harbouring single nucleotide polymorphisms (SNPs) associated with various human diseases, although the causal role of many of them remains unknown. In this paper, we postulate that co-location and shared biological function of novel genes with genes known to associate with a specific phenotype make them potential candidates linked to the same phenotype (“guilt-by-proxy”). We propose a novel network-based approach for predicting candidate genes/genomic regions utilising the knowledge of the 3D architecture of the human genome and GWAS data. As a case study we used a well-studied polygenic disorder ‒ schizophrenia ‒ for which we compiled a comprehensive dataset of SNPs. Our approach revealed 634 novel regions covering ~398 Mb of the human genome and harbouring ~9000 genes. Using various network measures and enrichment analysis, we identified subsets of genes and investigated the plausibility of these genes/regions having an association with schizophrenia using literature search and bioinformatics resources. We identified several genes/regions with previously reported associations with schizophrenia, thus providing proof-of-concept, as well as novel candidates with no prior known associations. This approach has the potential to identify novel genes/genomic regions linked to other polygenic disorders and provide means of aggregating genes/SNPs for further investigation.

Download Full-text

Human genotype-to-phenotype predictions: boosting accuracy with nonlinear models

10.1101/2021.06.30.21259753 ◽

2021 ◽

Author(s):

Aleksandr Medvedev ◽

Satyarth Mishra Sharma ◽

Evgenii Tsatsorin ◽

Elena Nabieva ◽

Dmitry Yarotsky

Keyword(s):

Decision Trees ◽

Predictive Models ◽

Linear Models ◽

Human Genetics ◽

State Of The Art ◽

Nonlinear Models ◽

Association Studies ◽

Genome Wide Association Studies ◽

Nucleotide Polymorphisms ◽

Phenotype Prediction

Genotype-to-phenotype prediction is a central problem of human genetics. In recent years, it has become possible to construct complex predictive models for phenotypes, thanks to the availability of large genome data sets as well as efficient and scalable machine learning tools. In this paper, we make a three-fold contribution to this problem. First, we ask if state-of-the-art nonlinear predictive models, such as boosted decision trees, can be more efficient for phenotype prediction than conventional linear models. We find that this is indeed the case if model features include a sufficiently rich set of covariates, but probably not otherwise. Second, we ask if the conventional selection of single nucleotide polymorphisms (SNPs) by genome wide association studies (GWAS) can be replaced by a more efficient procedure, taking into account information in previously selected SNPs. We propose such a procedure, based on a sequential feature importance estimation with decision trees, and show that this approach indeed produced informative SNP sets that are much more compact than when selected with GWAS. Finally, we show that the highest prediction accuracy can ultimately be achieved by ensembling individual linear and nonlinear models. To the best of our knowledge, for some of the phenotypes that we consider (asthma, hypothyroidism), our results are a new state-of-the-art.

Download Full-text