Disentangling genetic feature selection and aggregation in transcriptome-wide association studies

Abstract The success of transcriptome-wide association studies (TWAS) has led to substantial research towards improving the predictive accuracy of its core component of Genetically Regulated eXpression (GReX). GReX links expression information with genotype and phenotype by playing two roles simultaneously: it acts as both the outcome of the genotype-based predictive models (for predicting expressions) and the linear combination of genotypes (as the predicted expressions) for association tests. From the perspective of machine learning (considering SNPs as features), these are actually two separable steps—feature selection and feature aggregation—which can be independently conducted. In this work, we show that the single approach of GReX limits the adaptability of TWAS methodology and practice. By conducting simulations and real data analysis, we demonstrate that disentangled protocols adapting straightforward approaches for feature selection (e.g., simple marker test) and aggregation (e.g., kernel machines) outperform the standard TWAS protocols that rely on GReX. Our development provides more powerful novel tools for conducting TWAS. More importantly, our characterization of the exact nature of TWAS suggests that, instead of questionably binding two distinct steps into the same statistical form (GReX), methodological research focusing on optimal combinations of feature selection and aggregation approaches will bring higher power to TWAS protocols.

Download Full-text

Disentangling genetic feature selection and aggregation in transcriptome-wide association studies

10.1101/2020.11.19.390617 ◽

2020 ◽

Author(s):

Chen Cao ◽

Devin Kwok ◽

Qing Li ◽

Jingni He ◽

Xingyi Guo ◽

...

Keyword(s):

Feature Selection ◽

Linear Models ◽

Association Studies ◽

Core Component ◽

Association Testing ◽

Genetic Feature ◽

Regulated Expression ◽

Monolithic Approach ◽

Genetic Feature Selection ◽

Low Expression

ABSTRACTThe success of transcriptome-wide association studies (TWAS) has led to substantial research towards improving its core component of genetically regulated expression (GReX). GReX links expression information with phenotype by serving as both the outcome of genotype-based expression models and the predictor for downstream association testing. In this work, we demonstrate that current linear models of GReX inadvertently combine two separable steps of machine learning - feature selection and aggregation - which can be independently replaced to improve overall power. We show that the monolithic approach of GReX limits the adaptability of TWAS methodology and practice, especially given low expression heritability.

Download Full-text

Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It

Political Analysis ◽

10.1017/pan.2017.44 ◽

2018 ◽

Vol 26 (2) ◽

pp. 168-189 ◽

Cited By ~ 72

Author(s):

Matthew J. Denny ◽

Arthur Spirling

Keyword(s):

Feature Selection ◽

Unsupervised Learning ◽

Political Science ◽

Real Data ◽

Statistical Procedure ◽

Science Text ◽

Substantive Theory ◽

Text Preprocessing

Despite the popularity of unsupervised techniques for political science text-as-data research, the importance and implications of preprocessing decisions in this domain have received scant systematic attention. Yet, as we show, such decisions have profound effects on the results of real models for real data. We argue that substantive theory is typically too vague to be of use for feature selection, and that the supervised literature is not necessarily a helpful source of advice. To aid researchers working in unsupervised settings, we introduce a statistical procedure and software that examines the sensitivity of findings under alternate preprocessing regimes. This approach complements a researcher’s substantive understanding of a problem by providing a characterization of the variability changes in preprocessing choices may induce when analyzing a particular dataset. In making scholars aware of the degree to which their results are likely to be sensitive to their preprocessing decisions, it aids replication efforts.

Download Full-text

A comprehensive evaluation of methods for Mendelian randomization using realistic simulations and an analysis of 38 biomarkers for risk of type 2 diabetes

International Journal of Epidemiology ◽

10.1093/ije/dyaa262 ◽

2021 ◽

Author(s):

Guanghao Qi ◽

Nilanjan Chatterjee

Keyword(s):

Type 2 Diabetes ◽

Mendelian Randomization ◽

Association Studies ◽

Real Data ◽

Causal Effects ◽

Type I ◽

Genome Wide Association Studies ◽

Simulation Studies ◽

Sample Sizes

Abstract Background Previous studies have often evaluated methods for Mendelian randomization (MR) analysis based on simulations that do not adequately reflect the data-generating mechanisms in genome-wide association studies (GWAS) and there are often discrepancies in the performance of MR methods in simulations and real data sets. Methods We use a simulation framework that generates data on full GWAS for two traits under a realistic model for effect-size distribution coherent with the heritability, co-heritability and polygenicity typically observed for complex traits. We further use recent data generated from GWAS of 38 biomarkers in the UK Biobank and performed down sampling to investigate trends in estimates of causal effects of these biomarkers on the risk of type 2 diabetes (T2D). Results Simulation studies show that weighted mode and MRMix are the only two methods that maintain the correct type I error rate in a diverse set of scenarios. Between the two methods, MRMix tends to be more powerful for larger GWAS whereas the opposite is true for smaller sample sizes. Among the other methods, random-effect IVW (inverse-variance weighted method), MR-Robust and MR-RAPS (robust adjust profile score) tend to perform best in maintaining a low mean-squared error when the InSIDE assumption is satisfied, but can produce large bias when InSIDE is violated. In real-data analysis, some biomarkers showed major heterogeneity in estimates of their causal effects on the risk of T2D across the different methods and estimates from many methods trended in one direction with increasing sample size with patterns similar to those observed in simulation studies. Conclusion The relative performance of different MR methods depends heavily on the sample sizes of the underlying GWAS, the proportion of valid instruments and the validity of the InSIDE assumption. Down-sampling analysis can be used in large GWAS for the possible detection of bias in the MR methods.

Download Full-text

scSorter: assigning cells to known cell types according to marker genes

Genome Biology ◽

10.1186/s13059-021-02281-7 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Hongyu Guo ◽

Jun Li

Keyword(s):

Real Data ◽

Cell Types ◽

Exact Expression ◽

Marker Genes ◽

Specific Marker ◽

Sequencing Data ◽

Reference Dataset ◽

Over Expression ◽

Higher Power ◽

Cell Type Specific

AbstractOn single-cell RNA-sequencing data, we consider the problem of assigning cells to known cell types, assuming that the identities of cell-type-specific marker genes are given but their exact expression levels are unavailable, that is, without using a reference dataset. Based on an observation that the expected over-expression of marker genes is often absent in a nonnegligible proportion of cells, we develop a method called scSorter. scSorter allows marker genes to express at a low level and borrows information from the expression of non-marker genes. On both simulated and real data, scSorter shows much higher power compared to existing methods.

Download Full-text

Penalized partial least squares for pleiotropy

BMC Bioinformatics ◽

10.1186/s12859-021-03968-1 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Camilo Broc ◽

Therese Truong ◽

Benoit Liquet

Keyword(s):

Least Squares ◽

Partial Least Squares ◽

Association Studies ◽

A Priori ◽

Simulated Data ◽

Real Data ◽

Genome Wide Association Studies ◽

Genetic Associations ◽

Multiple Traits ◽

Application Fields

Abstract Background The increasing number of genome-wide association studies (GWAS) has revealed several loci that are associated to multiple distinct phenotypes, suggesting the existence of pleiotropic effects. Highlighting these cross-phenotype genetic associations could help to identify and understand common biological mechanisms underlying some diseases. Common approaches test the association between genetic variants and multiple traits at the SNP level. In this paper, we propose a novel gene- and a pathway-level approach in the case where several independent GWAS on independent traits are available. The method is based on a generalization of the sparse group Partial Least Squares (sgPLS) to take into account groups of variables, and a Lasso penalization that links all independent data sets. This method, called joint-sgPLS, is able to convincingly detect signal at the variable level and at the group level. Results Our method has the advantage to propose a global readable model while coping with the architecture of data. It can outperform traditional methods and provides a wider insight in terms of a priori information. We compared the performance of the proposed method to other benchmark methods on simulated data and gave an example of application on real data with the aim to highlight common susceptibility variants to breast and thyroid cancers. Conclusion The joint-sgPLS shows interesting properties for detecting a signal. As an extension of the PLS, the method is suited for data with a large number of variables. The choice of Lasso penalization copes with architectures of groups of variables and observations sets. Furthermore, although the method has been applied to a genetic study, its formulation is adapted to any data with high number of variables and an exposed a priori architecture in other application fields.

Download Full-text

Characterization of Deleterious Mutations in Outcrossing Populations

Genetics ◽

10.1093/genetics/150.2.945 ◽

1998 ◽

Vol 150 (2) ◽

pp. 945-956 ◽

Cited By ~ 4

Author(s):

Hong-Wen Deng

Keyword(s):

Genetic Variation ◽

Genetic Variance ◽

Estimation Bias ◽

Deleterious Mutations ◽

Environmental Variance ◽

Higher Power ◽

Sib Mating ◽

The Mean ◽

Simulation Results

Abstract Deng and Lynch recently proposed estimating the rate and effects of deleterious genomic mutations from changes in the mean and genetic variance of fitness upon selfing/outcrossing in outcrossing/highly selfing populations. The utility of our original estimation approach is limited in outcrossing populations, since selfing may not always be feasible. Here we extend the approach to any form of inbreeding in outcrossing populations. By simulations, the statistical properties of the estimation under a common form of inbreeding (sib mating) are investigated under a range of biologically plausible situations. The efficiencies of different degrees of inbreeding and two different experimental designs of estimation are also investigated. We found that estimation using the total genetic variation in the inbred generation is generally more efficient than employing the genetic variation among the mean of inbred families, and that higher degree of inbreeding employed in experiments yields higher power for estimation. The simulation results of the magnitude and direction of estimation bias under variable or epistatic mutation effects may provide a basis for accurate inferences of deleterious mutations. Simulations accounting for environmental variance of fitness suggest that, under full-sib mating, our extension can achieve reasonably well an estimation with sample sizes of only ∼2000-3000.

Download Full-text

Characterization of blue cheese volatiles using fingerprinting, self-organizing maps, and entropy-based feature selection

Food Chemistry ◽

10.1016/j.foodchem.2020.128955 ◽

2020 ◽

pp. 128955

Author(s):

Ryan High ◽

Graham T. Eyres ◽

Phil Bremer ◽

Biniam Kebede

Keyword(s):

Feature Selection ◽

Self Organizing Maps ◽

Blue Cheese ◽

Cheese Volatiles ◽

Self Organizing

Download Full-text

Application of a Rough Set-Based Inductive Learning System

Fundamenta Informaticae ◽

10.3233/fi-1993-182-409 ◽

1993 ◽

Vol 18 (2-4) ◽

pp. 209-220

Author(s):

Michael Hadjimichael ◽

Anita Wasilewska

Keyword(s):

Machine Learning ◽

Rough Set ◽

Presidential Election ◽

Predictive Accuracy ◽

Learning Algorithm ◽

Inductive Learning ◽

Real Data ◽

Semantic Content ◽

Learning System ◽

Voter Preferences

We present here an application of Rough Set formalism to Machine Learning. The resulting Inductive Learning algorithm is described, and its application to a set of real data is examined. The data consists of a survey of voter preferences taken during the 1988 presidential election in the U.S.A. Results include an analysis of the predictive accuracy of the generated rules, and an analysis of the semantic content of the rules.

Download Full-text

Bias in two-sample Mendelian randomization when using heritable covariable-adjusted summary associations

International Journal of Epidemiology ◽

10.1093/ije/dyaa266 ◽

2021 ◽

Author(s):

Fernando Pires Hartwig ◽

Kate Tilling ◽

George Davey Smith ◽

Deborah A Lawlor ◽

Maria Carolina Borges

Keyword(s):

Waist Circumference ◽

Genetic Variants ◽

Mendelian Randomization ◽

Causal Effect ◽

Association Studies ◽

Real Data ◽

Sensitivity Analyses ◽

Effect Estimate ◽

Genome Wide Association Studies ◽

Residual Confounding

Abstract Background Two-sample Mendelian randomization (MR) allows the use of freely accessible summary association results from genome-wide association studies (GWAS) to estimate causal effects of modifiable exposures on outcomes. Some GWAS adjust for heritable covariables in an attempt to estimate direct effects of genetic variants on the trait of interest. One, both or neither of the exposure GWAS and outcome GWAS may have been adjusted for covariables. Methods We performed a simulation study comprising different scenarios that could motivate covariable adjustment in a GWAS and analysed real data to assess the influence of using covariable-adjusted summary association results in two-sample MR. Results In the absence of residual confounding between exposure and covariable, between exposure and outcome, and between covariable and outcome, using covariable-adjusted summary associations for two-sample MR eliminated bias due to horizontal pleiotropy. However, covariable adjustment led to bias in the presence of residual confounding (especially between the covariable and the outcome), even in the absence of horizontal pleiotropy (when the genetic variants would be valid instruments without covariable adjustment). In an analysis using real data from the Genetic Investigation of ANthropometric Traits (GIANT) consortium and UK Biobank, the causal effect estimate of waist circumference on blood pressure changed direction upon adjustment of waist circumference for body mass index. Conclusions Our findings indicate that using covariable-adjusted summary associations in MR should generally be avoided. When that is not possible, careful consideration of the causal relationships underlying the data (including potentially unmeasured confounders) is required to direct sensitivity analyses and interpret results with appropriate caution.

Download Full-text