Bayesian sparse heritability analysis with high-dimensional neuroimaging phenotypes

Biostatistics ◽  
2020 ◽  
Author(s):  
Yize Zhao ◽  
Tengfei Li ◽  
Hongtu Zhu

Summary Heritability analysis plays a central role in quantitative genetics to describe genetic contribution to human complex traits and prioritize downstream analyses under large-scale phenotypes. Existing works largely focus on modeling single phenotype and currently available multivariate phenotypic methods often suffer from scaling and interpretation. In this article, motivated by understanding how genetic underpinning impacts human brain variation, we develop an integrative Bayesian heritability analysis to jointly estimate heritabilities for high-dimensional neuroimaging traits. To induce sparsity and incorporate brain anatomical configuration, we impose hierarchical selection among both regional and local measurements based on brain structural network and voxel dependence. We also use a nonparametric Dirichlet process mixture model to realize grouping among single nucleotide polymorphism-associated phenotypic variations, providing biological plausibility. Through extensive simulations, we show the proposed method outperforms existing ones in heritability estimation and heritable traits selection under various scenarios. We finally apply the method to two large-scale imaging genetics datasets: the Alzheimer’s Disease Neuroimaging Initiative and United Kingdom Biobank and show biologically meaningful results.

2016 ◽  
Author(s):  
Tian Ge ◽  
Chia-Yen Chen ◽  
Benjamin M. Neale ◽  
Mert R. Sabuncu ◽  
Jordan W. Smoller

Heritability estimation provides important information about the relative contribution of genetic and environmental factors to phenotypic variation, and provides an upper bound for the utility of genetic risk prediction models. Recent technological and statistical advances have enabled the estimation of additive heritability attributable to common genetic variants (SNP heritability) across a broad phenotypic spectrum. However, assessing the comparative heritability of multiple traits estimated in different cohorts may be misleading due to the population-specific nature of heritability. Here we report the SNP heritability for 551 complex traits derived from the large-scale, population-based UK Biobank, comprising both quantitative phenotypes and disease codes, and examine the moderating effect of three major demographic variables (age, sex and socioeconomic status) on the heritability estimates. Our study represents the first comprehensive phenome-wide heritability analysis in the UK Biobank, and underscores the importance of considering population characteristics in comparing and interpreting heritability.


2017 ◽  
Author(s):  
JinHyung Lee ◽  
David Carlson ◽  
Hooshmand Shokri ◽  
Weichi Yao ◽  
Georges Goetz ◽  
...  

AbstractSpike sorting is a critical first step in extracting neural signals from large-scale electrophysiological data. This manuscript describes an efficient, reliable pipeline for spike sorting on dense multi-electrode arrays (MEAs), where neural signals appear across many electrodes and spike sorting currently represents a major computational bottleneck. We present several new techniques that make dense MEA spike sorting more robust and scalable. Our pipeline is based on an efficient multi-stage “triage-then-cluster-then-pursuit” approach that initially extracts only clean, high-quality waveforms from the electrophysiological time series by temporarily skipping noisy or “collided” events (representing two neurons firing synchronously). This is accomplished by developing a neural network detection method followed by efficient outlier triaging. The clean waveforms are then used to infer the set of neural spike waveform templates through nonparametric Bayesian clustering. Our clustering approach adapts a “coreset” approach for data reduction and uses efficient inference methods in a Dirichlet process mixture model framework to dramatically improve the scalability and reliability of the entire pipeline. The “triaged” waveforms are then finally recovered with matching-pursuit deconvolution techniques. The proposed methods improve on the state-of-the-art in terms of accuracy and stability on both real and biophysically-realistic simulated MEA data. Furthermore, the proposed pipeline is efficient, learning templates and clustering much faster than real-time for a ≃ 500-electrode dataset, using primarily a single CPU core.


2021 ◽  
Author(s):  
Tung Dang ◽  
Kie Kumaishi ◽  
Erika Usui ◽  
Shungo Kobori ◽  
Takumi Sato ◽  
...  

AbstractBackgroundThe rapid and accurate identification of a minimal-size core set of representative microbial species plays an important role in the clustering of microbial community data and interpretation of the clustering results. However, the huge dimensionality of microbial metagenomics data sets is a major challenge for the existing methods such as Dirichlet multinomial mixture (DMM) models. In the framework of the existing methods, computational burdens for identifying a small number of representative species from a huge number of observed species remain a challenge.ResultsWe proposed a novel framework to improve the performance of the widely used DMM approach by combining three ideas: (i) We extended the finite DMM model to the infinite case, via the consideration of Dirichlet process mixtures and estimate the number of clusters as a random variables. (ii) We proposed an indicator variable to identify representative operational taxonomic units that substantially contribute to the differentiation among clusters. (iii) To address the computational burdens of the high-dimensional microbiome data, we proposed are a stochastic variational inference, which approximates the posterior distribution using a controllable distribution called variational distribution, and stochastic optimization algorithms for fast computation. With the proposed method named stochastic variational variable selection (SVVS), we analyzed the root microbiome data collected in our soybean field experiment and the human gut microbiome data from three published data sets of large-scale case-control studies.ConclusionsSVVS demonstrated a better performance and significantly faster computation than existing methods in all cases of testing data sets. In particular, SVVS is the only method that can analyze the massive high-dimensional microbial data with above 50,000 microbial species and 1,000 samples. Furthermore, it was suggested that microbial species selected as a core set played important roles in the recent microbiome studies.


Author(s):  
Naohiro Tawara ◽  
Tetsuji Ogawa ◽  
Shinji Watanabe ◽  
Atsushi Nakamura ◽  
Tetsunori Kobayashi

An infinite mixture model is applied to model-based speaker clustering with sampling-based optimization to make it possible to estimate the number of speakers. For this purpose, a framework of non-parametric Bayesian modeling is implemented with the Markov chain Monte Carlo and incorporated in the utterance-oriented speaker model. The proposed model is called the utterance-oriented Dirichlet process mixture model (UO-DPMM). The present paper demonstrates that UO-DPMM is successfully applied on large-scale data and outperforms the conventional hierarchical agglomerative clustering, especially for large amounts of utterances.


2015 ◽  
Vol 112 (8) ◽  
pp. 2479-2484 ◽  
Author(s):  
Tian Ge ◽  
Thomas E. Nichols ◽  
Phil H. Lee ◽  
Avram J. Holmes ◽  
Joshua L. Roffman ◽  
...  

The discovery and prioritization of heritable phenotypes is a computational challenge in a variety of settings, including neuroimaging genetics and analyses of the vast phenotypic repositories in electronic health record systems and population-based biobanks. Classical estimates of heritability require twin or pedigree data, which can be costly and difficult to acquire. Genome-wide complex trait analysis is an alternative tool to compute heritability estimates from unrelated individuals, using genome-wide data that are increasingly ubiquitous, but is computationally demanding and becomes difficult to apply in evaluating very large numbers of phenotypes. Here we present a fast and accurate statistical method for high-dimensional heritability analysis using genome-wide SNP data from unrelated individuals, termed massively expedited genome-wide heritability analysis (MEGHA) and accompanying nonparametric sampling techniques that enable flexible inferences for arbitrary statistics of interest. MEGHA produces estimates and significance measures of heritability with several orders of magnitude less computational time than existing methods, making heritability-based prioritization of millions of phenotypes based on data from unrelated individuals tractable for the first time to our knowledge. As a demonstration of application, we conducted heritability analyses on global and local morphometric measurements derived from brain structural MRI scans, using genome-wide SNP data from 1,320 unrelated young healthy adults of non-Hispanic European ancestry. We also computed surface maps of heritability for cortical thickness measures and empirically localized cortical regions where thickness measures were significantly heritable. Our analyses demonstrate the unique capability of MEGHA for large-scale heritability-based screening and high-dimensional heritability profile construction.


2009 ◽  
Vol 35 (7) ◽  
pp. 859-866
Author(s):  
Ming LIU ◽  
Xiao-Long WANG ◽  
Yuan-Chao LIU

2021 ◽  
Vol 11 (2) ◽  
pp. 472
Author(s):  
Hyeongmin Cho ◽  
Sangkyun Lee

Machine learning has been proven to be effective in various application areas, such as object and speech recognition on mobile systems. Since a critical key to machine learning success is the availability of large training data, many datasets are being disclosed and published online. From a data consumer or manager point of view, measuring data quality is an important first step in the learning process. We need to determine which datasets to use, update, and maintain. However, not many practical ways to measure data quality are available today, especially when it comes to large-scale high-dimensional data, such as images and videos. This paper proposes two data quality measures that can compute class separability and in-class variability, the two important aspects of data quality, for a given dataset. Classical data quality measures tend to focus only on class separability; however, we suggest that in-class variability is another important data quality factor. We provide efficient algorithms to compute our quality measures based on random projections and bootstrapping with statistical benefits on large-scale high-dimensional data. In experiments, we show that our measures are compatible with classical measures on small-scale data and can be computed much more efficiently on large-scale high-dimensional datasets.


Sign in / Sign up

Export Citation Format

Share Document