A statistical framework for QTL hotspot detection

AbstractQuantitative trait loci (QTL) hotspots (genomic locations enriched in QTL) are a common and notable feature when collecting many QTL for various traits in many areas of biological studies. The QTL hotspots are important and attractive since they are highly informative and may harbor genes for the quantitative traits. So far, the current statistical methods for QTL hotspot detection use either the individual-level data from the genetical genomics experiments or the summarized data from public QTL databases to proceed with the detection analysis. These methods may suffer from the problems of ignoring the correlation structure among traits, neglecting the magnitude of LOD scores for the QTL, or paying a very high computational cost, which often lead to the detection of excessive spurious hotspots, failure to discover biologically interesting hotspots composed of a small-to-moderate number of QTL with strong LOD scores, and computational intractability, respectively, during the detection process. In this article, we describe a statistical framework that can handle both types of data as well as address all the problems at a time for QTL hotspot detection. Our statistical framework directly operates on the QTL matrix and hence has a very cheap computational cost and is deployed to take advantage of the QTL mapping results for assisting the detection analysis. Two special devices, trait grouping and top γn,α profile, are introduced into the framework. The trait grouping attempts to group the traits controlled by closely linked or pleiotropic QTL together into the same trait groups and randomly allocates these QTL together across the genomic positions separately by trait group to account for the correlation structure among traits, so as to have the ability to obtain much stricter thresholds and dismiss spurious hotspots. The top γn,α profile is designed to outline the LOD-score pattern of QTL in a hotspot across the different hotspot architectures, so that it can serve to identify and characterize the types of QTL hotspots with varying sizes and LOD-score distributions. Real examples, numerical analysis, and simulation study are performed to validate our statistical framework, investigate the detection properties, and also compare with the current methods in QTL hotspot detection. The results demonstrate that the proposed statistical framework can effectively accommodate the correlation structure among traits, identify the types of hotspots, and still keep the notable features of easy implementation and fast computation for practical QTL hotspot detection.

Download Full-text

A Statistical Framework for QTL Hotspot Detection

10.1101/2020.08.13.249342 ◽

2020 ◽

Author(s):

Po-Ya Wu ◽

Man-Hsia Yang ◽

Chen-Hung Kao

Keyword(s):

Genetic Correlations ◽

Computational Cost ◽

Correlation Structure ◽

Detection Methods ◽

Hotspot Detection ◽

Lod Score ◽

Individual Level ◽

Detection Analysis ◽

Statistical Framework ◽

Biological Studies

ABSTRACTQuantitative trait loci (QTL) hotspots (genomic locations enriched in QTL) are a common and notable feature when collecting many QTL for various traits in many areas of biological studies. The QTL hotspots are important and attractive since they are highly informative and may harbor genes for the quantitative traits. So far, the current statistical methods for QTL hotspot detection use either the individual-level data from the genetical genomics experiments or the summarized data from public QTL databases to proceed with the detection analysis. These detection methods attempt to address some of the concerns, including the correlation structure among traits, the magnitude of LOD scores within a hotspot and computational cost, that arise during the process of QTL hotspot detection. In this article, we describe a statistical framework that can handle both types of data as well as address all the concerns at a time for QTL hotspot detection. Our statistical framework directly operates on the QTL matrix and hence has a very cheap computation cost, and is deployed to take advantage of the QTL mapping results for assisting the detection analysis. Two special devices, trait grouping and top γn,α profile, are introduced into the framework. The trait grouping attempts to group the closely linked or pleiotropic traits together to take care of the true linkages and cope with the underestimation of hotspot thresholds due to non-genetic correlations (arising from ignoring the correlation structure among traits), so as to have the ability to obtain much stricter thresholds and dismiss spurious hotspots. The top γn,α profile is designed to outline the LOD-score pattern of a hotspot across the different hotspot architectures, so that it can serve to identify and characterize the types of QTL hotspots with varying sizes and LOD score distributions. Real examples, numerical analysis and simulation study are performed to validate our statistical framework, investigate the detection properties, and also compare with the current methods in QTL hotspot detection. The results demonstrate that the proposed statistical framework can effectively accommodate the correlation structure among traits, identify the types of hotspots and still keep the notable features of easy implementation and fast computation for practical QTL hotspot detection.

Download Full-text

A Statistical Procedure for Genome-wide Detection of QTL Hotspots Using Public Databases with Application to Rice

10.1101/479725 ◽

2018 ◽

Author(s):

Man-Hsia Yang ◽

Dong-Hong Wu ◽

Chen-Hung Kao

Keyword(s):

Comparative Analysis ◽

Quantitative Traits ◽

Interval Data ◽

Statistical Procedure ◽

Phenotypic Traits ◽

Hotspot Detection ◽

Individual Level ◽

Genome Wide ◽

Level Data ◽

Biological Studies

ABSTRACTGenome-wide detection of quantitative trait loci (QTL) hotspots underlying variation in many molecular and phenotypic traits has been a key step in various biological studies since the QTL hotspots are highly informative and can be linked to the genes for the quantitative traits. Several statistical methods have been proposed to detect QTL hotspots. These hotspot detection methods rely heavily on permutation tests performed on summarized QTL data or individual-level data (with genotypes and phenotypes) from the genetical genomics experiments. In this article, we propose a statistical procedure for QTL hotspot detection by using the summarized QTL (interval) data collected in public web-accessible databases. First, a simple statistical method based on the uniform distribution is derived to convert the QTL interval data into the expected QTL frequency (EQF) matrix. And then, to account for the correlation structure among traits, the QTLs for correlated traits are grouped together into the same categories to form a reduced EQF matrix. Furthermore, a permutation algorithm on the EQF elements or on the QTL intervals is developed to compute a sliding scale of EQF thresholds, ranging from strict to liberal, for assessing the significance of QTL hotspots. With grouping, much stricter thresholds can be obtained to avoid the detection of spurious hotspots. Real example analysis and simulation study are carried out to illustrate our procedure, evaluate the performances and compare with other methods. It shows that our procedure can control the genome-wide error rates at the target levels, provide appropriate thresholds for correlated data and is comparable to the methods using individual-level data in hotspot detection. Depending on the thresholds used, more than 100 hotspots are detected in GRAMENE rice database. We also perform a genome-wide comparative analysis of the detected hotspots and the known genes collected in the Rice Q-TARO database. The comparative analysis reveals that the hotspots and genes are conformable in the sense that they co-localize closely and are functionally related to relevant traits. Our statistical procedure can provide a framework for exploring the networks among QTL hotspots, genes and quantitative traits in biological studies. The R codes that produce both numerical and graphical outputs of QTL hotspot detection in the genome are available on the worldwide web http://www.stat.sinica.edu.tw/~chkao/.

Download Full-text

A geostatistical extreme-value framework for fast simulation of natural hazard events

Proceedings of The Royal Society A Mathematical Physical and Engineering Sciences ◽

10.1098/rspa.2015.0855 ◽

2016 ◽

Vol 472 (2189) ◽

pp. 20150855 ◽

Cited By ~ 6

Author(s):

Benjamin D. Youngman ◽

David B. Stephenson

Keyword(s):

Natural Hazard ◽

Computational Cost ◽

Additive Model ◽

Simulation Method ◽

Value Theory ◽

Extreme Value ◽

Wind Gust ◽

Continuous Space ◽

Statistical Framework ◽

Distribution Parameters

We develop a statistical framework for simulating natural hazard events that combines extreme value theory and geostatistics. Robust generalized additive model forms represent generalized Pareto marginal distribution parameters while a Student’s t -process captures spatial dependence and gives a continuous-space framework for natural hazard event simulations. Efficiency of the simulation method allows many years of data (typically over 10 000) to be obtained at relatively little computational cost. This makes the model viable for forming the hazard module of a catastrophe model. We illustrate the framework by simulating maximum wind gusts for European windstorms, which are found to have realistic marginal and spatial properties, and validate well against wind gust measurements.

Download Full-text

Dijkstra’s-DBSCAN: Fast, Accurate, and Routable Density Based Clustering of Traffic Incidents on Large Road Network

Transportation Research Record Journal of the Transportation Research Board ◽

10.1177/0361198118796071 ◽

2018 ◽

Vol 2672 (45) ◽

pp. 265-273

Author(s):

Yang Zhang ◽

Lee D. Han ◽

Hyun Kim

Keyword(s):

Shortest Path ◽

Road Network ◽

Clustering Algorithm ◽

Computational Cost ◽

Incident Management ◽

Dijkstra’S Algorithm ◽

Hotspot Detection ◽

Traffic Incidents ◽

Original Algorithm ◽

Dijkstra's Algorithm

Incident hotspots are used as a direct indicator of the needs for road maintenance and infrastructure upgrade, and an important reference for investment location decisions. Previous incident hotspot identification methods are all region based, ignoring the underlying road network constraints. We first demonstrate how region based hotspot detection may be inaccurate. We then present Dijkstra’s-DBSCAN, a new network based density clustering algorithm specifically for traffic incidents which combines a modified Dijkstra’s shortest path algorithm with DBSCAN (density based spatial clustering of applications with noise). The modified Dijkstra’s algorithm, instead of returning the shortest path from a source to a target as the original algorithm does, returns a set of nodes (incidents) that are within a requested distance when traveling from the source. By retrieving the directly reachable neighbors using this modified Dijkstra’s algorithm, DBSCAN gains its awareness of network connections and measures distance more practically. It avoids clustering incidents that are close but not connected. The new approach extracts hazardous lanes instead of regions, and so is a much more precise approach for incident management purposes; it reduces the [Formula: see text] computational cost to [Formula: see text], and can process the entire U.S. network in seconds; it has routing flexibility and can extract clusters of any shape and connections; it is parallellable and can utilize distributed computing resources. Our experiments verified the new methodology’s capability of supporting safety management on a complicated surface street configuration. It also works for customized lane configuration, such as freeways, freeway junctions, interchanges, roundabouts, and other complex combinations.

Download Full-text

A comparison of methods to generate adaptive reference ranges in longitudinal monitoring

PLoS ONE ◽

10.1371/journal.pone.0247338 ◽

2021 ◽

Vol 16 (2) ◽

pp. e0247338

Author(s):

Davood Roshan ◽

John Ferguson ◽

Charles R. Pedlar ◽

Andrew Simpkin ◽

William Wyns ◽

...

Keyword(s):

Bayesian Approach ◽

Physiological State ◽

Computational Cost ◽

Biological Indicators ◽

Population Based ◽

Individual Variability ◽

Reference Ranges ◽

Individual Level ◽

Expectation Maximisation ◽

The Individual

In a clinical setting, biomarkers are typically measured and evaluated as biological indicators of a physiological state. Population based reference ranges, known as ‘static’ or ‘normal’ reference ranges, are often used as a tool to classify a biomarker value for an individual as typical or atypical. However, these ranges may not be informative to a particular individual when considering changes in a biomarker over time since each observation is assessed in isolation and against the same reference limits. To allow early detection of unusual physiological changes, adaptation of static reference ranges is required that incorporates within-individual variability of biomarkers arising from longitudinal monitoring in addition to between-individual variability. To overcome this issue, methods for generating individualised reference ranges are proposed within a Bayesian framework which adapts successively whenever a new measurement is recorded for the individual. This new Bayesian approach also allows the within-individual variability to differ for each individual, compared to other less flexible approaches. However, the Bayesian approach usually comes with a high computational cost, especially for individuals with a large number of observations, that diminishes its applicability. This difficulty suggests that a computational approximation may be required. Thus, methods for generating individualised adaptive ranges by the use of a time-efficient approximate Expectation-Maximisation (EM) algorithm will be presented which relies only on a few sufficient statistics at the individual level.

Download Full-text

Testing, tracing and isolation in compartmental models

PLoS Computational Biology ◽

10.1371/journal.pcbi.1008633 ◽

2021 ◽

Vol 17 (3) ◽

pp. e1008633

Author(s):

Simone Sturniolo ◽

William Waites ◽

Tim Colbourn ◽

David Manheim ◽

Jasmina Panovska-Griffiths

Keyword(s):

Computational Cost ◽

Disease Outbreaks ◽

Population Level ◽

Compartmental Models ◽

Contact Tracing ◽

Adaptive Planning ◽

Agent Based ◽

Individual Level ◽

Individual Based Models ◽

The Individual

Existing compartmental mathematical modelling methods for epidemics, such as SEIR models, cannot accurately represent effects of contact tracing. This makes them inappropriate for evaluating testing and contact tracing strategies to contain an outbreak. An alternative used in practice is the application of agent- or individual-based models (ABM). However ABMs are complex, less well-understood and much more computationally expensive. This paper presents a new method for accurately including the effects of Testing, contact-Tracing and Isolation (TTI) strategies in standard compartmental models. We derive our method using a careful probabilistic argument to show how contact tracing at the individual level is reflected in aggregate on the population level. We show that the resultant SEIR-TTI model accurately approximates the behaviour of a mechanistic agent-based model at far less computational cost. The computational efficiency is such that it can be easily and cheaply used for exploratory modelling to quantify the required levels of testing and tracing, alone and with other interventions, to assist adaptive planning for managing disease outbreaks.

Download Full-text

Single-molecule localization microscopy as nonlinear inverse problem

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1912634116 ◽

2019 ◽

Vol 116 (41) ◽

pp. 20438-20445

Author(s):

Ji Yu ◽

Ahmed Elmokadem

Keyword(s):

Single Molecule ◽

Computational Cost ◽

Image Deconvolution ◽

Fusion Algorithm ◽

Localization Microscopy ◽

Estimation Algorithms ◽

Statistical Framework ◽

Spatial Coordinates ◽

Classical Image ◽

Single Molecule Localization Microscopy

We present a statistical framework to model the spatial distribution of molecules based on a single-molecule localization microscopy (SMLM) dataset. The latter consists of a collection of spatial coordinates and their associated uncertainties. We describe iterative parameter-estimation algorithms based on this framework, as well as a sampling algorithm to numerically evaluate the complete posterior distribution. We demonstrate that the inverse computation can be viewed as a type of image restoration process similar to the classical image deconvolution methods, except that it is performed on SMLM images. We further discuss an application of our statistical framework in the task of particle fusion using SMLM data. We show that the fusion algorithm based on our model outperforms the current state-of-the-art in terms of both accuracy and computational cost.

Download Full-text

Parametric DFM Solution for Analog Circuits: Electrical-Driven Hotspot Detection, Analysis, and Correction Flow

IEEE Transactions on Very Large Scale Integration (VLSI) Systems ◽

10.1109/tvlsi.2012.2201759 ◽

2013 ◽

Vol 21 (5) ◽

pp. 807-820 ◽

Cited By ~ 7

Author(s):

Haitham Eissa ◽

Rami Fathy Salem ◽

Ahmed Arafa ◽

Sherif Hany ◽

Abdelrahman El-Mously ◽

...

Keyword(s):

Analog Circuits ◽

Hotspot Detection ◽

Detection Analysis

Download Full-text

Mixed Models for Meta-Analysis and Sequencing

10.1101/020115 ◽

2015 ◽

Author(s):

Brendan Bulik-Sullivan

Keyword(s):

Mixed Models ◽

Mixed Model ◽

Association Studies ◽

Meta Analysis ◽

Computational Cost ◽

Case Control ◽

Effective Sample Size ◽

Case Control Studies ◽

Individual Level ◽

Meta Analyses

Mixed models are an effective statistical method for increasing power and avoiding confounding in genetic association studies. Existing mixed model methods have been designed for ``pooled'' studies where all individual-level genotype and phenotype data are simultaneously visible to a single analyst. Many studies follow a ``meta-analysis'' design, wherein a large number of independent cohorts share only summary statistics with a central meta-analysis group, and no one person can view individual-level data for more than a small fraction of the total sample. When using linear regression for GWAS, there is no difference in power between pooled studies and meta-analyses \cite{lin2010meta}; however, we show that when using mixed models, standard meta-analysis is much less powerful than mixed model association on a pooled study of equal size. We describe a method that allows meta-analyses to capture almost all of the power available to mixed model association on a pooled study without sharing individual-level genotype data. The added computational cost and analytical complexity of this method is minimal, but the increase in power can be large: based on the predictive performance of polygenic scoring reported in \cite{wood2014defining} and \cite{locke2015genetic}, we estimate that the next height and BMI studies could see increases in effective sample size of $\approx$15\% and $\approx$8\%, respectively. Last, we describe how a related technique can be used to increase power in sequencing, targeted sequencing and exome array studies. Note that these techniques are presently only applicable to randomly ascertained studies and will sometimes result in loss of power in ascertained case/control studies. We are developing similar methods for case/control studies, but this is more complicated.

Download Full-text

Testing, tracing and isolation in compartmental models

10.1101/2020.05.14.20101808 ◽

2020 ◽

Cited By ~ 3

Author(s):

Simone Sturniolo ◽

William Waites ◽

Tim Colbourn ◽

David Manheim ◽

Jasmina Panovska-Griffiths

Keyword(s):

Computational Cost ◽

Disease Outbreaks ◽

Population Level ◽

Compartmental Models ◽

Disease Spread ◽

Contact Tracing ◽

Probabilistic Argument ◽

Adaptive Planning ◽

Agent Based ◽

Individual Level

AbstractExisting compartmental mathematical modelling methods for epidemics, such as SEIR models, cannot accurately represent effects of contact tracing. This makes them inappropriate for evaluating testing and contact tracing strategies to contain an outbreak. An alternative used in practice is the application of agent- or individual-based models (ABM). However ABMs are complex, less well-understood and much more computationally expensive. This paper presents a new method for accurately including the effects of Testing, contact-Tracing and Isolation (TTI) strategies in standard compartmental models. We derive our method using a careful probabilistic argument to show how contact tracing at the individual level is reflected in aggregate on the population level. We show that the resultant SEIR-TTI model accurately approximates the behaviour of a mechanistic agent-based model at far less computational cost. The computational efficiency is such that it can be easily and cheaply used for exploratory modelling to quantify the required levels of testing and tracing, alone and with other interventions, to assist adaptive planning for managing disease outbreaks.Author SummaryThe importance of modeling to inform and support decision making is widely acknowledged. Understanding how to enhance contact tracing as part of the Testing-Tracing-Isolation (TTI) strategy for mitigation of COVID is a key public policy questions. Our work develops the SEIR-TTI model as an extension of the classic Susceptible, Exposed, Infected and Recovered (SEIR) model to include tracing of contacts of people exposed to and infectious with COVID-19. We use probabilistic argument to derive contact tracing rates within a compartmental model as aggregates of contact tracing at an individual level. Our adaptation is applicable across compartmental models for infectious diseases spread. We show that our novel SEIR-TTI model can accurately approximate the behaviour of mechanistic agent-based models at far less computational cost. The SEIR-TTI model represents an important addition to the theoretical methodology of modelling infectious disease spread and we anticipate that it will be immediately applicable to the management of the COVID-19 pandemic.

Download Full-text