scholarly journals A strategy to incorporate prior knowledge into correlation network cutoff selection

2019 ◽  
Author(s):  
Elisa Benedetti ◽  
Maja Pučić-Baković ◽  
Toma Keser ◽  
Nathalie Gerstner ◽  
Mustafa Büyüközkan ◽  
...  

AbstractCorrelation networks are commonly used to statistically extract biological interactions between omics markers. Network edge selection is typically based on the significance of the underlying correlation coefficients. A statistical cutoff, however, is not guaranteed to capture biological reality, and heavily depends on dataset properties such as sample size. We here propose an alternative, innovative approach to address the problem of network reconstruction. Specifically, we developed a cutoff selection algorithm that maximizes the agreement to a given ground truth. We first evaluate the approach on IgG glycomics data, for which the biochemical pathway is known and well-characterized. The optimal network outperforms networks obtained with statistical cutoffs and is robust with respect to sample size. Importantly, we can show that even in the case of incomplete or incorrect prior knowledge, the optimal network is close to the true optimum. We then demonstrate the generalizability of the approach on an untargeted metabolomics and a transcriptomics dataset from The Cancer Genome Atlas (TCGA). For the transcriptomics case, we demonstrate that the optimized network is superior to statistical networks in systematically retrieving interactions that were not included in the biological reference used for the optimization. Overall, this paper shows that using prior information for correlation network inference is superior to using regular statistical cutoffs, even if the prior information is incomplete or partially inaccurate.

2020 ◽  
Vol 11 (1) ◽  
Author(s):  
Elisa Benedetti ◽  
Maja Pučić-Baković ◽  
Toma Keser ◽  
Nathalie Gerstner ◽  
Mustafa Büyüközkan ◽  
...  

Abstract Correlation networks are frequently used to statistically extract biological interactions between omics markers. Network edge selection is typically based on the statistical significance of the correlation coefficients. This procedure, however, is not guaranteed to capture biological mechanisms. We here propose an alternative approach for network reconstruction: a cutoff selection algorithm that maximizes the overlap of the inferred network with available prior knowledge. We first evaluate the approach on IgG glycomics data, for which the biochemical pathway is known and well-characterized. Importantly, even in the case of incomplete or incorrect prior knowledge, the optimal network is close to the true optimum. We then demonstrate the generalizability of the approach with applications to untargeted metabolomics and transcriptomics data. For the transcriptomics case, we demonstrate that the optimized network is superior to statistical networks in systematically retrieving interactions that were not included in the biological reference used for optimization.


1986 ◽  
Vol 16 (5) ◽  
pp. 1116-1118 ◽  
Author(s):  
Edwin J. Green ◽  
William E. Strawderman

A method for determining the appropriate sample size to produce an estimate with a stated allowable percent error when the sample data is to be combined with prior information is presented. Application of the method in the case where the objective is to estimate volume per acre and prior knowledge is represented by a yield equation demonstrates that this method can reduce the amount of sample information that would be required if the yield equation were to be ignored.


2021 ◽  
Vol 12 ◽  
Author(s):  
Yoonjee Kang ◽  
Denis Thieffry ◽  
Laura Cantini

Networks are powerful tools to represent and investigate biological systems. The development of algorithms inferring regulatory interactions from functional genomics data has been an active area of research. With the advent of single-cell RNA-seq data (scRNA-seq), numerous methods specifically designed to take advantage of single-cell datasets have been proposed. However, published benchmarks on single-cell network inference are mostly based on simulated data. Once applied to real data, these benchmarks take into account only a small set of genes and only compare the inferred networks with an imposed ground-truth. Here, we benchmark six single-cell network inference methods based on their reproducibility, i.e., their ability to infer similar networks when applied to two independent datasets for the same biological condition. We tested each of these methods on real data from three biological conditions: human retina, T-cells in colorectal cancer, and human hematopoiesis. Once taking into account networks with up to 100,000 links, GENIE3 results to be the most reproducible algorithm and, together with GRNBoost2, show higher intersection with ground-truth biological interactions. These results are independent from the single-cell sequencing platform, the cell type annotation system and the number of cells constituting the dataset. Finally, GRNBoost2 and CLR show more reproducible performance once a more stringent thresholding is applied to the networks (1,000–100 links). In order to ensure the reproducibility and ease extensions of this benchmark study, we implemented all the analyses in scNET, a Jupyter notebook available at https://github.com/ComputationalSystemsBiology/scNET.


2021 ◽  
pp. 174077452110208
Author(s):  
Elizabeth Korevaar ◽  
Jessica Kasza ◽  
Monica Taljaard ◽  
Karla Hemming ◽  
Terry Haines ◽  
...  

Background: Sample size calculations for longitudinal cluster randomised trials, such as crossover and stepped-wedge trials, require estimates of the assumed correlation structure. This includes both within-period intra-cluster correlations, which importantly differ from conventional intra-cluster correlations by their dependence on period, and also cluster autocorrelation coefficients to model correlation decay. There are limited resources to inform these estimates. In this article, we provide a repository of correlation estimates from a bank of real-world clustered datasets. These are provided under several assumed correlation structures, namely exchangeable, block-exchangeable and discrete-time decay correlation structures. Methods: Longitudinal studies with clustered outcomes were collected to form the CLustered OUtcome Dataset bank. Forty-four available continuous outcomes from 29 datasets were obtained and analysed using each correlation structure. Patterns of within-period intra-cluster correlation coefficient and cluster autocorrelation coefficients were explored by study characteristics. Results: The median within-period intra-cluster correlation coefficient for the discrete-time decay model was 0.05 (interquartile range: 0.02–0.09) with a median cluster autocorrelation of 0.73 (interquartile range: 0.19–0.91). The within-period intra-cluster correlation coefficients were similar for the exchangeable, block-exchangeable and discrete-time decay correlation structures. Within-period intra-cluster correlation coefficients and cluster autocorrelations were found to vary with the number of participants per cluster-period, the period-length, type of cluster (primary care, secondary care, community or school) and country income status (high-income country or low- and middle-income country). The within-period intra-cluster correlation coefficients tended to decrease with increasing period-length and slightly decrease with increasing cluster-period sizes, while the cluster autocorrelations tended to move closer to 1 with increasing cluster-period size. Using the CLustered OUtcome Dataset bank, an RShiny app has been developed for determining plausible values of correlation coefficients for use in sample size calculations. Discussion: This study provides a repository of intra-cluster correlations and cluster autocorrelations for longitudinal cluster trials. This can help inform sample size calculations for future longitudinal cluster randomised trials.


2015 ◽  
Vol 116 (9/10) ◽  
pp. 564-577 ◽  
Author(s):  
RISHABH SHRIVASTAVA ◽  
Preeti Mahajan

Purpose – The purpose of this paper is twofold. First, the study aims to investigate the relationship between the altmetric indicators from ResearchGate (RG) and the bibliometric indicators from the Scopus database. Second, the study seeks to examine the relationship amongst the RG altmetric indicators themselves. RG is a rich source of altmetric indicators such as Citations, RGScore, Impact Points, Profile Views, Publication Views, etc. Design/methodology/approach – For establishing whether RG metrics showed the same results as the established sources of metrics, Pearson’s correlation coefficients were calculated between the metrics provided by RG and the metrics obtained from Scopus. Pearson’s correlation coefficients were also calculated for the metrics provided by RG. The data were collected by visiting the profile pages of all the members who had an account in RG under the Department of Physics, Panjab University, Chandigarh (India). Findings – The study showed that most of the RG metrics showed strong positive correlation with the Scopus metrics, except for RGScore (RG) and Citations (Scopus), which showed moderate positive correlation. It was also found that the RG metrics showed moderate to strong positive correlation amongst each other. Research limitations/implications – The limitation of this study is that more and more scientists and researchers may join RG in the future, therefore the data may change. The study focuses on the members who had an account in RG under the Department of Physics, Panjab University, Chandigarh (India). Perhaps further studies can be conducted by increasing the sample size and by taking a different sample size having different characteristics. Originality/value – Being an emerging field, not much has been conducted in the area of altmetrics. Very few studies have been conducted on the reach of academic social networks like RG and their validity as sources of altmetric indicators like RGScore, Impact Points, etc. The findings offer insights to the question whether RG can be used as an alternative to traditional sources of bibliometric indicators, especially with reference to a rapidly developing country such as India.


2018 ◽  
Vol 28 (6) ◽  
pp. 1664-1675 ◽  
Author(s):  
TB Brakenhoff ◽  
KCB Roes ◽  
S Nikolakopoulos

The sample size of a randomized controlled trial is typically chosen in order for frequentist operational characteristics to be retained. For normally distributed outcomes, an assumption for the variance needs to be made which is usually based on limited prior information. Especially in the case of small populations, the prior information might consist of only one small pilot study. A Bayesian approach formalizes the aggregation of prior information on the variance with newly collected data. The uncertainty surrounding prior estimates can be appropriately modelled by means of prior distributions. Furthermore, within the Bayesian paradigm, quantities such as the probability of a conclusive trial are directly calculated. However, if the postulated prior is not in accordance with the true variance, such calculations are not trustworthy. In this work we adapt previously suggested methodology to facilitate sample size re-estimation. In addition, we suggest the employment of power priors in order for operational characteristics to be controlled.


2021 ◽  
Author(s):  
Kaixian Yu ◽  
Zihan Cui ◽  
Xin Sui ◽  
Xing Qiu ◽  
Jinfeng Zhang

Abstract Bayesian networks (BNs) provide a probabilistic, graphical framework for modeling high-dimensional joint distributions with complex correlation structures. BNs have wide applications in many disciplines, including biology, social science, finance and biomedical science. Despite extensive studies in the past, network structure learning from data is still a challenging open question in BN research. In this study, we present a sequential Monte Carlo (SMC)-based three-stage approach, GRowth-based Approach with Staged Pruning (GRASP). A double filtering strategy was first used for discovering the overall skeleton of the target BN. To search for the optimal network structures we designed an adaptive SMC (adSMC) algorithm to increase the quality and diversity of sampled networks which were further improved by a third stage to reclaim edges missed in the skeleton discovery step. GRASP gave very satisfactory results when tested on benchmark networks. Finally, BN structure learning using multiple types of genomics data illustrates GRASP’s potential in discovering novel biological relationships in integrative genomic studies.


2020 ◽  
Author(s):  
Yoonjee Kang ◽  
Denis Thieffry ◽  
Laura Cantini

AbstractNetworks are powerful tools to represent and investigate biological systems. The development of algorithms inferring regulatory interactions from functional genomics data has been an active area of research. With the advent of single-cell RNA-seq data (scRNA-seq), numerous methods specifically designed to take advantage of single-cell datasets have been proposed. However, published benchmarks on single-cell network inference are mostly based on simulated data. Once applied to real data, these benchmarks take into account only a small set of genes and only compare the inferred networks with an imposed ground-truth.Here, we benchmark four single-cell network inference methods based on their reproducibility, i.e. their ability to infer similar networks when applied to two independent datasets for the same biological condition. We tested each of these methods on real data from three biological conditions: human retina, T-cells in colorectal cancer, and human hematopoiesis.GENIE3 results to be the most reproducible algorithm, independently from the single-cell sequencing platform, the cell type annotation system, the number of cells constituting the dataset, or the thresholding applied to the links of the inferred networks. In order to ensure the reproducibility and ease extensions of this benchmark study, we implemented all the analyses in scNET, a Jupyter notebook available at https://github.com/ComputationalSystemsBiology/scNET.


Author(s):  
Meiping Yun ◽  
Wenwen Qin

Despite the wide application of floating car data (FCD) in urban link travel time estimation, limited efforts have been made to determine the minimum sample size of floating cars appropriate to the requirements for travel time distribution (TTD) estimation. This study develops a framework for seeking the required minimum number of travel time observations generated from FCD for urban link TTD estimation. The basic idea is to test how, with a decreasing the number of observations, the similarities between the distribution of estimated travel time from observations and those from the ground-truth vary. These are measured by employing the Hellinger Distance (HD) and Kolmogorov-Smirnov (KS) tests. Finally, the minimum sample size is determined by the HD value, ensuring that corresponding distribution passes the KS test. The proposed method is validated with the sources of FCD and Radio Frequency Identification Data (RFID) collected from an urban arterial in Nanjing, China. The results indicate that: (1) the average travel times derived from FCD give good estimation accuracy for real-time application; (2) the minimum required sample size range changes with the extent of time-varying fluctuations in traffic flows; (3) the minimum sample size determination is sensitive to whether observations are aggregated near each peak in the multistate distribution; (4) sparse and incomplete observations from FCD in most time periods cannot be used to achieve the minimum sample size. Moreover, this would produce a significant deviation from the ground-truth distributions. Finally, FCD is strongly recommended for better TTD estimation incorporating both historical trends and real-time observations.


2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Hannah M. L. Young ◽  
Mark W. Orme ◽  
Yan Song ◽  
Maurice Dungey ◽  
James O. Burton ◽  
...  

Abstract Background Physical activity (PA) is exceptionally low amongst the haemodialysis (HD) population, and physical inactivity is a powerful predictor of mortality, making it a prime focus for intervention. Objective measurement of PA using accelerometers is increasing, but standard reporting guidelines essential to effectively evaluate, compare and synthesise the effects of PA interventions are lacking. This study aims to (i) determine the measurement and processing guidance required to ensure representative PA data amongst a diverse HD population, and; (ii) to assess adherence to PA monitor wear amongst HD patients. Methods Clinically stable HD patients from the UK and China wore a SenseWear Armband accelerometer for 7 days. Step count between days (HD, Weekday, Weekend) were compared using repeated measures ANCOVA. Intraclass correlation coefficients (ICCs) determined reliability (≥0.80 acceptable). Spearman-Brown prophecy formula, in conjunction with a priori ≥  80% sample size retention, identified the minimum number of days required for representative PA data. Results Seventy-seven patients (64% men, mean ± SD age 56 ± 14 years, median (interquartile range) time on HD 40 (19–72) months, 40% Chinese, 60% British) participated. Participants took fewer steps on HD days compared with non-HD weekdays and weekend days (3402 [95% CI 2665–4140], 4914 [95% CI 3940–5887], 4633 [95% CI 3558–5707] steps/day, respectively, p < 0.001). PA on HD days were less variable than non-HD days, (ICC 0.723–0.839 versus 0.559–0.611) with ≥ 1 HD day and ≥  3 non-HD days required to provide representative data. Using these criteria, the most stringent wear-time retaining ≥ 80% of the sample was ≥7 h. Conclusions At group level, a wear-time of ≥7 h on ≥1HD day and ≥ 3 non-HD days is required to provide reliable PA data whilst retaining an acceptable sample size. PA is low across both HD and non- HD days and future research should focus on interventions designed to increase physical activity in both the intra and interdialytic period.


Sign in / Sign up

Export Citation Format

Share Document