Evaluating the reproducibility of single-cell gene regulatory network inference algorithms

AbstractNetworks are powerful tools to represent and investigate biological systems. The development of algorithms inferring regulatory interactions from functional genomics data has been an active area of research. With the advent of single-cell RNA-seq data (scRNA-seq), numerous methods specifically designed to take advantage of single-cell datasets have been proposed. However, published benchmarks on single-cell network inference are mostly based on simulated data. Once applied to real data, these benchmarks take into account only a small set of genes and only compare the inferred networks with an imposed ground-truth.Here, we benchmark four single-cell network inference methods based on their reproducibility, i.e. their ability to infer similar networks when applied to two independent datasets for the same biological condition. We tested each of these methods on real data from three biological conditions: human retina, T-cells in colorectal cancer, and human hematopoiesis.GENIE3 results to be the most reproducible algorithm, independently from the single-cell sequencing platform, the cell type annotation system, the number of cells constituting the dataset, or the thresholding applied to the links of the inferred networks. In order to ensure the reproducibility and ease extensions of this benchmark study, we implemented all the analyses in scNET, a Jupyter notebook available at https://github.com/ComputationalSystemsBiology/scNET.

Download Full-text

Evaluating the Reproducibility of Single-Cell Gene Regulatory Network Inference Algorithms

Frontiers in Genetics ◽

10.3389/fgene.2021.617282 ◽

2021 ◽

Vol 12 ◽

Author(s):

Yoonjee Kang ◽

Denis Thieffry ◽

Laura Cantini

Keyword(s):

Single Cell ◽

Network Inference ◽

Simulated Data ◽

Ground Truth ◽

Real Data ◽

Biological Interactions ◽

Gene Regulatory Network Inference ◽

Sequencing Platform ◽

Cell Network ◽

Inference Algorithms

Networks are powerful tools to represent and investigate biological systems. The development of algorithms inferring regulatory interactions from functional genomics data has been an active area of research. With the advent of single-cell RNA-seq data (scRNA-seq), numerous methods specifically designed to take advantage of single-cell datasets have been proposed. However, published benchmarks on single-cell network inference are mostly based on simulated data. Once applied to real data, these benchmarks take into account only a small set of genes and only compare the inferred networks with an imposed ground-truth. Here, we benchmark six single-cell network inference methods based on their reproducibility, i.e., their ability to infer similar networks when applied to two independent datasets for the same biological condition. We tested each of these methods on real data from three biological conditions: human retina, T-cells in colorectal cancer, and human hematopoiesis. Once taking into account networks with up to 100,000 links, GENIE3 results to be the most reproducible algorithm and, together with GRNBoost2, show higher intersection with ground-truth biological interactions. These results are independent from the single-cell sequencing platform, the cell type annotation system and the number of cells constituting the dataset. Finally, GRNBoost2 and CLR show more reproducible performance once a more stringent thresholding is applied to the networks (1,000–100 links). In order to ensure the reproducibility and ease extensions of this benchmark study, we implemented all the analyses in scNET, a Jupyter notebook available at https://github.com/ComputationalSystemsBiology/scNET.

Download Full-text

Gaining confidence in inferred networks

10.1101/2020.09.19.304980 ◽

2020 ◽

Author(s):

Léo P.M. Diaz ◽

Michael P.H. Stumpf

Keyword(s):

Biological Networks ◽

Regulatory Networks ◽

Network Inference ◽

False Negative ◽

Simulated Data ◽

Real Data ◽

Point Interactions ◽

Inference Algorithms ◽

Starting Point ◽

Inference Methods

AbstractNetwork inference is a notoriously challenging problem. Inferred networks are associated with high uncertainty and likely riddled with false positive and false negative interactions. Especially for biological networks we do not have good ways of judging the performance of inference methods against real networks, and instead we often rely solely on the performance against simulated data. Gaining confidence in networks inferred from real data nevertheless thus requires establishing reliable validation methods. Here, we argue that the expectation of mixing patterns in biological networks such as gene regulatory networks offers a reasonable starting point: interactions are more likely to occur between nodes with similar biological functions. We can quantify this behaviour using the assortativity coefficient, and here we show that the resulting heuristic, functional assortativity, offers a reliable and informative route for comparing different inference algorithms.

Download Full-text

SimiC: A Single Cell Gene Regulatory Network Inference method with Similarity Constraints

10.1101/2020.04.03.023002 ◽

2020 ◽

Author(s):

Jianhao Peng ◽

Ullas V. Chembazhi ◽

Sushant Bangru ◽

Ian M. Traniello ◽

Auinash Kalsotra ◽

...

Keyword(s):

Single Cell ◽

Network Inference ◽

Regional Analysis ◽

Supplementary Information ◽

Inference Method ◽

Gene Regulatory Network Inference ◽

Inference Problem ◽

Cell State ◽

Gene Regulatory ◽

Inference Methods

AbstractMotivationWith the use of single-cell RNA sequencing (scRNA-Seq) technologies, it is now possible to acquire gene expression data for each individual cell in samples containing up to millions of cells. These cells can be further grouped into different states along an inferred cell differentiation path, which are potentially characterized by similar, but distinct enough, gene regulatory networks (GRNs). Hence, it would be desirable for scRNA-Seq GRN inference methods to capture the GRN dynamics across cell states. However, current GRN inference methods produce a unique GRN per input dataset (or independent GRNs per cell state), failing to capture these regulatory dynamics.ResultsWe propose a novel single-cell GRN inference method, named SimiC, that jointly infers the GRNs corresponding to each state. SimiC models the GRN inference problem as a LASSO optimization problem with an added similarity constraint, on the GRNs associated to contiguous cell states, that captures the inter-cell-state homogeneity. We show on a mouse hepatocyte single-cell data generated after partial hepatectomy that, contrary to previous GRN methods for scRNA-Seq data, SimiC is able to capture the transcription factor (TF) dynamics across liver regeneration, as well as the cell-level behavior for the regulatory program of each TF across cell states. In addition, on a honey bee scRNA-Seq experiment, SimiC is able to capture the increased heterogeneity of cells on whole-brain tissue with respect to a regional analysis tissue, and the TFs associated specifically to each sequenced tissue.AvailabilitySimiC is written in Python and includes an R API. It can be downloaded from https://github.com/jianhao2016/[email protected], [email protected] informationSupplementary data are available at the code repository.

Download Full-text

ei.Datasets: Real Data Sets for Assessing Ecological Inference Algorithms

Social Science Computer Review ◽

10.1177/08944393211040808 ◽

2021 ◽

pp. 089443932110408

Author(s):

Jose M. Pavía

Keyword(s):

Simulated Data ◽

Ground Truth ◽

Real Data ◽

R Package ◽

Data Sets ◽

Ecological Inference ◽

Inference Models ◽

Individual Level ◽

Inference Algorithms ◽

Cross Classification

Ecological inference models aim to infer individual-level relationships using aggregate data. They are routinely used to estimate voter transitions between elections, disclose split-ticket voting behaviors, or infer racial voting patterns in U.S. elections. A large number of procedures have been proposed in the literature to solve these problems; therefore, an assessment and comparison of them are overdue. The secret ballot however makes this a difficult endeavor since real individual data are usually not accessible. The most recent work on ecological inference has assessed methods using a very small number of data sets with ground truth, combined with artificial, simulated data. This article dramatically increases the number of real instances by presenting a unique database (available in the R package ei.Datasets) composed of data from more than 550 elections where the true inner-cell values of the global cross-classification tables are known. The article describes how the data sets are organized, details the data curation and data wrangling processes performed, and analyses the main features characterizing the different data sets.

Download Full-text

Identifying strengths and weaknesses of methods for computational network inference from single cell RNA-seq data

10.1101/2021.06.01.446671 ◽

2021 ◽

Author(s):

Matthew Stone ◽

Sunnie Grace McCalla ◽

Alireza Fotuhi Siahpirani ◽

Viswesh Periyasamy ◽

Junha Shin ◽

...

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Gene Regulatory Network ◽

Regulatory Network ◽

Network Inference ◽

Cost Effective ◽

Gene Regulatory Network Inference ◽

Single Cell Rna Sequencing ◽

Gene Regulatory ◽

Inference Methods

Single-cell RNA-sequencing (scRNA-seq) offers unparalleled insight into the transcriptional pro- grams of different cellular states by measuring the transcriptome of thousands individual cells. An emerging problem in the analysis of scRNA-seq is the inference of transcriptional gene regulatory net- works and a number of methods with different learning frameworks have been developed. Here we present a expanded benchmarking study of eleven recent network inference methods on six published single-cell RNA-sequencing datasets in human, mouse, and yeast considering different types of gold standard networks and evaluation metrics. We evaluate methods based on their computing requirements as well as on their ability to recover the network structure. We find that while no method is a universal winner and most methods have a modest recovery of experimentally derived interactions based on global metrics such as AUPR, methods are able to capture targets of regulators that are relevant to the system under study. Based on overall performance we grouped the methods into three main categories and found a combination of information-theoretic and regression-based methods to have a generally high perfor- mance. We also evaluate the utility of imputation for gene regulatory network inference and find that a small number of methods benefit from imputation, which further depends upon the dataset. Finally, comparisons to inferred networks for comparable bulk conditions showed that networks inferred from scRNA-seq datasets are often better or at par to those from bulk suggesting that scRNA-seq datasets can be a cost-effective way for gene regulatory network inference. Our analysis should be beneficial in selecting algorithms for performing network inference but also argues for improved methods and better gold standards for accurate assessment of regulatory network inference methods for mammalian systems.

Download Full-text

ModularBoost: an efficient network inference algorithm based on module decomposition

BMC Bioinformatics ◽

10.1186/s12859-021-04074-y ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Xinyu Li ◽

Wei Zhang ◽

Jianming Zhang ◽

Guang Li

Keyword(s):

Network Inference ◽

Detection Methods ◽

Inference Problem ◽

Topological Constraints ◽

Inference Algorithms ◽

Module Detection ◽

Series Expression ◽

Gene Modules ◽

Inference Methods ◽

Complicated Task

Abstract Background Given expression data, gene regulatory network(GRN) inference approaches try to determine regulatory relations. However, current inference methods ignore the inherent topological characters of GRN to some extent, leading to structures that lack clear biological explanation. To increase the biophysical meanings of inferred networks, this study performed data-driven module detection before network inference. Gene modules were identified by decomposition-based methods. Results ICA-decomposition based module detection methods have been used to detect functional modules directly from transcriptomic data. Experiments about time-series expression, curated and scRNA-seq datasets suggested that the advantages of the proposed ModularBoost method over established methods, especially in the efficiency and accuracy. For scRNA-seq datasets, the ModularBoost method outperformed other candidate inference algorithms. Conclusions As a complicated task, GRN inference can be decomposed into several tasks of reduced complexity. Using identified gene modules as topological constraints, the initial inference problem can be accomplished by inferring intra-modular and inter-modular interactions respectively. Experimental outcomes suggest that the proposed ModularBoost method can improve the accuracy and efficiency of inference algorithms by introducing topological constraints.

Download Full-text

BRANE Cut: Biologically-Related A priori Network Enhancement with Graph cuts for Gene Regulatory Network Inference

10.1101/032383 ◽

2015 ◽

Author(s):

Aurélie Pirayre ◽

Camille Couprie ◽

Frédérique Bidard ◽

Laurent Duval ◽

Jean-Christophe Pesquet

Keyword(s):

Gene Regulatory Network ◽

Regulatory Network ◽

Gene Networks ◽

Network Inference ◽

State Of The Art ◽

A Priori ◽

Graph Cuts ◽

Gene Regulatory Network Inference ◽

Gene Regulatory ◽

Inference Methods

Background: Inferring gene networks from high-throughput data constitutes an important step in the discovery of relevant regulatory relationships in organism cells. Despite the large number of available Gene Regulatory Network inference methods, the problem remains challenging: the underdetermination in the space of possible solutions requires additional constraints that incorporate a priori information on gene interactions. Methods: Weighting all possible pairwise gene relationships by a probability of edge presence, we formulate the regulatory network inference as a discrete variational problem on graphs. We enforce biologically plausible coupling between groups and types of genes by minimizing an edge labeling functional coding for a priori structures. The optimization is carried out with Graph cuts, an approach popular in image processing and computer vision. We compare the inferred regulatory networks to results achieved by the mutual-information-based Context Likelihood of Relatedness (CLR) method and by the state-of-the-art GENIE3, winner of the DREAM4 multifactorial challenge. Results: Our BRANE Cut approach infers more accurately the five DREAM4 in silico networks (with improvements from 6% to 11%). On a real Escherichia coli compendium, an improvement of 11.8% compared to CLR and 3% compared to GENIE3 is obtained in terms of Area Under Precision-Recall curve. Up to 48 additional verified interactions are obtained over GENIE3 for a given precision. On this dataset involving 4345 genes, our method achieves a performance similar to that of GENIE3, while being more than seven times faster. The BRANE Cut code is available at: http://www-syscom.univ-mlv.fr/~pirayre/Codes-GRN-BRANE-cut.html Conclusions: BRANE Cut is a weighted graph thresholding method. Using biologically sound penalties and data-driven parameters, it improves three state-of-the-art GRN inference methods. It is applicable as a generic network inference post-processing, due its computational efficiency.

Download Full-text

Machine learning for single cell genomics data analysis

10.1101/2021.02.04.429763 ◽

2021 ◽

Author(s):

Félix Raimundo ◽

Laetitia Papaxanthos ◽

Céline Vallot ◽

Jean-Philippe Vert

Keyword(s):

Machine Learning ◽

Single Cell ◽

Network Inference ◽

Method Development ◽

Biological Knowledge ◽

Omics Data ◽

Gene Regulatory Network Inference ◽

Multimodal Data ◽

Low Dimensional ◽

Type Classification

AbstractSingle-cell omics technologies produce large quantities of data describing the genomic, transcriptomic or epigenomic profiles of many individual cells in parallel. In order to infer biological knowledge and develop predictive models from these data, machine learning (ML)-based model are increasingly used due to their flexibility, scalability, and impressive success in other fields. In recent years, we have seen a surge of new ML-based method development for low-dimensional representations of single-cell omics data, batch normalization, cell type classification, trajectory inference, gene regulatory network inference or multimodal data integration. To help readers navigate this fast-moving literature, we survey in this review recent advances in ML approaches developed to analyze single-cell omics data, focusing mainly on peer-reviewed publications published in the last two years (2019-2020).

Download Full-text

A Framework for the Objective Assessment of Registration Accuracy

International Journal of Biomedical Imaging ◽

10.1155/2014/128324 ◽

2014 ◽

Vol 2014 ◽

pp. 1-11 ◽

Cited By ~ 1

Author(s):

Francesca Pizzorni Ferrarese ◽

Flavio Simonetti ◽

Roberto Israel Foroni ◽

Gloria Menegaz

Keyword(s):

Accuracy Assessment ◽

Objective Assessment ◽

Synthetic Data ◽

Simulated Data ◽

Ground Truth ◽

Real Data ◽

Magnetic Resonance Images ◽

Good Prediction ◽

Registration Accuracy ◽

Affine Registration

Validation and accuracy assessment are the main bottlenecks preventing the adoption of image processing algorithms in the clinical practice. In the classical approach, a posteriori analysis is performed through objective metrics. In this work, a different approach based on Petri nets is proposed. The basic idea consists in predicting the accuracy of a given pipeline based on the identification and characterization of the sources of inaccuracy. The concept is demonstrated on a case study: intrasubject rigid and affine registration of magnetic resonance images. Both synthetic and real data are considered. While synthetic data allow the benchmarking of the performance with respect to the ground truth, real data enable to assess the robustness of the methodology in real contexts as well as to determine the suitability of the use of synthetic data in the training phase. Results revealed a higher correlation and a lower dispersion among the metrics for simulated data, while the opposite trend was observed for pathologic ones. Results show that the proposed model not only provides a good prediction performance but also leads to the optimization of the end-to-end chain in terms of accuracy and robustness, setting the ground for its generalization to different and more complex scenarios.

Download Full-text

A guide to trajectory inference and RNA velocity

10.1101/2021.12.22.473434 ◽

2021 ◽

Author(s):

Philipp Weiler ◽

Koen Van den Berge ◽

Kelly Street ◽

Simone Tiberi

Keyword(s):

Gene Expression ◽

Single Cell ◽

Time Derivative ◽

Real Data ◽

Use Case ◽

Technological Developments ◽

Inference Methods ◽

Cell Data ◽

Significant Attention ◽

Gene Expression Levels

Technological developments have led to an explosion of high-throughput single cell data, which are revealing unprecedented perspectives on cell identity. Recently, significant attention has focused on investigating, from single-cell RNA-sequencing (scRNA-seq) data, cellular dynamic processes, such as cell differentiation, cell cycle and cell (de)activation. Trajectory inference methods estimate a trajectory, a collection of differentiation paths of a dynamic system, by ordering cells along the paths of such a dynamic process. While trajectory inference tools typically work with gene expression levels, common scRNA-seq protocols allow the identification and quantification of unspliced pre-mRNAs and mature spliced mRNAs, for each gene. By exploiting the abundance of unspliced and spliced mRNA, one can infer the RNA velocity of individual cells, i.e., the time derivative of the gene expression state of cells. Whereas traditional trajectory inference methods reconstruct cellular dynamics given a population of cells of varying maturity, RNA velocity relies on a dynamical model describing splicing dynamics. Here, we initially discuss conceptual and theoretical aspects of both approaches, then illustrate how they can be combined together, and finally present an example use-case on real data.

Download Full-text