i2d: an R package for simulating data from images and the implications in biomedical research

Bioinformatics ◽

10.1093/bioinformatics/btaa991 ◽

2020 ◽

Author(s):

Xiaoyu Liang ◽

Ying Hu ◽

Chunhua Yan ◽

Ke Xu

Keyword(s):

Gene Networks ◽

Simulated Data ◽

R Package ◽

Quantitative Information ◽

Human Vision ◽

Supplementary Information ◽

Biological Research ◽

Limited Capacity ◽

Complex Information ◽

Quality Imaging

Abstract Motivation High-quality imaging analyses have been proposed to drive innovation in biomedical and biological research. However, the application of images remains underexploited because of the limited capacity of human vision and the challenges in extracting quantitative information from images. Computationally extracting quantitative information from images is critical to overcoming this limitation. Here, we present a novel R package, i2d, to simulate data from an image based on digital convolution. Results The R package i2d allows users to transform an image into a simulated dataset that can be used to extract and analyze complex information in biomedical and biological research. The package also includes three novel and efficient methods for graph clustering based on simulated data, which can be used to dissect complex gene networks into sub-clusters that have similar biological functions. Availability and implementation The code, the documentation, a tutorial and example data are available on an open source at: github.com/XiaoyuLiang/i2d. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

RMTL: an R library for multi-task learning

Bioinformatics ◽

10.1093/bioinformatics/bty831 ◽

2018 ◽

Vol 35 (10) ◽

pp. 1797-1798 ◽

Cited By ~ 2

Author(s):

Han Cao ◽

Jiayu Zhou ◽

Emanuel Schwarz

Keyword(s):

Biological Networks ◽

Simulated Data ◽

R Package ◽

Low Rank ◽

Supplementary Information ◽

Supplementary Data ◽

Software Environment ◽

Machine Learning Technique ◽

Task Learning ◽

Learning Technique

Abstract Motivation Multi-task learning (MTL) is a machine learning technique for simultaneous learning of multiple related classification or regression tasks. Despite its increasing popularity, MTL algorithms are currently not available in the widely used software environment R, creating a bottleneck for their application in biomedical research. Results We developed an efficient, easy-to-use R library for MTL (www.r-project.org) comprising 10 algorithms applicable for regression, classification, joint predictor selection, task clustering, low-rank learning and incorporation of biological networks. We demonstrate the utility of the algorithms using simulated data. Availability and implementation The RMTL package is an open source R package and is freely available at https://github.com/transbioZI/RMTL. RMTL will also be available on cran.r-project.org. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Inferring cellular heterogeneity of associations from single cell genomics

Bioinformatics ◽

10.1093/bioinformatics/btaa151 ◽

2020 ◽

Vol 36 (11) ◽

pp. 3466-3473

Author(s):

Maya Levy ◽

Amit Frishberg ◽

Irit Gat-Viks

Keyword(s):

Simulated Data ◽

R Package ◽

Biological Data ◽

Cellular Heterogeneity ◽

Supplementary Information ◽

Dynamic Changes ◽

Entire Cell ◽

Complete Set ◽

Cellular Phenotypes ◽

Cell Variation

Abstract Motivation Cell-to-cell variation has uncovered associations between cellular phenotypes. However, it remains challenging to address the cellular diversity of such associations. Results Here, we do not rely on the conventional assumption that the same association holds throughout the entire cell population. Instead, we assume that associations may exist in a certain subset of the cells. We developed CEllular Niche Association (CENA) to reliably predict pairwise associations together with the cell subsets in which the associations are detected. CENA does not rely on predefined subsets but only requires that the cells of each predicted subset would share a certain characteristic state. CENA may therefore reveal dynamic modulation of dependencies along cellular trajectories of temporally evolving states. Using simulated data, we show the advantage of CENA over existing methods and its scalability to a large number of cells. Application of CENA to real biological data demonstrates dynamic changes in associations that would be otherwise masked. Availability and implementation CENA is available as an R package at Github: https://github.com/mayalevy/CENA and is accompanied by a complete set of documentations and instructions. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

SpliceNet: recovering splicing isoform-specific differential gene networks from RNA-Seq data of normal and diseased samples

Nucleic Acids Research ◽

10.1093/nar/gku577 ◽

2014 ◽

Vol 42 (15) ◽

pp. e121-e121 ◽

Cited By ~ 18

Author(s):

Hari Krishna Yalamanchili ◽

Zhaoyuan Li ◽

Panwen Wang ◽

Maria P. Wong ◽

Jianfeng Yao ◽

...

Keyword(s):

Sample Size ◽

Gene Networks ◽

Simulated Data ◽

Exon Array ◽

R Package ◽

Mapk Signaling ◽

Rna Seq ◽

Gene Expressions ◽

Exon Level ◽

Splicing Isoforms

Abstract Conventionally, overall gene expressions from microarrays are used to infer gene networks, but it is challenging to account splicing isoforms. High-throughput RNA Sequencing has made splice variant profiling practical. However, its true merit in quantifying splicing isoforms and isoform-specific exon expressions is not well explored in inferring gene networks. This study demonstrates SpliceNet, a method to infer isoform-specific co-expression networks from exon-level RNA-Seq data, using large dimensional trace. It goes beyond differentially expressed genes and infers splicing isoform network changes between normal and diseased samples. It eases the sample size bottleneck; evaluations on simulated data and lung cancer-specific ERBB2 and MAPK signaling pathways, with varying number of samples, evince the merit in handling high exon to sample size ratio datasets. Inferred network rewiring of well established Bcl-x and EGFR centered networks from lung adenocarcinoma expression data is in good agreement with literature. Gene level evaluations demonstrate a substantial performance of SpliceNet over canonical correlation analysis, a method that is currently applied to exon level RNA-Seq data. SpliceNet can also be applied to exon array data. SpliceNet is distributed as an R package available at http://www.jjwanglab.org/SpliceNet.

Download Full-text

Identification of disease-associated loci using machine learning for genotype and network data integration

Bioinformatics ◽

10.1093/bioinformatics/btz310 ◽

2019 ◽

Vol 35 (24) ◽

pp. 5182-5190 ◽

Cited By ~ 4

Author(s):

Luis G Leal ◽

Alessia David ◽

Marjo-Riita Jarvelin ◽

Sylvain Sebert ◽

Minna Männikkö ◽

...

Keyword(s):

Machine Learning ◽

Gene Networks ◽

Association Studies ◽

R Package ◽

Biological Data ◽

Machine Learning Algorithms ◽

Supplementary Information ◽

Genome Wide Association Studies ◽

Omics Data ◽

Missing Heritability

Abstract Motivation Integration of different omics data could markedly help to identify biological signatures, understand the missing heritability of complex diseases and ultimately achieve personalized medicine. Standard regression models used in Genome-Wide Association Studies (GWAS) identify loci with a strong effect size, whereas GWAS meta-analyses are often needed to capture weak loci contributing to the missing heritability. Development of novel machine learning algorithms for merging genotype data with other omics data is highly needed as it could enhance the prioritization of weak loci. Results We developed cNMTF (corrected non-negative matrix tri-factorization), an integrative algorithm based on clustering techniques of biological data. This method assesses the inter-relatedness between genotypes, phenotypes, the damaging effect of the variants and gene networks in order to identify loci-trait associations. cNMTF was used to prioritize genes associated with lipid traits in two population cohorts. We replicated 129 genes reported in GWAS world-wide and provided evidence that supports 85% of our findings (226 out of 265 genes), including recent associations in literature (NLGN1), regulators of lipid metabolism (DAB1) and pleiotropic genes for lipid traits (CARM1). Moreover, cNMTF performed efficiently against strong population structures by accounting for the individuals’ ancestry. As the method is flexible in the incorporation of diverse omics data sources, it can be easily adapted to the user’s research needs. Availability and implementation An R package (cnmtf) is available at https://lgl15.github.io/cnmtf_web/index.html. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

simGWAS: a fast method for simulation of large scale case–control GWAS summary statistics

Bioinformatics ◽

10.1093/bioinformatics/bty898 ◽

2018 ◽

Vol 35 (11) ◽

pp. 1901-1906 ◽

Cited By ~ 4

Author(s):

Mary D Fortune ◽

Chris Wallace

Keyword(s):

Large Scale ◽

Simulated Data ◽

Enrichment Analysis ◽

R Package ◽

Gene Set Enrichment Analysis ◽

Supplementary Information ◽

Intermediate Step ◽

Fast Method ◽

Summary Statistics ◽

Causal Variants

Abstract Motivation Methods for analysis of GWAS summary statistics have encouraged data sharing and democratized the analysis of different diseases. Ideal validation for such methods is application to simulated data, where some ‘truth’ is known. As GWAS increase in size, so does the computational complexity of such evaluations; standard practice repeatedly simulates and analyses genotype data for all individuals in an example study. Results We have developed a novel method based on an alternative approach, directly simulating GWAS summary data, without individual data as an intermediate step. We mathematically derive the expected statistics for any set of causal variants and their effect sizes, conditional upon control haplotype frequencies (available from public reference datasets). Simulation of GWAS summary output can be conducted independently of sample size by simulating random variates about these expected values. Across a range of scenarios, our method, produces very similar output to that from simulating individual genotypes with a substantial gain in speed even for modest sample sizes. Fast simulation of GWAS summary statistics will enable more complete and rapid evaluation of summary statistic methods as well as opening new potential avenues of research in fine mapping and gene set enrichment analysis. Availability and implementation Our method is available under a GPL license as an R package from http://github.com/chr1swallace/simGWAS. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

SPARSim single cell: a count data simulator for scRNA-seq data

Bioinformatics ◽

10.1093/bioinformatics/btz752 ◽

2019 ◽

Cited By ~ 2

Author(s):

Giacomo Baruzzo ◽

Ilaria Patuzzi ◽

Barbara Di Camillo

Keyword(s):

Single Cell ◽

Count Data ◽

Simulated Data ◽

Real Data ◽

R Package ◽

Supplementary Information ◽

Rna Seq ◽

Distribution Of Zeros ◽

New Methods ◽

Research Fields

Abstract Motivation Single cell RNA-seq (scRNA-seq) count data show many differences compared with bulk RNA-seq count data, making the application of many RNA-seq pre-processing/analysis methods not straightforward or even inappropriate. For this reason, the development of new methods for handling scRNA-seq count data is currently one of the most active research fields in bioinformatics. To help the development of such new methods, the availability of simulated data could play a pivotal role. However, only few scRNA-seq count data simulators are available, often showing poor or not demonstrated similarity with real data. Results In this article we present SPARSim, a scRNA-seq count data simulator based on a Gamma-Multivariate Hypergeometric model. We demonstrate that SPARSim allows to generate count data that resemble real data in terms of count intensity, variability and sparsity, performing comparably or better than one of the most used scRNA-seq simulator, Splat. In particular, SPARSim simulated count matrices well resemble the distribution of zeros across different expression intensities observed in real count data. Availability and implementation SPARSim R package is freely available at http://sysbiobig.dei.unipd.it/? q=SPARSim and at https://gitlab.com/sysbiobig/sparsim. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

flexiMAP: a regression-based method for discovering differential alternative polyadenylation events in standard RNA-seq data

Bioinformatics ◽

10.1093/bioinformatics/btaa854 ◽

2020 ◽

Author(s):

Krzysztof J Szkop ◽

David S Moss ◽

Irene Nobeli

Keyword(s):

Simulated Data ◽

Alternative Polyadenylation ◽

Real Data ◽

R Package ◽

Supplementary Information ◽

Beta Regression ◽

Rna Seq ◽

Good Balance ◽

Flexible Modeling ◽

Specificity And Sensitivity

Abstract Motivation We present flexible Modeling of Alternative PolyAdenylation (flexiMAP), a new beta-regression-based method implemented in R, for discovering differential alternative polyadenylation events in standard RNA-seq data. Results We show, using both simulated and real data, that flexiMAP exhibits a good balance between specificity and sensitivity and compares favourably to existing methods, especially at low fold changes. In addition, the tests on simulated data reveal some hitherto unrecognized caveats of existing methods. Importantly, flexiMAP allows modeling of multiple known covariates that often confound the results of RNA-seq data analysis. Availability and implementation The flexiMAP R package is available at: https://github.com/kszkop/flexiMAP. Scripts and data to reproduce the analysis in this paper are available at: https://doi.org/10.5281/zenodo.3689788. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

simGWAS: a fast method for simulation of large scale case-control GWAS summarystatistics

10.1101/313023 ◽

2018 ◽

Cited By ~ 1

Author(s):

Mary D. Fortune ◽

Chris Wallace

Keyword(s):

Large Scale ◽

Simulated Data ◽

Enrichment Analysis ◽

R Package ◽

Gene Set Enrichment Analysis ◽

Supplementary Information ◽

Intermediate Step ◽

Fast Method ◽

Summary Statistics ◽

Causal Variants

AbstractMotivationMethods for analysis of GWAS summary statistics have encouraged data sharing and democratised the analysis of different diseases. Ideal validation for such methods is application to simulated data, where some “truth” is known. As GWAS increase in size, so does the computational complexity of such evaluations; standard practice repeatedly simulates and analyses genotype data for all individuals in an example study.ResultsWe have developed a novel method based on an alternative approach, directly simulating GWAS summary data, without individual data as an intermediate step. We mathematically derive the expected statistics for any set of causal variants and their effect sizes, conditional upon control haplotype frequencies (available from public reference datasets). Simulation of GWAS summary output can be conducted independently of sample size by simulating random variates about these expected values. Across a range of scenarios, our method, produces very similar output to that from simulating individual genotypes with a substantial gain in speed even for modest sample sizes. Fast simulation of GWAS summary statistics will enable more complete and rapid evaluation of summary statistic methods as well as opening new potential avenues of research in fine mapping and gene set enrichment analysis.Availability and ImplementationOur method is available under a GPL license as an R package from http://github.com/chr1swallace/[email protected] InformationSupplementary Information is appended.

Download Full-text

MOSim: Multi-Omics Simulation in R

10.1101/421834 ◽

2018 ◽

Cited By ~ 5

Author(s):

Carlos Martínez-Mira ◽

Ana Conesa ◽

Sonia Tarazona

Keyword(s):

Time Series Data ◽

Simulated Data ◽

R Package ◽

Experimental Designs ◽

Supplementary Information ◽

Series Data ◽

Data Sets ◽

Expression Data ◽

Supplementary Material ◽

Omic Data

AbstractMotivationAs new integrative methodologies are being developed to analyse multi-omic experiments, validation strategies are required for benchmarking. In silico approaches such as simulated data are popular as they are fast and cheap. However, few tools are available for creating synthetic multi-omic data sets.ResultsMOSim is a new R package for easily simulating multi-omic experiments consisting of gene expression data, other regulatory omics and the regulatory relationships between them. MOSim supports different experimental designs including time series data.AvailabilityThe package is freely available under the GPL-3 license from the Bitbucket repository (https://bitbucket.org/ConesaLab/mosim/)[email protected] informationSupplementary material is available at bioRxiv online.

Download Full-text

Computational functional genomics-based reduction of disease-related gene sets to their key components

Bioinformatics ◽

10.1093/bioinformatics/bty986 ◽

2018 ◽

Vol 35 (14) ◽

pp. 2362-2370 ◽

Cited By ~ 2

Author(s):

Catharina Lippmann ◽

Alfred Ultsch ◽

Jörn Lötsch

Keyword(s):

Functional Genomics ◽

Directed Acyclic Graph ◽

R Package ◽

Quantitative Information ◽

Knowledge Bases ◽

Supplementary Information ◽

Biological Processes ◽

Gene Set ◽

Gene Sets ◽

Disease Related Gene

Abstract Motivation The genetic architecture of diseases becomes increasingly known. This raises difficulties in picking suitable targets for further research among an increasing number of candidates. Although expression based methods of gene set reduction are applied to laboratory-derived genetic data, the analysis of topical sets of genes gathered from knowledge bases requires a modified approach as no quantitative information about gene expression is available. Results We propose a computational functional genomics-based approach at reducing sets of genes to the most relevant items based on the importance of the gene within the polyhierarchy of biological processes characterizing the disease. Knowledge bases about the biological roles of genes can provide a valid description of traits or diseases represented as a directed acyclic graph (DAG) picturing the polyhierarchy of disease relevant biological processes. The proposed method uses a gene importance score derived from the location of the gene-related biological processes in the DAG. It attempts to recreate the DAG and thereby, the roles of the original gene set, with the least number of genes in descending order of importance. This obtained precision and recall of over 70% to recreate the components of the DAG charactering the biological functions of n=540 genes relevant to pain with a subset of only the k=29 best-scoring genes. Conclusions A new method for reduction of gene sets is shown that is able to reproduce the biological processes in which the full gene set is involved by over 70%; however, by using only ∼5% of the original genes. Availability and implementation The necessary numerical parameters for the calculation of gene importance are implemented in the R package dbtORA at https://github.com/IME-TMP-FFM/dbtORA. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text