miRSM: an R package to infer and analyse miRNA sponge modules in heterogeneous data

Integrating datasets from different disciplines is hard because the data are often qualitatively different in meaning, scale and reliability. When two datasets describe the same entities, many scientific questions can be phrased around whether the (dis)similarities between entities are conserved across such different data. Our method, CLARITY, quantifies consistency across datasets, identifies where inconsistencies arise and aids in their interpretation. We illustrate this using three diverse comparisons: gene methylation versus expression, evolution of language sounds versus word use, and country-level economic metrics versus cultural beliefs. The non-parametric approach is robust to noise and differences in scaling, and makes only weak assumptions about how the data were generated. It operates by decomposing similarities into two components: a ‘structural’ component analogous to a clustering, and an underlying ‘relationship’ between those structures. This allows a ‘structural comparison’ between two similarity matrices using their predictability from ‘structure’. Significance is assessed with the help of re-sampling appropriate for each dataset. The software, CLARITY, is available as an R package from github.com/danjlawson/CLARITY .

Download Full-text

Vertical and horizontal integration of multi-omics data with miodin

BMC Bioinformatics ◽

10.1186/s12859-019-3224-4 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 13

Author(s):

Benjamin Ulfenborg

Keyword(s):

Data Analysis ◽

R Package ◽

Heterogeneous Data ◽

Omics Data ◽

Technical Expertise ◽

Horizontal Integration ◽

Level Data ◽

Genomics And Proteomics ◽

Health And Disease ◽

Molecular Layers

Abstract Background Studies on multiple modalities of omics data such as transcriptomics, genomics and proteomics are growing in popularity, since they allow us to investigate complex mechanisms across molecular layers. It is widely recognized that integrative omics analysis holds the promise to unlock novel and actionable biological insights into health and disease. Integration of multi-omics data remains challenging, however, and requires combination of several software tools and extensive technical expertise to account for the properties of heterogeneous data. Results This paper presents the miodin R package, which provides a streamlined workflow-based syntax for multi-omics data analysis. The package allows users to perform analysis of omics data either across experiments on the same samples (vertical integration), or across studies on the same variables (horizontal integration). Workflows have been designed to promote transparent data analysis and reduce the technical expertise required to perform low-level data import and processing. Conclusions The miodin package is implemented in R and is freely available for use and extension under the GPL-3 license. Package source, reference documentation and user manual are available at https://gitlab.com/algoromics/miodin.

Download Full-text

Biclique: An R package for Maximal Biclique Enumeration in Bipartite Graphs

10.21203/rs.2.16755/v2 ◽

2020 ◽

Author(s):

Yuping Lu ◽

Charles A. Phillips ◽

Michael A. Langston

Keyword(s):

State Of The Art ◽

Basic Research ◽

R Package ◽

Bipartite Graphs ◽

Heterogeneous Data ◽

General Purpose ◽

Public Repository ◽

Data Types ◽

Statistical Programming ◽

Reference Manual

Abstract Objective Bipartite graphs are widely used to model relationships between pairs of heterogeneous data types. Maximal bicliques are foundational structures in such graphs, and their enumeration is an important task in systems biology, epidemiology and many other problem domains. Thus, there is a need for an efficient, general purpose, publicly available tool to enumerate maximal bicliques in bipartite graphs. The statistical programming language R is a logical choice for such a tool, but until now no R package has existed for this purpose. Our objective is to provide such a package, so that the research community can more easily perform this computationally demanding task. Results Biclique is an R package that takes as input a bipartite graph and produces a listing of all maximal bicliques in this graph. Input and output formats are straightforward, with examples provided both in this paper and in the package documentation. Biclique employs a state-of-the-art algorithm previously developed for basic research in functional genomics. This package, along with its source code and reference manual, are freely available from the CRAN public repository at https://cran.r-project.org/web/packages/biclique/index.html .

Download Full-text

multiomics: A user-friendly multi-omics data harmonisation R pipeline

F1000Research ◽

10.12688/f1000research.53453.1 ◽

2021 ◽

Vol 10 ◽

pp. 538

Author(s):

Tyrone Chen ◽

Al J Abadi ◽

Kim-Anh Lê Cao ◽

Sonika Tyagi

Keyword(s):

Data Integration ◽

Case Studies ◽

R Package ◽

Heterogeneous Data ◽

General Purpose ◽

Omics Data ◽

Experimental Conditions ◽

Seamless Integration ◽

Data Object ◽

Omics Data Integration

Data from multiple omics layers of a biological system is growing in quantity, heterogeneity and dimensionality. Simultaneous multi-omics data integration is a growing field of research as it has strong potential to unlock information on previously hidden biological relationships leading to early diagnosis, prognosis and expedited treatments. Many tools for multi-omics data integration are being developed. However, these tools are often restricted to highly specific experimental designs, and types of omics data. While some general methods do exist, they require specific data formats and experimental conditions. A major limitation in the field is a lack of a single or multi-omics pipeline which can accept data in an unrefined, information-rich form pre-integration and subsequently generate output for further investigation. There is an increasing demand for a generic multi-omics pipeline to facilitate general-purpose data exploration and analysis of heterogeneous data. Therefore, we present our R multiomics pipeline as an easy to use and flexible pipeline that takes unrefined multi-omics data as input, sample information and user-specified parameters to generate a list of output plots and data tables for quality control and downstream analysis. We have demonstrated application of the pipeline on two separate COVID-19 case studies. We enabled limited checkpointing where intermediate output is staged to allow continuation after errors or interruptions in the pipeline and generate a script for reproducing the analysis to improve reproducibility. A seamless integration with the mixOmics R package is achieved, as the R data object can be loaded and manipulated with mixOmics functions. Our pipeline can be installed as an R package or from the git repository, and is accompanied by detailed documentation with walkthroughs on two case studies. The pipeline is also available as Docker and Singularity containers.

Download Full-text

Sstack: an R package for stacking with applications to scenarios involving sequential addition of samples and features

Bioinformatics ◽

10.1093/bioinformatics/btz010 ◽

2019 ◽

Vol 35 (17) ◽

pp. 3143-3145

Author(s):

Kevin Matlock ◽

Raziur Rahman ◽

Souparno Ghosh ◽

Ranadip Pal

Keyword(s):

Information Integration ◽

Cancer Cell Line ◽

R Package ◽

Heterogeneous Data ◽

Supplementary Information ◽

Genomic Feature ◽

Modeling Framework ◽

Building Models ◽

Drug Sensitivity Prediction ◽

Sequential Addition

Abstract Summary Biological processes are characterized by a variety of different genomic feature sets. However, often times when building models, portions of these features are missing for a subset of the dataset. We provide a modeling framework to effectively integrate this type of heterogeneous data to improve prediction accuracy. To test our methodology, we have stacked data from the Cancer Cell Line Encyclopedia to increase the accuracy of drug sensitivity prediction. The package addresses the dynamic regime of information integration involving sequential addition of features and samples. Availability and implementation The framework has been implemented as a R package Sstack, which can be downloaded from https://cran.r-project.org/web/packages/Sstack/index.html, where further explanation of the package is available. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Towards an Ecological Trait-data Standard Vocabulary

Biodiversity Information Science and Standards ◽

10.3897/biss.3.37612 ◽

2019 ◽

Vol 3 ◽

Author(s):

Florian Schneider ◽

David Fichtmüller ◽

Martin Gossner ◽

Anton Güntsch ◽

Malte Jochum ◽

...

Keyword(s):

R Package ◽

Heterogeneous Data ◽

Biological Data ◽

Published Data ◽

Multiple Use ◽

Multiple Sources ◽

Individual Level ◽

Data Standard ◽

Biodiversity Exploratories ◽

Long Term Observation

Trait-based research spans from evolutionary studies of individual-level properties to global patterns of biodiversity and ecosystem functioning. An increasing number of trait data is available for many different organism groups, published as open access data on a variety of file hosting services. Thus, standardization between datasets is generally lacking due to heterogeneous data formats and types. The compilation of these published data into centralised databases remains a difficult and time-consuming task. We reviewed existing trait databases and online services, as well as initiatives for trait data standardization. Together with data providers and users participating in a large long-term observation project on multiple taxa and research questions (the Biodiversity Exploratories, www.biodiversity-exploratories.de), we identified a need for a minimal trait-data terminology that is flexible enough to include traits from all types of organisms but simple enough to be adopted by different research communities. In order to facilitate reproducibility of analyses, the reuse of data and the combination of datasets from multiple sources, we propose a standardized vocabulary for trait data, the Ecological Trait-data Standard Vocabulary (ETS, hosted on GFBio Terminology Service, https://terminologies.gfbio.org/terms/ets/pages), which builds upon and is compatible with existing ontologies. By relying on unambiguous identifiers, the proposed minimal vocabulary for trait data captures the different degrees of resolution and measurement detail for multiple use cases of trait-based research. It further encourages the use of global Uniform Resource Identifiers (URI) for taxa and trait definitions, methods and units, thereby readying the data publication for the semantic web. An accompanying R-package (traitdataform) facilitates the upload of data to hosting services but also simplifies the access to published trait data. While originating from a current need in ecological research, in the next step, the described products are being developed for a seamless fit with broader initiatives on biodiversity data standardisation to foster a better linkage of ecological trait data and global e-infrastructures for biological data. The ETS is maintained and discussion on terms are managed via Github (https://github.com/EcologicalTraitData/ETS).

Download Full-text

netprioR: a probabilistic model for integrative hit prioritisation of genetic screens

Statistical Applications in Genetics and Molecular Biology ◽

10.1515/sagmb-2018-0033 ◽

2019 ◽

Vol 18 (3) ◽

Cited By ~ 1

Author(s):

Fabian Schmich ◽

Jack Kuipers ◽

Gunter Merdes ◽

Niko Beerenwinkel

Keyword(s):

Rna Interference ◽

Prior Knowledge ◽

Simulated Data ◽

R Package ◽

Heterogeneous Data ◽

Biological Data ◽

Network Data ◽

Data Sets ◽

Screening Experiments

Abstract In the post-genomic era of big data in biology, computational approaches to integrate multiple heterogeneous data sets become increasingly important. Despite the availability of large amounts of omics data, the prioritisation of genes relevant for a specific functional pathway based on genetic screening experiments, remains a challenging task. Here, we introduce netprioR, a probabilistic generative model for semi-supervised integrative prioritisation of hit genes. The model integrates multiple network data sets representing gene–gene similarities and prior knowledge about gene functions from the literature with gene-based covariates, such as phenotypes measured in genetic perturbation screens, for example, by RNA interference or CRISPR/Cas9. We evaluate netprioR on simulated data and show that the model outperforms current state-of-the-art methods in many scenarios and is on par otherwise. In an application to real biological data, we integrate 22 network data sets, 1784 prior knowledge class labels and 3840 RNA interference phenotypes in order to prioritise novel regulators of Notch signalling in Drosophila melanogaster. The biological relevance of our predictions is evaluated using in silico and in vivo experiments. An efficient implementation of netprioR is available as an R package at http://bioconductor.org/packages/netprioR.

Download Full-text

LMSM: a modular approach for identifying lncRNA related miRNA sponge modules in breast cancer

10.1101/841502 ◽

2019 ◽

Author(s):

Junpeng Zhang ◽

Taosheng Xu ◽

Lin Liu ◽

Wu Zhang ◽

Chunwen Zhao ◽

...

Keyword(s):

Breast Cancer ◽

Heterogeneous Data ◽

The Cancer Genome Atlas ◽

Messenger Rnas ◽

Mirna Sponge ◽

Mirna Targets ◽

Full Picture ◽

Competing Endogenous Rnas ◽

Cancer Genome Atlas ◽

Mirna Sponges

AbstractUntil now, existing methods for identifying lncRNA related miRNA sponge modules mainly rely on lncRNA related miRNA sponge interaction networks, which may not provide a full picture of miRNA sponging activities in biological conditions. Hence there is a strong need of new computational methods to identify lncRNA related miRNA sponge modules. In this work, we propose a framework, LMSM, to identify LncRNA related MiRNA Sponge Modules from heterogeneous data. To understand the miRNA sponging activities in biological conditions, LMSM uses gene expression data to evaluate the influence of the shared miRNAs on the clustered sponge lncRNAs and mRNAs. We have applied LMSM to the human breast cancer (BRCA) dataset from The Cancer Genome Atlas (TCGA). As a result, we have found that the majority of LMSM modules are significantly implicated in BRCA and most of them are BRCA subtype-specific. Most of the mediating miRNAs act as crosslinks across different LMSM modules, and all of LMSM modules are statistically significant. Multi-label classification analysis shows that the performance of LMSM modules is significantly higher than baseline’s performance, indicating the biological meanings of LMSM modules in classifying BRCA subtypes. The consistent results suggest that LMSM is robust in identifying lncRNA related miRNA sponge modules. Moreover, LMSM can be used to predict miRNA targets. Finally, LMSM outperforms a graph clustering-based strategy in identifying BRCA-related modules. Altogether, our study shows that LMSM is a promising method to investigate modular regulatory mechanism of sponge lncRNAs from heterogeneous data.Author summaryPrevious studies have revealed that long non-coding RNAs (lncRNAs), as microRNA (miRNA) sponges or competing endogenous RNAs (ceRNAs), can regulate the expression levels of messenger RNAs (mRNAs) by decreasing the amount of miRNAs interacting with mRNAs. In this work, we hypothesize that the “tug-of-war” between RNA transcripts for attracting miRNAs is across groups or modules. Based on the hypothesis, we propose a framework called LMSM, to identify LncRNA related MiRNA Sponge Modules. Based on the two miRNA sponge modular competition principles, significant sharing of miRNAs and high canonical correlation between the sponge lncRNAs and mRNAs, LMSM is also capable of predicting miRNA targets. LMSM not only extends the ceRNA hypothesis, but also provides a novel way to investigate the biological functions and modular mechanism of lncRNAs in breast cancer.

Download Full-text

Biclique: An R package for Maximal Biclique Enumeration in Bipartite Graphs

10.21203/rs.2.16755/v1 ◽

2019 ◽

Author(s):

Yuping Lu ◽

Charles A. Phillips ◽

Michael A. Langston

Keyword(s):

State Of The Art ◽

Basic Research ◽

R Package ◽

Bipartite Graphs ◽

Heterogeneous Data ◽

General Purpose ◽

Public Repository ◽

Data Types ◽

Statistical Programming ◽

Reference Manual

Abstract Objective Bipartite graphs are widely used to model relationships between pairs of heterogeneous data types. Maximal bicliques are foundational structures in such graphs, and their enumeration is an important task in systems biology, epidemiology and many other problem domains. Thus, there is a need for an efficient, general purpose, publicly available tool to enumerate maximal bicliques in bipartite graphs. The statistical programming language R is a logical choice for such a tool, but until now no R package has existed for this purpose. Our objective is to provide such a package, so that the research community can more easily perform this computationally demanding task.Results Biclique is an R package that takes as input a bipartite graph and produces a listing of all maximal bicliques in this graph. Input and output formats are straightforward, with examples provided both in this paper and in the package documentation. Biclique employs a state-of-the-art algorithm previously developed for basic research in functional genomics. This package, along with its source code and reference manual, are freely available from the CRAN public repository at https://cran.r-project.org/web/packages/biclique/index.html .

Download Full-text

DECO: decompose heterogeneous population cohorts for patient stratification and discovery of sample biomarkers using omic data profiling

Bioinformatics ◽

10.1093/bioinformatics/btz148 ◽

2019 ◽

Vol 35 (19) ◽

pp. 3651-3662 ◽

Cited By ~ 1

Author(s):

F J Campos-Laborie ◽

A Risueño ◽

M Ortiz-Estévez ◽

B Rosón-Burgo ◽

C Droste ◽

...

Keyword(s):

Correspondence Analysis ◽

Large Scale ◽

Simulated Data ◽

R Package ◽

Heterogeneous Data ◽

Supplementary Information ◽

Patient Stratification ◽

Differential Analysis ◽

Data Profiling ◽

Omic Data

Abstract Motivation Patient and sample diversity is one of the main challenges when dealing with clinical cohorts in biomedical genomics studies. During last decade, several methods have been developed to identify biomarkers assigned to specific individuals or subtypes of samples. However, current methods still fail to discover markers in complex scenarios where heterogeneity or hidden phenotypical factors are present. Here, we propose a method to analyze and understand heterogeneous data avoiding classical normalization approaches of reducing or removing variation. Results DEcomposing heterogeneous Cohorts using Omic data profiling (DECO) is a method to find significant association among biological features (biomarkers) and samples (individuals) analyzing large-scale omic data. The method identifies and categorizes biomarkers of specific phenotypic conditions based on a recurrent differential analysis integrated with a non-symmetrical correspondence analysis. DECO integrates both omic data dispersion and predictor–response relationship from non-symmetrical correspondence analysis in a unique statistic (called h-statistic), allowing the identification of closely related sample categories within complex cohorts. The performance is demonstrated using simulated data and five experimental transcriptomic datasets, and comparing to seven other methods. We show DECO greatly enhances the discovery and subtle identification of biomarkers, making it especially suited for deep and accurate patient stratification. Availability and implementation DECO is freely available as an R package (including a practical vignette) at Bioconductor repository (http://bioconductor.org/packages/deco/). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text