pathfindR: An R Package for Comprehensive Identification of Enriched Pathways in Omics Data Through Active Subnetworks

AbstractGiven the ever-increasing amount of high-dimensional and complex omics data becoming available, it is increasingly important to discover simple but effective methods of analysis. Divergence analysis transforms each entry of a high-dimensional omics profile into a digitized (binary or ternary) code based on the deviation of the entry from a given baseline population. This is a novel framework that is significantly different from existing omics data analysis methods: it allows digitization of continuous omics data at the univariate or multivariate level, facilitates sample level analysis, and is applicable on many different omics platforms. The divergence package, available on the R platform through the Bioconductor repository collection, provides easy-to-use functions for carrying out this transformation. Here we demonstrate how to use the package with sample high throughput sequencing data from the Cancer Genome Atlas.

Download Full-text

An R package for divergence analysis of omics data

PLoS ONE ◽

10.1371/journal.pone.0249002 ◽

2021 ◽

Vol 16 (4) ◽

pp. e0249002

Author(s):

Wikum Dinalankara ◽

Qian Ke ◽

Donald Geman ◽

Luigi Marchionni

Keyword(s):

R Package ◽

The Cancer Genome Atlas ◽

High Dimensional ◽

Omics Data ◽

Ternary Code ◽

Cancer Genome Atlas ◽

Level Analysis ◽

Data Analysis Methods ◽

Genome Atlas ◽

Omics Data Analysis

Given the ever-increasing amount of high-dimensional and complex omics data becoming available, it is increasingly important to discover simple but effective methods of analysis. Divergence analysis transforms each entry of a high-dimensional omics profile into a digitized (binary or ternary) code based on the deviation of the entry from a given baseline population. This is a novel framework that is significantly different from existing omics data analysis methods: it allows digitization of continuous omics data at the univariate or multivariate level, facilitates sample level analysis, and is applicable on many different omics platforms. The divergence package, available on the R platform through the Bioconductor repository collection, provides easy-to-use functions for carrying out this transformation. Here we demonstrate how to use the package with data from the Cancer Genome Atlas.

Download Full-text

Vertical and horizontal integration of multi-omics data with miodin

BMC Bioinformatics ◽

10.1186/s12859-019-3224-4 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 13

Author(s):

Benjamin Ulfenborg

Keyword(s):

Data Analysis ◽

R Package ◽

Heterogeneous Data ◽

Omics Data ◽

Technical Expertise ◽

Horizontal Integration ◽

Level Data ◽

Genomics And Proteomics ◽

Health And Disease ◽

Molecular Layers

Abstract Background Studies on multiple modalities of omics data such as transcriptomics, genomics and proteomics are growing in popularity, since they allow us to investigate complex mechanisms across molecular layers. It is widely recognized that integrative omics analysis holds the promise to unlock novel and actionable biological insights into health and disease. Integration of multi-omics data remains challenging, however, and requires combination of several software tools and extensive technical expertise to account for the properties of heterogeneous data. Results This paper presents the miodin R package, which provides a streamlined workflow-based syntax for multi-omics data analysis. The package allows users to perform analysis of omics data either across experiments on the same samples (vertical integration), or across studies on the same variables (horizontal integration). Workflows have been designed to promote transparent data analysis and reduce the technical expertise required to perform low-level data import and processing. Conclusions The miodin package is implemented in R and is freely available for use and extension under the GPL-3 license. Package source, reference documentation and user manual are available at https://gitlab.com/algoromics/miodin.

Download Full-text

UCSCXenaShiny: An R Package for Exploring and Analyzing UCSC Xena Public Datasets in Web Browser

10.20944/preprints202007.0179.v1 ◽

2020 ◽

Author(s):

Shixiang Wang ◽

Yi Xiong ◽

Kai Gu ◽

Longfei Zhao ◽

Yin Li ◽

...

Keyword(s):

R Package ◽

Data Availability ◽

Analysis Tool ◽

Omics Data ◽

Analysis Framework ◽

Web Browser ◽

Research Opportunities ◽

Public Projects ◽

R Shiny ◽

Public Datasets

Motivation: UCSC Xena platform provides huge amounts of processed cancer omics data from big public projects like TCGA or individual reserach groups for enabling unprecedented research opportunities. In 2019, we developed UCSCXenaTools, an R package for retrieval of UCSC Xena data. However, an easier dataset exploration and analysis tool is still lack, especially for researchers without programming experience. Results: We develop UCSCXenaShiny, an R Shiny package to quickly explore, download all datasets from UCSC Xena data hubs. In addiction, a module based analysis framework is constructed to analyze and visualize data. Availability: https://github.com/openbiox/UCSCXenaShiny or https://cran.r-project.org/package=UCSCXenaShiny.

Download Full-text

multiomics: A user-friendly multi-omics data harmonisation R pipeline

F1000Research ◽

10.12688/f1000research.53453.1 ◽

2021 ◽

Vol 10 ◽

pp. 538

Author(s):

Tyrone Chen ◽

Al J Abadi ◽

Kim-Anh Lê Cao ◽

Sonika Tyagi

Keyword(s):

Data Integration ◽

Case Studies ◽

R Package ◽

Heterogeneous Data ◽

General Purpose ◽

Omics Data ◽

Experimental Conditions ◽

Seamless Integration ◽

Data Object ◽

Omics Data Integration

Data from multiple omics layers of a biological system is growing in quantity, heterogeneity and dimensionality. Simultaneous multi-omics data integration is a growing field of research as it has strong potential to unlock information on previously hidden biological relationships leading to early diagnosis, prognosis and expedited treatments. Many tools for multi-omics data integration are being developed. However, these tools are often restricted to highly specific experimental designs, and types of omics data. While some general methods do exist, they require specific data formats and experimental conditions. A major limitation in the field is a lack of a single or multi-omics pipeline which can accept data in an unrefined, information-rich form pre-integration and subsequently generate output for further investigation. There is an increasing demand for a generic multi-omics pipeline to facilitate general-purpose data exploration and analysis of heterogeneous data. Therefore, we present our R multiomics pipeline as an easy to use and flexible pipeline that takes unrefined multi-omics data as input, sample information and user-specified parameters to generate a list of output plots and data tables for quality control and downstream analysis. We have demonstrated application of the pipeline on two separate COVID-19 case studies. We enabled limited checkpointing where intermediate output is staged to allow continuation after errors or interruptions in the pipeline and generate a script for reproducing the analysis to improve reproducibility. A seamless integration with the mixOmics R package is achieved, as the R data object can be loaded and manipulated with mixOmics functions. Our pipeline can be installed as an R package or from the git repository, and is accompanied by detailed documentation with walkthroughs on two case studies. The pipeline is also available as Docker and Singularity containers.

Download Full-text

A latent unknown clustering integrating multi-omics data (LUCID) with phenotypic traits

Bioinformatics ◽

10.1093/bioinformatics/btz667 ◽

2019 ◽

Vol 36 (3) ◽

pp. 842-850 ◽

Cited By ~ 4

Author(s):

Cheng Peng ◽

Jun Wang ◽

Isaac Asante ◽

Stan Louie ◽

Ran Jin ◽

...

Keyword(s):

Real Data ◽

R Package ◽

Integrative Model ◽

Supplementary Information ◽

Phenotypic Traits ◽

Omics Data ◽

Data Types ◽

Specific Effects ◽

Metabolomic Data ◽

Future Prediction

Abstract Motivation Epidemiologic, clinical and translational studies are increasingly generating multiplatform omics data. Methods that can integrate across multiple high-dimensional data types while accounting for differential patterns are critical for uncovering novel associations and underlying relevant subgroups. Results We propose an integrative model to estimate latent unknown clusters (LUCID) aiming to both distinguish unique genomic, exposure and informative biomarkers/omic effects while jointly estimating subgroups relevant to the outcome of interest. Simulation studies indicate that we can obtain consistent estimates reflective of the true simulated values, accurately estimate subgroups and recapitulate subgroup-specific effects. We also demonstrate the use of the integrated model for future prediction of risk subgroups and phenotypes. We apply this approach to two real data applications to highlight the integration of genomic, exposure and metabolomic data. Availability and Implementation The LUCID method is implemented through the LUCIDus R package available on CRAN (https://CRAN.R-project.org/package=LUCIDus). Supplementary information Supplementary materials are available at Bioinformatics online.

Download Full-text

Multi-kernel linear mixed model with adaptive lasso for prediction analysis on high-dimensional multi-omics data

Bioinformatics ◽

10.1093/bioinformatics/btz822 ◽

2019 ◽

Vol 36 (6) ◽

pp. 1785-1794

Author(s):

Jun Li ◽

Qing Lu ◽

Yalu Wen

Keyword(s):

Risk Prediction ◽

Mixed Model ◽

Linear Mixed Model ◽

R Package ◽

Kernel Functions ◽

Adaptive Lasso ◽

Supplementary Information ◽

High Dimensional ◽

Omics Data ◽

Modeling Framework

Abstract Motivation The use of human genome discoveries and other established factors to build an accurate risk prediction model is an essential step toward precision medicine. While multi-layer high-dimensional omics data provide unprecedented data resources for prediction studies, their corresponding analytical methods are much less developed. Results We present a multi-kernel penalized linear mixed model with adaptive lasso (MKpLMM), a predictive modeling framework that extends the standard linear mixed models widely used in genomic risk prediction, for multi-omics data analysis. MKpLMM can capture not only the predictive effects from each layer of omics data but also their interactions via using multiple kernel functions. It adopts a data-driven approach to select predictive regions as well as predictive layers of omics data, and achieves robust selection performance. Through extensive simulation studies, the analyses of PET-imaging outcomes from the Alzheimer’s Disease Neuroimaging Initiative study, and the analyses of 64 drug responses, we demonstrate that MKpLMM consistently outperforms competing methods in phenotype prediction. Availability and implementation The R-package is available at https://github.com/YaluWen/OmicPred. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Identification of disease-associated loci using machine learning for genotype and network data integration

Bioinformatics ◽

10.1093/bioinformatics/btz310 ◽

2019 ◽

Vol 35 (24) ◽

pp. 5182-5190 ◽

Cited By ~ 4

Author(s):

Luis G Leal ◽

Alessia David ◽

Marjo-Riita Jarvelin ◽

Sylvain Sebert ◽

Minna Männikkö ◽

...

Keyword(s):

Machine Learning ◽

Gene Networks ◽

Association Studies ◽

R Package ◽

Biological Data ◽

Machine Learning Algorithms ◽

Supplementary Information ◽

Genome Wide Association Studies ◽

Omics Data ◽

Missing Heritability

Abstract Motivation Integration of different omics data could markedly help to identify biological signatures, understand the missing heritability of complex diseases and ultimately achieve personalized medicine. Standard regression models used in Genome-Wide Association Studies (GWAS) identify loci with a strong effect size, whereas GWAS meta-analyses are often needed to capture weak loci contributing to the missing heritability. Development of novel machine learning algorithms for merging genotype data with other omics data is highly needed as it could enhance the prioritization of weak loci. Results We developed cNMTF (corrected non-negative matrix tri-factorization), an integrative algorithm based on clustering techniques of biological data. This method assesses the inter-relatedness between genotypes, phenotypes, the damaging effect of the variants and gene networks in order to identify loci-trait associations. cNMTF was used to prioritize genes associated with lipid traits in two population cohorts. We replicated 129 genes reported in GWAS world-wide and provided evidence that supports 85% of our findings (226 out of 265 genes), including recent associations in literature (NLGN1), regulators of lipid metabolism (DAB1) and pleiotropic genes for lipid traits (CARM1). Moreover, cNMTF performed efficiently against strong population structures by accounting for the individuals’ ancestry. As the method is flexible in the incorporation of diverse omics data sources, it can be easily adapted to the user’s research needs. Availability and implementation An R package (cnmtf) is available at https://lgl15.github.io/cnmtf_web/index.html. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

i2dash: Creation of Flexible, Interactive and Web-based Dashboards for Visualization of Omics-pipeline Results

10.1101/2020.07.06.189563 ◽

2020 ◽

Author(s):

Arsenij Ustjanzew ◽

Jens Preussner ◽

Mette Bentsen ◽

Carsten Kuenne ◽

Mario Looso

Keyword(s):

Single Cell ◽

Data Visualization ◽

Large Scale ◽

R Package ◽

Cloud Services ◽

Sequencing Analysis ◽

Omics Data ◽

Web Based ◽

Automated Data Processing ◽

Generic Design

AbstractData visualization and interactive data exploration are important aspects of illustrating complex concepts and results from analyses of omics data. A suitable visualization has to be intuitive and accessible. Web-based dashboards have become popular tools for the arrangement, consolidation and display of such visualizations. However, the combination of automated data processing pipelines handling omics data and dynamically generated, interactive dashboards is poorly solved. Here, we present i2dash, an R package intended to encapsulate functionality for programmatic creation of customized dashboards. It supports interactive and responsive (linked) visualizations across a set of predefined graphical layouts. i2dash addresses the needs of data analysts for a tool that is compatible and attachable to any R-based analysis pipeline, thereby fostering the separation of data visualization on one hand and data analysis tasks on the other hand. In addition, the generic design of i2dash enables data analysts to generate modular extensions for specific needs. As a proof of principle, we provide an extension of i2dash optimized for single-cell RNA-sequencing analysis, supporting the creation of dashboards for the visualization needs of single-cell sequencing experiments. Equipped with these features, i2dash is suitable for extensive use in large scale sequencing/bioinformatics facilities. Along this line, we provide i2dash as a containerized solution, enabling a straightforward large-scale deployment and sharing of dashboards using cloud services.i2dash is freely available via the R package archive CRAN.

Download Full-text

Robust Identification of Temporal Biomarkers in Longitudinal Omics Studies

10.1101/2021.11.19.469350 ◽

2021 ◽

Author(s):

Ahmed A. Metwally ◽

Tom Zhang ◽

Si Wu ◽

Ryan Kellogg ◽

Wenyu Zhou ◽

...

Keyword(s):

Empirical Distribution ◽

Temporal Patterns ◽

R Package ◽

Smoothing Splines ◽

Omics Data ◽

Differential Analysis ◽

Time Intervals ◽

Robust Identification ◽

Data Points ◽

Subject Dropout

Longitudinal studies increasingly collect rich 'omics' data sampled frequently over time and across large cohorts to capture dynamic health fluctuations and disease transitions. However, the generation of longitudinal omics data has preceded the development of analysis tools that can efficiently extract insights from such data. In particular, there is a need for statistical frameworks that can identify not only which omics features are differentially regulated between groups but also over what time intervals. Additionally, longitudinal omics data may have inconsistencies, including nonuniform sampling intervals, missing data points, subject dropout, and differing numbers of samples per subject. In this work, we developed a statistical method that provides robust identification of time intervals of temporal omics biomarkers. The proposed method is based on a semi-parametric approach, in which we use smoothing splines to model longitudinal data and infer significant time intervals of omics features based on an empirical distribution constructed through a permutation procedure. We benchmarked the proposed method on five simulated datasets with diverse temporal patterns, and the method showed specificity greater than 0.99 and sensitivity greater than 0.72. Applying the proposed method to the Integrative Personal Omics Profiling (iPOP) cohort revealed temporal patterns of amino acids, lipids, and hormone metabolites that are differentially regulated in male versus female subjects following a respiratory infection. In addition, we applied the longitudinal multi-omics dataset of pregnant women with and without preeclampsia, and the method identified potential lipid markers that are temporally significantly different between the two groups. We provide an open-source R package, OmicsLonDA (Omics Longitudinal Differential Analysis): https://bioconductor.org/packages/OmicsLonDA to enable widespread use.

Download Full-text