Partition: a surjective mapping approach for dimensionality reduction

Joshua Millstein; Francesca Battaglin; Malcolm Barrett; Shu Cao; Wu Zhang; Sebastian Stintzing; Volker Heinemann; Heinz-Josef Lenz

doi:10.1093/bioinformatics/btz661

Partition: a surjective mapping approach for dimensionality reduction

Bioinformatics ◽

10.1093/bioinformatics/btz661 ◽

2019 ◽

Vol 36 (3) ◽

pp. 676-681 ◽

Cited By ~ 1

Author(s):

Joshua Millstein ◽

Francesca Battaglin ◽

Malcolm Barrett ◽

Shu Cao ◽

Wu Zhang ◽

...

Keyword(s):

Dimensionality Reduction ◽

Multiple Testing ◽

Progression Free Survival ◽

Real Data ◽

R Package ◽

Information Loss ◽

Response To Treatment ◽

Supplementary Information ◽

Surjective Mapping ◽

New Feature

Abstract Motivation Large amounts of information generated by genomic technologies are accompanied by statistical and computational challenges due to redundancy, badly behaved data and noise. Dimensionality reduction (DR) methods have been developed to mitigate these challenges. However, many approaches are not scalable to large dimensions or result in excessive information loss. Results The proposed approach partitions data into subsets of related features and summarizes each into one and only one new feature, thus defining a surjective mapping. A constraint on information loss determines the size of the reduced dataset. Simulation studies demonstrate that when multiple related features are associated with a response, this approach can substantially increase the number of true associations detected as compared to principal components analysis, non-negative matrix factorization or no DR. This increase in true discoveries is explained both by a reduced multiple-testing challenge and a reduction in extraneous noise. In an application to real data collected from metastatic colorectal cancer tumors, more associations between gene expression features and progression free survival and response to treatment were detected in the reduced than in the full untransformed dataset. Availability and implementation Freely available R package from CRAN, https://cran.r-project.org/package=partition. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Detection of differentially methylated CpG sites between tumor samples with uneven tumor purities

Bioinformatics ◽

10.1093/bioinformatics/btz885 ◽

2019 ◽

Vol 36 (7) ◽

pp. 2017-2024

Author(s):

Weiwei Zhang ◽

Ziyi Li ◽

Nana Wei ◽

Hua-Jun Wu ◽

Xiaoqi Zheng

Keyword(s):

Real Data ◽

R Package ◽

Differential Methylation ◽

Least Square ◽

Epigenetic Mechanism ◽

Supplementary Information ◽

Cpg Sites ◽

Tumor Purity ◽

Different Sources ◽

Normal Controls

Abstract Motivation Inference of differentially methylated (DM) CpG sites between two groups of tumor samples with different geno- or pheno-types is a critical step to uncover the epigenetic mechanism of tumorigenesis, and identify biomarkers for cancer subtyping. However, as a major source of confounding factor, uneven distributions of tumor purity between two groups of tumor samples will lead to biased discovery of DM sites if not properly accounted for. Results We here propose InfiniumDM, a generalized least square model to adjust tumor purity effect for differential methylation analysis. Our method is applicable to a variety of experimental designs including with or without normal controls, different sources of normal tissue contaminations. We compared our method with conventional methods including minfi, limma and limma corrected by tumor purity using simulated datasets. Our method shows significantly better performance at different levels of differential methylation thresholds, sample sizes, mean purity deviations and so on. We also applied the proposed method to breast cancer samples from TCGA database to further evaluate its performance. Overall, both simulation and real data analyses demonstrate favorable performance over existing methods serving similar purpose. Availability and implementation InfiniumDM is a part of R package InfiniumPurify, which is freely available from GitHub (https://github.com/Xiaoqizheng/InfiniumPurify). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Chicdiff: a computational pipeline for detecting differential chromosomal interactions in Capture Hi-C data

Bioinformatics ◽

10.1093/bioinformatics/btz450 ◽

2019 ◽

Vol 35 (22) ◽

pp. 4764-4766 ◽

Cited By ~ 3

Author(s):

Jonathan Cairns ◽

William R Orchard ◽

Valeriya Malysheva ◽

Mikhail Spivakov

Keyword(s):

Multiple Testing ◽

Positive Association ◽

R Package ◽

Supplementary Information ◽

Gene Promoters ◽

Robust Detection ◽

Testing Procedures ◽

Multiple Testing Procedures ◽

Powerful Approach ◽

Cell Type Specific

Abstract Summary Capture Hi-C is a powerful approach for detecting chromosomal interactions involving, at least on one end, DNA regions of interest, such as gene promoters. We present Chicdiff, an R package for robust detection of differential interactions in Capture Hi-C data. Chicdiff enhances a state-of-the-art differential testing approach for count data with bespoke normalization and multiple testing procedures that account for specific statistical properties of Capture Hi-C. We validate Chicdiff on published Promoter Capture Hi-C data in human Monocytes and CD4+ T cells, identifying multitudes of cell type-specific interactions, and confirming the overall positive association between promoter interactions and gene expression. Availability and implementation Chicdiff is implemented as an R package that is publicly available at https://github.com/RegulatoryGenomicsGroup/chicdiff. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

powmic: an R package for power assessment in microbiome case–control studies

Bioinformatics ◽

10.1093/bioinformatics/btaa197 ◽

2020 ◽

Vol 36 (11) ◽

pp. 3563-3565

Author(s):

Li Chen

Keyword(s):

Power Analysis ◽

Real Data ◽

Analytical Form ◽

R Package ◽

Case Control ◽

Supplementary Information ◽

Metagenomic Sequencing ◽

Case Control Studies ◽

Simulation Based ◽

Over Dispersion

Abstract Summary Power analysis is essential to decide the sample size of metagenomic sequencing experiments in a case–control study for identifying differentially abundant (DA) microbes. However, the complexity of microbial data characteristics, such as excessive zeros, over-dispersion, compositionality, intrinsically microbial correlations and variable sequencing depths, makes the power analysis particularly challenging because the analytical form is usually unavailable. Here, we develop a simulation-based power assessment strategy and R package powmic, which considers the complexity of microbial data characteristics. A real data example demonstrates the usage of powmic. Availability and implementation powmic R package and online tutorial are available at https://github.com/lichen-lab/powmic. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

A latent unknown clustering integrating multi-omics data (LUCID) with phenotypic traits

Bioinformatics ◽

10.1093/bioinformatics/btz667 ◽

2019 ◽

Vol 36 (3) ◽

pp. 842-850 ◽

Cited By ~ 4

Author(s):

Cheng Peng ◽

Jun Wang ◽

Isaac Asante ◽

Stan Louie ◽

Ran Jin ◽

...

Keyword(s):

Real Data ◽

R Package ◽

Integrative Model ◽

Supplementary Information ◽

Phenotypic Traits ◽

Omics Data ◽

Data Types ◽

Specific Effects ◽

Metabolomic Data ◽

Future Prediction

Abstract Motivation Epidemiologic, clinical and translational studies are increasingly generating multiplatform omics data. Methods that can integrate across multiple high-dimensional data types while accounting for differential patterns are critical for uncovering novel associations and underlying relevant subgroups. Results We propose an integrative model to estimate latent unknown clusters (LUCID) aiming to both distinguish unique genomic, exposure and informative biomarkers/omic effects while jointly estimating subgroups relevant to the outcome of interest. Simulation studies indicate that we can obtain consistent estimates reflective of the true simulated values, accurately estimate subgroups and recapitulate subgroup-specific effects. We also demonstrate the use of the integrated model for future prediction of risk subgroups and phenotypes. We apply this approach to two real data applications to highlight the integration of genomic, exposure and metabolomic data. Availability and Implementation The LUCID method is implemented through the LUCIDus R package available on CRAN (https://CRAN.R-project.org/package=LUCIDus). Supplementary information Supplementary materials are available at Bioinformatics online.

Download Full-text

scBatch: batch-effect correction of RNA-seq data through sample distance matrix adjustment

Bioinformatics ◽

10.1093/bioinformatics/btaa097 ◽

2020 ◽

Vol 36 (10) ◽

pp. 3115-3123 ◽

Cited By ~ 3

Author(s):

Teng Fei ◽

Tianwei Yu

Keyword(s):

Single Cell ◽

Differential Expression Analysis ◽

Distance Matrix ◽

Real Data ◽

R Package ◽

Batch Effect ◽

Supplementary Information ◽

Rna Seq ◽

Sequencing Data ◽

Gene Differential Expression

Abstract Motivation Batch effect is a frequent challenge in deep sequencing data analysis that can lead to misleading conclusions. Existing methods do not correct batch effects satisfactorily, especially with single-cell RNA sequencing (RNA-seq) data. Results We present scBatch, a numerical algorithm for batch-effect correction on bulk and single-cell RNA-seq data with emphasis on improving both clustering and gene differential expression analysis. scBatch is not restricted by assumptions on the mechanism of batch-effect generation. As shown in simulations and real data analyses, scBatch outperforms benchmark batch-effect correction methods. Availability and implementation The R package is available at github.com/tengfei-emory/scBatch. The code to generate results and figures in this article is available at github.com/tengfei-emory/scBatch-paper-scripts. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

SPARSim single cell: a count data simulator for scRNA-seq data

Bioinformatics ◽

10.1093/bioinformatics/btz752 ◽

2019 ◽

Cited By ~ 2

Author(s):

Giacomo Baruzzo ◽

Ilaria Patuzzi ◽

Barbara Di Camillo

Keyword(s):

Single Cell ◽

Count Data ◽

Simulated Data ◽

Real Data ◽

R Package ◽

Supplementary Information ◽

Rna Seq ◽

Distribution Of Zeros ◽

New Methods ◽

Research Fields

Abstract Motivation Single cell RNA-seq (scRNA-seq) count data show many differences compared with bulk RNA-seq count data, making the application of many RNA-seq pre-processing/analysis methods not straightforward or even inappropriate. For this reason, the development of new methods for handling scRNA-seq count data is currently one of the most active research fields in bioinformatics. To help the development of such new methods, the availability of simulated data could play a pivotal role. However, only few scRNA-seq count data simulators are available, often showing poor or not demonstrated similarity with real data. Results In this article we present SPARSim, a scRNA-seq count data simulator based on a Gamma-Multivariate Hypergeometric model. We demonstrate that SPARSim allows to generate count data that resemble real data in terms of count intensity, variability and sparsity, performing comparably or better than one of the most used scRNA-seq simulator, Splat. In particular, SPARSim simulated count matrices well resemble the distribution of zeros across different expression intensities observed in real count data. Availability and implementation SPARSim R package is freely available at http://sysbiobig.dei.unipd.it/? q=SPARSim and at https://gitlab.com/sysbiobig/sparsim. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

flexiMAP: a regression-based method for discovering differential alternative polyadenylation events in standard RNA-seq data

Bioinformatics ◽

10.1093/bioinformatics/btaa854 ◽

2020 ◽

Author(s):

Krzysztof J Szkop ◽

David S Moss ◽

Irene Nobeli

Keyword(s):

Simulated Data ◽

Alternative Polyadenylation ◽

Real Data ◽

R Package ◽

Supplementary Information ◽

Beta Regression ◽

Rna Seq ◽

Good Balance ◽

Flexible Modeling ◽

Specificity And Sensitivity

Abstract Motivation We present flexible Modeling of Alternative PolyAdenylation (flexiMAP), a new beta-regression-based method implemented in R, for discovering differential alternative polyadenylation events in standard RNA-seq data. Results We show, using both simulated and real data, that flexiMAP exhibits a good balance between specificity and sensitivity and compares favourably to existing methods, especially at low fold changes. In addition, the tests on simulated data reveal some hitherto unrecognized caveats of existing methods. Importantly, flexiMAP allows modeling of multiple known covariates that often confound the results of RNA-seq data analysis. Availability and implementation The flexiMAP R package is available at: https://github.com/kszkop/flexiMAP. Scripts and data to reproduce the analysis in this paper are available at: https://doi.org/10.5281/zenodo.3689788. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

NewWave: a scalable R/Bioconductor package for the dimensionality reduction and batch effect removal of single-cell RNA-seq data

10.1101/2021.08.02.453487 ◽

2021 ◽

Author(s):

Federico Agostinis ◽

Chiara Romualdi ◽

Gabriele Sales ◽

Davide Risso

Keyword(s):

Dimensionality Reduction ◽

Single Cell ◽

R Package ◽

Batch Effect ◽

Supplementary Information ◽

Bioconductor Package ◽

Rna Seq ◽

Sequencing Data ◽

Bioconductor Project ◽

Single Cell Rna Sequencing

Summary: We present NewWave, a scalable R/Bioconductor package for the dimensionality reduction and batch effect removal of single-cell RNA sequencing data. To achieve scalability, NewWave uses mini-batch optimization and can work with out-of-memory data, enabling users to analyze datasets with millions of cells. Availability and implementation: NewWave is implemented as an open-source R package available through the Bioconductor project at https://bioconductor.org/packages/NewWave/ Supplementary information: Supplementary data are available at Bioinformatics online.

Download Full-text

Blood-based biomarkers in metastatic colorectal cancer patients treated with FOLFIRI plus regorafenib or placebo: Results from LCCC1029.

Journal of Clinical Oncology ◽

10.1200/jco.2019.37.4_suppl.587 ◽

2019 ◽

Vol 37 (4_suppl) ◽

pp. 587-587

Author(s):

Yingmiao Liu ◽

Kirsten Burdett ◽

Mark D. Starr ◽

J. Chris Brady ◽

Ace Joseph Hatch ◽

...

Keyword(s):

Colorectal Cancer ◽

Metastatic Colorectal Cancer ◽

Multiple Testing ◽

Kinase Inhibitor ◽

Predictive Biomarkers ◽

Progression Free Survival ◽

Primary Objective ◽

Fold Change ◽

Response To Treatment ◽

Colorectal Cancer Patients

587 Background: The LCCC1029 trial demonstrated that addition of the multitargeted kinase inhibitor regorafenib (Rego) to FOLFIRI in metastatic colorectal cancer (mCRC) patients (pts) modestly prolonged progression-free survival (PFS). In this preplanned analysis, circulating angiogenic and inflammatory proteins were explored as potential prognostic and predictive biomarkers of Rego benefit. Methods: Plasma samples from 149 mCRC pts (107 in Rego + FOLFIRI and 42 in placebo + FOLFIRI) were evaluated for 20 markers at baseline (n = 149) and cycle 1 day 21 (C1D21, n = 81). Predictive and prognostic values of each marker at baseline were analyzed for both PFS and overall survival (OS) using Cox proportional hazard models. On-treatment changes were quantified as fold change [log2(C1D21/baseline)] and differences between arms were evaluated using the Mann-Whitney test. Results: The primary objective of this study was to determine whether any marker was predictive of benefit with Rego for PFS. Although no treatment by marker interactions were significant after adjusting for multiple testing, the top three markers of interest were OPN (unadjusted p-values of 0.02), VCAM-1 (0.02), and PDGF-AA (0.04). VCAM-1 was also predictive for OS benefit in pts treated with Rego (unadjusted p = 0.01). Baseline levels of multiple markers (including HGF, IL-6, PlGF, VEGF-R1, OPN) were prognostic for both PFS and OS. Higher levels of these markers were associated with worse survival. Biomarker changes in response to treatment were explored and compared between arms. Fold change of three markers (PlGF, VEGF-A, VCAM-1) were significantly different between arms (p < 0.0001), all being markedly up-regulated in the Rego arm compared to the placebo after treatment. Conclusions: In this hypothesis generating report, VCAM-1, OPN, and PDGF-AA were the top biomarkers when analyzing the potential predictive association with PFS, where a lower hazard was observed for pts receiving Rego. Candidate prognostic markers were identified, including PlGF and VEGF-R1, key factors in VEGF biology. Biomarker changes observed here may offer insights into potential combinatorial strategies with Rego for future studies.

Download Full-text

Eagle: multi-locus association mapping on a genome-wide scale made routine

Bioinformatics ◽

10.1093/bioinformatics/btz759 ◽

2019 ◽

Vol 36 (5) ◽

pp. 1509-1516

Author(s):

Andrew W George ◽

Arunas Verbyla ◽

Joshua Bowden

Keyword(s):

Association Mapping ◽

Multiple Testing ◽

R Package ◽

Single Locus ◽

Supplementary Information ◽

Locus Method ◽

Genome Wide ◽

A Genome ◽

Wide Scale ◽

Mouse Study

Abstract Motivation We present Eagle, a new method for multi-locus association mapping. The motivation for developing Eagle was to make multi-locus association mapping ‘easy’ and the method-of-choice. Eagle’s strengths are that it (i) is considerably more powerful than single-locus association mapping, (ii) does not suffer from multiple testing issues, (iii) gives results that are immediately interpretable and (iv) has a computational footprint comparable to single-locus association mapping. Results By conducting a large simulation study, we will show that Eagle finds true and avoids false single-nucleotide polymorphism trait associations better than competing single- and multi-locus methods. We also analyze data from a published mouse study. Eagle found over 50% more validated findings than the state-of-the-art single-locus method. Availability and implementation Eagle has been implemented as an R package, with a browser-based Graphical User Interface for users less familiar with R. It is freely available via the CRAN website at https://cran.r-project.org. Videos, Quick Start guides, FAQs and Demos are available via the Eagle website http://eagle.r-forge.r-project.org. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text