debCAM: a bioconductor R package for fully unsupervised deconvolution of complex tissues

Lulu Chen; Chiung-Ting Wu; Niya Wang; David M Herrington; Robert Clarke; Yue Wang

doi:10.1093/bioinformatics/btaa205

debCAM: a bioconductor R package for fully unsupervised deconvolution of complex tissues

Bioinformatics ◽

10.1093/bioinformatics/btaa205 ◽

2020 ◽

Vol 36 (12) ◽

pp. 3927-3929 ◽

Cited By ~ 3

Author(s):

Lulu Chen ◽

Chiung-Ting Wu ◽

Niya Wang ◽

David M Herrington ◽

Robert Clarke ◽

...

Keyword(s):

Expression Profiles ◽

Software Tool ◽

R Package ◽

Supplementary Information ◽

Tissue Cell ◽

Deconvolution Method ◽

Imaging Data ◽

Specific Expression ◽

Knowledge Incorporation

Abstract Summary We develop a fully unsupervised deconvolution method to dissect complex tissues into molecularly distinctive tissue or cell subtypes based on bulk expression profiles. We implement an R package, deconvolution by Convex Analysis of Mixtures (debCAM) that can automatically detect tissue/cell-specific markers, determine the number of constituent subtypes, calculate subtype proportions in individual samples and estimate tissue/cell-specific expression profiles. We demonstrate the performance and biomedical utility of debCAM on gene expression, methylation, proteomics and imaging data. With enhanced data preprocessing and prior knowledge incorporation, debCAM software tool will allow biologists to perform a more comprehensive and unbiased characterization of tissue remodeling in many biomedical contexts. Availability and implementation http://bioconductor.org/packages/debCAM. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Jasmine: a Java pipeline for isomiR characterization in miRNA-Seq data

Bioinformatics ◽

10.1093/bioinformatics/btz806 ◽

2019 ◽

Cited By ~ 2

Author(s):

Xiangfu Zhong ◽

Albert Pla ◽

Simon Rayner

Keyword(s):

Population Structure ◽

Software Tool ◽

Supplementary Information ◽

Supplementary Data ◽

Analysis Pipeline ◽

Detailed Characterization ◽

Fasta Format ◽

Java Application

Abstract Motivation The existence of complex subpopulations of miRNA isoforms, or isomiRs, is well established. While many tools exist for investigating isomiR populations, they differ in how they characterize an isomiR, making it difficult to compare results across different tools. Thus, there is a need for a more comprehensive and systematic standard for defining isomiRs. Such a standard would allow investigation of isomiR population structure in progressively more refined sub-populations, permitting the identification of more subtle changes between conditions and leading to an improved understanding of the processes that generate these differences. Results We developed Jasmine, a software tool that incorporates a hierarchal framework for characterizing isomiR populations. Jasmine is a Java application that can process raw read data in fastq/fasta format, or mapped reads in SAM format to produce a detailed characterization of isomiR populations. Thus, Jasmine can reveal structure not apparent in a standard miRNA-Seq analysis pipeline. Availability and implementation Jasmine is implemented in Java and R and freely available at bitbucket https://bitbucket.org/bipous/jasmine/src/master/. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Targeted realignment of LC-MS profiles by neighbor-wise compound-specific graphical time warping with misalignment detection

Bioinformatics ◽

10.1093/bioinformatics/btaa037 ◽

2020 ◽

Vol 36 (9) ◽

pp. 2862-2871

Author(s):

Chiung-Ting Wu ◽

Yizhi Wang ◽

Yinxue Wang ◽

Timothy Ebbels ◽

Ibrahim Karaman ◽

...

Keyword(s):

Large Scale ◽

Synthetic Data ◽

Software Tool ◽

R Package ◽

Supplementary Information ◽

Internal Quality ◽

Systematic Change ◽

Warping Function ◽

Time Warping ◽

Alignment Algorithms

Abstract Motivation Liquid chromatography–mass spectrometry (LC-MS) is a standard method for proteomics and metabolomics analysis of biological samples. Unfortunately, it suffers from various changes in the retention times (RT) of the same compound in different samples, and these must be subsequently corrected (aligned) during data processing. Classic alignment methods such as in the popular XCMS package often assume a single time-warping function for each sample. Thus, the potentially varying RT drift for compounds with different masses in a sample is neglected in these methods. Moreover, the systematic change in RT drift across run order is often not considered by alignment algorithms. Therefore, these methods cannot effectively correct all misalignments. For a large-scale experiment involving many samples, the existence of misalignment becomes inevitable and concerning. Results Here, we describe an integrated reference-free profile alignment method, neighbor-wise compound-specific Graphical Time Warping (ncGTW), that can detect misaligned features and align profiles by leveraging expected RT drift structures and compound-specific warping functions. Specifically, ncGTW uses individualized warping functions for different compounds and assigns constraint edges on warping functions of neighboring samples. Validated with both realistic synthetic data and internal quality control samples, ncGTW applied to two large-scale metabolomics LC-MS datasets identifies many misaligned features and successfully realigns them. These features would otherwise be discarded or uncorrected using existing methods. The ncGTW software tool is developed currently as a plug-in to detect and realign misaligned features present in standard XCMS output. Availability and implementation An R package of ncGTW is freely available at Bioconductor and https://github.com/ChiungTingWu/ncGTW. A detailed user’s manual and a vignette are provided within the package. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Variant Set Enrichment: An R package to Identify Dis-ease-Associated Functional Genomic Regions

10.1101/077990 ◽

2016 ◽

Author(s):

Musaddeque Ahmed ◽

Richard C. Sallari ◽

Haiyang Guo ◽

Jason H. Moore ◽

Housheng Hansen He ◽

...

Keyword(s):

Human Genome ◽

R Package ◽

Supplementary Information ◽

Fast Method ◽

Functional Genomic ◽

Disease Development ◽

Noncoding Regions ◽

The Poor ◽

Genomic Regions

AbstractSummaryGenetic predispositions to diseases populate the noncoding regions of the human genome. Delineating their functional basis can inform on the mechanisms contributing to disease development. However, this remains a challenge due to the poor characterization of the noncoding genome. Variant Set Enrichment (VSE) is a fast method to calculate the enrichment of a set of disease-associated variants across functionally annotated genomic regions, consequently highlighting the mechanisms important in the etiology of the disease studied.Availability and ImplementationVSE is implemented as an R package and can easily be implemented in any system with R. See supplementary information for [email protected]; [email protected]

Download Full-text

Sequencing and characterization of lncRNAs in the breast muscle of Gushi and Arbor Acres chickens

Genome ◽

10.1139/gen-2017-0114 ◽

2018 ◽

Vol 61 (5) ◽

pp. 337-347 ◽

Cited By ~ 10

Author(s):

Tuanhui Ren ◽

Zhuanjian Li ◽

Yu Zhou ◽

Xuelian Liu ◽

Ruili Han ◽

...

Keyword(s):

Molecular Mechanisms ◽

Muscle Development ◽

Target Genes ◽

Expression Profiles ◽

Economic Value ◽

Muscle Quality ◽

Specific Expression ◽

Chicken Muscle ◽

Molecular Regulatory Mechanisms

Chicken muscle quality is one of the most important factors determining the economic value of poultry, and muscle development and growth are affected by genetics, environment, and nutrition. However, little is known about the molecular regulatory mechanisms of long non-coding RNAs (lncRNAs) in chicken skeletal muscle development. Our study aimed to better understand muscle development in chickens and thereby improve meat quality. In this study, Ribo-Zero RNA-Seq was used to investigate differences in the expression profiles of muscle development related genes and associated pathways between Gushi (GS) and Arbor Acres (AA) chickens. We identified two muscle tissue specific expression lncRNAs. In addition, the target genes of these lncRNAs were significantly enriched in certain biological processes and molecular functions, as demonstrated by Gene Ontology (GO) analysis, and these target genes participate in five signaling pathway, as revealed by an analysis of the Kyoto Encyclopedia of Genes and Genomes (KEGG) database. Taken together, these data suggest that different lncRNAs might be involved in regulating chicken muscle development and growth and provide new insight into the molecular mechanisms of lncRNAs.

Download Full-text

ClusterMine: a Knowledge-integrated Clustering Approach based on Expression Profiles of Gene Sets

10.1101/255711 ◽

2018 ◽

Author(s):

Hong-Dong Li ◽

Yunpei Xu ◽

Xiaoshu Zhu ◽

Quan Liu ◽

Gilbert S. Omenn ◽

...

Keyword(s):

Expression Profiles ◽

R Package ◽

Biological Data ◽

Supplementary Information ◽

Consensus Clustering ◽

Cluster Membership ◽

Link Type ◽

Novel Approach ◽

Gene Sets ◽

Biological Interpretation

ABSTRACTMotivationClustering analysis is essential for understanding complex biological data. In widely used methods such as hierarchical clustering (HC) and consensus clustering (CC), expression profiles of all genes are often used to assess similarity between samples for clustering. These methods output sample clusters, but are not able to provide information about which gene sets (functions) contribute most to the clustering. So interpretability of their results is limited. We hypothesized that integrating prior knowledge of annotated biological processes would not only achieve satisfying clustering performance but also, more importantly, enable potential biological interpretation of clusters.ResultsHere we report ClusterMine, a novel approach that identifies clusters by assessing functional similarity between samples through integrating known annotated gene sets, e.g., in Gene Ontology. In addition to outputting cluster membership of each sample as conventional approaches do, it outputs gene sets that are most likely to contribute to the clustering, a feature facilitating biological interpretation. Using three cancer datasets, two single cell RNA-sequencing based cell differentiation datasets, one cell cycle dataset and two datasets of cells of different tissue origins, we found that ClusterMine achieved similar or better clustering performance and that top-scored gene sets prioritized by ClusterMine are biologically relevant.Implementation and availabilityClusterMine is implemented as an R package and is freely available at: www.genemine.org/[email protected] InformationSupplementary data are available at Bioinformatics online.

Download Full-text

sismonr: simulation of in silico multi-omic networks with adjustable ploidy and post-transcriptional regulation in R

Bioinformatics ◽

10.1093/bioinformatics/btaa002 ◽

2020 ◽

Vol 36 (9) ◽

pp. 2938-2940

Author(s):

Olivia Angelin-Bonnet ◽

Patrick J Biggs ◽

Samantha Baldwin ◽

Susan Thomson ◽

Matthieu Vignes

Keyword(s):

In Silico ◽

Regulatory Networks ◽

Anthocyanin Biosynthesis ◽

Expression Profiles ◽

R Package ◽

Supplementary Information ◽

Protein Coding ◽

Stochastic Simulation Algorithms ◽

Post Transcriptional Regulation ◽

Biosynthesis Regulation

Abstract Summary We present sismonr, an R package for an integral generation and simulation of in silico biological systems. The package generates gene regulatory networks, which include protein-coding and non-coding genes along with different transcriptional and post-transcriptional regulations. The effect of genetic mutations on the system behaviour is accounted for via the simulation of genetically different in silico individuals. The ploidy of the system is not restricted to the usual haploid or diploid situations but can be defined by the user to higher ploidies. A choice of stochastic simulation algorithms allows us to simulate the expression profiles of the genes in the in silico system. We illustrate the use of sismonr by simulating the anthocyanin biosynthesis regulation pathway for three genetically distinct in silico plants. Availability and implementation The sismonr package is implemented in R and Julia and is publicly available on the CRAN repository (https://CRAN.R-project.org/package=sismonr). A detailed tutorial is available from GitHub at https://oliviaab.github.io/sismonr/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

deTS: tissue-specific enrichment analysis to decode tissue specificity

Bioinformatics ◽

10.1093/bioinformatics/btz138 ◽

2019 ◽

Vol 35 (19) ◽

pp. 3842-3845 ◽

Cited By ~ 8

Author(s):

Guangsheng Pei ◽

Yulin Dai ◽

Zhongming Zhao ◽

Peilin Jia

Keyword(s):

Expression Profiles ◽

Association Studies ◽

Gene Expression Profiles ◽

Enrichment Analysis ◽

R Package ◽

Supplementary Information ◽

Genome Wide Association Studies ◽

Tissue Specific ◽

Genome Wide ◽

Specific Regulation

Abstract Motivation Diseases and traits are under dynamic tissue-specific regulation. However, heterogeneous tissues are often collected in biomedical studies, which reduce the power in the identification of disease-associated variants and gene expression profiles. Results We present deTS, an R package, to conduct tissue-specific enrichment analysis with two built-in reference panels. Statistical methods are developed and implemented for detecting tissue-specific genes and for enrichment test of different forms of query data. Our applications using multi-trait genome-wide association studies data and cancer expression data showed that deTS could effectively identify the most relevant tissues for each query trait or sample, providing insights for future studies. Availability and implementation https://github.com/bsml320/deTS and CRAN https://cran.r-project.org/web/packages/deTS/ Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

gep2pep: a bioconductor package for the creation and analysis of pathway-based expression profiles

Bioinformatics ◽

10.1093/bioinformatics/btz803 ◽

2019 ◽

Cited By ~ 2

Author(s):

Farancesco Napolitano ◽

Diego Carrella ◽

Xin Gao ◽

Diego di Bernardo

Keyword(s):

Expression Profiles ◽

Enrichment Analysis ◽

Software Tool ◽

Supplementary Information ◽

Bioconductor Package ◽

Supplementary Data ◽

Connectivity Map ◽

Systematic Comparison ◽

Transcriptomic Data ◽

High Level

Abstract Summary Pathway-based expression profiles allow for high-level interpretation of transcriptomic data and systematic comparison of dysregulated cellular programs. We have previously demonstrated the efficacy of pathway-based approaches with two different applications: the drug set enrichment analysis and the Gene2drug analysis. Here, we present a software tool that allows to easily convert gene-based profiles to pathway-based profiles and analyze them within the popular R framework. We also provide pre-computed profiles derived from the original Connectivity Map and its next generation release, i.e. the LINCS database. Availability and implementation The tool is implemented as the R/Bioconductor package gep2pep and can be freely downloaded from https://bioconductor.org/packages/gep2pep. Contact [email protected] or [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

rMSIproc: an R package for mass spectrometry imaging data processing

Bioinformatics ◽

10.1093/bioinformatics/btaa142 ◽

2020 ◽

Vol 36 (11) ◽

pp. 3618-3619 ◽

Cited By ~ 2

Author(s):

Pere Ràfols ◽

Bram Heijs ◽

Esteban del Castillo ◽

Oscar Yanes ◽

Liam A McDonnell ◽

...

Keyword(s):

Mass Spectrometry ◽

Data Processing ◽

Mass Spectrometry Imaging ◽

R Package ◽

Supplementary Information ◽

Imaging Data ◽

Full Data ◽

Modern Computer ◽

Multiple Datasets ◽

Novel Strategy

Abstract Summary Mass spectrometry imaging (MSI) can reveal biochemical information directly from a tissue section. MSI generates a large quantity of complex spectral data which is still challenging to translate into relevant biochemical information. Here, we present rMSIproc, an open-source R package that implements a full data processing workflow for MSI experiments performed using TOF or FT-based mass spectrometers. The package provides a novel strategy for spectral alignment and recalibration, which allows to process multiple datasets simultaneously. This enables to perform a confident statistical analysis with multiple datasets from one or several experiments. rMSIproc is designed to work with files larger than the computer memory capacity and the algorithms are implemented using a multi-threading strategy. rMSIproc is a powerful tool able to take full advantage of modern computer systems to completely develop the whole MSI potential. Availability and implementation rMSIproc is freely available at https://github.com/prafols/rMSIproc. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Priors for Genotyping Polyploids

Bioinformatics ◽

10.1093/bioinformatics/btz852 ◽

2019 ◽

Author(s):

David Gerard ◽

Luís Felipe Ventorim Ferrão

Keyword(s):

Empirical Bayes ◽

A Priori ◽

Real Data ◽

R Package ◽

Complete Characterization ◽

Supplementary Information ◽

Genotype Distribution ◽

Systematic Biases ◽

Technical Artifacts

Abstract Motivation Empirical Bayes techniques to genotype polyploid organisms usually either (i) assume technical artifacts are known a priori or (ii) estimate technical artifacts simultaneously with the prior genotype distribution. Case (i) is unappealing as it places the onus on the researcher to estimate these artifacts, or to ensure that there are no systematic biases in the data. However, as we demonstrate with a few empirical examples, case (ii) makes choosing the class of prior genotype distributions extremely important. Choosing a class that is either too flexible or too restrictive results in poor genotyping performance. Results We propose two classes of prior genotype distributions that are of intermediate levels of flexibility: the class of proportional normal distributions and the class of unimodal distributions. We provide a complete characterization of and optimization details for the class of unimodal distributions. We demonstrate, using both simulated and real data, that using these classes results in superior genotyping performance. Availability and implementation Genotyping methods that use these priors are implemented in the updog R package available on the Comprehensive R Archive Network: https://cran.r-project.org/package=updog. All code needed to reproduce the results of this paper is available on GitHub: https://github.com/dcgerard/reproduce\_prior\_sims. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text