BEM: Mining Coregulation Patterns in Transcriptomics via Boolean Matrix Factorization

Lifan Liang; Kunju Zhu; Songjian Lu

doi:10.1093/bioinformatics/btz977

BEM: Mining Coregulation Patterns in Transcriptomics via Boolean Matrix Factorization

Bioinformatics ◽

10.1093/bioinformatics/btz977 ◽

2020 ◽

Vol 36 (13) ◽

pp. 4030-4037

Author(s):

Lifan Liang ◽

Kunju Zhu ◽

Songjian Lu

Keyword(s):

Matrix Factorization ◽

Cell Types ◽

Reconstruction Error ◽

Boolean Matrix ◽

Supplementary Information ◽

Rna Seq ◽

Transcriptomic Data ◽

Real World Application ◽

The Matrix ◽

Data Points

Abstract Motivation The matrix factorization is an important way to analyze coregulation patterns in transcriptomic data, which can reveal the tumor signal perturbation status and subtype classification. However, current matrix factorization methods do not provide clear bicluster structure. Furthermore, these algorithms are based on the assumption of linear combination, which may not be sufficient to capture the coregulation patterns. Results We presented a new algorithm for Boolean matrix factorization (BMF) via expectation maximization (BEM). BEM is more aligned with the molecular mechanism of transcriptomic coregulation and can scale to matrix with over 100 million data points. Synthetic experiments showed that BEM outperformed other BMF methods in terms of reconstruction error. Real-world application demonstrated that BEM is applicable to all kinds of transcriptomic data, including bulk RNA-seq, single-cell RNA-seq and spatial transcriptomic datasets. Given appropriate binarization, BEM was able to extract coregulation patterns consistent with disease subtypes, cell types or spatial anatomy. Availability and implementation Python source code of BEM is available on https://github.com/LifanLiang/EM_BMF. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

ExperimentSubset: an R package to manage subsets of Bioconductor Experiment objects

Bioinformatics ◽

10.1093/bioinformatics/btab179 ◽

2021 ◽

Author(s):

Irzam Sarfraz ◽

Muhammad Asif ◽

Joshua D Campbell

Keyword(s):

Single Cell ◽

R Package ◽

Poor Quality ◽

Data Matrix ◽

Supplementary Information ◽

Data Provenance ◽

Rna Seq ◽

Efficient Management ◽

The Matrix ◽

The Relationship

Abstract Motivation R Experiment objects such as the SummarizedExperiment or SingleCellExperiment are data containers for storing one or more matrix-like assays along with associated row and column data. These objects have been used to facilitate the storage and analysis of high-throughput genomic data generated from technologies such as single-cell RNA sequencing. One common computational task in many genomics analysis workflows is to perform subsetting of the data matrix before applying down-stream analytical methods. For example, one may need to subset the columns of the assay matrix to exclude poor-quality samples or subset the rows of the matrix to select the most variable features. Traditionally, a second object is created that contains the desired subset of assay from the original object. However, this approach is inefficient as it requires the creation of an additional object containing a copy of the original assay and leads to challenges with data provenance. Results To overcome these challenges, we developed an R package called ExperimentSubset, which is a data container that implements classes for efficient storage and streamlined retrieval of assays that have been subsetted by rows and/or columns. These classes are able to inherently provide data provenance by maintaining the relationship between the subsetted and parent assays. We demonstrate the utility of this package on a single-cell RNA-seq dataset by storing and retrieving subsets at different stages of the analysis while maintaining a lower memory footprint. Overall, the ExperimentSubset is a flexible container for the efficient management of subsets. Availability and implementation ExperimentSubset package is available at Bioconductor: https://bioconductor.org/packages/ExperimentSubset/ and Github: https://github.com/campbio/ExperimentSubset. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Discovering a sparse set of pairwise discriminating features in high-dimensional data

Bioinformatics ◽

10.1093/bioinformatics/btaa690 ◽

2020 ◽

Author(s):

Samuel Melton ◽

Sharad Ramanathan

Keyword(s):

Single Cell ◽

Dimensional Space ◽

Cell Types ◽

Dimensional Subspace ◽

Supplementary Information ◽

High Dimensional ◽

Technological Advances ◽

Data Points ◽

Low Dimensional ◽

Sparse Set

Abstract Motivation Recent technological advances produce a wealth of high-dimensional descriptions of biological processes, yet extracting meaningful insight and mechanistic understanding from these data remains challenging. For example, in developmental biology, the dynamics of differentiation can now be mapped quantitatively using single-cell RNA sequencing, yet it is difficult to infer molecular regulators of developmental transitions. Here, we show that discovering informative features in the data is crucial for statistical analysis as well as making experimental predictions. Results We identify features based on their ability to discriminate between clusters of the data points. We define a class of problems in which linear separability of clusters is hidden in a low-dimensional space. We propose an unsupervised method to identify the subset of features that define a low-dimensional subspace in which clustering can be conducted. This is achieved by averaging over discriminators trained on an ensemble of proposed cluster configurations. We then apply our method to single-cell RNA-seq data from mouse gastrulation, and identify 27 key transcription factors (out of 409 total), 18 of which are known to define cell states through their expression levels. In this inferred subspace, we find clear signatures of known cell types that eluded classification prior to discovery of the correct low-dimensional subspace. Availability and implementation https://github.com/smelton/SMD. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Differentiating isoform functions with collaborative matrix factorization

Bioinformatics ◽

10.1093/bioinformatics/btz847 ◽

2019 ◽

Author(s):

Keyao Wang ◽

Jun Wang ◽

Carlotta Domeniconi ◽

Xiangliang Zhang ◽

Guoxian Yu

Keyword(s):

Matrix Factorization ◽

Characteristic Curve ◽

Function Prediction ◽

Low Rank ◽

Data Matrix ◽

Supplementary Information ◽

Genomic Databases ◽

Gene Level ◽

The Matrix ◽

Level Function

Abstract Motivation Isoforms are alternatively spliced mRNAs of genes. They can be translated into different functional proteoforms, and thus greatly increase the functional diversity of protein variants (or proteoforms). Differentiating the functions of isoforms (or proteoforms) helps understanding the underlying pathology of various complex diseases at a deeper granularity. Since existing functional genomic databases uniformly record the annotations at the gene-level, and rarely record the annotations at the isoform-level, differentiating isoform functions is more challenging than the traditional gene-level function prediction. Results Several approaches have been proposed to differentiate the functions of isoforms. They generally follow the multi-instance learning paradigm by viewing each gene as a bag and the spliced isoforms as its instances, and push functions of bags onto instances. These approaches implicitly assume the collected annotations of genes are complete and only integrate multiple RNA-seq datasets. As such, they have compromised performance. We propose a data integrative solution (called DisoFun) to Differentiate isoform Functions with collaborative matrix factorization. DisoFun assumes the functional annotations of genes are aggregated from those of key isoforms. It collaboratively factorizes the isoform data matrix and gene-term data matrix (storing Gene Ontology (GO) annotations of genes) into low-rank matrices to simultaneously explore the latent key isoforms, and achieve function prediction by aggregating predictions to their originating genes. In addition, it leverages the PPI network and GO structure to further coordinate the matrix factorization. Extensive experimental results show that DisoFun improves the AUROC (area under the receiver-operating characteristic curve) and AUPRC (area under the precision-recall curve) of existing solutions by at least 7.7% and 28.9%, respectively. We further investigate DisoFun on four exemplar genes (LMNA, ADAM15, BCL2L1, and CFLAR) with known functions at the isoform-level, and observed that DisoFun can differentiate functions of their isoforms with 90.5% accuracy. Availability The code of DisoFun is available at mlda.swu.edu.cn/codes.php?name=DisoFun. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

NITUMID: Nonnegative matrix factorization-based Immune-TUmor MIcroenvironment Deconvolution

Bioinformatics ◽

10.1093/bioinformatics/btz748 ◽

2019 ◽

Author(s):

Daiwei Tang ◽

Seyoung Park ◽

Hongyu Zhao

Keyword(s):

Tumor Microenvironment ◽

Matrix Factorization ◽

Nonnegative Matrix Factorization ◽

Expression Profiles ◽

Mrna Level ◽

Nonnegative Matrix ◽

Gene Expression Profiles ◽

Cell Types ◽

Supplementary Information ◽

Mrna Levels

Abstract Motivation A number of computational methods have been proposed recently to profile tumor microenvironment (TME) from bulk RNA data, and they have proved useful for understanding microenvironment differences among therapeutic response groups. However, these methods are not able to account for tumor proportion nor variable mRNA levels across cell types. Results In this article, we propose a Nonnegative Matrix Factorization-based Immune-TUmor MIcroenvironment Deconvolution (NITUMID) framework for TME profiling that addresses these limitations. It is designed to provide robust estimates of tumor and immune cells proportions simultaneously, while accommodating mRNA level differences across cell types. Through comprehensive simulations and real data analyses, we demonstrate that NITUMID not only can accurately estimate tumor fractions and cell types’ mRNA levels, which are currently unavailable in other methods; it also outperforms most existing deconvolution methods in regular cell type profiling accuracy. Moreover, we show that NITUMID can more effectively detect clinical and prognostic signals from gene expression profiles in tumor than other methods. Availability and implementation The algorithm is implemented in R. The source code can be downloaded at https://github.com/tdw1221/NITUMID. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Identifying gene expression programs of cell-type identity and cellular activity with single-cell RNA-Seq

eLife ◽

10.7554/elife.43803 ◽

2019 ◽

Vol 8 ◽

Cited By ~ 37

Author(s):

Dylan Kotliar ◽

Adrian Veres ◽

M Aurel Nagy ◽

Shervin Tabrizi ◽

Eran Hodis ◽

...

Keyword(s):

Gene Expression ◽

Single Cell ◽

Matrix Factorization ◽

Cell Types ◽

Environmental Cues ◽

Rna Seq ◽

Cell Type ◽

Type Identity ◽

Brain Organoid ◽

Non Negative Matrix Factorization

Identifying gene expression programs underlying both cell-type identity and cellular activities (e.g. life-cycle processes, responses to environmental cues) is crucial for understanding the organization of cells and tissues. Although single-cell RNA-Seq (scRNA-Seq) can quantify transcripts in individual cells, each cell’s expression profile may be a mixture of both types of programs, making them difficult to disentangle. Here, we benchmark and enhance the use of matrix factorization to solve this problem. We show with simulations that a method we call consensus non-negative matrix factorization (cNMF) accurately infers identity and activity programs, including their relative contributions in each cell. To illustrate the insights this approach enables, we apply it to published brain organoid and visual cortex scRNA-Seq datasets; cNMF refines cell types and identifies both expected (e.g. cell cycle and hypoxia) and novel activity programs, including programs that may underlie a neurosecretory phenotype and synaptogenesis.

Download Full-text

Identifying Gene Expression Programs of Cell-type Identity and Cellular Activity with Single-Cell RNA-Seq

10.1101/310599 ◽

2018 ◽

Cited By ~ 7

Author(s):

Dylan Kotliar ◽

Adrian Veres ◽

M. Aurel Nagy ◽

Shervin Tabrizi ◽

Eran Hodis ◽

...

Keyword(s):

Gene Expression ◽

Single Cell ◽

Matrix Factorization ◽

Cell Types ◽

Rna Seq ◽

Cell Type ◽

Relative Contribution ◽

Neuronal Synapses ◽

Type Identity ◽

Brain Organoid

AbstractIdentifying gene expression programs underlying both cell-type identity and cellular activities (e.g. life-cycle processes, responses to environmental cues) is crucial for understanding the organization of cells and tissues. Although single-cell RNA-Seq (scRNA-Seq) can quantify transcripts in individual cells, each cell’s expression profile may be a mixture of both types of programs, making them difficult to disentangle. Here we illustrate and enhance the use of matrix factorization as a solution to this problem. We show with simulations that a method that we call consensus non-negative matrix factorization (cNMF) accurately infers identity and activity programs, including the relative contribution of programs in each cell. Applied to published brain organoid and visual cortex scRNA-Seq datasets, cNMF refines the hierarchy of cell types and identifies both expected (e.g. cell cycle and hypoxia) and intriguing novel activity programs. We propose that one of the novel programs may reflect a neurosecretory phenotype and a second may underlie the formation of neuronal synapses. We make cNMF available to the community and illustrate how this approach can provide key insights into gene expression variation within and between cell types.

Download Full-text

Continuous State HMMs for Modeling Time Series Single Cell RNA-Seq Data

10.1101/380568 ◽

2018 ◽

Author(s):

Chieh Lin ◽

Ziv Bar-Joseph

Keyword(s):

Time Series ◽

Single Cell ◽

Developmental Process ◽

Developmental Trajectories ◽

Cell Types ◽

Supplementary Information ◽

Rna Seq ◽

Inference Algorithms ◽

Continuous State ◽

Efficient Learning

AbstractMotivationMethods for reconstructing developmental trajectories from time series single cell RNA-Seq (scRNA-Seq) data can be largely divided into two categories. The first, often referred to as pseudotime ordering methods, are deterministic and rely on dimensionality reduction followed by an ordering step. The second learns a probabilistic branching model to represent the developmental process. While both types have been successful, each suffers from shortcomings that can impact their accuracy.ResultsWe developed a new method based on continuous state HMMs (CSHMMs) for representing and modeling time series scRNA-Seq data. We define the CSHMM model and provide efficient learning and inference algorithms which allow the method to determine both the structure of the branching process and the assignment of cells to these branches. Analyzing several developmental single cell datasets we show that the CSHMM method accurately infers branching topology and correctly and continuously assign cells to paths, improving upon prior methods proposed for this task. Analysis of genes based on the continuous cell assignment identifies known and novel markers for different cell types.AvailabilitySoftware and Supporting website: www.andrew.cmu.edu/user/chiehll/CSHMM/[email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

Reference Transcriptomes of Porcine Peripheral Immune Cells Created Through Bulk and Single-Cell RNA Sequencing

Frontiers in Genetics ◽

10.3389/fgene.2021.689406 ◽

2021 ◽

Vol 12 ◽

Author(s):

Juber Herrera-Uribe ◽

Jayne E. Wiarda ◽

Sathesh K. Sivasankaran ◽

Lance Daharsh ◽

Haibo Liu ◽

...

Keyword(s):

T Cells ◽

Single Cell ◽

Immune Cells ◽

Peripheral Blood ◽

Cell Types ◽

Cell Populations ◽

Rna Seq ◽

Cell Type ◽

Transcriptomic Data ◽

Single Cell Rna Sequencing

Pigs are a valuable human biomedical model and an important protein source supporting global food security. The transcriptomes of peripheral blood immune cells in pigs were defined at the bulk cell-type and single cell levels. First, eight cell types were isolated in bulk from peripheral blood mononuclear cells (PBMCs) by cell sorting, representing Myeloid, NK cells and specific populations of T and B-cells. Transcriptomes for each bulk population of cells were generated by RNA-seq with 10,974 expressed genes detected. Pairwise comparisons between cell types revealed specific expression, while enrichment analysis identified 1,885 to 3,591 significantly enriched genes across all 8 cell types. Gene Ontology analysis for the top 25% of significantly enriched genes (SEG) showed high enrichment of biological processes related to the nature of each cell type. Comparison of gene expression indicated highly significant correlations between pig cells and corresponding human PBMC bulk RNA-seq data available in Haemopedia. Second, higher resolution of distinct cell populations was obtained by single-cell RNA-sequencing (scRNA-seq) of PBMC. Seven PBMC samples were partitioned and sequenced that produced 28,810 single cell transcriptomes distributed across 36 clusters and classified into 13 general cell types including plasmacytoid dendritic cells (DC), conventional DCs, monocytes, B-cell, conventional CD4 and CD8 αβ T-cells, NK cells, and γδ T-cells. Signature gene sets from the human Haemopedia data were assessed for relative enrichment in genes expressed in pig cells and integration of pig scRNA-seq with a public human scRNA-seq dataset provided further validation for similarity between human and pig data. The sorted porcine bulk RNAseq dataset informed classification of scRNA-seq PBMC populations; specifically, an integration of the datasets showed that the pig bulk RNAseq data helped define the CD4CD8 double-positive T-cell populations in the scRNA-seq data. Overall, the data provides deep and well-validated transcriptomic data from sorted PBMC populations and the first single-cell transcriptomic data for porcine PBMCs. This resource will be invaluable for annotation of pig genes controlling immunogenetic traits as part of the porcine Functional Annotation of Animal Genomes (FAANG) project, as well as further study of, and development of new reagents for, porcine immunology.

Download Full-text

Cell type-specific aging clocks to quantify aging and rejuvenation in regenerative regions of the brain

10.1101/2022.01.10.475747 ◽

2022 ◽

Author(s):

Matthew T Buckley ◽

Eric Sun ◽

Benson M. George ◽

Ling Liu ◽

Nicholas Schaum ◽

...

Keyword(s):

Single Cell ◽

Cell Types ◽

Rna Seq ◽

Cell Type ◽

Cell Level ◽

Transcriptomic Data ◽

Precise Quantification ◽

Cell Type Specific ◽

Tissue Aging ◽

The Brain

Aging manifests as progressive dysfunction culminating in death. The diversity of cell types is a challenge to the precise quantification of aging and its reversal. Here we develop a suite of 'aging clocks' based on single cell transcriptomic data to characterize cell type-specific aging and rejuvenation strategies. The subventricular zone (SVZ) neurogenic region contains many cell types and provides an excellent system to study cell-level tissue aging and regeneration. We generated 21,458 single-cell transcriptomes from the neurogenic regions of 28 mice, tiling ages from young to old. With these data, we trained a suite of single cell-based regression models (aging clocks) to predict both chronological age (passage of time) and biological age (fitness, in this case the proliferative capacity of the neurogenic region). Both types of clocks perform well on independent cohorts of mice. Genes underlying the single cell-based aging clocks are mostly cell-type specific, but also include a few shared genes in the interferon and lipid metabolism pathways. We used these single cell-based aging clocks to measure transcriptomic rejuvenation, by generating single cell RNA-seq datasets of SVZ neurogenic regions for two interventions - heterochronic parabiosis (young blood) and exercise. Interestingly, the use of aging clocks reveals that both heterochronic parabiosis and exercise reverse transcriptomic aging in the niche, but in different ways across cell types and genes. This study represents the first development of high-resolution aging clocks from single cell transcriptomic data and demonstrates their application to quantify transcriptomic rejuvenation.

Download Full-text

Single-cell transcriptomics for the 99.9% of species without reference genomes

10.1101/2021.07.09.450799 ◽

2021 ◽

Author(s):

Olga Borisovna Botvinnik ◽

Pranathi Vemuri ◽

N. Tessa Pierce Ward ◽

Phoenix Aja Logan ◽

Saba Nafees ◽

...

Keyword(s):

Single Cell ◽

Gene Annotation ◽

Cell Types ◽

Model Organisms ◽

Mouse Lung ◽

Rna Seq ◽

Transcriptomic Data ◽

A Genome ◽

Genome Annotations ◽

Reference Genomes

Single-cell RNA-seq (scRNA-seq) is a powerful tool for cell type identification but is not readily applicable to organisms without well-annotated reference genomes. Of the approximately 10 million animal species predicted to exist on earth, >99.9% do not have any submitted genome assembly. To enable scRNA-seq for the vast majority of animals on the planet, here we introduce the concept of "k-mer homology," combining biochemical synonyms in degenerate protein alphabets with uniform data subsampling via MinHash into a pipeline called Kmermaid, to directly detect similar cell types across species from transcriptomic data without the need for a reference genome. Underpinning kmermaid is the tool Orpheum, a memory-efficient method for extracting high-confidence protein-coding sequences from RNA-seq data. After validating kmermaid using datasets from human and mouse lung, we applied Kmermaid to the Chinese horseshoe bat (Rhinolophus sinicus), where we propagated cellular compartment labels at high fidelity. Our pipeline provides a high-throughput tool that enables analyses of transcriptomic data across divergent species' transcriptomes in a genome- and gene annotation-agnostic manner. Thus, the combination of Kmermaid and Orpheum identifies cellular type-specific sequences that may be missing from genome annotations and empowers molecular cellular phenotyping for novel model organisms and species.

Download Full-text