Factorized embeddings learns rich and biologically meaningful embedding spaces using factorized tensor decomposition

Assya Trofimov; Joseph Paul Cohen; Yoshua Bengio; Claude Perreault; Sébastien Lemieux

doi:10.1093/bioinformatics/btaa488

Factorized embeddings learns rich and biologically meaningful embedding spaces using factorized tensor decomposition

Bioinformatics ◽

10.1093/bioinformatics/btaa488 ◽

2020 ◽

Vol 36 (Supplement_1) ◽

pp. i417-i426

Author(s):

Assya Trofimov ◽

Joseph Paul Cohen ◽

Yoshua Bengio ◽

Claude Perreault ◽

Sébastien Lemieux

Keyword(s):

Large Scale ◽

Learning Algorithm ◽

Single Gene ◽

Expression Patterns ◽

Tensor Decomposition ◽

Supplementary Information ◽

Fe Model ◽

Tensor Factorization ◽

Rna Seq ◽

Sample Representation

Abstract Motivation The recent development of sequencing technologies revolutionized our understanding of the inner workings of the cell as well as the way disease is treated. A single RNA sequencing (RNA-Seq) experiment, however, measures tens of thousands of parameters simultaneously. While the results are information rich, data analysis provides a challenge. Dimensionality reduction methods help with this task by extracting patterns from the data by compressing it into compact vector representations. Results We present the factorized embeddings (FE) model, a self-supervised deep learning algorithm that learns simultaneously, by tensor factorization, gene and sample representation spaces. We ran the model on RNA-Seq data from two large-scale cohorts and observed that the sample representation captures information on single gene and global gene expression patterns. Moreover, we found that the gene representation space was organized such that tissue-specific genes, highly correlated genes as well as genes participating in the same GO terms were grouped. Finally, we compared the vector representation of samples learned by the FE model to other similar models on 49 regression tasks. We report that the representations trained with FE rank first or second in all of the tasks, surpassing, sometimes by a considerable margin, other representations. Availability and implementation A toy example in the form of a Jupyter Notebook as well as the code and trained embeddings for this project can be found at: https://github.com/TrofimovAssya/FactorizedEmbeddings. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

SMILE: Mutual Information Learning for Integration of Single-cell Omics Data

Bioinformatics ◽

10.1093/bioinformatics/btab706 ◽

2021 ◽

Author(s):

Yang Xu ◽

Priyojit Das ◽

Rachel Patton McCord

Keyword(s):

Deep Learning ◽

Mutual Information ◽

Single Cell ◽

Learning Algorithm ◽

Cellular Systems ◽

Supplementary Information ◽

Omics Data ◽

Learning Approaches ◽

Rna Seq ◽

Integrate Data

Abstract Motivation Deep learning approaches have empowered single-cell omics data analysis in many ways and generated new insights from complex cellular systems. As there is an increasing need for single cell omics data to be integrated across sources, types, and features of data, the challenges of integrating single-cell omics data are rising. Here, we present an unsupervised deep learning algorithm that learns discriminative representations for single-cell data via maximizing mutual information, SMILE (Single-cell Mutual Information Learning). Results Using a unique cell-pairing design, SMILE successfully integrates multi-source single-cell transcriptome data, removing batch effects and projecting similar cell types, even from different tissues, into the shared space. SMILE can also integrate data from two or more modalities, such as joint profiling technologies using single-cell ATAC-seq, RNA-seq, DNA methylation, Hi-C, and ChIP data. When paired cells are known, SMILE can integrate data with unmatched feature, such as genes for RNA-seq and genome wide peaks for ATAC-seq. Integrated representations learned from joint profiling technologies can then be used as a framework for comparing independent single source data. Supplementary information Supplementary data are available at Bioinformatics online. The source code of SMILE including analyses of key results in the study can be found at: https://github.com/rpmccordlab/SMILE.

Download Full-text

Network-based characterization of disease–disease relationships in terms of drugs and therapeutic targets

Bioinformatics ◽

10.1093/bioinformatics/btaa439 ◽

2020 ◽

Vol 36 (Supplement_1) ◽

pp. i516-i524

Author(s):

Midori Iida ◽

Michio Iwata ◽

Yoshihiro Yamanishi

Keyword(s):

Large Scale ◽

Expression Patterns ◽

Therapeutic Targets ◽

Molecular Networks ◽

Supplementary Information ◽

New Associations ◽

Disease States ◽

Molecular Features ◽

Novel Approach

Abstract Motivation Disease states are distinguished from each other in terms of differing clinical phenotypes, but characteristic molecular features are often common to various diseases. Similarities between diseases can be explained by characteristic gene expression patterns. However, most disease–disease relationships remain uncharacterized. Results In this study, we proposed a novel approach for network-based characterization of disease–disease relationships in terms of drugs and therapeutic targets. We performed large-scale analyses of omics data and molecular interaction networks for 79 diseases, including adrenoleukodystrophy, leukaemia, Alzheimer's disease, asthma, atopic dermatitis, breast cancer, cystic fibrosis and inflammatory bowel disease. We quantified disease–disease similarities based on proximities of abnormally expressed genes in various molecular networks, and showed that similarities between diseases could be explained by characteristic molecular network topologies. Furthermore, we developed a kernel matrix regression algorithm to predict the commonalities of drugs and therapeutic targets among diseases. Our comprehensive prediction strategy indicated many new associations among phenotypically diverse diseases. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Platform-integrated mRNA isoform quantification

Bioinformatics ◽

10.1093/bioinformatics/btz932 ◽

2019 ◽

Vol 36 (8) ◽

pp. 2466-2473 ◽

Cited By ~ 2

Author(s):

Jiao Sun ◽

Jae-Woong Chang ◽

Teng Zhang ◽

Jeongsik Yong ◽

Rui Kuang ◽

...

Keyword(s):

Cell Lines ◽

Cancer Cell ◽

Large Scale ◽

Cancer Cell Lines ◽

Supplementary Information ◽

Rna Seq ◽

Mrna Quantification ◽

Transcript Quantification ◽

Isoform Expression ◽

Isoform Quantification

Abstract Motivation Accurate estimation of transcript isoform abundance is critical for downstream transcriptome analyses and can lead to precise molecular mechanisms for understanding complex human diseases, like cancer. Simplex mRNA Sequencing (RNA-Seq) based isoform quantification approaches are facing the challenges of inherent sampling bias and unidentifiable read origins. A large-scale experiment shows that the consistency between RNA-Seq and other mRNA quantification platforms is relatively low at the isoform level compared to the gene level. In this project, we developed a platform-integrated model for transcript quantification (IntMTQ) to improve the performance of RNA-Seq on isoform expression estimation. IntMTQ, which benefits from the mRNA expressions reported by the other platforms, provides more precise RNA-Seq-based isoform quantification and leads to more accurate molecular signatures for disease phenotype prediction. Results In the experiments to assess the quality of isoform expression estimated by IntMTQ, we designed three tasks for clustering and classification of 46 cancer cell lines with four different mRNA quantification platforms, including newly developed NanoString’s nCounter technology. The results demonstrate that the isoform expressions learned by IntMTQ consistently provide more and better molecular features for downstream analyses compared with five baseline algorithms which consider RNA-Seq data only. An independent RT-qPCR experiment on seven genes in twelve cancer cell lines showed that the IntMTQ improved overall transcript quantification. The platform-integrated algorithms could be applied to large-scale cancer studies, such as The Cancer Genome Atlas (TCGA), with both RNA-Seq and array-based platforms available. Availability and implementation Source code is available at: https://github.com/CompbioLabUcf/IntMTQ. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Detection of Alu Exonization Events in Human Frontal Cortex From RNA-Seq Data

Frontiers in Molecular Biosciences ◽

10.3389/fmolb.2021.727537 ◽

2021 ◽

Vol 8 ◽

Author(s):

Liliana Florea ◽

Lindsay Payer ◽

Corina Antonescu ◽

Guangyu Yang ◽

Kathleen Burns

Keyword(s):

Frontal Cortex ◽

Large Scale ◽

Reference Genome ◽

Expression Patterns ◽

Data Sets ◽

Rna Seq ◽

Mrna Isoforms ◽

Specific Expression ◽

Human Frontal Cortex ◽

Alternatively Spliced

Alu exonization events functionally diversify the transcriptome, creating alternative mRNA isoforms and accounting for an estimated 5% of the alternatively spliced (skipped) exons in the human genome. We developed computational methods, implemented into a software called Alubaster, for detecting incorporation of Alu sequences in mRNA transcripts from large scale RNA-seq data sets. The approach detects Alu sequences derived from both fixed and polymorphic Alu elements, including Alu insertions missing from the reference genome. We applied our methods to 117 GTEx human frontal cortex samples to build and characterize a collection of Alu-containing mRNAs. In particular, we detected and characterized Alu exonizations occurring at 870 fixed Alu loci, of which 237 were novel, as well as hundreds of putative events involving Alu elements that are polymorphic variants or rare alleles not present in the reference genome. These methods and annotations represent a unique and valuable resource that can be used to understand the characteristics of Alu-containing mRNAs and their tissue-specific expression patterns.

Download Full-text

SIMLR: a tool for large-scale single-cell analysis by multi-kernel learning

10.1101/118901 ◽

2017 ◽

Cited By ~ 9

Author(s):

Bo Wang ◽

Daniele Ramazzotti ◽

Luca De Sano ◽

Junjie Zhu ◽

Emma Pierson ◽

...

Keyword(s):

Single Cell ◽

Large Scale ◽

Single Cell Analysis ◽

R Package ◽

Supplementary Information ◽

Cell Analysis ◽

Rna Seq ◽

A Cell ◽

Supplementary Material ◽

Public Datasets

AbstractMotivationWe here present SIMLR (Single-cell Interpretation via Multi-kernel LeaRning), an open-source tool that implements a novel framework to learn a cell-to-cell similarity measure from single-cell RNA-seq data. SIMLR can be effectively used to perform tasks such as dimension reduction, clustering, and visualization of heterogeneous populations of cells. SIMLR was benchmarked against state-of-the-art methods for these three tasks on several public datasets, showing it to be scalable and capable of greatly improving clustering performance, as well as providing valuable insights by making the data more interpretable via better a visualization.Availability and ImplementationSIMLR is available on GitHub in both R and MATLAB implementations. Furthermore, it is also available as an R package on [email protected] or [email protected] InformationSupplementary data are available at Bioinformatics online.

Download Full-text

MetaMap: An atlas of metatranscriptomic reads in human disease-related RNA-seq data

10.1101/269092 ◽

2018 ◽

Cited By ~ 1

Author(s):

LM Simon ◽

S Karg ◽

AJ Westermann ◽

M Engel ◽

AHA Elbehery ◽

...

Keyword(s):

High Performance Computing ◽

Human Disease ◽

High Performance ◽

Large Scale ◽

Expression Patterns ◽

Rna Seq ◽

Wide Range ◽

Eukaryotic Gene ◽

Public Repositories ◽

Performance Computing

AbstractBackgroundWith the advent of the age of big data in bioinformatics, large volumes of data and high performance computing power enable researchers to perform re-analyses of publicly available datasets at an unprecedented scale. Ever more studies imply the microbiome in both normal human physiology and a wide range of diseases. RNA sequencing technology (RNA-seq) is commonly used to infer global eukaryotic gene expression patterns under defined conditions, including human disease-related contexts, but its generic nature also enables the detection of microbial and viral transcripts.FindingsWe developed a bioinformatic pipeline to screen existing human RNA-seq datasets for the presence of microbial and viral reads by re-inspecting the non-human-mapping read fraction. We validated this approach by recapitulating outcomes from 6 independent controlled infection experiments of cell line models and comparison with an alternative metatranscriptomic mapping strategy. We then applied the pipeline to close to 150 terabytes of publicly available raw RNA-seq data from >17,000 samples from >400 studies relevant to human disease using state-of-the-art high performance computing systems. The resulting data of this large-scale re-analysis are made available in the presented MetaMap resource.ConclusionsOur results demonstrate that common human RNA-seq data, including those archived in public repositories, might contain valuable information to correlate microbial and viral detection patterns with diverse diseases. The presented MetaMap database thus provides a rich resource for hypothesis generation towards the role of the microbiome in human disease.

Download Full-text

Adjustment of spurious correlations in co-expression measurements from RNA-Sequencing data

10.1101/2021.03.25.436972 ◽

2021 ◽

Author(s):

Ping-Han Hsieh ◽

Camila Miranda Lopes-Ramos ◽

Geir Kjetil Sandve ◽

Kimberly Glass ◽

Marieke Lydia Kuijjer

Keyword(s):

False Positive ◽

Large Scale ◽

Expression Patterns ◽

Gaussian Mixture ◽

Heterogeneous Data ◽

Rna Seq ◽

Sequencing Data ◽

Technical Variability ◽

Gene Filtering ◽

Coordinated Expression

Gene co-expression measurements are widely used in computational biology to identify coordinated expression patterns across a group of samples, which may indicate that these genes are controlled by the same transcriptional regulatory program, or involved in common biological processes. Gene co-expression is generally estimated from RNA-Seq data, which are generally normalized to remove technical variability. Here, we find and demonstrate that certain normalization methods, in particular quantile-based methods, can introduce false-positive associations between genes, and that this can consequently hamper downstream co-expression network analysis. Quantile-based normalization can, however, be extremely powerful. In particular when preprocessing large-scale heterogeneous data, quantile-based normalization can be applied to remove technical variability while maintaining global differences in expression for samples with different biological attributes. We therefore developed CAIMAN, a method to correct for false-positive associations that may arise from normalization of RNA-Seq data. CAIMAN utilizes a Gaussian mixture model to fit the distribution of gene expression and to adaptively select the threshold to define lowly expressed genes, which are prone to form false-positive associations. Thereafter, CAIMAN corrects the normalized expression for these genes by removing the variability across samples that might lead to false-positive associations. Moreover, CAIMAN avoids arbitrary gene filtering and retains associations to genes that only express in small subgroups of samples, highlighting its potential future impact on network modeling and other association-based approaches in large-scale heterogeneous data.

Download Full-text

A comprehensive rat transcriptome built from large scale RNA-seq-based annotation

Nucleic Acids Research ◽

10.1093/nar/gkaa638 ◽

2020 ◽

Vol 48 (15) ◽

pp. 8320-8331

Author(s):

Xiangjun Ji ◽

Peng Li ◽

James C Fuscoe ◽

Geng Chen ◽

Wenzhong Xiao ◽

...

Keyword(s):

Gene Expression ◽

Significant Influence ◽

Large Scale ◽

Expression Patterns ◽

Model Organism ◽

Rna Seq ◽

Disease Mechanisms ◽

Expression Studies ◽

Gene Expression Studies ◽

Different Tissues

Abstract The rat is an important model organism in biomedical research for studying human disease mechanisms and treatments, but its annotated transcriptome is far from complete. We constructed a Rat Transcriptome Re-annotation named RTR using RNA-seq data from 320 samples in 11 different organs generated by the SEQC consortium. Totally, there are 52 807 genes and 114 152 transcripts in RTR. Transcribed regions and exons in RTR account for ∼42% and ∼6.5% of the genome, respectively. Of all 73 074 newly annotated transcripts in RTR, 34 213 were annotated as high confident coding transcripts and 24 728 as high confident long noncoding transcripts. Different tissues rather than different stages have a significant influence on the expression patterns of transcripts. We also found that 11 715 genes and 15 852 transcripts were expressed in all 11 tissues and that 849 house-keeping genes expressed different isoforms among tissues. This comprehensive transcriptome is freely available at http://www.unimd.org/rtr/. Our new rat transcriptome provides essential reference for genetics and gene expression studies in rat disease and toxicity models.

Download Full-text

QUBIC2: a novel and robust biclustering algorithm for analyses and interpretation of large-scale RNA-Seq data

Bioinformatics ◽

10.1093/bioinformatics/btz692 ◽

2019 ◽

Vol 36 (4) ◽

pp. 1143-1149 ◽

Cited By ~ 9

Author(s):

Juan Xie ◽

Anjun Ma ◽

Yu Zhang ◽

Bingqiang Liu ◽

Sha Cao ◽

...

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Large Scale ◽

Gaussian Model ◽

Functional Gene ◽

Superior Performance ◽

Supplementary Information ◽

Expression Data ◽

Rna Seq ◽

Gene Modules

Abstract Motivation The biclustering of large-scale gene expression data holds promising potential for detecting condition-specific functional gene modules (i.e. biclusters). However, existing methods do not adequately address a comprehensive detection of all significant bicluster structures and have limited power when applied to expression data generated by RNA-Sequencing (RNA-Seq), especially single-cell RNA-Seq (scRNA-Seq) data, where massive zero and low expression values are observed. Results We present a new biclustering algorithm, QUalitative BIClustering algorithm Version 2 (QUBIC2), which is empowered by: (i) a novel left-truncated mixture of Gaussian model for an accurate assessment of multimodality in zero-enriched expression data, (ii) a fast and efficient dropouts-saving expansion strategy for functional gene modules optimization using information divergency and (iii) a rigorous statistical test for the significance of all the identified biclusters in any organism, including those without substantial functional annotations. QUBIC2 demonstrated considerably improved performance in detecting biclusters compared to other five widely used algorithms on various benchmark datasets from E.coli, Human and simulated data. QUBIC2 also showcased robust and superior performance on gene expression data generated by microarray, bulk RNA-Seq and scRNA-Seq. Availability and implementation The source code of QUBIC2 is freely available at https://github.com/OSU-BMBL/QUBIC2. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

scTIM: seeking cell-type-indicative marker from single cell RNA-seq data by consensus optimization

Bioinformatics ◽

10.1093/bioinformatics/btz936 ◽

2019 ◽

Vol 36 (8) ◽

pp. 2474-2485 ◽

Cited By ~ 2

Author(s):

Zhanying Feng ◽

Xianwen Ren ◽

Yuan Fang ◽

Yining Yin ◽

Chutian Huang ◽

...

Keyword(s):

Single Cell ◽

Large Scale ◽

Cell Types ◽

Mouse Cell ◽

Supplementary Information ◽

Rna Seq ◽

Cell Type ◽

Robust Solution ◽

Development Trajectory ◽

Consensus Optimization

Abstract Motivation Single cell RNA-seq data offers us new resource and resolution to study cell type identity and its conversion. However, data analyses are challenging in dealing with noise, sparsity and poor annotation at single cell resolution. Detecting cell-type-indicative markers is promising to help denoising, clustering and cell type annotation. Results We developed a new method, scTIM, to reveal cell-type-indicative markers. scTIM is based on a multi-objective optimization framework to simultaneously maximize gene specificity by considering gene-cell relationship, maximize gene’s ability to reconstruct cell–cell relationship and minimize gene redundancy by considering gene–gene relationship. Furthermore, consensus optimization is introduced for robust solution. Experimental results on three diverse single cell RNA-seq datasets show scTIM’s advantages in identifying cell types (clustering), annotating cell types and reconstructing cell development trajectory. Applying scTIM to the large-scale mouse cell atlas data identifies critical markers for 15 tissues as ‘mouse cell marker atlas’, which allows us to investigate identities of different tissues and subtle cell types within a tissue. scTIM will serve as a useful method for single cell RNA-seq data mining. Availability and implementation scTIM is freely available at https://github.com/Frank-Orwell/scTIM. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text