Metric Learning on Expression Data for Gene Function Prediction

AbstractMotivationCo-expression of two genes across different conditions is indicative of their involvement in the same biological process. However, using RNA-Seq datasets with many experimental conditions from diverse sources introduces batch effects and other artefacts that might obscure the real co-expression signal. Moreover, only a subset of experimental conditions is expected to be relevant for finding genes related to a particular Gene Ontology (GO) term. Therefore, we hypothesize that when the purpose is to find similar functioning genes that the co-expression of genes should not be determined on all samples but only on those samples informative for the GO term of interest.ResultsTo address both types of effects, we developed MLC (Metric Learning for Co-expression), a fast algorithm that assigns a GO-term-specific weight to each expression sample. The goal is to obtain a weighted co-expression measure that is more suitable than the unweighted Pearson correlation for applying Guilt-By-Association-based function predictions. More specifically, if two genes are annotated with a given GO term, MLC tries to maximize their weighted co-expression, and, in addition, if one of them is not annotated with that term, the weighted co-expression is minimized. Our experiments on publicly available Arabidopsis thaliana RNA-Seq data demonstrate that MLC outperforms standard Pearson correlation in term-centric performance.AvailabilityMLC is available as a Python package at www.github.com/stamakro/[email protected] informationSupplementary data are available online.

Download Full-text

Metric learning on expression data for gene function prediction

Bioinformatics ◽

10.1093/bioinformatics/btz731 ◽

2019 ◽

Author(s):

Stavros Makrodimitris ◽

Marcel J T Reinders ◽

Roeland C H J van Ham

Keyword(s):

Pearson Correlation ◽

Metric Learning ◽

Specific Weight ◽

Supplementary Information ◽

Expression Data ◽

Rna Seq ◽

Experimental Conditions ◽

Expression Of Genes ◽

Guilt By Association ◽

Python Package

Abstract Motivation Co-expression of two genes across different conditions is indicative of their involvement in the same biological process. However, when using RNA-Seq datasets with many experimental conditions from diverse sources, only a subset of the experimental conditions is expected to be relevant for finding genes related to a particular Gene Ontology (GO) term. Therefore, we hypothesize that when the purpose is to find similarly functioning genes, the co-expression of genes should not be determined on all samples but only on those samples informative for the GO term of interest. Results To address this, we developed Metric Learning for Co-expression (MLC), a fast algorithm that assigns a GO-term-specific weight to each expression sample. The goal is to obtain a weighted co-expression measure that is more suitable than the unweighted Pearson correlation for applying Guilt-By-Association-based function predictions. More specifically, if two genes are annotated with a given GO term, MLC tries to maximize their weighted co-expression and, in addition, if one of them is not annotated with that term, the weighted co-expression is minimized. Our experiments on publicly available Arabidopsis thaliana RNA-Seq data demonstrate that MLC outperforms standard Pearson correlation in term-centric performance. Moreover, our method is particularly good at more specific terms, which are the most interesting. Finally, by observing the sample weights for a particular GO term, one can identify which experiments are important for learning that term and potentially identify novel conditions that are relevant, as demonstrated by experiments in both A. thaliana and Pseudomonas Aeruginosa. Availability and implementation MLC is available as a Python package at www.github.com/stamakro/MLC. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

QUBIC2: a novel and robust biclustering algorithm for analyses and interpretation of large-scale RNA-Seq data

Bioinformatics ◽

10.1093/bioinformatics/btz692 ◽

2019 ◽

Vol 36 (4) ◽

pp. 1143-1149 ◽

Cited By ~ 9

Author(s):

Juan Xie ◽

Anjun Ma ◽

Yu Zhang ◽

Bingqiang Liu ◽

Sha Cao ◽

...

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Large Scale ◽

Gaussian Model ◽

Functional Gene ◽

Superior Performance ◽

Supplementary Information ◽

Expression Data ◽

Rna Seq ◽

Gene Modules

Abstract Motivation The biclustering of large-scale gene expression data holds promising potential for detecting condition-specific functional gene modules (i.e. biclusters). However, existing methods do not adequately address a comprehensive detection of all significant bicluster structures and have limited power when applied to expression data generated by RNA-Sequencing (RNA-Seq), especially single-cell RNA-Seq (scRNA-Seq) data, where massive zero and low expression values are observed. Results We present a new biclustering algorithm, QUalitative BIClustering algorithm Version 2 (QUBIC2), which is empowered by: (i) a novel left-truncated mixture of Gaussian model for an accurate assessment of multimodality in zero-enriched expression data, (ii) a fast and efficient dropouts-saving expansion strategy for functional gene modules optimization using information divergency and (iii) a rigorous statistical test for the significance of all the identified biclusters in any organism, including those without substantial functional annotations. QUBIC2 demonstrated considerably improved performance in detecting biclusters compared to other five widely used algorithms on various benchmark datasets from E.coli, Human and simulated data. QUBIC2 also showcased robust and superior performance on gene expression data generated by microarray, bulk RNA-Seq and scRNA-Seq. Availability and implementation The source code of QUBIC2 is freely available at https://github.com/OSU-BMBL/QUBIC2. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

BYASE: a Python library for estimating gene and isoform level allele-specific expression

Bioinformatics ◽

10.1093/bioinformatics/btaa636 ◽

2020 ◽

Vol 36 (19) ◽

pp. 4955-4956

Author(s):

Lili Dong ◽

Jianan Wang ◽

Guohua Wang

Keyword(s):

Graphical User Interface ◽

Supplementary Information ◽

Rna Seq ◽

Biological Mechanisms ◽

Specific Expression ◽

Allele Specific Expression ◽

Source Codes ◽

Gene Level ◽

Allele Specific ◽

Python Package

Abstract Summary Allele-specific expression (ASE) is involved in many important biological mechanisms. We present a python package BYASE and its graphical user interface (GUI) tool BYASE-GUI for the identification of ASE from single-end and paired-end RNA-seq data based on Bayesian inference, which can simultaneously report differences in gene-level and isoform-level expression. BYASE uses both phased SNPs and non-phased SNPs, and supports polyploid organisms. Availability and implementation The source codes of BYASE and BYASE-GUI are freely available at https://github.com/ncjllld/byase and https://github.com/ncjllld/byase_gui. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

FINDER: An automated software package to annotate eukaryotic genes from RNA-Seq data and associated protein sequences

10.1101/2021.02.04.429837 ◽

2021 ◽

Author(s):

Sagnik Banerjee ◽

Priyanka Bhandary ◽

Margaret Woodhouse ◽

Taner Z. Sen ◽

Roger P. Wise ◽

...

Keyword(s):

Gene Annotation ◽

Gene Prediction ◽

Active Regions ◽

Expression Data ◽

Rna Seq ◽

Experimental Conditions ◽

Eukaryotic Genes ◽

Associated Proteins ◽

Gene Structures ◽

Automated Software

AbstractBackgroundGene annotation in eukaryotes is a non-trivial task that requires meticulous analysis of accumulated transcript data. Challenges include transcriptionally active regions of the genome that contain overlapping genes, genes that produce numerous transcripts, transposable elements and numerous diverse sequence repeats. Currently available gene annotation software applications depend on pre-constructed full-length gene sequence assemblies which are not guaranteed to be error-free. The origins of these sequences are often uncertain, making it difficult to identify and rectify errors in them. This hinders the creation of an accurate and holistic representation of the transcriptomic landscape across multiple tissue types and experimental conditions. Therefore, to gauge the extent of diversity in gene structures, a comprehensive analysis of genome-wide expression data is imperative.ResultsWe present FINDER, a fully automated computational tool that optimizes the entire process of annotating genes and transcript structures. Unlike current state-of-the-art pipelines, FINDER automates the RNA-Seq pre-processing step by working directly with raw sequence reads and optimizes gene prediction from BRAKER2 by supplementing these reads with associated proteins. The FINDER pipeline (1) reports transcripts and recognizes genes that are expressed under specific conditions, (2) generates all possible alternatively spliced transcripts from expressed RNA-Seq data, (3) analyzes read coverage patterns to modify existing transcript models and create new ones, and (4) scores genes as high- or low-confidence based on the available evidence across multiple datasets. We demonstrate the ability of FINDER to automatically annotate a diverse pool of genomes from eight species.ConclusionsFINDER takes a completely automated approach to annotate genes directly from raw expression data. It is capable of processing eukaryotic genomes of all sizes and requires no manual supervision – ideal for bench researchers with limited experience in handling computational tools.

Download Full-text

pyBedGraph: a python package for fast operations on 1D genomic signal tracks

Bioinformatics ◽

10.1093/bioinformatics/btaa061 ◽

2020 ◽

Vol 36 (10) ◽

pp. 3234-3235

Author(s):

Henry B Zhang ◽

Minji Kim ◽

Jeffrey H Chuang ◽

Yijun Ruan

Keyword(s):

Chromatin Accessibility ◽

Genomic Research ◽

Supplementary Information ◽

Summary Statistics ◽

Rna Seq ◽

Binary Format ◽

Modern Genomic ◽

Binding Intensity ◽

Python Package ◽

Generation Sequencing

Abstract Motivation Modern genomic research is driven by next-generation sequencing experiments such as ChIP-seq and ChIA-PET that generate coverage files for transcription factor binding, as well as DHS and ATAC-seq that yield coverage files for chromatin accessibility. Such files are in a bedGraph text format or a bigWig binary format. Obtaining summary statistics in a given region is a fundamental task in analyzing protein binding intensity or chromatin accessibility. However, the existing Python package for operating on coverage files is not optimized for speed. Results We developed pyBedGraph, a Python package to quickly obtain summary statistics for a given interval in a bedGraph or a bigWig file. When tested on 12 ChIP-seq, ATAC-seq, RNA-seq and ChIA-PET datasets, pyBedGraph is on average 260 times faster than the existing program pyBigWig. On average, pyBedGraph can look up the exact mean signal of 1 million regions in ∼0.26 s and can compute their approximate means in <0.12 s on a conventional laptop. Availability and implementation pyBedGraph is publicly available at https://github.com/TheJacksonLaboratory/pyBedGraph under the MIT license. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

ClusterMap: compare multiple single cell RNA-Seq datasets across different experimental conditions

Bioinformatics ◽

10.1093/bioinformatics/btz024 ◽

2019 ◽

Vol 35 (17) ◽

pp. 3038-3045 ◽

Cited By ~ 7

Author(s):

Xin Gao ◽

Deqing Hu ◽

Madelaine Gogol ◽

Hua Li

Keyword(s):

Single Cell ◽

Molecular Mechanisms ◽

Population Level ◽

Supplementary Information ◽

Marker Genes ◽

Rna Seq ◽

Matching Problem ◽

Experimental Conditions ◽

Cut Method ◽

Underlying Mechanisms

Abstract Motivation Single cell RNA-Seq (scRNA-Seq) facilitates the characterization of cell type heterogeneity and developmental processes. Further study of single cell profiles across different conditions enables the understanding of biological processes and underlying mechanisms at the sub-population level. However, developing proper methodology to compare multiple scRNA-Seq datasets remains challenging. Results We have developed ClusterMap, a systematic method and workflow to facilitate the comparison of scRNA-seq profiles across distinct biological contexts. Using hierarchical clustering of the marker genes of each sub-group, ClusterMap matches the sub-types of cells across different samples and provides ‘similarity’ as a metric to quantify the quality of the match. We introduce a purity tree cut method designed specifically for this matching problem. We use Circos plot and regrouping method to visualize the results concisely. Furthermore, we propose a new metric ‘separability’ to summarize sub-population changes among all sample pairs. In the case studies, we demonstrate that ClusterMap has the ability to provide us further insight into the different molecular mechanisms of cellular sub-populations across different conditions. Availability and implementation ClusterMap is implemented in R and available at https://github.com/xgaoo/ClusterMap. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

LSTrAP-Kingdom: an automated pipeline to generate annotated gene expression atlases for kingdoms of life

Bioinformatics ◽

10.1093/bioinformatics/btab168 ◽

2021 ◽

Author(s):

William Goh1 ◽

Marek Mutwil1

Keyword(s):

Gene Expression ◽

Large Scale ◽

Supplementary Information ◽

Expression Data ◽

Supplementary Data ◽

Rna Seq ◽

Analysis Pipeline ◽

Study Gene Expression ◽

Automated Pipeline ◽

Bacteria And Fungi

Abstract Motivation There are now more than two million RNA sequencing experiments for plants, animals, bacteria and fungi publicly available, allowing us to study gene expression within and across species and kingdoms. However, the tools allowing the download, quality control and annotation of this data for more than one species at a time are currently missing. Results To remedy this, we present the Large-Scale Transcriptomic Analysis Pipeline in Kingdom of Life (LSTrAP-Kingdom) pipeline, which we used to process 134,521 RNA-seq samples, achieving ∼12,000 processed samples per day. Our pipeline generated quality-controlled, annotated gene expression matrices that rival the manually curated gene expression data in identifying functionally-related genes. Availability LSTrAP-Kingdom is available from: https://github.com/wirriamm/plants-pipeline and is fully implemented in Python and Bash. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Deep learning of gene relationships from single cell time-course expression data

10.1101/2020.09.21.306332 ◽

2020 ◽

Author(s):

Ye Yuan ◽

Ziv Bar-Joseph

Keyword(s):

Time Series ◽

Deep Learning ◽

Single Cell ◽

Time Course ◽

Expression Profiles ◽

Regulatory Gene ◽

Supplementary Information ◽

Expression Data ◽

Rna Seq ◽

Time Course Data

AbstractMotivationTime-course gene expression data has been widely used to infer regulatory and signaling relationships between genes. Most of the widely used methods for such analysis were developed for bulk expression data. Single cell RNA-Seq (scRNA-Seq) data offers several advantages including the large number of expression profiles available and the ability to focus on individual cells rather than averages. However, this data also raises new computational challenges.ResultsUsing a novel encoding for scRNA-Seq expression data we develop deep learning methods for interaction prediction from time-course data. Our methods use a supervised framework which represents the data as a 3D tensor and train convolutional and recurrent neural networks (CNN and RNN) for predicting interactions. We tested our Time-course Deep Learning (TDL) models on five different time series scRNA-Seq datasets. As we show, TDL can accurately identify causal and regulatory gene-gene interactions and can also be used to assign new function to genes. TDL improves on prior methods for the above tasks and can be generally applied to new time series scRNA-Seq data.Availability and ImplementationFreely available at https://github.com/xiaoyeye/[email protected] informationSupplementary data are available at XXX online.

Download Full-text

VERSE: a versatile and efficient RNA-Seq read counting tool

10.1101/053306 ◽

2016 ◽

Cited By ~ 10

Author(s):

Qin Zhu ◽

Stephen A Fisher ◽

Jamie Shallcross ◽

Junhyong Kim

Keyword(s):

Reference Genome ◽

Digital Gene Expression ◽

Supplementary Information ◽

Expression Data ◽

Rna Seq ◽

Gene Level ◽

Different Types ◽

Intergenic Regions ◽

Supplementary Material ◽

Assignment Scheme

AbstractMotivationRNA-Seq is a powerful technology that delivers digital gene expression data. To measure expression strength at the gene level, one popular approach is direct read counting after aligning the reads to a reference genome/transcriptome. HTSeq is one of the most popular ways of counting reads, yet its slow running speed of poses a bottleneck to many RNA-Seq pipelines. Gene level counting programs also lack a robust scheme for quantifying reads that map to non-exonic genomic features, such as intronic and intergenic regions, even though these reads are prevalent in most RNA-Seq data.ResultsIn this paper we present VERSE, an RNA-Seq read counting tool which builds upon the speed of featureCounts and implements the counting modes of HTSeq. VERSE is more than 30x faster than HTSeq when computing the same gene counts. VERSE also supports a hierarchical assignment scheme, which allows reads to be assigned uniquely and sequentially to different types of features according to user-defined priorities.AvailabilityVERSE is implemented in C. It is built on top of featureCounts. VERSE is open source and can be downloaded freely from Github (https://github.com/qinzhu/VERSE)[email protected] informationTables and figures illustrating the counting modes implemented in VERSE and the differences between hierarchical and independent assignment.

Download Full-text

Improved dropClust R package with integrative analysis support for scRNA-seq data

Bioinformatics ◽

10.1093/bioinformatics/btz823 ◽

2019 ◽

Author(s):

Debajyoti Sinha ◽

Pradyumn Sinha ◽

Ritwik Saha ◽

Sanghamitra Bandyopadhyay ◽

Debarka Sengupta

Keyword(s):

Single Cell ◽

Large Scale ◽

R Package ◽

Integrative Analysis ◽

Locality Sensitive Hashing ◽

Supplementary Information ◽

Expression Data ◽

Rna Seq ◽

Speed Up ◽

Cell Expression

Abstract Summary DropClust leverages Locality Sensitive Hashing (LSH) to speed up clustering of large scale single cell expression data. Here we present the improved dropClust, a complete R package that is, fast, interoperable and minimally resource intensive. The new dropClust features a novel batch effect removal algorithm that allows integrative analysis of single cell RNA-seq (scRNA-seq) datasets. Availability and implementation dropClust is freely available at https://github.com/debsin/dropClust as an R package. A lightweight online version of the dropClust is available at https://debsinha.shinyapps.io/dropClust/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text