MUREN: a robust and multi-reference approach of RNA-seq transcript normalization

Abstract Background Normalization of RNA-seq data aims at identifying biological expression differentiation between samples by removing the effects of unwanted confounding factors. Explicitly or implicitly, the justification of normalization requires a set of housekeeping genes. However, the existence of housekeeping genes common for a very large collection of samples, especially under a wide range of conditions, is questionable. Results We propose to carry out pairwise normalization with respect to multiple references, selected from representative samples. Then the pairwise intermediates are integrated based on a linear model that adjusts the reference effects. Motivated by the notion of housekeeping genes and their statistical counterparts, we adopt the robust least trimmed squares regression in pairwise normalization. The proposed method (MUREN) is compared with other existing tools on some standard data sets. The goodness of normalization emphasizes on preserving possible asymmetric differentiation, whose biological significance is exemplified by a single cell data of cell cycle. MUREN is implemented as an R package. The code under license GPL-3 is available on the github platform: github.com/hippo-yf/MUREN and on the conda platform: anaconda.org/hippo-yf/r-muren. Conclusions MUREN performs the RNA-seq normalization using a two-step statistical regression induced from a general principle. We propose that the densities of pairwise differentiations are used to evaluate the goodness of normalization. MUREN adjusts the mode of differentiation toward zero while preserving the skewness due to biological asymmetric differentiation. Moreover, by robustly integrating pre-normalized counts with respect to multiple references, MUREN is immune to individual outlier samples.

Download Full-text

Qtlizer: comprehensive QTL annotation of GWAS results

Scientific Reports ◽

10.1038/s41598-020-75770-7 ◽

2020 ◽

Vol 10 (1) ◽

Author(s):

Matthias Munz ◽

Inken Wohlers ◽

Eric Simon ◽

Tobias Reinberger ◽

Hauke Busch ◽

...

Keyword(s):

Association Studies ◽

Housekeeping Genes ◽

R Package ◽

Genome Wide Association Studies ◽

Protein Abundance ◽

Base Pairs ◽

Link Type ◽

Genome Wide ◽

Wide Range ◽

Distance Limit

AbstractExploration of genetic variant-to-gene relationships by quantitative trait loci such as expression QTLs is a frequently used tool in genome-wide association studies. However, the wide range of public QTL databases and the lack of batch annotation features complicate a comprehensive annotation of GWAS results. In this work, we introduce the tool “Qtlizer” for annotating lists of variants in human with associated changes in gene expression and protein abundance using an integrated database of published QTLs. Features include incorporation of variants in linkage disequilibrium and reverse search by gene names. Analyzing the database for base pair distances between best significant eQTLs and their affected genes suggests that the commonly used cis-distance limit of 1,000,000 base pairs might be too restrictive, implicating a substantial amount of wrongly and yet undetected eQTLs. We also ranked genes with respect to the maximum number of tissue-specific eQTL studies in which a most significant eQTL signal was consistent. For the top 100 genes we observed the strongest enrichment with housekeeping genes (P = 2 × 10–6) and with the 10% highest expressed genes (P = 0.005) after grouping eQTLs by r2 > 0.95, underlining the relevance of LD information in eQTL analyses. Qtlizer can be accessed via https://genehopper.de/qtlizer or by using the respective Bioconductor R-package (https://doi.org/10.18129/B9.bioc.Qtlizer).

Download Full-text

PTMphinder: an R package for PTM site localization and motif extraction from proteomic datasets

PeerJ ◽

10.7717/peerj.7046 ◽

2019 ◽

Vol 7 ◽

pp. e7046 ◽

Cited By ~ 3

Author(s):

Jacob M. Wozniak ◽

David J. Gonzalez

Keyword(s):

R Package ◽

Data Sets ◽

Amino Acid Residues ◽

Post Translational Modifications ◽

Specific Amino Acid ◽

Proteomic Data ◽

Wide Range ◽

Site Localization ◽

Programming Knowledge ◽

Beta Testing

Background Mass-spectrometry-based proteomics is a prominent field of study that allows for the unbiased quantification of thousands of proteins from a particular sample. A key advantage of these techniques is the ability to detect protein post-translational modifications (PTMs) and localize them to specific amino acid residues. These approaches have led to many significant findings in a wide range of biological disciplines, from developmental biology to cancer and infectious diseases. However, there is a current lack of tools available to connect raw PTM site information to biologically meaningful results in a high-throughput manner. Furthermore, many of the available tools require significant programming knowledge to implement. Results The R package PTMphinder was designed to enable researchers, particularly those with minimal programming background, to thoroughly analyze PTMs in proteomic data sets. The package contains three functions: parseDB, phindPTMs and extractBackground. Together, these functions allow users to reformat proteome databases for easier analysis, localize PTMs within full proteins, extract motifs surrounding the identified sites and create proteome-specific motif backgrounds for statistical purposes. Beta-testing of this R package has demonstrated its simplicity and ease of integration with existing tools. Conclusion PTMphinder empowers researchers to fully analyze and interpret PTMs derived from proteomic data. This package is simple enough for researchers with limited programming experience to understand and implement. The data produced from this package can inform subsequent research by itself and also be used in conjunction with other tools, such as motif-x, for further analysis.

Download Full-text

HRT Atlas v1.1 database: redefining human and mouse housekeeping genes and candidate reference transcripts by mining massive RNA-seq datasets

10.1101/787150 ◽

2019 ◽

Author(s):

Bidossessi Wilfried Hounkpe ◽

Francine Chenou ◽

Franciele Lima ◽

Erich Vinicius de Paula

Keyword(s):

Gene Expression ◽

Wild Type Mouse ◽

Housekeeping Genes ◽

Regulatory Elements ◽

Data Sets ◽

Rna Seq ◽

Cellular Functions ◽

Evolutionary Features ◽

Small Device ◽

Human And Mouse

AbstractHousekeeping (HK) genes are constitutively expressed genes that are required for the maintenance of basic cellular functions. Despite their importance in the calibration of gene expression, as well as the understanding of many genomic and evolutionary features, important discrepancies have been observed in studies that previously identified these genes. Here, we present Housekeeping Transcript Atlas (HRT Atlas v1.0, www.housekeeping.unicamp.br) a web-based database which addresses some of the previously observed limitations in the identification of these genes, and offers a more accurate database of human and mouse HK genes and transcripts. The database was generated by mining massive human and mouse RNA-seq data sets, including 12,482 and 507 high-quality RNA-seq samples from 82 human non-disease tissues/cells and 15 healthy tissues/cells of C57BL/6 wild type mouse, respectively. User can visualize the expression and download lists of 2,158 human HK transcripts from 2,176 HK genes and 3,024 mouse HK transcripts from 3,277 mouse HK genes. HRT Atlas also offers the most stable and suitable tissue selective candidate reference transcripts for normalization of qPCR experiments. Specific primers and predicted modifiers of gene expression for some of these HK transcripts are also proposed. HRT Atlas has also been integrated with regulatory elements from Epiregio server. All of these resources can be accessed and downloaded from any computer or small device web browsers.

Download Full-text

Validating Synthetic Longitudinal Populations for evaluation of Population Data Linkage

International Journal for Population Data Science ◽

10.23889/ijpds.v3i2.504 ◽

2018 ◽

Vol 3 (2) ◽

Author(s):

Tom Dalton ◽

Graham Kirby ◽

Alan Dearle ◽

Özgür Akgün ◽

Monique Mackenzie

Keyword(s):

Data Linkage ◽

Synthetic Data ◽

Population Data ◽

Data Sets ◽

Population Statistics ◽

Standard Data ◽

Wide Range ◽

Population Reconstruction ◽

Potential Interactions ◽

Linkage Evaluation

Background’Gold-standard’ data to evaluate linkage algorithms are rare. Synthetic data have the advantage that all the true links are known. In the domain of population reconstruction, the ability to synthesize populations on demand, with varying characteristics, allows a linkage approach to be evaluated across a wide range of data. We have implemented ValiPop, a microsimulation model, for this purpose. ApproachValiPop can create many varied populations based upon sets of desired population statistics, thus allowing linkage algorithms to be evaluated across many populations, rather than across a limited number of real world ’gold-standard’ data sets. Given the potential interactions between different desired population statistics, the creation of a population does not necessarily imply that all desired population statistics have been met. To address this we have developed a statistical approach to validate the adherence of created populations to the desired statistics, using a generalized linear model. This talk will discuss the benefits of synthetic data for data linkage evaluation, the approach to validating created populations, and present the results of some initial linkage experiments using our synthetic data.

Download Full-text

Use of mechanistic-empirical method of pavement design for performance sensitivity analysis to asphalt pavement fatigue

Matéria (Rio de Janeiro) ◽

10.1590/s1517-707620210003.1345 ◽

2021 ◽

Vol 26 (3) ◽

Author(s):

Natalia Assunção Brasil Silva ◽

Taciano Oliveira da Silva ◽

Heraldo Nunes Pitanga ◽

Geraldo Luciano de Oliveira Marques

Keyword(s):

Design Method ◽

Permanent Deformation ◽

Empirical Method ◽

Material Behavior ◽

Experimental Program ◽

Data Sets ◽

Statistical Regression ◽

Wide Range ◽

Number Of Cycles

ABSTRACT In the new Brazilian mechanistic-empirical design method of asphalt pavements, MeDiNa, the characterization of permanent deformation (PD) for the selection of soils and gravel is based on tests performed with at least 150,000 loading cycles for each of the nine specimens indicated in the DNIT standard. Despite providing information about the material behavior under a wide range of testing conditions, the experimental program related to these PD characterizations is time consuming and it is believed that it can be optimized. This paper evaluates the influence of the number of loading cycle applications on the characterization of the materials. For this purpose, seven materials were analyzed at their optimum moisture content (OMC) and one of them was also compacted in a condition above the OMC, in a total of eight data sets. Statistical regression analyzes were performed to identify the parameters of the predictive model for different numbers of cycles and the PD predictions for the different materials were compared. From these results, simulations were performed in the MeDiNa software to predict the performance of the materials. Four different N values were evaluated, considering 150,000 cycles as reference: discarding the 500 first cycles, but considering the PD accumulated in that interval; discarding the 500 first cycles and considering the PD accumulated in that interval; final N of 80,000; and final N of 100,000. For the analyzed materials, no significant differences were observed in the PD prediction, even considering tests with 50,000 or 70,000 cycles less than the 150,000 cycles required in the standard. This indicates that, although characterization is recommended following standardized procedures, the experimental program of the current PD standard can possibly be significantly optimized by reducing the number of cycles applied to materials in laboratory tests. This possibility must be analyzed for each material.

Download Full-text

GeneTonic: an R/Bioconductor package for streamlining the interpretation of RNA-seq data

BMC Bioinformatics ◽

10.1186/s12859-021-04461-5 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Federico Marini ◽

Annekathrin Ludt ◽

Jan Linke ◽

Konstantin Strauch

Keyword(s):

Differential Expression ◽

Differential Expression Analysis ◽

Transcriptome Profiling ◽

R Package ◽

Functional Enrichment ◽

Rna Seq ◽

Essential Information ◽

Wide Range ◽

Functional Patterns ◽

Interactive Visualizations

Abstract Background The interpretation of results from transcriptome profiling experiments via RNA sequencing (RNA-seq) can be a complex task, where the essential information is distributed among different tabular and list formats—normalized expression values, results from differential expression analysis, and results from functional enrichment analyses. A number of tools and databases are widely used for the purpose of identification of relevant functional patterns, yet often their contextualization within the data and results at hand is not straightforward, especially if these analytic components are not combined together efficiently. Results We developed the software package, which serves as a comprehensive toolkit for streamlining the interpretation of functional enrichment analyses, by fully leveraging the information of expression values in a differential expression context. is implemented in R and Shiny, leveraging packages that enable HTML-based interactive visualizations for executing drilldown tasks seamlessly, viewing the data at a level of increased detail. is integrated with the core classes of existing Bioconductor workflows, and can accept the output of many widely used tools for pathway analysis, making this approach applicable to a wide range of use cases. Users can effectively navigate interlinked components (otherwise available as flat text or spreadsheet tables), bookmark features of interest during the exploration sessions, and obtain at the end a tailored HTML report, thus combining the benefits of both interactivity and reproducibility. Conclusion is distributed as an R package in the Bioconductor project (https://bioconductor.org/packages/GeneTonic/) under the MIT license. Offering both bird’s-eye views of the components of transcriptome data analysis and the detailed inspection of single genes, individual signatures, and their relationships, aims at simplifying the process of interpretation of complex and compelling RNA-seq datasets for many researchers with different expertise profiles.

Download Full-text

HRT Atlas v1.0 database: redefining human and mouse housekeeping genes and candidate reference transcripts by mining massive RNA-seq datasets

Nucleic Acids Research ◽

10.1093/nar/gkaa609 ◽

2020 ◽

Cited By ~ 3

Author(s):

Bidossessi Wilfried Hounkpe ◽

Francine Chenou ◽

Franciele de Lima ◽

Erich Vinicius De Paula

Keyword(s):

Gene Expression ◽

Wild Type Mouse ◽

Housekeeping Genes ◽

Regulatory Elements ◽

Data Sets ◽

Rna Seq ◽

Cellular Functions ◽

Evolutionary Features ◽

Reference Transcript ◽

Human And Mouse

Abstract Housekeeping (HK) genes are constitutively expressed genes that are required for the maintenance of basic cellular functions. Despite their importance in the calibration of gene expression, as well as the understanding of many genomic and evolutionary features, important discrepancies have been observed in studies that previously identified these genes. Here, we present Housekeeping and Reference Transcript Atlas (HRT Atlas v1.0, www.housekeeping.unicamp.br) a web-based database which addresses some of the previously observed limitations in the identification of these genes, and offers a more accurate database of human and mouse HK genes and transcripts. The database was generated by mining massive human and mouse RNA-seq data sets, including 11 281 and 507 high-quality RNA-seq samples from 52 human non-disease tissues/cells and 14 healthy tissues/cells of C57BL/6 wild type mouse, respectively. User can visualize the expression and download lists of 2158 human HK transcripts from 2176 HK genes and 3024 mouse HK transcripts from 3277 mouse HK genes. HRT Atlas also offers the most stable and suitable tissue selective candidate reference transcripts for normalization of qPCR experiments. Specific primers and predicted modifiers of gene expression for some of these HK transcripts are also proposed. HRT Atlas has also been integrated with a regulatory elements resource from Epiregio server.

Download Full-text

The Impact of Normalization Methods on RNA-Seq Data Analysis

BioMed Research International ◽

10.1155/2015/621690 ◽

2015 ◽

Vol 2015 ◽

pp. 1-10 ◽

Cited By ~ 44

Author(s):

J. Zyprych-Walczak ◽

A. Szabelska ◽

L. Handschuh ◽

K. Górczak ◽

K. Klamecka ◽

...

Keyword(s):

High Throughput Sequencing ◽

Data Sets ◽

Complex Data ◽

Rna Seq ◽

Medical Problems ◽

Data Set ◽

Normalization Methods ◽

Wide Range ◽

The Impact ◽

Selection Of

High-throughput sequencing technologies, such as the Illumina Hi-seq, are powerful new tools for investigating a wide range of biological and medical problems. Massive and complex data sets produced by the sequencers create a need for development of statistical and computational methods that can tackle the analysis and management of data. The data normalization is one of the most crucial steps of data processing and this process must be carefully considered as it has a profound effect on the results of the analysis. In this work, we focus on a comprehensive comparison of five normalization methods related to sequencing depth, widely used for transcriptome sequencing (RNA-seq) data, and their impact on the results of gene expression analysis. Based on this study, we suggest a universal workflow that can be applied for the selection of the optimal normalization procedure for any particular data set. The described workflow includes calculation of the bias and variance values for the control genes, sensitivity and specificity of the methods, and classification errors as well as generation of the diagnostic plots. Combining the above information facilitates the selection of the most appropriate normalization method for the studied data sets and determines which methods can be used interchangeably.

Download Full-text

intePareto: an R package for integrative analyses of RNA-Seq and ChIP-Seq data

BMC Genomics ◽

10.1186/s12864-020-07205-6 ◽

2020 ◽

Vol 21 (S11) ◽

Author(s):

Yingying Cao ◽

Simo Kitanovski ◽

Daniel Hoffmann

Keyword(s):

Gene Expression ◽

Expression Analysis ◽

High Throughput Sequencing ◽

Differential Expression Analysis ◽

Cell Types ◽

R Package ◽

Integrative Approach ◽

Integrated Analysis ◽

Data Sets ◽

Rna Seq

Abstract Background RNA-Seq, the high-throughput sequencing (HT-Seq) of mRNAs, has become an essential tool for characterizing gene expression differences between different cell types and conditions. Gene expression is regulated by several mechanisms, including epigenetically by post-translational histone modifications which can be assessed by ChIP-Seq (Chromatin Immuno-Precipitation Sequencing). As more and more biological samples are analyzed by the combination of ChIP-Seq and RNA-Seq, the integrated analysis of the corresponding data sets becomes, theoretically, a unique option to study gene regulation. However, technically such analyses are still in their infancy. Results Here we introduce intePareto, a computational tool for the integrative analysis of RNA-Seq and ChIP-Seq data. With intePareto we match RNA-Seq and ChIP-Seq data at the level of genes, perform differential expression analysis between biological conditions, and prioritize genes with consistent changes in RNA-Seq and ChIP-Seq data using Pareto optimization. Conclusion intePareto facilitates comprehensive understanding of high dimensional transcriptomic and epigenomic data. Its superiority to a naive differential gene expression analysis with RNA-Seq and available integrative approach is demonstrated by analyzing a public dataset.

Download Full-text

A benchmark dataset for individual tree crown delineation in co-registered airborne RGB, LiDAR and hyperspectral imagery from the National Ecological Observation Network

10.1101/2020.11.16.385088 ◽

2020 ◽

Author(s):

Ben. G. Weinstein ◽

Sarah J. Graves ◽

Sergio Marconi ◽

Aditya Singh ◽

Alina Zare ◽

...

Keyword(s):

Forest Type ◽

R Package ◽

Benchmark Dataset ◽

Sensor Data ◽

Evaluation Metrics ◽

Forest Types ◽

Individual Tree ◽

Standard Data ◽

Wide Range ◽

Crown Delineation

AbstractBroad scale remote sensing promises to build forest inventories at unprecedented scales. A crucial step in this process is designing individual tree segmentation algorithms to associate pixels into delineated tree crowns. While dozens of tree delineation algorithms have been proposed, their performance is typically not compared based on standard data or evaluation metrics, making it difficult to understand which algorithms perform best under what circumstances. There is a need for an open evaluation benchmark to minimize differences in reported results due to data quality, forest type and evaluation metrics, and to support evaluation of algorithms across a broad range of forest types. Combining RGB, LiDAR and hyperspectral sensor data from the National Ecological Observatory Network’s Airborne Observation Platform with multiple types of evaluation data, we created a novel benchmark dataset to assess individual tree delineation methods. This benchmark dataset includes an R package to standardize evaluation metrics and simplify comparisons between methods. The benchmark dataset contains over 6,000 image-annotated crowns, 424 field-annotated crowns, and 3,777 overstory stem points from a wide range of forest types. In addition, we include over 10,000 training crowns for optional use. We discuss the different evaluation sources and assess the accuracy of the image-annotated crowns by comparing annotations among multiple annotators as well as to overlapping field-annotated crowns. We provide an example submission and score for an open-source baseline for future methods.

Download Full-text