scholarly journals cdev: a ground-truth based measure to evaluate RNA-seq normalization performance

PeerJ ◽  
2021 ◽  
Vol 9 ◽  
pp. e12233
Author(s):  
Diem-Trang Tran ◽  
Matthew Might

Normalization of RNA-seq data has been an active area of research since the problem was first recognized a decade ago. Despite the active development of new normalizers, their performance measures have been given little attention. To evaluate normalizers, researchers have been relying on ad hoc measures, most of which are either qualitative, potentially biased, or easily confounded by parametric choices of downstream analysis. We propose a metric called condition-number based deviation, or cdev, to quantify normalization success. cdev measures how much an expression matrix differs from another. If a ground truth normalization is given, cdev can then be used to evaluate the performance of normalizers. To establish experimental ground truth, we compiled an extensive set of public RNA-seq assays with external spike-ins. This data collection, together with cdev, provides a valuable toolset for benchmarking new and existing normalization methods.

2019 ◽  
Vol 36 (7) ◽  
pp. 2288-2290 ◽  
Author(s):  
Shian Su ◽  
Luyi Tian ◽  
Xueyi Dong ◽  
Peter F Hickey ◽  
Saskia Freytag ◽  
...  

Abstract Motivation Bioinformatic analysis of single-cell gene expression data is a rapidly evolving field. Hundreds of bespoke methods have been developed in the past few years to deal with various aspects of single-cell analysis and consensus on the most appropriate methods to use under different settings is still emerging. Benchmarking the many methods is therefore of critical importance and since analysis of single-cell data usually involves multi-step pipelines, effective evaluation of pipelines involving different combinations of methods is required. Current benchmarks of single-cell methods are mostly implemented with ad-hoc code that is often difficult to reproduce or extend, and exhaustive manual coding of many combinations is infeasible in most instances. Therefore, new software is needed to manage pipeline benchmarking. Results The CellBench R software facilitates method comparisons in either a task-centric or combinatorial way to allow pipelines of methods to be evaluated in an effective manner. CellBench automatically runs combinations of methods, provides facilities for measuring running time and delivers output in tabular form which is highly compatible with tidyverse R packages for summary and visualization. Our software has enabled comprehensive benchmarking of single-cell RNA-seq normalization, imputation, clustering, trajectory analysis and data integration methods using various performance metrics obtained from data with available ground truth. CellBench is also amenable to benchmarking other bioinformatics analysis tasks. Availability and implementation Available from https://bioconductor.org/packages/CellBench.


2018 ◽  
Vol 21 (2) ◽  
pp. 395-407 ◽  
Author(s):  
Tony C Y Kuo ◽  
Masaomi Hatakeyama ◽  
Toshiaki Tameshige ◽  
Kentaro K Shimizu ◽  
Jun Sese

Abstract Genome duplication with hybridization, or allopolyploidization, occurs in animals, fungi and plants, and is especially common in crop plants. There is an increasing interest in the study of allopolyploids because of advances in polyploid genome assembly; however, the high level of sequence similarity in duplicated gene copies (homeologs) poses many challenges. Here we compared standard RNA-seq expression quantification approaches used currently for diploid species against subgenome-classification approaches which maps reads to each subgenome separately. We examined mapping error using our previous and new RNA-seq data in which a subgenome is experimentally added (synthetic allotetraploid Arabidopsis kamchatica) or reduced (allohexaploid wheat Triticum aestivum versus extracted allotetraploid) as ground truth. The error rates in the two species were very similar. The standard approaches showed higher error rates (>10% using pseudo-alignment with Kallisto) while subgenome-classification approaches showed much lower error rates (<1% using EAGLE-RC, <2% using HomeoRoq). Although downstream analysis may partly mitigate mapping errors, the difference in methods was substantial in hexaploid wheat, where Kallisto appeared to have systematic differences relative to other methods. Only approximately half of the differentially expressed homeologs detected using Kallisto overlapped with those by any other method in wheat. In general, disagreement in low-expression genes was responsible for most of the discordance between methods, which is consistent with known biases in Kallisto. We also observed that there exist uncertainties in genome sequences and annotation which can affect each method differently. Overall, subgenome-classification approaches tend to perform better than standard approaches with EAGLE-RC having the highest precision.


2018 ◽  
Author(s):  
Tony Kuo ◽  
Masaomi Hatakeyama ◽  
Toshiaki Tameshige ◽  
Kentaro K. Shimizu ◽  
Jun Sese

AbstractGenome duplication with hybridization, or allopolyploidization, occurs in animals, fungi, and plants, and is especially common in crop plants. There is increasing interest in the study of allopolyploids due to advances in polyploid genome assembly, however the high level of sequence similarity in duplicated gene copies (homeologs) pose many challenges. Here we compared standard RNA-seq expression quantification approaches used currently for diploid species against subgenome-classification approaches which maps reads to each subgenome separately. We examined mapping error using our previous and new RNA-seq data in which a subgenome is experimentally added (synthetic allotetraploid Arabidopsis kamchatica) or reduced (allohexaploid wheat Triticum aestivum versus extracted allotetraploid) as ground truth. The error rates in the two species were very similar. The standard approaches showed higher error rates (> 10% using pseudo-alignment with Kallisto) while subgenome-classification approaches showed much lower error rates (< 1% using EAGLE-RC, < 2% using HomeoRoq). Although downstream analysis may partly mitigate mapping errors, the difference in methods was substantial in hexaploid wheat, where Kallisto appeared to have systematic differences relative to other methods. Only approximately half of the differentially expressed homeologs detected using Kallisto overlapped with those by any other method. In general, disagreement in low expression genes was responsible for most of the discordance between methods, which is consistent with known biases in Kallisto. We also observed that there exist uncertainties in genome sequences and annotation which can affect each method differently. Overall, subgenome-classification approaches tend to perform better than standard approaches with EAGLE-RC having the highest precision.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Yue You ◽  
Luyi Tian ◽  
Shian Su ◽  
Xueyi Dong ◽  
Jafar S. Jabbari ◽  
...  

Abstract Background Single-cell RNA-sequencing (scRNA-seq) technologies and associated analysis methods have rapidly developed in recent years. This includes preprocessing methods, which assign sequencing reads to genes to create count matrices for downstream analysis. While several packaged preprocessing workflows have been developed to provide users with convenient tools for handling this process, how they compare to one another and how they influence downstream analysis have not been well studied. Results Here, we systematically benchmark the performance of 10 end-to-end preprocessing workflows (Cell Ranger, Optimus, salmon alevin, alevin-fry, kallisto bustools, dropSeqPipe, scPipe, zUMIs, celseq2, and scruff) using datasets yielding different biological complexity levels generated by CEL-Seq2 and 10x Chromium platforms. We compare these workflows in terms of their quantification properties directly and their impact on normalization and clustering by evaluating the performance of different method combinations. While the scRNA-seq preprocessing workflows compared vary in their detection and quantification of genes across datasets, after downstream analysis with performant normalization and clustering methods, almost all combinations produce clustering results that agree well with the known cell type labels that provided the ground truth in our analysis. Conclusions In summary, the choice of preprocessing method was found to be less important than other steps in the scRNA-seq analysis process. Our study comprehensively compares common scRNA-seq preprocessing workflows and summarizes their characteristics to guide workflow users.


2021 ◽  
Author(s):  
Madalina Ciortan ◽  
Matthieu Defrance

Single-cell RNA sequencing (scRNA-seq) produces transcriptomic profiling for individual cells. Due to the lack of cell-class annotations, scRNA-seq is routinely analyzed with unsupervised clustering methods. Because these methods are typically limited to producing clustering predictions (that is, assignment of cells to clusters of similar cells), numerous model agnostic differential expression (DE) libraries have been proposed to identify the genes expressed differently in the detected clusters, as needed in the downstream analysis. In parallel, the advancements in neural networks (NN) brought several model-specific explainability methods to identify salient features based on gradients, eliminating the need for external models. We propose a comprehensive study to compare the performance of dedicated DE methods, with that of explainability methods typically used in machine learning, both model agnostic (such as SHAP, permutation importance) and model-specific (such as NN gradient-based methods). The DE analysis is performed on the results of 3 state-of-the-art clustering methods based on NNs. Our results on 36 simulated datasets indicate that all analyzed DE methods have limited agreement between them and with ground-truth genes. The gradients method outperforms the traditional DE methods, which encourages the development of NN-based clustering methods to provide an out-of-the-box DE capability. Employing DE methods on the input data preprocessed by clustering method outperforms the traditional approach of using the original count data, albeit still performing worse than gradient-based methods.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Matthew Chung ◽  
Vincent M. Bruno ◽  
David A. Rasko ◽  
Christina A. Cuomo ◽  
José F. Muñoz ◽  
...  

AbstractAdvances in transcriptome sequencing allow for simultaneous interrogation of differentially expressed genes from multiple species originating from a single RNA sample, termed dual or multi-species transcriptomics. Compared to single-species differential expression analysis, the design of multi-species differential expression experiments must account for the relative abundances of each organism of interest within the sample, often requiring enrichment methods and yielding differences in total read counts across samples. The analysis of multi-species transcriptomics datasets requires modifications to the alignment, quantification, and downstream analysis steps compared to the single-species analysis pipelines. We describe best practices for multi-species transcriptomics and differential gene expression.


2020 ◽  
Vol 10 (1) ◽  
Author(s):  
Li Tong ◽  
◽  
Po-Yen Wu ◽  
John H. Phan ◽  
Hamid R. Hassazadeh ◽  
...  

Abstract To use next-generation sequencing technology such as RNA-seq for medical and health applications, choosing proper analysis methods for biomarker identification remains a critical challenge for most users. The US Food and Drug Administration (FDA) has led the Sequencing Quality Control (SEQC) project to conduct a comprehensive investigation of 278 representative RNA-seq data analysis pipelines consisting of 13 sequence mapping, three quantification, and seven normalization methods. In this article, we focused on the impact of the joint effects of RNA-seq pipelines on gene expression estimation as well as the downstream prediction of disease outcomes. First, we developed and applied three metrics (i.e., accuracy, precision, and reliability) to quantitatively evaluate each pipeline’s performance on gene expression estimation. We then investigated the correlation between the proposed metrics and the downstream prediction performance using two real-world cancer datasets (i.e., SEQC neuroblastoma dataset and the NIH/NCI TCGA lung adenocarcinoma dataset). We found that RNA-seq pipeline components jointly and significantly impacted the accuracy of gene expression estimation, and its impact was extended to the downstream prediction of these cancer outcomes. Specifically, RNA-seq pipelines that produced more accurate, precise, and reliable gene expression estimation tended to perform better in the prediction of disease outcome. In the end, we provided scenarios as guidelines for users to use these three metrics to select sensible RNA-seq pipelines for the improved accuracy, precision, and reliability of gene expression estimation, which lead to the improved downstream gene expression-based prediction of disease outcome.


Sensors ◽  
2018 ◽  
Vol 18 (10) ◽  
pp. 3588 ◽  
Author(s):  
Lien-Wu Chen ◽  
Yu-Hao Peng ◽  
Yu-Chee Tseng ◽  
Ming-Fong Tsai

Mobile ad hoc networks (MANETs) have gained a lot of interests in research communities for the infrastructure-less self-organizing nature. A MANET with fleet cyclists using smartphones forms a two-tier mobile long-thin network (MLTN) along a common cycling route, where the high-tier network is composed of 3G/LTE interfaces and the low-tier network is composed of IEEE 802.11 interfaces. The low-tier network may consist of several path-like networks. This work investigates cooperative sensing data collection and distribution with packet collision avoidance in a two-tier MLTN. As numbers of cyclists upload their sensing data and download global fleet information frequently, serious bandwidth and latency problems may result if all members rely on their high-tier interfaces. We designed and analyzed a cooperative framework consisting of a distributed grouping mechanism, a group merging and splitting method, and a sensing data aggregation scheme. Through cooperation between the two tiers, the proposed framework outperforms existing works by significantly reducing the 3G/LTE data transmission and the number of 3G/LTE connections.


2020 ◽  
Author(s):  
Snehalika Lall ◽  
Abhik Ghosh ◽  
Sumanta Ray ◽  
Sanghamitra Bandyopadhyay

ABSTRACTMany single-cell typing methods require pure clustering of cells, which is susceptible towards the technical noise, and heavily dependent on high quality informative genes selected in the preliminary steps of downstream analysis. Techniques for gene selection in single-cell RNA sequencing (scRNA-seq) data are seemingly simple which casts problems with respect to the resolution of (sub-)types detection, marker selection and ultimately impacts towards cell annotation. We introduce sc-REnF, a novel and robust entropy based feature (gene) selection method, which leverages the landmark advantage of ‘Renyi’ and ‘Tsallis’ entropy achieved in their original application, in single cell clustering. Thereby, gene selection is robust and less sensitive towards the technical noise present in the data, producing a pure clustering of cells, beyond classifying independent and unknown sample with utmost accuracy. The corresponding software is available at: https://github.com/Snehalikalall/sc-REnF


1994 ◽  
Vol 37 (5) ◽  
Author(s):  
A. M. DzIewonski

The origins of the Federation of Digital Seismograph Networks (FDSN) can be traced to the summer of 1984. At that time, GEOSCOPE - the French global network of broadband instruments - was already well under way, and in the United States, the Incorporated Research Institutions for Seismology (IRIS) had just published its Science Plan for Global Seismographic Network (GSN). There was clearly an opportunity and the need to involve scientists from other countries in planning for the future of global seismology. An ad hoc meeting of some ten West European seismologists had been arranged in August during the annual meeting of the European Geophysical Society in Louvain. This may be considered to signify the beginning of widescale international cooperation, even though this particular group eventually became the nucleus of ORFEUS (Observatories and Research Facilities for EUropean Seismology). Rather than taking an active role in deployment of new stations, it chose to focus on the issue of providing the service for data collection and exchange, with an important mission of developing the requisite software.


Sign in / Sign up

Export Citation Format

Share Document