The Impact of Normalization Methods on RNA-Seq Data Analysis

High-throughput sequencing technologies, such as the Illumina Hi-seq, are powerful new tools for investigating a wide range of biological and medical problems. Massive and complex data sets produced by the sequencers create a need for development of statistical and computational methods that can tackle the analysis and management of data. The data normalization is one of the most crucial steps of data processing and this process must be carefully considered as it has a profound effect on the results of the analysis. In this work, we focus on a comprehensive comparison of five normalization methods related to sequencing depth, widely used for transcriptome sequencing (RNA-seq) data, and their impact on the results of gene expression analysis. Based on this study, we suggest a universal workflow that can be applied for the selection of the optimal normalization procedure for any particular data set. The described workflow includes calculation of the bias and variance values for the control genes, sensitivity and specificity of the methods, and classification errors as well as generation of the diagnostic plots. Combining the above information facilitates the selection of the most appropriate normalization method for the studied data sets and determines which methods can be used interchangeably.

Download Full-text

Three-dimensional deep learning with spatial erasing for unsupervised anomaly segmentation in brain MRI

International Journal of Computer Assisted Radiology and Surgery ◽

10.1007/s11548-021-02451-9 ◽

2021 ◽

Author(s):

Marcel Bengs ◽

Finn Behrendt ◽

Julia Krüger ◽

Roland Opfer ◽

Alexander Schlaefer

Keyword(s):

Deep Learning ◽

Brain Mri ◽

Magnetic Resonance Images ◽

Spatial Context ◽

Data Sets ◽

Learning Methods ◽

Data Set ◽

Performance Improvements ◽

Wide Range ◽

The Impact

Abstract Purpose Brain Magnetic Resonance Images (MRIs) are essential for the diagnosis of neurological diseases. Recently, deep learning methods for unsupervised anomaly detection (UAD) have been proposed for the analysis of brain MRI. These methods rely on healthy brain MRIs and eliminate the requirement of pixel-wise annotated data compared to supervised deep learning. While a wide range of methods for UAD have been proposed, these methods are mostly 2D and only learn from MRI slices, disregarding that brain lesions are inherently 3D and the spatial context of MRI volumes remains unexploited. Methods We investigate whether using increased spatial context by using MRI volumes combined with spatial erasing leads to improved unsupervised anomaly segmentation performance compared to learning from slices. We evaluate and compare 2D variational autoencoder (VAE) to their 3D counterpart, propose 3D input erasing, and systemically study the impact of the data set size on the performance. Results Using two publicly available segmentation data sets for evaluation, 3D VAEs outperform their 2D counterpart, highlighting the advantage of volumetric context. Also, our 3D erasing methods allow for further performance improvements. Our best performing 3D VAE with input erasing leads to an average DICE score of 31.40% compared to 25.76% for the 2D VAE. Conclusions We propose 3D deep learning methods for UAD in brain MRI combined with 3D erasing and demonstrate that 3D methods clearly outperform their 2D counterpart for anomaly segmentation. Also, our spatial erasing method allows for further performance improvements and reduces the requirement for large data sets.

Download Full-text

Searching for best lower dimensional visualization angles for high dimensional RNA-Seq data

PeerJ ◽

10.7717/peerj.5199 ◽

2018 ◽

Vol 6 ◽

pp. e5199

Author(s):

Wanli Zhang ◽

Yanming Di

Keyword(s):

Selection Criterion ◽

High Dimensional ◽

Data Sets ◽

Complex Data ◽

Rna Seq ◽

High Dimensions ◽

Data Set ◽

Complex Data Sets ◽

Hidden Patterns ◽

Scatterplot Matrix

The accumulation of RNA sequencing (RNA-Seq) gene expression data in recent years has resulted in large and complex data sets of high dimensions. Exploratory analysis, including data mining and visualization, reveals hidden patterns and potential outliers in such data, but is often challenged by the high dimensional nature of the data. The scatterplot matrix is a commonly used tool for visualizing multivariate data, and allows us to view multiple bivariate relationships simultaneously. However, the scatterplot matrix becomes less effective for high dimensional data because the number of bivariate displays increases quadratically with data dimensionality. In this study, we introduce a selection criterion for each bivariate scatterplot and design/implement an algorithm that automatically scan and rank all possible scatterplots, with the goal of identifying the plots in which separation between two pre-defined groups is maximized. By applying our method to a multi-experimentArabidopsisRNA-Seq data set, we were able to successfully pinpoint the visualization angles where genes from two biological pathways are the most separated, as well as identify potential outliers.

Download Full-text

Tissue-aware RNA-Seq processing and normalization for heterogeneous and sparse data

10.1101/081802 ◽

2016 ◽

Cited By ~ 6

Author(s):

Joseph N. Paulson ◽

Cho-Yi Chen ◽

Camila M. Lopes-Ramos ◽

Marieke L Kuijjer ◽

John Platig ◽

...

Keyword(s):

Quality Control ◽

Large Scale ◽

Transcriptional Profiling ◽

R Package ◽

Heterogeneous Data ◽

Data Sets ◽

List Type ◽

Complex Data ◽

Rna Seq ◽

Data Set

AbstractAlthough ultrahigh-throughput RNA-Sequencing has become the dominant technology for genome-wide transcriptional profiling, the vast majority of RNA-Seq studies typically profile only tens of samples, and most analytical pipelines are optimized for these smaller studies. However, projects are generating ever-larger data sets comprising RNA-Seq data from hundreds or thousands of samples, often collected at multiple centers and from diverse tissues. These complex data sets present significant analytical challenges due to batch and tissue effects, but provide the opportunity to revisit the assumptions and methods that we use to preprocess, normalize, and filter RNA-Seq data – critical first steps for any subsequent analysis. We find analysis of large RNA-Seq data sets requires both careful quality control and that one account for sparsity due to the heterogeneity intrinsic in multi-group studies. An R package instantiating our method for large-scale RNA-Seq normalization and preprocessing, YARN, is available at bioconductor.org/packages/yarn.HighlightsOverview of assumptions used in preprocessing and normalizationPipeline for preprocessing, quality control, and normalization of large heterogeneous dataA Bioconductor package for the YARN pipeline and easy manipulation of count dataPreprocessed GTEx data set using the YARN pipeline available as a resource

Download Full-text

Investigating the impact of pre-processing techniques and pre-trained word embeddings in detecting Arabic health information on social media

Journal Of Big Data ◽

10.1186/s40537-021-00488-w ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Yahya Albalawi ◽

Jim Buckley ◽

Nikola S. Nikolov

Keyword(s):

Social Media ◽

Deep Learning ◽

Comprehensive Evaluation ◽

Classification Problem ◽

Data Sets ◽

Word Embeddings ◽

Data Set ◽

Lower Accuracy ◽

Health Related ◽

The Impact

AbstractThis paper presents a comprehensive evaluation of data pre-processing and word embedding techniques in the context of Arabic document classification in the domain of health-related communication on social media. We evaluate 26 text pre-processings applied to Arabic tweets within the process of training a classifier to identify health-related tweets. For this task we use the (traditional) machine learning classifiers KNN, SVM, Multinomial NB and Logistic Regression. Furthermore, we report experimental results with the deep learning architectures BLSTM and CNN for the same text classification problem. Since word embeddings are more typically used as the input layer in deep networks, in the deep learning experiments we evaluate several state-of-the-art pre-trained word embeddings with the same text pre-processing applied. To achieve these goals, we use two data sets: one for both training and testing, and another for testing the generality of our models only. Our results point to the conclusion that only four out of the 26 pre-processings improve the classification accuracy significantly. For the first data set of Arabic tweets, we found that Mazajak CBOW pre-trained word embeddings as the input to a BLSTM deep network led to the most accurate classifier with F1 score of 89.7%. For the second data set, Mazajak Skip-Gram pre-trained word embeddings as the input to BLSTM led to the most accurate model with F1 score of 75.2% and accuracy of 90.7% compared to F1 score of 90.8% achieved by Mazajak CBOW for the same architecture but with lower accuracy of 70.89%. Our results also show that the performance of the best of the traditional classifier we trained is comparable to the deep learning methods on the first dataset, but significantly worse on the second dataset.

Download Full-text

Simulation Verification of the Contact Parameter Influence on the Forces’ Course of Cereal Grain Impact against a Stiff Surface

Applied Sciences ◽

10.3390/app11020466 ◽

2021 ◽

Vol 11 (2) ◽

pp. 466

Author(s):

Włodzimierz Kęska ◽

Jacek Marcinkiewicz ◽

Łukasz Gierz ◽

Żaneta Staszak ◽

Jarosław Selech ◽

...

Keyword(s):

Dynamic Properties ◽

Decisive Role ◽

Force Sensor ◽

Contact Forces ◽

Correct Selection ◽

Contact Parameter ◽

Cereal Grain ◽

Wide Range ◽

The Impact ◽

Selection Of

The continuous development of computer technology has made it applicable in many scientific fields, including research into a wide range of processes in agricultural machines. It allows the simulation of very complex physical phenomena, including grain motion. A recently discovered discrete element method (DEM) is used for this purpose. It involves direct integration of equations of grain system motion under the action of various forces, the most important of which are contact forces. The method’s accuracy depends mainly on precisely developed mathematical models of contacts. The creation of such models requires empirical validation, an experiment that investigates the course of contact forces at the moment of the impact of the grains. To achieve this, specialised test stations equipped with force and speed sensors were developed. The correct selection of testing equipment and interpretation of results play a decisive role in this type of research. This paper focuses on the evaluation of the force sensor dynamic properties’ influence on the measurement accuracy of the course of the plant grain impact forces against a stiff surface. The issue was examined using the computer simulation method. A proprietary computer software with the main calculation module and data input procedures, which presents results in a graphic form, was used for calculations. From the simulation, graphs of the contact force and force signal from the sensor were obtained. This helped to clearly indicate the essence of the correct selection of parameters used in the tests of sensors, which should be characterised by high resonance frequency.

Download Full-text

MUREN: a robust and multi-reference approach of RNA-seq transcript normalization

BMC Bioinformatics ◽

10.1186/s12859-021-04288-0 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Yance Feng ◽

Lei M. Li

Keyword(s):

Biological Significance ◽

Housekeeping Genes ◽

R Package ◽

Data Sets ◽

Statistical Regression ◽

Rna Seq ◽

Least Trimmed Squares ◽

Standard Data ◽

Wide Range ◽

Multiple References

Abstract Background Normalization of RNA-seq data aims at identifying biological expression differentiation between samples by removing the effects of unwanted confounding factors. Explicitly or implicitly, the justification of normalization requires a set of housekeeping genes. However, the existence of housekeeping genes common for a very large collection of samples, especially under a wide range of conditions, is questionable. Results We propose to carry out pairwise normalization with respect to multiple references, selected from representative samples. Then the pairwise intermediates are integrated based on a linear model that adjusts the reference effects. Motivated by the notion of housekeeping genes and their statistical counterparts, we adopt the robust least trimmed squares regression in pairwise normalization. The proposed method (MUREN) is compared with other existing tools on some standard data sets. The goodness of normalization emphasizes on preserving possible asymmetric differentiation, whose biological significance is exemplified by a single cell data of cell cycle. MUREN is implemented as an R package. The code under license GPL-3 is available on the github platform: github.com/hippo-yf/MUREN and on the conda platform: anaconda.org/hippo-yf/r-muren. Conclusions MUREN performs the RNA-seq normalization using a two-step statistical regression induced from a general principle. We propose that the densities of pairwise differentiations are used to evaluate the goodness of normalization. MUREN adjusts the mode of differentiation toward zero while preserving the skewness due to biological asymmetric differentiation. Moreover, by robustly integrating pre-normalized counts with respect to multiple references, MUREN is immune to individual outlier samples.

Download Full-text

Impact of RNA-seq data analysis algorithms on gene expression estimation and downstream prediction

Scientific Reports ◽

10.1038/s41598-020-74567-y ◽

2020 ◽

Vol 10 (1) ◽

Author(s):

Li Tong ◽

◽

Po-Yen Wu ◽

John H. Phan ◽

Hamid R. Hassazadeh ◽

...

Keyword(s):

Gene Expression ◽

Data Analysis ◽

Disease Outcome ◽

Rna Seq ◽

Next Generation Sequencing Technology ◽

Normalization Methods ◽

The Us ◽

Sequencing Quality ◽

Improved Accuracy ◽

The Impact

Abstract To use next-generation sequencing technology such as RNA-seq for medical and health applications, choosing proper analysis methods for biomarker identification remains a critical challenge for most users. The US Food and Drug Administration (FDA) has led the Sequencing Quality Control (SEQC) project to conduct a comprehensive investigation of 278 representative RNA-seq data analysis pipelines consisting of 13 sequence mapping, three quantification, and seven normalization methods. In this article, we focused on the impact of the joint effects of RNA-seq pipelines on gene expression estimation as well as the downstream prediction of disease outcomes. First, we developed and applied three metrics (i.e., accuracy, precision, and reliability) to quantitatively evaluate each pipeline’s performance on gene expression estimation. We then investigated the correlation between the proposed metrics and the downstream prediction performance using two real-world cancer datasets (i.e., SEQC neuroblastoma dataset and the NIH/NCI TCGA lung adenocarcinoma dataset). We found that RNA-seq pipeline components jointly and significantly impacted the accuracy of gene expression estimation, and its impact was extended to the downstream prediction of these cancer outcomes. Specifically, RNA-seq pipelines that produced more accurate, precise, and reliable gene expression estimation tended to perform better in the prediction of disease outcome. In the end, we provided scenarios as guidelines for users to use these three metrics to select sensible RNA-seq pipelines for the improved accuracy, precision, and reliability of gene expression estimation, which lead to the improved downstream gene expression-based prediction of disease outcome.

Download Full-text

A Visual and VAE Based Hierarchical Indoor Localization Method

Sensors ◽

10.3390/s21103406 ◽

2021 ◽

Vol 21 (10) ◽

pp. 3406

Author(s):

Jie Jiang ◽

Yin Zou ◽

Lidong Chen ◽

Yujie Fang

Keyword(s):

Image Retrieval ◽

Indoor Localization ◽

Data Sets ◽

Indoor Environments ◽

Global Features ◽

Data Set ◽

Data Annotation ◽

Wide Range ◽

Annotation Costs ◽

Global And Local

Precise localization and pose estimation in indoor environments are commonly employed in a wide range of applications, including robotics, augmented reality, and navigation and positioning services. Such applications can be solved via visual-based localization using a pre-built 3D model. The increase in searching space associated with large scenes can be overcome by retrieving images in advance and subsequently estimating the pose. The majority of current deep learning-based image retrieval methods require labeled data, which increase data annotation costs and complicate the acquisition of data. In this paper, we propose an unsupervised hierarchical indoor localization framework that integrates an unsupervised network variational autoencoder (VAE) with a visual-based Structure-from-Motion (SfM) approach in order to extract global and local features. During the localization process, global features are applied for the image retrieval at the level of the scene map in order to obtain candidate images, and are subsequently used to estimate the pose from 2D-3D matches between query and candidate images. RGB images only are used as the input of the proposed localization system, which is both convenient and challenging. Experimental results reveal that the proposed method can localize images within 0.16 m and 4° in the 7-Scenes data sets and 32.8% within 5 m and 20° in the Baidu data set. Furthermore, our proposed method achieves a higher precision compared to advanced methods.

Download Full-text

The Midlatitude Continental Convective Clouds Experiment (MC3E) sounding network: operations, processing and analysis

Atmospheric Measurement Techniques ◽

10.5194/amt-8-421-2015 ◽

2015 ◽

Vol 8 (1) ◽

pp. 421-434 ◽

Cited By ~ 18

Author(s):

M. P. Jensen ◽

T. Toto ◽

D. Troyan ◽

P. E. Ciesielski ◽

D. Holdridge ◽

...

Keyword(s):

Large Scale ◽

Scale Model ◽

Data Sets ◽

Central Plains ◽

Data Set ◽

Convective Systems ◽

Convective Clouds ◽

Quality Checks ◽

Network Operations ◽

The Impact

Abstract. The Midlatitude Continental Convective Clouds Experiment (MC3E) took place during the spring of 2011 centered in north-central Oklahoma, USA. The main goal of this field campaign was to capture the dynamical and microphysical characteristics of precipitating convective systems in the US Central Plains. A major component of the campaign was a six-site radiosonde array designed to capture the large-scale variability of the atmospheric state with the intent of deriving model forcing data sets. Over the course of the 46-day MC3E campaign, a total of 1362 radiosondes were launched from the enhanced sonde network. This manuscript provides details on the instrumentation used as part of the sounding array, the data processing activities including quality checks and humidity bias corrections and an analysis of the impacts of bias correction and algorithm assumptions on the determination of convective levels and indices. It is found that corrections for known radiosonde humidity biases and assumptions regarding the characteristics of the surface convective parcel result in significant differences in the derived values of convective levels and indices in many soundings. In addition, the impact of including the humidity corrections and quality controls on the thermodynamic profiles that are used in the derivation of a large-scale model forcing data set are investigated. The results show a significant impact on the derived large-scale vertical velocity field illustrating the importance of addressing these humidity biases.

Download Full-text

Considerations of the Scale of Radiocarbon Offsets in the East Mediterranean, and Considering a Case for the Latest (Most Recent) Likely Date for the Santorini Eruption

Radiocarbon ◽

10.1017/s0033822200047202 ◽

2012 ◽

Vol 54 (3-4) ◽

pp. 449-474 ◽

Cited By ~ 13

Author(s):

Sturt W Manning ◽

Bernd Kromer

Keyword(s):

Time Horizon ◽

Weighted Average ◽

17Th Century ◽

Data Sets ◽

East Mediterranean ◽

Average Value ◽

Accuracy And Precision ◽

Wide Range ◽

The Impact ◽

East Mediterranean Region

The debate over the dating of the Santorini (Thera) volcanic eruption has seen sustained efforts to criticize or challenge the radiocarbon dating of this time horizon. We consider some of the relevant areas of possible movement in the14C dating—and, in particular, any plausible mechanisms to support as late (most recent) a date as possible. First, we report and analyze data investigating the scale of apparent possible14C offsets (growing season related) in the Aegean-Anatolia-east Mediterranean region (excluding the southern Levant and especially pre-modern, pre-dam Egypt, which is a distinct case), and find no evidence for more than very small possible offsets from several cases. This topic is thus not an explanation for current differences in dating in the Aegean and at best provides only a few years of latitude. Second, we consider some aspects of the accuracy and precision of14C dating with respect to the Santorini case. While the existing data appear robust, we nonetheless speculate that examination of the frequency distribution of the14C data on short-lived samples from the volcanic destruction level at Akrotiri on Santorini (Thera) may indicate that the average value of the overall data sets is not necessarily the most appropriate14C age to use for dating this time horizon. We note the recent paper of Soter (2011), which suggests that in such a volcanic context some (small) age increment may be possible from diffuse CO2emissions (the effect is hypothetical at this stage and hasnotbeen observed in the field), and that "if short-lived samples from the same stratigraphic horizon yield a wide range of14C ages, the lower values may be the least altered by old CO2." In this context, it might be argued that a substantive “low” grouping of14C ages observable within the overall14C data sets on short-lived samples from the Thera volcanic destruction level centered about 3326–3328 BP is perhaps more representative of the contemporary atmospheric14C age (without any volcanic CO2contamination). This is a subjective argument (since, in statistical terms, the existing studies using the weighted average remain valid) that looks to support as late a date as reasonable from the14C data. The impact of employing this revised14C age is discussed. In general, a late 17th century BC date range is found (to remain) to be most likelyeven ifsuch a late-dating strategy is followed—a late 17th century BC date range is thus a robust finding from the14C evidence even allowing for various possible variation factors. However, the possibility of a mid-16th century BC date (within ∼1593–1530 cal BC) is increased when compared against previous analyses if the Santorini data are considered in isolation.

Download Full-text