Sensitive, reliable, and robust circRNA detection from RNA-seq with CirComPara2

AbstractCurrent methods for identifying circular RNAs (circRNAs) suffer from low discovery rates and inconsistent performance in diverse data sets. Therefore, the applied detection algorithm can bias high-throughput study findings by missing relevant circRNAs. Here, we show that our bioinformatics tool CirComPara2 (https://github.com/egaffo/CirComPara2), by combining multiple circRNA detection methods, consistently achieves high recall rates without loss of precision in simulated and different real-data sets.

Download Full-text

A two-step integrated approach to detect differentially expressed genes in RNA-Seq data

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720016500347 ◽

2016 ◽

Vol 14 (06) ◽

pp. 1650034 ◽

Cited By ~ 1

Author(s):

Naim Al Mahi ◽

Munni Begum

Keyword(s):

Simulation Study ◽

Negative Binomial ◽

Real Data ◽

Integrated Approach ◽

Differentially Expressed ◽

Data Sets ◽

Rna Seq ◽

Software Packages ◽

Treatment Conditions ◽

Integrated Approaches

One of the primary objectives of ribonucleic acid (RNA) sequencing or RNA-Seq experiment is to identify differentially expressed (DE) genes in two or more treatment conditions. It is a common practice to assume that all read counts from RNA-Seq data follow overdispersed (OD) Poisson or negative binomial (NB) distribution, which is sometimes misleading because within each condition, some genes may have unvarying transcription levels with no overdispersion. In such a case, it is more appropriate and logical to consider two sets of genes: OD and non-overdispersed (NOD). We propose a new two-step integrated approach to distinguish DE genes in RNA-Seq data using standard Poisson and NB models for NOD and OD genes, respectively. This is an integrated approach because this method can be merged with any other NB-based methods for detecting DE genes. We design a simulation study and analyze two real RNA-Seq data to evaluate the proposed strategy. We compare the performance of this new method combined with the three [Formula: see text]-software packages namely edgeR, DESeq2, and DSS with their default settings. For both the simulated and real data sets, integrated approaches perform better or at least equally well compared to the regular methods embedded in these [Formula: see text]-packages.

Download Full-text

Parallel Network Analysis and Communities Detection (PANC) Pipeline for the Analysis and Visualization of COVID-19 Data

Parallel Processing Letters ◽

10.1142/s0129626421420020 ◽

2021 ◽

pp. 2142002

Author(s):

Giuseppe Agapito ◽

Marianna Milano ◽

Mario Cannataro

Keyword(s):

Network Analysis ◽

Real Data ◽

Detection Algorithm ◽

Data Sets ◽

Data Set ◽

Italian Regions ◽

Initial Dataset ◽

Parallel Network ◽

Community Detection Algorithm ◽

Similarity Matrices

A new coronavirus, causing a severe acute respiratory syndrome (COVID-19), was started at Wuhan, China, in December 2019. The epidemic has rapidly spread across the world becoming a pandemic that, as of today, has affected more than 70 million people causing over 2 million deaths. To better understand the evolution of spread of the COVID-19 pandemic, we developed PANC (Parallel Network Analysis and Communities Detection), a new parallel preprocessing methodology for network-based analysis and communities detection on Italian COVID-19 data. The goal of the methodology is to analyze set of homogeneous datasets (i.e. COVID-19 data in several regions) using a statistical test to find similar/dissimilar behaviours, mapping such similarity information on a graph and then using community detection algorithm to visualize and analyze the initial dataset. The methodology includes the following steps: (i) a parallel methodology to build similarity matrices that represent similar or dissimilar regions with respect to data; (ii) an effective workload balancing function to improve performance; (iii) the mapping of similarity matrices into networks where nodes represent Italian regions, and edges represent similarity relationships; (iv) the discovering and visualization of communities of regions that show similar behaviour. The methodology is general and can be applied to world-wide data about COVID-19, as well as to all types of data sets in tabular and matrix format. To estimate the scalability with increasing workloads, we analyzed three synthetic COVID-19 datasets with the size of 90.0[Formula: see text]MB, 180.0[Formula: see text]MB, and 360.0[Formula: see text]MB. Experiments was performed on showing the amount of data that can be analyzed in a given amount of time increases almost linearly with the number of computing resources available. Instead, to perform communities detection, we employed the real data set.

Download Full-text

FDJD: RNA-Seq Based Fusion Transcript Detection Using Jaccard Distance

10.1101/2021.11.17.469019 ◽

2021 ◽

Author(s):

Hamid Reza Mohebbi ◽

Nurit Haspel

Keyword(s):

False Positive Rate ◽

Cancer Cell Line ◽

Fusion Transcript ◽

High Volume ◽

Detection Methods ◽

Data Sets ◽

Gene Fusions ◽

Rna Seq ◽

Jaccard Distance ◽

Fusion Detection

Gene fusions events, which are the result of two genes fused together to create a hybrid gene, were first described in cancer cells in the early 1980s. These events are relatively common in many cancers including prostate, lymphoid, soft tissue, and breast. Recent advances in next-generation sequencing (NGS) provide a high volume of genomic data, including cancer genomes. The detection of possible gene fusions requires fast and accurate methods. However, current methods suffer from inefficiency, lack of sufficient accuracy, and a high false-positive rate. We present an RNA-Seq fusion detection method that uses dimensionality reduction and parallel computing to speed up the computation. We convert the RNA categorical space into a compact binary array called binary fingerprints, which enables us to reduce the memory usage and increase efficiency. The search and detection of fusion candidates are done using the Jaccard distance. The detection of candidates is followed by refinement. We benchmarked our fusion prediction accuracy using both simulated and genuine RNA-Seq datasets. Paired-end Illumina RNA-Seq genuine data were obtained from 60 publicly available cancer cell line data sets. The results are compared against the state-of-the-art-methods such as STAR-Fusion, InFusion, and TopHat-Fusion. Our results show that FDJD exhibits superior accuracy compared to popular alternative fusion detection methods. We achieved 90% accuracy on simulated fusion transcript inputs, which is the highest among the compared methods while maintaining comparable run time.

Download Full-text

Excess False Positive Rates in Methods for Differential Gene Expression Analysis using RNA-Seq Data

10.1101/020784 ◽

2015 ◽

Cited By ~ 7

Author(s):

David M Rocke ◽

Luyao Ruan ◽

Yilun Zhang ◽

J. Jared Gossett ◽

Blythe Durbin-Johnson ◽

...

Keyword(s):

Linear Model ◽

False Positive ◽

Negative Binomial ◽

False Positive Rate ◽

Real Data ◽

False Positives ◽

P Value ◽

Data Sets ◽

Rna Seq ◽

Positive Rate

Motivation: An important property of a valid method for testing for differential expression is that the false positive rate should at least roughly correspond to the p-value cutoff, so that if 10,000 genes are tested at a p-value cutoff of 10−4, and if all the null hypotheses are true, then there should be only about 1 gene declared to be significantly differentially expressed. We tested this by resampling from existing RNA-Seq data sets and also by matched negative binomial simulations. Results: Methods we examined, which rely strongly on a negative binomial model, such as edgeR, DESeq, and DESeq2, show large numbers of false positives in both the resampled real-data case and in the simulated negative binomial case. This also occurs with a negative binomial generalized linear model function in R. Methods that use only the variance function, such as limma-voom, do not show excessive false positives, as is also the case with a variance stabilizing transformation followed by linear model analysis with limma. The excess false positives are likely caused by apparently small biases in estimation of negative binomial dispersion and, perhaps surprisingly, occur mostly when the mean and/or the dis-persion is high, rather than for low-count genes.

Download Full-text

Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences

F1000Research ◽

10.12688/f1000research.7563.2 ◽

2016 ◽

Vol 4 ◽

pp. 1521 ◽

Cited By ~ 268

Author(s):

Charlotte Soneson ◽

Michael I. Love ◽

Mark D. Robinson

Keyword(s):

Statistical Inference ◽

High Throughput Sequencing ◽

Real Data ◽

Transcript Level ◽

R Package ◽

Data Sets ◽

Rna Seq ◽

Abundance Estimates ◽

Gene Level ◽

Genomic Regions

High-throughput sequencing of cDNA (RNA-seq) is used extensively to characterize the transcriptome of cells. Many transcriptomic studies aim at comparing either abundance levels or the transcriptome composition between given conditions, and as a first step, the sequencing reads must be used as the basis for abundance quantification of transcriptomic features of interest, such as genes or transcripts. Various quantification approaches have been proposed, ranging from simple counting of reads that overlap given genomic regions to more complex estimation of underlying transcript abundances. In this paper, we show that gene-level abundance estimates and statistical inference offer advantages over transcript-level analyses, in terms of performance and interpretability. We also illustrate that the presence of differential isoform usage can lead to inflated false discovery rates in differential gene expression analyses on simple count matrices but that this can be addressed by incorporating offsets derived from transcript-level abundance estimates. We also show that the problem is relatively minor in several real data sets. Finally, we provide an R package (tximport) to help users integrate transcript-level abundance estimates from common quantification pipelines into count-based statistical inference engines.

Download Full-text

ML-LOO: Detecting Adversarial Examples with Feature Attribution

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.6140 ◽

2020 ◽

Vol 34 (04) ◽

pp. 6639-6647 ◽

Cited By ~ 1

Author(s):

Puyudi Yang ◽

Jianbo Chen ◽

Cho-Jui Hsieh ◽

Jane-Ling Wang ◽

Michael Jordan

Keyword(s):

State Of The Art ◽

Image Data ◽

Real Data ◽

Detection Methods ◽

Data Sets ◽

Confidence Levels ◽

Significant Difference ◽

Adversarial Examples ◽

Scale Estimate ◽

Complete Access

Deep neural networks obtain state-of-the-art performance on a series of tasks. However, they are easily fooled by adding a small adversarial perturbation to the input. The perturbation is often imperceptible to humans on image data. We observe a significant difference in feature attributions between adversarially crafted examples and original examples. Based on this observation, we introduce a new framework to detect adversarial examples through thresholding a scale estimate of feature attribution scores. Furthermore, we extend our method to include multi-layer feature attributions in order to tackle attacks that have mixed confidence levels. As demonstrated in extensive experiments, our method achieves superior performances in distinguishing adversarial examples from popular attack methods on a variety of real data sets compared to state-of-the-art detection methods. In particular, our method is able to detect adversarial examples of mixed confidence levels, and transfer between different attacking methods. We also show that our method achieves competitive performance even when the attacker has complete access to the detector.

Download Full-text

Microarray is an efficient tool for circRNA profiling

Briefings in Bioinformatics ◽

10.1093/bib/bby006 ◽

2018 ◽

Vol 20 (4) ◽

pp. 1420-1433 ◽

Cited By ~ 38

Author(s):

Shasha Li ◽

Shuaishuai Teng ◽

Junquan Xu ◽

Guannan Su ◽

Yu Zhang ◽

...

Keyword(s):

Detection Efficiency ◽

Circular Rnas ◽

Data Sets ◽

Plasma Samples ◽

Rna Seq ◽

Widespread Application ◽

Normal Tissues ◽

Large Numbers ◽

Cervical Tumors

Abstract Circular RNAs (circRNAs) are emerging as a new class of endogenous and regulatory noncoding RNAs in latest years. With the widespread application of RNA sequencing (RNA-seq) technology and bioinformatics prediction, large numbers of circRNAs have been identified. However, at present, we lack a comprehensive characterization of all these circRNAs in interested samples. In this study, we integrated 87 935 circRNAs sequences that cover most of circRNAs identified till now represented in circBase to design microarray probes targeting back-splice site of each circRNA to profile expression of those circRNAs. By comparing the circRNA detection efficiency of RNA-seq with this circRNA microarray, we revealed that microarray is more efficient than RNA-seq for circRNA profiling. Then, we found ∼80 000 circRNAs were expressed in cervical tumors and matched normal tissues, and ∼25 000 of them were differently expressed. Notably, many of these circRNAs detected by this microarray can be validated by quantitative reverse transcription polymerase chain reaction (RT-qPCR) or RNA-seq. Strikingly, as many as ∼18 000 circRNAs could be robustly detected in cell-free plasma samples, and the expression of ∼2700 of them differed after surgery for tumor removal. Our findings provided a comprehensive and genome-wide characterization of circRNAs in paired normal tissues and tumors and plasma samples from multiple individuals. In addition, we also provide a rich resource with 41 microarray data sets and 10 RNA-seq data sets and strong evidences for circRNA expression in cervical cancer. In conclusion, circRNAs could be efficiently profiled by circRNA microarray to target their reported back-splice sites in interested samples.

Download Full-text

The Spectral-Spatial Joint Learning for Change Detection in Multispectral Imagery

Remote Sensing ◽

10.3390/rs11030240 ◽

2019 ◽

Vol 11 (3) ◽

pp. 240 ◽

Cited By ~ 14

Author(s):

Wuxia Zhang ◽

Xiaoqiang Lu

Keyword(s):

Change Detection ◽

Discrimination Learning ◽

Spatial Information ◽

Feature Fusion ◽

Real Data ◽

Detection Methods ◽

Data Sets ◽

Joint Learning ◽

The Difference ◽

Joint Representation

Change detection is one of the most important applications in the remote sensing domain. More and more attention is focused on deep neural network based change detection methods. However, many deep neural networks based methods did not take both the spectral and spatial information into account. Moreover, the underlying information of fused features is not fully explored. To address the above-mentioned problems, a Spectral-Spatial Joint Learning Network (SSJLN) is proposed. SSJLN contains three parts: spectral-spatial joint representation, feature fusion, and discrimination learning. First, the spectral-spatial joint representation is extracted from the network similar to the Siamese CNN (S-CNN). Second, the above-extracted features are fused to represent the difference information that proves to be effective for the change detection task. Third, the discrimination learning is presented to explore the underlying information of obtained fused features to better represent the discrimination. Moreover, we present a new loss function that considers both the losses of the spectral-spatial joint representation procedure and the discrimination learning procedure. The effectiveness of our proposed SSJLN is verified on four real data sets. Extensive experimental results show that our proposed SSJLN can outperform the other state-of-the-art change detection methods.

Download Full-text

A Data Stream Outlier Detection Algorithm Based on Reverse K Nearest Neighbors

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.225-226.1032 ◽

2011 ◽

Vol 225-226 ◽

pp. 1032-1035 ◽

Cited By ~ 1

Author(s):

Zhong Ping Zhang ◽

Yong Xin Liang

Keyword(s):

Outlier Detection ◽

Data Stream ◽

Concept Drift ◽

Real Data ◽

Nearest Neighbors ◽

Detection Algorithm ◽

Data Sets ◽

K Nearest Neighbors ◽

Query Manager ◽

Current Window

This paper proposes a new data stream outlier detection algorithm SODRNN based on reverse nearest neighbors. We deal with the sliding window model, where outlier queries are performed in order to detect anomalies in the current window. The update of insertion or deletion only needs one scan of the current window, which improves efficiency. The capability of queries at arbitrary time on the whole current window is achieved by Query Manager Procedure, which can capture the phenomenon of concept drift of data stream in time. Results of experiments conducted on both synthetic and real data sets show that SODRNN algorithm is both effective and efficient.

Download Full-text

A robust method for inverse transport modeling of atmospheric emissions using blind outlier detection

Geoscientific Model Development ◽

10.5194/gmd-7-2303-2014 ◽

2014 ◽

Vol 7 (5) ◽

pp. 2303-2311 ◽

Cited By ~ 9

Author(s):

M. Martinez-Camara ◽

B. Béjar Haro ◽

A. Stohl ◽

M. Vetterli

Keyword(s):

Outlier Detection ◽

Environmental Concern ◽

Measurement Data ◽

Real Data ◽

Detection Algorithm ◽

Data Sets ◽

Data Set ◽

Heavy Tailed ◽

Improved Performance ◽

Inverse Transport

Abstract. Emissions of harmful substances into the atmosphere are a serious environmental concern. In order to understand and predict their effects, it is necessary to estimate the exact quantity and timing of the emissions from sensor measurements taken at different locations. There are a number of methods for solving this problem. However, these existing methods assume Gaussian additive errors, making them extremely sensitive to outlier measurements. We first show that the errors in real-world measurement data sets come from a heavy-tailed distribution, i.e., include outliers. Hence, we propose robustifying the existing inverse methods by adding a blind outlier-detection algorithm. The improved performance of our method is demonstrated on a real data set and compared to previously proposed methods. For the blind outlier detection, we first use an existing algorithm, RANSAC, and then propose a modification called TRANSAC, which provides a further performance improvement.

Download Full-text