scholarly journals Sensitive, reliable, and robust circRNA detection from RNA-seq with CirComPara2

2021 ◽  
Author(s):  
Enrico Gaffo ◽  
Alessia Buratin ◽  
Anna Dal Molin ◽  
Stefania Bortoluzzi

AbstractCurrent methods for identifying circular RNAs (circRNAs) suffer from low discovery rates and inconsistent performance in diverse data sets. Therefore, the applied detection algorithm can bias high-throughput study findings by missing relevant circRNAs. Here, we show that our bioinformatics tool CirComPara2 (https://github.com/egaffo/CirComPara2), by combining multiple circRNA detection methods, consistently achieves high recall rates without loss of precision in simulated and different real-data sets.

2016 ◽  
Vol 14 (06) ◽  
pp. 1650034 ◽  
Author(s):  
Naim Al Mahi ◽  
Munni Begum

One of the primary objectives of ribonucleic acid (RNA) sequencing or RNA-Seq experiment is to identify differentially expressed (DE) genes in two or more treatment conditions. It is a common practice to assume that all read counts from RNA-Seq data follow overdispersed (OD) Poisson or negative binomial (NB) distribution, which is sometimes misleading because within each condition, some genes may have unvarying transcription levels with no overdispersion. In such a case, it is more appropriate and logical to consider two sets of genes: OD and non-overdispersed (NOD). We propose a new two-step integrated approach to distinguish DE genes in RNA-Seq data using standard Poisson and NB models for NOD and OD genes, respectively. This is an integrated approach because this method can be merged with any other NB-based methods for detecting DE genes. We design a simulation study and analyze two real RNA-Seq data to evaluate the proposed strategy. We compare the performance of this new method combined with the three [Formula: see text]-software packages namely edgeR, DESeq2, and DSS with their default settings. For both the simulated and real data sets, integrated approaches perform better or at least equally well compared to the regular methods embedded in these [Formula: see text]-packages.


2021 ◽  
pp. 2142002
Author(s):  
Giuseppe Agapito ◽  
Marianna Milano ◽  
Mario Cannataro

A new coronavirus, causing a severe acute respiratory syndrome (COVID-19), was started at Wuhan, China, in December 2019. The epidemic has rapidly spread across the world becoming a pandemic that, as of today, has affected more than 70 million people causing over 2 million deaths. To better understand the evolution of spread of the COVID-19 pandemic, we developed PANC (Parallel Network Analysis and Communities Detection), a new parallel preprocessing methodology for network-based analysis and communities detection on Italian COVID-19 data. The goal of the methodology is to analyze set of homogeneous datasets (i.e. COVID-19 data in several regions) using a statistical test to find similar/dissimilar behaviours, mapping such similarity information on a graph and then using community detection algorithm to visualize and analyze the initial dataset. The methodology includes the following steps: (i) a parallel methodology to build similarity matrices that represent similar or dissimilar regions with respect to data; (ii) an effective workload balancing function to improve performance; (iii) the mapping of similarity matrices into networks where nodes represent Italian regions, and edges represent similarity relationships; (iv) the discovering and visualization of communities of regions that show similar behaviour. The methodology is general and can be applied to world-wide data about COVID-19, as well as to all types of data sets in tabular and matrix format. To estimate the scalability with increasing workloads, we analyzed three synthetic COVID-19 datasets with the size of 90.0[Formula: see text]MB, 180.0[Formula: see text]MB, and 360.0[Formula: see text]MB. Experiments was performed on showing the amount of data that can be analyzed in a given amount of time increases almost linearly with the number of computing resources available. Instead, to perform communities detection, we employed the real data set.


2021 ◽  
Author(s):  
Hamid Reza Mohebbi ◽  
Nurit Haspel

Gene fusions events, which are the result of two genes fused together to create a hybrid gene, were first described in cancer cells in the early 1980s. These events are relatively common in many cancers including prostate, lymphoid, soft tissue, and breast. Recent advances in next-generation sequencing (NGS) provide a high volume of genomic data, including cancer genomes. The detection of possible gene fusions requires fast and accurate methods. However, current methods suffer from inefficiency, lack of sufficient accuracy, and a high false-positive rate. We present an RNA-Seq fusion detection method that uses dimensionality reduction and parallel computing to speed up the computation. We convert the RNA categorical space into a compact binary array called binary fingerprints, which enables us to reduce the memory usage and increase efficiency. The search and detection of fusion candidates are done using the Jaccard distance. The detection of candidates is followed by refinement. We benchmarked our fusion prediction accuracy using both simulated and genuine RNA-Seq datasets. Paired-end Illumina RNA-Seq genuine data were obtained from 60 publicly available cancer cell line data sets. The results are compared against the state-of-the-art-methods such as STAR-Fusion, InFusion, and TopHat-Fusion. Our results show that FDJD exhibits superior accuracy compared to popular alternative fusion detection methods. We achieved 90% accuracy on simulated fusion transcript inputs, which is the highest among the compared methods while maintaining comparable run time.


2015 ◽  
Author(s):  
David M Rocke ◽  
Luyao Ruan ◽  
Yilun Zhang ◽  
J. Jared Gossett ◽  
Blythe Durbin-Johnson ◽  
...  

Motivation: An important property of a valid method for testing for differential expression is that the false positive rate should at least roughly correspond to the p-value cutoff, so that if 10,000 genes are tested at a p-value cutoff of 10−4, and if all the null hypotheses are true, then there should be only about 1 gene declared to be significantly differentially expressed. We tested this by resampling from existing RNA-Seq data sets and also by matched negative binomial simulations. Results: Methods we examined, which rely strongly on a negative binomial model, such as edgeR, DESeq, and DESeq2, show large numbers of false positives in both the resampled real-data case and in the simulated negative binomial case. This also occurs with a negative binomial generalized linear model function in R. Methods that use only the variance function, such as limma-voom, do not show excessive false positives, as is also the case with a variance stabilizing transformation followed by linear model analysis with limma. The excess false positives are likely caused by apparently small biases in estimation of negative binomial dispersion and, perhaps surprisingly, occur mostly when the mean and/or the dis-persion is high, rather than for low-count genes.


F1000Research ◽  
2016 ◽  
Vol 4 ◽  
pp. 1521 ◽  
Author(s):  
Charlotte Soneson ◽  
Michael I. Love ◽  
Mark D. Robinson

High-throughput sequencing of cDNA (RNA-seq) is used extensively to characterize the transcriptome of cells. Many transcriptomic studies aim at comparing either abundance levels or the transcriptome composition between given conditions, and as a first step, the sequencing reads must be used as the basis for abundance quantification of transcriptomic features of interest, such as genes or transcripts. Various quantification approaches have been proposed, ranging from simple counting of reads that overlap given genomic regions to more complex estimation of underlying transcript abundances. In this paper, we show that gene-level abundance estimates and statistical inference offer advantages over transcript-level analyses, in terms of performance and interpretability. We also illustrate that the presence of differential isoform usage can lead to inflated false discovery rates in differential gene expression analyses on simple count matrices but that this can be addressed by incorporating offsets derived from transcript-level abundance estimates. We also show that the problem is relatively minor in several real data sets. Finally, we provide an R package (tximport) to help users integrate transcript-level abundance estimates from common quantification pipelines into count-based statistical inference engines.


2020 ◽  
Vol 34 (04) ◽  
pp. 6639-6647 ◽  
Author(s):  
Puyudi Yang ◽  
Jianbo Chen ◽  
Cho-Jui Hsieh ◽  
Jane-Ling Wang ◽  
Michael Jordan

Deep neural networks obtain state-of-the-art performance on a series of tasks. However, they are easily fooled by adding a small adversarial perturbation to the input. The perturbation is often imperceptible to humans on image data. We observe a significant difference in feature attributions between adversarially crafted examples and original examples. Based on this observation, we introduce a new framework to detect adversarial examples through thresholding a scale estimate of feature attribution scores. Furthermore, we extend our method to include multi-layer feature attributions in order to tackle attacks that have mixed confidence levels. As demonstrated in extensive experiments, our method achieves superior performances in distinguishing adversarial examples from popular attack methods on a variety of real data sets compared to state-of-the-art detection methods. In particular, our method is able to detect adversarial examples of mixed confidence levels, and transfer between different attacking methods. We also show that our method achieves competitive performance even when the attacker has complete access to the detector.


2018 ◽  
Vol 20 (4) ◽  
pp. 1420-1433 ◽  
Author(s):  
Shasha Li ◽  
Shuaishuai Teng ◽  
Junquan Xu ◽  
Guannan Su ◽  
Yu Zhang ◽  
...  

Abstract Circular RNAs (circRNAs) are emerging as a new class of endogenous and regulatory noncoding RNAs in latest years. With the widespread application of RNA sequencing (RNA-seq) technology and bioinformatics prediction, large numbers of circRNAs have been identified. However, at present, we lack a comprehensive characterization of all these circRNAs in interested samples. In this study, we integrated 87 935 circRNAs sequences that cover most of circRNAs identified till now represented in circBase to design microarray probes targeting back-splice site of each circRNA to profile expression of those circRNAs. By comparing the circRNA detection efficiency of RNA-seq with this circRNA microarray, we revealed that microarray is more efficient than RNA-seq for circRNA profiling. Then, we found ∼80 000 circRNAs were expressed in cervical tumors and matched normal tissues, and ∼25 000 of them were differently expressed. Notably, many of these circRNAs detected by this microarray can be validated by quantitative reverse transcription polymerase chain reaction (RT-qPCR) or RNA-seq. Strikingly, as many as ∼18 000 circRNAs could be robustly detected in cell-free plasma samples, and the expression of ∼2700 of them differed after surgery for tumor removal. Our findings provided a comprehensive and genome-wide characterization of circRNAs in paired normal tissues and tumors and plasma samples from multiple individuals. In addition, we also provide a rich resource with 41 microarray data sets and 10 RNA-seq data sets and strong evidences for circRNA expression in cervical cancer. In conclusion, circRNAs could be efficiently profiled by circRNA microarray to target their reported back-splice sites in interested samples.


2019 ◽  
Vol 11 (3) ◽  
pp. 240 ◽  
Author(s):  
Wuxia Zhang ◽  
Xiaoqiang Lu

Change detection is one of the most important applications in the remote sensing domain. More and more attention is focused on deep neural network based change detection methods. However, many deep neural networks based methods did not take both the spectral and spatial information into account. Moreover, the underlying information of fused features is not fully explored. To address the above-mentioned problems, a Spectral-Spatial Joint Learning Network (SSJLN) is proposed. SSJLN contains three parts: spectral-spatial joint representation, feature fusion, and discrimination learning. First, the spectral-spatial joint representation is extracted from the network similar to the Siamese CNN (S-CNN). Second, the above-extracted features are fused to represent the difference information that proves to be effective for the change detection task. Third, the discrimination learning is presented to explore the underlying information of obtained fused features to better represent the discrimination. Moreover, we present a new loss function that considers both the losses of the spectral-spatial joint representation procedure and the discrimination learning procedure. The effectiveness of our proposed SSJLN is verified on four real data sets. Extensive experimental results show that our proposed SSJLN can outperform the other state-of-the-art change detection methods.


2011 ◽  
Vol 225-226 ◽  
pp. 1032-1035 ◽  
Author(s):  
Zhong Ping Zhang ◽  
Yong Xin Liang

This paper proposes a new data stream outlier detection algorithm SODRNN based on reverse nearest neighbors. We deal with the sliding window model, where outlier queries are performed in order to detect anomalies in the current window. The update of insertion or deletion only needs one scan of the current window, which improves efficiency. The capability of queries at arbitrary time on the whole current window is achieved by Query Manager Procedure, which can capture the phenomenon of concept drift of data stream in time. Results of experiments conducted on both synthetic and real data sets show that SODRNN algorithm is both effective and efficient.


2014 ◽  
Vol 7 (5) ◽  
pp. 2303-2311 ◽  
Author(s):  
M. Martinez-Camara ◽  
B. Béjar Haro ◽  
A. Stohl ◽  
M. Vetterli

Abstract. Emissions of harmful substances into the atmosphere are a serious environmental concern. In order to understand and predict their effects, it is necessary to estimate the exact quantity and timing of the emissions from sensor measurements taken at different locations. There are a number of methods for solving this problem. However, these existing methods assume Gaussian additive errors, making them extremely sensitive to outlier measurements. We first show that the errors in real-world measurement data sets come from a heavy-tailed distribution, i.e., include outliers. Hence, we propose robustifying the existing inverse methods by adding a blind outlier-detection algorithm. The improved performance of our method is demonstrated on a real data set and compared to previously proposed methods. For the blind outlier detection, we first use an existing algorithm, RANSAC, and then propose a modification called TRANSAC, which provides a further performance improvement.


Sign in / Sign up

Export Citation Format

Share Document