Mixture models reveal multiple positional bias types in RNA-Seq data and lead to accurate transcript concentration estimates

Mapping Intimacies ◽

10.1101/011767 ◽

2014 ◽

Author(s):

Andreas Tuerk ◽

Gregor Wiktorin ◽

Serhat Güler

Keyword(s):

Probability Distributions ◽

False Positive Rate ◽

Synthetic Data ◽

True Positive Rate ◽

Rna Seq ◽

Microarray Quality Control ◽

Data Set ◽

Rna Transcripts ◽

Positive Rate ◽

Fragment Distribution

Quantification of RNA transcripts with RNA-Seq is inaccurate due to positional fragment bias, which is not represented appropriately by current statistical models of RNA-Seq data. This article introduces the Mix2(rd. "mixquare") model, which uses a mixture of probability distributions to model the transcript specific positional fragment bias. The parameters of the Mix2model can be efficiently trained with the Expectation Maximization (EM) algorithm resulting in simultaneous estimates of the transcript abundances and transcript specific positional biases. Experiments are conducted on synthetic data and the Universal Human Reference (UHR) and Brain (HBR) sample from the Microarray quality control (MAQC) data set. Comparing the correlation between qPCR and FPKM values to state-of-the-art methods Cufflinks and PennSeq we obtain an increase in R2value from 0.44 to 0.6 and from 0.34 to 0.54. In the detection of differential expression between UHR and HBR the true positive rate increases from 0.44 to 0.71 at a false positive rate of 0.1. Finally, the Mix2model is used to investigate biases present in the MAQC data. This reveals 5 dominant biases which deviate from the common assumption of a uniform fragment distribution. The Mix2software is available at http://www.lexogen.com/fileadmin/uploads/bioinfo/mix2model.tgz.

Get full-text (via PubEx)

contamDE-lm: linear model-based differential gene expression analysis using next-generation RNA-seq data from contaminated tumor samples

Bioinformatics ◽

10.1093/bioinformatics/btaa006 ◽

2020 ◽

Vol 36 (8) ◽

pp. 2492-2499

Author(s):

Yifan Ji ◽

Chang Yu ◽

Hong Zhang

Keyword(s):

Gene Expression ◽

Linear Model ◽

False Positive Rate ◽

Simulated Data ◽

True Positive Rate ◽

Supplementary Information ◽

Large Sample Size ◽

Rna Seq ◽

Next Generation ◽

Positive Rate

Abstract Motivation Tumor and adjacent normal RNA samples are commonly used to screen differentially expressed genes between normal and tumor samples or among tumor subtypes. Such paired-sample design could avoid numerous confounders in differential expression (DE) analysis, but the cellular contamination of tumor samples can be an important noise and confounding factor, which can both inflate false-positive rate and deflate true-positive rate. The existing DE tools that use next-generation RNA-seq data either do not account for cellular contamination or are computationally extensive with increasingly large sample size. Results A novel linear model was proposed to avoid the problem that could arise from tumor–normal correlation for paired samples. A statistically robust and computationally very fast DE analysis procedure, contamDE-lm, was developed based on the novel model to account for cellular contamination, boosting DE analysis power through the reduction in individual residual variances using gene-wise information. The desired advantages of contamDE-lm over some state-of-the-art methods (limma and DESeq2) were evaluated through the applications to simulated data, TCGA database and Gene Expression Omnibus (GEO) database. Availability and implementation The proposed method contamDE-lm was implemented in an updated R package contamDE (version 2.0), which is freely available at https://github.com/zhanghfd/contamDE. Supplementary information Supplementary data are available at Bioinformatics online.

Get full-text (via PubEx)

An e-healthcare system for disease prediction using hybrid data mining technique

Journal of Modelling in Management ◽

10.1108/jm2-05-2018-0069 ◽

2019 ◽

Vol 14 (3) ◽

pp. 628-661 ◽

Cited By ~ 1

Author(s):

Bikash Kanti Sarkar ◽

Shib Sankar Sana

Keyword(s):

Data Mining ◽

False Positive Rate ◽

True Positive Rate ◽

Data Partition ◽

Data Mining Technique ◽

Data Set ◽

Content Type ◽

Positive Rate ◽

Effective Diagnosis ◽

Disease Specific

Purpose The purpose of this study is to alleviate the specified issues to a great extent. To promote patients’ health via early prediction of diseases, knowledge extraction using data mining approaches shows an integral part of e-health system. However, medical databases are highly imbalanced, voluminous, conflicting and complex in nature, and these can lead to erroneous diagnosis of diseases (i.e. detecting class-values of diseases). In literature, numerous standard disease decision support system (DDSS) have been proposed, but most of them are disease specific. Also, they usually suffer from several drawbacks like lack of understandability, incapability of operating rare cases, inefficiency in making quick and correct decision, etc. Design/methodology/approach Addressing the limitations of the existing systems, the present research introduces a two-step framework for designing a DDSS, in which the first step (data-level optimization) deals in identifying an optimal data-partition (Popt) for each disease data set and then the best training set for Popt in parallel manner. On the other hand, the second step explores a generic predictive model (integrating C4.5 and PRISM learners) over the discovered information for effective diagnosis of disease. The designed model is a generic one (i.e. not disease specific). Findings The empirical results (in terms of top three measures, namely, accuracy, true positive rate and false positive rate) obtained over 14 benchmark medical data sets (collected from https://archive.ics.uci.edu/ml) demonstrate that the hybrid model outperforms the base learners in almost all cases for initial diagnosis of the diseases. After all, the proposed DDSS may work as an e-doctor to detect diseases. Originality/value The model designed in this study is original, and the necessary parallelized methods are implemented in C on Cluster HPC machine (FUJITSU) with total 256 cores (under one Master node).

Get full-text (via PubEx)

Simultaneous estimation of transcript abundances and transcript specific fragment distributions of RNA-Seq data with the Mix2 model

10.1101/005918 ◽

2014 ◽

Author(s):

Andreas Tuerk ◽

Gregor Wiktorin

Keyword(s):

Probability Distributions ◽

Synthetic Data ◽

Superior Performance ◽

Simultaneous Estimation ◽

Rna Seq ◽

Scale Parameters ◽

Rna Transcripts ◽

Abundance Estimates ◽

The Em Algorithm ◽

Specific Fragment

AbstractQuantification of RNA transcripts with RNA-Seq is inaccurate due to positional fragmentation bias, which is not represented appropriately by current statistical models of RNA-Seq data. Another, less investigated, source of error is the inaccuracy of transcript start and end annotations.This article introduces the Mix2 (rd. “mixquare”) model, which uses a mixture of probability distributions to model the transcript specific positional fragment bias. The parameters of the Mix2 model can be efficiently trained with the EM algorithm and are tied between similar transcripts. Transcript specific shift and scale parameters allow the Mix2 model to automatically correct inaccurate transcript start and end annotations. Experiments are conducted on synthetic data covering 7 genes of different complexity, 4 types of fragment bias and correct as well as incorrect transcript start and end annotations. Abundance estimates obtained by Cufflinks 2.2.0, PennSeq and the Mix2 model show superior performance of the Mix2 model in the vast majority of test conditions.The Mix2 software is available at http://www.lexogen.com/fileadmin/uploads/bioinfo/mix2model.tgz, subject to the enclosed license.Additional experimental data are available in the supplement.

Get full-text (via PubEx)

The Junction Usage Model (JUM): A method for comprehensive annotation-free analysis of alternative pre-mRNA splicing patterns

10.1101/116863 ◽

2017 ◽

Cited By ~ 4

Author(s):

Qingqing Wang ◽

Donald C. Rio

Keyword(s):

Intron Retention ◽

False Positive Rate ◽

True Positive Rate ◽

Mrna Splicing ◽

Rna Seq ◽

Detection Analysis ◽

Positive Rate ◽

Ir Detection ◽

Splicing Patterns ◽

Relevant Pattern

AbstractAlternative pre-mRNA splicing (AS) greatly diversifies metazoan transcriptomes and proteomes and is crucial for gene regulation. Current computational analysis methods of AS from Illumina RNA-seq data rely on pre-annotated libraries of known spliced transcripts, which hinders AS analysis with poorly annotated genomes and can further mask unknown AS patterns. To address this critical bioinformatics problem, we developed a method called the Junction Usage Model (JUM) that uses a bottom-up approach to identify, analyze and quantitate global AS profiles without any prior transcriptome annotations. JUM accurately reports global AS changes in terms of the five conventional AS patterns and an additional “Composite” category composed of inseparable combinations of conventional patterns. JUM stringently classifies the difficult and disease-relevant pattern of intron retention, reducing the false positive rate of IR detection commonly seen in other annotation-based methods to near negligible rates. When analyzing AS in RNA-samples derived from Drosophila heads, human tumors and human cell lines bearing cancer-associated splicing factor mutations, JUM consistently identified ~ twice the number of novel AS events missed by other methods. Computational simulations showed JUM exhibits a 1.2-4.8 times higher true positive rate at a fixed cut-off of 5% false discovery rate. In summary, JUM provides a new framework and improved method that removes the necessity for transcriptome annotations and enables the detection, analysis and quantification of AS patterns in complex metazoan transcriptomes with superior accuracy.

Get full-text (via PubEx)

PRATD: A Phased Remote Access Trojan Detection Method with Double-Sided Features

Electronics ◽

10.3390/electronics9111894 ◽

2020 ◽

Vol 9 (11) ◽

pp. 1894

Author(s):

Chun Guo ◽

Zihua Song ◽

Yuan Ping ◽

Guowei Shen ◽

Yuhei Cui ◽

...

Keyword(s):

False Positive ◽

Detection Method ◽

False Positive Rate ◽

True Positive Rate ◽

Remote Access ◽

Detection Methods ◽

Security Threats ◽

True Positive ◽

Trojan Detection ◽

Positive Rate

Remote Access Trojan (RAT) is one of the most terrible security threats that organizations face today. At present, two major RAT detection methods are host-based and network-based detection methods. To complement one another’s strengths, this article proposes a phased RATs detection method by combining double-side features (PRATD). In PRATD, both host-side and network-side features are combined to build detection models, which is conducive to distinguishing the RATs from benign programs because that the RATs not only generate traffic on the network but also leave traces on the host at run time. Besides, PRATD trains two different detection models for the two runtime states of RATs for improving the True Positive Rate (TPR). The experiments on the network and host records collected from five kinds of benign programs and 20 famous RATs show that PRATD can effectively detect RATs, it can achieve a TPR as high as 93.609% with a False Positive Rate (FPR) as low as 0.407% for the known RATs, a TPR 81.928% and FPR 0.185% for the unknown RATs, which suggests it is a competitive candidate for RAT detection.

Get full-text (via PubEx)

Ascertaining an efficient eligibility cut-off for extended Medicare items for eating disorders

Australasian Psychiatry ◽

10.1177/10398562211028632 ◽

2021 ◽

pp. 103985622110286

Author(s):

Tracey Wade ◽

Jamie-Lee Pennesi ◽

Yuan Zhou

Keyword(s):

Eating Disorders ◽

Eating Disorder ◽

False Positive Rate ◽

Area Under The Curve ◽

Rate Sensitivity ◽

True Positive Rate ◽

Eating Disorder Examination Questionnaire ◽

Eating Disorder Examination ◽

Positive Rate ◽

The Relationship

Objective: Currently eligibility for expanded Medicare items for eating disorders (excluding anorexia nervosa) require a score ⩾ 3 on the 22-item Eating Disorder Examination-Questionnaire (EDE-Q). We compared these EDE-Q “cases” with continuous scores on a validated 7-item version of the EDE-Q (EDE-Q7) to identify an EDE-Q7 cut-off commensurate to 3 on the EDE-Q. Methods: We utilised EDE-Q scores of female university students ( N = 337) at risk of developing an eating disorder. We used a receiver operating characteristic (ROC) curve to assess the relationship between the true-positive rate (sensitivity) and the false-positive rate (1-specificity) of cases ⩾ 3. Results: The area under the curve showed outstanding discrimination of 0.94 (95% CI: .92–.97). We examined two specific cut-off points on the EDE-Q7, which included 100% and 87% of true cases, respectively. Conclusion: Given the EDE-Q cut-off for Medicare is used in conjunction with other criteria, we suggest using the more permissive EDE-Q7 cut-off (⩾2.5) to replace use of the EDE-Q cut-off (⩾3) in eligibility assessments.

Get full-text (via PubEx)

Improving Ecological Inference by Predicting Individual Ethnicity from Voter Registration Records

Political Analysis ◽

10.1093/pan/mpw001 ◽

2016 ◽

Vol 24 (2) ◽

pp. 263-272 ◽

Cited By ~ 29

Author(s):

Kosuke Imai ◽

Kabir Khanna

Keyword(s):

Mean Squared Error ◽

False Positive Rate ◽

True Positive Rate ◽

Voter Registration ◽

Racial Groups ◽

Ecological Inference ◽

Inference Problem ◽

Individual Level ◽

Positive Rate ◽

Election Results

In both political behavior research and voting rights litigation, turnout and vote choice for different racial groups are often inferred using aggregate election results and racial composition. Over the past several decades, many statistical methods have been proposed to address this ecological inference problem. We propose an alternative method to reduce aggregation bias by predicting individual-level ethnicity from voter registration records. Building on the existing methodological literature, we use Bayes's rule to combine the Census Bureau's Surname List with various information from geocoded voter registration records. We evaluate the performance of the proposed methodology using approximately nine million voter registration records from Florida, where self-reported ethnicity is available. We find that it is possible to reduce the false positive rate among Black and Latino voters to 6% and 3%, respectively, while maintaining the true positive rate above 80%. Moreover, we use our predictions to estimate turnout by race and find that our estimates yields substantially less amounts of bias and root mean squared error than standard ecological inference estimates. We provide open-source software to implement the proposed methodology.

Get full-text (via PubEx)

Replicate sequencing libraries are important for quantification of allelic imbalance

Nature Communications ◽

10.1038/s41467-021-23544-8 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Asia Mendelevich ◽

Svetlana Vinogradova ◽

Saumya Gupta ◽

Andrey A. Mironov ◽

Shamil R. Sunyaev ◽

...

Keyword(s):

Allelic Imbalance ◽

False Positive Rate ◽

Error Rates ◽

Differential Analysis ◽

Rna Seq ◽

Specific Expression ◽

Technical Noise ◽

Specific Analysis ◽

Positive Rate ◽

Allele Specific

AbstractA sensitive approach to quantitative analysis of transcriptional regulation in diploid organisms is analysis of allelic imbalance (AI) in RNA sequencing (RNA-seq) data. A near-universal practice in such studies is to prepare and sequence only one library per RNA sample. We present theoretical and experimental evidence that data from a single RNA-seq library is insufficient for reliable quantification of the contribution of technical noise to the observed AI signal; consequently, reliance on one-replicate experimental design can lead to unaccounted-for variation in error rates in allele-specific analysis. We develop a computational approach, Qllelic, that accurately accounts for technical noise by making use of replicate RNA-seq libraries. Testing on new and existing datasets shows that application of Qllelic greatly decreases false positive rate in allele-specific analysis while conserving appropriate signal, and thus greatly improves reproducibility of AI estimates. We explore sources of technical overdispersion in observed AI signal and conclude by discussing design of RNA-seq studies addressing two biologically important questions: quantification of transcriptome-wide AI in one sample, and differential analysis of allele-specific expression between samples.

Get full-text (via PubEx)

Watch For Failing Objects: What Inappropriate Compliance Reveals About Shared Mental Models In Autonomous Cars

Proceedings of the Human Factors and Ergonomics Society Annual Meeting ◽

10.1177/1071181321651081 ◽

2021 ◽

Vol 65 (1) ◽

pp. 643-647

Author(s):

Yosef S. Razin ◽

Jack Gale ◽

Jiaojiao Fan ◽

Jaznae’ Smith ◽

Karen M. Feigh

Keyword(s):

Mental Models ◽

Mental Model ◽

False Positive Rate ◽

Ground Truth ◽

True Positive Rate ◽

Shared Mental Models ◽

Shared Mental Model ◽

Autonomous Cars ◽

Positive Rate ◽

Dispositional Factors

This paper evaluates Banks et al.’s Human-AI Shared Mental Model theory by examining how a self-driving vehicle’s hazard assessment facilitates shared mental models. Participants were asked to affirm the vehicle’s assessment of road objects as either hazards or mistakes in real-time as behavioral and subjective measures were collected. The baseline performance of the AI was purposefully low (<50%) to examine how the human’s shared mental model might lead to inappropriate compliance. Results indicated that while the participant true positive rate was high, overall performance was reduced by the large false positive rate, indicating that participants were indeed being influenced by the Al’s faulty assessments, despite full transparency as to the ground-truth. Both performance and compliance were directly affected by frustration, mental, and even physical demands. Dispositional factors such as faith in other people’s cooperativeness and in technology companies were also significant. Thus, our findings strongly supported the theory that shared mental models play a measurable role in performance and compliance, in a complex interplay with trust.

Get full-text (via PubEx)

Investigating the clinical usefulness of definitions of progression with 10-2 visual field

British Journal of Ophthalmology ◽

10.1136/bjophthalmol-2020-318188 ◽

2021 ◽

pp. bjophthalmol-2020-318188

Author(s):

Shotaro Asano ◽

Hiroshi Murata ◽

Yuri Fujino ◽

Takehiro Yamashita ◽

Atsuya Miki ◽

...

Keyword(s):

Visual Field ◽

False Positive ◽

False Positive Rate ◽

Clinical Validity ◽

Data Set ◽

Humphrey Field Analyzer ◽

Positive Rate ◽

Using Data ◽

Sensitivity Specificity ◽

Higher Sensitivity

Background/AimTo investigate the clinical validity of the Guided Progression Analysis definition (GPAD) and cluster-based definition (CBD) with the Humphrey Field Analyzer 10-2 test in diagnosing glaucomatous visual field (VF) progression, and to introduce a novel definition with optimised specificity by combining the ‘any-location’ and ‘cluster-based’ approaches (hybrid definition).Methods64 400 stable glaucomatous VFs were simulated from 664 pairs of 10-2 tests (10 sets × 10 VF series × 664 eyes; data set 1). Using these simulated VFs, the specificity to detect progression and the effects of changing the parameters (number of test locations or consecutive VF tests, and percentile cut-off values) were investigated. The hybrid definition was designed as the combination where the specificity was closest to 95.0%. Subsequently, another 5000 actual glaucomatous 10-2 tests from 500 eyes (10 VFs each) were collected (data set 2), and their accuracy (sensitivity, specificity and false positive rate) and the time needed to detect VF progression were evaluated.ResultsThe specificity values calculated using data set 1 with GPAD and CBD were 99.6% and 99.8%. Using data set 2, the hybrid definition had a higher sensitivity than GPAD and CBD, without detriment to the specificity or false positive rate. The hybrid definition also detected progression significantly earlier than GPAD and CBD (at 3.1 years vs 4.2 years and 4.1 years, respectively).ConclusionsGPAD and CBD had specificities of 99.6% and 99.8%, respectively. A novel hybrid definition (with a specificity of 95.5%) had higher sensitivity and enabled earlier detection of progression.

Get full-text (via PubEx)