What if we ignore the random effects when analyzing RNA-seq data in a multifactor experiment

Author(s):  
Shiqi Cui ◽  
Tieming Ji ◽  
Jilong Li ◽  
Jianlin Cheng ◽  
Jing Qiu

AbstractIdentifying differentially expressed (DE) genes between different conditions is one of the main goals of RNA-seq data analysis. Although a large amount of RNA-seq data were produced for two-group comparison with small sample sizes at early stage, more and more RNA-seq data are being produced in the setting of complex experimental designs such as split-plot designs and repeated measure designs. Data arising from such experiments are traditionally analyzed by mixed-effects models. Therefore an appropriate statistical approach for analyzing RNA-seq data from such designs should be generalized linear mixed models (GLMM) or similar approaches that allow for random effects. However, common practices for analyzing such data in literature either treat random effects as fixed or completely ignore the experimental design and focus on two-group comparison using partial data. In this paper, we examine the effect of ignoring the random effects when analyzing RNA-seq data. We accomplish this goal by comparing the standard GLMM model to the methods that ignore the random effects through simulation studies and real data analysis. Our studies show that, ignoring random effects in a multi-factor experiment can lead to the increase of the false positives among the top selected genes or lower power when the nominal FDR level is controlled.

2016 ◽  
Vol 34 (4_suppl) ◽  
pp. 286-286
Author(s):  
Robin Kate Kelley ◽  
John Dozier Gordan ◽  
Kimberley Evason ◽  
Paige M. Bracci ◽  
Nancy M. Joseph ◽  
...  

286 Background: Mutations in TP53 and CTNNB1 are common in early stage HCC resection samples. The frequency and prognostic impact of these mutations in advanced HCC is not known. We conducted this retrospective analysis using a large NGS panel to explore for association between tumor genetics, clinicopathologic features, and prognosis in an advanced HCC cohort. Methods: Eligible cases had diagnosis of unresectable HCC or mixed HCC-cholangiocarcinoma and were enrolled on NCT01008917 or NCT01687673 clinical trials of sorafenib plus temsirolimus with informed consent for specimen banking for future research including genetic testing. Paired tumor and germline (blood) DNA samples were sequenced using a capture-based NGS cancer panel to allow for determination of somatic variants. Analysis was based on the human reference sequence UCSC build hg19. Variants were called using GATK Unified Genotyper software. Somatic, non-synonymous, and exonic calls were curated using COSMIC, cBioPortal, and Pubmed. Results: Cases with HCC (n = 21) and mixed HCC-cholangiocarcinoma (n = 2) comprised the cohort (N = 23). Male/female: 83%/17%. Race: White 56%, Asian 39%. BCLC stage: B 35%, C 65%. Etiology: HBsAg+ 26%, HCV+ 39%. Immune infiltrates ( ≥ 1 on scale 0-3) were present in 7/12 (58%) evaluable tumor samples. TP53 mutations were present in 14/23 (61%, 95% CI: 38.5, 80.0). CTNNB1 mutations were present in 7/23 (30%, 95% CI: 13.2, 52.9). There was no significant difference between HBsAg+ and HCV+. Both TP53 and CTNNB1 mutation were present in 4/23 (17%). CTNNB1 mutation was present in 2/7 (29%) cases with immune infiltrate score ≥ 1, and 1/5 (20%) with score < 1 (not significant). Other mutations and variants will be reported. Conclusions: NGS in this advanced HCC cohort suggests a higher incidence of TP53 and coexisting TP53 plus CTNNB1 mutations than has been reported in early stage HCC which requires confirmation in a larger cohort. There was no clear relationship between these mutations, HCC etiology, or tumor immune infiltrates though interpretation is limited by small sample sizes. Analyses are ongoing to explore for association between TP53 and CTNNB1 mutations and prognosis in this advanced HCC cohort.


2019 ◽  
Vol 29 (7) ◽  
pp. 1799-1817
Author(s):  
Saswati Saha ◽  
Werner Brannath ◽  
Björn Bornkamp

Drug combination trials are often motivated by the fact that individual drugs target the same disease but via different routes. A combination of such drugs may then have an overall better effect than the individual treatments which has to be verified by clinical trials. Several statistical methods have been explored that discuss the problem of comparing a fixed-dose combination therapy to each of its components. But an extension of these approaches to multiple dose combinations can be difficult and is not yet fully investigated. In this paper, we propose two approaches by which one can provide confirmatory assurance with familywise error rate control, that the combination of two drugs at differing doses is more effective than either component doses alone. These approaches involve multiple comparisons in multilevel factorial designs where the type 1 error can be controlled first, by bootstrapping tests, and second, by considering the least favorable null configurations for a family of union intersection tests. The main advantage of the new approaches is that their implementation is simple. The implementation of these new approaches is illustrated with a real data example from a blood pressure reduction trial. Extensive simulations are also conducted to evaluate the new approaches and benchmark them with existing ones. We also present an illustration of the relationship between the different approaches. We observed that the bootstrap provided some power advantages over the other approaches with the disadvantage that there may be some error rate inflation for small sample sizes.


2014 ◽  
Vol 11 (Suppl 1) ◽  
pp. S2 ◽  
Author(s):  
Joanna Zyla ◽  
Paul Finnon ◽  
Robert Bulman ◽  
Simon Bouffler ◽  
Christophe Badie ◽  
...  

2001 ◽  
Vol 09 (02) ◽  
pp. 105-121 ◽  
Author(s):  
ANIKO SZABO ◽  
ANDREI YAKOVLEV

In this paper we discuss some natural limitations in quantitative inference about the frequency, correlation and ordering of genetic events occurring in the course of tumor development. We consider a simple, yet frequently used experimental design, under which independent tumors are examined once for the presence/absence of specific mutations of interest. The most typical factors that affect the inference on the chronological order of genetic events are: a possible dependence of mutation rates, the sampling bias that arises from the observation process and small sample sizes. Our results clearly indicate that just these three factors alone may dramatically distort the outcome of data analysis, thereby leading to estimates of limited utility as an underpinning for mechanistic models of carcinogenesis.


2019 ◽  
Author(s):  
Ran Bi ◽  
Peng Liu

AbstractRNA sequencing (RNA-seq) technologies have been popularly applied to study gene expression in recent years. Identifying differentially expressed (DE) genes across treatments is one of the major steps in RNA-seq data analysis. Most differential expression analysis methods rely on parametric assumptions, and it is not guaranteed that these assumptions are appropriate for real data analysis. In this paper, we develop a semi-parametric Bayesian approach for differential expression analysis. More specifically, we model the RNA-seq count data with a Poisson-Gamma mixture model, and propose a Bayesian mixture modeling procedure with a Dirichlet process as the prior model for the distribution of fold changes between the two treatment means. We develop Markov chain Monte Carlo (MCMC) posterior simulation using Metropolis Hastings algorithm to generate posterior samples for differential expression analysis while controlling false discovery rate. Simulation results demonstrate that our proposed method outperforms other popular methods used for detecting DE genes.


2016 ◽  
Vol 38 (2) ◽  
pp. 21-25
Author(s):  
James B. Brown ◽  
Susan E. Celniker

In this article, we discuss emerging frontiers in RNA biology from a historical perspective. The field is currently undergoing yet another transformative expansion. RNA-seq has revealed that splicing, and, more generally, RNA processing is far more complex than expected, and the mechanisms of regulation are correspondingly sophisticated. Our understanding of the molecular machines involved in RNA metabolism is incomplete and derives from small sample sizes. Even if we manage to complete a catalogue of molecular species, RNA isoforms and the ribonucleoprotein complexes that drive their genesis, the horizons of molecular dynamics and cell-type-specific processing mechanisms await. This is an exciting time to enter into the study of RNA biology; analytical tools, wet and dry, are advancing rapidly, and each new measurement modality brings into view another new function or activity of versatile RNA. Since the dawn of sequence-based RNA biology, we have come a long way.


2016 ◽  
Vol 32 (4) ◽  
pp. 963-986 ◽  
Author(s):  
Sabine Krieg ◽  
Harm Jan Boonstra ◽  
Marc Smeets

Abstract Many target variables in official statistics follow a semicontinuous distribution with a mixture of zeros and continuously distributed positive values. Such variables are called zero inflated. When reliable estimates for subpopulations with small sample sizes are required, model-based small-area estimators can be used, which improve the accuracy of the estimates by borrowing information from other subpopulations. In this article, three small-area estimators are investigated. The first estimator is the EBLUP, which can be considered the most common small-area estimator and is based on a linear mixed model that assumes normal distributions. Therefore, the EBLUP is model misspecified in the case of zero-inflated variables. The other two small-area estimators are based on a model that takes zero inflation explicitly into account. Both the Bayesian and the frequentist approach are considered. These small-area estimators are compared with each other and with design-based estimation in a simulation study with zero-inflated target variables. Both a simulation with artificial data and a simulation with real data from the Dutch Household Budget Survey are carried out. It is found that the small-area estimators improve the accuracy compared to the design-based estimator. The amount of improvement strongly depends on the properties of the population and the subpopulations of interest.


2019 ◽  
Vol 80 (3) ◽  
pp. 499-521
Author(s):  
Ben Babcock ◽  
Kari J. Hodge

Equating and scaling in the context of small sample exams, such as credentialing exams for highly specialized professions, has received increased attention in recent research. Investigators have proposed a variety of both classical and Rasch-based approaches to the problem. This study attempts to extend past research by (1) directly comparing classical and Rasch techniques of equating exam scores when sample sizes are small ( N≤ 100 per exam form) and (2) attempting to pool multiple forms’ worth of data to improve estimation in the Rasch framework. We simulated multiple years of a small-sample exam program by resampling from a larger certification exam program’s real data. Results showed that combining multiple administrations’ worth of data via the Rasch model can lead to more accurate equating compared to classical methods designed to work well in small samples. WINSTEPS-based Rasch methods that used multiple exam forms’ data worked better than Bayesian Markov Chain Monte Carlo methods, as the prior distribution used to estimate the item difficulty parameters biased predicted scores when there were difficulty differences between exam forms.


2019 ◽  
Author(s):  
Dustin Fife

Data analysis is a risky endeavor, particularly among those unaware of its dangers. In the words of Cook and Campbell (1976; see also Cook, Campbell, and Shadish 2002), “Statistical Conclusions Validity” threatens all experiments that subject themselves to the dark arts of statistical magic. Although traditional statistics classes may advise against certain practices (e.g., multiple comparisons, small sample sizes, violating normality), they may fail to cover others (e.g., outlier detection and violating linearity). More common, perhaps, is that researchers may fail to remember them. In this paper, rather than rehashing old warnings and diatribes against this practice or that, I instead advocate a general statistical analysis strategy. This graphically-based eight step strategy promises to resolve the majority of statistical traps researchers may fall in without having to remember large lists of problematic statistical practices. These steps will assist in preventing both Type I and Type II errors and yield critical insights about the data that would have otherwise been missed. I conclude with an applied example that shows how the eight steps highlight data problems that would not be detected with standard statistical practices.


Sign in / Sign up

Export Citation Format

Share Document