scholarly journals Significance tests for analyzing gene expression data with small sample sizes

2019 ◽  
Vol 35 (20) ◽  
pp. 3996-4003
Author(s):  
Insha Ullah ◽  
Sudhir Paul ◽  
Zhenjie Hong ◽  
You-Gan Wang

Abstract Motivation Under two biologically different conditions, we are often interested in identifying differentially expressed genes. It is usually the case that the assumption of equal variances on the two groups is violated for many genes where a large number of them are required to be filtered or ranked. In these cases, exact tests are unavailable and the Welch’s approximate test is most reliable one. The Welch’s test involves two layers of approximations: approximating the distribution of the statistic by a t-distribution, which in turn depends on approximate degrees of freedom. This study attempts to improve upon Welch’s approximate test by avoiding one layer of approximation. Results We introduce a new distribution that generalizes the t-distribution and propose a Monte Carlo based test that uses only one layer of approximation for statistical inferences. Experimental results based on extensive simulation studies show that the Monte Carol based tests enhance the statistical power and performs better than Welch’s t-approximation, especially when the equal variance assumption is not met and the sample size of the sample with a larger variance is smaller. We analyzed two gene-expression datasets, namely the childhood acute lymphoblastic leukemia gene-expression dataset with 22 283 genes and Golden Spike dataset produced by a controlled experiment with 13 966 genes. The new test identified additional genes of interest in both datasets. Some of these genes have been proven to play important roles in medical literature. Availability and implementation R scripts and the R package mcBFtest is available in CRAN and to reproduce all reported results are available at the GitHub repository, https://github.com/iullah1980/MCTcodes. Supplementary information Supplementary data is available at Bioinformatics online.

2016 ◽  
Author(s):  
Brian Keith Lohman ◽  
Jesse N Weber ◽  
Daniel I Bolnick

RNAseq is a relatively new tool for ecological genetics that offers researchers insight into changes in gene expression in response to a myriad of natural or experimental conditions. However, standard RNAseq methods (e.g., Illumina TruSeq® or NEBNext®) can be cost prohibitive, especially when study designs require large sample sizes. Consequently, RNAseq is often underused as a method, or is applied to small sample sizes that confer poor statistical power. Low cost RNAseq methods could therefore enable far greater and more powerful applications of transcriptomics in ecological genetics and beyond. Standard mRNAseq is costly partly because one sequences portions of the full length of all transcripts. Such whole-mRNA data is redundant for estimates of relative gene expression. TagSeq is an alternative method that focuses sequencing effort on mRNAs 3-prime end, thereby reducing the necessary sequencing depth per sample, and thus cost. Here we present a revised TagSeq protocol, and compare its performance against NEBNext®, the gold-standard whole mRNAseq method. We built both TagSeq and NEBNext® libraries from the same biological samples, each spiked with control RNAs. We found that TagSeq measured the control RNA distribution more accurately than NEBNext®, for a fraction of the cost per sample (~10%). The higher accuracy of TagSeq was particularly apparent for transcripts of moderate to low abundance. Technical replicates of TagSeq libraries are highly correlated, and were correlated with NEBNext® results. Overall, we show that our modified TagSeq protocol is an efficient alternative to traditional whole mRNAseq, offering researchers comparable data at greatly reduced cost.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Florent Le Borgne ◽  
Arthur Chatton ◽  
Maxime Léger ◽  
Rémi Lenain ◽  
Yohann Foucher

AbstractIn clinical research, there is a growing interest in the use of propensity score-based methods to estimate causal effects. G-computation is an alternative because of its high statistical power. Machine learning is also increasingly used because of its possible robustness to model misspecification. In this paper, we aimed to propose an approach that combines machine learning and G-computation when both the outcome and the exposure status are binary and is able to deal with small samples. We evaluated the performances of several methods, including penalized logistic regressions, a neural network, a support vector machine, boosted classification and regression trees, and a super learner through simulations. We proposed six different scenarios characterised by various sample sizes, numbers of covariates and relationships between covariates, exposure statuses, and outcomes. We have also illustrated the application of these methods, in which they were used to estimate the efficacy of barbiturates prescribed during the first 24 h of an episode of intracranial hypertension. In the context of GC, for estimating the individual outcome probabilities in two counterfactual worlds, we reported that the super learner tended to outperform the other approaches in terms of both bias and variance, especially for small sample sizes. The support vector machine performed well, but its mean bias was slightly higher than that of the super learner. In the investigated scenarios, G-computation associated with the super learner was a performant method for drawing causal inferences, even from small sample sizes.


2019 ◽  
Vol 36 (8) ◽  
pp. 2608-2610
Author(s):  
Aritro Nath ◽  
Jeremy Chang ◽  
R Stephanie Huang

Abstract Summary MicroRNAs (miRNAs) are critical post-transcriptional regulators of gene expression. Due to challenges in accurate profiling of small RNAs, a vast majority of public transcriptome datasets lack reliable miRNA profiles. However, the biological consequence of miRNA activity in the form of altered protein-coding gene (PCG) expression can be captured using machine-learning algorithms. Here, we present iMIRAGE (imputed miRNA activity from gene expression), a convenient tool to predict miRNA expression using PCG expression of the test datasets. The iMIRAGE package provides an integrated workflow for normalization and transformation of miRNA and PCG expression data, along with the option to utilize predicted miRNA targets to impute miRNA activity from independent test PCG datasets. Availability and implementation The iMIRAGE package for R, along with package documentation and vignette, is available at https://aritronath.github.io/iMIRAGE/index.html. Supplementary information Supplementary data are available at Bioinformatics online.


1991 ◽  
Vol 9 (1) ◽  
pp. 139-144 ◽  
Author(s):  
J Ochs ◽  
J Rodman ◽  
M Abromowitch ◽  
R Kavanagh ◽  
M Harris ◽  
...  

Teniposide (VM-26) can increase intracellular methotrexate (MTX) and its polyglutamate derivatives in vitro and thus has the potential to improve the therapeutic index of regimens containing MTX. In this phase II study, children and adolescents with acute lymphoblastic leukemia (ALL) in first or second marrow relapse were randomly assigned to receive either simultaneous (n = 11) or sequential (n = 12) continuous infusions of MTX and VM-26 prior to reinduction. Infusions of VM-26 were begun 12 hours after completion of MTX infusion in the sequential group. Dosages were individually adjusted to maintain plasma concentration levels of 10 microns for MTX and 15 microns for VM-26; total infusion times were 24 and 72 hours, respectively. Significant toxicity in the first six patients who received the scheduled 72-hour VM-26 infusion (including one drug-related death) prompted a 50% reduction in infusion duration. The reduced dose was associated with similar but more manageable toxicity. Examination of bone marrow aspirates 10 days after therapy was begun showed one complete and two partial marrow remissions; a fourth patient who had an aplastic marrow on day 10 received no further chemotherapy and had a complete remission (CR) documented on day 31. There was no obvious clinical advantage associated with either infusion schedule, although small sample sizes preclude definitive conclusions. The 17% response rate to the MTX/VM-26 therapeutic window in patients with refractory disease suggests the need for further investigation to evaluate alternative schedules and concomitant therapy for this drug combination.


2016 ◽  
Vol 2 (1) ◽  
pp. 41-54
Author(s):  
Ashleigh Saunders ◽  
Karen E. Waldie

Purpose – Autism spectrum disorder (ASD) is a lifelong neurodevelopmental condition for which there is no known cure. The rate of psychiatric comorbidity in autism is extremely high, which raises questions about the nature of the co-occurring symptoms. It is unclear whether these additional conditions are true comorbid conditions, or can simply be accounted for through the ASD diagnosis. The paper aims to discuss this issue. Design/methodology/approach – A number of questionnaires and a computer-based task were used in the current study. The authors asked the participants about symptoms of ASD, attention deficit hyperactivity disorder (ADHD) and anxiety, as well as overall adaptive functioning. Findings – The results demonstrate that each condition, in its pure form, can be clearly differentiated from one another (and from neurotypical controls). Further analyses revealed that when ASD occurs together with anxiety, anxiety appears to be a separate condition. In contrast, there is no clear behavioural profile for when ASD and ADHD co-occur. Research limitations/implications – First, due to small sample sizes, some analyses performed were targeted to specific groups (i.e. comparing ADHD, ASD to comorbid ADHD+ASD). Larger sample sizes would have given the statistical power to perform a full scale comparative analysis of all experimental groups when split by their comorbid conditions. Second, males were over-represented in the ASD group and females were over-represented in the anxiety group, due to the uneven gender balance in the prevalence of these conditions. Lastly, the main profiling techniques used were questionnaires. Clinical interviews would have been preferable, as they give a more objective account of behavioural difficulties. Practical implications – The rate of psychiatric comorbidity in autism is extremely high, which raises questions about the nature of the co-occurring symptoms. It is unclear whether these additional conditions are true comorbid conditions, or can simply be accounted for through the ASD diagnosis. Social implications – This information will be important, not only to healthcare practitioners when administering a diagnosis, but also to therapists who need to apply evidence-based treatment to comorbid and stand-alone conditions. Originality/value – This study is the first to investigate the nature of co-existing conditions in ASD in a New Zealand population.


Author(s):  
Weiguang Mao ◽  
Javad Rahimikollu ◽  
Ryan Hausler ◽  
Maria Chikina

Abstract Motivation RNA-seq technology provides unprecedented power in the assessment of the transcription abundance and can be used to perform a variety of downstream tasks such as inference of gene-correlation network and eQTL discovery. However, raw gene expression values have to be normalized for nuisance biological variation and technical covariates, and different normalization strategies can lead to dramatically different results in the downstream study. Results We describe a generalization of singular value decomposition-based reconstruction for which the common techniques of whitening, rank-k approximation and removing the top k principal components are special cases. Our simple three-parameter transformation, DataRemix, can be tuned to reweigh the contribution of hidden factors and reveal otherwise hidden biological signals. In particular, we demonstrate that the method can effectively prioritize biological signals over noise without leveraging external dataset-specific knowledge, and can outperform normalization methods that make explicit use of known technical factors. We also show that DataRemix can be efficiently optimized via Thompson sampling approach, which makes it feasible for computationally expensive objectives such as eQTL analysis. Finally, we apply our method to the Religious Orders Study and Memory and Aging Project dataset, and we report what to our knowledge is the first replicable trans-eQTL effect in human brain. Availabilityand implementation DataRemix is an R package which is freely available at GitHub (https://github.com/wgmao/DataRemix). Supplementary information Supplementary data are available at Bioinformatics online.


2021 ◽  
Author(s):  
David Gerard

AbstractMany bioinformatics pipelines include tests for equilibrium. Tests for diploids are well studied and widely available, but extending these approaches to autopolyploids is hampered by the presence of double reduction, the co-migration of sister chromatid segments into the same gamete during meiosis. Though a hindrance for equilibrium tests, double reduction rates are quantities of interest in their own right, as they provide insights about the meiotic behavior of autopolyploid organisms. Here, we develop procedures to (i) test for equilibrium while accounting for double reduction, and (ii) estimate double reduction given equilibrium. To do so, we take two approaches: a likelihood approach, and a novel U-statistic minimization approach that we show generalizes the classical equilibrium χ2 test in diploids. For small sample sizes and uncertain genotypes, we further develop a bootstrap procedure based on our U-statistic to test for equilibrium. Finally, we highlight the difficulty in distinguishing between random mating and equilibrium in tetraploids at biallelic loci. Our methods are implemented in the hwep R package on GitHub https://github.com/dcgerard/hwep.


2020 ◽  
Vol 36 (15) ◽  
pp. 4301-4308
Author(s):  
Stephan Seifert ◽  
Sven Gundlach ◽  
Olaf Junge ◽  
Silke Szymczak

Abstract Motivation High-throughput technologies allow comprehensive characterization of individuals on many molecular levels. However, training computational models to predict disease status based on omics data is challenging. A promising solution is the integration of external knowledge about structural and functional relationships into the modeling process. We compared four published random forest-based approaches using two simulation studies and nine experimental datasets. Results The self-sufficient prediction error approach should be applied when large numbers of relevant pathways are expected. The competing methods hunting and learner of functional enrichment should be used when low numbers of relevant pathways are expected or the most strongly associated pathways are of interest. The hybrid approach synthetic features is not recommended because of its high false discovery rate. Availability and implementation An R package providing functions for data analysis and simulation is available at GitHub (https://github.com/szymczak-lab/PathwayGuidedRF). An accompanying R data package (https://github.com/szymczak-lab/DataPathwayGuidedRF) stores the processed and quality controlled experimental datasets downloaded from Gene Expression Omnibus (GEO). Supplementary information Supplementary data are available at Bioinformatics online.


2018 ◽  
Vol 23 (4) ◽  
pp. 289-299 ◽  
Author(s):  
Wim Meeus

Abstract. The developmental continuum of identity status has been a topic of theoretical debate since the early 1980’s. A recent meta-analysis and recent studies with dual cycle models lead to two conclusions: (1) during adolescence there is systematic identity maturation; (2) there are two continuums of identity status progression. Both continuums show that in general adolescents move from transient identity statuses to identity statuses that mark the relative endpoints of development: from diffusion to closure, and from searching moratorium and moratorium to closure and achievement. This pattern can be framed as development from identity formation to identity maintenance. In Identity Status Interview research using Marcia’s model, not the slightest indication for a continuum of identity development was found. This may be due to the small sample sizes of the various studies leading to small statistical power to detect differences in identity status transitions, as well as developmental inconsistencies in Marcia’s model. Findings from this review are interpreted in terms of life-span developmental psychology.


Sign in / Sign up

Export Citation Format

Share Document