A field-wide assessment of differential high throughput sequencing reveals widespread bias

AbstractHere we assess reproducibility and inferential quality in the field of differential HT-seq, based on analysis of datasets submitted 2008-2019 to the NCBI GEO data repository. Analysis of GEO submission file structures places an overall 59% upper limit to reproducibility. We further show that only 23% of experiments resulted in theoretically expected p value histogram shapes, although both reproducibility and p value distributions show marked improvement over time. Uniform p value histogram shapes, indicative of <100 true effects, were extremely few. Our calculations of π0, the fraction of true nulls, showed that 36% of experiments have π0 <0.5, meaning that in over a third of experiments most RNA-s were estimated to change their expression level upon experimental treatment. Both the fraction of different p value histogram types and π0 values are strongly associated with the software used for calculating these p values by the original authors, indicating widespread bias.

Download Full-text

High-Throughput miRNA Sequencing Reveals a Field Effect in Gastric Cancer and Suggests an Epigenetic Network Mechanism

Bioinformatics and Biology Insights ◽

10.4137/bbi.s24066 ◽

2015 ◽

Vol 9 ◽

pp. BBI.S24066 ◽

Cited By ~ 19

Author(s):

Monica B. Assumpção ◽

Fabiano C. Moreira ◽

Igor G. Hamoy ◽

Leandro Magalhães ◽

Amanda Vidal ◽

...

Keyword(s):

Gastric Cancer ◽

High Throughput ◽

Field Effect ◽

High Throughput Sequencing ◽

Gastric Carcinogenesis ◽

Field Cancerization ◽

P Value ◽

Primary Tumors ◽

Field Effects ◽

Network Mechanism

Field effect in cancer, also called “field cancerization”, attempts to explain the development of multiple primary tumors and locally recurrent cancer. The concept of field effect in cancer has been reinforced, since molecular alterations were found in tumor-adjacent tissues with normal histopathological appearances. With the aim of investigating field effects in gastric cancer (GC), we conducted a high-throughput sequencing of the miRnome of four GC samples and their respective tumor-adjacent tissues and compared them with the miRnome of a gastric antrum sample from patients without GC, assuming that tumor-adjacent tissues could not be considered as normal tissues. The global number of miRNAs and read counts was highest in tumor samples, followed by tumor-adjacent and normal samples. Analyzing the miRNA expression profile of tumor-adjacent miRNA, hsa-miR-3131, hsa-miR-664, hsa-miR-483, and hsa-miR-150 were significantly downregulated compared with the antrum without tumor tissue ( P-value < 0.01; fold-change < 5). Additionally, hsa-miR-3131, hsa-miR-664, and hsa-miR-150 were downregulated ( P-value < 0.001) in all paired samples of tumor and tumor-adjacent tissues, compared with antrum without tumor mucosa. The field effect was clearly demonstrated in gastric carcinogenesis by an epigenetics-based approach, and potential biomarkers of the GC field effect were identified. The elevated expression of miRNAs in adjacent tissues and tumors tissues may indicate that a cascade of events takes place during gastric carcinogenesis, reinforcing the notion of field effects. This phenomenon seems to be linked to DNA methylation patterns in cancer and suggests the involvement of an epigenetic network mechanism.

Download Full-text

The underestimation of global microbial diversity

10.7287/peerj.preprints.2357 ◽

2016 ◽

Author(s):

Jay T Lennon ◽

Kenneth J Locey

Keyword(s):

Microbial Diversity ◽

High Throughput ◽

High Throughput Sequencing ◽

Sequencing Technology ◽

Recent Commentary ◽

Quantum Leap ◽

Over Time

In a recent commentary, Amann and Rosselló-Mórab summarize how the census of Bacteria and Archaea has changed over time (1). For decades, the number of recognized microbial taxa was underestimated owing to limitations associated with culture-based methods and the rules of nomenclature. The authors describe a "quantum leap" in the estimates of global microbial diversity following advances in high-throughput sequencing technology. Despite this, Amann and Rosselló-Mórab project that a complete census of microbial diversity will be reached within a few years culminating in the lower millions of taxa (1). While perhaps attractively optimistic to some, this presumption is misleading for the following reasons.

Download Full-text

The underestimation of global microbial diversity

10.7287/peerj.preprints.2357v1 ◽

2016 ◽

Author(s):

Jay T Lennon ◽

Kenneth J Locey

Keyword(s):

Microbial Diversity ◽

High Throughput ◽

High Throughput Sequencing ◽

Sequencing Technology ◽

Recent Commentary ◽

Quantum Leap ◽

Over Time

Download Full-text

EZH2 and CD79B mutational status over time in B‐cell non‐Hodgkin lymphomas detected by high‐throughput sequencing using minimal samples

Cancer Cytopathology ◽

10.1002/cncy.21262 ◽

2013 ◽

Vol 121 (7) ◽

pp. 377-386 ◽

Cited By ~ 20

Author(s):

Mauro Ajaj Saieg ◽

William R. Geddie ◽

Scott L. Boerner ◽

Denis Bailey ◽

Michael Crump ◽

...

Keyword(s):

B Cell ◽

High Throughput ◽

High Throughput Sequencing ◽

Mutational Status ◽

Over Time

Download Full-text

Almost significant: trends and P values in the use of phrases describing marginally significant results in 567,758 randomized controlled trials published between 1990 and 2020

10.1101/2021.03.01.21252701 ◽

2021 ◽

Author(s):

Willem M Otte ◽

Christiaan H Vinkers ◽

Philippe Habets ◽

David G P van IJzendoorn ◽

Joeri K Tijdink

Keyword(s):

Confidence Intervals ◽

Bayes Factor ◽

Statistical Significance ◽

Clinical Results ◽

Effect Sizes ◽

P Value ◽

Controlled Trials ◽

Positive Trend ◽

P Values ◽

Over Time

Abstract Objective To quantitatively map how non-significant outcomes are reported in randomised controlled trials (RCTs) over the last thirty years. Design Quantitative analysis of English full-texts containing 567,758 RCTs recorded in PubMed (81.5% of all published RCTs). Methods We determined the exact presence of 505 pre-defined phrases denoting results that do not reach formal statistical significance (P<0.05) in 567,758 RCT full texts between 1990 and 2020 and manually extracted associated P values. Phrase data was modeled with Bayesian linear regression. Evidence for temporal change was obtained through Bayes-factor analysis. In a randomly sampled subset, the associated P values were manually extracted. Results We identified 61,741 phrases indicating close to significant results in 49,134 (8.65%; 95% confidence interval (CI): 8.58–8.73) RCTs. The overall prevalence of these phrases remained stable over time, with the most prevalent phrases being ‘marginally significant’ (in 7,735 RCTs), ‘all but significant’ (7,015), ‘a nonsignificant trend’ (3,442), ‘failed to reach statistical significance’ (2,578) and ‘a strong trend’ (1,700). The strongest evidence for a temporal prevalence increase was found for ‘a numerical trend’, ‘a positive trend’, ‘an increasing trend’ and ‘nominally significant’. The phrases ‘all but significant’, ‘approaches statistical significance’, ‘did not quite reach statistical significance’, ‘difference was apparent’, ‘failed to reach statistical significance’ and ‘not quite significant’ decreased over time. In the random sampled subset, the 11,926 identified P values ranged between 0.05 and 0.15 (68.1%; CI: 67.3–69.0; median 0.06). Conclusions Our results demonstrate that phrases describing marginally significant results are regularly used in RCTs to report P values close to but above the dominant 0.05 cut-off. The phrase prevalence remained stable over time, despite all efforts to change the focus from P < 0.05 to reporting effect sizes and corresponding confidence intervals. To improve transparency and enhance responsible interpretation of RCT results, researchers, clinicians, reviewers, and editors need to abandon the focus on formal statistical significance thresholds and stimulate reporting of exact P values with corresponding effect sizes and confidence intervals. Significance statement The power of language to modify the reader’s perception of how to interpret biomedical results cannot be underestimated. Misreporting and misinterpretation are urgent problems in RCT output. This may be at least partially related to the statistical paradigm of the 0.05 significance threshold. Sometimes, creativity and inventive strategies of clinical researchers may be used – describing their clinical results to be ‘almost significant’ – to get their data published. This phrasing may convince readers about the value of their work. Since 2005 there is an increasing concern that most current published research findings are false and it has been generally advised to switch from null hypothesis significance testing to using effect sizes, estimation, and cumulation of evidence. If this ‘new statistics’ approach has worked out well should be reflected in the phases describing non-significance results of RCTs. In particular in changing patterns describing P values just above 0.05 value. More than five hundred phrases potentially suited to report or discuss non-significant results were searched in over half a million published RCTs. A stable overall prevalence of these phrases (10.87%, CI: 10.79–10.96; N: 61,741), with associated P values close to 0.05, was found in the last three decades, with strong increases or decreases in individual phrases describing these near-significant results. The pressure to pass scientific peer-review barrier may function as an incentive to use effective phrases to mask non-significant results in RCTs. However, this keeps the researcher’s pre-occupied with hypothesis testing rather than presenting outcome estimations with uncertainty. The effect of language on getting RCT results published should ideally be minimal to steer evidence-based medicine away from overselling of research results, unsubstantiated claims about the efficacy of certain RCTs and to prevent an over-reliance on P value cutoffs. Our exhaustive search suggests that presenting RCT findings remains a struggle when P values approach the carved-in-stone threshold of 0.05.

Download Full-text

The rise and fall of SARS-CoV-2 variants and the mutational profile of Omicron

10.1101/2021.12.16.473096 ◽

2021 ◽

Author(s):

Tanner Roy Wiegand ◽

Aidan McVey ◽

Anna Nemudraia ◽

Artem Nemudryi ◽

Blake Wiedenheft

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Viral Proteins ◽

Rapid Identification ◽

Etiological Agent ◽

Spike Protein ◽

Mutation Rates ◽

Selective Pressures ◽

Sequencing Technologies ◽

Over Time

In late December of 2019, high throughput sequencing technologies enabled rapid identification of SARS-CoV-2 as the etiological agent of COVID-19, and global sequencing efforts are now a critical tool for monitoring the ongoing spread and evolution of this virus. Here, we analyze a subset (n=87,032) of all publicly available SARS-CoV-2 genomes (n=~5.6 million) that were randomly selected, but equally distributed over the course of the pandemic. We plot the appearance of new variants of concern (VOCs) over time and show that the mutation rates in Omicron viruses are significantly greater than those in previously identified SARS-CoV-2 variants. Mutations in Omicron are primarily restricted to the spike protein, while 25 other viral proteins—including those involved in SARS-CoV-2 replication—are highly conserved. Collectively, this suggests that the genetic distinction of Omicron primarily arose from selective pressures on the spike, and that the fidelity of replication of this variant has not been altered.

Download Full-text

The P Value Line Dance: When Does the Music Stop?

Journal of Medical Internet Research ◽

10.2196/21345 ◽

2020 ◽

Vol 22 (8) ◽

pp. e21345 ◽

Cited By ~ 3

Author(s):

Marcus Bendtsen

Keyword(s):

Decision Making ◽

Bayesian Methods ◽

Null Hypothesis ◽

Driving Force ◽

P Value ◽

Type I ◽

P Values ◽

Interim Analyses ◽

Back Seat ◽

Over Time

When should a trial stop? Such a seemingly innocent question evokes concerns of type I and II errors among those who believe that certainty can be the product of uncertainty and among researchers who have been told that they need to carefully calculate sample sizes, consider multiplicity, and not spend P values on interim analyses. However, the endeavor to dichotomize evidence into significant and nonsignificant has led to the basic driving force of science, namely uncertainty, to take a back seat. In this viewpoint we discuss that if testing the null hypothesis is the ultimate goal of science, then we need not worry about writing protocols, consider ethics, apply for funding, or run any experiments at all—all null hypotheses will be rejected at some point—everything has an effect. The job of science should be to unearth the uncertainties of the effects of treatments, not to test their difference from zero. We also show the fickleness of P values, how they may one day point to statistically significant results; and after a few more participants have been recruited, the once statistically significant effect suddenly disappears. We show plots which we hope would intuitively highlight that all assessments of evidence will fluctuate over time. Finally, we discuss the remedy in the form of Bayesian methods, where uncertainty leads; and which allows for continuous decision making to stop or continue recruitment, as new data from a trial is accumulated.

Download Full-text

onlineFDR: an R package to control the false discovery rate for growing data repositories

Bioinformatics ◽

10.1093/bioinformatics/btz191 ◽

2019 ◽

Vol 35 (20) ◽

pp. 4196-4199 ◽

Cited By ~ 3

Author(s):

David S Robertson ◽

Jan Wildenhain ◽

Adel Javanmard ◽

Natasha A Karp

Keyword(s):

Hypothesis Testing ◽

R Package ◽

Supplementary Information ◽

Biological Research ◽

Data Repository ◽

Data Repositories ◽

P Values ◽

False Discovery ◽

The Family ◽

Over Time

Abstract Summary In many areas of biological research, hypotheses are tested in a sequential manner, without having access to future P-values or even the number of hypotheses to be tested. A key setting where this online hypothesis testing occurs is in the context of publicly available data repositories, where the family of hypotheses to be tested is continually growing as new data is accumulated over time. Recently, Javanmard and Montanari proposed the first procedures that control the FDR for online hypothesis testing. We present an R package, onlineFDR, which implements these procedures and provides wrapper functions to apply them to a historic dataset or a growing data repository. Availability and implementation The R package is freely available through Bioconductor (http://www.bioconductor.org/packages/onlineFDR). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

High-Throughput Sequencing of Circulating MicroRNAs in Plasma and Serum during Pregnancy Progression

Life ◽

10.3390/life11101055 ◽

2021 ◽

Vol 11 (10) ◽

pp. 1055

Author(s):

Elena S. Vashukova ◽

Polina Y. Kozyulina ◽

Roman A. Illarionov ◽

Natalya O. Yurkina ◽

Olga V. Pachuliia ◽

...

Keyword(s):

High Throughput ◽

Molecular Mechanisms ◽

High Throughput Sequencing ◽

Maternal Blood ◽

P Value ◽

Sequencing Technology ◽

Serum Samples ◽

Circulating Micrornas ◽

Non Invasive ◽

Qrt Pcr

Although circulating microRNAs (miRNAs) in maternal blood may play an important role in regulation of pregnancy progression and serve as non-invasive biomarkers for different gestation complications, little is known about their profile in blood during normally developing pregnancy. In this study we evaluated the miRNA profiles in paired plasma and serum samples from pregnant women without health or gestational abnormalities at three time points using high-throughput sequencing technology. Sequencing revealed that the percentage of miRNA reads in plasma and serum decreased by a third compared to first and second trimesters. We found two miRNAs in plasma (hsa-miR-7853-5p and hsa-miR-200c-3p) and 10 miRNAs in serum (hsa-miR-203a-5p, hsa-miR-495-3p, hsa-miR-4435, hsa-miR-340-5p, hsa-miR-4417, hsa-miR-1266-5p, hsa-miR-4494, hsa-miR-134-3p, hsa-miR-5008-5p, and hsa-miR-6756-5p), that exhibit level changes during pregnancy (p-value adjusted < 0.05). In addition, we observed differences for 36 miRNAs between plasma and serum (p-value adjusted < 0.05), which should be taken into consideration when comparing the results between studies performed using different biosample types. The results were verified by analysis of three miRNAs using qRT-PCR (p < 0.05). The present study confirms that the circulating miRNA profile in blood changes during gestation. Our results set the basis for further investigation of molecular mechanisms, involved in regulation of pregnancy, and the search for biomarkers of gestation abnormalities.

Download Full-text

Predicting Plasmid Promiscuity Based on Genomic Signature

Journal of Bacteriology ◽

10.1128/jb.00277-10 ◽

2010 ◽

Vol 192 (22) ◽

pp. 6045-6055 ◽

Cited By ~ 96

Author(s):

Haruo Suzuki ◽

Hirokazu Yano ◽

Celeste J. Brown ◽

Eva M. Top

Keyword(s):

Genetic Distance ◽

Host Range ◽

High Throughput ◽

High Throughput Sequencing ◽

Nucleotide Composition ◽

Genomic Signature ◽

Bacterial Evolution ◽

Incp Plasmids ◽

Bacterial Chromosomes ◽

Over Time

ABSTRACT Despite the important contribution of self-transmissible plasmids to bacterial evolution, little is understood about the range of hosts in which these plasmids have evolved. Our goal was to infer this so-called evolutionary host range. The nucleotide composition, or genomic signature, of plasmids is often similar to that of the chromosome of their current host, suggesting that plasmids acquire their hosts’ signature over time. Therefore, we examined whether the evolutionary host range of plasmids could be inferred by comparing their trinucleotide composition to that of all completely sequenced bacterial chromosomes. The diversity of candidate hosts was determined using taxonomic classification and genetic distance. The method was first tested using plasmids from six incompatibility (Inc) groups whose host ranges are generally thought to be narrow (IncF, IncH, and IncI) or broad (IncN, IncP, and IncW) and then applied to other plasmid groups. The evolutionary host range was found to be broad for IncP plasmids, narrow for IncF and IncI plasmids, and intermediate for IncH and IncN plasmids, which corresponds with their known host range. The IncW plasmids as well as several plasmids from the IncA/C, IncP, IncQ, IncU, and PromA groups have signatures that were not similar to any of the chromosomal signatures, raising the hypothesis that these plasmids have not been ameliorated in any host due to their promiscuous nature. The inferred evolutionary host range of IncA/C, IncP-9, and IncL/M plasmids requires further investigation. In this era of high-throughput sequencing, this genomic signature method is a useful tool for predicting the host range of novel mobile elements.

Download Full-text