noisyR: Enhancing biological signal in sequencing datasets by characterising random technical noise

AbstractHigh-throughput sequencing enables an unprecedented resolution in transcript quantification, at the cost of magnifying the impact of technical noise. The consistent reduction of random background noise to capture functionally meaningful biological signals is still challenging. Intrinsic sequencing variability introducing low-level expression variations can obscure patterns in downstream analyses.We introduce noisyR, a comprehensive noise filter to assess the variation in signal distribution and achieve an optimal information-consistency across replicates and samples; this selection also facilitates meaningful pattern recognition outside the background-noise range. noisyR is applicable to count matrices and sequencing data; it outputs sample-specific signal/noise thresholds and filtered expression matrices.We exemplify the effects of minimising technical noise on several datasets, across various sequencing assays: coding, non-coding RNAs and interactions, at bulk and single-cell level. An immediate consequence of filtering out noise is the convergence of predictions (differential-expression calls, enrichment analyses and inference of gene regulatory networks) across different approaches.TeaserNoise removal from sequencing quantification improves the convergence of downstream tools and robustness of conclusions.

Download Full-text

A meta-analysis reveals the environmental and host factors shaping the structure and function of the shrimp microbiota

PeerJ ◽

10.7717/peerj.5382 ◽

2018 ◽

Vol 6 ◽

pp. e5382 ◽

Cited By ~ 33

Author(s):

Fernanda Cornejo-Granados ◽

Luigui Gallardo-Becerra ◽

Miriam Leonardo-Reza ◽

Juan Pablo Ochoa-Romo ◽

Adrian Ochoa-Leyva

Keyword(s):

Developmental Stage ◽

High Throughput Sequencing ◽

Meta Analysis ◽

Biological Factor ◽

Rrna Gene ◽

Wild Type ◽

Biological Factors ◽

Sequencing Data ◽

Freshwater Environment ◽

The Impact

The shrimp or prawn is the most valuable traded marine product in the world market today and its microbiota plays an essential role in its development, physiology, and health. The technological advances and dropping costs of high-throughput sequencing have increased the number of studies characterizing the shrimp microbiota. However, the application of different experimental and bioinformatics protocols makes it difficult to compare different studies to reach general conclusions about shrimp microbiota. To meet this necessity, we report the first meta-analysis of the microbiota from freshwater and marine shrimps using all publically available sequences of the 16S ribosomal gene (16S rRNA gene). We obtained data for 199 samples, in which 63.3% were from marine (Alvinocaris longirostris, Litopenaeus vannamei and Penaeus monodon), and 36.7% were from freshwater (Macrobrachium asperulum, Macrobrachium nipponense, Macrobranchium rosenbergii, Neocaridina denticulata) shrimps. Technical variations among studies, such as selected primers, hypervariable region, and sequencing platform showed a significant impact on the microbiota structure. Additionally, the ANOSIM and PERMANOVA analyses revealed that the most important biological factor in structuring the shrimp microbiota was the marine and freshwater environment (ANOSIM R = 0.54, P = 0.001; PERMANOVA pseudo-F = 21.8, P = 0.001), where freshwater showed higher bacterial diversity than marine shrimps. Then, for marine shrimps, the most relevant biological factors impacting the microbiota composition were lifestyle (ANOSIM R = 0.341, P = 0.001; PERMANOVA pseudo-F = 8.50, P = 0.0001), organ (ANOSIM R = 0.279, P = 0.001; PERMANOVA pseudo-F = 6.68, P = 0.001) and developmental stage (ANOSIM R = 0.240, P = 0.001; PERMANOVA pseudo-F = 5.05, P = 0.001). According to the lifestyle, organ, developmental stage, diet, and health status, the highest diversity were for wild-type, intestine, adult, wild-type diet, and healthy samples, respectively. Additionally, we used PICRUSt to predict the potential functions of the microbiota, and we found that the organ had more differentially enriched functions (93), followed by developmental stage (12) and lifestyle (9). Our analysis demonstrated that despite the impact of technical and bioinformatics factors, the biological factors were also statistically significant in shaping the microbiota. These results show that cross-study comparisons are a valuable resource for the improvement of the shrimp microbiota and microbiome fields. Thus, it is important that future studies make public their sequencing data, allowing other researchers to reach more powerful conclusions about the microbiota in this non-model organism. To our knowledge, this is the first meta-analysis that aims to define the shrimp microbiota.

Download Full-text

Methods of MicroRNA Promoter Prediction and Transcription Factor Mediated Regulatory Network

BioMed Research International ◽

10.1155/2017/7049406 ◽

2017 ◽

Vol 2017 ◽

pp. 1-8 ◽

Cited By ~ 19

Author(s):

Yuming Zhao ◽

Fang Wang ◽

Su Chen ◽

Jun Wan ◽

Guohua Wang

Keyword(s):

Transcription Factor ◽

Regulatory Network ◽

Regulatory Networks ◽

High Throughput Sequencing ◽

Prediction Models ◽

Promoter Prediction ◽

Sequencing Data ◽

Protein Coding ◽

Mirna Genes ◽

Intergenic Regions

MicroRNAs (miRNAs) are short (~22 nucleotides) noncoding RNAs and disseminated throughout the genome, either in the intergenic regions or in the intronic sequences of protein-coding genes. MiRNAs have been proved to play important roles in regulating gene expression. Hence, understanding the transcriptional mechanism of miRNA genes is a very critical step to uncover the whole regulatory network. A number of miRNA promoter prediction models have been proposed in the past decade. This review summarized several most popular miRNA promoter prediction models which used genome sequence features, or other features, for example, histone markers, RNA Pol II binding sites, and nucleosome-free regions, achieved by high-throughput sequencing data. Some databases were described as resources for miRNA promoter information. We then performed comprehensive discussion on prediction and identification of transcription factor mediated microRNA regulatory networks.

Download Full-text

A Comprehensive Coexpression Network Analysis in Vibrio cholerae

mSystems ◽

10.1128/msystems.00550-20 ◽

2020 ◽

Vol 5 (4) ◽

Author(s):

Cory D. DuPai ◽

Claus O. Wilke ◽

Bryan W. Davies

Keyword(s):

Network Analysis ◽

Vibrio Cholerae ◽

Rna Sequencing ◽

High Throughput Sequencing ◽

Model Organism ◽

Coexpression Network ◽

Sequencing Data ◽

Content Type ◽

The Impact ◽

Coexpression Network Analysis

ABSTRACT Research into the evolution and pathogenesis of Vibrio cholerae has benefited greatly from the generation of high-throughput sequencing data to drive molecular analyses. The steady accumulation of these data sets now provides a unique opportunity for in silico hypothesis generation via coexpression analysis. Here, we leverage all published V. cholerae RNA sequencing data, in combination with select data from other platforms, to generate a gene coexpression network that validates known gene interactions and identifies novel genetic partners across the entire V. cholerae genome. This network provides direct insights into genes influencing pathogenicity, metabolism, and transcriptional regulation, further clarifies results from previous sequencing experiments in V. cholerae (e.g., transposon insertion sequencing [Tn-seq] and chromatin immunoprecipitation sequencing [ChIP-seq]), and expands upon microarray-based findings in related Gram-negative bacteria. IMPORTANCE Cholera is a devastating illness that kills tens of thousands of people annually. Vibrio cholerae, the causative agent of cholera, is an important model organism to investigate both bacterial pathogenesis and the impact of horizontal gene transfer on the emergence and dissemination of new virulent strains. Despite the importance of this pathogen, roughly one-third of V. cholerae genes are functionally unannotated, leaving large gaps in our understanding of this microbe. Through coexpression network analysis of existing RNA sequencing data, this work develops an approach to uncover novel gene-gene relationships and contextualize genes with no known function, which will advance our understanding of V. cholerae virulence and evolution.

Download Full-text

Identification of Potential Prognostic Competing Triplets in High-Grade Serous Ovarian Cancer

Frontiers in Genetics ◽

10.3389/fgene.2020.607722 ◽

2021 ◽

Vol 11 ◽

Author(s):

Jian Zhao ◽

Xiaofeng Song ◽

Tianyi Xu ◽

Qichang Yang ◽

Jingjing Liu ◽

...

Keyword(s):

Ovarian Cancer ◽

High Throughput Sequencing ◽

Enrichment Analysis ◽

Pathway Enrichment Analysis ◽

Sequencing Data ◽

P53 Signaling ◽

Oocyte Meiosis ◽

The Difference ◽

Division Cell ◽

The Impact

Increasing lncRNA-associated competing triplets were found to play important roles in cancers. With the accumulation of high-throughput sequencing data in public databases, the size of available tumor samples is becoming larger and larger, which introduces new challenges to identify competing triplets. Here, we developed a novel method, called LncMiM, to detect the lncRNA–miRNA–mRNA competing triplets in ovarian cancer with tumor samples from the TCGA database. In LncMiM, non-linear correlation analysis is used to cover the problem of weak correlations between miRNA–target pairs, which is mainly due to the difference in the magnitude of the expression level. In addition, besides the miRNA, the impact of lncRNA and mRNA on the interactions in triplets is also considered to improve the identification sensitivity of LncMiM without reducing its accuracy. By using LncMiM, a total of 847 lncRNA-associated competing triplets were found. All the competing triplets form a miRNA–lncRNA pair centered regulatory network, in which ZFAS1, SNHG29, GAS5, AC112491.1, and AC099850.4 are the top five lncRNAs with most connections. The results of biological process and KEGG pathway enrichment analysis indicates that the competing triplets are mainly associated with cell division, cell proliferation, cell cycle, oocyte meiosis, oxidative phosphorylation, ribosome, and p53 signaling pathway. Through survival analysis, 107 potential prognostic biomarkers are found in the competing triplets, including FGD5-AS1, HCP5, HMGN4, TACC3, and so on. LncMiM is available at https://github.com/xiaofengsong/LncMiM.

Download Full-text

A Short Sequence Splicing Method for Genome Assembly Using a Three-Dimensional Mixing-Pool of BAC Clones and High-throughput Technology

The Open Biotechnology Journal ◽

10.2174/1874070701509010210 ◽

2015 ◽

Vol 9 (1) ◽

pp. 210-215

Author(s):

Xiaojun Kang ◽

Cheng Yang ◽

Xuguang Zhao ◽

Weiwei Chen ◽

Sifa Zhang ◽

...

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Three Dimensional ◽

Whole Genome Sequence ◽

Sequencing Data ◽

Assembly Algorithm ◽

Bac Clones ◽

High Throughput Technology ◽

Sequencing Strategy ◽

The Cost

Current genome sequencing techniques are expensive, and it is still a major challenge to obtain an individual whole-genome sequence. To reduce the cost of sequencing, this paper introduced a high-throughput sequencing strategy using a three-dimensional mixing-pools based on the cube. Following the strategy, BAC clones were injected into each vertex of the cube, and sequencing of each plane provided information about multiple clones, thereby significantly reducing the cost of sequencing. In addition, Velvet was used to assemble the sequencing data. The scaffold generated from Velvet contained a number of contigs, which were orderless. Therefore, to address this problem, a scaffold assembly algorithm based on multi-way trees was used. The algorithm used a multi-way tree to build the framework of chromosomes, and subsequently, the frame was filled to complete the scaffold assembly. This algorithm alone outperformed Velvet in the assembling of a scaffold.

Download Full-text

Next level proctored exam for proficiency testing in Primary Care Education: an observatory study on efficiency and accuracy and on exam outcome. (Preprint)

10.2196/preprints.23834 ◽

2020 ◽

Author(s):

Birgitte Schoenmakers ◽

Johan Wens

Keyword(s):

Background Noise ◽

Negative Impact ◽

Cost Benefit ◽

Critical Event ◽

Future Research ◽

Test Accuracy ◽

Primary Care Education ◽

Technical Issues ◽

The Cost ◽

The Impact

BACKGROUND The COVID19 pandemic affected education and assessment and led to a complex planning. Therefore, we organised the proficiency test for admission to Family Medicine as a proctored exam. To prevent from fraud we developed a virtual supervisor app tracking and tracing candidates’ behaviour. OBJECTIVE To assess efficiency and accuracy of the proctored exam procedure and to test the impact on the exam scores. METHODS The app operates on three levels to register events: recording of actions, analyses of behaviour and live supervision. Each suspicious event is given a score. To assess efficiency we inventoried the technical issues and the interventions. To test accuracy we counted the number of suspicious students and behaviours. To test the impact of the supervising app on students’ exam outcome we compared the scores between the proctored and the on campus group. Candidates were free to register for off or on campus participation. RESULTS 593 candidates subscribed to the exam: 472 (79%) candidates used the supervisor app and 121 (20%) were on campus. Test results of both groups were comparable. We registered 15 technical issues in off campus context. Two candidates experienced a negative impact on the exam due to the technical issue. The app detected 22 candidates with a suspicious level >1, mainly increased due to background noise. All events occurred without fraud purpose. CONCLUSIONS This pilot study demonstrated that a supervisor app with recording and registration behaviour is able to detect suspicious events without an impact on the exam. Background noise was the most critical event. There was no fraud detected. A supervisor app registering and recording behaviour to prevent from fraud during exams is efficient and not affecting the exam outcome. In future research, a controlled design should compare the cost-benefit balance between the complex intervention of the supervisor app and the combination of the candidates’ awareness of being monitored with a safe exam browsing plug in. CLINICALTRIAL Not applicable

Download Full-text

EPCO-20. PEDIATRIC HIGH-GRADE GLIOMA EXHIBITS DISTINCT 3D GENOME STRUCTURE THAT IMPACTS TRANSCRIPTION REGULATION

Neuro-Oncology ◽

10.1093/neuonc/noab196.019 ◽

2021 ◽

Vol 23 (Supplement_6) ◽

pp. vi6-vi6

Author(s):

Tina Huang ◽

Juan Wang ◽

Ye Hu ◽

Andrea Piunti ◽

Elizabeth Bartom ◽

...

Keyword(s):

Transcription Regulation ◽

Cell Lines ◽

Regulatory Networks ◽

Critical Role ◽

Genome Structure ◽

High Grade Glioma ◽

High Grade ◽

Sequencing Data ◽

Structural Variations ◽

The Impact

Abstract INTRODUCTION Pediatric high-grade gliomas (pHGGs), including glioblastoma multiforme (GBM) and diffuse intrinsic pontine glioma (DIPG), are highly morbid brain tumors. Up to 80% of DIPGs harbor a somatic missense mutation in genes encoding Histone H3. To investigate whether the H3K27M mutant protein is associated with distinct chromatin structure affecting transcription regulation, we generated the first high-resolution Hi-C and ATAC-Seq maps of pHGG cell lines, and integrated these with tissue and cell genomic data. METHODS We generated sequencing data from patient-derived cell lines (DIPG n=6, GBM n=3, normal n=2) and frozen tissue specimens (DIPG n=1, normal brainstem n=1). Analyses included cell line RNA-Seq, ChIP-Seq (H3K27ac, H3K27me3, H3K27M) and genome-wide chromatin conformation capture (Hi-C), as well as tissue ATAC-Seq. Publicly available pediatric glioma tissue ChIP-Seq data was integrated with cell data. CRISPR knock-down of target enhancer regions was performed. RESULTS We identified tumor-specific enhancers and regulatory networks for known oncogenes in DIPG and GBM. In DIPG, FOX, SOX, STAT and SMAD families were among top H3K27Ac enriched motifs. Significant differences in Topologically Associating Domains (TADs) and DNA looping were observed at OLIG2 and MYCN in H3K27M mutant DIPG, relative to wild-type GBM and normal cells. Pharmacologic treatment targeting H3K27Ac (BET and Bromodomain inhibition) altered these 3D structures. Functional analysis of differentially enriched enhancers in DIPG implicated SOX2, SUZ12, and TRIM24 as top activated upstream regulators. Distinct genomic structural variations leading to enhancer hijacking and gene co-amplification were identified at A2M, JAG2, and FLRT1. CONCLUSION We show genome structural variations enhancer-promoter interactions that impact gene expression in pHGG in the presence and absence of the H3K27M mutation. Our results imply that tridimensional genome alterations may play a critical role in the pHGG epigenetic landscape and thereby contribute to pediatric gliomagenesis. Further studies examining the impact of the alterations is therefore underway.

Download Full-text

The Impact of Bioinformatics Pipelines on Microbiota Studies: Does the Analytical “Microscope” Affect the Biological Interpretation?

Microorganisms ◽

10.3390/microorganisms7100393 ◽

2019 ◽

Vol 7 (10) ◽

pp. 393 ◽

Cited By ~ 2

Author(s):

Léa Siegwald ◽

Ségolène Caboche ◽

Gaël Even ◽

Eric Viscogliosi ◽

Christophe Audebert ◽

...

Keyword(s):

High Throughput Sequencing ◽

Case Control Study ◽

Case Control ◽

Sequencing Data ◽

Human Gut ◽

Accurate Identification ◽

Intestinal Protozoa ◽

16S Sequencing ◽

The Impact ◽

Control Study

Targeted metagenomics is the solution of choice to reveal differential microbial profiles (defined by richness, diversity and composition) as part of case-control studies. It is well documented that each data processing step may have the potential to introduce bias in the results. However, selecting a bioinformatics pipeline to analyze high-throughput sequencing data from A to Z remains one of the critical considerations in a case-control microbiota study design. Consequently, the aim of this study was to assess whether the same biological conclusions regarding human gut microbiota composition and diversity could be reached using different bioinformatics pipelines. In this work, we considered four pipelines (mothur, QIIME, kraken and CLARK) with different versions and databases, and examined their impact on the outcome of metagenetic analysis of Ion Torrent 16S sequencing data. We re-analyzed a case-control study evaluating the impact of the colonization of the intestinal protozoa Blastocystis sp. on the human gut microbial profile. Although most pipelines reported the same trends in this case-control study, we demonstrated how the use of different pipelines affects the biological conclusions that can be drawn. Targeted metagenomics must therefore rather be considered as a profiling tool to obtain a broad sense of the variations of the microbiota, rather than an accurate identification tool.

Download Full-text

Assessing the reliability of spike-in normalization for analyses of single-cell RNA sequencing data

10.1101/119784 ◽

2017 ◽

Author(s):

Aaron T. L. Lun ◽

Fernando J. Calero-Nieto ◽

Liora Haim-Vilmovsky ◽

Berthold Göttgens ◽

John C. Marioni

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Capture Efficiency ◽

Cellular Heterogeneity ◽

Sequencing Data ◽

Constant Amount ◽

Technical Noise ◽

Data Analyses ◽

Single Cell Rna Sequencing ◽

The Cost

AbstractBy profiling the transcriptomes of individual cells, single-cell RNA sequencing provides unparalleled resolution to study cellular heterogeneity. However, this comes at the cost of high technical noise, including cell-specific biases in capture efficiency and library generation. One strategy for removing these biases is to add a constant amount of spike-in RNA to each cell, and to scale the observed expression values so that the coverage of spike-in RNA is constant across cells. This approach has previously been criticized as its accuracy depends on the precise addition of spike-in RNA to each sample, and on similarities in behaviour (e.g., capture efficiency) between the spike-in and endogenous transcripts. Here, we perform mixture experiments using two different sets of spike-in RNA to quantify the variance in the amount of spike-in RNA added to each well in a plate-based protocol. We also obtain an upper bound on the variance due to differences in behaviour between the two spike-in sets. We demonstrate that both factors are small contributors to the total technical variance and have only minor effects on downstream analyses such as detection of highly variable genes and clustering. Our results suggest that spike-in normalization is reliable enough for routine use in single-cell RNA sequencing data analyses.

Download Full-text

TahcoRoll: An Efficient Approach for Signature Profiling in Genomic Data through Variable-Length k-mers

10.1101/229708 ◽

2017 ◽

Author(s):

Chelsea J.-T. Ju ◽

Jyun-Yu Jiang ◽

Ruirui Li ◽

Zeyu Li ◽

Wei Wang

Keyword(s):

High Throughput Sequencing ◽

Transcriptome Assembly ◽

Variable Length ◽

Sequencing Data ◽

Desktop Computer ◽

De Bruijn Graphs ◽

Transcript Quantification ◽

Sequencing Technologies ◽

Genomic Markers ◽

Long Read

Abstractk-mer profiling has been one of the trending approaches to analyze read data generated by high-throughput sequencing technologies. The tasks of k-mer profiling include, but are not limited to, counting the frequencies and determining the occurrences of short sequences in a dataset. The notion of k-mer has been extensively used to build de Bruijn graphs in genome or transcriptome assembly, which requires examining all possible k-mers presented in the dataset. Recently, an alternative way of profiling has been proposed, which constructs a set of representative k-mers as genomic markers and profiles their occurrences in the sequencing data. This technique has been applied in both transcript quantification through RNA-Seq and taxonomic classification of metagenomic reads. Most of these applications use a set of fixed-size k-mers since the majority of existing k-mer counters are inadequate to process genomic sequences with variable-length k-mers. However, choosing the appropriate k is challenging, as it varies for different applications. As a pioneer work to profile a set of variable-length k-mers, we propose TahcoRoll in order to enhance the Aho-Corasick algorithm. More specifically, we use one bit to represent each nucleotide, and integrate the rolling hash technique to construct an efficient in-memory data structure for this task. Using both synthetic and real datasets, results show that TahcoRoll outperforms existing approaches in either or both time and memory efficiency without using any disk space. In addition, compared to the most efficient state-of-the-art k-mer counters, such as KMC and MSBWT, TahcoRoll is the only approach that can process long read data from both PacBio and Oxford Nanopore on a commodity desktop computer. The source code of TahcoRoll is implemented in C++14, and available at https://github.com/chelseaju/TahcoRoll.git.

Download Full-text