deepBase v3.0: expression atlas and interactive analysis of ncRNAs from thousands of deep-sequencing data

Abstract Eukaryotic genomes encode thousands of small and large non-coding RNAs (ncRNAs). However, the expression, functions and evolution of these ncRNAs are still largely unknown. In this study, we have updated deepBase to version 3.0 (deepBase v3.0, http://rna.sysu.edu.cn/deepbase3/index.html), an increasingly popular and openly licensed resource that facilitates integrative and interactive display and analysis of the expression, evolution, and functions of various ncRNAs by deeply mining thousands of high-throughput sequencing data from tissue, tumor and exosome samples. We updated deepBase v3.0 to provide the most comprehensive expression atlas of small RNAs and lncRNAs by integrating ∼67 620 data from 80 normal tissues and ∼50 cancer tissues. The extracellular patterns of various ncRNAs were profiled to explore their applications for discovery of noninvasive biomarkers. Moreover, we constructed survival maps of tRNA-derived RNA Fragments (tRFs), miRNAs, snoRNAs and lncRNAs by analyzing >45 000 cancer sample data and corresponding clinical information. We also developed interactive webs to analyze the differential expression and biological functions of various ncRNAs in ∼50 types of cancers. This update is expected to provide a variety of new modules and graphic visualizations to facilitate analyses and explorations of the functions and mechanisms of various types of ncRNAs.

Download Full-text

SomVarIUS: somatic variant identification from unpaired tissue samples

Bioinformatics ◽

10.1093/bioinformatics/btv685 ◽

2015 ◽

Vol 32 (6) ◽

pp. 808-813 ◽

Cited By ~ 18

Author(s):

Kyle S. Smith ◽

Vinod K. Yadav ◽

Shanshan Pei ◽

Daniel A. Pollyea ◽

Craig T. Jordan ◽

...

Keyword(s):

High Throughput Sequencing ◽

Variant Calling ◽

Computational Method ◽

Supplementary Information ◽

Sequencing Data ◽

Somatic Variant ◽

Tissue Samples ◽

Normal Tissues ◽

High Throughput Sequencing Data ◽

Oncogenic Mutations

Abstract Motivation: Somatic variant calling typically requires paired tumor-normal tissue samples. Yet, paired normal tissues are not always available in clinical settings or for archival samples. Results: We present SomVarIUS, a computational method for detecting somatic variants using high throughput sequencing data from unpaired tissue samples. We evaluate the performance of the method using genomic data from synthetic and real tumor samples. SomVarIUS identifies somatic variants in exome-seq data of ∼150 × coverage with at least 67.7% precision and 64.6% recall rates, when compared with paired-tissue somatic variant calls in real tumor samples. We demonstrate the utility of SomVarIUS by identifying somatic mutations in formalin-fixed samples, and tracking clonal dynamics of oncogenic mutations in targeted deep sequencing data from pre- and post-treatment leukemia samples. Availability and implementation: SomVarIUS is written in Python 2.7 and available at http://www.sjdlab.org/resources/ Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.

Download Full-text

Faculty Opinions recommendation of Coalescent Inference Using Serially Sampled, High-Throughput Sequencing Data from Intrahost HIV Infection.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.726132071.793531014 ◽

2017 ◽

Author(s):

Sarah Rowland-Jones ◽

Sophie Andrews

Keyword(s):

Hiv Infection ◽

High Throughput ◽

High Throughput Sequencing ◽

Sequencing Data ◽

High Throughput Sequencing Data

Download Full-text

BlindCall: ultra-fast base-calling of high-throughput sequencing data by blind deconvolution

Bioinformatics ◽

10.1093/bioinformatics/btu010 ◽

2014 ◽

Vol 30 (9) ◽

pp. 1214-1219 ◽

Cited By ~ 6

Author(s):

C. Ye ◽

C. Hsiao ◽

H. Corrada Bravo

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Blind Deconvolution ◽

Sequencing Data ◽

Base Calling ◽

High Throughput Sequencing Data

Download Full-text

Improvement, identification, and target prediction for miRNAs in the porcine genome by using massive, public high-throughput sequencing data

Journal of Animal Science ◽

10.1093/jas/skab018 ◽

2021 ◽

Vol 99 (2) ◽

Author(s):

Yuhua Fu ◽

Pengyu Fan ◽

Lu Wang ◽

Ziqiang Shu ◽

Shilin Zhu ◽

...

Keyword(s):

High Throughput Sequencing ◽

Target Genes ◽

Target Prediction ◽

Large Data ◽

Sequencing Data ◽

Regulate Gene Expression ◽

High Throughput Sequencing Data ◽

Annotation Information ◽

Public Data ◽

Broad Variety

Abstract Despite the broad variety of available microRNA (miRNA) research tools and methods, their application to the identification, annotation, and target prediction of miRNAs in nonmodel organisms is still limited. In this study, we collected nearly all public sRNA-seq data to improve the annotation for known miRNAs and identify novel miRNAs that have not been annotated in pigs (Sus scrofa). We newly annotated 210 mature sequences in known miRNAs and found that 43 of the known miRNA precursors were problematic due to redundant/missing annotations or incorrect sequences. We also predicted 811 novel miRNAs with high confidence, which was twice the current number of known miRNAs for pigs in miRBase. In addition, we proposed a correlation-based strategy to predict target genes for miRNAs by using a large amount of sRNA-seq and RNA-seq data. We found that the correlation-based strategy provided additional evidence of expression compared with traditional target prediction methods. The correlation-based strategy also identified the regulatory pairs that were controlled by nonbinding sites with a particular pattern, which provided abundant complementarity for studying the mechanism of miRNAs that regulate gene expression. In summary, our study improved the annotation of known miRNAs, identified a large number of novel miRNAs, and predicted target genes for all pig miRNAs by using massive public data. This large data-based strategy is also applicable for other nonmodel organisms with incomplete annotation information.

Download Full-text

Improving gene function predictions using independent transcriptional components

Nature Communications ◽

10.1038/s41467-021-21671-w ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Carlos G. Urzúa-Traslaviña ◽

Vincent C. Leeuwenburgh ◽

Arkajyoti Bhattacharya ◽

Stefan Loipfinger ◽

Marcel A. T. M. van Vugt ◽

...

Keyword(s):

Independent Component Analysis ◽

High Throughput Sequencing ◽

Principal Component ◽

Component Analysis ◽

Independent Component ◽

Sequencing Data ◽

New Members ◽

High Throughput Sequencing Data ◽

Gene Sets ◽

Functional Understanding

AbstractThe interpretation of high throughput sequencing data is limited by our incomplete functional understanding of coding and non-coding transcripts. Reliably predicting the function of such transcripts can overcome this limitation. Here we report the use of a consensus independent component analysis and guilt-by-association approach to predict over 23,000 functional groups comprised of over 55,000 coding and non-coding transcripts using publicly available transcriptomic profiles. We show that, compared to using Principal Component Analysis, Independent Component Analysis-derived transcriptional components enable more confident functionality predictions, improve predictions when new members are added to the gene sets, and are less affected by gene multi-functionality. Predictions generated using human or mouse transcriptomic data are made available for exploration in a publicly available web portal.

Download Full-text

Great differences in performance and outcome of high-throughput sequencing data analysis platforms for fungal metabarcoding

MycoKeys ◽

10.3897/mycokeys.39.28109 ◽

2018 ◽

Vol 39 ◽

pp. 29-40 ◽

Cited By ~ 21

Author(s):

Sten Anslan ◽

R. Henrik Nilsson ◽

Christian Wurzbacher ◽

Petr Baldrian ◽

Leho Tedersoo ◽

...

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Computation Time ◽

Potential Effect ◽

Data Sets ◽

Sequencing Data ◽

Operational Taxonomic Units ◽

High Throughput Sequencing Data ◽

Recent Developments

Along with recent developments in high-throughput sequencing (HTS) technologies and thus fast accumulation of HTS data, there has been a growing need and interest for developing tools for HTS data processing and communication. In particular, a number of bioinformatics tools have been designed for analysing metabarcoding data, each with specific features, assumptions and outputs. To evaluate the potential effect of the application of different bioinformatics workflow on the results, we compared the performance of different analysis platforms on two contrasting high-throughput sequencing data sets. Our analysis revealed that the computation time, quality of error filtering and hence output of specific bioinformatics process largely depends on the platform used. Our results show that none of the bioinformatics workflows appears to perfectly filter out the accumulated errors and generate Operational Taxonomic Units, although PipeCraft, LotuS and PIPITS perform better than QIIME2 and Galaxy for the tested fungal amplicon dataset. We conclude that the output of each platform requires manual validation of the OTUs by examining the taxonomy assignment values.

Download Full-text

circtools—a one-stop software solution for circular RNA research

Bioinformatics ◽

10.1093/bioinformatics/bty948 ◽

2018 ◽

Vol 35 (13) ◽

pp. 2326-2328 ◽

Cited By ~ 13

Author(s):

Tobias Jakobi ◽

Alexey Uvarovskii ◽

Christoph Dieterich

Keyword(s):

High Throughput Sequencing ◽

Circular Rna ◽

Statistical Testing ◽

Supplementary Information ◽

Circular Rnas ◽

Sequencing Data ◽

High Throughput Sequencing Data ◽

Multi Stage ◽

Sequence Reconstruction ◽

One Stop

Abstract Motivation Circular RNAs (circRNAs) originate through back-splicing events from linear primary transcripts, are resistant to exonucleases, are not polyadenylated and have been shown to be highly specific for cell type and developmental stage. CircRNA detection starts from high-throughput sequencing data and is a multi-stage bioinformatics process yielding sets of potential circRNA candidates that require further analyses. While a number of tools for the prediction process already exist, publicly available analysis tools for further characterization are rare. Our work provides researchers with a harmonized workflow that covers different stages of in silico circRNA analyses, from prediction to first functional insights. Results Here, we present circtools, a modular, Python-based framework for computational circRNA analyses. The software includes modules for circRNA detection, internal sequence reconstruction, quality checking, statistical testing, screening for enrichment of RBP binding sites, differential exon RNase R resistance and circRNA-specific primer design. circtools supports researchers with visualization options and data export into commonly used formats. Availability and implementation circtools is available via https://github.com/dieterich-lab/circtools and http://circ.tools under GPLv3.0. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Improvements and impacts of GRCh38 human reference on high throughput sequencing data analysis

Genomics ◽

10.1016/j.ygeno.2017.01.005 ◽

2017 ◽

Vol 109 (2) ◽

pp. 83-90 ◽

Cited By ~ 44

Author(s):

Yan Guo ◽

Yulin Dai ◽

Hui Yu ◽

Shilin Zhao ◽

David C. Samuels ◽

...

Keyword(s):

Data Analysis ◽

High Throughput ◽

High Throughput Sequencing ◽

Sequencing Data ◽

High Throughput Sequencing Data ◽

Sequencing Data Analysis

Download Full-text

An R Package for Divergence Analysis of Omics Data

10.1101/720391 ◽

2019 ◽

Author(s):

Wikum Dinalankara ◽

Qian Ke ◽

Donald Geman ◽

Luigi Marchionni

Keyword(s):

High Throughput Sequencing ◽

R Package ◽

The Cancer Genome Atlas ◽

High Dimensional ◽

Omics Data ◽

Sequencing Data ◽

High Throughput Sequencing Data ◽

Ternary Code ◽

Cancer Genome Atlas ◽

Level Analysis

AbstractGiven the ever-increasing amount of high-dimensional and complex omics data becoming available, it is increasingly important to discover simple but effective methods of analysis. Divergence analysis transforms each entry of a high-dimensional omics profile into a digitized (binary or ternary) code based on the deviation of the entry from a given baseline population. This is a novel framework that is significantly different from existing omics data analysis methods: it allows digitization of continuous omics data at the univariate or multivariate level, facilitates sample level analysis, and is applicable on many different omics platforms. The divergence package, available on the R platform through the Bioconductor repository collection, provides easy-to-use functions for carrying out this transformation. Here we demonstrate how to use the package with sample high throughput sequencing data from the Cancer Genome Atlas.

Download Full-text

A machine learning-based framework for modeling transcription elongation

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.2007450118 ◽

2021 ◽

Vol 118 (6) ◽

pp. e2007450118

Author(s):

Peiyuan Feng ◽

An Xiao ◽

Meng Fang ◽

Fangping Wan ◽

Shuya Li ◽

...

Keyword(s):

Rna Polymerase Ii ◽

High Throughput Sequencing ◽

Gene Expression Regulation ◽

Transcription Elongation ◽

Transcriptional Elongation ◽

Sequencing Data ◽

Pol Ii ◽

High Throughput Sequencing Data ◽

Pol Ii Pausing ◽

Elongation Process

RNA polymerase II (Pol II) generally pauses at certain positions along gene bodies, thereby interrupting the transcription elongation process, which is often coupled with various important biological functions, such as precursor mRNA splicing and gene expression regulation. Characterizing the transcriptional elongation dynamics can thus help us understand many essential biological processes in eukaryotic cells. However, experimentally measuring Pol II elongation rates is generally time and resource consuming. We developed PEPMAN (polymerase II elongation pausing modeling through attention-based deep neural network), a deep learning-based model that accurately predicts Pol II pausing sites based on the native elongating transcript sequencing (NET-seq) data. Through fully taking advantage of the attention mechanism, PEPMAN is able to decipher important sequence features underlying Pol II pausing. More importantly, we demonstrated that the analyses of the PEPMAN-predicted results around various types of alternative splicing sites can provide useful clues into understanding the cotranscriptional splicing events. In addition, associating the PEPMAN prediction results with different epigenetic features can help reveal important factors related to the transcription elongation process. All these results demonstrated that PEPMAN can provide a useful and effective tool for modeling transcription elongation and understanding the related biological factors from available high-throughput sequencing data.

Download Full-text