Quark enables semi-reference-based compression of RNA-seq data

Mapping Intimacies ◽

10.1101/085878 ◽

2016 ◽

Author(s):

Hirak Sarkar ◽

Rob Patro

Keyword(s):

State Of The Art ◽

Reference Sequence ◽

Rna Seq ◽

Sequencing Data ◽

The Past ◽

Link Type ◽

Exponential Increase

AbstractMotivationThe past decade has seen an exponential increase in biological sequencing capacity, and there has been a simultaneous effort to help organize and archive some of the vast quantities of sequencing data that are being generated. While these developments are tremendous from the perspective of maximizing the scientific utility of available data, they come with heavy costs. The storage and transmission of such vast amounts of sequencing data is expensive.ResultsWe present Quark, a semi-reference-based compression tool designed for RNA-seq data. Quark makes use of a reference sequence when encoding reads, but produces a representation that can be decoded independently, without the need for a reference. This allows Quark to achieve markedly better compression rates than existing reference-free schemes, while still relieving the burden of assuming a specific, shared reference sequence between the encoder and decoder. We demonstrate that Quark achieves state-of-the-art compression rates, and that, typically, only a small fraction of the reference sequence must be encoded along with the reads to allow reference-free decompression.AvailabilityQuark is implemented in C++11, and is available under a GPLv3 license at www.github.com/COMBINE-lab/[email protected]

Download Full-text

The state of the art in soybean transcriptomics resources and gene coexpression networks

in silico Plants ◽

10.1093/insilicoplants/diab005 ◽

2021 ◽

Author(s):

Fabricio Almeida-Silva ◽

Kanhu C Moharana ◽

Thiago M Venancio

Keyword(s):

State Of The Art ◽

The State ◽

Gene Coexpression Network ◽

Rna Seq ◽

Transcriptomic Data ◽

The Past ◽

Gene Coexpression ◽

Genomics Research ◽

Public Repositories ◽

Coexpression Networks

Abstract In the past decade, over 3000 samples of soybean transcriptomic data have accumulated in public repositories. Here, we review the state of the art in soybean transcriptomics, highlighting the major microarray and RNA-seq studies that investigated soybean transcriptional programs in different tissues and conditions. Further, we propose approaches for integrating such big data using gene coexpression network and outline important web resources that may facilitate soybean data acquisition and analysis, contributing to the acceleration of soybean breeding and functional genomics research.

Download Full-text

K-mer counting with low memory consumption enables fast clustering of single-cell sequencing data without read alignment

10.1101/723833 ◽

2019 ◽

Author(s):

Christina Huan Shi ◽

Kevin Y. Yip

Keyword(s):

Single Cell ◽

State Of The Art ◽

Rna Seq ◽

Sequencing Data ◽

Memory Consumption ◽

Analysis Pipeline ◽

Cell Clusters ◽

Single Cell Sequencing ◽

Sequencing Errors ◽

Full Analysis

AbstractK-mer counting has many applications in sequencing data processing and analysis. However, sequencing errors can produce many false k-mers that substantially increase the memory requirement during counting. We propose a fast k-mer counting method, CQF-deNoise, which has a novel component for dynamically identifying and removing false k-mers while preserving counting accuracy. Compared with four state-of-the-art k-mer counting methods, CQF-deNoise consumed 49-76% less memory than the second best method, but still ran competitively fast. The k-mer counts from CQF-deNoise produced cell clusters from single-cell RNA-seq data highly consistent with CellRanger but required only 5% of the running time at the same memory consumption, suggesting that CQF-deNoise can be used for a preview of cell clusters for an early detection of potential data problems, before running a much more time-consuming full analysis pipeline.

Download Full-text

Snaptron: querying and visualizing splicing across tens of thousands of RNA-seq samples

10.1101/097881 ◽

2017 ◽

Cited By ~ 2

Author(s):

Christopher Wilks ◽

Phani Gaddipati ◽

Abhinav Nellore ◽

Ben Langmead

Keyword(s):

Tissue Specificity ◽

Rna Seq ◽

Sequencing Data ◽

Transcription Start ◽

Link Type ◽

Alternative Transcription ◽

Web App ◽

Inverted Indexing ◽

Splice Junctions ◽

Splicing Patterns

AbstractAs more and larger genomics studies appear, there is a growing need for comprehensive and queryable cross-study summaries. Snaptron is a search engine for summarized RNA sequencing data with a query planner that leverages R-tree, B-tree and inverted indexing strategies to rapidly execute queries over 146 million exon-exon splice junctions from over 70,000 human RNA-seq samples. Queries can be tailored by constraining which junctions and samples to consider. Snaptron can also rank and score junctions according to tissue specificity or other criteria. Further, Snaptron can rank and score samples according to the relative frequency of different splicing patterns. We outline biological questions that can be explored with Snaptron queries, including a study of novel exons in annotated genes, of exonization of repetitive element loci, and of a recently discovered alternative transcription start site for the ALK gene. Web app and documentation are at http://snaptron.cs.jhu.edu. Source code is at https://github.com/ChristopherWilks/snaptron under the MIT license.

Download Full-text

Graphmap2 - splice-aware RNA-seq mapper for long reads

10.1101/720458 ◽

2019 ◽

Cited By ~ 1

Author(s):

Josip Marić ◽

Ivan Sović ◽

Krešimir Križanović ◽

Niranjan Nagarajan ◽

Mile Šikić

Keyword(s):

State Of The Art ◽

The State ◽

Rna Seq ◽

Link Type ◽

Pacific Biosciences ◽

Long Reads ◽

Oxford Nanopore

AbstractIn this paper we present Graphmap2, a splice-aware mapper built on our previously developed DNA mapper Graphmap. Graphmap2 is tailored for long reads produced by Pacific Biosciences and Oxford Nanopore devices. It uses several newly developed algorithms which enable higher precision and recall of correctly detected transcripts and exon boundaries. We compared its performance with the state-of-the-art tools Minimap2 and Gmap. On both simulated and real datasets Graphmap2 achieves higher mappability and more correctly recognized exons and their ends. In addition we present an analysis of potential of splice aware mappers and long reads for the identification of previously unknown isoforms and even genes. The Graphmap2 tool is publicly available at https://github.com/lbcb-sci/graphmap2.

Download Full-text

RNA sequencing data: hitchhiker's guide to expression analysis

10.7287/peerj.preprints.27283 ◽

2018 ◽

Author(s):

Koen Van Den Berge ◽

Katharina Hembach ◽

Charlotte Soneson ◽

Simone Tiberi ◽

Lieven Clement ◽

...

Keyword(s):

Gene Expression ◽

Rna Sequencing ◽

Large Scale ◽

Science Studies ◽

Rna Seq ◽

Sequencing Data ◽

Data Types ◽

The Past ◽

Long Read ◽

Statistical Approaches

Gene expression is the fundamental level at which the result of various genetic and regulatory programs are observable. The measurement of transcriptome-wide gene expression has convincingly switched from microarrays to sequencing in a matter of years. RNA sequencing (RNA-seq) provides a quantitative and open system for profiling transcriptional outcomes on a large scale and therefore facilitates a large diversity of applications, including basic science studies, but also agricultural or clinical situations. In the past 10 years or so, much has been learned about the characteristics of the RNA-seq datasets as well as the performance of the myriad of methods developed. In this review, we give an overall view of the developments in RNA-seq data analysis, including experimental design, with an explicit focus on quantification of gene expression and statistical approaches for differential expression. We also highlight emerging data types, such as single-cell RNA-seq and gene expression profiling using long-read technologies.

Download Full-text

GEO2RNAseq: An easy-to-use R pipeline for complete pre-processing of RNA-seq data

10.1101/771063 ◽

2019 ◽

Cited By ~ 2

Author(s):

Bastian Seelbinder ◽

Thomas Wolf ◽

Steffen Priebe ◽

Sylvie McNamara ◽

Silvia Gerber ◽

...

Keyword(s):

Gene Expression ◽

Single Species ◽

Gene Expression Omnibus ◽

Rna Seq ◽

Sequencing Data ◽

Interacting Species ◽

Link Type ◽

Fastq Format ◽

Standard Tool ◽

Processing Steps

ABSTRACTIn transcriptomics, the study of the total set of RNAs transcribed by the cell, RNA sequencing (RNA-seq) has become the standard tool for analysing gene expression. The primary goal is the detection of genes whose expression changes significantly between two or more conditions, either for a single species or for two or more interacting species at the same time (dual RNA-seq, triple RNA-seq and so forth). The analysis of RNA-seq can be simplified as many steps of the data pre-processing can be standardised in a pipeline.In this publication we present the “GEO2RNAseq” pipeline for complete, quick and concurrent pre-processing of single, dual, and triple RNA-seq data. It covers all pre-processing steps starting from raw sequencing data to the analysis of differentially expressed genes, including various tables and figures to report intermediate and final results. Raw data may be provided in FASTQ format or can be downloaded automatically from the Gene Expression Omnibus repository. GEO2RNAseq strongly incorporates experimental as well as computational metadata. GEO2RNAseq is implemented in R, lightweight, easy to install via Conda and easy to use, but still very flexible through using modular programming and offering many extensions and alternative workflows.GEO2RNAseq is publicly available at https://anaconda.org/xentrics/r-geo2rnaseq and https://bitbucket.org/thomas_wolf/geo2rnaseq/overview, including source code, installation instruction, and comprehensive package documentation.

Download Full-text

sangeranalyseR: simple and interactive analysis of Sanger sequencing data in R

10.1101/2020.05.18.102459 ◽

2020 ◽

Author(s):

Kuan-Hao Chao ◽

Kirston Barton ◽

Sarah Palmer ◽

Robert Lanfear

Keyword(s):

Sanger Sequencing ◽

Reference Sequence ◽

Supplementary Information ◽

File Format ◽

Bioconductor Package ◽

Sequencing Data ◽

Interactive Analysis ◽

Link Type ◽

Online Documentation ◽

Wide Range

AbstractSummarysangeranalyseR is an interactive R/Bioconductor package and two associated Shiny applications designed for analysing Sanger sequencing from data from the ABIF file format in R. It allows users to go from loading reads to saving aligned contigs in a few lines of R code. sangeranalyseR provides a wide range of options for a number of commonly-performed actions including read trimming, detecting secondary peaks, viewing chromatograms, and detecting indels using a reference sequence. All parameters can be adjusted interactively either in R or in the associated Shiny applications. sangeranalyseR comes with extensive online documentation, and outputs detailed interactive HTML reports.Availability and implementationsangeranalyseR is implemented in R and released under an MIT license. It is available for all platforms on Bioconductor (https://bioconductor.org/packages/sangeranalyseR) and on Github (https://github.com/roblanf/sangeranalyseR)[email protected] informationDocumentation at https://sangeranalyser.readthedocs.io/.

Download Full-text

NASQAR: A web-based platform for high-throughput sequencing data analysis and visualization

10.1101/709980 ◽

2019 ◽

Cited By ~ 1

Author(s):

Ayman Yousif ◽

Nizar Drou ◽

Jillian Rowe ◽

Mohammed Khalfan ◽

Kristin C Gunsalus

Keyword(s):

New York ◽

Data Analysis ◽

Open Source ◽

High Throughput ◽

High Throughput Sequencing ◽

Web Applications ◽

Rna Seq ◽

Sequencing Data ◽

Web Based ◽

Link Type

AbstractBackgroundAs high-throughput sequencing applications continue to evolve, the rapid growth in quantity and variety of sequence-based data calls for the development of new software libraries and tools for data analysis and visualization. Often, effective use of these tools requires computational skills beyond those of many researchers. To ease this computational barrier, we have created a dynamic web-based platform, NASQAR (Nucleic Acid SeQuence Analysis Resource).ResultsNASQAR offers a collection of custom and publicly available open-source web applications that make extensive use of a variety of R packages to provide interactive data analysis and visualization. The platform is publicly accessible at http://nasqar.abudhabi.nyu.edu/. Open-source code is on GitHub at https://github.com/nasqar/NASQAR, and the system is also available as a Docker image at https://hub.docker.com/r/aymanm/nasqarall. NASQAR is a collaboration between the core bioinformatics teams of the NYU Abu Dhabi and NYU New York Centers for Genomics and Systems Biology.ConclusionsNASQAR empowers non-programming experts with a versatile and intuitive toolbox to easily and efficiently explore, analyze, and visualize their Transcriptomics data interactively. Popular tools for a variety of applications are currently available, including Transcriptome Data Preprocessing, RNA-seq Analysis (including Single-cell RNA-seq), Metagenomics, and Gene Enrichment.

Download Full-text

Strategies for cellular deconvolution in human brain RNA sequencing data

F1000Research ◽

10.12688/f1000research.50858.1 ◽

2021 ◽

Vol 10 ◽

pp. 750

Author(s):

Olukayode A. Sosina ◽

Matthew N. Tran ◽

Kristen R. Maynard ◽

Ran Tao ◽

Margaret A. Taub ◽

...

Keyword(s):

Single Cell ◽

Expression Data ◽

Moderate Correlation ◽

Rna Seq ◽

Cell Type ◽

Sequencing Data ◽

Reference Dataset ◽

The Past ◽

Cell Type Composition ◽

Type Composition

Background: Statistical deconvolution strategies have emerged over the past decade to estimate the proportion of various cell populations in homogenate tissue sources like brain using gene expression data. However, no study has been undertaken to assess the extent to which expression-based and DNAm-based cell type composition estimates agree. Results: Using estimated neuronal fractions from DNAm data, from the same brain region (i.e., matched) as our bulk RNA-Seq dataset, as proxies for the true unobserved cell-type fractions (i.e., as the gold standard), we assessed the accuracy (RMSE) and concordance (R2) of four reference-based deconvolution algorithms: Houseman, CIBERSORT, non-negative least squares (NNLS)/MIND, and MuSiC. We did this for two cell-type populations - neurons and non-neurons/glia - using matched single nuclei RNA-Seq and mismatched single cell RNA-Seq reference datasets. With the mismatched single cell RNA-Seq reference dataset, Houseman, MuSiC, and NNLS produced concordant (high correlation; Houseman R2 = 0.51, 95% CI [0.39, 0.65]; MuSiC R2 = 0.56, 95% CI [0.43, 0.69]; NNLS R2 = 0.54, 95% CI [0.32, 0.68]) but biased (high RMSE, >0.35) neuronal fraction estimates. CIBERSORT produced more discordant (moderate correlation; R2 = 0.25, 95% CI [0.15, 0.38]) neuronal fraction estimates, but with less bias (low RSME, 0.09). Using the matched single nuclei RNA-Seq reference dataset did not eliminate bias (MuSiC RMSE = 0.17). Conclusions: Our results together suggest that many existing RNA deconvolution algorithms estimate the RNA composition of homogenate tissue, e.g. the amount of RNA attributable to each cell type, and not the cellular composition, which relates to the underlying fraction of cells.

Download Full-text

Spatiotemporal Changes in Transcriptome of Odontogenic and Non-odontogenic Regions in the Dental Arch of Mus musculus

Frontiers in Cell and Developmental Biology ◽

10.3389/fcell.2021.723326 ◽

2021 ◽

Vol 9 ◽

Author(s):

Dong-Joon Lee ◽

Hyun-Yi Kim ◽

Seung-Jun Lee ◽

Han-Sung Jung

Keyword(s):

Tooth Development ◽

Molecular Therapy ◽

Dental Arch ◽

Tooth Regeneration ◽

Rna Seq ◽

Sequencing Data ◽

Supernumerary Tooth ◽

Gene Modification ◽

The Past ◽

New Perspective

Over the past 40 years, studies on tooth regeneration have been conducted. These studies comprised two main flows: some focused on epithelial–mesenchymal interaction in the odontogenic region, whereas others focused on creating a supernumerary tooth in the non-odontogenic region. Recently, the scope of the research has moved from conventional gene modification and molecular therapy to genome and transcriptome sequencing analyses. However, these sequencing data have been produced only in the odontogenic region. We provide RNA-Seq data of not only the odontogenic region but also the non-odontogenic region, which loses tooth-forming capacity during development and remains a rudiment. Sequencing data were collected from mouse embryos at three different stages of tooth development. These data will expand our understanding of tooth development and will help in designing developmental and regenerative studies from a new perspective.

Download Full-text