Curse: building expression atlases and co-expression networks from public RNA-Seq data

2018 ◽  
Vol 35 (16) ◽  
pp. 2880-2881 ◽  
Author(s):  
Dries Vaneechoutte ◽  
Klaas Vandepoele

Abstract Summary Public RNA-Sequencing (RNA-Seq) datasets are a valuable resource for transcriptome analyses, but their accessibility is hindered by the imperfect quality and presentation of their metadata and by the complexity of processing raw sequencing data. The Curse suite was created to alleviate these problems. It consists of an online curation tool named Curse to efficiently build compendia of experiments hosted on the Sequence Read Archive, and a lightweight pipeline named Prose to download and process the RNA-Seq data into expression atlases and co-expression networks. Curse networks showed improved linking of functionally related genes compared to the state-of-the-art. Availability and implementation Curse, Prose and their manuals are available at http://bioinformatics.psb.ugent.be/webtools/Curse/. Prose was implemented in Java. Supplementary information Supplementary data are available at Bioinformatics online.

2019 ◽  
Vol 36 (7) ◽  
pp. 2291-2292 ◽  
Author(s):  
Saskia Freytag ◽  
Ryan Lister

Abstract Summary Due to the scale and sparsity of single-cell RNA-sequencing data, traditional plots can obscure vital information. Our R package schex overcomes this by implementing hexagonal binning, which has the additional advantages of improving speed and reducing storage for resulting plots. Availability and implementation schex is freely available from Bioconductor via http://bioconductor.org/packages/release/bioc/html/schex.html and its development version can be accessed on GitHub via https://github.com/SaskiaFreytag/schex. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Author(s):  
Alemu Takele Assefa ◽  
Jo Vandesompele ◽  
Olivier Thas

SummarySPsimSeq is a semi-parametric simulation method for bulk and single cell RNA sequencing data. It simulates data from a good estimate of the actual distribution of a given real RNA-seq dataset. In contrast to existing approaches that assume a particular data distribution, our method constructs an empirical distribution of gene expression data from a given source RNA-seq experiment to faithfully capture the data characteristics of real data. Importantly, our method can be used to simulate a wide range of scenarios, such as single or multiple biological groups, systematic variations (e.g. confounding batch effects), and different sample sizes. It can also be used to simulate different gene expression units resulting from different library preparation protocols, such as read counts or UMI counts.Availability and implementationThe R package and associated documentation is available from https://github.com/CenterForStatistics-UGent/SPsimSeq.Supplementary informationSupplementary data are available at bioRχiv online.


2020 ◽  
Vol 36 (18) ◽  
pp. 4810-4812
Author(s):  
Qingxi Meng ◽  
Idoia Ochoa ◽  
Mikel Hernaez

Abstract Motivation Sequencing data are often summarized at different annotation levels for further analysis, generally using the general feature format (GFF) or its descendants, gene transfer format (GTF) and GFF3. Existing utilities for accessing these files, like gffutils and gffread, do not focus on reducing the storage space, significantly increasing it in some cases. We propose GPress, a framework for querying GFF files in a compressed form. GPress can also incorporate and compress expression files from both bulk and single-cell RNA-Seq experiments, supporting simultaneous queries on both the GFF and expression files. In brief, GPress applies transformations to the data which are then compressed with the general lossless compressor BSC. To support queries, GPress compresses the data in blocks and creates several index tables for fast retrieval. Results We tested GPress on several GFF files of different organisms, and showed that it achieves on average a 61% reduction in size with respect to gzip (the current de facto compressor for GFF files) while being able to retrieve all annotations for a given identifier or a range of coordinates in a few seconds (when run in a common laptop). In contrast, gffutils provides faster retrieval but doubles the size of the GFF files. When additionally linking an expression file, we show that GPress can reduce its size by more than 68% when compared to gzip (for both bulk and single-cell RNA-Seq experiments), while still retrieving the information within seconds. Finally, applying BSC to the data streams generated by GPress instead of to the original file shows a size reduction of more than 44% on average. Availability and implementation GPress is freely available at https://github.com/qm2/gpress. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Author(s):  
Sebastian Deorowicz

AbstractMotivationThe amount of genomic data that needs to be stored is huge. Therefore it is not surprising that a lot of work has been done in the field of specialized data compression of FASTQ files. The existing algorithms are, however, still imperfect and the best tools produce quite large archives.ResultsWe present FQSqueezer, a novel compression algorithm for sequencing data able to process single- and paired-end reads of variable lengths. It is based on the ideas from the famous prediction by partial matching and dynamic Markov coder algorithms known from the general-purpose-compressors world. The compression ratios are often tens of percent better than offered by the state-of-the-art tools.Availability and Implementationhttps://github.com/refresh-bio/[email protected] informationSupplementary data are available at publisher’s Web site.


2019 ◽  
Author(s):  
Justin Sein ◽  
Liam F. Spurr ◽  
Pavlos Bousounis ◽  
N M Prashant ◽  
Hongyu Liu ◽  
...  

SummaryRsQTL is a tool for identification of splicing quantitative trait loci (sQTLs) from RNA-sequencing (RNA-seq) data by correlating the variant allele fraction at expressed SNV loci in the transcriptome (VAFRNA) with the proportion of molecules spanning local exon-exon junctions at loci with differential intron excision (percent spliced in, PSI). We exemplify the method on sets of RNA-seq data from human tissues obtained though the Genotype-Tissue Expression Project (GTEx). RsQTL does not require matched DNA and can identify a subset of expressed sQTL loci. Due to the dynamic nature of VAFRNA, RsQTL is applicable for the assessment of conditional and dynamic variation-splicing relationships.Availability and implementationhttps://github.com/HorvathLab/[email protected] or [email protected] InformationRsQTL_Supplementary_Data.zip


2019 ◽  
Vol 36 (8) ◽  
pp. 2589-2591
Author(s):  
Lorraine A K Ayad ◽  
Panagiotis Charalampopoulos ◽  
Solon P Pissis

Abstract Summary State-of-the-art repeat analysis tools rely on extending maximal repeated pairs to enumerate maximal k-mismatch repeats. These pairs can be quadratic in n, the length of the input sequence, and thus greedy heuristics are applied to speed up the extension. Here, we introduce supermaximal k-mismatch repeats, which are linear in n and capture all maximal k-mismatch repeats: every maximal k-mismatch repeat is a substring of some supermaximal k-mismatch repeat. We present SMART, a tool based on recent algorithmic advances implemented in C++ to compute supermaximal k-mismatch repeats directly, and show that these elements are statistically much more significant than the output of the state-of-the-art. Availability and implementation http://github.com/lorrainea/smart (GNU GPL v3.0). Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Andrew E Teschendorff ◽  
Alok K Maity ◽  
Xue Hu ◽  
Chen Weiyan ◽  
Matthias Lechner

Abstract Motivation An important task in the analysis of single-cell RNA-Seq data is the estimation of differentiation potency, as this can help identify stem-or-multipotent cells in non-temporal studies or in tissues where differentiation hierarchies are not well established. A key challenge in the estimation of single-cell potency is the need for a fast and accurate algorithm, scalable to large scRNA-Seq studies profiling millions of cells. Results Here, we present a single-cell potency measure, called Correlation of Connectome and Transcriptome (CCAT), which can return accurate single-cell potency estimates of a million cells in minutes, a 100-fold improvement over current state-of-the-art methods. We benchmark CCAT against 8 other single-cell potency models and across 28 scRNA-Seq studies, encompassing over 2 million cells, demonstrating comparable accuracy than the current state-of-the-art, at a significantly reduced computational cost, and with increased robustness to dropouts. Availability and implementation CCAT is part of the SCENT R-package, freely available from https://github.com/aet21/SCENT. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Vol 36 (7) ◽  
pp. 2260-2261 ◽  
Author(s):  
Georgios Fotakis ◽  
Dietmar Rieder ◽  
Marlene Haider ◽  
Zlatko Trajanoski ◽  
Francesca Finotello

Abstract Summary Gene fusions can generate immunogenic neoantigens that mediate anticancer immune responses. However, their computational prediction from RNA sequencing (RNA-seq) data requires deep bioinformatics expertise to assembly a computational workflow covering the prediction of: fusion transcripts, their translated proteins and peptides, Human Leukocyte Antigen (HLA) types, and peptide-HLA binding affinity. Here, we present NeoFuse, a computational pipeline for the prediction of fusion neoantigens from tumor RNA-seq data. NeoFuse can be applied to cancer patients’ RNA-seq data to identify fusion neoantigens that might expand the repertoire of suitable targets for immunotherapy. Availability and implementation NeoFuse source code and documentation are available under GPLv3 license at https://icbi.i-med.ac.at/NeoFuse/. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Matteo Chiara ◽  
Federico Zambelli ◽  
Marco Antonio Tangaro ◽  
Pietro Mandreoli ◽  
David S Horner ◽  
...  

Abstract Summary While over 200 000 genomic sequences are currently available through dedicated repositories, ad hoc methods for the functional annotation of SARS-CoV-2 genomes do not harness all currently available resources for the annotation of functionally relevant genomic sites. Here, we present CorGAT, a novel tool for the functional annotation of SARS-CoV-2 genomic variants. By comparisons with other state of the art methods we demonstrate that, by providing a more comprehensive and rich annotation, our method can facilitate the identification of evolutionary patterns in the genome of SARS-CoV-2. Availabilityand implementation Galaxy   http://corgat.cloud.ba.infn.it/galaxy; software: https://github.com/matteo14c/CorGAT/tree/Revision_V1; docker: https://hub.docker.com/r/laniakeacloud/galaxy_corgat. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Fabricio Almeida-Silva ◽  
Kanhu C Moharana ◽  
Thiago M Venancio

Abstract In the past decade, over 3000 samples of soybean transcriptomic data have accumulated in public repositories. Here, we review the state of the art in soybean transcriptomics, highlighting the major microarray and RNA-seq studies that investigated soybean transcriptional programs in different tissues and conditions. Further, we propose approaches for integrating such big data using gene coexpression network and outline important web resources that may facilitate soybean data acquisition and analysis, contributing to the acceleration of soybean breeding and functional genomics research.


Sign in / Sign up

Export Citation Format

Share Document