scholarly journals Kmerator Suite: design of specific k-mer signatures and automatic metadata discovery in large RNA-Seq datasets.

2021 ◽  
Author(s):  
Sebastien Riquier ◽  
Chloe Bessiere ◽  
Benoit Guibert ◽  
Anne-Laure Bouge ◽  
Anthony Boureux ◽  
...  

The huge body of publicly available RNA-seq libraries is a treasure of functional information allowing to quantify the expression of known or novel transcripts in tissues. However, transcript quantification commonly relies on alignment methods requiring a lot of computational resources and processing time, which does not scale easily to large datasets. K-mer decomposition constitutes a new way to process RNA-seq data for the identification of transcriptional signatures, as k-mers can be used to quantify accurately gene expression in a less resource-consuming way. We present the Kmerator Suite, a set of three tools designed to extract specific k-mer signatures, quantify these k-mers into RNA-seq datasets and quickly visualize large datasets characteristics. The core tool, Kmerator, produces specific k-mers for 97% of human genes, enabling the measure of gene expression with high accuracy in simulated datasets. KmerExploR, a direct application of Kmerator, uses a set of predictor genes specific k-mers to infer metadata including library protocol, sample features or contaminations from RNA-seq datasets. KmerExploR results are visualised through a user-friendly interface. Moreover, we demonstrate that the Kmerator Suite can be used for advanced queries targeting known or new biomarkers such as mutations, gene fusions or long non coding-RNAs for human health applications.

2021 ◽  
Vol 3 (3) ◽  
Author(s):  
Sébastien Riquier ◽  
Chloé Bessiere ◽  
Benoit Guibert ◽  
Anne-Laure Bouge ◽  
Anthony Boureux ◽  
...  

Abstract The huge body of publicly available RNA-sequencing (RNA-seq) libraries is a treasure of functional information allowing to quantify the expression of known or novel transcripts in tissues. However, transcript quantification commonly relies on alignment methods requiring a lot of computational resources and processing time, which does not scale easily to large datasets. K-mer decomposition constitutes a new way to process RNA-seq data for the identification of transcriptional signatures, as k-mers can be used to quantify accurately gene expression in a less resource-consuming way. We present the Kmerator Suite, a set of three tools designed to extract specific k-mer signatures, quantify these k-mers into RNA-seq datasets and quickly visualize large dataset characteristics. The core tool, Kmerator, produces specific k-mers for 97% of human genes, enabling the measure of gene expression with high accuracy in simulated datasets. KmerExploR, a direct application of Kmerator, uses a set of predictor gene-specific k-mers to infer metadata including library protocol, sample features or contaminations from RNA-seq datasets. KmerExploR results are visualized through a user-friendly interface. Moreover, we demonstrate that the Kmerator Suite can be used for advanced queries targeting known or new biomarkers such as mutations, gene fusions or long non-coding RNAs for human health applications.


2012 ◽  
Vol 111 (suppl_1) ◽  
Author(s):  
Emma L Robinson ◽  
Syed Haider ◽  
Hillary Hei ◽  
Richard T Lee ◽  
Roger S Foo

Heart failure comprises of clinically distinct inciting causes but a consistent pattern of change in myocardial gene expression supports the hypothesis that unifying biochemical mechanisms underlie disease progression. The recent RNA-seq revolution has enabled whole transcriptome profiling, using deep-sequencing technologies. Up to 70% of the genome is now known to be transcribed into RNA, a significant proportion of which is long non-coding RNAs (lncRNAs), defined as polyribonucleotides of ≥200 nucleotides. This project aims to discover whether the myocardium expression of lncRNAs changes in the failing heart. Paired end RNA-seq from a 300-400bp library of ‘stretched’ mouse myocyte total RNA was carried out to generate 76-mer sequence reads. Mechanically stretching myocytes with equibiaxial stretch apparatus mimics pathological hypertrophy in the heart. Transcripts were assembled and aligned to reference genome mm9 (UCSC), abundance determined and differential expression of novel transcripts and alternative splice variants were compared with that of control (non-stretched) mouse myocytes. Five novel transcripts have been identified in our RNA-seq that are differentially expressed in stretched myocytes compared with non-stretched. These are regions of the genome that are currently unannotated and potentially are transcribed into non-coding RNAs. Roles of known lncRNAs include control of gene expression, either by direct interaction with complementary regions of the genome or association with chromatin remodelling complexes which act on the epigenome.Changes in expression of genes which contribute to the deterioration of the failing heart could be due to the actions of these novel lncRNAs, immediately suggesting a target for new pharmaceuticals. Changes in the expression of these novel transcripts will be validated in a larger sample size of stretched myocytes vs non-stretched myocytes as well as in the hearts of transverse aortic constriction (TAC) mice vs Sham (surgical procedure without the aortic banding). In vivo investigations will then be carried out, using siLNA antisense technology to silence novel lncRNAs in mice.


BMC Genomics ◽  
2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Paulo Rapazote-Flores ◽  
Micha Bayer ◽  
Linda Milne ◽  
Claus-Dieter Mayer ◽  
John Fuller ◽  
...  

Abstract Background The time required to analyse RNA-seq data varies considerably, due to discrete steps for computational assembly, quantification of gene expression and splicing analysis. Recent fast non-alignment tools such as Kallisto and Salmon overcome these problems, but these tools require a high quality, comprehensive reference transcripts dataset (RTD), which are rarely available in plants. Results A high-quality, non-redundant barley gene RTD and database (Barley Reference Transcripts – BaRTv1.0) has been generated. BaRTv1.0, was constructed from a range of tissues, cultivars and abiotic treatments and transcripts assembled and aligned to the barley cv. Morex reference genome (Mascher et al. Nature; 544: 427–433, 2017). Full-length cDNAs from the barley variety Haruna nijo (Matsumoto et al. Plant Physiol; 156: 20–28, 2011) determined transcript coverage, and high-resolution RT-PCR validated alternatively spliced (AS) transcripts of 86 genes in five different organs and tissue. These methods were used as benchmarks to select an optimal barley RTD. BaRTv1.0-Quantification of Alternatively Spliced Isoforms (QUASI) was also made to overcome inaccurate quantification due to variation in 5′ and 3′ UTR ends of transcripts. BaRTv1.0-QUASI was used for accurate transcript quantification of RNA-seq data of five barley organs/tissues. This analysis identified 20,972 significant differentially expressed genes, 2791 differentially alternatively spliced genes and 2768 transcripts with differential transcript usage. Conclusion A high confidence barley reference transcript dataset consisting of 60,444 genes with 177,240 transcripts has been generated. Compared to current barley transcripts, BaRTv1.0 transcripts are generally longer, have less fragmentation and improved gene models that are well supported by splice junction reads. Precise transcript quantification using BaRTv1.0 allows routine analysis of gene expression and AS.


2021 ◽  
Vol 8 (10) ◽  
pp. 257-262
Author(s):  
Aigli Korfiati ◽  
Giorgos Livanos ◽  
Christos Konstandinou

Computer-aided diagnosis, prognosis and therapy systems have been of great interest for a number of years. The availability of big volumes of data and of powerful computational resources have allowed artificial intelligence approaches to emerge in melanoma related studies. However, for such approaches to have good predictive performances data availability is of crucial importance. Melanoma related imaging, biological and clinical data can be found partially and scattered in various repositories. Thus, in this work, we assemble in a web accessible database, named ebioMelDB, the widest collection of clinical and dermoscopy images accompanied with patient clinical data and the widest collection of RNA-Seq gene expression data accompanied with patient clinical data. The database organization allows users to select the data that are appropriate for their application of interest (diagnosis, prognosis and therapy). Keywords: melanoma database, integrated data, dermoscopy, imaging, RNA-Seq, clinical data.


2021 ◽  
Author(s):  
Fei Wu ◽  
Yaozhong Liu ◽  
Binhua Ling

RNA-seq data contains not only host transcriptomes but also non-host information that comprises transcripts from active microbiota in the host cells. Therefore, metatranscriptomics can reveal gene expression of the entire microbial community in a given sample. However, there is no single tool that can simultaneously analyze host-microbiota interactions and to quantify microbiome at the single-cell level, particularly for users with limited expertise of bioinformatics. Here, we developed a novel software program that can comprehensively and synergistically analyze gene expression of the host and microbiome as well as their association using bulk and single-cell RNA-seq data. Our pipeline, named Meta-Transcriptome Detector (MTD), can identify and quantify microbiome extensively, including viruses, bacteria, protozoa, fungi, plasmids, and vectors. MTD is easy to install and is user-friendly. This novel software program empowers researchers to study the interactions between microbiota and the host by analyzing gene expressions and pathways, which provides further insights into host responses to microorganisms.


Author(s):  
Joshua Orvis ◽  
Brian Gottfried ◽  
Jayaram Kancherla ◽  
Ricky S. Adkins ◽  
Yang Song ◽  
...  

ABSTRACTThe gEAR portal (gene Expression Analysis Resource, umgear.org) is an open access community-driven tool for multi-omic and multi-species data visualization, analysis and sharing. The gEAR supports visualization of multiple RNA-seq data types (bulk, sorted, single cell/nucleus) and epigenomics data, from multiple species, time points and tissues in a single-page, user-friendly browsable format. An integrated scRNA-seq workbench provides access to raw data of scRNA-seq datasets for de novo analysis, as well as marker-gene and cluster comparisons of pre-assigned clusters. Users can upload, view, analyze and privately share their own data in the context of previously published datasets. Short, permanent URLs can be generated for dissemination of individual or collections of datasets in published manuscripts. While the gEAR is currently curated for auditory research with over 90 high-value datasets organized in thematic profiles, the gEAR also supports the BRAIN initiative (via nemoanalytics.org) and is easily adaptable for other research domains.


2016 ◽  
Author(s):  
Sara Ballouz ◽  
Jesse Gillis

In addition to detecting novel transcripts and higher dynamic range, a principal claim for RNA-sequencing has been greater replicability, typically measured in sample-sample correlations of gene expression levels. Through a re-analysis of ENCODE data, we show that replicability of transcript abundances will provide misleading estimates of the replicability of conditional variation in transcript abundances (i.e., most expression experiments). Heuristics which implicitly address this problem have emerged in quality control measures to obtain 'good' differential expression results. However, these methods involve strict filters such as discarding low expressing genes or using technical replicates to remove discordant transcripts, and are costly or simply ad hoc. As an alternative, we model gene-level replicability of differential activity using co-expressing genes. We find that sets of housekeeping interactions provide a sensitive means of estimating the replicability of expression changes, where the co-expressing pair can be regarded as pseudo-replicates of one another. We model the effects of noise that perturbs a gene's expression within its usual distribution of values and show that perturbing expression by only 5% within that range is readily detectable (AUROC~0.73). We have made our method available as a set of easily implemented R scripts.


2017 ◽  
Author(s):  
Alexander Lachmann ◽  
Denis Torre ◽  
Alexandra B. Keenan ◽  
Kathleen M. Jagodnik ◽  
Hyojin J. Lee ◽  
...  

RNA-sequencing (RNA-seq) is currently the leading technology for genome-wide transcript quantification. While the volume of RNA-seq data is rapidly increasing, the currently publicly available RNA-seq data is provided mostly in raw form, with small portions processed non- uniformly. This is mainly because the computational demand, particularly for the alignment step, is a significant barrier for global and integrative retrospective analyses. To address this challenge, we developed all RNA-seq and ChIP-seq sample and signature search (ARCHS4), a web resource that makes the majority of previously published RNA-seq data from human and mouse freely available at the gene count level. Such uniformly processed data enables easy integration for downstream analyses. For developing the ARCHS4 resource, all available FASTQ files from RNA-seq experiments were retrieved from the Gene Expression Omnibus (GEO) and aligned using a cloud-based infrastructure. In total 137,792 samples are accessible through ARCHS4 with 72,363 mouse and 65,429 human samples. Through efficient use of cloud resources and dockerized deployment of the sequencing pipeline, the alignment cost per sample is reduced to less than one cent. ARCHS4 is updated automatically by adding newly published samples to the database as they become available. Additionally, the ARCHS4 web interface provides intuitive exploration of the processed data through querying tools, interactive visualization, and gene landing pages that provide average expression across cell lines and tissues, top co-expressed genes, and predicted biological functions and protein-protein interactions for each gene based on prior knowledge combined with co-expression. Benchmarking the quality of these predictions, co-expression correlation data created from ARCHS4 outperforms co-expression data created from other major gene expression data repositories such as GTEx and CCLE.ARCHS4 is freely accessible at: http://amp.pharm.mssm.edu/archs4


BMC Genomics ◽  
2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Alberto Luiz P. Reyes ◽  
Tiago C. Silva ◽  
Simon G. Coetzee ◽  
Jasmine T. Plummer ◽  
Brian D. Davis ◽  
...  

Abstract Background The development of next generation sequencing (NGS) methods led to a rapid rise in the generation of large genomic datasets, but the development of user-friendly tools to analyze and visualize these datasets has not developed at the same pace. This presents a two-fold challenge to biologists; the expertise to select an appropriate data analysis pipeline, and the need for bioinformatics or programming skills to apply this pipeline. The development of graphical user interface (GUI) applications hosted on web-based servers such as Shiny can make complex workflows accessible across operating systems and internet browsers to those without programming knowledge. Results We have developed GENAVi (Gene Expression Normalization Analysis and Visualization) to provide a user-friendly interface for normalization and differential expression analysis (DEA) of human or mouse feature count level RNA-Seq data. GENAVi is a GUI based tool that combines Bioconductor packages in a format for scientists without bioinformatics expertise. We provide a panel of 20 cell lines commonly used for the study of breast and ovarian cancer within GENAVi as a foundation for users to bring their own data to the application. Users can visualize expression across samples, cluster samples based on gene expression or correlation, calculate and plot the results of principal components analysis, perform DEA and gene set enrichment and produce plots for each of these analyses. To allow scalability for large datasets we have provided local install via three methods. We improve on available tools by offering a range of normalization methods and a simple to use interface that provides clear and complete session reporting and for reproducible analysis. Conclusion The development of tools using a GUI makes them practical and accessible to scientists without bioinformatics expertise, or access to a data analyst with relevant skills. While several GUI based tools are currently available for RNA-Seq analysis we improve on these existing tools. This user-friendly application provides a convenient platform for the normalization, analysis and visualization of gene expression data for scientists without bioinformatics expertise.


2018 ◽  
Author(s):  
Alexander Lachmann ◽  
Zhuorui Xie ◽  
Avi Ma’ayan

MotivationRNA-sequencing (RNA-seq) is currently the leading technology for genome-wide transcript quantification. Mapping the raw reads to transcript and gene level counts can be achieved by a variety of aligners and pipelines. The diversity of processing options reduces interoperability. In addition, the alignment step requires significant computational resources and basic programming knowledge. Elysium enables users of all skill levels to perform a uniform and free RNA-seq alignment in the cloud.ResultsThe Elysium infrastructure is comprised of four components: A file upload API that enables storage of FASTQ files on Amazon S3 without Amazon credentials; an API to handle the cloud alignment job scheduling for uploaded files; and a graphical user interface (GUI) to provide intuitive access to users that do not have command-line access skills.AvailabilityThe Elysium source code is available under the Apache Licence 2.0 on GitHub at: https://github.com/maayanlab/elysiumThe service of cloud based RNA-seq alignment is freely accessible through the Elysium GUI at: http://elysium.cloud


Sign in / Sign up

Export Citation Format

Share Document