CAFU: a Galaxy framework for exploring unmapped RNA-Seq data

Abstract A widely used approach in transcriptome analysis is the alignment of short reads to a reference genome. However, owing to the deficiencies of specially designed analytical systems, short reads unmapped to the genome sequence are usually ignored, resulting in the loss of significant biological information and insights. To fill this gap, we present Comprehensive Assembly and Functional annotation of Unmapped RNA-Seq data (CAFU), a Galaxy-based framework that can facilitate the large-scale analysis of unmapped RNA sequencing (RNA-Seq) reads from single- and mixed-species samples. By taking advantage of machine learning techniques, CAFU addresses the issue of accurately identifying the species origin of transcripts assembled using unmapped reads from mixed-species samples. CAFU also represents an innovation in that it provides a comprehensive collection of functions required for transcript confidence evaluation, coding potential calculation, sequence and expression characterization and function annotation. These functions and their dependencies have been integrated into a Galaxy framework that provides access to CAFU via a user-friendly interface, dramatically simplifying complex exploration tasks involving unmapped RNA-Seq reads. CAFU has been validated with RNA-Seq data sets from wheat and Zea mays (maize) samples. CAFU is freely available via GitHub: https://github.com/cma2015/CAFU.

Download Full-text

Gene Expression Imputation with Generative Adversarial Imputation Nets

10.1101/2020.06.09.141689 ◽

2020 ◽

Author(s):

Ramon Viñas ◽

Tiago Azevedo ◽

Eric R. Gamazon ◽

Pietro Liò

Keyword(s):

Gene Expression ◽

Large Scale ◽

Biological Significance ◽

Predictive Performance ◽

Cost Effective ◽

Rna Seq ◽

Comprehensive Collection ◽

Genomic Studies ◽

Biological Discovery ◽

Cancer Types

AbstractA question of fundamental biological significance is to what extent the expression of a subset of genes can be used to recover the full transcriptome, with important implications for biological discovery and clinical application. To address this challenge, we present GAIN-GTEx, a method for gene expression imputation based on Generative Adversarial Imputation Networks. In order to increase the applicability of our approach, we leverage data from GTEx v8, a reference resource that has generated a comprehensive collection of transcriptomes from a diverse set of human tissues. We compare our model to several standard and state-of-the-art imputation methods and show that GAIN-GTEx is significantly superior in terms of predictive performance and runtime. Furthermore, our results indicate strong generalisation on RNA-Seq data from 3 cancer types across varying levels of missingness. Our work can facilitate a cost-effective integration of large-scale RNA biorepositories into genomic studies of disease, with high applicability across diverse tissue types.

Download Full-text

RNA-seq based analysis of population structure within the maize inbred B73

10.1101/043513 ◽

2016 ◽

Author(s):

Zhikai Liang ◽

James C Schnable

Keyword(s):

Large Scale ◽

Reference Genome ◽

Data Sets ◽

Rna Seq ◽

Research Groups ◽

Independent Research ◽

Sequencing Project ◽

Maize Genetics ◽

Outcrossing Species ◽

The Relationship

B73 is a variety of maize (Zea mays ssp. mays) widely used in genetic, genomic, and phenotypic research around the world. B73 was also served as the reference genotype for the original maize genome sequencing project. The advent of large-scale RNA-sequencing as a method of measuring gene expression presents a unique opportunity to assess the level of relatedness among individuals identified as variety B73. The level of haplotype conservation and divergence across the genome were assessed using 27 RNA-seq data sets from 20 independent research groups in three countries. Several clearly distinct clades were identified among putatively B73 samples. A number of these blocks were defined by the presence of clearly defined genomic blocks containing a haplotype which did not match the published B73 reference genome. In a number of cases the relationship among B73 samples generated by different research groups recapitulated mentor/mentee relationships within the maize genetics community. A number of regions with distinct, dissimilar, haplotypes were identified in our study. However, when considering the age of the B73 accession -- greater than 40 years -- and the challenges of maintaining isogenic lines of a naturally outcrossing species, a strikingly high overall level of conservation was exhibited among B73 samples from around the globe.

Download Full-text

Improving the Scalability of XCS-Based Learning Classifier Systems

10.26686/wgtn.17006398 ◽

2021 ◽

Author(s):

◽

Muhammad Iqbal

Keyword(s):

Large Scale ◽

Building Blocks ◽

Machine Learning Techniques ◽

Classifier Systems ◽

Problem Domain ◽

Code Fragment ◽

Cyclic Representation ◽

Large Scale Problems ◽

And Function ◽

First Time

<p>Using evolutionary intelligence and machine learning techniques, a broad range of intelligent machines have been designed to perform different tasks. An intelligent machine learns by perceiving its environmental status and taking an action that maximizes its chances of success. Human beings have the ability to apply knowledge learned from a smaller problem to more complex, large-scale problems of the same or a related domain, but currently the vast majority of evolutionary machine learning techniques lack this ability. This lack of ability to apply the already learned knowledge of a domain results in consuming more than the necessary resources and time to solve complex, large-scale problems of the domain. As the problem increases in size, it becomes difficult and even sometimes impractical (if not impossible) to solve due to the needed resources and time. Therefore, in order to scale in a problem domain, a systemis needed that has the ability to reuse the learned knowledge of the domain and/or encapsulate the underlying patterns in the domain. To extract and reuse building blocks of knowledge or to encapsulate the underlying patterns in a problem domain, a rich encoding is needed, but the search space could then expand undesirably and cause bloat, e.g. as in some forms of genetic programming (GP). Learning classifier systems (LCSs) are a well-structured evolutionary computation based learning technique that have pressures to implicitly avoid bloat, such as fitness sharing through niche based reproduction. The proposed thesis is that an LCS can scale to complex problems in a domain by reusing the learnt knowledge from simpler problems of the domain and/or encapsulating the underlying patterns in the domain. Wilson’s XCS is used to implement and test the proposed systems, which is a well-tested, online learning and accuracy based LCS model. To extract the reusable building blocks of knowledge, GP-tree like, code-fragments are introduced, which are more than simply another representation (e.g. ternary or real-valued alphabets). This thesis is extended to capture the underlying patterns in a problemusing a cyclic representation. Hard problems are experimented to test the newly developed scalable systems and compare them with benchmark techniques. Specifically, this work develops four systems to improve the scalability of XCS-based classifier systems. (1) Building blocks of knowledge are extracted fromsmaller problems of a Boolean domain and reused in learning more complex, large-scale problems in the domain, for the first time. By utilizing the learnt knowledge from small-scale problems, the developed XCSCFC (i.e. XCS with Code-Fragment Conditions) system readily solves problems of a scale that existing LCS and GP approaches cannot, e.g. the 135-bitMUX problem. (2) The introduction of the code fragments in classifier actions in XCSCFA (i.e. XCS with Code-Fragment Actions) enables the rich representation of GP, which when couples with the divide and conquer approach of LCS, to successfully solve various complex, overlapping and niche imbalance Boolean problems that are difficult to solve using numeric action based XCS. (3) The underlying patterns in a problem domain are encapsulated in classifier rules encoded by a cyclic representation. The developed XCSSMA system produces general solutions of any scale n for a number of important Boolean problems, for the first time in the field of LCS, e.g. parity problems. (4) Optimal solutions for various real-valued problems are evolved by extending the existing real-valued XCSR system with code-fragment actions to XCSRCFA. Exploiting the combined power of GP and LCS techniques, XCSRCFA successfully learns various continuous action and function approximation problems that are difficult to learn using the base techniques. This research work has shown that LCSs can scale to complex, largescale problems through reusing learnt knowledge. The messy nature, disassociation of message to condition order, masking, feature construction, and reuse of extracted knowledge add additional abilities to the XCS family of LCSs. The ability to use rich encoding in antecedent GP-like codefragments or consequent cyclic representation leads to the evolution of accurate, maximally general and compact solutions in learning various complex Boolean as well as real-valued problems. Effectively exploiting the combined power of GP and LCS techniques, various continuous action and function approximation problems are solved in a simple and straight forward manner. The analysis of the evolved rules reveals, for the first time in XCS, that no matter how specific or general the initial classifiers are, all the optimal classifiers are converged through the mechanism ‘be specific then generalize’ near the final stages of evolution. Also that standard XCS does not use all available information or all available genetic operators to evolve optimal rules, whereas the developed code-fragment action based systems effectively use figure and ground information during the training process. Thiswork has created a platformto explore the reuse of learnt functionality, not just terminal knowledge as present, which is needed to replicate human capabilities.</p>

Download Full-text

Detection of Alu Exonization Events in Human Frontal Cortex From RNA-Seq Data

Frontiers in Molecular Biosciences ◽

10.3389/fmolb.2021.727537 ◽

2021 ◽

Vol 8 ◽

Author(s):

Liliana Florea ◽

Lindsay Payer ◽

Corina Antonescu ◽

Guangyu Yang ◽

Kathleen Burns

Keyword(s):

Frontal Cortex ◽

Large Scale ◽

Reference Genome ◽

Expression Patterns ◽

Data Sets ◽

Rna Seq ◽

Mrna Isoforms ◽

Specific Expression ◽

Human Frontal Cortex ◽

Alternatively Spliced

Alu exonization events functionally diversify the transcriptome, creating alternative mRNA isoforms and accounting for an estimated 5% of the alternatively spliced (skipped) exons in the human genome. We developed computational methods, implemented into a software called Alubaster, for detecting incorporation of Alu sequences in mRNA transcripts from large scale RNA-seq data sets. The approach detects Alu sequences derived from both fixed and polymorphic Alu elements, including Alu insertions missing from the reference genome. We applied our methods to 117 GTEx human frontal cortex samples to build and characterize a collection of Alu-containing mRNAs. In particular, we detected and characterized Alu exonizations occurring at 870 fixed Alu loci, of which 237 were novel, as well as hundreds of putative events involving Alu elements that are polymorphic variants or rare alleles not present in the reference genome. These methods and annotations represent a unique and valuable resource that can be used to understand the characteristics of Alu-containing mRNAs and their tissue-specific expression patterns.

Download Full-text

Detecting Sources of Transcriptional Heterogeneity in Large-Scale RNA-Seq Data Sets

Genetics ◽

10.1534/genetics.116.193714 ◽

2016 ◽

Vol 204 (4) ◽

pp. 1391-1396 ◽

Cited By ~ 8

Author(s):

Brian C. Searle ◽

Rachel M. Gittelman ◽

Ohad Manor ◽

Joshua M. Akey

Keyword(s):

Large Scale ◽

Data Sets ◽

Rna Seq

Download Full-text

Highly parallel direct RNA sequencing on an array of nanopores

10.1101/068809 ◽

2016 ◽

Cited By ~ 26

Author(s):

Daniel R. Garalde ◽

Elizabeth A. Snell ◽

Daniel Jachimowicz ◽

Andrew J. Heron ◽

Mark Bruce ◽

...

Keyword(s):

Rna Sequencing ◽

Reverse Transcription ◽

Biological Information ◽

Rna Seq ◽

Sequencing Technology ◽

Rna Sequences ◽

Oxford Nanopore ◽

The Status ◽

And Function ◽

Oxford Nanopore Technologies

AbstractRibonucleic acid sequencing can allow us to monitor the RNAs present in a sample. This enables us to detect the presence and nucleotide sequence of viruses, or to build a picture of how active transcriptional processes are changing – information that is useful for understanding the status and function of a sample. Oxford Nanopore Technologies’ sequencing technology is capable of electronically analysing a sample’s DNA directly, and in real-time. In this manuscript we demonstrate the ability of an array of nanopores to sequence RNA directly, and we apply it to a range of biological situations. Nanopore technology is the only available sequencing technology that can sequence RNA directly, rather than depending on reverse transcription and PCR. There are several potential advantages of this approach over other RNA-seq strategies, including the absence of amplification and reverse transcription biases, the ability to detect nucleotide analogues and the ability to generate full-length, strand-specific RNA sequences. Direct RNA sequencing is a completely new way of analysing the sequence of RNA samples and it will improve the ease and speed of RNA analysis, while yielding richer biological information.

Download Full-text

Improvements to the Rice Genome Annotation Through Large-Scale Analysis of RNA-Seq and Proteomics Data Sets

Molecular & Cellular Proteomics ◽

10.1074/mcp.ra118.000832 ◽

2018 ◽

Vol 18 (1) ◽

pp. 86-98 ◽

Cited By ~ 5

Author(s):

Zhe Ren ◽

Da Qi ◽

Nina Pugh ◽

Kai Li ◽

Bo Wen ◽

...

Keyword(s):

Genome Annotation ◽

Large Scale ◽

Rice Genome ◽

Data Sets ◽

Scale Analysis ◽

Rna Seq ◽

Proteomics Data ◽

Large Scale Analysis

Download Full-text

BIG: a large-scale data integration tool for renal physiology

AJP Renal Physiology ◽

10.1152/ajprenal.00249.2016 ◽

2016 ◽

Vol 311 (4) ◽

pp. F787-F792 ◽

Cited By ~ 8

Author(s):

Yue Zhao ◽

Chin-Rang Yang ◽

Viswanathan Raghuram ◽

Jaya Parulekar ◽

Mark A. Knepper

Keyword(s):

Big Data ◽

Large Scale ◽

Data Science ◽

Relevant Information ◽

Biological Information ◽

Renal Physiology ◽

Data Sets ◽

Data Set ◽

Large Scale Data ◽

Quantify Gene Expression

Due to recent advances in high-throughput techniques, we and others have generated multiple proteomic and transcriptomic databases to describe and quantify gene expression, protein abundance, or cellular signaling on the scale of the whole genome/proteome in kidney cells. The existence of so much data from diverse sources raises the following question: “How can researchers find information efficiently for a given gene product over all of these data sets without searching each data set individually?” This is the type of problem that has motivated the “Big-Data” revolution in Data Science, which has driven progress in fields such as marketing. Here we present an online Big-Data tool called BIG (Biological Information Gatherer) that allows users to submit a single online query to obtain all relevant information from all indexed databases. BIG is accessible at http://big.nhlbi.nih.gov/ .

Download Full-text

4058 Enhanced efficiency of large-scale clinical proteomic studies: when less is more

Journal of Clinical and Translational Science ◽

10.1017/cts.2020.329 ◽

2020 ◽

Vol 4 (s1) ◽

pp. 107-107

Author(s):

Stefani Thomas ◽

Betty Friedrich ◽

Michael Schnaubelt ◽

Daniel W. Chan ◽

Hui Zhang ◽

...

Keyword(s):

Cancer Biology ◽

Large Scale ◽

Molecular Taxonomy ◽

The Cancer Genome Atlas ◽

Biological Information ◽

Data Sets ◽

Less Is More ◽

Gene Correlation ◽

The Stability ◽

Cancer Tissues

OBJECTIVES/GOALS: Large-scale clinical proteomic studies of cancer tissues often entail complex workflows and are resource-intensive. In this study we analyzed ovarian tumors using an emerging, high-throughput proteomic technology termed SWATH. We compared SWATH with the more widely used iTRAQ workflow based on robustness, complexity, ability to detect differential protein expression, and the elucidated biological information. METHODS/STUDY POPULATION: Proteomic measurements of 103 clinically-annotated high-grade serous ovarian cancer (HGSOC) tumors previously genomically characterized by The Cancer Genome Atlas were conducted using two orthogonal mass spectrometry-based proteomic methods: iTRAQ and SWATH. The analytical differences between the two methods were compared with respect to relative protein abundances. To assess the ability to classify the tumors into subtypes based on proteomic signatures, an unbiased molecular taxonomy of HGSOC was established using protein abundance data. The 1,599 proteins quantified in both datasets were classified based on z-score-transformed protein abundances, and the emergent protein modules were characterized using weighted gene-correlation network analysis and Reactome pathway enrichment. RESULTS/ANTICIPATED RESULTS: Despite the greater than two-fold difference in the analytical depth of each proteomic method, common differentially expressed proteins in enriched pathways associated with the HGSOC Mesenchymal subtype were identified by both methods. The stability of tumor subtype classification was sensitive to the number of analyzed samples, and the statistically stable subgroups were identified by the data from both methods. Additionally, the homologous recombination deficiency-associated enriched DNA repair and chromosome organization pathways were conserved in both data sets. DISCUSSION/SIGNIFICANCE OF IMPACT: SWATH is a robust proteomic method that can be used to elucidate cancer biology. The lower number of proteins detected by SWATH compared to iTRAQ is mitigated by its streamlined workflow, increased sample throughput, and reduced sample requirement. SWATH therefore presents novel opportunities to enhance the efficiency of clinical proteomic studies.

Download Full-text

RNA Sequencing Data: Hitchhiker's Guide to Expression Analysis

Annual Review of Biomedical Data Science ◽

10.1146/annurev-biodatasci-072018-021255 ◽

2019 ◽

Vol 2 (1) ◽

pp. 139-173 ◽

Cited By ~ 23

Author(s):

Koen Van den Berge ◽

Katharina M. Hembach ◽

Charlotte Soneson ◽

Simone Tiberi ◽

Lieven Clement ◽

...

Keyword(s):

Gene Expression ◽

Rna Sequencing ◽

Large Scale ◽

Science Studies ◽

Data Sets ◽

Rna Seq ◽

Sequencing Data ◽

Data Types ◽

The Past ◽

Long Read

Gene expression is the fundamental level at which the results of various genetic and regulatory programs are observable. The measurement of transcriptome-wide gene expression has convincingly switched from microarrays to sequencing in a matter of years. RNA sequencing (RNA-seq) provides a quantitative and open system for profiling transcriptional outcomes on a large scale and therefore facilitates a large diversity of applications, including basic science studies, but also agricultural or clinical situations. In the past 10 years or so, much has been learned about the characteristics of the RNA-seq data sets, as well as the performance of the myriad of methods developed. In this review, we give an overview of the developments in RNA-seq data analysis, including experimental design, with an explicit focus on the quantification of gene expression and statistical approachesfor differential expression. We also highlight emerging data types, such as single-cell RNA-seq and gene expression profiling using long-read technologies.

Download Full-text