scholarly journals CAFU: a Galaxy framework for exploring unmapped RNA-Seq data

2019 ◽  
Vol 21 (2) ◽  
pp. 676-686 ◽  
Author(s):  
Siyuan Chen ◽  
Chengzhi Ren ◽  
Jingjing Zhai ◽  
Jiantao Yu ◽  
Xuyang Zhao ◽  
...  

Abstract A widely used approach in transcriptome analysis is the alignment of short reads to a reference genome. However, owing to the deficiencies of specially designed analytical systems, short reads unmapped to the genome sequence are usually ignored, resulting in the loss of significant biological information and insights. To fill this gap, we present Comprehensive Assembly and Functional annotation of Unmapped RNA-Seq data (CAFU), a Galaxy-based framework that can facilitate the large-scale analysis of unmapped RNA sequencing (RNA-Seq) reads from single- and mixed-species samples. By taking advantage of machine learning techniques, CAFU addresses the issue of accurately identifying the species origin of transcripts assembled using unmapped reads from mixed-species samples. CAFU also represents an innovation in that it provides a comprehensive collection of functions required for transcript confidence evaluation, coding potential calculation, sequence and expression characterization and function annotation. These functions and their dependencies have been integrated into a Galaxy framework that provides access to CAFU via a user-friendly interface, dramatically simplifying complex exploration tasks involving unmapped RNA-Seq reads. CAFU has been validated with RNA-Seq data sets from wheat and Zea mays (maize) samples. CAFU is freely available via GitHub: https://github.com/cma2015/CAFU.

2020 ◽  
Author(s):  
Ramon Viñas ◽  
Tiago Azevedo ◽  
Eric R. Gamazon ◽  
Pietro Liò

AbstractA question of fundamental biological significance is to what extent the expression of a subset of genes can be used to recover the full transcriptome, with important implications for biological discovery and clinical application. To address this challenge, we present GAIN-GTEx, a method for gene expression imputation based on Generative Adversarial Imputation Networks. In order to increase the applicability of our approach, we leverage data from GTEx v8, a reference resource that has generated a comprehensive collection of transcriptomes from a diverse set of human tissues. We compare our model to several standard and state-of-the-art imputation methods and show that GAIN-GTEx is significantly superior in terms of predictive performance and runtime. Furthermore, our results indicate strong generalisation on RNA-Seq data from 3 cancer types across varying levels of missingness. Our work can facilitate a cost-effective integration of large-scale RNA biorepositories into genomic studies of disease, with high applicability across diverse tissue types.


2016 ◽  
Author(s):  
Zhikai Liang ◽  
James C Schnable

B73 is a variety of maize (Zea mays ssp. mays) widely used in genetic, genomic, and phenotypic research around the world. B73 was also served as the reference genotype for the original maize genome sequencing project. The advent of large-scale RNA-sequencing as a method of measuring gene expression presents a unique opportunity to assess the level of relatedness among individuals identified as variety B73. The level of haplotype conservation and divergence across the genome were assessed using 27 RNA-seq data sets from 20 independent research groups in three countries. Several clearly distinct clades were identified among putatively B73 samples. A number of these blocks were defined by the presence of clearly defined genomic blocks containing a haplotype which did not match the published B73 reference genome. In a number of cases the relationship among B73 samples generated by different research groups recapitulated mentor/mentee relationships within the maize genetics community. A number of regions with distinct, dissimilar, haplotypes were identified in our study. However, when considering the age of the B73 accession -- greater than 40 years -- and the challenges of maintaining isogenic lines of a naturally outcrossing species, a strikingly high overall level of conservation was exhibited among B73 samples from around the globe.


2021 ◽  
Author(s):  
◽  
Muhammad Iqbal

<p>Using evolutionary intelligence and machine learning techniques, a broad range of intelligent machines have been designed to perform different tasks. An intelligent machine learns by perceiving its environmental status and taking an action that maximizes its chances of success. Human beings have the ability to apply knowledge learned from a smaller problem to more complex, large-scale problems of the same or a related domain, but currently the vast majority of evolutionary machine learning techniques lack this ability. This lack of ability to apply the already learned knowledge of a domain results in consuming more than the necessary resources and time to solve complex, large-scale problems of the domain. As the problem increases in size, it becomes difficult and even sometimes impractical (if not impossible) to solve due to the needed resources and time. Therefore, in order to scale in a problem domain, a systemis needed that has the ability to reuse the learned knowledge of the domain and/or encapsulate the underlying patterns in the domain. To extract and reuse building blocks of knowledge or to encapsulate the underlying patterns in a problem domain, a rich encoding is needed, but the search space could then expand undesirably and cause bloat, e.g. as in some forms of genetic programming (GP). Learning classifier systems (LCSs) are a well-structured evolutionary computation based learning technique that have pressures to implicitly avoid bloat, such as fitness sharing through niche based reproduction. The proposed thesis is that an LCS can scale to complex problems in a domain by reusing the learnt knowledge from simpler problems of the domain and/or encapsulating the underlying patterns in the domain. Wilson’s XCS is used to implement and test the proposed systems, which is a well-tested,  online learning and accuracy based LCS model. To extract the reusable building  blocks of knowledge, GP-tree like, code-fragments are introduced, which are more  than simply another representation (e.g. ternary or real-valued alphabets). This  thesis is extended to capture the underlying patterns in a problemusing a cyclic  representation. Hard problems are experimented to test the newly developed scalable  systems and compare them with benchmark techniques. Specifically, this work develops four systems to improve the scalability of XCS-based classifier systems. (1) Building blocks of knowledge are extracted fromsmaller problems of a Boolean domain and reused in learning more complex, large-scale problems in the domain, for the first time. By utilizing the learnt knowledge from small-scale problems, the developed XCSCFC (i.e. XCS with Code-Fragment Conditions) system readily solves problems of a scale that existing LCS and GP approaches cannot, e.g. the 135-bitMUX problem. (2) The introduction of the code fragments in classifier actions in XCSCFA (i.e. XCS with Code-Fragment Actions) enables the rich representation of GP, which when couples with the divide and conquer approach of LCS, to successfully solve various complex, overlapping and niche imbalance Boolean problems that are difficult to solve using numeric action based XCS. (3) The underlying patterns in a problem domain are encapsulated in classifier rules encoded by a cyclic representation. The developed XCSSMA system produces general solutions of any scale n for a number of important Boolean problems, for the first time in the field of LCS, e.g. parity problems. (4) Optimal solutions for various real-valued problems are evolved by extending the existing real-valued XCSR system with code-fragment actions to XCSRCFA. Exploiting the combined power of GP and LCS techniques, XCSRCFA successfully learns various continuous action and function approximation problems that are difficult to learn using the base techniques. This research work has shown that LCSs can scale to complex, largescale problems through reusing learnt knowledge. The messy nature, disassociation of  message to condition order, masking, feature construction, and reuse of extracted knowledge add additional abilities to the XCS family of LCSs. The ability to use  rich encoding in antecedent GP-like codefragments or consequent cyclic representation  leads to the evolution of accurate, maximally general and compact solutions in learning  various complex Boolean as well as real-valued problems. Effectively exploiting the combined power of GP and LCS techniques, various continuous action and function approximation problems are solved in a simple and straight forward manner. The analysis of the evolved rules reveals, for the first time in XCS, that no matter how specific or general the initial classifiers are, all the optimal classifiers are converged through the mechanism ‘be specific then generalize’ near the final stages of evolution. Also that standard XCS does not use all available information or all available genetic operators to evolve optimal rules, whereas the developed code-fragment action based systems effectively use figure  and ground information during the training process. Thiswork has created a platformto explore the reuse of learnt functionality, not just terminal knowledge as present, which is needed to replicate human capabilities.</p>


2021 ◽  
Vol 8 ◽  
Author(s):  
Liliana Florea ◽  
Lindsay Payer ◽  
Corina Antonescu ◽  
Guangyu Yang ◽  
Kathleen Burns

Alu exonization events functionally diversify the transcriptome, creating alternative mRNA isoforms and accounting for an estimated 5% of the alternatively spliced (skipped) exons in the human genome. We developed computational methods, implemented into a software called Alubaster, for detecting incorporation of Alu sequences in mRNA transcripts from large scale RNA-seq data sets. The approach detects Alu sequences derived from both fixed and polymorphic Alu elements, including Alu insertions missing from the reference genome. We applied our methods to 117 GTEx human frontal cortex samples to build and characterize a collection of Alu-containing mRNAs. In particular, we detected and characterized Alu exonizations occurring at 870 fixed Alu loci, of which 237 were novel, as well as hundreds of putative events involving Alu elements that are polymorphic variants or rare alleles not present in the reference genome. These methods and annotations represent a unique and valuable resource that can be used to understand the characteristics of Alu-containing mRNAs and their tissue-specific expression patterns.


Genetics ◽  
2016 ◽  
Vol 204 (4) ◽  
pp. 1391-1396 ◽  
Author(s):  
Brian C. Searle ◽  
Rachel M. Gittelman ◽  
Ohad Manor ◽  
Joshua M. Akey
Keyword(s):  

2016 ◽  
Author(s):  
Daniel R. Garalde ◽  
Elizabeth A. Snell ◽  
Daniel Jachimowicz ◽  
Andrew J. Heron ◽  
Mark Bruce ◽  
...  

AbstractRibonucleic acid sequencing can allow us to monitor the RNAs present in a sample. This enables us to detect the presence and nucleotide sequence of viruses, or to build a picture of how active transcriptional processes are changing – information that is useful for understanding the status and function of a sample. Oxford Nanopore Technologies’ sequencing technology is capable of electronically analysing a sample’s DNA directly, and in real-time. In this manuscript we demonstrate the ability of an array of nanopores to sequence RNA directly, and we apply it to a range of biological situations. Nanopore technology is the only available sequencing technology that can sequence RNA directly, rather than depending on reverse transcription and PCR. There are several potential advantages of this approach over other RNA-seq strategies, including the absence of amplification and reverse transcription biases, the ability to detect nucleotide analogues and the ability to generate full-length, strand-specific RNA sequences. Direct RNA sequencing is a completely new way of analysing the sequence of RNA samples and it will improve the ease and speed of RNA analysis, while yielding richer biological information.


2016 ◽  
Vol 311 (4) ◽  
pp. F787-F792 ◽  
Author(s):  
Yue Zhao ◽  
Chin-Rang Yang ◽  
Viswanathan Raghuram ◽  
Jaya Parulekar ◽  
Mark A. Knepper

Due to recent advances in high-throughput techniques, we and others have generated multiple proteomic and transcriptomic databases to describe and quantify gene expression, protein abundance, or cellular signaling on the scale of the whole genome/proteome in kidney cells. The existence of so much data from diverse sources raises the following question: “How can researchers find information efficiently for a given gene product over all of these data sets without searching each data set individually?” This is the type of problem that has motivated the “Big-Data” revolution in Data Science, which has driven progress in fields such as marketing. Here we present an online Big-Data tool called BIG (Biological Information Gatherer) that allows users to submit a single online query to obtain all relevant information from all indexed databases. BIG is accessible at http://big.nhlbi.nih.gov/ .


2020 ◽  
Vol 4 (s1) ◽  
pp. 107-107
Author(s):  
Stefani Thomas ◽  
Betty Friedrich ◽  
Michael Schnaubelt ◽  
Daniel W. Chan ◽  
Hui Zhang ◽  
...  

OBJECTIVES/GOALS: Large-scale clinical proteomic studies of cancer tissues often entail complex workflows and are resource-intensive. In this study we analyzed ovarian tumors using an emerging, high-throughput proteomic technology termed SWATH. We compared SWATH with the more widely used iTRAQ workflow based on robustness, complexity, ability to detect differential protein expression, and the elucidated biological information. METHODS/STUDY POPULATION: Proteomic measurements of 103 clinically-annotated high-grade serous ovarian cancer (HGSOC) tumors previously genomically characterized by The Cancer Genome Atlas were conducted using two orthogonal mass spectrometry-based proteomic methods: iTRAQ and SWATH. The analytical differences between the two methods were compared with respect to relative protein abundances. To assess the ability to classify the tumors into subtypes based on proteomic signatures, an unbiased molecular taxonomy of HGSOC was established using protein abundance data. The 1,599 proteins quantified in both datasets were classified based on z-score-transformed protein abundances, and the emergent protein modules were characterized using weighted gene-correlation network analysis and Reactome pathway enrichment. RESULTS/ANTICIPATED RESULTS: Despite the greater than two-fold difference in the analytical depth of each proteomic method, common differentially expressed proteins in enriched pathways associated with the HGSOC Mesenchymal subtype were identified by both methods. The stability of tumor subtype classification was sensitive to the number of analyzed samples, and the statistically stable subgroups were identified by the data from both methods. Additionally, the homologous recombination deficiency-associated enriched DNA repair and chromosome organization pathways were conserved in both data sets. DISCUSSION/SIGNIFICANCE OF IMPACT: SWATH is a robust proteomic method that can be used to elucidate cancer biology. The lower number of proteins detected by SWATH compared to iTRAQ is mitigated by its streamlined workflow, increased sample throughput, and reduced sample requirement. SWATH therefore presents novel opportunities to enhance the efficiency of clinical proteomic studies.


2019 ◽  
Vol 2 (1) ◽  
pp. 139-173 ◽  
Author(s):  
Koen Van den Berge ◽  
Katharina M. Hembach ◽  
Charlotte Soneson ◽  
Simone Tiberi ◽  
Lieven Clement ◽  
...  

Gene expression is the fundamental level at which the results of various genetic and regulatory programs are observable. The measurement of transcriptome-wide gene expression has convincingly switched from microarrays to sequencing in a matter of years. RNA sequencing (RNA-seq) provides a quantitative and open system for profiling transcriptional outcomes on a large scale and therefore facilitates a large diversity of applications, including basic science studies, but also agricultural or clinical situations. In the past 10 years or so, much has been learned about the characteristics of the RNA-seq data sets, as well as the performance of the myriad of methods developed. In this review, we give an overview of the developments in RNA-seq data analysis, including experimental design, with an explicit focus on the quantification of gene expression and statistical approachesfor differential expression. We also highlight emerging data types, such as single-cell RNA-seq and gene expression profiling using long-read technologies.


Sign in / Sign up

Export Citation Format

Share Document