scholarly journals GEOMetaCuration: A web-based application for accurate manual curation of Gene Expression Omnibus metadata

2018 ◽  
Author(s):  
Zhao Li ◽  
Jin Li ◽  
Peng Yu

AbstractMetadata curation has become increasingly important for biological discovery and biomedical research because a large amount of heterogeneous biological data is currently freely available. To facilitate efficient metadata curation, we developed an easy-to-use web-based curation application, GEOMetaCuration, for curating the metadata of Gene Expression Omnibus datasets. It can eliminate mechanical operations that consume precious curation time and can help coordinate curation efforts among multiple curators. It improves the curation process by introducing various features that are critical to metadata curation, such as a back-end curation management system and a curator-friendly front-end. The application is based on a commonly used web development framework of Python/Django and is open-sourced under the GNU General Public License V3. GEOMetaCuration is expected to benefit the biocuration community and to contribute to computational generation of biological insights using large-scale biological data. An example use case can be found at the demo website: http://geometacuration.yubiolab.org. Source code URL: https://bitbucket.com/yubiolab/GEOMetaCuration

Mathematics ◽  
2021 ◽  
Vol 9 (7) ◽  
pp. 772
Author(s):  
Seonghun Kim ◽  
Seockhun Bae ◽  
Yinhua Piao ◽  
Kyuri Jo

Genomic profiles of cancer patients such as gene expression have become a major source to predict responses to drugs in the era of personalized medicine. As large-scale drug screening data with cancer cell lines are available, a number of computational methods have been developed for drug response prediction. However, few methods incorporate both gene expression data and the biological network, which can harbor essential information about the underlying process of the drug response. We proposed an analysis framework called DrugGCN for prediction of Drug response using a Graph Convolutional Network (GCN). DrugGCN first generates a gene graph by combining a Protein-Protein Interaction (PPI) network and gene expression data with feature selection of drug-related genes, and the GCN model detects the local features such as subnetworks of genes that contribute to the drug response by localized filtering. We demonstrated the effectiveness of DrugGCN using biological data showing its high prediction accuracy among the competing methods.


2020 ◽  
Author(s):  
Ramon Viñas ◽  
Tiago Azevedo ◽  
Eric R. Gamazon ◽  
Pietro Liò

AbstractA question of fundamental biological significance is to what extent the expression of a subset of genes can be used to recover the full transcriptome, with important implications for biological discovery and clinical application. To address this challenge, we present GAIN-GTEx, a method for gene expression imputation based on Generative Adversarial Imputation Networks. In order to increase the applicability of our approach, we leverage data from GTEx v8, a reference resource that has generated a comprehensive collection of transcriptomes from a diverse set of human tissues. We compare our model to several standard and state-of-the-art imputation methods and show that GAIN-GTEx is significantly superior in terms of predictive performance and runtime. Furthermore, our results indicate strong generalisation on RNA-Seq data from 3 cancer types across varying levels of missingness. Our work can facilitate a cost-effective integration of large-scale RNA biorepositories into genomic studies of disease, with high applicability across diverse tissue types.


2017 ◽  
Author(s):  
Venkata Manem ◽  
George Adam ◽  
Tina Gruosso ◽  
Mathieu Gigoux ◽  
Nicholas Bertos ◽  
...  

ABSTRACTBackground:Over the last several years, we have witnessed the metamorphosis of network biology from being a mere representation of molecular interactions to models enabling inference of complex biological processes. Networks provide promising tools to elucidate intercellular interactions that contribute to the functioning of key biological pathways in a cell. However, the exploration of these large-scale networks remains a challenge due to their high-dimensionality.Results:CrosstalkNet is a user friendly, web-based network visualization tool to retrieve and mine interactions in large-scale bipartite co-expression networks. In this study, we discuss the use of gene co-expression networks to explore the rewiring of interactions between tumor epithelial and stromal cells. We show how CrosstalkNet can be used to efficiently visualize, mine, and interpret large co-expression networks representing the crosstalk occurring between the tumour and its microenvironment.Conclusion:CrosstalkNet serves as a tool to assist biologists and clinicians in exploring complex, large interaction graphs to obtain insights into the biological processes that govern the tumor epithelial-stromal crosstalk. A comprehensive tutorial along with case studies are provided with the application.Availability:The web-based application is available at the following location: http://epistroma.pmgenomics.ca/app/. The code is open-source and freely available from http://github.com/bhklab/EpiStroma-webapp.Contact:[email protected]


2016 ◽  
Vol 2 ◽  
pp. e90 ◽  
Author(s):  
Ranko Gacesa ◽  
David J. Barlow ◽  
Paul F. Long

Ascribing function to sequence in the absence of biological data is an ongoing challenge in bioinformatics. Differentiating the toxins of venomous animals from homologues having other physiological functions is particularly problematic as there are no universally accepted methods by which to attribute toxin function using sequence data alone. Bioinformatics tools that do exist are difficult to implement for researchers with little bioinformatics training. Here we announce a machine learning tool called ‘ToxClassifier’ that enables simple and consistent discrimination of toxins from non-toxin sequences with >99% accuracy and compare it to commonly used toxin annotation methods. ‘ToxClassifer’ also reports the best-hit annotation allowing placement of a toxin into the most appropriate toxin protein family, or relates it to a non-toxic protein having the closest homology, giving enhanced curation of existing biological databases and new venomics projects. ‘ToxClassifier’ is available for free, either to download (https://github.com/rgacesa/ToxClassifier) or to use on a web-based server (http://bioserv7.bioinfo.pbf.hr/ToxClassifier/).


F1000Research ◽  
2016 ◽  
Vol 5 ◽  
pp. 305 ◽  
Author(s):  
Alexandra K. Marr ◽  
Sabri Boughorbel ◽  
Scott Presnell ◽  
Charlie Quinn ◽  
Damien Chaussabel ◽  
...  

Compendia of large-scale datasets made available in public repositories provide a precious opportunity to discover new biomedical phenomena and to fill gaps in our current knowledge. In order to foster novel insights it is necessary to ensure that these data are made readily accessible to research investigators in an interpretable format. Here we make a curated, public, collection of transcriptome datasets relevant to human placenta biology available for further analysis and interpretation via an interactive data browsing interface. We identified and retrieved a total of 24 datasets encompassing 759 transcriptome profiles associated with the development of the human placenta and associated pathologies from the NCBI Gene Expression Omnibus (GEO) and present them in a custom web-based application designed for interactive query and visualization of integrated large-scale datasets (http://placentalendocrinology.gxbsidra.org/dm3/landing.gsp). We also performed quality control checks using relevant biological markers. Multiple sample groupings and rank lists were subsequently created to facilitate data query and interpretation. Via this interface, users can create web-links to customized graphical views which may be inserted into manuscripts for further dissemination, or e-mailed to collaborators for discussion. The tool also enables users to browse a single gene across different projects, providing a mechanism for  developing new perspectives on the role of a molecule of interest across multiple biological states. The dataset collection we created here is available at: http://placentalendocrinology.gxbsidra.org/dm3.


2019 ◽  
Author(s):  
Bastian Seelbinder ◽  
Thomas Wolf ◽  
Steffen Priebe ◽  
Sylvie McNamara ◽  
Silvia Gerber ◽  
...  

ABSTRACTIn transcriptomics, the study of the total set of RNAs transcribed by the cell, RNA sequencing (RNA-seq) has become the standard tool for analysing gene expression. The primary goal is the detection of genes whose expression changes significantly between two or more conditions, either for a single species or for two or more interacting species at the same time (dual RNA-seq, triple RNA-seq and so forth). The analysis of RNA-seq can be simplified as many steps of the data pre-processing can be standardised in a pipeline.In this publication we present the “GEO2RNAseq” pipeline for complete, quick and concurrent pre-processing of single, dual, and triple RNA-seq data. It covers all pre-processing steps starting from raw sequencing data to the analysis of differentially expressed genes, including various tables and figures to report intermediate and final results. Raw data may be provided in FASTQ format or can be downloaded automatically from the Gene Expression Omnibus repository. GEO2RNAseq strongly incorporates experimental as well as computational metadata. GEO2RNAseq is implemented in R, lightweight, easy to install via Conda and easy to use, but still very flexible through using modular programming and offering many extensions and alternative workflows.GEO2RNAseq is publicly available at https://anaconda.org/xentrics/r-geo2rnaseq and https://bitbucket.org/thomas_wolf/geo2rnaseq/overview, including source code, installation instruction, and comprehensive package documentation.


2020 ◽  
Author(s):  
Silu Huang ◽  
Charles Blatti ◽  
Saurabh Sinha ◽  
Aditya Parameswaran

AbstractMotivationA common but critical task in genomic data analysis is finding features that separate and thereby help explain differences between two classes of biological objects, e.g., genes that explain the differences between healthy and diseased patients. As lower-cost, high-throughput experimental methods greatly increase the number of samples that are assayed as objects for analysis, computational methods are needed to quickly provide insights into high-dimensional datasets with tens of thousands of objects and features.ResultsWe develop an interactive exploration tool called Genvisage that rapidly discovers the most discriminative feature pairs that best separate two classes in a dataset, and displays the corresponding visualizations. Since quickly finding top feature pairs is computationally challenging, especially when the numbers of objects and features are large, we propose a suite of optimizations to make Genvisage more responsive and demonstrate that our optimizations lead to a 400X speedup over competitive baselines for multiple biological data sets. With this speedup, Genvisage enables the exploration of more large-scale datasets and alternate hypotheses in an interactive and interpretable fashion. We apply Genvisage to uncover pairs of genes whose transcriptomic responses significantly discriminate treatments of several chemotherapy drugs.AvailabilityFree webserver at http://genvisage.knoweng.org:443/ with source code at https://github.com/KnowEnG/Genvisage


2021 ◽  
Author(s):  
Kimiya Gohari ◽  
Anoshirvan Kazemnejad ◽  
Shayan Mostafaei ◽  
Ali Sheidaei ◽  
Maryam S Daneshpour ◽  
...  

Abstract Background: Comparison of LASSO, smoothly clipped absolute deviation (SCAD) and minimax concave penalty (MCP) logistic classifiers in order to reconnaissance of related genes with COPD disease and assessing the genes effects on the progression of the disease based on one of the main classes of cells involved in the disease, Sputum Cells. We used a genome-wide expression profiling to define gene networks relevant to the disease. The data retrieved from Gene Expression Omnibus (GEO) with accession numbers "GSE22148". From 143 samples in GOLD stage 2-4 COPD ex-smokers, 54,675 probes primary were assessed. After normalization, LASSO, SCAD and MCP logistic regressions were applied. K-fold cross-validation scheme was used to evaluate the performance of two methods. All of the computational processes were done using "ncvreg", "Affy," "Limma" and "SVA" R packages. Results: The results of LASSO (AUC=0.95, sensitivity= 0.91, specificity= 0.86) and SCAD (AUC=0.97, sensitivity= 0.95, specificity= 0.85) logistic regression were almost similar. There were 23 and 22 significantly associated genes for LASSO and SCAD, respectively. The only difference between these models is related to "stromal interaction molecule 2". Comparing to MCP approach, the most conservative method, we detected only 7 significant genes (AUC= 0.94, sensitivity= 0.94, specificity= 0.82). Conclusions: In the present study, the relative expressions of thousands of the genes were assessed and identified as associated genes with the progression of COPD. Differential analysis of gene expression data is able to reduce the number of genes but in a limited manner. In order to find an efficient and small subset of genes, we should use alternative approaches like logistic regression. Regularization solves the high dimensionality problem in using this kind of regression.


2018 ◽  
Author(s):  
Chi Tung Choy ◽  
Chi Hang Wong ◽  
Stephen Lam Chan

AbstractArtificial neural networks (ANNs) have been utilized for classification and prediction task with remarkable accuracy. However, its implications for unsupervised data mining using molecular data is under-explored. We adopted a method of unsupervised ANN, namely word embedding, to extract biologically relevant information from TCGA gene expression dataset. Ground truth relationship, such as cancer types of the input sample and semantic meaning of genes, were showed to retain in the resulting entity matrices. We also demonstrated the interpretability and usage of these matrices in shortlisting candidates from a long gene list. This method is feasible to mine big volume of biological data, and would be a valuable tool to discover novel knowledge from omics data. The resulting embedding matrices mined from TCGA gene expression data are interactively explorable online (http://bit.ly/tcga-embedding-cancer) and could serve as an informative reference.


Sign in / Sign up

Export Citation Format

Share Document