Using Semantic Web Technologies to Enable Cancer Genomics Discovery at Petabyte Scale

Increased efforts in cancer genomics research and bioinformatics are producing tremendous amounts of data. These data are diverse in origin, format, and content. As the amount of available sequencing data increase, technologies that make them discoverable and usable are critically needed. In response, we have developed a Semantic Web–based Data Browser, a tool allowing users to visually build and execute ontology-driven queries. This approach simplifies access to available data and improves the process of using them in analyses on the Seven Bridges Cancer Genomics Cloud (CGC; www.cancergenomicscloud.org ). The Data Browser makes large data sets easily explorable and simplifies the retrieval of specific data of interest. Although initially implemented on top of The Cancer Genome Atlas (TCGA) data set, the Data Browser’s architecture allows for seamless integration of other data sets. By deploying it on the CGC, we have enabled remote researchers to access data and perform collaborative investigations.

Download Full-text

Rapid inference of direct interactions in large-scale ecological networks from heterogeneous microbial sequencing data

10.1101/390195 ◽

2018 ◽

Cited By ~ 4

Author(s):

Janko Tackmann ◽

João Frederico Matias Rodrigues ◽

Christian von Mering

Keyword(s):

Graphical Models ◽

Large Scale ◽

Study Data ◽

Microbial Interactions ◽

Data Sets ◽

Metagenomic Sequencing ◽

Sequencing Data ◽

Human Gut ◽

Data Set ◽

Seamless Integration

AbstractThe recent explosion of metagenomic sequencing data opens the door towards the modeling of microbial ecosystems in unprecedented detail. In particular, co-occurrence based prediction of ecological interactions could strongly benefit from this development. However, current methods fall short on several fronts: univariate tools do not distinguish between direct and indirect interactions, resulting in excessive false positives, while approaches with better resolution are so far computationally highly limited. Furthermore, confounding variables typical for cross-study data sets are rarely addressed. We present FlashWeave, a new approach based on a flexible Probabilistic Graphical Models framework to infer highly resolved direct microbial interactions from massive heterogeneous microbial abundance data sets with seamless integration of metadata. On a variety of benchmarks, FlashWeave outperforms state-of-the-art methods by several orders of magnitude in terms of speed while generally providing increased accuracy. We apply FlashWeave to a cross-study data set of 69 818 publicly available human gut samples, resulting in one of the largest and most diverse models of microbial interactions in the human gut to date.

Download Full-text

Knowledge discovery using SPARQL property path: The case of disease data set

Journal of Information Science ◽

10.1177/0165551519865495 ◽

2019 ◽

pp. 016555151986549

Author(s):

Enayat Rajabi ◽

Salvador Sanchez-Alonso

Keyword(s):

Semantic Web ◽

Knowledge Discovery ◽

Query Language ◽

Data Sets ◽

Web Portal ◽

Semantic Web Technologies ◽

Data Set ◽

Domain Experts ◽

Manual Exploration ◽

Reasoning Engine

The Semantic Web allows knowledge discovery on graph-based data sets and facilitates answering complex queries that are extremely difficult to achieve using traditional database approaches. Intuitively, the Semantic Web query language (SPARQL) has a ‘property path’ feature that enables knowledge discovery in a knowledgebase using its reasoning engine. In this article, we utilise the property path of SPARQL and the other Semantic Web technologies to answer sophisticated queries posed over a disease data set. To this aim, we transform data from a disease web portal to a graph-based data set by designing an ontology, present a template to define the queries and provide a set of conjunctive queries on the data set. We illustrate how the reasoning engine of ‘property path’ feature of SPARQL can retrieve the results from the designed knowledgebase. The results of this study were verified by two domain experts as well as authors’ manual exploration on the disease web portal.

Download Full-text

Measuring the Information Obtained from a Single-Cell Sequencing Experiment

10.1101/2020.10.01.322255 ◽

2020 ◽

Author(s):

Michael J. Casey ◽

Rubén J. Sánchez-García ◽

Ben D. MacArthur

Keyword(s):

Single Cell ◽

Expression Patterns ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Sequencing Data ◽

Data Set ◽

Single Cell Sequencing ◽

Amount Of Information ◽

Formal Framework

ABSTRACTSingle-cell sequencing (sc-Seq) experiments are producing increasingly large data sets. However, large data sets do not necessarily contain large amounts of information. Here, we introduce a formal framework for assessing the amount of information obtained from a sc-Seq experiment, which can be used throughout the sc-Seq analysis pipeline, including for quality control, feature selection and cluster evaluation. We illustrate this framework with some simple examples, including using it to quantify the amount of information in a single-cell sequencing data set that is explained by a proposed clustering, and thereby to determine cluster quality. Our information-theoretic framework provides a formal way to assess the quality of data obtained from sc-Seq experiments and the effectiveness of analyses performed, with wide implications for our understanding of variability in gene expression patterns within heterogeneous cell populations.

Download Full-text

CloneSig can jointly infer intra-tumor heterogeneity and mutational signature activity in bulk tumor sequencing data

Nature Communications ◽

10.1038/s41467-021-24992-y ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Judith Abécassis ◽

Fabien Reyal ◽

Jean-Philippe Vert

Keyword(s):

Tumor Heterogeneity ◽

Cancer Genomics ◽

Computational Method ◽

The Cancer Genome Atlas ◽

Sequencing Data ◽

Cancer Dataset ◽

Whole Exome Sequencing Data ◽

Cancer Genome Atlas ◽

Pan Cancer ◽

Mutational Processes

AbstractSystematic DNA sequencing of cancer samples has highlighted the importance of two aspects of cancer genomics: intra-tumor heterogeneity (ITH) and mutational processes. These two aspects may not always be independent, as different mutational processes could be involved in different stages or regions of the tumor, but existing computational approaches to study them largely ignore this potential dependency. Here, we present CloneSig, a computational method to jointly infer ITH and mutational processes in a tumor from bulk-sequencing data. Extensive simulations show that CloneSig outperforms current methods for ITH inference and detection of mutational processes when the distribution of mutational signatures changes between clones. Applied to a large cohort of 8,951 tumors with whole-exome sequencing data from The Cancer Genome Atlas, and on a pan-cancer dataset of 2,632 whole-genome sequencing tumor samples from the Pan-Cancer Analysis of Whole Genomes initiative, CloneSig obtains results overall coherent with previous studies.

Download Full-text

PIQMEE: Bayesian Phylodynamic Method for Analysis of Large Data Sets with Duplicate Sequences

Molecular Biology and Evolution ◽

10.1093/molbev/msaa136 ◽

2020 ◽

Vol 37 (10) ◽

pp. 3061-3075 ◽

Cited By ~ 2

Author(s):

Veronika Boskova ◽

Tanja Stadler

Keyword(s):

Large Data ◽

Large Data Sets ◽

Parameter Estimates ◽

Data Sets ◽

Sequencing Data ◽

Full Data ◽

Data Set ◽

Reliable Parameter ◽

Phylodynamic Analysis ◽

Speed And Accuracy

Abstract Next-generation sequencing of pathogen quasispecies within a host yields data sets of tens to hundreds of unique sequences. However, the full data set often contains thousands of sequences, because many of those unique sequences have multiple identical copies. Data sets of this size represent a computational challenge for currently available Bayesian phylogenetic and phylodynamic methods. Through simulations, we explore how large data sets with duplicate sequences affect the speed and accuracy of phylogenetic and phylodynamic analysis within BEAST 2. We show that using unique sequences only leads to biases, and using a random subset of sequences yields imprecise parameter estimates. To overcome these shortcomings, we introduce PIQMEE, a BEAST 2 add-on that produces reliable parameter estimates from full data sets with increased computational efficiency as compared with the currently available methods within BEAST 2. The principle behind PIQMEE is to resolve the tree structure of the unique sequences only, while simultaneously estimating the branching times of the duplicate sequences. Distinguishing between unique and duplicate sequences allows our method to perform well even for very large data sets. Although the classic method converges poorly for data sets of 6,000 sequences when allowed to run for 7 days, our method converges in slightly more than 1 day. In fact, PIQMEE can handle data sets of around 21,000 sequences with 20 unique sequences in 14 days. Finally, we apply the method to a real, within-host HIV sequencing data set with several thousand sequences per patient.

Download Full-text

CloneSig: Joint inference of intra-tumor heterogeneity and signature deconvolution in tumor bulk sequencing data

10.1101/825778 ◽

2019 ◽

Cited By ~ 2

Author(s):

Judith Abécassis ◽

Fabien Reyal ◽

Jean-Philippe Vert

Keyword(s):

Cancer Progression ◽

Tumor Heterogeneity ◽

Cancer Genomics ◽

Computational Method ◽

The Cancer Genome Atlas ◽

Sequencing Data ◽

Joint Inference ◽

Nucleotide Context ◽

Cancer Genome Atlas ◽

Mutational Processes

The possibility to sequence DNA in cancer samples has triggered much effort recently to identify the forces at the genomic level that shape tumorigenesis and cancer progression. It has resulted in novel understanding or clarification of two important aspects of cancer genomics: (i) intra-tumor heterogeneity (ITH), as captured by the variability in observed prevalences of somatic mutations within a tumor, and (ii) mutational processes, as revealed by the distribution of the types of somatic mutation and their immediate nucleotide context. These two aspects are not independent from each other, as different mutational processes can be involved in different subclones, but current computational approaches to study them largely ignore this dependency. In particular, sequential methods that first estimate subclones and then analyze the mutational processes active in each clone can easily miss changes in mutational processes if the clonal decomposition step fails, and conversely information regarding mutational signatures is overlooked during the subclonal reconstruction. To address current limitations, we present CloneSig, a new computational method to jointly infer ITH and mutational processes in a tumor from bulk-sequencing data, including whole-exome sequencing (WES) data, by leveraging their dependency. We show through an extensive benchmark on simulated samples that CloneSig is always as good as or better than state-of-the-art methods for ITH inference and detection of mutational processes. We then apply CloneSig to a large cohort of 8,954 tumors with WES data from the cancer genome atlas (TCGA), where we obtain results coherent with previous studies on whole-genome sequencing (WGS) data, as well as new promising findings. This validates the applicability of CloneSig to WES data, paving the way to its use in a clinical setting where WES is increasingly deployed nowadays.

Download Full-text

Population Substructure Has Implications in Validating Next-Generation Cancer Genomics Studies with TCGA

International Journal of Molecular Sciences ◽

10.3390/ijms20051192 ◽

2019 ◽

Vol 20 (5) ◽

pp. 1192 ◽

Cited By ~ 2

Author(s):

Marina Miller ◽

Eric Devor ◽

Erin Salinas ◽

Andreea Newtson ◽

Michael Goodheart ◽

...

Keyword(s):

Cancer Genomics ◽

The Cancer Genome Atlas ◽

Population Substructure ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

University Of Iowa ◽

Data Set ◽

Using Data ◽

Genetic Substructure

In the era of large genetic and genomic datasets, it has become crucially important to validate results of individual studies using data from publicly available sources, such as The Cancer Genome Atlas (TCGA). However, how generalizable are results from either an independent or a large public dataset to the remainder of the population? The study presented here aims to answer that question. Utilizing next generation sequencing data from endometrial and ovarian cancer patients from both the University of Iowa and TCGA, genomic admixture of each population was analyzed using STRUCTURE and ADMIXTURE software. In our independent data set, one subpopulation was identified, whereas in TCGA 4–6 subpopulations were identified. Data presented here demonstrate how different the genetic substructures of the TCGA and University of Iowa populations are. Validation of genomic studies between two different population samples must be aware of, account for and be corrected for background genetic substructure.

Download Full-text

Upgrading the Repertoire of miRNAs in Gastric Adenocarcinoma to Provide a New Resource for Biomarker Discovery

International Journal of Molecular Sciences ◽

10.3390/ijms20225697 ◽

2019 ◽

Vol 20 (22) ◽

pp. 5697 ◽

Cited By ~ 2

Author(s):

Michelle E. Pewarchuk ◽

Mateus C. Barros-Filho ◽

Brenda C. Minatel ◽

David E. Cohn ◽

Florian Guisier ◽

...

Keyword(s):

Gastric Adenocarcinoma ◽

Biomarker Discovery ◽

The Cancer Genome Atlas ◽

Small Rna Sequencing ◽

Sequencing Data ◽

Specific Expression ◽

Novel Mirna ◽

Cancer Genome Atlas ◽

Context Specific ◽

Independent Cohort

Recent studies have uncovered microRNAs (miRNAs) that have been overlooked in early genomic explorations, which show remarkable tissue- and context-specific expression. Here, we aim to identify and characterize previously unannotated miRNAs expressed in gastric adenocarcinoma (GA). Raw small RNA-sequencing data were analyzed using the miRMaster platform to predict and quantify previously unannotated miRNAs. A discovery cohort of 475 gastric samples (434 GA and 41 adjacent nonmalignant samples), collected by The Cancer Genome Atlas (TCGA), were evaluated. Candidate miRNAs were similarly assessed in an independent cohort of 25 gastric samples. We discovered 170 previously unannotated miRNA candidates expressed in gastric tissues. The expression of these novel miRNAs was highly specific to the gastric samples, 143 of which were significantly deregulated between tumor and nonmalignant contexts (p-adjusted < 0.05; fold change > 1.5). Multivariate survival analyses showed that the combined expression of one previously annotated miRNA and two novel miRNA candidates was significantly predictive of patient outcome. Further, the expression of these three miRNAs was able to stratify patients into three distinct prognostic groups (p = 0.00003). These novel miRNAs were also present in the independent cohort (43 sequences detected in both cohorts). Our findings uncover novel miRNA transcripts in gastric tissues that may have implications in the biology and management of gastric adenocarcinoma.

Download Full-text

Galaxy spin direction distribution in HST and SDSS show similar large-scale asymmetry

Publications of the Astronomical Society of Australia ◽

10.1017/pasa.2020.46 ◽

2020 ◽

Vol 37 ◽

Author(s):

Lior Shamir

Keyword(s):

Large Scale ◽

Spiral Galaxies ◽

Hubble Space Telescope ◽

Gravitational Interaction ◽

Large Data ◽

Sloan Digital Sky Survey ◽

Data Sets ◽

Dipole Axis ◽

Data Set ◽

The Asymmetry

Abstract Several recent observations using large data sets of galaxies showed non-random distribution of the spin directions of spiral galaxies, even when the galaxies are too far from each other to have gravitational interaction. Here, a data set of $\sim8.7\cdot10^3$ spiral galaxies imaged by Hubble Space Telescope (HST) is used to test and profile a possible asymmetry between galaxy spin directions. The asymmetry between galaxies with opposite spin directions is compared to the asymmetry of galaxies from the Sloan Digital Sky Survey. The two data sets contain different galaxies at different redshift ranges, and each data set was annotated using a different annotation method. The results show that both data sets show a similar asymmetry in the COSMOS field, which is covered by both telescopes. Fitting the asymmetry of the galaxies to cosine dependence shows a dipole axis with probabilities of $\sim2.8\sigma$ and $\sim7.38\sigma$ in HST and SDSS, respectively. The most likely dipole axis identified in the HST galaxies is at $(\alpha=78^{\rm o},\delta=47^{\rm o})$ and is well within the $1\sigma$ error range compared to the location of the most likely dipole axis in the SDSS galaxies with $z>0.15$ , identified at $(\alpha=71^{\rm o},\delta=61^{\rm o})$ .

Download Full-text

Generation of geometric interpolations of building types with deep variational autoencoders

Design Science ◽

10.1017/dsj.2020.31 ◽

2020 ◽

Vol 6 ◽

Author(s):

Jaime de Miguel Rodríguez ◽

Maria Eugenia Villafañe ◽

Luka Piškorec ◽

Fernando Sancho Caparrini

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Large Data ◽

Learning Model ◽

Large Data Sets ◽

Data Sets ◽

Connectivity Map ◽

Data Set ◽

3D Objects ◽

Machine Learning Model

Abstract This work presents a methodology for the generation of novel 3D objects resembling wireframes of building types. These result from the reconstruction of interpolated locations within the learnt distribution of variational autoencoders (VAEs), a deep generative machine learning model based on neural networks. The data set used features a scheme for geometry representation based on a ‘connectivity map’ that is especially suited to express the wireframe objects that compose it. Additionally, the input samples are generated through ‘parametric augmentation’, a strategy proposed in this study that creates coherent variations among data by enabling a set of parameters to alter representative features on a given building type. In the experiments that are described in this paper, more than 150 k input samples belonging to two building types have been processed during the training of a VAE model. The main contribution of this paper has been to explore parametric augmentation for the generation of large data sets of 3D geometries, showcasing its problems and limitations in the context of neural networks and VAEs. Results show that the generation of interpolated hybrid geometries is a challenging task. Despite the difficulty of the endeavour, promising advances are presented.

Download Full-text