scholarly journals Exploring Integrative Analysis Using the BioMedical Evidence Graph

2020 ◽  
pp. 147-159 ◽  
Author(s):  
Adam Struck ◽  
Brian Walsh ◽  
Alexander Buchanan ◽  
Jordan A. Lee ◽  
Ryan Spangler ◽  
...  

PURPOSE The analysis of cancer biology data involves extremely heterogeneous data sets, including information from RNA sequencing, genome-wide copy number, DNA methylation data reporting on epigenetic regulation, somatic mutations from whole-exome or whole-genome analyses, pathology estimates from imaging sections or subtyping, drug response or other treatment outcomes, and various other clinical and phenotypic measurements. Bringing these different resources into a common framework, with a data model that allows for complex relationships as well as dense vectors of features, will unlock integrated data set analysis. METHODS We introduce the BioMedical Evidence Graph (BMEG), a graph database and query engine for discovery and analysis of cancer biology. The BMEG is unique from other biologic data graphs in that sample-level molecular and clinical information is connected to reference knowledge bases. It combines gene expression and mutation data with drug-response experiments, pathway information databases, and literature-derived associations. RESULTS The construction of the BMEG has resulted in a graph containing > 41 million vertices and 57 million edges. The BMEG system provides a graph query–based application programming interface to enable analysis, with client code available for Python, Javascript, and R, and a server online at bmeg.io. Using this system, we have demonstrated several forms of cross–data set analysis to show the utility of the system. CONCLUSION The BMEG is an evolving resource dedicated to enabling integrative analysis. We have demonstrated queries on the system that illustrate mutation significance analysis, drug-response machine learning, patient-level knowledge-base queries, and pathway level analysis. We have compared the resulting graph to other available integrated graph systems and demonstrated the former is unique in the scale of the graph and the type of data it makes available.

2019 ◽  
Author(s):  
Adam Struck ◽  
Brian Walsh ◽  
Alexander Buchanan ◽  
Jordan A. Lee ◽  
Ryan Spangler ◽  
...  

AbstractThe analysis of cancer biology data involves extremely heterogeneous datasets including information from RNA sequencing, genome-wide copy number, DNA methylation data reporting on epigenomic regulation, somatic mutations from whole-exome or whole-genome analyses, pathology estimates from imaging sections or subtyping, drug response or other treatment outcomes, and various other clinical and phenotypic measurements. Bringing these different resources into a common framework, with a data model that allows for complex relationships as well as dense vectors of features, will unlock integrative analysis. We introduce a graph database and query engine for discovery and analysis of cancer biology, called the BioMedical Evidence Graph (BMEG). The BMEG is unique from other biological data graphs in that sample level molecular information is connected to reference knowledge bases. It combines gene expression and mutation data, with drug response experiments, pathway information databases and literature derived associations. The construction of the BMEG has resulted in a graph containing over 36M vertices and 29M edges. The BMEG system provides a graph query based API to enable analysis, with client code available for Python, Javascript and R, and a server online at bmeg.io. Using this system we have developed several forms of integrated analysis to demonstrate the utility of the system. The BMEG is an evolving resource dedicated to enabling integrative analysis. We have demonstrated queries on the system that illustrate mutation significance analysis, drug response machine learning, patient level knowledge base queries and pathway level analysis. We have compared the resulting graph to other available integrated graph systems, and demonstrated that it is unique in the scale of the graph and the type of data it makes available.HighlightsData resource connected extremely diverse set of cancer data setsGraph query engine that can be easily deployed and used on new datasetsEasily installed python clientServer online at bmeg.ioSummaryThe analysis of cancer biology data involves extremely heterogeneous datasets including information. Bringing these different resources into a common framework, with a data model that allows for complex relationships as well as dense vectors of features, will unlock integrative analysis. We introduce a graph database and query engine for discovery and analysis of cancer biology, called the BioMedical Evidence Graph (BMEG). The construction of the BMEG has resulted in a graph containing over 36M vertices and 29M edges. The BMEG system provides a graph query based API to enable analysis, with client code available for Python, Javascript and R, and a server online at bmeg.io. Using this system we have developed several forms of integrated analysis to demonstrate the utility of the system.


1996 ◽  
Vol 35 (01) ◽  
pp. 41-51 ◽  
Author(s):  
F. Molino ◽  
D. Furia ◽  
F. Bar ◽  
S. Battista ◽  
N. Cappello ◽  
...  

AbstractThe study reported in this paper is aimed at evaluating the effectiveness of a knowledge-based expert system (ICTERUS) in diagnosing jaundiced patients, compared with a statistical system based on probabilistic concepts (TRIAL). The performances of both systems have been evaluated using the same set of data in the same number of patients. Both systems are spin-off products of the European project Euricterus, an EC-COMACBME Project designed to document the occurrence and diagnostic value of clinical findings in the clinical presentation of jaundice in Europe, and have been developed as decision-making tools for the identification of the cause of jaundice based only on clinical information and routine investigations. Two groups of jaundiced patients were studied, including 500 (retrospective sample) and 100 (prospective sample) subjects, respectively. All patients were independently submitted to both decision-support tools. The input of both systems was the data set agreed within the Euricterus Project. The performances of both systems were evaluated with respect to the reference diagnoses provided by experts on the basis of the full clinical documentation. Results indicate that both systems are clinically reliable, although the diagnostic prediction provided by the knowledge-based approach is slightly better.


Genes ◽  
2020 ◽  
Vol 12 (1) ◽  
pp. 25
Author(s):  
He-Gang Chen ◽  
Xiong-Hui Zhou

Drug repurposing/repositioning, which aims to find novel indications for existing drugs, contributes to reducing the time and cost for drug development. For the recent decade, gene expression profiles of drug stimulating samples have been successfully used in drug repurposing. However, most of the existing methods neglect the gene modules and the interactions among the modules, although the cross-talks among pathways are common in drug response. It is essential to develop a method that utilizes the cross-talks information to predict the reliable candidate associations. In this study, we developed MNBDR (Module Network Based Drug Repositioning), a novel method that based on module network to screen drugs. It integrated protein–protein interactions and gene expression profile of human, to predict drug candidates for diseases. Specifically, the MNBDR mined dense modules through protein–protein interaction (PPI) network and constructed a module network to reveal cross-talks among modules. Then, together with the module network, based on existing gene expression data set of drug stimulation samples and disease samples, we used random walk algorithms to capture essential modules in disease development and proposed a new indicator to screen potential drugs for a given disease. Results showed MNBDR could provide better performance than popular methods. Moreover, functional analysis of the essential modules in the network indicated our method could reveal biological mechanism in drug response.


AI Magazine ◽  
2015 ◽  
Vol 36 (1) ◽  
pp. 75-86 ◽  
Author(s):  
Jennifer Sleeman ◽  
Tim Finin ◽  
Anupam Joshi

We describe an approach for identifying fine-grained entity types in heterogeneous data graphs that is effective for unstructured data or when the underlying ontologies or semantic schemas are unknown. Identifying fine-grained entity types, rather than a few high-level types, supports coreference resolution in heterogeneous graphs by reducing the number of possible coreference relations that must be considered. Big data problems that involve integrating data from multiple sources can benefit from our approach when the datas ontologies are unknown, inaccessible or semantically trivial. For such cases, we use supervised machine learning to map entity attributes and relations to a known set of attributes and relations from appropriate background knowledge bases to predict instance entity types. We evaluated this approach in experiments on data from DBpedia, Freebase, and Arnetminer using DBpedia as the background knowledge base.


2018 ◽  
Vol 14 (4) ◽  
Author(s):  
Omkar Singh ◽  
Ramesh Kumar Sunkaria

Abstract Background This article proposes an extension of empirical wavelet transform (EWT) algorithm for multivariate signals specifically applied to cardiovascular physiological signals. Materials and methods EWT is a newly proposed algorithm for extracting the modes in a signal and is based on the design of an adaptive wavelet filter bank. The proposed algorithm finds an optimum signal in the multivariate data set based on mode estimation strategy and then its corresponding spectra is segmented and utilized for extracting the modes across all the channels of the data set. Results The proposed algorithm is able to find the common oscillatory modes within the multivariate data and can be applied for multichannel heterogeneous data analysis having unequal number of samples in different channels. The proposed algorithm was tested on different synthetic multivariate data and a real physiological trivariate data series of electrocardiogram, respiration, and blood pressure to justify its validation. Conclusions In this article, the EWT is extended for multivariate signals and it was demonstrated that the component-wise processing of multivariate data leads to the alignment of common oscillating modes across the components.


2020 ◽  
Author(s):  
Alexander E. Zarebski ◽  
Louis du Plessis ◽  
Kris V. Parag ◽  
Oliver G. Pybus

Inferring the dynamics of pathogen transmission during an outbreak is an important problem in both infectious disease epidemiology and phylodynamics. In mathematical epidemiology, estimates are often informed by time-series of infected cases while in phylodynamics genetic sequences sampled through time are the primary data source. Each data type provides different, and potentially complementary, insights into transmission. However inference methods are typically highly specialised and field-specific. Recent studies have recognised the benefits of combining data sources, which include improved estimates of the transmission rate and number of infected individuals. However, the methods they employ are either computationally prohibitive or require intensive simulation, limiting their real-time utility. We present a novel birth-death phylogenetic model, called TimTam which can be informed by both phylogenetic and epidemiological data. Moreover, we derive a tractable analytic approximation of the TimTam likelihood, the computational complexity of which is linear in the size of the data set. Using the TimTam we show how key parameters of transmission dynamics and the number of unreported infections can be estimated accurately using these heterogeneous data sources. The approximate likelihood facilitates inference on large data sets, an important consideration as such data become increasingly common due to improving sequencing capability.


Author(s):  
Mariana Damova ◽  
Atanas Kiryakov ◽  
Maurice Grinberg ◽  
Michael K. Bergman ◽  
Frédérick Giasson ◽  
...  

The chapter introduces the process of design of two upper-level ontologies—PROTON and UMBEL—into reference ontologies and their integration in the so-called Reference Knowledge Stack (RKS). It is argued that RKS is an important step in the efforts of the Linked Open Data (LOD) project to transform the Web into a global data space with diverse real data, available for review and analysis. RKS is intended to make the interoperability between published datasets much more efficient than it is now. The approach discussed in the chapter consists of developing reference layers of upper-level ontologies by mapping them to certain LOD schemata and assigning instance data to them so they cover a reasonable portion of the LOD datasets. The chapter presents the methods (manual and semi-automatic) used in the creation of the RKS and gives examples that illustrate its advantages for managing highly heterogeneous data and its usefulness in real life knowledge intense applications.


2015 ◽  
Vol 19 (12) ◽  
pp. 4747-4764 ◽  
Author(s):  
F. Alshawaf ◽  
B. Fersch ◽  
S. Hinz ◽  
H. Kunstmann ◽  
M. Mayer ◽  
...  

Abstract. Data fusion aims at integrating multiple data sources that can be redundant or complementary to produce complete, accurate information of the parameter of interest. In this work, data fusion of precipitable water vapor (PWV) estimated from remote sensing observations and data from the Weather Research and Forecasting (WRF) modeling system are applied to provide complete grids of PWV with high quality. Our goal is to correctly infer PWV at spatially continuous, highly resolved grids from heterogeneous data sets. This is done by a geostatistical data fusion approach based on the method of fixed-rank kriging. The first data set contains absolute maps of atmospheric PWV produced by combining observations from the Global Navigation Satellite Systems (GNSS) and Interferometric Synthetic Aperture Radar (InSAR). These PWV maps have a high spatial density and a millimeter accuracy; however, the data are missing in regions of low coherence (e.g., forests and vegetated areas). The PWV maps simulated by the WRF model represent the second data set. The model maps are available for wide areas, but they have a coarse spatial resolution and a still limited accuracy. The PWV maps inferred by the data fusion at any spatial resolution show better qualities than those inferred from single data sets. In addition, by using the fixed-rank kriging method, the computational burden is significantly lower than that for ordinary kriging.


2019 ◽  
pp. 1-8 ◽  
Author(s):  
Steffen Pallarz ◽  
Manuela Benary ◽  
Mario Lamping ◽  
Damian Rieke ◽  
Johannes Starlinger ◽  
...  

PURPOSE Precision oncology depends on the availability of up-to-date, comprehensive, and accurate information about associations between genetic variants and therapeutic options. Recently, a number of knowledge bases (KBs) have been developed that gather such information on the basis of expert curation of the scientific literature. We performed a quantitative and qualitative comparison of Clinical Interpretations of Variants in Cancer, OncoKB, Cancer Gene Census, Database of Curated Mutations, CGI Biomarkers (the cancer genome interpreter biomarker database), Tumor Alterations Relevant for Genomics-Driven Therapy, and the Precision Medicine Knowledge Base. METHODS We downloaded each KB and restructured their content to describe variants, genes, drugs, and gene-drug associations in a common format. We normalized gene names to Entrez Gene IDs and drug names to ChEMBL and DrugBank IDs. For the analysis of clinically relevant gene-drug associations, we obtained lists of genes affected by genetic alterations and putative drug therapies for 113 patients with cancer whose cases were presented at the Molecular Tumor Board (MTB) of the Charité Comprehensive Cancer Center. RESULTS Our analysis revealed that the KBs are largely overlapping but also that each source harbors a notable amount of unique information. Although some KBs cover more genes, others contain more data about gene-drug associations. Retrospective comparisons with findings of the Charitè MTB at the gene level showed that use of multiple KBs may considerably improve retrieval results. The relative importance of a KB in terms of cancer genes was assessed in more detail by logistic regression, which revealed that all but one source had a notable impact on result quality. We confirmed these findings using a second data set obtained from an independent MTB. CONCLUSION To date, none of the existing publicly available KBs on gene-drug associations in precision oncology fully subsumes the others, but all of them exhibit specific strengths and weaknesses. Consideration of multiple KBs, therefore, is essential to obtain comprehensive results.


Sign in / Sign up

Export Citation Format

Share Document