scholarly journals A mixture copula Bayesian network model for multimodal genomic data

2017 ◽  
Vol 16 ◽  
pp. 117693511770238 ◽  
Author(s):  
Qingyang Zhang ◽  
Xuan Shi

Gaussian Bayesian networks have become a widely used framework to estimate directed associations between joint Gaussian variables, where the network structure encodes the decomposition of multivariate normal density into local terms. However, the resulting estimates can be inaccurate when the normality assumption is moderately or severely violated, making it unsuitable for dealing with recent genomic data such as the Cancer Genome Atlas data. In the present paper, we propose a mixture copula Bayesian network model which provides great flexibility in modeling non-Gaussian and multimodal data for causal inference. The parameters in mixture copula functions can be efficiently estimated by a routine expectation–maximization algorithm. A heuristic search algorithm based on Bayesian information criterion is developed to estimate the network structure, and prediction can be further improved by the best-scoring network out of multiple predictions from random initial values. Our method outperforms Gaussian Bayesian networks and regular copula Bayesian networks in terms of modeling flexibility and prediction accuracy, as demonstrated using a cell signaling data set. We apply the proposed methods to the Cancer Genome Atlas data to study the genetic and epigenetic pathways that underlie serous ovarian cancer.

2017 ◽  
Author(s):  
Qingyang Zhang ◽  
Xuan Shi

AbstractGaussian Bayesian networks have become a widely used framework to estimate directed associations between joint Gaussian variables, where the network structure encodes decomposition of multivariate normal density into local terms. However, the resulting estimates can be inaccurate when normality assumption is moderately or severely violated, making it unsuitable to deal with recent genomic data such as the Cancer Genome Atlas data. In the present paper, we propose a mixture copula Bayesian network model which provides great flexibility in modeling non-Gaussian and multimodal data for causal inference. The parameters in mixture copula functions can be efficiently estimated by a routine Expectation-Maximization algorithm. A heuristic search algorithm based on Bayesian information criterion is developed to estimate the network structure, and prediction can be further improved by the best-scoring network out of multiple predictions from random initial values. Our method outperforms Gaussian Bayesian networks and regular copula Bayesian networks in terms of modeling flexibility and prediction accuracy, as demonstrated using a cell signaling dataset. We apply the proposed methods to the Cancer Genome Atlas data to study the genetic and epigenetic pathways that underlie serous ovarian cancer.


F1000Research ◽  
2017 ◽  
Vol 6 ◽  
pp. 319
Author(s):  
Erin K. Wagner ◽  
Satyajeet Raje ◽  
Liz Amos ◽  
Jessica Kurata ◽  
Abhijit S. Badve ◽  
...  

Data sharing is critical to advance genomic research by reducing the demand to collect new data by reusing and combining existing data and by promoting reproducible research. The Cancer Genome Atlas (TCGA) is a popular resource for individual-level genotype-phenotype cancer related data. The Database of Genotypes and Phenotypes (dbGaP) contains many datasets similar to those in TCGA. We have created a software pipeline that will allow researchers to discover relevant genomic data from dbGaP, based on matching TCGA metadata. The resulting research provides an easy to use tool to connect these two data sources.


Epigenomics ◽  
2020 ◽  
Vol 12 (16) ◽  
pp. 1443-1456
Author(s):  
Yan Huang ◽  
Dianshuang Zhou ◽  
Yihan Wang ◽  
Xingda Zhang ◽  
Mu Su ◽  
...  

Aim: We aim to predict transcription factor (TF) binding events from knowledge of gene expression and epigenetic modifications. Materials & methods: TF-binding events based on the Encode project and The Cancer Genome Atlas data were analyzed by the random forest method. Results: We showed the high performance of TF-binding predictive models in GM12878, HeLa, HepG2 and K562 cell lines and applied them to other cell lines and tissues. The genes bound by the top TFs ( MAX and MAZ) were significantly associated with cancer-related processes such as cell proliferation and DNA repair. Conclusion: We successfully constructed TF-binding predictive models in cell lines and applied them in tissues.


2014 ◽  
Vol 989-994 ◽  
pp. 2106-2110
Author(s):  
Shun Ge ◽  
Xue Zhi Xia

SBN(Series Bayesian Network) model and constructing algorithm of BN are introduced, and advantages and disadvantages of constructing SBN with MEBN(multi-entity Bayesian networks) are pointed out. Aiming at the demands of constructing SBN, expression of Markov MFrag (MEBN Fragment) within MEBN system is clarified, and probability mapping pseudo MFrags are added, then SBN constructing algorithm based on Markov extended MEBN system is studied and the constructing procedures are illuminated by an example.


2018 ◽  
Author(s):  
Seo Jeong Shin ◽  
Seng Chan You ◽  
Yu Rang Park ◽  
Jin Roh ◽  
Jang-Hee Kim ◽  
...  

BACKGROUND Clinical sequencing data should be shared in order to achieve the sufficient scale and diversity required to provide strong evidence for improving patient care. A distributed research network allows researchers to share this evidence rather than the patient-level data across centers, thereby avoiding privacy issues. The Observational Medical Outcomes Partnership (OMOP) common data model (CDM) used in distributed research networks has low coverage of sequencing data and does not reflect the latest trends of precision medicine. OBJECTIVE The aim of this study was to develop and evaluate the feasibility of a genomic CDM (G-CDM), as an extension of the OMOP-CDM, for application of genomic data in clinical practice. METHODS Existing genomic data models and sequencing reports were reviewed to extend the OMOP-CDM to cover genomic data. The Human Genome Organisation Gene Nomenclature Committee and Human Genome Variation Society nomenclature were adopted to standardize the terminology in the model. Sequencing data of 114 and 1060 patients with lung cancer were obtained from the Ajou University School of Medicine database of Ajou University Hospital and The Cancer Genome Atlas, respectively, which were transformed to a format appropriate for the G-CDM. The data were compared with respect to gene name, variant type, and actionable mutations. RESULTS The G-CDM was extended into four tables linked to tables of the OMOP-CDM. Upon comparison with The Cancer Genome Atlas data, a clinically actionable mutation, p.Leu858Arg, in the EGFR gene was 6.64 times more frequent in the Ajou University School of Medicine database, while the p.Gly12Xaa mutation in the KRAS gene was 2.02 times more frequent in The Cancer Genome Atlas dataset. The data-exploring tool GeneProfiler was further developed to conduct descriptive analyses automatically using the G-CDM, which provides the proportions of genes, variant types, and actionable mutations. GeneProfiler also allows for querying the specific gene name and Human Genome Variation Society nomenclature to calculate the proportion of patients with a given mutation. CONCLUSIONS We developed the G-CDM for effective integration of genomic data with standardized clinical data, allowing for data sharing across institutes. The feasibility of the G-CDM was validated by assessing the differences in data characteristics between two different genomic databases through the proposed data-exploring tool GeneProfiler. The G-CDM may facilitate analyses of interoperating clinical and genomic datasets across multiple institutions, minimizing privacy issues and enabling researchers to better understand the characteristics of patients and promote personalized medicine in clinical practice.


2021 ◽  
Author(s):  
Volkan Sevinç

Abstract Energy is one of the main concerns of humanity because energy resources are limited and costly. In order to reduce the costs and to use the energy for space heating effectively, new building materials, techniques and insulations facilities are being developed. Therefore, it is important to know which factors affect the space heating costs. This study aims to introduce the novel Rank Correlation Bayesian Network model and its application in analyzing the effects of dwelling characteristics on the space heating costs. The results show that the constructed Rank Correlation Bayesian Network model performed better than the Bayesian networks models estimated by Bayesian search, PC and Greedy Thick Thinning algorithms, which are kinds of structure learning algorithms having different kinds of estimation mechanisms to build Bayesian networks. The constructed Rank Correlation Bayesian Network model shows that the space heating costs of the dwellings are mostly affected by the heating systems used. Coal stoves, air conditioners and electric stoves appear to be the costliest heating systems. The second most important factor appears to be the existence of external wall insulation. The lack of external wall insulation almost doubles the space heating costs. The third most important factor is the building age. Dwellings on the ground floors and the first floors appear to pay the highest space heating costs. Therefore, dwellings on these floors need to be more effectively insulated. As the size of the dwelling increases the heating cost increases too. Another result is that facing directions and floor levels of the dwellings have the least effects on their space heating.


2021 ◽  
Author(s):  
Aditya Lahiri ◽  
Lin Zhou ◽  
Ping He ◽  
Aniruddha Datta

Abstract Drought is a natural hazard that affects crops by inducing water stress. Water stress, induced by drought, accounts for more loss in crop yield than all the other causes combined. With the increasing frequency and intensity of droughts worldwide, it is essential to develop drought-resistant crops to ensure food security. In this paper, we model multiple drought signaling pathways in Arabidopsis using Bayesian networks to identify potential regulators of drought-responsive reporter genes. Genetically intervening at these regulators can help develop drought-resistant crops. We create the Bayesian network model from the biological literature and determine its parameters from publicly available data. We conduct inference on this model using a stochastic simulation technique known as likelihood weighting to determine the best regulators of drought-responsive reporter genes. Our analysis reveals that activating MYC2 or inhibiting ATAF1 are the best single node intervention strategies to regulate the drought-responsive reporter genes. Additionally, we observe simultaneously activating MYC2 and inhibiting ATAF1 is a better strategy. The Bayesian network model indicated that MYC2 and ATAF1 are possible regulators of the drought response. Validation experiments showed that ATAF1 negatively regulated the drought response. Thus intervening at ATAF1 has the potential to create drought-resistant crops.


Sign in / Sign up

Export Citation Format

Share Document