1455Leveraging multi-omic negative controls for effect estimation in molecular epidemiologic studies: A simulation study

Abstract Background Exploratory null-hypothesis significance testing (e.g. GWAS, EWAS) form the backbone of molecular epidemiology, however methods to identify true causal signals are underdeveloped. Via plasmode simulation, I evaluate two approaches to quantitatively control for shared unmeasured confounding and recover unbiased effects using complementary epigenomes and biologically-informed structural assumptions. Methods I adapt proposed negative control-based estimators, the control outcome calibration approach (COCA) and proximal g-computation (PG) to case studies in perinatal molecular epidemiology. COCA may be employed when maternal epigenome has no direct effects on phenotype and proxy shared unmeasured confounders and PG further with suitable genetic instruments (e.g. mQTLs). Baseline covariates were extracted from 777 mother-child pairs in a birth cohort with maternal blood and fetal cord DNA methylation array data. Treatment and outcome values were simulated in 2000 bootstraps. Bootstrapped, ordinary (COCA) and 2-stage (PG) least squares were fitted to estimate treatment effects and standard errors under various common settings of missing confounders (e.g. paternal data). Doubly-robust, machine learning estimators were explored. Results COCA and PG performed well in simplistic data generating processes. However, in real-world cohort simulations, COCA performed acceptably only in settings with strong proxy confounders, but otherwise poorly (median bias 610%; coverage 29%). PG performed slightly better. Alternatively, simple covariate adjustment for maternal methylation outperformed (median bias 22%; 71% coverage) COCA, PG, and machine learning estimators. Discussion Molecular epidemiology provides key opportunity to leverage biological knowledge against unmeasured confounding. Negative control calibration or adjustments may help under limited scenarios where assumptions are fulfilled, but should be tested with suitable simulations. Key messages Quantitative approaches for unmeasured confounding in molecular epidemiology are a critical gap. Negative control calibration or adjustment may help under limiting scenarios. Proposed estimators should be tested in simulation settings that closely mimic target data.

Download Full-text

Robust Machine Learning for Treatment Effects in Multilevel Observational Studies Under Cluster-level Unmeasured Confounding

10.31234/osf.io/t7vbz ◽

2020 ◽

Author(s):

Youmi Suk ◽

Hyunseung Kang

Keyword(s):

Machine Learning ◽

Observational Studies ◽

Treatment Effects ◽

Math Achievement ◽

R Package ◽

Simulation Studies ◽

Unmeasured Confounding ◽

Doubly Robust ◽

Unmeasured Confounders ◽

Cluster Level

Recently, machine learning (ML) methods have been used in causal inference to estimate treatment effects in order to reduce concerns for model mis-specification. However, many, if not all, ML methods require that all confounders are measured to consistently estimate treatment effects. In this paper, we propose a family of ML methods that estimate treatment effects in the presence of cluster-level unmeasured confounders, a type of unmeasured confounders that are shared within each cluster and are common in multilevel observational studies. We show through simulation studies that our proposed methods are consistent and doubly robust when unmeasured cluster-level confounders are present. We also examine the effect of taking an algebra course on math achievement scores from the Early Childhood Longitudinal Study, a multilevel observational educational study, using our methods. The proposed methods are available in the CURobustML R package.

Download Full-text

EWAS Data Hub: a resource of DNA methylation array data and metadata

Nucleic Acids Research ◽

10.1093/nar/gkz840 ◽

2019 ◽

Vol 48 (D1) ◽

pp. D890-D895 ◽

Cited By ~ 6

Author(s):

Zhuang Xiong ◽

Mengwei Li ◽

Fei Yang ◽

Yingke Ma ◽

Jian Sang ◽

...

Keyword(s):

Dna Methylation ◽

Complex Traits ◽

Cell Types ◽

Great Promise ◽

Methylation Array ◽

Array Data ◽

The Past ◽

Comprehensive Collection ◽

Dna Methylation Array ◽

Brain Parts

Abstract Epigenome-Wide Association Study (EWAS) has become an effective strategy to explore epigenetic basis of complex traits. Over the past decade, a large amount of epigenetic data, especially those sourced from DNA methylation array, has been accumulated as the result of numerous EWAS projects. We present EWAS Data Hub (https://bigd.big.ac.cn/ewas/datahub), a resource for collecting and normalizing DNA methylation array data as well as archiving associated metadata. The current release of EWAS Data Hub integrates a comprehensive collection of DNA methylation array data from 75 344 samples and employs an effective normalization method to remove batch effects among different datasets. Accordingly, taking advantages of both massive high-quality DNA methylation data and standardized metadata, EWAS Data Hub provides reference DNA methylation profiles under different contexts, involving 81 tissues/cell types (that contain 25 brain parts and 25 blood cell types), six ancestry categories, and 67 diseases (including 39 cancers). In summary, EWAS Data Hub bears great promise to aid the retrieval and discovery of methylation-based biomarkers for phenotype characterization, clinical treatment and health care.

Download Full-text

Machine learning for single cell genomics data analysis

10.1101/2021.02.04.429763 ◽

2021 ◽

Author(s):

Félix Raimundo ◽

Laetitia Papaxanthos ◽

Céline Vallot ◽

Jean-Philippe Vert

Keyword(s):

Machine Learning ◽

Single Cell ◽

Network Inference ◽

Method Development ◽

Biological Knowledge ◽

Omics Data ◽

Gene Regulatory Network Inference ◽

Multimodal Data ◽

Low Dimensional ◽

Type Classification

AbstractSingle-cell omics technologies produce large quantities of data describing the genomic, transcriptomic or epigenomic profiles of many individual cells in parallel. In order to infer biological knowledge and develop predictive models from these data, machine learning (ML)-based model are increasingly used due to their flexibility, scalability, and impressive success in other fields. In recent years, we have seen a surge of new ML-based method development for low-dimensional representations of single-cell omics data, batch normalization, cell type classification, trajectory inference, gene regulatory network inference or multimodal data integration. To help readers navigate this fast-moving literature, we survey in this review recent advances in ML approaches developed to analyze single-cell omics data, focusing mainly on peer-reviewed publications published in the last two years (2019-2020).

Download Full-text

MBRS-14. INTEGRATING CLINICAL AND GENOMIC CHARACTERISTICS IN PEDIATRIC MEDULLOBLASTOMA SUBTYPES IN A SINGLE COHORT IN TAIWAN

Neuro-Oncology ◽

10.1093/neuonc/noaa222.531 ◽

2020 ◽

Vol 22 (Supplement_3) ◽

pp. iii400-iii401

Author(s):

Kuo-Sheng Wu ◽

Tai-Tong Wong

Keyword(s):

Dna Methylation ◽

Cluster Analysis ◽

Treatment Strategies ◽

Clinical Results ◽

Tumor Location ◽

Molecular Subgroups ◽

Methylation Array ◽

Metastatic Rate ◽

Pediatric Medulloblastoma ◽

Dna Methylation Array

Abstract BACKGROUND Medulloblastoma (MB) was classified to 4 molecular subgroups: WNT, SHH, group 3 (G3), and group 4 (G4) with the demographic and clinical differences. In 2017, The heterogeneity within MB was proposed, and 12 subtypes with distinct molecular and clinical characteristics. PATIENTS AND METHODS: PATIENTS AND METHODS We retrieved 52 MBs in children to perform RNA-Seq and DNA methylation array. Subtype cluster analysis performed by similarity network fusion (SNF) method. With clinical results and molecular profiles, the characteristics including age, gender, histological variants, tumor location, metastasis status, survival, cytogenetic and genetic aberrations among MB subtypes were identified. RESULTS In this cohort series, 52 childhood MBs were classified into 11 subtypes by SNF cluster analysis. WNT tumors shown no metastasis and 100% survival rate. All WNT tumors located on midline in 4th ventricle. Monosomy 6 presented in WNT α, but not in β subtype. SHH α and β occurred in children, while SHH γ in infant. Among SHH tumors, α subtype showed the worst outcome. G3 γ showed the highest metastatic rate and worst survival associated with MYC amplification. G4 α has the highest metastatic rate, however G4 γ showed the worst survival. CONCLUSION We identified molecular subgroups and subtypes of MBs based on gene expression and DNA methylation profile in children in our cohort series. The results may contribute to the establishment of nation-wide correlated optimal diagnosis and treatment strategies for MBs in infant and children.

Download Full-text

Biomarker discovery studies for patient stratification using machine learning analysis of omics data: a scoping review

BMJ Open ◽

10.1136/bmjopen-2021-053674 ◽

2021 ◽

Vol 11 (12) ◽

pp. e053674

Author(s):

Enrico Glaab ◽

Armin Rauschenberger ◽

Rita Banzi ◽

Chiara Gerardi ◽

Paula Garcia ◽

...

Keyword(s):

Machine Learning ◽

Scoping Review ◽

Biomarker Discovery ◽

Biomedical Literature ◽

Molecular Signature ◽

Biological Knowledge ◽

Measurement Technology ◽

Omics Data ◽

Patient Stratification ◽

Complex Disorders

ObjectiveTo review biomarker discovery studies using omics data for patient stratification which led to clinically validated FDA-cleared tests or laboratory developed tests, in order to identify common characteristics and derive recommendations for future biomarker projects.DesignScoping review.MethodsWe searched PubMed, EMBASE and Web of Science to obtain a comprehensive list of articles from the biomedical literature published between January 2000 and July 2021, describing clinically validated biomarker signatures for patient stratification, derived using statistical learning approaches. All documents were screened to retain only peer-reviewed research articles, review articles or opinion articles, covering supervised and unsupervised machine learning applications for omics-based patient stratification. Two reviewers independently confirmed the eligibility. Disagreements were solved by consensus. We focused the final analysis on omics-based biomarkers which achieved the highest level of validation, that is, clinical approval of the developed molecular signature as a laboratory developed test or FDA approved tests.ResultsOverall, 352 articles fulfilled the eligibility criteria. The analysis of validated biomarker signatures identified multiple common methodological and practical features that may explain the successful test development and guide future biomarker projects. These include study design choices to ensure sufficient statistical power for model building and external testing, suitable combinations of non-targeted and targeted measurement technologies, the integration of prior biological knowledge, strict filtering and inclusion/exclusion criteria, and the adequacy of statistical and machine learning methods for discovery and validation.ConclusionsWhile most clinically validated biomarker models derived from omics data have been developed for personalised oncology, first applications for non-cancer diseases show the potential of multivariate omics biomarker design for other complex disorders. Distinctive characteristics of prior success stories, such as early filtering and robust discovery approaches, continuous improvements in assay design and experimental measurement technology, and rigorous multicohort validation approaches, enable the derivation of specific recommendations for future studies.

Download Full-text

Model-Based Clustering of DNA Methylation Array Data

Translational Bioinformatics - Computational and Statistical Epigenomics ◽

10.1007/978-94-017-9927-0_5 ◽

2015 ◽

pp. 91-123

Author(s):

Devin C. Koestler ◽

E. Andrés Houseman

Keyword(s):

Dna Methylation ◽

Methylation Array ◽

Array Data ◽

Model Based Clustering ◽

Model Based ◽

Dna Methylation Array

Download Full-text

PyHIST: A Histological Image Segmentation Tool

PLoS Computational Biology ◽

10.1371/journal.pcbi.1008349 ◽

2020 ◽

Vol 16 (10) ◽

pp. e1008349

Author(s):

Manuel Muñoz-Aguirre ◽

Vasilis F. Ntasis ◽

Santiago Rojas ◽

Roderic Guigó

Keyword(s):

Machine Learning ◽

Input Image ◽

Biological Knowledge ◽

Imaging Data ◽

Tissue Segmentation ◽

Histological Image ◽

Command Line Tool ◽

Machine Learning Applications ◽

Histopathological Images ◽

High Resolution Images

The development of increasingly sophisticated methods to acquire high-resolution images has led to the generation of large collections of biomedical imaging data, including images of tissues and organs. Many of the current machine learning methods that aim to extract biological knowledge from histopathological images require several data preprocessing stages, creating an overhead before the proper analysis. Here we present PyHIST (https://github.com/manuel-munoz-aguirre/PyHIST), an easy-to-use, open source whole slide histological image tissue segmentation and preprocessing command-line tool aimed at tile generation for machine learning applications. From a given input image, the PyHIST pipeline i) optionally rescales the image to a different resolution, ii) produces a mask for the input image which separates the background from the tissue, and iii) generates individual image tiles with tissue content.

Download Full-text

ReVac: a reverse vaccinology computational pipeline for prioritization of prokaryotic protein vaccine candidates

BMC Genomics ◽

10.1186/s12864-019-6195-y ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 1

Author(s):

Adonis D’Mello ◽

Christian P. Ahearn ◽

Timothy F. Murphy ◽

Hervé Tettelin

Keyword(s):

Machine Learning ◽

Experimental Testing ◽

Negative Control ◽

Reverse Vaccinology ◽

Learning Approaches ◽

Protein Vaccine ◽

Vaccine Candidates ◽

Variable Expression ◽

Prediction Tools ◽

Potential Vaccine

Abstract Background Reverse vaccinology accelerates the discovery of potential vaccine candidates (PVCs) prior to experimental validation. Current programs typically use one bacterial proteome to identify PVCs through a filtering architecture using feature prediction programs or a machine learning approach. Filtering approaches may eliminate potential antigens based on limitations in the accuracy of prediction tools used. Machine learning approaches are heavily dependent on the selection of training datasets with experimentally validated antigens (positive control) and non-protective-antigens (negative control). The use of one or few bacterial proteomes does not assess PVC conservation among strains, an important feature of vaccine antigens. Results We present ReVac, which implements both a panoply of feature prediction programs without filtering out proteins, and scoring of candidates based on predictions made on curated positive and negative control PVCs datasets. ReVac surveys several genomes assessing protein conservation, as well as DNA and protein repeats, which may result in variable expression of PVCs. ReVac’s orthologous clustering of conserved genes, identifies core and dispensable genome components. This is useful for determining the degree of conservation of PVCs among the population of isolates for a given pathogen. Potential vaccine candidates are then prioritized based on conservation and overall feature-based scoring. We present the application of ReVac, applied to 69 Moraxella catarrhalis and 270 non-typeable Haemophilus influenzae genomes, prioritizing 64 and 29 proteins as PVCs, respectively. Conclusion ReVac’s use of a scoring scheme ranks PVCs for subsequent experimental testing. It employs a redundancy-based approach in its predictions of features using several prediction tools. The protein’s features are collated, and each protein is ranked based on the scoring scheme. Multi-genome analyses performed in ReVac allow for a comprehensive overview of PVCs from a pan-genome perspective, as an essential pre-requisite for any bacterial subunit vaccine design. ReVac prioritized PVCs of two human respiratory pathogens, identifying both novel and previously validated PVCs.

Download Full-text