A novel machine learning framework for automated biomedical relation extraction from large-scale literature repositories

AbstractAccess to large-scale genomics datasets has increased the utility of hypothesis-free genome-wide analyses that result in candidate lists of genes. Often these analyses highlight several gene signals that might contribute to pathogenesis but are insufficiently powered to reach experiment-wide significance. This often triggers a process of laborious evaluation of highly-ranked genes through manual inspection of various public knowledge resources to triage those considered sufficiently interesting for deeper investigation. Here, we introduce a novel multi-dimensional, multi-step machine learning framework to objectively and more holistically assess biological relevance of genes to disease studies, by relying on a plethora of gene-associated annotations. We developed mantis-ml to serve as an automated machine learning (AutoML) framework, following a stochastic semi-supervised learning approach to rank known and novel disease-associated genes through iterative training and prediction sessions of random balanced datasets across the protein-coding exome (n=18,626 genes). We applied this framework on a range of disease-specific areas and as a generic disease likelihood estimator, achieving an average Area Under Curve (AUC) prediction performance of 0.85. Critically, to demonstrate applied utility on exome-wide association studies, we overlapped mantis-ml disease-specific predictions with data from published cohort-level association studies. We retrieved statistically significant enrichment of high mantis-ml predictions among the top-ranked genes from hypothesis-free cohort-level statistics (p<0.05), suggesting the capture of true prioritisation signals. We believe that mantis-ml is a novel easy-to-use tool to support objectively triaging gene discovery and overall enhancing our understanding of complex genotype-phenotype associations.

Download Full-text

Unifying machine learning and quantum chemistry with a deep neural network for molecular wavefunctions

Nature Communications ◽

10.1038/s41467-019-12875-2 ◽

2019 ◽

Vol 10 (1) ◽

Cited By ~ 64

Author(s):

K. T. Schütt ◽

M. Gastegger ◽

A. Tkatchenko ◽

K.-R. Müller ◽

R. J. Maurer

Keyword(s):

Machine Learning ◽

Quantum Chemistry ◽

Degrees Of Freedom ◽

Large Scale ◽

Materials Science ◽

Chemical Space ◽

Chemical Properties ◽

Molecular Structures ◽

Learning Framework ◽

Molecular Wavefunctions

AbstractMachine learning advances chemistry and materials science by enabling large-scale exploration of chemical space based on quantum chemical calculations. While these models supply fast and accurate predictions of atomistic chemical properties, they do not explicitly capture the electronic degrees of freedom of a molecule, which limits their applicability for reactive chemistry and chemical analysis. Here we present a deep learning framework for the prediction of the quantum mechanical wavefunction in a local basis of atomic orbitals from which all other ground-state properties can be derived. This approach retains full access to the electronic structure via the wavefunction at force-field-like efficiency and captures quantum mechanics in an analytically differentiable representation. On several examples, we demonstrate that this opens promising avenues to perform inverse design of molecular structures for targeting electronic property optimisation and a clear path towards increased synergy of machine learning and quantum chemistry.

Download Full-text

StackHCV: a web-based integrative machine-learning framework for large-scale identification of hepatitis C virus NS5B inhibitors

Journal of Computer-Aided Molecular Design ◽

10.1007/s10822-021-00418-1 ◽

2021 ◽

Author(s):

Aijaz Ahmad Malik ◽

Warot Chotpatiwetchkul ◽

Chuleeporn Phanus-umporn ◽

Chanin Nantasenamat ◽

Phasit Charoenkwan ◽

...

Keyword(s):

Machine Learning ◽

Hepatitis C Virus ◽

Hepatitis C ◽

Large Scale ◽

Web Based ◽

Learning Framework

Download Full-text

Abstract 1898: Accurate modeling of antigen processing and MHC peptide presentation using large-scale immunopeptidomes and a novel machine learning framework

10.1158/1538-7445.am2021-1898 ◽

2021 ◽

Author(s):

Rachel Marty Pyke ◽

Dattatreya Mellacheruvu ◽

Steven Dea ◽

Charles Abbott ◽

Nick Phillips ◽

...

Keyword(s):

Machine Learning ◽

Antigen Processing ◽

Large Scale ◽

Learning Framework ◽

Peptide Presentation

Download Full-text

BioRel: towards large-scale biomedical relation extraction

BMC Bioinformatics ◽

10.1186/s12859-020-03889-5 ◽

2020 ◽

Vol 21 (S16) ◽

Author(s):

Rui Xing ◽

Jie Luo ◽

Tengwei Song

Keyword(s):

Deep Learning ◽

Large Scale ◽

Critical Role ◽

Relation Extraction ◽

Extraction Methods ◽

Statistical Machine Learning ◽

Language System ◽

Unified Medical Language System ◽

Medical Language ◽

Biomedical Relation Extraction

Abstract Background Although biomedical publications and literature are growing rapidly, there still lacks structured knowledge that can be easily processed by computer programs. In order to extract such knowledge from plain text and transform them into structural form, the relation extraction problem becomes an important issue. Datasets play a critical role in the development of relation extraction methods. However, existing relation extraction datasets in biomedical domain are mainly human-annotated, whose scales are usually limited due to their labor-intensive and time-consuming nature. Results We construct BioRel, a large-scale dataset for biomedical relation extraction problem, by using Unified Medical Language System as knowledge base and Medline as corpus. We first identify mentions of entities in sentences of Medline and link them to Unified Medical Language System with Metamap. Then, we assign each sentence a relation label by using distant supervision. Finally, we adapt the state-of-the-art deep learning and statistical machine learning methods as baseline models and conduct comprehensive experiments on the BioRel dataset. Conclusions Based on the extensive experimental results, we have shown that BioRel is a suitable large-scale datasets for biomedical relation extraction, which provides both reasonable baseline performance and many remaining challenges for both deep learning and statistical methods.

Download Full-text

KDML: a machine-learning framework for inference of multi-scale gene functions from genetic perturbation screens

10.1101/761106 ◽

2019 ◽

Cited By ~ 1

Author(s):

Heba Z. Sailem ◽

Jens Rittscher ◽

Lucas Pelkmans

Keyword(s):

Colorectal Cancer ◽

Machine Learning ◽

Large Scale ◽

Ad Hoc ◽

Olfactory Receptors ◽

Functional Enrichment ◽

Learning Framework ◽

Gene Functions ◽

Health And Disease ◽

Colorectal Cancer Patients

AbstractCharacterising context-dependent gene functions is crucial for understanding the genetic bases of health and disease. To date, inference of gene functions from large-scale genetic perturbation screens is based on ad-hoc analysis pipelines involving unsupervised clustering and functional enrichment. We present Knowledge-Driven Machine Learning (KDML), a framework that systematically predicts multiple functions for a given gene based on the similarity of its perturbation phenotype to those with known function. As proof of concept, we test KDML on three datasets describing phenotypes at the molecular, cellular and population levels, and show that it outperforms traditional analysis pipelines. In particular, KDML identified an abnormal multicellular organisation phenotype associated with the depletion of olfactory receptors and TGFβ and WNT signalling genes in colorectal cancer cells. We validate these predictions in colorectal cancer patients and show that olfactory receptors expression is predictive of worse patient outcome. These results highlight KDML as a systematic framework for discovering novel scale-crossing and clinically relevant gene functions. KDML is highly generalizable and applicable to various large-scale genetic perturbation screens.

Download Full-text

Machine Learning Based Teaching Quality Evaluation

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.271-273.1451 ◽

2011 ◽

Vol 271-273 ◽

pp. 1451-1454

Author(s):

Gang Zhang ◽

Jian Yin ◽

Liang Lun Cheng ◽

Chun Ru Wang

Keyword(s):

Machine Learning ◽

Large Scale ◽

Quality Evaluation ◽

College Teaching ◽

Real Data ◽

Teaching Quality ◽

Data Sets ◽

Stable Model ◽

Learning Framework ◽

Artificial Neural Network Ann

Teaching quality is a key metric in college teaching effect and ability evaluation. In many previous literatures, evaluation of such metric is merely depended on subjective judgment of few experts based on their experience, which leads to some false, bias or unstable results. Moreover, pure human based evaluation is expensive that is difficult to extend to large scale. With the application of information technology, much information in college teaching is recorded and stored electronically, which founds the basic of a computer-aid analysis. In this paper, we perform teaching quality evaluation within machine learning framework, focusing on learning and modeling electronic information associated with quality of teaching, to get a stable model described the substantial principles of teaching quality. Artificial Neural Network (ANN) is selected as the main model in this work. Experiment results on real data sets consisted of 4 subjects / 8 semesters show the effectiveness of the proposed method.

Download Full-text

Targeting and Privacy in Mobile Advertising

Marketing Science ◽

10.1287/mksc.2020.1235 ◽

2020 ◽

Cited By ~ 1

Author(s):

Omid Rafieian ◽

Hema Yoganarasimhan

Keyword(s):

Machine Learning ◽

Large Scale ◽

Current System ◽

Contextual Information ◽

Asian Country ◽

Economic Incentives ◽

Modeling Framework ◽

Dominant Form ◽

Learning Framework ◽

Behavioral Targeting

Mobile in-app advertising is now the dominant form of digital advertising. Although these ads have excellent user-tracking properties, they have raised concerns among privacy advocates. This has resulted in an ongoing debate on the value of different types of targeting information, the incentives of ad networks to engage in behavioral targeting, and the role of regulation. To answer these questions, we propose a unified modeling framework that consists of two components—a machine learning framework for targeting and an analytical auction model for examining market outcomes under counterfactual targeting regimes. We apply our framework to large-scale data from the leading in-app ad network of an Asian country. We find that an efficient targeting policy based on our machine learning framework improves the average click-through rate by 66.80% over the current system. These gains mainly stem from behavioral information compared with contextual information. Theoretical and empirical counterfactuals show that although total surplus grows with more granular targeting, the ad network’s revenues are nonmonotonic; that is, the most efficient targeting does not maximize ad network revenues. Rather, it is maximized when the ad network does not allow advertisers to engage in behavioral targeting. Our results suggest that ad networks may have economic incentives to preserve users’ privacy without external regulation.

Download Full-text

Estimating nitrogen and phosphorus concentrations in streams and rivers across the contiguous United States: a machine learning framework

10.7287/peerj.preprints.27585 ◽

2019 ◽

Author(s):

Longzhu Shen ◽

Giuseppe Amatulli ◽

Tushar Sethi ◽

Peter Raymond ◽

Sami Domisch

Keyword(s):

Machine Learning ◽

Large Scale ◽

Learning Algorithm ◽

External Validation ◽

Anthropogenic Activity ◽

Spatial And Temporal Variability ◽

Nitrogen And Phosphorus ◽

Learning Framework ◽

Environmental Models ◽

Improved Accuracy

Nitrogen (N) and Phosphorus (P) are essential nutrients for life processes in water bodies but in excessive quantities, they are a significant source of aquatic pollution. Eutrophication has now become widespread due to such an imbalance, and is largely attributed to anthropogenic activity. In view of this phenomenon, we present a new dataset and statistical method for estimating and mapping elemental and compound con- centrations of N and P at a resolution of 30 arc-seconds (∼1 km) for the conterminous US. The model is based on a Random Forest (RF) machine learning algorithm that was fitted with environmental variables and seasonal N and P concentration observations from 230,000 stations spanning across US stream networks. Accounting for spatial and temporal variability offers improved accuracy in the analysis of N and P cycles. The algorithm has been validated with an internal and external validation procedure that is able to explain 70-83% of the variance in the model. The dataset is ready for use as input in a variety of environmental models and analyses, and the methodological framework can be applied to large-scale studies on N and P pollution, which include water quality, species distribution and water ecology research worldwide.

Download Full-text