Abstract 1898: Accurate modeling of antigen processing and MHC peptide presentation using large-scale immunopeptidomes and a novel machine learning framework

BackgroundAccurately identified neoantigens can be effective therapeutic agents in both adjuvant and neoadjuvant settings. A key challenge for neoantigen discovery has been the availability of accurate prediction models for MHC peptide presentation. We have shown previously that our proprietary model based on (i) large-scale, in-house mono-allelic data, (ii) custom features that model antigen processing, and (iii) advanced machine learning algorithms has strong performance. We have extended upon our work by systematically integrating large quantities of high-quality, publicly available data, implementing new modelling algorithms, and rigorously testing our models. These extensions lead to substantial improvements in performance and generalizability. Our algorithm, named Systematic HLA Epitope Ranking Pan Algorithm (SHERPA™), is integrated into the ImmunoID NeXT Platform®, our immuno-genomics and transcriptomics platform specifically designed to enable the development of immunotherapies.MethodsIn-house immunopeptidomic data was generated using stably transfected HLA-null K562 cells lines that express a single HLA allele of interest, followed by immunoprecipitation using W6/32 antibody and LC-MS/MS. Public immunopeptidomics data was downloaded from repositories such as MassIVE and processed uniformly using in-house pipelines to generate peptide lists filtered at 1% false discovery rate. Other metrics (features) were either extracted from source data or generated internally by re-processing samples utilizing the ImmunoID NeXT Platform.ResultsWe have generated large-scale and high-quality immunopeptidomics data by using approximately 60 mono-allelic cell lines that unambiguously assign peptides to their presenting alleles to create our primary models. Briefly, our primary ‘binding’ algorithm models MHC-peptide binding using peptide and binding pockets while our primary ‘presentation’ model uses additional features to model antigen processing and presentation. Both primary models have significantly higher precision across all recall values in multiple test data sets, including mono-allelic cell lines and multi-allelic tissue samples. To further improve the performance of our model, we expanded the diversity of our training set using high-quality, publicly available mono-allelic immunopeptidomics data. Furthermore, multi-allelic data was integrated by resolving peptide-to-allele mappings using our primary models. We then trained a new model using the expanded training data and a new composite machine learning architecture. The resulting secondary model further improves performance and generalizability across several tissue samples.ConclusionsImproving technologies for neoantigen discovery is critical for many therapeutic applications, including personalized neoantigen vaccines, and neoantigen-based biomarkers for immunotherapies. Our new and improved algorithm (SHERPA) has significantly higher performance compared to a state-of-the-art public algorithm and furthers this objective.

Download Full-text

Stochastic semi-supervised learning to prioritise genes from high-throughput genomic screens

10.1101/655449 ◽

2019 ◽

Author(s):

Dimitrios Vitsios ◽

Slavé Petrovski

Keyword(s):

Machine Learning ◽

Supervised Learning ◽

Large Scale ◽

Association Studies ◽

Protein Coding ◽

Knowledge Resources ◽

Learning Framework ◽

Significant Enrichment ◽

Disease Associated Genes ◽

Disease Specific

AbstractAccess to large-scale genomics datasets has increased the utility of hypothesis-free genome-wide analyses that result in candidate lists of genes. Often these analyses highlight several gene signals that might contribute to pathogenesis but are insufficiently powered to reach experiment-wide significance. This often triggers a process of laborious evaluation of highly-ranked genes through manual inspection of various public knowledge resources to triage those considered sufficiently interesting for deeper investigation. Here, we introduce a novel multi-dimensional, multi-step machine learning framework to objectively and more holistically assess biological relevance of genes to disease studies, by relying on a plethora of gene-associated annotations. We developed mantis-ml to serve as an automated machine learning (AutoML) framework, following a stochastic semi-supervised learning approach to rank known and novel disease-associated genes through iterative training and prediction sessions of random balanced datasets across the protein-coding exome (n=18,626 genes). We applied this framework on a range of disease-specific areas and as a generic disease likelihood estimator, achieving an average Area Under Curve (AUC) prediction performance of 0.85. Critically, to demonstrate applied utility on exome-wide association studies, we overlapped mantis-ml disease-specific predictions with data from published cohort-level association studies. We retrieved statistically significant enrichment of high mantis-ml predictions among the top-ranked genes from hypothesis-free cohort-level statistics (p<0.05), suggesting the capture of true prioritisation signals. We believe that mantis-ml is a novel easy-to-use tool to support objectively triaging gene discovery and overall enhancing our understanding of complex genotype-phenotype associations.

Download Full-text

Unifying machine learning and quantum chemistry with a deep neural network for molecular wavefunctions

Nature Communications ◽

10.1038/s41467-019-12875-2 ◽

2019 ◽

Vol 10 (1) ◽

Cited By ~ 64

Author(s):

K. T. Schütt ◽

M. Gastegger ◽

A. Tkatchenko ◽

K.-R. Müller ◽

R. J. Maurer

Keyword(s):

Machine Learning ◽

Quantum Chemistry ◽

Degrees Of Freedom ◽

Large Scale ◽

Materials Science ◽

Chemical Space ◽

Chemical Properties ◽

Molecular Structures ◽

Learning Framework ◽

Molecular Wavefunctions

AbstractMachine learning advances chemistry and materials science by enabling large-scale exploration of chemical space based on quantum chemical calculations. While these models supply fast and accurate predictions of atomistic chemical properties, they do not explicitly capture the electronic degrees of freedom of a molecule, which limits their applicability for reactive chemistry and chemical analysis. Here we present a deep learning framework for the prediction of the quantum mechanical wavefunction in a local basis of atomic orbitals from which all other ground-state properties can be derived. This approach retains full access to the electronic structure via the wavefunction at force-field-like efficiency and captures quantum mechanics in an analytically differentiable representation. On several examples, we demonstrate that this opens promising avenues to perform inverse design of molecular structures for targeting electronic property optimisation and a clear path towards increased synergy of machine learning and quantum chemistry.

Download Full-text

StackHCV: a web-based integrative machine-learning framework for large-scale identification of hepatitis C virus NS5B inhibitors

Journal of Computer-Aided Molecular Design ◽

10.1007/s10822-021-00418-1 ◽

2021 ◽

Author(s):

Aijaz Ahmad Malik ◽

Warot Chotpatiwetchkul ◽

Chuleeporn Phanus-umporn ◽

Chanin Nantasenamat ◽

Phasit Charoenkwan ◽

...

Keyword(s):

Machine Learning ◽

Hepatitis C Virus ◽

Hepatitis C ◽

Large Scale ◽

Web Based ◽

Learning Framework

Download Full-text

KDML: a machine-learning framework for inference of multi-scale gene functions from genetic perturbation screens

10.1101/761106 ◽

2019 ◽

Cited By ~ 1

Author(s):

Heba Z. Sailem ◽

Jens Rittscher ◽

Lucas Pelkmans

Keyword(s):

Colorectal Cancer ◽

Machine Learning ◽

Large Scale ◽

Ad Hoc ◽

Olfactory Receptors ◽

Functional Enrichment ◽

Learning Framework ◽

Gene Functions ◽

Health And Disease ◽

Colorectal Cancer Patients

AbstractCharacterising context-dependent gene functions is crucial for understanding the genetic bases of health and disease. To date, inference of gene functions from large-scale genetic perturbation screens is based on ad-hoc analysis pipelines involving unsupervised clustering and functional enrichment. We present Knowledge-Driven Machine Learning (KDML), a framework that systematically predicts multiple functions for a given gene based on the similarity of its perturbation phenotype to those with known function. As proof of concept, we test KDML on three datasets describing phenotypes at the molecular, cellular and population levels, and show that it outperforms traditional analysis pipelines. In particular, KDML identified an abnormal multicellular organisation phenotype associated with the depletion of olfactory receptors and TGFβ and WNT signalling genes in colorectal cancer cells. We validate these predictions in colorectal cancer patients and show that olfactory receptors expression is predictive of worse patient outcome. These results highlight KDML as a systematic framework for discovering novel scale-crossing and clinically relevant gene functions. KDML is highly generalizable and applicable to various large-scale genetic perturbation screens.

Download Full-text

Machine Learning Based Teaching Quality Evaluation

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.271-273.1451 ◽

2011 ◽

Vol 271-273 ◽

pp. 1451-1454

Author(s):

Gang Zhang ◽

Jian Yin ◽

Liang Lun Cheng ◽

Chun Ru Wang

Keyword(s):

Machine Learning ◽

Large Scale ◽

Quality Evaluation ◽

College Teaching ◽

Real Data ◽

Teaching Quality ◽

Data Sets ◽

Stable Model ◽

Learning Framework ◽

Artificial Neural Network Ann

Teaching quality is a key metric in college teaching effect and ability evaluation. In many previous literatures, evaluation of such metric is merely depended on subjective judgment of few experts based on their experience, which leads to some false, bias or unstable results. Moreover, pure human based evaluation is expensive that is difficult to extend to large scale. With the application of information technology, much information in college teaching is recorded and stored electronically, which founds the basic of a computer-aid analysis. In this paper, we perform teaching quality evaluation within machine learning framework, focusing on learning and modeling electronic information associated with quality of teaching, to get a stable model described the substantial principles of teaching quality. Artificial Neural Network (ANN) is selected as the main model in this work. Experiment results on real data sets consisted of 4 subjects / 8 semesters show the effectiveness of the proposed method.

Download Full-text

Precision neoantigen discovery using large-scale immunopeptidomes and composite modeling of MHC peptide presentation

10.1101/2021.04.30.442203 ◽

2021 ◽

Author(s):

Rachel Marty Pyke ◽

Datta Mellacheruvu ◽

Steven Dea ◽

Charles Abbott ◽

Simo V. Zhang ◽

...

Keyword(s):

Antigen Processing ◽

Large Scale ◽

Binding Pocket ◽

Genetic Alterations ◽

Parental Cell ◽

Clinical Applications ◽

Training Data ◽

Gradient Boosting ◽

Parental Cell Line ◽

Peptide Presentation

Major histocompatibility complex (MHC)-bound peptides that originate from tumor-specific genetic alterations, known as neoantigens, are an important class of anti-cancer therapeutic targets. Accurately predicting peptide presentation by MHC complexes is a key aspect of discovering therapeutically relevant neoantigens. Technological improvements in mass-spectrometry-based immunopeptidomics and advanced modeling techniques have vastly improved MHC presentation prediction over the past two decades. However, improvement in the sensitivity and specificity of prediction algorithms is needed for clinical applications such as the development of personalized cancer vaccines, the discovery of biomarkers for response to checkpoint blockade and the quantification of autoimmune risk in gene therapies. Toward this end, we generated allele-specific immunopeptidomics data using 25 mono-allelic cell lines and created Systematic HLA Epitope Ranking Pan Algorithm (SHERPA TM), a pan-allelic MHC-peptide algorithm for predicting MHC-peptide binding and presentation. In contrast to previously published large-scale mono-allelic data, we used an HLA-null K562 parental cell line and a stable transfection of HLA alleles to better emulate native presentation. Our dataset includes five previously unprofiled alleles that expand MHC binding pocket diversity in the training data and extend allelic coverage in under-profiled populations. To improve generalizability, SHERPA systematically integrates 128 mono-allelic and 384 multi-allelic samples with publicly available immunoproteomics data and binding assay data. Using this dataset, we developed two features that empirically estimate the propensities of genes and specific regions within gene bodies to engender immunopeptides to represent antigen processing. Using a composite model constructed with gradient boosting decision trees, multi-allelic deconvolution and 2.15 million peptides encompassing 167 alleles, we achieved a 1.71 fold improvement of positive predictive value compared to existing tools when evaluated on independent mono-allelic datasets and a 1.24 fold improvement when evaluating on tumor samples. With a high degree of accuracy, SHERPA has the potential to enable precision neoantigen discovery for future clinical applications.

Download Full-text

Targeting and Privacy in Mobile Advertising

Marketing Science ◽

10.1287/mksc.2020.1235 ◽

2020 ◽

Cited By ~ 1

Author(s):

Omid Rafieian ◽

Hema Yoganarasimhan

Keyword(s):

Machine Learning ◽

Large Scale ◽

Current System ◽

Contextual Information ◽

Asian Country ◽

Economic Incentives ◽

Modeling Framework ◽

Dominant Form ◽

Learning Framework ◽

Behavioral Targeting

Mobile in-app advertising is now the dominant form of digital advertising. Although these ads have excellent user-tracking properties, they have raised concerns among privacy advocates. This has resulted in an ongoing debate on the value of different types of targeting information, the incentives of ad networks to engage in behavioral targeting, and the role of regulation. To answer these questions, we propose a unified modeling framework that consists of two components—a machine learning framework for targeting and an analytical auction model for examining market outcomes under counterfactual targeting regimes. We apply our framework to large-scale data from the leading in-app ad network of an Asian country. We find that an efficient targeting policy based on our machine learning framework improves the average click-through rate by 66.80% over the current system. These gains mainly stem from behavioral information compared with contextual information. Theoretical and empirical counterfactuals show that although total surplus grows with more granular targeting, the ad network’s revenues are nonmonotonic; that is, the most efficient targeting does not maximize ad network revenues. Rather, it is maximized when the ad network does not allow advertisers to engage in behavioral targeting. Our results suggest that ad networks may have economic incentives to preserve users’ privacy without external regulation.

Download Full-text

A novel machine learning framework for automated biomedical relation extraction from large-scale literature repositories

Nature Machine Intelligence ◽

10.1038/s42256-020-0189-y ◽

2020 ◽

Vol 2 (6) ◽

pp. 347-355 ◽

Cited By ~ 1

Author(s):

Lixiang Hong ◽

Jinjian Lin ◽

Shuya Li ◽

Fangping Wan ◽

Hui Yang ◽

...

Keyword(s):

Machine Learning ◽

Large Scale ◽

Relation Extraction ◽

Learning Framework ◽

Biomedical Relation Extraction

Download Full-text

Estimating nitrogen and phosphorus concentrations in streams and rivers across the contiguous United States: a machine learning framework

10.7287/peerj.preprints.27585 ◽

2019 ◽

Author(s):

Longzhu Shen ◽

Giuseppe Amatulli ◽

Tushar Sethi ◽

Peter Raymond ◽

Sami Domisch

Keyword(s):

Machine Learning ◽

Large Scale ◽

Learning Algorithm ◽

External Validation ◽

Anthropogenic Activity ◽

Spatial And Temporal Variability ◽

Nitrogen And Phosphorus ◽

Learning Framework ◽

Environmental Models ◽

Improved Accuracy

Nitrogen (N) and Phosphorus (P) are essential nutrients for life processes in water bodies but in excessive quantities, they are a significant source of aquatic pollution. Eutrophication has now become widespread due to such an imbalance, and is largely attributed to anthropogenic activity. In view of this phenomenon, we present a new dataset and statistical method for estimating and mapping elemental and compound con- centrations of N and P at a resolution of 30 arc-seconds (∼1 km) for the conterminous US. The model is based on a Random Forest (RF) machine learning algorithm that was fitted with environmental variables and seasonal N and P concentration observations from 230,000 stations spanning across US stream networks. Accounting for spatial and temporal variability offers improved accuracy in the analysis of N and P cycles. The algorithm has been validated with an internal and external validation procedure that is able to explain 70-83% of the variance in the model. The dataset is ready for use as input in a variety of environmental models and analyses, and the methodological framework can be applied to large-scale studies on N and P pollution, which include water quality, species distribution and water ecology research worldwide.

Download Full-text