Predicting peptide presentation by major histocompatibility complex class I using one million peptides
AbstractImproved computational tools are needed to prioritize putative neoantigens within immunotherapy pipelines for cancer treatment. Herein, we assemble a database of over one million human peptides presented by major histocompatibility complex class I (MHC-I), the largest known database of its type. We use these data to train a random forest classifier (ForestMHC) to predict likelihood of MHC-I presentation. The information content of features mirrors the canonical importance of positions two and nine in determining likelihood of binding. Our random forest-based method outperforms NetMHC and NetMHCpan on test sets, and it outperforms both these methods and MixMHCpred on new mass spectrometry data from an ovarian carcinoma sample. Furthermore, the random forest scores correlate monotonically with peptide binding affinities, when known. Finally, we examine the effect size of gene expression on peptide presentation and find a moderately strong relationship. The ForestMHC method is a promising modality to prioritize neoantigens for experimental testing in immunotherapy.