AbstractDrug repurposing is a valuable tool for combating the slowing rates of novel therapeutic discovery. The Computational Analysis of Novel Drug Opportunities (CANDO) platform performs shotgun repurposing of 3,733 drugs/compounds that map to 2,030 indications/diseases by predicting their interactions with 46,784 protein structures and relating them via proteomic interaction signatures. The accuracy of the CANDO platform is evaluated using our benchmarking protocol that assesses indication accuracies based on whether or not pairs of drugs associated with the same indication can be captured within a certain cutoff, which is a measure of the drug repurposing recovery rate. To identify subsets of proteins that exhibit the same therapeutic effectiveness as the full set, groups of 8 proteins were randomly selected and subsequently benchmarked 50 times. The resulting protein sets were ranked according to average indication accuracy, pairwise accuracy, and coverage (count of indications with non-zero accuracy). The best 50 subsets of 8 according to each metric were progressively combined into supersets after each iteration and benchmarked. These supersets yield up to 14% improvement in benchmarking accuracy, and represent a 100-1,000 fold reduction in the number of proteins relative to the full set. Protein supersets optimized using independent compound libraries derived from the full library were cross-tested and were shown to reproduce the performance relative to using all 46,784 proteins, indicating that these reduced size supersets are broadly applicable for characterizing drug behavior. Further analysis revealed that sets comprised of proteins with more equitably diverse ligand interactions are important for describing drug behavior. Our work elucidates the role of particular protein subsets and corresponding ligand interactions that play a role in computational drug repurposing, and paves the way for the use of machine learning approaches to further improve the accuracy of the CANDO platform and its repurposing potential.Author summaryDrug repurposing is a valuable approach for ameliorating the current problems plaguing drug discovery. We introduce a novel protein subset analysis pipeline that allows us to elucidate features important for drug repurposing accuracies using the Computational Analysis of Novel Drug Opportunities (CANDO) platform. Our platform relates drugs based on the similarity of their interactions with a diverse library of proteins. We subjected all proteins in the platform to a splitting and ranking protocol that ranked protein subsets based on their benchmarking performance. Further analysis of the best performing protein subsets revealed that the most useful proteins for describing how small molecule compounds behave in biological systems are those that are predicted to interact with a structurally diverse range of ligands. We hypothesize that this is a consequence of the multitarget nature of drugs and, conversely, the implied promiscuity of proteins in biological systems. These results may be used to make drug discovery more accurate and efficient by alleviating some of its bottlenecks, bringing us one step further in better understanding how drugs behave in the context of their environments.