Sparse input neural networks to differentiate 32 primary cancer types based on somatic point mutations
AbstractThis paper aims to differentiate cancer types from primary tumour samples based on somatic point mutations (SPM). Primary cancer site identification is necessary to perform site-specific and potentially targeted treatment. Current methods like histopathology/lab-tests cannot accurately determine cancers origin, which results in empirical patient treatment and poor survival rates. The availability of large deoxyribonucleic-acid sequencing datasets has allowed scientists to examine the ability of SPM to classify primary cancer sites. These datasets are highly sparse since most genes will not be mutated, have low signal-to-noise ratio and are imbalanced since rare cancers have less samples. To overcome these limitations a sparse-input neural network (spinn) is suggested that projects the input data in a lower dimensional space, where the more informative genes are used for learning. To train and evaluate spinn, an extensive dataset was collected from the cancer genome atlas containing 7624 samples spanning 32 cancer types. Different sampling strategies were performed to balance the dataset but have not benefited the classifiers performance except for removing Tomek-links. This is probably due to high amount of class overlapping. Spinn consistently outperformed algorithms like extreme gradient-boosting, deep neural networks and support-vector-machines, achieving an accuracy up to 73% on independent testing data.