Abstract
Background: Same-species contamination detection is an important quality control step in genetic data analysis. Due to a scarcity of methods to detect and correct for this quality control issue, same-species contamination is more difficult to detect than cross-species contamination. We introduce a novel machine learning algorithm to detect same-species contamination in next-generation sequencing (NGS) data using a support vector machine (SVM) model. Our approach uniquely detects contamination using variant calling information stored in variant call format (VCF) files for DNA or RNA. Importantly, it can differentiate between same-species contamination and mixtures of tumor and normal cells.In the first stage, a change-point detection method is used to identify copy number variations (CNVs) and copy number aberrations (CNAs) for filtering. Next, single nucleotide polymorphism (SNP) data is used to test for same-species contamination using an SVM model. Based on the assumption that alternative allele frequencies in NGS follow the beta-binomial distribution, the deviation parameter ρ is estimated by the maximum likelihood method. All features of a radial basis function (RBF) kernel SVM are generated using publicly available or private training data. Results: We demonstrate our approach in simulation experiments. The datasets combine, in silico, exome sequencing data of DNA from two lymphoblastoid cell lines (NA12878 and NA10855). We generate VCF files using variants identified in these data and then evaluate the power and false-positive rate of our approach. Our approach can detect contamination levels as low as 5% with a reasonable false-positive rate. Results in real data have sensitivity above 99.99% and specificity of 90.24%, even in the presence of degraded samples with similar features as contaminated samples. We provide an R software implementation of our approach.Conclusions: Our approach addresses the gap in methods to test for same-species contamination in NGS. Due to its high sensitivity for degraded samples and tumor-normal samples, it represents an important tool that can be applied within the quality control process. Additionally, the user-friendly software has the unique ability to conduct quality control using the VCF format.