AbstractMotivationVariant discovery is crucial in medical and clinical research, especially in the setting of personalized medicine. As such, precision in variant identification is paramount. However, variants identified by current genomic analysis pipelines contain many false positives (i.e., incorrectly called variants). These can be potentially eliminated by applying state-of-the-art filtering tools, such as the Variant Quality Score Recalibration (VQSR) or the Hard Filtering (HF), both proposed by GATK. However, these methods are very user-dependent and fail to run in some cases. We propose VEF, a variant filtering tool based on ensemble methods that overcomes the main drawbacks of VQSR and the HF. Contrary to these methods, we treat filtering as a supervised learning problem. This is possible by using for training variant call data for which the set of “true” variants is known, i.e., a gold standard exists. Hence, we can classify each variant in the training VCF file as true or false using the gold standard, and further use the annotations of each variant as features for the classification problem. Once trained, VEF can be directly applied to filter the variants contained in a given VCF file. Analysis of several ensemble methods revealed random forest as offering the best performance, and hence VEF uses a random forest for the classification task.ResultsAfter training VEF on a Whole Genome Sequencing (WGS) Human dataset of sample NA12878, we tested its performance on a WGS Human dataset of sample NA24385. For these two samples, the set of high-confident variants has been produced and made available. Results show that the proposed filtering tool VEF consistently outperforms VQSR and HF. In addition, we show that VEF generalizes well even when some features have missing values, and when the training and testing datasets differ either in coverage or in the sequencing machine that was used to generate the data. Finally, since the training needs to be performed only once, there is a significant saving in running time when compared to VQSR (50 minutes versus 4 minutes approximately for filtering the SNPs of WGS Human sample NA24385). Code and scripts available at: github.com/ChuanyiZ/vef.