scholarly journals Generalizable characteristics of false-positive bacterial variant calls

2021 ◽  
Vol 7 (8) ◽  
Author(s):  
Stephen J. Bush

Minimizing false positives is a critical issue when variant calling as no method is without error. It is common practice to post-process a variant-call file (VCF) using hard filter criteria intended to discriminate true-positive (TP) from false-positive (FP) calls. These are applied on the simple principle that certain characteristics are disproportionately represented among the set of FP calls and that a user-chosen threshold can maximize the number detected. To provide guidance on this issue, this study empirically characterized all false SNP and indel calls made using real Illumina sequencing data from six disparate species and 166 variant-calling pipelines (the combination of 14 read aligners with up to 13 different variant callers, plus four ‘all-in-one’ pipelines). We did not seek to optimize filter thresholds but instead to draw attention to those filters of greatest efficacy and the pipelines to which they may most usefully be applied. In this respect, this study acts as a coda to our previous benchmarking evaluation of bacterial variant callers, and provides general recommendations for effective practice. The results suggest that, of the pipelines analysed in this study, the most straightforward way of minimizing false positives would simply be to use Snippy. We also find that a disproportionate number of false calls, irrespective of the variant-calling pipeline, are located in the vicinity of indels, and highlight this as an issue for future development.

2020 ◽  
Vol 13 (1) ◽  
Author(s):  
Ali Karimnezhad ◽  
Gareth A. Palidwor ◽  
Kednapa Thavorn ◽  
David J. Stewart ◽  
Pearl A. Campbell ◽  
...  

Abstract Background Treating cancer depends in part on identifying the mutations driving each patient’s disease. Many clinical laboratories are adopting high-throughput sequencing for assaying patients’ tumours, applying targeted panels to formalin-fixed paraffin-embedded tumour tissues to detect clinically-relevant mutations. While there have been some benchmarking and best practices studies of this scenario, much variant calling work focuses on whole-genome or whole-exome studies, with fresh or fresh-frozen tissue. Thus, definitive guidance on best choices for sequencing platforms, sequencing strategies, and variant calling for clinical variant detection is still being developed. Methods Because ground truth for clinical specimens is rarely known, we used the well-characterized Coriell cell lines GM12878 and GM12877 to generate data. We prepared samples to mimic as closely as possible clinical biopsies, including formalin fixation and paraffin embedding. We evaluated two well-known targeted sequencing panels, Illumina’s TruSight 170 hybrid-capture panel and the amplification-based Oncomine Focus panel. Sequencing was performed on an Illumina NextSeq500 and an Ion Torrent PGM respectively. We performed multiple replicates of each assay, to test reproducibility. Finally, we applied four different freely-available somatic single-nucleotide variant (SNV) callers to the data, along with the vendor-recommended callers for each sequencing platform. Results We did not observe major differences in variant calling success within the regions that each panel covers, but there were substantial differences between callers. All had high sensitivity for true SNVs, but numerous and non-overlapping false positives. Overriding certain default parameters to make them consistent between callers substantially reduced discrepancies, but still resulted in high false positive rates. Intersecting results from multiple replicates or from different variant callers eliminated most false positives, while maintaining sensitivity. Conclusions Reproducibility and accuracy of targeted clinical sequencing results depend less on sequencing platform and panel than on variability between replicates and downstream bioinformatics. Differences in variant callers’ default parameters are a greater influence on algorithm disagreement than other differences between the algorithms. Contrary to typical clinical practice, we recommend employing multiple variant calling pipelines and/or analyzing replicate samples, as this greatly decreases false positive calls.


2015 ◽  
Author(s):  
John G. Cleary ◽  
Ross Braithwaite ◽  
Kurt Gaastra ◽  
Brian S Hilbush ◽  
Stuart Inglis ◽  
...  

To evaluate and compare the performance of variant calling methods and their confidence scores, comparisons between a test call set and a ?gold standard? need to be carried out. Unfortunately, these comparisons are not straightforward with the current Variant Call Files (VCF), which are the standard output of most variant calling algorithms for high-throughput sequencing data. Comparisons of VCFs are often confounded by the different representations of indels, MNPs, and combinations thereof with SNVs in complex regions of the genome, resulting in misleading results. A variant caller is inherently a classification method designed to score putative variants with confidence scores that could permit controlling the rate of false positives (FP) or false negatives (FN) for a given application. Receiver operator curves (ROC) and the area under the ROC (AUC) are efficient metrics to evaluate a test call set versus a gold standard. However, in the case of VCF data this also requires a special accounting to deal with discrepant representations. We developed a novel algorithm for comparing variant call sets that deals with complex call representation discrepancies and through a dynamic programing method that minimizes false positives and negatives globally across the entire call sets for accurate performance evaluation of VCFs.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Gavin W. Wilson ◽  
Mathieu Derouet ◽  
Gail E. Darling ◽  
Jonathan C. Yeung

AbstractIdentifying single nucleotide variants has become common practice for droplet-based single-cell RNA-seq experiments; however, presently, a pipeline does not exist to maximize variant calling accuracy. Furthermore, molecular duplicates generated in these experiments have not been utilized to optimally detect variant co-expression. Herein, we introduce scSNV designed from the ground up to “collapse” molecular duplicates and accurately identify variants and their co-expression. We demonstrate that scSNV is fast, with a reduced false-positive variant call rate, and enables the co-detection of genetic variants and A>G RNA edits across twenty-two samples.


2019 ◽  
Vol 20 (S22) ◽  
Author(s):  
Hang Zhang ◽  
Ke Wang ◽  
Juan Zhou ◽  
Jianhua Chen ◽  
Yizhou Xu ◽  
...  

Abstract Background Variant calling and refinement from whole genome/exome sequencing data is a fundamental task for genomics studies. Due to the limited accuracy of NGS sequencing and variant callers, IGV-based manual review is required for further false positive variant filtering, which costs massive labor and time, and results in high inter- and intra-lab variability. Results To overcome the limitation of manual review, we developed a novel approach for Variant Filter by Automated Scoring based on Tagged-signature (VariFAST), and also provided a pipeline integrating GATK Best Practices with VariFAST, which can be easily used for high quality variants detection from raw data. Using the bam and vcf files, VariFAST calculates a v-score by sum of weighted metrics causing false positive variations, and marks tags in the manner of keeping high consistency with manual review, for each variant. We validated the performance of VariFAST for germline variant filtering using the benchmark sequencing data from GIAB, and also for somatic variant filtering using sequencing data of both malignant carcinoma and benign adenomas as well. VariFAST also includes a predictive model trained by XGBOOST algorithm for germline variants refinement, which reveals better MCC and AUC than the state-of-the-art VQSR, especially outcompete in INDEL variant filtering. Conclusion VariFAST can assist researchers efficiently and conveniently to filter the false positive variants, including both germline and somatic ones, in NGS data analysis. The VariFAST source code and the pipeline integrating with GATK Best Practices are available at https://github.com/bioxsjtu/VariFAST.


2017 ◽  
Author(s):  
Xin Zhou ◽  
Serafim Batzoglou ◽  
Arend Sidow ◽  
Lu Zhang

AbstractBackgroundDe novo mutations (DNMs) are associated with neurodevelopmental and congenital diseases, and their detection can contribute to understanding disease pathogenicity. However, accurate detection is challenging because of their small number relative to the genome-wide false positives in next generation sequencing (NGS) data. Software such as DeNovoGear and TrioDeNovo have been developed to detect DNMs, but at good sensitivity they still produce many false positive calls.ResultsTo address this challenge, we develop HAPDeNovo, a program that leverages phasing information from linked read sequencing, to remove false positive DNMs from candidate lists generated by DNM-detection tools. Short reads from each phasing block are allocated to each of the two haplotypes followed by generating a haploid genotype for each putative DNM.HAPDeNovo removes variants that are called as heterozygous in one of the haplotypes because they are almost certainly false positives. Our experiments on 10X Chromium linked read sequencing trio data reveal that HAPDeNovo eliminates 80% to 99% of false positives regardless of how large the candidate DNM set is.ConclusionsHAPDeNovo leverages the haplotype information from linked read sequencing to remove spurious false positive DNMs effectively, and it increases accuracy of DNM detection dramatically without sacrificing sensitivity.


2019 ◽  
Vol 2019 ◽  
pp. 1-8
Author(s):  
Junbo Duan ◽  
Han Liu ◽  
Lanling Zhao ◽  
Xiguo Yuan ◽  
Yu-Ping Wang ◽  
...  

Next generation sequencing is an emerging technology that has been widely used in the detection of genomic variants. However, since its depth of coverage, a main signature used for variant calling, is affected greatly by biases such as GC content and mappability, some callings are false positives. In this study, we utilized paired-end read mapping, another signature that is not affected by the aforementioned biases, to detect false-positive deletions in the database of genomic variants. We first identified 1923 suspicious variants that may be false positives and then conducted validation studies on each suspicious variant, which detected 583 false-positive deletions. Finally we analysed the distribution of these false positives by chromosome, sample, and size. Hopefully, incorrect documentation and annotations in downstream studies can be avoided by correcting these false positives in public repositories.


Author(s):  
Michele Fúlvia Angelo ◽  
Homero Schiabel ◽  
Ana Claudia Patrocinio

This work has as purpose to compare the effects of a CAD scheme applied to digitized and direct digital mamograms sets. A routine designed to be applied to mammogram in DICOM standard was developed and a schema based on the Watershed Transform to masses detection was applied to 252 ROIs from 130 digitized mammograms, resulting in 92% of true positive and 10% of false positives. For clustered microcalcifications detection, another procedure was applied to 165 ROIs from 120 mammograms, resulting in 93% of true positive and 16% of false positive. By using the same procedures to 154 digital mammograms obtained from FFDM, the rates have shown a little decrease in the scheme performance: 89% of true positive and 16% of false positive for masses detection; 90% of true positive and 27% of false positive for clusters detection. Although the tests with digital mammograms have been carried with a smaller number of images and different cases compared to the digitized ones, including several dense breasts images, the results can be considered comparable, mainly for clustered microcalcifications detection with a difference of only 3% between the sensibility rates for the both images sets. Another important feature affecting these results is the contrast difference between the two images set. This implies the need of  extensive investigations not only with a larger number of cases from FFDM but also on the parameters related to its image acquisition as well as to its corresponding processing


2020 ◽  
Vol 11 (04) ◽  
pp. 593-596
Author(s):  
Prakash B. Behere ◽  
Amit B. Nagdive ◽  
Aniruddh P. Behere ◽  
Richa Yadav ◽  
Rouchelle Fernandes

Abstract Objectives Can undergraduate medical students (UGs) adopt a village model to identify mentally ill persons in an adopted village successfully? Materials and Methods UGs during their first year adopt a village, and each student adopts seven families in the villages. During the visit, they look after immunization, tobacco and alcohol abuse, nutrition, hygiene, and sanitation. They help in identifying the health needs (including mental health) of the adopted family. The Indian Psychiatric Survey Schedule containing 15 questions covering most of the psychiatric illnesses were used by UGs to identify mental illness in the community. Persons identified as suffering from mental illness were referred to a consultant psychiatrist for confirmation of diagnosis and further management. Statistical Analysis  Calculated by percentage of expected mentally ill persons based on prevalence of mental illness in the rural community and is compared with actual number of patients with mental illness identified by the UGs. True-positive, false-positive, and true predictive values were derived. Results In Umri village, UGs were able to identify 269 persons as true positives and 25 as false positives, whereas in Kurzadi village, UGs were able to identify 221 persons as true positives and 35 as false positives. It suggests UGs were able to identify mental illnesses with a good positive predictive value. In Umri village, out of 294 mentally ill patients, it gave a true positive value of 91.49% and a false positive value of 8.5%, whereas in Kurzadi village, out of the 256 mentally ill patients, it gave a true positive value of 86.3% and a false positive value of 13.67%. Conclusion The ratio of psychiatrists in India is approximately 0.30 per 100,000 population due to which psychiatrists alone cannot cover the mental health problems of India. Therefore, we need a different model to cover mental illness in India, which is discussed in this article.


2019 ◽  
Author(s):  
Ali Karimnezhad ◽  
Gareth A. Palidwor ◽  
Kednapa Thavorn ◽  
David J. Stewart ◽  
Pearl A. Campbell ◽  
...  

AbstractBackgroundTreating cancer depends in part on identifying the mutations driving each patient’s disease. Many clinical laboratories are adopting high-throughput sequencing for assaying patients’ tumours, applying targeted panels to formalin-fixed paraffin-embedded tumour tissues to detect clinically-relevant mutations. While there have been some benchmarking and best practices studies of this scenario, much variant-calling work focuses on whole-genome or whole-exome studies, with fresh or fresh-frozen tissue. Thus, definitive guidance on best choices for sequencing platforms, sequencing strategies, and variant calling for clinical variant detection is still being developed.ResultsBecause ground truth for clinical specimens is rarely known, we used the well-characterized Coriell cell lines GM12878 and GM12877 to generate data. We prepared samples to mimic as closely as possible clinical biopsies, including formalin fixation and paraffin embedding. We evaluated two well-known targeted sequencing panels, Illumina’s TruSight 170 panel and the Oncomine Focus panel. Sequencing was performed on an Illumina NextSeq500 and an Ion Torrent PGM respectively. We performed multiple biological replicates of each assay, to test reproducibility. Finally, we applied five different public and freely-available somatic single-nucleotide variant (SNV) callers to the data, MuTect2, SAMtools, VarScan2, Pisces and VarDict. Although the TruSight 170 and Oncomine Focus panels cover different amounts of the genome, we did not observe major differences in variant calling success within the regions that each covers. We observed substantial discrepancies between the five variant callers. All had high sensitivity, detecting known SNVs, but highly varying and non-overlapping false positive detections. Harmonizing variant caller parameters or intersecting the results of multiple variant callers reduced disagreements. However, intersecting results from biological replicates was even better at eliminating false positives.ConclusionsReproducibility and accuracy of targeted clinical sequencing results depends less on sequencing platform and panel than on downstream bioinformatics and biological variability. Differences in variant callers’ default parameters are a greater influence on algorithm disagreement than other differences between the algorithms. Contrary to typical clinical practice, we recommend analyzing replicate samples, as this greatly decreases false positive calls.


Chemotherapy ◽  
2018 ◽  
Vol 63 (6) ◽  
pp. 324-329 ◽  
Author(s):  
Michael S. Ewer ◽  
Jay Herson

Purpose: Cardiac ultrasound provides important structural and functional information that makes identification of cardiac abnormalities possible. Left ventricular ejection fraction (LVEF) provides the most commonly used parameter for recognition of treatment-related cardiac dysfunction. Random reading variance and physiologic factors influence LVEF and make the reported value imperfect. We attempt to quantitate the likelihood of false positive events by computer simulation. Methods: We simulated four visits on hypothetical trials. We assumed a baseline LVEF of 55% and normal distribution with regard to reading error and physiologic variation. 1,000 trials of sample size 1,500 were simulated. In a separate simulation, 1,000 patients entered with LVEFs of 45, 43, and 41% to estimate true positive incidence. Results: At each examination, less than 1.0% of false positives were noted. The cumulative false positive rate over four visits was 3.60%. True cardiotoxicity identification is satisfactory only when LVEF declines substantially. Conclusion: A 3.60% false positive rate in trials where the expected level of toxicity is low suggests that false positives are troubling and may exceed true positive results. Strategies to reduce the number of false positive results include making confirmatory studies mandatory. Evaluating increases along with decreases obtains some estimation of variance.


Sign in / Sign up

Export Citation Format

Share Document