A deep learning approach for filtering structural variants in short read sequencing data

Author(s):  
Yongzhuang Liu ◽  
Yalin Huang ◽  
Guohua Wang ◽  
Yadong Wang

Abstract Short read whole genome sequencing has become widely used to detect structural variants in human genetic studies and clinical practices. However, accurate detection of structural variants is a challenging task. Especially existing structural variant detection approaches produce a large proportion of incorrect calls, so effective structural variant filtering approaches are urgently needed. In this study, we propose a novel deep learning-based approach, DeepSVFilter, for filtering structural variants in short read whole genome sequencing data. DeepSVFilter encodes structural variant signals in the read alignments as images and adopts the transfer learning with pre-trained convolutional neural networks as the classification models, which are trained on the well-characterized samples with known high confidence structural variants. We use two well-characterized samples to demonstrate DeepSVFilter’s performance and its filtering effect coupled with commonly used structural variant detection approaches. The software DeepSVFilter is implemented using Python and freely available from the website at https://github.com/yongzhuang/DeepSVFilter.

2020 ◽  
Vol 35 (9) ◽  
pp. 1675-1679
Author(s):  
Haloom Rafehi ◽  
David J. Szmulewicz ◽  
Kate Pope ◽  
Mathew Wallis ◽  
John Christodoulou ◽  
...  

Author(s):  
Varuni Sarwal ◽  
Sebastian Niehus ◽  
Ram Ayyala ◽  
Sei Chang ◽  
Angela Lu ◽  
...  

AbstractAdvances in whole genome sequencing promise to enable the accurate and comprehensive structural variant (SV) discovery. Dissecting SVs from whole genome sequencing (WGS) data presents a substantial number of challenges and a plethora of SV-detection methods have been developed. Currently, there is a paucity of evidence which investigators can use to select appropriate SV-detection tools. In this paper, we evaluated the performance of SV-detection tools using a comprehensive PCR-confirmed gold standard set of SVs. In contrast to the previous benchmarking studies, our gold standard dataset included a complete set of SVs allowing us to report both precision and sensitivity rates of SV-detection methods. Our study investigates the ability of the methods to detect deletions, thus providing an optimistic estimate of SV detection performance, as the SV-detection methods that fail to detect deletions are likely to miss more complex SVs. We found that SV-detection tools varied widely in their performance, with several methods providing a good balance between sensitivity and precision. Additionally, we have determined the SV callers best suited for low and ultra-low pass sequencing data.


2017 ◽  
Vol 55 (5) ◽  
pp. 1446-1453 ◽  
Author(s):  
Alex Marchand-Austin ◽  
Raymond S. W. Tsang ◽  
Jennifer L. Guthrie ◽  
Jennifer H. Ma ◽  
Gillian H. Lim ◽  
...  

ABSTRACTBordetella pertussisis a Gram-negative bacterium that causes respiratory infections in humans. Ongoing molecular surveillance ofB. pertussisacellular vaccine (aP) antigens is critical for understanding the interaction between evolutionary pressures, disease pathogenesis, and vaccine effectiveness. Methods currently used to characterize aP components are relatively labor-intensive and low throughput. To address this challenge, we sought to derive aP antigen genotypes from minimally processed short-read whole-genome sequencing data generated from 40 clinicalB. pertussisisolates and analyzed using the SRST2 bioinformatic package. SRST2 was able to identify aP antigen genotypes for all antigens with the exception of pertactin, possibly due to low read coverage in GC-rich low-complexity regions of variation. Two main genotypes were observed in addition to a singular third genotype that contained an 84-bp deletion that was identified by SRST2 despite the issues in allele calling. This method has the potential to generate large pools ofB. pertussismolecular data that can be linked to clinical and epidemiological information to facilitate research of vaccine effectiveness and disease severity in the context of emerging vaccine antigen-deficient strains.


2016 ◽  
Vol 16 (1) ◽  
Author(s):  
Taryn B. T. Athey ◽  
Sarah Teatero ◽  
Sonia Lacouture ◽  
Daisuke Takamatsu ◽  
Marcelo Gottschalk ◽  
...  

2020 ◽  
Author(s):  
Xiao Chen ◽  
Fei Shen ◽  
Nina Gonzaludo ◽  
Alka Malhotra ◽  
Cande Rogert ◽  
...  

AbstractResponsible for the metabolism of 25% of clinically used drugs, CYP2D6 is a critical component of personalized medicine initiatives. Genotyping CYP2D6 is challenging due to sequence similarity with its pseudogene paralog CYP2D7 and a high number and variety of common structural variants (SVs). Here we describe a novel bioinformatics method, Cyrius, that accurately genotypes CYP2D6 using whole-genome sequencing (WGS) data. We show that Cyrius has superior performance (96.5% concordance with truth genotypes) compared to existing methods (84-86.8%). After implementing the improvements identified from the comparison against the truth data, Cyrius’s accuracy has since been improved to 99.3%. Using Cyrius, we built a haplotype frequency database from 2504 ethnically diverse samples and estimate that SV-containing star alleles are more frequent than previously reported. Cyrius will be an important tool to incorporate pharmacogenomics in WGS-based precision medicine initiatives.


2019 ◽  
Vol 56 (12) ◽  
pp. 809-817 ◽  
Author(s):  
Brett Trost ◽  
Susan Walker ◽  
Syed A Haider ◽  
Wilson W L Sung ◽  
Sergio Pereira ◽  
...  

BackgroundWhole blood is currently the most common DNA source for whole-genome sequencing (WGS), but for studies requiring non-invasive collection, self-collection, greater sample stability or additional tissue references, saliva or buccal samples may be preferred. However, the relative quality of sequencing data and accuracy of genetic variant detection from blood-derived, saliva-derived and buccal-derived DNA need to be thoroughly investigated.MethodsMatched blood, saliva and buccal samples from four unrelated individuals were used to compare sequencing metrics and variant-detection accuracy among these DNA sources.ResultsWe observed significant differences among DNA sources for sequencing quality metrics such as percentage of reads aligned and mean read depth (p<0.05). Differences were negligible in the accuracy of detecting short insertions and deletions; however, the false positive rate for single nucleotide variation detection was slightly higher in some saliva and buccal samples. The sensitivity of copy number variant (CNV) detection was up to 25% higher in blood samples, depending on CNV size and type, and appeared to be worse in saliva and buccal samples with high bacterial concentration. We also show that methylation-based enrichment for eukaryotic DNA in saliva and buccal samples increased alignment rates but also reduced read-depth uniformity, hampering CNV detection.ConclusionFor WGS, we recommend using DNA extracted from blood rather than saliva or buccal swabs; if saliva or buccal samples are used, we recommend against using methylation-based eukaryotic DNA enrichment. All data used in this study are available for further open-science investigation.


Sign in / Sign up

Export Citation Format

Share Document