S95 Identifying new hereditary haemorrhagic telangiectasia genes by applying a machine learning approach to screen whole genome sequencing data

Author(s):  
S Xiao ◽  
D Brown ◽  
IG Mollet ◽  
FS Govani ◽  
D Patel ◽  
...  
2020 ◽  
Author(s):  
Sihao Xiao ◽  
Zhentian Kai ◽  
David Brown ◽  
Claire L Shovlin ◽  

SUMMARYWhole genome sequencing (WGS) is championed by the UK National Health Service (NHS) to identify genetic variants that cause particular diseases. The full potential of WGS has yet to be realised as early data analytic steps prioritise protein-coding genes, and effectively ignore the less well annotated non-coding genome which is rich in transcribed and critical regulatory regions. To address, we developed a filter, which we call GROFFFY, and validated in WGS data from hereditary haemorrhagic telangiectasia patients within the 100,000 Genomes Project. Before filter application, the mean number of DNA variants compared to human reference sequence GRCh38 was 4,867,167 (range 4,786,039-5,070,340), and one-third lay within intergenic areas. GROFFFY removed a mean of 2,812,015 variants per DNA. In combination with allele frequency and other filters, GROFFFY enabled a 99.56% reduction in variant number. The proportion of intergenic variants was maintained, and no pathogenic variants in disease genes were lost. We conclude that the filter applied to NHS diagnostic samples in the 100,000 Genomes pipeline offers an efficient method to prioritise intergenic, intronic and coding gDNA variants. Reducing the overwhelming number of variants while retaining functional genome variation of importance to patients, enhances the near-term value of WGS in clinical diagnostics.


2018 ◽  
Author(s):  
Nathan Wan ◽  
David Weinberg ◽  
Tzu-Yu Liu ◽  
Katherine Niehaus ◽  
Daniel Delubac ◽  
...  

AbstractBackgroundBlood-based methods using cell-free DNA (cfDNA) are under development as an alternative to existing screening tests. However, early-stage detection of cancer using tumor-derived cfDNA has proven challenging because of the small proportion of cfDNA derived from tumor tissue in early-stage disease. A machine learning approach to discover signatures in cfDNA, potentially reflective of both tumor and non-tumor contributions, may represent a promising direction for the early detection of cancer.MethodsWhole-genome sequencing was performed on cfDNA extracted from plasma samples (N=546 colorectal cancer and 271 non-cancer controls). Reads aligning to protein-coding gene bodies were extracted, and read counts were normalized. cfDNA tumor fraction was estimated using IchorCNA. Machine learning models were trained using k-fold cross-validation and confounder-based cross-validation to assess generalization performance.ResultsIn a colorectal cancer cohort heavily weighted towards early-stage cancer (80% stage I/II), we achieved a mean AUC of 0.92 (95% CI 0.91-0.93) with a mean sensitivity of 85% (95% CI 83-86%) at 85% specificity. Sensitivity generally increased with tumor stage and increasing tumor fraction. Stratification by age, sequencing batch, and institution demonstrated the impact of these confounders and provided a more accurate assessment of generalization performance.ConclusionsA machine learning approach using cfDNA achieved high sensitivity and specificity in a large, predominantly early-stage, colorectal cancer cohort. The possibility of systematic technical and institution-specific biases warrants similar confounder analyses in other studies. Prospective validation of this machine learning method and evaluation of a multi-analyte approach are underway.


mSystems ◽  
2020 ◽  
Vol 5 (3) ◽  
Author(s):  
Nenad Macesic ◽  
Oliver J. Bear Don’t Walk ◽  
Itsik Pe’er ◽  
Nicholas P. Tatonetti ◽  
Anton Y. Peleg ◽  
...  

ABSTRACT Polymyxins are used as treatments of last resort for Gram-negative bacterial infections. Their increased use has led to concerns about emerging polymyxin resistance (PR). Phenotypic polymyxin susceptibility testing is resource intensive and difficult to perform accurately. The complex polygenic nature of PR and our incomplete understanding of its genetic basis make it difficult to predict PR using detection of resistance determinants. We therefore applied machine learning (ML) to whole-genome sequencing data from >600 Klebsiella pneumoniae clonal group 258 (CG258) genomes to predict phenotypic PR. Using a reference-based representation of genomic data with ML outperformed a rule-based approach that detected variants in known PR genes (area under receiver-operator curve [AUROC], 0.894 versus 0.791, P = 0.006). We noted modest increases in performance by using a bacterial genome-wide association study to filter relevant genomic features and by integrating clinical data in the form of prior polymyxin exposure. Conversely, reference-free representation of genomic data as k-mers was associated with decreased performance (AUROC, 0.692 versus 0.894, P = 0.015). When ML models were interpreted to extract genomic features, six of seven known PR genes were correctly identified by models without prior programming and several genes involved in stress responses and maintenance of the cell membrane were identified as potential novel determinants of PR. These findings are a proof of concept that whole-genome sequencing data can accurately predict PR in K. pneumoniae CG258 and may be applicable to other forms of complex antimicrobial resistance. IMPORTANCE Polymyxins are last-resort antibiotics used to treat highly resistant Gram-negative bacteria. There are increasing reports of polymyxin resistance emerging, raising concerns of a postantibiotic era. Polymyxin resistance is therefore a significant public health threat, but current phenotypic methods for detection are difficult and time-consuming to perform. There have been increasing efforts to use whole-genome sequencing for detection of antibiotic resistance, but this has been difficult to apply to polymyxin resistance because of its complex polygenic nature. The significance of our research is that we successfully applied machine learning methods to predict polymyxin resistance in Klebsiella pneumoniae clonal group 258, a common health care-associated and multidrug-resistant pathogen. Our findings highlight that machine learning can be successfully applied even in complex forms of antibiotic resistance and represent a significant contribution to the literature that could be used to predict resistance in other bacteria and to other antibiotics.


2019 ◽  
Vol 6 (Supplement_2) ◽  
pp. S42-S42
Author(s):  
David E Greenberg ◽  
Jiwoong Kim ◽  
Xiaowei Zhan ◽  
Samuel A Shelburne ◽  
Samuel A Shelburne ◽  
...  

Abstract Background Multi-drug-resistant (MDR) P. aeruginosa (PA) infections continue to cause significant morbidity and mortality in various patient groups including those with malignancies. Predicting antimicrobial resistance (AMR) from whole-genome sequencing data if done rapidly, could aid in providing optimal care to patients. Methods To better understand the connections between DNA variation and phenotypic AMR in PA, we developed a new algorithm, variant mapping and prediction of antibiotic resistance (VAMPr), to build association and machine learning prediction models of AMR based on publicly available whole-genome sequencing and antibiotic susceptibility testing (AST) data. A validation cohort of contemporary PA bloodstream isolates was sequenced and AST was performed. Accuracy of predicting AMR for various PA–drug combinations was calculated. Results VAMPr was built from 3,393 bacterial isolates (83 PA isolates included) from 9 species that contained AST data for 29 antibiotics. 14,615 variant genotypes were identified within the dataset and 93 association and prediction models were built. 120 PA bloodstream isolates from cancer patients were included for analysis in the validation cohort. ~15% of isolates were carbapenem resistant and ~20% were quinolone resistant. For drug-isolate combinations where >100 isolates were available, machine-learning prediction accuracies ranged from 75.6% (PA and ceftazidime; 90/119 correctly predicted) to 98.1% (PA and amikacin; 105/107 correctly predicted). Machine learning accurately identified known variants that strongly predicted resistance to various antibiotic classes. Examples included specific gyrA mutations (T83I; P < 0.00001) and quinolone resistance. Conclusion Machine learning predicted AMR in P. aeruginosa across a number of antibiotics with high accuracy. Given the genomic heterogeneity of PA, increased genomic data for this pathogen will aid in further improving prediction accuracy across all antibiotic classes. Disclosures Samuel L. Aitken, PharmD, Melinta Therapeutoics: Grant/Research Support, Research Grant; Merck, Sharpe, and Dohme: Advisory Board; Shionogi: Advisory Board


Sign in / Sign up

Export Citation Format

Share Document