A Genome-based Model to Predict the Virulence of Pseudomonas aeruginosa Isolates

ABSTRACTVariation in the genome of Pseudomonas aeruginosa, an important pathogen, can have dramatic impacts on the bacterium’s ability to cause disease. We therefore asked whether it was possible to predict the virulence of P. aeruginosa isolates based upon their genomic content. We applied a machine learning approach to a genetically and phenotypically diverse collection of 115 clinical P. aeruginosa isolates using genomic information and corresponding virulence phenotypes in a mouse model of bacteremia. We defined the accessory genome of these isolates through the presence or absence of accessory genomic elements (AGEs), sequences present in some strains but not others. Machine learning models trained using AGEs were predictive of virulence, with a mean nested cross-validation accuracy of 75% using the random forest algorithm. However, individual AGEs did not have a large influence on the algorithm’s performance, suggesting instead that the virulence prediction derives from a diffuse genomic signature. These results were validated with an independent test set of 25 P. aeruginosa isolates whose virulence was predicted with 72% accuracy. Machine learning models trained using core genome single nucleotide variants and whole genome k-mers also predicted virulence. Our findings are a proof of concept for the use of bacterial genomes to predict pathogenicity in P. aeruginosa and highlight the potential of this approach for predicting patient outcomes.IMPORTANCEPseudomonas aeruginosa is a clinically important gram-negative opportunistic pathogen. As a species, P. aeruginosa has a large degree of heterogeneity both through variation in sequences found throughout the species (core genome) and the presence or absence of sequences in different isolates (accessory genome). P. aeruginosa isolates also differ markedly in their ability to cause disease. In this study, we used machine learning to predict the virulence level of P. aeruginosa isolates in a mouse bacteremia model based on genomic content. We show that both the accessory and core genome are predictive of virulence. This study provides a machine learning framework to investigate relationships between bacterial genomes and complex phenotypes such as virulence.

Download Full-text

A Genome-Based Model to Predict the Virulence of Pseudomonas aeruginosa Isolates

mBio ◽

10.1128/mbio.01527-20 ◽

2020 ◽

Vol 11 (4) ◽

Author(s):

Nathan B. Pincus ◽

Egon A. Ozer ◽

Jonathan P. Allen ◽

Marcus Nguyen ◽

James J. Davis ◽

...

Keyword(s):

Machine Learning ◽

Pseudomonas Aeruginosa ◽

Core Genome ◽

Learning Models ◽

Single Nucleotide Variants ◽

Bacterial Genomes ◽

Content Type ◽

Accessory Genome ◽

A Genome ◽

Machine Learning Models

ABSTRACT Variation in the genome of Pseudomonas aeruginosa, an important pathogen, can have dramatic impacts on the bacterium’s ability to cause disease. We therefore asked whether it was possible to predict the virulence of P. aeruginosa isolates based on their genomic content. We applied a machine learning approach to a genetically and phenotypically diverse collection of 115 clinical P. aeruginosa isolates using genomic information and corresponding virulence phenotypes in a mouse model of bacteremia. We defined the accessory genome of these isolates through the presence or absence of accessory genomic elements (AGEs), sequences present in some strains but not others. Machine learning models trained using AGEs were predictive of virulence, with a mean nested cross-validation accuracy of 75% using the random forest algorithm. However, individual AGEs did not have a large influence on the algorithm’s performance, suggesting instead that virulence predictions are derived from a diffuse genomic signature. These results were validated with an independent test set of 25 P. aeruginosa isolates whose virulence was predicted with 72% accuracy. Machine learning models trained using core genome single-nucleotide variants and whole-genome k-mers also predicted virulence. Our findings are a proof of concept for the use of bacterial genomes to predict pathogenicity in P. aeruginosa and highlight the potential of this approach for predicting patient outcomes. IMPORTANCE Pseudomonas aeruginosa is a clinically important Gram-negative opportunistic pathogen. P. aeruginosa shows a large degree of genomic heterogeneity both through variation in sequences found throughout the species (core genome) and through the presence or absence of sequences in different isolates (accessory genome). P. aeruginosa isolates also differ markedly in their ability to cause disease. In this study, we used machine learning to predict the virulence level of P. aeruginosa isolates in a mouse bacteremia model based on genomic content. We show that both the accessory and core genomes are predictive of virulence. This study provides a machine learning framework to investigate relationships between bacterial genomes and complex phenotypes such as virulence.

Download Full-text

Reducing Sanger confirmation testing through false positive prediction algorithms

Genetics in Medicine ◽

10.1038/s41436-021-01148-3 ◽

2021 ◽

Author(s):

James M. Holt ◽

Melissa Kelly ◽

Brett Sundlof ◽

Ghunwa Nakouzi ◽

David Bick ◽

...

Keyword(s):

Machine Learning ◽

False Positive ◽

Turnaround Time ◽

Learning Models ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

False Positive Prediction ◽

Confirmatory Testing ◽

Reference Human Genome ◽

Machine Learning Models

Abstract Purpose Clinical genome sequencing (cGS) followed by orthogonal confirmatory testing is standard practice. While orthogonal testing significantly improves specificity, it also results in increased turnaround time and cost of testing. The purpose of this study is to evaluate machine learning models trained to identify false positive variants in cGS data to reduce the need for orthogonal testing. Methods We sequenced five reference human genome samples characterized by the Genome in a Bottle Consortium (GIAB) and compared the results with an established set of variants for each genome referred to as a truth set. We then trained machine learning models to identify variants that were labeled as false positives. Results After training, the models identified 99.5% of the false positive heterozygous single-nucleotide variants (SNVs) and heterozygous insertions/deletions variants (indels) while reducing confirmatory testing of nonactionable, nonprimary SNVs by 85% and indels by 75%. Employing the algorithm in clinical practice reduced overall orthogonal testing using dideoxynucleotide (Sanger) sequencing by 71%. Conclusion Our results indicate that a low false positive call rate can be maintained while significantly reducing the need for confirmatory testing. The framework that generated our models and results is publicly available at https://github.com/HudsonAlpha/STEVE.

Download Full-text

Outcomes associated with SARS-CoV-2 viral clades in COVID-19

10.1101/2020.09.24.20201228 ◽

2020 ◽

Author(s):

Kenji Nakamichi ◽

Jolie Zhu Shen ◽

Cecilia S Lee ◽

Aaron Y Lee ◽

Emma Adaline Roberts ◽

...

Keyword(s):

Machine Learning ◽

Outcome Data ◽

Sequence Information ◽

Sequence Variants ◽

Clustering Methods ◽

Learning Models ◽

Single Nucleotide Variants ◽

First Case ◽

Significant Difference ◽

Machine Learning Models

Background The COVID-19 epidemic of 2019-20 is due to the novel coronavirus SARS-CoV-2. Following first case description in December, 2019 this virus has infected over 10 million individuals and resulted in at least 500,000 deaths world-wide. The virus is undergoing rapid mutation, with two major clades of sequence variants emerging. This study sought to determine whether SARS-CoV-2 sequence variants are associated with differing outcomes among COVID-19 patients in a single medical system. Methods Whole genome SARS-CoV-2 RNA sequence was obtained from isolates collected from patients registered in the University of Washington Medicine health system between March 1 and April 15, 2020. Demographic and baseline medical data along with outcomes of hospitalization and death were collected. Statistical and machine learning models were applied to determine if viral genetic variants were associated with specific outcomes of hospitalization or death. Findings Full length SARS-CoV-2 sequence was obtained 190 subjects with clinical outcome data. 35 (18.4%) were hospitalized and 14 (7.4%) died from complications of infection. A total of 289 single nucleotide variants were identified. Clustering methods demonstrated two major viral clades, which could be readily distinguished by 12 polymorphisms in 5 genes. A trend toward higher rates of hospitalization of patients with Clade 2 was observed (p=0.06). Machine learning models utilizing patient demographics and co-morbidities achieved area-under-the-curve (AUC) values of 0.93 for predicting hospitalization. Addition of viral clade or sequence information did not significantly improve models for outcome prediction. Conclusion SARS-CoV-2 shows substantial sequence diversity in a community-based sample. Two dominant clades of virus are in circulation. Among patients sufficiently ill to warrant testing for virus, no significant difference in outcomes of hospitalization or death could be discerned between clades in this sample. Major risk factors for hospitalization and death for either major clade of virus include patient age and comorbid conditions.

Download Full-text

Predictive engineering and optimization of tryptophan metabolism in yeast through a combination of mechanistic and machine learning models

10.1101/858464 ◽

2019 ◽

Cited By ~ 1

Author(s):

Jie Zhang ◽

Søren D. Petersen ◽

Tijana Radivojevic ◽

Andrés Ramirez ◽

Andrés Pérez ◽

...

Keyword(s):

Machine Learning ◽

Machine Learning Algorithms ◽

Mechanistic Modeling ◽

Scale Model ◽

Data Sets ◽

Data Generation ◽

Learning Models ◽

A Genome ◽

Aromatic Amino Acid Metabolism ◽

Machine Learning Models

SUMMARYIn combination with advanced mechanistic modeling and the generation of high-quality multi-dimensional data sets, machine learning is becoming an integral part of understanding and engineering living systems. Here we show that mechanistic and machine learning models can complement each other and be used in a combined approach to enable accurate genotype-to-phenotype predictions. We use a genome-scale model to pinpoint engineering targets and produce a large combinatorial library of metabolic pathway designs with different promoters which, once phenotyped, provide the basis for machine learning algorithms to be trained and used for new design recommendations. The approach enables successful forward engineering of aromatic amino acid metabolism in yeast, with the new recommended designs improving tryptophan production by up to 17% compared to the best designs used for algorithm training, and ultimately producing a total increase of 106% in tryptophan accumulation compared to optimized reference designs. Based on a single high-throughput data-generation iteration, this study highlights the power of combining mechanistic and machine learning models to enhance their predictive power and effectively direct metabolic engineering efforts.

Download Full-text

Hospitalization and mortality associated with SARS-CoV-2 viral clades in COVID-19

Scientific Reports ◽

10.1038/s41598-021-82850-9 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Kenji Nakamichi ◽

Jolie Z. Shen ◽

Cecilia S. Lee ◽

Aaron Lee ◽

Emma A. Roberts ◽

...

Keyword(s):

Machine Learning ◽

Outcome Data ◽

Sequence Information ◽

Sequence Variants ◽

Clustering Methods ◽

Learning Models ◽

Single Nucleotide Variants ◽

First Case ◽

Significant Difference ◽

Machine Learning Models

AbstractThe COVID-19 epidemic of 2019–20 is due to the novel coronavirus SARS-CoV-2. Following first case description in December, 2019 this virus has infected over 10 million individuals and resulted in at least 500,000 deaths world-wide. The virus is undergoing rapid mutation, with two major clades of sequence variants emerging. This study sought to determine whether SARS-CoV-2 sequence variants are associated with differing outcomes among COVID-19 patients in a single medical system. Whole genome SARS-CoV-2 RNA sequence was obtained from isolates collected from patients registered in the University of Washington Medicine health system between March 1 and April 15, 2020. Demographic and baseline clinical characteristics of patients and their outcome data including their hospitalization and death were collected. Statistical and machine learning models were applied to determine if viral genetic variants were associated with specific outcomes of hospitalization or death. Full length SARS-CoV-2 sequence was obtained 190 subjects with clinical outcome data. 35 (18.4%) were hospitalized and 14 (7.4%) died from complications of infection. A total of 289 single nucleotide variants were identified. Clustering methods demonstrated two major viral clades, which could be readily distinguished by 12 polymorphisms in 5 genes. A trend toward higher rates of hospitalization of patients with Clade 2 infections was observed (p = 0.06, Fisher’s exact). Machine learning models utilizing patient demographics and co-morbidities achieved area-under-the-curve (AUC) values of 0.93 for predicting hospitalization. Addition of viral clade or sequence information did not significantly improve models for outcome prediction. In summary, SARS-CoV-2 shows substantial sequence diversity in a community-based sample. Two dominant clades of virus are in circulation. Among patients sufficiently ill to warrant testing for virus, no significant difference in outcomes of hospitalization or death could be discerned between clades in this sample. Major risk factors for hospitalization and death for either major clade of virus include patient age and comorbid conditions.

Download Full-text

Combining mechanistic and machine learning models for predictive engineering and optimization of tryptophan metabolism

Nature Communications ◽

10.1038/s41467-020-17910-1 ◽

2020 ◽

Vol 11 (1) ◽

Cited By ~ 8

Author(s):

Jie Zhang ◽

Søren D. Petersen ◽

Tijana Radivojevic ◽

Andrés Ramirez ◽

Andrés Pérez-Manríquez ◽

...

Keyword(s):

Machine Learning ◽

Machine Learning Algorithms ◽

Mechanistic Modeling ◽

Scale Model ◽

Data Generation ◽

Learning Models ◽

A Genome ◽

Single Data ◽

Aromatic Amino Acid Metabolism ◽

Machine Learning Models

Abstract Through advanced mechanistic modeling and the generation of large high-quality datasets, machine learning is becoming an integral part of understanding and engineering living systems. Here we show that mechanistic and machine learning models can be combined to enable accurate genotype-to-phenotype predictions. We use a genome-scale model to pinpoint engineering targets, efficient library construction of metabolic pathway designs, and high-throughput biosensor-enabled screening for training diverse machine learning algorithms. From a single data-generation cycle, this enables successful forward engineering of complex aromatic amino acid metabolism in yeast, with the best machine learning-guided design recommendations improving tryptophan titer and productivity by up to 74 and 43%, respectively, compared to the best designs used for algorithm training. Thus, this study highlights the power of combining mechanistic and machine learning models to effectively direct metabolic engineering efforts.

Download Full-text

Reducing Sanger Confirmation Testing through False Positive Prediction Algorithms

10.1101/2020.04.30.066159 ◽

2020 ◽

Author(s):

James M. Holt ◽

Melissa Wilk ◽

Brett Sundlof ◽

Ghunwa Nakouzi ◽

David Bick ◽

...

Keyword(s):

Machine Learning ◽

False Positive ◽

Learning Models ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

False Positive Prediction ◽

Confirmatory Testing ◽

Clinical Genome Sequencing ◽

Reference Human Genome ◽

Machine Learning Models

AbstractPurposeClinical genome sequencing (cGS) followed by orthogonal confirmatory testing is standard practice. While orthogonal testing significantly improves specificity it also results in increased turn-around-time and cost of testing. The purpose of this study is to evaluate machine learning models trained to identify false positive variants in cGS data to reduce the need for orthogonal testing.MethodsWe sequenced five reference human genome samples characterized by the Genome in a Bottle Consortium (GIAB) and compared the results to an established set of variants for each genome referred to as a ‘truth-set’. We then trained machine learning models to identify variants that were labeled as false positives.ResultsAfter training, the models identified 99.5% of the false positive heterozygous single nucleotide variants (SNVs) and heterozygous insertions/deletions variants (indels) while reducing confirmatory testing of true positive SNVs to 1.67% and indels to 20.29%. Employing the algorithm in clinical practice reduced orthogonal testing using dideoxynucleotide (Sanger) sequencing by 78.22%.ConclusionOur results indicate that a low false positive call rate can be maintained while significantly reducing the need for confirmatory testing. The framework that generated our models and results is publicly available at https://github.com/HudsonAlpha/STEVE.

Download Full-text

Improving XGBoost with Imagination Sampling

Communications of the Blyth Institute ◽

10.33014/issn.2640-5652.2.1.holloway.1 ◽

2020 ◽

Vol 2 (1) ◽

pp. 3-6

Author(s):

Eric Holloway

Keyword(s):

Machine Learning ◽

General System ◽

Learning Models ◽

Starting Point ◽

Machine Learning Models

Imagination Sampling is the usage of a person as an oracle for generating or improving machine learning models. Previous work demonstrated a general system for using Imagination Sampling for obtaining multibox models. Here, the possibility of importing such models as the starting point for further automatic enhancement is explored.

Download Full-text