scholarly journals Biological machine learning combined with bacterial population genomics reveals common and rare allelic variants of genes to cause disease

2019 ◽  
Author(s):  
DJ Darwin R. Bandoy ◽  
Bart C. Weimer

AbstractHighly dimensional data generated from bacterial whole genome sequencing is providing unprecedented scale of information that requires appropriate statistical frameworks of analysis to infer biological function from bacterial genomic populations. Application of genome wide association study (GWAS) methods is an emerging approach with bacterial population genomics that yields a list of genes associated with a phenotype with an undefined importance among the candidates in the list. Here, we validate the combination of GWAS, machine learning, and pathogenic bacterial population genomics as a novel scheme to identify SNPs and rank allelic variants to determine associations for accurate estimation of disease phenotype. This approach parsed a dataset of 1.2 million SNPs that resulted in a ranked importance of associated alleles of Campylobacter jejuni porA using multiple spatial locations over a 30-year period. We validated this approach using previously proven laboratory experimental alleles from an in vivo guinea pig abortion model. This approach, termed BioML, defined intestinal and extraintestinal groups that have differential allelic variants that cause abortion. Divergent variants containing indels that defeated gene callers were rescued using biological context and knowledge that resulted in defining rare and divergent variants that were maintained in the population over two continents and 30 years. This study defines the capability of machine learning coupled to GWAS and population genomics to simultaneously identify and rank alleles to define their role in abortion, and more broadly infectious disease.

2020 ◽  
Vol 8 (4) ◽  
pp. 549 ◽  
Author(s):  
DJ Darwin R. Bandoy ◽  
Bart C. Weimer

Highly dimensional data generated from bacterial whole-genome sequencing is providing an unprecedented scale of information that requires an appropriate statistical analysis framework to infer biological function from populations of genomes. The application of genome-wide association study (GWAS) methods is an appropriate framework for bacterial population genome analysis that yields a list of candidate genes associated with a phenotype, but it provides an unranked measure of importance. Here, we validated a novel framework to define infection mechanism using the combination of GWAS, machine learning, and bacterial population genomics that ranked allelic variants that accurately identified disease. This approach parsed a dataset of 1.2 million single nucleotide polymorphisms (SNPs) and indels that resulted in an importance ranked list of associated alleles of porA in Campylobacter jejuni using spatiotemporal analysis over 30 years. We validated this approach using previously proven laboratory experimental alleles from an in vivo guinea pig abortion model. This framework, termed μPathML, defined intestinal and extraintestinal groups that have differential allelic porA variants that cause abortion. Divergent variants containing indels that defeated automated annotation were rescued using biological context and knowledge that resulted in defining rare, divergent variants that were maintained in the population over two continents and 30 years. This study defines the capability of machine learning coupled with GWAS and population genomics to simultaneously identify and rank alleles to define their role in infectious disease mechanisms.


2020 ◽  
Vol 8 (12) ◽  
pp. 2043
Author(s):  
Shawn M. Higdon ◽  
Bihua C. Huang ◽  
Alan B. Bennett ◽  
Bart C. Weimer

Sierra Mixe maize is a landrace variety from Oaxaca, Mexico, that utilizes nitrogen derived from the atmosphere via an undefined nitrogen fixation mechanism. The diazotrophic microbiota associated with the plant’s mucilaginous aerial root exudate composed of complex carbohydrates was previously identified and characterized by our group where we found 23 lactococci capable of biological nitrogen fixation (BNF) without containing any of the proposed essential genes for this trait (nifHDKENB). To determine the genes in Lactococcus associated with this phenotype, we selected 70 lactococci from the dairy industry that are not known to be diazotrophic to conduct a comparative population genomic analysis. This showed that the diazotrophic lactococcal genomes were distinctly different from the dairy isolates. Examining the pangenome followed by genome-wide association study and machine learning identified genes with the functions needed for BNF in the maize isolates that were absent from the dairy isolates. Many of the putative genes received an ‘unknown’ annotation, which led to the domain analysis of the 135 homologs. This revealed genes with molecular functions needed for BNF, including mucilage carbohydrate catabolism, glycan-mediated host adhesion, iron/siderophore utilization, and oxidation/reduction control. This is the first report of this pathway in this organism to underpin BNF. Consequently, we proposed a model needed for BNF in lactococci that plausibly accounts for BNF in the absence of the nif operon in this organism.


2019 ◽  
Author(s):  
Anton Levitan ◽  
Andrew N. Gale ◽  
Emma K. Dallon ◽  
Darby W. Kozan ◽  
Kyle W. Cunningham ◽  
...  

ABSTRACTIn vivo transposon mutagenesis, coupled with deep sequencing, enables large-scale genome-wide mutant screens for genes essential in different growth conditions. We analyzed six large-scale studies performed on haploid strains of three yeast species (Saccharomyces cerevisiae, Schizosaccaromyces pombe, and Candida albicans), each mutagenized with two of three different heterologous transposons (AcDs, Hermes, and PiggyBac). Using a machine-learning approach, we evaluated the ability of the data to predict gene essentiality. Important data features included sufficient numbers and distribution of independent insertion events. All transposons showed some bias in insertion site preference because of jackpot events, and preferences for specific insertion sequences and short-distance vs long-distance insertions. For PiggyBac, a stringent target sequence limited the ability to predict essentiality in genes with few or no target sequences. The machine learning approach also robustly predicted gene function in less well-studied species by leveraging cross-species orthologs. Finally, comparisons of isogenic diploid versus haploid S. cerevisiae isolates identified several genes that are haplo-insufficient, while most essential genes, as expected, were recessive. We provide recommendations for the choice of transposons and the inference of gene essentiality in genome-wide studies of eukaryotic haploid microbes such as yeasts, including species that have been less amenable to classical genetic studies.


2021 ◽  
Author(s):  
Michael Burns ◽  
Jonathan Renk ◽  
David Eickholt ◽  
Amanda Gilbert ◽  
Travis Hattery ◽  
...  

Lack of high throughput phenotyping systems for determining moisture content during the maize nixtamalization cooking process has led to difficulty in breeding for this trait. This study provides a high throughput, quantitative measure of kernel moisture content during nixtamalization based on NIR scanning of uncooked maize kernels. Machine learning was utilized to develop models based on the combination of NIR spectra and moisture content determined from a scaled-down benchtop cook method. A linear support vector machine (SVM) model with a Spearman's rank correlation coefficient of 0.852 between wet lab and predicted values was developed from 100 diverse temperate genotypes grown in replicate across two environments. This model was applied to NIR data from 501 diverse temperate genotypes grown in replicate in five environments. Analysis of variance revealed environment explained the highest percent of the variation (51.5%), followed by genotype (15.6%) and genotype-by-environment interaction (11.2%). A genome-wide association study identified 26 significant loci across five environments that explained between 5.04% and 16.01% (average = 10.41%). However, genome-wide markers explained 10.54% to 45.99% (average = 31.68%) of the variation, indicating the genetic architecture of this trait is likely complex and controlled by many loci of small effect. This study provides a high-throughput method to evaluate moisture content during nixtamalization that is feasible at the scale of a breeding program and provides important information about the factors contributing to variation of this trait for breeders and food companies to make future strategies to improve this important processing trait.


Metabolites ◽  
2022 ◽  
Vol 12 (1) ◽  
pp. 82
Author(s):  
Atsushi Kimura ◽  
Akiyoshi Hirayama ◽  
Tatsuaki Matsumoto ◽  
Yuiko Sato ◽  
Tami Kobayashi ◽  
...  

Ossification of the posterior longitudinal ligament (OPLL), a disease characterized by the ectopic ossification of a spinal ligament, promotes neurological disorders associated with spinal canal stenosis. While blocking ectopic ossification is mandatory to prevent OPLL development and progression, the mechanisms underlying the condition remain unknown. Here we show that expression of hydroxyacid oxidase 1 (Hao1), a gene identified in a previous genome-wide association study (GWAS) as an OPLL-associated candidate gene, specifically and significantly decreased in fibroblasts during osteoblast differentiation. We then newly established Hao1-deficient mice by generating Hao1-flox mice and crossing them with CAG-Cre mice to yield global Hao1-knockout (CAG-Cre/Hao1flox/flox; Hao1 KO) animals. Hao1 KO mice were born normally and exhibited no obvious phenotypes, including growth retardation. Moreover, Hao1 KO mice did not exhibit ectopic ossification or calcification. However, urinary levels of some metabolites of the tricarboxylic acid (TCA) cycle were significantly lower in Hao1 KO compared to control mice based on comprehensive metabolomic analysis. Our data indicate that Hao1 loss does not promote ectopic ossification, but rather that Hao1 functions to regulate the TCA cycle in vivo.


2021 ◽  
Author(s):  
Aitzkoa Lopez de Lapuente Portilla ◽  
Ludvig Ekdahl ◽  
Caterina Cafaro ◽  
Zain Ali ◽  
Natsumi Miharada ◽  
...  

Understanding how hematopoietic stem and progenitor cells (HSPCs) are regulated is of central importance for the development of new therapies for blood disorders and stem cell transplantation. To date, HSPC regulation has been extensively studied in vitro and in animal models, but less is known about the mechanisms in vivo in humans. Here, in a genome-wide association study on 13,167 individuals, we identify 9 significant and 2 suggestive DNA sequence variants that influence HSPC (CD34+) levels in human blood. The identified loci associate with blood disorders, harbor known and novel HSPC genes, and affect gene expression in HSPCs. Interestingly, our strongest association maps to the PPM1H gene, encoding an evolutionarily conserved serine/threonine phosphatase never previously implicated in stem cell biology. PPM1H is expressed in HSPCs, and the allele that confers higher blood CD34+ cell levels downregulates PPM1H. By functional fine-mapping, we find that this downregulation is caused by the variant rs772557-A, which abrogates a MYB transcription factor binding site in PPM1H intron 1 that is active in specific HSPC subpopulations, including hematopoietic stem cells, and interacts with the promoter by chromatin looping. Furthermore, rs772557-A selectively increases HSPC subpopulations in which the MYB site is active, and PPM1H shRNA-knockdown increases CD34+ and CD34+90+ cell proportions in umbilical cord blood assays. Our findings represent the first large-scale association study on a stem cell trait, illuminating HSPC regulation in vivo in humans, and identifying PPM1H as a novel inhibition target that can potentially be utilized clinically to facilitate stem cell harvesting for transplantation.


Sign in / Sign up

Export Citation Format

Share Document