Comparing the genetic and environmental architecture of blood count, blood biochemistry and urine biochemistry biological ages with machine learning

While a large number of biological age predictors have been built from blood samples, a blood count-based biological age predictor is lacking, and the genetic and environmental factors associated with blood-measured accelerated aging remain elusive. In the following, we leveraged 31 blood count biomarkers measured from 489,079 blood samples, 28 blood biochemistry biomarkers measured from 245,147 blood samples, and four urine biochemistry biomarkers measured from 158,381 samples to build three distinct biological age predictors by training machine learning models to predict age. Blood biochemistry significantly outperformed blood count and urine biochemistry in terms of age prediction (RMSE: 5.92+-0.02 vs. 7.60+-0.02 years and 7.72+-0.04 years). We performed genome wide association studies [GWASs], and found accelerated blood biochemistry, blood count and urine biochemistry aging to be respectively 26.2+-0.3%, 18.1+-0.2% and 10.5+-0.5% GWAS-heritable. We identified 1,081 single nucleotide polymorphisms [SNPs] associated with accelerated blood biochemistry aging, 2,636 SNPs associated with accelerated blood cells aging and 24 SNPs associated with accelerated urine biochemistry aging. Similarly, we identified biomarkers, clinical phenotypes, diseases, environmental and socioeconomic factors associated with accelerated blood biochemistry, blood cells and urine biochemistry aging.

Download Full-text

White Blood Cells and Severe COVID-19: A Mendelian Randomization Study

Journal of Personalized Medicine ◽

10.3390/jpm11030195 ◽

2021 ◽

Vol 11 (3) ◽

pp. 195

Author(s):

Yitang Sun ◽

Jingqi Zhou ◽

Kaixiong Ye

Keyword(s):

Blood Cell ◽

Cell Count ◽

White Blood Cell Count ◽

White Blood Cell ◽

Blood Cells ◽

Mendelian Randomization ◽

Association Studies ◽

White Blood Cells ◽

Genome Wide Association Studies ◽

Blood Cell Count

Increasing evidence shows that white blood cells are associated with the risk of coronavirus disease 2019 (COVID-19), but the direction and causality of this association are not clear. To evaluate the causal associations between various white blood cell traits and the COVID-19 susceptibility and severity, we conducted two-sample bidirectional Mendelian Randomization (MR) analyses with summary statistics from the largest and most recent genome-wide association studies. Our MR results indicated causal protective effects of higher basophil count, basophil percentage of white blood cells, and myeloid white blood cell count on severe COVID-19, with odds ratios (OR) per standard deviation increment of 0.75 (95% CI: 0.60–0.95), 0.70 (95% CI: 0.54–0.92), and 0.85 (95% CI: 0.73–0.98), respectively. Neither COVID-19 severity nor susceptibility was associated with white blood cell traits in our reverse MR results. Genetically predicted high basophil count, basophil percentage of white blood cells, and myeloid white blood cell count are associated with a lower risk of developing severe COVID-19. Individuals with a lower genetic capacity for basophils are likely at risk, while enhancing the production of basophils may be an effective therapeutic strategy.

Download Full-text

Forest and Trees: Exploring Bacterial Virulence with Genome-wide Association Studies and Machine Learning

Trends in Microbiology ◽

10.1016/j.tim.2020.12.002 ◽

2021 ◽

Author(s):

Jonathan P. Allen ◽

Evan Snitkin ◽

Nathan B. Pincus ◽

Alan R. Hauser

Keyword(s):

Machine Learning ◽

Association Studies ◽

Bacterial Virulence ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Genome Wide

Download Full-text

A description of large-scale metabolomics studies: increasing value by combining metabolomics with genome-wide SNP genotyping and transcriptional profiling

Journal of Endocrinology ◽

10.1530/joe-12-0144 ◽

2012 ◽

Vol 215 (1) ◽

pp. 17-28 ◽

Cited By ~ 18

Author(s):

Georg Homuth ◽

Alexander Teumer ◽

Uwe Völker ◽

Matthias Nauck

Keyword(s):

Blood Cells ◽

Large Scale ◽

Genetic Factors ◽

Association Studies ◽

Transcriptional Profiling ◽

Genome Wide Association Studies ◽

Protein Levels ◽

Future Developments ◽

Genome Wide ◽

Metabolome Data

The metabolome, defined as the reflection of metabolic dynamics derived from parameters measured primarily in easily accessible body fluids such as serum, plasma, and urine, can be considered as the omics data pool that is closest to the phenotype because it integrates genetic influences as well as nongenetic factors. Metabolic traits can be related to genetic polymorphisms in genome-wide association studies, enabling the identification of underlying genetic factors, as well as to specific phenotypes, resulting in the identification of metabolome signatures primarily caused by nongenetic factors. Similarly, correlation of metabolome data with transcriptional or/and proteome profiles of blood cells also produces valuable data, by revealing associations between metabolic changes and mRNA and protein levels. In the last years, the progress in correlating genetic variation and metabolome profiles was most impressive. This review will therefore try to summarize the most important of these studies and give an outlook on future developments.

Download Full-text

Combining Multiple Hypothesis Testing with Machine Learning Increases the Statistical Power of Genome-wide Association Studies

Scientific Reports ◽

10.1038/srep36671 ◽

2016 ◽

Vol 6 (1) ◽

Cited By ~ 20

Author(s):

Bettina Mieth ◽

Marius Kloft ◽

Juan Antonio Rodríguez ◽

Sören Sonnenburg ◽

Robin Vobruba ◽

...

Keyword(s):

Machine Learning ◽

Hypothesis Testing ◽

Statistical Power ◽

Association Studies ◽

Multiple Hypothesis Testing ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Multiple Hypothesis ◽

Genome Wide

Download Full-text

Natural hemoplasma infection of cats in Cuiaba, Mato Grosso, Brazil

Semina Ciências Agrárias ◽

10.5433/1679-0359.2018v39n2p875 ◽

2018 ◽

Vol 39 (2) ◽

pp. 875 ◽

Cited By ~ 1

Author(s):

Herica Makino ◽

Daphine Ariadne Jesus de Paula ◽

Valéria Regia Franco Sousa ◽

Adriane Jorge Mendonça ◽

Valéria Dutra ◽

...

Keyword(s):

Blood Count ◽

Clinical Care ◽

Mato Grosso ◽

Blood Samples ◽

Factors Associated ◽

Mixed Breed ◽

The City ◽

Veterinary Hospital ◽

Candidatus Mycoplasma Turicensis

The aim of this research was to investigate natural hemoplasma infection in cats treated at the Veterinary Hospital of the Federal University of Mato Grosso, and the factors associated with infection. Blood samples from 151 cats of different sexes, breeds, and ages were analyzed by PCR and blood count. The overall occurrence of hemoplasma was 25.8%. Mycoplasma haemofelis (Mhf), ‘Candidatus Mycoplasma haemominutum (CMhm)’, and ‘Candidatus Mycoplasma turicensis’ (CMt) were observed in 15.2%, 14.6% and 2.6% of cats, respectively. In 6.6 % of cases, co-infection was observed. Male felines or mixed breed cats were associated with infection by CMhm (P = 0.02 and 0.04, respectively). The data obtained demonstrated an occurrence of 25.8% for hemoplasma infection in felines coming from clinical care in the city of Cuiabá, where males were at higher risk of acquiring the infection by these agents, in addition to a higher risk for CMhm in felines with no specific breed.

Download Full-text

Multiple similarly effective solutions exist for biomedical feature selection and classification problems

Scientific Reports ◽

10.1038/s41598-017-13184-8 ◽

2017 ◽

Vol 7 (1) ◽

Cited By ~ 9

Author(s):

Jiamei Liu ◽

Cheng Xu ◽

Weifeng Yang ◽

Yayun Shu ◽

Weiwei Zheng ◽

...

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Association Studies ◽

Binary Classification ◽

Learning Algorithms ◽

Optimal Solution ◽

Machine Learning Algorithms ◽

Disease Classification ◽

Genome Wide Association Studies ◽

Classification Problems

Abstract Binary classification is a widely employed problem to facilitate the decisions on various biomedical big data questions, such as clinical drug trials between treated participants and controls, and genome-wide association studies (GWASs) between participants with or without a phenotype. A machine learning model is trained for this purpose by optimizing the power of discriminating samples from two groups. However, most of the classification algorithms tend to generate one locally optimal solution according to the input dataset and the mathematical presumptions of the dataset. Here we demonstrated from the aspects of both disease classification and feature selection that multiple different solutions may have similar classification performances. So the existing machine learning algorithms may have ignored a horde of fishes by catching only a good one. Since most of the existing machine learning algorithms generate a solution by optimizing a mathematical goal, it may be essential for understanding the biological mechanisms for the investigated classification question, by considering both the generated solution and the ignored ones.

Download Full-text

Machine Learning in Genome-Wide Association Studies

10.3389/978-2-88966-229-6 ◽

2020 ◽

Keyword(s):

Machine Learning ◽

Association Studies ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Genome Wide

Download Full-text

Learning polygenic scores for human blood cell traits

10.1101/2020.02.17.952788 ◽

2020 ◽

Author(s):

Yu Xu ◽

Dragana Vuckovic ◽

Scott C Ritchie ◽

Parsa Akbari ◽

Tao Jiang ◽

...

Keyword(s):

Machine Learning ◽

Blood Cell ◽

Association Studies ◽

Relative Effectiveness ◽

Univariate Analysis ◽

Genetic Correlations ◽

Genome Wide Association Studies ◽

Learning Methods ◽

Polygenic Scores ◽

Using Data

AbstractPolygenic scores (PGSs) for blood cell traits can be constructed using summary statistics from genome-wide association studies. As the selection of variants and the modelling of their interactions in PGSs may be limited by univariate analysis, therefore, such a conventional method may yield sub-optional performance. This study evaluated the relative effectiveness of four machine learning and deep learning methods, as well as a univariate method, in the construction of PGSs for 26 blood cell traits, using data from UK Biobank (n=~400,000) and INTERVAL (n=~40,000). Our results showed that learning methods can improve PGSs construction for nearly every blood cell trait considered, with this superiority explained by the ability of machine learning methods to capture interactions among variants. This study also demonstrated that populations can be well stratified by the PGSs of these blood cell traits, even for traits that exhibit large differences between ages and sexes, suggesting potential for disease prevention. As our study found genetic correlations between the PGSs for blood cell traits and PGSs for several common human diseases (recapitulating well-known associations between the blood cell traits themselves and certain diseases), it suggests that blood cell traits may be indicators or/and mediators for a variety of common disorders via shared genetic variants and functional pathways.

Download Full-text

Identification of disease-associated loci using machine learning for genotype and network data integration

Bioinformatics ◽

10.1093/bioinformatics/btz310 ◽

2019 ◽

Vol 35 (24) ◽

pp. 5182-5190 ◽

Cited By ~ 4

Author(s):

Luis G Leal ◽

Alessia David ◽

Marjo-Riita Jarvelin ◽

Sylvain Sebert ◽

Minna Männikkö ◽

...

Keyword(s):

Machine Learning ◽

Gene Networks ◽

Association Studies ◽

R Package ◽

Biological Data ◽

Machine Learning Algorithms ◽

Supplementary Information ◽

Genome Wide Association Studies ◽

Omics Data ◽

Missing Heritability

Abstract Motivation Integration of different omics data could markedly help to identify biological signatures, understand the missing heritability of complex diseases and ultimately achieve personalized medicine. Standard regression models used in Genome-Wide Association Studies (GWAS) identify loci with a strong effect size, whereas GWAS meta-analyses are often needed to capture weak loci contributing to the missing heritability. Development of novel machine learning algorithms for merging genotype data with other omics data is highly needed as it could enhance the prioritization of weak loci. Results We developed cNMTF (corrected non-negative matrix tri-factorization), an integrative algorithm based on clustering techniques of biological data. This method assesses the inter-relatedness between genotypes, phenotypes, the damaging effect of the variants and gene networks in order to identify loci-trait associations. cNMTF was used to prioritize genes associated with lipid traits in two population cohorts. We replicated 129 genes reported in GWAS world-wide and provided evidence that supports 85% of our findings (226 out of 265 genes), including recent associations in literature (NLGN1), regulators of lipid metabolism (DAB1) and pleiotropic genes for lipid traits (CARM1). Moreover, cNMTF performed efficiently against strong population structures by accounting for the individuals’ ancestry. As the method is flexible in the incorporation of diverse omics data sources, it can be easily adapted to the user’s research needs. Availability and implementation An R package (cnmtf) is available at https://lgl15.github.io/cnmtf_web/index.html. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Faculty Opinions recommendation of Combining Multiple Hypothesis Testing with Machine Learning Increases the Statistical Power of Genome-wide Association Studies.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.727038147.793530327 ◽

2017 ◽

Author(s):

Chloé-Agathe Azencott

Keyword(s):

Machine Learning ◽

Hypothesis Testing ◽

Statistical Power ◽

Association Studies ◽

Multiple Hypothesis Testing ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Multiple Hypothesis ◽

Genome Wide

Download Full-text