Abstract
Objectives
To better understand host-microbe interactions, a more computationally intensive, multivariate, machine learning approach must be utilized. Accordingly, we aimed to identify biomarkers with high predictive accuracy for dietary intake.
Methods
Data were aggregated from five randomized, controlled, feeding studies in adults (n = 199) that provided avocados, almonds, broccoli, walnuts, or whole grain oats and whole grain barley. Fecal samples were collected during treatment and control periods for each study for DNA extraction. Subsequently, the 16S rRNA gene (V4 region) was amplified and sequenced. Sequence data were analyzed using DADA2 and QIIME2. Marginal screening using the Kruskal-Wallis test was performed on all species-level taxa to examine the differences between each of the 6 treatment groups and respective control groups. The top 20 species from each diet were selected and pooled together for multiclass classification using random forest. The resultant bacterial species were further decreased in a stepwise fashion and iteratively analyzed with the variable importance generated from random forest to determine a compact feature set with a minor loss of accuracy in the prediction of food consumed.
Result
When all six foods were analyzed together using the top 20 species of each diet, oats and barley were frequently confused for each other, with 44% and 47% classification error, respectively, and the overall model accuracy was 66%. Collapsing oats and barley into one category, whole grains, reduced the classification error of the whole grain category to 6% and improved the overall model accuracy to 73%. Refitting the random forest with the top 30, 20, and 10 important species resulted in correct identification of the 5 foods (avocados, almonds, broccoli, walnuts, and whole grains) 75%, 74%, and 70% of the time, respectively.
Conclusions
These results reveal promise in accurately predicting foods consumed using bacterial species as biomarkers. Ongoing analyses include incorporation of metagenomic and metabolomic data into the models to improve predictive accuracy and utilize the multi-omics dataset to predict health status. Long-term, these approaches may inform diet-microbiota-tailored recommendations.
Funding Sources
This research was funded by The Foundation for Food and Agriculture Research, USDA, Hass Avocado Board, and USDA National Institute of Food and Agriculture, Hatch project 1009249.