Machine Learning-assisted Identification of Bioindicators Predicts Medium-chain Carboxylate Production Performance of an Anaerobic Mixed Culture
Abstract Background: The ability to quantitatively predict ecophysiological functions of microbial communities provides an important step to engineer microbiota for desired functions related to specific biochemical conversions. Here, we present the quantitative prediction of medium-chain carboxylate production in two continuous anaerobic bioreactors from 16S rRNA gene dynamics in enrichment cultures. Results: By progressively shortening the hydraulic retention time from 8 days to 2 days with different temporal schemes in both bioreactors operated for 211 days, we achieved higher productivities and yields of the target products n-caproate and n-caprylate. The datasets generated from each bioreactor were applied independently for training and testing in machine learning. A predictive model was generated by employing the random forest algorithm using 16S rRNA amplicon sequencing data. More than 90% accuracy in the prediction of n-caproate and n-caprylate productivities was achieved. Four inferred bioindicators belonging to the genera Olsenella, Lactobacillus, Syntrophococcus and Clostridium IV suggest their relevance to the higher carboxylate productivity at shorter hydraulic retention time. The recovery of metagenome-assembled genomes of these bioindicators confirmed their genetic potential to perform key steps of medium-chain carboxylate production.Conclusions: Shortening the hydraulic retention time of the continuous bioreactor systems allows to shape the communities with desired chain elongation functions. Using machine-learning, we demonstrated that 16S rRNA amplicon sequencing data can be used to predict bioreactor process performance quantitatively and accurately. Characterising and harnessing bioindicators holds promise to manage reactor microbiota towards selection of the target processes. Our mathematical framework is transferrable to other ecosystem processes and 3 microbial systems where community dynamics is linked to key functions. The general methodology can be adapted to data types of other functional categories such as genes, transcripts, proteins or metabolites.