INFLAMMATORY BOWEL DISEASE CLASSIFICATION USING THE GUT MICROBIOME: A BENCHMARK OF MICROBIAL DATA ANALYSIS METHODS
Abstract The prevalence of inflammatory bowel disease (IBD) is increasing throughout the developed world. For the newly diagnosed, the time between the appearance of symptoms and diagnosis can take months, involving invasive procedures. There is an urgent need to develop a simple, low cost, accurate and non-invasive diagnostic test. With decreasing costs of next-generation sequencing, many studies have compared IBD gut microbiomes to healthy controls, successfully identifying bacterial biomarkers for IBD. Unfortunately, a majority of these studies utilize machine learning and statistical methods on either single or low-sample size datasets. This results in the creation of disease classification models that have a high level of overfitting and therefore minimal clinical application to new patient cohorts. There are several data preprocessing methods available for data normalization and reduction of cohort specific signals (batch reduction) which can address this lack of cross-dataset performance. With an abundance of potential methods, there is a need to benchmark the performance and generalizability of various machine learning pipelines (combination of data preprocessing and model) for microbiome-based IBD diagnostic tools. We used a collection of 12 IBD-associated North American microbiome datasets (~4000 samples) to benchmark several machine learning pipelines. Raw sequencing data was processed, collapsed at the OTU or Genus level and merged using QIIME2. Datasets were then normalized using either sum-scaling or log based methods and batch reduction was performed using either zero-centering or Empirical Bayes’ approaches. Performance of pipelines was evaluated using binary accuracy, AUC, F1 metric and MCC score. Generalizability of pipelines was evaluated using leave one out cross validation, where data from one study was left out of the training set and tested upon. The best performing and most generalizable pipeline included a Random Forest model paired with centered log ratio based normalization and batch reduction via an Empirical Bayes’ based approach. This combination, along with others, showed equivalent or higher performance to that of more complex models involving deep neural networks (DNNs). In addition to benchmarking our pipelines, we also explore their limitations, such as the tendency of zero-centered batch reduction to rely on balanced data as input or the tendency of Empirical Bayes’ based methods to introduce artificial signals into data, evidencing certain methods as poor tools for clinical use. To our knowledge, this is the first comprehensive benchmark of data preprocessing and machine learning methods for microbiome-based disease classification of IBD. These findings will help improve the generalizability of machine learning models as we move towards non-invasive diagnostic and disease management tools for patients with IBD.