A Machine Learning Approach to Identify Predictors of Frequent Vaping and Vulnerable Californian Youth Subgroups
Abstract Introduction Machine learning presents a unique opportunity to improve electronic cigarette (vaping) monitoring in youth. Here we built a random forest model to predict frequent vaping status among Californian youth and to identify contributing factors and vulnerable populations. Methods In this prospective cohort study, 1,281 ever-vaping twelfth-grade students from metropolitan Los Angeles were surveyed in Fall and in 6-month in Spring. Frequent vaping was measured at the 6-month follow-up as nicotine-containing vaping on 20 or more days in past 30 days. Predictors (n=131) encompassed sociodemographic characteristics, substance use and perceptions, health status, and characteristics of the household, school and neighborhood. A random forest was developed to identify the top ten predictors of frequent vaping and interactions by sociodemographic variables. Results Forty participants (3.1%) reported frequent vaping at the follow-up. The random forest outperformed a logistic regression model in prediction (C-Index=0.87 vs. 0.77). Higher past-month nicotine concentration in vape, more daily vaping sessions, and greater nicotine dependence were the top three of the ten most important predictors of frequent vaping. Interactions were found between age and perceived discrimination, and between age and race/ethnicity, as those who were younger than their classmates and either reported experiencing discrimination frequently or identified as Asian or Native American/Pacific Islander were at increased risk of becoming frequent vapers. Conclusions Machine learning can produce models that accurately predict progression of vaping behaviours among youth. The potential association between frequent vaping and perceived discrimination warrants more in-depth analyses to confirm if discrimination constitutes a cause of increased vaping.