Prediction of the Concentration of Dissolved Oxygen in Running Water by Employing A Random Forest Machine Learning Technique
Dissolved oxygen (DO) is a key indicator in the study of the ecological health of rivers. Modeling DO is a major challenge due to complex interactions among various process components of it. Considering the vital importance of it in water bodies, the accurate prediction of DO is a critical issue in ecosystem management. Given the intricacy of the current process-based water quality models, a data-driven model could be an effective alternative tool. In this study, a random forest machine learning technique is employed to predict the DO level by identifying its major drivers. Time-series of half-hourly water quality data, spanning from 2007 to 2019, for the South Branch Potomac River near Springfield, WV, are obtained from the United States Geological Survey database. Key drivers are identified, and models are formulated for different scenarios of input variables. The model is calibrated for each input scenario using 80% of the data. Water temperature and pH are found to be the most influential predictors of DO. However, satisfactory model performance is achieved by considering water temperature, pH, and specific conductance as input variables. The model validation is made by predicting DO concentrations for the remaining 20% of the data. The comparison with the traditional multiple linear regression method shows that the random forest model performs significantly better. The study insights are, therefore, expected to be useful to estimate stream/river DO levels at various sites with a minimum number of predictors and help build a sturdy framework for ecosystem health management across an environmental gradient.