Abstract. In this study, we propose a data-driven approach for automatically identifying
rainfall-runoff events in discharge time series. The
core of the concept is to construct and apply discrete multivariate
probability distributions to obtain probabilistic predictions of each time
step that is part of an event. The approach permits any data to serve as
predictors, and it is non-parametric in the sense that it can handle any
kind of relation between the predictor(s) and the target. Each choice of a
particular predictor data set is equivalent to formulating a model
hypothesis. Among competing models, the best is found by comparing their
predictive power in a training data set with user-classified events. For
evaluation, we use measures from information theory such as Shannon entropy
and conditional entropy to select the best predictors and models and,
additionally, measure the risk of overfitting via cross entropy and
Kullback–Leibler divergence. As all these measures are expressed in “bit”,
we can combine them to identify models with the best tradeoff between
predictive power and robustness given the available data. We applied the method to data from the Dornbirner Ach catchment in Austria,
distinguishing three different model types: models relying on discharge data,
models using both discharge and precipitation data, and recursive models,
i.e., models using their own predictions of a previous time step as an
additional predictor. In the case study, the additional use of precipitation
reduced predictive uncertainty only by a small amount, likely because the
information provided by precipitation is already contained in the discharge
data. More generally, we found that the robustness of a model quickly dropped
with the increase in the number of predictors used (an effect well known as
the curse of dimensionality) such that, in the end, the best model was a
recursive one applying four predictors (three standard and one recursive):
discharge from two distinct time steps, the relative magnitude of discharge
compared with all discharge values in a surrounding 65 h time window and
event predictions from the previous time step. Applying the model reduced the
uncertainty in event classification by 77.8 %, decreasing conditional
entropy from 0.516 to 0.114 bits. To assess the quality of the proposed
method, its results were binarized and validated through a holdout method and
then compared to a physically based approach. The comparison showed similar
behavior of both models (both with accuracy near 90 %), and the cross-validation reinforced the quality of the proposed model. Given enough data to build data-driven models, their potential lies in the
way they learn and exploit relations between data unconstrained by
functional or parametric assumptions and choices. And, beyond that, the use
of these models to reproduce a hydrologist's way of identifying rainfall-runoff
events is just one of many potential applications.