Abstract. Calibration is an essential step for improving
the accuracy of simulations generated using hydrologic models. A key modeling
decision is selecting the performance metric to be optimized. It has been
common to use squared error performance metrics, or normalized variants such
as Nash–Sutcliffe efficiency (NSE), based on the idea that their
squared-error nature will emphasize the estimates of high flows. However, we
conclude that NSE-based model calibrations actually result in poor
reproduction of high-flow events, such as the annual peak flows that are used
for flood frequency estimation. Using three different types of performance
metrics, we calibrate two hydrological models at a daily step, the Variable
Infiltration Capacity (VIC) model and the mesoscale Hydrologic Model (mHM),
and evaluate their ability to simulate high-flow events for 492 basins
throughout the contiguous United States. The metrics investigated are
(1) NSE, (2) Kling–Gupta efficiency (KGE) and its variants, and (3) annual
peak flow bias (APFB), where the latter is an application-specific metric
that focuses on annual peak flows. As expected, the APFB metric produces the
best annual peak flow estimates; however, performance on other
high-flow-related metrics is poor. In contrast, the use of NSE results in
annual peak flow estimates that are more than 20 % worse, primarily due
to the tendency of NSE to underestimate observed flow variability. On the
other hand, the use of KGE results in annual peak flow estimates that are
better than from NSE, owing to improved flow time series metrics (mean and
variance), with only a slight degradation in performance with respect to
other related metrics, particularly when a non-standard weighting of the
components of KGE is used. Stochastically generated ensemble simulations
based on model residuals show the ability to improve the high-flow metrics,
regardless of the deterministic performances. However, we emphasize that
improving the fidelity of streamflow dynamics from deterministically
calibrated models is still important, as it may improve high-flow metrics
(for the right reasons). Overall, this work highlights the need for a deeper
understanding of performance metric behavior and design in relation to the
desired goals of model calibration.