<p>Because the atmosphere is inherently chaotic, probabilistic weather forecasts are crucial to provide reliable information. In this work, we present an extension to the WeatherBench, a benchmark dataset for medium-range, data-driven weather prediction, which was originally designed for deterministic forecasts. We add a set of commonly used probabilistic verification metrics: the spread-skill ratio, the continuous ranked probability score (CRPS) and rank histograms. Further, we compute baseline scores from the operational IFS ensemble forecast.&#160;</p><p>Then, we compare three different methods of creating probabilistic neural network forecasts: first, using Monte-Carlo dropout during inference with a range of dropout rates; second, parametric forecasts, which optimize for the CRPS; and third, categorical forecasts, in which the probability of occurrence for specific bins is predicted. We show that plain Monto-Carlo dropout does not provide enough spread. The parametric and categorical networks, on the other hand, provide reliable forecasts, with the categorical method being more versatile.</p>