scholarly journals Why do some probabilistic forecasts lack reliability?

2019 ◽  
Vol 9 ◽  
pp. A17
Author(s):  
Yûki Kubo

In this work, we investigate the reliability of the probabilistic binary forecast. We mathematically prove that a necessary, but not sufficient, condition for achieving a reliable probabilistic forecast is maximizing the Peirce Skill Score (PSS) at the threshold probability of the climatological base rate. The condition is confirmed by using artificially synthesized forecast–outcome pair data and previously published probabilistic solar flare forecast models. The condition gives a partial answer as to why some probabilistic forecast system lack reliability, because the system, which does not satisfy the proved condition, can never be reliable. Therefore, the proved condition is very important for the developers of a probabilistic forecast system. The result implies that those who want to develop a reliable probabilistic forecast system must adjust or train the system so as to maximize PSS near the threshold probability of the climatological base rate.

2009 ◽  
Vol 3 (1) ◽  
pp. 39-43
Author(s):  
A. B. A. Slangen ◽  
M. J. Schmeits

Abstract. The development and verification of a probabilistic forecast system for winter thunderstorms around Amsterdam Airport Schiphol is described. We have used Model Output Statistics (MOS) to develop the probabilistic forecast equations. The MOS system consists of 32 logistic regression equations, i.e. for two forecast periods (0–6 h and 6–12 h), four 90×80 km2 regions around Amsterdam Airport Schiphol, and four 6-h time periods. For the predictand quality-controlled Surveillance et Alerte Foudre par Interférométrie Radioélectrique (SAFIR) total lightning data were used. The potential predictors were calculated from postprocessed output of two numerical weather prediction (NWP) models – i.e. the High-Resolution Limited-Area Model (HIRLAM) and the European Centre for Medium-Range Weather Forecasts (ECMWF) model – and from an ensemble of advected lightning and radar data (0–6 h projections only). The predictors that are selected most often are the HIRLAM Boyden index, the square root of the ECMWF 3-h and 6-h convective precipitation sum, the HIRLAM convective available potential energy (CAPE) and two radar advection predictors. An objective verification was done, from which it can be concluded that the MOS system is skilful. The forecast system runs at the Royal Netherlands Meteorological Institute (KNMI) on an experimental basis, with the primary objective to warn aircraft pilots for potential aircraft induced lightning (AIL) risk during winter.


2021 ◽  
Vol 73 (1) ◽  
Author(s):  
Naoto Nishizuka ◽  
Yûki Kubo ◽  
Komei Sugiura ◽  
Mitsue Den ◽  
Mamoru Ishii

AbstractWe developed an operational solar flare prediction model using deep neural networks, named Deep Flare Net (DeFN). DeFN can issue probabilistic forecasts of solar flares in two categories, such as ≥ M-class and < M-class events or ≥  C-class and < C-class events, occurring in the next 24 h after observations and the maximum class of flares occurring in the next 24 h. DeFN is set to run every 6 h and has been operated since January 2019. The input database of solar observation images taken by the Solar Dynamic Observatory (SDO) is downloaded from the data archive operated by the Joint Science Operations Center (JSOC) of Stanford University. Active regions are automatically detected from magnetograms, and 79 features are extracted from each region nearly in real time using multiwavelength observation data. Flare labels are attached to the feature database, and then, the database is standardized and input into DeFN for prediction. DeFN was pretrained using the datasets obtained from 2010 to 2015. The model was evaluated with the skill score of the true skill statistics (TSS) and achieved predictions with TSS = 0.80 for ≥  M-class flares and TSS = 0.63 for ≥  C-class flares. For comparison, we evaluated the operationally forecast results from January 2019 to June 2020. We found that operational DeFN forecasts achieved TSS = 0.70 (0.84) for ≥  C-class flares with the probability threshold of 50 (40)%, although there were very few M-class flares during this period and we should continue monitoring the results for a longer time. Here, we adopted a chronological split to divide the database into two for training and testing. The chronological split appears suitable for evaluating operational models. Furthermore, we proposed the use of time-series cross-validation. The procedure achieved TSS = 0.70 for ≥  M-class flares and 0.59 for ≥  C-class flares using the datasets obtained from 2010 to 2017. Finally, we discuss the standard evaluation methods for operational forecasting models, such as the preparation of observation, training, and testing datasets, and the selection of verification metrics.


2012 ◽  
Vol 8 (1) ◽  
pp. 53-57
Author(s):  
S. Siegert ◽  
J. Bröcker ◽  
H. Kantz

Abstract. In numerical weather prediction, ensembles are used to retrieve probabilistic forecasts of future weather conditions. We consider events where the verification is smaller than the smallest, or larger than the largest ensemble member of a scalar ensemble forecast. These events are called outliers. In a statistically consistent K-member ensemble, outliers should occur with a base rate of 2/(K+1). In operational ensembles this base rate tends to be higher. We study the predictability of outlier events in terms of the Brier Skill Score and find that forecast probabilities can be calculated which are more skillful than the unconditional base rate. This is shown analytically for statistically consistent ensembles. Using logistic regression, forecast probabilities for outlier events in an operational ensemble are calculated. These probabilities exhibit positive skill which is quantitatively similar to the analytical results. Possible causes of these results as well as their consequences for ensemble interpretation are discussed.


2008 ◽  
Vol 136 (1) ◽  
pp. 352-363 ◽  
Author(s):  
Bodo Ahrens ◽  
André Walser

Abstract The information content, that is, the predictive capability, of a forecast system is often quantified with skill scores. This paper introduces two ranked mutual information skill (RMIS) scores, RMISO and RMISY, for the evaluation of probabilistic forecasts. These scores are based on the concept of mutual information of random variables as developed in information theory. Like the ranked probability skill score (RPSS)—another and often applied skill score—the new scores compare cumulative probabilities for multiple event thresholds. The RMISO quantifies the fraction of information in the observational data that is explained by the forecasts. The RMISY quantifies the amount of useful information in the forecasts. Like the RPSS, the new scores are biased, but they can be debiased with a simple and robust method. This and additional promising characteristics of the scores are discussed with ensemble forecast assessment experiments.


2020 ◽  
Vol 148 (6) ◽  
pp. 2233-2249
Author(s):  
Leonard A. Smith ◽  
Hailiang Du ◽  
Sarah Higgins

Abstract Probabilistic forecasting is common in a wide variety of fields including geoscience, social science, and finance. It is sometimes the case that one has multiple probability forecasts for the same target. How is the information in these multiple nonlinear forecast systems best “combined”? Assuming stationarity, in the limit of a very large forecast–outcome archive, each model-based probability density function can be weighted to form a “multimodel forecast” that will, in expectation, provide at least as much information as the most informative single model forecast system. If one of the forecast systems yields a probability distribution that reflects the distribution from which the outcome will be drawn, Bayesian model averaging will identify this forecast system as the preferred system in the limit as the number of forecast–outcome pairs goes to infinity. In many applications, like those of seasonal weather forecasting, data are precious; the archive is often limited to fewer than 26 entries. In addition, no perfect model is in hand. It is shown that in this case forming a single “multimodel probabilistic forecast” can be expected to prove misleading. These issues are investigated in the surrogate model (here a forecast system) regime, where using probabilistic forecasts of a simple mathematical system allows many limiting behaviors of forecast systems to be quantified and compared with those under more realistic conditions.


2020 ◽  
Vol 12 (1) ◽  
pp. 449-470
Author(s):  
Nico Keilman

The aim of this article is to review a number of issues related to uncertain population forecasts, with a focus on world population. Why are these forecasts uncertain? Population forecasters traditionally follow two approaches when dealing with this uncertainty, namely scenarios (forecast variants) and probabilistic forecasts. Early probabilistic population forecast models were based upon a frequentist approach, whereas current ones are of the Bayesian type. I evaluate the scenario approach versus the probabilistic approach and conclude that the latter is preferred. Finally, forecasts of resources need not only population input, but also input on future numbers of households. While methods for computing probabilistic country-specific household forecasts have been known for some time, how to compute such forecasts for the whole world is yet an unexplored issue.


2014 ◽  
Vol 29 (1) ◽  
pp. 177-181 ◽  
Author(s):  
Otto Hyvärinen

Abstract An alternative derivation of Heidke skill score for 2 × 2 tables is presented, starting from the assumption that a categorical forecast is useful, if the probability of an occurrence of an event, given the forecast, is greater than the base rate of the event. A tentative measure of skill would then be the difference of these probabilities, normalized by the maximum value based on the base rate. For binary events, the Heidke skill score is then the harmonic mean of these differences for both the occurrence and the nonoccurrence of the event. This derivation differs from the usual derivation in that the concept of chance agreement is not used. It is Bayesian in nature with implied updating of prior probabilities to posterior probabilities.


Water ◽  
2020 ◽  
Vol 12 (9) ◽  
pp. 2631
Author(s):  
Xinchi Chen ◽  
Xiaohong Chen ◽  
Dong Huang ◽  
Huamei Liu

Precipitation is one of the most important factors affecting the accuracy and uncertainty of hydrological forecasting. Considerable progress has been made in numerical weather prediction after decades of development, but the forecast products still cannot be used directly for hydrological forecasting. This study used ensemble pro-processor (EPP) to post-process the Global Ensemble Forecast System (GEFS) and Climate Forecast System version 2 (CFSv2) with four designed schemes, and then integrated them to investigate the forecast accuracy in longer time scales based on the best scheme. Many indices such as correlation coefficient, Nash efficiency coefficient, rank histogram, and continuous ranked probability skill score were used to evaluate the results in different aspects. The results show that EPP can improve the accuracy of raw forecast significantly, and the scheme considering cumulative forecast precipitation is better than that only considers single-day forecast. Moreover, the scheme that considers some observed precipitation would help to improve the accuracy and reduce the uncertainty. In terms of medium- and long-term forecasts, the integrated forecast based on GEFS and CFSv2 after post-processed would be better than CFSv2 significantly. The results of this study would be a very important demonstration to remove the deviation of ensemble forecast and improve the accuracy of hydrological forecasting in different time scales.


2019 ◽  
Vol 34 (6) ◽  
pp. 1955-1964
Author(s):  
Adam J. Clark

Abstract This study compares ensemble precipitation forecasts from 10-member, 3-km grid-spacing, CONUS domain single- and multicore ensembles that were a part of the 2016 Community Leveraged Unified Ensemble (CLUE) that was run for the 2016 NOAA Hazardous Weather Testbed Spring Forecasting Experiment. The main results are that a 10-member ARW ensemble was significantly more skillful than a 10-member NMMB ensemble, and a 10-member MIX ensemble (5 ARW and 5 NMMB members) performed about the same as the 10-member ARW ensemble. Skill was measured by area under the relative operating characteristic curve (AUC) and fractions skill score (FSS). Rank histograms in the ARW ensemble were flatter than the NMMB ensemble indicating that the envelope of ensemble members better encompassed observations (i.e., better reliability) in the ARW. Rank histograms in the MIX ensemble were similar to the ARW ensemble. In the context of NOAA’s plans for a Unified Forecast System featuring a CAM ensemble with a single core, the results are positive and indicate that it should be possible to develop a single-core system that performs as well as or better than the current operational CAM ensemble, which is known as the High-Resolution Ensemble Forecast System (HREF). However, as new modeling applications are developed and incremental changes that move HREF toward a single-core system are made possible, more thorough testing and evaluation should be conducted.


Sign in / Sign up

Export Citation Format

Share Document