Abstract. Measurements of dust in the atmosphere have long been used to calibrate dust emission models. However, there is growing recognition that atmospheric dust confounds the magnitude and frequency of emission from dust sources and hides potential weaknesses in dust emission model formulation. In the satellite era, dichotomous (presence = 1 or absence = 0) observations of dust emission point sources (DPS) provide a valuable inventory of regional dust emission. We used these DPS data to develop an open and transparent framework to routinely evaluate dust emission model (development) performance using coincidence of simulated and observed dust emission (or lack of emission). To illustrate the utility of this framework, we evaluated the recently developed albedo-based dust emission model (AEM) which included the traditional entrainment threshold (u*ts) at the grain scale, fixed over space and static over time, with sediment supply infinite everywhere. For comparison with the dichotomous DPS data, we reduced the AEM simulations to its frequency of occurrence in which soil surface wind friction velocity (us*) exceeds the u*ts, P(us* > u*ts). We used a global collation of nine DPS datasets from established studies to describe the spatio-temporal variation of dust emission frequency. A total of 37,352 unique DPS locations were aggregated into 1,945 1° grid boxes to harmonise data across the studies which identified a total of 59,688 dust emissions. The DPS data alone revealed that dust emission does not usually recur at the same location, are rare (1.8 %) even in North Africa and the Middle East, indicative of extreme, large wind speed events. The AEM over-estimated the occurrence of dust emission by between 1 and 2 orders of magnitude. More diagnostically, the AEM simulations coincided with dichotomous observations ~71 % of the time but simulated dust emission ~27 % of the time when no dust emission was observed. Our analysis indicates that u*ts was typically too small, needed to vary over space and time, and at the grain-scale u*ts is incompatible with the us* scale (MODIS 500 m). During observed dust emission, us* was too small because wind speeds were too small and/or the wind speed scale (ERA5; 11 km) is incompatible with the us* scale. The absence of any limit to sediment supply caused the AEM to simulate dust emission whenever P (us* > u*ts), producing many false positives when and where wind speeds were frequently large. Dust emission model scaling needs to be reconciled and new parameterisations are required for u*ts and to restrict sediment supply varying over space and time. Whilst u*ts remains poorly constrained and unrealistic assumptions persist about sediment supply and availability, the DPS data provide a basis for the calibration of dust emission models for operational use. As dust emission models develop, these DPS data provide a consistent, reproducible, and valid framework for their routine evaluation and potential model optimisation. This work emphasises the growing recognition that dust emission models should not be evaluated against atmospheric dust.