Do your evaluations have enough power?

sambrand · 17 February 2026 17:42

Late to this party… but I’m not convinced about model confidence sets based on proper scores.

Essentially, MCS will limit to picking out one model with the best performance over a period of time (unless two models are literally exactly indistinguishable) which is essentially recreating the same problem as using BMA e.g. here .

Or, if it doesn’t assumption 1 of the MCS paper is violated and therefore its not a great approach either.

At a meta-stats level I’m not convinced about treating models as being in the same set because of a p-value constructed around a null hypothesis that itself has probability zero of being true (i.e. that two distinct models will have precisely the same marginal skill as data goes to infinity).

Topic		Replies	Views
Community Seminar 2024-08-07 - Kaitlyn Johnson - Wastewater modeling to forecast hospital admissions in the US: Challenges and opportunities Meetings	19	184	14 August 2024
Baseball Stats, Model Cards, and Forecasting Performance Project Proposals	17	289	11 March 2026
Scoring best practice: Should we always have scoring simulations in our papers?	5	35	27 April 2026
How can collaborative infectious disease forecasting/nowcasting projects be improved?	6	520	5 June 2023
A basket of baselines Project Proposals	15	150	27 January 2026

Do your evaluations have enough power?

Related topics