Do your evaluations have enough power?

Late to this party… but I’m not convinced about model confidence sets based on proper scores.

Essentially, MCS will limit to picking out one model with the best performance over a period of time (unless two models are literally exactly indistinguishable) which is essentially recreating the same problem as using BMA e.g. here .

Or, if it doesn’t assumption 1 of the MCS paper is violated and therefore its not a great approach either.

At a meta-stats level I’m not convinced about treating models as being in the same set because of a p-value constructed around a null hypothesis that itself has probability zero of being true (i.e. that two distinct models will have precisely the same marginal skill as data goes to infinity).