Late to this party… but I’m not convinced about model confidence sets based on proper scores.
Essentially, MCS will limit to picking out one model with the best performance over a period of time (unless two models are literally exactly indistinguishable) which is essentially recreating the same problem as using BMA e.g. here .
Or, if it doesn’t assumption 1 of the MCS paper is violated and therefore its not a great approach either.
At a meta-stats level I’m not convinced about treating models as being in the same set because of a p-value constructed around a null hypothesis that itself has probability zero of being true (i.e. that two distinct models will have precisely the same marginal skill as data goes to infinity).