Yes agree both of those are true and a benefit here of a model vs subseting and scoring but it sounds like there isn’t going to be a propriaty impact no matter what you do
@sbfnk and I are having a chat about implementing MCS in scoringutils and what that could look like. After a bit of reading it does look awfully like the routines are essentially constructing a model, and then doing stepwise selection using a p value threshold. The latter in a regression context is known to be unstable/isn’t generally recommend.
From what I have read so far it seems like this might reinforce my point that framing this all explicitly as an explicit model might make these things clearer and allow a wider toolbox to be used (here for example switching things over to a Bayesian context and using a horseshoe prior (https://proceedings.mlr.press/v5/carvalho09a/carvalho09a.pdf) or similar would be a more rigorous selection approach for getting the joint set imo). You could use something like a GAM in mgcv and stay frequentist but again have better selection options available to you (i.e. that packages inbuilt shrinkage). You would still have the propiety issue if you wanted any kind of link function but I think writing their operations down directly as a model would avoid one).
See this issue: GitHub · Where software is built
I have just been skimming the literature on the train and was recently triggered by marking the basic stats course at LSHTM so may update my views.
These issues don’t strike me as insurmountable.
Non-negativity can be overcome by using a log-link, i.e. modelling log(E[WIS]) – if I understand correctly this doesn’t violate propriety.
Heteroskedasticity could then be addressed either by scoring on the log scale, or by having a distributional assumption that allows variance to scale with the mean.
For what it’s worth, this future landmark paper uses a log-link for modelling log-transformed scores.
So I thought a log link was a problem i.e not proper which was one of the motivators for log scoring in the first place?
I agree it does all seem summountable i.e using a different obs dist etc
to get a good regression coefficient for your model you have to minimize log(WIS + 1),
but this bit isn’t the case when using a log link, i.e. WIS remains the response variable?
If you think about it in terms of frequentist optimisation, with a log link you’re still optimising by minimising (average) WIS – the monotonic transformation of the expectation won’t affect this or any ranking based on WIS.
That’s different from logging every WIS score, where then taking the expectation might change the ranking (because of Jensen’s inequality if I’m not mistaken), and you can start gaming things.
This all makes sense but I distinctly remember @johannes telling me no and didn’t we right this as a benefit in the log transform paper?
Late to this party… but I’m not convinced about model confidence sets based on proper scores.
Essentially, MCS will limit to picking out one model with the best performance over a period of time (unless two models are literally exactly indistinguishable) which is essentially recreating the same problem as using BMA e.g. here .
Or, if it doesn’t assumption 1 of the MCS paper is violated and therefore its not a great approach either.
At a meta-stats level I’m not convinced about treating models as being in the same set because of a p-value constructed around a null hypothesis that itself has probability zero of being true (i.e. that two distinct models will have precisely the same marginal skill as data goes to infinity).