Do your evaluations have enough power?

samabbott · 27 January 2026 14:48

Yes agree both of those are true and a benefit here of a model vs subseting and scoring but it sounds like there isn’t going to be a propriaty impact no matter what you do

samabbott · 29 January 2026 10:54

@sbfnk and I are having a chat about implementing MCS in scoringutils and what that could look like. After a bit of reading it does look awfully like the routines are essentially constructing a model, and then doing stepwise selection using a p value threshold. The latter in a regression context is known to be unstable/isn’t generally recommend.

From what I have read so far it seems like this might reinforce my point that framing this all explicitly as an explicit model might make these things clearer and allow a wider toolbox to be used (here for example switching things over to a Bayesian context and using a horseshoe prior (https://proceedings.mlr.press/v5/carvalho09a/carvalho09a.pdf) or similar would be a more rigorous selection approach for getting the joint set imo). You could use something like a GAM in mgcv and stay frequentist but again have better selection options available to you (i.e. that packages inbuilt shrinkage). You would still have the propiety issue if you wanted any kind of link function but I think writing their operations down directly as a model would avoid one).

See this issue: GitHub · Where software is built

I have just been skimming the literature on the train and was recently triggered by marking the basic stats course at LSHTM so may update my views.

sbfnk · 10 February 2026 16:08

These issues don’t strike me as insurmountable.

Non-negativity can be overcome by using a log-link, i.e. modelling log(E[WIS]) – if I understand correctly this doesn’t violate propriety.

Heteroskedasticity could then be addressed either by scoring on the log scale, or by having a distributional assumption that allows variance to scale with the mean.

For what it’s worth, this future landmark paper uses a log-link for modelling log-transformed scores.

samabbott · 11 February 2026 16:35

So I thought a log link was a problem i.e not proper which was one of the motivators for log scoring in the first place?

I agree it does all seem summountable i.e using a different obs dist etc

sbfnk · 11 February 2026 17:07

to get a good regression coefficient for your model you have to minimize log(WIS + 1),

but this bit isn’t the case when using a log link, i.e. WIS remains the response variable?

sbfnk · 12 February 2026 10:13

If you think about it in terms of frequentist optimisation, with a log link you’re still optimising by minimising (average) WIS – the monotonic transformation of the expectation won’t affect this or any ranking based on WIS.

That’s different from logging every WIS score, where then taking the expectation might change the ranking (because of Jensen’s inequality if I’m not mistaken), and you can start gaming things.

samabbott · 16 February 2026 18:52

This all makes sense but I distinctly remember @johannes telling me no and didn’t we right this as a benefit in the log transform paper?

sambrand · 17 February 2026 17:42

Late to this party… but I’m not convinced about model confidence sets based on proper scores.

Essentially, MCS will limit to picking out one model with the best performance over a period of time (unless two models are literally exactly indistinguishable) which is essentially recreating the same problem as using BMA e.g. here .

Or, if it doesn’t assumption 1 of the MCS paper is violated and therefore its not a great approach either.

At a meta-stats level I’m not convinced about treating models as being in the same set because of a p-value constructed around a null hypothesis that itself has probability zero of being true (i.e. that two distinct models will have precisely the same marginal skill as data goes to infinity).

Topic		Replies	Views
Community Seminar 2024-08-07 - Kaitlyn Johnson - Wastewater modeling to forecast hospital admissions in the US: Challenges and opportunities Meetings	19	173	14 August 2024
Baseball Stats, Model Cards, and Forecasting Performance Project Proposals	15	235	24 February 2026
How can collaborative infectious disease forecasting/nowcasting projects be improved?	6	515	5 June 2023
A basket of baselines Project Proposals	15	126	27 January 2026
Streamlining of epi modeling tools	12	99	14 August 2024

Do your evaluations have enough power?

Related topics