Do your evaluations have enough power?

samabbott · 5 January 2026 16:18

I strongly agree with this concern. It really is the wild west at the moment. Not only drawing too much from artifacts but also the lack of common practice meaning people accidentally/intentionally go down different reporting paths to find what they want to find.

So my view of this is that in some sense the scores can be thought about as data and from there you can then naturally do lots of things to them with different statistical tools. Often these have been quite classic frequentists tests but I don’t think there is a reason you can’t also do more generative bayesian models etc. The tricky bit is of course making sure what you do doesn’t get rid of the proper part of the proper score which is tricky.

Personally, though obviously biased, I do think modelling the scores is the way to go most of the time if getting serious about this. In the first instance making sure to report the distribution of scores etc seems sensible.

I really like the idea of having some living guidance “best practice”, especially for high-dimensional settings. I have had a few chats with @johannes about this and would love to get it off the ground. Something I am unclear about is how much this needs to be in our domain and how much we can farm out to the stats folks.

Yes this exists but I think I would say for at least myself I am uncomfortable about depending on that as an approach for reasons I find hard to quantify.

In terms of good practice, I tend to look back at what @nikosbosse was doing and what @johannes has done recently.

Topic		Replies	Views
Community Seminar 2024-08-07 - Kaitlyn Johnson - Wastewater modeling to forecast hospital admissions in the US: Challenges and opportunities Meetings	19	184	14 August 2024
Baseball Stats, Model Cards, and Forecasting Performance Project Proposals	17	289	11 March 2026
Scoring best practice: Should we always have scoring simulations in our papers?	5	35	27 April 2026
How can collaborative infectious disease forecasting/nowcasting projects be improved?	6	520	5 June 2023
A basket of baselines Project Proposals	15	150	27 January 2026

Do your evaluations have enough power?

Related topics