Scoring best practice: Should we always have scoring simulations in our papers?

I was thinking about some of the slides from the KIT meeting this week that @johannes, @kejohnson9, and @sbfnk attended.
In theoretical work we often see simulations of known settings scored and used to understand how scores work.
We also think that there is maybe a potential mismatch, or perhaps a perceived one, between what a user wants from a score and what they actually get.

Perhaps something we could practise more is simulating forecasts/outcomes/decisions and scoring them in whatever evaluation framework we pick.
This might serve as a sense check for what we can expect from any given forecast evaluation.
It might help people reason about which scores they want to use.
It is standard modelling practice already, and makes it clearer that scoring is part of the science.

I think not, but perhaps this is already widely done in other domains?

This might become even more important as scoring systems grow more complex.
We are increasingly using more strata, aggregating results, and combining transformed scores in custom ways.

The big issue with this idea is that it might be hard for people to do.
There are no great tools for setting up these simulations or thinking them through.
If it is just tooling, though, that might not be so hard to change.

Perhaps not everyone needs a simulation for their complete evaluation.
But if they do not have one, they should probably cite something that closely matches their setting.

The recent discussion about statistical power calls for a best practice guide in Do your evaluations have enough power? - #28 by sambrand .
Simulating optimal outcomes is exactly the sort of good practice we might potentially include in that guide, etc.

What do people think?

P.S I cleaned this up a bit with Gemini CLI (testing it - its awful)

2 Likes

oh dear :upside_down_face:

This is really interesting!


simulating forecasts/outcomes/decisions

To check I’m understanding you: these three together would be like simulating a risk management scenario - into which one’s proposed score would feed, to be tested in?


The big issue with this idea is that it might be hard for people to do.

As a general thing I guess this has been part of the downstream value of the Hubs so far. People use Hubs as a benchmark set of forecasts/ data/ setting with implicit decision-utility. (Typically for developing forecasts than scores, but can see that as an obvious development, as in Minsu Kim’s model importance work.) But as often discussed the Hubs are quite a miscellany of forecasting targets/participation, without a consistent method for aggregating across or filling in for missingness, so not really ideal for an ‘industry benchmark’. It would be great to have a standard simulated environment to test out different scores in.


If I’m understanding the intention, we could also look at adopting some ideas for implementing this from how ML/AI use standardised benchmarks? You and others here will have a much better idea of this than me - I have only very superficially browsed around. But I enjoy the apparently neat infrastructure (eg Lighteval · Hugging Face )

That’s a later/bigger goal though - sounds like the smallest next step for this would be to design something [simple] that could work with scoringutils?

Hot off the press from Adiga et al:

Repo: