I was thinking about some of the slides from the KIT meeting this week that @johannes, @kejohnson9, and @sbfnk attended.
In theoretical work we often see simulations of known settings scored and used to understand how scores work.
We also think that there is maybe a potential mismatch, or perhaps a perceived one, between what a user wants from a score and what they actually get.
Perhaps something we could practise more is simulating forecasts/outcomes/decisions and scoring them in whatever evaluation framework we pick.
This might serve as a sense check for what we can expect from any given forecast evaluation.
It might help people reason about which scores they want to use.
It is standard modelling practice already, and makes it clearer that scoring is part of the science.
I think not, but perhaps this is already widely done in other domains?
This might become even more important as scoring systems grow more complex.
We are increasingly using more strata, aggregating results, and combining transformed scores in custom ways.
The big issue with this idea is that it might be hard for people to do.
There are no great tools for setting up these simulations or thinking them through.
If it is just tooling, though, that might not be so hard to change.
Perhaps not everyone needs a simulation for their complete evaluation.
But if they do not have one, they should probably cite something that closely matches their setting.
The recent discussion about statistical power calls for a best practice guide in Do your evaluations have enough power? - #28 by sambrand .
Simulating optimal outcomes is exactly the sort of good practice we might potentially include in that guide, etc.
What do people think?
P.S I cleaned this up a bit with Gemini CLI (testing it - its awful)