I was thinking about some of the slides from the KIT meeting this week that @johannes, @kejohnson9, and @sbfnk attended.
In theoretical work we often see simulations of known settings scored and used to understand how scores work.
We also think that there is maybe a potential mismatch, or perhaps a perceived one, between what a user wants from a score and what they actually get.
Perhaps something we could practise more is simulating forecasts/outcomes/decisions and scoring them in whatever evaluation framework we pick.
This might serve as a sense check for what we can expect from any given forecast evaluation.
It might help people reason about which scores they want to use.
It is standard modelling practice already, and makes it clearer that scoring is part of the science.
I think not, but perhaps this is already widely done in other domains?
This might become even more important as scoring systems grow more complex.
We are increasingly using more strata, aggregating results, and combining transformed scores in custom ways.
The big issue with this idea is that it might be hard for people to do.
There are no great tools for setting up these simulations or thinking them through.
If it is just tooling, though, that might not be so hard to change.
Perhaps not everyone needs a simulation for their complete evaluation.
But if they do not have one, they should probably cite something that closely matches their setting.
The recent discussion about statistical power calls for a best practice guide in Do your evaluations have enough power? - #28 by sambrand .
Simulating optimal outcomes is exactly the sort of good practice we might potentially include in that guide, etc.
What do people think?
P.S I cleaned this up a bit with Gemini CLI (testing it - its awful)
To check Iām understanding you: these three together would be like simulating a risk management scenario - into which oneās proposed score would feed, to be tested in?
The big issue with this idea is that it might be hard for people to do.
As a general thing I guess this has been part of the downstream value of the Hubs so far. People use Hubs as a benchmark set of forecasts/ data/ setting with implicit decision-utility. (Typically for developing forecasts than scores, but can see that as an obvious development, as in Minsu Kimās model importance work.) But as often discussed the Hubs are quite a miscellany of forecasting targets/participation, without a consistent method for aggregating across or filling in for missingness, so not really ideal for an āindustry benchmarkā. It would be great to have a standard simulated environment to test out different scores in.
If Iām understanding the intention, we could also look at adopting some ideas for implementing this from how ML/AI use standardised benchmarks? You and others here will have a much better idea of this than me - I have only very superficially browsed around. But I enjoy the apparently neat infrastructure (eg Lighteval Ā· Hugging Face )
Thatās a later/bigger goal though - sounds like the smallest next step for this would be to design something [simple] that could work with scoringutils?
Checking my understanding on both of your points @samabbott and @kath-sherratt , and admitting I havenāt read Adiga et al yet, but would the idea be that we have some sort of database or simulator for epidemic forecasts of different forms (e.g. counts of cases, composite metrics like % ED visits due to flu, and even other target types like variant proportions) generated from different models and under different circumstances, and can then ātestā scoring procedures using this database/simulator where we know the ātrueā model and can see how a proposed scoring algorithm plays out under different assumptions/ways of misspecifying the forecasts?
This reminds me of the scoring proportions work we heard about last Thursday, as well as how Johannes Resin demonstrated the flaw of using arithmetic/geometric means of individual relative scores.
I feel like this should be almost a requirement for anytime someone proposes a new scoring method or applying a scoring method to a new setting, though perhaps is not actually done a ton in practice? I would be interested in thinking about this more with regards to the evaluation of local vs state forecasting and potentially incorporating adjusted scoring metrics e.g. outcome/threshold-weighted scoring. I feel that in order to feel confident in using something new we would want to validate it on a simulated dataset with known properties.
This is an interesting question, something Iām in two minds about.
From my perspective:
We barely have time / can prioritise academic work, so thereās a risk of adding further barriers to entry for public health agencies wanting to contribute to this field (but thatās not the case for everyone, naturally!). As a result Iād be keen that simulation isnāt a minimum requirement to get a forecasting paper out.
Simulation can be a gold standard for validating methods, so yes itād be great if there was more of it!
Tying in to the good practice guidelines would be ideal for me, āin this situation you should think about this scoring rule, as show by XYZ et alā - taking the scoring transformations paper a bit further to recommendations. But also I would flag that these sorts of things can be constraining sometimes - Iām often asked why we donāt log transform our WISā¦
Overall, very keen for more examples of good practice to be available to point to!