This is more a collection of thoughts than a coherent research idea, alas.
I’m reading lots of literature and doing lots of model scoring, but I can’t help worry that I’m drawing conclusions from statistical artifacts rather than real effects.
It seems like almost all of the literature is “the performance of model A is % better or worse than model B”, regardless of scoring rule. That holds for a single forecast, but also for a multi-year many many forecasts evaluation. But there’s no clear indication of how valid that comparison is, based on how many forecasts were made.
I suppose I’m taking the long way around of saying, should we be doing some version of significance testing in forecast evaluations? Bootstrapping to understand variation in the scores? Anything relating to uncertainty in our scores?
Is anyone already doing this or are there good examples our there? I’d imagine Hubs are probably the natural place to explore this, but it could probably be tackled from the theory side as well.
There’s some regression modelling for forecast evaluation (which is really cool and should be done more), but are we missing the step before that as? How do I know if I’m making enough comparisons to conclude whether model A is better than model B?
I’ve got ideas of how I’d do this if I needed to quickly, but perhaps what I’m also hinting at is perhaps there should be guidance/a consensus/an example out there for others to follow.
There’s a risk researchers learn the wrong lessons from underpowered evaluations, and that the complexity of scoring increases without incorporating sample sizes.
Includes uncertainty in the comparison, which gets me most of the way there, and then by doing a regression and comparisons from there you can do some more advanced comparisons accounting for structural bits.
Just wanted to agree with your thoughts here (and thanks for linking the work! Expanding on this definitely a priority).
Not quite the main point of your post, but especially picking up on this:
This is something that @kejohnson9 and I have chatted about as well, and a bit with @samabbott / @sbfnk . It feels a bit like the wild west in how we select what to evaluate … which makes it basically impossible to compare across published evaluations (even if they are reporting the same metric, which as you mention, is not often). But ideally as you say we would all refer to a consensus best practice or reporting guideline with some minimum standards for reporting scores. Off the top of my head, I imagine that could include e.g. selection of scoring metric; level of stratification / aggregation across multiple forecasts; considering the scale/data transforms; potential confounding factors…
As an open question, I’d be keen to hear what others think of the variability in how evaluation is conducted/reported, and how methodological choices affect comparability. And/or examples of good practice!
I strongly agree with this concern. It really is the wild west at the moment. Not only drawing too much from artifacts but also the lack of common practice meaning people accidentally/intentionally go down different reporting paths to find what they want to find.
So my view of this is that in some sense the scores can be thought about as data and from there you can then naturally do lots of things to them with different statistical tools. Often these have been quite classic frequentists tests but I don’t think there is a reason you can’t also do more generative bayesian models etc. The tricky bit is of course making sure what you do doesn’t get rid of the proper part of the proper score which is tricky.
Personally, though obviously biased, I do think modelling the scores is the way to go most of the time if getting serious about this. In the first instance making sure to report the distribution of scores etc seems sensible.
I really like the idea of having some living guidance “best practice”, especially for high-dimensional settings. I have had a few chats with @johannes about this and would love to get it off the ground. Something I am unclear about is how much this needs to be in our domain and how much we can farm out to the stats folks.
Yes this exists but I think I would say for at least myself I am uncomfortable about depending on that as an approach for reasons I find hard to quantify.
In terms of good practice, I tend to look back at what @nikosbosse was doing and what @johannes has done recently.
I don’t have a lot to add to the discussion, but I agree we don’t have strong standards on uncertainty quantification around model evaluations. This is something that has started coming up at CFA as we evaluate our county-level GAM models. I think currently we don’t have uncertainty quantification, and indeed are hitting issues on “is this a big change or not”. So I hope it’s something we tackle in the next few months.
Hi! I’m a bit late to the party, but totally agree this is important. To my knowledge the most widely used test to assess if differences in performance are significant is the Diebold Mariano test, contained e.g., in the forecast package in R. I also have a paper called Who has the best probabilities? Luck versus skill in prediction tournaments on my reading list, which highlights how noisy such assessments can be. I suspect there is quite a bit of literature out there already, I’ll ask around a little in our institute.
This is a great question - my sense from all the hub etc. work is that it’s likely underpowered but it would be great to think about this a bit more, and even more so to develop some guidance on how to address this when reporting forecast scores.
One thing that I think we discussed in the past was the idea of Model Confidence Sets, i.e. sets of models that are indistinguishable in their forecast ability - there seems to be some active work on this, with applications to COVID forecasts in Sequential model confidence sets and to forecasts during particular phases in Conditional model confidence sets.