Do your evaluations have enough power?

jonathon.mellor · 15 December 2025 17:03

This is more a collection of thoughts than a coherent research idea, alas.

I’m reading lots of literature and doing lots of model scoring, but I can’t help worry that I’m drawing conclusions from statistical artifacts rather than real effects.

It seems like almost all of the literature is “the performance of model A is % better or worse than model B”, regardless of scoring rule. That holds for a single forecast, but also for a multi-year many many forecasts evaluation. But there’s no clear indication of how valid that comparison is, based on how many forecasts were made.

I suppose I’m taking the long way around of saying, should we be doing some version of significance testing in forecast evaluations? Bootstrapping to understand variation in the scores? Anything relating to uncertainty in our scores?

Is anyone already doing this or are there good examples our there? I’d imagine Hubs are probably the natural place to explore this, but it could probably be tackled from the theory side as well.

There’s some regression modelling for forecast evaluation (which is really cool and should be done more), but are we missing the step before that as? How do I know if I’m making enough comparisons to conclude whether model A is better than model B?

I’ve got ideas of how I’d do this if I needed to quickly, but perhaps what I’m also hinting at is perhaps there should be guidance/a consensus/an example out there for others to follow.

There’s a risk researchers learn the wrong lessons from underpowered evaluations, and that the complexity of scoring increases without incorporating sample sizes.

jonathon.mellor · 22 December 2025 10:31

To answer my own question:

Obtain pairwise comparisons between models — get_pairwise_comparisons • scoringutils

Includes uncertainty in the comparison, which gets me most of the way there, and then by doing a regression and comparisons from there you can do some more advanced comparisons accounting for structural bits.

Fingers crossed these things become widely used!

kath-sherratt · 5 January 2026 16:06

Just wanted to agree with your thoughts here (and thanks for linking the work! Expanding on this definitely a priority).

Not quite the main point of your post, but especially picking up on this:

This is something that @kejohnson9 and I have chatted about as well, and a bit with @samabbott / @sbfnk . It feels a bit like the wild west in how we select what to evaluate … which makes it basically impossible to compare across published evaluations (even if they are reporting the same metric, which as you mention, is not often). But ideally as you say we would all refer to a consensus best practice or reporting guideline with some minimum standards for reporting scores. Off the top of my head, I imagine that could include e.g. selection of scoring metric; level of stratification / aggregation across multiple forecasts; considering the scale/data transforms; potential confounding factors…

As an open question, I’d be keen to hear what others think of the variability in how evaluation is conducted/reported, and how methodological choices affect comparability. And/or examples of good practice!

samabbott · 5 January 2026 16:18

I strongly agree with this concern. It really is the wild west at the moment. Not only drawing too much from artifacts but also the lack of common practice meaning people accidentally/intentionally go down different reporting paths to find what they want to find.

So my view of this is that in some sense the scores can be thought about as data and from there you can then naturally do lots of things to them with different statistical tools. Often these have been quite classic frequentists tests but I don’t think there is a reason you can’t also do more generative bayesian models etc. The tricky bit is of course making sure what you do doesn’t get rid of the proper part of the proper score which is tricky.

Personally, though obviously biased, I do think modelling the scores is the way to go most of the time if getting serious about this. In the first instance making sure to report the distribution of scores etc seems sensible.

I really like the idea of having some living guidance “best practice”, especially for high-dimensional settings. I have had a few chats with @johannes about this and would love to get it off the ground. Something I am unclear about is how much this needs to be in our domain and how much we can farm out to the stats folks.

Yes this exists but I think I would say for at least myself I am uncomfortable about depending on that as an approach for reasons I find hard to quantify.

In terms of good practice, I tend to look back at what @nikosbosse was doing and what @johannes has done recently.

samabbott · 5 January 2026 16:29

In terms of standardising things this is where the idea for doing someting like Baseball Stats, Model Cards, and Forecasting Performance was coming from and I still think that would be neat/acheivable

athowes · 6 January 2026 01:10

I don’t have a lot to add to the discussion, but I agree we don’t have strong standards on uncertainty quantification around model evaluations. This is something that has started coming up at CFA as we evaluate our county-level GAM models. I think currently we don’t have uncertainty quantification, and indeed are hitting issues on “is this a big change or not”. So I hope it’s something we tackle in the next few months.

johannes · 12 January 2026 11:41

Hi! I’m a bit late to the party, but totally agree this is important. To my knowledge the most widely used test to assess if differences in performance are significant is the Diebold Mariano test, contained e.g., in the forecast package in R. I also have a paper called Who has the best probabilities? Luck versus skill in prediction tournaments on my reading list, which highlights how noisy such assessments can be. I suspect there is quite a bit of literature out there already, I’ll ask around a little in our institute.

sbfnk · 13 January 2026 15:54

This is a great question - my sense from all the hub etc. work is that it’s likely underpowered but it would be great to think about this a bit more, and even more so to develop some guidance on how to address this when reporting forecast scores.

One thing that I think we discussed in the past was the idea of Model Confidence Sets, i.e. sets of models that are indistinguishable in their forecast ability - there seems to be some active work on this, with applications to COVID forecasts in Sequential model confidence sets and to forecasts during particular phases in Conditional model confidence sets.

samabbott · 16 January 2026 11:58

Yes I agree.

I’d forgotten about model confidence sets need to put that in the reading pile. What I am getting from this is there is a clear need for something on current best practices and challenges here?

kath-sherratt · 16 January 2026 13:11

I was thinking - Delphi consensus study on forecast evaluation best practice? (In fact Pollet et al did something similar to come up with the epiforge guidelines.)

sambrand · 16 January 2026 17:40

Something I’ve been thinking about, but don’t have well-formed thoughts… is how someone who had access to an ensemble of forecasts could diagnose that a group of forecasts were settling on confidence set vs there was a rapidly developing field of models.

For example, I could hypothesise that a “stable” (in some poorly defined sense) set of “good” models probably don’t have an ensemble forecast skill that can be long term beaten by a linear pool. My vibes reasoning would be similar to classic probabilistic modelling of financial returns (i.e. BSM) where the argument was that normally distributed log-returns were a good model, because if they weren’t everyone would pile into the arbitrage opportunities and it would auto-correct.

jonathon.mellor · 19 January 2026 09:03

Thanks all! Really great to see all your thoughts.

Seems like we all agree there’s a gap here and evaluations could be done more robustly/reported better.

It feels like any guidance specific to the epi forecasting community on the current state or recommended evaluation would be a step up from where we currently are. The worry with relying on the more generalist statistics advice is that it’s harder for this community to find it, which I think motivates some “translation”.

Lots of further reading to do based on this thread! I feel like the Diebold-Mariano test does to some extent the specific question I was starting with (if it can generalise to e.g. WIS), but the push for better guidance is really where the real value will be.

I can imagine a nice paper that uses either some open hub data, or a simulation to demonstrate when models aren’t distinguishably better from each other. Perhaps there’s a Bronze/Silver/Gold standard for evaluation - 1) report the distribution, 2) use a test statistic, 3) regress on the scores to infer differences.

In my head there’s almost a power analysis that could be done ahead of prospective forecasting studies. If I have a N forecast (e.g. weekly over a winter), and some significance criteria, what’s the minimum effect size (performance difference?) that we can make conclusions about? A big oversimplification - but the motivation for me is being worried we evaluation one winter at a time, which might not actually tell us anything useful.

Outside of writing a paper, it’d be great to bring our thoughts together in a slightly more structured way as this thread is a gold-mine but probably tricky to discover for others!

nickreich · 19 January 2026 22:47

I agree with this being a big challenge facing the field. I worry a lot about fitting regression models to very highly correlated observations and then trying to draw valid statistical inference there. This is, I think, very challenging. It still might be the “best” thing to do, but basically I wouldn’t trust inference coming out of most regression-based models unless it was very painstakingly done and residual plots examined with a fine-toothed comb. Just throwing random effects in and/or adjusting for everything I’m not sure will give you what you want, and then maybe you’re back to being underpowered again.

I’m excited to look more closely at the Model Confidence Sets idea.

samabbott · 20 January 2026 10:33

Thanks @nickreich,

I agree that model building of any kind is challenging but also generally think that most other actions we take are often ad-hoc model building of one kind or another. In my view, using a formal toolbox with a large literature is generally going to be safer.

Do you have some preferred literature on what you consider best practice for regression model building? It sounds like you are drawing from some slightly different silos than I typically read.

My concern with the various tests etc. is that they are just a regression model with a mask on often so you are not as safe as you may think. That being said I need to engage with them more.

@jonathon.mellor I agree. I think in the first instance just something that collects current best practice people can agree on and then highlights ongoing challenges seems like it would be useful. I really like the idea of a tiered set of reporting standards.

I don’t know a huge amount about power analysis but you would think there must be. I am a big fan of setting up these studies more like an experiment (I thought this was great: https://onlinelibrary.wiley.com/doi/10.1111/ele.70251). I am quite bored of secondary analysis that cut the problem N ways, report M of them (where M << N), and the get very excited about the ith cut.

I’m not totally sure how we should take this forward so any thoughts very welcome. I guess the next step might be to have a call with people who can commit some time to thinking this hrough?

nickreich · 22 January 2026 17:27

“preferred literature” might be a bit of a misnomer here (given my relative lack of expertise in this area), but after some digging here are some papers that I found that I think are talking about the kinds of issues that regression models face, in particular around std error estimation:

Cameron, Gelbach & Miller (2011) - “Robust Inference with Multiway Clustering” Journal of Business & Economic Statistics

Driscoll & Kraay (1998) - “Consistent Covariance Matrix Estimation with Spatially Dependent Panel Data” Review of Economics and Statistics

Mass & Hox (2004) - “The influence of violations of assumptions on multilevel parameter estimates and their standard errors” Computational Statistics & Data Analysis

I agree that other “actions we take are often ad hoc”.

To me there is a distinction between operational decision-making and inference about models. For example, when I am building a set of models and needing to decide which one to submit, I might find it useful to have a Model Confidence Set and then pick the simplest/smallest model in that set. I’m less concerned about the inference part and more just trying to make a decision about what model I should be running every week. But when I’m running a systematic analysis and trying to derive generalizable insights then two things are true: (1) it is important to have a formal process for inference, and (2) it is important to understand the limitations and weaknessess of that inferential procedure. I think we are in the multilevel/multiway-clustering-panel-data-spatial kind of realm here in terms of our data setup, but often with more dimensions than a classical panel data that has something like location/time variables. We have location/forecast-time/horizon/model/…

Overall, I’m saying YES we should build a model. But, man, it’s going to take a lot of convincing for me to trust that the inferences that come out of it are valid.

johannes · 26 January 2026 15:05

One issue with regression models I’ve been pondering for a while is how it interacts with propriety. Most suggestions for regression modelling of scores I have seen used log(WIS + 1) as the outcome because it helps with the non-negativity of scores and the fact that they are skewed.

The problem I then see is that to get a good regression coefficient for your model you have to minimize log(WIS + 1), which as pointed out in this absolute landmark paper is not a proper scoring rule.

You could of course also use a model without a transformed WIS (or use WIS(log incidence) as the target), but it’s completely unclear to me how we can handle both the heteroskedasticity of the outcome variable and the incentives correctly.

samabbott · 27 January 2026 14:24

Yes, I totally agree with this but I also don’t trust the vast majority of evaluation that is done at the moment that looks at lots of different strata in the data in an ad-hoc way so my threshold for thinking we should use a model is much lower.

I need to engage with the MCS literature more but on a first pass I assume it can also be represented in a model framework which would be handy as again it means you have just a single set of tools to learn/develop best practices for.

Yup this is the big problem right - depending on how you set them up and as you say one of the arguments @nikosbosse gave for why a transformed score is nice.

Again something, I wonder is how often we have similar problems when we are reasoning about a model’s performance by i.e. location and horizon using graphs etc. It would be interesting to try and unpick if the model setup just makes a more common problem obvious.

johannes · 27 January 2026 14:33

What do you mean by “similar problems” here? At least in terms of propriety I’m comfortable with showing performance by horizon etc.

samabbott · 27 January 2026 14:42

Yes I think that is fine. I am thinking about settings where people try and say something about a combination of factors i.e location, horizon etc.

The issue with propriaty in the models is when you need to use a transform right?

I was wondering if there are any settings in which viewing the data in different cuts, making a score and interpreting it i.e forecast performance was worse at longer horizons in Brazil than in Australia or if there is no way that interepration can become problematic from that perspective.

Writing that out I think it is likely fine?

johannes · 27 January 2026 14:46

The two issues I’d see here are scale dependence of scores (you’ll always have lower WIS or CRPS in Luxembourg than the US) and general noisiness.

Topic		Replies	Views
Community Seminar 2024-08-07 - Kaitlyn Johnson - Wastewater modeling to forecast hospital admissions in the US: Challenges and opportunities Meetings	19	173	14 August 2024
Baseball Stats, Model Cards, and Forecasting Performance Project Proposals	15	235	24 February 2026
How can collaborative infectious disease forecasting/nowcasting projects be improved?	6	515	5 June 2023
A basket of baselines Project Proposals	15	126	27 January 2026
Streamlining of epi modeling tools	12	99	14 August 2024

Do your evaluations have enough power?

Related topics