Measure of forecast difficulty?

Hi all,

As the winter comes to an end we are starting to evaluate our suite of forecasts, and how to compare across pathogens, seasons, locations and models has come up.

I think some of these are straightforward to think about (ie compare different models). But others less so, like comparing forecasts across pathogens and seasons. The simple answer is “they are not comparable, don’t compare them”, but our forecasts do exist in the context of other pathogens and seasons, so comparison is inherent.

It’d be handy to have some sort of measure of forecast difficulty to either weight or compare performance against. Specifically, our COVID-19 forecast looks great, because we have had a flat epidemic over the last few months, very easy to forecast! However, the flu wave has been complex, and therefore the models aren’t as good.

Are there existing measures of forecast difficulty / complexity?

My two thoughts on such measures might be:

  • autocorrelation of the epidemic growth rate at different horizons
  • similarity with past seasons waves (only useful when past seasons are available obviously)

There is probably generic forecasting literature on this topic (I have not looked extensively), but it would be useful if there’s anything specific to epidemic forecasting.

I think what I’m also trying to get at is across pathogens/forecasts we run the risk of implying certain models / approaches are better than others, when actually the epidemic was just easier to forecast for one of them.

Anyone else have thoughts?

The fundamental measure of complexity is the entropy of the time series, which can be estimate empirically https://uk.mathworks.com/help/predmaint/ref/approximateentropy.html (that I’m linking something from MATLAB indicates how far back into my memory recesses I’m going here).

However, thats a bit unsatisfying for epi modelling because that really sets a lower bound on the log-score. For example, in a flat epidemic with iid fluctuations around the mean you can’t do better than forecasting the fluctuation distribution. If the fluctuations are bigger, then its “harder” to forecast (in the sense that the log-score will be worse) but otoh the creative process of getting a model to get as close to theoretical optimality is still very simple.

Its a super boring answer… but I think the most practical/interpretable measure of “difficulty” is comparing the performance of a well understood set of baseline forecast models on two different time series. I think in a practical sense, something like “SARIMA does well out-of-sample on this time series” means its an “easy” time series compared to a time series where where SARIMA does badly out-of-sample.

I don’t have any useful suggestions although I agree with @sambrand that naive baselines are probably the most viable option and have a long “tradition” (MASE score, Hyndman).

But I’m curious how you were planning to use such a difficulty weighting:

  • To decide which models need further improvement - assuming you have different models for different pathogens/locations?
  • To decide for which pathogens/locations a specific model needs to be better adapted / to find its weak spots?
  • To decide how much you will “trust” a specific model for a specific pathogen/location in the next season?
  • To generally compare the performance of several models in aggregate across a larger number of pathogens/locations?

I feel like depending on what your objective is, weighting by difficulty may be more or less useful. Just as an example, if you have a pathogen that is generally “easier” to forecast than others, does it make sense to put more weight on forecast errors/be stricter for this pathogen? Maybe we are happy that it is easy to forecast so we don’t have to worry about improving our models further for it.

Finally, one general problem of “difficulty” measures I see especially in epi is that we don’t know how much of our uncertainty is actually aleatory/stochastic. Sure, you can measure the entropy of a time series, but this will only tell you something about predictability with respect to the time series itself. Maybe your model could use another data source that explains away lots of the “random” variation you see in your target time series, and suddenly forecasting might be much “easier”. This is not an argument against comparing models across different settings, but means that “difficulty” scores are likely data-dependent themselves.

Finally, one general problem of “difficulty” measures I see especially in epi is that we don’t know how much of our uncertainty is actually aleatory/stochastic. Sure, you can measure the entropy of a time series, but this will only tell you something about predictability with respect to the time series itself. Maybe your model could use another data source that explains away lots of the “random” variation you see in your target time series, and suddenly forecasting might be much “easier”.

Hard agree here. This is, for me, a good motivation for baselining with something like SARIMA. My reasoning is that if you come up with a creative model that explains alot more variation than a tested method like SARIMA, thats quite informative even if there is substantial “left-over” variation.

As you imply, the entropy makes a theoretical bound on predictability but we don’t actually know what that is (entropy estimation can be far from the true entropy the “true” generative process).

1 Like

Thanks @adrianlison @sambrand really interesting. We recently did a journal club about Recalibrating probabilistic forecasts of epidemics - PMC and entropy came up there but I didn’t quite make the connection.

Performance of a baseline model does make sense as a comparator for complexity. However, it does sort of shift the problem to the re-occuring “what is a good baseline model” - but is probably always the case.

So, the sort of conclusion with naive baselines across pathogens would be “my SARIMA is better for disease X than Y therefore disease Y is more complex” is what we are getting at? I think that makes sense.

Regarding how we are going to use that sort of weighting, I don’t have a clear answer yet, but it is a gap in what we can make at the moment I think. The challenge I’m contemplating is:

  • we have 4 diseases (one flat, one weird, one identical to the year prior, one double the year prior) forecasted
  • we (the modellers) want to avoid drawing strong conclusions about which we are “better” at forecasting, as the waves are all different
  • our users / commissioners will inherently drawn conclusions about what we are better at based on our evaluations. (Secondary point is they will want to direct where we focus efforts for next year)
  • how can we express how complex each different pathogen was, in an empirical way, to base the comparisons on evidence rather than eyeballing

@adrianlison, regarding your point on alternative data sources explaining the variation I think this fits well with our RSV problem - the wave looked very similar to the last few years. In that context, the complexity our users would think of is probably just the dissimilarity between previous years. Back to baselines, choosing an appropriate baseline does come down to the pathogen to some degree as what works for RSV may not for COVID - is that because one is more complex, or just the baseline model assumptions were more fitted to one problem rather than the other.

Perhaps this is something we just need to communicate qualitatively to users. I could see us:

  • having a suite of baseline models
  • working out the entropy
  • comparing against past seasons

And using these 3 criteria having a “low/medium/high” type classification for complexity we share with our evaluation. This way it’s more evidence based than eyeballing, backed by some metrics, but not strict as none of the approaches cover everything.

I think this sounds like a good plan. If you are planning to go for a qualitative approach with a suite of baselines, maybe you can also think about what qualitative aspects are in your opinion/for your models relevant in terms difficulty, and then use different very simplistic baselines to capture each aspect.

For the seasonality part, maybe just a model that predicts the average of the last year at that time (potentially slightly more complex by fitting a scaling factor and shift). For the autocorrelation part, just a model that predicts the last observed value etc., and so on…

Challenge might be to have probabilistic versions of these baselines, but maybe you just use a score that allows comparison of both point and probabilistic forecasts such as CRPS.

1 Like

Having thought about it a little more - something cropped up.

Is there a lower bound for the WIS?

The minimum possible value in theory is 0, but that’s not necessarily the same as a possible forecast.

Perhaps given some data & assumptions about a simple (perfect?) model a lower bound WIS can be derived.

I’m probably just being confused, but if there’s a lower bound for the log score it’d be interesting to have one for WIS, as it seems WIS is more widely used in recent literature.

There is a lower bound on the expected WIS e.g. over the average of your probabilistic forecast and the actual generative distribution of the random variable you are trying to forecast.

The classic Gneting and Rafferty https://www.tandfonline.com/doi/abs/10.1198/016214506000001437 has a bit on this (section 2.2).

1 Like

Obviously, for any given forecast you can be spot on and get zero score the lower bound is based on the average.

For Log score the minimum average is the entropy of the distribution of the thing you’re trying to forecast and the divergence (i.e. how much worse you are on average compared to theoretical best possible) is the Kullbeck-Leibler divergence (e.g. the usual target for minimisation in variational inference).

The link above shows that you can split other proper scores in the same way.

1 Like