A basket of baselines

Been thinking about the overlap between baseline models (Manuel Stapper is working on this), the method of analogs (stuff @nickreich and co are going on about), and model output evaluation (i.e. @kathsherratt work).This builds on the surrogate modelling work @sbfnk and I did a few years ago where we tried to replicate a forecast hub ensemble performance using a simple model (got within 20% in real-time evaluation).

Idea: Define a set of baseline models with well-understood characteristics.

Manuel Stapper and I have discussed this “basket of baselines” concept.

Some examples:

  • What would a time series expert build given only the time series data (no domain context)?

  • What would an infectious disease specialist build given domain knowledge but no time series?

  • Maybe a few other dimensions

The result is a bunch of simple baseline models that we understand, are easy to fit/robust/well calibrated etc as one might wish with different people perhaps drawing the line in different places in terms of compexity.

Having this basket is I think useful in and of itself and certainly more powerful than trying to pick an “optimal” baseline for a given evaluation.

The thing I have been thinking about to take this further is then using them to understand other models by:

  1. Use time series decomposition (STL or similar) to break each baseline forecast into components: trend, periodicity, residuals, etc.

2. Express other forecast models (from hubs etc) as combinations of these baseline components.
For example: “Model X has the trend of an SIR model but residuals of a time series model” (this is overly simple I think in reality it would need to be more like GP kernel composition i.e additive and multiplicative etc)3. This creates a dimensionally-reduced space (defined by baseline models) where we can understand relationships between models. Could implement via regression or mixture models with interactions.

You could also do this more simply if you didn’t decompose the baselines and just tried to represent model complex models directly with them.

There is also some connection here to the performance above replacement idea we talked about in Baseball Stats, Model Cards, and Forecasting Performance - #7 by samabbott i.e you would replace with different kinds of baseline.

Another alternative is to so non-outcome focused clustering - apply PCA or similar to the decomposed model components to understand how models group together.
Less interesting than the baseline approach but maybe more feasible.

Still working through how this all connects to scoring - could either:

  • Add scoring as a final step post this decomposition

  • Use the composition framework to better understand score patterns

  • Or could make the final target scores but that gets a bit more complicated

This could be a neat way of getting more at the composition of forecast hubs that don’t have good metadata i.e. to make their secondary data more valuable for other things.
The dimension reduction on forecasts output either as scores or the raw forecasts/residuals feels like something there must be literature on? Perhaps not for the baseline approach but also maybe?
Not totally clear where the truth data comes into this either i.e is the breakdown relative to the truth data or absolute.

Still rolling this around in my head and waiting on a few baseline papers people are writing to see what directions they go.

2 Likes

Also a basket of baselines is just a lovely title for a paper

1 Like

Also I posted this in a few other places and roughly copied it across so its a bit burbled.

I am super keen on this!

Having chatted about this before, adding some of the previous comments.


I think for me the highlight was this @samabbott :

The main assumption is I think that you can accurately express a forecast from a combination of other forecasts that you understand both enough about the components and how you have composed them in order for it to be useful

And this was my own summary of the whole thing, to make sure I understood correctly - but adding here again in case it helps others to see it put it in different words.

This seems like seeing each (realised) model as like an “ensemble” of theoretical components that are remixed in various ways [with the simplest possible combinations, creating the basket of baseline models]

  • Bottom up model composition: do some principled (theoretical) model building (the baseline models); use these as “predictors”/“explanatory variables” for (realised) hub models. So probably being a bit simplistic but you could end up with coefficients that are a bit like having a continuous, quantitative measure among the binary model categorisations we tend to talk about (e.g. “statistical”, “semi-mech”, “mech”).
  • Top down model decomposition: take the (realised) hub models outputs, cluster with eg PCA, hope to interpret the eigenvalues as like the underlying (theoretical) model components? Maybe this approach has some echoes of the model similarity work and the Cramer distance.

Then, following up with some thoughts for analysis plans:

Maybe (subsequent) evaluation with scoring could then be like comparing the performance between

  1. equal ensemble of the theoretical baseline models *
  2. each realised model ( identifying this as a kind of “unequal ensemble” of the baseline models)
  3. equal ensemble of realised models

where comparative analysis is like:

  • 3 vs 2 - as seen in hub evaluation papers (equal hub ensemble Vs individual model rWIS)
  • 2 vs 1 - shows the difference made by modeller choices, in combining basic model flavours vs the equally weighted baseline set
  • 3 vs 1 - shows difference from collating multiple complex models versus a minimum baseline set [perhaps useful for assessing value of collaborations]

Added to by @samabbott with:

* Or the ensemble could be on some decomposition of the baselines i.e you ensemble each component i.e trend etc and recombine or just directly using a the baseline basket


In any case - interested in thinking about / working on this :slight_smile:

1 Like

I think there’s some really interesting ideas here, I’m particularly interested in the decomposition of models into components.

This chimes with how we are thinking about our ensembling for winter 25/26 - have 1 model that covers semi-mechanistic, time series, seasonality to we have different “components” covered in the ensemble, rather than “just add more time series” which worked for COVID to some extent but not for other pathogens.

Regarding having a basket of baselines, one practical challenge I’ve found when trying to apply baseline methods from other countries/competitions/papers is just the different spatio-temporal granularities of data available. The weekly baseline methods of CDC competitions don’t work in the UK as we need to model daily data, as an example. Hopefully a set of baseline models would be generalised to avoid this issue to allow for more comparability.

1 Like

I really like this idea. I think ensemble diveristy is something that needs more exploration.

Yes agree I think this is mainly in careful interpretation. I have been wondering if this is a good thiing for the hubverse to work on as a hubbaseline package seems to make a lot of sense to me (that would try and be flexible across their different supported hub types)

1 Like

If anyone else is perhaps we can scope out something to do around this that everyone might have time for?

1 Like