A basket of baselines

samabbott · 5 August 2025 21:28

Been thinking about the overlap between baseline models (Manuel Stapper is working on this), the method of analogs (stuff @nickreich and co are going on about), and model output evaluation (i.e. @kathsherratt work).This builds on the surrogate modelling work @sbfnk and I did a few years ago where we tried to replicate a forecast hub ensemble performance using a simple model (got within 20% in real-time evaluation).

Idea: Define a set of baseline models with well-understood characteristics.

Manuel Stapper and I have discussed this “basket of baselines” concept.

Some examples:

What would a time series expert build given only the time series data (no domain context)?
What would an infectious disease specialist build given domain knowledge but no time series?
Maybe a few other dimensions

The result is a bunch of simple baseline models that we understand, are easy to fit/robust/well calibrated etc as one might wish with different people perhaps drawing the line in different places in terms of compexity.

Having this basket is I think useful in and of itself and certainly more powerful than trying to pick an “optimal” baseline for a given evaluation.

The thing I have been thinking about to take this further is then using them to understand other models by:

Use time series decomposition (STL or similar) to break each baseline forecast into components: trend, periodicity, residuals, etc.

2. Express other forecast models (from hubs etc) as combinations of these baseline components.
For example: “Model X has the trend of an SIR model but residuals of a time series model” (this is overly simple I think in reality it would need to be more like GP kernel composition i.e additive and multiplicative etc)3. This creates a dimensionally-reduced space (defined by baseline models) where we can understand relationships between models. Could implement via regression or mixture models with interactions.

You could also do this more simply if you didn’t decompose the baselines and just tried to represent model complex models directly with them.

There is also some connection here to the performance above replacement idea we talked about in Baseball Stats, Model Cards, and Forecasting Performance - #7 by samabbott i.e you would replace with different kinds of baseline.

Another alternative is to so non-outcome focused clustering - apply PCA or similar to the decomposed model components to understand how models group together.
Less interesting than the baseline approach but maybe more feasible.

Still working through how this all connects to scoring - could either:

Add scoring as a final step post this decomposition
Use the composition framework to better understand score patterns
Or could make the final target scores but that gets a bit more complicated

This could be a neat way of getting more at the composition of forecast hubs that don’t have good metadata i.e. to make their secondary data more valuable for other things.
The dimension reduction on forecasts output either as scores or the raw forecasts/residuals feels like something there must be literature on? Perhaps not for the baseline approach but also maybe?
Not totally clear where the truth data comes into this either i.e is the breakdown relative to the truth data or absolute.

Still rolling this around in my head and waiting on a few baseline papers people are writing to see what directions they go.

samabbott · 5 August 2025 21:30

Also a basket of baselines is just a lovely title for a paper

samabbott · 5 August 2025 21:57

Also I posted this in a few other places and roughly copied it across so its a bit burbled.

kath-sherratt · 5 August 2025 23:11

I am super keen on this!

Having chatted about this before, adding some of the previous comments.

I think for me the highlight was this @samabbott :

The main assumption is I think that you can accurately express a forecast from a combination of other forecasts that you understand both enough about the components and how you have composed them in order for it to be useful

And this was my own summary of the whole thing, to make sure I understood correctly - but adding here again in case it helps others to see it put it in different words.

This seems like seeing each (realised) model as like an “ensemble” of theoretical components that are remixed in various ways [with the simplest possible combinations, creating the basket of baseline models]

Bottom up model composition: do some principled (theoretical) model building (the baseline models); use these as “predictors”/“explanatory variables” for (realised) hub models. So probably being a bit simplistic but you could end up with coefficients that are a bit like having a continuous, quantitative measure among the binary model categorisations we tend to talk about (e.g. “statistical”, “semi-mech”, “mech”).

Top down model decomposition: take the (realised) hub models outputs, cluster with eg PCA, hope to interpret the eigenvalues as like the underlying (theoretical) model components? Maybe this approach has some echoes of the model similarity work and the Cramer distance.

Then, following up with some thoughts for analysis plans:

Maybe (subsequent) evaluation with scoring could then be like comparing the performance between

equal ensemble of the theoretical baseline models *

each realised model ( identifying this as a kind of “unequal ensemble” of the baseline models)

equal ensemble of realised models

where comparative analysis is like:

3 vs 2 - as seen in hub evaluation papers (equal hub ensemble Vs individual model rWIS)

2 vs 1 - shows the difference made by modeller choices, in combining basic model flavours vs the equally weighted baseline set

3 vs 1 - shows difference from collating multiple complex models versus a minimum baseline set [perhaps useful for assessing value of collaborations]

Added to by @samabbott with:

* Or the ensemble could be on some decomposition of the baselines i.e you ensemble each component i.e trend etc and recombine or just directly using a the baseline basket

In any case - interested in thinking about / working on this

jonathon.mellor · 6 August 2025 08:17

I think there’s some really interesting ideas here, I’m particularly interested in the decomposition of models into components.

This chimes with how we are thinking about our ensembling for winter 25/26 - have 1 model that covers semi-mechanistic, time series, seasonality to we have different “components” covered in the ensemble, rather than “just add more time series” which worked for COVID to some extent but not for other pathogens.

Regarding having a basket of baselines, one practical challenge I’ve found when trying to apply baseline methods from other countries/competitions/papers is just the different spatio-temporal granularities of data available. The weekly baseline methods of CDC competitions don’t work in the UK as we need to model daily data, as an example. Hopefully a set of baseline models would be generalised to avoid this issue to allow for more comparability.

samabbott · 6 August 2025 10:33

I really like this idea. I think ensemble diveristy is something that needs more exploration.

Yes agree I think this is mainly in careful interpretation. I have been wondering if this is a good thiing for the hubverse to work on as a hubbaseline package seems to make a lot of sense to me (that would try and be flexible across their different supported hub types)

samabbott · 6 August 2025 13:09

If anyone else is perhaps we can scope out something to do around this that everyone might have time for?

kath-sherratt · 20 August 2025 22:05

Thanks for the suggestion Sam, sorry to have let this message slip! I’d be keen on talking this through anytime. If we want to take anything forward, I will put aside some time likely from mid September.

samabbott · 21 August 2025 15:13

No worries. Sounds good. That would work well for me. Would you like to propose a time? Anyone else interested please chime in.

kath-sherratt · 28 August 2025 22:37

How about either the 10th or 12th September? Say 10.30, but any time either of those days works for me.

(btw I don’t seem to be getting forum notifications - will do a better job of checking directly, but please ping me if I miss a reply!)

jack · 29 August 2025 12:33

would be interested in being involved, let me know when the call is and i should be able to turn up

samabbott · 1 September 2025 13:40

Meeting invite:

Topic: A basket of baselines
Time: Sep 10, 2025 10:30 AM London

Join Zoom Meeting

Meeting ID: 845 6434 6680
Password: 525852

samabbott · 10 September 2025 12:17

@kath-sherratt and I met today to talk about this. We got a bit side tracked talking about performance above replacement which I think we are going to focus on in the short term ( Baseball Stats, Model Cards, and Forecasting Performance - #8 by samabbott ). In terms of taking ideas forward here we saw a few projects:

Making surrogate models from ensembles of baselines as a way of understanding performance
Decomposition of models into key features and doing dimension reduction
Making surrogate models by ensembles of decomposed baselines
Using a “basket of baselines” as the baseline model

The curent plan is that @kath-sherratt is going to take forward a project using the first and last ideas as part of a wider fellowship application. Her very cool idea here is to do so as an expert elicitation exercise. If anyone wants to be involved with that reach out to her. We see third project as probably being an extension of the first one and so needs to wait. We see the decomposition idea as being independent and a nice project for i.e. a PhD student if anyone has anyone in mind. @kath-sherratt please correct, fill in anything I have missed as you think is needed.

Topic		Replies	Views
Community Seminar 2024-08-07 - Kaitlyn Johnson - Wastewater modeling to forecast hospital admissions in the US: Challenges and opportunities Meetings	19	133	14 August 2024
Baseball Stats, Model Cards, and Forecasting Performance Project Proposals	9	103	10 September 2025
Measure of forecast difficulty?	11	124	7 May 2025
Include a simple reference model Project Proposals model-extension , package-extension	21	860	30 June 2025
How can collaborative infectious disease forecasting/nowcasting projects be improved?	6	511	5 June 2023

A basket of baselines

Related topics