Been thinking about the overlap between baseline models (Manuel Stapper is working on this), the method of analogs (stuff @nickreich and co are going on about), and model output evaluation (i.e. @kathsherratt work).This builds on the surrogate modelling work @sbfnk and I did a few years ago where we tried to replicate a forecast hub ensemble performance using a simple model (got within 20% in real-time evaluation).
Idea: Define a set of baseline models with well-understood characteristics.
Manuel Stapper and I have discussed this “basket of baselines” concept.
Some examples:
-
What would a time series expert build given only the time series data (no domain context)?
-
What would an infectious disease specialist build given domain knowledge but no time series?
-
Maybe a few other dimensions
The result is a bunch of simple baseline models that we understand, are easy to fit/robust/well calibrated etc as one might wish with different people perhaps drawing the line in different places in terms of compexity.
Having this basket is I think useful in and of itself and certainly more powerful than trying to pick an “optimal” baseline for a given evaluation.
The thing I have been thinking about to take this further is then using them to understand other models by:
- Use time series decomposition (STL or similar) to break each baseline forecast into components: trend, periodicity, residuals, etc.
2. Express other forecast models (from hubs etc) as combinations of these baseline components.
For example: “Model X has the trend of an SIR model but residuals of a time series model” (this is overly simple I think in reality it would need to be more like GP kernel composition i.e additive and multiplicative etc)3. This creates a dimensionally-reduced space (defined by baseline models) where we can understand relationships between models. Could implement via regression or mixture models with interactions.
You could also do this more simply if you didn’t decompose the baselines and just tried to represent model complex models directly with them.
There is also some connection here to the performance above replacement idea we talked about in Baseball Stats, Model Cards, and Forecasting Performance - #7 by samabbott i.e you would replace with different kinds of baseline.
Another alternative is to so non-outcome focused clustering - apply PCA or similar to the decomposed model components to understand how models group together.
Less interesting than the baseline approach but maybe more feasible.
Still working through how this all connects to scoring - could either:
-
Add scoring as a final step post this decomposition
-
Use the composition framework to better understand score patterns
-
Or could make the final target scores but that gets a bit more complicated
This could be a neat way of getting more at the composition of forecast hubs that don’t have good metadata i.e. to make their secondary data more valuable for other things.
The dimension reduction on forecasts output either as scores or the raw forecasts/residuals feels like something there must be literature on? Perhaps not for the baseline approach but also maybe?
Not totally clear where the truth data comes into this either i.e is the breakdown relative to the truth data or absolute.
Still rolling this around in my head and waiting on a few baseline papers people are writing to see what directions they go.