Baseball Stats, Model Cards, and Forecasting Performance

TL;DR: A baseball game in Atlanta led to some ideas for visual scorecards, performance-above-replacement metrics, and effort-normalised evaluation that seem useful/interesting.

I have been in Atlanta for the last few days teaching at SISMID (course materials with some really good (not biased) new forecasting content and other tweaks) with Nick Reich and Thomas Robacker. It was a lot of fun even if Atlanta is about 300% too hot for me.

Nick was kind enough to organise a trip to the Braves vs Yankees baseball game, which was super fun. Something that really got to me was the (famously) large amount of stats everywhere, especially on the big screens where new batters were shown with their summary stats (and another similar one for the pitcher).

This got me thinking about how nice it would be to have something like this for infectious disease forecasting models that people could use in their READMEs to summarise performance, or forecast hubs could use as a way of visually summarising a model.

We then had to escape the stadium and naturally, this meant camping in a car park for a bit whilst all the traffic cleared. This gave me some time to think about how these score cards might interact with the kinds of model cards that gen AI has started using. These are basically YAML formats that go in your README that contain model and performance metadata following certain standards.

I naturally then spent a lot of my free day in Atlanta thinking about this whilst zipping up and down the beltline. The conclusion of this is that I think there is a fairly natural way to express this as an extension of scoringutils that outputs both a model card (i.e. YAML) and a scorecard (an image). Claude and I have been iterating on a design document and I am very keen for feedback. Importantly, I am keen to know if there are examples of this kind of thing in the wild, as it’s really a very general and non-domain-specific concept, so it feels like there very much might be.

The score card prototype looks like:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ [Logo]       MODEL NAME              PAR: +2.3%/+1.9% (overall) β”‚ <- Header (Rows 1-3)
β”‚              Team/Organization       Nat: β–β–ƒβ–ˆβ–‚ (1-4w)           β”‚
β”‚                                      Log: β–‚β–„β–ˆβ–ƒ (1-4w)           β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”‚ <- Metrics (Rows 4-8)
β”‚ β”‚Coverage  β”‚   WIS    β”‚   Rel    β”‚   Bias   β”‚   Ensemble     β”‚β”‚
β”‚ β”‚ 50%: 48%↓│  Nat: 42↑│  Skill   β”‚ -0.02 ↓  β”‚   Contrib      β”‚β”‚
β”‚ β”‚ 90%: 87%↓│  Log:0.38β”‚Nat: 0.95↑│          β”‚ Nat: +3.2%↑    β”‚β”‚
β”‚ β”‚          β”‚          β”‚Log: 0.87↑│          β”‚ Log: +2.8%↑    β”‚β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ [Performance Timeline Graph - Model vs Others]                  β”‚ <- Timeline (Rows 9-11)
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Forecasts: 127 | Since: 2023-01 | Target Coverage: 95%         β”‚ <- Footer (Row 12)
β”‚ Best: 2-week ahead | Most consistent Q3 2024                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

A concern is that this is meant to be fun and as I have iterated it has become progressively less fun and more information dense. Definitely something to watch for. I think maybe adding the plots and logos will edge this back in the fun direction.

We also had a good chat about the types of stats that are widely used in baseball and the main one I thought was interesting was performance above replacement. This as I understand it is weighing all the actions players take vs the average action as a way of assessing value (I would like a version of this that is weighted to player cost - more on this in a second). There are things like this that we do (i.e. value to an ensemble where you look at ensemble performance with and without your model) but I think nothing quite like it. I think we could get closer if we looked at permutations of models in ensembles (as both @sbfnk and co and Spencer Fox and co have been doing recently). So this would look like taking ensemble performance with our model and dividing by the mean performance of all possible ensembles created by removing our model and instead duplicating another component model. A caveat is I haven’t really thought through how this relates to other measures we already have (i.e. the ensemble with and without, just taking relative skill etc.). A nuance here is you might want to check replacement by some sub strata or categories (i.e. replacement across renewal models etc. but I think except in very large forecasting problems with lots of problems this is likely to run into issues.

If it does shake out as being useful, then as with baseball, I like the idea of trying to normalise against effort. I have had a bugbear for a while at how we confound model performance by modelling effort. All the forecasting I have done points to a relationship between having more time to spend reviewing, iterating and just looking at the forecasts and how you do in evaluations. This should mean that well-resourced teams that focus heavily on forecasting do very well, whilst less resourced teams do less well. There is also some interaction with model complexity as a more complex model takes more time and thus leaves less effort for the other forecasting tasks that might be the thing driving performance. This would bias us towards simpler models (it’s convenient I think this isn’t it). Another way to think about this is an β€œhouse effect” where all models from a team should have some common performance aspects and some that vary.

Something to note is we can also get at this doing model-based evaluation (i.e. adjusting for team size etc.) but this makes us need some fairly complex other ideas so having a good summary measure might be useful.

Summary

For me, the baseball statistics displayed at the game highlighted three gaps in how we evaluate forecasting models:

  • Visual scorecards could make model performance more accessible than current approaches.
  • Performance-above-replacement metrics might better capture model value within ensembles than existing measures.
  • Effort-normalised evaluation could address the confounding between model performance and development resources.
4 Likes

Very cool! I like the prototype score card, informative.

I think maybe adding the plots and logos will edge this back in the fun direction.

And emojis! :laughing: As Claude would likely try to do (based on it’s model submission reports during SISMID). :trophy:
Perhaps even a metadata entry for team colors.

1 Like

Yeah in the spec is the idea a team might have a colour scheme. I think it might be fun to add a hub or other colour scheme (i.e. across models) and a model/team colour scheme as well

I’m excited about this idea!

Noting that Minsu Kim is working on this idea of β€œmodel importance” that is basically an implementation of a version of the β€œscore all the permutations of models” idea. The core ideas are in her recent preprint which was just revised and resubmitted to Intl Journal of Forecasting.

And an R package is in the works too…

I’ve been wondering for a while if we could create more interesting forecasting collaborations if the primary scoring metric was something like this. Instead of β€œmake the best prediction you can” it’s more like β€œmake the best prediction you can that no one else is already making”. Which I guess at an extreme could turn into trying to predict when and how other predictions will fail.

As another example of baseball stats β€œscorecards” check out these umpire ratings that are posted after every game.

Here’s the one from the game on Friday night:

Nice! Looking forward to trying out the package.,

I think this doesn’t extend to the replacement component mentioned here though? i.e. its the leave one or a subset out and normalise relative to the ensemble (or take difference) vs here leave one out and replace it with another for all permutations, then take the mean (or not if wanting a range) and use this to normalise the ensemble that includes the target model. The latter is closest, I think to the baseball metric but I haven’t thought deeply on tradeoffs.

Which I guess at an extreme could turn into trying to predict when and how other predictions will fail.

Yes, I agree with the general idea and also with trying to push towards this extreme at least a little.

Oh, very cool. This has more of the information dense vibe that the design prototype has been heading towards.

1 Like

Yes I agree that Minsu’s work is different in the way you describe. Although I might argue that this setup is more appropriate in our setting. In sports, if you don’t play there WILL be a replacement (unless I guess you get a red card in a football match). In ensemble forecasting, the models are β€œunique” and would not necessarily be replaced. So I’d argue that the more β€œtrue-to-life” metric is one that doesn’t replace a model with a generic one but just leaves it out.

1 Like

Hmm I guess the replacement is the alternative baseline model I might submit. The issue you have if you don’t adjust for this you end up conflating the value of a given model with the value of an ensemble just having more random models that represent some error distribution i.e. how many models like analyses.

Also I would lightly push back on the idea that models are β€œunique” as often quite derivative of each other right especially in larger hubs. This replacement idea also I think starts maybe getting at that a bit as it adjusts for model weight in some sense to give the uniqueness value (or lol maybe it doesn’t who knows).

@kath-sherratt and I met today to talk about A basket of baselines - #12 by samabbott as an off shoot of that conversation we thought that it would be really great to write a nice tightly scoped piece on the idea in here to evaluate models by value above replacement.

This would be a remix extension of some of the ideas from @nickreich and cos model importance work (with the extension here being replacement) and showing when it might be needed by looking at different ensemble sizes (under the assumption that as performance asymptotes with the addition of more models so should the difference betwee just looking at leave one out and performance above replacement).

We also thought there was an interesting discussion to be had in terms of what to replace with i.e all other models as a permutation, a baseline (hence the link to the basket of baselines), the ensemble of all other models (we thought this might end up being the same as the all permutation idea). There is a clear extension here to weighted replacements i.e more similar models or the extreme category version of this i.e. models from the same β€œfamily” for now we thought we would leave that to the discussion.

The plan is that we will draw up a short analysis plan over the next week or so and then reach out to interested parties from here to set up a meeting. Keen to hear peoples thoughts.

1 Like

I think once you get into weighting things get much more complicated. (How are the weights computed? if you have a model that is essentially given zero weight because it isn’t valuable, then it doesn’t matter what you replace it with?)

I sometimes have trouble following the Sam’s rambly run-ons… But the idea is that the replacement value could be higher in settings with fewer models?

Excited that you all are thinking about this and happy to be involved in further discussions.

2 Likes

The idea is that ensemble without / ensemble with as a measure of performance is confounded by the size of the original ensemble as it mixes the unique contribution of that model vs the contribution of any model. I expect that as the ensemble size grows, the difference between performance above replacement and performance vs not being present at all will shrink given the assumed existence in an asymptotic relationship between ensemble size and performance.

I think once you get into weighting things get much more complicated.

Yes, I agree. Here, the weighting is slightly different (its about weighting the permutations of the replacement not the overall ensemble) but has similar issues hence doing something simpler, i.e. replacement by models of a similar type vs any model would be an easier thing to look at.

There is also the wider point that for any ensemble with weighting, things get really complicated and we haven’t thought about it at all at this stage. That being said I think you might just weight the replacement by the same amount as the current model as a first pass.