How can collaborative infectious disease forecasting/nowcasting projects be improved?

I’ve been involved in a few collaborative nowcasting/forecasting projects (i.e. CDC COVID19, Germany/Poland COVID19 forecasting/nowcasting, European COVID19 forecasting, SPI-M-O short-term forecasts + reproduction number estimation) as a contributor and stood near people as they run these projects.

I was wondering if people had any thoughts about them both in terms of their experience but also in terms of their limitations and how they can be improved.

My take on the main aims of these projects is generally:

  1. Provide improved forecasts to stakeholders by ensembling a range of approaches and evaluating forecasts that may already exist in a single framework
  2. Drive iterative improvement for the forecasting task at hand by feedback and model selection
  3. Improve the practice of forecasting more generally by highlighting what works well and what doesn’t

I think we are currently doing 1. pretty well here with 2 happening a little bit but perhaps mostly via model selection (i.e. drop outs) vs model improvement (would love to be wrong about this). Not sure we have made massive progress with 3. and its not clear that just doing more of the same will get us there?

The main issue I have found personally as a contributor is prioritising these kinds of initiatives given how much work they take. We have tried to do that in the past by attaching research projects to submissions (for example https://www.medrxiv.org/content/10.1101/2022.10.12.22280917v1.full.pdf) but except in one instance it has been so much work that the research part of these projects has ended up getting dropped just because there is no time/resource left. The incentive is enough to do them (especially if its your first time) but maybe not to keep iterating/innovating?

That then relies on those doing secondary data analysis to pull out findings from the hub results themselves but that is such an overwhelming job it seems to be a real challenge to dig deep enough (and through all the noise of realistic data and model variations) to get real insights to drive iterative improvements. Maybe these will start being used as a data source for others to do secondary analysis and this will help with this resource issue? I haven’t seen loads of pure secondary data analysis coming out that wasn’t by hub organisers but perhaps I am looking in the wrong places. People do definitely love to evaluate against hub ensembles as a baseline which does seem like a very strong win for the community and for the organisers (but again not really those contributing).

I haven’t seen a huge amount on how we can do better here but perhaps I just haven’t been looking in the right places? Does anyone know if there is an initiative underweight from the various stakeholders (i.e. hub organisers, contributors, funders, and those consuming the forecasts) to think about what the next generation of these projects look like?

The biggest success for me has been @dwolffram and @johannes nowcasting hub. I don’t think there is a clear reason why exactly apart from that the statistical problem is a little clearer/well phrased and most people were using fairly similar models (making learning what works/doesn’t work a bit easier). I wonder if that suggests that a more limited scope could be helpful? It could also have been @dwolffram charm of course but were very charming people involved in the other projects I contributed to.

I’ll circle back to this and make it less of a ramble in a bit!

2 Likes

Thanks Sam! I think this is really important to discuss and there hasn’t really been a community wide forum for it (yet).

I agree with how you lay out the aims of the projects, and I also see a bit of a fundamental tension between them (from the perspective of both setting up and contributing to collaborative projects). Prioritising collecting modelling results prospectively in real time (with presentation/evaluation aimed at specific outbreak support: your aims 1/2), as we did in the COVID hubs, means accepting messy data that is difficult to evaluate. But for aim (3) we would want to set the project up to prioritise collecting a standardised, systematic sample of results from different modelling approaches (for the purpose of evaluating for generalisable insights / improvements).

Even if these aims aren’t incompatible, they don’t necessarily or straightforwardly align at all, and it takes a lot of figuring out (or maybe just resources) to make a collaborative project that can do both well. Like you say, the German hospitalisation nowcasting hub seems to be doing great at both! I think it might be a mix of having a fairly specific, standardised, and reliable forecast target + data flow + modelling community. (Maybe that also applies to the past US flu forecasting projects?)

On this point:

I haven’t seen loads of pure secondary data analysis coming out that wasn’t by hub organisers

Yes, as far as I know secondary use of the Hub data has been to evaluate the ensemble or selected models against a team’s own model of interest, rather than using the Hub dataset for a standalone evaluation. I think that’s to be expected, but agree it would be good to encourage the use of these as standard databases for experimenting on more generally, and easy access to the forecast & observed data are the first barrier to that at the moment.

In terms of the future of the US & European COVID (/flu) hubs, as far as I am aware:

  • In the US, dev is underway to make at least the software element of this more standardised and reproducible. See the Consortium of ID Modeling Hubs. So far, this is software focussed so doesn’t really cover what you are getting at here with system-level improvements to hub projects; but it might be a good venue for discussing/implementing any suggested changes.

  • In Europe I believe the ECDC modelling team are discussing internally and may have some kind of planning in the works for the future of the Hubs in Europe. Unfortunately as that is within the ECDC I have no insight into what that is.

1 Like

That’s a really good question and some excellent points raised already.

The stated aims of the US/Euro hubs were to

  1. Provide decision-makers and general public with reliable information about where the pandemic is headed in the next month.
  2. Gain insight into which modelling approaches do well. (Secondarily, hold models “accountable”.)
  3. Assess the reliability of forecasts for different measures of disease severity.
  4. Create a community of infectious disease modelers underpinned by an open-science ethos.

I’d say that aim (1) has been partially achieved (information wasn’t necessarily reliable but probably more reliable than any other information available), (2) not really (except the accountability bit), (3) partially (deaths easier than cases, at least for models if not necessary for humans as per Bosse et al., and (4) has been partially achieved as the hubs have brought some people together though it will remain to be seen how much can be turned into sustainable communities.

I very much agree on the issue of prioritisation vs how much work contributing takes. I remember from contributing to the Ebola forecasting challenge that it was hard to find the time to make submissions - with the additional issue that this was on synthetic data so there was no clear incentive to getting it well/right except the shame/pride of having under/overperformed. And as you say Sam, the incentives are probably greater to contribute in the first place (i.e. be part of the project and thus be able to contribute to papers etc.) than to work towards improving models iteratively.

One additional issue with the hubs that I haven’t seen discussed much but that we faced in the past is the tension between them being a research project / a testing ground for new models and their aim to reliably inform public health. It has not so much been a problem with the forecasts themselves as the median ensemble has appeared fairly robust to the choice of models (something we’re hoping to look at in more detail) but this has created issues in communication when some models have come under particularly intense scrutiny from the press.

To me the main value that the hubs can and should provide, beyond the potential benefit to situational awareness, is that

  • they can provide an useful validation tool / data set, i.e. if someone develops a forecasting model this can be put to scrutiny and compared to other models
  • they can encourage open science practices, sharing of code and data
  • they can provide genuine prospective evaluation of predictive ability
  • they can act as a platform for community building and knowledge exchange

In my view in the future the hubs should provide a number of diverse default models so that they don’t depend on contribution, but be open to contributions from the community so as ideally to combine the two approaches that Kath mentions, allowing both systematic comparison as well as broader engagement. The main obvious areas of improvement are:

  • making secondary use easier (though I’m not sure whether ease of access is the main barrier vs. other things)
  • better incentivising continued and sustained contribution, e.g. through more detailed feedback/discussion of model performance (perhaps one of many things that worked well in the German nowcasting hub?) and perhaps broader scientific discussion.

I would love to hear more suggestions for how the hubs can create both more scientific and public health value as well as provide a good experience to those contributing.

1 Like

Mmmm, I think there’s an undervalued element of lowering barrier to entry / reducing friction when it comes to having data in a standardized format / accessible in a stable way / etc. There’s a bit of that in the “open science” perspective, I suppose, but I’m less concerned with the transparency / reproducibility aspect, and more with the enabling-small-teams outcome.

Seems like “hub” type initiatives can potentially consolidate repetitious-but-necessary scut work, and then enable teams with clever ideas but low bandwidth to deal with the mess that typifies public health data.

2 Likes

@kathsherratt

I agree that it might be something to do with the structure/nature of the collaboration. So far they have been very hub and spokes (i.e. orgs and submitting teams). The hubs I have most enjoyed (@dwolffram and @johannes as you flag) have been the least like this - especially the nowcasting hub. From an actual organisational level though they were really organised similarly but perhaps the culture was a little different.

I think this is likely the key. I wonder which of these is required though and what features the community needs to have/not have.

This looks like a good initiative and new to me so thanks. From what I can see though it doesn’t appear to really target just making the data (both forecasts and versioned truth) just really easy to access, filter, and manipulate. The only package that looks like it could help with that in there is very nascent and also has lots of bells and whistles which makes me worried even when that functionality is there it will be hard for the kinds of people who will do/want to do secondary analysis to access.

I know we have chatted about this ourselves and also both rolled our own approaches to access data. I really think this should be more of a priority as there is so much value left in these datasets which is otherwise siloed to the respective hub teams who may not be best placed to extract it.

@sbfnk

Is there a paper where these are written down and then each one is evaluated to see how well they have been achieved. I kind of understood these aims but not this explicitly. If there isn’t - maybe there should be?

This is really interesting. Has there been some definition of what reliable means in this context? I am really thinking here about the role trying to predict NPIs plays. I wonder if its clear to people that some models are trying to do this and others aren’t which obviously really impacts the forecasts. It means they could be “reliable” when you ignore them but unreliable as soon as you start setting policy based on them potentially (because you could decide to do something or not do something based on the forecasts which the forecasts have assumed the reverse of based on prior data on your actions).

This seems to be the thing that is a real struggle and on which progress is a bit stuck. I think we agree that even defining well has been a struggle. For the reasons given above, I think opening up the data and pushing that would help here. (let a thousand flowers bloom etc.).

Yes, I agree though it is also maybe super obvious (given one is in theory conditional on the other) and even given that we don’t have incredibly strong evidence directly from the hubs.

This also seems like a struggle as you say. Lots of the models are really shrouded in mystery. I guess it’s a success if you think of the model output as needing to be open access. Moving to a model where code is submitted seems like it would really help here. If you focus on the community side I suppose this has been a much bigger hit. I wonder if its possible to show that community members have improved vs others (I guess new entrants vs long term contributors or serial contributors to multiple hubs vs new entrants (though this gets swamped by lots of biases I guess).

Yes, totally agree. Something I was wondering with this is if people should be attaching levels of confidence to their models or if they should be rated by some kind of panel etc. based on research vs trying to be the best ever forecast. Could do something similar with are you trying to include NPIs etc. and then show more than one ensemble. In general, making more ensembles with different flavours like this seems like it could be a good way of educating people about some of the black box choices whilst still giving them the benefits of multiple models.

I think this needs to be more carefully caveated than it is though given the data problems. The kinds of models that are likely to do well are really those that handle data issues really robustly those are not necessarily the kind of models that will do well when data sources are more manually created (for example during an outbreak with some kind of line list with known properties).

I agree. I think ideally there should be an adoption process where you can onboard contributed models so they become hub default models. Over time that means you build up a stock of diverse models but you also don’t lose the value of a range of people’s insights and you don’t accidentally discourage submissions by getting a reputation for stealing ideas/credit/etc.

what are the other issues? The struggle to get secondary stuff published or the lack of sufficient metadata/info on models to make doing this relatively easy? Has there been any community discussion about what would make this easier for people? I feel like I haven’t seen any but then I haven’t been that engaged recently.

Is that really enough to keep people contributing? I guess the easiest way to discover this is to ask people? Again I might have missed work on that but I don’t think that has happened?

The key for me is really ramping up the metadata I think and focussing on automating the boring bits for contributors (or they bits that may be new to them). I also think that those running the hubs should always be running the models (or maybe after some time window of submission). This seems to unblock a lot of potential and to more properly balance work with incentives. It would sometimes limit the kind of models (i.e. if manually tweaking to add NPIs etc) but there could be a code submission process to update models vs updating forecasts. This would also really help with versioning. At the moment models evolve over time and often we really have no idea. For teams running ensembles we are really just measuring how good/well resourced the team is (in the US it seems like the most engaged/big/resourced teams are on average doing better over time - which we might attribute to the model but it seems likely that some part of that is other factors linked to resourcing).

@pearsonca

@pearsonca yes I agree this adds real value. At the moment there still seems be a lot of boring busy work associated with hub submission but the move to github action style submission for the European hub is a massive step forward. More initiatives like that and more work streamlining the process would be really excellent. There could also be more support for routine parts of forecasting that may be less familiar to people (like docs/info/suggestions for outlier handling, forecast inspection, local model evaluation etc). That seems like it would really help grow the community. Thinking of something like the applied epi handbook written by the community of forecast contributors and managed by the hub (but importantly with very careful thought for credit sharing).

Not explicitly I think. There probably should be, but even more so we really need to start building mechanisms for evaluating success into modelling studies from the get go.

I agree. I’m not sure people would really expect ensemble models to reliably predict the impact of a new intervention, but who knows. As shown in the US hub (and I’d expect the same to hold elsewhere) these are the times when forecasts are the least reliable https://www.medrxiv.org/content/10.1101/2023.05.30.23290732v1

I completely agree that this is what we should be aiming for in the future.

1 Like

See the ECDC evaluation for a few additional views on this:

(linked from Background)

1 Like