Create a collection of benchmark data sets

Title essentially says it all… I feel like new nowcasting methods are generally only tested on one specific data set the respective authors just had on their hands. It’s not like in regular time series forecasting where you can just go and download a bunch of time series.

Would there be be point in having a collection of example data in a standard format, along with the nowcasting results achieved by different methods? Could be a useful resource for future methods development.

Data sets I currently have in mind:

  • German hospitalization data
  • the Swiss data @adrianlison compiled at some point (up to when they changed the reporting rhythm)
  • I think the NobBs package contains one or two data sets

Not sure what format this should take precisely, though.


Agree this would be helpful, and can suggest these following indicators from the UK coronavirus dashboard

Before 2022-07-01 when they shifted to weekly reporting

1 Like

I really love this idea. For many of the problems we have been thinking about little (or no) public data appears to exist. That has left us with having to make simulated case studies that may or may not match up to reality (i.e GitHub - epiforecasts/nowcasting.example: Estimate reporting delays and use them for nowcasting).

I had initially thought that the epinowcast package should include these examples but obviously that would make it quite large and also bring with it lots of modelling code that people may or may not want. In the spirit of keeping things more modular doing this as another package seems like a good idea.

There seem to be a few approaches we could take to this:

  1. Having an R package (or other language package) in which the datasets live that has minimal or no functionality.

  2. Hosting example datasets somewhere like Zenodo and providing tools for looking up these datasets and downloading them. This would be similar to how socialmixr functions.

I think that 2. is perhaps the better bet as these data can be quite large and it also provides a straightforward model for how to reward data sharing (as people can be cited). In principle implementing something like this should’ve fairly straightforward given the available examples to work from. It would get more complex if we want to implement a standard data format etc but perhaps we don’t want to be that perscriptive (at least initially).

Aside from design considerations it could be useful to start listing available and relevant datasets in this thread.


We had a short discussion at today’s meeting of options for this. Minutes here: Short-dated epinowcast meeting, 2022-09-30 - #6 by alison

We have some covid case data aggregated by date of test for both California and Massachusetts in the US that I think would be good potential (public) benchmark datasets.