Create a collection of benchmark data sets

johannes · 27 September 2022 13:10

Title essentially says it all… I feel like new nowcasting methods are generally only tested on one specific data set the respective authors just had on their hands. It’s not like in regular time series forecasting where you can just go and download a bunch of time series.

Would there be be point in having a collection of example data in a standard format, along with the nowcasting results achieved by different methods? Could be a useful resource for future methods development.

Data sets I currently have in mind:

German hospitalization data
the Swiss data @adrianlison compiled at some point (up to when they changed the reporting rhythm)
I think the NobBs package contains one or two data sets

Not sure what format this should take precisely, though.

teojcryan · 29 September 2022 06:36

Agree this would be helpful, and can suggest these following indicators from the UK coronavirus dashboard

Before 2022-07-01 when they shifted to weekly reporting

samabbott · 29 September 2022 10:49

I really love this idea. For many of the problems we have been thinking about little (or no) public data appears to exist. That has left us with having to make simulated case studies that may or may not match up to reality (i.e GitHub - epiforecasts/nowcasting.example: Estimate reporting delays and use them for nowcasting).

I had initially thought that the epinowcast package should include these examples but obviously that would make it quite large and also bring with it lots of modelling code that people may or may not want. In the spirit of keeping things more modular doing this as another package seems like a good idea.

There seem to be a few approaches we could take to this:

Having an R package (or other language package) in which the datasets live that has minimal or no functionality.
Hosting example datasets somewhere like Zenodo and providing tools for looking up these datasets and downloading them. This would be similar to how socialmixr functions.

I think that 2. is perhaps the better bet as these data can be quite large and it also provides a straightforward model for how to reward data sharing (as people can be cited). In principle implementing something like this should’ve fairly straightforward given the available examples to work from. It would get more complex if we want to implement a standard data format etc but perhaps we don’t want to be that perscriptive (at least initially).

Aside from design considerations it could be useful to start listing available and relevant datasets in this thread.

samabbott · 30 September 2022 12:30

We had a short discussion at today’s meeting of options for this. Minutes here: Short-dated epinowcast meeting, 2022-09-30 - #6 by alison

nickreich · 30 September 2022 13:58

We have some covid case data aggregated by date of test for both California and Massachusetts in the US that I think would be good potential (public) benchmark datasets.

e.g.

github.com

reichlab/covid-hosp-forecasts-with-cases/blob/ec9564613b4cfab5cf7e9ecbdd8f930eaacecb41/csv-data/MA-DPH-covid-alldata.csv

version https://git-lfs.github.com/spec/v1
oid sha256:712b4eb7aa049a99fac3c33a605b57981bf779a0636036423ae3ded65e49da0d
size 11797201

Topic		Replies	Views
Data management recommendations for nowcasting Projects	13	710	23 March 2026
Update: a simple reference model for nowcasting package-extension , application	9	163	1 September 2025
Nowcasting COVID-19 cases by specimen date in England Projects case-study , reporting-structure , application , thesis-project	0	349	24 August 2022
Nowcasting in a real-time analysis pipeline to estimate the effective reproduction number with missing data and reporting delay Publicity simulated-case-study , effective-reproducti , missing-data	0	414	24 August 2022
Include a simple reference model Project Proposals model-extension , package-extension	21	1050	30 June 2025

Create a collection of benchmark data sets

Related topics