Data management recommendations for nowcasting

This is a small research project intending to highlight the importance of data quality in epidemiological reporting for nowcasting/real-time surveillance and to provide data management recommendations for improving upon the current status quo.

Initially it started with a discussion here.

@adrianlison wrote:

Summarizing some points inspired from a discussion with @seabbs a while ago and leaving it here for further development:

The storage, processing and sharing of epidemiological data may not be directly the business of nowcasting packages like epinowcast, but it has strong implications for their usefulness and applicability. It may therefore also be valuable to develop recommendations for this domain and spread them among public health practitioners / provide them to the organizations where such data are usually collected.

A general message to get across is of course that data on the process of reporting are almost as important as data on the epidemiological dynamics we are interested in monitoring. Without them, we barely have a chance to account for the various source of uncertainty, measurement error, delays and other distortions during modeling, eventually resulting in suboptimal prediction performance. In the worst case, the absence of data on the reporting process motivates/forces modelers to just pretend that the data are perfect, risking overly confident, biased predictions.

Now, more concretely, what does that mean for nowcasting? That public health authorities should be able to analyze internally/share publicly epidemiological count data always together with information on the delay with which the counted events were reported. What counts most here is the delay until the data is available for analysis. Typical violations of this principle are using the time when a case was sent to the authority as date of report, but not accounting for the delay until it is written to the database, or nowcasting the time between onset and hospitalization, but not accounting for the delay until the hospitalization is available to the health authority (see e.g. NDR Info - Die Nachrichten fĂĽr den Norden | NDR.de - Nachrichten - NDR Info, unfortunately only in German).

As it seems, the linelist databases of many health authorities currently do not have suitable data schemes in place to retrieve the necessary information. Potential recommendations are:

  • The best scenario would be if health authorities would shift to using bi-temporal database table designs for important epidemiological data. This means that for each attribute of a patient in a linelist, the valid time and transaction time of the attribute value are stored, and updated by adding rows (never overwriting existing rows) over time. Such a design would allow for a complete reconstruction of the reporting process, including negative delays and corrections. It is a well-known approach in data warehousing.
  • If not being able to use a bi-temporal DB design, one should at least think hardly about which epidemiological events are potentially relevant for nowcasting and add attributes to the linelist schema which record the date when the information is readily available for analysis. For any attribute where you say "this is about an event in the past but could be updated in the future (i.e. after “now”), you need a column storing the transaction time.
  • Finally, the not-so-elegant but robust workaround is to store regular snapshots of your linelist database and compare consecutive snapshots to derive the transaction times, or to inspect database logs (this is probably, and based on my experiences, the least fun). This can also be useful for detecting problems of your current approach, i.e. finding fields you thought would be stable at a certain point in time but are not in reality.

Another thing to discuss (but we haven’t progressed with it so far) is what options exist to account for biases resulting from people doing none/not being able to do any of the above. The only straighforward but probably suboptimal idea I have is if you know that your data can take 2 days until written in the database and you do not know the transaction times, leave any data with a reporting date from the last two days out of your analysis (i.e. do a nowcast with “retrospective” data) and potentially forecast the last two days.

1 Like

Sebastian Funk replied:

I think this is really important and makes some excellent points.

A few comments:

  • As far as I’m aware most data in public health agencies is stored in Excel. So while it might be useful to point out solutions that improve on this, it may be beyond their capacity to set up and sustainably maintain a more complex set up
  • A middle ground between points 1 and 3 might be something like a csv with version control or saved as a daily snapshot - allows reconstruction of any updates/changes and doesn’t require the user intervention that (2) requires (especially as the person entering/maintaining the data may not be aware of analysis requirements, thus creating the potential for misunderstanding).

As regards options for accounting for biases, I completely agree but think it would be useful to also point out the consequences of ignoring the biases with some examples.

@samabbott replied:

Really good points and a nice starting place for a discussion!

bi-temporal database table designs

These were new to me. Are there any examples of implementations? Is it a feature of commonly used DBs that can be turned on/off by the user?

Agree with @sbfnk that likely many key datasets are in Excel so we need some recommendations that improve on this. Those could be to switch to a database setup (which could be in reach of most orgs, especially with the right motivation as to why). That being said at least a few PH I am aware of are using DBs and still have the problems you have outlined. Given that it is even more useful to discuss this now as likely more orgs will commission new infrastructure for dealing with outbreaks in the near future.

Another thing to discuss (but we haven’t progressed with it so far) is what options exist to account for biases resulting from people doing none/not being able to do any of the above. The only straighforward but probably suboptimal idea I have is if you know that your data can take 2 days until written in the database and you do not know the transaction times, leave any data with a reporting date from the last two days out of your analysis (i.e. do a nowcast with “retrospective” data) and potentially forecast the last two days.

Again as @sbfnk points out, some simulations on what exactly the impact of this issue is could be interesting. I assume it really places an upper limit on how well a nowcast model can perform realistically when these kinds of issues are present which might favour the use in general of simpler models that are more robust to noise.

I have heard of people estimating this delay from data snapshots and then imputing their data which may cause issues due to epidemic phase bias. This could also be approached as a second nowcasting step if data is available (i.e a mixture of distributions). This lends itself to a non-parametric approach, multi-stage nowcasting (though then there are issues with propagating uncertainty), or quite a complex parametric model.

A general message to get across is of course that data on the process of reporting are almost as important as data on the epidemiological dynamics we are interested in monitoring. Without them, we barely have a chance to account for the various source of uncertainty, measurement error, delays and other distortions during modeling, eventually resulting in suboptimal prediction performance.

It would be interesting to think about how to highlight this to those planning surveillance systems. I totally agree but the lack of this information doesn’t stop estimates from being made even if sub-optimal. It may not be clear to stakeholders how much information is being lost/how biased the estimates they are depending on can be.

@adrianlison replied:

Thanks for the insightsful comments @sbfnk, @seabbs! A few replies

Spreadsheets

@sbfnk good point, the database scenario may be less of a standard than I implied above. Fully agree that offering recommendations also for spreadsheet-based solutions would be important! I quite like the idea of csv + version control. Would you think this should be supported by some simple automation tools e.g.

  • a script that runs daily, converting the xls file to csv and ensuring that regular commits are added
  • a script that you can run on a git repo with the linelist csv and that goes through the historical commits, constructing a “temporal” csv for you that is ready for nowcasting etc.?
    If yes, are we aware of any tools that offer something likes this?

As @seabbs brought in the perspective of mid-term transition to DBs, I guess it also makes sense to think about which type of recommendation/support for spreadsheets today makes it easy to transition to the “right” DB design tomorrow. Not yet sure if the snapshot-based approach is ideal here, but on the other hand, adding database-like features to a spreadsheet may put more workload on users or complicate the workflow, making it less attractive…

Databases

Regarding bitemporal database tables, I admit this is still a bit of an advanced feature, afaik SQL does support querying it and some DB systems offer it (db2 from IBM I think?) or you have to do it yourself by implementing a corresponding schema (but this comes with the challenge of placing suitable constraints and making sure the transactions are done correctly. In particular, you might not want users to alter the table “manually” using SQL queries but only via scripts that fill out the time columns correctly). Very likely, current schemas would have to be split up, e.g. it will not make much sense to have the “whole” linelist with all attributes in one bitemporal table (due to efficiency reasons but also because it makes later schema changes a mess).

Thus, some tiered advice may also be good here. The strength of bitemporal is that we could also account for corrections to the linelist (which is something that can happen to significant degrees as we know). But if we ignore this and assume that values are valid right from the beginning and not changed anymore, we can also go for a unitemporal database with the transaction time as the time attribute. This is easier and supported by a lot of popular database systems today.

And if I understand correctly a general takeaway that we have here is that preaching how to do it right is nice, but showing what happens if you don’t do it right (via simulation studies?) could be necessary to convince stakeholders (which is somewhat understandable, changing processes/data schemas can be expensive and risky).

@samabbott replied:

On Adrian’s points.

Spreadsheets

I think there are some commercial tools out there that do versions of this but I am not clear if they are in use in the PH space. In terms of scripts to do this what you have summarised is pretty much what I was thinking and what we (actually @sbfnk) did for nowcast Rt estimates (get_covid19_nowcasts.r · GitHub). I don’t think there is an explicit tool available to do this but it seems like it could be useful if one existed.

There is a lot of value in simple data storage. From what I have heard from those with DBs to do this kind of work there is often a lot of red-tape and only some people have access. This means that effectively a spreadsheet approach ends up being used. This was definitely the case when I worked with DBs and stakeholders but that wasn’t in PH so less relevant.

Databases

Sounds quite complicated - so glad I am not a data engineer. Definitely think the tiered advice makes sense.

we can also go for a unitemporal database with the transaction time as the time attribute.

This seems like a nice option to make clear as I imagine easier for people to adapt what they have currently have.

Just remembered I also made a version of this to recover sequence data over time. Nothing very fancy and it’s a pretty common problem.

It might be better to make a version of this that uses a local git repository in order to make it as generalisable as possible.

@samabbott wrote:

I think when we discussed this we thought this could make a nice short paper once some ideas have been rolled around. People interested? Adrian has already written up quite a bit here and we could flesh that out and identify things that need simulations etc to demonstrate?

A planning doc / draft has been created here. Please ask for editing permission if you would like to contribute to the project.

1 Like

I will look over the planning document this week/early next. Perhaps we should have a meeting with those interested to flesh out aims and who can do what? The middle of next week would work well for me if we think it’s a good idea.

A few more thoughts I had based on a discussion with Sebastian Funk and Tanja Stadler some weeks ago:

The added value of having the right delay information available (or the cost of not having it because of your DB design etc.) really only plays out in the context of nowcasting. Otherwise, the only thing you can do is to roughly estimate until when “most” of your cases (e.g. 95%) are reported, and not show the data to stakeholders / do analyses before that delay (or put up a big warning sign).

Hence, if we want to address this topic in a scientific paper, it probably has to be a “nowcasting paper” - but with a different perspective than the usual one. Most nowcasting papers that I have seen are about the method, not the data (please reply if you have counter-examples!). This one would instead be a story about the data.

I think of applying a fairly standard nowcasting model to various levels of available delay information, going from completely biased to less and less biased. If we take hospitalization data for example, we could have the following scenarios

  • only reference date = symptom onset date (no nowcasting possible)
  • symptom onset date + date of hospitalization
  • symptom onset date + reporting date of hospitalization case
  • symptom onset date + date when case is actually available in the database

These scenarios could be compared regarding performance. The most straightforward evaluation would be on the eventual case counts by reference date. But I think it would be very useful for the message of the paper if we could also have an evaluation metric like “time until a new outbreak/wave is detected”. This could either be achieved by also estimating growth rates/reproduction numbers and using some threshold there, or by applying a standard aberration detection method to the case counts.

Some things to be discussed:

  • What would serve as a “fairly standard” nowcasting model? On the one hand, using a model from epinowcast would be nice for advertising the package of course. On the other hand, I think it is important to have the message “if you have the data available, there are already robust and easy to use tools out there to do nowcasting”. The question is if we can get epinowcast there until the writing of the paper.
  • How could we do such a comparison? Using simulated data with parameters based on expert input could be an option. But having a real-world dataset with the attributes mentioned above on which we can run the different scenarios would be a lot cooler and more credible I guess. So, would someone have access to such data or has contacts who would be interested to collaborate? It doesn’t have to be monkeypox or COVID-19 data, showing the principle on older data for some endemic disease should do the job too I guess. What is needed concretely is a linelist dataset which has as many different date attributes about each case as possible, so that we can try out the different scenarios. If the data owners can get the surveillance package or epinowcast running, it might even be possible to do this study without us needing access to the linelist data.

What do you think?

2 Likes

I forgot to add: the above relates to the “what is at stake” part of such a paper.

The “how to make sure you have the necessary data available as a public health authority” could either be a second part of this paper, or be in another document (maybe not scientific paper then) referencing the former.

Re bi-temporal data, recently listened to a podcast about a tech solution Apache Kafka that described solving this kind of problem for US CDC (but light on details, also just focused the storage/pub-sub problem, not the how-to-use-the-data problem).

Not sure what y’all already have in the outline, but I think the version of addressing this problem that would most interest me looks roughly like:

  • assuming a stream of linelist-like-updates event data (e.g. patient id, timestamp type [test date, receipt date, report date, onset date, etc], timestamp)…
  • describing an online (aka, realtime, running, etc) algorithm for updating a (now|fore)cast as new event(s) arrive. Since the point here is the data management considerations, this could be a relatively blunt instrument (e.g. re-run from scratch each time, so O(f(all data n)) each step, monotonically growing n), though nice if something particularly clever (e.g. use old prior + new data, so O(f(new data n)) each step, varying n).
  • have a few synthetic scenarios of interest (emergence, decline, shift in underlying natural history, shift in observation process, …) => generate event series
  • showing how various data collection/processing choices affect the performance of that approach. E.g. what having none of particular date type does vs only adding data when all the dates are available vs etc. Key here is to describe the event stream equivalent of various approaches to data release, including how analysts would be able to reverse engineer actual distro into the event stream (e.g. they get & keep snapshots of data). Other key here would be have reference event series (multiple realizations by synthetic scenario), that get mapped / filtered / collapsed / etc via various data management scenarios
  • use the quality ~ choices + situation conclusions to prioritize consequences of various record / report / etc choices in terms of impact on forecast
2 Likes

Yes I mostly agree though I think there is value to understanding your data provenance more generally even outside of a nowcasting context (but this is really more for a discussion-like section).

I have also not seen any nowcasting papers explicitly about the kinds of data that might be available and agree that this is a gap.

Again agree and agree that detecting some kind of change point would make a more impactful piece of work than just using any old setting. We could think about a few different underlying dynamics to simulate (for example nothing is changing but there is a change in reporting delay, an outbreak setting etc).

Yes, this is an issue and it is a bit of a chicken and egg problem sadly. If we are only using a very simple nowcasting model (i.e a static delay) then I would say epinowcast is in a place to be used. If we want that delay to be non-parametric then that is of course a blocker (though I would like to work on this next time I have some dev cycles and don’t want to debug).

It is very attractive to base this on a very simple method so that it doesn’t detract from the actual message of the paper. Given that we could use a simple chain ladder (i.e regression) approach, use surveillance or something else (@johannes baseline model perhaps?).

As there first part of the work is really all about defining scenarios and writing what will be done we could also get working on that without a clear nowcasting method and then fill that in once we get their.

Word on the grapevine is that UKHSA may realise their Monkeypox data in a form useful for nowcasting but heard nothing official on that recently (reaching out now).

I quite like the idea of using some simulated scenarios as well as real data anyway so that can very explicitly control the underlying generative process and know the true delays etc. Perhaps the work should have a few simulated examples (that aren’t trying to be exhaustive) + a case study using real-world data if and when that becomes available. We could also provide code (obviously we will) that makes it easy to simulate your own examples and test them if our simulation choices aren’t ideal.

Personally, I prefer the idea of doing all of this in one so there is one place to point people and so it you know happens but perhaps we should start on this and then see where it goes.

In terms of getting this off the ground, how would interested collabs feel about scheduling a co-working day to thrash out an analysis plan and perhaps make a bit of a start on this?

I really like this, as well as the idea of having a single nowcasting model and evaluating the performance of different ways of providing the data. If the focus is on detecting a new outbreak/wave I think we’d also want to add

  • date when case is actually available in the database only (no nowcasting but also no truncation)

I think it’s equally interesting to detect a downturn, or more generally whether hospitalisations are going up or down at any point in time.

2 Likes