Handling delayed entry of symptom onset dates in line lists

@FelixGuenther drew our attention to a potential idiosyncrasy of line list data that he came across during his nowcasting work: it may be the case that the reference date (e.g. symptom onset date) is entered “retrospectively”, i.e. a case may be reported with missing symptom onset date first, and the symptom onset date is only added later.

If this happens on a regular basis, the reporting triangle (including missing delay as a special cell) is not stable over time. Essentially, this means that the share of cases with missing reference date depends on the date of report, as cases closer to the present have a higher probability to be (still) without symptom onset date.

This has direct implications for the missingness model envisioned / almost finalized for epinowcast, where we model the share of cases with missing symptom onset only conditional on the date of report. If there is delayed entry of reference dates, then even if the eventual share of missing cases was constant over time, we would observe an increase in the share of missing cases towards the present, both by date of report and by date of reference.

For the current missingness model, this means two things

  • A stationary time series prior on the share of missing cases by date of reference would lead to an underestimation of missingness towards the present. A time series prior with trend could reduce/avoid this bias, however we have not yet discussed which type of trend (linear trend on the logit scale?) would be most suited.
  • Compared to imputing missing symptom onset dates by estimating the backward delay distribution, the generative missingness model would be less precise because it depends on modeling the share of missing cases by date of reference over time and cannot condition on the date of report. Of course, the estimation of backward delays also has its own challenge (dependence on the epidemic curve), so it is unclear what would be more precise overall.

The above points apply to a situation in which we just have the reporting triangle as data. If we can get additional data about the delay with which reference dates are recorded, we could also consider extending the nowcasting framework to several “dates of report” (e.g. date of report 1 = reporting of case, date of report 2 = reporting of reference date). This modeling of “higher-dimensional” reporting triangles is also closely connected to other idiosyncrasies of line list data, such as retrospective deletion of cases / editing of dates. See also our discussion here.

Some take-aways from this discussion:

  • Simulations / having real data with the described data generation process would be very helpful to study the issue in more detail
  • Adjustments / a non-stationary time series prior for the missingness model in epinowcast may be sensible
  • There could be a lot of value in modeling of higher-dimensional reporting “triangles” - if we have appropriate data (and computing resources).

@FelixGuenther I accidentally also wrote minutes for this part of our discussion today (although this was supposed to be your job), so I posted them here to avoid double work - please add any thoughts you have / things I forgot!

1 Like

This was a really interesting discussion and definitely a feature of many datasets that are often hidden by their reporting structures.

@kcharniga/@Gunnar/@amygimma/@medewitt/@rachaelpung or anyone else with good data access is the trend @adrianlison describes (of the proportion of cases with onsets etc decreasing over time regardless of when the nowcast is made) something you see in your data/experience?

As you suggest we can fudge this in the short-term by modelling a trend but in reality, extending the dimension of the data to have an additional definition of reporting seems like it will be needed to capture this well in data-rich settings (or to highlight the impact of lack of data richness to motivate better reporting systems).

Does anyone know if this has been discussed before in the literature? As far as I am aware it hasn’t been. As @adrianlison flags (and as flagged by @johannes here Create a collection of benchmark data sets) public access to some more real-world datasets (or synthetic versions that can be released) would greatly help to improve our ability to handle these kinds of issues.

Essentially, this means that the share of cases with missing reference date depends on the date of report, as cases closer to the present have a higher probability to be (still) without symptom onset date.

This “feels” correct based on what’s I’ve seen. At t_0, date of result, there is likely to be a higher degree of missingess in the onset date, but as t \to \infty that missingness will approach some stable value. And the shape of the delay function is likely a function of the size of the epidemic/ number of active cases due to the human constraints on reporting.

However, I’ve seen an intermediate phenomenon when early on t_{test} might be imputed as t_{onset} as a placeholder until the “true” value for onset is recorder so onset might not appear as missing (which can be a nasty surprise when you ask about it…).

I’ll look through what I have…I feel like I have some incremental snapshots of something that might be worth looking at.

Thanks @medewitt that is good to hear (well not good but you know what I mean).

It not being stable over time and depending on the underlying burden is a wrinkle that does make things more complicated I think.

That is a good point about onset imputation from the test date and this being recategorised later on. Tricky stuff. I think modelling everything with an additional dimension as @adrianlison suggests would also help here but imagine that some modeller/data collector communication is also a good way to go (though when is it not).

If you do have some snapshots that would be amazing. We really need as many and as varied examples of real-world (or as close to real-world as we can get) as possible.

I’ve also had some personal communication from people at orgs who can’t speak publicly at the moment who report having seen similar artefacts in their data so as we guessed this does seem quite widespread.

Missingness of onset at report can depend on different factors, a couple that come to mind are:

  1. Case definitions for the disease in question. If testing is conditional on onset, then date of onset is more likely to be available at time of report.
  2. How overwhelmed data collectors are. If reporters are swamped, then reports are more likely to be missing detailed information (such as date of onset) at time of report (as mentioned by @medewitt).

@samabbott i haven’t forgotten about this…I think I have a lead on some snapshots, but I have to write some powershell to get sharepoint snapshots :confused: and likely some wrangling after that.

1 Like

:scream: powershell. You are a brave brave man.

Finally was able to aggregate these data. I have a little over 80 daily snapshots of the line list for mpox (ID, recorded onset date at time of report, test date, report date). I have to go back and review our protocols, but I doubt I can share even the anonymized line list and I don’t want to add noise because the temporal relationship is what’s important. That being said, what distributions make sense to fit? Trying to think through how to do that…maybe with a sliding window?

1 Like

Nice!

Ah, data sharing is always so tricky. In the first instance could you share some plots of reporting delays over time and updates to those reporting delays (its not clear to me what that plot should look like - we clearly need some representative synthetic data for this higher dimensional reporting structure).