We have one case study (SARI hospitalizations in Germany) where it seems like there are periods when data get systematically corrected downwards rather than upwards as we usually see. So somehow some hospitalizations get removed from the data. Has anybody encountered this before or even thought about a solution? I sort of remember talking about this to @adrianlison .
I am not sure why this is happening (maybe initial mis-classifications of reason of hospitalization?) and will need to ask around among people more familiar with the data source. All we have a re snapshots of the time series from different times from which we compute increments, so no line list data or similar. But I feel like independently of why this happens it’s an interesting additional problem from a stats perspective.
As it feels like a clearly delimited task I’ll very likely assign the topic to a BSc student writing his thesis with our group. I don’t think it makes sense to have him wrap his head around the epinowcast package and try to extend it, but I hope that conceptually we’ll come up with some ideas which can be helpful afterwards. I think it’s mainly about identifying suitable distributions to model negative increments. Currently I have the Skellam distribution on my list.
In my understanding this is not the same as the problem of negative delays.
This seems like the inverse problem of delayed case reporting, and the larger problem of back-filling (in your case, back-draining, to coin a new term?).
That’s can be a pretty non-identifiable problem especially without knowledge of the reasons for this. Backfilling was a challenge in the B.1.1.7 paper and lead to clear biases in key epi statistics like Rt, r(t). You can try to set up a repository of datasets downloaded at different times & identify some regularity in the process. IMO there are many ways to hack it, and you can justify a hack with careful attention to bias/variance of your estimates for the same day from datasets available at different days. Hence, it depends on what you’re trying to estimate.
I hope this helps, and good luck! Backfilling is a massive issue in quant finance - often, we completely throw out datasets if they’re back-filled because it makes postdiction impossible absent very clear & accurate descriptions of the backfilling procedure (in general, if it’s back-filled, it indicates that the data providers/vendors are playing catch-up and all sorts of operational changes can suddenly lead to changes in their catch-up process and completely change model performance)
Thanks for bringing this up again, @johannes - porting our conversation from Slack below:
A little peak at the project mentioned above. This plot shows different data versions of SARI hospitalizations in (severe acute respiratory syndrom = all sorts of respiratory diseases lumped together, from sentinel surveillance in 71 clinics). What’s interesting: downward corrections seem to be an actual feature of the data and quite common during certain periods. The reason may be a revision of case definitions or maybe they removed data from one sentinel clinic. Not sure how it should be handled in that case. But thought it’s worth sharing here
Interesting! Data corrections would likely require some additional modeling I think (that may only pay off if the corrections are substantial, which can apparently be the case).
Generally, I still wonder how such changes are best integrated into a nowcasting framework. If there are really a lot of corrections, and if you are able to distinguish between corrections (removal of cases for a reference date) and normal updates (adding of cases for a reference date) in your data, one could maybe try to do a separate nowcast for the corrections. This is at least how I would look at corrections in a nowcasting context in the most natural way: treating them as “erroneous cases” that occurred at the reference date and trying to estimate the number of errors that are not yet reported (i.e. corrections not yet applied). But that would require lots of corrections to inform the delays.
If everything is blended in one case count per reference date (as is probably the case with your RKI data), then the whole thing gets more tricky of course and one would have to model the “mixture” of updates and corrections… not a nice outlook
I agree that modelling them as “erroneous” would be closest to the actual process, but (1) that makes our reporting triangle three-dimensional (we’d need to keep track of all corrections made at different times) and (2) as you say we just have aggregate counts and don’t know where the corrections happened exactly. My approach would be to work with distributions which have a support including negative numbers and just model negative entries directly. Such distributions exist but I don’t know much about them and they are likely less convenient than actual count distributions.
I guess one would need a distribution with a mean, a dispersion and a skewness parameter which steers whether and how strongly the distribution laps into the negative numbers.
That being said, there’s also a chance I messed up the data processing
Hey johannes, sorry for the late reply! Can you help me a bit, I don’t understand why you would have to keep track of for which reporting date a correction was made (leading to the 3D triangle). I see that this is required to document the whole data generating process, but is it necessary for nowcasting? What are we missing if we just model how many cases for a given reference date get revoked at which delay?
And with the negative distributions approach you mean the count data distributions for the reported cases, not the delay distributions, right?
Re 3D triangle: I guess I was just thinking about how to store all the available information. But you are right, that’s probably not necessary. And it’s probably not feasible anyway using the data we have anyway as usually we just have differences between values reported on different days. So I guess we’d just have a 2D reporting triangle with some negative entries then? After all, usually we would not know if one case got removed but another added, right?
Re negative distributions: Yes, the assumed distribution for the cells of the triangle.
Just to play this through - if I were to model this in a generative model like the one we have in
epinowcast, then I would (ignoring all the identifiability issues and the reasonable advice from @bigalculus to stay the hell away from such data if possible) probably add a new process that generates “fake cases”:
- This process would either be independent or coupled with the actual expectation model (if we expect fake cases to be a function of real cases)
- Fake case are reported with a certain delay (we would very likely be forced to assume the same delay distribution as for real cases), but then they are “unreported” again with another delay (that we probably cannot estimate ^^). If we assume that the reporting and unreporting delays are stochastically independent, this would be quite straightforward to model (using convolution operations as we already do for the real cases, would only need to adjust the maxDelay).
- Finally, the difference between the reports and “unreports” is then the observed data, represented using a likelihood function that allows for negative counts as proposed by @johannes.
Already while writing this down I imagine how the parameters of such a model would be virtually impossible to identify, and how we would need very strong priors to get it running. Maybe @samabbott also has thoughts on this…
I don’t have a solution but wanted to share that I encountered this too in the UK data on cases by specimen date. Cases identified by lateral flow devices often have false positives, resulting in occasional overall downward corrections once they are identified.
I think @adrianlison has captured the generative process version of this problem quite well and agree it seems like quite a hard nut to crack. @johannes suggestion of starting with a distribution with negative support seems like a good one as a first pass and agree this differs from negative delays (though solutions that help with either will likely help with the other to some extent).
@johannes in your case what is the typical delay for negative updates and how often do they happen? I think there is another version of this problem where there is no real distribution and you instead have irregular large-scale system changes that lead to updates across the time series. Not a nowcasting time-series but the behaviour of reported COVID-19 in the JHU data set comes to mind as an example of this. I haven’t seen much on dealing with this (and actually thinking of writing a short note) but on the face of it that problem may be almost impossible to model.
Just to expand on @teojcryan project (more detail: Nowcasting COVID-19 cases by specimen date in England).
The basic issue was to nowcasting COVID-19 cases by specimen data in the UK. The problem (as @teojryan highlights) was that case confirmation can come from either an LFT test or a PCR test. When a LFT test is used a PCR test is commonly used to check. Sometimes the LFT is a false positive and so the PCR test leads to the case being removed (regardless of what happens the first date available for a case is kept). This leads to a dynamic where cases are truncated by some distribution (a mixture of LFT and PCR test to report delays) and then updated based on another distribution (the time from LFT to follow PCR) which has no effect when the PCR is positive. As @adrianlison points out a sensible way to model this would be as two separate processes. As he also points out this could very well be quite hard to identify (though there is information there so likely not impossible).
Love everything @adrianlison and @samabbott shared!
One final piece of staircase wit from the finance side: since you’ll be building a model sensitive to an unknown operational bottleneck, it may be worthwhile to not only construct a nowcasting model based on some filling procedure, but to also construct a rolling test of the filling procedure to be alerted to possible changes of that underlying operational bottleneck. We typically do this with stop-losses in finance (stop trading / close a strategy the instant it isn’t performing as we’d expect), and at a minimum having such a test set up for your model can enable you to visually inspect if the procedures/assumptions on which your model is based are changing in real-time.
Yes, this is a good suggestion and not something we see in epi. In production models I have worked on there has been a similar system to check performance drift vs the training set (also looking at training distributions vs streaming data distributions as indicators).