@adrianlison replied:
Sorry for the late reply, continuing some points from #112 as this is the open issue now.
I also thought about this for longer and I think you have convinced me that the current strategy implemented (excluding observations and normalizing) is the way to go.
I don’t think I am a big fan of the 2nd max delay approach, as I still think that the maximum delay should be chosen large enough in the first place
Just to clarify, what I meant with the above is only that I did not like the idea of adding a second threshold delay at which we deal with observations differently, because it implies that the first max_delay
is something almost intended to be too short. I just want to advocate that we communicate to users clearly that max_delay
is ideally chosen big enough to cover all or almost all possible delays.
If we exclude but don’t normalise we end up with an identifiability problem. This is because the expectation or reporting distribution can be changed to achieve the same overall count.
Good point, it may nevertheless be nice to discuss one more time what the most principled approach to normalizing would be (as dividing by the remaining probability is not the only possible way, lumping it on the maximum delay would achieve the same goal). Then again, I can already guess the outcome of the discussion, because there are strong points in favor of the current approach, since it keeps the shape of the parametric distribution unaltered and, maybe even more importantly, can be interpreted as conditioning on the max delay.
I think the way to frame this model and the use of a maximum delay in general is that are shifting the question from nowcasting what will ever be reported to what will be reported up to D
Yeah, this is the crux I think. With such an interpretation, excluding observations in the data and not modeling them in the nowcasting model becomes the intuitive approach…
One more thought I had and just wanted to mention is that the alternative philosophy would be to explicitly model all the delays beyond the maximum until infinity together: using a very simple assumption like a constant hazard and working out the likelihood for that part (taking into account how many days beyond the maximum could have been observed so far), the count of the cases beyond the max delay could then be used as data in the model and the expected cases would be an estimate of all cases really. Of course, this would come at the disadvantage of using a very simple and limited delay model beyond the maximum. Moreover, there may be good reasons why users could actually prefer to exclude cases beyond the maximum delay from their estimate, e.g. because these cases could have outlier properties.
And lastly, since it fits into the overall discussion, could it make sense to also offer a min_delay
at some point? A typical use case could be a situation where I know that reports should take at least x
days, so it would be nice to avoid modeling all these near-zero-chance delays d<x
.
And, more remotely, this could even allow users to cut-off their nowcast at a certain max_delay
and do a second nowcast using max_delay + 1
as the new min_delay
. Of course this second nowcast would not add a lot of insight close to the present, but certain users may be interested in conducting a separate analysis for long delays.