Maximum delay treatment

Porting the discussion from here.

@samabbott wrote:

At the moment we exclude observations beyond the user set maximum delay in both preprocessing and modelling. This has the advantage in that observations are fixed vs other approaches and we can expect to observe everything within our imposed timespan. It has the disadvantage of not modelling long delays and maybe not making this trade-off clear to the user.

The proposed change is to make clear when we set the maximum delay we are reframing the nowcasting problem from nowcasting what will ever be observed to nowcasting what will be observed at the maximum delay. We should also improve the diagnostics (currently surfaced as max_confirm and cum_prop_reported) and documentation.

There is a more detailed discussion of this in #116

2 Likes

@samabbott wrote

In #112 I’ve added additional documentation fleshing out the fact max delay is currently zero indexed.

@adrianlison replied:

Sorry for the late reply, continuing some points from #112 as this is the open issue now.

I also thought about this for longer and I think you have convinced me that the current strategy implemented (excluding observations and normalizing) is the way to go.

I don’t think I am a big fan of the 2nd max delay approach, as I still think that the maximum delay should be chosen large enough in the first place

Just to clarify, what I meant with the above is only that I did not like the idea of adding a second threshold delay at which we deal with observations differently, because it implies that the first max_delay is something almost intended to be too short. I just want to advocate that we communicate to users clearly that max_delay is ideally chosen big enough to cover all or almost all possible delays.

If we exclude but don’t normalise we end up with an identifiability problem. This is because the expectation or reporting distribution can be changed to achieve the same overall count.

Good point, it may nevertheless be nice to discuss one more time what the most principled approach to normalizing would be (as dividing by the remaining probability is not the only possible way, lumping it on the maximum delay would achieve the same goal). Then again, I can already guess the outcome of the discussion, because there are strong points in favor of the current approach, since it keeps the shape of the parametric distribution unaltered and, maybe even more importantly, can be interpreted as conditioning on the max delay.

I think the way to frame this model and the use of a maximum delay in general is that are shifting the question from nowcasting what will ever be reported to what will be reported up to D

Yeah, this is the crux I think. With such an interpretation, excluding observations in the data and not modeling them in the nowcasting model becomes the intuitive approach…

One more thought I had and just wanted to mention is that the alternative philosophy would be to explicitly model all the delays beyond the maximum until infinity together: using a very simple assumption like a constant hazard and working out the likelihood for that part (taking into account how many days beyond the maximum could have been observed so far), the count of the cases beyond the max delay could then be used as data in the model and the expected cases would be an estimate of all cases really. Of course, this would come at the disadvantage of using a very simple and limited delay model beyond the maximum. Moreover, there may be good reasons why users could actually prefer to exclude cases beyond the maximum delay from their estimate, e.g. because these cases could have outlier properties.

And lastly, since it fits into the overall discussion, could it make sense to also offer a min_delay at some point? A typical use case could be a situation where I know that reports should take at least x days, so it would be nice to avoid modeling all these near-zero-chance delays d<x.
And, more remotely, this could even allow users to cut-off their nowcast at a certain max_delay and do a second nowcast using max_delay + 1 as the new min_delay. Of course this second nowcast would not add a lot of insight close to the present, but certain users may be interested in conducting a separate analysis for long delays.

1 Like

@samabbott replied:

Ah the sweet taste of victory (for now) :laughing:. Sorry about just pushing through - I was in a bit of a rush to get the new stuff out the door because people kept wanting MPX support and I really didn’t want to only have EpiNow2 solutions to offer them!

because it implies that the first max_delay is something almost intended to be too short. I just want to advocate that we communicate to users clearly that max_delay is ideally chosen big enough to cover all or almost all possible delays.

Totally agree. I think we do need to be pragmatic though as there are people with very limited compute resources/patience that we don’t want to put them off.

can be interpreted as conditioning on the max delay.

Yes exactly. In terms of dealing with needing heavier tails etc, I thought it would make sense to emphasise the use of the not yet existing non-parametric reference date model.

One more thought I had and just wanted to mention is that the alternative philosophy would be to explicitly model all the delays beyond the maximum until infinity together: using a very simple assumption like a constant hazard and working out the likelihood for that part (taking into account how many days beyond the maximum could have been observed so far), the count of the cases beyond the max delay could then be used as data in the model and the expected cases would be an estimate of all cases really. Of course, this would come at the disadvantage of using a very simple and limited delay model beyond the maximum. Moreover, there may be good reasons why users could actually prefer to exclude cases beyond the maximum delay from their estimate, e.g. because these cases could have outlier properties.

I really like this idea but think it will be hard to generalise. It feels like it would need its own case study to explore the impact as well. In the UK Ryan is seeing that 99% of reports happen in the first 10 days of so and then nothing happens until case definitions are changed and there is a big spike at 100 + days. That scenario seems like probably quite common and also impossible to model well.

@samabbott replied to

And lastly, since it fits into the overall discussion, could it make sense to also offer a min_delay at some point? A typical use case could be a situation where I know that reports should take at least x days, so it would be nice to avoid modeling all these near-zero-chance delays d<x.
And, more remotely, this could even allow users to cut-off their nowcast at a certain max_delay and do a second nowcast using max_delay + 1 as the new min_delay. Of course this second nowcast would not add a lot of insight close to the present, but certain users may be interested in conducting a separate analysis for long delays.

Yes, I like this idea a lot. I think if we thought about it right could also do this to support negative (i.e reference after report as mentioned here #108 as well though not sure that is a good idea.

Partly this could be covered by the structural reporting model. Perhaps we could make this a special case though that might get complicated.

1 Like