Modelling censored primary observations in a discrete time nowcasting model

This is just a note so I don’t forget - will circle back.

All based on work with @sangwoopark with any good bits being theirs and any fever dreams being mine.

The problem

In current discrete type joint primary incidence and delay models (like epinowcast) there is a fundamental reliance on the primary event being accurately reported within a day. This is rarely the case and is a particular issue for things like the date of symptom onset.

In individual level models the problem is fairly simple as we can give priors for each primary observation over the censored window. This approach is hard to fit into the population level approach taken in most/all joint nowcast models.

Solutions

Retooling to individual level likelihood

This would involve retooling to individually fit data points but continuing to use information from the discrete time incidence model to inform the priors on the primary incidence dates.

This seems like a good option but would require a substantial refactor. It may actually be easier to start with current individual level approaches and add methods to inform the priors. Given that at least for now I don’t see a pathway in epinowcast (but this is well suited to epidist).

A mixture of the current likelihood

Instead of taking the direct approach we could fit a mixture for each entry in the reporting triangle across multiple primary event times. This would be like aggregating our uncertainty for individuals primary event times into a population level aggregate. The mixture weighting would represent the combined censoring window. This probability would in theory need to be based on the underlying expectation model to inform the prior (as it depends on the growth rate) but could in the first instance be assumed to be uniform (this would be nice as we could use static mixture weights making this more tractable).

You could in theory build more complex priors or let them vary by observation to account for differences such as weekend reporting.

For each observation this would look something like the following over a 3 day censoring period.

N_{td} \sim \mathcal{Poisson}\left( w_1 \lambda_{t-1} p_{d+1} + w_2 \lambda_t p_d + w_3 \lambda_{t+1} p_{d-1} \right)

Where

\sum_{i = 1}^3 w_i = 1,

and uniform censoring period would imply,

w_i = \frac{1}{3}

If you wanted to have incidence based (i.e. a growth rate-based prior with some reporting model) I guess you would decompose w into an incidence weighting and some reporting weighting with potentially normalisation to ensure the sum to 1 constraint.

This is based on the model formulation and notation from the epinowcast documentation.

:thinking::thinking::thinking::thinking::thinking::thinking::thinking::thinking::thinking::thinking:

It’s not entirely clear to me the second proposal really makes sense. It would be nice to find some kind of reformulation that does though.

Checking my understanding of this: appropriate to summarize as “we don’t currently deal with primary outcome measurement error”? As in, as we deal with the problem of measurement-has-happened-but-is-not-yet-reported, we aren’t dealing with but-also-the-measurement-might-be-wrong?

1 Like

We are dealing with both on a sense. We account for delayed reporting explicitly and we model Tue expectation and the use an observation model (I.e. Poisson or negative binomial) to account for also the measurement might be wrong error.

This point is about building in some of the ways we might be wrong by allowing for specifying intervals on which primary measurements might have happened. In that sense it’s an extra layer on our current observation model.

Okay, so more like the current model blends these two kinds of error, but we might want to allow specifying them separately, e.g. if we have specific data about one or more of them?

edit: also to make explicit the implicit: but not because, e.g., we want to try to estimate them distinctly? That seems like a non-identifiability trap.

Yes because in some settings we may know about the censoring process. The reason this is identifiable is that we specify known unknowns with censoring vs trying to get the model to learn it (hence the uniform prior above). Doing this means you end up getting more appropriate uncertainty, especially across a time series.

Thinking about this some more this would only make sense if all individuals shared a censoring window on any give primary event day. Not ideal limitation. You could generalise by grouping by different censoring windows types and modelling across this but this would need some changes to happen to the generative model to pool cases in the renewal process and would also maybe get messy when other forms of strata were wanted.

Great that you are giving this thoughts, it would be awesome to have a better approximation. But I think we will need to put more thinking into this, and maybe it won’t work at all…

N_{td} \sim \mathcal{Poisson}\left( w_1 \lambda_{t-1} p_{d+1} + w_2 \lambda_t p_d + w_3 \lambda_{t+1} p_{d-1} \right)

I don’t think you can model a mixture of Poissons like that. Instead, we would need to use the log_mix function in stan (see 5.4 Vectorizing mixtures | Stan User’s Guide). Which is not good news as it involves a lot of log_sum_exp computations…

More fundamentally, I think an issue is that under primary event censoring, the multinomially distributed delays become correlated over different primary event dates. Without censoring, the cases with primary event on date t are just distributed to secondary event dates as \text{Multinom}(n=\lambda_t, p_0, p_1, \dots, p_D). But the additional noise resulting from the censoring will be correlated, and I currently don’t know how to account for this.

Sorry this is actually want I meant and is indeed not good news. Edit: I’ve had a change of heart and I now don’t agree with you. Will circle back.

This is very very true. I think above as a mixture would only ever be a first pass.