Some thoughts on parameterization of negative binomials

This may be a bit of a niche topic, but I’ve discussed this with @adrianlison and @samabbott before and said I’d write it up properly. So here we go. Sorry for somewhat messy notation. How should we use the negative binomial distribution in nowcasting models?

The usual formulation in nowcasting models as epinowcast is

n_td | \lambda_{td} \sim NegBin(\lambda_t \times p_d, \phi)

independently for all d.

What always bugs me about this is that in my opinion it does not reflect the mechanism. My idea of the mechanism would be:

  • there’s a total number of hospitalizations N_t, for which we assume
N_t | \lambda_t \sim NegBin(\lambda_t, \phi).
  • once N_t is there (but not yet observed), there’s the reporting process where
(n_{t1}, \dots, n_{tD}) | N_t \sim Mult(N_t, p_{1}, ..., p_{D}).

It is true that under this mechanism, for each d individually you get

n_{td} \sim NegBin(\lambda_t \times p_d, \phi).

That’s a nice property of the NegBin. However, given lambda_t, the different n_{td} are dependent.
It is therefore not the same to assume that for each d independently

n_{td} \sim NegBin(\lambda_t \times p_d, \phi).

If you define it that way and set

N_t = \sum_{d = 1}^D n_{td} \ \ \ (1)

you do not end up with

N_t \sim~ NegBin(\lambda_t \times p_d, \phi) (2)

as at least I would like to. In fact, N_t as defined in (1) will have considerably lower variance than if you defined it as in (2) (if you use the same \phi).

But as a matter of fact, there is an alternative formulation where you can start form the n_{td}, but introduce some dependence that will lead to the desired distribution of N_t. Specifically, we note that

n_{td} | \lambda_t \sim NegBin(\lambda_t \times p_d, \phi)

can also be written in a hierarchical manner as

c_{td} \sim Gamma(\phi, \phi) \ \ \ (3)
n_{td} | \lambda_t, c_{td} \sim Poisson(\lambda_t \times p_d \times c_{td}).

We can now use the c_{td} to introduce a dependence. Specifically, by using just one c_t rather than different c_{td}, i.e.,

c_t \sim Gamma(\phi, \phi)
n_{td} | \lambda_t, c_{td} \sim Poisson(\lambda_t \times p_d \times c_t).

then the sum

N_t = \sum_{i = 1}^ D n_{td}

will follow a NegBin(\lambda_t, \phi). The reason is that given c_t, N_t follows a Poisson(c_t \times \lambda_t) distribution, which as in (3) is NegBin.

To check the above I wrote a little R example. As it may also be easier to understand in code than in formulas, I’m pasting that below. I wrote this up a bit quickly and there may be some errors in the above, will correct them if they spark confusion :upside_down_face:

# Example in code:

n_sim <- 10^6

# Version 1: simulate N first:
N1 <- rnbinom(n_sim, mu = 100, size = 10)
# [1] 100.0539
# [1] 1099.193; = 100 + 100^2/10

# then sample n_d from N. To keep it simple I use p = rep(0.2, 5)
# tedious as rmultinom is not vectorized properly
n1 <- matrix(NA, nrow = n_sim, ncol = 5)
for(i in 1:n_sim){
  n1[i, ] <- rmultinom(1, N1[i], prob = rep(0.2, 5))
var(n1[, 1])
# [1] 59.95003; = 20 + 20^2/10

 # Version 2: simulate n_d first and independently:
n2 <- matrix(NA, nrow = n_sim, ncol = 5)
for(i in 1:n_sim){
  n2[i, ] <- rnbinom(5, mu = 0.2*100, size = 10)
 # we still get the same variance for the n:
var(n2[, 1])
# [1] 60.03383; = 20 + 20^2/10
# this fits with the above

# then get N as sum over n:
N2 <- rowSums(n2)
# [1] 299.4957
# variance is much lower than if we started by sampling N
# [1] 100.0086
# mean is still correct

# Illustration: NegBin as a Poisson gamma mixture
n3 <- rnbinom(n_sim, mu = 10, size = 20)
c <- rgamma(n_sim, shape = 20, rate = 20)
n4 <- rpois(n_sim, c*10)

mean(n3); mean(n4)
# [1] 10.00033
# [1] 10.00291

var(n3); var(n4)
# [1] 15.02911
# [1] 14.99928

# plot(table(n3))
# lines(table(n4), type = "l", col = "red")

# Version 3: simulate n_d first, but with dependence:
n5 <- matrix(NA, nrow = n_sim, ncol = 5)
for(i in 1:n_sim){
  c_temp <- rgamma(1, shape = 10, rate = 10)
  n5[i, ] <- rpois(5, c_temp*0.2*100)
# we still get the same variance for the n:
var(n5[, 1])
# [1] 60.12718; = 20 + 20^2/10
# same as always

# then get N as sum over n:
N5 <- rowSums(n5)
# [1] 1102.508
# variance as for N1, as intended

Really nice to have this clearly written up + helpful to have the sims. Pretty much the perfect level of niche for this forum.

I know we discussed this a few times but in my head there was some kind of blocker or issue. I feel like there isn’t and this is just a better choice as an observation model? I don’t think I’ve got a substantive point aside from let’s implement. It would be nice to see what impact this has on an actual nowcast.

I’ll look at doing that in the next few days, run some comparison models, and then circle back.

What if I told you discourse supports LaTex math…


The problem description and discussion makes a lot of sense to me, very nice!

Do I understand correctly that our current implementation corresponds in the “Poisson-Gamma” formulation to

n_{t,d}|\lambda_t, p_{t,d}, c_{t,d} \stackrel{iid}{\sim} Pois(\lambda_t*p_{t,d}*c_{t,d}), \;\; (1)

with c_{td} \sim Gamma(\phi^{old}, \phi^{old}), whereas the new and improved version would be

n_{t,d}|\lambda_t, p_{t,d}, c_{t} \stackrel{iid}{\sim} Pois(\lambda_t*p_{t,d}*c_{t}), \;\; (2)

with c_{t} \sim Gamma(\phi^{new}, \phi^{new})?


I edited my original post to use LaTeX :sweat_smile: And yes, it all boils down to what Felix wrote. I also edited my post a bit to make that more clear.

A while ago I wondered about whether using the currently implemented version rather than the newly suggested one version would lead to too narrow nowcast distributions. My first intuition was that yes, but the more I think about it the less an intuition I have. Looking forward to your results, Sam!

(Edited in light of Sam’s comment below to clarify)

1 Like

Where old is actually equivalent to our current default (i.e negative binomial). I think I agree.

1 Like

Thanks @johannes for the nice and detailed exploration. On that note, is there a principled reason to have a negative binomial observation process here in the first place (of course it’s commonly used in infectious disease dynamic models but I don’t think in a particular principled manner, mostly motivated by ease of inference)?

1 Like

There’s a bit of discussion and motivation here. But personally I don’t think there are clear reasons for favouring the negative binomial over other overdispersed count distributions. I guess it’s mainly out of convenience.

1 Like

I think the key point is that there are reasons to favour an overdispersed distribution to a non-overdispersed distribution.

1 Like

We also ideally want something that aggregates robustly as @johannes outlines the NB on its own doesn’t here. That is only because we can’t have discrete parameters and so can’t use the decomposed mechanism we might actually prefer (because of HMC constraints).

There’s a bit of discussion and motivation here.

Perhaps I missed it but doesn’t that reference only discuss the negative binomial distribution to model overdispersion in the transmission process (as the sum of independent geometric random variables) - but not in the context of observation models?

For theoretical reasons or for ease of inference?

Really interesting discussion here.

I don’t think there has been much discussion about the different observation models in the literature.

POMP people seem to prefer using normal distributions (where SD is a function of the mean) instead of the negative binomial:

Here’s a more thorough comparison of different transmission vs observation processes with overdispersion. Ideas around moment matching discussed here might be also relevant?

1 Like

I think for theoretical reasons if you include consideration of unmodelled processes as theoretical reasons (i.e. as with the Poisson) but not if you mean link to the underlying transmission dynamics (though if we believe they are overdispersed and we don’t model that in the generative process that will obviously show up in the obs process.

Pulled this out from the paper @sangwoopark shared:

Accounting for overdispersion had more impact on our fits than the presence or absence of ceilings. In particular, models with no overdispersion in either process lacked flexibility and tended to be over-confident (that is, they showed low coverage). However, models that account for overdispersion in only one process (either transmission or observation) tended to be reliable for estimating parameters such as R0, mean generation interval, and short-term forecasts, particularly when overdispersion was implemented through the negative binomial (a less constrained distribution than the beta binomial). However, parameters that are closely tied to the details of a particular model structure (such as the overdispersion parameters for the observation and transmission processes) must change when the overdispersion model changes, in order to compensate for missing sources of variability. Several authors17,18 suggest that accounting for process as well as observation error in estimates of R0 and in forecasts is necessary in order to avoid over-confident estimates. Our exploration does not include any cases where process error is completely absent – even our ‘‘dispersion-free’’ processes incorporate sampling error in the process. However, we find that neglecting overdispersion can still lead to over-confident and unreliable estimates.

This is also done a lot within models that use HMC for practical reasons (not being able to have discrete parameters).

Understandable for the transmission model, but doesn’t seem necessary for the observation model when only the response variables are discrete variables, no?

It can be in the context of nowcasting @johannes is talking about above where you want to decompose observations by report day. The other context where you see it used a lot is for latent variables like infections in convolution-based approaches but arguable if that is an obs or process part of the model.

Yes, agree with all those cases. But the POMP case I linked above is specifically for the simple observation process.

Either way, I think Li et al shows that moment matching might be good enough as long as overdispersion is taken into account?

For theoretical reasons or for ease of inference?

There’s also a fair amount of history behind dealing with overdispersion in ecology and exploring different kinds of overdispersion (i.e., mean-variance relationship):

1 Like

I had a brief chat with Seb an realized that when writing my first post I was not thinking about the role of the negative binomial in this model in quite the right way. I was thinking about it as the “next generation distribution” in a branching process-type model. But here it is just an observation model and the actual epidemic process behind is governed by (log-)normal distributions. So Seb is right in remarking

doesn’t that reference only discuss the negative binomial distribution to model overdispersion in the transmission process

Given this aspect I am a bit less sure my suggestion is preferable to what is currently implemented. I tend to think that the negative binomial is not only an observation process in the sense that we have stochasticity in the detection of cases, but also reflects the imperfectness of our model - actual values deviate from the smoother curve of expected values our model reconstructs. Here I’d still consider the mechanism I described more appropriate, but things are a bit less clear to me.

When we’re talking purely about an observation process I don’t see a good motivation for the negative binomial, apart from maybe the following. The observation process should be something like a (generalized) binomial distribution if we assume no false positive reports. When numbers are high, a binomial distribution is well-approximated by a Poisson distribution. Consequently, a mixture of binomial distributions is also well-approximated by a mixture of Poissons, and the NegBin as a Poisson-gamma mixture should be an approximation of a gamma-binomial distribution (though I’m not familiar with the definitions). But I am not aware of a particular reason why that sort of mixture should be preferred to other mixing distributions.

On a similar note: The negative binomial is usually well approximated by a normal if numbers are sufficiently large (I’ve heard mean 30 as a rule of thumb, though you can of course always make overdispersion so excessive that NB becomes skewed). For small numbers I’m not a fan of normal distributions.

Sorry, I’m a bit late, but first wanted to say thanks a lot @johannes for writing this up (and for the perfect summary @FelixGuenther)

While I agree with @sbfnk point, I still think the model specification as proposed by you is preferrable (assuming that we don’t have domain knowledge about the observation process contradicting it) for two reasons:

  • This makes it more consistent with non-nowcasting models which don’t fit to reporting triangles but directly to case counts and often assume these are negative binomial
  • It separates the overdispersion from the reporting delays. While I don’t know if that is realistic to assume, it should make the estimated overdispersion more comparable between different model versions (for example when assuming different max_delays or using different parametric delay distributions). This becomes even more important when also modeling missing cases (same arguments apply because the missingness dimension can just be flattened into the multinomial reporting process)
1 Like

Yes agree with this.

I think we should consider also adding support for this.

While I agree with @sbfnk point, I still think the model specification as proposed by you is preferrable (assuming that we don’t have domain knowledge about the observation process contradicting it) for two reasons:

I also agree with the practical reasons @adrianlison puts forward.

I’ve added a feature issue for Poisson gamma mixture here: Add support for a gamma Poisson mixture observation model · Issue #283 · epinowcast/epinowcast · GitHub

and another for a normal approximation here: Add support for a normal observation model · Issue #284 · epinowcast/epinowcast · GitHub