This may be a bit of a niche topic, but I’ve discussed this with @adrianlison and @samabbott before and said I’d write it up properly. So here we go. Sorry for somewhat messy notation. How should we use the negative binomial distribution in nowcasting models?
The usual formulation in nowcasting models as epinowcast is
independently for all d.
What always bugs me about this is that in my opinion it does not reflect the mechanism. My idea of the mechanism would be:
- there’s a total number of hospitalizations N_t, for which we assume
- once N_t is there (but not yet observed), there’s the reporting process where
It is true that under this mechanism, for each d individually you get
That’s a nice property of the NegBin. However, given lambda_t, the different n_{td} are dependent.
It is therefore not the same to assume that for each d independently
If you define it that way and set
you do not end up with
as at least I would like to. In fact, N_t as defined in (1) will have considerably lower variance than if you defined it as in (2) (if you use the same \phi).
But as a matter of fact, there is an alternative formulation where you can start form the n_{td}, but introduce some dependence that will lead to the desired distribution of N_t. Specifically, we note that
can also be written in a hierarchical manner as
We can now use the c_{td} to introduce a dependence. Specifically, by using just one c_t rather than different c_{td}, i.e.,
then the sum
will follow a NegBin(\lambda_t, \phi). The reason is that given c_t, N_t follows a Poisson(c_t \times \lambda_t) distribution, which as in (3) is NegBin.
To check the above I wrote a little R example. As it may also be easier to understand in code than in formulas, I’m pasting that below. I wrote this up a bit quickly and there may be some errors in the above, will correct them if they spark confusion
# Example in code:
set.seed(123)
n_sim <- 10^6
# Version 1: simulate N first:
N1 <- rnbinom(n_sim, mu = 100, size = 10)
mean(N1)
# [1] 100.0539
var(N1)
# [1] 1099.193; = 100 + 100^2/10
# then sample n_d from N. To keep it simple I use p = rep(0.2, 5)
# tedious as rmultinom is not vectorized properly
n1 <- matrix(NA, nrow = n_sim, ncol = 5)
for(i in 1:n_sim){
n1[i, ] <- rmultinom(1, N1[i], prob = rep(0.2, 5))
}
var(n1[, 1])
# [1] 59.95003; = 20 + 20^2/10
# Version 2: simulate n_d first and independently:
n2 <- matrix(NA, nrow = n_sim, ncol = 5)
for(i in 1:n_sim){
n2[i, ] <- rnbinom(5, mu = 0.2*100, size = 10)
}
# we still get the same variance for the n:
var(n2[, 1])
# [1] 60.03383; = 20 + 20^2/10
# this fits with the above
# then get N as sum over n:
N2 <- rowSums(n2)
var(N2)
# [1] 299.4957
# variance is much lower than if we started by sampling N
mean(N2)
# [1] 100.0086
# mean is still correct
# Illustration: NegBin as a Poisson gamma mixture
n3 <- rnbinom(n_sim, mu = 10, size = 20)
c <- rgamma(n_sim, shape = 20, rate = 20)
n4 <- rpois(n_sim, c*10)
mean(n3); mean(n4)
# [1] 10.00033
# [1] 10.00291
var(n3); var(n4)
# [1] 15.02911
# [1] 14.99928
# plot(table(n3))
# lines(table(n4), type = "l", col = "red")
# Version 3: simulate n_d first, but with dependence:
n5 <- matrix(NA, nrow = n_sim, ncol = 5)
for(i in 1:n_sim){
c_temp <- rgamma(1, shape = 10, rate = 10)
n5[i, ] <- rpois(5, c_temp*0.2*100)
}
# we still get the same variance for the n:
var(n5[, 1])
# [1] 60.12718; = 20 + 20^2/10
# same as always
# then get N as sum over n:
N5 <- rowSums(n5)
var(N5)
# [1] 1102.508
# variance as for N1, as intended