Forecasting is a natural extension of the current epinowcast model and would be a requirement for current plans of using the epinowcast package for daily surveillance of respiratory diseases.
There are many important questions about how to specify and implement forecasting, but I have tried to implement a very simple forecasting approach related to the ongoing work of specifying a flexible expectation model. Currently a very basic implementation of an approach where the forecasting is done by extrapolating the expectation model forward in time is is implemented in GitHub - Gulfa/epinowcast at forecasting.
My main question at this point is if an approach along those lines would be interesting or if people have other suggestions about how this should be done. If this approach would be interesting, I can work on the implementation and prepare a PR for it.
In the current implementation adding forecasting looks something like this:
Currently one has to specify the same formula for the expectation and the forecasting. The continue dynamics flag indicates if we want to continue with a random walk in the future or if we want to assume that the trend remains constant for the forecast.
As mentioned above, there are several issues to sort out if we wanted to implement forecasting along these lines. This includes, but is not limited to:
Currently code only works for one group
Code cleanup and testing
Currently we are forecasting the expected new data, but there should maybe be another poisson/neg binominal distribution in the forecasting
This is a setting we’ll also face in the near future, so interested in hearing about progress on this. Including a random walk will make forecasts very dispersed fairly quickly. But they probably should be given how difficult forecasts have proven…
This is awesome @Gunnar great to see the package being extended like this to match a use case.
I really like some points of this implementation and have a few criticisms when it comes to it being a general solution.
So my understanding of your current implementation approach is the following:
In sample the model is estimated as normal with a flexible expectation model and whatever reporting delay model the user has specified.
Out of sample (i.e at the forecast horizon in the generated quantities) the forecast model is used (as defined in enw_forecast) to generated expected cases. Nothing else is available using this approach (i.e there is no measurement model and delays are not extended into the forecast). As implemented this has to be an extension of enw_expectation but it could in theory be anything (though if this were the case it would either rely on priors only or need to also be fit to the data in 1.).
The big upside to this approach is its relative simplicity all that needs to be done (I say all but this is actually very cleverly done here) is to extend the design matrix for the expectation into the future, capture any new effects and draw these from a prior distribution (using the correct random effect if present), and simulate new expected counts after model fitting has been done. In principle this could in fact all happen as another step.
The big downside of this approach is that you lose the connection with observed data via the observation model (as you aren’t simulating delays into the future and you aren’t then combining these with the forecast expected counts to get observations with measurement noise).
The alternative design I had been thinking about was substantially less efficient as it would put more parameters into the model vs the GQ and would run something like this:
In data preprocessing extend reference dates and report dates data frames by the forecast horizon.
Make sure subsequent steps correctly flag that some data is not observed at all (and so doesn’t make its way into the likelihood
Run model fitting as normal. This creates parameters for the forecast part fo the model but has no data to inform them so just returns the priors.
This approach would give you the full model posterior and not just the expected counts but obviously is not very robust. A hybrid version of these approaches would add similar handling of the reporting delay design matrices and the missing data design matrices etc to split them between model and generated quantities. This is obviously complicated but beyond that breaks independence between model modules unless you are very clever.
I really like the idea of being able to fix current dynamics or have say a random walk going into the future so the flag for this is a great idea. Will this approach generalise well to more complex model formulations? What if, for example, we have a linear trend with random walk noise. We might want the random walk dynamics to be fixed but to use the linear trend? Does this indicate that the flag should in fact be in the random walk operator itself rather than being a high level option. This would effectively be a parameter that allowed you to specify the date range or time etc for the random walk to apply.
What settings were you thinking of having a different expectation and forecasting model? I am not sure I see the use case clearly and of there is one how you would fit the forecasting model? Potentially being dense!
P.S sorry for the lack of movement on this on GitHub
P.P.S this may be a bit insane as I have had so much coffee my head may explode at any moment.
Thanks for the detailed look at this. As you say the functionality is mainly tailored for our use case, but I am hoping we can come up with a reasonably general solution that could be of use for more people.
Your understanding of the current implementation is correct and I agree that there are benefits and drawbacks to this approach compared with the alternative approach you outlined. A few comments:
I agree that in the current implementation one has to specify the formula for the expectation twice and they have to be identical. It would have been nice to be able to get hold of the formula from the expectation, but since they currently both are specified in the epinowcast function, I did not see a nice way of accessing the enw_expectation object. One probably should require that a enw_expectation object is a parameter for the definition of enw_forecast.
I partly agree with your comments about missing the rest of the observation model. I don’t understand why we would want to include the delays in the forecast as then the forecast would also need to be for some specified observation date in the future. To me it makes more sense that the forecast should be of the final expected counts once we have passed the max delay for the forecasted data. That said, I think that adding the measurement error and probably some form of the missing data model would be important for the forecast so that it could be compared to future data (once we are passed a point where delays are important)
The implementation in the generated quantities block makes conceptual sense to me. I see the point that it might require effects to be specified twice in the code, and that this would likely require some more ongoing work to ensure that it would be up-to-date with the rest of the model. I still think it is quite possible to extend the current approach so that it also can take into account measurement error and missing data and that we could then keep the forecasting completely separate from the inference, which I think it a nice feature.
I agree that it be nice to have a more general way of handling how we either fix or continue the current dynamics. It should be possible to specify in a flexible how the dynamics should be treated in the forecast.
I think the key question now is if we should continue to work on the current approach with the generated quantities block or investigate/implement the other approach instead.
Forecasting sounds like a cool idea but I’m worried that it’s going to depend heavily on priors. If we’re thinking more mechanistically, it seems like it would be most conservative (parsimonious?) to assume R_t stays constant for forecasting (or doing a random walk over log R_t). From the computation and implementation side, this might not be the best option, and not sure how different assumptions are going to affect predictions.
I tried going over some of the code and documentation but had trouble figuring out what’s being assumed about priors for forecasting. Would be nice to get more details on what is meant by “fixing or continuing the current dynamics”. So sorry if it’s already documented but I missed it…
Forecasting sounds like a cool idea but I’m worried that it’s going to depend heavily on priors
This is definitely the case though where the generative model is the same as before the forecast window those priors will at least have hyper priors informed by data (i.e in the case of a random walk). There are other specifications we could imagine (and would be supported via the formula interface) that make more sense. For example, if the temperature has a role and forecasts for that are available.
Also agree we should have an option to allow fixing R_t as that is so often what people/want expect to see and also can perform quite well given the uncertainties of doing anything else. I think this is what @Gunnar means when he says fixing the current dynamics.
The current code base can definitely be hard to navigate which is a shame. It might be worth opening a discussion to think about how we can improve this and in general make it easier to onboard those interested in contributing.