Community seminar tomorrow at 3pm UK time
Please use this thread asynchronously for question-asking!
The recording for this talk is available here: https://www.youtube.com/watch?v=dChAFTwCgJ0
Community seminar tomorrow at 3pm UK time
Please use this thread asynchronously for question-asking!
The recording for this talk is available here: https://www.youtube.com/watch?v=dChAFTwCgJ0
Thanks for the great talk @kejohnson9! Just one follow-up question on continuous model evaluation, as we ran out of time in the meeting…
You said that regular model evaluation helped you during the model development process. As I understood, the evaluation was based on real-world forecasting performance. On the one hand, I can see how this is great because it explicitly represents the end goal of your work and it evaluates the full interplay of your model components on actual data, so you cannot trick yourself using simulation etc. On the other hand, I wonder how informative forecast scores can be for identifying your model’s limitations in the early development stage. I could imagine that at this stage, forecasts might be pretty bad, but this can be due to various factors - and combinations of factors. How does this help you to get insights like “Oh, maybe we should weigh site-level Rt variation by catchment size” etc. I guess I am just missing concrete examples of how forecast performance can highlight individual problem areas of a model as complex as yours.
Hi @adrianlison thanks for the follow up question!
On the other hand, I wonder how informative forecast scores can be for identifying your model’s limitations in the early development stage
This is a good point, I think early in the development stage, I have found simulating from your generative model and fitting to it to be the most useful tool for model development. This is the approach that I think I would take as a first pass for any “larger” (hard to define) scientific improvements. I think the evaluation of new developments/features should come in as a secondary step. Because we don’t know how the data is truly generated, and it may actually be farther from the module you added, the simulation component is necessary but insufficient. And I think that’s where the evaluation piece can come in and give us confidence that this component is not worsening performance of the model. Sort of akin to model or feature selection.
or paste code here
How does this help you to get insights like “Oh, maybe we should weigh site-level Rt variation by catchment size” etc.
So in theory, before we implement this change, we should have both generated data from this model and fit both versions of our model (with and without the feature) and ensured that the model with the known data generating process “fit” better (either more identifiable components or in our case we could evaluate against simulated forecasts). We did do this to justify this change.
But additionally, we could have also rerun all/ a subset of the forecast dates for all locations and evaluated the forecast performance with and without the component, and ensured that adding the component improved our overall performance. We did not have this infrastructure set up at the time for this change, but since then, when we have made changes (e.g. changing how we specified priors) we did benchmark them against performance from a single forecast date. I think we should do better, and benchmark across a subset of forecast dates and locations…
However, I don’t think that evaluation analysis highlights problem areas because its sort of a black box. I think you need to really analyze the evaluation to identify problem areas, for example, what you mentioned in the meeting with the overprediction. Here is our QQ plot for both models, where we are consistently over predicting in both models, which makes me think the issue is in one of the shared components of the model. One hypothesis is that this is due to the model for the time evolution of R(t). But I would want to test that out by changing it and rerunning this analysis to see if using a different time evolution would improve the model calibration.
ww + hosp model = green
hosp only model = orange
Thank you for the detailed explanation @kejohnson9
I think you have highlighted an important point, namely that for forecast score-based evaluation to be helpful in model development, you need different versions of your model with certain components turned on or off. By the differences in performance between these model versions, you might be able to pinpoint certain issues with some of your components. And of course I agree that more detailed performance evaluations across different dates/populations/epidemic phases can make that process easier/more informative.
One additional thought on evaluation based on simulated data: in the past I have had situations where I had a generative model in mind but implemented a simplified version of that model for inference (for example because the original model would have been impossible or really inefficient to sample from in stan
). In these situations, I always found it reassuring to code up the original generative model in R, simulate from it and then do inference using the approximate model.
I think you have highlighted an important point, namely that for forecast score-based evaluation to be helpful in model development, you need different versions of your model with certain components turned on or off. By the differences in performance between these model versions, you might be able to pinpoint certain issues with some of your components. And of course I agree that more detailed performance evaluations across different dates/populations/epidemic phases can make that process easier/more informative.
@adrianlison I wonder how related this is to sensitivity analysis in compartmental modelling and whether what you are describing has a name in this area of modelling.
Thanks for your fantastic talk yesterday @kejohnson9. I was particularly interested in one of your leading questions, i.e., “How do we streamline deploying scientific ideas/models into production and tool development”?
I think the forecasting/nowcasting community is fast evolving and it’s important to act on this now rather than later. On the call, I asked if we could employ the tidymodels approach where the community has developed guidelines for developing models. Some interface/engine/output elements are shared in most of the packages in this ecosystem, so I think it might be more straightforward to streamline these. For example, most of the packages use stan directly or indirectly (epinowcast, EpiNow2, EpiSewer, epidist, etc). The workflow is often similar, you nowcast or forecast a timeseries of cases and evaluate with a package like scoringutils which expects the inputs to be structured in a certain way. Most of the packages expect you to define priors or distributions. If these standards or guidelines are not established now, we may end up with a fragmented ecosystem like the wider epidemiology ecosystem. What are folks’ thoughts on achieving this in this child ecosystem? Would a set of guidelines like that of tidymodels be welcomed and what would it take to get to this?
This has also been something I find necessary to reassure myself that the components being added 1 produce reasonable results and 2 the inference works as expected, both in an isolated example of just the particular component and also embedded into the more complex model.
On the call, I asked if we could employ the tidymodels approach where the community has developed guidelines for developing models.
I think this is a really great point, and I completely agree with your assessment that taking the time now to streamline these would have a huge impact going forward. As someone relatively new to this, I would say it would also be really helpful because it would remove some of these decisions about interface/formatting inputs and outputs that we face as developers.
Just so I make sure I understand what you’re suggesting, would the idea be that we as a relatively small community building tools aimed at solving similar problems and producing similar types of outputs, would coalesce on say standardized ways of doing things like: specifying priors, specifying delay distributions, specifying input data, and formatting of outputs (e.g. the main wrapper function would return the same set of elements, with similar looking tooling downstream to extract posterior draws/join with data/run diagnostics/score results)?
I think this would be fantastic. I am still in the process of making some of these decisions for the package I am building and had a discussion with @seabbs Tuesday on the best way to streamline the outputs so that for example an entirely different model could still return the same elements, and just have different post-processing to extract different variables.
Yes, exactly. For example, there is a similar conversation happening in EpiNow2 about restructuring the output class and returning the summarised output in a ready format for use with scoringutils. Here, scoringutils could dictate the structure of outputs but we must be wary of tightly coupling these tools instead of making them interoperable. See a similar restructuring exercise in serofoi.
Other standardisations could include function and input naming conventions, etc.
Sam had sent me this issue a few days ago, and this is exactly what we’re trying to decide now. See issue here.
Have you all landed on a set of outputs coming from a wrapper fitting function? I really liked the idea of passing back the stan_args
, but also think we need to pass back the input data with the correct metadata mapped to it (e.g. the dates for example).
I would also be interested in making the summarized output in a format readily usable for scoringutils, we haven’t quite gotten to any evaluation modules in the package yet but intend to.
Have you all landed on a set of outputs coming from a wrapper fitting function? I really liked the idea of passing back the
stan_args
, but also think we need to pass back the input data with the correct metadata mapped to it (e.g. the dates for example).
I’m not sure we’ve agreed yet .
I would also be interested in making the summarized output in a format readily usable for scoringutils, we haven’t quite gotten to any evaluation modules in the package yet but intend to.
I think that would be quite convenient as users wouldn’t have to wrangle further. Your package could have a custom as.forecast method as suggested by Sam.
Here is the EpiNow2
issue (Return forecasts for easy processing with `scoringutils` · Issue #618 · epiforecasts/EpiNow2 · GitHub) and here (Update scoringutils integration to offer a as_forecast_samples.epinowcast method · Issue #455 · epinowcast/epinowcast · GitHub) is the epinowcast
issue.
In the new scoringutils 2.0.0
we would just need a as_forecast_sample
method and then you can map to all the other data formats you might want (i.e quantiles - I know nice work @nikosbosse).
oops looks like those links were already above
I think some of the really obvious wins for sharing tools are:
brms
functionality could be adapted here.brms
or similar but I couldn’t make it work. epinowcast
currently has a custom interface and either spinning that out or make a new version that is easy to pass to stan would really help.brms
) and again epinowcast
has some custom stuff for discretising but again that could be improved and generalised.tidybayes
would also I think help a lot for a lot of these tools (again @athowes is exploring this in epidist
).epinowcast
we found a lot of gotchas in processing real time data that I think are often missed.forecast_samples
class in scoringutils
I think a really nice model plotting package could be made. Potentially an Rt specific version could also be made.coerceDT
that aim to make some of the things we commonly do (here verify data.table inputs) easier and more robust.There are also lots of others that could be done without needing to learn new things/ massively engineer.
I am currently working on (with @sambrand) a new Julia ecosystem (EpiAware.jl: Real-time infectious disease monitoring · EpiAware.jl - excuse the LLM text in some of the docs) where I think making these interactive modules should be a lot easier. That is still in the design stages though and obviously Julia doesn’t have the uptake or dependability that R and stan has.
@seabbs I am curious which of these seem the highest priority/lowest lift and could represent a proof of concept/ test of the added value (or if you think thats silly and we have to do all of them at once).
For example, my instinct is that a shared interface for the inputs (so prior specification, delay distribution handling, and preprocessing) would be extremely helpful for evaluation/identifying which models work best in which contexts all else remaining equal.
@jamesazam will be following the issue to see what you all decide on!
Link to slides
Definitely don’t need to do them all at once (and I think by its nature the modular approach is the way to go).
I think the formula interface that epinowcast
uses could be spun out and I imagine that could then be used (after maybe a refactor to improve) in say EpiSewer
fairly easily (@adrianlison ?). The only blocker to that is whether or not it makes sense to spin it out or if something like brms
actually could be used (me not managing it doesn’t mean it isn’t very doable).
I also think the share stan library is very doable with some effort and thought as long as people are willing to take the submodule dependency approach.
The work on distribution specification that @sbfnk and @jamesazam have been doing also seems like it could fairly easily be spun out and usefully used elsewhere in the very near term.
I think all these things are mostly about resource and will as none of them are on the shortest path to any outcome and it has proved extremely difficult to convince funders etc they are important.
What would be a real game changer for me is to have a domain (but not inference tool)-specific language for the kinds of semi-mechanistic generative epi models we use + a package to represent this in a convenient data structure in R
. This is because I expect R
to remain popular in epi research and public health for quite some time, while probabilistic programming is evolving fast and I don’t know how long people will still like stan
.
If there was a ppl-agnostic structure to represent the different models in epinowcast
, EpiNow2
, EpiSewer
, epidist
, etc. (and I think that @samabbott and @sambrand have already laid the conceptual groundwork for this in EpiAware
, although in Julia
and maybe still a bit tightly coupled with Turing
), and a package with helper functions to produce corresponding stanargs
(including inits and priors, as mentioned above) that follow some clear convention, then we would as a first step “only” have to
stan
models to receive arguments for the “data” block from this representationR
packages to define the inputs for stan
via this ppl-agnostic representation (instead of directly specifying stan arguments as we now do most of the time)This would not necessarily require a shared stan library as it leaves the ppl implementation of the models completely open (it remains the responsibility of the developer to implement the model correctly and to reject representations that your stan model cannot implement).
The immediate advantage of this would be cleaner interface functions in R
with a lower entry barrier for new contributors - right now, I can at least say for epinowcast
and EpiSewer
that IMO you need quite an in-depth understanding of the respective stan model (including variable names!) to work on the interface functions… which is really bad.
The more long-term advantage of this would be that this could pave the way for transitioning to other ppls (or offering several backends) while still offering our tools in R. I guess that in Julia
you could ideally take such a representation and directly construct the respective EpiAware model during runtime. And if someone wants to try build that with stan
/brms
, we won’t stop them
@jamesazam Yes, this is definitely related but not quite the same thing, I would say. I am thinking of situations where you might have no particular assumption in mind that you want to test sensitivity to, but you get unsatisfying forecast performance and now want to find out what part of your model could be improved. My question was about the process of identifying the source of problems and to what extent forecast scores can help here - beyond telling us that “something is off”.
Agree this is a big issue though I am not sure how your suggestion helps lower the barriers for entry as new contributors would still not be able to add model features (as they would just be adding UI for those features?).
Yes exactly. I think this would be great if someone did but as below so much effort (see below)
I agree a brms
like model generator could be the way to go but to be honest I am not sure a PPL agnostic front-end is less work/learning than just making people learn a new language (i.e. our Julia project ).
Do you have an example from other fields of the kind of domain specific tool you are thinking about existing?
In my head what you are proposing is a R specific PPL that then under the hood maps to other PPls and contains epi specific functionality. That sound very very hard and high effort to me? What happens if someone just does this mapping from say stan to whatever the new hotness is?
I think the version of this that remains focussed on stan (maybe only for now) is very doable though.
(it remains the responsibility of the developer to implement the model correctly and to reject representations that your stan model cannot implement).
I think this is a separate issue and highlights the problem that I think most of the issue lies in developer time/skill as people (myself included) don’t implement the details correctly. If we had a really nice user interface across lots of still wrong packages that seems like a bad use of effort?
Given that most/all of the currently available tools that exist has fundamental flaws to their infra it seems like we should fix those (by ideally pooling resource) before we worry hugely about a extremely consistent UI. Especially as currently many/most users are relatively specialised/motivated and so can navigate maybe clunky UIs?
I moved the above discussion about streamlining epi modeling tools to a separate thread, please see and post further comments here.