Community Seminar 2024-08-07 - Kaitlyn Johnson - Wastewater modeling to forecast hospital admissions in the US: Challenges and opportunities

Community seminar tomorrow at 3pm UK time

See more: Kaitlyn Johnson - Wastewater modeling to forecast hospital admissions in the US: Challenges and opportunities – Epinowcast

Please use this thread asynchronously for question-asking!

The recording for this talk is available here: https://www.youtube.com/watch?v=dChAFTwCgJ0

1 Like

Thanks for the great talk @kejohnson9! Just one follow-up question on continuous model evaluation, as we ran out of time in the meeting…

You said that regular model evaluation helped you during the model development process. As I understood, the evaluation was based on real-world forecasting performance. On the one hand, I can see how this is great because it explicitly represents the end goal of your work and it evaluates the full interplay of your model components on actual data, so you cannot trick yourself using simulation etc. On the other hand, I wonder how informative forecast scores can be for identifying your model’s limitations in the early development stage. I could imagine that at this stage, forecasts might be pretty bad, but this can be due to various factors - and combinations of factors. How does this help you to get insights like “Oh, maybe we should weigh site-level Rt variation by catchment size” etc. I guess I am just missing concrete examples of how forecast performance can highlight individual problem areas of a model as complex as yours.

2 Likes

Hi @adrianlison thanks for the follow up question!

On the other hand, I wonder how informative forecast scores can be for identifying your model’s limitations in the early development stage

This is a good point, I think early in the development stage, I have found simulating from your generative model and fitting to it to be the most useful tool for model development. This is the approach that I think I would take as a first pass for any “larger” (hard to define) scientific improvements. I think the evaluation of new developments/features should come in as a secondary step. Because we don’t know how the data is truly generated, and it may actually be farther from the module you added, the simulation component is necessary but insufficient. And I think that’s where the evaluation piece can come in and give us confidence that this component is not worsening performance of the model. Sort of akin to model or feature selection.
or paste code here

How does this help you to get insights like “Oh, maybe we should weigh site-level Rt variation by catchment size” etc.

So in theory, before we implement this change, we should have both generated data from this model and fit both versions of our model (with and without the feature) and ensured that the model with the known data generating process “fit” better (either more identifiable components or in our case we could evaluate against simulated forecasts). We did do this to justify this change.

But additionally, we could have also rerun all/ a subset of the forecast dates for all locations and evaluated the forecast performance with and without the component, and ensured that adding the component improved our overall performance. We did not have this infrastructure set up at the time for this change, but since then, when we have made changes (e.g. changing how we specified priors) we did benchmark them against performance from a single forecast date. I think we should do better, and benchmark across a subset of forecast dates and locations…

However, I don’t think that evaluation analysis highlights problem areas because its sort of a black box. I think you need to really analyze the evaluation to identify problem areas, for example, what you mentioned in the meeting with the overprediction. Here is our QQ plot for both models, where we are consistently over predicting in both models, which makes me think the issue is in one of the shared components of the model. One hypothesis is that this is due to the model for the time evolution of R(t). But I would want to test that out by changing it and rerunning this analysis to see if using a different time evolution would improve the model calibration.

ww + hosp model = green
hosp only model = orange

Thank you for the detailed explanation @kejohnson9 :slight_smile:

I think you have highlighted an important point, namely that for forecast score-based evaluation to be helpful in model development, you need different versions of your model with certain components turned on or off. By the differences in performance between these model versions, you might be able to pinpoint certain issues with some of your components. And of course I agree that more detailed performance evaluations across different dates/populations/epidemic phases can make that process easier/more informative.

One additional thought on evaluation based on simulated data: in the past I have had situations where I had a generative model in mind but implemented a simplified version of that model for inference (for example because the original model would have been impossible or really inefficient to sample from in stan). In these situations, I always found it reassuring to code up the original generative model in R, simulate from it and then do inference using the approximate model.

2 Likes

I think you have highlighted an important point, namely that for forecast score-based evaluation to be helpful in model development, you need different versions of your model with certain components turned on or off. By the differences in performance between these model versions, you might be able to pinpoint certain issues with some of your components. And of course I agree that more detailed performance evaluations across different dates/populations/epidemic phases can make that process easier/more informative.

@adrianlison I wonder how related this is to sensitivity analysis in compartmental modelling and whether what you are describing has a name in this area of modelling.

Thanks for your fantastic talk yesterday @kejohnson9. I was particularly interested in one of your leading questions, i.e., “How do we streamline deploying scientific ideas/models into production and tool development”?

I think the forecasting/nowcasting community is fast evolving and it’s important to act on this now rather than later. On the call, I asked if we could employ the tidymodels approach where the community has developed guidelines for developing models. Some interface/engine/output elements are shared in most of the packages in this ecosystem, so I think it might be more straightforward to streamline these. For example, most of the packages use stan directly or indirectly (epinowcast, EpiNow2, EpiSewer, epidist, etc). The workflow is often similar, you nowcast or forecast a timeseries of cases and evaluate with a package like scoringutils which expects the inputs to be structured in a certain way. Most of the packages expect you to define priors or distributions. If these standards or guidelines are not established now, we may end up with a fragmented ecosystem like the wider epidemiology ecosystem. What are folks’ thoughts on achieving this in this child ecosystem? Would a set of guidelines like that of tidymodels be welcomed and what would it take to get to this?

This has also been something I find necessary to reassure myself that the components being added 1 produce reasonable results and 2 the inference works as expected, both in an isolated example of just the particular component and also embedded into the more complex model.

On the call, I asked if we could employ the tidymodels approach where the community has developed guidelines for developing models.

I think this is a really great point, and I completely agree with your assessment that taking the time now to streamline these would have a huge impact going forward. As someone relatively new to this, I would say it would also be really helpful because it would remove some of these decisions about interface/formatting inputs and outputs that we face as developers.

Just so I make sure I understand what you’re suggesting, would the idea be that we as a relatively small community building tools aimed at solving similar problems and producing similar types of outputs, would coalesce on say standardized ways of doing things like: specifying priors, specifying delay distributions, specifying input data, and formatting of outputs (e.g. the main wrapper function would return the same set of elements, with similar looking tooling downstream to extract posterior draws/join with data/run diagnostics/score results)?

I think this would be fantastic. I am still in the process of making some of these decisions for the package I am building and had a discussion with @seabbs Tuesday on the best way to streamline the outputs so that for example an entirely different model could still return the same elements, and just have different post-processing to extract different variables.

Yes, exactly. For example, there is a similar conversation happening in EpiNow2 about restructuring the output class and returning the summarised output in a ready format for use with scoringutils. Here, scoringutils could dictate the structure of outputs but we must be wary of tightly coupling these tools instead of making them interoperable. See a similar restructuring exercise in serofoi.

Other standardisations could include function and input naming conventions, etc.

2 Likes

Sam had sent me this issue a few days ago, and this is exactly what we’re trying to decide now. See issue here.

Have you all landed on a set of outputs coming from a wrapper fitting function? I really liked the idea of passing back the stan_args, but also think we need to pass back the input data with the correct metadata mapped to it (e.g. the dates for example).

I would also be interested in making the summarized output in a format readily usable for scoringutils, we haven’t quite gotten to any evaluation modules in the package yet but intend to.

1 Like

Have you all landed on a set of outputs coming from a wrapper fitting function? I really liked the idea of passing back the stan_args, but also think we need to pass back the input data with the correct metadata mapped to it (e.g. the dates for example).

I’m not sure we’ve agreed yet :sweat_smile:.

I would also be interested in making the summarized output in a format readily usable for scoringutils, we haven’t quite gotten to any evaluation modules in the package yet but intend to.

I think that would be quite convenient as users wouldn’t have to wrangle further. Your package could have a custom as.forecast method as suggested by Sam.

Here is the EpiNow2 issue (Return forecasts for easy processing with `scoringutils` · Issue #618 · epiforecasts/EpiNow2 · GitHub) and here (Update scoringutils integration to offer a as_forecast_samples.epinowcast method · Issue #455 · epinowcast/epinowcast · GitHub) is the epinowcast issue.

In the new scoringutils 2.0.0 we would just need a as_forecast_sample method and then you can map to all the other data formats you might want (i.e quantiles - I know nice work @nikosbosse).

oops looks like those links were already above

I think some of the really obvious wins for sharing tools are:

  • Specifying priors and making it easy to pass to stan in some generalised way. I think brms functionality could be adapted here.
  • Sharing a formula interface. My hope here had been again to leverage brms or similar but I couldn’t make it work. epinowcast currently has a custom interface and either spinning that out or make a new version that is easy to pass to stan would really help.
  • Delay distribution handling from estimating through to handling (for example correct discretisation). @athowes has done a lot of work on the flexible estimation side (extending brms) and again epinowcast has some custom stuff for discretising but again that could be improved and generalised.
  • Better integration with tidybayes would also I think help a lot for a lot of these tools (again @athowes is exploring this in epidist).
  • I think it would be quite easy to create a shared library of stan functions where people depend on them using git submodules. Its not ideal but I think it would work. A more complex version would transpile this out to C++ and then this could be more easily distributed and integrated into stan.
  • Pre and post processing data. Again in epinowcast we found a lot of gotchas in processing real time data that I think are often missed.
  • Visualisation. With the new forecast_samples class in scoringutils I think a really nice model plotting package could be made. Potentially an Rt specific version could also be made.
  • @pearsonca has a few projects like coerceDT that aim to make some of the things we commonly do (here verify data.table inputs) easier and more robust.

There are also lots of others that could be done without needing to learn new things/ massively engineer.

I am currently working on (with @sambrand) a new Julia ecosystem (EpiAware.jl: Real-time infectious disease monitoring · EpiAware.jl - excuse the LLM text in some of the docs) where I think making these interactive modules should be a lot easier. That is still in the design stages though and obviously Julia doesn’t have the uptake or dependability that R and stan has.

2 Likes

@seabbs I am curious which of these seem the highest priority/lowest lift and could represent a proof of concept/ test of the added value (or if you think thats silly and we have to do all of them at once).

For example, my instinct is that a shared interface for the inputs (so prior specification, delay distribution handling, and preprocessing) would be extremely helpful for evaluation/identifying which models work best in which contexts all else remaining equal.

@jamesazam will be following the issue to see what you all decide on!

1 Like

Link to slides

1 Like

Definitely don’t need to do them all at once (and I think by its nature the modular approach is the way to go).

I think the formula interface that epinowcast uses could be spun out and I imagine that could then be used (after maybe a refactor to improve) in say EpiSewer fairly easily (@adrianlison ?). The only blocker to that is whether or not it makes sense to spin it out or if something like brms actually could be used (me not managing it doesn’t mean it isn’t very doable).

I also think the share stan library is very doable with some effort and thought as long as people are willing to take the submodule dependency approach.

The work on distribution specification that @sbfnk and @jamesazam have been doing also seems like it could fairly easily be spun out and usefully used elsewhere in the very near term.

I think all these things are mostly about resource and will as none of them are on the shortest path to any outcome and it has proved extremely difficult to convince funders etc they are important.

3 Likes

What would be a real game changer for me is to have a domain (but not inference tool)-specific language for the kinds of semi-mechanistic generative epi models we use + a package to represent this in a convenient data structure in R. This is because I expect R to remain popular in epi research and public health for quite some time, while probabilistic programming is evolving fast and I don’t know how long people will still like stan.

If there was a ppl-agnostic structure to represent the different models in epinowcast, EpiNow2, EpiSewer, epidist, etc. (and I think that @samabbott and @sambrand have already laid the conceptual groundwork for this in EpiAware, although in Julia and maybe still a bit tightly coupled with Turing), and a package with helper functions to produce corresponding stanargs (including inits and priors, as mentioned above) that follow some clear convention, then we would as a first step “only” have to

  • adjust our stan models to receive arguments for the “data” block from this representation
  • adjust the interface functions of our R packages to define the inputs for stan via this ppl-agnostic representation (instead of directly specifying stan arguments as we now do most of the time)

This would not necessarily require a shared stan library as it leaves the ppl implementation of the models completely open (it remains the responsibility of the developer to implement the model correctly and to reject representations that your stan model cannot implement).

The immediate advantage of this would be cleaner interface functions in R with a lower entry barrier for new contributors - right now, I can at least say for epinowcast and EpiSewer that IMO you need quite an in-depth understanding of the respective stan model (including variable names!) to work on the interface functions… which is really bad.

The more long-term advantage of this would be that this could pave the way for transitioning to other ppls (or offering several backends) while still offering our tools in R. I guess that in Julia you could ideally take such a representation and directly construct the respective EpiAware model during runtime. And if someone wants to try build that with stan/brms, we won’t stop them :smiley:

2 Likes

@jamesazam Yes, this is definitely related but not quite the same thing, I would say. I am thinking of situations where you might have no particular assumption in mind that you want to test sensitivity to, but you get unsatisfying forecast performance and now want to find out what part of your model could be improved. My question was about the process of identifying the source of problems and to what extent forecast scores can help here - beyond telling us that “something is off”.

1 Like

Agree this is a big issue though I am not sure how your suggestion helps lower the barriers for entry as new contributors would still not be able to add model features (as they would just be adding UI for those features?).

Yes exactly. I think this would be great if someone did but as below so much effort (see below)

I agree a brms like model generator could be the way to go but to be honest I am not sure a PPL agnostic front-end is less work/learning than just making people learn a new language (i.e. our Julia project :wink: ).

Do you have an example from other fields of the kind of domain specific tool you are thinking about existing?

In my head what you are proposing is a R specific PPL that then under the hood maps to other PPls and contains epi specific functionality. That sound very very hard and high effort to me? What happens if someone just does this mapping from say stan to whatever the new hotness is?

I think the version of this that remains focussed on stan (maybe only for now) is very doable though.

(it remains the responsibility of the developer to implement the model correctly and to reject representations that your stan model cannot implement).

I think this is a separate issue and highlights the problem that I think most of the issue lies in developer time/skill as people (myself included) don’t implement the details correctly. If we had a really nice user interface across lots of still wrong packages that seems like a bad use of effort?

Given that most/all of the currently available tools that exist has fundamental flaws to their infra it seems like we should fix those (by ideally pooling resource) before we worry hugely about a extremely consistent UI. Especially as currently many/most users are relatively specialised/motivated and so can navigate maybe clunky UIs?

I moved the above discussion about streamlining epi modeling tools to a separate thread, please see and post further comments here.

1 Like