Streamlining of epi modeling tools

The following posts are spun out of the discussion from Community Seminar 2024-08-07 - Kaitlyn Johnson - Wastewater modeling to forecast hospital admissions in the US: Challenges and opportunities as a separate discussion thread (see below).

@jamesazam wrote:

@kejohnson9 replied:

@jamesazam replied:

@kejohnson9 replied:

jamesazam replied:

Have you all landed on a set of outputs coming from a wrapper fitting function? I really liked the idea of passing back the stan_args, but also think we need to pass back the input data with the correct metadata mapped to it (e.g. the dates for example).

I’m not sure we’ve agreed yet :sweat_smile:.

I would also be interested in making the summarized output in a format readily usable for scoringutils, we haven’t quite gotten to any evaluation modules in the package yet but intend to.

I think that would be quite convenient as users wouldn’t have to wrangle further. Your package could have a custom as.forecast method as suggested by Sam.

@samabbott replied:

@samabbott replied:

I think some of the really obvious wins for sharing tools are:

  • Specifying priors and making it easy to pass to stan in some generalised way. I think brms functionality could be adapted here.
  • Sharing a formula interface. My hope here had been again to leverage brms or similar but I couldn’t make it work. epinowcast currently has a custom interface and either spinning that out or make a new version that is easy to pass to stan would really help.
  • Delay distribution handling from estimating through to handling (for example correct discretisation). @athowes has done a lot of work on the flexible estimation side (extending brms) and again epinowcast has some custom stuff for discretising but again that could be improved and generalised.
  • Better integration with tidybayes would also I think help a lot for a lot of these tools (again @athowes is exploring this in epidist).
  • I think it would be quite easy to create a shared library of stan functions where people depend on them using git submodules. Its not ideal but I think it would work. A more complex version would transpile this out to C++ and then this could be more easily distributed and integrated into stan.
  • Pre and post processing data. Again in epinowcast we found a lot of gotchas in processing real time data that I think are often missed.
  • Visualisation. With the new forecast_samples class in scoringutils I think a really nice model plotting package could be made. Potentially an Rt specific version could also be made.
  • @pearsonca has a few projects like coerceDT that aim to make some of the things we commonly do (here verify data.table inputs) easier and more robust.

There are also lots of others that could be done without needing to learn new things/ massively engineer.

I am currently working on (with @sambrand) a new Julia ecosystem (EpiAware.jl: Real-time infectious disease monitoring · EpiAware.jl - excuse the LLM text in some of the docs) where I think making these interactive modules should be a lot easier. That is still in the design stages though and obviously Julia doesn’t have the uptake or dependability that R and stan has.

@kejohnson9 replied:

@samabbott replied:

Definitely don’t need to do them all at once (and I think by its nature the modular approach is the way to go).

I think the formula interface that epinowcast uses could be spun out and I imagine that could then be used (after maybe a refactor to improve) in say EpiSewer fairly easily (@adrianlison ?). The only blocker to that is whether or not it makes sense to spin it out or if something like brms actually could be used (me not managing it doesn’t mean it isn’t very doable).

I also think the share stan library is very doable with some effort and thought as long as people are willing to take the submodule dependency approach.

The work on distribution specification that @sbfnk and @jamesazam have been doing also seems like it could fairly easily be spun out and usefully used elsewhere in the very near term.

I think all these things are mostly about resource and will as none of them are on the shortest path to any outcome and it has proved extremely difficult to convince funders etc they are important.

@adrianlison replied:

What would be a real game changer for me is to have a domain (but not inference tool)-specific language for the kinds of semi-mechanistic generative epi models we use + a package to represent this in a convenient data structure in R. This is because I expect R to remain popular in epi research and public health for quite some time, while probabilistic programming is evolving fast and I don’t know how long people will still like stan.

If there was a ppl-agnostic structure to represent the different models in epinowcast, EpiNow2, EpiSewer, epidist, etc. (and I think that @samabbott and @sambrand have already laid the conceptual groundwork for this in EpiAware, although in Julia and maybe still a bit tightly coupled with Turing), and a package with helper functions to produce corresponding stanargs (including inits and priors, as mentioned above) that follow some clear convention, then we would as a first step “only” have to

  • adjust our stan models to receive arguments for the “data” block from this representation
  • adjust the interface functions of our R packages to define the inputs for stan via this ppl-agnostic representation (instead of directly specifying stan arguments as we now do most of the time)

This would not necessarily require a shared stan library as it leaves the ppl implementation of the models completely open (it remains the responsibility of the developer to implement the model correctly and to reject representations that your stan model cannot implement).

The immediate advantage of this would be cleaner interface functions in R with a lower entry barrier for new contributors - right now, I can at least say for epinowcast and EpiSewer that IMO you need quite an in-depth understanding of the respective stan model (including variable names!) to work on the interface functions… which is really bad.

The more long-term advantage of this would be that this could pave the way for transitioning to other ppls (or offering several backends) while still offering our tools in R. I guess that in Julia you could ideally take such a representation and directly construct the respective EpiAware model during runtime. And if someone wants to try build that with stan/brms, we won’t stop them :smiley:

@samabbott replied:

Agree this is a big issue though I am not sure how your suggestion helps lower the barriers for entry as new contributors would still not be able to add model features (as they would just be adding UI for those features?).

Yes exactly. I think this would be great if someone did but as below so much effort (see below)

I agree a brms like model generator could be the way to go but to be honest I am not sure a PPL agnostic front-end is less work/learning than just making people learn a new language (i.e. our Julia project :wink: ).

Do you have an example from other fields of the kind of domain specific tool you are thinking about existing?

In my head what you are proposing is a R specific PPL that then under the hood maps to other PPls and contains epi specific functionality. That sound very very hard and high effort to me? What happens if someone just does this mapping from say stan to whatever the new hotness is?

I think the version of this that remains focussed on stan (maybe only for now) is very doable though.

(it remains the responsibility of the developer to implement the model correctly and to reject representations that your stan model cannot implement).

I think this is a separate issue and highlights the problem that I think most of the issue lies in developer time/skill as people (myself included) don’t implement the details correctly. If we had a really nice user interface across lots of still wrong packages that seems like a bad use of effort?

Given that most/all of the currently available tools that exist has fundamental flaws to their infra it seems like we should fix those (by ideally pooling resource) before we worry hugely about a extremely consistent UI. Especially as currently many/most users are relatively specialised/motivated and so can navigate maybe clunky UIs?

Thanks @samabbott, I agree that building a new PPL in R would be both very difficult and probably a waste of resources.

My original thought was not about having a complete PPL (although my remark about Julia seems to suggest this), but to simply have a structure in R to represent the inputs for the inference tool (in stan, this would be data and inits) which reflects the components of our models in a streamlined way. So I’m basically thinking of

tool-specific UI → intermediate data structure → flat list of stan arguments → stan

We partly have this “intermediate data structure” implicitly in our tools (e.g. modules in epinowcast), but I was thinking that if there was an explicit and standardized structure it could be

  1. easier to handle (e.g. there could be options to print the inputs in a conceptually meaningful way, and one would better know what parts of the data must be modified when updating a UI function)
  2. easier to map these inputs out to other inference tools than stan, i.e. requiring less refactoring of the tools-specific interface

For an example, think of a standardized structure in R to represent a renewal process, with certain agreed on attributes that map to variable names in your generative model, and nested attributes for non-parametric smoothing of the growth rate, for a seeding process, and for things suggested above (prior specification, formulas etc.). However this would not have to contain all the details of the modeled likelihood!

I guess my main point is that I think establishing something like the above would be roughly comparable in terms of effort and complexity to building a shared stan library, but potentially more useful in the long-term…