Which paper would you use? Code Availability and Different Standards for Public Health Research

samabbott · 11 August 2025 11:07

Just been reading a nice paper from Thomas Ward, @OvertonC and co on IFR estimation which seems to do a really nice job: The real-time infection hospitalisation and fatality risk across the COVID-19 pandemic in England | Nature Communications

Driven by this thread: @michaelplanknz.bsky.social on Bluesky

The Code Problem

From the methods etc, it looks like a good job, and obviously, they usually do good work. However, when I started to properly engage, I got to: “The model code can be made available on request to DataAccess@ukhsa.gov.uk.”

(I have emailed but usually these processes give you an endless run around just due to the nature of what they are and the more serious data they are usually trying to keep confidential).

Turns out reviewers weren’t given code either. From the review response: “There was no code to review. The data are held in safe havens for another researcher would have to apply to run the analyses in these safe havens also. Making the Stan code available for other researchers would probably be a good idea but there are sufficient…”

So no one has seen the code but this reviewer things that the methods are “sufficient”?

Most of the time if I hit this in a paper, I stop reading it and make it a point of principle to avoid citing it or recommending it to others.

Had an exchange with Thomas House (@tah-sci.com) who pointed out the UKHSA group is under “correctly very strict data security.” Fair for data but Stan code is not@tah-sci.comtah-sci.comdata. UKHSA usually shares code on GitHub so not sure why not here?

Looking for Alternatives

In this instance, I went looking for other sources for that discussion and found: Dynamics of SARS-CoV-2 infection hospitalisation and infection fatality ratios over 23 months in England

Which is similar but limited to REACT. Not clear why the Ward paper doesn’t cite this, given the overlap? The reviewer reports said they didn’t find any papers with overlap which seems odd. Maybe its not as close as I think from skimming.

Digging into the Methods

From skimming both, I think I marginally prefer the Ward paper as I don’t love the REACT spline approach more generally, and the Ward paper looks to do a nice data integration job. UKHSA usually have all the best data as well so I would also be biased in that direction for that reason.

But trying to understand what’s actually going on is hard. Reviewer 2 asked for a methods diagram - the response says they added one but I can’t find it in the manuscript. Went to check the SI (https://static-content.springer.com/esm/art%3A10.1038%2Fs41467-024-47199-3/MediaObjects/41467_2024_47199_MOESM1_ESM.pdf) and the methods overview diagram is Figure 48! It’s flagged at the head of the methods (“For each study (Supplementary Fig. 48) we describe the modelling to calculate…”), but how many people are going on the journey needed to find it?

This got me looking at the SI in more detail. No more methods here just lots of figures (I think a much less text-heavy bit of maths would help me). The credible intervals on PCR positive in SI Fig 1 are very, very tight - wonder what’s going on there. Looking back at the methods, I am none the wiser, and I don’t think I have the tools from the text to find out.

I like the main figure and SI results, where they show the combined and ONS vs REACT estimates. These could be useful for understanding data integration issues/conflict but I don’t find this in the paper being discussed (missing it?).

(It’s a Monday, so maybe I’m just grumpy)

No standard Bayesian workflow stuff about prior or posterior predictive checks that I can see. Nothing about data integration issues or approximation issues from passing data between models. So from that and no code hard to know if the model works imo.

I think as a field we do a really, really bad job at this (i.e. this paper is doing quite a good job relatively). We don’t really have standards or guidelines and reviewers/journals usually don’t really care. For this reason currently writing a bit of a checklist paper for data integration that aims to try and help standardise these things

Thoughts from diving in

My main concern would be about the delay from infection to testing positive being onset and nothing being done about this or it being explored anywhere. None of the reviews flag this that I can see - the REACT paper also does this I think which isn’t great for these kind of papers.

This statement seems wrong no matter how I read it: “assumption that time of symptom onset approximates the time at which the case becomes positive. While it would be possible to not make this assumption and use an interval censoring model… due to the size of the intervals relative to the delay, the uncertainty on any estimates produced using this approach would be too large and would consequently degrade results.”

If this uncertainty exists, it means your time-varying parameters are way too precise i.e. can’t be resolved. “Degrade results” equals make realistic? I don’t see any reviewer comments about this so perhaps I am misunderstanding something

We do start from infection time here and it seems fine: Combined analyses of within-host SARS-CoV-2 viral kinetics and information on past exposures to the virus in a human cohort identifies intrinsic differences of Omicron and Delta variants

This all means the estimates are now time-varying IFR by onset time I think (“we shift the testing dates to symptom onset date rather than infection date, since we have more reliable data on the delay distributions post symptom onset date”) which I think isn’t clear in the limitations and introduces some hard to fully understand biases.

Note: I have problems here as well as we use a rubbish prior for our infection times. I think all these models need to have some kind of epidemic growth rate model under the hood informing these priors.

Equation 16, where incidence is backed out from prevalence by dividing by expected duration positive, the length of the round and the pop size worries me. Looks like back-sampling of infection times which, as we found to our cost is not the right way to do things: Practical considerations for measuring the effective reproductive number, Rt

I think this also means each round is independent from other rounds which again seems like a problematic assumption that lead to some knock on impacts in the results (I think they try and get around this in a later part of their pipeline by smoothing over the IFR estimates?).

I think you want a forward model of I here, which who knows might actually be whats in the code (i.e. not above but in a different part of the submodel but same difference).

The discussion says the issue is that the growth rate within a round is ignored - how much this matters depends on the round length. I would worry a lot about this if a round was more than say a week and I was going to use the time-varying parameters at rates of lower than say a month. As a modeller, I would find this hard to be chill about myself regardless of the intervals.

I will probably wait for code until I look at this more as from the methods narrative, I am struggling to fully get to grips with what they did and the assumptions that were made and when those matter. This probably isn’t a great sign as despite just generally vibing my way through work I have written papers about some quite similar models so I would imagine that if I am not following, only probably a fairly narrow group would do so without engaging more than I can on a Monday full of deadlines.

Different Standards?

So I would probably use the REACT results for my work. If this wasn’t from a Public health agency, I wouldn’t even consider using the estimates due to the lack of code and would engage with it in only a very limited way.

What would others do?

This got me thinking more generally about the standards I hold public health teams to vs academics and it’s definitely different. I am more willing to allow less code, less likely to challenge method issues etc.

I would also usually consider Nature Communications as basically predatory and so be less likely to engage again for that reason (not a fan we are spending public money on these publishing charges folks - I hope you get them waived).

I wonder why I think this though as often the public health teams will have more senior people writing papers (seems to be the case at the moment which imo is a good thing but also means standards should be higher?) and more access to the data so fewer needs to make approximations. It’s also harder to critically assess as often they are making decisions based on context others don’t have i.e. that is internal. Again I usually give this a free pass.

Part of this is lack of leverage I guess but I also do it when reviewing when I obviously do have leverage (though leverage at peer review is stupid just like the rest of the process).

On the flip side, I would far rather methods and estimates being used were public domain so I can have a whine (like this) rather than all just being kept internal and never really checked (like some other public health agencies that come to mind) so maybe having different standards is okay? Or perhaps we need a different way of sharing things more generally, that is a bit less of a pointless formal dance that wastes everyone’s time? I generally think this for all academic style work - we should just have preprints and then people post reviews and there can be external editor filtering if people wish.

Something I would like to see if agencies engage in more review with a feedback loop into their work - this could be a really neat role for the HPRUs. The problem with this is independence as usually, unless you are a relentless maverick ( ) bent on not having any career success (), biting the hand that feeds is not usually a good idea.

Why? Well we need impact right! This means all the different parts of agencies end up with academic friends (not immune from this) and get siloed feedback. This is also obviously true in academic networks so perhaps a wider problem.

Moving Forward

To be clear, this is a really good piece of work and the above are niggles. Without code though, it’s hard to properly evaluate or build on. I’m pretty sure that the lack of code is probably outside the authors control.

Wonder if someone from this team would like to come and give a retrospective on this work at an epinowcast seminar? I would love to understand the modelling choices as well as the background around reporting better and I think others would too.

What do others think? Very interested to hear from Public health colleagues, especially. Is this all just Monday Sam?

jonathon.mellor · 11 August 2025 15:17

I’ve forwarded on to Tom who I’m sure would be happy to discuss.

I won’t get into a response because it’ll be much more interesting hearing what others think, some bits really resonate, some we can explain, and some I think come down to the very difficult question of “what should in house modelling teams be doing”?

Very briefly, I think the incidence + prevalence paper from WCIS will have open code, as will the IHR/IFR one.

As an aside to Jean’s point on BS (I haven’t moved over for my screen-time / mental health). “UKHSA deaths” are… from the ONS, not an inherently different surveillance system. UKHSA receives death registrations from ONS each day they’re reported, and NHS real-time data to help us do public health action in real time. “What is a COVID death” is not a solvable problem, only one that can be approximated with data available - mentions on death certificates have biases, as do time from onset thresholds.

samabbott · 11 August 2025 16:37

Thanks for forwarding, and I hope the tone is okay/productive!

Yes, I think this is precisely the question and hard to know what the right answer is. I thought the answer was a lot of in-house modelling due to experiencing academia but after getting more involved in a PH in-house team, now I think more of a mixture is pretty key.

For that to work, though, there needs to be good both public and non-public lines of communication, I think. I like the HPRUs for this, but I am not sure if historically they have delivered (or if they have enough funds/flexibility to really do so).

Shocking no one I think the answer is more unified tooling that can be shared across these kinds of analyses (not just in agency) so everything becomes a bit less complex to communicate/verify/expand on.

Something I am not sure really came across in my above points is that I don’t think research groups are doing a better job on a lot of the issues, usually they are doing far far worse than this which is a very nice paper, its just that I find it easier to discount their work and I never really stop and think about that.

Probably a good idea. Its mostly American academics having a very justified doom spiral. These points didn’t really concern me aside from perhaps from a communication angle (i.e. I assumed you had access to lots of good data sources because why wouldn’t you).

OvertonC · 12 August 2025 07:15

Hey Sam, thanks for the comments! Here are my personal thoughts, though I think Tom will also have some comments.

On the code, we do try to publish where possible, but for this project it wasn’t deemed suitable due to working across multiple secure data systems. The methods developed were a pragmatic solution to the challenges of handling data across multiple systems. In the recent Winter COVID Infection Study, where we held all the data within UKHSA infrastructure, we have built upon these methods to develop the models we would have liked to have used in this study. The methods papers from the Winter CIS will hopefully all be published soon (here is one just on duration of positivity and sensitivity as an example SARS-CoV-2 test sensitivity and duration of positivity in the UK during the 2023/2024 Winter: A prospective cohort study based on self-reported data - ScienceDirect), and the code is currently going through internal review prior to publication, so the code should all be available alongside some synthetic data, as Jon has already mentioned.

I agree about the challenges with onset date instead of infection time – this limitation is also present in the Winter CIS work unfortunately. We tried to model from infection time, but for this to be robust with estimating IHR/IFR, we’d need to use outcome specific incubation periods, otherwise although it looks like “infection time” it is effectively just onset time but shifted in time. This is important, as it was shown in Adjusting for time of infection or positive test when estimating the risk of a post-infection outcome in an epidemic - PMC that when time-delays vary by patient outcome we can get a large bias in the estimated IHR/IFR. Onset time, and all the time delay distributions downstream of onset time, we had good, timely data on, so this allowed the model to be maximally informed by the available data.

There is a lot of room for methodological improvements in this space, but some large technical challenges still to be overcome in the severity bias, before we can truly generate a time-varying IHR/IFR estimate. We are interested in doing more research in this space to develop improved methods.

We’ll let you know when the WCIS papers are out!

jonathon.mellor · 12 August 2025 08:22

Hi Sam,

Tom doesn’t have an account, but this is his response:

“

Code & reproducibility

We agree that open code is important and where possible we always publish code. In this project, the end-to-end pipeline spanned multiple secure environments (ONS Secure Research Service, and several UKHSA AWS/SRS “safe haven” setups) with row-level, linkable health data. That’s why we didn’t provide a single repo: large chunks are environment-specific (access controls, linkage and disclosure checks). That said, the model code (Stan and wrappers) is shareable and was always available on request via DataAccess@ukhsa.gov.uk, as stated in the paper. The journal operated transparent review; reviewers had our full methods and could have requested the model code. If they had, we’d have provided it under the usual secure route. The absence of a public repo isn’t a judgment on transparency; it reflects legal/data-governance constraints we have to work within.

High level

We had to make pragmatic approximations due to working across different data environments. Much of methodological challenge with this paper comes from overcoming the capabilities of different data environments. The environments where the data resides were often not capable of heavy computational loads, and egressing/ingressing parameters/data/posteriors between environments is not always possible. As such, the methodology must make pragmatic modelling choices for the analysis to be feasible.

We estimate prevalence where the survey data reside, convert that to incidence using the posterior for positivity duration in a different environment, then integrate with linked outcomes—and we apply temporal smoothing at the integration stage.

Could we have modelled continuous prevalence directly in a single unified model? Technically yes, but the ONS SRS could not handle the computation required for that approach (memory/throughput limits and queueing constraints), so an end-to-end fit in one environment wasn’t feasible. The modular pipeline plus smoothing was therefore a practical necessity driven by compute constraints, not a methodological preference.

Tight uncertainty on PCR-positivity bits.

Those bands are tight in places because (a) many strata/variant periods have very large sample sizes, and (b) we use smooth but flexible parametric forms (with leave-one-out CV guiding distribution choice) for delays and test sensitivity. Uncertainty is wider where data thin out (e.g., tails or narrow strata), and that propagates downstream.

Onset vs infection time.

Infection time is the conceptual target. We anchored to symptom onset because (i) that’s observed/recorded far more reliably alongside outcomes, (ii) we explicitly model onset to test and onset to outcome delays with interval censoring/right-truncation. We explicitly noted the simplifying assumption that onset approximates the time a case becomes test-positive, and we showed what would be needed to limit epidemic-phase bias (linking prevalence to outcomes with outcome-specific incubation distributions). We agree this is a trade-off and flagged it as a limitation.

Prevalence to incidence and round structure.

Eq. (16) is an aggregate prevalence to incidence identity under a stated constant-risk assumption (onset-anchored, uncertainty-propagated and RW2-smoothed across sources); it is not the individual back-sampling that Gostic et al. discuss for Rt estimation. This was pragmatic choice because we had round-based sampling (REACT) alongside a continuous series (ONS CIS), converting round-level prevalence to incidence using the expected positivity duration was thought a transparent, survey-compatible bridge that avoids inventing within-round dynamics the data cannot identify. Rounds are not treated as independent in the severity estimates: prevalence is estimated in a hierarchical framework with RW2 smoothing, and the time-varying IHR/IFR is also given a second-order random-walk (RW2) smooth, which provides temporal continuity across round boundaries. The paper is clear that within-round growth can induce epidemic-phase severity bias, and it explains what extra linkage (outcome-specific incubation distributions, separate prevalence by eventual outcome) would be required to limit it - data that weren’t available. Finally, with datasets confined to different safe havens, we could not fit one monolithic forward model. We therefore used a more modular workflow, so uncertainty was propagated by passing parametric summaries of posteriors between sub-models rather than fitting one monolithic model.

Finally, the journal used was certainly reputable and if you had any thoughts regarding code or wanted clarity, you could always have emailed the authors rather than the blog here.

We had access to row level data for WCIS that could be run within our environments and this allowed for the use of a single unified Bayesian modelling approach. The code for all these models were provided as it was possible to run internally without the same constraints.

”

samabbott · 12 August 2025 13:18

Lots to respond to here so let me know if I missed anything or its garbled!

pragmatic solution to the challenges of handling data across multiple systems

As are they ever! I’m looking forward to seeing the models when the data release people get to it.

suitable due to working across multiple secure data systems.

I’m not sure I fully understand what this means? As in you couldn’t put out fully working code? Yeah, fair enough but having a repo of say the stan models would be better than nothing right? Or are there additional internal compliance issues? If its just a matter of strong-arming internal policy my vote (which I don’t have and has no weight) is massively for any code vs none and for code as fast as possible again even if partial (i.e. at preprint). This relates to one of the points I was trying to make about weighting PH vs academic research i.e. I am aware PH has random internal policies and restrictions so its hard to know when these are having a role as an external or in precisely what way.

Yes I agree - sidebar I really love this paper!

outcome specific incubation periods

Yes agree.

allowed the model to be maximally informed by the available data.

I am not so sure I agree? Its just putting a hard assumption in right so the problems don’t show up? Kind of related to that statement in the original paper I was confused by I guess. Its like putting any estimate in as a point vs a prior and saying its better? Or do you think its more nuanced than that?

There is a lot of room for methodological improvements in this space

Yes, I agree. There is a lot of interesting overlap with the generation time stuff I have been thinking about (grant app posted on here at some point) as well.

Your paper led me to https://www.medrxiv.org/content/10.1101/2024.10.23.24315984v1.full-text this which looks great (a much much nicer version of our hack job here https://www.medrxiv.org/content/10.1101/2022.03.29.22273101v1, which we did the joint model of but never did anything with (woops) inc2prev/stan/inc2prev.stan at 1d395006c1595b780c0295aa96a94928d2e697a0 · epiforecasts/inc2prev · GitHub ) - any idea when the code is coming for this one as I think its missing as well?

the code is currently going through internal review prior to publication.

Ah so you do this after preprint but before journal publication? Hence, the preprints not having code? I think this might be why I miss these papers as I don’t read without code usually and only really read at preprint unless something calls me back. I imagine that that isn’t hugely common though.

Did you look at the additional role of phase bias on your interval windows on top of the reporting bias you saw or do you think that will have swamped it?

As an aside, have you folks tried marginalising the primary event intervals yet? I imagine it could help with these long computation times you are seeing/approximations needed if you got rid of the latent variables and used a weighted likelihood ( Why it works • primarycensored ). I should have that paper ready for comment soon so will circulate. Especially the 4 likelihood here with an analytical solution vs needing this approx: https://ars.els-cdn.com/content/image/1-s2.0-S0163445325000799-mmc1.pdf. Might be reasons that wouldn’t shake out as helpful.

I like the pop level approach to approximating the individual variation in the absence of marginalising though - neat. Could be interesting to think if something like that fits as an option in primarycensored if you don’t have an analytical solution and don’t want to numerically solve for whatever reason.

Something of a tangent so perhaps we should talk about that somewhere else?

Likely, this was symptomatic individuals wanting to use their LFD test and waiting until the testing window was open

Another random aside was it not part of the study design that they could test if symptomatic outside the schedule as long as they reported that they did so? If not maybe that would help reduce this issue? Or would that induce more issues?

It sounds like Tom might prefer an email response so will do that!

but briefly

the model code (Stan and wrappers) is shareable

My point would generally be that something is better than nothing all the time!

operated transparent review

I’m a big fan of public peer review as far as I am a fan of peer review at all via the traditional journal approach.

pragmatic approximations

I think this is great and reflects the real world. The concern I had diving in was more about the reporting checking which was really only a major concern due to the lack of code preventing going in and checking i.e. with @OvertonC paper above I can just go run the prior predictive check myself.

Now you do have the data in house have you compared the old modular approach to the joint approach?

Those bands are tight in places because (a) many strata/variant periods have very large sample sizes

Looking at the methods paper Chris sent I also realise this is the strata level curve and not the realised individual curve i.e it contains no individual level variation and so it being tight seems reasonable.

We agree this is a trade-off and flagged it as a limitation.

I have no problem (though of course we would all prefer it wasn’t there) with the approximation or the need for it my confusion was more about the wording/treatment i.e. “would consequently degrade results” its not really degrading right its more reflecting known uncertainy? Perhaps this is just semantics.

converting round-level prevalence to incidence using the expected positivity duration was thought a transparent, survey-compatible bridge that avoids inventing within-round dynamics the data cannot identify.

This all makes sense. Again very keen to know how this compares to the updated method you have now you all the data in house.

Generally, I prefer where possible to have discussions about papers in the public domain as much as possible so more people can see the conversation (like this)! If I had real concerns, i.e. I think this is wrong and should be retracted kind of concerns I would email first and do things more formally but I don’t in this case.

I think this is more about framing. In the post/here I am trying to have a discussion about how people generally treat/filter papers and if it differs between work from academics and journals. For me, it definitely does for example.

I don’t think many of the for-profit spin-out journals are positives (especially ones that copy the results methods structure of i.e nature) so would down weight on this (same with lack of code I down weight on it). The point I was trying to make was not article specific and more realising I use different heuristics when evaluating a paper from a public health team than I do a research team but haven’t really evaluated why that is.

kylieainslie · 13 August 2025 15:27

I will throw in my experiences working within a public health institution. (I won’t comment on the methods within the paper as that has been well covered).

I publish all of my code, and have been supported by my team lead and institution to do so. The code must (obviously) not include any privacy sensitive information.

Our team has made a concerted effort following the COVID-19 pandemic to make all, or most, of our code public. During COVID, the government was criticised for making policy decisions based on modelling work that was not accessible to the public. Our response to COVID was subsequently audited by an international committee of experts and one outcome was that we need to be more transparent (with code, methods, data, etc.).

In response to the audit, we, as a team, are trying to make all the code and data public once a project is ready for publication. If the data cannot be made public (due to GDPR), we are strongly encouraged to create dummy data that approximates the real data, so the code will run and there is some semblance of reproducibility.

Ultimately, our work is not (only) an academic exercise, but is publicly funded and must be of policy relevance (and is often requested by policy makers); therefore, we have a higher burden of transparency.

samabbott · 13 August 2025 15:36

Thanks for this @kylieainslie. Do you define this as at a journal or made public i.e at preprint? And why?

I think the trend towards using synthetic data is really great @jonathon.mellor has a nice paper that does this which motivated a recent collab after we used it for some other validation which was really helpful.

strongly agree on the public funding - generally the same for academics of course. Do you get any journal steers i.e. towards certain publishers?

kylieainslie · 13 August 2025 17:53

@samabbott it’s at the preprint stage. We preprint when we’re ready to submit for publication in a journal. We have to get approval before we make any manuscript public, preprint or otherwise, so at this stage what’s going to be made public (manuscript or code), should be ready to go. However, depending on the project, the code might be public before a manuscript (think {mitey}). It depends on the researcher and the project.

Topic		Replies	Views
How can collaborative infectious disease forecasting/nowcasting projects be improved?	6	511	5 June 2023
Data management recommendations for nowcasting Projects	12	568	7 October 2022
Introductions - Michael DeWitt Introductions contributor , researcher	0	327	20 October 2022
Community Seminar 2024-08-07 - Kaitlyn Johnson - Wastewater modeling to forecast hospital admissions in the US: Challenges and opportunities Meetings	19	133	14 August 2024
Streamlining of epi modeling tools	12	81	14 August 2024