(C) PLOS One This story was originally published by PLOS One and is unaltered. . . . . . . . . . . Probabilistic program inference in network-based epidemiological simulations [1] ['Niklas Smedemark-Margulies', 'Khoury College Of Computer Science', 'Northeastern University', 'Boston', 'Massachusetts', 'United States Of America', 'Robin Walters', 'Heiko Zimmermann', 'Informatics Institute', 'University Of Amsterdam'] Date: 2022-11 Accurate epidemiological models require parameter estimates that account for mobility patterns and social network structure. We demonstrate the effectiveness of probabilistic programming for parameter inference in these models. We consider an agent-based simulation that represents mobility networks as degree-corrected stochastic block models, whose parameters we estimate from cell phone co-location data. We then use probabilistic program inference methods to approximate the distribution over disease transmission parameters conditioned on reported cases and deaths. Our experiments demonstrate that the resulting models improve the quality of fit in multiple geographies relative to baselines that do not model network topology. The ability to create computer simulations of epidemics is important to be able to predict where and when people will be become infected, identify factors which either contribute to or slow disease spread, and test various interventions without risking real lives. However, the conclusions of experiments performed using these simulations are only meaningful in the real world if we can be sure the simulation accurately models what is happening in the real world. We study methods for fitting parameters, such as infectiousness, to real world data so that the disease simulator correctly represents the actual disease. We achieve this using probabilistic programming methods which automatically adjust the parameters of the simulator until its outputs look realistic. Our method can work on very detailed simulators which model individual people interacting at specific locations in different locales whereas other methods can only fit very simple simulators. Funding: RW is supported by a Postdoctoral Fellowship from the Roux Institute. JWvdM is also supported by the Intel Corporation, the 3M Corporation, NSF award 1835309, startup funds from Northeastern University, the Air Force Research Laboratory (AFRL), and DARPA. RC is funded by MIT Lincoln Laboratory and the Under Secretary of Defense for Research and Engineering under Air Force Contract No. FA8702-15-D-0001. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Data Availability: Code and data to reproduce all experiments is available at https://github.com/nik-sm/ProbProgEpiNet.jl . This repository includes network topologies generated based on cellphone geolocation data and county-level census data obtained from SafeGraph; raw data from SafeGraph can be freely obtained for academic use by requesting an account at https://www.safegraph.com/academics . This repository also includes instructions to download county-level COVID-19 infection and death statistics from the COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University ( https://github.com/CSSEGISandData/COVID-19 ). Copyright: © 2022 Smedemark-Margulies et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Contributions . The present work evaluates the utility of probabilistic programming methods for parameter estimation in simulations that model disease transmission at the level of individuals. For this purpose, we perform a relatively comprehensive set of numerical experiments. Our results demonstrate that even BBVI, a relatively simple simulation-based inference method, can be used for parameter inference in these models. We hope this will inspire further applications of probabilistic programming to this domain. In particular we observe that variational inference methods make it possible to estimate region-specific parameters in a manner that results in a higher quality of fit when comparing to simpler compartmental models and other inference methods commonly used in related work. Lastly, our method is not designed to model the evolving temporal dynamics of the disease. We are able to fit time-varying infectiousness parameters over a past time range based on data, but our model cannot extrapolate into the future except by assuming these parameters stay constant. After fitting the parameters of our model to data, we show the cumulative infections produced by 100 runs of the simulator using selected parameters. On the left, we use the posterior mean parameters; the observed variation comes from the untraced randomness of network simulator. On the right, we use samples from the posterior distribution; variation comes from both the untraced randomness and the variance of our posterior distribution. A further limitation is that our gradient estimates have a high variance. The reason for this is that we treat the stochastic simulator as a black box. This means that the distribution over parameters is optimized to maximize agreement with observed data, but simulator trajectories are sampled from a broad prior distribution that is defined in terms of these trajectories (see Fig 1 ). As a result, the accuracy of the posterior approximation may be limited by the variance of the gradient signal. Second, we perform variational inference for the parameters in our model using a so-called fully-factorized approximation. This approximation makes the optimization problem simpler, since fewer variational parameters have to be learned, but does have well-documented drawbacks. Its main limitation is that a factorized variational distribution cannot capture correlations between parameters in the posterior, which in turn can result in an under-approximation of the posterior variance. First, our simulator models the network topology of a region as a subsampled population of nodes. This subsampling necessitates rescaling to compare model outputs to reported case counts, and means that parameter estimates will depend on the degree of subsampling. Our simulator is nonetheless relatively granular compared to other disease models and balances well between speed and resolution. Limitations . We make several simplifying assumptions which are common in the literature. These are both in terms of the level of realism of our model design and the quality of our inference approximation. In order to apply probabilistic programming methods to parameter estimation for fine-grained disease simulations, we implement an agent-based simulation in the Julia language. We model disease transmission on subsampled social proximity networks constructed from cellphone co-location data (Sec. 4.2), in which nodes represent individuals who transition stochastically between disease states (Alg. 4). We infer transmission parameters of this model using Blackbox Variational Inference (BBVI, [ 16 , 17 ]), a well-established stochastic variational method for probabilistic programs. We use an implementation of BBVI in the Gen probabilistic programming system [ 18 ]. We find that this method scales well with the relatively computationally intensive simulations in this model, and outperforms simpler methods based on likelihood-weighted importance sampling and Metropolis-Hastings in terms of sample quality and diversity. To separately capture the effects of changes in mobility from changes in transmission during interaction (e.g. due to non-pharmaceutical interventions like mask-wearing), we include experiments where the contact network varies over time. To explore the sensitivity of our model to the network size, we also include experiments with up- and down-sampled graphs. We find that our model gives high-quality samples in both of these cases. Lastly, we consider an experiment in which we simultaneously fit both infection and death data, and find that our model is able to incorporate both pieces of evidence successfully. However, there are numerous other methods from the probabilistic programming literature that impose few requirements on the underlying model, and can be used to perform inference in simulation-based models in a variety of languages with relatively few changes to the underlying code base. Examples include single-site Metropolis-Hastings [ 12 ], importance-sampling methods based on Sequential Monte Carlo [ 13 ], and stochastic variational inference methods [ 13 ], which can be combined with deep learning to train models that invert probabilistic programs for fast inference at test time [ 14 , 15 ]. Parameter estimation in complex disease simulations typically relies on comparatively simple methods such as importance sampling using likelihood weighting or approximate Bayesian Computation [ 3 , 4 ]. Approximate Bayesian Computation (ABC) relies on a heuristic likelihood function to compare observed data and predictions from the simulation [ 10 , 11 ]. In this paper, we present a case study in the use of probabilistic programming methods to infer parameters of agent-based disease simulations. Probabilistic programming research has developed a wide variety of methods for inference in programmatically specified models, including methods based on Markov Chain Monte Carlo (MCMC) [ 7 ], importance sampling and variational inference [ 8 ]. An example of the use of probabilistic programming in disease modeling is the work by Flaxman et al. [ 9 ], which applies methods based on Hamiltonian Monte Carlo (HMC) to comparatively low-granularity compartmental models. HMC is widely used and can be computationally efficient, but requires a differentiable model that defines a density with continuous support. These requirements are not always easy to satisfy. Complex simulations, such as the ones we consider here, may incorporate non-differentiable aspects such as discrete variables or have discontinuities arising from if-then-else statements. Moreover, it is not always convenient to (re-)implement simulations in a special-purpose modeling language that supports automatic differentiation. Fine-grained models are desirable for epidemiologists who seek to encode important domain knowledge, such as the effects of pre-existing patient states (e.g. age, co-morbidities, or even genetic predispositions) and regional differences in mobility and policy. However, such models will typically have a large number of parameters, and estimating these parameters poses challenges. Some parameters, such as those describing person-to-person interaction frequency, can be estimated prior to simulation using available data on mobility and demographics. Other parameters, such as disease transmission rates, may have substantial uncertainty, particularly during the early stages of an ongoing epidemic, or may be sensitive to implementation details of a simulator so that they cannot be estimated externally. Moreover, parameter estimates based on past data can be invalidated by public health intervention policies affecting mobility and social interaction. In such settings, we would like to deploy models that can incorporate high-resolution data from as many sources as possible. Then, we can apply techniques for maximum likelihood estimation and approximate inference to reason about the most probable parameter values. Planning and control of disease spread often relies on having access to realistic simulation models that can capture fine-grained dynamics and differences between geographic regions. Models of infectious diseases track the number of individuals in different stages of disease progression and with varying granularity. The least granular models track global population totals in compartments such as susceptible, infected, and removed [ 1 ], and rely on an assumption of uniform mixing such that the simulated population can be fully described by a system of differential equations. This simplifies computation, but also imposes limitations on dynamics, such as the fact, that unlike real-world infection data with multiple waves of infection, this model can only generate a single wave [ 2 ]. More sophisticated simulators stratify the population according to age [ 3 ] or geography [ 4 ] to account for variations in the frequency of interactions. The most fine-grained simulation models are agent-based [ 5 ], and can account for the fact that highly-connected individuals are more likely to both contract and spread a disease [ 6 ]. While BBVI does not require the simulator to be differentiable, there are a number of more efficient inference methods that can be used when a model does support differentiation. Examples include variants of Hamiltonian Monte Carlo [ 21 ], and reparameterized methods for variational inference [ 22 , 23 ]. To apply these methods, the model must be implemented in a language that supports automatic differentiation, such as Stan [ 24 ], PyMC [ 25 ], or Pyro [ 15 ]. Moreover, since these methods rely on computation of the gradient of the log density with respect to the latent variables ∇ z log p(x, z), all variables in the model need to be continuous. For this reason, support for these methods must be factored into the design of a simulator from the outset, and cannot easily be applied to code bases that do not already support automatic differentiation. Furthermore, these design constraints limit the expressivity of disease simulators, since it is not possible to apply differentiation to models with discrete random variables. By contrast, the BBVI methods that we consider in this paper have fewer implementation requirements, but typically require a much larger amount of computation to approximate the posterior. Algorithm 3 summarizes the resulting MH sampling procedure when using a proposal that updates a single randomly selected parameter , keeping all remaining variables constant ( ) This sampler is once again very general. To generate a proposal and compute the acceptance ratio , we only need to be able to sample from the proposal kernel q(θ′ ∣ θ), run the simulator to generate s′ ∼ f(θ′), and compute the weight w′ = g(x, s′, θ′). This construction is once again generally applicable to probabilistic programs that make calls to stochastic functions f whose random choices are opaque to the inference algorithm. A justification for replacing α with can be derived by noting that the resulting sampler satisfies a technical definition known as proper weighting [ 19 , 20 ]. This amounts to replacing the marginal likelihood p(x ∣ θ′) with an unbiased estimate w′ and similarly replacing p(x ∣ θ) with an unbiased estimate w, since, (15) Unfortunately, we cannot compute this acceptance ratio directly, since doing so requires computing p(x, θ), which involves an integral with respect to all simulation trajectories s. However, we can use likelihood weighting to compute an acceptance ratio by running the simulator to generate trajectories s′ ∼ f(θ′) and s ∼ f(θ), and computing the acceptance ratio in terms of importance weights, (14) In the context of simulation-based inference, there is a subtlety to performing MH sampling. The MH acceptance ratio α for a simulation-based model has the form, (13) One of the baseline methods that we use in this paper is single-site Metropolis-Hastings (MH) [ 12 ], which is a Markov Chain Monte Carlo (MCMC) method that is implemented in many probabilistic programming systems. In MH, we randomly initialize a sample θ ∼ f 0 from the prior. We then apply a series of MCMC updates. For each update, we propose a change θ′ ∼ q(θ′ ∣ θ). The proposed change is then either accepted or rejected according to a Metropolis-Hastings ratio. In the resulting Markov chain, early samples will be representative of the prior distribution, but successive updates define a biased random walk that converges to the posterior. As with likelihood weighting, BBVI is very broadly applicable to simulation-based models. To compute the weights w k , we need to evaluate the likelihood g, the prior f 0 , and the variational density q ϕ . The only additional requirement is that we can compute the log gradient ∇ ϕ log q ϕ (θ). As a result, BBVI can be implemented by simply running the simulator f repeatedly to generate samples. The constant , which is known as a baseline, serves to reduce the variance of the estimator. Here, we use a simple baseline in the form of the average log weight. BBVI optimizes the variational objective using stochastic gradient ascent, which requires an unbiased estimate of the gradient . To compute this estimate, BBVI uses samples s k , θ k ∼ q ϕ to compute a likelihood-ratio estimator (see S1 Appendix ), (12) The variational objective represents a trade-off between two objectives. The first is to maximize the expected log likelihood, which tends to concentrate the variational distribution q ϕ (θ) around the parameters θ* that yield the highest agreement between simulation and data. The second is to minimize the KL divergence relative to the prior, which tends to make the variational distribution more broadly peaked. The variational objective can equivalently be decomposed into an expected log likelihood and a KL divergence between the variational distribution and the prior, (10) (11) Since log p(x) does not depend on the variational parameters ϕ, maximizing with respect to ϕ is equivalent to minimizing the KL divergence, which ensures that q ϕ becomes as similar as possible to the posterior. To learn the variational parameters ϕ, we will in this paper use black-box variational inference (BBVI) [ 16 , 17 ], which is one of the simplest and most widely implemented methods for probabilistic programs. BBVI minimizes the Kullback-Leibler (KL) divergence KL(q ϕ (s, θ)||p(s, θ|x)) by maximizing a variational lower bound (see S1 Appendix for further discussion), (8) (9) Our goal is to learn parameters ϕ such that q ϕ (s, θ) is as similar as possible to the posterior p(s, θ ∣ x). Since q ϕ (s ∣ θ) = p(s ∣ θ) is just the distribution over trajectories in the simulator, for which there are no optimizable parameters, making q ϕ (s, θ) as similar as possible to p(s, θ ∣ x) equates to making q ϕ (θ) as similar as possible to the posterior marginal over parameters p(θ ∣ x). Variational inference methods directly approximate the posterior by optimizing the parameters of a variational distribution, thereby turning an inference problem into an optimization problem. In this paper, we assume a distribution q ϕ (θ) with variational parameters ϕ. If we combine the variational distribution with the simulator f, then this defines a variational family q ϕ (s, θ), (7) One of the limitations of likelihood weighting is that it is not a particularly efficient inference strategy. The prior f 0 typically defines a distribution over a broad range of parameters, out of which only a small subset are likely to give rise to simulation trajectories that are in good agreement with the data. That is, likelihood weighting is a form of guess-and-check sampling—it typically requires many proposals s k , θ k , out of which the overwhelming majority will be in poor agreement with the data. This means that the weighted average in Eq 6 will typically be dominated by a small fraction of high-weight samples. A convenient property of this style of inference is that it can be applied to any stochastic simulator. Likelihood weighting does not require evaluation of p(s ∣ θ) at any step in the computation. The only requirements for likelihood weighting is that we should be able sample θ from the prior, run the simulator to generate a trajectory s and evaluate the user-defined likelihood g(x, s, θ). After normalizing these weights by their sum, this method can be used to approximate expectations with respect to the posterior, (6) Probabilistic programming systems implement general-purpose inference methods that can be applied to simulation-based models. These methods approximate the posterior using samples that are generated by repeatedly running the simulation. One of the simplest of these methods is importance sampling using likelihood weighting, which generates samples by proposing θ from the prior, running the simulator to generate a trajectory s, and computing an importance weight w according to the likelihood, (5) We will assume that the prior f 0 is a density or probabilistic program that is completely tractable, which is to say that we can sample θ ∼ f 0 and evaluate the density f 0 (θ). The likelihood g may take the form of a tractable parametric density, as is the case for the Gaussian likelihood that we define in Section 4.3. Alternatively, we can define a likelihood g(x, s, θ) ∝ exp[−ℓ(x, s)] in terms of a loss function ℓ(x, s) that measures the discrepancy between the simulation state s and the observations x. Such a loss function should be chosen to ensure that the normalizer Z = ∫ dx exp[−ℓ(x, s)] is a constant that is independent of s. We will discuss an example of such a heuristic likelihood in our discussion of ABC rejection algorithms below. Concretely, we will assume that we have access to a simulator f with inputs θ, and that execution of this simulator generates a random trajectory s ∼ f(θ). This implicitly defines a distribution over trajectories s ∼ p(⋅ ∣ θ). However, the density associated with this distribution is intractable, which is to say that we cannot compute p(s ∣ θ). To perform inference in a disease simulation, we will assume that the user implements additional code to define a likelihood p(x ∣ s, θ) and a prior p(θ), (4) In this paper we will instead focus on inference methods that require no changes to an existing simulation code base. These methods can be applied to a stochastic simulator that is “opaque”, which is to say that randomness inside the simulation is not accessible to the inference algorithm. In the Gen probabilistic programming system [ 18 ], which we use as the basis for our experiments, these uncontrolled random choices are referred to as “untraced” randomness. It is possible to define all model components in a probabilistic programming language to facilitate inference later on. However, this may not be the most convenient choice when applying probabilistic program inference methods to an existing simulation code base, since all simulation code must now be translated or modified to ensure that random variables are generated in a manner that can be tracked by the probabilistic programming framework. This model has three components. The first is a likelihood p(x ∣ s, θ), the second is a distribution over disease simulations p(s ∣ θ), and the third is a prior p(θ). In this work, observed data takes the form of a time series x = (x 1 , …, x T ), where each time point x t is a count of reported cases. The latent variables z = {θ, s} comprise model parameters θ and a sequence s = (s 1 , …, s T ) in which s t denotes the state of a disease simulator at time t. We express the probability of latent variables and observations as, (3) This density is defined in terms of a likelihood p(x ∣ z), which reflects the agreement between observed data x and latent variables z, and a prior p(z) over latent variables z in the absence of observations. The computational challenge in approximating the posterior is that integrals with respect to z are generally intractable. This means that we cannot compute the marginal likelihood p(x) in closed form and it is generally difficult to approximate the expected value of any quantity h(z) that depends on the latent variables, (2) In this paper, we apply methods for Bayesian inference to reason about unknown parameters in disease simulations. Our goal is to approximate a posterior density over latent variables z that is conditioned on observed data x, (1) Unlike the ABC method used by Chinaazi et al. and Chang et al., our method does not use a hard threshold for rejecting samples, which increases sample efficiency. Similarly, Chang et al. [ 4 ] use ABC to fit model parameters to the number of confirmed cases provided by the New York Times. They estimate 3 parameters: (1) base transmission rate, (2) point-of-interest transmission rate, (3) initial proportion of exposed individuals. These parameters are fit by performing a grid search, where the utility of parameters is computed based on the average root mean squared error (RMSE) over 30 random simulations. To quantify uncertainty, the authors select parameters whose average RMSE is within 20% of the best-fitting parameter set, and report the mean and 2.5–97.5th percentile range of parameter values. Chinazzi et al. [ 3 ] study the effect of travel restrictions on the spread of the coronavirus. They use ABC rejection algorithms to estimate the posterior distribution of the reproductive number R 0 . Specifically, they choose parameter values from samples which have simulated case counts that match the observed number of cumulative imported cases before January 23, 2020 to within a margin of error of +40%. In the limit where ϵ 0 → 0, this heuristic likelihood conditions the simulation output to exactly match the observations. Increasing the threshold ϵ 0 results in a higher sample efficiency, at the expense of yielding larger approximation errors. ABC rejection algorithms . Another class of methods related to likelihood weighting are ABC rejection algorithms. These algorithms compare the simulation output s to the data x according to some error function ϵ(x, s) and reject all samples whose error exceeds a threshold ϵ 0 . This is a special case of likelihood weighting using a heuristic likelihood of the form, (16) Likelihood weighting . Inference in larger-scale simulation-based models has typically relied on much simpler techniques in order to reduce implementation requirements and balance the quality of approximation with computational cost. One of the more commonly used methods in this context is likelihood weighting (see Alg. 1). Work by Wood et al. [ 28 ] uses a probabilistic programming implementation of likelihood weighting to perform inference in FRED [ 5 ], an agent-based disease simulator. Wilder et al. [ 29 ] use likelihood weighting to fit 4 parameters: (1) the probability of infection after contact with an infected individual, (2) the start time of an infection, (3) a base mortality rate multiplier, (4) the reduction factor in expected number of contacts after a lockdown. They select a negative binomial distribution as a likelihood function, where the dispersion parameter is estimated by fitting an autoregressive binomial regression model. Samples are generated approximately from a uniform prior using Latin hypercube sampling. Markov chain Monte Carlo methods . For comparatively low granularity models, such as models with global compartments, it is often possible to apply MCMC methods, which are guaranteed to asymptotically converge to the posterior. Hamiltonian Monte Carlo (HMC, [ 21 , 26 , 27 ]) methods are amongst the most efficient and widely used MCMC methods in this context. An example of the usage of such methods is the work by Flaxman et al. [ 9 ] using models implemented in the Stan probabilistic programming language [ 24 ]. However, as discussed in Section 2.4, applying HMC requires that the model be differentiable with respect to the latent variables and, practically, the number of needed gradient evaluations be computationally feasible. This makes it difficult to apply HMC to larger-scale disease simulations (on the order of 10000 agents), such as the ones that we consider in this paper. These simulations are not always differentiable, since they may contain discrete random variables, or may simply not be implemented in a language that supports differentiation. Note that we share the variance parameters σ ρ across communities, and share variance parameters σ E and σ I across time points. This modeling choice reduces the number of parameters in our model at a small cost to expressivity. Variational distribution . To approximate the model posterior over latent variables, we define a variational distribution q ϕ (θ) which mirrors the prior of the generative model, with parameters ϕ = {μ ρ , σ ρ , μ E , σ E , μ I , σ I } for the individual logistic-normal distributions. This results in a fully-factorized variational approximation of the form, (26) Likelihood . We define a likelihood that compares the model output s 1:T to the reported case counts x 1:T for a particular region. Here, we must account for the fact that our agent-based simulator uses a subsampled population that is orders of magnitude smaller than the actual regional population. For this purpose, we define a likelihood model with time-dependent Gaussian noise, which incorporates a scaling factor r to account for the ratio between individuals in the population and nodes in the network topology and a hyperparameter ν describing the noise in our observations (see Sec. 5.1 and S3 Appendix ), (25) Note that there is some flexibility in deciding how to parametrize each component of the variational model. Our first constraint is that the sampled values of each of these variables (ρ c , , and ) should be non-negative; this suggests the use of log-normal or logistic normal distributions as relatively standard choices. Since ρ c represents a percentage of exposed individuals in a community, this parameter should also be restricted in [0, 1]. We similarly chose to restrict the values of and to the range of [0, 1], such that a node’s maximal contribution to the infection of a neighbor is limited by the edge weight between them. The logistic normal distribution becomes a natural choice, as it enforces the desired constraints on these variables. The logistic normal distribution is defined so that values z ∼ LN(⋅) sampled from the distribution can be transformed by the sigmoid function y = sigmoid(z) to produce Gaussian-distributed outputs. The parameters defining the prior are hyperparameters of model tuned with grid search (See Sec. 5.1 and S3 Appendix ). We fix the percent exposed on the first day E 0 to remove ambiguity between having high ρ c and low versus low ρ c and high , which has only a small impact on the likelihood. Prior . We define a prior distribution over input parameters to our simulator. We choose to factor this distribution into the product of independent logistic normal distributions over each disease parameter, (24) In the f nseir disease simulator, the set of parameters that we will infer is θ = {ρ, β E , β I }. We treat the other inputs to the model as fixed hyperparameters, including the latency parameter γ = 0.143, and the recovery parameter λ = 0.072, (corresponding to a mean latency of 7 days and mean recovery time of 14 days) which we base on clinical case studies [ 38 ]. Given these inputs, the simulator returns a sampled trajectory s that contains the number of infected nodes at each time, (23) Given a regional network and the stochastic disease simulator f nseir from Algorithm 4, we seek to fit the remaining free parameters of our simulator to regional data. To do so, we use BBVI (Section 2.2), as implemented in the Gen probabilistic programming system, to approximate the posterior over model parameters and initial conditions using a variational distribution. Linearly approximated transitions . The exponential transition formula ( Eq 20 ) can be linearly approximated . When using this approximation, we call our disease model Network-SEIR-Linear (NSEIR-Linear). In this approximation, the values and meanings of the edge weights W uv and disease parameters differ relative to the regular NSEIR model. We use the NSEIR-Linear model in our experiments with synthetic data since the values are more easily interpretable; in Eq 20 , increasing beyond a certain magnitude has a very small impact on the probabilities due to exponential decay. The full disease simulation procedure, which we refer to as f nseir , is described in Algorithm 4. The inputs to this model are the simulated regional network G (described by its vertices and weighted edges W), initial rates of exposure ρ c in each community, state transition parameters γ and λ, and values for and at N time points τ n , from which we define parameters at all other times t using linear interpolation. Here the terms and affecting node v’s transition probability are defined based on its network weights to exposed neighbors and infected neighbors , and scaled by time-dependent transmission parameters and , (21) Given a representative network instance as described above, we construct a stochastic disease state model as follows. Nodes transition between four states as in the traditional SEIR model: susceptible, exposed, infected, and removed. Let S t , E t , I t , and R t refer to the subset of nodes in each state at time t. We model exposure probability using an exponential distribution, (20) Edge weights . Given the edge structure described above, we assign edge weights by considering visit overlap duration within a chosen 1-week time window. The weight of edge (u, v) represents the total expected minutes for which individuals u and v might overlap at various POIs. For each POI p, SafeGraph provides the median visit duration d p . We build a visit duration tensor , where C is the number of CBGs, N is the number of POIs, and M is the total number of minutes POIs are open (we assume M = 10 hours per day, 7 days per week = 4, 200 minutes). Specifically, for each POI p, we examine our visit count matrix L, and include L rp total visits from CBG r , letting each visit start at a uniform random minute in the interval (0, M − d p ) and last for d p minutes. An entry D rpt indicates the number of visitors from CBG r present at POI p in minute t. The weight assigned to an edge between nodes and is then given by the total minutes of overlap, (19) This gives us a DCSBM where the edge probability between two CBGs reflects how often individuals from those communities tend to visit the same POIs. From this DCSBM, we sample a network instance G with vertices and edges . Each vertex represents a simulated person, and belongs to some CBG i (where i ∈ {1, …, C}). Vertices are also randomly assigned to households (cliques) within each CBG based on census survey data describing the size distribution of households with 1 to 7 people, resulting in additional “household” edges [ 37 ]. We perform a degree-correction procedure on this network to yield a heterogeneous degree distribution between nodes in different communities. The parameter for degree-correction is sampled from a power-law function with exponent γ = 3 selected so the node with the largest degree in each community has a degree in the range of 50–100. This is done to be consistent with other realistic social contact networks of similar sizes [ 34 , 35 , 36 ]. The correction procedure is done on a per-community basis, such that the degree sequence sampling takes into account the size and average degree of each CBG. Edge probabilities . To construct the block matrix P describing community structure, we compute each entry P rs from the cross-correlation score of the POI visit vectors of CBG r and CBG s . Let be the visit count matrix for a specific week of data, where N is the number of POIs. The cross-correlation score is given by, (18) We choose the number of communities in the modeled network by analyzing convergence properties of core topological properties like network density and number of triangles. The sizes of each community and the density of contact patterns within and across communities are selected to match census and Safe graph data from the corresponding Census Block Groups. We use an extension of a Degree-Corrected Stochastic Block Model (DCSBM) to generate a synthetic contact network that captures the community structure and heterogeneous degree distribution of real networks. The DCSBM has two parameters: 1) a partition of vertices into C communities, and 2) a symmetric matrix of edge probabilities, where element P rs gives the probability of an edge existing between any two vertices and . For our model, we apply a degree correction procedure within each community and overlay edges that represent household interactions. We leverage census data to extract the household size distribution for each community. The first stage of constructing our disease simulator involves fitting a network model to regional mobility data. Our network model belongs to the popular category of mobility networks called spatial meta-population models [ 3 , 32 , 33 ]. We use mobility data from SafeGraph [ 31 ], consisting of opt-in, anonymized foot-traffic to points-of-interest (POIs) such as stores and schools across the United States. Devices in this data are assigned to a home Census Block Group (CBG), the smallest geographical unit for which population data is reported in the U.S. Census. For each region of interest, we collect one week of data beginning on February 17, 2020, shortly before widespread quarantines were instituted in the United States and elsewhere. We then select CBGs that contribute to the top 10% of mobility data available for each county during this time period. In some settings, this model includes natural rates of birth and death for the population; we omit these factors from our network model for two reasons. First, newborn individuals will not contribute meaningfully to the spread of infection on their own; rather, they can be treated as a sub-compartment of their caregivers. Second, individuals who die of natural causes make the overall mobility network slightly more sparse, and omitting this effect causes only a small error in our predictions. We begin by briefly reviewing traditional global compartmental SEIR models in order to motivate the agent dynamics in our Network-SEIR model. In a SEIR model, the population is separated into four compartments representing Susceptible (S), Exposed (E), Infected (I), and Removed (R) individuals. These compartments are approximated by continuous values S(t), E(t), I(t), and R(t), where the total population size is fixed at N = S(t) + E(t) + I(t) + R(t). The population dynamics are modeled by the differential equations, (17) To calibrate our Network-SEIR model to a particular region, we seek to learn input parameters to this simulator that produce infection statistics matching the region’s true outcomes. We incorporate this agent-based model into a probabilistic program by defining a prior over these input parameters and a likelihood for reported case counts. The resulting probabilistic program defines a Bayesian posterior over parameters that we approximate using variational inference. In this work, we consider a disease simulation model that we call Network-SEIR (NSEIR), which comprises two components. The first is a network topology model, in the form of a degree-corrected stochastic block model (DCSBM [ 30 ]). This model describes contact patterns in the population. We obtain point estimates of network topology parameters using cell-phone co-location data [ 31 ] in order to produce a representative graph for simulations. The second component is an agent-based compartmental model that describes how disease spreads across this network topology to produce simulated infection statistics. 5 Results Our experiments evaluate the extent to which standard probabilistic programming methods, which have been implemented in a wide range of probabilistic programming systems, can be used to estimate parameters in the Network-SEIR model, a representative of agent-based models that are on the computationally intensive end of the spectrum. In particular, we examine whether our approach results in an approximation to the posterior that more accurately mirrors real-world disease spread, especially when compared to other commonly-used methods that assume simplified disease models or fitting techniques. We first validate the self-consistency and accuracy of our inference model. In Section 5.2, we show that our fitting procedure is well-calibrated by checking if our method is able to infer disease parameters when applied to synthetic data generated by our own simulator using known ground truth values. In Sections 5.3 and 5.4, we compare our method to other methods which use either simplified disease models or simplified fitting procedures common in the literature. We show our method is able to more accurately fit the complex dynamics of real-world data. Next, we perform a series of experiments to understand how the output of our inference procedure varies as we vary the regional network topology and regional disease statistics. In Section 5.5, we compare the inferred disease parameters found by applying our method to different regional networks and infection data. In Section 5.6, we explore the pattern of initial disease spread inferred by our model. Lastly, in Section 5.7, we perform sensitivity analysis for our inference procedure to choices made when modeling the network topologies. 5.1 Experiment setup Before describing the results of our experiments, we first describe data preprocessing, evaluation metrics, and the hyper-parameter tuning procedure. Regional infection data processing. We fit the parameters of our disease model to a particular geographic region by conditioning the model on corresponding county’s cumulative infection counts from the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE) dashboard [39]. We use data from February 29, 2020 until August 9, 2020. We preprocess this data by applying a 7-day rolling average to mitigate the effects of delayed reporting and weekly variation. We then select a 163-day window of investigation, beginning one average latency period (1/γ days) before the community infection count reached the chosen initial percentage of exposed individuals E 0 . We then rescale the infection counts to the number of nodes in our network topology. Instead of rescaling proportionately by , we scale by a constant r such that the total infection count reaches approximately 50% of the nodes by the end of our simulation time-window (see Eq 25). We do this for two reasons. First, when working with heavily down-sampled graphs, and especially in areas or time periods with limited test information, the number of infections may be so low that the signal is difficult to resolve in simulation. Second, studies have shown a systematic under-counting of cases, for example by comparing the rates of positive PCR tests (which can detect active or recently cleared infection) to the rates of positive serum antibody tests (which can detect historical infection in individuals who were asymptomatic and may not have received a PCR test) [40]. More sophisticated approaches are certainly possible, such as using counts of hospitalizations and deaths (since severe and fatal cases may be less prone to under-counting problems), and then estimating a static or a time-varying fraction of severe and fatal cases. Likewise, it would be possible to model a daily latent variable representing testing rates, which would require conditioning the model on data describing the number of tests administered in a certain region. In this case study, we are primarily interested in evaluating the feasibility of applying probabilistic program inference methods, and we will therefore leave further refinement of the regional data model to future work. Evaluation metric. To evaluate quality of fit for a given disease model, either with fixed disease parameters or a distribution over disease parameters, we generate multiple trajectories from our stochastic disease model, sampling from the disease parameter distribution if necessary. We then compute the mean daily absolute error (MDAE), (27) where x represents the true cumulative case counts, represents our inferred cumulative case counts using sampled parameters z, N represents the number of trajectories computed, T represents the time duration of each trajectory, and S represents the size of the regional population being simulated. MDAE measures the time- and population-normalized distance between the generated cumulative infection counts and true data to which the model was fit. We evaluate model quality in terms of infection counts rather than inferred parameters because our inference model is conditioned only on infection counts, and there exist multiple solutions for our overparametrized model whose output infection counts would be of similar quality. In the case of experiments on data generated from our own disease model, it is meaningful to compare the disease parameters used to generate the data with those we infer. However, for experiments using real-world data, there are no a priori ground truth values for these disease parameters to compare our learned values to, since these disease parameters are intrinsic to our specific disease model. Though values for βE and βI are reported elsewhere in the literature, these are, in fact, average transmission rates under the homogenous network assumption valid only for a global compartmental model. In our model, we relax this assumption to take into account the heterogeneity of the network, and thus we cannot directly compare. Hyperparameter optimization. We perform a grid search over hyperparameters for our probabilistic model. This includes hyperparameters such as learning rate and samples per gradient step for the BBVI algorithm, assumed observational noise, and the values for the prior over disease parameters. See Table A in S3 Appendix for the explored parameter ranges. Likelihood noise model. All experiments presented in this paper use the time-dependent noise model σx described in 4.3. We also considered likelihood models with noise scaling based on the value, (28) as well as constant or piecewise constant noise functions, (29) (30) These alternative approaches did not work as well in early experiments and we did not pursue them further. 5.3 Network versus compartmental disease model Compartmental disease models are popular in the literature as they are lightweight, deterministic, and differentiable. There are thus many effective strategies for fitting compartmental models to data. Here we use a five-compartment SEIRD model, where R represents recovered and D represents deceased individuals, and fit the model to data using Certainty-Equivalent Expectation Maximization (CE-EM) [41]. To compare with our method, we combine the contents of the Recovered and Deceased compartments to approximate our Removed compartment. In Table 2, we see that our method achieves better fit across multiple regions. Compartmental models lack expressivity and flexibility and are not able to capture the wide range of disease dynamics we see in different regions. In particular, for fitting data containing phenomena such as multi-wave infections and variable spread-rate resulting from network structure, our agent-based or network-based model achieves much lower MDAE than the compartmental baseline. PPT PowerPoint slide PNG larger image TIFF original image Download: Table 2. Mean daily absolute error (MDAE, Mean daily absolute error (MDAE, Eq 27 ) for various disease models and fitting procedures across several counties. BBVI fits the NSEIR model to real infection statistics better than Metropolis-Hastings or Likelihood-weighted importance sampling. NSEIR also outperforms Compartmental SEIR model with CE-EM. https://doi.org/10.1371/journal.pcbi.1010591.t002 5.6 Inferring starting communities Since our variational distribution includes means for the proportion of initial exposure in each community c, we can interpret learned values for the parameters as indicating which communities were likely to have had higher initial exposure given the observed disease data. Note that this is not the same as inferring the actual precise location of the initial exposure within a region, since our cumulative global infection data is too coarse to deduce this. In other words, there are many possible initial exposure scenarios which may result in similar aggregate infection data. Rather, we can only conclude that the location of certain communities in the network topology is more consistent with observed disease dynamics. However, even having the ability to filter out potential source communities is a very useful feature in practical settings, especially when designing localized intervention strategies. In Fig 11, we see that for large observational noise ν, the inferred parameters are closer to the uniform prior , whereas for small observation noise, we find a higher initial exposure in certain communities is more consistent with observed data. Varying the size of our observation noise controls the sparsity of inferred starting conditions. As our noise distribution tightens, inference moves further from our uniform prior over initial community exposure rates. For example, in the top left, in Los Angeles county, we infer higher initial exposures in communities 11 and 17. PPT PowerPoint slide PNG larger image TIFF original image Download: Fig 11. Varying observation noise level controls sparsity of inferred starting conditions. As noise distribution tightens, inference moves further from our uniform prior on initial community exposure rates. Low noise corresponds to ν = 0.00025, and high noise to ν = 0.0005. The network topology of each county is modeled using 20 communities which correspond to actual geographic areas. We plot for 1 ≤ c ≤ 20. In the left plots, we use ν = 0.00025, a tighter observational noise than the right plots where ν = 0.0005. https://doi.org/10.1371/journal.pcbi.1010591.g011 In Fig 12, we show the posterior distribution of ρ for each CBG. Recall that the total initial exposure level is fixed, so that the sampled value of ρ c only controls the fraction of initial exposure allocated to a given CBG. The posteriors shown here use the fit of lowest MDAE from each region. PPT PowerPoint slide PNG larger image TIFF original image Download: Fig 12. Posterior distribution of ρ c for communities and Posterior distribution of ρfor communities andat each change point during simulation. Note that CBGs have no correspondence across counties. https://doi.org/10.1371/journal.pcbi.1010591.g012 5.8 Sensitivity to time-varying network model We find that our model still fits well if we use time-varying edges weights to reflect changing mobility patterns in the network. In this case, the βE and βI parameters only reflect changing infectivity patterns. Transmission dynamics change as a result of at least two sources of variation. First, changes in mobility and interaction frequency may occur due to quarantines and business closures. Second, transmission probability during each contact may change, due to the use of personal protective equipment, hygiene practices, seasonal effects, and disease mutation. These non-pharmaceutical interventions may have a strong effect on the trajectory of the pandemic. In our model as described in Sec. 4.3, we account for both of these sources of variation by allowing our disease transmission parameters βE and βI to vary over time. Alternatively, we can account for variation in mobility in the network itself, by varying the network across time. This leaves the parameters βE and βI to model the transmission probability during each contact. In this experiment, we evaluate whether using time-varying network connectivity can further improve the quality of fit. To do so, we construct time-varying networks as follows. Similar to Sec. 4.2.1, each estimate of network structure is formed from a week’s worth of geo-location data. At the beginning of each week of our disease simulator, we allow the network structure to vary. We begin with a network estimated from the first week of data as before. For subsequent weeks, we separately estimate a network instance for each week of data, containing the same CBGs and the same number of nodes in each CBG. All nodes are assigned by ID to the same households, to maintain an identity for each node, and other edges (both connectivity and weights) are allowed to vary freely. Note that procedure allows for the possibility that nodes will change rank order (i.e. that the highest degree node for one week may not be highest in another week), since constraining nodes to keep the same rank order may be unrealistic. Note also that we continue to allow disease parameters to vary over time; in this way, we expect that the network modeling can capture mobility-related effects, while the time-varying disease parameters can capture the behavioral and other effects mentioned above. In Fig 15, we find that the model successfully learns time varying parameters that are relatively consistent with the observed data, achieving an MDAE of 0.00260. We hypothesize that the relatively noisy procedure used for node re-assignment contributes additional error to our model. PPT PowerPoint slide PNG larger image TIFF original image Download: Fig 15. Inferred cumulative infection statistics and SEIR curves for network modeled with time-varying edge weights. Our model still finds parameters that approximately match the data, even when the network topology changes over time. Note that the model also compensates for the poor performance of the prior parameters. https://doi.org/10.1371/journal.pcbi.1010591.g015 [END] --- [1] Url: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010591 Published and (C) by PLOS One Content appears here under this condition or license: Creative Commons - Attribution BY 4.0. via Magical.Fish Gopher News Feeds: gopher://magical.fish/1/feeds/news/plosone/