This app illustrates how to fit a mechanistic dynamical model to data and how to use simulated data to evaluate if it is possible to fit a specific model.
For this app, weekly mortality data from the 1918 influenza pandemic in New York City is used. The data comes from (Mills, Robins, and Lipsitch 2004). You can read a bit more about the data by looking at its help file with help('flu1918data')
.
Alternatively, the model itself can be used to generate artificial data. We can then fit the model to this model-generated artificial data. This is useful for diagnostic purposes, as you will learn by going through the tasks for this app.
The underlying model that is being fit is a version of the basic SIR model. Since the available data is mortality, we need to keep track of dead individuals in the model, too. This can be achieved by including an additional compartment and let a fraction of infected individuals move into the dead instead of the recovered compartment.
Flow diagram for an SIR model with deaths.
The equations for the model are given by
\[ \begin{aligned} \dot S & = -bSI \\ \dot I & = bSI - gI \\ \dot R & = (1-f)gI \\ \dot D & = fgI \end{aligned} \]
Since the individuals in the R compartment are not tracked in the data and do not further influence the model dynamics, we implement the model without the R compartment, i.e., the simulation runs the equations.
\[ \begin{aligned} \dot S & = -bSI \\ \dot I & = bSI - gI \\ \dot D & = fgI \end{aligned} \]
The data is reported in new deaths per week per 100,000 individuals. Our model tracks cumulative, not new deaths. The easiest way to match the two is to sequentially add the weekly deaths in the data to compute cumulative deaths for each week. We can then fit that quantity directly to the model variable D. Adjustment for population size is also needed, which is done by dividing the reported death rate by 100,000 and multiplying with the population size. This is further discussed in the tasks.
The app fits the model by minimizing the sum of square difference (SSR) between model predictions for cumulative deaths and the cumulative number of reported deaths for all data points, i.e. \[ SSR= \sum_t (D_t - D^{data}_t)^2 \] where the sum runs over the times at which data was reported.
It is also possible to set the app to fit the difference between the logarithm of data and model, i.e. \[ SSR= \sum_t (\log(D_t) - \log(D^{data}_t))^2 \]
The choice to fit the data or the log of the data depends on the biological setting. Sometimes one approach is more suitable than the other. In this case, both approaches might be considered reasonable.
The app reports the final SSR for the fit.
While minimizing the sum of square difference between data and model prediction is a very common approach, it is not the only one. A more flexible formulation of the problem is to define a likelihood function, which is a mathematical object that compares the difference between model and data and has its maximum for the model settings that most closely describe the data. Under certain assumptions, maximizing the likelihood and minimizing the sum of squares are the same problem. Further details on this are beyond the basic introduction we want to provide here. Interested readers are recommended to look further into this topic, e.g., by reading about (maximum) likelihood on Wikipedia.
A computer routine does the minimization of the sum of squares. Many such routines, generally referred to as optimizers, exist. For simple problems, e.g., fitting a linear regression model to data, any of the standard routines work fine. For the kind of minimization problem we face here, which involves a differential equation, it often makes a difference what numerical optimizer routine one uses. R
has several packages for that purpose. In this app, we make use of the optimizer algorithms called COBYLA, Nelder-Mead and Subplex from the the nloptr
package. This package provides access to a large number of optimizers and is a good choice for many optimization/fitting tasks. For more information , see the help files for the nloptr
package and especially the nlopt website.
For any problem that involves fitting ODE models to data, it is often important to try different numerical routines and different starting points to ensure results are consistent. This will be discussed a bit in the tasks.
Since the data is in weeks, we also run the model in units of weeks.
Generally, with increasing iterations, the fits get better. A fitting step or iteration is essentially an ‘attempt’ by the underlying code to find the best possible model. Increasing the tries usually improves the fit. In practice, one should not specify a fixed number of iterations. We do it here, so things run reasonably fast. Instead, one should ask the solver to run as long as it takes until it can’t find a way to improve the fit (can’t further reduce the SSR). The technical expression for this is that the solver has converged to the solution. This can be done with the solver used here (nloptr
R package), but it would take too long, so we implement a “hard stop” after the specified number of iterations.
Ideally, with enough iterations, all solvers should reach the best fit with the lowest possible SSR. In practice, that does not always happen. Often it depends on the starting conditions. Let’s explore whether starting values matter.
In general, picking good starting values is essential. One can get them by trying an initial visual fit or by doing several short fits, and use the best fit values at the end as starting values for a new fit. Especially if you want to fit multiple parameters, optimizers can ‘get stuck’. If they get stuck, even running them for a long time might not find the best fit. One way an optimizer can get stuck is when a solver finds a local optimum. The local optimum is a good fit, and now as the solver varies parameters, each new fit is worse, so the solver “thinks” it found the best fit, even though there are better ones further away in parameter space. Many solvers - even so-called ‘global’ solvers - can get stuck. Unfortunately, we never know if the solution is real or if the solver is stuck in a local optimum. One way to figure this out is to try different solvers and different starting conditions, and let each one run for a long time. If all return the same answer, no matter what type of solver you use and where you start, it’s quite likely (though not guaranteed) that we found the overall best fit (lowest SSR).
The problem of ‘getting stuck’ is something that frequently happens when trying to fit ODE models, which is in contrast to fitting with more standard models (e.g., a linear regression model), where it is not a problem. The technical reason for this is that a simple regression optimization is convex while fitting an ODE model is usually not. That’s why you don’t have to worry if you found the right solution if you use the lm
or glm
functions for fitting in R
. When fitting more complicated models such as ODE or similar models, you do have to carefully check that the “best fit” is not the result of a local optimum.
In the previous task, you learned that for the SSR computation, differences between large data and model values - which themselves tend to be larger in magnitude - often dominate the SSR expression. As an extreme example, if you have 2 data points you fit, one at 1E10 and the model predicts 1.1E10, that’s a difference of 1E9. The second data point is 1E7, and the model predicts 1E6. This is in some sense a more significant discrepancy, but the difference is only 9E6, much smaller than the 1E9 for the first data point.
One way to give data of different magnitude more comparable weights is by fitting the log of the data. Note that fitting the data or the log of the data are different, and the choice should be made based on scientific/biological rationale.
Again, fitting either a linear or log scale are both reasonable approaches, and the choice should be made based on the underlying biology/science. A good rule of thumb is that if the data spans several orders of magnitude, fitting on the log scale is probably the better option. You probably already realized that you can of course, not compare the SSR between the two fitting approaches. SSR values only make sense to compare once you determined the scale for fitting.
One consideration when fitting these kinds of mechanistic models to data is the balance between data availability and model complexity. The more and “richer” the data available the more parameters you can estimate and therefore, the more detailed a model can be. If you ‘ask too much’ from the data, it leads to the problem of overfitting. Overfitting can be thought of as trying to estimate more parameters than can be robustly estimated for a given dataset. One way to safeguard against overfitting is by probing if the model can in generate estimates in a scenario close to known values for that parameter. To do so, we can use our model with specific parameter values and simulate data. We can then fit the model to this simulated data. If everything works, we expect that - ideally independent of the starting values for our solver - we end up with estimated best-fit parameter values that agree with the ones we used to simulate the artificial data. We’ll try this now with the app.
Let’s see if the fitting routine can recover parameters from a simulation if we start with different initial/starting values.
Theory suggests that if we run enough iterations, we should obtain a best fit with an SSR close to 0 and best fit values that agree with those used to generate the artificial data. You might find that this does not happen for all 3 solvers within a reasonable number of iterations. For instance, using solver 2 and 1000 iterations should get you pretty close to what you started with.
That indicates that you can potentially estimate these parameters with that kind of data, at least if there is no noise. This is the most basic test. If you can’t get the best fit values to be the same as the ones you used to make the data, it means you are trying to fit more parameters than your data can support, i.e., you are overfitting. At that point, you will have to either get more data or reduce your fitted parameters. Reducing fitted parameters can be done by either fixing some parameters based on biological a priori knowledge or by reducing the number of parameters through model simplification.
Note that since you now change your data after you simulated it, you don’t expect the parameter values for the simulation and those you obtain from your best fit to be the same. However, if the noise is not too large, you expect them to be similar.
You will likely find that for certain combinations of simulated data, noise added, and specific starting conditions, you might not get estimates that are close to those you used to create the data. This suggests that even for this simple model with 3 parameters, estimating those 3 parameters based on the available data is not straightforward.
simulate_fit_flu
. You can call them directly, without going through the shiny app. Use the help()
command for more information on how to use the functions directly. If you go that route, you need to use the results returned from this function and produce useful output (such as a plot) yourself.vignette('DSAIDE')
into the R console.R
is (Bolker 2008). Note though that the focus is on ecological data and ODE-type models are not/barely discussed.pomp
package in R. (If you don’t know what stochastic models are, check out the stochastic apps in DSAIDE.)flu1918data
, you can read more about it by looking at its help file entry help(flu1918data)
. The publication from which the data comes is (Mills, Robins, and Lipsitch 2004).Bolker, Benjamin M. 2008. Ecological Models and Data in R. Princeton University Press.
Hilborn, Ray, and Marc Mangel. 1997. The ecological detective : confronting models with data. Monographs in Population Biology 28. Princeton, N.J.: Princeton University Press.
Miao, Hongyu, Xiaohua Xia, Alan S. Perelson, and Hulin Wu. 2011. “On Identifiability of Nonlinear ODE Models and Applications in Viral Dynamics.” SIAM Review 53 (1): 3. https://doi.org/10.1137/090757009.
Mills, Christina E., James M. Robins, and Marc Lipsitch. 2004. “Transmissibility of 1918 Pandemic Influenza.” Nature 432 (7019): 904–6. https://doi.org/10.1038/nature03063.