Calibrate a model¶
MISCore features a built-in calibration module: miscore.tools.calibration.
It enables the user to construct a fitness function that can be optimized using an optimization algorithm of their choice.
This tutorial will illustrate how to use this module for a simple calibration of the EC process.
For this tutorial, all of the previous tutorials for Modellers are useful as their topics can be used to define efficient calibration targets.
Required imports¶
First, let’s import the required classes, utils and data from MISCore.
NumPy is imported to be able to define arrays.
We’ll also import the scipy.optimize.minimize() function, which will be used for the actual calibration.
1import numpy as np
2from scipy.optimize import minimize
3
4from miscore import Model, Universe
5from miscore.processes import Birth, EC, OC
6from miscore.processes.ec.data import us as ec_data
7from miscore.processes.oc.data import us_2017 as oc_data
8from miscore.tools.calibration import durations_from_result, \
9 events_from_result, Fit, PoissonDeviance, Target
Defining the model and its variables¶
The calibration will need to know what the model looks like for a given input x.
Therefore, we define the model_from_x() function, which returns a Model instance for any input x.
In this example, the hazard rate of EC is calibrated, as well as the age factors determining the ages of onset.
Note that the relative hazard between lesions with and without atypia is kept constant at 6.14.
12def model_from_x(x):
13 # Initialize Birth process
14 birth = Birth(
15 name="birth",
16 year=1980
17 )
18
19 # Initialize OC process
20 oc = OC(
21 name="oc",
22 life_table=oc_data.life_table_female,
23 end_year=2050
24 )
25
26 # EC hazard based on x
27 hazard = x[0]
28
29 age_factors = {
30 "without_atypia": [
31 (0, 0.0),
32 (30, x[1] * 6.142857143),
33 (40, x[2] * 6.142857143),
34 (50, x[3] * 6.142857143),
35 (60, x[4] * 6.142857143),
36 (70, x[5] * 6.142857143),
37 (80, x[6] * 6.142857143),
38 (100, x[7] * 6.142857143),
39 ],
40 "with_atypia": [
41 (0, 0.0),
42 (30, x[1]),
43 (40, x[2]),
44 (50, x[3]),
45 (60, x[4]),
46 (70, x[5]),
47 (80, x[6]),
48 (100, x[7]),
49 ]
50 }
51
52 # Initialize EC process
53 ec = EC(
54 name="ec",
55 hazard=hazard,
56 age_factors=age_factors,
57 dwell_hyperplasia=ec_data.dwell_hyperplasia,
58 dwell_preclinical=ec_data.dwell_preclinical,
59 cure=ec_data.cure,
60 time_to_death=ec_data.time_to_death,
61 onset_method="before_oc"
62 )
63
64 # Define universe
65 universe = Universe(
66 name="universe",
67 processes=[birth, oc, ec]
68 )
69
70 # Create model
71 model = Model(
72 universes=[universe]
73 )
74
75 # Return model
76 return model
Note
Note that, for many calibration tasks your targets will be based on observed evidence (incidence for example)
limited to a certain calendar year or range of years.
In this case, you’re not really interested in model outcomes after this year and to speed up calibration you can
stop the simulation early by adding an end_year argument to the OC initialization.
Similarly, if your target data only considers cases or events up to a certain age, you could use end_age.
This will limit the total life-years simulated and speed up a single model run.
Since calibration generally involves many iterations of model evaluations, this could save a lot of calibration
time in the end.
See also the tutorial on Life tables.
Defining the calibration targets¶
The Target class should be used to define calibration targets.
You might want to calibrate the model on observed incidence.
In this example, we assume an absolute incidence of 0, 0, 5 and 10 for the age groups 40-49, 50-59, 60-69 and 70-79, respectively.
The total number of individuals in these age groups are 10000, 9500, 8000 and 6000.
The Target will need to know how to extract the simulated outcomes from each obtained
Result.
You should pass functions to calculate these outcomes as result_to_sim and result_to_m arguments.
The events_from_result() and durations_from_result()
functions can easily be used to construct such functions.
Recall that events and durations can be logged in multiple ways, see the basic tutorial section
Logging.
In this calibration example, we assume that the simulation events were logged in age groups of ten years (similar to
the note in Logging): all events/durations in e.g. the age group 40-49 are stored at the age 40.
Then the simulation outcomes of interest are the total number of ec_clinical events at ages 40, 50, 60 and 70.
The corresponding number of people alive is calculated as the total number of life-years spent in these age groups
(logged by the Birth process as life).
Note that universe=0 ensures the events and durations are calculated based on the outcomes of the first universe.
We assume this incidence follows a Poisson distribution. Therefore, we’ll use Poisson deviance to measure the goodness of fit for this target.
You can also specify the weight which is a factor with which the goodness of fit is multiplied. If the weight parameter is omitted, it is set to the default value of 1.
79# Define an incidence target
80target_incidence = Target(
81 name="Incidence",
82 x=[35, 45, 55, 65],
83 obs=[0, 0, 5, 10],
84 n=[10000, 9500, 8000, 6000],
85 result_to_sim=events_from_result(
86 tags=[["ec_clinical_with_atypia", "ec_clinical_without_atypia"]],
87 ages=[[40.], [50.], [60.], [70.]]
88 ),
89 result_to_m=durations_from_result(
90 tags=[["life"]],
91 ages=[[40.], [50.], [60.], [70.]]
92 ),
93 goodness_of_fit=PoissonDeviance(),
94 weight=1.0
95)
Note
Note that, if events and durations are summed per age, every age in the age groups should be explicitly mentioned
in the argument ages in events_from_result() and
durations_from_result().
The code below demonstrates this.
events_from_result(
tags=[["ec_clinical"]],
ages=[
list(range(40, 50, 1)),
list(range(50, 60, 1)),
list(range(60, 70, 1)),
list(range(70, 80, 1))
]
)
Extra options for the calibration targets¶
Observations by year:
You can also calibrate to yearly observations instead of observations by age.
In that case, use the parameter years in the events_from_result() and
durations_from_result() methods.
Also, don’t forget to specify event_years and duration_years in the Fit
class which is introduced in the upcoming subsection.
target_incidence = Target(
name="Incidence",
x=[2025, 2035, 2045, 2055],
obs=[0, 0, 5, 10],
n=[10000, 9500, 8000, 6000],
result_to_sim=events_from_result(
tags=[["ec_clinical"]],
years=[[2020.], [2030.], [2040.], [2050.]]
),
result_to_m=durations_from_result(
tags=[["life"]],
years=[[2020.], [2030.], [2040.], [2050.]]
),
goodness_of_fit=PoissonDeviance()
)
Conditions: It is also possible to add conditions to the targets. For example, if you want to calibrate to the incidence in the subpopulation in stratum 1, in the years 1985 to 1989, use the following:
target_incidence = Target(
name="Incidence",
x=[45, 55, 65, 75],
obs=[0, 0, 5, 10],
n=[10000, 9500, 8000, 6000],
result_to_sim=events_from_result(
tags=[["ec_clinical"]],
ages=[[40.], [50.], [60.], [70.]],
conditions=[("stratum", "==", 1), ("year", ">=", 1985), ("year", "<", 1990)]
),
result_to_m=durations_from_result(
tags=[["life"]],
ages=[[40.], [50.], [60.], [70.]],
conditions=[("stratum", "==", 1), ("year", ">=", 1985), ("year", "<", 1990)]
),
goodness_of_fit=PoissonDeviance()
)
Note that only the events/durations that adhere to all conditions are selected. A condition always consists of three items:
A string that is any column in the durations/events/snapshots DataFrame;
A string that contains an operator (<=, >=, <, >, ==, !=);
A value that corresponds with the column in the first item, e.g. the name of a universe (string), an age or a year (numbers).
Weight by category: You can also specify the weight for each age category separately. In this calibration example, the goodness of fit for the age group 40-49 is multiplied by 1.0, whereas the goodness of fit for the age group 50-59 is multiplied by 2.0.
target_incidence = Target(
name="Incidence",
x=[45, 55, 65, 75],
obs=[0, 0, 5, 10],
n=[10000, 9500, 8000, 6000],
result_to_sim=events_from_result(
tags=[["ec_clinical"]],
ages=[[40.], [50.], [60.], [70.]]
),
result_to_m=durations_from_result(
tags=[["life"]],
ages=[[40.], [50.], [60.], [70.]]
),
goodness_of_fit=PoissonDeviance(),
weight=[1.0, 2.0, 3.0, 4.0]
)
Calibrate to snapshots:
You can also calibrate to snapshots instead of events/durations.
For example, this is useful when calibrating to prevalences.
Use the snapshots_ages_from_result() and
snapshots_years_from_result() functions.
target_incidence = Target(
name="Snapshot",
x=[45, 55, 65, 75],
obs=[0, 0, 5, 10],
n=[10000, 9500, 8000, 6000],
result_to_sim=snapshots_ages_from_result(
tags=[["snapshot"]],
ages=[[40.], [50.], [60.], [70.]]
),
result_to_m=durations_from_result(
tags=[["life"]],
ages=[[40.], [50.], [60.], [70.]]
),
goodness_of_fit=PoissonDeviance(),
)
Multiple categories:
PoissonDeviance supports multiple categories (i.e. multinomial Poisson deviance).
For example, if the categories would be EC clinical in stages 1 and 2, and stages 3 and 4, we can adjust the tags
and corresponding observations obs as follows.
target_incidence = Target(
name="Incidence by stage",
x=[45, 55, 65, 75],
obs=[[0, 0], [0, 0], [2, 3], [3, 7]],
n=[10000, 9500, 8000, 6000],
result_to_sim=snapshots_ages_from_result(
tags=[
["ec_clinical_stage1", "ec_clinical_stage2"],
["ec_clinical_stage3", "ec_clinical_stage4"]
],
ages=[[40.], [50.], [60.], [70.]]
),
result_to_m=durations_from_result(
tags=[["life"]],
ages=[[40.], [50.], [60.], [70.]]
),
goodness_of_fit=PoissonDeviance(),
)
Note
When calibrating non-integer outcomes (e.g. durations), the MultinomialDeviance
function needs a diff argument to avoid raising an error due to rounding of non-integer simulated outcomes.
The user can specify the maximum allowed difference between the sums of sim and m in each period.
goodness_of_fit=MultinomialDeviance(diff=0.01)
Constructing the fitness function¶
With the model and targets defined, we can now create a Fit instance to use as fitness function.
Note that while we only use a single target here, it is possible to add any number of targets.
Any other keyword arguments used to create the Fit instance are passed to the
run() method of the Model instances created during the calibration.
Thus, we need to pass at least the argument n.
We fix all process seeds across all runs by passing seeds_properties and seeds_properties_tmp.
Furthermore, to be able to calculate the simulated incidence, events and durations need to be logged for the age groups
corresponding to the observed data.
This is taken care of by the event_ages and duration_ages arguments.
97# Initialize fitness function
98fit = Fit(
99 model_from_x=model_from_x,
100 targets=[target_incidence],
101 n=1e3,
102 seeds_properties={
103 "oc": 3532,
104 "ec": 9721
105 },
106 seeds_properties_tmp={
107 "ec": 4631
108 },
109 event_ages=range(40, 81, 10),
110 duration_ages=range(40, 81, 10)
111)
Note
Make sure to specify your snapshot_functions, snapshot_tags and snapshot_years/snapshot_ages here if you calibrate to snapshots.
Optimizing the fitness function¶
In the example, the Nelder-Mead algorithm (with adaptive parameters), as implemented in SciPy (scipy.optimize.minimize()),
is used to perform the actual calibration.
This algorithm requires an initial guess x0.
113# Initial guess
114x0 = np.array([1.5, 0.1, 0.1, 0.2, 0.5, 1.0, 1.0, 0.8, 0.5])
115
116# Calibrate the model
117minimize(
118 fun=fit,
119 x0=x0,
120 method="Nelder-Mead",
121 options={
122 "adaptive": True,
123 "maxiter": 10,
124 "disp": True
125 }
126)
Note
The maximum number of iterations is set to 10 here. Obviously, you will want to change or omit this setting for an actual calibration.
Note
Note that algorithms other than Nelder-Mead can be used to optimize the fitness function. For example, consider the DEAP package for evolutionary algorithms.
Evaluating the result¶
The Fit instance stores the best inputs and corresponding results in its best attribute.
Each element of this list is a tuple with three elements: the fitness value, input array and Result.
Thus, this example shows how to obtain the best found Result.
Note
If you are using multiprocessing in an outer calibration loop which divides the Fit calls
across multiple workers, the Fit.found and Fit.best attributes will give unreliable results.
Each Target can be plotted with confidence intervals using its
plot() method.
Let’s plot our incidence target with Poisson confidence intervals.
We also pass the best found Result, so we can compare it with the observation data in the plot.
128# Obtain best result
129best_result = fit.best[0][2]
130
131# Plot target observation data with confidence interval against best result
132target_incidence.plot(
133 result=best_result,
134 confidence_interval=target_incidence.confidence_interval(
135 alpha=.05,
136 distribution="poisson",
137 method="normal"
138 )
139).savefig("incidence.png")
Note
The argument marker_simulated_values in the plot() function can be
used to specify a marker for the simulated values.
For single simulated values (e.g. a single year or age was calibrated), a marker is used automatically.
A list of possible markers can be found here https://matplotlib.org/stable/api/markers_api.html .
target_incidence.plot(
result=best_result,
confidence_interval=target_incidence.confidence_interval(
alpha=.05,
distribution="poisson",
method="normal"
),
marker_simulated_values="p"
).savefig("incidence.png")
The final code¶
The final code is given below.
1import numpy as np
2from scipy.optimize import minimize
3
4from miscore import Model, Universe
5from miscore.processes import Birth, EC, OC
6from miscore.processes.ec.data import us as ec_data
7from miscore.processes.oc.data import us_2017 as oc_data
8from miscore.tools.calibration import durations_from_result, \
9 events_from_result, Fit, PoissonDeviance, Target
10
11
12def model_from_x(x):
13 # Initialize Birth process
14 birth = Birth(
15 name="birth",
16 year=1980
17 )
18
19 # Initialize OC process
20 oc = OC(
21 name="oc",
22 life_table=oc_data.life_table_female,
23 end_year=2050
24 )
25
26 # EC hazard based on x
27 hazard = x[0]
28
29 age_factors = {
30 "without_atypia": [
31 (0, 0.0),
32 (30, x[1] * 6.142857143),
33 (40, x[2] * 6.142857143),
34 (50, x[3] * 6.142857143),
35 (60, x[4] * 6.142857143),
36 (70, x[5] * 6.142857143),
37 (80, x[6] * 6.142857143),
38 (100, x[7] * 6.142857143),
39 ],
40 "with_atypia": [
41 (0, 0.0),
42 (30, x[1]),
43 (40, x[2]),
44 (50, x[3]),
45 (60, x[4]),
46 (70, x[5]),
47 (80, x[6]),
48 (100, x[7]),
49 ]
50 }
51
52 # Initialize EC process
53 ec = EC(
54 name="ec",
55 hazard=hazard,
56 age_factors=age_factors,
57 dwell_hyperplasia=ec_data.dwell_hyperplasia,
58 dwell_preclinical=ec_data.dwell_preclinical,
59 cure=ec_data.cure,
60 time_to_death=ec_data.time_to_death,
61 onset_method="before_oc"
62 )
63
64 # Define universe
65 universe = Universe(
66 name="universe",
67 processes=[birth, oc, ec]
68 )
69
70 # Create model
71 model = Model(
72 universes=[universe]
73 )
74
75 # Return model
76 return model
77
78
79# Define an incidence target
80target_incidence = Target(
81 name="Incidence",
82 x=[35, 45, 55, 65],
83 obs=[0, 0, 5, 10],
84 n=[10000, 9500, 8000, 6000],
85 result_to_sim=events_from_result(
86 tags=[["ec_clinical_with_atypia", "ec_clinical_without_atypia"]],
87 ages=[[40.], [50.], [60.], [70.]]
88 ),
89 result_to_m=durations_from_result(
90 tags=[["life"]],
91 ages=[[40.], [50.], [60.], [70.]]
92 ),
93 goodness_of_fit=PoissonDeviance(),
94 weight=1.0
95)
96
97# Initialize fitness function
98fit = Fit(
99 model_from_x=model_from_x,
100 targets=[target_incidence],
101 n=1e3,
102 seeds_properties={
103 "oc": 3532,
104 "ec": 9721
105 },
106 seeds_properties_tmp={
107 "ec": 4631
108 },
109 event_ages=range(40, 81, 10),
110 duration_ages=range(40, 81, 10)
111)
112
113# Initial guess
114x0 = np.array([1.5, 0.1, 0.1, 0.2, 0.5, 1.0, 1.0, 0.8, 0.5])
115
116# Calibrate the model
117minimize(
118 fun=fit,
119 x0=x0,
120 method="Nelder-Mead",
121 options={
122 "adaptive": True,
123 "maxiter": 10,
124 "disp": True
125 }
126)
127
128# Obtain best result
129best_result = fit.best[0][2]
130
131# Plot target observation data with confidence interval against best result
132target_incidence.plot(
133 result=best_result,
134 confidence_interval=target_incidence.confidence_interval(
135 alpha=.05,
136 distribution="poisson",
137 method="normal"
138 )
139).savefig("incidence.png")