Cost-effectiveness analysis¶

This tutorial explains how to use the cost-effectiveness analysis tool presented in cea. The tool calculates the costs, QALYs gained or any other label of one or multiple Universe instances. Moreover, it is possible to use this tool for probabilistic sensitivity analysis (PSA) and univariate sensitivity analysis (USA) to incorporate uncertainty in cost and QALY estimates. It can also find the (strongly or weakly) dominating Universe instances and plot them. This allows simple post-processing of MISCore simulations.

We discuss:

How to write a data file to specify the costs/effects of each logged event or duration;
How to implement a regular CEA and find dominating strategies;
How to do a probabilistic sensitivity analysis;
How to do a univariate sensitivity analysis;
How to use implement different QALY values for age and sex groups.

Note

If you use built-in data for the costs and effects, you can skip the first part on the CEA data.

The starting point of any cost-effectiveness analysis is to have events and durations results from which you can calculate costs and benefits of a given scenario. As an example, you may use the following model run in which we compare a scenario of screening for endometrial cancer with a non-screening scenario. You may run it yourself to obtain the dataframes used in the examples.

Show/hide full code

import pandas as pd

from miscore import Model, processes, Universe
from miscore.tools import cea
from miscore.tools.cea.data import population_norms

# Birth process
birth = processes.Birth(
    name="birth",
    year=1975
)

# OC process
oc = processes.OC(
    name="oc",
    life_table=processes.oc.data.us_2009.life_table,
)

# EC process
ec = processes.EC.from_data(processes.ec.data.us)

# EC screening process
ec_screening = processes.EC_screening.from_data(processes.ec_screening.data.example)

# Universes
no_screening = Universe(
    name="no_screening",
    processes=[
        birth,
        oc,
        ec
    ]
)

screening = Universe(
    name="screening",
    processes=[
        birth,
        oc,
        ec,
        ec_screening
    ]
)

# Model
model = Model(
    universes=[
        no_screening,
        screening
    ]
)

# Run model
result = model.run(
    n=1e5,
    seed=123,
    event_ages=range(0, 101, 1),
    duration_ages=range(0, 101, 1),
    event_years=range(1975, 2075, 1),
    duration_years=range(1975, 2075, 1),
    verbose=True
)

The complete code used in this tutorial is shown at the bottom of this page, you may also download the .py file cea_example_run.py.

Data file¶

In order to calculate the costs/effects of strategies, they must be specified in a data file. A data file consists of up to three elements (one of them is optional).

Labels¶

The post-processing code allows to calculate multiple quantities. For example the costs, the (Quality-Adjusted) Life-Years (QALY/LY) or the number of hysterectomies in each Universe. The name of each labels to be measured has to be specified in the parameter labels in a list.

labels = ("costs", "LY", "QALY", "n_hysterectomies")

Discount rates (optional)¶

The data can be discounted (as will be discussed later in this tutorial). It is possible to either specify one discount rate for all quantities or specify separate rates for each label. In the latter case, the rates have to be specified in the data file with a list:

discount_rates = (0.03, 0.03, 0.03, 0.00)

Note that the order of the rates must correspond with the order in labels. If the discount rate is the same for all quantities, discount_rates can simply be a floating point number.

Costs and effects data¶

In the data file, costs and effects should be specified for tags that are logged in the events, durations and snapshots tables. The costs/effects for a tag are specified in one or multiple data sets. The use of multiple datasets in the costs and effects specification allows varying costs and utilities by subpopulation. Each data set is a dictionary within the dictionary ce which consists of up to five elements: conditions, durations, events, snapshot_years and snapshot_ages.

We will discuss the following example:

# We declare a cost-effectiveness dataset, associating tags of interest with benefits and harms.
# The dataset consists of two parts: lifelong costs/utilities and utilities for women aged 45-.
cea_data = {
    "Lifelong": {
        "conditions": [],
        "durations": {
            # Total number of Life Years
            "life": (0, 1, 1, 0),
        },

        "events": {
            # Costs of treatment
            "ec_clinical": (35763, 0, 0, 1),
            "ec_screening_hysterectomy": (15276, 0, 0, 1),

            # Count number of deaths in population
            ".*death": "deaths"
        },
    },
    "Menopause": {
        "conditions": [("age", "<", 45)],
        "durations": {
            # Disutilities due to surgically induced menopause
            ("ec_clinical_month1", "ec_screening_hysterectomy_month1"): (0, 0, -0.44, 0),
            ("ec_clinical_months2-3", "ec_screening_hysterectomy_months2-3"): (0, 0, -0.26, 0),
            ("ec_clinical_cont", "ec_screening_hysterectomy_cont"): (0, 0, -0.12, 0),
        },
        "events": {}
    },
}

Conditions¶

These cost-effectiveness data consist of two parts: Lifelong and Menopause. The Lifelong part does not have any conditions. The Menopause part, however, only applies for events and durations that occurred before age 45. This part contains disutilities (i.e. negative QALYs) caused by surgically induced menopause. Since women aged under 45 likely had not had their menopause yet, these disutilities only apply to events and durations before age 45.

Note

It is allowed to specify multiple conditions in conditions, for example

"conditions": [("universe", "==", "screening"), ("age", "<", 65),  ("year", ">=", 1993)]

would only apply to people in the Universe screening that were younger than 65 in the years after 1993. Conditions can be applied to any column in the durations and events tables (including stratum, cohort, year, age and Universe). The feasible operators are <, <=, ==, !=, >=, >.

Durations and events¶

Next, both parts specify costs and utility data for various events and duration tags. For example, the Lifelong part specifies that an occurrence of the life tag in the durations yields 0 costs, 1 QALY, 1 LY and 0 hysterectomies. This represents the total number of lifeyears of individuals. Lifelong also specifies the costs of treatment and hysterectomies. The event tags ec_clinical and ec_screening_hysterectomy are associated with costs and with 1 hysterectomy. No (reduction in) lifeyears and QALYs are associated with these tags, however.

Finally, the Lifelong part counts the total number of deaths. By specifying a string instead of a tuple of values, the specified tags are aggregated and counted, without multiplying by costs or utility values. As a result, an extra label deaths will be added to the outputs. Note that we used .*death, which means “all tags that end with ‘death’”. This is an example of regex. Their documentation will show more examples of regex rules.

The Menopause part only specifies disutilities for six duration tags. The first month of surgically induced menopause leads to a 0.44 reduction in yearly QALYs (so effectively a 0.04 QALY reduction). The second and third month lead to a 0.26 reduction in yearly QALYs. After that, there is a 0.12 reduction in yearly QALYs, until the age of 45. Note that these disutilities are the same for hysterectomies due to clinical diagnosis and due to screening. By setting two or more tags within a tuple (in brackets), the same cost/effect values can be assigned to these tags.

Note

In general, the value for QALYs are negative for all tags except “life”.

Note

In this example, costs/effects are specified for durations and events only. In a similar way, costs/effects can also be specified for snapshot_years and snapshot_ages.

Note

It is also possible to specify the disutilities of surgically induced menopause using regex. The following would yield the same results. Note, however, that the use_regex parameter in the cost_effectiveness_analysis() function should be set to True.

# Disutilities due to surgically induced menopause
"ec_.*_month1": (0, 0, -0.44, 0),
"ec_.*_months2-3": (0, 0, -0.26, 0),
"ec_.*_cont": (0, 0, -0.12, 0),

Functions in the cost-effectiveness analysis module¶

The cost-effectiveness analysis module contains four functions which are discussed in this section. They require the events and durations output from the simulation run as inputs. If the simulation was run in the same script, they can be retrieved from the Result object as follows:

# Retrieve the events and durations tables obtained by simulation
durations = result.durations
events = result.events

Alternatively, in case the durations and events output were generated by another script, the .csv or .pkl output files can be imported as follows.

import pandas as pd

# Import the (concatenated) events and durations tables obtained by simulation
durations = pd.read_csv("durations.csv")
events = pd.read_csv("events.csv")

import pickle

# Import the (concatenated) events and durations tables obtained by simulation
durations = pickle.load(open('durations.pkl','rb'))
events = pickle.load(open('events.pkl','rb'))

Note

The cost-effectiveness module takes only one events and one durations table as an input. These should contain the durations and events of each Universe you want to evaluate.

Therefore, if your simulation was done with more than one Model, make sure to concatenate all events and durations to one table respectively. You can do so by adding all events/durations to two separate lists and using the concat() function to concatenate all events/durations dataframes:

# If the simulation was done with three models, there are three Result instances.
# First put all events and durations dataframes in separate lists
events_list = [result1.events, result2.events, result3.events]
durations_list = [result1.durations, result2.durations, result3.durations]

# Then concatenate the lists of events and durations into one dataframe
events = pd.concat(events_list)
durations = pd.concat(durations_list)

This note can be ignored if your simulation used multiple Universe instances instead of multiple Model runs.

Calculate costs and effects¶

The cost_effectiveness_analysis() function calculates the costs and effects for the different simulated Universe instances. It can be used as follows:

ce_result = cea.cost_effectiveness_analysis(durations=durations,
                                            events=events,
                                            cea_data=cea_data,
                                            cea_labels=labels,
                                            division_factor=100000,
                                            discount_by="age",
                                            discount_start=40,
                                            discount_rate=discount_rates,
                                            group_by=["universe"],
                                            use_regex=True,
                                            verbose=True
                                            )

The ce_result object is a CEAResult instance as the output. The CEAResult page shows the elements included in this class. The most relevant result is ce.cea_output, which contains outcomes by type of outcome and Universe. Printing these outcomes will show the table of cost-effectiveness outputs.

print(ce_result.cea_output)

id	universe	label	number
0	no_screening	costs	8289.900573
1	no_screening	LY	60.960182
2	no_screening	QALY	60.898495
3	no_screening	nr_hysterectomy	0.359300
4	no_screening	deaths	1.000000
5	screening	costs	15376.644948
6	screening	LY	61.111235
7	screening	QALY	60.211758
8	screening	nr_hysterectomy	0.853680
9	screening	deaths	1.000000

The cost_effectiveness_analysis() function requires the following parameters:

cea_data and cea_labels should be used to specify the cea input data (cea_data and labels) from the data file.
Simulation output, such as the durations and events tables that were generated by the simulation, and/or the snapshots_ages and snapshots_years dataframes should be specified. Alternatively, the entire result object can be supplied, from which the output dataframes will be harvested.

Additionally, the function has a large variety of optional parameters.

Outcomes can be discounted by age or year by setting discount_by to either “year” or “age”. This also requires setting a first year/age to start discounting as discount_start, as well as a discount_rate, which may either be a single floating point rate or an iterable with a rate for each of the outcomes in cea_labels.
division_factor scales the CEA results by dividing all outcomes by the specified factor. For instance, to get results per individual alive at the start of simulation, specify the population size n of your simulation. Alternatively, to scale to a given number of individuals alive at the start of simulation, use n divided by the scale. For example if you want to report result per 1,000 individuals.

Next, it is possible to set a reference value for each of the quantities. Then the costs/effects of all other universes are calculated with respect to the reference Universe, i.e. the costs/effects of the reference are subtracted from each other Universe. It can be specified either by a name of a Universe in the durations/events tables or by a tuple with explicit values. For example the following two specifications would give the same result:

# Now we include a reference strategy. There are two methods:
reference_strategy = (8289.90057, 60.96018, 60.89849, 0.35930, 1.0)
reference_strategy = "no_screening"

ce_result = cea.cost_effectiveness_analysis(durations=durations,
                                            events=events,
                                            cea_data=cea_data,
                                            cea_labels=labels,
                                            division_factor=1000000,
                                            discount_by="age",
                                            discount_start=40,
                                            discount_rate=discount_rates,
                                            group_by=["universe"],
                                            use_regex=True,
                                            reference=reference_strategy
                                            )

print(ce_result.cea_output)

id	universe	label	number
5	screening	costs	708.674438
6	screening	LY	0.015105
7	screening	QALY	-0.068674
8	screening	n_hysterectomies	0.049438

Note

If the reference values are specified by a tuple, the values must match the scale per division_factor.

Costs and utilities can be grouped by variables other than Universe by specifying the group_by parameter. The following example will show the costs and effects per Universe and per year. The outcomes can be grouped by universe, cohort, stratum, year and tag (the labels of events and durations), and any of their combinations.

# Now we group the results by universe and year only.
group_by = ["universe", "year"]
ce_result = cea.cost_effectiveness_analysis(durations=durations,
                                            events=events,
                                            cea_data=cea_data,
                                            cea_labels=labels,
                                            division_factor=1000000,
                                            discount_by="age",
                                            discount_start=40,
                                            discount_rate=discount_rates,
                                            group_by=group_by,
                                            use_regex=True)
print(ce_result.cea_output)

id	universe	year	label	number
1	no_screening	0.0	LY	0.099671
2	no_screening	0.0	QALY	0.099671
4	no_screening	0.0	deaths	0.000649
5	no_screening	1.0	costs	0.178815
6	no_screening	1.0	LY	0.099331
…	…	…	…	…
1001	screening	99.0	LY	0.000386
1002	screening	99.0	QALY	0.000386
1003	screening	99.0	n_hysterectomies	0.000005
1004	screening	99.0	deaths	0.000819
1009	screening	100.0	deaths	0.001797

The boolean verbose, set to True by default, will warn you of any tags that were in the simulation results but were not included in the cea_data dictionary. It will also communicate the progress of the cost effectiveness analysis.

For the events, durations and snapshots tables separately, two types of tags are printed. First, unused tags are those that are specified in the data file, but are not used in the durations/events tables. Second, unspecified tags are those that are not specified in the data file, but are used in the durations/events tables. This option is useful when generating a new data file, however it also takes extra running time. Even though the option defaults to True, it is advised to turn it to False in final runs.

Note

When you use regex to associate tags with outcomes and effects, make sure you set the option use_regex to True!

Finding dominating universes¶

The dataframe that results from cost_effectiveness_analysis() contains the costs and effects of each Universe. If each Universe is generated with different screening strategies, one may be interested in the dominating Universe set. The functions find_pareto() and find_efficient() find the strongly and weakly dominating universes respectively.

The functions expect that you have somewhat transformed your ce.cea_output data, with the labels of your outcomes as columns. We can easily get such a table by using the pandas.pivot_table function. This has the added benefit that we can specify fill_value=0 to fill in any outcomes that do not occurr for each Universe:

# Pivot results by universe to search for efficient universes
universe_table = pd.pivot_table(ce_result.cea_output, aggfunc="sum",
                                columns="label", values="number", index="universe")
print(universe_table)

Note

pd.pivot_table by default calculates the mean across the index values (in this case the different Universe instances, supply aggfunc=”sum” to get the sum of all values in the number column.

This gives us the right type of table:

universe	costs	LY	QALY	nr_hysterectomy	deaths
no_screening	828.990057	6.096018	6.089849	0.035930	0.1
screening	1537.664495	6.111124	6.021176	0.085368	0.1

We can then supply this table to our functions that determine the efficiency frontiers, containing each efficient Universe.

# Here we search for the dominating strategies
dominating_strategies = cea.find_pareto(universe_table)
efficient_strategies = cea.find_efficient(universe_table)
print(efficient_strategies)

universe	costs	LY	QALY	nr_hysterectomy	deaths	icer
no_screening	828.990057	6.096018	6.089849	0.03593	0.1	-inf

Note

By default, the functions search for the Universe instances that minimise the costs and maximise the QALYs. However, users may alter these choices by specifying costs_label and effects_label respectively. For example, Universe instances that minimise the number of hysterectomies and maximise the LY are found by

dominating_strategies = find_pareto(universe_table, costs_label="nr_hysterectomy", effects_label="LY")

Note

The find_efficient() function also has a near_efficiency_margin parameter. This parameter takes a number between 0 and 1. Strategies of which the effects_label is within near_efficiency_margin * 100% of the frontier are considered near-efficient and included in the frontier.

efficient_strategies = find_pareto(universe_table, near_efficiency_margin=0.02)

Note

If you performed a PSA as part of your cost effectiveness analysis, you may be interested in the efficient frontier and the ICERs for each of the random draws. This allows you to estimate the probability per your stochastic sensitivity analysis of your strategy being on the frontier, as well as the probability that your ICER is below a given threshold. The find_efficient() function will automatically recognize that you have output from a PSA, if you have grouped your cea_output by both Universe and psa_set, with psa_set as the first index level:

psa_table = pd.pivot_table(ce.cea_output, columns="label", values="number", index=["psa_set", "universe"])
dominating_strategies_psa = find_pareto(universe_table)

Scatter plot of universes¶

Finally it is possible to make a scatter plot of the costs and effects of the calculated Universe with:

# The scatter_ce function allows us to plot the universes on the cost-QALY plane
cea.scatter_ce(universe_table, x_label="costs", y_label="QALY", tags="index")
cea.show()

This code generates a scatter plot showing the costs of each Universe on the horizontal axis and QALYs on the vertical axis. Note that it is possible to vary the quantities on each of the axes with the parameters x_label and y_label.

Additionally, it is possible to show a string with each scatter dot in the graph by specifying the parameter tags. The value index will add the value of the index of the specified DataFrame to the dot (usually this is the name of the Universe). The value number will assign a number to each of the dots, this may improve the readability.

Finally, the function takes kwargs that are passed to matplotlib, allowing the plots to be styled.

The figure below shows the result. Note that in your analysis the plane may be populated by more Universe outcomes than only the two generated by our example run.

Probabilistic Sensitivity Analysis (PSA)¶

In the above examples, costs and effects were specified as a single value. However, sometimes it can be useful to perform sensitivity analyses on costs and effects of one or multiple tags. For that, you can specify a probability distribution in place of the single value in the input dictionary:

# Next, we do an example of a probabilistic sensitivity analysis
labels = ["costs", "LY", "QALY", "n_hysterectomy"]
discount_rates = (0.03, 0.01, 0.01, 0.00)

# PSA-like input
# We declare a cost-effectiveness dataset, associating tags of interest with costs and effects
psa_data = {
    "Lifelong": {
        "conditions": [],
        "durations": {
            # Total number of Life Years
            "life": (0, 1, (1, 0.8) * 1000, 0),
        },
        "events": {
            # Costs of treatment
            "ec_clinical": (("norm", {"loc": 35763, "scale": 5960}), 0, 0, 1),
            "ec_screening_hysterectomy": (("norm", {"loc": 15276, "scale": 2546}), 0, 0, 1),

            # Count number of deaths in population
            ".*death": "deaths"
        },
    },
    "Menopause": {
        "conditions": [("age", "<", 45)],
        "durations": {
            # Disutilities due to surgically induced menopause
            ("ec_clinical_month1", "ec_screening_hysterectomy_month1"): (0, 0, -0.44, 0),
            ("ec_clinical_months2-3", "ec_screening_hysterectomy_months2-3"): (0, 0, -0.26, 0),
            ("ec_clinical_cont", "ec_screening_hysterectomy_cont"):
                (0, 0, ("uniform", {"loc": -0.12, "scale": 0.10}), 0),
        },
        "events": {}
    },
}

In this example, the disutilities for the durations ec_clinical_month1 and ec_screening_hysterectomy_month1 follow a normal distribution with mean -0.44 and standard deviation 0.05. Disutilities of ec_clinical_month23 and ec_screening_hysterectomy_month23 follow a uniform distribution with lowerbound -0.39 and upperbound 0.13. Any distribution available in scipy.stats can be used. All distributions and their parameters are listed here.

Another option is to specify a tuple of possible values for costs and effects as a user. An example of this is shown for event tag ec_clinical_cont and ec_screening_hysterectomy_cont. Disutilities for these durations alternate between -0.16 and 0.08.

Note

For larger analyses, like a PSA, efficiency is essential. To speed up your run, consider removing any unnecessary tags from the events dataframe before using it in your PSA.

We then need to tailor our inputs to the cost_effectiveness_analysis() function to perform a Probabilistic Sensitivity Analysis (PSA). To do so, set the parameter psa_n_samples to the desired number of draws from the distributions defined for each of the inputs. When there are user-supplied values for each of the draws (i.e. an array of a given length n was supplied to perform the PSA), make sure psa_n_samples matches the length of the user-supplied draws. To make sure your PSA draws are reproducible, set a value for psa_seed.

# Calculate costs and effects for each PSA draw
ce_result = cea.cost_effectiveness_analysis(result=result,
                                            cea_data=psa_data, cea_labels=labels,
                                            division_factor=1e5, discount_by="age",
                                            discount_start=40,
                                            discount_rate=discount_rates,
                                            group_by=["universe"],
                                            psa_n_samples=2000,
                                            psa_seed=42195)
print(ce_result.cea_output)

We are now returned a result DataFrame with separate outputs by label and Universe for each of the possible parameter draws out of the 2000 we simulated. The distribution of your outputs across these draws should inform you of the uncertainty surrounding your initial estimate.

id	universe	psa_set	label	number
0	no_screening	0	costs	8171.210975
1	no_screening	0	LY	69.894512
2	no_screening	0	QALY	69.846467
3	no_screening	0	n_hysterectomy	0.359300
5	no_screening	1	costs	7501.383817
…	…	…	…	…
19993	screening	1998	n_hysterectomy	0.853680
19995	screening	1999	costs	13180.019709
19996	screening	1999	LY	70.150378
19997	screening	1999	QALY	55.839194
19998	screening	1999	n_hysterectomy	0.853680

Note

The option of performing a PSA also allows for different outputs to be included in the CEAResult object.

Setting return_psa_seeds to True, the seeds for each of the different inputs are returned. The user-supplied seed is used to generate these seeds, to ensure that each input always starts the draw from the same point in the random number stream.

Setting return_psa_values to True, will return a DataFrame object with the exact values for each PSA draw and input tag. This can be valuable in plotting the relationship between your aggregated outcomes for each draw, and the values of the inputs in each of these draws. Depending on the number of inputs used and the value of psa_n_samples, this may be a memory-intensive DataFrame object.

Setting return_descriptives to True, will return a DataFrame object with descriptive sample statistics of each of the inputs, across all of the PSA draws. This may be valuable in evaluating whether the specified distribution for an input is resulting in the intended sample distribution, and whether psa_n_samples is sufficient to reduce the discrepancy between the sample distribution and the intended distribution.

Univariate Sensitivity Analysis (USA)¶

Alternatively, you may be interested in the effect of individual inputs, changed one at a time. To this end, the same dataset used for performing the PSA may be used for conducting a Univariate Sensitivity Analysis (USA). To do so, set perform_usa to True. Rather than drawing all the values simultaneously, every draw will only have one value changed at a time. In the DataFrame CEAResult.usa_values, you can find which value was changed to which input for each of the usa_set values in CEAResult.cea_output.

By default, the values are each set at their 2.5%, 5%, 25%, 50%, 75%, 95% and 97.5% percentile values, per the distribution used for the PSA. For PSA inputs for which a list of n values was supplied, the sample quantile values are used. The quantiles used can be specified with the argument usa_quantiles if needed.

An example of performing a Univariate Sensitivity Analysis is given below.

# Alternatively, we may perform a Univariate Sensitivity Analysis, in which values
# are changed one by one.
ce_result = cea.cost_effectiveness_analysis(result=result,
                                            cea_data=psa_data, cea_labels=labels,
                                            division_factor=1e5, discount_by="age",
                                            discount_start=40,
                                            discount_rate=discount_rates,
                                            reference="no_screening",
                                            group_by=["universe"],
                                            perform_usa=True)
# The output is now grouped by `usa_set` rather than `psa_set`
print(ce_result.cea_output)

# The `usa_values` DataFrame shows which input corresponds to which `usa_set`
print(ce_result.usa_values)

Show full output

The output is now given by:

idx	universe	usa_set	label	number
140	screening	0	costs	6624.022297
141	screening	0	LY	0.255866
142	screening	0	QALY	-0.283104
143	screening	0	n_hysterectomy	0.494380
145	screening	1	costs	6698.415765
…	…	…	…	…
273	screening	26	n_hysterectomy	0.494380
275	screening	27	costs	7086.744375
276	screening	27	LY	0.255866
277	screening	27	QALY	-0.602420
278	screening	27	n_hysterectomy	0.494380

Specified by the usa_values DataFrame, where it is specified which input is set to which value (corresponding to a particular quantile of its distribution) to generate the input related to the usa_set index:

id	usa_ceaset	usa_table	usa_tag	usa_label	usa_value	usa_quantile
0	Lifelong	events	ec_clinical	costs	47444.385348	0.025
1	Lifelong	events	ec_clinical	costs	45566.327617	0.050
2	Lifelong	events	ec_clinical	costs	39782.958911	0.250
3	Lifelong	events	ec_clinical	costs	35763.000000	0.500
4	Lifelong	events	ec_clinical	costs	31743.041089	0.750
5	Lifelong	events	ec_clinical	costs	25959.672383	0.950
6	Lifelong	events	ec_clinical	costs	24081.614652	0.975
7	Lifelong	events	ec_screening_hysterectomy	costs	20266.068305	0.025
8	Lifelong	events	ec_screening_hysterectomy	costs	19463.797334	0.050
9	Lifelong	events	ec_screening_hysterectomy	costs	16993.250904	0.250
10	Lifelong	events	ec_screening_hysterectomy	costs	15276.000000	0.500
11	Lifelong	events	ec_screening_hysterectomy	costs	13558.749096	0.750
12	Lifelong	events	ec_screening_hysterectomy	costs	11088.202666	0.950
13	Lifelong	events	ec_screening_hysterectomy	costs	10285.931695	0.975
14	Lifelong	durations	life	QALY	0.800000	0.025
15	Lifelong	durations	life	QALY	0.800000	0.050
16	Lifelong	durations	life	QALY	0.800000	0.250
17	Lifelong	durations	life	QALY	0.900000	0.500
18	Lifelong	durations	life	QALY	1.000000	0.750
19	Lifelong	durations	life	QALY	1.000000	0.950
20	Lifelong	durations	life	QALY	1.000000	0.975
21	Menopause	durations	(ec_clinical_cont, ec_screening_hysterectomy_cont)	QALY	-0.022500	0.025
22	Menopause	durations	(ec_clinical_cont, ec_screening_hysterectomy_cont)	QALY	-0.025000	0.050
23	Menopause	durations	(ec_clinical_cont, ec_screening_hysterectomy_cont)	QALY	-0.045000	0.250
24	Menopause	durations	(ec_clinical_cont, ec_screening_hysterectomy_cont)	QALY	-0.070000	0.500
25	Menopause	durations	(ec_clinical_cont, ec_screening_hysterectomy_cont)	QALY	-0.095000	0.750
26	Menopause	durations	(ec_clinical_cont, ec_screening_hysterectomy_cont)	QALY	-0.115000	0.950
27	Menopause	durations	(ec_clinical_cont, ec_screening_hysterectomy_cont)	QALY	-0.117500	0.975

Using population reference QALY values for different age- and sex groups¶

In performing cost-effectiveness analyses, it can often be easy to set decrements for health utility for the particular health states we are simulating, and to set the QALY value for our "life" tag to 1.0. This approach, however, assumes that anyone not experiencing the specific health state, is in perfect health. Especially in older populations, other morbidities are very likely to influence the general level of health utility, causing the average QALY value to never be as high as 1.0. Even younger populations often have an average health utility level around 0.90.

It is therefore crucial to adjust the population reference values for any cost-effectiveness analysis. This could be done by adjusting the "life" tag QALY value to some value below 1.0, but we also include a tool for doing this separately for each age category. For this, the data module includes the population_norms module with the add_population_norms() function, which adjusts your cost-effectiveness input dictionary for the average population health utility, by age and sex. Population norm utilities can be supplied by the user as a dataframe, or can be taken from the internally stored values from the literature for the US, the UK, NL or CH.

In this example, the add_population_norms() function uses the internally stored population norm values for the Dutch population. We can view this data by importing the dataset only.

from miscore.tools.cea.data.population_norms import data
data[(data.Country=="nl") & (data.Sex=="total") & (data.Method=="EQ5D")].head()

Country	Agemin	Agemax	Yearmax	Sex	Utility	SE	Method
nl	0	25	2200	total	0.938	0.021	EQ5D
nl	25	35	2200	total	0.910	0.012	EQ5D
nl	35	45	2200	total	0.922	0.013	EQ5D
nl	45	55	2200	total	0.874	0.019	EQ5D
nl	55	65	2200	total	0.869	0.027	EQ5D

Note

The method by which the utilities are derived in the literature may also differ, this is what the Method column reports. For most countries, Visual Analogue Scale (VAS) values are included, as well as values derived by means of population valuations of the EQ-5D survey. Although the EQ-5D is the most commonly used method, it may be valuable to your research to perform a sensitivity analysis in which results with both value sets are compared.

Note

Yearmin and Yearmax columns are included, which may report any conditions on the calendar year for which to apply the population health utilities. It could be that survey values were used which are updated over the years, such that separate calendar years are associated with new population health utility values. For the current internally included values, these limits are set to 0 and 2200, since only one survey year is used.

An example call of the add_population_norms() function is given below. We see that we supplied a CEA dataset called ce to this function. For each age category, new entries in ce are created in which the life QALY value, and any other disutility values is adjusted. These values which were previously relative to a health utility value of 1, are now set to be relative to the population norm utility. The adjustment of disutility values depends on the disutility_method parameter. The available methods are:

"proportional":
Disutilities are adjusted proportionally to the difference between the original life utility (set to 1.0) and the population norm utility. Hence, for individuals ages 0 to 25, the disutility for ec_clinical changes from -0.1 to -0.0938.
"nominal":
Disutilities are adjusted by the absolute difference between the life utility and the population norm utility. Now for individuals ages 0 to 25, the disutility for ec_clinical changes from -0.1 to -0.038. Because the adjusted disutility depends on the duration of the disutility, it is not possible to change event disutilities using this method.
"nominal_censored":
Similar to the "nominal" method, but the disutility values are capped at the population norm utility level. This ensures that the simulated disease state does not result in a better health utility than the general population for individuals of the same age. Before using this method, it should be carefully considered whether your diseased population could be healthier than the average individual of the same age. Also similar to the "nominal" method, it is not possible to change event disutilities using this method.

Here is how you might use the add_population_norms() function with these adjustments: Note that the inputs in the new ce dictionary are stored with tuple indexes (a, b), where a is the name of the input dataset, in this case 'x', and b is the row index in the dataframe of population norm utilities.

# Optionally correct our cost-effectiveness input data for population norm values
cea_data['Lifelong']['durations']['ec_clinical'] = (0, 0, -0.1, 0)
ce_input = population_norms.add_population_norms(
    data_quantities=cea_data,
    country="nl",
    sex="total",
    data_labels=processes.ec_screening.data.cea.labels,
    disutility_tags=['ec_clinical'],
    disutility_method="nominal"
)
print(ce_input[('Lifelong', 63)]['conditions'])
print(ce_input[('Lifelong', 63)]['durations'])

[('age', '>=', 0), ('age', '<', 25)]
{'life': (0, 1, 0.938, 0), 'ec_clinical': (0, 0, -0.03799999999999995, 0)}

If you would like the health disutilities to be taken relative to whatever the reference health utility is, simply leave the argument disutility_tags blank to prevent any disutilities from being changed.

It is also possible to supply your own population norm utility values. To do so, use the argument norm_data. Make sure the input is a pandas DataFrame which includes all the columns shown in the example above.

Note

We include also the SE column, which contains standard errors of the health utility values. These may be used in order to generate Univariate or Probabilistic Sensitivity Analysis cea sets.

To generate an adjusted cea dataset that can be used in a Probabilistic Sensitivity Analysis, simply set the argument n_psa_draws to the same n that you will use for your PSA. Normally distributed norm utility values will be drawn, per the standard error in the SE column. The resulting dataset may simply be passed on to cost_effectiveness_analysis, and for each of the n draws the corresponding norm utility will be used. The disutility values for all disutility_tags will be adjusted so as to preserve the same overall utility in the disease-specific state (i.e. if the random draw results in a decrease of the norm utility by 0.1, the disutility will be adjusted by 0.1 to yield the same nominal disease-specific utility). You may simultaneously generate random draws for the disutility tags. To do so, simply specify a distribution for the disutility value as you would for any other PSA input. If a disutility tag is present but no distribution is given, a constant nominal health-specific utility will be assumed (i.e. the sum of the norm utility and the disutility will be held constant), and a warning will be given to report that your disutility is not part of your PSA.

To generate an adjusted cea dataset that can be used in a Univariate Sensitivity Analysis, set return_usa_draws to True. Similar to generating the psa draws, the dataset will now return an array of utility values rather than single floating point values. One at a time, the utility values will be set to varying quantile values per their normal distribution. The quantile values may be supplied through the argument usa_quantiles, and will default to [0.025, 0.05, 0.25, 0.5, 0.75, 0.95, 0.975]. If any disutility tags are associated with a distribution, these will also be adjusted to their quantile values, after accounting for any shift in the norm utility by age. A DataFrame summarizing which USA value corresponds to which draw in the cea dataset is returned, so that the CEA output may be associated with the corresponding univariate changes in the inputs.

Note that for both USA and PSA draws, any outlier in the norm utilities that yields a value outside the domain 0 to 1 will be clipped to the edge values.