Model output¶

The output of a Model run is a Result object. It contains all the information that can be obtained from a simulation. Up to now we only considered the events and durations but there are other elements available. This tutorial shows:

What information is contained in the Result object.
How to efficiently store and restore such objects.
How to perform calculations with the outputs.
Some tricks to keep in mind when exporting the model results.

It’s important to keep in mind that although it is advantageous to extract all the output data available this may also slow down your simulation considerably.

Note

In the following examples the result variable refers to a Result object.

The code below, a simple EC simulation without screening, is used as an example run in this tutorial:

import matplotlib.pyplot as plt
import numpy as np

from miscore import load_result
from miscore import Model, processes, Universe

birth = processes.Birth(
    year=1975
)

oc = processes.OC(
    life_table=processes.oc.data.us_2017.life_table_female,
)

ec = processes.EC.from_data(
    processes.ec.data.us
)

no_screening = Universe(
    name="no_screening",
    processes=[birth, oc, ec]
)

model = Model(
    universes=[no_screening]
)

result = model.run(
    n=10000,
    seed=123,
    event_ages=range(0, 100, 1),
    duration_ages=range(0, 100, 1),
    return_properties=["oc_death", "ec_onset_age"]
)

Note that we import matplotlib, numpy and load_result, and added the return_properties argument to model.run(). We will introduce these functions and packages in this tutorial.

Result object¶

The Result object holds six pandas DataFrame objects: the events and durations, which we’ve seen before, the properties, the snapshots (by age and by year) and the individual event logs. You can open and export all tables like you did with the events and durations in the previous tutorials.

# Export all tables separately - Naive way
result.events.to_csv("events.csv")
result.durations.to_csv("durations.csv")
result.properties.to_csv("properties.csv")
result.snapshots_ages.to_csv("snapshots_ages.csv")
result.snapshots_years.to_csv("snapshots_years.csv")
result.events_individual.to_csv("events_individual.csv")

Note that it is naive to export all tables separately, we’ll introduce a smarter way in the next part of this tutorial. Also note that most of these tables will only be useful after you completed the tutorials for Modellers.

events¶

The events DataFrame stores all the events logged during the simulation. It has the following columns:

universe (category): The name of the Universe to which this row applies.
stratum (category): The stratum to which these individuals belong.
tag (category): A text label describing the event of interest. It can be controlled by passing event_tags to run().
year (float64): The lower bound of the year bin. It can be controlled by passing event_years to run().
age (float64): The lower bound of the age bin. It can be controlled by passing event_ages to run().
number (uint32): The number of events with this tag that were logged in this universe, stratum, age bin and year bin.

durations¶

The durations DataFrame stores all the durations logged during the simulation. It has the following columns:

universe (category): The name of the Universe to which this row applies.
stratum (category): The stratum to which these individuals belong.
tag (category): The duration tag. It can be controlled by passing duration_tags to run().
year (float64): The lower bound of the year bin. It can be controlled by passing duration_years to run().
age (float64): The lower bound of the age bin. It can be controlled by passing duration_ages to run().
number (uint32): The length of the durations (in years) with this tag that were logged in this universe, stratum, age bin and year bin.

properties¶

Most MISCore processes contain a properties() method which generates the properties for all individuals. These properties define the individuals of the simulation. For example, the Birth process generates an individual’s birth year; the OC process generates a year of other-cause death; and the EC process generates an individual’s onset age for endometrial cancer (EC).

The properties DataFrame stores all properties generated by any Process included in the Model. It does not have any default columns and by default this DataFrame is empty. Set return_properties to True in the run() to fill this DataFrame with all generated properties. Alternatively, pass a list of properties to only include those (see line 33 in the example code). In the above example, the result will contain the birth year and EC onset age for all individuals.

Note

For more details about the properties() method please refer to the MISCore structure and Create a process tutorials.

snapshots_ages¶

Note

The snapshot function is introduced in Take snapshots.

The snapshots_ages DataFrame stores all the snapshots taken at a certain age during the simulation. It has the following columns:

universe (category): The name of the Universe to which this row applies.
stratum (category): The stratum to which these individuals belong.
tag (category): The snapshot tag. It can be controlled by passing snapshot_tags to run().
age (float64): The snapshot age. It can be controlled by passing snapshots_ages to run().
number (uint32): The number of individuals in this universe and stratum for which the snapshots_functions passed to run() returned this tag at this age.

snapshots_years¶

Note

The snapshot function is introduced in Take snapshots.

The snapshots_years DataFrame stores all the snapshots taken at a certain year during the simulation. It has the following columns:

universe (category): The name of the Universe to which this row applies.
stratum (category): The stratum to which these individuals belong.
tag (category): The snapshot tag. It can be controlled by passing snapshot_tags to run().
year (float64): The snapshot year (i.e. date). It can be controlled by passing snapshots_years to run().
number (uint32): The number of individuals in this universe and stratum for which the snapshots_functions passed to run() returned this tag at this date.

events_individual¶

Note

Logging events at individual level is introduced in Advanced logging

The events_individual DataFrame stores all the individual events that were logged during the simulation. It has the following columns:

universe (category): The name of the Universe to which this row applies.
individual (uint32): The number/index of the individual.
element (uint16): The element may for example be used to indicate to which lesion this row applies. However, the exact interpretation depends on the implementation of the Process that logged this event. The element is 0 by default.
age (float64): The age at which this event happened.
tag (category): The tag of the event that happened at this age in this universe for this individual and element. It can be controlled by passing event_individual_tags to run().

Other output¶

Besides the six DataFrame objects described above, the Result object also includes the following information about the simulation:

n (int): The number of individuals simulated.
block_size (int): The number of individuals in each block. The default value is 2000.
version (string): The version of MISCore used to obtain the result.
released (boolean): Boolean indicating whether or not this version of MISCore was released.
seeds_properties (Dict[str, int]): The seeds used for the properties() method of each Process.
seeds_properties_tmp (Dict[str, int]): The seeds used for the properties_tmp() method of each Process.
seeds_random (Dict[str, int]): The seeds used for the random number generators of each Process accessible during the simulation.

You can open this information in a similar way. For example the following prints the number of simulated individuals:

# Print number of individuals that were simulated
print(result.n)

Save and load results¶

In the tutorials up to now, we exported the durations and events as csv files. However, it is naive to export separate csv files. If we run a computationally expensive MISCore simulation, we want to save as much information as possible, to prevent having to run the simulation all over when we miss some information. It is therefore good coding practice to save the entire Result object. This is also more efficient than storing the data in separate CSV files as shown before. You can use the save() method of the Result object and the load_result() function to save and load results. The object will then be saved as a pickle file.

For example, the following code saves result to the file my_simulation.result and then loads the file:

# Save the result object - Smart way
result.save("my_simulation_result.result")

# Load the result object so we can do an analysis without redoing the simulation
result = load_result("my_simulation_result.result")

In line 51 we used the load_result function that we imported in line 4.

Note

The .result file extension is just an example. You can use any extension you want or don’t use an extension at all.

Note

You can also save a single DataFrame like the events as a pickle. This is faster than saving it as a csv when the file is very large. You can use the pandas.DataFrame.to_pickle() method to save and pandas.read_pickle() function to load it. Note that you need to import the pandas package to load the table.

# Save the events DataFrame as a pickle
result.events.to_pickle("events.pickle")

# Load the pickled events DataFrame
import pandas as pd
events = pd.read_pickle("events.pickle")

Perform calculations¶

Probably, you want to perform some calculations on the output, for example to obtain the mortality rate (per 100,000 life years). Since each output is stored as a pandas DataFrame, all DataFrame methods can be applied to them, for example to manipulate or export this data. This tutorial shows some examples.

You probably want to extract events and durations with a certain tag. Events and durations can be filtered like shown below. This code first creates a table of the number of endometrial cancer deaths (ec_death) by age from the events as a Series, with the age as index. It then does the same for the life years, using the durations table.

# Extract the ec mortality and lifeyears
mortality = result.events[result.events["tag"] == "ec_death"].groupby("age")['number'].sum()
life_years = result.durations[result.durations["tag"] == "life"].groupby("age")['number'].sum()

Given that you used the same event_ages and duration_ages in run(), the mortality rate can be calculated by simply dividing mortality by life_years. Here, life_years is first divided by 100,000 to obtain the rate per 100,000 lifeyears.

# Calculate the mortality rate
mortality_rate = mortality.div(life_years / 1e5)

The resulting Series can be plotted easily by just calling its plot() method. For this, we imported matplotlib.pyplot in line 1 of the example code.

# Plot the mortality rate
mortality_rate.plot()
plt.show()

Tricks before exporting¶

Pivot tables are often easy to interpret and work with. The pandas.DataFrame.pivot_table() method can be used to create such tables. For example, the following code creates a pivot table with a row for each combination of universe and age, and a column for each tag (i.e. event name). Each combination of universe, age and tag that does not exist gets a value 0 (fill_value). Also, we have to indicate that logged data should be aggregated by summing (aggfunc=np.sum) For that, we imported numpy in line 2 of the example code. You can export the pivot table to a csv similarly to the events and durations.

# Make and export a pivot table
pivot_table = result.events.pivot_table(index=["universe", "age"], columns="tag", values="number",
                                        aggfunc=np.sum, fill_value=0)
pivot_table.to_csv("pivot_table.csv")

Note

You can also make and export the table in a single line of code.

result.events.pivot_table(index=["universe", "age"], columns="tag", values="number", aggfunc=np.sum, fill_value=0).to_csv("pivot_table.csv")

Example code¶

The code below gives a final overview of the code we wrote up to now.

import matplotlib.pyplot as plt
import numpy as np

from miscore import load_result
from miscore import Model, processes, Universe

birth = processes.Birth(
    year=1975
)

oc = processes.OC(
    life_table=processes.oc.data.us_2017.life_table_female,
)

ec = processes.EC.from_data(
    processes.ec.data.us
)

no_screening = Universe(
    name="no_screening",
    processes=[birth, oc, ec]
)

model = Model(
    universes=[no_screening]
)

result = model.run(
    n=10000,
    seed=123,
    event_ages=range(0, 100, 1),
    duration_ages=range(0, 100, 1),
    return_properties=["oc_death", "ec_onset_age"]
)

# Export all tables separately - Naive way
result.events.to_csv("events.csv")
result.durations.to_csv("durations.csv")
result.properties.to_csv("properties.csv")
result.snapshots_ages.to_csv("snapshots_ages.csv")
result.snapshots_years.to_csv("snapshots_years.csv")
result.events_individual.to_csv("events_individual.csv")

# Print number of individuals that were simulated
print(result.n)

# Save the result object - Smart way
result.save("my_simulation_result.result")

# Load the result object so we can do an analysis without redoing the simulation
result = load_result("my_simulation_result.result")

# Extract the ec mortality and lifeyears
mortality = result.events[result.events["tag"] == "ec_death"].groupby("age")['number'].sum()
life_years = result.durations[result.durations["tag"] == "life"].groupby("age")['number'].sum()

# Calculate the mortality rate
mortality_rate = mortality.div(life_years / 1e5)

# Plot the mortality rate
mortality_rate.plot()
plt.show()

# Make and export a pivot table
pivot_table = result.events.pivot_table(index=["universe", "age"], columns="tag", values="number",
                                        aggfunc=np.sum, fill_value=0)
pivot_table.to_csv("pivot_table.csv")