Generate chain of events for individuals #1468

marghe-molaro · 2024-10-02T11:35:11Z

This PR creates an option to use the simulation to log & print events and their effect on individuals across any number of modules. By collating this information in post-processing, we can create chains of events and outcomes for each simulated individual across different diseases, demographic changes, etc.
This will be useful to generate training data for module emulators/time series analysis, but also for general debugging. In order to assist in the former, the PR additionally gives the user the option to “ignore” the epidemiology of diseases in order to prioritise infecting as many individuals as possible during the runs. This (for now) is controlled by the “generate_event_chains” parameter in the Simulation class. Post-processing is done via postprocess_event_chains.py.

NOTE: This is working as intended, but not really feasible (RAM+CPU-time wise) if intending to use it for a large number of individuals. Next step will be to think about how to make it so.

How the PR works:

Events are collected when fired. This allows us to collect info centrally without modifying any of the existing modules and/or their event functions. This occurs at the Simulation-class level and inside the HealthSystemScheduler function. (In addition, birth “events” are recorded inside the do_birth function as a one-off).
In the case of Population-wide events, the event is only logged for those individuals for which the event resulted in a meaningful (i.e. parameter) change. E.g. TbActiveEvent, which is a population-wide event, is only stored for individuals who become actively infected by Tb as a result of the event having been fired. This is not optimal from a run time perspective (see below).
The user can specify for which modules events should be logged via “generate_event_chains_modules_of_interest” (e.g. could chose to log events belonging to Tb+Hiv modules only), and also specify which events to ignore completely via “generate_event_chains_ignore_events“, if the user has established a priori which event would be particularly cumbersome+uninteresting (e.g. TbActiveCasePollGenerateData, which only schedules the date on which infection will actually occur).
Properties of individuals are currently printed in full, and both before and after the event has occurred (in order to identify, in post-processing, which events resulted in meaningful changes in the individual’s parameters). This is far from ideal, given that there could be hundreds of events per individual, and x2 for before+after, and hence not sustainable from a memory standpoint (see discussion below).
In post processing, events are grouped by individual (while chronological order is already observed by construction), and all those “fired” after death are eliminated from the chain. For each individual, the chain is scanned to identify which events resulted in meaningful changes for the individual, therefore reconstructing the individual trajectory to outcome.

OPTIMISATION

To reduce CPU requirements:

At the moment dataframe is extended row by row every time; this is of course not ideal, should be extended by N rows at a time like currently done in population dataframe.
Detecting changes in the entire population dataframe is extremely CPU costly. On the other hand, printing this info for the entire population without any change-based pre-selection would be extremely RAM intensive…

To reduce RAM requirements:

Print to hdm5 during run time?
Reduce the number of individual properties to be logged. Not ideal at this stage if we want to print and collect data in a way that is agnostic as to what parameters are relevant. However, there are certainly some that we know we won’t need (district_num_of_residence, district_of_residence, region_of_residence, age_exact_years, age_years, age_range, age_days). However, there could be a potential draw-back in CPU time from having to eliminate columns from the dataframe.
To avoid printing properties before/after (Note: maximum benefit to this is x ½ reduction, so not huge):
--Assume that the individual has not been modified since the previous event, such that the last “After” state is the “Before” state of the subsequent event. This is a bit risky given that the user can decide not to print all events the individual has experienced. In addition to this, one could also only log the changes . Again, potential draw-back in CPU time from having to compare rows.
-- Decide to ignore all intermediate events, and clearly define a “start” (i.e. onset) and end (“death”/”resolution”) of the episode. This is what we would need for emulator, however loads of very interesting information in between that ultimately we would want to capture to explore the possibility of training time series / to capture “infectiousness” of individual (e.g. viral load status for HIV).

Merged master in branch

…hing tidier

src/tlo/events.py

tbhallett · 2024-10-03T08:45:53Z

src/tlo/events.py

+                print_chains = True
+                if self.target != self.sim.population:
+                    row = self.sim.population.props.iloc[[self.target]]
+                    row['person_ID'] = self.target


don't think needed?

tbhallett · 2024-10-03T08:47:43Z

src/tlo/events.py

+                row['event'] = self
+                row['event_date'] = self.sim.date
+                row['when'] = 'After'
+                self.sim.event_chains = pd.concat([self.sim.event_chains, row], ignore_index=True)


I believe that it's faster to not do these kinds of pandas operations very often, but instead to collect the data in python native structures (sets, dicts, lists, tuples) and then assemble them into a data-frame at the end.

tbhallett · 2024-10-03T08:48:56Z

src/tlo/methods/demography.py

+        if sim.generate_event_chains is False:
+            # Create (and store pointer to) the OtherDeathPoll and schedule first occurrence immediately
+            self.other_death_poll = OtherDeathPoll(self)
+            sim.schedule_event(self.other_death_poll, sim.date)


unsure why this change is included; perhaps done for something in debugging?

This is required if we want to train disease-specific emulators: we don't want other causes, including those grouped as "Others" (which we do not explicitly include as a disease module when running a sim, because they are always included by default), to "interfere"/end a life short when we are trying to capture effect of single disease.

If we want to use this not for emu training but other purposes then it could become important to not exclude this.

src/tlo/methods/hiv.py

tbhallett · 2024-10-03T08:51:21Z

src/tlo/methods/hsi_event.py

+        print_chains = False
+        df_before = []
+
+        if self.sim.generate_event_chains:


obviously would be nice to factorise-out this logic, which repeats in events.py

(Shame we don't have HSI_Event inheriting from Event, and we'd get it for free. We used to.. but it was changed at some point for a reason I can no longer remember.)

tbhallett · 2024-10-03T08:52:20Z

src/tlo/methods/hsi_event.py

+                    new_rows_after['event_date'] = self.sim.date
+                    new_rows_after['when'] = 'After'
+
+                    self.event_chains = pd.concat([self.sim.event_chains,new_rows_before], ignore_index=True)


rather than building up the enormous thing in memory in the format of a data frame, I have the feeling that it's more efficient to put out to a logger bit by bit.

tbhallett · 2024-10-03T08:53:19Z

src/tlo/simulation.py

        self.end_date = None
        self.output_file = None
        self.population: Optional[Population] = None
+        self.event_chains: Optinoal[Population] = None


I think event_chains is a data frame rather than an instance of Population?

tbhallett · 2024-10-03T08:57:34Z

src/tlo/events.py

+                    new_rows_after['person_ID'] = new_rows_after.index
+                    new_rows_after['event'] = self
+                    new_rows_after['event_date'] = self.sim.date
+                    new_rows_after['when'] = 'After'


why not store only the changes?

…intained is generate_event_chains is None

…Log changes to logger.

marghe-molaro · 2024-10-11T16:01:19Z

Updates:

The event chain logger now only logs changes in individual's properties rather than the whole row before+after. In doing so assume that none of the events which take place but we do not log lead to meaningful changes for the individual.
The comparison to find changed properties still seems to be more efficient than the former approach.
Tidied up by refactoring into fncs.

To add:

Could pop dataframe-wide comparison for Population events be done more efficiently?
Still can't get the name of the Event object using name
Is the logging done ok/should it be improved? Currently person_ID is the key to the dict of event details + property changes.
We need to store individual properties at the start of the sim, since now only printing changes;

…th, label what is only used for ddebugging and will be later removed

marghe-molaro · 2024-10-13T10:33:01Z

Hi @tbhallett,

The bulk of the changes we discussed are done. I just have a couple of quick questions (below) before being ready to submit this for formal review. To summarise:

The entire property list of each individual is now only logged at the start of the simulation or when the individual is born.
Whenever an individual experiences an event of interest (i.e. which belongs to the list of modules of interest and does not belong to the list of events which we explicitly want to ignore) only the following info is logged: A) Event details (name + date) and B) Any property changes the individual incurred as a result. The logging takes the form of a dictionary with person_ID as the key, and an inner dictionary containing A and B.
This can be logged one individual at the time in the case of individual events (e.g. {43 : dict_A+B}), or for multiple individuals at once for pop-wide events (e.g. {12: dict_A+B_for_12, 23: dict_A+B_for_23, ...}).
The logging is done on the Events module logger (which seemed most appropriate) at level "INFO".

Quick questions:

Whenever the logging takes place, this may contain:

A variable number of person_ID keys (depending on whether the event was at individual or population level)
An inner dict of different lengths (because each event may change a variable number of individual properties).
A person ID which was not initially declared as a key in the first of such logging instances (if the individual was born during the runtime of the sim).
The logger seems to be unhappy about these variations in length. Do you have any advice as to how I could make this more logger friendly? Including all possible keys with empty values assigned to them seems unnecessarily expensive.

2. To tidy up the use of this modality, I would like to assign default values for its parameters (generate_event_chains, generate_event_chains_overwrite_epi, generate_event_chains_modules_of_interest, generate_event_chains_ignore_events) which could then be modified by a scenario file (like for any other module). However there is currently no resource file for the overall sim that could be a natural place for such default values to be stored. Any preference as to how I could approach this? I was thinking the following default values could maybe be directly specified at the sim level, without any need to add a resource file:
generate_event_chains = False, generate_event_chains_overwrite_epi = False, generate_event_chains_modules_of_interest = [], generate_event_chains_ignore_events = []
This is no longer relevant if this project will fork off master

I am still struggling to get the event name using the "__" method. Any advice would be appreciated.

marghe-molaro · 2024-10-15T16:33:25Z

Hi @tbhallett @tamuri @mnjowe @matt-graham,

Apologies for casting the net wide but I'm a bit stuck on this. I was hoping to get someone's input on the following two issues (summary of what this PR does below):

Logging events and changes: The logger seems to be unhappy with how I am currently logging info (see summary below). Whenever the logging takes place, this may contain:
i) A variable number of person_ID keys (depending on whether the event was at individual or population level)
ii) An inner dict of different lengths (because each event may change a variable number of individual properties).
iii) A person ID which was not initially declared as a key in the first of such logging instances (if the individual was born during the runtime of the sim).
Do you have any advice as to how I could make this more logger friendly? Including all possible keys with empty values assigned to them seems unnecessarily expensive.
Optimising the before/after properties comparison: Finding properties that have changed for individuals following individual events (e.g. line 162 in src/tlo/events.py) but especially population-level events (e.g. line 182 in src/tlo/events.py) is very cumbersome. Any advice as to how this could be done more efficiently? (Note that there isn't really getting around the pop-frame comparison before/after the event, because many modules seem to rely on pop-level events for key transitions such as onset, resolution, etc. Short of modifying the modules themselves, there is no other way of knowing which individuals were affected otherwise.).

Many thanks in advance! If anything is unclear please let me know.

Summary of PR

The entire property list of each simulated individual is logged at the start of the simulation or when the individual is born.
Whenever an individual experiences an event of interest (i.e. which belongs to the list of modules of interest and does not belong to the list of events which we explicitly want to ignore) only the following info is logged: A) Event details (name + date) and B) Any property changes the individual incurred as a result.
The logging takes the form of a dictionary with person_ID as the key, and an inner dictionary containing A and B.
This can be logged one individual at the time in the case of individual events (e.g. {43 : dict_A+B}), or for multiple individuals at once for pop-wide events (e.g. {12: dict_A+B_for_12, 23: dict_A+B_for_23, ...}).
The logging is done on the Events module logger (which seemed most appropriate) at level "INFO".

…ible to all modules. For now add person_ID to the dict of info printed as the outer dictionary key logging seems to have a problem.

…hains

Merge master

tbhallett · 2024-10-21T09:30:45Z

The logger seems to be unhappy with how I am currently logging info (see summary below). Whenever the logging takes place, this may contain:
i) A variable number of person_ID keys (depending on whether the event was at individual or population level)

Make this always be a list of Person_IDs, so that whether it's one or more doesn't change the type.

ii) An inner dict of different lengths (because each event may change a variable number of individual properties).

If in doubt, I coerce to strings for the purpose of logging (and then unpack it using eval) when analysing.

iii) A person ID which was not initially declared as a key in the first of such logging instances (if the individual was born during the runtime of the sim).

Sorted by first point?

Optimising the before/after properties comparison: Finding properties that have changed for individuals following individual events (e.g. line 162 in src/tlo/events.py) but especially population-level events (e.g. line 182 in src/tlo/events.py) is very cumbersome. Any advice as to how this could be done more efficiently? (Note that there isn't really getting around the pop-frame comparison before/after the event, because many modules seem to rely on pop-level events for key transitions such as onset, resolution, etc. Short of modifying the modules themselves, there is no other way of knowing which individuals were affected otherwise.).

I think probably using compare is going to be the best solution that captures individual level and population level events.

Asif/Matt will know more, but I was thinking that for an indidual-level event, storing only the row as a dict and then using some optimised tool for dict comparison (e.g. https://miguendes.me/the-best-way-to-compare-two-dictionaries-in-python) might be better.

In any case, storing the results of that as a dict seems like the efficient choice to me.

thewati · 2024-10-29T10:01:09Z

The logger seems to be unhappy with how I am currently logging info (see summary below). Whenever the logging takes place, this may contain:
i) A variable number of person_ID keys (depending on whether the event was at individual or population level)

Make this always be a list of Person_IDs, so that whether it's one or more doesn't change the type.

ii) An inner dict of different lengths (because each event may change a variable number of individual properties).

If in doubt, I coerce to strings for the purpose of logging (and then unpack it using eval) when analysing.

iii) A person ID which was not initially declared as a key in the first of such logging instances (if the individual was born during the runtime of the sim).

Sorted by first point?

Optimising the before/after properties comparison: Finding properties that have changed for individuals following individual events (e.g. line 162 in src/tlo/events.py) but especially population-level events (e.g. line 182 in src/tlo/events.py) is very cumbersome. Any advice as to how this could be done more efficiently? (Note that there isn't really getting around the pop-frame comparison before/after the event, because many modules seem to rely on pop-level events for key transitions such as onset, resolution, etc. Short of modifying the modules themselves, there is no other way of knowing which individuals were affected otherwise.).

I think probably using compare is going to be the best solution that captures individual level and population level events.

Asif/Matt will know more, but I was thinking that for an indidual-level event, storing only the row as a dict and then using some optimised tool for dict comparison (e.g. https://miguendes.me/the-best-way-to-compare-two-dictionaries-in-python) might be better.

In any case, storing the results of that as a dict seems like the efficient choice to me.

I would agree with Tim to use pandas compare() but only for population-level-events. However, I believe the individual level events are rather small dictionaries, so it may not be necessary to use compare() for them. For the individual-level-events, using the same method you're using should be okay as it seems to be straightforward. Just considering the trouble of converting to a different data structure before doing a comparison.

marghe-molaro added 5 commits April 3, 2024 15:00

Investigate analysis of events at sim level

dbff470

Merge branch 'master' into molaro/harvest-training-data

bf64628

Merged master in branch

Final data-printing set-up

05098f7

Print event chains

16c071c

Add chains in mode 2 too and clean up in simuation

ba81487

marghe-molaro requested a review from tbhallett October 2, 2024 11:35

Merged with master, and moved all logging into event module to keep t…

0474624

…hing tidier

tbhallett reviewed Oct 3, 2024

View reviewed changes

marghe-molaro added 6 commits October 7, 2024 09:36

Fix issue with tests by ensuring standard Polling and infection is ma…

b1c907c

…intained is generate_event_chains is None

Switch iloc for loc

cfb4264

Change syntax of if statement

e0327de

Change syntax of if statement and print string of event

fceee02

Focus on rti and print footprint

eaeae62

Only store change in individual properties, not entire property row. …

c7bd9d0

…Log changes to logger.

marghe-molaro added 2 commits October 11, 2024 17:03

Style fixes

769aaec

Include printing of individual properties at the beginning and at bir…

757cee3

…th, label what is only used for ddebugging and will be later removed

marghe-molaro added 6 commits October 16, 2024 14:00

Log everything to simulation, as events logger doesn't seem to be vis…

22a5e44

…ible to all modules. For now add person_ID to the dict of info printed as the outer dictionary key logging seems to have a problem.

Consider all modules included as of interest

7faa817

Remove pop-wide HSI warning and make epi default even when printing c…

7232f97

…hains

Merge branch 'master' into molaro/harvest-training-data

98a8832

Merge master

Style fix

a6def2d

Remove data generation test, which wasn't really a test

ecea532

Change dict of properties to string in logging, and add analysis files

ae7a44c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate chain of events for individuals #1468

Generate chain of events for individuals #1468

marghe-molaro commented Oct 2, 2024

tbhallett Oct 3, 2024

tbhallett Oct 3, 2024

tbhallett Oct 3, 2024

marghe-molaro Oct 7, 2024

tbhallett Oct 3, 2024

tbhallett Oct 3, 2024

tbhallett Oct 3, 2024

tbhallett Oct 3, 2024

marghe-molaro commented Oct 11, 2024 •

edited

Loading

marghe-molaro commented Oct 13, 2024 •

edited

Loading

marghe-molaro commented Oct 15, 2024 •

edited

Loading

tbhallett commented Oct 21, 2024

thewati commented Oct 29, 2024

Generate chain of events for individuals #1468

Are you sure you want to change the base?

Generate chain of events for individuals #1468

Conversation

marghe-molaro commented Oct 2, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marghe-molaro commented Oct 11, 2024 • edited Loading

marghe-molaro commented Oct 13, 2024 • edited Loading

marghe-molaro commented Oct 15, 2024 • edited Loading

tbhallett commented Oct 21, 2024

thewati commented Oct 29, 2024

marghe-molaro commented Oct 11, 2024 •

edited

Loading

marghe-molaro commented Oct 13, 2024 •

edited

Loading

marghe-molaro commented Oct 15, 2024 •

edited

Loading