Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate chain of events for individuals #1468

Draft
wants to merge 21 commits into
base: master
Choose a base branch
from

Conversation

marghe-molaro
Copy link
Collaborator

This PR creates an option to use the simulation to log & print events and their effect on individuals across any number of modules. By collating this information in post-processing, we can create chains of events and outcomes for each simulated individual across different diseases, demographic changes, etc.
This will be useful to generate training data for module emulators/time series analysis, but also for general debugging. In order to assist in the former, the PR additionally gives the user the option to “ignore” the epidemiology of diseases in order to prioritise infecting as many individuals as possible during the runs. This (for now) is controlled by the “generate_event_chains” parameter in the Simulation class. Post-processing is done via postprocess_event_chains.py.


NOTE: This is working as intended, but not really feasible (RAM+CPU-time wise) if intending to use it for a large number of individuals. Next step will be to think about how to make it so.


How the PR works:

  • Events are collected when fired. This allows us to collect info centrally without modifying any of the existing modules and/or their event functions. This occurs at the Simulation-class level and inside the HealthSystemScheduler function. (In addition, birth “events” are recorded inside the do_birth function as a one-off).
  • In the case of Population-wide events, the event is only logged for those individuals for which the event resulted in a meaningful (i.e. parameter) change. E.g. TbActiveEvent, which is a population-wide event, is only stored for individuals who become actively infected by Tb as a result of the event having been fired. This is not optimal from a run time perspective (see below).
  • The user can specify for which modules events should be logged via “generate_event_chains_modules_of_interest” (e.g. could chose to log events belonging to Tb+Hiv modules only), and also specify which events to ignore completely via “generate_event_chains_ignore_events“, if the user has established a priori which event would be particularly cumbersome+uninteresting (e.g. TbActiveCasePollGenerateData, which only schedules the date on which infection will actually occur).
  • Properties of individuals are currently printed in full, and both before and after the event has occurred (in order to identify, in post-processing, which events resulted in meaningful changes in the individual’s parameters). This is far from ideal, given that there could be hundreds of events per individual, and x2 for before+after, and hence not sustainable from a memory standpoint (see discussion below).
  • In post processing, events are grouped by individual (while chronological order is already observed by construction), and all those “fired” after death are eliminated from the chain. For each individual, the chain is scanned to identify which events resulted in meaningful changes for the individual, therefore reconstructing the individual trajectory to outcome.

OPTIMISATION

To reduce CPU requirements:

  • At the moment dataframe is extended row by row every time; this is of course not ideal, should be extended by N rows at a time like currently done in population dataframe.
  • Detecting changes in the entire population dataframe is extremely CPU costly. On the other hand, printing this info for the entire population without any change-based pre-selection would be extremely RAM intensive…

To reduce RAM requirements:

  • Print to hdm5 during run time?
  • Reduce the number of individual properties to be logged. Not ideal at this stage if we want to print and collect data in a way that is agnostic as to what parameters are relevant. However, there are certainly some that we know we won’t need (district_num_of_residence, district_of_residence, region_of_residence, age_exact_years, age_years, age_range, age_days). However, there could be a potential draw-back in CPU time from having to eliminate columns from the dataframe.
  • To avoid printing properties before/after (Note: maximum benefit to this is x ½ reduction, so not huge):
    --Assume that the individual has not been modified since the previous event, such that the last “After” state is the “Before” state of the subsequent event. This is a bit risky given that the user can decide not to print all events the individual has experienced. In addition to this, one could also only log the changes . Again, potential draw-back in CPU time from having to compare rows.
    -- Decide to ignore all intermediate events, and clearly define a “start” (i.e. onset) and end (“death”/”resolution”) of the episode. This is what we would need for emulator, however loads of very interesting information in between that ultimately we would want to capture to explore the possibility of training time series / to capture “infectiousness” of individual (e.g. viral load status for HIV).

src/tlo/events.py Outdated Show resolved Hide resolved
src/tlo/events.py Outdated Show resolved Hide resolved
src/tlo/events.py Outdated Show resolved Hide resolved
print_chains = True
if self.target != self.sim.population:
row = self.sim.population.props.iloc[[self.target]]
row['person_ID'] = self.target
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't think needed?

row['event'] = self
row['event_date'] = self.sim.date
row['when'] = 'After'
self.sim.event_chains = pd.concat([self.sim.event_chains, row], ignore_index=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that it's faster to not do these kinds of pandas operations very often, but instead to collect the data in python native structures (sets, dicts, lists, tuples) and then assemble them into a data-frame at the end.

Comment on lines +318 to +321
if sim.generate_event_chains is False:
# Create (and store pointer to) the OtherDeathPoll and schedule first occurrence immediately
self.other_death_poll = OtherDeathPoll(self)
sim.schedule_event(self.other_death_poll, sim.date)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unsure why this change is included; perhaps done for something in debugging?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is required if we want to train disease-specific emulators: we don't want other causes, including those grouped as "Others" (which we do not explicitly include as a disease module when running a sim, because they are always included by default), to "interfere"/end a life short when we are trying to capture effect of single disease.

If we want to use this not for emu training but other purposes then it could become important to not exclude this.

src/tlo/methods/hiv.py Outdated Show resolved Hide resolved
print_chains = False
df_before = []

if self.sim.generate_event_chains:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

obviously would be nice to factorise-out this logic, which repeats in events.py

(Shame we don't have HSI_Event inheriting from Event, and we'd get it for free. We used to.. but it was changed at some point for a reason I can no longer remember.)

new_rows_after['event_date'] = self.sim.date
new_rows_after['when'] = 'After'

self.event_chains = pd.concat([self.sim.event_chains,new_rows_before], ignore_index=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than building up the enormous thing in memory in the format of a data frame, I have the feeling that it's more efficient to put out to a logger bit by bit.

self.end_date = None
self.output_file = None
self.population: Optional[Population] = None
self.event_chains: Optinoal[Population] = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think event_chains is a data frame rather than an instance of Population?

new_rows_after['person_ID'] = new_rows_after.index
new_rows_after['event'] = self
new_rows_after['event_date'] = self.sim.date
new_rows_after['when'] = 'After'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not store only the changes?

@marghe-molaro
Copy link
Collaborator Author

marghe-molaro commented Oct 11, 2024

Updates:

  • The event chain logger now only logs changes in individual's properties rather than the whole row before+after. In doing so assume that none of the events which take place but we do not log lead to meaningful changes for the individual.
    The comparison to find changed properties still seems to be more efficient than the former approach.
  • Tidied up by refactoring into fncs.

To add:

  • Could pop dataframe-wide comparison for Population events be done more efficiently?
  • Still can't get the name of the Event object using name
  • Is the logging done ok/should it be improved? Currently person_ID is the key to the dict of event details + property changes.
  • We need to store individual properties at the start of the sim, since now only printing changes;

…th, label what is only used for ddebugging and will be later removed
@marghe-molaro
Copy link
Collaborator Author

marghe-molaro commented Oct 13, 2024

Hi @tbhallett,

The bulk of the changes we discussed are done. I just have a couple of quick questions (below) before being ready to submit this for formal review. To summarise:

  • The entire property list of each individual is now only logged at the start of the simulation or when the individual is born.
  • Whenever an individual experiences an event of interest (i.e. which belongs to the list of modules of interest and does not belong to the list of events which we explicitly want to ignore) only the following info is logged: A) Event details (name + date) and B) Any property changes the individual incurred as a result. The logging takes the form of a dictionary with person_ID as the key, and an inner dictionary containing A and B.
  • This can be logged one individual at the time in the case of individual events (e.g. {43 : dict_A+B}), or for multiple individuals at once for pop-wide events (e.g. {12: dict_A+B_for_12, 23: dict_A+B_for_23, ...}).
  • The logging is done on the Events module logger (which seemed most appropriate) at level "INFO".

Quick questions:

  1. Whenever the logging takes place, this may contain:
  • A variable number of person_ID keys (depending on whether the event was at individual or population level)
  • An inner dict of different lengths (because each event may change a variable number of individual properties).
  • A person ID which was not initially declared as a key in the first of such logging instances (if the individual was born during the runtime of the sim).
    The logger seems to be unhappy about these variations in length. Do you have any advice as to how I could make this more logger friendly? Including all possible keys with empty values assigned to them seems unnecessarily expensive.

2. To tidy up the use of this modality, I would like to assign default values for its parameters (generate_event_chains, generate_event_chains_overwrite_epi, generate_event_chains_modules_of_interest, generate_event_chains_ignore_events) which could then be modified by a scenario file (like for any other module). However there is currently no resource file for the overall sim that could be a natural place for such default values to be stored. Any preference as to how I could approach this? I was thinking the following default values could maybe be directly specified at the sim level, without any need to add a resource file:
generate_event_chains = False, generate_event_chains_overwrite_epi = False, generate_event_chains_modules_of_interest = [], generate_event_chains_ignore_events = []

This is no longer relevant if this project will fork off master

  1. I am still struggling to get the event name using the "__" method. Any advice would be appreciated.

@marghe-molaro
Copy link
Collaborator Author

marghe-molaro commented Oct 15, 2024

Hi @tbhallett @tamuri @mnjowe @matt-graham,

Apologies for casting the net wide but I'm a bit stuck on this. I was hoping to get someone's input on the following two issues (summary of what this PR does below):

  1. Logging events and changes: The logger seems to be unhappy with how I am currently logging info (see summary below). Whenever the logging takes place, this may contain:
    i) A variable number of person_ID keys (depending on whether the event was at individual or population level)
    ii) An inner dict of different lengths (because each event may change a variable number of individual properties).
    iii) A person ID which was not initially declared as a key in the first of such logging instances (if the individual was born during the runtime of the sim).
    Do you have any advice as to how I could make this more logger friendly? Including all possible keys with empty values assigned to them seems unnecessarily expensive.

  2. Optimising the before/after properties comparison: Finding properties that have changed for individuals following individual events (e.g. line 162 in src/tlo/events.py) but especially population-level events (e.g. line 182 in src/tlo/events.py) is very cumbersome. Any advice as to how this could be done more efficiently? (Note that there isn't really getting around the pop-frame comparison before/after the event, because many modules seem to rely on pop-level events for key transitions such as onset, resolution, etc. Short of modifying the modules themselves, there is no other way of knowing which individuals were affected otherwise.).

Many thanks in advance! If anything is unclear please let me know.

Summary of PR

  1. The entire property list of each simulated individual is logged at the start of the simulation or when the individual is born.
  2. Whenever an individual experiences an event of interest (i.e. which belongs to the list of modules of interest and does not belong to the list of events which we explicitly want to ignore) only the following info is logged: A) Event details (name + date) and B) Any property changes the individual incurred as a result.
  3. The logging takes the form of a dictionary with person_ID as the key, and an inner dictionary containing A and B.
  4. This can be logged one individual at the time in the case of individual events (e.g. {43 : dict_A+B}), or for multiple individuals at once for pop-wide events (e.g. {12: dict_A+B_for_12, 23: dict_A+B_for_23, ...}).
  5. The logging is done on the Events module logger (which seemed most appropriate) at level "INFO".

@tbhallett
Copy link
Collaborator

The logger seems to be unhappy with how I am currently logging info (see summary below). Whenever the logging takes place, this may contain:
i) A variable number of person_ID keys (depending on whether the event was at individual or population level)

Make this always be a list of Person_IDs, so that whether it's one or more doesn't change the type.

ii) An inner dict of different lengths (because each event may change a variable number of individual properties).

If in doubt, I coerce to strings for the purpose of logging (and then unpack it using eval) when analysing.

iii) A person ID which was not initially declared as a key in the first of such logging instances (if the individual was born during the runtime of the sim).

Sorted by first point?

  1. Optimising the before/after properties comparison: Finding properties that have changed for individuals following individual events (e.g. line 162 in src/tlo/events.py) but especially population-level events (e.g. line 182 in src/tlo/events.py) is very cumbersome. Any advice as to how this could be done more efficiently? (Note that there isn't really getting around the pop-frame comparison before/after the event, because many modules seem to rely on pop-level events for key transitions such as onset, resolution, etc. Short of modifying the modules themselves, there is no other way of knowing which individuals were affected otherwise.).

I think probably using compare is going to be the best solution that captures individual level and population level events.

Asif/Matt will know more, but I was thinking that for an indidual-level event, storing only the row as a dict and then using some optimised tool for dict comparison (e.g. https://miguendes.me/the-best-way-to-compare-two-dictionaries-in-python) might be better.

In any case, storing the results of that as a dict seems like the efficient choice to me.

@thewati
Copy link
Collaborator

thewati commented Oct 29, 2024

The logger seems to be unhappy with how I am currently logging info (see summary below). Whenever the logging takes place, this may contain:
i) A variable number of person_ID keys (depending on whether the event was at individual or population level)

Make this always be a list of Person_IDs, so that whether it's one or more doesn't change the type.

ii) An inner dict of different lengths (because each event may change a variable number of individual properties).

If in doubt, I coerce to strings for the purpose of logging (and then unpack it using eval) when analysing.

iii) A person ID which was not initially declared as a key in the first of such logging instances (if the individual was born during the runtime of the sim).

Sorted by first point?

  1. Optimising the before/after properties comparison: Finding properties that have changed for individuals following individual events (e.g. line 162 in src/tlo/events.py) but especially population-level events (e.g. line 182 in src/tlo/events.py) is very cumbersome. Any advice as to how this could be done more efficiently? (Note that there isn't really getting around the pop-frame comparison before/after the event, because many modules seem to rely on pop-level events for key transitions such as onset, resolution, etc. Short of modifying the modules themselves, there is no other way of knowing which individuals were affected otherwise.).

I think probably using compare is going to be the best solution that captures individual level and population level events.

Asif/Matt will know more, but I was thinking that for an indidual-level event, storing only the row as a dict and then using some optimised tool for dict comparison (e.g. https://miguendes.me/the-best-way-to-compare-two-dictionaries-in-python) might be better.

In any case, storing the results of that as a dict seems like the efficient choice to me.

I would agree with Tim to use pandas compare() but only for population-level-events. However, I believe the individual level events are rather small dictionaries, so it may not be necessary to use compare() for them. For the individual-level-events, using the same method you're using should be okay as it seems to be straightforward. Just considering the trouble of converting to a different data structure before doing a comparison.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants