Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New ingestion_schemas + MANY minor fixes and improvements #438

Open
wants to merge 48 commits into
base: datajoint_pipeline
Choose a base branch
from

Conversation

ttngu207
Copy link
Contributor

@ttngu207 ttngu207 commented Oct 18, 2024

This PR includes a few major new logic/features and many minor fixes (mostly for BlockAnalysis)

  1. Use ingestion_schemas - i.e. a separate set of schemas for DJ ingestion (specialized Encoder reader and Video reader)
    • Encoder reader with default downsampling to 50Hz
    • Video reader with only the hw_timestamp column
  2. Update streams.py DJ schema accordingly (mainly to drop extra columns when reading Video data)
  3. Add special handling to ingest fullpose data for social0.2
  4. Add a new tracking table: BlobPosition to read and store blob position tracking when SLEAP data is not available
  5. Fixes & improvements for BlockAnalysis
    • Bugfix incorrect extraction of subject_in_patch times
    • fetch_stream function rounds times to microseconds (mysql precision is to microseconds only)
    • BlockAnalysis use BlobPosition when SLEAPTracking is not available
    • improve logic to search for chunks in a given block
    • BlockDetection - when double 0s are found, use the first 0s (instead of the 2nd one)
    • Add BlockForaging computed table

Fix #427
Replace #437

ttngu207 and others added 30 commits August 8, 2024 08:52
…Centre/gl-issue-418

Allow reading pose model metadata from local folder
why? corrupted data in harp files? not sure
@ttngu207 ttngu207 changed the title Datajoint pipeline New ingestion_schemas + MANY minor fixes and improvements Oct 22, 2024
@ttngu207 ttngu207 marked this pull request as ready for review October 22, 2024 18:56
Args:
query (datajoint.Query): A query object containing data from a Stream table
drop_pk (bool, optional): Drop primary key columns. Defaults to True.
round_microseconds (bool, optional): Round timestamps to microseconds. Defaults to False.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defaults to True?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me this is ok?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mismatch is in the description of round_microseconds which says "Defaults to False"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

# Patch data - TriggerPellet, DepletionState, Encoder (distancetravelled)
# For wheel data, downsample to 10Hz
final_encoder_fs = 10
final_encoder_hz = 10
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we actually want this to be 50 hz, not 10 hz

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Encoder data is ingested at 50Hz, this is at the BlockAnalysis step, do we also want 50Hz here (or further downsample to 10?)

if encoder_df.empty:
encoder_df["distance_travelled"] = 0
else:
encoder_df["distance_travelled"] = -1 * analysis_utils.distancetravelled(encoder_df.angle)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a comment saying something like -1 is for placement of magnetic encoder, where wheel movement actually decreases encoder value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

patch_rate = depletion_state_df.rate.iloc[0]
patch_offset = depletion_state_df.offset.iloc[0]
# handles patch rate value being INF
patch_rate = 999999999 if np.isinf(patch_rate) else patch_rate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it actually an issue if patch rate is inf? Does it cause some downstream issue? We do this as default when no env is loaded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's due to MySQL float doesn't handle INF well - we can convert to NaN but that would lose the intended meaning of INF here

@@ -288,27 +299,50 @@ def make(self, key):
& f'chunk_start <= "{chunk_keys[-1]["chunk_start"]}"'
)[:block_start]
subject_visits_df = subject_visits_df[subject_visits_df.region == "Environment"]
subject_visits_df = subject_visits_df[~subject_visits_df.id.str.contains("Test", case=False)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sometimes we use other, non "Test" subjects as test subjects. Maybe the check should be, if the subject does not begin with 'baa' (can str.lower to check for regardless of case) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Is startswith("baa") reliable, future-proof? Perhaps better logic is to also cross check with the Subjects to be manually specified by users for a particular experiment

)
pos_df = fetch_stream(pos_query)[block_start:block_end]
pos_df["likelihood"] = np.nan
# keep only rows with area between 0 and 1000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this because areas of > 1000 is likely an experimenter, or some other artifact? Maybe specify that in the comment?

data = pd.DataFrame(payload, index=seconds, columns=self.columns)

# remove rows where the index is zero (why? corrupted data in harp files?)
data = data[data.index != 0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This I guess will be fixed in the next PR targeting a new Encoder reader in ingestion_schemas.py ?

@@ -48,6 +48,12 @@ def __init__(self, path):
super().__init__(_reader.Pose(f"{path}_test-node1*"))


class Pose03(Stream):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made add a comment that this is necessary due to changing registers for the pose streams for social02 in particular? And that 03 corresponds to the fact that this is because this pattern is what we're going with for social03 and moving forward? Or call this class something else?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or just import Pose from social_03.py as Pose03? I think I like something like this actually

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is primarily due to full-pose data for social0.2 has a different pattern than that of the non-fullpose SLEAP data.
I don't think you can do from social_03 import Pose as Pose03, the way the the Devices/Streams being instantiated is that it's using the class name, so aliasing wouldn't work.

Copy link
Member

@jkbhagatio jkbhagatio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a bunch of small comments here.

A general point - maybe add top-level module docstrings for all these modules? Unless this will be done by @MilagrosMarin in #443 ?

@ttngu207
Copy link
Contributor Author

I've left a bunch of small comments here.

A general point - maybe add top-level module docstrings for all these modules? Unless this will be done by @MilagrosMarin in #443 ?

Yes, this will be address when we apply Ruff checks and fixes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants