-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
load()
tries to return indices it may not have
#327
Comments
load()
tries to to return indices it may not haveload()
tries to return indices it may not have
Potential fix: 8071a40 |
Additional info from #327 Contains duplicate keys, e.g. see and Interestingly, the error only occurs on data saved in aeon4, not in aeon3. Also interestingly, this works: aeon.load(
"/ceph/aeon/aeon/data/raw/AEON4/social0.2",
social02.Environment.SubjectVisits,
pd.Timestamp("2024-01-31 10:00:00"),
pd.Timestamp("2024-02-05 13:00:00"),
) but moving the end timestamp up by 1 hour: aeon.load(
"/ceph/aeon/aeon/data/raw/AEON4/social0.2",
social02.Environment.SubjectVisits,
pd.Timestamp("2024-01-31 10:00:00"),
pd.Timestamp("2024-02-05 14:00:00"),
) throws this error:
|
Additional info noticed by @ttngu207 "We've also encountered this type of KeyError with other reader throughout the different parts of ingestion. Probably the same root cause" e.g. with rfid events |
This means the non-monotonic indices are in between 13:00 and 14:00. if start is not None or end is not None:
try:
return data.loc[start:end]
except KeyError:
if not data.index.is_monotonic_increasing:
warnings.warn(f"data index for {reader.pattern} contains out-of-order timestamps!")
data = data.sort_index()
return data.loc[start:end] |
@lochhh and I have noticed that the current fix (commit 8071a40) can cause api.load to drop the final row. For example, the table retreived by |
A few more instances of this error
You can load with
And you should see the |
As an update to this, @JaerongA has provided a csv of chunks where this occurs on Aeon3 (additional cases have occured on Aeon4) Unfortunately, this is not always limited to the first chunk in an epoch, though that is most often where this error occurs. |
An update on this issue:
|
A fundamental issue here seems to be that we often have what is really a multi-index data frame. All rows with duplicate timestamps actually have a secondary (or tertiary) index which discriminates the rows, e.g. animal ID, body part ID. A possible solution might be to just make sure we make this explicit by returning a MultiIndex dataframe, and determine how to properly index it, e.g. see https://stackoverflow.com/questions/66588217/how-to-slice-into-a-multiindex-pandas-dataframe Related to #294 |
Example for how to create and manipulate a multi-index frame: Creating the multi-indexSimple example data-frame with duplicate "timestamps": df = pd.DataFrame([[0, 32, 24], [1,33,45], [0,32,25], [1, 42, 60]], index=[23,23,24, 24], columns=['id','x','y']) This will return the following dataframe:
The idea here is that the key is some timestamp in seconds, and all duplicate timestamps include a column acting as a secondary key, in this case mi = pd.MultiIndex.from_tuples(zip(df.index, df.id)) Assigning this multi-index to the dataframe (and dropping the now-redundant
Indexing the multi-indexGiven the above dataframe, the below should all be valid queries over the multi-index frame: Return all data at specific timestampdf.loc[23] Return all data between a range of timestampsdf.loc[23:24] Reindex data with multi-indexdf.reindex([(23, 0), (23, 1)], method='pad') In this case we need to be explicit and for each timestamp create a tuple that reindexes that time for all secondary-keys of the multi-index. This could potentially be automated with a similar strategy to the above Reindex data using toleranceThis is unfortunately where vanilla pandas first falls short: df.reindex([(23, 0), (23, 1)], method='pad', tolerance=1) outputs:
Sadly the latest version of pandas still doesn't support this out-of-the-box, so even though it looks quite doable to export everything to multi-index, it wouldn't solve the ultimate purpose of flexibly extracting data from streams close to events from another stream. The limit tolerance is important so we don't pick up random far-away events simply because there is no data. As a glimmer of hope, though, the below works, and would work probably for all periodic streams: df.reindex([(23.1, 0), (23.1, 1)], method='pad') For other streams we would need to be careful and keep in mind this limitation. |
This is a good point. If we are building the multi-index with zip anyway we can also easily index an extra optional "sequence number" index for frames with duplicate entries to make it more efficient. |
@glopesdev @lochhh do you remember the status of this? |
https://github.com/SainsburyWellcomeCentre/aeon_mecha/blob/main/aeon/io/api.py#L140
here, there may not be data corresponding to the 'start' or 'end' index, due to these not aligning with a given chunk.
e.g. imagine you are calling
but the acquisition epoch started after 14:00:00 but before 15:00:00 - in this case there would be no index in the data corresponding to
start
The text was updated successfully, but these errors were encountered: