Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SubjectVisits stream cannot be loaded due to KeyErrors #330

Closed
jkbhagatio opened this issue Feb 6, 2024 · 5 comments
Closed

SubjectVisits stream cannot be loaded due to KeyErrors #330

jkbhagatio opened this issue Feb 6, 2024 · 5 comments
Assignees
Labels
bug Something isn't working critical

Comments

@jkbhagatio
Copy link
Member

jkbhagatio commented Feb 6, 2024

Contains duplicate keys, e.g. see Z:\aeon\data\raw\AEON4\social0.2\2024-01-31T10-14-14\Environment\Environment_SubjectVisits_2024-01-31T10-00-00.csv

and load throws the following error.

Interestingly, the error only occurs on data saved in aeon3, not in aeon4.

Also interestingly, this works:

aeon.load(
    "/ceph/aeon/aeon/data/raw/AEON4/social0.2",
    social02.Environment.SubjectVisits,
    pd.Timestamp("2024-01-31 10:00:00"),
    pd.Timestamp("2024-02-05 13:00:00"),
)

but moving the end timestamp up by 1 hour:

aeon.load(
    "/ceph/aeon/aeon/data/raw/AEON4/social0.2",
    social02.Environment.SubjectVisits,
    pd.Timestamp("2024-01-31 10:00:00"),
    pd.Timestamp("2024-02-05 14:00:00"),
)

throws this error:

KeyError Traceback (most recent call last)
Cell In[14], line 3
1 """Environment info."""
----> 3 aeon.load(block.root, social02.Environment.SubjectVisits, pd.Timestamp("2024-01-31 10:00:00"), exp_end)

File ~/ProjectAeon/aeon_mecha/aeon/io/api.py:151, in load(root, reader, start, end, time, tolerance, epoch)
149 warnings.warn(f"data index for {reader.pattern} contains duplicate keys!")
150 data = data[~data.index.duplicated(keep="first")]
--> 151 return data.loc[start:end]
152 return data

File ~/mambaforge/envs/aeon/lib/python3.11/site-packages/pandas/core/indexing.py:1103, in _LocationIndexer.getitem(self, key)
1100 axis = self.axis or 0
1102 maybe_callable = com.apply_if_callable(key, self.obj)
-> 1103 return self._getitem_axis(maybe_callable, axis=axis)

File ~/mambaforge/envs/aeon/lib/python3.11/site-packages/pandas/core/indexing.py:1323, in _LocIndexer._getitem_axis(self, key, axis)
1321 if isinstance(key, slice):
1322 self._validate_key(key, axis)
-> 1323 return self._get_slice_axis(key, axis=axis)
1324 elif com.is_bool_indexer(key):
1325 return self._getbool_axis(key, axis=axis)

File ~/mambaforge/envs/aeon/lib/python3.11/site-packages/pandas/core/indexing.py:1355, in _LocIndexer._get_slice_axis(self, slice_obj, axis)
1352 return obj.copy(deep=False)
1354 labels = obj._get_axis(axis)
-> 1355 indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop, slice_obj.step)
1357 if isinstance(indexer, slice):
1358 return self.obj._slice(indexer, axis=axis)

File ~/mambaforge/envs/aeon/lib/python3.11/site-packages/pandas/core/indexes/datetimes.py:636, in DatetimeIndex.slice_indexer(self, start, end, step)
628 # GH#33146 if start and end are combinations of str and None and Index is not
629 # monotonic, we can not use Index.slice_indexer because it does not honor the
630 # actual elements, is only searching for start and end
631 if (
632 check_str_or_none(start)
633 or check_str_or_none(end)
634 or self.is_monotonic_increasing
635 ):
--> 636 return Index.slice_indexer(self, start, end, step)
638 mask = np.array(True)
639 raise_mask = np.array(True)

File ~/mambaforge/envs/aeon/lib/python3.11/site-packages/pandas/core/indexes/base.py:6344, in Index.slice_indexer(self, start, end, step)
6300 def slice_indexer(
6301 self,
6302 start: Hashable | None = None,
6303 end: Hashable | None = None,
6304 step: int | None = None,
6305 ) -> slice:
6306 """
6307 Compute the slice indexer for input labels and step.
6308
(...)
6342 slice(1, 3, None)
6343 """
-> 6344 start_slice, end_slice = self.slice_locs(start, end, step=step)
6346 # return a slice
6347 if not is_scalar(start_slice):

File ~/mambaforge/envs/aeon/lib/python3.11/site-packages/pandas/core/indexes/base.py:6537, in Index.slice_locs(self, start, end, step)
6535 start_slice = None
6536 if start is not None:
-> 6537 start_slice = self.get_slice_bound(start, "left")
6538 if start_slice is None:
6539 start_slice = 0

File ~/mambaforge/envs/aeon/lib/python3.11/site-packages/pandas/core/indexes/base.py:6462, in Index.get_slice_bound(self, label, side)
6459 return self._searchsorted_monotonic(label, side)
6460 except ValueError:
6461 # raise the original KeyError
-> 6462 raise err
6464 if isinstance(slc, np.ndarray):
6465 # get_loc may return a boolean array, which
6466 # is OK as long as they are representable by a slice.
6467 assert is_bool_dtype(slc.dtype)

File ~/mambaforge/envs/aeon/lib/python3.11/site-packages/pandas/core/indexes/base.py:6456, in Index.get_slice_bound(self, label, side)
6454 # we need to look up the label
6455 try:
-> 6456 slc = self.get_loc(label)
6457 except KeyError as err:
6458 try:

File ~/mambaforge/envs/aeon/lib/python3.11/site-packages/pandas/core/indexes/datetimes.py:586, in DatetimeIndex.get_loc(self, key)
584 return Index.get_loc(self, key)
585 except KeyError as err:
--> 586 raise KeyError(orig_key) from err

KeyError: Timestamp('2024-01-31 10:00:00')

@jkbhagatio jkbhagatio added bug Something isn't working critical labels Feb 6, 2024
@glopesdev
Copy link
Contributor

Contains duplicate keys, e.g. see Z:\aeon\data\raw\AEON4\social0.2\2024-02-05T14-36-00\Environment\Environment_SubjectVisits_2024-01-31T10-00-00.csv

@jkbhagatio the file you mentioned doesn't exist on CEPH, and would be weird if it did since it corresponds to a chunk timestamped before the epoch starts. Maybe a typo?

@jkbhagatio
Copy link
Member Author

Contains duplicate keys, e.g. see Z:\aeon\data\raw\AEON4\social0.2\2024-02-05T14-36-00\Environment\Environment_SubjectVisits_2024-01-31T10-00-00.csv

@jkbhagatio the file you mentioned doesn't exist on CEPH, and would be weird if it did since it corresponds to a chunk timestamped before the epoch starts. Maybe a typo?

Yes typo, fixed that, sorry.

Looks like the issue may be particularly with the "2024-02-05 14:00:00" chunk, as this works:

aeon.load(
    "/ceph/aeon/aeon/data/raw/AEON4/social0.2",
    social02.Environment.SubjectVisits,
    pd.Timestamp("2024-01-31 10:00:00"),
    pd.Timestamp("2024-02-05 13:00:00"),
)

and this works:

aeon.load(
    "/ceph/aeon/aeon/data/raw/AEON4/social0.2",
    social02.Environment.SubjectVisits,
    pd.Timestamp("2024-02-05 15:00:00"),
    pd.Timestamp("2024-02-05 16:00:00"),
)

but to my eyes visually, the Z:\aeon\data\raw\AEON4\social0.2\2024-02-05T14-36-00\Environment\Environment_SubjectVisits_2024-02-05T14-00-00.csv looks ok (i.e. not corrupted or anything immediately unusual)

@jkbhagatio
Copy link
Member Author

Potentially fix mentioned here: #327

@glopesdev glopesdev transferred this issue from SainsburyWellcomeCentre/aeon_experiments Feb 7, 2024
@ttngu207
Copy link
Contributor

ttngu207 commented Feb 7, 2024

We've also encountered this type of KeyError with other reader throughout the different parts of ingestion. Probably the same root cause

@jkbhagatio
Copy link
Member Author

Closing this as duplicate of #327

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working critical
Projects
None yet
Development

No branches or pull requests

3 participants