Aeon IO API glossary and proposal #334

glopesdev · 2024-02-14T11:25:32Z

glopesdev
Feb 14, 2024
Maintainer

Following earlier discussions in DA meetings and the current PR at #310 proposing to change the low-level interface API, I wanted to introduce here a more formal glossary of terms as they are currently understood in the low-level data interface, to allow us to undergo a more systematic and thorough assessment of the interface design guidelines which were used to create the current version of the API.

I feel it would be wise to carefully consider these before proceeding with any further changes. From both my own experience and documented best practices in the field of software engineering, APIs are best thought of as contracts, and should not be changed lightly once they have been put to regular use. The current API design was created over the course of more than one year of carefully considering a large number of possible scenarios we might encounter when interfacing with Project Aeon data, not just for the ongoing foraging or social experiments, but for experiments going even beyond the scope of the foraging group.

With that in mind, I will break down this discussion into two parts, the Glossary and Proposal. The latter is intended as a space for ongoing discussion and as a working document for discussion at DA meetings, where we can iterate ideas for how to clarify the existing API design before we can decide on action items for the API refactoring proper.

Glossary

Below I provide working definitions for terms which are not defined in the Aeon Glossary but are nevertheless critical to understand the current design decisions. Interpretation of the terms below have an intended bias towards standard practice in software engineering, which may differ from regular language use. On occasion we repeat definitions from the Aeon Glossary, when we considered the existing definition to require further clarification for the discussion at hand.

File (from Wikipedia)

In computing, a computer file is a resource for recording data on a computer storage device, primarily identified by its filename. Just as words can be written on paper, so can data be written to a computer file.

Stream (from Wikipedia)

In computer science, a stream is a sequence of data elements made available over time. A stream can be thought of as items on a conveyor belt being processed one at a time rather than in large batches.

Chunk File

A file storing an Acquisition Chunk, i.e. a file storing all data from a specific stream over a specific one-hour acquisition period.

Note

From the above definitions, it follows that a "stream" is not a "file", and specifically a "stream" is not a "chunk file". The collection of all "chunk files" associated with a named stream is a serializable representation of "data elements" in a stream, but is itself not a stream.

Reader

An object providing access to the data stored inside specific chunk files.

Device

From Wikipedia definition for Peripheral device

A peripheral device, or simply peripheral, is an auxiliary hardware device used to transfer information into and out of a computer.[1] The term peripheral device refers to all hardware components that are attached to a computer and are controlled by the computer system, but they are not the core components of the computer.

A uniquely identified component in the experimental environment, usually a hardware data collection device. Originally intended as the definition of "peripheral device" above, but which we have since extended to represent also "software devices", i.e. purely virtual devices or logical modules or any other logically independent component in an experiment.

Device Stream

A uniquely identified sequence of data elements made available over time by a specific device. A device stream is uniquely identified by a combination of the name of the device and the name of the stream, where the latter must be unique within the containing device.

Note

From the above definition, a "device stream" represents both online streams being acquired live during an experiment, and offline streams made available by the Aeon IO API. This duality is intentional and allows setting in place specific expectations for symmetry and parity between data contract, acquisition system and low-level data interface.

Important

The name of a device stream is required. A sequence of data which is not uniquely identified by a device name and stream name is not a "device stream" under the above definition.

Schema (from Wikipedia)

A formal description of the structure of a database: the names of the tables, the names of the columns of each table, and the data type and other attributes of each column.

Note

In the Aeon IO API the only representations for "device" and "device stream" are schema objects. Therefore, the terms "device" and "device stream" in the Aeon IO API should be interchangeably understood as "device schema" and "device stream schema", respectively. Nevertheless we define these terms below explicitly in the context of the Aeon IO API since it is understood this is where the root of the confusion lies.

Device Schema

A dictionary describing the set of streams made available by a device. Each device must have a unique name in a given experiment. Device schemas are currently represented by the Device class.

Important

In the current implementation, we allow the creation of "anonymous" device schemas, or "composite streams", which are essentially temporary dictionary objects containing collections of device stream schemas. They are used primarily as a composition tool, to allow aggregating together multiple device stream schemas hierarchically before passing them on to the main device schema object.

Device Stream Schema

An object comprising:

the name of the device stream;
a pattern to find the chunk files storing the data elements of the stream, and
a Reader object.

In combination with the aeon.io.api.load function, a device stream schema can be used to make the stream data elements available over arbitrary time ranges (see Device Stream and Stream above).

Important

Currently device stream schemas are not represented by an explicit class object, but rather by dictionary objects where a Reader is paired with a unique key. The pattern used to find the chunk files in the stream is currently stored inside the Reader object, which we believe may be a possible source of confusion. Understanding this it should hopefully become clear that "binder functions" are really just functions that create device stream schemas, or simply device streams.

Data Contract (a.k.a. Experiment Schema)

The collection of all device schemas and device stream schemas for a specific experiment, e.g. foraging or social experiments. Currently represented as a DotMap dictionary of device schema objects.

Proposal

Below we outline a refactoring proposal to clarify and materialize the terminology and above glossary directly in the API.

Represent device stream schemas using a DeviceStream class, with a name attribute, instead of loose dictionaries
Move the Device and DeviceStream classes into the schema module to clarify their intended usage as data schema objects
Consider using DeviceStream objects as input to the load function in addition, or instead, of Reader objects
Consider using explicit classes to represent the anonymous device schemas used to aggregate multiple streams

jkbhagatio · 2024-02-21T16:02:18Z

jkbhagatio
Feb 21, 2024
Maintainer

Proposal: Stream class - constructor takes device name (could be gotten from a Device object).

Stream sets its name

e.g.

from abc import ABC, abstractmethod

class Stream(ABC):

    @abstractmethod
    def __init__(self, device_name):
        pass


class SubjectWeight(Stream):

    def __init__(self, device_name):
        self.device_name = device_name
        cols = ["weight", "confidence", "subject_id", "int_id"]
        self.reader = reader.Csv(f"{device_name}_{self.__name__}*", cols)

device_name = "Nest"
weight_stream = Stream(device_name)


class Device():

    def __init__(self, name, *args):
         ...


device = Device(device_name, SubjectWeight, ...)

0 replies

jkbhagatio · 2024-02-21T16:06:03Z

jkbhagatio
Feb 21, 2024
Maintainer

Questions remain on whether to break load backward compatibility, and e.g. changing name of reader arg in load

0 replies

glopesdev · 2024-02-28T11:10:39Z

glopesdev
Feb 28, 2024
Maintainer Author

To follow up on this, I have started prototyping a refactor of the device and streams API and have refreshed my thoughts on why the architecture was designed the way it is currently.

There are lots to be said, but essentially it boils down to the decision to use the DotMap objects to represent our schemas. Because of specific constraints relating to how these objects are initialized (specifically objects passed into the dotmap all have to be dictionaries or iterables of tuples), there is little room to create intermediate abstractions, since DotMap will then simply not be able to work with them.

It seems then to reorganize the architecture we can take two options:

Keep using DotMap and implement a set of StreamProvider classes which essentially work like dictionaries in that they are iterables of key-value pairs. In this design, a Device object is itself a StreamProvider and can contain both collections of streams or other stream providers. A single stream is also a StreamProvider which returns an iterable with a single key-value pair (the name of the stream, and its reader).
Forego DotMap and redesign the class hierarchy entirely with polymorphism where the Device class will have both a list of streams, and a list of child devices. In this case, auto-completion would be implemented using native python functionality instead of relying on the dotmap module.

Re. pros and cons, 1) has the advantage it requires no major changes to the existing code organization. The concepts will remain slightly entangled, since a Stream is also its own StreamProvider, but to be honest I remain undecided whether this is an issue or an advantage.

For popular examples where this composite design is leveraged successfully, we need to look no further than JSON or XML. In either of these data and schema standards, values can be either primitive values or objects, and nesting is achieved naturally via this composition polymorphism. Forcing a separation might seem "cleaner" but in reality is less flexible since we are setting in stone very hard and fast rules about what kinds of things can be composed and how.

P.S.: For reference, including here Wikipedia article on the Composite pattern which is the design pattern being applied in the current implementation (hence the function name compositeStream).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aeon IO API glossary and proposal #334

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Aeon IO API glossary and proposal #334

glopesdev Feb 14, 2024 Maintainer

Glossary

File (from Wikipedia)

Stream (from Wikipedia)

Chunk File

Reader

Device

Device Stream

Schema (from Wikipedia)

Device Schema

Device Stream Schema

Data Contract (a.k.a. Experiment Schema)

Proposal

Replies: 3 comments

jkbhagatio Feb 21, 2024 Maintainer

jkbhagatio Feb 21, 2024 Maintainer

glopesdev Feb 28, 2024 Maintainer Author

glopesdev
Feb 14, 2024
Maintainer

jkbhagatio
Feb 21, 2024
Maintainer

jkbhagatio
Feb 21, 2024
Maintainer

glopesdev
Feb 28, 2024
Maintainer Author