Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lineage #94

Open
shreyashankar opened this issue Oct 11, 2024 · 8 comments
Open

Lineage #94

shreyashankar opened this issue Oct 11, 2024 · 8 comments
Labels
enhancement New feature or request request

Comments

@shreyashankar
Copy link
Collaborator

Reduce Operation Lineage

From discord:

One use case I'm really interested in is [pre]computing a set of "reports" / outputs from a large set of documents, and then being able to reuse that computation when I filter documents to the applicable reports that have only those documents as "Sources"

i.e

if full corpus is a, b, c, d, e, f -> generates reports 1 (a, b, c) + 2 (b, c, d) + 3 (c, d, e) + 4 (d, e, f)
and then I want to see the "reports" contributed by docs d, e, f = 2,3,4

My proposal is to support a lineage param in the output, e.g.,

name: opname
type: reduce
reduce_key: ...
prompt: ...
output:
  schema: ...
  lineage:
    - keyname1
    - keyname2

then for every document in the output, there should be a key opname_lineage with a list of kv pairs for all the keys in lineage, for all documents in the group that the output document was derived from.

Querying Pipeline Lineage

It would be nice to log all the pipeline lineage to sqlite & have users be able to query it (e.g., find all the reports contributed by certain upstream/input docs). We'd have to think of a good data model & query patterns.

@shreyashankar shreyashankar added enhancement New feature or request request labels Oct 11, 2024
@garuna-m6
Copy link
Contributor

@shreyashankar took some more time that thought to get the OpenAI keys :( , trying to understand the issue here, we need tracing in logs for lineage reduce operations (don't want the sql setup anywhere in pipeline). With existing verbose functionality have logging like reduce : lineage keys [if used] : reduce operation output in logging 👀 ? Would need some guidance

@shreyashankar
Copy link
Collaborator Author

No worries!

I think the logging can be set at a pipeline level; in the top level of the config someone can specify the path to store a sqlite db of the logs; then, we can add ids to each document in the input and pass them through each operation in the pipeline.

For each operation, we could create a table of the outputs, with an additional "id" column. We could also create a dependency table for each operation to link the operation's outputs with the id(s) of its inputs:

CREATE TABLE {operation}_dependencies (
    dependent_id INTEGER REFERENCES dependent_table(id),
    main_id INTEGER REFERENCES main_table(id),
    PRIMARY KEY (dependent_id, main_id)
);

So, each operation has its own output table, as well as a dependencies table. This can enable both forward and backwards tracing.

@garuna-m6
Copy link
Contributor

Sorry for asking explanations as a 5 year old, but docetl pipeline would run on demand, the expectation here is to start a sqlite local server if set in config, put all the logs in the db then close the pipeline shutting down server :/ or dump the logs for a sql server to read or are we expecting the server connection files are present?

@shreyashankar
Copy link
Collaborator Author

No worries, sorry for the confusion! Sqlite doesn't require a separate server process: https://docs.python.org/3/library/sqlite3.html

So if the user specifies a path for the sqlite db in the config, we can create a db and populate relevant tables as the outputs are created.

@redhog
Copy link
Collaborator

redhog commented Oct 21, 2024

Is there a big reason to keep the lineage data out-of-bound?

I'd rather save lineage info inside the items, so that an outside system that gets the final output dump, has access to it directly (without a join). What's the drawback of doing that?

I think sqlite output is interesting in the context of #104 btw :)

Also, potentially for storing the intermediate data more efficiently?

@shreyashankar
Copy link
Collaborator Author

I think saving it to a database makes it significantly more queryable...otherwise constructing forward traces will involve a bunch of for loops to go through the outputs and see which ids contain the source id. Similarly constructing a backwards traces will require lots of wrangling.

@redhog
Copy link
Collaborator

redhog commented Oct 21, 2024

Well, that depends on what happens with the output. If it's just a json, yes. But if you insert it into something like elastic-search, then having the metadata / lineage inline is super useful. So maybe both?

If we had output plugins, and could write multiple outputs with different plugins, then this could be handled at the output stage:

pipeline:
  steps: ...
  output;
    - json:
      path: my-pipeline-output.json
    - sqlite:
      path: metadata.sqlite
      keys:
        - source-file
        - page
     - elasticsearch:
       url: http://localhost:9200/

@shreyashankar
Copy link
Collaborator Author

whoops, sorry I missed this. I like your operator spec, but I think supporting an elastic search integration as a plugin can be done later down the line. most people use DocETL locally, and I think the sqlite interface is a great start for them

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request request
Projects
None yet
Development

No branches or pull requests

3 participants