Lineage : DB boilerplating #191

garuna-m6 · 2024-11-17T12:13:33Z

@shreyashankar kind of faded out on how should we have it.

So what is in the PR :

Refactoring of the huge util file into smaller util files
Util interface to connect to any db
Sqlite db interface

Where I am stuck :

log schema I am considering

```
          "id": "INTEGER PRIMARY KEY AUTOINCREMENT",
          "process_id": "TEXT",
          "operation": "TEXT",
          "log_message": "TEXT",
          "timestamp": "DATETIME DEFAULT CURRENT_TIMESTAMP"

I want lineage only for operations, every yml given can have a "process_id" or will be randomly generated as a key for the logs
User defined schema kinds of mess up, bcz code needs to be altered in that specific way
The database connection happens at the start of the application
I want console logging function itself to have an extra feature where it uses the connection and pushes log message to logging connection
obviously haven't tested any parts :(

shreyashankar · 2024-11-18T06:55:55Z

Thanks for opening the PR! I think that log schema is almost good--we also need an "input_ids" field, so we can trace back a row in the table (i.e., a log) to its inputs. So every row will represent an output document of an operation. I don't think we need a "log_message" field; this table is more about enabling us to reconstruct the lineage of any DocETL pipeline output, which just requires us to attach an id to every output, and have it point to its input_ids in the table.

We will also want another table with just the id of an input or output, and its value. So two tables overall.

I don't think we need to do anything regarding console.logging! I'm imagining the PR logic can be something like:

Check if there's sqlite params/config specified in the output of the docetl pipeline config
Create the DB specified in the config, in this constructor maybe. Also create a boolean here (maybe self.tracing_enabled) to represent if lineage tracing is enabled, so the operations can reference it as self.runner.tracing_enabled)
In the dataset load method, add ids before returning the dataset to a special _tracing_id key, if tracing is enabled. Also write this data to the data table in the sqlite DB.
In each operation's execute method, check if tracing is enabled (self.runner.tracing_enabled == True). If so, whenever a new output is created, add it to the tracing table in the sqlite DB, with the relevant _tracing_id keys from the input(s), and a new _tracing_id output key. Also add the output to the data table.

Maybe we can make a PR with this code for just a map operation, then extend step 4 to the other operators? LMK what makes sense or doesn't make sense!

garuna-m6 · 2024-11-19T06:01:30Z

Hi @shreyashankar again will take some time to make more changes, yes this makes sense, log message is basically the operation that went through and the data. I am thinking every log.message would be logged, or do we want the intermediate data to be logged as well ? Haven't run in debug mode, so don't know much about the log traces.

The major problem I am feeling is the schema for the table, if we are confining it to one default schema its good, but if user can give their own schema it becomes problematic especially to do any mappings against it.
Also asking bit too much but would it be possible to give like one example -> input to log

shreyashankar · 2024-11-20T05:09:24Z

Yes, let me get back to you about an example in the next few days. Trying to wrap up some stuff before the end of the week for a paper deadline

Anurag Mehta added 2 commits November 17, 2024 17:24

refactor : breaking the util into helper files

527e6f3

feat : database utils

e0c9382

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lineage : DB boilerplating #191

Lineage : DB boilerplating #191

garuna-m6 commented Nov 17, 2024 •

edited

Loading

shreyashankar commented Nov 18, 2024

garuna-m6 commented Nov 19, 2024

shreyashankar commented Nov 20, 2024

Lineage : DB boilerplating #191

Are you sure you want to change the base?

Lineage : DB boilerplating #191

Conversation

garuna-m6 commented Nov 17, 2024 • edited Loading

shreyashankar commented Nov 18, 2024

garuna-m6 commented Nov 19, 2024

shreyashankar commented Nov 20, 2024

garuna-m6 commented Nov 17, 2024 •

edited

Loading