Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lineage : DB boilerplating #191

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Lineage : DB boilerplating #191

wants to merge 2 commits into from

Conversation

garuna-m6
Copy link
Contributor

@garuna-m6 garuna-m6 commented Nov 17, 2024

@shreyashankar kind of faded out on how should we have it.

So what is in the PR :

  • Refactoring of the huge util file into smaller util files
  • Util interface to connect to any db
  • Sqlite db interface

Where I am stuck :

  • log schema I am considering

    ```
              "id": "INTEGER PRIMARY KEY AUTOINCREMENT",
              "process_id": "TEXT",
              "operation": "TEXT",
              "log_message": "TEXT",
              "timestamp": "DATETIME DEFAULT CURRENT_TIMESTAMP"
    
  • I want lineage only for operations, every yml given can have a "process_id" or will be randomly generated as a key for the logs

  • User defined schema kinds of mess up, bcz code needs to be altered in that specific way

  • The database connection happens at the start of the application

  • I want console logging function itself to have an extra feature where it uses the connection and pushes log message to logging connection

  • obviously haven't tested any parts :(

@shreyashankar
Copy link
Collaborator

Thanks for opening the PR! I think that log schema is almost good--we also need an "input_ids" field, so we can trace back a row in the table (i.e., a log) to its inputs. So every row will represent an output document of an operation. I don't think we need a "log_message" field; this table is more about enabling us to reconstruct the lineage of any DocETL pipeline output, which just requires us to attach an id to every output, and have it point to its input_ids in the table.

We will also want another table with just the id of an input or output, and its value. So two tables overall.

I don't think we need to do anything regarding console.logging! I'm imagining the PR logic can be something like:

  1. Check if there's sqlite params/config specified in the output of the docetl pipeline config
  2. Create the DB specified in the config, in this constructor maybe. Also create a boolean here (maybe self.tracing_enabled) to represent if lineage tracing is enabled, so the operations can reference it as self.runner.tracing_enabled)
  3. In the dataset load method, add ids before returning the dataset to a special _tracing_id key, if tracing is enabled. Also write this data to the data table in the sqlite DB.
  4. In each operation's execute method, check if tracing is enabled (self.runner.tracing_enabled == True). If so, whenever a new output is created, add it to the tracing table in the sqlite DB, with the relevant _tracing_id keys from the input(s), and a new _tracing_id output key. Also add the output to the data table.

Maybe we can make a PR with this code for just a map operation, then extend step 4 to the other operators? LMK what makes sense or doesn't make sense!

@garuna-m6
Copy link
Contributor Author

Hi @shreyashankar again will take some time to make more changes, yes this makes sense, log message is basically the operation that went through and the data. I am thinking every log.message would be logged, or do we want the intermediate data to be logged as well ? Haven't run in debug mode, so don't know much about the log traces.

The major problem I am feeling is the schema for the table, if we are confining it to one default schema its good, but if user can give their own schema it becomes problematic especially to do any mappings against it.
Also asking bit too much but would it be possible to give like one example -> input to log

@shreyashankar
Copy link
Collaborator

Yes, let me get back to you about an example in the next few days. Trying to wrap up some stuff before the end of the week for a paper deadline

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants