Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add CLL to OpenLineage in BigQueryInsertJobOperator #44872

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

kacpermuda
Copy link
Contributor

@kacpermuda kacpermuda commented Dec 12, 2024

BigQueryInsertJobOperator already support OpenLineage for QUERY type jobs, but it lacks Column Level Lineage (CLL).

This PR introduces CLL (Column-Level Lineage) to this operator based on SQL parsing, which can be useful in straightforward scenarios. However, since SQL parsing alone might not always provide all the details (e.g. in SQL query we can reference table only by table name, or dataset.table without the project_id), checks have been implemented to ensure accurate lineage. As a result CLL may not be included when there is uncertainty about its correctness.

There is another change not related to CLL: right now output table is duplicated into input tables. We are creating a list of input tables based on referencedTables property provided by Google and as it turns out, this also includes the destination table. So f.e. this query:
INSERT INTO a.b.c VALUES (1, "a", 23)
would return a.b.c as input table and output table.

This PR fixes it by removing output table from input tables. I am not sure if it's a correct approach as sometimes users may write a query that performs a process that moves data from one table to the same table but i think this is rare and also this kind of lineage information (from A to A) does not provide much value. Please let me know if you think I'm wrong.


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@boring-cyborg boring-cyborg bot added area:providers provider:google Google (including GCP) related issues labels Dec 12, 2024
@kacpermuda kacpermuda force-pushed the feat-ol-bq-insert-job-cll branch 4 times, most recently from 6c73a9f to 79c6ad4 Compare December 12, 2024 16:38
@kacpermuda kacpermuda force-pushed the feat-ol-bq-insert-job-cll branch from 79c6ad4 to b383b93 Compare December 13, 2024 15:19
@kacpermuda kacpermuda marked this pull request as ready for review December 13, 2024 15:19
@kacpermuda kacpermuda marked this pull request as draft December 13, 2024 15:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers provider:google Google (including GCP) related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant