Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigQuery final type suggestion should always allow null #153

Open
istreeter opened this issue Mar 10, 2023 · 1 comment
Open

BigQuery final type suggestion should always allow null #153

istreeter opened this issue Mar 10, 2023 · 1 comment

Comments

@istreeter
Copy link
Contributor

istreeter commented Mar 10, 2023

When schema-ddl cannot find a BigQuery type for a field, it falls back to a string type. This basically means we stringify whatever value we get (number, object, array, whatever). Currently the "nullability" of the string field is set by whether the field is listed as a required field. However, there are examples of schemas where a field is listed as required but it can also be null.

Example 1:

{
  "type": "object",
  "required": ["xyz"],
  "properties: {
    "xyz": {
      "oneOf: [
        {"type": "string"},
        {"type": "number"},
        {"type": "null"}
      ]
    }
  }
}

Example 2:

{
  "type": "object",
  "required": ["xyz"],
  "properties": {
    "xyz": {
      "type": ["object", "null"]
    }
  }
}

Example 3:

{
  "type": "object",
  "required": ["xyz"],
  "properties": {
    "xyz": {
      "type": ["array", "null"]
    }
  }
}

Schema DDL should suggest a nullable string for these examples. Otherwise, Snowplow events with these weird types can fail to get loaded.

@istreeter
Copy link
Contributor Author

I tested the fix in #154 . But we have decided to not merge it, because of the potential impact on existing pipelines.

The loader after the fix is incompatible with the loader before the fix. I will use this schema to explain:

{
  "$schema": "http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#",
  "self": {
    "vendor": "com.example",
    "name": "test_20230615",
    "format": "jsonschema",
    "version": "1-0-0"
  },
  "type": "object",
  "properties": {
    "xyz": {
      "oneOf": [
        {"type": "string"},
        {"type": "number"},
        {"type": "null"}
      ]
    }
  }
}

And imagine two Snowplow events using these two valid objects as the unstruct event:

{
  "xyz": "this is a string"
}

{
  "xyz": null
}

If the BigQuery column was created with the old loader, then column xyz in BigQuery is a non-nullable string.

With the old version of the loader, both events would get loaded, i.e. we get two rows in the BigQuery table. But the for the second event, the unstruct event is silently dropped, because of the bug.

After upgrading the loader to the version with the fix (and assuming we do not alter the existing table), the first event would get loaded, but the second event would fail to get loaded, and would go to the failed inserts bucket.

We will leave this issue un-fixed, until we have a better way to assert if existing pipelines are affected by this change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant