Preserve original field order when merging parquet Fields #213

istreeter · 2024-11-15T15:39:21Z

We use the Migrations.mergeSchema function in the loaders, for combining a series of schemas (e.g. 1-0-0, 1-0-1, 1-0-2) into a reconciled column

Before this PR, mergeSchemas contained some logic to preserve field order... but it was the wrong logic. It preserved field order of the newer schema, ignoring whether a field was present in the older schema.

After this PR, we preserve field order of the older schema. New fields added in a later schema are appended to the end of the field list.

This feature change is needed for loaders that only allow field additions when they are appended to the end of a struct. E.g. the Lake Loader for Hudi/Glue.

We use the `Migrations.mergeSchema` function in the loaders, for combining a series of schemas (e.g. `1-0-0`, `1-0-1`, `1-0-2`) into a reconciled column Before this PR, `mergeSchemas` contained some logic to preserve field order... but it was the wrong logic. It preserved field order of the newer schema, ignoring whether a field was present in the older schema. After this PR, we preserve field order of the older schema. New fields added in a later schema are appended to the end of the field list. This feature change is needed for loaders that only allow field additions when they are appended to the end of a struct. E.g. the Lake Loader for Hudi/Glue.

coveralls · 2024-11-15T15:43:08Z

Pull Request Test Coverage Report for Build 11973201274

Details

15 of 15 (100.0%) changed or added relevant lines in 2 files are covered.
14 unchanged lines in 10 files lost coverage.
Overall coverage decreased (-0.4%) to 80.314%

Files with Coverage Reduction	New Missed Lines	%
modules/core/src/main/scala/com.snowplowanalytics/iglu.schemaddl/jsonschema/mutate/Mutate.scala	1	77.05%
modules/core/src/main/scala/com.snowplowanalytics/iglu.schemaddl/bigquery/Row.scala	1	69.23%
modules/core/src/main/scala/com.snowplowanalytics/iglu.schemaddl/jsonschema/JsonPath.scala	1	85.71%
modules/core/src/main/scala/com.snowplowanalytics/iglu.schemaddl/jsonschema/circe/CommonCodecs.scala	1	75.0%
modules/core/src/main/scala/com.snowplowanalytics/iglu.schemaddl/redshift/internal/ColumnTypeSuggestions.scala	1	78.89%
modules/core/src/main/scala/com.snowplowanalytics/iglu.schemaddl/jsonschema/mutate/Widened.scala	1	87.27%
modules/core/src/main/scala/com.snowplowanalytics/iglu.schemaddl/redshift/ShredModelEntry.scala	2	86.54%
modules/core/src/main/scala/com.snowplowanalytics/iglu.schemaddl/parquet/Caster.scala	2	83.33%
modules/core/src/main/scala/com.snowplowanalytics/iglu.schemaddl/parquet/Suggestion.scala	2	90.0%
modules/core/src/main/scala/com.snowplowanalytics/iglu.schemaddl/jsonschema/circe/StringCodecs.scala	2	71.43%

Totals
Change from base Build 11124453738:	-0.4%
Covered Lines:	1277
Relevant Lines:	1590

💛 - Coveralls

See snowplow/schema-ddl#213 When a schema is evolved (e.g. from `1-0-0` to `1-0-1`) we create a merged struct column combining fields from new and old schema. For some loaders it is important that newly-added nested fields come after the original fields. E.g. Lake Loader with Hudi and Glue sync enabled.

This overcomes a limitation with how Hudi syncs schemas to the Glue catalog. Previously, if version `1-0-0` of a schema had fields `a` and `b`, and then vesion `1-0-1` adds a field `c`, then the new field might be added _before_ the original fields in the Hudi schema. The new field would get synced to Glue, but only for new partitions; it is not back-filled to existing partitions. After this change, the new field `c` is added _after_ the original fields `a` and `b` in the Hudi schema. Then there is no need to sync the new field to existing partitions in Glue. The problem manifested in AWS Athena with a message like: > HIVE_PARTITION_SCHEMA_MISMATCH: There is a mismatch between the table and partition schemas. This fix was implemented in snowplow/schema-ddl#213 and imported via a new version of common-streams. This change does not impact Delta or Iceberg, where nothing was broken.

This overcomes a limitation with how Hudi syncs schemas to the Glue catalog. Previously, if version `1-0-0` of a schema had fields `a` and `b`, and then vesion `1-0-1` adds a field `c`, then the new field might be added _before_ the original fields in the Hudi schema. The new field would get synced to Glue, but only for new partitions; it is not back-filled to existing partitions. After this change, the new field `c` is added _after_ the original fields `a` and `b` in the Hudi schema. Then there is no need to sync the new field to existing partitions in Glue. The problem manifested in AWS Athena with a message like: > HIVE_PARTITION_SCHEMA_MISMATCH: There is a mismatch between the table and partition schemas. This fix was implemented in snowplow/schema-ddl#213 and snowplow-incubator/common-streams#98 and imported via a new version of common-streams. This change does not impact Delta or Iceberg, where nothing was broken.

benjben

Looks great !

benjben · 2024-11-21T08:28:21Z

modules/core/src/test/scala/com/snowplowanalytics/iglu/schemaddl/parquet/MigrationSpec.scala

+        Field("xxx", Type.String, Required),
+        Field("new1", Type.String, Required),
+        Field("yyy", Type.String, Required),
+        Field("new2", Type.String, Required),


Could be nice to switch new1 and new2

If I switch new1 and new2 in the input, then I'll also need to swtich new1 and new2 in the expected output. It's because the alphabetical ordering only happens when we call Field.normalize, and that is not part of this test.

You have made me realize though that this test is not realistic! Because realistically we always call normalize first, and mergeSchemas afterwards. So to make it realistic, struct1 should start with fields vvv, xxx, zzz, and the new added fields should be www and yyy.

If I switch new1 and new2 in the input, then I'll also need to swtich new1 and new2 in the expected output

Yes this is what I meant, to make sure that the "wrong" alphabetical order is preserved.

and that is not part of this test

Alright !

Because realistically we always call normalize first, and mergeSchemas afterwards

I'm very glad that you say that ! I was wondering in which case we were calling normalize and in each case we were not. Shouldn't mergeSchemas be in charge of calling normalize then?

Shouldn't mergeSchemas be in charge of calling normalize then?

Yeah that's a really good point.

@benjben I pushed a commit in which mergeSchemas is in change of calling normalize.

...but unfortunately I think I need to revert it. It's because of this comment

Must normalize first and add the schema key field after. To avoid unlikely weird issues with similar existing keys.

To be strictly correct under all circumstances, the order of operations must be:

Normalize the field

Add the _schema_version key.

Merge the fields.

While it is probably ok to mix up that order for most schemas, it is not guaranteed to be ok for all possible schemas. And I cannot think of a clever way to keep that order, while also making mergeSchemas responsible for calling normalize.

So on balance I think it is better to say we always normalize first and merge afterwards.

Maybe we can improve on this in future -- but this is stretching beyond what I'm trying to address in this current PR.

Couldn't addSchemaVersion just be a boolean parameter of mergeSchemas ?

Otherwise that's fine to leave it as is for now, at least we have considered the options.

It could yeah... this is the logic that we would have to migrate from common-streams over to schema-ddl. But it's a bit awkward because there is some stuff there which is "snowplow-aware" -- meaning, how we add the _schema_version key to objects within a contexts array. Whereas schema-ddl currently doesn't know about snowplow details like how contexts are arrays.

Plus if we make any change to mergeSchemas then we need to carry it over to rdb-loader, which is a bit fragile at the moment.

I do like the ideas to tidy-up the inter-dependencies between two libraries (schema-ddl and common-streams), but I'm going to skip it for this PR.

modules/core/src/main/scala/com.snowplowanalytics/iglu.schemaddl/parquet/Migrations.scala

…emas" This reverts commit db04e30.

See snowplow/schema-ddl#213 When a schema is evolved (e.g. from `1-0-0` to `1-0-1`) we create a merged struct column combining fields from new and old schema. For some loaders it is important that newly-added nested fields come after the original fields. E.g. Lake Loader with Hudi and Glue sync enabled.

istreeter mentioned this pull request Nov 15, 2024

Preserve original field order when merging schemas snowplow-incubator/common-streams#98

Merged

istreeter mentioned this pull request Nov 18, 2024

Preserve original field order during schema evolution snowplow-incubator/snowplow-lake-loader#96

Open

benjben approved these changes Nov 21, 2024

View reviewed changes

istreeter added 3 commits November 22, 2024 08:56

Review feedback

31a6e19

Amendment: Fields should be normalized as part of merging schemas

db04e30

Revert "Amendment: Fields should be normalized as part of merging sch…

89d7603

…emas" This reverts commit db04e30.

istreeter merged commit bd5e4ba into master Nov 25, 2024
2 checks passed

istreeter deleted the migration-field-order branch November 25, 2024 10:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preserve original field order when merging parquet Fields #213

Preserve original field order when merging parquet Fields #213

istreeter commented Nov 15, 2024

coveralls commented Nov 15, 2024 •

edited

Loading

benjben left a comment

benjben Nov 21, 2024

istreeter Nov 22, 2024

benjben Nov 22, 2024

istreeter Nov 22, 2024

istreeter Nov 22, 2024

benjben Nov 22, 2024

istreeter Nov 25, 2024

Preserve original field order when merging parquet Fields #213

Preserve original field order when merging parquet Fields #213

Conversation

istreeter commented Nov 15, 2024

coveralls commented Nov 15, 2024 • edited Loading

Pull Request Test Coverage Report for Build 11973201274

Details

💛 - Coveralls

benjben left a comment

Choose a reason for hiding this comment

benjben Nov 21, 2024

Choose a reason for hiding this comment

istreeter Nov 22, 2024

Choose a reason for hiding this comment

benjben Nov 22, 2024

Choose a reason for hiding this comment

istreeter Nov 22, 2024

Choose a reason for hiding this comment

istreeter Nov 22, 2024

Choose a reason for hiding this comment

benjben Nov 22, 2024

Choose a reason for hiding this comment

istreeter Nov 25, 2024

Choose a reason for hiding this comment

coveralls commented Nov 15, 2024 •

edited

Loading