Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support of References for AVRO Schemas #926

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

libretto
Copy link
Contributor

@libretto libretto commented Jul 27, 2024

This change adds support for references for AVRO schemas

We aim to add support for references for AVRO schemas to Karapace using a straightforward approach by leveraging the same API used in Protobuf references. Although Avro does not natively support references, we can implement this functionality by utilizing the property of Avro schemas that allows combining types. We utilize Names from avro.name and make_avsc_object to create an Avro schema that includes referenced schemas within as explained in #987.

@libretto libretto requested review from a team as code owners July 27, 2024 13:50
Copy link
Contributor

@jjaakola-aiven jjaakola-aiven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleanup commits to single/coherent commits. Add tests for better coverage.

@libretto
Copy link
Contributor Author

libretto commented Jul 29, 2024

Cleanup commits to single/coherent commits. Add tests for better coverage.

@jjaakola-aiven Could you suggest the tests that should be added?

@libretto libretto force-pushed the avro_references2 branch 2 times, most recently from f0f9450 to 0167a75 Compare July 29, 2024 09:46
@libretto
Copy link
Contributor Author

@jjaakola-aiven I'm still waiting for your suggestions regarding additional tests for the new functionality. Please keep in mind that we already have integration tests covering various aspects, including:

Basic Avro references
Avro references compatibility
Avro union references
Avro incompatible names in references

Additionally, the new code has successfully passed all existing tests related to Avro functionality. I also added a few unit tests for the AvroMerge class today.

karapace/schema_models.py Outdated Show resolved Hide resolved
karapace/schema_models.py Outdated Show resolved Hide resolved
karapace/schema_models.py Outdated Show resolved Hide resolved
karapace/schema_models.py Outdated Show resolved Hide resolved
karapace/schema_models.py Outdated Show resolved Hide resolved
karapace/schema_models.py Outdated Show resolved Hide resolved
@nosahama
Copy link
Contributor

@libretto shall we rebase this work so we get upstream changes and then let's discuss around the comments if you can tell me when you'd like to, we can spare sometime and discuss in one thread here.

@nosahama
Copy link
Contributor

@libretto The unit test for AvroMerge is gone and it seems that you have reverted some of your changes with your recent push.

@libretto
Copy link
Contributor Author

@nosahama

@libretto The unit test for AvroMerge is gone and it seems that you have reverted some of your changes with your recent push.

Yes it was an issue with rebase I just restored banch, synced it with main and try rebase again...

@libretto libretto force-pushed the avro_references2 branch 2 times, most recently from 9cac0ba to f4346ea Compare September 26, 2024 16:57
@libretto
Copy link
Contributor Author

@libretto shall we rebase this work so we get upstream changes and then let's discuss around the comments if you can tell me when you'd like to, we can spare sometime and discuss in one thread here.

rebased

@davide-armand
Copy link
Contributor

@libretto Hi, sorry for the delay, this took some time to properly review.

I think that the implementation using Avro unions is risky and might not work in all cases (hard to tell).
A more "standard" solution using the Avro Python library instead of unions seems a better choice.

I have opened a draft PR implementing the merging using Avro lib here:
#987

Please have a look and consider that approach instead.

@libretto
Copy link
Contributor Author

libretto commented Nov 6, 2024

@libretto Hi, sorry for the delay, this took some time to properly review.

I think that the implementation using Avro unions is risky and might not work in all cases (hard to tell). A more "standard" solution using the Avro Python library instead of unions seems a better choice.

I have opened a draft PR implementing the merging using Avro lib here: #987

Please have a look and consider that approach instead.

@davide-armand Hi,

As I understand, the union approach concept has been confirmed by our teams and has passed all existing Karapace tests, including the references tests I provided. This demonstrates that the concept works well. I will try using Your approach with the Avro library to eliminate the need for the union method. I expect the results to be consistent.

Regarding tests, I noticed that You added a few tests for recursion. However, the direct recursion test You included is on the Avro-level parser and should be tested within the Avro tests, outside of the Karapace project. Indirect recursion (where Schema A references Schema B, which in turn references Schema A) is not possible in Karapace's design, as all schemas are added simultaneously. A schema cannot be added if any of its references are unresolved, which effectively prevents indirect recursion between schemas.

I will include the Your code for the test_avro_reference() test in the my branch tests.

@libretto
Copy link
Contributor Author

libretto commented Nov 9, 2024

@davide-armand I utilized the Avro library functionality as you suggested. References now work in the native Avro way.

@libretto libretto requested a review from nosahama November 9, 2024 20:36
@libretto
Copy link
Contributor Author

libretto commented Nov 9, 2024

#987 implemented there

Copy link
Contributor

@davide-armand davide-armand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@libretto Hi, I left a few comments, please have a look

def __init__(self, schema_str: str, dependencies: Mapping[str, Dependency] | None = None):
self.schema_str = json_encode(json_decode(schema_str), compact=True, sort_keys=True)
self.dependencies = dependencies
self.unique_id = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is unique_id only used in tests?
Can you do without it?

self.schema_str = json_encode(json_decode(schema_str), compact=True, sort_keys=True)
self.dependencies = dependencies
self.unique_id = 0
self.regex = re.compile(r"^\s*\[")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regex does not seem to be used


return merge

def resolve(self) -> list:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolve() and build() can be merged in a single method, since builder() is only called by resolve().
Maybe the whole AvroResolver class can be just a static method?

merged_schema = make_avsc_object(json.loads(schema), names)
merged_schema_str = str(merged_schema)
else:
merged_schema_str = schema_str
parsed_schema = parse_avro_schema_definition(
Copy link
Contributor

@davide-armand davide-armand Dec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have realized now that it is possible to pass a list of schema strings to avro.schema.parse():
https://github.com/aiven/avro/blob/5a82d57f2a650fd87c819a30e433f1abb2c76ca2/lang/py/avro/schema.py#L1192

We currently use parse() to parse a single schema (see parse_avro_schema_definition()).

When called with a list of schemas it returns a UnionSchema (through make_avsc_object()), which itself contains a list of parsed schemas (UnionSchema.schemas).
The constructor of UnionSchema seems to be doing what we want (loop on the schemas, call make_avsc_object() and remember names).

I think passing to parse() the reference schemas + the main schema (the main schema should be passed as last in the list) and then getting the last schema from UnionSchema.schemas should be equivalent to what's the PR now.

If it works then we can delegate that logic to the Avro library

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants