Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add slash at the end of the load path #1366

Merged
merged 1 commit into from
Nov 14, 2024
Merged

Conversation

spenes
Copy link
Contributor

@spenes spenes commented Nov 13, 2024

Jira ref: PDP-1569

Shredder puts together entities with the same schema model-revision-addition in the same batch under same folder. Let’s say you have events with 1-0-0, 1-0-1 and 1-0-2 version of the com.acme.test in the same batch. In that case, resulting run folder will have following subfolders:

output=good/vendor=com.acme/name=test/format=tsv/model=1/revision=0/addition=0
output=good/vendor=com.acme/name=test/format=tsv/model=1/revision=0/addition=1
output=good/vendor=com.acme/name=test/format=tsv/model=1/revision=0/addition=2

Before the fix, Loader was using the s3 paths without slash (/) at the end in the created copy statements. This works fine in most cases. However, when same batch contains events with 1-0-1 and 1-0-11, then problem starts. In that case, run folder will have following subfolders:

output=good/vendor=com.acme/name=test/format=tsv/model=1/revision=0/addition=1
output=good/vendor=com.acme/name=test/format=tsv/model=1/revision=0/addition=11

When entities in the /model=1/revision=0/addition=1 are tried to be copied to respective table with copy statement, Redshift tries to copy the entities under /model=1/revision=0/addition=11 as well since they have same prefix and it gives error during the copy since data under /model=1/revision=0/addition=11 doesn’t have same structure with 1-0-1. Putting slash at the end of the path solved the problem. After that change, only entities under model=1/revision=0/addition=1 are copied as expected.

Shredder puts together entities with the same schema model-revision-addition in the same batch under same folder. Let’s say you have events with `1-0-0`, `1-0-1` and `1-0-2` version of the `com.acme.test` in the same batch. In that case, resulting run folder will have following subfolders:
```
output=good/vendor=com.acme/name=test/format=tsv/model=1/revision=0/addition=0
output=good/vendor=com.acme/name=test/format=tsv/model=1/revision=0/addition=1
output=good/vendor=com.acme/name=test/format=tsv/model=1/revision=0/addition=2
```
Before the fix, Loader was using the s3 paths without slash (/) at the end in the created copy statements. This works fine in most cases. However, when same batch contains events with `1-0-1` and `1-0-11`, then problem starts. In that case, run folder will have following subfolders:
```
output=good/vendor=com.acme/name=test/format=tsv/model=1/revision=0/addition=1
output=good/vendor=com.acme/name=test/format=tsv/model=1/revision=0/addition=11
```
When entities in the `/model=1/revision=0/addition=1` are tried to be copied to respective table with copy statement, Redshift tries to copy the entities under `/model=1/revision=0/addition=11` as well since they have same prefix and it gives error during the copy since data under `/model=1/revision=0/addition=11` doesn’t have same structure with `1-0-1`. Putting slash at the end of the path solved the problem. After that change, only entities under `model=1/revision=0/addition=1` are copied as expected.
@spenes spenes force-pushed the s3-load-path-suffix-slash-fix branch from 12bab2c to e6ac798 Compare November 13, 2024 15:07
@spenes spenes changed the base branch from master to develop November 13, 2024 15:15
@spenes spenes merged commit e6ac798 into develop Nov 14, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants