[Feature] bulk load optimization (Part 1 - block manager) #53921
+295
−5
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Why I'm doing:
When performing large data ingestions, memory limitations during the ingestion process can result in the final Rowset containing a large number of segment files. This negatively impacts query performance and increases resource consumption (CPU and memory) during subsequent compaction operations.
To address this, we propose a new large data ingestion strategy that spills data during the ingestion process and assembles it into final segment files in the final stage of the ingestion. This approach reduces the number of segment files in the Rowset.
This PR represents the first phase, implementing the management of
SpillBlockManager
within each Delta writer.What I'm doing:
Fix #53954
Three classes are introduced here:
LoadSpillBlockContainer
: Responsible for managing the blocks continuously generated during the import process.BlockManager
and theLoadSpillBlockContainer
. TheBlockManager
handles block allocation and release. Each Delta writer has its ownLoadSpillBlockManager
, which will be released along with the Delta writer at the end of the import process, reclaiming all intermediate files.SpillMemTableSink
: Responsible for implementing the spill logic of the memtable (TODO).What type of PR is this:
Does this PR entail a change in behavior?
If yes, please specify the type of change:
Checklist:
Bugfix cherry-pick branch check: