Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] bulk load optimization (Part 1 - block manager) #53921

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

luohaha
Copy link
Contributor

@luohaha luohaha commented Dec 13, 2024

Why I'm doing:

When performing large data ingestions, memory limitations during the ingestion process can result in the final Rowset containing a large number of segment files. This negatively impacts query performance and increases resource consumption (CPU and memory) during subsequent compaction operations.

To address this, we propose a new large data ingestion strategy that spills data during the ingestion process and assembles it into final segment files in the final stage of the ingestion. This approach reduces the number of segment files in the Rowset.

This PR represents the first phase, implementing the management of SpillBlockManager within each Delta writer.

What I'm doing:

Fix #53954

Three classes are introduced here:

  1. LoadSpillBlockContainer: Responsible for managing the blocks continuously generated during the import process.
  2. LoadSpillBlockManager: Responsible for managing both the BlockManager and the LoadSpillBlockContainer. The BlockManager handles block allocation and release. Each Delta writer has its own LoadSpillBlockManager, which will be released along with the Delta writer at the end of the import process, reclaiming all intermediate files.
  3. SpillMemTableSink: Responsible for implementing the spill logic of the memtable (TODO).

What type of PR is this:

  • BugFix
  • Feature
  • Enhancement
  • Refactor
  • UT
  • Doc
  • Tool

Does this PR entail a change in behavior?

  • Yes, this PR will result in a change in behavior.
  • No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

  • Interface/UI changes: syntax, type conversion, expression evaluation, display information
  • Parameter changes: default values, similar parameters but with different default values
  • Policy changes: use new policy to replace old one, functionality automatically enabled
  • Feature removed
  • Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

  • I have added test cases for my bug fix or my new feature
  • This pr needs user documentation (for new or modified features or behaviors)
    • I have added documentation for my new feature or new function
  • This is a backport pr

Bugfix cherry-pick branch check:

  • I have checked the version labels which the pr will be auto-backported to the target branch
    • 3.4
    • 3.3
    • 3.2
    • 3.1
    • 3.0

decster
decster previously approved these changes Dec 13, 2024
@decster decster marked this pull request as ready for review December 13, 2024 14:39
@decster decster requested review from a team as code owners December 13, 2024 14:39
Signed-off-by: luohaha <[email protected]>
Signed-off-by: luohaha <[email protected]>
Copy link

[FE Incremental Coverage Report]

pass : 0 / 0 (0%)

Copy link

[Java-Extensions Incremental Coverage Report]

pass : 0 / 0 (0%)

Copy link

[BE Incremental Coverage Report]

fail : 5 / 7 (71.43%)

file detail

path covered_line new_line coverage not_covered_line_detail
🔵 src/storage/lake/delta_writer.cpp 5 7 71.43% [367, 369]

Signed-off-by: luohaha <[email protected]>
Signed-off-by: luohaha <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] bulk load optimization
2 participants