[Question]: Why is the entire parsed md file divided into one chunk? #3992

Negai-98 · 2024-12-11T09:17:39Z

Describe your problem

The default settings are used. The chunk length of the md file is obviously longer than the setting, which should be due to the absence of a delimiter. The delimiter includes \n but cannot recognize the line break in md.

Negai-98 · 2024-12-11T09:20:23Z

Platform: Windows WSL
Version: 1.14.1-dev

Snify89 · 2024-12-11T09:21:30Z

Carriage Return / Line Feed delimiter issue?

Negai-98 · 2024-12-11T09:26:03Z

It is supposed to be like that, but I am not sure why it fails to recognize the line breaks in the md file. I use ragflow/readme.md as the document.

Snify89 · 2024-12-11T09:40:20Z

It is supposed to be like that, but I am not sure why it fails to recognize the line breaks in the md file. I use ragflow/readme.md as the document.

Could you please check and report back here, what kind of BOM (Byte Order Mark) the file has? This might be an en-/decoding issue.

Negai-98 · 2024-12-11T10:00:01Z

utf-8 @Snify89

Negai-98 · 2024-12-11T10:20:04Z

Manual copy the content from the md file directly into a docx file, which can work properly, but it is best to be able to parse the md file directly.

Snify89 · 2024-12-11T11:52:05Z

Maybe it's a HTML parsing issue?! Haven't reproduced yet, Def, worth a look. Thanks for reporting.

KevinHuSh · 2024-12-12T02:29:48Z

What about turning down the chunk token number?
Or, set the delimitor: `##`\n\r

Negai-98 · 2024-12-12T08:29:26Z

@KevinHuSh Changing the delimiter or down the chunk token number for splitting does not seem to have an effect on md files; it still combines full md document into one single chunk.

Negai-98 · 2024-12-12T08:30:30Z

I tried reinstalling the latest version of Nightly, but I am still getting the same result.

Snify89 · 2024-12-12T08:59:59Z

Do you get more than 1 chunk, if you use other chunk methods? Maybe the chunk method decided to use just one chunk?

Edit: It's weird tho, that the docx chunks better?!

Negai-98 · 2024-12-13T01:12:47Z

Using the QA mode can chunk normaly, it seems there is a bug in the general mode regarding MD chunking.
@Snify89

Negai-98 · 2024-12-13T01:18:59Z

Additionaly, here I convert the readme.md to docx, and then chunk using general method. It work properly.

Negai-98 added the question Further information is requested label Dec 11, 2024

KevinHuSh assigned Feiue Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]: Why is the entire parsed md file divided into one chunk? #3992

[Question]: Why is the entire parsed md file divided into one chunk? #3992

Negai-98 commented Dec 11, 2024

Negai-98 commented Dec 11, 2024

Snify89 commented Dec 11, 2024

Negai-98 commented Dec 11, 2024

Snify89 commented Dec 11, 2024

Negai-98 commented Dec 11, 2024

Negai-98 commented Dec 11, 2024 •

edited

Loading

Snify89 commented Dec 11, 2024

KevinHuSh commented Dec 12, 2024 •

edited

Loading

Negai-98 commented Dec 12, 2024 •

edited

Loading

Negai-98 commented Dec 12, 2024

Snify89 commented Dec 12, 2024 •

edited

Loading

Negai-98 commented Dec 13, 2024

Negai-98 commented Dec 13, 2024

[Question]: Why is the entire parsed md file divided into one chunk? #3992

[Question]: Why is the entire parsed md file divided into one chunk? #3992

Comments

Negai-98 commented Dec 11, 2024

Describe your problem

Negai-98 commented Dec 11, 2024

Snify89 commented Dec 11, 2024

Negai-98 commented Dec 11, 2024

Snify89 commented Dec 11, 2024

Negai-98 commented Dec 11, 2024

Negai-98 commented Dec 11, 2024 • edited Loading

Snify89 commented Dec 11, 2024

KevinHuSh commented Dec 12, 2024 • edited Loading

Negai-98 commented Dec 12, 2024 • edited Loading

Negai-98 commented Dec 12, 2024

Snify89 commented Dec 12, 2024 • edited Loading

Negai-98 commented Dec 13, 2024

Negai-98 commented Dec 13, 2024

Negai-98 commented Dec 11, 2024 •

edited

Loading

KevinHuSh commented Dec 12, 2024 •

edited

Loading

Negai-98 commented Dec 12, 2024 •

edited

Loading

Snify89 commented Dec 12, 2024 •

edited

Loading