Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: Why is the entire parsed md file divided into one chunk? #3992

Open
Negai-98 opened this issue Dec 11, 2024 · 13 comments
Open

[Question]: Why is the entire parsed md file divided into one chunk? #3992

Negai-98 opened this issue Dec 11, 2024 · 13 comments
Assignees
Labels
question Further information is requested

Comments

@Negai-98
Copy link

Describe your problem

1733908459312

The default settings are used. The chunk length of the md file is obviously longer than the setting, which should be due to the absence of a delimiter. The delimiter includes \n but cannot recognize the line break in md.

@Negai-98 Negai-98 added the question Further information is requested label Dec 11, 2024
@Negai-98
Copy link
Author

Platform: Windows WSL
Version: 1.14.1-dev

@Snify89
Copy link

Snify89 commented Dec 11, 2024

Carriage Return / Line Feed delimiter issue?

@Negai-98
Copy link
Author

It is supposed to be like that, but I am not sure why it fails to recognize the line breaks in the md file. I use ragflow/readme.md as the document.

@Snify89
Copy link

Snify89 commented Dec 11, 2024

It is supposed to be like that, but I am not sure why it fails to recognize the line breaks in the md file. I use ragflow/readme.md as the document.

Could you please check and report back here, what kind of BOM (Byte Order Mark) the file has? This might be an en-/decoding issue.

@Negai-98
Copy link
Author

utf-8 @Snify89

@Negai-98
Copy link
Author

Negai-98 commented Dec 11, 2024

Manual copy the content from the md file directly into a docx file, which can work properly, but it is best to be able to parse the md file directly.

@Snify89
Copy link

Snify89 commented Dec 11, 2024

Maybe it's a HTML parsing issue?! Haven't reproduced yet, Def, worth a look. Thanks for reporting.

@KevinHuSh
Copy link
Collaborator

KevinHuSh commented Dec 12, 2024

What about turning down the chunk token number?
Or, set the delimitor: `##`\n\r

@Negai-98
Copy link
Author

Negai-98 commented Dec 12, 2024

1733991958827
1733991993243
@KevinHuSh Changing the delimiter or down the chunk token number for splitting does not seem to have an effect on md files; it still combines full md document into one single chunk.

@Negai-98
Copy link
Author

I tried reinstalling the latest version of Nightly, but I am still getting the same result.

@Snify89
Copy link

Snify89 commented Dec 12, 2024

Do you get more than 1 chunk, if you use other chunk methods? Maybe the chunk method decided to use just one chunk?

Edit: It's weird tho, that the docx chunks better?!

@Negai-98
Copy link
Author

Using the QA mode can chunk normaly, it seems there is a bug in the general mode regarding MD chunking.
@Snify89
1734052335972

@Negai-98
Copy link
Author

Additionaly, here I convert the readme.md to docx, and then chunk using general method. It work properly.
1734052723681

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants