Implement LLM-based Document Splitting #78

shreyashankar · 2024-10-07T06:42:54Z

As requested by a member of the community, it would be cool to implement a new feature for splitting documents using an LLM nstead of our current token or delimiter-based methods. This will allow for more intelligent and context-aware splitting of documents.

Proposed Idea

Implement an LLM "scan" operation that can process a document and determine contiguous splits based on specified criteria.
Allow users to provide a split_criteria_prompt that describes how to split the document (e.g., by topic).
Use a scratchpad technique (similar to our reduce operation) to manage internal state/memory while splitting.

Technical Approach

Feed as much text as possible into the LLM.
Ask the LLM to output:
- As many split points as it's confident in (phrases of 5-10 tokens that we can search in the document to split)
- Any memory/state it wants to keep track of for splitting the next part of the document
Remove processed chunks from the document.
Repeat the process until the entire document is processed.

Considerations

Splitting strategy:
- All splits in one call
- One split at a time
- K splits at a time
- As many splits as the LLM can confidently provide
Balancing split quality with processing efficiency
Handling very large documents that exceed LLM context limits
Ensuring consistency in splitting criteria across multiple LLM calls

Proposed Interface Design

operations:
  - name: llm_split
    type: split
    split_criteria: "split by theme discussed"

The text was updated successfully, but these errors were encountered:

staru09 · 2024-10-07T19:57:27Z

How to go with this (https://python.langchain.com/docs/how_to/#text-splitters) and then use it with some LLM for splitting.

shreyashankar · 2024-10-07T23:50:22Z

This seems the most similar: https://python.langchain.com/docs/how_to/semantic-chunker/

But it depends on embeddings and relies on hard-coded criteria for splitting (i.e., 3 sentences).

I also wonder; maybe we don't need chunks to be contiguous, as long as they are correct in reading order. For example, imagine the following 4 units in a transcript of a conversation:
A, B, C, D

Suppose B is totally out of context (an aside from one of the members), but A, C, and D are part of the main topic. I think a valid chunk set is {{A + C + D}, {B}}. This set has 2 chunks, and each chunk's reading order is preserved. This chunk set will not get generated via the chunking strategy I proposed in the issue, but it could be useful for some downstream analysis...

shreyashankar · 2024-10-07T23:52:14Z

A downside to chunking on embedding cosine similarity space is that the chunks aren't related to any particular task. We want our chunking strategy to be tied to the task, defined by some user prompt.

shreyashankar added the request label Oct 7, 2024

redhog added the New operation label Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement LLM-based Document Splitting #78

Implement LLM-based Document Splitting #78

shreyashankar commented Oct 7, 2024

staru09 commented Oct 7, 2024

shreyashankar commented Oct 7, 2024 •

edited

Loading

shreyashankar commented Oct 7, 2024

Implement LLM-based Document Splitting #78

Implement LLM-based Document Splitting #78

Comments

shreyashankar commented Oct 7, 2024

Proposed Idea

Technical Approach

Considerations

Proposed Interface Design

staru09 commented Oct 7, 2024

shreyashankar commented Oct 7, 2024 • edited Loading

shreyashankar commented Oct 7, 2024

shreyashankar commented Oct 7, 2024 •

edited

Loading