Correctly patch pdfminer to avoid unnecessarily and unsuccessfully repairing PDFs with long content streams, causing needless and endless OCR #3822

dhdaines · 2024-12-12T02:43:03Z

Verified on my very large documents that it doesn't unnecessarily and unsuccessfully "repair" them.

You may or may not wish to keep the version check in patch_psparser. Since you're pinning the version of pdfminer.six and since it isn't guaranteed that the bug in question will be fixed in the next pdfminer.six release (but it is rather serious, so I should hope so), then perhaps you just want to unconditionally patch it.

Also corrected an import so that if you do feel like using a newer version of pdfminer.six, it won't break on you.

…#3815)

dhdaines and others added 6 commits December 11, 2024 20:15

fix: correctly patch EOF handling in pdfminer (fixes: Unstructured-IO…

e0f464a

…#3815)

chore: add missing newline

1637377

docs: clarify exactly what we are patching here

7d87840

fix: correct the import of PSSyntaxError

39b2472

docs: document what patch_psparser does

99b1c61

chore: ruff

6cba88a

dhdaines changed the title ~~Fix the fix to pdfminer~~ Correctly patch pdfminer to avoid unnecessarily and unsuccessfully repairing PDFs with long content streams, causing needless and endless OCR Dec 12, 2024

chore: changelog

1b0c7f7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correctly patch pdfminer to avoid unnecessarily and unsuccessfully repairing PDFs with long content streams, causing needless and endless OCR #3822

Correctly patch pdfminer to avoid unnecessarily and unsuccessfully repairing PDFs with long content streams, causing needless and endless OCR #3822

dhdaines commented Dec 12, 2024 •

edited

Loading

Correctly patch pdfminer to avoid unnecessarily and unsuccessfully repairing PDFs with long content streams, causing needless and endless OCR #3822

Are you sure you want to change the base?

Correctly patch pdfminer to avoid unnecessarily and unsuccessfully repairing PDFs with long content streams, causing needless and endless OCR #3822

Conversation

dhdaines commented Dec 12, 2024 • edited Loading

dhdaines commented Dec 12, 2024 •

edited

Loading