Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: refine filetype detection #3828

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

scanny
Copy link
Collaborator

@scanny scanny commented Dec 14, 2024

Summary
Fixes a bug where a CSV file with asserted content-type application/vnd.ms-excel was incorrectly identified as an XLS file and failed partitioning.

Additional Context
The content_type argument to partitioning is often authored by the client system (e.g. Unstructured SDK) and is both unreliable and outside the control of the user. In this case the .csv -> XLS mapping is correct for certain purposes (Excel is often used to load and edit CSV files) but not for partitioning, and the user has no readily available way to override the mapping.

XLS files as well as seven other common binary file types can be efficiently detected 100% of the time (at least 99.999%) using code we already have in the file detector.

  • Promote this direct-inspection strategy to be tried first.
  • When DOC, DOCX, EPUB, ODT, PPT, PPTX, XLS, or XLSX is detected, use that file-type.
  • When one of those types is NOT detected, clear the asserted content_type when it matches any of those types. This prevents the problem seen in the bug where the asserted content type was used to determine the file-type.
  • The remaining content_type, guess MIME-type, and filename-extension mapping strategies are tried, in that order, only when direct inspection fails. This is largely the same as it was before.
  • Fix bug/check for magic library availability doesn't appear to be correct #3781 while we were in the neighborhood.

The eight file-types based on CFB or ZIP compound files can be detected
with 100% accuracy (or at least 99.999%). Try this strategy first,
ignoring the unreliable content-type for these file-types.

This fixes a problem where a CSV file with an asserted XLS file-type was
mistakenly typed as XLS and failed partitioning.
Looks like we need some work to upgrade ruff past 0.4.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug/check for magic library availability doesn't appear to be correct
1 participant