-
Notifications
You must be signed in to change notification settings - Fork 787
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BadZipFile error when ran on AWS lambda #3759
Comments
@pastram-i the DOCX format is in fact a Zip archive that contains the XML files (and images etc.) that define the document. So it's entirely plausible to get a Zip-related error when trying to partition one. I'd be inclined to suspect some kind of corruption has occurred in the S3 round-trip. Can you possibly do a SHA1 or MD5 hash check on the before and after to see if the file was changed in some way? I believe the central directory on a Zip archive is at the end of the file (to make appending efficient), so my first guess would be truncation of some sort, leaving the file-type identifier at the very top of the file in place. |
Thanks for the response @scanny -
Yeah - shortly after posting this, I did notice this comment that mentions that docx == zip.
It would be weird that the corruption wouldn't happen in the local Docker image run, but does in the lambda image run - unless the corruption isn't in the trip itself, but in the saving of the file in the lambda file system? To test for this though, I tried to use bytes instead but still got the same result. import io
with io.BytesIO() as file_obj:
s3.download_fileobj(self.bucket, self.key, file_obj)
file_obj.seek(0)
return partition(file=file_obj, **self.unstructured_kwargs)
I'll be honest here - I'm not sure how I'd be able to do a hash check remotely from s3 before retrieval, to compare to the after? The below is the closest I can think of, but let me know if I'm missing the goal here. import hashlib
....
with tempfile.TemporaryDirectory() as temp_dir:
file_path = f"{temp_dir}/{self.key}"
os.makedirs(os.path.dirname(file_path), exist_ok=True)
before_hash = self._calculate_hash(file_path)
s3.download_file(self.bucket, self.key, file_path)
after_hash = self.calculate_hash(file_path)
if before_hash != after_hash:
print("File has been modified during download")
else:
print("File has not been modified during download")
return partition(filename=file_path, **self.unstructured_kwargs)
def _calculate_hash(self, file_path):
with open(file_path, 'rb') as file:
file_hash = hashlib.sha1()
while True:
data = file.read(4096)
if not data:
break
file_hash.update(data)
return file_hash.hexdigest() Which - is honestly just ending with a |
Hmm. Dunno. I would observe though that if you had a partial Zip archive, like you took one of 100k bytes and just truncated it at 50k bytes, this behavior would be plausible. The identification as a Zip archive is based on the first few bytes of the file and the central directory is at the end of the file. It looks like it's getting as far as identifying the file as a Zip, and failing in the disambiguation code that reads the archive contents to figure out more specifically what flavor of Zip it is (DOCX, PPTX, XLSX, etc.) But I don't have any good ideas about what's happening here. I think you need to find some means of observing what's happening, like writing to logs or something. Btw the locally computed hash should be identical to the remotely computed one after the download, assuming you use the same hash type (SHA1 would be my first choice, which it looks like you're using). That's the way Git works, so no need to try to do the "before" in the lambda code. |
Good catch - I guess I didn't consider this. I think you're on to something. The "absolute local" and the "image local" (that gets from s3, but runs on my machine) hashes do match. However, the lambda hash does not. So I guess there is some sort of corruption happening here.. Though, I'm thinking I might sidebar this lambda deployment method either way. To pass this error and work on the entire process I passed a Currently exploring other deployment options that would fit our use better. |
If the SHAs don't match, next thing to look at is the length. If the lambda version is longer, maybe there's a wrapper in there somewhere. If it's shorter, well then something is definitely going wrong up there :) Maybe the call doesn't wait for the download to finish or something, dunno much about AWS Lambda. |
We ended up going a different route (having the apps front end handle the file processing through unstructured) since we couldn't find the cause here, but we ran into this issue elsewhere and finally found what was happening. Logging this here just in-case anyone runs into something similar.. This was actually caused by Mangum+FastAPI (which was used to get the file from the user->lambda, but not in the original example code, as I thought it wasn't relevant) (that were corrupting the files) as well as API Gateway+Lambda through AWS. We found this by stumbling on this comment in the fastapi github. |
My custom image works as expected when ran locally against a
test.docx
from an s3 path.But when I upload the image to lambda, I get the error
BadZipFile: Bad magic number for central directory
on thefrom unstructured.partition.auto import partition
function - even though the file isn't azip
, and is still the sametest.docx
from s3.Example code below:
The text was updated successfully, but these errors were encountered: