You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am currently in the process of upgrading from blosc v1 to blosc v2 and facing some issues. I use blosc to compress many binary files (numpy arrays) and I am really impressed by the compression ratio :-)
I don't know what the difference between blosc2.compress and blosc2.compress2 is (except the different API) so I tried both. The compressed file size is the same even though the files are not identical. However, compress is much faster than compress2. Unfortunately, compress hangs when used in a multiprocessing setting. Here is a reproducible:
importmultiprocessingimportpickleimporttempfilefrompathlibimportPathfromtimeitimportdefault_timerimportbloscimportblosc2importnumpyasnp# Just for time measurements, not relevant for the reproducibleclassMeasureTime:
def__init__(self, name: str="", silent: bool=False):
""" Easily measure the time of a Python code block. >>> import time >>> with MeasureTime() as m: ... time.sleep(1) Elapsed time: 0 m and 1.00 s >>> round(m.elapsed_seconds) 1 Args: name: Name which is included in the time info message. silent: Whether to print the time info message. """self.name=nameself.silent=silentself.elapsed_seconds=0def__enter__(self):
self.start=default_timer()
returnselfdef__exit__(self, exc_type, exc_value, traceback):
end=default_timer()
seconds=end-self.startifself.name:
tag="["+self.name+"] "else:
tag=""self.elapsed_seconds=secondsifnotself.silent:
print("%sElapsed time: %d m and %.2f s"% (tag, seconds//60, seconds%60))
defcompress_file_v1(path: Path, array: np.ndarray) ->None:
""" Compresses the numpy array using blosc (https://github.com/Blosc/c-blosc). Args: path: The path where the compressed file should be stored. array: The array data to store. """# Based on https://stackoverflow.com/a/56761075array=np.ascontiguousarray(array) # Does nothing if already contiguous (https://stackoverflow.com/a/51457275)# A bit ugly, but very fast (http://python-blosc.blosc.org/tutorial.html#compressing-from-a-data-pointer)compressed_data=blosc.compress_ptr(
array.__array_interface__["data"][0],
array.size,
array.dtype.itemsize,
clevel=9,
cname="zstd",
shuffle=blosc.SHUFFLE,
)
withopen(path, "wb") asf:
pickle.dump((array.shape, array.dtype), f)
f.write(compressed_data)
defcompress_file(path: Path, array: np.ndarray) ->None:
""" Compresses the numpy array using blosc2 (https://github.com/Blosc/c-blosc2). Args: path: The path where the compressed file should be stored. array: The array data to store. """# Based on https://stackoverflow.com/a/56761075array=np.ascontiguousarray(array) # Does nothing if already contiguous (https://stackoverflow.com/a/51457275)compressed_data=blosc2.compress(
array,
typesize=array.dtype.itemsize,
clevel=9,
cname="zstd",
)
withopen(path, "wb") asf:
pickle.dump((array.shape, array.dtype), f)
f.write(compressed_data)
defcompress_file2(path: Path, array: np.ndarray) ->None:
""" Compresses the numpy array using blosc2 (https://github.com/Blosc/c-blosc2). Args: path: The path where the compressed file should be stored. array: The array data to store. """# Based on https://stackoverflow.com/a/56761075array=np.ascontiguousarray(array) # Does nothing if already contiguous (https://stackoverflow.com/a/51457275)compressed_data=blosc2.compress2(
array,
typesize=array.dtype.itemsize,
clevel=9,
compcode=blosc2.Codec.ZSTD,
)
withopen(path, "wb") asf:
pickle.dump((array.shape, array.dtype), f)
f.write(compressed_data)
defcompress_multi_v1(i):
np.random.seed(0)
N=int(1e6)
arr=np.random.randint(0, 10_000, N)
compress_file_v1(tmp_dir/f"compress.blosc2", arr)
defcompress_multi(i):
np.random.seed(0)
N=int(1e6)
arr=np.random.randint(0, 10_000, N)
compress_file(tmp_dir/f"compress.blosc2", arr)
defcompress_multi2(i):
np.random.seed(0)
N=int(1e6)
arr=np.random.randint(0, 10_000, N)
compress_file2(tmp_dir/f"compress.blosc2", arr)
if__name__=="__main__":
np.random.seed(0)
N=int(1e6)
arr=np.random.randint(0, 10_000, N)
tmp_dir_handle=tempfile.TemporaryDirectory()
tmp_dir=Path(tmp_dir_handle.name)
withMeasureTime("compress_v1"):
compress_file_v1(tmp_dir/f"compress.blosc", arr)
withMeasureTime("compress"):
compress_file(tmp_dir/f"compress.blosc2", arr)
withMeasureTime("compress2"):
compress_file2(tmp_dir/f"compress2.blosc2", arr)
forfinsorted(tmp_dir.iterdir()):
print(f"{f.name}: {f.stat().st_size} Bytes")
pool=multiprocessing.Pool()
pool.map(compress_multi_v1, [0, 1])
pool.close()
pool.join()
print("Finished with compress_multi_v1 (using compress_file_v1)")
pool=multiprocessing.Pool()
pool.map(compress_multi2, [0, 1])
pool.close()
pool.join()
print("Finished with compress_multi2 (using compress_file2)")
# This code block hangspool=multiprocessing.Pool()
pool.map(compress_multi, [0, 1])
pool.close()
pool.join()
print("Finished with compress_multi (using compress_file)")
tmp_dir_handle.cleanup()
Which produces the following output on my machine (Ubuntu 20.04):
[compress_v1] Elapsed time: 0 m and 0.09 s
[compress] Elapsed time: 0 m and 0.05 s
[compress2] Elapsed time: 0 m and 0.11 s
compress.blosc: 1685799 Bytes
compress.blosc2: 1676904 Bytes
compress2.blosc2: 1676904 Bytes
Finished with compress_multi_v1 (using compress_file_v1)
Finished with compress_multi2 (using compress_file2)
The line Finished with compress_multi (using compress_file) does not show up. Compression with blosc v1 works fine and with blosc v2 compress2 also works but compress just does nothing when used with the processing pool.
I'll stick to blosc v1 for now but would of course be cool to upgrade to the new library :-) Do you have any idea what the problem here is?
The text was updated successfully, but these errors were encountered:
This is a good point. I think we are not doing the correct thing with blosc2.compress, but blosc2.compress2 should be fine. I still need to think a bit more about this, but upfront I don't see a reason why we should not replace blosc2.compress by blosc2.compress2 (and deprecate the blosc2.compress2 name).
I am currently in the process of upgrading from blosc v1 to blosc v2 and facing some issues. I use blosc to compress many binary files (numpy arrays) and I am really impressed by the compression ratio :-)
I don't know what the difference between
blosc2.compress
andblosc2.compress2
is (except the different API) so I tried both. The compressed file size is the same even though the files are not identical. However,compress
is much faster thancompress2
. Unfortunately,compress
hangs when used in a multiprocessing setting. Here is a reproducible:Which produces the following output on my machine (Ubuntu 20.04):
The line
Finished with compress_multi (using compress_file)
does not show up. Compression with blosc v1 works fine and with blosc v2compress2
also works butcompress
just does nothing when used with the processing pool.I'll stick to blosc v1 for now but would of course be cool to upgrade to the new library :-) Do you have any idea what the problem here is?
The text was updated successfully, but these errors were encountered: