-
-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract PDF pages to images memory use #117
Comments
I do not think there is anything wrong with your console app and I do not think you will be able to reduce memory usage with this method of creating images. |
I'm not sure if Poppler is the bottleneck here. There's a trick you can use for loading image: if you know you will just be doing simple top-to-bottom operations on the image, like arithmetic or filtering or resize, you can tell NetVips that you only need sequential access to pixels: using var image = Image.NewFromFile("example_028.pdf", access: Enums.Access.Sequential); Now libvips will stream your image. It'll run the load and the save the image all in parallel and never keep more than a few scanlines of pixels in memory at any one time. This can give a really nice improvement to speed and memory use. Also, I'd avoid Here's a complete example: // Maximum value for a coordinate.
// See: https://github.com/kleisauke/net-vips/issues/71#issuecomment-609756394
const int VipsMaxCoord = 10000000;
// Test image: https://tcpdf.org/files/examples/example_028.pdf
const string file = "example_028.pdf";
var filename = Path.GetFileNameWithoutExtension(file);
using var im = Image.Pdfload(file, page: 0, access: Enums.Access.Sequential);
// Or: using var im = Image.PdfloadBuffer(buffer, page: 0, access: Enums.Access.Sequential);
var nPages = im.Contains("n-pages") ? (int) im.Get("n-pages") : 1;
Console.WriteLine($"{file} has {nPages} pages");
for (var i = 0; i < nPages; i++)
{
Console.WriteLine($"\t{file} rendering page {i} ...");
using var page = Image.Pdfload(file, page: i, dpi: 150, access: Enums.Access.Sequential);
// Or: using var page = Image.PdfloadBuffer(buffer, page: i, dpi: 150, access: Enums.Access.Sequential);
page.Jpegsave($"{filename}_page_{i}.jpg", strip: true);
Console.WriteLine($"\t{file} rendering page thumb {i} ...");
// Avoid calling ThumbnailImage, see: https://github.com/kleisauke/net-vips/issues/64
using var thumb = Image.Thumbnail($"{file}[page={i},dpi=150]", 250, height: VipsMaxCoord);
// Or: using var thumb = Image.ThumbnailBuffer(buffer, optionString: "[page={i},dpi=150]", width: 250, height: VipsMaxCoord);
thumb.Jpegsave($"{filename}_thumb_{i}.jpg", strip: true);
} This example takes ~250 milliseconds to complete and uses ~35 MB on this PC. Perhaps Google's PDFium (Apache-2.0 licence) could speed it up even more (but is a bit of a hassle to set-up, see libvips/build-win64-mxe#20). |
@kleisauke Thanks a lot for chipping in! Do you find this approach to scale linearly as pages in a PDF increase? I have tested with a PDF with 94 pages, most of which consisting of dense text with graphics, and changing it to I tried with this PDF: https://research.ark-invest.com/hubfs/1_Download_Files_ARK-Invest/White_Papers/Big-Ideas-2019-ARKInvest.pdf Anyway, it works much better than any other PDF library I've tried and is easy to get running. Closest alternative is Mozilla's PDF.js library, in terms of time to convert a full PDF to images. |
I have a console app that can read a PDF file, and create JPG files for each page in the PDF. Both a large image and a thumbnail.
The PDF file is 1.2 MB and 145 pages.
When debugging in VS, memory usage runs to ~250MB. GC is running and reducing it gradually, but not as much as I would expect.
Running libvips 8.10.5 from
PATH
env variable, and not the nuget package.Momory while debugging
It takes ~25 seconds to complete.
Is there a more memory efficient way to achieve this, than the code here:
The text was updated successfully, but these errors were encountered: