Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract PDF pages to images memory use #117

Closed
martinnormark opened this issue Feb 22, 2021 · 3 comments
Closed

Extract PDF pages to images memory use #117

martinnormark opened this issue Feb 22, 2021 · 3 comments
Labels
question Further information is requested

Comments

@martinnormark
Copy link

martinnormark commented Feb 22, 2021

I have a console app that can read a PDF file, and create JPG files for each page in the PDF. Both a large image and a thumbnail.

The PDF file is 1.2 MB and 145 pages.

When debugging in VS, memory usage runs to ~250MB. GC is running and reducing it gradually, but not as much as I would expect.

Running libvips 8.10.5 from PATH env variable, and not the nuget package.

Momory while debugging
image

It takes ~25 seconds to complete.

Is there a more memory efficient way to achieve this, than the code here:

using NetVips;
using System;
using System.Diagnostics;
using System.IO;

namespace PdfToImage
{
	class Program
	{
		static void Main(string[] args)
		{
			Console.WriteLine("Hello World!");

			if (ModuleInitializer.VipsInitialized)
			{
				Console.WriteLine($"Inited libvips {NetVips.NetVips.Version(0)}.{NetVips.NetVips.Version(1)}.{NetVips.NetVips.Version(2)}");
			}
			else
			{
				Console.WriteLine(ModuleInitializer.Exception.Message);
			}

			Stopwatch sw = new Stopwatch();
			sw.Start();

			int pages = 0;
			var imageBytes = File.ReadAllBytes(@"C:\Workspace\test.pdf");

			using (var image = Image.NewFromBuffer(imageBytes))
			{
				pages = (int)image.Get("n-pages");
				Console.WriteLine($"Pages: {pages}");
			}

			for (int i = 0; i < pages; i++)
			{
				Console.WriteLine($"Converting page {i}");

				using (var page = Image.PdfloadBuffer(imageBytes, page: i, n: 1, scale: 2, dpi: 150))
				{
					page.WriteToFile($"C:\\Workspace\\converto\\test_{i}.jpg");

					using (var thumb = page.ThumbnailImage(150))
					{
						thumb.WriteToFile($"C:\\Workspace\\converto\\thumb_{i}.jpg");
					}
				}
			}

			sw.Stop();

			Console.WriteLine($"Time taken: {sw.Elapsed.TotalSeconds}");


			Console.WriteLine("Goodbye World!!");
			Console.ReadLine();
		}
	}
}
@CanadianHusky
Copy link

CanadianHusky commented Feb 22, 2021

I do not think there is anything wrong with your console app and I do not think you will be able to reduce memory usage with this method of creating images.
See comments in this post #11
libvips uses poppler library internally to interpret and eventually render the PDF. The (relatively) high memory usage you see is therefore from the poppler library and no matter what you do in libvips, you won't be able to make an impact.
If you want less memory use, you need to look for another pdf renderer in my opinion.

@kleisauke kleisauke added the question Further information is requested label Feb 23, 2021
@kleisauke
Copy link
Owner

I'm not sure if Poppler is the bottleneck here. There's a trick you can use for loading image: if you know you will just be doing simple top-to-bottom operations on the image, like arithmetic or filtering or resize, you can tell NetVips that you only need sequential access to pixels:

using var image = Image.NewFromFile("example_028.pdf", access: Enums.Access.Sequential);

Now libvips will stream your image. It'll run the load and the save the image all in parallel and never keep more than a few scanlines of pixels in memory at any one time. This can give a really nice improvement to speed and memory use.

Also, I'd avoid ThumbnailImage, if possible. It can't do any of the shrink-on-load tricks, so it's much, much slower.

Here's a complete example:

// Maximum value for a coordinate.
// See: https://github.com/kleisauke/net-vips/issues/71#issuecomment-609756394
const int VipsMaxCoord = 10000000;

// Test image: https://tcpdf.org/files/examples/example_028.pdf
const string file = "example_028.pdf";
var filename = Path.GetFileNameWithoutExtension(file);

using var im = Image.Pdfload(file, page: 0, access: Enums.Access.Sequential);
// Or: using var im = Image.PdfloadBuffer(buffer, page: 0, access: Enums.Access.Sequential);

var nPages = im.Contains("n-pages") ? (int) im.Get("n-pages") : 1;

Console.WriteLine($"{file} has {nPages} pages");

for (var i = 0; i < nPages; i++)
{
    Console.WriteLine($"\t{file} rendering page {i} ...");
    using var page = Image.Pdfload(file, page: i, dpi: 150, access: Enums.Access.Sequential);
    // Or: using var page = Image.PdfloadBuffer(buffer, page: i, dpi: 150, access: Enums.Access.Sequential);
    page.Jpegsave($"{filename}_page_{i}.jpg", strip: true);

    Console.WriteLine($"\t{file} rendering page thumb {i} ...");

    // Avoid calling ThumbnailImage, see: https://github.com/kleisauke/net-vips/issues/64
    using var thumb = Image.Thumbnail($"{file}[page={i},dpi=150]", 250, height: VipsMaxCoord);
    // Or: using var thumb = Image.ThumbnailBuffer(buffer, optionString: "[page={i},dpi=150]", width: 250, height: VipsMaxCoord);
    thumb.Jpegsave($"{filename}_thumb_{i}.jpg", strip: true);
}

This example takes ~250 milliseconds to complete and uses ~35 MB on this PC. Perhaps Google's PDFium (Apache-2.0 licence) could speed it up even more (but is a bit of a hassle to set-up, see libvips/build-win64-mxe#20).

@martinnormark
Copy link
Author

@kleisauke Thanks a lot for chipping in!

Do you find this approach to scale linearly as pages in a PDF increase?

I have tested with a PDF with 94 pages, most of which consisting of dense text with graphics, and changing it to access: Enums.Access.Sequential makes performance slightly worse (take a little more time, not much).

I tried with this PDF: https://research.ark-invest.com/hubfs/1_Download_Files_ARK-Invest/White_Papers/Big-Ideas-2019-ARKInvest.pdf

Anyway, it works much better than any other PDF library I've tried and is easy to get running. Closest alternative is Mozilla's PDF.js library, in terms of time to convert a full PDF to images.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants