Extract PDF pages to images memory use #117

martinnormark · 2021-02-22T14:02:21Z

I have a console app that can read a PDF file, and create JPG files for each page in the PDF. Both a large image and a thumbnail.

The PDF file is 1.2 MB and 145 pages.

When debugging in VS, memory usage runs to ~250MB. GC is running and reducing it gradually, but not as much as I would expect.

Running libvips 8.10.5 from PATH env variable, and not the nuget package.

Momory while debugging

It takes ~25 seconds to complete.

Is there a more memory efficient way to achieve this, than the code here:

using NetVips;
using System;
using System.Diagnostics;
using System.IO;

namespace PdfToImage
{
	class Program
	{
		static void Main(string[] args)
		{
			Console.WriteLine("Hello World!");

			if (ModuleInitializer.VipsInitialized)
			{
				Console.WriteLine($"Inited libvips {NetVips.NetVips.Version(0)}.{NetVips.NetVips.Version(1)}.{NetVips.NetVips.Version(2)}");
			}
			else
			{
				Console.WriteLine(ModuleInitializer.Exception.Message);
			}

			Stopwatch sw = new Stopwatch();
			sw.Start();

			int pages = 0;
			var imageBytes = File.ReadAllBytes(@"C:\Workspace\test.pdf");

			using (var image = Image.NewFromBuffer(imageBytes))
			{
				pages = (int)image.Get("n-pages");
				Console.WriteLine($"Pages: {pages}");
			}

			for (int i = 0; i < pages; i++)
			{
				Console.WriteLine($"Converting page {i}");

				using (var page = Image.PdfloadBuffer(imageBytes, page: i, n: 1, scale: 2, dpi: 150))
				{
					page.WriteToFile($"C:\\Workspace\\converto\\test_{i}.jpg");

					using (var thumb = page.ThumbnailImage(150))
					{
						thumb.WriteToFile($"C:\\Workspace\\converto\\thumb_{i}.jpg");
					}
				}
			}

			sw.Stop();

			Console.WriteLine($"Time taken: {sw.Elapsed.TotalSeconds}");


			Console.WriteLine("Goodbye World!!");
			Console.ReadLine();
		}
	}
}

The text was updated successfully, but these errors were encountered:

CanadianHusky · 2021-02-22T14:19:10Z

I do not think there is anything wrong with your console app and I do not think you will be able to reduce memory usage with this method of creating images.
See comments in this post #11
libvips uses poppler library internally to interpret and eventually render the PDF. The (relatively) high memory usage you see is therefore from the poppler library and no matter what you do in libvips, you won't be able to make an impact.
If you want less memory use, you need to look for another pdf renderer in my opinion.

kleisauke · 2021-02-23T12:44:40Z

I'm not sure if Poppler is the bottleneck here. There's a trick you can use for loading image: if you know you will just be doing simple top-to-bottom operations on the image, like arithmetic or filtering or resize, you can tell NetVips that you only need sequential access to pixels:

using var image = Image.NewFromFile("example_028.pdf", access: Enums.Access.Sequential);

Now libvips will stream your image. It'll run the load and the save the image all in parallel and never keep more than a few scanlines of pixels in memory at any one time. This can give a really nice improvement to speed and memory use.

Also, I'd avoid ThumbnailImage, if possible. It can't do any of the shrink-on-load tricks, so it's much, much slower.

Here's a complete example:

// Maximum value for a coordinate.
// See: https://github.com/kleisauke/net-vips/issues/71#issuecomment-609756394
const int VipsMaxCoord = 10000000;

// Test image: https://tcpdf.org/files/examples/example_028.pdf
const string file = "example_028.pdf";
var filename = Path.GetFileNameWithoutExtension(file);

using var im = Image.Pdfload(file, page: 0, access: Enums.Access.Sequential);
// Or: using var im = Image.PdfloadBuffer(buffer, page: 0, access: Enums.Access.Sequential);

var nPages = im.Contains("n-pages") ? (int) im.Get("n-pages") : 1;

Console.WriteLine($"{file} has {nPages} pages");

for (var i = 0; i < nPages; i++)
{
    Console.WriteLine($"\t{file} rendering page {i} ...");
    using var page = Image.Pdfload(file, page: i, dpi: 150, access: Enums.Access.Sequential);
    // Or: using var page = Image.PdfloadBuffer(buffer, page: i, dpi: 150, access: Enums.Access.Sequential);
    page.Jpegsave($"{filename}_page_{i}.jpg", strip: true);

    Console.WriteLine($"\t{file} rendering page thumb {i} ...");

    // Avoid calling ThumbnailImage, see: https://github.com/kleisauke/net-vips/issues/64
    using var thumb = Image.Thumbnail($"{file}[page={i},dpi=150]", 250, height: VipsMaxCoord);
    // Or: using var thumb = Image.ThumbnailBuffer(buffer, optionString: "[page={i},dpi=150]", width: 250, height: VipsMaxCoord);
    thumb.Jpegsave($"{filename}_thumb_{i}.jpg", strip: true);
}

This example takes ~250 milliseconds to complete and uses ~35 MB on this PC. Perhaps Google's PDFium (Apache-2.0 licence) could speed it up even more (but is a bit of a hassle to set-up, see libvips/build-win64-mxe#20).

martinnormark · 2021-02-23T14:14:05Z

@kleisauke Thanks a lot for chipping in!

Do you find this approach to scale linearly as pages in a PDF increase?

I have tested with a PDF with 94 pages, most of which consisting of dense text with graphics, and changing it to access: Enums.Access.Sequential makes performance slightly worse (take a little more time, not much).

I tried with this PDF: https://research.ark-invest.com/hubfs/1_Download_Files_ARK-Invest/White_Papers/Big-Ideas-2019-ARKInvest.pdf

Anyway, it works much better than any other PDF library I've tried and is easy to get running. Closest alternative is Mozilla's PDF.js library, in terms of time to convert a full PDF to images.

martinnormark closed this as completed Feb 22, 2021

kleisauke added the question Further information is requested label Feb 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract PDF pages to images memory use #117

Extract PDF pages to images memory use #117

martinnormark commented Feb 22, 2021 •

edited

Loading

CanadianHusky commented Feb 22, 2021 •

edited

Loading

kleisauke commented Feb 23, 2021

martinnormark commented Feb 23, 2021

Extract PDF pages to images memory use #117

Extract PDF pages to images memory use #117

Comments

martinnormark commented Feb 22, 2021 • edited Loading

CanadianHusky commented Feb 22, 2021 • edited Loading

kleisauke commented Feb 23, 2021

martinnormark commented Feb 23, 2021

martinnormark commented Feb 22, 2021 •

edited

Loading

CanadianHusky commented Feb 22, 2021 •

edited

Loading