Improve SmartPdfCopy compression and performance #132

asidorowicz · 2023-11-27T01:52:28Z

Re-introduced detection of duplicate dictionaries
Don't visit previously processed references while attempting to detect duplicate streams/dictionaries

While correcting a bug that caused an infinite loop in SmartPdfCopy when handling certain documents (#124), we have lost the ability to remove duplicate dictionaries.

This causes the document to become unnecessarily large when appending multiple documents that are based on the same template (eg: thousands of documents prepared for a print job)

These changes restore that capability, and prevent re-visiting the same nodes when detecting identical content, or already processed nodes, improving performance handling many streams/dictionaries (especially recursive)

…Copy

- Don't attempt stream/dictionary content matching if reference already processed - Match seen references by RefKey and not by PdfObject - RefKey should compare Num before Gen for early exits

what-the-diff · 2023-11-27T01:53:17Z

PR Summary

Enhancements to the PdfSmartCopyTests.cs test class
The testing for this feature has been improved with new and modified test methods. There's a new test method Verify_Remove_Duplicate_Dictionaries_Works to ensure dictionaries in PDFs don't duplicate. Furthermore, the method Verify_Remove_Duplicate_Objects_Works is now Verify_Remove_Duplicate_Streams_Works, aligning its name to its primary function better. The CompressMultiplePdfFilesRemoveDuplicateObjects() method now provides the functionality to compress many PDF files while avoiding duplication of objects.
Improvements in PdfCopy.cs and PdfSmartCopy.cs
In the PdfCopy.cs file, a minor amendment was made in the Equals() method to improve the comparison between objects. Meanwhile, in PdfSmartCopy.cs', several changes have been made to enhance the removal of duplicates and elegance of the code. Some unnecessary stream handling code in the CopyIndirect()` method was removed for code cleanliness.
Introduction of ByteStore class in PdfSmartCopy.cs
A new ByteStore class has been added to help handle the serialization of indirect references. This class contains a private List, _references, to keep track of seen references. This should streamline dealing with references and contribute to the overall reduction of duplicates.
Modification of serObject() in ByteStore class
In order to provide better handling of indirect references and improve the computing of the MD5 hash, the serObject() method in the ByteStore class was modified. It now handles indirect reference serialization and uses the using statement for the MD5BouncyCastle object for better resource management.

src/iTextSharp.LGPLv2.Core/iTextSharp/text/pdf/PdfSmartCopy.cs

@@ -103,10 +100,12 @@
    internal class ByteStore
    {
        private readonly byte[] _b;
+        private List<RefKey> _references;


BuildTools added 3 commits November 25, 2023 16:10

Add support for detecting duplicate dictionaries while using PdfSmart…

f5b498a

…Copy

Improve performance of PdfSmartCopy

1611f0a

- Don't attempt stream/dictionary content matching if reference already processed - Match seen references by RefKey and not by PdfObject - RefKey should compare Num before Gen for early exits

Added test for handling dictionaries in PdfSmartCopy

e2b2fd5

VahidN merged commit f7087e3 into VahidN:master Nov 27, 2023
3 checks passed

github-advanced-security bot found potential problems Nov 27, 2023

View reviewed changes

src/iTextSharp.LGPLv2.Core/iTextSharp/text/pdf/PdfSmartCopy.cs

@@ -103,10 +100,12 @@

internal class ByteStore

{

private readonly byte[] _b;

private List<RefKey> _references;

Check notice

Code scanning / CodeQL

Missed 'readonly' opportunity Note

Field '_references' can be 'readonly'.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve SmartPdfCopy compression and performance #132

Improve SmartPdfCopy compression and performance #132

asidorowicz commented Nov 27, 2023

what-the-diff bot commented Nov 27, 2023

Improve SmartPdfCopy compression and performance #132

Improve SmartPdfCopy compression and performance #132

Conversation

asidorowicz commented Nov 27, 2023

what-the-diff bot commented Nov 27, 2023

PR Summary