Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve SmartPdfCopy compression and performance #132

Merged
merged 3 commits into from
Nov 27, 2023

Conversation

asidorowicz
Copy link

  • Re-introduced detection of duplicate dictionaries
  • Don't visit previously processed references while attempting to detect duplicate streams/dictionaries

While correcting a bug that caused an infinite loop in SmartPdfCopy when handling certain documents (#124), we have lost the ability to remove duplicate dictionaries.

This causes the document to become unnecessarily large when appending multiple documents that are based on the same template (eg: thousands of documents prepared for a print job)

These changes restore that capability, and prevent re-visiting the same nodes when detecting identical content, or already processed nodes, improving performance handling many streams/dictionaries (especially recursive)

BuildTools added 3 commits November 25, 2023 16:10
- Don't attempt stream/dictionary content matching if reference already processed
- Match seen references by RefKey and not by PdfObject
- RefKey should compare Num before Gen for early exits
Copy link

what-the-diff bot commented Nov 27, 2023

PR Summary

  • Enhancements to the PdfSmartCopyTests.cs test class
    The testing for this feature has been improved with new and modified test methods. There's a new test method Verify_Remove_Duplicate_Dictionaries_Works to ensure dictionaries in PDFs don't duplicate. Furthermore, the method Verify_Remove_Duplicate_Objects_Works is now Verify_Remove_Duplicate_Streams_Works, aligning its name to its primary function better. The CompressMultiplePdfFilesRemoveDuplicateObjects() method now provides the functionality to compress many PDF files while avoiding duplication of objects.

  • Improvements in PdfCopy.cs and PdfSmartCopy.cs
    In the PdfCopy.cs file, a minor amendment was made in the Equals() method to improve the comparison between objects. Meanwhile, in PdfSmartCopy.cs', several changes have been made to enhance the removal of duplicates and elegance of the code. Some unnecessary stream handling code in the CopyIndirect()` method was removed for code cleanliness.

  • Introduction of ByteStore class in PdfSmartCopy.cs
    A new ByteStore class has been added to help handle the serialization of indirect references. This class contains a private List, _references, to keep track of seen references. This should streamline dealing with references and contribute to the overall reduction of duplicates.

  • Modification of serObject() in ByteStore class
    In order to provide better handling of indirect references and improve the computing of the MD5 hash, the serObject() method in the ByteStore class was modified. It now handles indirect reference serialization and uses the using statement for the MD5BouncyCastle object for better resource management.

@VahidN VahidN merged commit f7087e3 into VahidN:master Nov 27, 2023
3 checks passed
@@ -103,10 +100,12 @@
internal class ByteStore
{
private readonly byte[] _b;
private List<RefKey> _references;

Check notice

Code scanning / CodeQL

Missed 'readonly' opportunity Note

Field '_references' can be 'readonly'.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants