Introducing PDFUtil – Compare two PDF files textually or Visually

In my project, I need to compare tons of PDF files. I could not find any good FREE library which is working out of the box to compare the PDF files. I did not want just Text compare & I was looking for something which can compare PDFs pixel by pixel to find all the differences.  Libraries which can do are NOT FREE.

So, I have come up with a simple JAVA library (using apache-pdf-box – Licensed under the Apache License, Version 2.0) which can compare given PDF documents in Text/Image mode & highlight the differences, extract images from the PDF documents, save the PDF pages as images etc.

 

Maven Dependency:

Include the below dependency in your POM file.

Download:

PDF compare utility with all the dependencies.


	taguru-pdf-utility-v1.1.zip	(6322 downloads)

Github:

The source code for this project is here.

Usage:

  • To get page count
import com.testautomationguru.utility.PDFUtil;

PDFUtil pdfUtil = new PDFUtil();
pdfUtil.getPageCount("c:/sample.pdf"); //returns the page count

  • To get page content as plain text
//returns the pdf content - all pages
pdfUtil.getText("c:/sample.pdf"); 

// returns the pdf content from page number 2
pdfUtil.getText("c:/sample.pdf",2); 

// returns the pdf content from page number 5 to 8
pdfUtil.getText("c:/sample.pdf", 5, 8);

  • To extract attached images from PDF
//set the path where we need to store the images
 pdfUtil.setImageDestinationPath("c:/imgpath");
 pdfUtil.extractImages("c:/sample.pdf");

// extracts and saves the pdf content from page number 3
pdfUtil.extractImages("c:/sample.pdf", 3);

// extracts and saves the pdf content from page 2
pdfUtil.extractImages("c:/sample.pdf", 2, 2);

  • To store PDF pages as images
//set the path where we need to store the images
 pdfUtil.setImageDestinationPath("c:/imgpath");
 pdfUtil.savePdfAsImage("c:/sample.pdf");

  • To compare PDF files in text mode (faster – But it does not compare the format, images etc in the PDF)
String file1="c:/files/doc1.pdf";
String file1="c:/files/doc2.pdf";

// compares the pdf documents and returns a boolean
// true if both files have same content. false otherwise.
pdfUtil.compare(file1, file2);

// compare the 3rd page alone
pdfUtil.compare(file1, file2, 3, 3);

// compare the pages from 1 to 5
pdfUtil.compare(file1, file2, 1, 5);

  • To exclude certain text while comparing PDF files in text mode
String file1="c:/files/doc1.pdf";
String file1="c:/files/doc2.pdf";

//pass all the possible texts to be removed before comparing
pdfutil.excludeText("1998", "testautomation");

//pass regex patterns to be removed before comparing
// \\d+ removes all the numbers in the pdf before comparing
pdfutil.excludeText("\\d+");

// compares the pdf documents and returns a boolean
// true if both files have same content. false otherwise.
pdfUtil.compare(file1, file2);

// compare the 3rd page alone
pdfUtil.compare(file1, file2, 3, 3);

// compare the pages from 1 to 5
pdfUtil.compare(file1, file2, 1, 5);

  • To compare PDF files in Visual mode (slower – compares PDF documents pixel by pixel – highlights pdf difference & store the result as image)
String file1="c:/files/doc1.pdf";
String file1="c:/files/doc2.pdf";

// compares the pdf documents and returns a boolean
// true if both files have same content. false otherwise.
// Default is CompareMode.TEXT_MODE
pdfUtil.setCompareMode(CompareMode.VISUAL_MODE);
pdfUtil.compare(file1, file2);

// compare the 3rd page alone
pdfUtil.compare(file1, file2, 3, 3);

// compare the pages from 1 to 5
pdfUtil.compare(file1, file2, 1, 5);

//if you need to store the result
pdfUtil.highlightPdfDifference(true);
pdfUtil.setImageDestinationPath("c:/imgpath");
pdfUtil.compare(file1, file2);

For example, I have 2 PDF documents which have exact same content except the below differences in the charts.

 

pdfu001                                      pdfu002

 

 

My PDFUtility gives the result as given below (highlights the difference in Magenta color by default. Color can be changed).

pdfu003

 

Features to be added soon:

  • While comparing PDFs in VISUAL_MODE, ignore certain area.
  • While comparing PDFs in VISUAL_MODE, return true / false based on certain threshold / sensitivity.

 

Share This:

Categories: Articles, Framework, Utility

134 comments

    • vIns

      It is not on github – I do not have any issues in sharing with others. Please give me sometime. I will share with you ASAP.

      • David

        Hi, there’s been many requests for the source code to be shared. This is another +1

        Hope you can get to it sometime. While on the subject, I think it would be nice if you shared/released the source code of future tools/utilities that you offer the binary for download (if/where you have no reservations or restrictions for sharing the source). It’s a lot easier to do when you make that an intent from the beginning.

        And the lamest but still good approach would be to just tar/zip up the source code (with ideally OSS license) and offer that for download in addition to the binary, if you don’t want to deal with git/source control.

  • Sachin

    Thanks for such a wonderful explanation.

    Can you please cahre the libraby with me on email id.

    Thanks

    Sachin A
    India

  • abhishek

    Can you push it on git. it has quite a lot of potential and i would like to contribute to your code. Thanks – Abhishek

  • Shanu

    Can you mail me the documentation of the code for easy understanding.
    I also have the similar project. And I find your work as brilliant. It will be very useful if u share how the compare works and how result is shown? Thanks in advance

  • Shanu

    I am using eclipse to run your code. The compare block throws error as

    “Nov 02, 2015 6:01:39 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
    INFO: unsupported/disabled operation: BDC
    Nov 02, 2015 6:01:41 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
    INFO: unsupported/disabled operation: EMC”

    Where and how should I run this code to get the highlighted differences?
    Also, Is it possible for me to have this pdf comparer as a web service?

  • sai rajeswari

    I am currently working on a pdf comparer project. We are working on highlighting the differences between two pdfs.Above code helped us to compare pdfs. But the output is highlighted and overlapping..Can you please send us your source code that will help us to make changes on the above result.

      • vIns

        Yes, it is expected as it compares pixel by pixel. so for a very small change, it will highlight – so you might see as it overlaps. If you expect text mismatch, please do text compare.

  • Jan

    Very nice util !

    …one question…..

    sometimes it’s nice to have a method which enables you to exclude some part of the PDF file … by making use of page area’s which one can select or deselect….

    If you let me access the source I can make some extentsions for all of us……

    anyway… nice job !

    • vIns

      You can get the content of the PDF as text. Then you can apply the logic yourself to find the mismatch. That should be very easy to implement.

  • srikanth yadugani

    Hi

    we are using this, first we should thank for such a great work you provided to us. Thank you very much!

    My two PDF documents have 16 differences, but comparePdfFilesBinaryMode(file1, file2); method is showing only 13 differences(screenshots). How should we overcome this problem?

    Any Suggestions? I am looking for Optical character recognising (ocr)jar files to overcome this.

  • sz

    Hi,
    I am trying to compare two files and get following error:
    Feb 05, 2016 12:59:10 PM org.apache.pdfbox.util.operator.pagedrawer.Invoke process
    WARNING: getRGBImage returned NULL
    Feb 05, 2016 12:59:10 PM org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap getRGBImage
    SEVERE: java.lang.NegativeArraySizeException
    java.lang.NegativeArraySizeException

    Looks like it is problem with PDFBOX.jar
    What I can see you are using ver 1.8.9
    but there is version 2.0 RC
    Can you provide your tool with updated PDFBOX to check if this will fix my problem?
    Thanks in advance

    • vIns

      Yes, That is right.
      pdfbox.jar is a separate jar in the PDFUtil. You can just replace the pdfbox.jar with the latest one. Thanks for pointing it out.

  • sz

    Hi,
    thanks for info,
    I updated pdfbox to ver 20.0.-rc3
    I get following error:

    Exception in thread “main” java.lang.NoSuchMethodError: org.apache.pdfbox.pdmodel.PDDocument.load(Ljava/lang/String;)Lorg/apache/pdfbox/pdmodel/PDDocument;
    at com.taguru.utility.PDFUtil.getPageCount(PDFUtil.java:160)

    I am using IntelliJ idea 15 community edition.
    Do you know how to fix this?

    • vIns

      Somehow this comment went to spam. Not sure why.
      Anyway for your question – Can you please see if you can use pdfbox-app-1.8.11.jar.?

  • Peera

    Hi great work. i am very much interested . As per our project needs, we need to skip som of the sections in the PDF from comparing. it would be helpful if you share open jar file with us. Thank you

  • esjjse

    Interested in your api. Looking forward for a PDF comparison requirement., Would be grateful if you can share this source/jar file to try

    SJS

  • Simone

    Hi Vls,

    Congratulations to have created such a nice tool.

    Will you share the code? And if you are not supposed to share the code can you at least tell us you intention?

    It looks like you are not answering to every request about sharing the code so it is unclear whether you will actually do it.

    Thanks
    Simone

  • dirk

    Wow, nice work ! This really saves me a lot of work, manually comparing hundreds of pdfs
    Currently it does not seem to run under Java 8 . :-/ Is there an upgrade planned ?

  • Santhosh

    I have used the functions and plugin, but we are not able to save the image as said in the last section. i.e, comparing two pdf files and highlighting the differences and writing it to in an image file. Could you please help in the regard. Piece of code is something like below.

    pdfutil.highlightPdfDifference(true);
    pdfutil.setImageDestinationPath(Path+”//results//”);
    //pdfutil.savePdfAsImage(Path);
    System.out.println(pdfutil.comparePdfFilesTextMode(Doc_BaseLine, Doc_Actual));
    // pdfutil.comparePdfFilesBinaryMode(Doc_BaseLine, Doc_Actual);
    System.out.println(pdfutil.comparePdfFilesBinaryMode(Doc_BaseLine, Doc_Actual));
    //pdfutil.extractImages(Doc_Actual) ;
    pdfutil.savePdfAsImage(Doc_Actual);

  • Nikhil

    Hi,
    First of all, Thank you. It helped me a lot. But as per my project, i need to skip some of the sections in the PDF from comparing. it would be helpful if you share open jar file with us so that i can make changes as per need.
    Thank you

  • raj

    Hi, rather saving the compared image to specific path I want to download the compared PDF output image file , is it possible ?? if yes plz suggest me solution for it .. Thanks

  • Neil B

    Hi,

    I’ve posted multiple comments here but none are actually showing up. I would really like to use this could you please help me?

    1. After downloading the ZIP file (which contains 2 JARs), what are the exact steps to compare 2 PDFs?

    2. Could you send me a link to the source code as well?

    Thanks!

  • bhaskar

    Wonderful information and Amazing explanation !!! 🙂

    I have downloaded the Zip file and when I am trying to extract the Zip unfortunately it is showing an error as “Cannot Open file: it does not appear to be a valid archive”

    could you please resend that valid Zip file to my email id

    Thanks in Advance !!!

  • Larry David

    Thank you so much for sharing this tool! Would you be able to please share the source code? I need to modify it to ignore certain parts of the file and remove special unicode certain characters from the PDF file before comparing as it’s throwing off the comparison.

    Please let me know when you share it. Thanks.

  • Abhinandan

    Hi

    The library is promising. Can you share the source code with us? And can you point us to some documentation? Say I want to change the colour of comparison from Magenta to Green, how do I do that?

  • George

    Hi VLNS,

    I’ve messaged multiple times asking if the source code is available for this. Kindly let me know if it’s not so I can start my own implementation 🙂 Just don’t want to waste time implementing something from scratch if I can just build on yours so please let me know soon.

    Thanks.

  • JHerrmann

    Nice Tool!
    Is there a way to declare wildcards in binaryMode? I generate daily pdf reports with the current date on it. The textMode is not accurate enough for my pdf files.

    It would be nice if you can declare wildcards (a region on the file may).

  • Carlos

    Neat utility which great potential.

    I think it would be really good if ignore rules could be added based on some RegEx

  • Carlos

    Neat utility with great potential.

    I think it would be really good if there was a way to ignore certain text by adding some Regex rules

  • Automation

    It’s great work and indeed. don’t mind can you share the source code so that we can contribute to utlize in all the possible requirements?

  • Raghu

    Hi, I tried to compare pixel by pixel for 2 PDFs using below code. But am not getting the image, which highlights the difference between the 2 PDFs

    //if you need to store the result
    pdfUtil.highlightPdfDifference(true);
    pdfUtil.setImageDestinationPath(“c:/imgpath”);
    pdfUtil.comparePdfFilesBinaryMode(file1, file2);

  • sundar

    I would like to call the JAR file in VB Script or from UFT 12.53.

    Can you please guide me. Sample VB script i have attached . But unable to pass the command ( getPageCount) & arguments (“c:/sample.pdf”)

    Set WshShell = CreateObject(“WScript.Shell”)
    dim a
    a = “C:\PDF Compare\taguru-pdf-utility-v1.0\taguru-pdf-utility-v1.0\pdfbox-app-1.8.9.jar”
    WshShell.Run “java -jar ” & chr(34) & a & chr(34)

  • Jeevan

    Hi,

    Function convertToImageAndCompare(String file1, String file2, int startPage, int endPage) having issues, not returning anything and also unable to generate Results to a folder with the following :

    String file1 = “resources/July 16th.pdf”;
    String file2 = “resources/July 17th.pdf”;
    util.highlightPdfDifference(true);
    util.setImageDestinationPath(“/Users/test/Errors”);
    util.comparePdfFilesBinaryMode(file1, file2);

    Do we need to give any file name? tried to debug the code but it only returns only true or false, the code under the convertToImageAndCompare is commented and the function comparePdfFilesBinaryMode is calling convertToImageAndCompare which is not returning anything and getting the error.

    Thanks,
    Jeevan

  • Venkatesh Prasad

    Trying with pdfbox-app-2.0.2and get the following error:

    Exception in thread “main” java.lang.NoSuchMethodError: org.apache.pdfbox.pdmodel.PDDocument.load(Ljava/lang/String;)Lorg/apache/pdfbox/pdmodel/PDDocument;
    at com.taguru.utility.PDFUtil.getPageCount(PDFUtil.java:160)
    at com.taguru.utility.PDFUtil.comparePdfByImage(PDFUtil.java:459)
    at com.taguru.utility.PDFUtil.comparePdfFilesBinaryMode(PDFUtil.java:402)
    at ERS.UnitTest.Reports.ComparePDFDocuments.Compare2Documents(ComparePDFDocuments.java:27)
    at ERS.UnitTest.Reports.ComparePDFDocuments.main(ComparePDFDocuments.java:19)

    Please advise.

  • Sam

    Can you please share the source code or post it in GitHub. I want to contribute in the project too. I think there have been many requests regarding this.

  • Oksana

    Hello,

    I also would like to ask you about your tool.
    Could you please share source code or give me a link?

    Thanks in advance!

  • Jonas

    Hi! How about the sources? I would really line to submit some enhancements and perhaps look into the regex-excludes mentioned above.

  • Deepa Kiran

    Its a Nice utility !! I just tried using it. I have used the method ‘comparePdfFilesBinaryMode’. it compared the pdfs but result image is generated only for first page of pdf though there are differences in second page too.
    Please suggest if there is any other way to generate images for each page ?

    • vIns

      Yes, Please check the API – you need to set the flag to compare all the pages. otherwise, it will just return false as soon as it finds a mismatch and exit.

  • Deepa Kiran

    while comparing the pdfs, I wanted to ignore few differences (like form IDs) in the PDFs and make them pass irrespective of few kinds of differences in them.
    Can you please share source code in order to make this change for my project.

  • Raj

    Hi,
    Is it possible to share a demo video on how to use this library file to compare PDF files in Visual mode. Or Steps to do this?
    Thanks.

  • Visa

    Hi,

    I am running automation to compare more than one pair of PDF. i would like to save all the compared images into a output folder. But, your code seems to clean the folder before writing to it.

    • vIns

      Yeah!! I thought I should clear the folder. But you are right. This library should not do that. It is upto the user to decide to clear or not. One easy option is, under the output folder, you can create a separate folder for each pdf. Or the sourcode is available in github. You can comment the code which clears the output folder & build it. I have provided the build instruction.

    • vIns

      No, for the time being! But you can do this yourself. pdfUtil.getText("c:/sample.pdf").replaceAll("[0-9]{2}\\[0-9]{2}\\[0-9]{4}", "") it will remove the date and give you the string for compare.

      • Visa

        This is working fine for text compare. But in case of image comparison it is failing, is there a way out to either remove this from PDF ?

        • vIns

          In case of image, it does pixel by pixel compare. It is very sensitive comparison & masking certain is not very simple and straightforward approach.

  • chk

    Can you please let us know how to compare a pdf when it has a watermark or watermark layer on it. Also can you please let us know how to delete that water mark .This utility helped us greatly.

  • Gnanasekaran

    Hi , I am trying to use this utility in vbscript , my requirement is compare two pdf files in commandline and generate the difference file in specific location .. using Jar i am unable to set the image destination path in commandline ..please guide
    1.set destination image path in commandline
    2.compare the images in commandline and save the difference image

  • ML

    We had the same requirement, I modified the main class and re-built the jar using the maven build file. We found that the comparison needed to be page by page or else you don’t get a diff image per page, also we needed a non-zero System exit value to get it to be useful in the test environment.

  • gagan

    Exception in thread “main” java.lang.UnsupportedClassVersionError: com/testautomationguru/utility/PDFUtil : Unsupported major.minor version 52.0

    Facing this issuewhile running java program. Could you please help me with this how to resolve it?

  • Ranjith KN

    Hi I am finding issue when both of the images are having difference , then the resulting image is not highlighting the difference . Also this would be nice if you can make a side by side comparison

    • vIns

      Number of pixels change could be very less – may be 1 – that is why you are unable to notice the difference. I will see if we can have some threshold.

  • chk

    Hi, If possible can you please help me with Watermark removal code ,it is really important for me and that will be of great help to me if you can do that .

  • YOGESH VASU

    hi, i want to compare pdf files pixel by pixel but this is not comparing can you show to me how to execute this code

  • karuna

    This is really very nice.
    And i would like to know that whether the below features which are added soon is available in github.
    While comparing PDFs – ignore certain text using Regular Expression
    For example, 2 PDFs have same text & contains date on which it was generated which needs to be omitted while comparing.
    While comparing PDFs in VISUAL_MODE, ignore certain area.
    While comparing PDFs in VISUAL_MODE, return true / false based on certain threshold / sensitivity.

  • Juhi

    Thank you so much vlns… I am new to Selenium and do not understand Git, anyway I was able to download the jar file. Just wanted to know what import command do I need to write in my eclipse after adding this jar to my Reference Library

  • Srikanth Y

    Hi Vlns
    your work is marvelous, But Pixel by pixel comparison is much slower when compared to a Licensed tool(StreamDiff). Do you have any idea to increase the speed of comparison?

    Will Wait for your response!

  • Juhi

    Hi Vlns,

    This works wonders…Thank you so much.

    But for pixel by pixel comparison, my PDF have 3 pages, and there were some differences on all 3 pages but the Result Image that captures the Difference only shows the same for 1st page only. Can you please help on this – as to how to showcase the differences on all the pages of PDF and not just the 1st Page.

    • vIns

      That is the default behavior to exit as soon as a mismatch is found. if you want all pages to be compared, you could set – pdfUtil.compareAllPages(true) – before comparing.

      • Juhi

        Thank You Vlns. This worked. But in case the no. of pages in the PDF are not same, then this does not spot the difference in the image format. Is there any way I can capture the Pixel difference in all the pages even if page count does not match?

  • Srikanth Yadugani

    Hi Team,

    I admire your work very much. I want to bring to your notice that image generated by below statement
    pdfRenderer1.renderImageWithDPI(iPage, 72, ImageType.RGB
    is https://drive.google.com/file/d/0B18WGCjoaDzJQXVnYVdDand3cWs/view
    which is odd, and time it takes to compare singe page of two pdfs is 5 secs on an average.
    Could you please suggest any resolution to correct image generation and increase speed of comparison?
    Waiting for your response( positive or diplomatic, ready to receive)

  • Sajid Khan

    HI

    I am comparing two PDFs and i have enabled the logs too.
    ArrayIndexOutOfBound is coming:
    WARNING: The end of the stream doesn’t point to the correct offset, using workaround to read the stream, stream start position: 5903, length: 0, expected end position: 5903
    Apr 02, 2017 9:20:44 AM com.testautomationguru.utility.PDFUtil convertToImageAndCompare
    INFO: Comparing Page No : 1
    Apr 02, 2017 9:20:44 AM org.apache.pdfbox.contentstream.PDFStreamEngine operatorException
    WARNING: Image stream is empty
    java.lang.ArrayIndexOutOfBoundsException: Coordinate out of bounds!
    at sun.awt.image.IntegerInterleavedRaster.getDataElements(IntegerInterleavedRaster.java:219)
    at java.awt.image.BufferedImage.getRGB(BufferedImage.java:986)
    at com.testautomationguru.utility.ImageUtil.compareAndHighlight(ImageUtil.java:19)
    at com.testautomationguru.utility.PDFUtil.convertToImageAndCompare(PDFUtil.java:458)

    Need your email id so that can share the PDFs

  • Shivangi

    This is an awesome utility. I have one doubt, we want to change the color of the difference and I don’t want to overlap the difference, instead want it to shift towards left side. Will that be possible?

  • Shivangi

    And what do you think on shifting the differences to the left of the source pdf file. Actually we have two pdf files printing prices of the resources, And we want to compare the differences. But due to overlapping we can’t read baseline and actual file.

  • Abhishek Pandey

    Hi Vlns,
    This utility is really helpful, but i am facing one issue actually i used this utility as a jar and used in my class and passing “pdfUtil.setImageDestinationPath(DestPath);” DestPath – i am passing as string with two PDFs, — “public static String pdfMatchMethod(String pdf1Path, String pdf2Path, String DestPath ) “and converted my class as webservice, but while calling my class in Client proxy class and passing these parameters, image is not getting saved in the desired given path.

    Can you please help me in solving this issue, why it is not downloading from webservice to local.

  • unknown

    can you please tell me how do i compare the font and alignment of pdf’s with this library and with this library its not storing the differed image

    pdfUtil.setImageDestinationPath(“c:/imgpath”);
    pdfUtil.compare(file1, file2);

    no exception but no image in the folder as well

Leave a Reply

Your email address will not be published. Required fields are marked *