Extract PDF text And Verify Text Present in PDF using WebDriver

Most of the applications has 'Print PDF' functionality. How to achieve this in Automation. we first need to decide is this really required to automate, if your answer is Yes then proceed further to see how we can achieve this. In Earlier tutorial we have seen validating if the file downloaded or not after clicking on download button. In this tutorial we will now see to validate Print PDF functionality by using below two ways.

There are multiple ways of doing this.

1. A very simple way without using any third party libraries.
2. Extract the text from PDF and then validate if the text you are looking is present in the PDF document or not. We should go for this ONLY when we want to validate something for sure.

Based on the requirement can decide on which one to use.

The very first way of doing this is below:

/**
	 * To verify pdf in the URL 
	 */
	@Test
	public void testVerifyPDFInURL() {
		WebDriver driver = new FirefoxDriver();
		driver.get("http://www.princexml.com/samples/");
		driver.findElement(By.linkText("PDF flyer")).click();
		String getURL = driver.getCurrentUrl();
		Assert.assertTrue(getURL.contains(".pdf"));
	}

The second way is using third party library. In this example we will how to use 'Apache PDFBox' library

To extract text from a PDF we can use Apache PDFBox library which is one of the main feature of PDFBox. I can extract the text from variety of PDF documents. The functionality of extracting text is encapsulated in 'org.apache.pdfbox.util.PDFTextStripper'

It also provides an option to limit the text that is extracted during the extraction process by specifying the range of pages that we want to extract. For example, if the PDF has 100 pages, we can give the range from first to second page to validate the text present.

Below code snippet to specify the range which will read first and second page of the PDF. If you want to verify the text some where in the middle of the PDF you can read that and validate.

PDFTextStripper stripper = new PDFTextStripper();
pdfStripper.setStartPage(1);
pdfStripper.setEndPage(2);

NOTE: The startPage and endPage properties of PDFTextStripper are 1 based and inclusive.

Below is the example Program for the both the above discussed ways.

import java.io.BufferedInputStream;
import java.io.IOException;
import java.net.MalformedURLException;
import java.net.URL;

import junit.framework.Assert;

import org.apache.pdfbox.cos.COSDocument;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.util.PDFTextStripper;
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.firefox.FirefoxDriver;
import org.testng.annotations.AfterClass;
import org.testng.annotations.BeforeClass;
import org.testng.annotations.Test;

public class ReadPDF {
	
	WebDriver driver;
	
	@BeforeClass
	public void setUp() {
		driver = new FirefoxDriver();
	}
	
	/**
	 * To verify PDF content in the pdf document
	 */
	@Test
	public void testVerifyPDFTextInBrowser() {
		
		driver.get("http://www.princexml.com/samples/");
		driver.findElement(By.linkText("PDF flyer")).click();
		Assert.assertTrue(verifyPDFContent(driver.getCurrentUrl(), "Prince Cascading"));
	}

	/**
	 * To verify pdf in the URL 
	 */
	@Test
	public void testVerifyPDFInURL() {
		driver.get("http://www.princexml.com/samples/");
		driver.findElement(By.linkText("PDF flyer")).click();
		String getURL = driver.getCurrentUrl();
		Assert.assertTrue(getURL.contains(".pdf"));
	}

	
	public boolean verifyPDFContent(String strURL, String reqTextInPDF) {
		
		boolean flag = false;
		
		PDFTextStripper pdfStripper = null;
		PDDocument pdDoc = null;
		COSDocument cosDoc = null;
		String parsedText = null;

		try {
			URL url = new URL(strURL);
			BufferedInputStream file = new BufferedInputStream(url.openStream());
			PDFParser parser = new PDFParser(file);
			
			parser.parse();
			cosDoc = parser.getDocument();
			pdfStripper = new PDFTextStripper();
			pdfStripper.setStartPage(1);
			pdfStripper.setEndPage(1);
			
			pdDoc = new PDDocument(cosDoc);
			parsedText = pdfStripper.getText(pdDoc);
		} catch (MalformedURLException e2) {
			System.err.println("URL string could not be parsed "+e2.getMessage());
		} catch (IOException e) {
			System.err.println("Unable to open PDF Parser. " + e.getMessage());
			try {
				if (cosDoc != null)
					cosDoc.close();
				if (pdDoc != null)
					pdDoc.close();
			} catch (Exception e1) {
				e.printStackTrace();
			}
		}
		
		System.out.println("+++++++++++++++++");
		System.out.println(parsedText);
		System.out.println("+++++++++++++++++");

		if(parsedText.contains(reqTextInPDF)) {
			flag=true;
		}
		
		return flag;
	}
	
	@AfterClass
	public void tearDown() {
		driver.quit();
	}
}

The above case works fine when the PDF file is opened in a Browser after clicking on the Print button. There are few cases where once we click on Print, it will download the pdf file.

In these cases we should do in the below way: We need to change the below code

URL url = new URL(strURL);
BufferedInputStream file = new BufferedInputStream(url.openStream());
PDFParser parser = new PDFParser(file);
		
convert as below
		File file = new File("D:/Paynetsbicardbill.pdf");
		PDFParser parser = new PDFParser(new FileInputStream(file));

We should pass the path where the document is downloaded.

Selenium Tutorials: 

Comments

good to reading

This code is not working for me.
FAILED: testVerifyPDFTextInBrowser
java.net.UnknownHostException: www.princexml.com

Do we need to do any changes testVerifyPDFInURL,testVerifyPDFTex method when we open / read pdf from folder ?

As you suggested to do below changes in above cases:

URL url = new URL(strURL);
BufferedInputStream file = new BufferedInputStream(url.openStream());
PDFParser parser = new PDFParser(file);

convert as below
File file = new File("D:/Paynetsbicardbill.pdf");
PDFParser parser = new PDFParser(new FileInputStream(file));
Please let me know

Through this website i came to know how to write a code in eclipse and execute the commands. very useful for begginers like me.

PDFParser parser = new PDFParser(new FileInputStream(file)); - Constructor undefined in pdf parser

The problem is due to using (Apache PDFBox 2.0.0 API) jar Files. Remove them from build path and use (Apache PDFBox 1.8.11 API) as PDFParser class in 2.0 doesn't have PDFParser(BufferedInputStream args) Constructor. But 1.8 has PDFParser(InputStream args) Constructor.

Hi I am getting this exception java.io.IOException: Error: End-of-File, expected line while calling parser.parse();

help provided by anyone highly appreciated.

Thanks in advance.

I am getting following error while executing the code. Below the error.

java.io.IOException: Error: Header doesn't contain versioninfo
at org.apache.pdfbox.pdfparser.PDFParser.parseHeader(PDFParser.java:379)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:186)

Note: I used 1.18.11 pdfbox version with java 7.

I have few check boxes and radio buttons on my pdf file and i may need to verify if they are selected. This pdf file is generated as a result of users selections in a html page.

Hi,
How can we download a "embedded PDF file" from webdriver through selenium using java.

Add new comment

CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.