Read PDF content using Selenium

|
| By Webner

To read PDF document file in Selenium, we can use a Java library called PDFBox. Apache PDFBox is an open-source library that helps in managing PDF files. We can use it to verify the text or images present in the file. To use this with Selenium testing, we need to add the maven dependency in the pom.xml file or add an external jar in the build path.

Here we will use add as an external jar method:

  • Download the jar file from the below path:
    https://pdfbox.apache.org/download.html
    I am using the jar version of PDFbox 1.8.16.
  • Go to the project and select “Configure Build Path” and add the external jar file.
  • After adding the jar, click on the “apply” and “close” buttons.

Code to extract the content of the PDF:

package Testing;
import java.io.BufferedInputStream;
import java.io.InputStream;
import java.net.URL;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.util.PDFTextStripper;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.chrome.ChromeDriver;
import io.github.bonigarcia.wdm.WebDriverManager;
public class pdfread {
public static WebDriver driver;
public void ReadPDF() throws Exception {
WebDriverManager.chromedriver().setup();
driver = new ChromeDriver();
driver.manage().window().maximize();
driver.get("https://unec.edu.az/application/uploads/2014/12/pdf-sample.pdf");
String Currentlink=driver.getCurrentUrl();
URL URL = new URL(Currentlink);
InputStream Inputfile = URL.openStream();
BufferedInputStream file =new BufferedInputStream(Inputfile);
PDDocument document = PDDocument.load(file);
String pdfContent= new PDFTextStripper().getText(document);
System.out.println(pdfContent);
}
public static void main(String[] args) throws Exception {
pdfread read = new pdfread();
read.ReadPDF();
driver.quit();
}
}

Result:
code-file

Leave a Reply

Your email address will not be published. Required fields are marked *