How To Extract Data From A PDF Document In JAVA

Posted By : Sanjay Saini | 31-May-2018

Hi Guys,

In this Blog, I'm going to show, how to read/extract data from a PDF using Java Program.Many times we have need to reading PDF and doing some work with PDF data.

In Java, we have an API "PDF BOX" for doing this work easily.PDF BOX API is provided by Apache and it is open source API. It helps us to create, delete and manipulate a PDF document in the application.

Before Writing a sample program I'm giving you brief Detail about this API.

 

What Is PDF BOX?

Apache PDFBox is a free Java library that helps the improvement and change of PDF papers.Using this library, you can write Java programs that produce, transform and manipulate PDF papers. In addition to this, PDFBox also introduces a command line utility for executing various controls over PDF utilizing the available Jar file.

 

Features of PDFBox :

Following are the important characteristics of PDFBox −

Extract Text − With the help of PDFBox, you can extract Unicode text from PDF documents.

Break & Mix − With the help of PDFBox, you can divide an individual PDF document into multiple documents, and mix them back into a single document.

Fill Forms − With the help of PDFBox, you can fill the application data in a document.

Print − With the help of PDFBox, you can print a PDF file using the official Java printing API.

Save as Image − With the help of PDFBox, you can save PDFs as image files, such as PNG or JPEG.

Create PDFs − With the help of PDFBox, you can create a new PDF file by building Java applications and, you can also insert images and fonts.

Signing − With the assistance of PDFBox, you can add computerized signs to the PDF records.

 

Components of PDFBox


The following are the four main components of PDFBox −

PDFBox −  This includes the classes and interfaces associated to data extraction and manipulation.

FontBox − This includes the classes and interfaces related to font, and using these classes we can change the font of the text of the PDF document.

XmpBox − This includes the classes and interfaces that manipulate XMP metadata.

Preflight − This part is used to check the PDF files upon the PDF/A-1b measure.

 

Sample Program for Printing PDF file Data using Java

<dependency>
        <groupId>org.apache.pdfbox</groupId>
        <artifactId>pdfbox</artifactId>
        <version>2.0.7</version>
</dependency>

package com.sanjay;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.PDFTextStripperByArea;

import java.io.File;
import java.io.IOException;

public class PrintPdf {

    public static void main(String[] args) throws IOException {

        try (PDDocument pdfDocument = PDDocument.load(new File("F:/Test.pdf"))) {

        	pdfDocument.getClass();

            if (!pdfDocument.isEncrypted()) {
			
                PDFTextStripperByArea pdfTextStripperByArea = new PDFTextStripperByArea();
                pdfTextStripperByArea.setSortByPosition(Boolean.TRUE);

                PDFTextStripper pdfTextStripper = new PDFTextStripper();

                String pdfFileInText = pdfTextStripper.getText(pdfDocument);
              
                String lines[] = pdfFileInText.split("\\r?\\n");
                for (String line : lines) {
                    System.out.println(line);
                }

            }

        }

    }
}

Thanks

Sanjay Saini

About Author

Author Image
Sanjay Saini

Sanjay has been working on web application development using frameworks like Java, groovy and grails. He loves listening to music , playing games and going out with friends in free time.

Request for Proposal

Name is required

Comment is required

Sending message..