How To Extract Data From A PDF Document In JAVA
Posted By : Sanjay Saini | 31-May-2018
Hi Guys,
In this Blog, I'm going to show, how to read/extract data from a PDF using Java Program.Many times we have need to reading PDF and doing some work with PDF data.
In Java, we have an API "PDF BOX" for doing this work easily.PDF BOX API is provided by Apache and it is open source API. It helps us to create, delete and manipulate a PDF document in the application.
Before Writing a sample program I'm giving you brief Detail about this API.
What Is PDF BOX?
Apache PDFBox is a free Java library that helps the improvement and change of PDF papers.Using this library, you can write Java programs that produce, transform and manipulate PDF papers. In addition to this, PDFBox also introduces a command line utility for executing various controls over PDF utilizing the available Jar file.
Features of PDFBox :
Following are the important characteristics of PDFBox −
Extract Text − With the help of PDFBox, you can extract Unicode text from PDF documents.
Break & Mix − With the help of PDFBox, you can divide an individual PDF document into multiple documents, and mix them back into a single document.
Fill Forms − With the help of PDFBox, you can fill the application data in a document.
Print − With the help of PDFBox, you can print a PDF file using the official Java printing API.
Save as Image − With the help of PDFBox, you can save PDFs as image files, such as PNG or JPEG.
Create PDFs − With the help of PDFBox, you can create a new PDF file by building Java applications and, you can also insert images and fonts.
Signing − With the assistance of PDFBox, you can add computerized signs to the PDF records.
Components of PDFBox
The following are the four main components of PDFBox −
PDFBox − This includes the classes and interfaces associated to data extraction and manipulation.
FontBox − This includes the classes and interfaces related to font, and using these classes we can change the font of the text of the PDF document.
XmpBox − This includes the classes and interfaces that manipulate XMP metadata.
Preflight − This part is used to check the PDF files upon the PDF/A-1b measure.
Sample Program for Printing PDF file Data using Java
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>2.0.7</version>
</dependency>
package com.sanjay;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.PDFTextStripperByArea;
import java.io.File;
import java.io.IOException;
public class PrintPdf {
public static void main(String[] args) throws IOException {
try (PDDocument pdfDocument = PDDocument.load(new File("F:/Test.pdf"))) {
pdfDocument.getClass();
if (!pdfDocument.isEncrypted()) {
PDFTextStripperByArea pdfTextStripperByArea = new PDFTextStripperByArea();
pdfTextStripperByArea.setSortByPosition(Boolean.TRUE);
PDFTextStripper pdfTextStripper = new PDFTextStripper();
String pdfFileInText = pdfTextStripper.getText(pdfDocument);
String lines[] = pdfFileInText.split("\\r?\\n");
for (String line : lines) {
System.out.println(line);
}
}
}
}
}
Thanks
Sanjay Saini
Cookies are important to the proper functioning of a site. To improve your experience, we use cookies to remember log-in details and provide secure log-in, collect statistics to optimize site functionality, and deliver content tailored to your interests. Click Agree and Proceed to accept cookies and go directly to the site or click on View Cookie Settings to see detailed descriptions of the types of cookies and choose whether to accept certain cookies while on the site.
About Author
Sanjay Saini
Sanjay has been working on web application development using frameworks like Java, groovy and grails. He loves listening to music , playing games and going out with friends in free time.