Reading Data From A PDF File In Java
Posted By : Shakil Pathan | 31-Jan-2018
Hi Guys,
In this blog, I am going to explain you about how to read text data from a pdf file.
For this, you have to add a maven dependency called Apache PDFBox. The Apache PDFBox library is an open source Java library which is used to work with the PDF documents. This library can be used for creating any new PDF documents or manipulation of any existing documents and it also provides ability to extract the content of the documents.
So let's start with reading the data from the pdf file. First add the maven dependency like:
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>2.0.6</version>
</dependency>
Then I use below code to read the text from the pdf file:
public static void main(String[] args) throws IOException {
try (PDDocument pdDocument = PDDocument.load(new File("/file_path/fileToRead.pdf"))) {
pdDocument.getClass();
PDFTextStripperByArea pdfTextStripperByArea = new PDFTextStripperByArea();
pdfTextStripperByArea.setSortByPosition(true);
PDFTextStripper pdfTextStripper = new PDFTextStripper();
String textInPDFFile = pdfTextStripper.getText(pdDocument);
String textLines[] = textInPDFFile.split("\\r?\\n");
for (String textLine : textLines) {
System.out.println(textLine);
}
}
}
In the above code setSortByPosition method is used because the order of the text tokens in a PDF file may not be the same as they appear visually on the screen.
Hope it helps!
Cookies are important to the proper functioning of a site. To improve your experience, we use cookies to remember log-in details and provide secure log-in, collect statistics to optimize site functionality, and deliver content tailored to your interests. Click Agree and Proceed to accept cookies and go directly to the site or click on View Cookie Settings to see detailed descriptions of the types of cookies and choose whether to accept certain cookies while on the site.
About Author
Shakil Pathan
Shakil is an experienced Groovy and Grails developer . He has also worked extensively on developing STB applications using NetGem .