Wednesday 30 March 2016

java code to extract text from pdf files

java code to extract text from pdf files


Today in this article I am going to demonstrate you how to read or extract text from pdf files using java code.

Why do I even need to read the pdf files ?

Okay, suppose you are programming a E-commerce website using java and you need to analyze Invoices of the customers, obviously the Invoices will be saved in the pdf format you need to get a token number of the customer, In that case you can use this program to find the string token.
I am going to demonstrate using the two methods
1) Using the iText Package
2) Using the apache Package

package com.codingsec;
 
import java.io.IOException;
 
//iText imports
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfTextExtractor;
 
public class iTextReadDemo {
 
 
    public static void main(String[] args) {
        try {
             
            PdfReader reader = new PdfReader("c:/temp/test.pdf");
            System.out.println("This PDF has "+reader.getNumberOfPages()+" pages.");
            String page = PdfTextExtractor.getTextFromPage(reader, 2);
            System.out.println("Page Content:\n\n"+page+"\n\n");
            System.out.println("Is this document tampered: "+reader.isTampered());
            System.out.println("Is this document encrypted: "+reader.isEncrypted());
 
        } catch (IOException e) {
            e.printStackTrace();
        }
 
    }
 
}
AND THE SECOND CODE USING APACHE
import java.io.*;
import org.apache.pdfbox.pdmodel.*;
import org.apache.pdfbox.util.*;

public class PDFTest {

 public static void main(String[] args){
 PDDocument pd;
 BufferedWriter wr;
 try {
         File input = new File("C:\\Invoice.pdf");  // The PDF file from where you would like to extract
         File output = new File("C:\\SampleText.txt"); // The text file where you are going to store the extracted data
         pd = PDDocument.load(input);
         System.out.println(pd.getNumberOfPages());
         System.out.println(pd.isEncrypted());
         pd.save("CopyOfInvoice.pdf"); // Creates a copy called "CopyOfInvoice.pdf"
         PDFTextStripper stripper = new PDFTextStripper();
         wr = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(output)));
         stripper.writeText(pd, wr);
         if (pd != null) {
             pd.close();
         }
        // I use close() to flush the stream.
        wr.close();
 } catch (Exception e){
         e.printStackTrace();
        } 
     }
}
Take your time to comment on this article share your views.

No comments:

Post a Comment