Today in this article I am going to demonstrate you how to read or extract text from pdf files using java code.

Why do I even need to read the pdf files ?

Okay, suppose you are programming a E-commerce website using java and you need to analyze Invoices of the customers, obviously the Invoices will be saved in the pdf format you need to get a token number of the customer, In that case you can use this program to find the string token.

I am going to demonstrate using the two methods

1) Using the iText Package

2) Using the apache Package

package com.codingsec;
 
import java.io.IOException;
 
//iText imports
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfTextExtractor;
 
public class iTextReadDemo {
 
 
    public static void main(String[] args) {
        try {
             
            PdfReader reader = new PdfReader("c:/temp/test.pdf");
            System.out.println("This PDF has "+reader.getNumberOfPages()+" pages.");
            String page = PdfTextExtractor.getTextFromPage(reader, 2);
            System.out.println("Page Content:\n\n"+page+"\n\n");
            System.out.println("Is this document tampered: "+reader.isTampered());
            System.out.println("Is this document encrypted: "+reader.isEncrypted());
 
        } catch (IOException e) {
            e.printStackTrace();
        }
 
    }
 
}

AND THE SECOND CODE USING APACHE

import java.io.*;
import org.apache.pdfbox.pdmodel.*;
import org.apache.pdfbox.util.*;

public class PDFTest {

 public static void main(String[] args){
 PDDocument pd;
 BufferedWriter wr;
 try {
         File input = new File("C:\\Invoice.pdf");  // The PDF file from where you would like to extract
         File output = new File("C:\\SampleText.txt"); // The text file where you are going to store the extracted data
         pd = PDDocument.load(input);
         System.out.println(pd.getNumberOfPages());
         System.out.println(pd.isEncrypted());
         pd.save("CopyOfInvoice.pdf"); // Creates a copy called "CopyOfInvoice.pdf"
         PDFTextStripper stripper = new PDFTextStripper();
         wr = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(output)));
         stripper.writeText(pd, wr);
         if (pd != null) {
             pd.close();
         }
        // I use close() to flush the stream.
        wr.close();
 } catch (Exception e){
         e.printStackTrace();
        } 
     }
}

Take your time to comment on this article share your views.

computer and technology

Wednesday, 30 March 2016

java code to extract text from pdf files

java code to extract text from pdf files

Today in this article I am going to demonstrate you how to read or extract text from pdf files using java code.

Why do I even need to read the pdf files ?

No comments:

Post a Comment

Blog Archive

Followers