Monday, July 1, 2013

Extract images from Word

WordImageExtractor is able to extract all images in a microsoft word 2007+ file. The Java library that is used to extract the images from the microsoft word 2007+ file is Apache POI. In this libray, there is a class called XWPFDocument. The XWPFDocument class has the getAllPictures method that can extract all images in the file to store in a list.


extract images from microsoft word file

WordImageExtractor source code:

import java.awt.image.BufferedImage;
import java.io.ByteArrayInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.util.Iterator;
import java.util.List;
import javax.imageio.ImageIO;
import javax.swing.JFileChooser;
import javax.swing.filechooser.FileNameExtensionFilter;
import org.apache.poi.xwpf.usermodel.XWPFPictureData;
import org.apache.poi.xwpf.usermodel.XWPFDocument;

public class WordImageExtractor{
public static void main(String[] args){

selectwORD();

}

//allow office word file selection for extracting
public static void selectwORD(){

JFileChooser chooser = new JFileChooser();
    FileNameExtensionFilter filter = new FileNameExtensionFilter("DOCX","docx");
    chooser.setFileFilter(filter);
    chooser.setMultiSelectionEnabled(false);
    int returnVal = chooser.showOpenDialog(null);
    if(returnVal == JFileChooser.APPROVE_OPTION) {
    File file=chooser.getSelectedFile();
    System.out.println("Please wait...");  
    extractImages(file.toString());
    System.out.println("Extraction complete");
           }

     
}
public static void extractImages(String src){
try{

//create file inputstream to read from a binary file
FileInputStream fs=new FileInputStream(src);
//create office word 2007+ document object to wrap the word file
XWPFDocument docx=new XWPFDocument(fs);
//get all images from the document and store them in the list piclist
List<XWPFPictureData> piclist=docx.getAllPictures();
//traverse through the list and write each image to a file
Iterator<XWPFPictureData> iterator=piclist.iterator();
int i=0;
while(iterator.hasNext()){
XWPFPictureData pic=iterator.next();
byte[] bytepic=pic.getData();
BufferedImage imag=ImageIO.read(new ByteArrayInputStream(bytepic));
        ImageIO.write(imag, "jpg", new File("D:/imagefromword"+i+".jpg"));
        i++;
}

}catch(Exception e){System.exit(-1);}

}



}

In the code above, the JFileChooser is used to display a file dialog that the users can easily browse for a Microsoft Word file. Once, the path of the file is obtained, the extracting image process can start. The FileInputStream class reads the byte data of the Microsoft Word file. To get images from the original Microsoft Word file, firstly you need to construct a Microsoft Word document object by using the  XWPFDocument and pass the FileInputStream object to its constructor. Once you have document object, it seems like the original Microsoft Word file now is in you hand so you can do somethings with it. To get the images from the document, you will use the getAllPictures method. This method returns a lit of XWPFPictureData objects. Each XWPFPictureData object refer to an image. You can read all bytes from the XWPFPictureData object by using the getData. When you have the byte array of the image, you can construct the BufferedImage object from it. Then use the write method of the ImageIO class to write the image out to a file.

3 comments:

  1. I'm not a developer, i always use the free online ocr to recognize and scan text from image.

    ReplyDelete
  2. I have a docx file include paragraph, table, picture. I want to convert this to a html file. How I extract picture and true position of it?

    ReplyDelete