Read Word document using Apache POI (Java) :
To read any document (.doc or .docx) or excel document in java, there are several libraries but Apache POI is pretty good. Using this library we can read word documents line by line.
Required Jar files:
- Poi.jar
- Poi-ooxml.jar
- Poi-ooxml-schemas.jar
- Poi-scratchpad.jar
- Xmlbeans.jar
You can download all these jars from below link:
http://www.java2s.com
Here is the simple program to read a doc file using Apache POI:
public class FileReader{ /** * Read any type of word or excel document * @param filename */ public void readDocument(String filename){ //variable declaration FileInputStream fis = new FileInputStream(filename); XWPFDocument docx = new XWPFDocument(fis); List<XWPFParagraph> para= docx.getParagraphs(); List<XWPFTable> tbl = docx.getTables(); Iterator<IBodyElement> iter = docx.getBodyElementsIterator(); List<String[]> varArray = new ArrayList<String[]>(); while (iter.hasNext()) { IBodyElement elem = iter.next(); if (elem instanceof XWPFParagraph) { String st = "image"; for (XWPFRun run : para.get(countpara).getRuns()) { for (XWPFPicture pic : run.getEmbeddedPictures()) { picdata = pic.getPictureData().getFileName(); detectFileExtension(varArray, st); } } String dataStyle = para.get(countpara).getStyle(); if (elem instanceof XWPFTable) { int numberOfRows = tbl.get(counttbl).getRows().size(); readTableData(tbl, varArray, st, numberOfRows); counttbl++; tableNo++; } } // End of if } // End of while } // End of function private void readTableData(List<XWPFTable> tbl, List<String[]> varArray, String st, int numberOfRows) { for(i=0;i<numberOfRows;i++) { numberOfColumns =tbl.get(counttbl).getRow(i).getTableICells().size(); for(int j=0;j<numberOfColumns;j++) { cellData= tbl.get(counttbl).getRow(i).getTableCells().get(j).getText(); varArray.add(new String[] {st,Integer.toString(tableNo),tr,Integer.toString(i),td, cellData }); } } } private void detectFileExtension(List<String[]> varArray, String st) { if(picdata.contains(".png")){ varArray.add(new String[] {st,picdata}); } else if(picdata.contains(".jpg")){ varArray.add(new String[] {st,picdata}); } else if(picdata.contains(".jpeg")){ varArray.add(new String[] {st,picdata}); } } }
In above program there are mainly three classes used which are :
- 1. XWPFParagraph :: To read text data or style of data in the document.
- 2. XWPFTable :: To read table data from the document.
- 3. XWPFPicture :: To read image name or data from the document.
Here are the sample screenshots of a word document that is being read by these classes of Apache POI library: