Friday, July 5, 2013

HTML To PDF Converter

HTMLToPDFConverter is able to convert multiple html files to PDF files.  It is easy to use. The program provides you two options. First, you can select html files in your local computer to be converted to PDF files. A file open dialog is shown. It allows you to select html files from your local computer. You will wait for a while to complete the conversion task.

html to pdf converter local files selection


Another option allows you to convert html files on the web to PDF files. You will need to type or paste the address of the html page in to the Address box and click Add to add this address to the conversion list. You can add many html pages as you want. After adding all addresses that you want  to the list, click OK and wait a moment until the conversion task finishes.

html to pdf converter web pages selection


Technically, to convert an html file to a PDF file, there are few steps that have to be taken. These steps are:
-After the html file is read, it is cleaned. The Jsoup library is used to clean the html file.
-The cleaned html file is converted to xhtml file by using the Jtidy library.
-The final step is to convert the xhtml file to a PDF file by using XMLWorker library.

HTMLToPDFConverter source code:

import java.io.BufferedReader;
import java.io.DataOutputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.FilenameFilter;
import java.io.IOException;
import java.io.InputStream;
import java.net.URL;
import java.util.Scanner;
import javax.swing.JFileChooser;
import javax.swing.filechooser.FileNameExtensionFilter;
import org.jsoup.Jsoup;
import org.w3c.tidy.Tidy;
import com.itextpdf.text.Document;
import com.itextpdf.text.pdf.PdfWriter;
import com.itextpdf.tool.xml.XMLWorkerHelper;
import org.jsoup.Jsoup;
import org.jsoup.safety.Whitelist;
import org.jsoup.select.Elements;
import javax.swing.*;
import java.awt.event.*;
import java.awt.*;

public class HTMLToPDFConverter{
public static void main(String[] args){

System.out.println("........HTML to PDF Converter......");
int ch;
Scanner sc=new Scanner(System.in);
showOptions();
System.out.println("Enter your choice:");
ch=sc.nextInt();
switch(ch){
case 1: selectLocal();break;
case 2: selectWeb();break;
case 3: System.exit(0);
default: System.out.println("Invalid choice");
}



}


public static void selectLocal(){
JFileChooser chooser = new JFileChooser();
FileNameExtensionFilter filter = new FileNameExtensionFilter("HTML", "html", "htm");
chooser.setFileFilter(filter);
chooser.setMultiSelectionEnabled(true);
int returnVal = chooser.showOpenDialog(null);
if(returnVal == JFileChooser.APPROVE_OPTION) {
File[] Files=chooser.getSelectedFiles();
System.out.println("Please wait...");
        for( int i=0;i<Files.length;i++){    
        String address="file:///"+Files[i].toString();
        loadHTML(address);
        convertHTMLToPDF();
        }
System.out.println("Conversion complete");
        }
}
public static void selectWeb(){
new LinksAdd("Add links");

}

//display a window to add links to the list for conversion

static class LinksAdd extends JFrame implements ActionListener{
DefaultListModel listmodel;
JTextField textlink;
JLabel lblwait;

LinksAdd(String title){
Container cont=getContentPane();
cont.setLayout(new BorderLayout());
setTitle(title);
setPreferredSize(new Dimension(600,300));
JLabel lbl=new JLabel("Address:");
textlink=new JTextField(30);
JButton btadd=new JButton("Add");
btadd.addActionListener(this);
JPanel panelnorth=new JPanel();
panelnorth.add(lbl);
panelnorth.add(textlink);
panelnorth.add(btadd);
listmodel=new DefaultListModel();
JList linkslist=new JList(listmodel);
linkslist.setVisibleRowCount(10);
linkslist.setFixedCellWidth(200);
JScrollPane pane=new JScrollPane(linkslist);
JButton btok=new JButton("OK");
lblwait=new JLabel();
JPanel panelsouth=new JPanel();
panelsouth.add(btok);
panelsouth.add(lblwait);
btok.addActionListener(this);
cont.add(panelnorth, BorderLayout.NORTH);
cont.add(pane, BorderLayout.CENTER);
cont.add(panelsouth, BorderLayout.SOUTH);

pack();
setVisible(true);
}

public void actionPerformed(ActionEvent e){
if(e.getActionCommand().equals("Add")){
if(!textlink.getText().equals("")){
listmodel.addElement(textlink.getText());
textlink.setText("");
}

}
else if(e.getActionCommand().equals("OK")){
Thread t=new Thread(){
public void run(){
lblwait.setText("Please wait...");

for(int i=0;i<listmodel.size();i++){
loadHTML(listmodel.getElementAt(i).toString());
convertHTMLToPDF();
}
dispose();
}
};
t.start();


}
}
}

//read, and convert the original html to xhtml file
//by using parse method of Tidy class
public static void loadHTML(String src){
Tidy tidy=new Tidy();
tidy.setMakeClean(true);
tidy.setXHTML(true);
try{

URL url=new URL(src);
File cleanedHTMLFile=cleanHTML(url.openStream());
tidy.parse(new FileInputStream(cleanedHTMLFile),new FileOutputStream("d:/tempfile.xhtml"));
}catch(Exception fnfe){
System.out.println("The file can't be read.");
System.exit(-1);
}
}
//convert the xhtml file to PDF file by using
//the parseXHtml method of the XMLWorkerHelper class
public static void convertHTMLToPDF(){
try{

Document doc=new Document();
PdfWriter pw=PdfWriter.getInstance(doc, new FileOutputStream(System.currentTimeMillis()+".pdf"));
doc.open();
XMLWorkerHelper.getInstance().parseXHtml(pw, doc, new FileInputStream("d:/tempfile.xhtml"));
doc.close();

}catch(Exception e){
System.out.println("Conversion can't be completed.");
System.exit(-1);}
}


public static File cleanHTML(InputStream srcstream){
String htmlfile="d:/cleanedhtml.html";

try{
org.jsoup.nodes.Document doc=Jsoup.parse(srcstream, "UTF-8", "");
Whitelist wlist=Whitelist.relaxed();
wlist.addAttributes(":all","style","href","ftp","http","https","class");
String cleanedBody=Jsoup.clean(doc.toString(),wlist);
Elements ls=doc.select("body");
ls.remove();
cleanedBody="<body>"+cleanedBody+"</body>";
String cleanedHTML=doc.toString().replaceFirst("</html>",cleanedBody+"</html>");
DataOutputStream dos=new DataOutputStream(new FileOutputStream(htmlfile));
dos.writeBytes(cleanedHTML);
dos.flush();
dos.close();

}catch(IOException ie){System.out.println("Unable to process the file");}
return new File(htmlfile);
}

public static void showOptions(){
System.out.println("1. html files from local computer");
System.out.println("2. html files from the web");

}


}

13 comments:

  1. the program did not create the pdf with images?
    Thanks

    ReplyDelete
  2. Now, we still not have a free library that does the perfect thing in html to pdf conversion task.

    ReplyDelete
  3. It is useful Java code to convert html to pdf file. I tried to find it for long a go.

    ReplyDelete
    Replies
    1. You can try http://www.tagpdf.com/online/convert-pdf-to-html/ for high quality convert pdf to xhtml online. Even there is 10% discount on the first order.

      Delete
  4. Very informative, Thanks for Sharing, I came across another Java PDF Component that converts PDF files to HTML format. Here is the link, Aspose.Pdf for Java. Could you please tell me how your product is different from this one?

    ReplyDelete
    Replies
    1. This program is able to convert local html files to pdf files and you can also download html files from the internet and convert them to pdf file. Sorry, I am not sure what Aspose can do.

      Delete
  5. Thanks for the reply Dara,
    Here are are some code examples I found on the Aspose documentation section, It can give you any overview how Aspose guys are doing it. Here are the links

    http://www.aspose.com/docs/display/pdfnet/HTML+to+PDF+conversion
    http://www.aspose.com/docs/display/pdfnet/How+to+convert+HTML+to+PDF+using+InLineHTML+approach
    http://www.aspose.com/docs/display/pdfnet/Convert+PDF+file+into+HTML+format

    Thanks once again
    David

    ReplyDelete
  6. I really like your method of converting HTML to PDF. However, at times a need may arise to convert JPG to PDF . For this purpose, i would like to refer to this amazing JPG to PDF converter. You must try this once and experience the smooth operation.

    ReplyDelete
  7. iam getting error while using method 1.i.e..fetching html files from local computer.and the erroe is


    Tidy (vers 4th August 2000) Parsing "InputStream"
    line 2 column 8 - Warning: inserting missing 'title' element

    InputStream: Document content looks like HTML 2.0
    1 warnings/errors were found!

    Conversion can't be completed.


    can u pls help in resolving it?

    ReplyDelete
  8. Hey can you tell me how to fix it so it shows images in PDF?

    ReplyDelete