Convert text to html, docx and Pdf

Let’s convert text in multiple formats

This time we want to convert text into

  • html,
  • docx,
  • pdf.

Convert text to docx with the module python-docx

Let’s say we have a text file with these lines of code:

This is just a simple file

to convert

to pdf

27/9/19

To convert it to docx, we install python-docx (pip install python-docx) and then we import this module:

The last line of the installation should be something like this (the version could be different):

Successfully installed python-docx-0.8.10

Now, we can import the class Document from docx (that’s how we call the module when we import it):

from docx import Document

Then we create an instance of the class Document that will create a brand new docx file (word).

doc = Document()

Now we can do many stuffs with doc.

Let’s load the text from text.txt and put it into doc as a paragraph

with open(input_txt, 'r', encoding='utf-8') as file:
    doc.add_paragraph(file.read())

All we have to do now is to save the file with a name of our choice (textdoc.docx in the example below):

doc.save("textdoc.docx")

This is the whole code:

from docx import Document

doc = Document()
with open("text.txt", 'r', encoding='utf-8') as file:
    doc.add_paragraph(file.read())
doc.save("text.docx")
os.startfile("text.docx")

This is the output:

text

Text to pdf with wkthtmltopdf!

Do this:

  • pip install pdfkit
  • install https://wkthtmltopdf.org/
  • add the path to wkhtmltopdf.exe to the environmental variables
  • restart pc
  • use the code below

We could, now, transform the docx file into pdf from word… but we want to make it with Python, of course, we do not want to open Word… too much time wasted.

From TXT=> to HTML=> to PDF

Simple as transforming text into html and then html into pdf. For me this is the best way. I am used to the html tags and it is very simple for anyone to learn them. The html page can be rendered easily and then converted in a pdf. In this example we will use https://wkhtmltopdf.org/ and the pdfkit module for Python.

We will simply read the txt file, transforming the \n new line characters into the html tag <br> and nothing else to make it simple.

Video about adding environmental variable for wkthtmltopdf.exe

The code to convert txt to pdf (via html and pdfkit)

# from txt to html
# install wkthtml
import os
import pdfkit

with open("text.txt") as file:
	with open ("text.html", "w") as output:
		file = file.read()
		file = file.replace("\n", "<br>")
		output.write(file)

#os.startfile("text.txt")
#os.startfile("text.html")
pdfkit.from_file("text.html", "output.pdf")

os.startfile("output.pdf")

Convert from url with pdfkit.from_url

You can also convert html to pdf from url, that is very convenient even if you can do it from Chrome chosing to print an html page from the browser and then choosing to save it as a pdf, becaus with this code you do not have to change from print the page to save as pdf and viceversa when you need to just print the page! Saving time!

# from txt to html
# install wkthtml
import os
import pdfkit

pdfkit.from_url("http://www.gutenberg.org/files/1112/1112.txt", "romeoandjuliet.pdf")

os.startfile("romeoandjuliet.pdf")

This will give you the pdf of Romeo and Juliet by William Shakespeare from http://gutenberg.org.

From text to pdf directly

You can also convert plain text to pdf without transforming it into html.

# from txt to html
# install wkthtml
import os
import pdfkit


pdfkit.from_file("text.txt", "text_pdf.pdf")

os.startfile("text_pdf.pdf")

 

Utilities

Published by pythonprogramming

Started with basic on the spectrum, loved javascript in the 90ies and python in the 2000, now I am back with python, still making some javascript stuff when needed.