Let’s convert text in multiple formats
This time we want to convert text into
- html,
- docx,
- pdf.
Convert text to docx with the module python-docx
Let’s say we have a text file with these lines of code:
This is just a simple file to convert to pdf 27/9/19
To convert it to docx, we install python-docx (pip install python-docx) and then we import this module:
The last line of the installation should be something like this (the version could be different):
Successfully installed python-docx-0.8.10
Now, we can import the class Document from docx (that’s how we call the module when we import it):
from docx import Document
Then we create an instance of the class Document that will create a brand new docx file (word).
doc = Document()
Now we can do many stuffs with doc.
Let’s load the text from text.txt and put it into doc as a paragraph
with open(input_txt, 'r', encoding='utf-8') as file: doc.add_paragraph(file.read())
All we have to do now is to save the file with a name of our choice (textdoc.docx in the example below):
doc.save("textdoc.docx")
This is the whole code:
from docx import Document doc = Document() with open("text.txt", 'r', encoding='utf-8') as file: doc.add_paragraph(file.read()) doc.save("text.docx") os.startfile("text.docx")
This is the output:
Text to pdf with wkthtmltopdf!
Do this:
- pip install pdfkit
- install https://wkthtmltopdf.org/
- add the path to wkhtmltopdf.exe to the environmental variables
- restart pc
- use the code below
We could, now, transform the docx file into pdf from word… but we want to make it with Python, of course, we do not want to open Word… too much time wasted.
From TXT=> to HTML=> to PDF
Simple as transforming text into html and then html into pdf. For me this is the best way. I am used to the html tags and it is very simple for anyone to learn them. The html page can be rendered easily and then converted in a pdf. In this example we will use https://wkhtmltopdf.org/ and the pdfkit module for Python.
We will simply read the txt file, transforming the \n new line characters into the html tag <br> and nothing else to make it simple.
Video about adding environmental variable for wkthtmltopdf.exe
The code to convert txt to pdf (via html and pdfkit)
# from txt to html # install wkthtml import os import pdfkit with open("text.txt") as file: with open ("text.html", "w") as output: file = file.read() file = file.replace("\n", "<br>") output.write(file) #os.startfile("text.txt") #os.startfile("text.html") pdfkit.from_file("text.html", "output.pdf") os.startfile("output.pdf")
Convert from url with pdfkit.from_url
You can also convert html to pdf from url, that is very convenient even if you can do it from Chrome chosing to print an html page from the browser and then choosing to save it as a pdf, becaus with this code you do not have to change from print the page to save as pdf and viceversa when you need to just print the page! Saving time!
# from txt to html # install wkthtml import os import pdfkit pdfkit.from_url("http://www.gutenberg.org/files/1112/1112.txt", "romeoandjuliet.pdf") os.startfile("romeoandjuliet.pdf")
This will give you the pdf of Romeo and Juliet by William Shakespeare from http://gutenberg.org.
From text to pdf directly
You can also convert plain text to pdf without transforming it into html.
# from txt to html # install wkthtml import os import pdfkit pdfkit.from_file("text.txt", "text_pdf.pdf") os.startfile("text_pdf.pdf")