Monday, February 01, 2010

Convert PowerPoint to HTML with python

After I converted MS Word to HTML (and fed it to the application..) the next stage was to convert MS PowerPoint to HTML.
I thought it would be rather straight forward, given the success I experienced with openoffice headless api converting Word to HTML. It wasn't.
openoffice converts ppt to html (filter "impress_html_Export"), that's right. The output is a set of files, in which each ppt slide is converted to image (screenshot) and HTML. While the screenshots are good, the HTML is not satisfactory. Embedded images in the ppt doesn't appear in the converted HTML, and the same happened for tables. In addition, using the "2 column layout" produced HTML with only the left-column text, leaving the right-column text out. Same happened for any content added to a blank layout template (e.g. text boxes). In addition, numbered list (ol) where converted to bullets (ul).
Needless to say this solution is out of the question.

So here I was, looking for a way to convert ppt to html, using Java or (preferably) Python.
Looking for a Python module to do the job I found win32com, which may be good but not relevant for me since our servers don't run Windows. Although win32com CAN run on debian I preferred working with software that is not Windows dependant.

AND THEN... I found odfpy.
It's a GPL software defining itself as "Python API and tools to manipulate OpenDocument files".
Since openoffice document is basically an archive file, this module reads and writes the archive structure, allowing for easy manipulation of all kinds of openoffice formats.
In addition, it has some built-in scripts for common tasks, e.g. odf2xhtml(which I'm using), odfoutline, csv2odfand more.
SO, I'm converting the ppt to odp using openoffice headless api, and then convert the odp to HTML using odfpy.

And it works !