word转html

Python

背景

有的业务场景是用户上传word文件，读取word文件里的内容后台做处理

其中一种方法是讲word转换成html

这里有两个坑是：

word文件分高版本.docx和低版本.doc
需要提取word文件里的图片

.docx和.doc的区别是.docx是标准xml文件，.doc则是微软的自有格式，所以需要分别处理

解决方案

.doc

需要用到abiword

abiword的安装参照这里：http://www.linuxquestions.org/questions/linux-software-2/is-it-possible-to-install-abiword-on-centos-7-a-4175582704/
You can get the fc19 packages as one tar archive : abiword-fc19.tar, 8.6MB https://drive.google.com/file/d/0B7S…ew?usp=sharing
```
$ tar xvf abiword-fc19.tar
$ cd abiword-fc19/
$ yum install *.rpm
```
需要注意的是：.doc或许不到文件里面的图片
.docx

用python的mammoth包即可，可获取文件里面的图片

word_file = request.files.get('word_upload')
# 处理.doc
if word_file.filename.endswith(".doc"):
    # 将文件存到本地
    word_path = "{}/static/{}-{}.doc".format(ROOT_DIR, user.id,
                                             int(time.time()))
    word_file.save(word_path)
    word_file.close()
    # abiword命令，生成一个同名的html文件
    command_str = (
        "abiword --to html {} "
        "--exp-props='html4:yes; embed-images:yes'").format(
            word_path)
    os.system(command_str)
    html_path = word_path[:-4] + ".html"
    f = open(html_path, 'r')
    html = f.read()
    html = unicode(html, "utf-8")
    # 删除两个文件
    os.remove(word_path)
    os.remove(html_path)
# 处理.docx
elif word_file.filename.endswith(".docx"):
    # 用mammoth包即可
    result = mammoth.convert_to_html(word_file)
    html = result.value