python使用html2text库实现从HTML转markdown的方法详解

更新时间：2020年02月21日 17:01:55 投稿：WDC

这篇文章主要介绍了python使用html2text库实现从HTML转markdown的方法,需要的朋友可以参考下

如果PyPi上搜html2text的话，找到的是另外一个库：Alir3z4/html2text。这个库是从aaronsw/html2text fork过来，并在此基础上对功能进行了扩展。因此是直接用pip安装的，因此本文主要来讲讲这个库。

首先，进行安装：

pip install html2text

命令行方式使用html2text

安装完后，就可以通过命令html2text进行一系列的操作了。

html2text命令使用方式为：html2text [(filename|url) [encoding]]。通过html2text -h，我们可以查看该命令支持的选项：

选项	描述
`--version`	显示程序版本号并退出
`-h, --help`	显示帮助信息并退出
`--no-wrap-links`	转换期间包装链接
`--ignore-emphasis`	对于强调，不包含任何格式
`--reference-links`	使用参考样式的链接，而不是内联链接
`--ignore-links`	对于链接，不包含任何格式
`--protect-links`	保护链接不换行，并用尖角括号将其围起来
`--ignore-images`	对于图像，不包含任何格式
`--images-to-alt`	丢弃图像数据，只保留替换文本
`--images-with-size`	将图像标签作为原生html，并带height和width属性，以保留维度
`-g, --google-doc`	转换一个被导出为html的谷歌文档
`-d, --dash-unordered-list`	对于无序列表，使用破折号而不是星号
`-e, --asterisk-emphasis`	对于被强调文本，使用星号而不是下划线
`-b BODY_WIDTH, --body-width=BODY_WIDTH`	每个输出行的字符数，0表示不自动换行
`-i LIST_INDENT, --google-list-indent=LIST_INDENT`	Google缩进嵌套列表的像素数
`-s, --hide-strikethrough`	隐藏带删除线文本。只有当也指定-g的时候才有用
`--escape-all`	转义所有特殊字符。输出较为不可读，但是会避免极端情况下的格式化问题。
`--bypass-tables`	以HTML格式格式化表单，而不是Markdown语法。
`--single-line-break`	在一个块元素后使用单个换行符，而不是两个换行符。注意：要求–body-width=0
`--unicode-snob`	整个文档中都使用unicode
`--no-automatic-links`	在任何适用情况下，不要使用自动链接
`--no-skip-internal-links`	不要跳过内部链接
`--links-after-para`	将链接置于每段之后而不是文档之后
`--mark-code`	用复制代码代码如下: … 将代码块标记出来
`--decode-errors=DECODE_ERRORS`	如何处理decode错误。接受值为'ignore', ‘strict'和'replace'

具体使用如下：

# 传递url
html2text http://eepurl.com/cK06Gn

# 传递文件名，编码方式设置为utf-8
html2text test.html utf-8

脚本中使用html2text

除了直接通过命令行使用html2text外，我们还可以在脚本中将其作为库导入。

我们以以下html文本为例

html_content = """
<span style="font-size:14px"><a href="http://blog.yhat.com/posts/visualize-nba-pipelines.html" rel="external nofollow" target="_blank" style="color: #1173C7;text-decoration: underline;font-weight: bold;">Data Wrangling 101: Using Python to Fetch, Manipulate &amp; Visualize NBA Data</a></span><br>
A tutorial using pandas and a few other packages to build a simple datapipe for getting NBA data. Even though this tutorial is done using NBA data, you don't need to be an NBA fan to follow along. The same concepts and techniques can be applied to any project of your choosing.<br>
"""

一句话转换html文本为Markdown格式的文本：

import html2text
print html2text.html2text(html_content)

输出如下：

[Data Wrangling 101: Using Python to Fetch, Manipulate & Visualize NBA

Data](http://blog.yhat.com/posts/visualize-nba-pipelines.html)

A tutorial using pandas and a few other packages to build a simple datapipe

for getting NBA data. Even though this tutorial is done using NBA data, you

don't need to be an NBA fan to follow along. The same concepts and techniques

can be applied to any project of your choosing.

另外，还可以使用上面的配置项：

import html2text
h = html2text.HTML2Text()
print h.handle(html_content) # 输出同上

注意：下面仅展示使用某个配置项时的输出，不使用某个配置项时使用默认值的输出（如无特殊说明）同上。

--ignore-emphasis

指定选项–ignore-emphasis

h.ignore_emphasis = True
print h.handle("<p>hello, this is <em>Ele</em></p>")

输出为：

hello, this is Ele

不指定选项–ignore-emphasis

h.ignore_emphasis = False # 默认值
print h.handle("<p>hello, this is <em>Ele</em></p>")

输出为：

hello, this is _Ele_

--reference-links

h.inline_links = False
print h.handle(html_content)

输出为：

[Data Wrangling 101: Using Python to Fetch, Manipulate & Visualize NBA

Data][16]

A tutorial using pandas and a few other packages to build a simple datapipe

for getting NBA data. Even though this tutorial is done using NBA data, you

don't need to be an NBA fan to follow along. The same concepts and techniques

can be applied to any project of your choosing.

[16]: http://blog.yhat.com/posts/visualize-nba-pipelines.html

--ignore-links

h.ignore_links = True
print h.handle(html_content)

输出为：

Data Wrangling 101: Using Python to Fetch, Manipulate & Visualize NBA Data

A tutorial using pandas and a few other packages to build a simple datapipe

for getting NBA data. Even though this tutorial is done using NBA data, you

don't need to be an NBA fan to follow along. The same concepts and techniques

can be applied to any project of your choosing.

--protect-links

h.protect_links = True
print h.handle(html_content)

输出为：

[Data Wrangling 101: Using Python to Fetch, Manipulate & Visualize NBA

Data](<http://blog.yhat.com/posts/visualize-nba-pipelines.html>)

A tutorial using pandas and a few other packages to build a simple datapipe

for getting NBA data. Even though this tutorial is done using NBA data, you

don't need to be an NBA fan to follow along. The same concepts and techniques

can be applied to any project of your choosing.

--ignore-images

h.ignore_images = True
print h.handle('<p>This is a img: <img src="https://my.oschina.net/img/hot3.png" style="max-height: 32px; max-width: 32px;" alt="hot3"> ending ...</p>')

输出为：

This is a img: ending ...

--images-to-alt

h.images_to_alt = True
print h.handle('<p>This is a img: <img src="https://my.oschina.net/img/hot3.png" style="max-height: 32px; max-width: 32px;" alt="hot3"> ending ...</p>')

输出为：

This is a img: hot3 ending ...

--images-with-size

h.images_with_size = True
print h.handle('<p>This is a img: <img src="https://my.oschina.net/img/hot3.png" height=32px width=32px alt="hot3"> ending ...</p>')

输出为：

This is a img: <img src='https://my.oschina.net/img/hot3.png' width='32px'

height='32px' alt='hot3' /> ending ...

--body-width

h.body_width=0
print h.handle(html_content)

输出为：

[Data Wrangling 101: Using Python to Fetch, Manipulate & Visualize NBA Data](http://blog.yhat.com/posts/visualize-nba-pipelines.html)

A tutorial using pandas and a few other packages to build a simple datapipe for getting NBA data. Even though this tutorial is done using NBA data, you don't need to be an NBA fan to follow along. The same concepts and techniques can be applied to any project of your choosing.

--mark-code

h.mark_code=True
print h.handle('<pre class="hljs css"><code class="hljs css">&nbsp;&nbsp;&nbsp;&nbsp;<span class="hljs-selector-tag"><span class="hljs-selector-tag">rpm</span></span>&nbsp;<span class="hljs-selector-tag"><span class="hljs-selector-tag">-Uvh</span></span>&nbsp;<span class="hljs-selector-tag"><span class="hljs-selector-tag">erlang-solutions-1</span></span><span class="hljs-selector-class"><span class="hljs-selector-class">.0-1</span></span><span class="hljs-selector-class"><span class="hljs-selector-class">.noarch</span></span><span class="hljs-selector-class"><span class="hljs-selector-class">.rpm</span></span></code></pre>')

输出为：

复制代码代码如下:

rpm -Uvh erlang-solutions-1.0-1.noarch.rpm

通过这种方式，就可以以脚本的形式自定义HTML -> MARKDOWN的自动化过程了。例子可参考下面的例子

#-*- coding: utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding('utf-8') 
import re
import requests
from lxml import etree
import html2text


# 获取第一个issue
def get_first_issue(url):
  resp = requests.get(url)
  page = etree.HTML(resp.text)
  issue_list = page.xpath("//ul[@id='archive-list']/div[@class='display_archive']/li/a")
  fst_issue = issue_list[0].attrib
  fst_issue["text"] = issue_list[0].text
  return fst_issue


# 获取issue的内容，并转成markdown
def get_issue_md(url):
  resp = requests.get(url)
  page = etree.HTML(resp.text)
  content = page.xpath("//table[@id='templateBody']")[0]#'//table[@class="bodyTable"]')[0]
  h = html2text.HTML2Text()
  h.body_width=0 # 不自动换行
  return h.handle(etree.tostring(content))

subtitle_mapping = {
  '**From Our Sponsor**': '# 来自赞助商',
  '**News**': '# 新闻',
  '**Articles**,** Tutorials and Talks**': '# 文章，教程和讲座',
  '**Books**': '# 书籍',
  '**Interesting Projects, Tools and Libraries**': '# 好玩的项目，工具和库',
  '**Python Jobs of the Week**': '# 本周的Python工作',
  '**New Releases**': '# 最新发布',
  '**Upcoming Events and Webinars**': '# 近期活动和网络研讨会',
}
def clean_issue(content):
  # 去除‘Share Python Weekly'及后面部分
  content = re.sub('\*\*Share Python Weekly.*', '', content, flags=re.IGNORECASE)
  # 预处理标题
  for k, v in subtitle_mapping.items():
    content = content.replace(k, v)
  return content

tpl_str = """原文：[{title}]({url})
---
{content}
"""
def run():
  issue_list_url = "https://us2.campaign-archive.com/home/?u=e2e180baf855ac797ef407fc7&id=9e26887fc5"
  print "开始获取最新的issue……"
  fst = get_first_issue(issue_list_url)
  #fst = {'href': 'http://eepurl.com/dqpDyL', 'title': 'Python Weekly - Issue 341'}
  print "获取完毕。开始截取最新的issue内容并将其转换成markdown格式"
  content = get_issue_md(fst['href'])
  print "开始清理issue内容"
  content = clean_issue(content)

  print "清理完毕，准备将", fst['title'], "写入文件"
  title = fst['title'].replace('- ', '').replace(' ', '_')
  with open(title.strip()+'.md', "wb") as f:
    f.write(tpl_str.format(title=fst['title'], url=fst['href'], content=content))
  print "恭喜，完成啦。文件保存至%s.md" % title

if __name__ == '__main__':
  run()

这是一个每周跑一次的python weekly转markdown的脚本。

好啦，html2text就介绍到这里了。如果觉得它还不能满足你的要求，或者想添加更多的功能，可以fork并自行修改。

您可能感兴趣的文章:

对Matlab中共轭、转置和共轭装置的区别说明
这篇文章主要介绍了对Matlab中共轭、转置和共轭装置的区别说明，具有很好的参考价值，希望对大家有所帮助。一起跟随小编过来看看吧
2020-05-05
python 数据的清理行为实例详解
这篇文章主要介绍了python 数据的清理行为实例详解的相关资料,需要的朋友可以参考下
2017-07-07
Python脚本修改Maya ASCII文件路径方法实现
本文主要介绍了Python脚本修改Maya ASCII文件路径方法实现，文中通过示例代码介绍的非常详细，对大家的学习或者工作具有一定的参考学习价值，需要的朋友们下面随着小编来一起学习学习吧
2023-02-02
Python把csv数据写入list和字典类型的变量脚本方法
今天小编就为大家分享一篇Python把csv数据写入list和字典类型的变量脚本方法，具有很好的参考价值，希望对大家有所帮助。一起跟随小编过来看看吧
2018-06-06
Python+PyQt5实现自动化任务管理
这篇文章主要为大家详细介绍了如何通过PyQt5构建图形界面,使用Python实现了一个自动化任务管理系统,感兴趣的小伙伴可以参考一下
2025-04-04
Python爬虫之爬取某文库文档数据
这篇文章主要介绍了Python爬虫之爬取某文库文档数据,文中有非常详细的代码示例,对正在学python的小伙伴们有很好地帮助,需要的朋友可以参考下
2021-04-04
python处理大数字的方法
这篇文章主要介绍了python处理大数字的方法,涉及Python递归操作的相关技巧,需要的朋友可以参考下
2015-05-05
Python包，__init__.py功能与用法分析
这篇文章主要介绍了Python包，__init__.py功能与用法,结合实例形式分析了Python中包的概念、功能及__init__.py初始化相关操作技巧,需要的朋友可以参考下
2020-01-01
Django-xadmin+rule对象级权限的实现方式
今天小编就为大家分享一篇Django-xadmin+rule对象级权限的实现方式，具有很好的参考价值，希望对大家有所帮助。一起跟随小编过来看看吧
2020-03-03
Python运行中频繁出现Restart提示的解决办法
在编程的世界里,遇到各种奇怪的问题是家常便饭,但是,当你的 Python 程序在运行过程中频繁出现“Restart”提示时,这可能不仅仅是令人头疼的小问题,而是隐藏着深层次的原因,本文将深入探讨这一现象,并提供解决方案,需要的朋友可以参考下
2025-04-04