Python lxml模块的基本使用方法分析

更新时间：2019年12月21日 11:21:51 作者：Dylan HU

这篇文章主要介绍了Python lxml模块的基本使用方法,结合实例形式分析了Python安装与使用lxml模块常见操作技巧与相关注意事项,需要的朋友可以参考下

本文实例讲述了Python lxml模块的基本使用方法。分享给大家供大家参考，具体如下：

1 lxml的安装

安装方式：pip install lxml

2 lxml的使用

2.1 lxml模块的入门使用

导入lxml 的 etree 库 (导入没有提示不代表不能用)

from lxml import etree

利用etree.HTML，将字符串转化为Element对象,Element对象具有xpath的方法,返回结果的列表，能够接受bytes类型的数据和str类型的数据

html = etree.HTML(text) 
ret_list = html.xpath("xpath字符串")

把转化后的element对象转化为字符串，返回bytes类型结果 etree.tostring(element)

假设我们现有如下的html字符换，尝试对他进行操作

<div> <ul> 
<li class="item-1"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li> 
<li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li> 
<li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li> 
<li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li> 
<li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a> # 注意，此处缺少一个 </li> 闭合标签 
</ul> </div>

from lxml import etree
text = ''' <div> <ul> 
    <li class="item-1"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li> 
    <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li> 
    <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li> 
    <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li> 
    <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a> 
    </ul> </div> '''
html = etree.HTML(text)
print(type(html)) 
handeled_html_str = etree.tostring(html).decode()
print(handeled_html_str)

输出为

<class 'lxml.etree._Element'>
<html><body><div> <ul>
        <li class="item-1"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li>
        <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li>
        <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li>
        <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li>
        <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a>
        </li></ul> </div> </body></html>

可以发现，lxml确实能够把确实的标签补充完成，但是请注意lxml是人写的，很多时候由于网页不够规范，或者是lxml的bug，即使参考url地址对应的响应去提取数据，任然获取不到，这个时候我们需要使用etree.tostring的方法，观察etree到底把html转化成了什么样子，即根据转化后的html字符串去进行数据的提取。

2.2 lxml的深入练习

接下来我们继续操作，假设每个class为item-1的li标签是1条新闻数据，如何把这条新闻数据组成一个字典

from lxml import etree
text = ''' <div> <ul> 
    <li class="item-1"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li> 
    <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li> 
    <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li> 
    <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li> 
    <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a> 
    </ul> </div> '''
html = etree.HTML(text)
#获取href的列表和title的列表
href_list = html.xpath("//li[@class='item-1']/a/@href")
title_list = html.xpath("//li[@class='item-1']/a/text()")
#组装成字典
for href in href_list:
  item = {}
  item["href"] = href
  item["title"] = title_list[href_list.index(href)]
  print(item)

输出为

{'href': 'link1.html', 'title': 'first item'}
{'href': 'link2.html', 'title': 'second item'}
{'href': 'link4.html', 'title': 'fourth item'}

假设在某种情况下，某个新闻的href没有，那么会怎样呢？

from lxml import etree
text = ''' <div> <ul> 
    <li class="item-1"><a>first item</a></li> 
    <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li> 
    <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li> 
    <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li> 
    <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a> 
    </ul> </div> '''

结果是

{'href': 'link2.html', 'title': 'first item'}
{'href': 'link4.html', 'title': 'second item'}

数据的对应全部错了，这不是我们想要的，接下来通过2.3小节的学习来解决这个问题

2.3 lxml模块的进阶使用

前面我们取到属性，或者是文本的时候，返回字符串但是如果我们取到的是一个节点，返回什么呢?

返回的是element对象，可以继续使用xpath方法，对此我们可以在后面的数据提取过程中：先根据某个标签进行分组，分组之后再进行数据的提取

示例如下：

from lxml import etree
text = ''' <div> <ul> 
    <li class="item-1"><a>first item</a></li> 
    <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li> 
    <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li> 
    <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li> 
    <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a> 
    </ul> </div> '''
html = etree.HTML(text)
li_list = html.xpath("//li[@class='item-1']")
print(li_list)

结果为：

[<Element li at 0x11106cb48>, <Element li at 0x11106cb88>, <Element li at 0x11106cbc8>]

可以发现结果是一个element对象，这个对象能够继续使用xpath方法

先根据li标签进行分组，之后再进行数据的提取

from lxml import etree
text = ''' <div> <ul> 
    <li class="item-1"><a>first item</a></li> 
    <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li> 
    <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li> 
    <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li> 
    <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a> 
    </ul> </div> '''
#根据li标签进行分组
html = etree.HTML(text)
li_list = html.xpath("//li[@class='item-1']")
#在每一组中继续进行数据的提取
for li in li_list:
  item = {}
  item["href"] = li.xpath("./a/@href")[0] if len(li.xpath("./a/@href"))>0 else None
  item["title"] = li.xpath("./a/text()")[0] if len(li.xpath("./a/text()"))>0 else None
  print(item)

结果是：

{'href': None, 'title': 'first item'}
{'href': 'link2.html', 'title': 'second item'}
{'href': 'link4.html', 'title': 'fourth item'}

前面的代码中，进行数据提取需要判断，可能某些一面不存在数据的情况，对应的可以使用三元运算符来解决

PS：这里再为大家提供几款关于xml操作的在线工具供大家参考使用：

在线XML/JSON互相转换工具：
http://tools.jb51.net/code/xmljson

在线格式化XML/在线压缩XML：
http://tools.jb51.net/code/xmlformat

XML在线压缩/格式化工具：
http://tools.jb51.net/code/xml_format_compress

XML代码在线格式化美化工具：
http://tools.jb51.net/code/xmlcodeformat

更多关于Python相关内容感兴趣的读者可查看本站专题：《Python操作xml数据技巧总结》、《Python数据结构与算法教程》、《Python Socket编程技巧总结》、《Python函数使用技巧总结》、《Python字符串操作技巧汇总》、《Python入门与进阶经典教程》及《Python文件与目录操作技巧汇总》

希望本文所述对大家Python程序设计有所帮助。

您可能感兴趣的文章:

Python中的//符号是什么意思呢
这篇文章主要介绍了Python中的//符号是什么意思，具有很好的参考价值，希望对大家有所帮助。如有错误或未考虑完全的地方，望不吝赐教
2022-05-05
python怎么使用xlwt操作excel你知道吗
这篇文章主要为大家介绍了python使用xlwt操作excel的方法，具有一定的参考价值，感兴趣的小伙伴们可以参考一下，希望能够给你带来帮助
2022-01-01
python使用selenium模拟浏览器进入好友QQ空间留言功能
这篇文章主要介绍了python使用selenium模拟浏览器进入好友QQ空间留言,在本文实现过程中需要注意的是留言框和发表按钮在不同的frame，发表在外面的一层，具体实现过程跟随小编一起看看吧
2022-04-04
Python中的Numpy矩阵操作
这篇文章主要介绍了Python中的Numpy矩阵操作，小编觉得挺不错的，现在分享给大家，也给大家做个参考。一起跟随小编过来看看吧
2018-08-08
python-web根据元素属性进行定位的方法
这篇文章主要介绍了python-web根据元素属性进行定位的方法，本文给大家介绍的非常详细，具有一定的参考借鉴价值,需要的朋友可以参考下
2019-12-12
python 从csv读数据到mysql的实例
今天小编就为大家分享一篇python 从csv读数据到mysql的实例，具有很好的参考价值，希望对大家有所帮助。一起跟随小编过来看看吧
2018-06-06
python sqlite的Row对象操作示例
这篇文章主要介绍了python sqlite的Row对象操作,结合实例形式分析了Python使用sqlite的Row对象进行数据的查询操作相关实现技巧,需要的朋友可以参考下
2019-09-09
利用Python对哥德巴赫猜想进行检验和推理
数学是一个奇妙的东西，对此，也衍生出了许多的悖论与猜想。这篇文章会对哥德巴赫猜想用编程语言进行检验和推理，感兴趣的小伙伴可以跟随小编一起学习一下
2022-12-12
python实现读取excel写入mysql的小工具详解
EXCEL 和 MySQL 大体上来说都可以算是"数据库"，MySQL貌似有EXCEL的接口，但是最近在自学Python，用Python实现了一下，下面这篇文章主要给大家介绍了关于利用python实现读取excel写入mysql的一个小工具，需要的朋友可以参考下。
2017-11-11
三种Python比较两个时间序列在图形上是否相似的方法分享
这篇文章主要为大家详细介绍了三种Python中比较两个时间序列在图形上是否相似的方法，文中的示例代码简洁易懂，感兴趣的小伙伴可以了解一下
2023-03-03