详解BeautifulSoup获取特定标签下内容的方法

更新时间：2020年12月07日 11:40:41 作者：qianc6350528

这篇文章主要介绍了详解BeautifulSoup获取特定标签下内容的方法，文中通过示例代码介绍的非常详细，对大家的学习或者工作具有一定的参考学习价值，需要的朋友们下面随着小编来一起学习学习吧

以下是个人在学习beautifulSoup过程中的一些总结，目前我在使用爬虫数据时使用的方法的是：先用find_all()找出需要内容所在的标签，如果所需内容一个find_all()不能满足，那就用两个或者多个。接下来遍历find_all的结果，用get_txt（）、get(‘href')、得到文本或者链接，然后放入各自的列表中。这样做有一个缺点就是txt的数据是一个单独的列表，链接的数据也是一个单独的列表，一方面不能体现这些数据之间的结构性，另一方面当想要获得更多的内容时，就要创建更多的空列表。

遍历所有标签：

soup.find_all('a')

找出所有页面中含有标签a的html语句，结果以列表形式存储。对找到的标签可以进一步处理，如用for对结果遍历，可以对结果进行purify，得到如链接，字符等结果。

# 创建空列表
links=[] 
txts=[]
tags=soup.find_all('a')
for tag in tags:
  links.append(tag.get('href')
  txts.append(tag.txt)         #或者txts.append(tag.get_txt())

得到html的属性名：

atr=[]
tags=soup.find_all('a')
for tag in tags:
  atr.append(tag.p('class')) # 得到a 标签下，子标签p的class名称

find_all()的相关用法实例：

实例来自BeautifulSoup中文文档

1. 字符串

最简单的过滤器是字符串.在搜索方法中传入一个字符串参数,Beautiful Soup会查找与字符串完整匹配的内容,下面的例子用于查找文档中所有的标签:

soup.find_all('b')
# [<b>The Dormouse's story</b>]

2.正则表达式

如果传入正则表达式作为参数,Beautiful Soup会通过正则表达式的 match() 来匹配内容.下面例子中找出所有以b开头的标签,这表示和标签都应该被找到:

import re
for tag in soup.find_all(re.compile("^b")):
  print(tag.name)
# body
# b

下面代码找出所有名字中包含”t”的标签:

for tag in soup.find_all(re.compile("t")):
  print(tag.name)
# html
# title

3.列表

如果传入列表参数,Beautiful Soup会将与列表中任一元素匹配的内容返回.下面代码找到文档中所有标签和标签:

soup.find_all(["a", "b"])
# [<b>The Dormouse's story</b>,
# <a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]

4.方法（自定义函数，传入find_all）

如果没有合适过滤器,那么还可以定义一个方法,方法只接受一个元素参数 [4] ,如果这个方法返回 True 表示当前元素匹配并且被找到,如果不是则反回 False
下面方法校验了当前元素,如果包含 class 属性却不包含 id 属性,那么将返回 True:

def has_class_but_no_id(tag):
  return tag.has_attr('class') and not tag.has_attr('id')```

返回结果中只有

标签没有标签,因为标签还定义了”id”,没有返回和,因为和中没有定义”class”属性.
下面代码找到所有被文字包含的节点内容:

from bs4 import NavigableString
def surrounded_by_strings(tag):
  return (isinstance(tag.next_element, NavigableString)
      and isinstance(tag.previous_element, NavigableString))

for tag in soup.find_all(surrounded_by_strings):
  print tag.name
# p
# a
# a
# a
# p

5.按照CSS搜索

按照CSS类名搜索tag的功能非常实用,但标识CSS类名的关键字 class 在Python中是保留字,使用 class 做参数会导致语法错误.从Beautiful Soup的4.1.1版本开始,可以通过 class_ 参数搜索有指定CSS类名的tag:

soup.find_all("a", class_="sister")
# [<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]

或者：

soup.find_all("a", attrs={"class": "sister"})
# [<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]

6.按照text参数查找

通过 text 参数可以搜搜文档中的字符串内容.与 name 参数的可选值一样, text 参数接受字符串 , 正则表达式 , 列表, True . 看例子:

soup.find_all(text="Elsie")
# [u'Elsie']

soup.find_all(text=["Tillie", "Elsie", "Lacie"])
# [u'Elsie', u'Lacie', u'Tillie']

soup.find_all(text=re.compile("Dormouse"))
[u"The Dormouse's story", u"The Dormouse's story"]

def is_the_only_string_within_a_tag(s):
  ""Return True if this string is the only child of its parent tag.""
  return (s == s.parent.string)

soup.find_all(text=is_the_only_string_within_a_tag)
# [u"The Dormouse's story", u"The Dormouse's story", u'Elsie', u'Lacie', u'Tillie', u'...']

虽然 text 参数用于搜索字符串,还可以与其它参数混合使用来过滤tag.Beautiful Soup会找到 .string 方法与 text 参数值相符的tag.下面代码用来搜索内容里面包含“Elsie”的标签:

soup.find_all("a", text="Elsie")
# [<a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1">Elsie</a>]

7.只查找当前标签的子节点

调用tag的 find_all() 方法时,Beautiful Soup会检索当前tag的所有子孙节点,如果只想搜索tag的直接子节点,可以使用参数 recursive=False .

一段简单的文档:

<html>
 <head>
 <title>
  The Dormouse's story
 </title>
 </head>
...

是否使用 recursive 参数的搜索结果:

soup.html.find_all("title")
# [<title>The Dormouse's story</title>]

soup.html.find_all("title", recursive=False)
# []

到此这篇关于详解BeautifulSoup获取特定标签下内容的方法的文章就介绍到这了,更多相关BeautifulSoup获取特定标签内容内容请搜索脚本之家以前的文章或继续浏览下面的相关文章希望大家以后多多支持脚本之家！

您可能感兴趣的文章:

基于Python制作三款起床闹钟的示例代码
每天上班最痛苦的事情就是早起早起早起！这是大部分上班族的痛苦，但是不上班又是不可能的啦，因为都是为了搞钱。本文用Python制作了三款有趣的闹钟，感兴趣的可以学习一下
2022-05-05
python实现线性回归的示例代码
线性回归就是通过多次取点，找出符合函数的曲线，那么就可以完成一维线性回归，本文通过实例代码给大家介绍python实现线性回归的相关知识，感兴趣的朋友一起看看吧
2022-02-02
python字典按照value排序方法
在本篇文章里小编给各位分享一篇关于python字典按照value排序方法的相关文章，有兴趣的朋友们可以学习下。
2020-12-12
使用turtle绘制五角星、分形树
这篇文章主要为大家详细介绍了使用turtle绘制五角星、分形树，文中示例代码介绍的非常详细，具有一定的参考价值，感兴趣的小伙伴们可以参考一下
2019-10-10
详细介绍Python中的set集合
本文详细介绍了Python中set集合的基本概念和详细用法，希望对读者朋友们有所帮助。需要的朋友可以参考下面具体的文章内容
2021-09-09
关于Java中RabbitMQ的高级特性
这篇文章主要介绍了关于Java中RabbitMQ的高级特性,MQ全称为Message Queue，即消息队列,"消息队列"是在消息的传输过程中保存消息的容器,它是典型的：生产者、消费者模型,生产者不断向消息队列中生产消息，消费者不断的从队列中获取消息,需要的朋友可以参考下
2023-07-07
利用Python代码制作过年春联
这篇文章主要介绍了如何利用代码编写过年的春联，文中一共介绍了两种方法，一是利用HTML+CSS+JS，二是利用Python，感兴趣的可以试一试
2022-01-01
Python OpenCV中的图像处理物体跟踪效果
我们知道怎样将一幅图像从 BGR 转换到 HSV 了,我们可以利用这一点来提取带有某个特定颜色的物体,这篇文章主要介绍了Python OpenCV中的图像处理物体跟踪,需要的朋友可以参考下
2023-08-08
Python开发之pip安装及使用方法详解
这篇文章主要介绍了Python开发之pip安装及使用方法详解,需要的朋友可以参考下
2020-02-02
python ansible自动化运维工具执行流程
ansible是基于 paramiko 开发的,并且基于模块化工作，本身没有批量部署的能力，接下来通过本文给大家分享python ansible自动化运维工具的特点及执行流程，感兴趣的朋友跟随小编一起看看吧
2021-06-06

详解BeautifulSoup获取特定标签下内容的方法

相关文章

最新评论

大家感兴趣的内容

最近更新的内容

常用在线小工具