Scrapy爬虫Response子类在应用中的问题解析

更新时间：2023年05月16日 14:35:53 作者：ponponon

这篇文章主要为大家介绍了Scrapy爬虫Response它的子类（TextResponse、HtmlResponse、XmlResponse）在应用问题解析

正文

今天用scrapy爬取壁纸的时候（url：http://pic.netbian.com/4kmein...）絮叨了一些问题，记录下来，供后世探讨，以史为鉴。**

因为网站是动态渲染的，所以选择scrapy对接selenium（scrapy抓取网页的方式和requests库相似，都是直接模拟HTTP请求，而Scrapy也不能抓取JavaScript动态渲染的网页。）

所以在Downloader Middlewares中需要得到Request并且返回一个Response，问题出在Response，通过查看官方文档发现class scrapy.http.Response(url[, status=200, headers=None, body=b'', flags=None, request=None])，随即通过from scrapy.http import Response导入Response

输入scrapy crawl girl得到如下错误：

*results=response.xpath('//[@id="main"]/div[3]/ul/lia/img')
raise NotSupported("Response content isn't text")
scrapy.exceptions.NotSupported: Response content isn't text**

检查相关代码：

# middlewares.py
from scrapy import signals
from scrapy.http import Response
from scrapy.exceptions import IgnoreRequest
import selenium
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
class Pic4KgirlDownloaderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.
    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.
        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        try:
            self.browser=selenium.webdriver.Chrome()
            self.wait=WebDriverWait(self.browser,10)
            self.browser.get(request.url)
            self.wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '#main > div.page > a:nth-child(10)')))
            return Response(url=request.url,status=200,request=request,body=self.browser.page_source.encode('utf-8'))
        #except:
            #raise IgnoreRequest()
        finally:
            self.browser.close()

推断问题出在：

return Response(url=request.url,status=200,request=request,body=self.browser.page_source.encode('utf-8'))

查看Response类的定义

@property
    def text(self):
        """For subclasses of TextResponse, this will return the body
        as text (unicode object in Python 2 and str in Python 3)
        """
        raise AttributeError("Response content isn't text")
    def css(self, *a, **kw):
        """Shortcut method implemented only by responses whose content
        is text (subclasses of TextResponse).
        """
        raise NotSupported("Response content isn't text")
    def xpath(self, *a, **kw):
        """Shortcut method implemented only by responses whose content
        is text (subclasses of TextResponse).
        """
        raise NotSupported("Response content isn't text")

说明Response类不可以被直接使用，需要被继承重写方法后才能使用

响应子类

**TextResponse对象**
class scrapy.http.TextResponse(url[, encoding[, ...]])
**HtmlResponse对象**
class scrapy.http.HtmlResponse(url[, ...])
**XmlResponse对象**
class scrapy.http.XmlResponse（url [，... ] ）

举例观察TextResponse的定义from scrapy.http import TextResponse

导入TextResponse发现

class TextResponse(Response):
    _DEFAULT_ENCODING = 'ascii'
    def __init__(self, *args, **kwargs):
        self._encoding = kwargs.pop('encoding', None)
        self._cached_benc = None
        self._cached_ubody = None
        self._cached_selector = None
        super(TextResponse, self).__init__(*args, **kwargs)

其中xpath方法已经被重写

@property
    def selector(self):
        from scrapy.selector import Selector
        if self._cached_selector is None:
            self._cached_selector = Selector(self)
        return self._cached_selector
    def xpath(self, query, **kwargs):
        return self.selector.xpath(query, **kwargs)
    def css(self, query):
        return self.selector.css(query)

所以用户想要调用Response类，必须选择调用其子类,并且重写部分方法

Scrapy爬虫入门教程十一 Request和Response（请求和响应）

scrapy文档：https://doc.scrapy.org/en/lat...

中文翻译文档：https://www.jb51.net/article/248161.htm

以上就是Scrapy爬虫Response子类在应用中的问题解析的详细内容，更多关于Scrapy爬虫Response子类应用的资料请关注脚本之家其它相关文章！

您可能感兴趣的文章:

Python数学建模PuLP库线性规划实际案例编程详解
本节以一个实际数学建模案例，来为大家讲解PuLP求解线性规划问题的建模与编程。来巩固加深大家对Python数学建模PuLP库线性规划的运用理解
2021-10-10
python 6.7 编写printTable()函数表格打印(完整代码)
这篇文章主要介绍了python 6.7 编写一个名为printTable()的函数表格打印，本文通过实例代码给大家介绍的非常详细，对大家的学习或工作具有一定的参考借鉴价值,需要的朋友可以参考下
2020-03-03
用Python采集《雪中悍刀行》弹幕做成词云实例
大家好，本篇文章主要讲的是用Python采集《雪中悍刀行》弹幕做成词云实例，感兴趣的同学赶快来看一看吧，对你有帮助的话记得收藏一下
2022-01-01
python tkinter 设置窗口大小不可缩放实例
这篇文章主要介绍了python tkinter 设置窗口大小不可缩放实例，具有很好的参考价值，希望对大家有所帮助。一起跟随小编过来看看吧
2020-03-03
Python元类与迭代器生成器案例详解
这篇文章主要介绍了Python元类与迭代器生成器案例详解,本篇文章通过简要的案例,讲解了该项技术的了解与使用,以下就是详细内容,需要的朋友可以参考下
2021-08-08
Python 3中的yield from语法详解
在python 3.3里，generator新增了一个语法 yield from，这个yield from的作用是什么？语法是什么呢？下面通过这篇文章主要给大家详细介绍了Python 3中yield from语法的相关资料，需要的朋友可以参考借鉴，下面来一起看看吧。
2017-01-01
python版大富翁源代码分享
这篇文章主要为大家详细介绍了python版大富翁源代码，具有一定的参考价值，感兴趣的小伙伴们可以参考一下
2018-11-11
解决tensorflow打印tensor有省略号的问题
今天小编就为大家分享一篇解决tensorflow打印tensor有省略号的问题，具有很好的参考价值，希望对大家有所帮助。一起跟随小编过来看看吧
2020-02-02
详解Python使用apscheduler定时执行任务
在平常的工作中几乎有一半的功能模块都需要定时任务来推动，例如项目中有一个定时统计程序，定时爬出网站的URL程序，定时检测钓鱼网站的程序等等，都涉及到了关于定时任务的问题，所以就找到了python的定时任务模块
2022-03-03
PyQt5 实现字体大小自适应分辨率的方法
今天小编就为大家分享一篇PyQt5 实现字体大小自适应分辨率的方法，具有很好的参考价值，希望对大家有所帮助。一起跟随小编过来看看吧
2019-06-06