Python爬虫实战之使用Scrapy爬取豆瓣图片

 更新时间:2021年06月02日 11:29:24   作者:濯君  
在用Python的urllib和BeautifulSoup写过了很多爬虫之后,本人决定尝试著名的Python爬虫框架——Scrapy.本次分享将详细讲述如何利用Scrapy来下载豆瓣名人图片,需要的朋友可以参考下

使用Scrapy爬取豆瓣某影星的所有个人图片

莫妮卡·贝鲁奇为例

在这里插入图片描述

1.首先我们在命令行进入到我们要创建的目录,输入 scrapy startproject banciyuan 创建scrapy项目

创建的项目结构如下

在这里插入图片描述

2.为了方便使用pycharm执行scrapy项目,新建main.py

from scrapy import cmdline

cmdline.execute("scrapy crawl banciyuan".split())

再edit configuration

在这里插入图片描述

然后进行如下设置,设置后之后就能通过运行main.py运行scrapy项目了

在这里插入图片描述

3.分析该HTML页面,创建对应spider

在这里插入图片描述

from scrapy import Spider
import scrapy

from banciyuan.items import BanciyuanItem


class BanciyuanSpider(Spider):
    name = 'banciyuan'
    allowed_domains = ['movie.douban.com']
    start_urls = ["https://movie.douban.com/celebrity/1025156/photos/"]
    url = "https://movie.douban.com/celebrity/1025156/photos/"

    def parse(self, response):
        num = response.xpath('//div[@class="paginator"]/a[last()]/text()').extract_first('')
        print(num)
        for i in range(int(num)):
            suffix = '?type=C&start=' + str(i * 30) + '&sortby=like&size=a&subtype=a'
            yield scrapy.Request(url=self.url + suffix, callback=self.get_page)

    def get_page(self, response):
        href_list = response.xpath('//div[@class="article"]//div[@class="cover"]/a/@href').extract()
        # print(href_list)
        for href in href_list:
            yield scrapy.Request(url=href, callback=self.get_info)

    def get_info(self, response):
        src = response.xpath(
            '//div[@class="article"]//div[@class="photo-show"]//div[@class="photo-wp"]/a[1]/img/@src').extract_first('')
        title = response.xpath('//div[@id="content"]/h1/text()').extract_first('')
        # print(response.body)
        item = BanciyuanItem()
        item['title'] = title
        item['src'] = [src]
        yield item

4.items.py

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class BanciyuanItem(scrapy.Item):
    # define the fields for your item here like:
    src = scrapy.Field()
    title = scrapy.Field()

pipelines.py

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
from scrapy.pipelines.images import ImagesPipeline
import scrapy

class BanciyuanPipeline(ImagesPipeline):
    def get_media_requests(self, item, info):
        yield scrapy.Request(url=item['src'][0], meta={'item': item})

    def file_path(self, request, response=None, info=None, *, item=None):
        item = request.meta['item']
        image_name = item['src'][0].split('/')[-1]
        # image_name.replace('.webp', '.jpg')
        path = '%s/%s' % (item['title'].split(' ')[0], image_name)

        return path

settings.py

# Scrapy settings for banciyuan project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'banciyuan'

SPIDER_MODULES = ['banciyuan.spiders']
NEWSPIDER_MODULE = 'banciyuan.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.80 Safari/537.36'}


# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'banciyuan.middlewares.BanciyuanSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'banciyuan.middlewares.BanciyuanDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'banciyuan.pipelines.BanciyuanPipeline': 1,
}
IMAGES_STORE = './images'

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

5.爬取结果

在这里插入图片描述

reference

源码

到此这篇关于Python爬虫实战之使用Scrapy爬取豆瓣图片的文章就介绍到这了,更多相关Scrapy爬取豆瓣图片内容请搜索脚本之家以前的文章或继续浏览下面的相关文章希望大家以后多多支持脚本之家!

相关文章

  • 利用Python写个简易版星空大战游戏

    利用Python写个简易版星空大战游戏

    通过小编观察,大家好像对划水摸鱼是情有独钟啊。所以本文给大家带来了一个用Python编写的简单版的星空大战小游戏,感兴趣的小伙伴可以动手试一试
    2022-03-03
  • 解决pandas展示数据输出时列名不能对齐的问题

    解决pandas展示数据输出时列名不能对齐的问题

    今天小编就为大家分享一篇解决pandas展示数据输出时列名不能对齐的问题,具有很好的参考价值,希望对大家有所帮助。一起跟随小编过来看看吧
    2019-11-11
  • 使用python去除图片白色像素的实例

    使用python去除图片白色像素的实例

    今天小编就为大家分享一篇使用python去除图片白色像素的实例,具有很好的参考价值,希望对大家有所帮助。一起跟随小编过来看看吧
    2019-12-12
  •  Python错误与异常处理

     Python错误与异常处理

    这篇文章主要介绍了 Python错误与异常处理,错误与异常处理在Python中具有非常重要的地位,熟练的使用错误与异常处理能够为我们的Python编程提供很多的便利之处,希望您阅读完本文后能够有所收获
    2022-01-01
  • Numpy一维线性插值函数的用法

    Numpy一维线性插值函数的用法

    这篇文章主要介绍了Numpy一维线性插值函数的用法,具有很好的参考价值,希望对大家有所帮助。一起跟随小编过来看看吧
    2020-04-04
  • Python字符串处理的8招秘籍(小结)

    Python字符串处理的8招秘籍(小结)

    这篇文章主要介绍了Python字符串处理的8招秘籍,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学习价值,需要的朋友们下面随着小编来一起学习学习吧
    2019-08-08
  • python识别验证码的思路及解决方案

    python识别验证码的思路及解决方案

    在本篇内容里小编给大家整理的是一篇关于python识别验证码的思路及解决方案,有需要的朋友们可以参考下。
    2020-09-09
  • 详解Python中的文件操作

    详解Python中的文件操作

    这篇文章主要介绍了Python中文件操作的相关资料,帮助大家更好的理解和学习python,感兴趣的朋友可以了解下
    2021-01-01
  • Python数据分析pandas模块用法实例详解

    Python数据分析pandas模块用法实例详解

    这篇文章主要介绍了Python数据分析pandas模块用法,结合实例形式分析了pandas模块对象创建、数值运算等相关操作技巧与注意事项,需要的朋友可以参考下
    2019-11-11
  • python爬虫scrapy框架的梨视频案例解析

    python爬虫scrapy框架的梨视频案例解析

    这篇文章主要介绍了python爬虫scrapy框架的梨视频案例解析,本文通过实例代码给大家介绍的非常详细,对大家的学习或工作具有一定的参考借鉴价值,需要的朋友可以参考下
    2021-02-02

最新评论