Python爬虫实战之使用Scrapy爬取豆瓣图片

更新时间：2021年06月02日 11:29:24 作者：濯君

在用Python的urllib和BeautifulSoup写过了很多爬虫之后,本人决定尝试著名的Python爬虫框架——Scrapy.本次分享将详细讲述如何利用Scrapy来下载豆瓣名人图片,需要的朋友可以参考下

使用Scrapy爬取豆瓣某影星的所有个人图片

以莫妮卡·贝鲁奇为例

在这里插入图片描述

1.首先我们在命令行进入到我们要创建的目录，输入 scrapy startproject banciyuan 创建scrapy项目

创建的项目结构如下

在这里插入图片描述

2.为了方便使用pycharm执行scrapy项目，新建main.py

from scrapy import cmdline

cmdline.execute("scrapy crawl banciyuan".split())

再edit configuration

在这里插入图片描述

然后进行如下设置，设置后之后就能通过运行main.py运行scrapy项目了

在这里插入图片描述

3.分析该HTML页面，创建对应spider

在这里插入图片描述

from scrapy import Spider
import scrapy

from banciyuan.items import BanciyuanItem


class BanciyuanSpider(Spider):
    name = 'banciyuan'
    allowed_domains = ['movie.douban.com']
    start_urls = ["https://movie.douban.com/celebrity/1025156/photos/"]
    url = "https://movie.douban.com/celebrity/1025156/photos/"

    def parse(self, response):
        num = response.xpath('//div[@class="paginator"]/a[last()]/text()').extract_first('')
        print(num)
        for i in range(int(num)):
            suffix = '?type=C&start=' + str(i * 30) + '&sortby=like&size=a&subtype=a'
            yield scrapy.Request(url=self.url + suffix, callback=self.get_page)

    def get_page(self, response):
        href_list = response.xpath('//div[@class="article"]//div[@class="cover"]/a/@href').extract()
        # print(href_list)
        for href in href_list:
            yield scrapy.Request(url=href, callback=self.get_info)

    def get_info(self, response):
        src = response.xpath(
            '//div[@class="article"]//div[@class="photo-show"]//div[@class="photo-wp"]/a[1]/img/@src').extract_first('')
        title = response.xpath('//div[@id="content"]/h1/text()').extract_first('')
        # print(response.body)
        item = BanciyuanItem()
        item['title'] = title
        item['src'] = [src]
        yield item

4.items.py

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class BanciyuanItem(scrapy.Item):
    # define the fields for your item here like:
    src = scrapy.Field()
    title = scrapy.Field()

pipelines.py

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
from scrapy.pipelines.images import ImagesPipeline
import scrapy

class BanciyuanPipeline(ImagesPipeline):
    def get_media_requests(self, item, info):
        yield scrapy.Request(url=item['src'][0], meta={'item': item})

    def file_path(self, request, response=None, info=None, *, item=None):
        item = request.meta['item']
        image_name = item['src'][0].split('/')[-1]
        # image_name.replace('.webp', '.jpg')
        path = '%s/%s' % (item['title'].split(' ')[0], image_name)

        return path

settings.py

# Scrapy settings for banciyuan project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'banciyuan'

SPIDER_MODULES = ['banciyuan.spiders']
NEWSPIDER_MODULE = 'banciyuan.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.80 Safari/537.36'}


# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'banciyuan.middlewares.BanciyuanSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'banciyuan.middlewares.BanciyuanDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'banciyuan.pipelines.BanciyuanPipeline': 1,
}
IMAGES_STORE = './images'

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

5.爬取结果

在这里插入图片描述

reference

源码

到此这篇关于Python爬虫实战之使用Scrapy爬取豆瓣图片的文章就介绍到这了,更多相关Scrapy爬取豆瓣图片内容请搜索脚本之家以前的文章或继续浏览下面的相关文章希望大家以后多多支持脚本之家！

您可能感兴趣的文章:

python 画函数曲线示例
今天小编就为大家分享一篇python 画函数曲线示例，具有很好的参考价值，希望对大家有所帮助。一起跟随小编过来看看吧
2019-12-12
如何用python写个模板引擎
这篇文章主要介绍了如何用python写个模板引擎，帮助大家更好的理解和使用python，感兴趣的朋友可以了解下
2021-01-01
基于Python构建一个智能语音机器人
这篇文章主要为大家详细介绍了如何基于Python构建一个智能语音机器人,文中的示例代码讲解详细,感兴趣的小伙伴可以跟随小编一起学习一下
2023-12-12
实例详解Python装饰器与闭包
闭包是Python装饰器的基础。要理解闭包，先要了解Python中的变量作用域规则。本文主要给大家介绍Python装饰器与闭包的相关知识,需要的朋友可以参考下
2019-07-07
解决Django加载静态资源失败的问题
今天小编就为大家分享一篇解决Django加载静态资源失败的问题，具有很好的参考价值，希望对大家有所帮助。一起跟随小编过来看看吧
2019-07-07
基于Python实现最新房价信息的获取
这篇文章主要为大家介绍了如何利用Python获取房价信息（以北京为例），整个数据获取的信息是通过房源平台获取的，通过下载网页元素并进行数据提取分析完成整个过程，需要的可以参考一下
2022-04-04
Django实现微信小程序的登录验证功能并维护登录态
这篇文章主要介绍了Django实现小程序的登录验证功能并维护登录态,本文通过实例代码给大家介绍的非常详细，具有一定的参考借鉴价值，需要的朋友可以参考下
2019-07-07
Python 支付整合开发包的实现
这篇文章主要介绍了Python 支付整合开发包的实现，小编觉得挺不错的，现在分享给大家，也给大家做个参考。一起跟随小编过来看看吧
2019-01-01
Python使用ClickHouse的实践与踩坑记录
这篇文章主要介绍了Python使用ClickHouse的实践与踩坑记录，具有很好的参考价值，希望对大家有所帮助。如有错误或未考虑完全的地方，望不吝赐教
2022-05-05
童年回忆录之python版4399吃豆豆小游戏
相信80，90后都玩过4399网站的小游戏，虽然游戏很low但是童年的回忆，今天小编带你一起用python自己写一个4399吃豆豆的小游戏，文中给大家介绍的非常详细，对大家的学习或工作具有一定的价值
2021-09-09