首页 编程教程正文

小白用scrapy爬取某站MeiNv套图

piaodoo 编程教程 2020-02-22 22:14:38 1217 0 python教程

本文来源吾爱破解论坛

本帖最后由 nongyf 于 2020-1-14 00:12 编辑

       首先,很久以前我就已经很想爬此站的meinv套图到本地来,一开始用bs4、requests....等等库做成单文件爬虫,每次扑了8张左右就停下不动了,也没什么提示,后来学习了scrapy框架,感觉非常不错,于是抄抄别人,自己写写,把一个入门级的scrapy爬取图片并创建目录的框架捣鼓出来了,给大家献丑了。
      好了,进入正题。
     1、由于我的win10装了2.7和3.7两个版本的python,所以创建scrapy项目用python3 -m scrapy startproject fa24spider
     2、用pycharm进入项目目录,首先设计items.py,先把可能需要的放进去,到时不用直接删除或注释掉就好了
[Python] 纯文本查看 复制代码

# -*- coding: utf-8 -*-
[color=#808080][i]#!/usr/bin/env python3[/i][/color][color=#808080][i]
[/i][/color][color=#808080][i]# [url=home.php?mod=space&uid=238618]@Time[/url]    : 2020/1/7 11:13
[/i][/color][color=#808080][i]# [url=home.php?mod=space&uid=686208]@AuThor[/url]  : ZekiLee[/i][/color]# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy

class Fa24SpiderItem(scrapy.Item):
    # define the fields for your item here like:
    # 套图名称
    pic_title = scrapy.Field()
    # 图片地址
    pic_url = scrapy.Field()
    # 图片名称
    pic_name = scrapy.Field()
    # 保存地址
    pic_path = scrapy.Field()
    # 反爬虫用的反重定向地址
referer = scrapy.Field()


   3、开始做爬虫主程序spiders.py
[Python] 纯文本查看 复制代码
# -*- coding: utf-8 -*-
[color=#808080][i]#!/usr/bin/env python3[/i][/color][color=#808080][i]
[/i][/color][color=#808080][i]# @Time    : 2020/1/7 11:13
[/i][/color][color=#808080][i]# @Author  : ZekiLee[/i][/color][color=#808080][i]
[/i][/color]import scrapy
from fa24spider.items import Fa24SpiderItem

class SpidersSpider(scrapy.Spider):
    name = 'spiders'
    allowed_domains = ['24fa.top']
    top_url = "https://www.24fa.top"
    start_urls = ['https://www.24fa.top/MeiNv/index.html']

    def parse(self, response):
        """
        每页的套图链接
        """
        title_link_list = response.xpath('//td[@align="center"]/a/@href').extract()
        for title_link in title_link_list:
            title_url = title_link.replace("..", self.top_url)
            yield scrapy.Request(url=title_url, callback=self.pic_parse)

        # 做翻页处理,如果有下一页,则取出下一页的地址,yield:返回给parse函数真理
        next_page_link = response.xpath('//div[@class="pager"]//a[@title="后页"]/@href').extract_first("")
        if next_page_link:
            next_page_url = next_page_link.replace("..", self.top_url)
            yield scrapy.Request(url=next_page_url, callback=self.parse)

    def pic_parse(self, response):
        """
        进入套图链接后,处理每一页图片链接
        """
        item = Fa24SpiderItem()
        title = response.xpath('//h1[@class="title2"]/text()').extract()[0]
        item["pic_title"] = title
        # 获取referer
        referer = response.url
        item["referer"] = referer
        pic_url_list = response.xpath('//div[@id="content"]//img/@src').extract()
        for pic_link in pic_url_list:
            pic_name = pic_link[-10:] #图片名称取了图片链接的后10位字符
            item["pic_name"] = pic_name
            pic_url = pic_link.replace("../..", self.top_url)
            item["pic_url"] = pic_url
            yield item

        # 同样,套图页面里也是分页了的,所以同样要处理下一页
        next_page_link = response.xpath('//a[@title="下一页"]/@href').extract_first("")
        if next_page_link:
            next_page_url = next_page_link.replace("../..", self.top_url)
            yield scrapy.Request(url=next_page_url, callback=self.pic_parse)



    4、进入settings.py进行设置
    4.1 设置user-agent
[Python] 纯文本查看 复制代码
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'



[Asm] 纯文本查看 复制代码
# Obey robots.txt rules
# 不遵守爬虫协议,否则......你懂的
ROBOTSTXT_OBEY = False

IMAGES_URLS_FIELD = "pic_url"

#自定义保存路径
IMAGES_STORE = "G:\\Fa24"
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# 下载延迟,可以自己设置,太快可能被网站ban掉
DOWNLOAD_DELAY = 3


  4.2 自定一个pipelines,后面的数字自己改吧,数字越小,就越优先
[Python] 纯文本查看 复制代码
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
#    'fa24spider.pipelines.Fa24SpiderPipeline': 300,
    'fa24spider.pipelines.Fa24TopPipeline':1,
}



   5、基本上setting设置好了,重头戏来了,该怎么下载,来看pipelines.py ,基本上都有注释说明了
[Python] 纯文本查看 复制代码
# -*- coding: utf-8 -*-
[color=#808080][i]#!/usr/bin/env python3[/i][/color][color=#808080][i]
[/i][/color][color=#808080][i]# @Time    : 2020/1/7 11:13
[/i][/color][color=#808080][i]# @Author  : ZekiLee[/i][/color]# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
from scrapy.pipelines.images import ImagesPipeline
from scrapy.utils.project import get_project_settings
import scrapy
import os
import shutil


class Fa24SpiderPipeline(object):
    def process_item(self, item, spider):
        return item


class Fa24TopPipeline(ImagesPipeline):
    # 获取settings中设置保存的路径
    IMAGES_STORE = get_project_settings().get("IMAGES_STORE")
    # 重写ImagesPipeline类的方法
    # 发送图片下载请求
    def get_media_requests(self, item, info):

        image_url = item["pic_url"]
        # headers是请求头主要是防反爬虫
        header = {
            "referer":item["referer"],
            "user-agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36",
                  }
        yield scrapy.Request(image_url, headers=header)

    def item_completed(self, results, item, info):
        # image_path 得到的是保存在full目录下用哈希值命名的图片列表路径
        # image_path = ['full/5db315b42dfc54a0d2bd0488c87913dfc25a71ef.jpg']
        image_path = [x["path"] for ok,x in results if ok]

        # 定义分类保存的路径
        # img_path 得到的是settings中定义的路径+套图名称
        new_path = '%s\%s'%(self.IMAGES_STORE,item["pic_title"])

        # 如果目录不存在,则创建目录
        if not os.path.exists(new_path):
            os.mkdir(new_path)

        # 将文件从默认下路路径移动到指定路径下
        # self.IMAGES_STORE + "\\" + image_path[0] 就是原路径 G:\Fa24\full/5db315b42dfc54a0d2bd0488c87913dfc25a71ef.jpg
        # image_path[0][image_path[0].find("full\\")+6:] 把原目录'full/5db315b42dfc54a0d2bd0488c87913dfc25a71ef.jpg'中的“full/”去掉
        pic_name = image_path[0][image_path[0].find("full\\")+6:] # 得到的是哈希值命名的图片名
        old_path = self.IMAGES_STORE + "\\" + image_path[0]
        shutil.move(old_path, new_path + "\\" + pic_name)
# 哈希值的名字太长太长了,改一下名吧
        os.rename(new_path + "\\" + pic_name,new_path + "\\" + item["pic_name"])
        # 把图片路径传回给item
        item["pic_url"] = new_path + "\\" + item["pic_name"]
        # item["pic_url"] = new_path + "\\" + image_path[0][image_path[0].find("full\\")+6:]

        return item

  
    6、大功告成!pycharm下面的Terminal执行
>python3 -m scrapy crawl spiders
    然后.....就飞法法哗啦啦下载啦!!


版权声明:

本站所有资源均为站长或网友整理自互联网或站长购买自互联网,站长无法分辨资源版权出自何处,所以不承担任何版权以及其他问题带来的法律责任,如有侵权或者其他问题请联系站长删除!站长QQ754403226 谢谢。

有关影视版权:本站只供百度云网盘资源,版权均属于影片公司所有,请在下载后24小时删除,切勿用于商业用途。本站所有资源信息均从互联网搜索而来,本站不对显示的内容承担责任,如您认为本站页面信息侵犯了您的权益,请附上版权证明邮件告知【754403226@qq.com】,在收到邮件后72小时内删除。本文链接:https://www.piaodoo.com/7904.html

搜索