本文来源吾爱破解论坛
本帖最后由 nongyf 于 2020-1-14 00:12 编辑
首先,很久以前我就已经很想爬此站的meinv套图到本地来,一开始用bs4、requests....等等库做成单文件爬虫,每次扑了8张左右就停下不动了,也没什么提示,后来学习了scrapy框架,感觉非常不错,于是抄抄别人,自己写写,把一个入门级的scrapy爬取图片并创建目录的框架捣鼓出来了,给大家献丑了。
好了,进入正题。
1、由于我的win10装了2.7和3.7两个版本的python,所以创建scrapy项目用python3 -m scrapy startproject fa24spider
2、用pycharm进入项目目录,首先设计items.py,先把可能需要的放进去,到时不用直接删除或注释掉就好了
[Python] 纯文本查看 复制代码
# -*- coding: utf-8 -*- [color=#808080][i]#!/usr/bin/env python3[/i][/color][color=#808080][i] [/i][/color][color=#808080][i]# [url=home.php?mod=space&uid=238618]@Time[/url] : 2020/1/7 11:13 [/i][/color][color=#808080][i]# [url=home.php?mod=space&uid=686208]@AuThor[/url] : ZekiLee[/i][/color]# Define here the models for your scraped items # # See documentation in: # https://docs.scrapy.org/en/latest/topics/items.html import scrapy class Fa24SpiderItem(scrapy.Item): # define the fields for your item here like: # 套图名称 pic_title = scrapy.Field() # 图片地址 pic_url = scrapy.Field() # 图片名称 pic_name = scrapy.Field() # 保存地址 pic_path = scrapy.Field() # 反爬虫用的反重定向地址 referer = scrapy.Field()
3、开始做爬虫主程序spiders.py
[Python] 纯文本查看 复制代码
# -*- coding: utf-8 -*- [color=#808080][i]#!/usr/bin/env python3[/i][/color][color=#808080][i] [/i][/color][color=#808080][i]# @Time : 2020/1/7 11:13 [/i][/color][color=#808080][i]# @Author : ZekiLee[/i][/color][color=#808080][i] [/i][/color]import scrapy from fa24spider.items import Fa24SpiderItem class SpidersSpider(scrapy.Spider): name = 'spiders' allowed_domains = ['24fa.top'] top_url = "https://www.24fa.top" start_urls = ['https://www.24fa.top/MeiNv/index.html'] def parse(self, response): """ 每页的套图链接 """ title_link_list = response.xpath('//td[@align="center"]/a/@href').extract() for title_link in title_link_list: title_url = title_link.replace("..", self.top_url) yield scrapy.Request(url=title_url, callback=self.pic_parse) # 做翻页处理,如果有下一页,则取出下一页的地址,yield:返回给parse函数真理 next_page_link = response.xpath('//div[@class="pager"]//a[@title="后页"]/@href').extract_first("") if next_page_link: next_page_url = next_page_link.replace("..", self.top_url) yield scrapy.Request(url=next_page_url, callback=self.parse) def pic_parse(self, response): """ 进入套图链接后,处理每一页图片链接 """ item = Fa24SpiderItem() title = response.xpath('//h1[@class="title2"]/text()').extract()[0] item["pic_title"] = title # 获取referer referer = response.url item["referer"] = referer pic_url_list = response.xpath('//div[@id="content"]//img/@src').extract() for pic_link in pic_url_list: pic_name = pic_link[-10:] #图片名称取了图片链接的后10位字符 item["pic_name"] = pic_name pic_url = pic_link.replace("../..", self.top_url) item["pic_url"] = pic_url yield item # 同样,套图页面里也是分页了的,所以同样要处理下一页 next_page_link = response.xpath('//a[@title="下一页"]/@href').extract_first("") if next_page_link: next_page_url = next_page_link.replace("../..", self.top_url) yield scrapy.Request(url=next_page_url, callback=self.pic_parse)
4、进入settings.py进行设置
4.1 设置user-agent
[Python] 纯文本查看 复制代码
# Crawl responsibly by identifying yourself (and your website) on the user-agent USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'
[Asm] 纯文本查看 复制代码
# Obey robots.txt rules # 不遵守爬虫协议,否则......你懂的 ROBOTSTXT_OBEY = False IMAGES_URLS_FIELD = "pic_url" #自定义保存路径 IMAGES_STORE = "G:\\Fa24" # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs # 下载延迟,可以自己设置,太快可能被网站ban掉 DOWNLOAD_DELAY = 3
4.2 自定一个pipelines,后面的数字自己改吧,数字越小,就越优先
[Python] 纯文本查看 复制代码
# Configure item pipelines # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { # 'fa24spider.pipelines.Fa24SpiderPipeline': 300, 'fa24spider.pipelines.Fa24TopPipeline':1, }
5、基本上setting设置好了,重头戏来了,该怎么下载,来看pipelines.py ,基本上都有注释说明了
[Python] 纯文本查看 复制代码
# -*- coding: utf-8 -*- [color=#808080][i]#!/usr/bin/env python3[/i][/color][color=#808080][i] [/i][/color][color=#808080][i]# @Time : 2020/1/7 11:13 [/i][/color][color=#808080][i]# @Author : ZekiLee[/i][/color]# Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html from scrapy.pipelines.images import ImagesPipeline from scrapy.utils.project import get_project_settings import scrapy import os import shutil class Fa24SpiderPipeline(object): def process_item(self, item, spider): return item class Fa24TopPipeline(ImagesPipeline): # 获取settings中设置保存的路径 IMAGES_STORE = get_project_settings().get("IMAGES_STORE") # 重写ImagesPipeline类的方法 # 发送图片下载请求 def get_media_requests(self, item, info): image_url = item["pic_url"] # headers是请求头主要是防反爬虫 header = { "referer":item["referer"], "user-agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36", } yield scrapy.Request(image_url, headers=header) def item_completed(self, results, item, info): # image_path 得到的是保存在full目录下用哈希值命名的图片列表路径 # image_path = ['full/5db315b42dfc54a0d2bd0488c87913dfc25a71ef.jpg'] image_path = [x["path"] for ok,x in results if ok] # 定义分类保存的路径 # img_path 得到的是settings中定义的路径+套图名称 new_path = '%s\%s'%(self.IMAGES_STORE,item["pic_title"]) # 如果目录不存在,则创建目录 if not os.path.exists(new_path): os.mkdir(new_path) # 将文件从默认下路路径移动到指定路径下 # self.IMAGES_STORE + "\\" + image_path[0] 就是原路径 G:\Fa24\full/5db315b42dfc54a0d2bd0488c87913dfc25a71ef.jpg # image_path[0][image_path[0].find("full\\")+6:] 把原目录'full/5db315b42dfc54a0d2bd0488c87913dfc25a71ef.jpg'中的“full/”去掉 pic_name = image_path[0][image_path[0].find("full\\")+6:] # 得到的是哈希值命名的图片名 old_path = self.IMAGES_STORE + "\\" + image_path[0] shutil.move(old_path, new_path + "\\" + pic_name) # 哈希值的名字太长太长了,改一下名吧 os.rename(new_path + "\\" + pic_name,new_path + "\\" + item["pic_name"]) # 把图片路径传回给item item["pic_url"] = new_path + "\\" + item["pic_name"] # item["pic_url"] = new_path + "\\" + image_path[0][image_path[0].find("full\\")+6:] return item
6、大功告成!pycharm下面的Terminal执行
>python3 -m scrapy crawl spiders
然后.....就飞法法哗啦啦下载啦!!
版权声明:
本站所有资源均为站长或网友整理自互联网或站长购买自互联网,站长无法分辨资源版权出自何处,所以不承担任何版权以及其他问题带来的法律责任,如有侵权或者其他问题请联系站长删除!站长QQ754403226 谢谢。