scrapy框架爬取奇书网整站小说，一文学会scrapy用法！

piaodoo 编程教程 2020-02-22 22:13:45 1300 0 python教程

本文来源吾爱破解论坛

本帖最后由 huguo002 于 2019-9-19 18:36 编辑

scrapy框架爬取奇书网整站小说，一文学会scrapy用法！
scrapy框架爬取奇书网整站小说，一文学会scrapy用法！

Scrapy默认是不能在IDE中调试的，调试方法：
我们在根目录中新建一个py文件叫：entrypoint.py；在里面写入以下内容：
[Asm] 纯文本查看 复制代码

from scrapy.cmdline import execute
execute(['scrapy','crawl','qisuu'])

我这里的ide是pycham
调试直接运行 entrypoint.py 文件即可！

调试.jpg (86.51 KB, 下载次数: 0)

下载附件保存到相册

2019-9-19 18:24 上传

spider爬虫主程序 qisuu.py

[Asm] 纯文本查看 复制代码

import re
import scrapy
from bs4 import BeautifulSoup
from scrapy.http import Request

class Myspider(scrapy.Spider):

    name='qisuu'
    allowed_domains=['qisuu.la']
    bash_url='https://www.qisuu.la/soft/sort0'
    bashurl='.html'

    def start_requests(self):
        for i in range(1,11):
            url=f'{self.bash_url}{str(i)}/index_1{self.bashurl}'
            yield Request(url,self.parse)

    def parse(self,response):
        max_num=re.findall(r"下一页</a>.+?<a href='/soft/sort0.+?/index_(.+?).html'>尾页</a>",response.text,re.S)[0]
        bashurl=str(response.url)[:-6]
        for i in range(1,int(max_num)+1):
            url=f'{bashurl}{str(i)}{self.bashurl}'
            yield Request(url,callback=self.get_name)

    def get_name(self,response):
        lis=BeautifulSoup(response.text,'lxml').find('div',class_="listBox").find_all('li')
        for li in lis:
            novelname=li.find('a').get_text() #小说名
            novelinformation = li.find('div', class_="s").get_text() #小说信息
            novelintroduce=li.find('div',class_="u").get_text() #小说简介
            novelurl=f"https://www.qisuu.la{li.find('a')['href']}" #小说链接
            yield Request(novelurl,callback=self.get_chapterurl,meta={'name':novelname,'url':novelurl})

    def get_chapterurl(self,response):
        #novelname =BeautifulSoup(response.text,'lxml').find('h1').get_text()
        novelname=str(response.meta['name'])
        lis=BeautifulSoup(response.text,'lxml').find('div',class_="detail_right").find_all('li')
        noveclick=lis[0].get_text() #点击次数
        novefilesize=lis[1].get_text() #文件大小
        novefiletype = lis[2].get_text()  # 书籍类型
        noveupatedate = lis[3].get_text()  # 更新日期
        novestate = lis[4].get_text()  # 连载状态
        noveauthor = lis[5].get_text()  # 书籍作者
        novefile_running_environment = lis[6].get_text()  # 运行环境
        lis=BeautifulSoup(response.text,'lxml').find('div',class_="showDown").find_all('li')
        novefile_href=re.findall(r"'.+?','(.+?)','.+?'",str(lis[-1]),re.S)[0]  #小说下载地址
        print(novelname)
        print(noveclick)
        print(novefilesize)
        print(novefiletype)
        print(noveupatedate)
        print(novestate)
        print(noveauthor)
        print(novefile_running_environment)
        print(novefile_href)

运行.gif (569.19 KB, 下载次数: 0)

下载附件保存到相册

2019-9-19 18:26 上传

方法来源：崔庆才，静觅-小白进阶之Scrapy第一篇
https://cuiqingcai.com/3472.html/3

感兴趣的话可以参照尝试！
也欢迎一起交流py！
欢迎留言探讨！

如果有帮到您！可以的话免费给个评分！
每天一次评分，您随手一点，才能给予分享者更多动力！

本帖被以下淘专辑推荐: · 源码系列|主题: 31, 订阅: 8

版权声明：

本站所有资源均为站长或网友整理自互联网或站长购买自互联网，站长无法分辨资源版权出自何处，所以不承担任何版权以及其他问题带来的法律责任，如有侵权或者其他问题请联系站长删除！站长QQ754403226 谢谢。

有关影视版权：本站只供百度云网盘资源，版权均属于影片公司所有，请在下载后24小时删除，切勿用于商业用途。本站所有资源信息均从互联网搜索而来，本站不对显示的内容承担责任，如您认为本站页面信息侵犯了您的权益，请附上版权证明邮件告知【754403226@qq.com】，在收到邮件后72小时内删除。本文链接：https://www.piaodoo.com/7851.html