首页 编程教程正文

***Python最火爬虫框架Scrapy入门与实践,关键源码

piaodoo 编程教程 2020-02-22 22:09:28 1029 0 python教程

本文来源吾爱破解论坛

本帖最后由 huguo002 于 2019-11-15 09:12 编辑

***Python最火爬虫框架Scrapy入门与实践,关键源码


记录一下!
创建Scrapy项目:
[Python] 纯文本查看 复制代码

scrapy startproject douban



webwxgetmsgimg.jpg (46.44 KB, 下载次数: 0)

下载附件  保存到相册

2019-11-13 20:51 上传




1.pycham ide调试文件代码
新建py文件
entrypoint.py
[Python] 纯文本查看 复制代码
from scrapy.cmdline import execute
execute(['scrapy','crawl','douban'])



douban是scrapy项目名!


2.items.py
设置字段
[Python] 纯文本查看 复制代码
import scrapy


class DoubanItem(scrapy.Item):
    num=scrapy.Field() #序列号
    name=scrapy.Field() #电影名
    introduce=scrapy.Field() #介绍
    star=scrapy.Field() # 星级评分
    appraise=scrapy.Field() # 评价人数
    survey=scrapy.Field() #一句话介绍



引入 scrapy框架
设置字段格式:
字段名=scrapy.Field()


3.设置文件
settings.py


爬取豆瓣需要协议头!
ua开启:
[Python] 纯文本查看 复制代码
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)'



抓取调试开启:
[Python] 纯文本查看 复制代码
# Enable and configure HTTP caching (disabled by default)
# See [url=https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings]https://docs.scrapy.org/en/lates ... middleware-settings[/url]
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0
HTTPCACHE_DIR = 'httpcache'
HTTPCACHE_IGNORE_HTTP_CODES = []
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'



这里开启的含义:
这几行注释的作用是,Scrapy会缓存你有的Requests!当你再次请求时,如果存在缓存文档则返回缓存文档,而不是去网站请求,这样既加快了本地调试速度,也减轻了 网站的压力。


激活item pipeline
我们的pipeline定义后,需要在配置文件中添加激活才能使用,因此我们需要配置settings.py。
[Python] 纯文本查看 复制代码
ITEM_PIPELINES = {
    'douban.pipelines.DoubanPipeline': 300,
}



4.爬虫文件
doub.py


[Python] 纯文本查看 复制代码
# -*- coding: utf-8 -*-
import scrapy
import requests
from douban.items import DoubanItem
from bs4 import BeautifulSoup
from scrapy.http import Request

class DoubSpider(scrapy.Spider):
    name = 'douban'
    allowed_domains = ['movie.douban.com']
    start_urls=["https://movie.douban.com/top250"]


    def parse(self, response):
        item_list=response.xpath('//ol[@class="grid_view"]/li')
        for item in item_list:
            douban_item=DoubanItem()
            douban_item['num'] = item.xpath('.//div[@class="pic"]/em/text()').extract_first()
            douban_item['name']=item.xpath('.//div[@class="hd"]/a/span[1]/text()').extract_first()
            #print(douban_item['name'])
            introduces=item.xpath('.//div[@class="bd"]/p[1]/text()').extract()
            for introduce in introduces:
                introduce_date="".join(introduce.split())
                douban_item['introduce']=introduce_date
                #print(douban_item['introduce'])
            douban_item['star']=item.xpath('.//div[@class="star"]/span[@class="rating_num"]/text()').extract_first()
            #print(douban_item['star'])
            douban_item['appraise'] = item.xpath('.//div[@class="star"]/span[4]/text()').extract_first()
            douban_item['survey']=item.xpath('.//p[@class="quote"]/span[@class="inq"]/text()').extract_first()
            #print(douban_item['survey'])
            print(douban_item)
            yield douban_item
        next_page=response.xpath('//div[@class="paginator"]/span[@class="next"]/a/@href').extract()
        if next_page:
            yield Request(f'https://movie.douban.com/top250{next_page[0]}',callback=self.parse)


    '''def parse(self, response):
        paginator_urls=[]
        paginator_urls.extend(self.start_urls)
        paginators=BeautifulSoup(response.text,'lxml').find('div',class_="paginator").find_all('a')[:-1]
        for paginator in paginators:
            paginator=f"https://movie.douban.com/top250{paginator['href']}"
            paginator_urls.append(paginator)
        print(paginator_urls)

        paginator_urls=set(paginator_urls)
        for paginator_url in paginator_urls:
            print(paginator_url)
            yield Request(paginator_url,callback=self.get_content)

    def get_content(self,response):
        thispage=BeautifulSoup(response.text,'lxml').find('span',class_="thispage").get_text()
        print(thispage)'''




Scrapy自带xpath 与爬虫 etree xpath类似
注意.extract() 和.extract_first()


注释部分为调用bs4抓取数据,代码、排序等等不完美


5.pipelines.py
写入本地数据库
[Python] 纯文本查看 复制代码
import pymysql

class DoubanPipeline(object):
    def __init__(self):
        #连接MySQL数据库
        self.connect=pymysql.connect(
            host="localhost",
            user="root",
            password="123456",
            db="xiaoshuo",
            port=3306,
        )
        self.cursor=self.connect.cursor()
    def process_item(self, item, spider):
        self.cursor.execute('insert into movie(num,name,introduce,star,appraise,survey)VALUES("{}","{}","{}","{}","{}","{}")'.format(item['num'],item['name'],item['introduce'],item['star'],item['appraise'],item['survey']))
        self.connect.commit()
        return item

    #关闭数据库
    def close_spider(self,spider):
        self.cursor.close()
        self.connect.close()



6.代{过}{滤}理ip的使用 阿布云
由于没有账号,未测试。。


scrapy.jpg (113.69 KB, 下载次数: 0)

下载附件  保存到相册

2019-11-13 20:45 上传



数据库.jpg (159.31 KB, 下载次数: 0)

下载附件  保存到相册

2019-11-13 20:46 上传



项目打包,两种获取方式


百度云:
链接: https://pan.baidu.com/s/1GX9srMbh7aJbbpC8y6ZzDw 提取码: zp3h


论坛附件:
嗨学网 douban.rar (164.7 KB, 下载次数: 15) 2019-11-13 20:49 上传 点击文件名下载附件
项目包
下载积分: 吾爱币 -1 CB




感谢 *** 大壮老师!

2019.11.14 更新  


扩展 django 项目





1.创建django项目

[Python] 纯文本查看 复制代码
django-admin startproject douban_movie


2.创建app

pycham内置工具 manage.py

[Python] 纯文本查看 复制代码
startapp douban


3.注册app

settings.py

NSTALLED_APPS =[]

添加

'douban'


4.模型添加字段

models.py


[Python] 纯文本查看 复制代码
from django.db import models

# Create your models here.
class Movie(models.Model):
    num=models.IntegerField(max_length=10)
    name=models.CharField(max_length=50)
    introduce=models.CharField(max_length=255)
    star=models.CharField(max_length=10)
    appraise=models.CharField(max_length=255)
    survey=models.CharField(max_length=100)


5.添加app urls.py文件


urls.py添加代码

[Python] 纯文本查看 复制代码
from django.urls import path
from . import views

urlpatterns=[
    path('index/',views.hello_world)
]


项目urls设置app urls转发


[Python] 纯文本查看 复制代码
from django.contrib import admin
from django.urls import path,include


urlpatterns = [
    path('admin/', admin.site.urls),
    path('douban/', include('douban.urls')),
]


6.app视图 views 添加代码实现 hello world


[Python] 纯文本查看 复制代码
from django.shortcuts import render
from django.http import HttpResponse

def hello_world(request):
    return HttpResponse("Hello_world!")


http://127.0.0.1:8000/douban/index/ 访问实现返回字段 Hello_world!


7.更改数据库

修改settings.py

[Python] 纯文本查看 复制代码
DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.sqlite3',
        'NAME': os.path.join(BASE_DIR, 'db.sqlite3'),
    }
}


更改为:

[Python] 纯文本查看 复制代码
DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.mysql', #数据库引擎
        'NAME': 'douban_movie', #数据库名
        'USER': 'root', #密码
        'HOST': 'localhost', #主机
        'PORT': '3306', #端口
    }
}


修改时区

[Python] 纯文本查看 复制代码
TIME_ZONE = 'UTC'

更改为

[Python] 纯文本查看 复制代码
TIME_ZONE='Asia/Shanghai'


项目下的_init_.py添加代码

[Python] 纯文本查看 复制代码
import pymysql
pymysql.install_as_MySQLdb()


数据库迁移命令

[Python] 纯文本查看 复制代码
python manage.py makemigrations
python manage.py migrate



8.数据库更改报错,错误处理

https://blog.csdn.net/weixin_45476498/article/details/100098297

附上部分核心代码:
模型层
models.py
[Python] 纯文本查看 复制代码
from django.db import models

# Create your models here.
class Movie(models.Model):
    num=models.IntegerField(max_length=11)
    name=models.CharField(max_length=50)
    introduce=models.CharField(max_length=255)
    star=models.CharField(max_length=10)
    appraise=models.CharField(max_length=255)
    survey=models.CharField(max_length=100)

    def __str__(self):
        return self.name


app 路由 urls.py
[Python] 纯文本查看 复制代码
from django.urls import path
from . import views

urlpatterns=[
    path('index',views.hello_world,),
    #path('index/',views.movie),
    path('index/',views.index,),
]


app 视图层 urls.py
[Python] 纯文本查看 复制代码
from django.shortcuts import render
from django.http import HttpResponse
from .models import Movie
from django.core.paginator import Paginator

def hello_world(request):
    return HttpResponse("Hello_world!")

'''def movie(request):
    movie_list=Movie.objects.all()
    movie=movie_list[0]
    return HttpResponse('%s%s%s%s%s%s'%(movie.num,movie.name,movie.introduce,movie.star,movie.appraise,movie.survey,))'''

'''def index(request):
    movie_list=Movie.objects.all()
    return render(request,'douban/index.html',{
        'movie_list':movie_list,
    })'''

def index(request):
    movie_list=Movie.objects.all()
    paginator=Paginator(movie_list,25)
    page=request.GET.get('page')
    page_obj=paginator.get_page(page)
    return render(request,'douban/index.html',{
        'paginator':paginator,
        'page_obj':page_obj,
    })




django分页器 Paginator
[Python] 纯文本查看 复制代码
 print(Paginator.count) #总数据量
    print(Paginator.num_pages) #分页数
    print(Paginator.page_range) #显示的是页数的标记 就是按钮的数目
    print(page2.has_next())            #是否有下一页
    print(page2.next_page_number())    #下一页的页码
    print(page2.has_previous())        #是否有上一页
    print(page2.previous_page_number()) #上一页的页码
 


项目 路由 urls.py
[Python] 纯文本查看 复制代码
from django.contrib import admin
from django.urls import path,include


urlpatterns = [
    path('admin/', admin.site.urls),
    path('douban/', include('douban.urls')),
]


__init__.py
[Python] 纯文本查看 复制代码
import pymysql
pymysql.install_as_MySQLdb()


index.html
[HTML] 纯文本查看 复制代码
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Top 250电影</title>
</head>
<body>
<div>
    <table>
        {% for movie in page_obj %}
        <tr>
            <td>编号:{{ movie.num  }}</td>
            <td>电源名:{{ movie.name  }}</td>
            <td>简介:{{ movie.introduce  }}</td>
            <td>评分:{{ movie.star  }}</td>
            <td>评论人次:{{ movie.appraise  }}</td>
            <td>一句话介绍:{{ movie.survey  }}</td>
        </tr>
        {% endfor %}
    </table>
</div>
<div>
    <ul>
        <li>
            {% if page_obj.has_previous %}
             <a href="?page={{ page_obj.previous_page_number }}">上一页</a>
            {% endif %}
        </li>
        {% for i in paginator.page_range %}
        <li>
            <a href="?page={{ i }}">{{ i }}</a>
        </li>
        {% endfor %}
        <li>
            {% if page_obj.has_next %}
            <a href="?page={{ page_obj.next_page_number }}">下一页</a>
            {% endif %}
        </li>
    </ul></div>
</body>
</html>



豆瓣.jpg (97.34 KB, 下载次数: 0)

下载附件  保存到相册

2019-11-14 19:45 上传



效果:

效果1.jpg (358.62 KB, 下载次数: 0)

下载附件  保存到相册

2019-11-14 19:46 上传


效果2.jpg (375.97 KB, 下载次数: 0)

下载附件  保存到相册

2019-11-14 19:46 上传



豆瓣Top250 电影:https://movie.douban.com/top250


代码有问题就不打包了!!!

存在问题

报错信息:
[Python] 纯文本查看 复制代码
"D:\Program Files\JetBrains\PyCharm 2019.1.2\bin\runnerw64.exe" E:\douban_movie\venv\Scripts\python.exe E:/douban_movie/manage.py runserver 8000
Watching for file changes with StatReloader
Performing system checks...

System check identified some issues:

WARNINGS:
douban.Movie.num: (fields.W122) 'max_length' is ignored when used with IntegerField.
        HINT: Remove 'max_length' from field

System check identified 1 issue (0 silenced).
November 14, 2019 - 19:46:31
Django version 2.2.7, using settings 'douban_movie.settings'
Starting development server at [url=http://127.0.0.1:8000/]http://127.0.0.1:8000/[/url]
Quit the server with CTRL-BREAK.

来个大佬解答下!万分感谢!

版权声明:

本站所有资源均为站长或网友整理自互联网或站长购买自互联网,站长无法分辨资源版权出自何处,所以不承担任何版权以及其他问题带来的法律责任,如有侵权或者其他问题请联系站长删除!站长QQ754403226 谢谢。

有关影视版权:本站只供百度云网盘资源,版权均属于影片公司所有,请在下载后24小时删除,切勿用于商业用途。本站所有资源信息均从互联网搜索而来,本站不对显示的内容承担责任,如您认为本站页面信息侵犯了您的权益,请附上版权证明邮件告知【754403226@qq.com】,在收到邮件后72小时内删除。本文链接:https://www.piaodoo.com/7725.html

搜索