知音漫客爬虫

piaodoo 编程教程 2020-02-22 22:14:14 1376 0 python教程

本文来源吾爱破解论坛

本帖最后由天空宫阙于 2019-10-23 23:56 编辑

1.发现知音漫客是一个很适合练习的网站,h5页面的源码有漫画图片的真实地址这里还是选择pc端做一下练习，pc的源码中漫画图片在服务器上的位置有做简单的加密处理，最核心部分是通过chapter_addr解密得到this.imgpath(图在服务器上的位置)其实一部漫画在服务器上的位置是相当定的，就没有几种组合方式，但此处还是通过抓包分析了this.imgpath(图在服务器上的位置)的解密过程（其实说白了很简单就是类似用后一位字母代替前一位类似，不过此处是unicode移动的位数也不是一位）。

解密操作this.image的函数.PNG (100.16 KB, 下载次数: 1)

下载附件保存到相册

2019-10-23 23:52 上传

this.decode经过解密后的一段js
[JavaScript] 纯文本查看 复制代码

!__cr.imgpath=__cr.imgpath.replace(/./g,function(a){return String.fromCharCode(a.charCodeAt(0)-__cr.chapter_id%10)})!

python可以这样模拟[Python] 纯文本查看 复制代码

def decode(raw, chapter_id):
    # 移动unicode对应数字位数为chapter_id最后值
    # 解密减 加密加
    # !__cr.imgpath=__cr.imgpath.replace(/./g,function(a){return String.fromCharCode(a.charCodeAt(0)-__cr.chapter_id%10)})!
    result = ''
    for i in raw:
        result += chr(ord(i)-int(chapter_id) % 10)
    return result

其中this.decode如何得到上面这段呢？其实也是做了类似的操作。

111.PNG (50.73 KB, 下载次数: 2)

下载附件保存到相册

2019-10-23 23:35 上传

2.整个的源码如下[Python] 纯文本查看 复制代码

import requests
from bs4 import BeautifulSoup
import json
import time
import os
import re
from tqdm import tqdm

headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36'
}

def get_index(index_url):
    chapterslist = {}
    response = requests.get(index_url, headers=headers)
    soup = BeautifulSoup(response.text, 'lxml')
    chapterList = soup.select('#chapterList')[0]
    chapters = chapterList.select('a')
    for chapter in chapters:
        chapterslist[chapter['title']] = chapter['href']
    return chapterslist


def quote_keys_for_json(json_str):
    """给键值不带双引号的json字符串的所有键值加上双引号。
    注：解析一般的不严格的json串，可以check out https://github.com/dmeranda/demjson, 速度比标准库要慢。"""
    quote_pat = re.compile(r'".*?"')
    a = quote_pat.findall(json_str)
    json_str = quote_pat.sub('@', json_str)
    key_pat = re.compile(r'(\w+):')
    json_str = key_pat.sub(r'"\1":', json_str)
    assert json_str.count('@') == len(a)
    count = -1

    def put_back_values(match):
        nonlocal count
        count += 1
        return a[count]
    json_str = re.sub('@', put_back_values, json_str)
    return json_str

def decode(raw, chapter_id):
    # 移动unicode对应数字位数为chapter_id最后值
    # 解密减 加密加
    # !__cr.imgpath=__cr.imgpath.replace(/./g,function(a){return String.fromCharCode(a.charCodeAt(0)-__cr.chapter_id%10)})!
    result = ''
    for i in raw:
        result += chr(ord(i)-int(chapter_id) % 10)
    return result

def get_info(index_url, num, index_dict):
    base = index_url
    tail = index_dict[f'{str(num)}话']
    detial_url = base + tail
    response = requests.get(detial_url, headers=headers)
    raw_address = BeautifulSoup(response.text, 'lxml').select('#content > div.comiclist > script')[0].string
    address = re.search('__cr.init\(({.*?})\)', raw_address, re.S)
    if address:
        # 类似python的字典形式 但引用没有引号用quote_keys_for_json()转一下
        # quote_keys_for_json()出处，https://segmentfault.com/q/1010000006090535?_ea=1009953
        info = json.loads(quote_keys_for_json(address.group(1)))
    return info

def get_certain_chapter_links(index_url,chapter,index_dict):
    certain_chapter_links = []
    info = get_info(index_url, chapter, index_dict)
    image_path = decode(info['chapter_addr'], info['chapter_id'])
    certain_chapter_total = int(info['end_var'])
    for num in range(1,certain_chapter_total+1):
        # 核心拼接"//" + i + "/comic/" + this.imgpath + a
        image_address = 'http://mhpic.' + info['domain'] + '/comic/' + image_path + str(num) + '.jpg' +info['comic_definition']['middle']
        # image_address = 'http://mhpic.' + info['domain'] + '/comic/' + image_path + str(num) + '.jpg' +info['comic_definition']['high']
        certain_chapter_links.append(image_address)
    return certain_chapter_total,certain_chapter_links

def downloadFILE(url,name):
    headers={
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36'
    }
    resp = requests.get(url=url,stream=True,headers=headers)
    content_size = int(int(resp.headers['Content-Length'])/1024)
    with open(name, "wb") as f:
        print("Pkg total size is:",content_size,'k,start...')
        for data in tqdm(iterable=resp.iter_content(1024),total=content_size,unit='k',desc=name):
            f.write(data)
        print(name , "download finished!")

if __name__ == "__main__":
    # 动漫主页https://www.zymk.cn/1/
    index_url = 'https://www.zymk.cn/1/'
    index_dict = get_index(index_url)
    # 下载目录doupo
    if not os.path.exists('zyresult'):
        os.mkdir('zyresult')
    # 爬取的章节1到802
    for chapter in range(1,803):
        try:
            total,certain_chapter_links = get_certain_chapter_links(index_url,chapter,index_dict)
            for i in range(0,total):
                temp = f'{str(chapter).zfill(3)}话{str(int(i)+1).zfill(2)}.jpg'
                name = os.path.join('zyresult',temp)
                url = certain_chapter_links[i]
                downloadFILE(url,name)
        except Exception as e:
            error = f'error at {chapter} ep'
            detail = str(e)
            print(error+'\n'+detail+'\n')
            with open('log.txt','a',encoding='utf-8') as f:
                f.write(error+'\n'+detail+'\n')
                f.close()
            continue

3.最后的效果下载了一整部斗破苍穹漫画没有发现异常

doupo.PNG (614.87 KB, 下载次数: 1)

下载附件保存到相册

2019-10-23 23:42 上传

版权声明：

本站所有资源均为站长或网友整理自互联网或站长购买自互联网，站长无法分辨资源版权出自何处，所以不承担任何版权以及其他问题带来的法律责任，如有侵权或者其他问题请联系站长删除！站长QQ754403226 谢谢。

有关影视版权：本站只供百度云网盘资源，版权均属于影片公司所有，请在下载后24小时删除，切勿用于商业用途。本站所有资源信息均从互联网搜索而来，本站不对显示的内容承担责任，如您认为本站页面信息侵犯了您的权益，请附上版权证明邮件告知【754403226@qq.com】，在收到邮件后72小时内删除。本文链接：https://www.piaodoo.com/7878.html