入门python爬取精彩阅读网的小说 --- 1

piaodoo 编程教程 2020-02-22 22:00:55 1016 0 python教程

本文来源吾爱破解论坛

本帖最后由 HaNnnng 于 2018-11-14 00:51 编辑

第一次发帖，记录一下自己学习爬虫的过程。

很简单的一个例子，爬取精彩阅读网的小说，如果爬取指定小说则需要手动更改第九行的URL。

这是面向过程的爬虫，明天改一下写个面向过程的

[Python] 纯文本查看 复制代码

# _*_ coding: utf_8 _*_
__author__ = 'lwh'
__date__ = '2018/11/10 15:12'

import requests
import re

# 获取网页信息
url = 'http://www.jingcaiyuedu.com/book/317834.html'
response = requests.get(url)
response.encoding = 'utf-8'
html = response.text
# 获取小说的名称
title = re.findall(r'<meta property="og:novel:book_name" content="(.*?)"/>', html)[0]

# 获取小说的章节数据，章节名称跟url
dl = re.findall(r'<dl class="panel-body panel-chapterlist">  <dd class="col-md-3">.*?</dl>', html, re.S)[0]
chapter_info_list = re.findall(r'href="(.*?)">(.*?)<', dl)

# 写文件
f = open('%s.txt' % title, "w", encoding='utf-8')
# 循环下载每一个章节
for chapter_url, chapter_title in chapter_info_list:
    chapter_url = 'http://www.jingcaiyuedu.com%s' % chapter_url
    response = requests.get(chapter_url)
    response.encoding = 'utf-8'
    html = response.text
    # 提取章节内容
    chapter_content = re.findall(r' <div class="panel-body" id="htmlContent">(.*?)</div> ', html, re.S)[0]
    chapter_content = chapter_content.replace('<br />', '')
    chapter_content = chapter_content.replace('<br>', '')
    chapter_content = chapter_content.replace('<br />', '')
    chapter_content = chapter_content.replace('<p>', '')
    chapter_content = chapter_content.replace('</p>', '')
    chapter_content = chapter_content.replace(' ', '')
    

    f.write(chapter_title)
    f.write('\n')
    f.write(chapter_content)
    f.write('\n\n\n\n\n')
    print(chapter_url)

版权声明：

本站所有资源均为站长或网友整理自互联网或站长购买自互联网，站长无法分辨资源版权出自何处，所以不承担任何版权以及其他问题带来的法律责任，如有侵权或者其他问题请联系站长删除！站长QQ754403226 谢谢。

有关影视版权：本站只供百度云网盘资源，版权均属于影片公司所有，请在下载后24小时删除，切勿用于商业用途。本站所有资源信息均从互联网搜索而来，本站不对显示的内容承担责任，如您认为本站页面信息侵犯了您的权益，请附上版权证明邮件告知【754403226@qq.com】，在收到邮件后72小时内删除。本文链接：https://www.piaodoo.com/7286.html