某书网小说爬虫

piaodoo 编程教程 2020-02-22 22:06:37 1061 0 python教程

本文来源吾爱破解论坛

爬虫解析数据有很多工具，正则，xpath,BeautifulSoup等等，听大神们说BeautifulSoup是其中解析速度最慢的，因此造成我长时间对BeautifulSoup不感兴趣，但是今天突然发现BeautifulSoup也是有优势的，举例说明：

之前看过某个团队一个小说爬虫分享视频，是通过正则解析数据的，正好当时我正在学正则就跟着练过代码，
但是正则解析出来的内容需要应该很多次清洗才能得到比较干净的文本内容，然而，这些操作，对BeautifulSoup来说就SO easy 了，
用get_text()直接就获取到很干净的内容。

我是一个自学python的小白，大神们不要见笑，如果有什么经验分享还请不吝赐教，谢谢

附BeautifulSoup解析的代码，之前正则的代码也写过，不过好像丢了。
[Python] 纯文本查看 复制代码

import requests
from bs4 import BeautifulSoup


class NovelSpider:
    """某书网，小说爬虫"""

    def __init__(self):
        self.session = requests.Session()

    def get_novel(self, url):  # 主逻辑
        """下载小说"""
        # 下载小说首页html
        index_html = self.download(url, encoding="gbk")
        # 小说的标题
        soup = BeautifulSoup(index_html, "html.parser")
        article_title = soup.find('a', class_="article_title").get_text()

        # 提取章节信息，url 网址
        novel_chapter_infos = self.get_chapter_info(index_html)
        # 创建一个文件 小说名.txt
        fb = open(f"{article_title}.txt", "w", encoding="utf-8")

        # 下载章节信息 循环
        for chapter_info in novel_chapter_infos:
            # 写章节
            fb.write(f"{chapter_info[1]}\n")
            # 下载章节
            content = self.get_chapter_content(chapter_info[0])
            fb.write(f"{content}\n")
            print(chapter_info)
        fb.close()

    def download(self, url, encoding):
        """下载html源码"""
        r = self.session.get(url)
        r.encoding = encoding
        return r.content

    def get_chapter_info(self, index_html):
        """提取章节信息"""
        soup = BeautifulSoup(index_html, "html.parser")
        chapterNum = soup.find('div', class_="chapterNum")
        data = []
        for link in chapterNum.find_all("li"):
            link = link.find('a')
            data.append((link["href"], link.get_text()))

        return data

    def get_chapter_content(self, chapter_url):
        """下载章节内容"""
        chapter_html = self.download(chapter_url, encoding="gbk")

        soup = BeautifulSoup(chapter_html, "html.parser")
        content = soup.find("div", class_="mainContenr")
        content = content.get_text().replace("style5();", '')
        return content


if __name__ == '__main__':
    novel_url = 'http://www.quanshuwang.com/book/9/9055'
    spider = NovelSpider()
    spider.get_novel(novel_url)

版权声明：

本站所有资源均为站长或网友整理自互联网或站长购买自互联网，站长无法分辨资源版权出自何处，所以不承担任何版权以及其他问题带来的法律责任，如有侵权或者其他问题请联系站长删除！站长QQ754403226 谢谢。

有关影视版权：本站只供百度云网盘资源，版权均属于影片公司所有，请在下载后24小时删除，切勿用于商业用途。本站所有资源信息均从互联网搜索而来，本站不对显示的内容承担责任，如您认为本站页面信息侵犯了您的权益，请附上版权证明邮件告知【754403226@qq.com】，在收到邮件后72小时内删除。本文链接：https://www.piaodoo.com/7550.html