采集天堂电影数据来看看

piaodoo 编程教程 2020-02-22 22:10:04 1134 0 python教程

本文来源吾爱破解论坛

1
原理：
构建目标URL：
def page_urls():
baseurl = 'http://www.ygdy8.net/html/gndy/dyzz/list_23_{}.html'
for i in range(1, 30):
      url = baseurl.format(i)
      parse_url(url)
只需要改变{}里面的内容就可以实现翻页

爬取电影详情URL:
def parse_url(url):
response = requests.get(url, headers=headers)
html = etree.HTML(response.text)
tables = html.xpath('//table[@class="tbspan"]//a/@href')
for table_url in tables:
      page_urls = baseurl + table_url

2
需要的模块：
import time
import random
import requests
from lxml import etree
import csv

主程序：（有点长，截取部分）
def spider(page_urls):
data = {}
response = requests.get(page_urls, headers=headers)
html = etree.HTML(response.content.decode('gbk'))
title = html.xpath('//div[@class="title_all"]//font[@color="#07519a"]/text()')[0]
data['名字'] = title
try:
      images = html.xpath('//div[@id="Zoom"]//img/@src')[1]
except:
      print("套路深！")
try:
      posters = html.xpath('//div[@id="Zoom"]//img/@src')[0]
except:
      print("套路深！!")
data['海报'] = posters
# time.sleep(random.randint(1, 2))
zoom_ = html.xpath('//div[@id="Zoom"]')[0]
infos = zoom_.xpath('.//text()')
for info in infos:

      if info.startswith('◎年　　代'):
         info1 = info.replace('◎年　　代', '').strip()
         data['年代'] = info1
      elif info.startswith('◎产　　地'):
         info2 = info.replace('◎产　　地', '').strip()
         data['产地'] = info2
      elif info.startswith('◎类　　别'):
         info3 = info.replace('◎类　　别', '').strip()
         data['类别'] = info3
      elif info.startswith('◎语　　言'):
         info4 = info.replace('◎语　　言', '').strip()
         data['语言'] = info4
      elif info.startswith('◎上映日期'):
         info5 = info.replace('◎上映日期', '').strip()
         data['上映日期'] = info5
      elif info.startswith('◎豆瓣评分'):
         info6 = info.replace('◎豆瓣评分', '').strip()
         info6 = ''.join(info6.split('/')[:1])
         data['豆瓣评分'] = info6
      elif info.startswith('◎片　　长'):
         info7 = info.replace('◎片　　长', '').strip()
         data['片长'] = info7
3.
效果图
嗨学网

微信图片_20191111235622.jpg (248.75 KB, 下载次数: 2)

下载附件保存到相册

2019-11-11 23:56 上传

版权声明：

本站所有资源均为站长或网友整理自互联网或站长购买自互联网，站长无法分辨资源版权出自何处，所以不承担任何版权以及其他问题带来的法律责任，如有侵权或者其他问题请联系站长删除！站长QQ754403226 谢谢。

有关影视版权：本站只供百度云网盘资源，版权均属于影片公司所有，请在下载后24小时删除，切勿用于商业用途。本站所有资源信息均从互联网搜索而来，本站不对显示的内容承担责任，如您认为本站页面信息侵犯了您的权益，请附上版权证明邮件告知【754403226@qq.com】，在收到邮件后72小时内删除。本文链接：https://www.piaodoo.com/7742.html