本文来源吾爱破解论坛
最近在学分布式爬虫
发现lxml和Xpath熟练应用很重要,写了一个练习(爬取腾讯招聘信息)
没有写线程,目前只会多进程,多进程顺序会乱,所以没有用到
IDE是pycharm
python版本是3.65
[Python] 纯文本查看 复制代码
import requests from lxml import etree BASE_DOMAIN="https://hr.tencent.com/" HEADERS={ "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0", "Cookie":"PHPSESSID=77cb9dm9pvcs7lgeu401lc0td1; pgv_pvi=9957434368; pgv_si=s9246081024", "Host":"hr.tencent.com", "Upgrade-Insecure-Requests":"1" } #获取每一页的url def get_urls(url): hander = requests.get(url, headers=HEADERS) html = etree.HTML(hander.text) link = html.xpath("//td[@class='l square']/a/@href") links = [] for a in link: a = BASE_DOMAIN + a links.append(a) return links #获取职位的详细信息,写入字典中 def parse_tetail_page(url): position={} hander = requests.get(url, headers=HEADERS) html = etree.HTML(hander.text) table=html.xpath("//table[@class='tablelist textl']")[0] title=table.xpath("//td[@id='sharetitle']/text()")[0] position['title']=title place=table.xpath("//tr[@class='c bottomline']/td//text()") workplace=place[0]+place[1] JobCategory = place[2] + place[3] Hiring = place[4] + place[5] position['workplace'] = workplace position['JobCategory'] = JobCategory position['Hiring'] = Hiring content = table.xpath("//ul[@class='squareli']") duty = content[0].xpath(".//text()") requirements = content[1].xpath(".//text()") position['duty']=duty position['requirements']=requirements return position #主循环 def spider(): informations=[] #此处为搜索的详细页面 page="https://hr.tencent.com/position.php?keywords=python&start={}0#a" #此处为爬取的页数 for x in range(0,1): url=page.format(x) links=get_urls(url) for link in links: information = parse_tetail_page(link) informations.append(information) print(informations) if __name__ == '__main__': spider()
版权声明:
本站所有资源均为站长或网友整理自互联网或站长购买自互联网,站长无法分辨资源版权出自何处,所以不承担任何版权以及其他问题带来的法律责任,如有侵权或者其他问题请联系站长删除!站长QQ754403226 谢谢。
- 上一篇: [转载]多种方法爬取猫眼电影并分析(附代码)
- 下一篇: 献丑了,一个简单的猜数游戏。练手用的。