本文来源吾爱破解论坛
起因:
最近在学英语发现有一个材料叫做the english we speak,不是特别热门的资源,音频还算比较容易找但是文本比较难找(可以打包下载的很难找),有的网站需要注册后用积分购买。
最终发现可可英语有在线的文本,因为有300多期不提供批量下载,于是就想着用python采集下来制作成pdf方便使用,随便把音频也采集下来,采集的网站http://m.kekenet.com/menu/14439/index.shtml
分析
1.音频和所需的文本都在网页源代码中
2.音频直接beautifulsoup解析到链接即可
3.文本的采集稍微麻烦一点,因为位置不固定,只能采集上一层标签,而且需要的p标签中还有span标签中的内容不需要,strong标签不需要,</br>换成换行符
4.解决3的问题采用正则表达式,先把tag类型转成str,删除span标签及其中的内容,删除strong标签,</br>换成换行符,然后转回tag类型,循环遍历p标签,get_text()只取内容,把所用的内容拼接起来保存。
[Python] 纯文本查看 复制代码
import requests from bs4 import BeautifulSoup import re from tqdm import tqdm import time headers={ 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.87 Safari/537.36' } def remove_span_tag(tag): content = str(tag) treated_content = re.sub('<span.*?span>','',content,flags=re.S+re.I) result = BeautifulSoup(treated_content,'lxml') return result def remove_strong_tag(tag): content = str(tag) treated_content = re.sub('<strong>|<strong/>','',content,flags=re.S+re.I) result = BeautifulSoup(treated_content,'lxml') return result def remove2next1(string): treated_content = re.sub('\n\n','\n',string,flags=re.S+re.I) return treated_content def change_br2next(tag): content = str(tag) # treated_content = re.sub('<br>','',content,flags=re.S+re.I) treated_content = re.sub('<br/>','\n',content,flags=re.S+re.I) result = BeautifulSoup(treated_content,'lxml') return result def get_html(url): response = requests.get(url,headers=headers) if response.status_code==200: response.encoding = 'utf-8' # print(response.apparent_encoding) return response.text def parse_audio_text(html): soup = BeautifulSoup(html,'lxml') title = soup.select('div.f-title')[0].string # print(title) audio = soup.select('#show_mp3 > audio')[0].source['src'] # print(audio) content = soup.select('#content > div > div.infoMain > div.f-y.w.hauto')[0] texts =content.select('p') # test = change_br2next(remove_strong_tag(remove_span_tag(texts[1]))) # print(test) result = '' for text in texts: result+= change_br2next(remove_strong_tag(remove_span_tag(text))).get_text() result_text = remove2next1(result) # print(result_text) return title,audio,result_text # texts = remove_strong_tag(remove_span_tag(content.select('p')[1])) def parse_index(html): soup = BeautifulSoup(html,'lxml') links = soup.select('.listItem') srcs = [] for link in links: src = link.select('a')[0]['href'] src = 'http://m.kekenet.com'+ src srcs.append(src) return srcs def get_index(url): response = requests.get(url,headers=headers) if response.status_code==200: response.encoding = 'utf-8' # print(response.apparent_encoding) return response.text def save_text(title,content): with open(title + '.txt','a',encoding='utf-8') as f: f.write(content) f.close() def downloadFILE(url,name): headers={ 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36' } resp = requests.get(url=url,stream=True,headers=headers) content_size = int(int(resp.headers['Content-Length'])/1024) with open(name, "wb") as f: print("Pkg total size is:",content_size,'k,start...') for data in tqdm(iterable=resp.iter_content(1024),total=content_size,unit='k',desc=name): f.write(data) print(name , "download finished!") if __name__ == "__main__": for i in range(1,24): url = 'http://m.kekenet.com/menu/14439/List_{}.shtml'.format(str(i)) html = get_index(url) srcs = parse_index(html) # print(srcs) print('list',i) for src in srcs: detial_html = get_html(src) title,audio,result_text= parse_audio_text(detial_html) title = re.search('第(.*?)期',title,re.S) if title: title = title.group(1).zfill(3) print(audio) print(result_text) save_text(title,result_text) downloadFILE(audio,title +'.mp3') # 24链接http://m.kekenet.com/menu/14439/index.shtml
最后用word排一下版生成pdf如图共400多页
the english we speak pdf.PNG (221.24 KB, 下载次数: 0)
下载附件 保存到相册
采集的结果放网盘了,需要的自取,终于可以快乐地学英语了
the english we speak(bbc地道英语)
链接:https://pan.baidu.com/s/1OKO6wo1hQ1xEIOQYHd62lQ
提取码:3k9t
版权声明:
本站所有资源均为站长或网友整理自互联网或站长购买自互联网,站长无法分辨资源版权出自何处,所以不承担任何版权以及其他问题带来的法律责任,如有侵权或者其他问题请联系站长删除!站长QQ754403226 谢谢。
- 上一篇: 人脸识别+表情识别
- 下一篇: 【Python3】基于文叔叔网盘上传与下载的Python脚本