本文来源吾爱破解论坛
本帖最后由 trioKun 于 2019-8-27 18:07 编辑
出于兴趣写了这么个小玩意,希望和大家分享一下。
由于网络延迟和反爬机制的原因,脚本运行速度仍比较慢,欢迎交流改进方案。
简要介绍一下脚本的作用:
分析器的基本思想和微博自带的推荐“你关注的XX也关注了YY”类似。分析器通过爬取用户关注列表,利用BFS深入到关注链的任意层,从而挖掘出很多你可能认识的人。同时通过简单的判定过滤掉大V用户和其他无效用户。
作为一个例子,运行分析器,你将获得一个包括如下信息的用户列表。Level是指关注链层次,Level=1表示你直接关注了该用户,Level=2表示你直接关注的用户关注了该用户,依此类推。Score用于表征该用户与你的关系网的相关程度,你也可以自定义Score的各项因子权重。
Nickname: 兴趣使然的英雄
Gender: 男
Region: 北京 海淀区
Followers: 638
Tweets: 142
Last Tweet: 2019-05-11 04:06
Home Page: https://weibo.com/1234567890
Relation Level: 3
Relation Score: 90
完整源码及更多相关信息见附件,也可在GitHub下载完整源码(trioKun/Weibo-Relation-Analysis-Spider)。
下面给出分析器的核心部分~
[Python] 纯文本查看 复制代码
# MultiThreading solving the too long answering time problem class GetUser(threading.Thread): def __init__(self, uid, pa_ind, ch_ind): threading.Thread.__init__(self) self.uid = uid self.pa_index = pa_ind self.ch_index = ch_ind def run(self): new_usr = User(self.uid) Analyzer.usr_nodes_mutex.acquire() Analyzer.usr_nodes[self.ch_index] = new_usr Analyzer.usr_nodes[self.ch_index].dist = Analyzer.usr_nodes[self.pa_index].dist + 1 Analyzer.usr_nodes_mutex.release() class Analyzer: usr_nodes = [] # containing all detected users usr_nodes_mutex = threading.Lock() def __init__(self, uid, level=2, child_threads=3): """ :param uid: the user id of the root user :param level: search level, 2 or 3 is recommended :param child_threads: max number of child threads, depends on the number of your cookies """ self.root_uid = uid self.threads = 1 + child_threads # maximum child child_threads self.bfs(level) # construct relationship graph by BFS def bfs(self, level): # search until [level]st level cert = list() # certifications, decides whether to put a user into usr_nodes # value of cert[i]: Certed = 2**level # get certificated Init_Cert = 0 Pot_Cert = range(0, Certed) # Potential certificated. but not enough followed, looking for another Not_Cert = -1 # No certificated. be blocked due to certain filter strategy Exist = Certed + 1 # already existed usr_nodes uids = list() # uid[i] <==> cert[i] uids.append(self.root_uid) self.usr_nodes.append(User(self.root_uid)) cert.append(Exist) self.usr_nodes[0].dist = 0 self.usr_nodes[0].in_degree = 0 curr = 0 last_level = -1 while True: self.usr_nodes_mutex.acquire() curr_user = self.usr_nodes[curr] self.usr_nodes_mutex.release() if curr_user.dist != last_level: print("current level is %d" % curr_user.dist) last_level = curr_user.dist print("\t %d.scanning uid %d..." % (curr, curr_user.usr_id)) for follow_uid in curr_user.follow_uid_list: if follow_uid not in uids and curr_user.dist < level: # follow_uid is not yet collected uids.append(follow_uid) cert.append(Init_Cert) if follow_uid in uids: u_ind = uids.index(follow_uid) if curr_user.dist < level and cert[u_ind] in Pot_Cert: cert[u_ind] += 2**(level - curr_user.dist) if cert[u_ind] == Certed: MaxNoTweetDays = 180 # filter, ignore users who didn't tweet for a number of days if calc_days_until_now(self.get_last_tweet_time(follow_uid)) > MaxNoTweetDays: cert[u_ind] = Not_Cert else: cert[u_ind] = Exist print("\t\t found new uid %d" % follow_uid) self.usr_nodes_mutex.acquire() ins_index = len(self.usr_nodes) self.usr_nodes.append(User()) # a empty User object as a placeholder self.usr_nodes_mutex.release() while threading.activeCount() >= self.threads: time.sleep(0.1) GetUser(follow_uid, curr, ins_index).start() elif cert[u_ind] == Exist: # increase corresponding attributes self.usr_nodes_mutex.acquire() n_ind = self.index_usr_node(follow_uid) self.usr_nodes_mutex.release() while n_ind == -1: time.sleep(0.1) # wait until thread works it out self.usr_nodes_mutex.acquire() n_ind = self.index_usr_node(follow_uid) self.usr_nodes_mutex.release() self.usr_nodes_mutex.acquire() # divide (dist+1) because dist could be 0 self.usr_nodes[n_ind].in_degree += 1 / (curr_user.dist + 1) if curr_user.usr_id in self.usr_nodes[n_ind].follow_uid_list: # follow each other self.usr_nodes[curr].bidirectional_follow += 1 / (self.usr_nodes[n_ind].dist + 1) self.usr_nodes[n_ind].bidirectional_follow += 1 / (curr_user.dist + 1) self.usr_nodes_mutex.release() curr += 1 self.usr_nodes_mutex.acquire() over = bool(curr >= len(self.usr_nodes)) self.usr_nodes_mutex.release() if over: break while True: self.usr_nodes_mutex.acquire() work_out = bool(self.usr_nodes[curr].usr_id != 0) self.usr_nodes_mutex.release() if work_out: break time.sleep(0.1) # wait until thread works it out @staticmethod def get_last_tweet_time(uid): return except_wrapper_func(get_last_tweet_time_fullver, uid) def output(self, file=sys.stdout): for user in sorted(self.usr_nodes, key=scoring, reverse=True): if user.dist >= 1 and user.usr_id != 0: user.show(file=file) print("Relation Level: %d" % user.dist, file=file) print("Relation Score: %d" % round(scoring(user)), file=file) print(file=file) def index_usr_node(self, uid): """ :param uid: an int, the target uid :return: the index of User of the specific uid """ index = 0 while index < len(self.usr_nodes) and self.usr_nodes[index].usr_id != uid: index += 1 if index < len(self.usr_nodes): return index else: return -1
Weibo-Relation-Analysis-Spider-master.zip
2019-8-27 16:39 上传
点击文件名下载附件
下载积分: 吾爱币 -1 CB10.95 KB, 下载次数: 13, 下载积分: 吾爱币 -1 CB
分析器完整源码
版权声明:
本站所有资源均为站长或网友整理自互联网或站长购买自互联网,站长无法分辨资源版权出自何处,所以不承担任何版权以及其他问题带来的法律责任,如有侵权或者其他问题请联系站长删除!站长QQ754403226 谢谢。
- 上一篇: 关于Python我想说的二,三事!
- 下一篇: 探测设备端口是否开启