呼,好久没来写博客了、差点真的变成从入门到放弃了呢。不过讲真,这段时间真是荒废了好一阵呢。嘛,月初看了《战狼》,爽爽的。然后看到网上好多做战狼评论分析的文章。忍不住自己也弄了个。
好吧。Talk is cheap, show me the code…
首先当然是要导入request和re库。
1 2 |
from urllib import request import re |
我们先定义一个专门解析网页内容的函数。为了便于分析,我还将网页代码存储到本地source.html(被我注释了):
1 2 3 4 5 6 7 8 9 |
def get_html_data(url): r = request.urlopen(url) html_data = r.read().decode('utf-8') ''' 保存html数据到txt with open('source.html', 'w', encoding='utf-8') as f: f.write(html_data) ''' return html_data |
然后我们收集网友评论的文字信息:
1 2 3 4 5 6 7 8 |
def get_vote_data(comments,html_data): reg_comment = r'<p class=""> (.*?)</p>' list_comment = re.compile(reg_comment, re.S).findall(html_data) #爬取评论信息comments for i in range(len(list_comment)): comments = comments + list_comment[i] return comments |
数据清洗,把评论中的标点符号都给去掉:
1 2 3 4 5 6 |
def get_cleaned_comment(comments): #评论数据清洗 pattern = re.compile(r'[\u4e00-\u9fa5]+', re.S) #保留中文 filter_comment = pattern.findall(comments) cleaned_comment = ''.join(filter_comment) return cleaned_comment |
然后通过进行分词生成词云的操作。首先当然要引入模块:
1 2 3 4 5 6 |
import jieba #分词 import pandas as pd #数据分析 import numpy import matplotlib.pyplot as plt import matplotlib from wordcloud import WordCloud |
然后再定义词云的函数:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
def word_cloud(cleaned_comment): # 分词 segment = jieba.lcut(cleaned_comment) words_df = pd.DataFrame({'segment': segment}) # 去除无意义的停用词 stopwords = pd.read_csv("stopwords.txt", index_col=False, quoting=3, sep="\t", names=['stopword'], encoding='utf-8') # quoting=3全不引用 words_df = words_df[~words_df.segment.isin(stopwords.stopword)] #print(words_df) #词频统计.设置字体,大小和颜色 #wordcloud = WordCloud(font_path='simhei.ttf', background_color='white', max_font_size=80) words_stat = words_df.groupby(by=['segment'])['segment'].agg({"计数": numpy.size}) words_stat = words_stat.reset_index().sort_values(by=["计数"], ascending=False) #print(words_stat.head) # 用词云进行显示,生成词频字典 word_frequence = {x[0]: x[1] for x in words_stat.head(1000).values} #print(word_frequence) '''word_frequence_list = [] for key in word_frequence: temp = (key, word_frequence[key]) word_frequence_list.append(temp) #print(word_frequence_list)''' wordcloud = WordCloud(font_path='simhei.ttf', background_color='white', max_font_size=80).fit_words(word_frequence) plt.figure() plt.imshow(wordcloud) plt.axis('off') plt.show() |
最后用定义主函数:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
def main(): comments = '' page = 10 for i in range(page): n = i * 20 url = 'https://movie.douban.com/subject/26363254/comments?start=%s&limit=20&sort=new_score&status=P' % n html_data = get_html_data(url) comments = get_vote_data(comments, html_data) print(i / page * 100, ' %') cleaned_comment = get_cleaned_comment(comments) word_cloud(cleaned_comment) if __name__ == '__main__': main() |
代码是写完了。不过这个代码有个问题。就是豆瓣有反爬虫的机制。如果频繁测试或者爬的页数太多,就会触发豆瓣反爬虫机制。那就会返回403错误了。过几天加上模拟登陆再优化一下好了。