《纽约时报》数据镜头下的女性崛起
共 16280字,需浏览 33分钟
·
2020-11-03 11:01
大数据文摘出品
来源:medium
编译:Hippo、lin、夏雅薇
媒体镜头下的女性角色能反应一个社会的女性主义崛起。
今天,文摘菌给大家介绍的这个项目,就是希望通过情感分析、频繁术语可视化和主题建模,来调查过去70年里女性在《纽约时报》报道中的代表性。
一起看看。
为了进行这项调查,作者通过《纽约时报》开发人员门户网站的归档文件API抓取了《纽约时报》的相关数据。
首先,获取API密钥(钥匙链接地址:https://developer.nytimes.com/)。放心,这是免费的!获取了密钥之后,你就像是打开了《纽约时报》数据大坝的闸门,里面的数据将会源源不断地输送。由于这种类型的API比较适合批量数据的收集,因此不会对数据进行事先过滤。如果您想重新进行试验,请按照发布在Github上的Jupyter notebooks中的说明进行操作。如果您想看这篇文章的视频版本,也可以点击相关链接访问(链接:https://www.youtube.com/watch?v=rK-9t1IS0A4&feature=youtu.be)。
通过作者在GitHub上的my project repository(链接:https://github.com/sasha-talks-tech/New-York-Times)你也可以访问所有的说明、代码笔记和运行结果,可让你更顺利的进行复制。
通过API进行数据收集同时使用SpaCy和Gensim进行主题建模
在进一步进行分析之前,先对《纽约时报》自2019年1月至2020年9月的大部分文章进行主题建模,来分析标题、关键词和主要段落。目标是将最普遍的问题和经久不衰的主题区分开来,以确保相关研究遵循《纽约时报》的宗旨,并且不会歪曲他们的新闻风格。
数据收集思路的灵感来自Briena Herold的非常有用的教程(教程链接:https://towardsdatascience.com/collecting-data-from-the-new-york-times-over-any-period-of-time-3e365504004)。
让我们导入必要的工具和库:
import os
import pandas as pd
import requests
import json
import time
import dateutil
import datetime
from dateutil.relativedelta
import relativedeltaimport glob
Determine the timeframe of the analysis:
end = datetime.date.today()
start = datetime.date(2019, 1, 1)
下列helper函数(请参阅教程:https://towardsdatascience.com/collecting-data-from-the-new-york-times-over-any-period-of-time-3e365504004)通过API提取《纽约时报》相关数据将其保存到特定的csv文件中:
def send_request(date):'''Sends a request to the NYT Archive API for given date.''' base_url = 'https://api.nytimes.com/svc/archive/v1/' url = base_url + '/' + date[0] + '/' + date[1] + '.json?api-key=' + 'F9FPP1mJjiX8pAEFAxBYBg08vZECa39n'try: response = requests.get(url, verify=False).json()except Exception:return None time.sleep(6)return responsedef is_valid(article, date):'''An article is only worth checking if it is in range, and has a headline.''' is_in_range = date > start and date < end has_headline = type(article['headline']) == dict and 'main' in article['headline'].keys()return is_in_range and has_headlinedef parse_response(response):'''Parses and returns response as pandas data frame.''' data = {'headline': [], 'date': [],'doc_type': [],'material_type': [],'section': [],'keywords': [],'lead_paragraph': []} articles = response['response']['docs']for article in articles: # For each article, make sure it falls within our date range date = dateutil.parser.parse(article['pub_date']).date()if is_valid(article, date): data['date'].append(date) data['headline'].append(article['headline']['main'])if 'section' in article: data['section'].append(article['section_name'])else: data['section'].append(None) data['doc_type'].append(article['document_type'])if 'type_of_material' in article: data['material_type'].append(article['type_of_material'])else: data['material_type'].append(None) keywords = [keyword['value'] for keyword in article['keywords'] if keyword['name'] == 'subject'] data['keywords'].append(keywords)if 'lead_paragraph' in article: data['lead_paragraph'].append(article['lead_paragraph'])else: data['lead_paragraph'].append(None)return pd.DataFrame(data)def get_data(dates):'''Sends and parses request/response to/from NYT Archive API for given dates.''' total = 0 print('Date range: ' + str(dates[0]) + ' to ' + str(dates[-1]))if not os.path.exists('headlines'): os.mkdir('headlines')for date in dates: print('Working on ' + str(date) + '...') csv_path = 'headlines/' + date[0] + '-' + date[1] + '.csv'if not os.path.exists(csv_path): # If we don't already have this month response = send_request(date)if response is not None: df = parse_response(response) total += len(df) df.to_csv(csv_path, index=False) print('Saving ' + csv_path + '...')print('Number of articles collected: ' + str(total))
让我们详细看下helper函数:
函数send_request(date)向存档发送给定日期的请求,转换成json格式,返回响应。 函数is_valid(article, date)检查某篇文章是否在要求的时间范围内,确认标题的存在,返回is_in_range(在范围内)和has_headline(存在标题)结论。 函数parse_response(response)将响应转换为DataFrame数据集。data是一个字典,它包含DataFrame的列,这些列最初是空的,但将被此函数追加。函数返回最终的DataFrame。
函数get_data(dates):如果日期对应于用户指定的范围,则利用send request()和parse response()函数,将标题和其他信息保存到.csv文件中,每月每年保存一个文件。
# get data file names
path = "headlines/"
filenames = glob.glob("*.csv")
dfs = []
print(filenames)
for filename in filenames:
dfs.append(pd.read_csv(filename))
# Concatenate all data into one DataFrame
big_frame = pd.concat(dfs, ignore_index=True)
Importing tools and libraries:
from collections import defaultdict
import re, string #regular expressions
from gensim import corpora # this is the topic modeling library
from gensim.models import LdaModel
Defaultdict对统计唯一单词的出现很有用。 https://www.geeksforgeeks.org/defaultdict-in-python/ 当我们在文本中查找精确或模糊的匹配项时re和string很有用。如果你对文本分析感兴趣,正则表达式(https://www.w3schools.com/python/python_regex.asp)会是一个常用表达式。如果想要练习,可以点击链接使用便捷工具(https://regex101.com/)。 gensim(https://radimrehurek.com/gensim/)是一个用于主题建模的库, 只要你能搞清必要的相关性(https://www.tutorialspoint.com/gensim/gensim_getting_started.htm), 你会发现它非常好用。
big_frame_corpus_headline = big_frame['headline']
big_frame_corpus_keywords = big_frame['keywords']
big_frame_corpus_lead = big_frame['lead_paragraph']
文本预处理步骤。图片由作者提供。图标由Freepik制作。
from nltk.corpus import stopwords
headlines = [re.sub(r'[^\w\s]','',str(item)) for item in big_frame_corpus_headline]
keywords = [re.sub(r'[^\w\s]','',str(item)) for item in big_frame_corpus_keywords]
lead = [re.sub(r'[^\w\s]','',str(item)) for item in big_frame_corpus_lead]
stopwords = set(stopwords.words('english'))
# please note: you can append to this list of pre-defined stopwords if needed
More pre-processing:
headline_texts = [[word for word in document.lower().split() if word not in stopwords] for document in headlines]
keywords_texts = [[word for word in document.lower().split() if word not in stopwords] for document in keywords]
lead_texts = [[word for word in document.lower().split() if word not in stopwords] for document in lead]
Removing less frequent words:
frequency = defaultdict(int)
for headline_text in headline_texts:
for token in headline_text:
frequency[token] += 1
for keywords_text in keywords_texts:
for token in keywords_text:
frequency[token] += 1
for lead_text in lead_texts:
for token in lead_text:
frequency[token] += 1
headline_texts = [[token for token in headline_text if frequency[token] > 1] for headline_text in headline_texts]
keywords_texts = [[token for token in keywords_text if frequency[token] > 1] for keywords_text in keywords_texts]
lead_texts = [[token for token in lead_text if frequency[token] > 1] for lead_text in lead_texts]
dictionary_headline = corpora.Dictionary(headline_texts)
dictionary_keywords = corpora.Dictionary(keywords_texts)
dictionary_lead = corpora.Dictionary(lead_texts)
headline_corpus = [dictionary.doc2bow(headline_text) for headline_text in headline_texts]
keywords_corpus = [dictionary.doc2bow(keywords_text) for keywords_text in keywords_texts]
lead_corpus = [dictionary.doc2bow(lead_text) for lead_text in lead_texts]
Let’s decide on the optimal number of topics for our case:
NUM_TOPICS = 5
ldamodel_headlines = LdaModel(headline_corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=12)
ldamodel_keywords = LdaModel(keywords_corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=12)
ldamodel_lead = LdaModel(lead_corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=12)
Here’s the result:
topics_headlines = ldamodel_headlines.show_topics()
for topic_headlines in topics_headlines:
print(topic_headlines)
topics_keywords = ldamodel_keywords.show_topics()
for topic_keywords in topics_keywords:
print(topic_keywords)
topics_lead = ldamodel_lead.show_topics()
for topic_lead in topics_lead:
print(topic_lead)
Let’s organize those into dataframes:
word_dict_headlines = {};
for i in range(NUM_TOPICS):
words_headlines = ldamodel_headlines.show_topic(i, topn = 20)
word_dict_headlines['Topic # ' + '{:02d}'.format(i+1)] = [i[0] for i in words_headlines]
pd.DataFrame(word_dict_headlines)
for i in range(NUM_TOPICS):
words_keywords = ldamodel_keywords.show_topic(i, topn = 20)
word_dict_keywords['Topic # ' + '{:02d}'.format(i+1)] = [i[0] for i in words_keywords]
pd.DataFrame(word_dict_keywords)
for i in range(NUM_TOPICS):
words_lead = ldamodel_lead.show_topic(i, topn = 20)
word_dict_lead ['Topic # ' + '{:02d}'.format(i+1)] = [i[0] for i in words_lead]
pd.DataFrame(word_dict_lead)
主题建模结果。图片由作者提供。图标由Freepik制作。
1950 – 至今:数据搜集与关键字分析
import pickle
with open('frame_all.pickle', 'wb') as to_write:
pickle.dump(frame, to_write)
with open('frame_all.pickle', 'rb') as read_file:
df = pickle.load(read_file)
df['date'] = pd.to_datetime(df['date'])
df = df[df['headline'].notna()].drop_duplicates().sort_values(by='date')
df.dropna(axis=0, subset=['keywords'], inplace = True)
import ast
df.keywords = df.keywords.astype(str).str.lower().transform(ast.literal_eval)
keyword_counts = pd.Series(x for l in df['keywords'] for x in l).value_counts(ascending=False)
len(keyword_counts)
58,298 unique keywords.
project_keywords1 = [x for x in keyword_counts.keys() if 'women in politics' in x
or 'businesswoman' in x
or 'female executive' in x
or 'female leader' in x
or 'female leadership' in x
or 'successful woman' in x
or 'female entrepreneur' in x
or 'woman entrepreneur' in x
or 'women in tech' in x
or 'female technology' in x
or 'female startup' in x
or 'female founder' in x ]
df['headline'] = df['headline'].astype(str).str.lower()
Examine the headlines that contain words like woman, politics and power:
wip_headlines = df[df['headline'].str.contains(('women' or 'woman' or 'female')) & df['headline'].str.contains(('politics' or 'power' or 'election'))]
‘wip’ stands for ‘women in politics’.
Our search returned only 185 headlines. Let’s look at the keywords to supplement that.
df['keywords'].dropna()
df['keywords_joined'] = df.keywords.apply(', '.join)
df['keywords_joined'] = df['keywords_joined'].astype(str)
import re
wip_keywords = df[df['keywords_joined'].str.contains(r'(?=.*women)(?=.*politics)',regex=True)]
wip_df = pd.concat([wip_headlines, wip_keywords], axis=0, sort = True)
feminist_keywords = df[df['keywords_joined'].str.contains(r'(?=.*women)(?=.*feminist)',regex=True)]
#metoo movement:
metoo_keywords = df[df['keywords_joined'].str.contains(r'(?=.*women)(?=.*metoo)(?=.*movement)',regex=True)]
ax = df.groupby(df.date.dt.year['headline'].count().plot(kind='bar', figsize=(20, 6))
ax.set(xlabel='Year', ylabel='Number of Articles')
ax.yaxis.set_tick_params(labelsize='large')
ax.xaxis.label.set_size(18)
ax.yaxis.label.set_size(18)
ax.set_title('Total Published Every Year', fontdict={'fontsize': 24, 'fontweight': 'medium'})
plt.show()
ax = project_df.groupby('year')['headline'].count().plot(kind='bar', figsize=(20, 6))
ax.set(xlabel='Year', ylabel='Number of Articles')
ax.yaxis.set_tick_params(labelsize='large')
ax.xaxis.label.set_size(18)
ax.yaxis.label.set_size(18)
ax.set_title('Articles About Strong Women (based on relevant keywords) Published Every Year', \
fontdict={'fontsize': 20, 'fontweight': 'medium'})
plt.show()
这些剪报是通过TimesMachine(纽约时报出版物档案)获得的。图片是由作者使用这些剪报创造的。
N-grams,Wordcloud以及情感分析
from sklearn.feature_extraction.text import CountVectorizer
word_vectorizer = CountVectorizer(ngram_range=(1,3), analyzer='word')
sparse_matrix = word_vectorizer.fit_transform(corpus)
frequencies = sum(sparse_matrix).toarray()[0]
ngram_df_project = pd.DataFrame(frequencies, index=word_vectorizer.get_feature_names(), columns=['frequency'])
from wordcloud import WordCloud, STOPWORDS
all_headlines = ' '.join(project_df['headline'].str.lower())
stopwords = STOPWORDS
stopwords.add('will')
# Note: you can append your own stopwords to the existing ones.
wordcloud = WordCloud(stopwords=stopwords, background_color="white", max_words=1000, width = 480, height = 480).\
generate(all_headlines)
plt.figure(figsize=(20,10))
plt.imshow(wordcloud)
plt.axis("off");
import nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA
sia = SIA()
results = []
for line in project_df.headline:
pol_score = sia.polarity_scores(line)
pol_score['headline'] = line
results.append(pol_score)
print(results[:3])
Output:
[{'neg': 0.0, 'neu': 0.845, 'pos': 0.155, 'compound': 0.296, 'headline': 'women doctors join navy; seventeen end their training and are ordered to duty'}, {'neg': 0.18, 'neu': 0.691, 'pos': 0.129, 'compound': -0.2732, 'headline': 'n.y.u. to graduate 21 women doctors; war gave them, as others, an opportunity to enter a medical school'}, {'neg': 0.159, 'neu': 0.725, 'pos': 0.116, 'compound': -0.1531, 'headline': 'greets women doctors; dean says new york medical college has no curbs'}]
Sentiment as a dataframe:
sentiment_df = pd.DataFrame.from_records(results)
dates = project_df['year']
sentiment_df = pd.merge(sentiment_df, dates, left_index=True, right_index=True)
The code above allows us to have a timeline for our sentiment. To simplify the sentiment analysis, we are going to create some new categories for positive, negative and neutral.
sentiment_df['label'] = 0
sentiment_df.loc[sentiment_df['compound'] > 0.2, 'label'] = 1
sentiment_df.loc[sentiment_df['compound'] < -0.2, 'label'] = -1
sentiment_df.head()
To visualize overall sentiment distribution:
sentiment_df.label.value_counts(normalize=True) * 100
图片由作者产生,Slidesgo模板
sns.lineplot(x="year", y="label", data=sentiment_df)
plt.show()
结论
实习/全职编辑记者招聘ing
加入我们,亲身体验一家专业科技媒体采写的每个细节,在最有前景的行业,和一群遍布全球最优秀的人一起成长。坐标北京·清华东门,在大数据文摘主页对话页回复“招聘”了解详情。简历请直接发送至zz@bigdatadigest.cn