【Python私活案例】Pandas找到公司内最相似的员工(100元)

共 12248字,需浏览 25分钟

 ·

2021-11-12 21:58

今日份在蚂蚁老师的vip群里,有一位群友提出一个需求

我马上想到可以用所学的pandas相关知识解决

让我们来看看群友的具体需求

我们根据群友的需求,构造了出了相应的表格数据

在这里,我在name字段下先设定了一些值

开始写代码!

首先引入numpy和pandas,这俩是数据分析领域必不可少的模块

import numpy as np
# 这行代码的意思是通过设置固定的随机数种子,让你我生成的随机数是一样的
np.random.seed(666)
import pandas as pd
# 读取我们的excel文件
data = pd.read_excel("data.xlsx")

观察数据,我们有十六名员工,除了姓名,其它都是空值

data

nameJavaPythonAS400ITIDOracleBigdataSQLLeadershipManagementCreativityCommunication
0刘备NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
1关羽NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
2张飞NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
3诸葛亮NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
4赵云NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
5司马懿NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
6孙权NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
7曹操NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
8张角NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
9姜维NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
10司马昭NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
11司马师NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
12魏延NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
13徐庶NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
14陆逊NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
15鲁肃NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN

将空值用0到1之间的随机数代替

def initialize(item):
    # 如果发现当前这个元素是空值,就用随机数代替
    if pd.isnull(item):
        return np.random.random()
    # 否则就原样返回(对应name那列)
    else:
        return item

通过applymap函数,将上面的自定义函数应用到表格中的每一个数据上

new_data = data.applymap(initialize)

再来看看我们构造好的数据new_data,发现已经得到预期值了

new_data

nameJavaPythonAS400ITIDOracleBigdataSQLLeadershipManagementCreativityCommunication
0刘备0.0367120.3541090.8970440.5109470.8470970.8194430.7886410.2858680.9947700.5779060.273414
1关羽0.1556750.4404370.5543690.1470620.1219570.3784830.8683640.6709420.7004970.7191900.035426
2张飞0.1384460.2375090.6884920.9863740.5154090.9101560.1709670.7998650.0206610.8022780.424132
3诸葛亮0.0458780.2956900.8752180.0145690.0410400.4063390.6960760.3068230.1782750.0063400.132899
4赵云0.9267760.1587410.2122680.9993130.0601300.5939340.2962810.4257220.6655090.9105550.824788
5司马懿0.1581740.6865190.9894810.3281450.8437830.8940610.3140480.2929650.3050310.2369820.178578
6孙权0.2347940.8542720.9530050.9736680.4839470.9044040.8032890.5223700.9496730.0964650.439397
7曹操0.2182460.7074250.8810690.3766620.8014660.7552650.2917980.9380500.4858260.3464370.066456
8张角0.4983400.5630200.0199410.6869560.4389630.9422090.5352980.0683990.1692050.6381460.246636
9姜维0.8663820.3365340.4598160.1212750.1175270.0862000.7246720.0410770.7928520.6204570.794938
10司马昭0.3872800.9111110.3610180.0247110.3521680.8743070.5609310.0664540.3038520.2678940.524910
11司马师0.0693310.6544980.1333580.3492970.0925520.5275920.5593910.0413850.4110110.3910100.340453
12魏延0.0572410.2010210.4917020.5807540.1861230.8072210.3247360.7297370.9200880.0806130.537976
13徐庶0.0362210.2059920.9539270.3833840.4943150.4168610.0893450.6407950.5134790.4120780.759755
14陆逊0.5708700.4108810.7573210.2600930.9161100.6894730.0876440.1993680.5707180.7414450.307660
15鲁肃0.4123430.4261800.4440810.2384450.2004760.6656520.8481650.5107070.9652710.1353490.436991

接下来我想将上面的表格进行自表连接,便于计算,所以设置了一个用于连接的公共字段one

new_data["one"] = 1
new_data.head(5)

nameJavaPythonAS400ITIDOracleBigdataSQLLeadershipManagementCreativityCommunicationone
0刘备0.0367120.3541090.8970440.5109470.8470970.8194430.7886410.2858680.9947700.5779060.2734141
1关羽0.1556750.4404370.5543690.1470620.1219570.3784830.8683640.6709420.7004970.7191900.0354261
2张飞0.1384460.2375090.6884920.9863740.5154090.9101560.1709670.7998650.0206610.8022780.4241321
3诸葛亮0.0458780.2956900.8752180.0145690.0410400.4063390.6960760.3068230.1782750.0063400.1328991
4赵云0.9267760.1587410.2122680.9993130.0601300.5939340.2962810.4257220.6655090.9105550.8247881

进行自表连接,可以发现我们连接后的表有256行,也就是16*16,符合预期

需要注意的是,为了区分一左右两个表的数据来源,会将来源于左表的字段加_x,来源于右表则加_y

new_data_merge = pd.merge(left=new_data, right=new_data, left_on="one", right_on="one")
new_data_merge

name_xJava_xPython_xAS400_xITID_xOracle_xBigdata_xSQL_xLeadership_xManagement_x...Python_yAS400_yITID_yOracle_yBigdata_ySQL_yLeadership_yManagement_yCreativity_yCommunication_y
0刘备0.0367120.3541090.8970440.5109470.8470970.8194430.7886410.2858680.994770...0.3541090.8970440.5109470.8470970.8194430.7886410.2858680.9947700.5779060.273414
1刘备0.0367120.3541090.8970440.5109470.8470970.8194430.7886410.2858680.994770...0.4404370.5543690.1470620.1219570.3784830.8683640.6709420.7004970.7191900.035426
2刘备0.0367120.3541090.8970440.5109470.8470970.8194430.7886410.2858680.994770...0.2375090.6884920.9863740.5154090.9101560.1709670.7998650.0206610.8022780.424132
3刘备0.0367120.3541090.8970440.5109470.8470970.8194430.7886410.2858680.994770...0.2956900.8752180.0145690.0410400.4063390.6960760.3068230.1782750.0063400.132899
4刘备0.0367120.3541090.8970440.5109470.8470970.8194430.7886410.2858680.994770...0.1587410.2122680.9993130.0601300.5939340.2962810.4257220.6655090.9105550.824788
..................................................................
251鲁肃0.4123430.4261800.4440810.2384450.2004760.6656520.8481650.5107070.965271...0.6544980.1333580.3492970.0925520.5275920.5593910.0413850.4110110.3910100.340453
252鲁肃0.4123430.4261800.4440810.2384450.2004760.6656520.8481650.5107070.965271...0.2010210.4917020.5807540.1861230.8072210.3247360.7297370.9200880.0806130.537976
253鲁肃0.4123430.4261800.4440810.2384450.2004760.6656520.8481650.5107070.965271...0.2059920.9539270.3833840.4943150.4168610.0893450.6407950.5134790.4120780.759755
254鲁肃0.4123430.4261800.4440810.2384450.2004760.6656520.8481650.5107070.965271...0.4108810.7573210.2600930.9161100.6894730.0876440.1993680.5707180.7414450.307660
255鲁肃0.4123430.4261800.4440810.2384450.2004760.6656520.8481650.5107070.965271...0.4261800.4440810.2384450.2004760.6656520.8481650.5107070.9652710.1353490.436991

256 rows × 25 columns

紧接着,我们看一下特征属性,要剔除name字段以及用于自表连接的one字段

我们也可以通过修改columns里面的值,自定义关注哪些字段

columns = list(new_data.columns)
columns.remove("name")
columns.remove("one")
columns
['Java',
'Python',
'AS400',
'ITID',
'Oracle',
'Bigdata',
'SQL',
'Leadership',
'Management',
'Creativity',
'Communication']

计算相似度的函数

我们通过“欧式距离”来表征相似度,欧式距离越大,说明两者之间差距越大,相似度越小,反之亦然

def similarity(row):
    # 传进来的row是表中的一行
    # 设定相似度的初始值是0
    sim_value = 0.0
    # 取出一行当中的每一个特征值
    # 相应的特征值相减之后的结果再进行平方,最后全部加起来,也就是“欧氏距离”的概念
    for column in columns:
        sim_value += (float(row[column+"_x"]) - float(row[column+"_y"]))**2
    return sim_value

通过设定axis=1来指定对表格中的每一行进行计算相似度的操作

new_data_merge["sim"] = new_data_merge.apply(similarity, axis=1)

来看看计算后的表格,发现多了一个字段sim

new_data_merge.sample(15)

name_xJava_xPython_xAS400_xITID_xOracle_xBigdata_xSQL_xLeadership_xManagement_x...AS400_yITID_yOracle_yBigdata_ySQL_yLeadership_yManagement_yCreativity_yCommunication_ysim
218徐庶0.0362210.2059920.9539270.3833840.4943150.4168610.0893450.6407950.513479...0.3610180.0247110.3521680.8743070.5609310.0664540.3038520.2678940.5249102.002230
62诸葛亮0.0458780.2956900.8752180.0145690.0410400.4063390.6960760.3068230.178275...0.7573210.2600930.9161100.6894730.0876440.1993680.5707180.7414450.3076602.315648
245鲁肃0.4123430.4261800.4440810.2384450.2004760.6656520.8481650.5107070.965271...0.9894810.3281450.8437830.8940610.3140480.2929650.3050310.2369820.1785781.749616
129张角0.4983400.5630200.0199410.6869560.4389630.9422090.5352980.0683990.169205...0.5543690.1470620.1219570.3784830.8683640.6709420.7004970.7191900.0354261.935264
46张飞0.1384460.2375090.6884920.9863740.5154090.9101560.1709670.7998650.020661...0.7573210.2600930.9161100.6894730.0876440.1993680.5707180.7414450.3076601.645900
149姜维0.8663820.3365340.4598160.1212750.1175270.0862000.7246720.0410770.792852...0.9894810.3281450.8437830.8940610.3140480.2929650.3050310.2369820.1785783.124456
65赵云0.9267760.1587410.2122680.9993130.0601300.5939340.2962810.4257220.665509...0.5543690.1470620.1219570.3784830.8683640.6709420.7004970.7191900.0354262.615905
20关羽0.1556750.4404370.5543690.1470620.1219570.3784830.8683640.6709420.700497...0.2122680.9993130.0601300.5939340.2962810.4257220.6655090.9105550.8247882.615905
135张角0.4983400.5630200.0199410.6869560.4389630.9422090.5352980.0683990.169205...0.8810690.3766620.8014660.7552650.2917980.9380500.4858260.3464370.0664562.136879
246鲁肃0.4123430.4261800.4440810.2384450.2004760.6656520.8481650.5107070.965271...0.9530050.9736680.4839470.9044040.8032890.5223700.9496730.0964650.4393971.155614
115曹操0.2182460.7074250.8810690.3766620.8014660.7552650.2917980.9380500.485826...0.8752180.0145690.0410400.4063390.6960760.3068230.1782750.0063400.1328991.806937
166司马昭0.3872800.9111110.3610180.0247110.3521680.8743070.5609310.0664540.303852...0.9530050.9736680.4839470.9044040.8032890.5223700.9496730.0964650.4393972.016103
194魏延0.0572410.2010210.4917020.5807540.1861230.8072210.3247360.7297370.920088...0.6884920.9863740.5154090.9101560.1709670.7998650.0206610.8022780.4241321.701496
86司马懿0.1581740.6865190.9894810.3281450.8437830.8940610.3140480.2929650.305031...0.9530050.9736680.4839470.9044040.8032890.5223700.9496730.0964650.4393971.376950
61诸葛亮0.0458780.2956900.8752180.0145690.0410400.4063390.6960760.3068230.178275...0.9539270.3833840.4943150.4168610.0893450.6407950.5134790.4120780.7597551.505521

15 rows × 26 columns

然后,在自表连接的时候肯定有“自己连自己的情况”

也就是name_x字段的值等于name_y

这些是无意义的数据,剔除

new_data_merge = new_data_merge[new_data_merge["name_x"] != new_data_merge["name_y"]].copy()

随机选5条清理后的数据看看

new_data_merge.sample(5)

name_xJava_xPython_xAS400_xITID_xOracle_xBigdata_xSQL_xLeadership_xManagement_x...AS400_yITID_yOracle_yBigdata_ySQL_yLeadership_yManagement_yCreativity_yCommunication_ysim
114曹操0.2182460.7074250.8810690.3766620.8014660.7552650.2917980.9380500.485826...0.6884920.9863740.5154090.9101560.1709670.7998650.0206610.8022780.4241321.327640
37张飞0.1384460.2375090.6884920.9863740.5154090.9101560.1709670.7998650.020661...0.9894810.3281450.8437830.8940610.3140480.2929650.3050310.2369820.1785781.572089
156姜维0.8663820.3365340.4598160.1212750.1175270.0862000.7246720.0410770.792852...0.4917020.5807540.1861230.8072210.3247360.7297370.9200880.0806130.5379762.417639
92司马懿0.1581740.6865190.9894810.3281450.8437830.8940610.3140480.2929650.305031...0.4917020.5807540.1861230.8072210.3247360.7297370.9200880.0806130.5379761.720346
90司马懿0.1581740.6865190.9894810.3281450.8437830.8940610.3140480.2929650.305031...0.3610180.0247110.3521680.8743070.5609310.0664540.3038520.2678940.5249101.065205

5 rows × 26 columns

找每个员工与他最相似的10个员工

def get_top_student(df_sub):
    # 传入的是清洗后的数据结果groupby后的子表,对应单个员工的数据
    # 例如刘备与其他人连接的所有行,曹操与其他人连接的所有行
    
    # 对这个子表按sim值,也就是相似度进行升序排序,取出前10条数据
    df_sort = df_sub.sort_values(by="sim", ascending=True).head(10)
    # 将前十个人的名字取出
    names = ",".join(list(df_sort["name_y"]))
    # 将前十个人的值相似度值取出
    sims = ",".join([str(x) for x in list(df_sort["sim"])])
    # 打包成Series返回给调用它的地方
    return pd.Series({"names": names, "sims": sims})

对清洗后的表先按name_x进行groupby分组,对每一个分组,调用上述函数

result = new_data_merge.groupby("name_x").apply(get_top_student)

结果如下

例如,与关羽最相似的前十位依次是鲁肃,司马师,诸葛亮,刘备,曹操,魏延,姜维,徐庶,司马昭,陆逊

相似值依次是 0.7735408527877448,1.0708763430514607,1.113466...

result

namessims
name_x

关羽鲁肃,司马师,诸葛亮,刘备,曹操,魏延,姜维,徐庶,司马昭,陆逊0.7735408527877448,1.0708763430514607,1.113466...
刘备孙权,司马懿,陆逊,鲁肃,曹操,关羽,魏延,徐庶,司马师,司马昭0.9632560988057848,0.9990413893821443,1.099269...
司马师司马昭,张角,鲁肃,关羽,诸葛亮,魏延,姜维,司马懿,陆逊,徐庶0.5730340940360173,0.7408606894082557,0.994871...
司马懿曹操,陆逊,刘备,司马昭,徐庶,诸葛亮,孙权,张飞,司马师,张角0.5130746529222359,0.75366469445295,0.99904138...
司马昭司马师,张角,司马懿,鲁肃,诸葛亮,陆逊,关羽,姜维,曹操,魏延0.5730340940360173,0.9338616306533777,1.065204...
姜维鲁肃,司马师,赵云,关羽,司马昭,陆逊,徐庶,诸葛亮,张角,魏延1.1997951508166318,1.522733479990684,1.5613112...
孙权刘备,鲁肃,魏延,曹操,司马懿,徐庶,司马昭,关羽,司马师,陆逊0.9632560988057848,1.1556135829895424,1.206446...
张角司马师,司马昭,陆逊,张飞,赵云,司马懿,鲁肃,关羽,刘备,魏延0.7408606894082557,0.9338616306533777,1.439048...
张飞徐庶,曹操,张角,司马懿,陆逊,魏延,赵云,刘备,司马师,孙权1.2290138375053403,1.3276402597791603,1.527254...
徐庶魏延,曹操,陆逊,司马懿,张飞,刘备,诸葛亮,鲁肃,关羽,司马师0.8881419714636151,1.1138655046817798,1.144920...
曹操司马懿,陆逊,徐庶,刘备,张飞,孙权,魏延,关羽,鲁肃,司马昭0.5130746529222359,1.067403409629831,1.1138655...
诸葛亮关羽,司马师,鲁肃,司马懿,司马昭,徐庶,魏延,曹操,刘备,姜维1.1134665626313842,1.143403490075634,1.2731639...
赵云姜维,张角,张飞,魏延,鲁肃,陆逊,司马师,徐庶,关羽,司马昭1.561311245891365,1.5859586070097609,1.9056652...
陆逊司马懿,曹操,刘备,徐庶,司马昭,张角,张飞,司马师,鲁肃,关羽0.75366469445295,1.067403409629831,1.099269594...
魏延鲁肃,徐庶,孙权,刘备,曹操,司马师,关羽,诸葛亮,张飞,司马懿0.6536739741495201,0.8881419714636151,1.206446...
鲁肃魏延,关羽,司马师,司马昭,刘备,孙权,姜维,诸葛亮,徐庶,曹操0.6536739741495201,0.7735408527877448,0.994871...

生成excel文件,方便观看

result.to_excel("相似计算结果.xlsx", index=True)

最后,推荐蚂蚁老师的Pandas数据分析课程

课程名:《Python使用Pandas入门数据分析》

部分大纲:

扫码购买:

购买课程后,加我vx:ant_learn_python,拉付费VIP群

点击《阅读原文》,也可以到达课程页面。

浏览 26
点赞
评论
收藏
分享

手机扫一扫分享

分享
举报
评论
图片
表情
推荐
点赞
评论
收藏
分享

手机扫一扫分享

分享
举报