Pandas 对数值进行分箱操作的4种方法总结对比
![](https://filescdn.proginn.com/d9a5ea9d7c6c043f2ecb53f18b731dae/b99ccc91dd115822ca8349dc9a0aeef1.webp)
来源:DeepHub IMBA 本文约1500字,建议阅读5分钟
我们将讨论使用 python Pandas 库对数值进行分箱的 4 种方法。
![](https://filescdn.proginn.com/f13599a60473224ebb517fc3c83f868c/895f500a557989066e78cc3e9486c8d8.webp)
import pandas as pd # version 1.3.5
import numpy as np
def create_df():
df = pd.DataFrame({'score': np.random.randint(0,101,1000)})
return df
create_df()
df.head()
![](https://filescdn.proginn.com/be9bc7f15ed428c6620c44299bec3d6b/d4a658950811413e8d14acef326501f8.webp)
1、between & loc
left:左边界 right:右边界 inclusive:要包括哪个边界。可接受的值为 {“both”、“neither”、“left”、“right”}。
A: (80, 100] B: (50, 80] C: [0, 50]
df.loc[df['score'].between(0, 50, 'both'), 'grade'] = 'C'
df.loc[df['score'].between(50, 80, 'right'), 'grade'] = 'B'
df.loc[df['score'].between(80, 100, 'right'), 'grade'] = 'A'
![](https://filescdn.proginn.com/6ed6ed5dcbfa27d6ba4af8603a5314a7/2823a596b7d4a01f1ef9839ca2314eb1.webp)
df.grade.value_counts()
![](https://filescdn.proginn.com/c38d7c9f021ab7b11b237016976aab42/6fef445962fac0f80ebff26c1d6d3091.webp)
2、cut
x:要分箱的数组。必须是一维的。 bins:标量序列:定义允许非均匀宽度的 bin 边缘。 labels:指定返回的 bin 的标签。必须与上面的 bins 参数长度相同。 include_lowest: (bool) 第一个区间是否应该是左包含的。
bins = [0, 50, 80, 100]
labels = ['C', 'B', 'A']
df['grade'] = pd.cut(x = df['score'], bins = bins, labels = labels, include_lowest = True)
![](https://filescdn.proginn.com/6ed6ed5dcbfa27d6ba4af8603a5314a7/2823a596b7d4a01f1ef9839ca2314eb1.webp)
df.grade.value_counts()
![](https://filescdn.proginn.com/c38d7c9f021ab7b11b237016976aab42/6fef445962fac0f80ebff26c1d6d3091.webp)
3、qcut
df['grade'], cut_bin = pd.qcut(df['score'], q = 3, labels = ['C', 'B', 'A'], retbins = True)
df.head()
![](https://filescdn.proginn.com/5d0ac42906251abfb629d9d7f4e41904/d383196c008769b2e54b1842b9b8718a.webp)
print (cut_bin)
>> [ 0. 36. 68. 100.]
C:[0, 36] B:(36, 68] A:(68, 100]
df.grade.value_counts()
4、value_counts
df['score'].value_counts(bins = 3, sort = False)
![](https://filescdn.proginn.com/52c633f7c3abfec7d506b1cb07f68e12/78e13147a865e3d7ec9d223b88177fd9.webp)
df['score'].value_counts(bins = [0,50,80,100], sort = False)
![](https://filescdn.proginn.com/7b16e01081cdc6328a75f8de0bdeb65c/5526a9a920fb573c9184d9ff0647d64a.webp)
总结
评论