Pandas知识点-详解分组函数groupby
groupby用法和参数介绍
groupby(self, by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=no_default, observed=False, dropna=True):
by: 指定根据哪个/哪些字段分组,默认值是None,按多个字段分组时传入列表。by参数可以按位置参数的方式传入。
axis: 设置按列分组还是按行分组,0或index表示按列分组,1或columns表示按行分组,默认值为0。
level: 当DataFrame的索引为多重索引时,level参数指定用于分组的索引,可以传入多重索引中索引的下标(0,1...)或索引名,多个用列表传入。
as_index: 分组结果默认将分组列的值作为索引,如果按单列分组,结果默认是单索引,如果按多列分组,结果默认是多重索引。将as_index设置为False可以重置索引(0,1...)。
sort: 结果按分组列的值升序排列,将sort设置为False则不排序,可以提升性能。
dropna: 默认情况下,分组列的NaN在分组结果中不保留,将dropna设置为False,可以保留NaN分组。
分组对象的内部结构
# coding=utf-8
import pandas as pd
import numpy as np
vip_df = pd.DataFrame(
{'isVip': ['vip', 'svip', 'member', 'vip', 'member', 'vip', 'svip'],
'gender': ['male', 'female', 'female', 'female', 'male', 'female', 'male'],
'age': [25, 30, 40, 25, 40, 18, 30],
'vipLevel': ['LV2', 'LV5', np.nan, 'LV3', 'LV2', 'LV2', 'LV3'],
'growValue': [180, 425, np.nan, 288, 190, 110, 240]}
)
print(vip_df)
grouped = vip_df.groupby('isVip')
print(grouped)
isVip gender age vipLevel growValue
0 vip male 25 LV2 180.0
1 svip female 30 LV5 425.0
2 member female 40 NaN NaN
3 vip female 25 LV3 288.0
4 member male 40 LV2 190.0
5 vip female 18 LV2 110.0
6 svip male 30 LV3 240.0
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001C3E81D1370>
for name, group in grouped:
print(name)
print(group)
member
isVip gender age vipLevel growValue
2 member female 40 NaN NaN
4 member male 40 LV2 190.0
svip
isVip gender age vipLevel growValue
1 svip female 30 LV5 425.0
6 svip male 30 LV3 240.0
vip
isVip gender age vipLevel growValue
0 vip male 25 LV2 180.0
3 vip female 25 LV3 288.0
5 vip female 18 LV2 110.0
for group in grouped:
print(group)
print(type(group), type(group[0]), type(group[1]))
('member', isVip gender age vipLevel growValue
2 member female 40 NaN NaN
4 member male 40 LV2 190.0)
<class 'tuple'> <class 'str'> <class 'pandas.core.frame.DataFrame'>
('svip', isVip gender age vipLevel growValue
1 svip female 30 LV5 425.0
6 svip male 30 LV3 240.0)
<class 'tuple'> <class 'str'> <class 'pandas.core.frame.DataFrame'>
('vip', isVip gender age vipLevel growValue
0 vip male 25 LV2 180.0
3 vip female 25 LV3 288.0
5 vip female 18 LV2 110.0)
<class 'tuple'> <class 'str'> <class 'pandas.core.frame.DataFrame'>
print(grouped.groups)
group_name = [gn for gn in grouped.groups.keys()]
print(group_name)
group = grouped.get_group(group_name[2])
print('-'*40, '\n', group, sep='')
{'member': [2, 4], 'svip': [1, 6], 'vip': [0, 3, 5]}
['member', 'svip', 'vip']
----------------------------------------
isVip gender age vipLevel growValue
0 vip male 25 LV2 180.0
3 vip female 25 LV3 288.0
5 vip female 18 LV2 110.0
按多重索引分组
vip_multi_df = vip_df.set_index(['isVip', 'gender'])
print('-'*40, '\n', vip_multi_df, sep='')
----------------------------------------
age vipLevel growValue
isVip gender
vip male 25 LV2 180.0
svip female 30 LV5 425.0
member female 40 NaN NaN
vip female 25 LV3 288.0
member male 40 LV2 190.0
vip female 18 LV2 110.0
svip male 30 LV3 240.0
# 按多重索引中的指定索引进行分组
grouped = vip_multi_df.groupby(level='isVip')
print('-'*40, '\n', grouped.mean(), sep='')
# 按多重索引中除指定索引之外的索引分组
grouped = vip_multi_df.groupby(level=vip_multi_df.index.names.difference(['gender']))
print('-'*40, '\n', grouped.mean(), sep='')
# 按多重索引中的多个索引分组
grouped = vip_multi_df.groupby(level=[0, 1])
print('-'*40, '\n', grouped.mean(), sep='')
----------------------------------------
age growValue
isVip
member 40.000000 190.000000
svip 30.000000 332.500000
vip 22.666667 192.666667
----------------------------------------
age growValue
isVip
member 40.000000 190.000000
svip 30.000000 332.500000
vip 22.666667 192.666667
----------------------------------------
age growValue
isVip gender
member female 40.0 NaN
male 40.0 190.0
svip female 30.0 425.0
male 30.0 240.0
vip female 21.5 199.0
male 25.0 180.0
grouped = vip_multi_df.groupby('gender')
print('-'*40, '\n', grouped.mean(), sep='')
grouped = vip_multi_df.groupby(['gender', 'age'])
print('-'*40, '\n', grouped.mean(), sep='')
----------------------------------------
age growValue
gender
female 28.250000 274.333333
male 31.666667 203.333333
----------------------------------------
growValue
gender age
female 18 110.0
25 288.0
30 425.0
40 NaN
male 25 180.0
30 240.0
40 190.0
重置结果的索引
grouped = vip_multi_df.groupby(['isVip', 'gender'])
print('-'*40, '\n', grouped.mean(), sep='')
# 重设索引
grouped = vip_multi_df.groupby(['isVip', 'gender'], as_index=False)
print('-'*40, '\n', grouped.mean(), sep='')
----------------------------------------
age growValue
isVip gender
member female 40.0 NaN
male 40.0 190.0
svip female 30.0 425.0
male 30.0 240.0
vip female 21.5 199.0
male 25.0 180.0
----------------------------------------
age growValue
0 40.0 NaN
1 40.0 190.0
2 30.0 425.0
3 30.0 240.0
4 21.5 199.0
5 25.0 180.0
结果是否排序
grouped = vip_df.groupby('isVip')
print('-'*40, '\n', grouped.mean(), sep='')
grouped = vip_df.groupby('isVip', sort=False)
print('-'*40, '\n', grouped.mean(), sep='')
----------------------------------------
age growValue
isVip
member 40.000000 190.000000
svip 30.000000 332.500000
vip 22.666667 192.666667
----------------------------------------
age growValue
isVip
vip 22.666667 192.666667
svip 30.000000 332.500000
member 40.000000 190.000000
是否保留空值
grouped = vip_df.groupby('vipLevel')
print('-'*40, '\n', grouped.mean(), sep='')
grouped = vip_df.groupby('vipLevel', dropna=False)
print('-'*40, '\n', grouped.mean(), sep='')
----------------------------------------
age growValue
vipLevel
LV2 27.666667 160.0
LV3 27.500000 264.0
LV5 30.000000 425.0
----------------------------------------
age growValue
vipLevel
LV2 27.666667 160.0
LV3 27.500000 264.0
LV5 30.000000 425.0
NaN 40.000000 NaN
提取分组结果的指定列
grouped = vip_df.groupby('isVip', dropna=False)
print('-'*40, '\n', grouped['gender'], sep='')
for name, group in grouped['gender']:
print(name)
print(group)
----------------------------------------
<pandas.core.groupby.generic.SeriesGroupBy object at 0x00000285CA1B0790>
member
2 female
4 male
Name: gender, dtype: object
svip
1 female
6 male
Name: gender, dtype: object
vip
0 male
3 female
5 female
Name: gender, dtype: object
grouped = vip_df['gender'].groupby(vip_df['isVip'])
for name, group in grouped:
print(name)
print(group)
member
2 female
4 male
Name: gender, dtype: object
svip
1 female
6 male
Name: gender, dtype: object
vip
0 male
3 female
5 female
Name: gender, dtype: object
参考文档:
[1] pandas中文网:https://www.pypandas.cn/docs/
评论