Pandas groupby

Categoricals and groupby

In [2]:
import pandas as pd

sales = pd.DataFrame({'weekday': ['Sun', 'Sun', 'Mon', 'Mon'],
 'city': ['Austin', 'Dallas', 'Austin', 'Dallas'],
 'bread': [139, 237, 326, 456],
'butter': [20, 45, 70, 98] } )
In [3]:
sales
Out[3]:
bread butter city weekday
0 139 20 Austin Sun
1 237 45 Dallas Sun
2 326 70 Austin Mon
3 456 98 Dallas Mon

Boolean filter and count

In [6]:
sales.loc[sales['weekday'] == 'Sun'].count()
Out[6]:
bread      2
butter     2
city       2
weekday    2
dtype: int64

Groupby and count

Split-apply-combine

  • sales.groupby('weekday').count()
    • split by ‘weekday’
    • apply count() function on each group
    • combine counts per group
In [8]:
sales.groupby('weekday').count()
Out[8]:
bread butter city
weekday
Mon 2 2 2
Sun 2 2 2
In [9]:
sales.groupby('weekday')
Out[9]:
<pandas.core.groupby.DataFrameGroupBy object at 0x000000000929FF60>
In [11]:
print dir(sales.groupby('weekday'))
['__bytes__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__format__', '__getattr__', '__getattribute__', '__getitem__', '__hash__', '__init__', '__iter__', '__len__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__unicode__', '__weakref__', '_agg_doc', '_aggregate', '_aggregate_generic', '_aggregate_item_by_item', '_aggregate_multiple_funcs', '_apply_filter', '_apply_to_column_groupbys', '_apply_whitelist', '_assure_grouper', '_block_agg_axis', '_builtin_table', '_choose_path', '_concat_objects', '_constructor', '_cumcount_array', '_cython_agg_blocks', '_cython_agg_general', '_cython_table', '_cython_transform', '_decide_output_index', '_def_str', '_define_paths', '_dir_additions', '_dir_deletions', '_get_data_to_aggregate', '_get_index', '_get_indices', '_gotitem', '_group_selection', '_index_with_as_index', '_insert_inaxis_grouper_inplace', '_internal_names', '_internal_names_set', '_is_builtin_func', '_is_cython_func', '_iterate_column_groupbys', '_iterate_slices', '_make_wrapper', '_obj_with_exclusions', '_post_process_cython_aggregate', '_python_agg_general', '_python_apply_general', '_reindex_output', '_reset_cache', '_see_also_template', '_selected_obj', '_selection', '_selection_list', '_set_result_index_ordered', '_set_selection_from_grouper', '_shallow_copy', '_transform_general', '_transform_item_by_item', '_try_cast', '_wrap_agged_blocks', '_wrap_aggregated_output', '_wrap_applied_output', '_wrap_generic_output', '_wrap_transformed_output', 'agg', 'aggregate', 'all', 'any', 'apply', 'backfill', 'bfill', 'boxplot', 'bread', 'butter', 'city', 'corr', 'corrwith', 'count', 'cov', 'cumcount', 'cummax', 'cummin', 'cumprod', 'cumsum', 'describe', 'diff', 'dtypes', 'expanding', 'ffill', 'fillna', 'filter', 'first', 'get_group', 'groups', 'head', 'hist', 'idxmax', 'idxmin', 'indices', 'irow', 'last', 'mad', 'max', 'mean', 'median', 'min', 'name', 'ndim', 'ngroups', 'nth', 'ohlc', 'pad', 'pct_change', 'plot', 'prod', 'quantile', 'rank', 'resample', 'rolling', 'sem', 'shift', 'size', 'skew', 'std', 'sum', 'tail', 'take', 'transform', 'tshift', 'var', 'weekday']
In [12]:
sales.groupby('weekday').sum()
Out[12]:
bread butter
weekday
Mon 782 168
Sun 376 65
In [13]:
sales.groupby('weekday').std()
Out[13]:
bread butter
weekday
Mon 91.923882 19.79899
Sun 69.296465 17.67767

Aggregation/Reduction

  • first(), last()
  • min(), max()
In [15]:
sales.groupby('weekday').last()
Out[15]:
bread butter city
weekday
Mon 456 98 Dallas
Sun 237 45 Dallas

Groupby and sum

In [17]:
sales.groupby('weekday')['bread'].sum()
Out[17]:
weekday
Mon    782
Sun    376
Name: bread, dtype: int64
In [20]:
sales.groupby('weekday')['bread','butter'].sum()
Out[20]:
bread butter
weekday
Mon 782 168
Sun 376 65
In [21]:
sales.groupby('weekday')[['bread','butter']].sum()
Out[21]:
bread butter
weekday
Mon 782 168
Sun 376 65

keep df form

In [22]:
sales.groupby('weekday')[['bread']].sum()
Out[22]:
bread
weekday
Mon 782
Sun 376

Groupby and mean: multi-level index

In [23]:
sales.groupby(['city','weekday']).mean()
Out[23]:
bread butter
city weekday
Austin Mon 326 70
Sun 139 20
Dallas Mon 456 98
Sun 237 45

Customers

  • index
    • 'Alice' have index 1 and 3
In [24]:
customers = pd.Series(['Dave','Alice','Bob','Alice'])
customers
Out[24]:
0     Dave
1    Alice
2      Bob
3    Alice
dtype: object
In [25]:
sales
Out[25]:
bread butter city weekday
0 139 20 Austin Sun
1 237 45 Dallas Sun
2 326 70 Austin Mon
3 456 98 Dallas Mon
In [33]:
sales.groupby(customers)['bread'].sum()
Out[33]:
Alice    693
Bob      326
Dave     139
Name: bread, dtype: int64

Categorical data

  • ### Advantages
    • ### Uses less memory
    • ### Speeds up operations like groupby()
In [54]:
sales['weekday'].unique()
Out[54]:
array(['Sun', 'Mon'], dtype=object)
In [56]:
sales['weekday'] = sales['weekday'].astype('category')
sales
Out[56]:
bread butter city weekday
0 139 20 Austin Sun
1 237 45 Dallas Sun
2 326 70 Austin Mon
3 456 98 Dallas Mon
In [57]:
sales['weekday'] 
Out[57]:
0    Sun
1    Sun
2    Mon
3    Mon
Name: weekday, dtype: category
Categories (2, object): [Mon, Sun]

Multiple aggregations

In [59]:
sales.groupby('city')[['bread','butter']].agg(['max','sum'])
Out[59]:
bread butter
max sum max sum
city
Austin 326 465 70 90
Dallas 456 693 98 143

Custom aggregation

  • ### pass a func to df.groupby(...).agg(func)
In [60]:
def data_range(series):
     return series.max() - series.min()
In [61]:
sales.groupby('weekday')[['bread', 'butter']].agg(data_range)
Out[61]:
bread butter
weekday
Mon 130 28
Sun 98 25

Custom aggregation: dictionaries

In [62]:
sales.groupby(customers)[['bread', 'butter']].agg({'bread':'sum', 'butter':data_range})
Out[62]:
butter bread
Alice 53 693
Bob 0 326
Dave 0 139

practice on Titanic dataset

In [34]:
titanic = pd.read_csv('train.csv')
titanic.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
In [36]:
titanic.head()
Out[36]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
  • ### The 'pclass' column identifies which class of ticket was purchased by the passenger
  • ### embarked' column indicates at which of the three ports the passenger boarded the Titanic.
    • ### 'S' stands for Southampton, England, 'C' for Cherbourg, France and 'Q' for Queenstown, Ireland.
      • ### first group by the 'pclass' column and count the number of rows in each class using the 'survived' column.
        • ### then group by the 'embarked' and 'pclass' columns and count the number of passengers.
In [38]:
# Group titanic by 'pclass'
# Aggregate 'survived' column of by_class by count

p_group = titanic.groupby('Pclass')['Survived'].count()
p_group
Out[38]:
Pclass
1    216
2    184
3    491
Name: Survived, dtype: int64

Aggregating on index levels/fields

  • If you have a DataFrame with a multi-level row index, the individual levels can be used to perform the groupby. This allows advanced aggregation techniques to be applied along one or more levels in the index and across one or more columns.

Computing multiple aggregates of multiple columns

  • The .agg() method can be used with a tuple or list of aggregations as input. When applying multiple aggregations on multiple columns, the aggregated DataFrame has a multi-level column index.
    • group passengers on the Titanic by 'pclass' and aggregate the 'age' and 'fare' columns by the functions 'max' and 'median'
In [42]:
# Group titanic by 'embarked' and 'pclass'
# Aggregate 'survived' column of by_mult by count


multi_group = titanic.groupby(['Embarked','Pclass']).count()
multi_group
Out[42]:
PassengerId Survived Name Sex Age SibSp Parch Ticket Fare Cabin
Embarked Pclass
C 1 85 85 85 85 74 85 85 85 85 66
2 17 17 17 17 15 17 17 17 17 2
3 66 66 66 66 41 66 66 66 66 1
Q 1 2 2 2 2 2 2 2 2 2 2
2 3 3 3 3 2 3 3 3 3 1
3 72 72 72 72 24 72 72 72 72 1
S 1 127 127 127 127 108 127 127 127 127 106
2 164 164 164 164 156 164 164 164 164 13
3 353 353 353 353 290 353 353 353 353 10
In [43]:
multi_survive = titanic.groupby(['Embarked','Pclass'])['Survived'].count()
multi_survive
Out[43]:
Embarked  Pclass
C         1          85
          2          17
          3          66
Q         1           2
          2           3
          3          72
S         1         127
          2         164
          3         353
Name: Survived, dtype: int64
In [65]:
# Group titanic by 'pclass': by_class
by_class = titanic.groupby('Pclass')

# Select 'age' and 'fare'
by_class_sub = by_class[['Age','Fare']]

# Aggregate by_class_sub by 'max' and 'median': aggregated
aggregated = by_class_sub.agg(['max','median'])
aggregated
Out[65]:
Age Fare
max median max median
Pclass
1 80.0 37.0 512.3292 60.2875
2 70.0 29.0 73.5000 14.2500
3 74.0 24.0 69.5500 8.0500
In [68]:
# Print the maximum age in each class
print(aggregated.loc[:, ('Age','max')])
Pclass
1    80.0
2    70.0
3    74.0
Name: (Age, max), dtype: float64
In [69]:
# Print the median fare in each class
print(aggregated.loc[:, ('Fare','median')])
Pclass
1    60.2875
2    14.2500
3     8.0500
Name: (Fare, median), dtype: float64
In [ ]:
 

Grouping by another series

In [45]:
# Read life_fname into a DataFrame: life
life = pd.read_csv('https://s3.amazonaws.com/assets.datacamp.com/production/course_1650/datasets/life_expectancy.csv', 
                   index_col='Country')

# Read regions_fname into a DataFrame: regions
regions = pd.read_csv('https://s3.amazonaws.com/assets.datacamp.com/production/course_1650/datasets/regions.csv', 
                      index_col='Country')
print life.shape
print regions.shape
(204, 50)
(204, 1)
In [48]:
life.head(3)
Out[48]:
1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 ... 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
Country
Afghanistan 33.639 34.152 34.662 35.170 35.674 36.172 36.663 37.143 37.614 38.075 ... 56.583 57.071 57.582 58.102 58.618 59.124 59.612 60.079 60.524 60.947
Albania 65.475 65.863 66.122 66.316 66.500 66.702 66.948 67.251 67.595 67.966 ... 75.725 75.949 76.124 76.278 76.433 76.598 76.780 76.979 77.185 77.392
Algeria 47.953 48.389 48.806 49.205 49.592 49.976 50.366 50.767 51.195 51.670 ... 69.682 69.854 70.020 70.180 70.332 70.477 70.615 70.747 70.874 71.000

3 rows × 50 columns

In [49]:
regions.head()
Out[49]:
region
Country
Afghanistan South Asia
Albania Europe & Central Asia
Algeria Middle East & North Africa
Angola Sub-Saharan Africa
Antigua and Barbuda America

DataFrame life and regions share the same index

In [53]:
# Group life by regions['region']: life_by_region
life_by_region = life.groupby(regions['region'])

# Print the mean over the '2010' column of life_by_region
life_by_region['2010'].mean()
Out[53]:
region
America                       74.037350
East Asia & Pacific           73.405750
Europe & Central Asia         75.656387
Middle East & North Africa    72.805333
South Asia                    68.189750
Sub-Saharan Africa            57.575080
Name: 2010, dtype: float64

Grouping on a function of the index

  • ### Groubpy operations can also be performed on transformations of the index values. In the case of a DateTimeIndex, we can extract portions of the datetime over which to group.
    • #### a set of sample sales data from February 2015 and assign the 'Date' column as the index. Your job is to group the sales data by the day of the week and aggregate the sum of the 'Units' column.
      • Is there a day of the week that is more popular for customers? To find out, you're going to use .strftime('%a') to transform the index datetime values to abbreviated days of the week.
In [75]:
from pandas import Timestamp

df = pd.DataFrame({'Company': {Timestamp('2015-02-02 08:30:00'): 'Hooli',
  Timestamp('2015-02-02 21:00:00'): 'Mediacore',
  Timestamp('2015-02-03 14:00:00'): 'Initech',
  Timestamp('2015-02-04 15:30:00'): 'Streeplex',
  Timestamp('2015-02-04 22:00:00'): 'Acme Coporation',
  Timestamp('2015-02-05 02:00:00'): 'Acme Coporation',
  Timestamp('2015-02-05 22:00:00'): 'Hooli',
  Timestamp('2015-02-07 23:00:00'): 'Acme Coporation',
  Timestamp('2015-02-09 09:00:00'): 'Streeplex',
  Timestamp('2015-02-09 13:00:00'): 'Mediacore',
  Timestamp('2015-02-11 20:00:00'): 'Initech',
  Timestamp('2015-02-11 23:00:00'): 'Hooli',
  Timestamp('2015-02-16 12:00:00'): 'Hooli',
  Timestamp('2015-02-19 11:00:00'): 'Mediacore',
  Timestamp('2015-02-19 16:00:00'): 'Mediacore',
  Timestamp('2015-02-21 05:00:00'): 'Mediacore',
  Timestamp('2015-02-21 20:30:00'): 'Hooli',
  Timestamp('2015-02-25 00:30:00'): 'Initech',
  Timestamp('2015-02-26 09:00:00'): 'Streeplex'},
 'Product': {Timestamp('2015-02-02 08:30:00'): 'Software',
  Timestamp('2015-02-02 21:00:00'): 'Hardware',
  Timestamp('2015-02-03 14:00:00'): 'Software',
  Timestamp('2015-02-04 15:30:00'): 'Software',
  Timestamp('2015-02-04 22:00:00'): 'Hardware',
  Timestamp('2015-02-05 02:00:00'): 'Software',
  Timestamp('2015-02-05 22:00:00'): 'Service',
  Timestamp('2015-02-07 23:00:00'): 'Hardware',
  Timestamp('2015-02-09 09:00:00'): 'Service',
  Timestamp('2015-02-09 13:00:00'): 'Software',
  Timestamp('2015-02-11 20:00:00'): 'Software',
  Timestamp('2015-02-11 23:00:00'): 'Software',
  Timestamp('2015-02-16 12:00:00'): 'Software',
  Timestamp('2015-02-19 11:00:00'): 'Hardware',
  Timestamp('2015-02-19 16:00:00'): 'Service',
  Timestamp('2015-02-21 05:00:00'): 'Software',
  Timestamp('2015-02-21 20:30:00'): 'Hardware',
  Timestamp('2015-02-25 00:30:00'): 'Service',
  Timestamp('2015-02-26 09:00:00'): 'Service'},
 'Units': {Timestamp('2015-02-02 08:30:00'): 3,
  Timestamp('2015-02-02 21:00:00'): 9,
  Timestamp('2015-02-03 14:00:00'): 13,
  Timestamp('2015-02-04 15:30:00'): 13,
  Timestamp('2015-02-04 22:00:00'): 14,
  Timestamp('2015-02-05 02:00:00'): 19,
  Timestamp('2015-02-05 22:00:00'): 10,
  Timestamp('2015-02-07 23:00:00'): 1,
  Timestamp('2015-02-09 09:00:00'): 19,
  Timestamp('2015-02-09 13:00:00'): 7,
  Timestamp('2015-02-11 20:00:00'): 7,
  Timestamp('2015-02-11 23:00:00'): 4,
  Timestamp('2015-02-16 12:00:00'): 10,
  Timestamp('2015-02-19 11:00:00'): 16,
  Timestamp('2015-02-19 16:00:00'): 10,
  Timestamp('2015-02-21 05:00:00'): 3,
  Timestamp('2015-02-21 20:30:00'): 3,
  Timestamp('2015-02-25 00:30:00'): 10,
  Timestamp('2015-02-26 09:00:00'): 4}})
In [76]:
df
Out[76]:
Company Product Units
2015-02-02 08:30:00 Hooli Software 3
2015-02-02 21:00:00 Mediacore Hardware 9
2015-02-03 14:00:00 Initech Software 13
2015-02-04 15:30:00 Streeplex Software 13
2015-02-04 22:00:00 Acme Coporation Hardware 14
2015-02-05 02:00:00 Acme Coporation Software 19
2015-02-05 22:00:00 Hooli Service 10
2015-02-07 23:00:00 Acme Coporation Hardware 1
2015-02-09 09:00:00 Streeplex Service 19
2015-02-09 13:00:00 Mediacore Software 7
2015-02-11 20:00:00 Initech Software 7
2015-02-11 23:00:00 Hooli Software 4
2015-02-16 12:00:00 Hooli Software 10
2015-02-19 11:00:00 Mediacore Hardware 16
2015-02-19 16:00:00 Mediacore Service 10
2015-02-21 05:00:00 Mediacore Software 3
2015-02-21 20:30:00 Hooli Hardware 3
2015-02-25 00:30:00 Initech Service 10
2015-02-26 09:00:00 Streeplex Service 4
In [80]:
# Create a groupby object: by_day
by_day = df.groupby(df.index.strftime('%a'))
by_day
Out[80]:
<pandas.core.groupby.DataFrameGroupBy object at 0x0000000009B116A0>
In [81]:
df.index.strftime('%a')
Out[81]:
array(['Mon', 'Mon', 'Tue', 'Wed', 'Wed', 'Thu', 'Thu', 'Sat', 'Mon',
       'Mon', 'Wed', 'Wed', 'Mon', 'Thu', 'Thu', 'Sat', 'Sat', 'Wed', 'Thu'], 
      dtype='|S3')
In [82]:
# Create sum: units_sum
units_sum = by_day['Units'].sum()
units_sum
Out[82]:
Mon    48
Sat     7
Thu    59
Tue    13
Wed    48
Name: Units, dtype: int64

Groupby and transformation

The z-score

  • ### speed level
    • ### C level
      • func from scipy

transformation

  • ### apply a function element wize to groups

use df.groupby(...).apply(func)

  • ### similar but returns type will be determined by the func return type
  • The .apply() method when used on a groupby object performs an arbitrary function on each of the groups. These functions can be aggregations, transformations or more complex workflows. The .apply() method will then combine the results in an intelligent way.
In [84]:
def Zscore(series):
     return (series - series.mean()) / series.std()
In [85]:
from scipy.stats import zscore 
In [90]:
%%timeit
Zscore(df['Units'])
10000 loops, best of 3: 130 µs per loop
In [91]:
%%timeit
zscore(df['Units'])
10000 loops, best of 3: 30.9 µs per loop

by weekdays

In [104]:
df.head()
Out[104]:
Company Product Units
2015-02-02 08:30:00 Hooli Software 3
2015-02-02 21:00:00 Mediacore Hardware 9
2015-02-03 14:00:00 Initech Software 13
2015-02-04 15:30:00 Streeplex Software 13
2015-02-04 22:00:00 Acme Coporation Hardware 14
In [106]:
df.index.weekday
Out[106]:
array([0, 0, 1, 2, 2, 3, 3, 5, 0, 0, 2, 2, 0, 3, 3, 5, 5, 2, 3])
In [102]:
df.groupby(df.index.weekday).sum()
Out[102]:
Units
0 48
1 13
2 48
3 59
5 7
In [110]:
df.groupby(df.index.weekday).transform(zscore)
Out[110]:
Units
2015-02-02 08:30:00 -1.250862
2015-02-02 21:00:00 -0.113715
2015-02-03 14:00:00 NaN
2015-02-04 15:30:00 0.913926
2015-02-04 22:00:00 1.182727
2015-02-05 02:00:00 1.376494
2015-02-05 22:00:00 -0.344124
2015-02-07 23:00:00 -1.414214
2015-02-09 09:00:00 1.781530
2015-02-09 13:00:00 -0.492764
2015-02-11 20:00:00 -0.698884
2015-02-11 23:00:00 -1.505289
2015-02-16 12:00:00 0.075810
2015-02-19 11:00:00 0.802955
2015-02-19 16:00:00 -0.344124
2015-02-21 05:00:00 0.707107
2015-02-21 20:30:00 0.707107
2015-02-25 00:30:00 0.107521
2015-02-26 09:00:00 -1.491202
In [111]:
df.groupby(df.index.weekday)['Units'].transform(zscore)
Out[111]:
2015-02-02 08:30:00   -1.250862
2015-02-02 21:00:00   -0.113715
2015-02-03 14:00:00         NaN
2015-02-04 15:30:00    0.913926
2015-02-04 22:00:00    1.182727
2015-02-05 02:00:00    1.376494
2015-02-05 22:00:00   -0.344124
2015-02-07 23:00:00   -1.414214
2015-02-09 09:00:00    1.781530
2015-02-09 13:00:00   -0.492764
2015-02-11 20:00:00   -0.698884
2015-02-11 23:00:00   -1.505289
2015-02-16 12:00:00    0.075810
2015-02-19 11:00:00    0.802955
2015-02-19 16:00:00   -0.344124
2015-02-21 05:00:00    0.707107
2015-02-21 20:30:00    0.707107
2015-02-25 00:30:00    0.107521
2015-02-26 09:00:00   -1.491202
Name: Units, dtype: float64
In [119]:
df.groupby(df.index.weekday).apply(sum)
Out[119]:
Company Product Units
0 HooliMediacoreStreeplexMediacoreHooli SoftwareHardwareServiceSoftwareSoftware 48
1 Initech Software 13
2 StreeplexAcme CoporationInitechHooliInitech SoftwareHardwareSoftwareSoftwareService 48
3 Acme CoporationHooliMediacoreMediacoreStreeplex SoftwareServiceHardwareServiceService 59
5 Acme CoporationMediacoreHooli HardwareSoftwareHardware 7
In [126]:
df.groupby(df.index.weekday)['Units'].apply(Zscore)
Out[126]:
2015-02-02 08:30:00   -1.118805
2015-02-02 21:00:00   -0.101710
2015-02-03 14:00:00         NaN
2015-02-04 15:30:00    0.817440
2015-02-04 22:00:00    1.057863
2015-02-05 02:00:00    1.231174
2015-02-05 22:00:00   -0.307794
2015-02-07 23:00:00   -1.154701
2015-02-09 09:00:00    1.593449
2015-02-09 13:00:00   -0.440741
2015-02-11 20:00:00   -0.625101
2015-02-11 23:00:00   -1.346372
2015-02-16 12:00:00    0.067806
2015-02-19 11:00:00    0.718185
2015-02-19 16:00:00   -0.307794
2015-02-21 05:00:00    0.577350
2015-02-21 20:30:00    0.577350
2015-02-25 00:30:00    0.096169
2015-02-26 09:00:00   -1.333772
Name: Units, dtype: float64
In [127]:
df.groupby(df.index.weekday)['Units'].apply(zscore)
Out[127]:
0    [-1.25086177191, -0.113714706537, 1.7815304024...
1                                                [nan]
2    [0.91392566197, 1.18272732726, -0.698884329742...
3    [1.37649440322, -0.344123600806, 0.80295506854...
5     [-1.41421356237, 0.707106781187, 0.707106781187]
Name: Units, dtype: object

Filling missing data (imputation) by group

  • Many statistical and machine learning packages cannot determine the best action to take when missing data entries are encountered.
    • Dealing with missing data is natural in pandas (both in using the default behavior and in defining a custom behavior).
      • sing the .dropna() method to drop missing values. Now, you will practice imputing missing values.
        • use .groupby() and .transform() to fill missing data appropriately for each group.
  • use .groupby() and .transform() to fill missing data appropriately for each group.
    • group by the 'sex' and 'pclass' columns and transform each group with a custom function to call .fillna() and impute the median value.
In [129]:
titanic.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
In [134]:
# Write a function that imputes median
def impute_median(series):
    return series.fillna(series.median())
In [138]:
# Create a groupby object: by_sex_class
by_sex_class = titanic.groupby(['Sex','Pclass'])
by_sex_class.count()
Out[138]:
PassengerId Survived Name Age SibSp Parch Ticket Fare Cabin Embarked
Sex Pclass
female 1 94 94 94 85 94 94 94 94 81 92
2 76 76 76 74 76 76 76 76 10 76
3 144 144 144 102 144 144 144 144 6 144
male 1 122 122 122 101 122 122 122 122 95 122
2 108 108 108 99 108 108 108 108 6 108
3 347 347 347 253 347 347 347 347 6 347
In [142]:
# Impute age and assign to titanic['age']
titanic['Age'] = by_sex_class['Age'].transform(impute_median)
titanic.head(4)
Out[142]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S

Groupby and filtering

In [143]:
auto = pd.read_csv('mpg.csv')
In [145]:
auto.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
mpg             398 non-null float64
cylinders       398 non-null int64
displacement    398 non-null float64
horsepower      398 non-null object
weight          398 non-null int64
acceleration    398 non-null float64
model_year      398 non-null int64
origin          398 non-null int64
name            398 non-null object
dtypes: float64(3), int64(4), object(2)
memory usage: 28.1+ KB
In [146]:
auto.head()
Out[146]:
mpg cylinders displacement horsepower weight acceleration model_year origin name
0 18.0 8 307.0 130 3504 12.0 70 1 chevrolet chevelle malibu
1 15.0 8 350.0 165 3693 11.5 70 1 buick skylark 320
2 18.0 8 318.0 150 3436 11.0 70 1 plymouth satellite
3 16.0 8 304.0 150 3433 12.0 70 1 amc rebel sst
4 17.0 8 302.0 140 3449 10.5 70 1 ford torino

Mean MPG by year

In [148]:
auto.groupby('model_year')['mpg'].mean()
Out[148]:
model_year
70    17.689655
71    21.250000
72    18.714286
73    17.100000
74    22.703704
75    20.266667
76    21.573529
77    23.375000
78    24.061111
79    25.093103
80    33.696552
81    30.334483
82    31.709677
Name: mpg, dtype: float64

groupby object

  • ### groupby object's attribute "groups" is a dictionary
In [155]:
splitting = auto.groupby('model_year')

print type(splitting)

print type(splitting.groups)
<class 'pandas.core.groupby.DataFrameGroupBy'>
<type 'dict'>
In [152]:
print splitting.groups
{70: [0L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L], 71: [29L, 30L, 31L, 32L, 33L, 34L, 35L, 36L, 37L, 38L, 39L, 40L, 41L, 42L, 43L, 44L, 45L, 46L, 47L, 48L, 49L, 50L, 51L, 52L, 53L, 54L, 55L, 56L], 72: [57L, 58L, 59L, 60L, 61L, 62L, 63L, 64L, 65L, 66L, 67L, 68L, 69L, 70L, 71L, 72L, 73L, 74L, 75L, 76L, 77L, 78L, 79L, 80L, 81L, 82L, 83L, 84L], 73: [85L, 86L, 87L, 88L, 89L, 90L, 91L, 92L, 93L, 94L, 95L, 96L, 97L, 98L, 99L, 100L, 101L, 102L, 103L, 104L, 105L, 106L, 107L, 108L, 109L, 110L, 111L, 112L, 113L, 114L, 115L, 116L, 117L, 118L, 119L, 120L, 121L, 122L, 123L, 124L], 74: [125L, 126L, 127L, 128L, 129L, 130L, 131L, 132L, 133L, 134L, 135L, 136L, 137L, 138L, 139L, 140L, 141L, 142L, 143L, 144L, 145L, 146L, 147L, 148L, 149L, 150L, 151L], 75: [152L, 153L, 154L, 155L, 156L, 157L, 158L, 159L, 160L, 161L, 162L, 163L, 164L, 165L, 166L, 167L, 168L, 169L, 170L, 171L, 172L, 173L, 174L, 175L, 176L, 177L, 178L, 179L, 180L, 181L], 76: [182L, 183L, 184L, 185L, 186L, 187L, 188L, 189L, 190L, 191L, 192L, 193L, 194L, 195L, 196L, 197L, 198L, 199L, 200L, 201L, 202L, 203L, 204L, 205L, 206L, 207L, 208L, 209L, 210L, 211L, 212L, 213L, 214L, 215L], 77: [216L, 217L, 218L, 219L, 220L, 221L, 222L, 223L, 224L, 225L, 226L, 227L, 228L, 229L, 230L, 231L, 232L, 233L, 234L, 235L, 236L, 237L, 238L, 239L, 240L, 241L, 242L, 243L], 78: [244L, 245L, 246L, 247L, 248L, 249L, 250L, 251L, 252L, 253L, 254L, 255L, 256L, 257L, 258L, 259L, 260L, 261L, 262L, 263L, 264L, 265L, 266L, 267L, 268L, 269L, 270L, 271L, 272L, 273L, 274L, 275L, 276L, 277L, 278L, 279L], 79: [280L, 281L, 282L, 283L, 284L, 285L, 286L, 287L, 288L, 289L, 290L, 291L, 292L, 293L, 294L, 295L, 296L, 297L, 298L, 299L, 300L, 301L, 302L, 303L, 304L, 305L, 306L, 307L, 308L], 80: [309L, 310L, 311L, 312L, 313L, 314L, 315L, 316L, 317L, 318L, 319L, 320L, 321L, 322L, 323L, 324L, 325L, 326L, 327L, 328L, 329L, 330L, 331L, 332L, 333L, 334L, 335L, 336L, 337L], 81: [338L, 339L, 340L, 341L, 342L, 343L, 344L, 345L, 346L, 347L, 348L, 349L, 350L, 351L, 352L, 353L, 354L, 355L, 356L, 357L, 358L, 359L, 360L, 361L, 362L, 363L, 364L, 365L, 366L], 82: [367L, 368L, 369L, 370L, 371L, 372L, 373L, 374L, 375L, 376L, 377L, 378L, 379L, 380L, 381L, 382L, 383L, 384L, 385L, 386L, 387L, 388L, 389L, 390L, 391L, 392L, 393L, 394L, 395L, 396L, 397L]}
In [154]:
print splitting.groups.keys()
[70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82]

groupby object: iteration

In [157]:
for group_name, group in splitting:
    avg = group['mpg'].mean()
    print(group_name, avg)
(70, 17.689655172413794)
(71, 21.25)
(72, 18.714285714285715)
(73, 17.1)
(74, 22.703703703703702)
(75, 20.266666666666666)
(76, 21.573529411764707)
(77, 23.375)
(78, 24.061111111111114)
(79, 25.09310344827585)
(80, 33.696551724137926)
(81, 30.33448275862069)
(82, 31.70967741935484)

groupby object: iteration and filtering

In [158]:
for group_name, group in splitting:
    avg = group.loc[group['name'].str.contains('chevrolet'), 'mpg'].mean()
    print(group_name, avg)
(70, 15.666666666666666)
(71, 20.25)
(72, 15.333333333333334)
(73, 14.833333333333334)
(74, 18.666666666666668)
(75, 17.666666666666668)
(76, 23.25)
(77, 20.25)
(78, 23.233333333333334)
(79, 21.666666666666668)
(80, 30.05)
(81, 23.5)
(82, 29.0)

groupby object: comprehension

  • ### dictionary comprehension
In [160]:
chevy_means = {year:group.loc[group['name'].str.contains('chevrolet'),'mpg'].mean()
for year,group in splitting}

chevy_means
Out[160]:
{70: 15.666666666666666,
 71: 20.25,
 72: 15.333333333333334,
 73: 14.833333333333334,
 74: 18.666666666666668,
 75: 17.666666666666668,
 76: 23.25,
 77: 20.25,
 78: 23.233333333333334,
 79: 21.666666666666668,
 80: 30.05,
 81: 23.5,
 82: 29.0}

Boolean groupby

In [175]:
chevy = auto['name'].str.contains('chevrolet')
print type(chevy)
print chevy.shape
chevy.head()
<class 'pandas.core.series.Series'>
(398L,)
Out[175]:
0     True
1    False
2    False
3    False
4    False
Name: name, dtype: bool
  • ### share same index
In [176]:
auto.shape
Out[176]:
(398, 9)
In [174]:
auto.groupby(['model_year', chevy]).mean()
Out[174]:
mpg cylinders displacement weight acceleration origin
model_year name
70 False 17.923077 6.615385 269.230769 3315.076923 13.269231 1.346154
True 15.666667 8.000000 387.000000 3873.000000 10.166667 1.000000
71 False 21.416667 5.583333 208.041667 2985.916667 15.083333 1.500000
True 20.250000 5.500000 220.000000 3052.500000 15.500000 1.000000
72 False 19.120000 5.720000 212.700000 3195.040000 15.120000 1.600000
True 15.333333 6.666667 265.666667 3593.333333 15.166667 1.000000
73 False 17.500000 6.264706 246.617647 3339.735294 14.205882 1.441176
True 14.833333 7.000000 315.000000 3868.333333 14.916667 1.000000
74 False 23.208333 5.250000 166.541667 2835.208333 16.104167 1.750000
True 18.666667 5.333333 213.333333 3219.666667 17.000000 1.000000
75 False 20.555556 5.407407 196.444444 3117.925926 16.222222 1.518519
True 17.666667 7.333333 287.333333 3706.666667 14.500000 1.000000
76 False 21.350000 5.666667 199.566667 3097.000000 15.673333 1.533333
True 23.250000 5.500000 184.500000 2941.750000 17.950000 1.000000
77 False 23.895833 5.291667 181.500000 2929.583333 15.620833 1.666667
True 20.250000 6.500000 250.750000 3404.000000 14.325000 1.000000
78 False 24.136364 5.303030 175.696970 2857.272727 15.790909 1.666667
True 23.233333 6.000000 201.000000 2911.666667 15.966667 1.000000
79 False 25.488462 5.653846 201.884615 3021.730769 16.034615 1.307692
True 21.666667 7.333333 248.333333 3346.666667 13.900000 1.000000
80 False 33.966667 4.148148 115.185185 2439.444444 17.003704 2.296296
True 30.050000 4.000000 124.500000 2399.000000 16.000000 1.000000
81 False 30.578571 4.571429 133.964286 2515.714286 16.439286 2.000000
True 23.500000 6.000000 173.000000 2725.000000 12.600000 1.000000
82 False 32.111111 4.222222 129.925926 2424.814815 16.381481 1.740741
True 29.000000 4.000000 121.750000 2647.500000 18.375000 1.000000

Grouping and filtering with .apply()

  • ### By using .apply(), you can write functions that filter rows within groups. The .apply() method will handle the iteration over individual groups and then re-combine them back into a Series or DataFrame.
    • take the Titanic data set and analyze survival rates from the 'C' deck, which contained the most passengers.
      • To do this you'll group the dataset by 'sex' and then use the .apply()
        • on a provided user defined function which calculates the mean survival rates on the 'C' deck:
In [178]:
titanic.head()
Out[178]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
In [183]:
def c_deck_survival(gr):

    c_passengers = gr['Cabin'].str.startswith('C').fillna(False)

    return gr.loc[c_passengers, 'Survived'].mean()
In [184]:
# Create a groupby object using titanic over the 'sex' column: by_sex
by_sex = titanic.groupby('Sex')

# Call by_sex.apply with the function c_deck_survival and print the result
c_surv_by_sex = by_sex.apply(c_deck_survival)

c_surv_by_sex
Out[184]:
Sex
female    0.888889
male      0.343750
dtype: float64

use pandas

In [204]:
titanic.loc[titanic['Cabin'].str.startswith('C').fillna(False)].groupby('Sex')['Survived'].mean()
Out[204]:
Sex
female    0.888889
male      0.343750
Name: Survived, dtype: float64
In [196]:
titanic['Cabin'].str.startswith('C').head(10)
Out[196]:
0      NaN
1     True
2      NaN
3     True
4      NaN
5      NaN
6    False
7      NaN
8      NaN
9      NaN
Name: Cabin, dtype: object
In [197]:
titanic['Cabin'].str.startswith('C').fillna(False).head(10)
Out[197]:
0    False
1     True
2    False
3     True
4    False
5    False
6    False
7    False
8    False
9    False
Name: Cabin, dtype: bool
In [203]:
titanic.loc[titanic['Cabin'].str.startswith('C').fillna(False)].groupby('Sex')['Survived'].mean()
Out[203]:
Sex
female    0.888889
male      0.343750
Name: Survived, dtype: float64

Grouping and filtering with .filter()

  • ### You can use groupby with the .filter() method to remove whole groups of rows from a DataFrame based a boolean condition.
In [205]:
df
Out[205]:
Company Product Units
2015-02-02 08:30:00 Hooli Software 3
2015-02-02 21:00:00 Mediacore Hardware 9
2015-02-03 14:00:00 Initech Software 13
2015-02-04 15:30:00 Streeplex Software 13
2015-02-04 22:00:00 Acme Coporation Hardware 14
2015-02-05 02:00:00 Acme Coporation Software 19
2015-02-05 22:00:00 Hooli Service 10
2015-02-07 23:00:00 Acme Coporation Hardware 1
2015-02-09 09:00:00 Streeplex Service 19
2015-02-09 13:00:00 Mediacore Software 7
2015-02-11 20:00:00 Initech Software 7
2015-02-11 23:00:00 Hooli Software 4
2015-02-16 12:00:00 Hooli Software 10
2015-02-19 11:00:00 Mediacore Hardware 16
2015-02-19 16:00:00 Mediacore Service 10
2015-02-21 05:00:00 Mediacore Software 3
2015-02-21 20:30:00 Hooli Hardware 3
2015-02-25 00:30:00 Initech Service 10
2015-02-26 09:00:00 Streeplex Service 4
In [209]:
# Group sales by 'Company': by_company
by_company = df.groupby('Company')

# Compute the sum of the 'Units' of by_company: by_com_sum
by_com_sum = by_company['Units'].sum()
by_com_sum
Out[209]:
Company
Acme Coporation    34
Hooli              30
Initech            30
Mediacore          45
Streeplex          36
Name: Units, dtype: int64
In [211]:
# Filter 'Units' where the sum is > 35: by_com_filt
by_com_filt = by_company.filter(lambda g:g['Units'].sum() > 35)
by_com_filt
Out[211]:
Company Product Units
2015-02-02 21:00:00 Mediacore Hardware 9
2015-02-04 15:30:00 Streeplex Software 13
2015-02-09 09:00:00 Streeplex Service 19
2015-02-09 13:00:00 Mediacore Software 7
2015-02-19 11:00:00 Mediacore Hardware 16
2015-02-19 16:00:00 Mediacore Service 10
2015-02-21 05:00:00 Mediacore Software 3
2015-02-26 09:00:00 Streeplex Service 4

Filtering and grouping with .map()

  • #### You have seen how to group by a column, or by multiple columns. Sometimes, you may instead want to group by a function/transformation of a column.
  • #### The key here is that the Series is indexed the same way as the DataFrame. You can also mix and match column grouping with Series grouping.
In [215]:
# Create the Boolean Series: under10
under10 = (titanic['Age'] < 10).map({True:'under 10', False:'over 10'})
print type(under10)
under10.head(10)
<class 'pandas.core.series.Series'>
Out[215]:
0     over 10
1     over 10
2     over 10
3     over 10
4     over 10
5     over 10
6     over 10
7    under 10
8     over 10
9     over 10
Name: Age, dtype: object
  • share the same index
In [217]:
# Group by under10 and compute the survival rate
survived_mean_1 = titanic.groupby(under10)['Survived'].mean()
print(survived_mean_1)
Age
over 10     0.366707
under 10    0.612903
Name: Survived, dtype: float64
In [221]:
# Group by under10 and pclass and compute the survival rate
survived_mean_2 = titanic.groupby([under10, 'Pclass'])['Survived'].mean()
survived_mean_2
Out[221]:
Age       Pclass
over 10   1         0.629108
          2         0.419162
          3         0.222717
under 10  1         0.666667
          2         1.000000
          3         0.452381
Name: Survived, dtype: float64
In [ ]: