Pandas groupby¶

Categoricals and groupby¶

import pandas as pd

sales = pd.DataFrame({'weekday': ['Sun', 'Sun', 'Mon', 'Mon'],
 'city': ['Austin', 'Dallas', 'Austin', 'Dallas'],
 'bread': [139, 237, 326, 456],
'butter': [20, 45, 70, 98] } )

sales

Boolean filter and count¶

sales.loc[sales['weekday'] == 'Sun'].count()

bread      2
butter     2
city       2
weekday    2
dtype: int64

Groupby and count¶

Split-apply-combine¶

sales.groupby('weekday').count()
- split by ‘weekday’
- apply count() function on each group
- combine counts per group

sales.groupby('weekday').count()

sales.groupby('weekday')

<pandas.core.groupby.DataFrameGroupBy object at 0x000000000929FF60>

print dir(sales.groupby('weekday'))

['__bytes__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__format__', '__getattr__', '__getattribute__', '__getitem__', '__hash__', '__init__', '__iter__', '__len__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__unicode__', '__weakref__', '_agg_doc', '_aggregate', '_aggregate_generic', '_aggregate_item_by_item', '_aggregate_multiple_funcs', '_apply_filter', '_apply_to_column_groupbys', '_apply_whitelist', '_assure_grouper', '_block_agg_axis', '_builtin_table', '_choose_path', '_concat_objects', '_constructor', '_cumcount_array', '_cython_agg_blocks', '_cython_agg_general', '_cython_table', '_cython_transform', '_decide_output_index', '_def_str', '_define_paths', '_dir_additions', '_dir_deletions', '_get_data_to_aggregate', '_get_index', '_get_indices', '_gotitem', '_group_selection', '_index_with_as_index', '_insert_inaxis_grouper_inplace', '_internal_names', '_internal_names_set', '_is_builtin_func', '_is_cython_func', '_iterate_column_groupbys', '_iterate_slices', '_make_wrapper', '_obj_with_exclusions', '_post_process_cython_aggregate', '_python_agg_general', '_python_apply_general', '_reindex_output', '_reset_cache', '_see_also_template', '_selected_obj', '_selection', '_selection_list', '_set_result_index_ordered', '_set_selection_from_grouper', '_shallow_copy', '_transform_general', '_transform_item_by_item', '_try_cast', '_wrap_agged_blocks', '_wrap_aggregated_output', '_wrap_applied_output', '_wrap_generic_output', '_wrap_transformed_output', 'agg', 'aggregate', 'all', 'any', 'apply', 'backfill', 'bfill', 'boxplot', 'bread', 'butter', 'city', 'corr', 'corrwith', 'count', 'cov', 'cumcount', 'cummax', 'cummin', 'cumprod', 'cumsum', 'describe', 'diff', 'dtypes', 'expanding', 'ffill', 'fillna', 'filter', 'first', 'get_group', 'groups', 'head', 'hist', 'idxmax', 'idxmin', 'indices', 'irow', 'last', 'mad', 'max', 'mean', 'median', 'min', 'name', 'ndim', 'ngroups', 'nth', 'ohlc', 'pad', 'pct_change', 'plot', 'prod', 'quantile', 'rank', 'resample', 'rolling', 'sem', 'shift', 'size', 'skew', 'std', 'sum', 'tail', 'take', 'transform', 'tshift', 'var', 'weekday']

sales.groupby('weekday').sum()

sales.groupby('weekday').std()

Aggregation/Reduction¶

first(), last()
min(), max()

sales.groupby('weekday').last()

Groupby and sum¶

sales.groupby('weekday')['bread'].sum()

weekday
Mon    782
Sun    376
Name: bread, dtype: int64

sales.groupby('weekday')['bread','butter'].sum()

sales.groupby('weekday')[['bread','butter']].sum()

keep df form¶

sales.groupby('weekday')[['bread']].sum()

Groupby and mean: multi-level index¶

sales.groupby(['city','weekday']).mean()

Customers¶

index
- 'Alice' have index 1 and 3

customers = pd.Series(['Dave','Alice','Bob','Alice'])
customers

0     Dave
1    Alice
2      Bob
3    Alice
dtype: object

sales

sales.groupby(customers)['bread'].sum()

Alice    693
Bob      326
Dave     139
Name: bread, dtype: int64

Categorical data¶

### Advantages
- ### Uses less memory
- ### Speeds up operations like groupby()

sales['weekday'].unique()

array(['Sun', 'Mon'], dtype=object)

sales['weekday'] = sales['weekday'].astype('category')
sales

sales['weekday']

0    Sun
1    Sun
2    Mon
3    Mon
Name: weekday, dtype: category
Categories (2, object): [Mon, Sun]

Multiple aggregations¶

sales.groupby('city')[['bread','butter']].agg(['max','sum'])

Custom aggregation¶

### pass a func to df.groupby(...).agg(func)

def data_range(series):
     return series.max() - series.min()

sales.groupby('weekday')[['bread', 'butter']].agg(data_range)

Custom aggregation: dictionaries¶

sales.groupby(customers)[['bread', 'butter']].agg({'bread':'sum', 'butter':data_range})

practice on Titanic dataset¶

titanic = pd.read_csv('train.csv')
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

titanic.head()

### The 'pclass' column identifies which class of ticket was purchased by the passenger
### embarked' column indicates at which of the three ports the passenger boarded the Titanic.
- ### 'S' stands for Southampton, England, 'C' for Cherbourg, France and 'Q' for Queenstown, Ireland.
  - ### first group by the 'pclass' column and count the number of rows in each class using the 'survived' column.
    - ### then group by the 'embarked' and 'pclass' columns and count the number of passengers.

# Group titanic by 'pclass'
# Aggregate 'survived' column of by_class by count

p_group = titanic.groupby('Pclass')['Survived'].count()
p_group

Pclass
1    216
2    184
3    491
Name: Survived, dtype: int64

Aggregating on index levels/fields¶

If you have a DataFrame with a multi-level row index, the individual levels can be used to perform the groupby. This allows advanced aggregation techniques to be applied along one or more levels in the index and across one or more columns.

Computing multiple aggregates of multiple columns¶

The .agg() method can be used with a tuple or list of aggregations as input. When applying multiple aggregations on multiple columns, the aggregated DataFrame has a multi-level column index.
- group passengers on the Titanic by 'pclass' and aggregate the 'age' and 'fare' columns by the functions 'max' and 'median'

# Group titanic by 'embarked' and 'pclass'
# Aggregate 'survived' column of by_mult by count


multi_group = titanic.groupby(['Embarked','Pclass']).count()
multi_group

multi_survive = titanic.groupby(['Embarked','Pclass'])['Survived'].count()
multi_survive

Embarked  Pclass
C         1          85
          2          17
          3          66
Q         1           2
          2           3
          3          72
S         1         127
          2         164
          3         353
Name: Survived, dtype: int64

# Group titanic by 'pclass': by_class
by_class = titanic.groupby('Pclass')

# Select 'age' and 'fare'
by_class_sub = by_class[['Age','Fare']]

# Aggregate by_class_sub by 'max' and 'median': aggregated
aggregated = by_class_sub.agg(['max','median'])
aggregated

# Print the maximum age in each class
print(aggregated.loc[:, ('Age','max')])

Pclass
1    80.0
2    70.0
3    74.0
Name: (Age, max), dtype: float64

# Print the median fare in each class
print(aggregated.loc[:, ('Fare','median')])

Pclass
1    60.2875
2    14.2500
3     8.0500
Name: (Fare, median), dtype: float64

Grouping by another series¶

### investigate the average life expectancy (in years) at birth in 2010 for the 6 continental regions
http://gapminder.org/
### read the life expectancy data per country into one pandas DataFrame and the association between country and region into another.
- ### The life expectancy CSV file is available
  - 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1650/datasets/life_expectancy.csv'
- ### the regions filename is available in the variable
  - 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1650/datasets/regions.csv'

# Read life_fname into a DataFrame: life
life = pd.read_csv('https://s3.amazonaws.com/assets.datacamp.com/production/course_1650/datasets/life_expectancy.csv', 
                   index_col='Country')

# Read regions_fname into a DataFrame: regions
regions = pd.read_csv('https://s3.amazonaws.com/assets.datacamp.com/production/course_1650/datasets/regions.csv', 
                      index_col='Country')
print life.shape
print regions.shape

(204, 50)
(204, 1)

life.head(3)

regions.head()

DataFrame life and regions share the same index¶

# Group life by regions['region']: life_by_region
life_by_region = life.groupby(regions['region'])

# Print the mean over the '2010' column of life_by_region
life_by_region['2010'].mean()

region
America                       74.037350
East Asia & Pacific           73.405750
Europe & Central Asia         75.656387
Middle East & North Africa    72.805333
South Asia                    68.189750
Sub-Saharan Africa            57.575080
Name: 2010, dtype: float64

Grouping on a function of the index¶

### Groubpy operations can also be performed on transformations of the index values. In the case of a DateTimeIndex, we can extract portions of the datetime over which to group.
- #### a set of sample sales data from February 2015 and assign the 'Date' column as the index. Your job is to group the sales data by the day of the week and aggregate the sum of the 'Units' column.
  - Is there a day of the week that is more popular for customers? To find out, you're going to use .strftime('%a') to transform the index datetime values to abbreviated days of the week.

from pandas import Timestamp

df = pd.DataFrame({'Company': {Timestamp('2015-02-02 08:30:00'): 'Hooli',
  Timestamp('2015-02-02 21:00:00'): 'Mediacore',
  Timestamp('2015-02-03 14:00:00'): 'Initech',
  Timestamp('2015-02-04 15:30:00'): 'Streeplex',
  Timestamp('2015-02-04 22:00:00'): 'Acme Coporation',
  Timestamp('2015-02-05 02:00:00'): 'Acme Coporation',
  Timestamp('2015-02-05 22:00:00'): 'Hooli',
  Timestamp('2015-02-07 23:00:00'): 'Acme Coporation',
  Timestamp('2015-02-09 09:00:00'): 'Streeplex',
  Timestamp('2015-02-09 13:00:00'): 'Mediacore',
  Timestamp('2015-02-11 20:00:00'): 'Initech',
  Timestamp('2015-02-11 23:00:00'): 'Hooli',
  Timestamp('2015-02-16 12:00:00'): 'Hooli',
  Timestamp('2015-02-19 11:00:00'): 'Mediacore',
  Timestamp('2015-02-19 16:00:00'): 'Mediacore',
  Timestamp('2015-02-21 05:00:00'): 'Mediacore',
  Timestamp('2015-02-21 20:30:00'): 'Hooli',
  Timestamp('2015-02-25 00:30:00'): 'Initech',
  Timestamp('2015-02-26 09:00:00'): 'Streeplex'},
 'Product': {Timestamp('2015-02-02 08:30:00'): 'Software',
  Timestamp('2015-02-02 21:00:00'): 'Hardware',
  Timestamp('2015-02-03 14:00:00'): 'Software',
  Timestamp('2015-02-04 15:30:00'): 'Software',
  Timestamp('2015-02-04 22:00:00'): 'Hardware',
  Timestamp('2015-02-05 02:00:00'): 'Software',
  Timestamp('2015-02-05 22:00:00'): 'Service',
  Timestamp('2015-02-07 23:00:00'): 'Hardware',
  Timestamp('2015-02-09 09:00:00'): 'Service',
  Timestamp('2015-02-09 13:00:00'): 'Software',
  Timestamp('2015-02-11 20:00:00'): 'Software',
  Timestamp('2015-02-11 23:00:00'): 'Software',
  Timestamp('2015-02-16 12:00:00'): 'Software',
  Timestamp('2015-02-19 11:00:00'): 'Hardware',
  Timestamp('2015-02-19 16:00:00'): 'Service',
  Timestamp('2015-02-21 05:00:00'): 'Software',
  Timestamp('2015-02-21 20:30:00'): 'Hardware',
  Timestamp('2015-02-25 00:30:00'): 'Service',
  Timestamp('2015-02-26 09:00:00'): 'Service'},
 'Units': {Timestamp('2015-02-02 08:30:00'): 3,
  Timestamp('2015-02-02 21:00:00'): 9,
  Timestamp('2015-02-03 14:00:00'): 13,
  Timestamp('2015-02-04 15:30:00'): 13,
  Timestamp('2015-02-04 22:00:00'): 14,
  Timestamp('2015-02-05 02:00:00'): 19,
  Timestamp('2015-02-05 22:00:00'): 10,
  Timestamp('2015-02-07 23:00:00'): 1,
  Timestamp('2015-02-09 09:00:00'): 19,
  Timestamp('2015-02-09 13:00:00'): 7,
  Timestamp('2015-02-11 20:00:00'): 7,
  Timestamp('2015-02-11 23:00:00'): 4,
  Timestamp('2015-02-16 12:00:00'): 10,
  Timestamp('2015-02-19 11:00:00'): 16,
  Timestamp('2015-02-19 16:00:00'): 10,
  Timestamp('2015-02-21 05:00:00'): 3,
  Timestamp('2015-02-21 20:30:00'): 3,
  Timestamp('2015-02-25 00:30:00'): 10,
  Timestamp('2015-02-26 09:00:00'): 4}})

df

# Create a groupby object: by_day
by_day = df.groupby(df.index.strftime('%a'))
by_day

<pandas.core.groupby.DataFrameGroupBy object at 0x0000000009B116A0>

df.index.strftime('%a')

array(['Mon', 'Mon', 'Tue', 'Wed', 'Wed', 'Thu', 'Thu', 'Sat', 'Mon',
       'Mon', 'Wed', 'Wed', 'Mon', 'Thu', 'Thu', 'Sat', 'Sat', 'Wed', 'Thu'], 
      dtype='|S3')

# Create sum: units_sum
units_sum = by_day['Units'].sum()
units_sum

Mon    48
Sat     7
Thu    59
Tue    13
Wed    48
Name: Units, dtype: int64

Groupby and transformation¶

The z-score¶

### speed level
- ### C level
  - func from scipy

transformation¶

### apply a function element wize to groups

use df.groupby(...).apply(func)¶

### similar but returns type will be determined by the func return type

The .apply() method when used on a groupby object performs an arbitrary function on each of the groups. These functions can be aggregations, transformations or more complex workflows. The .apply() method will then combine the results in an intelligent way.

def Zscore(series):
     return (series - series.mean()) / series.std()

from scipy.stats import zscore

%%timeit
Zscore(df['Units'])

10000 loops, best of 3: 130 µs per loop

%%timeit
zscore(df['Units'])

10000 loops, best of 3: 30.9 µs per loop

by weekdays¶

df.head()

df.index.weekday

array([0, 0, 1, 2, 2, 3, 3, 5, 0, 0, 2, 2, 0, 3, 3, 5, 5, 2, 3])

df.groupby(df.index.weekday).sum()

df.groupby(df.index.weekday).transform(zscore)

df.groupby(df.index.weekday)['Units'].transform(zscore)

2015-02-02 08:30:00   -1.250862
2015-02-02 21:00:00   -0.113715
2015-02-03 14:00:00         NaN
2015-02-04 15:30:00    0.913926
2015-02-04 22:00:00    1.182727
2015-02-05 02:00:00    1.376494
2015-02-05 22:00:00   -0.344124
2015-02-07 23:00:00   -1.414214
2015-02-09 09:00:00    1.781530
2015-02-09 13:00:00   -0.492764
2015-02-11 20:00:00   -0.698884
2015-02-11 23:00:00   -1.505289
2015-02-16 12:00:00    0.075810
2015-02-19 11:00:00    0.802955
2015-02-19 16:00:00   -0.344124
2015-02-21 05:00:00    0.707107
2015-02-21 20:30:00    0.707107
2015-02-25 00:30:00    0.107521
2015-02-26 09:00:00   -1.491202
Name: Units, dtype: float64

df.groupby(df.index.weekday).apply(sum)

df.groupby(df.index.weekday)['Units'].apply(Zscore)

2015-02-02 08:30:00   -1.118805
2015-02-02 21:00:00   -0.101710
2015-02-03 14:00:00         NaN
2015-02-04 15:30:00    0.817440
2015-02-04 22:00:00    1.057863
2015-02-05 02:00:00    1.231174
2015-02-05 22:00:00   -0.307794
2015-02-07 23:00:00   -1.154701
2015-02-09 09:00:00    1.593449
2015-02-09 13:00:00   -0.440741
2015-02-11 20:00:00   -0.625101
2015-02-11 23:00:00   -1.346372
2015-02-16 12:00:00    0.067806
2015-02-19 11:00:00    0.718185
2015-02-19 16:00:00   -0.307794
2015-02-21 05:00:00    0.577350
2015-02-21 20:30:00    0.577350
2015-02-25 00:30:00    0.096169
2015-02-26 09:00:00   -1.333772
Name: Units, dtype: float64

df.groupby(df.index.weekday)['Units'].apply(zscore)

0    [-1.25086177191, -0.113714706537, 1.7815304024...
1                                                [nan]
2    [0.91392566197, 1.18272732726, -0.698884329742...
3    [1.37649440322, -0.344123600806, 0.80295506854...
5     [-1.41421356237, 0.707106781187, 0.707106781187]
Name: Units, dtype: object

Filling missing data (imputation) by group¶

Many statistical and machine learning packages cannot determine the best action to take when missing data entries are encountered.
- Dealing with missing data is natural in pandas (both in using the default behavior and in defining a custom behavior).
  - sing the .dropna() method to drop missing values. Now, you will practice imputing missing values.
    - use .groupby() and .transform() to fill missing data appropriately for each group.

use .groupby() and .transform() to fill missing data appropriately for each group.
- group by the 'sex' and 'pclass' columns and transform each group with a custom function to call .fillna() and impute the median value.

titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

# Write a function that imputes median
def impute_median(series):
    return series.fillna(series.median())

# Create a groupby object: by_sex_class
by_sex_class = titanic.groupby(['Sex','Pclass'])
by_sex_class.count()

# Impute age and assign to titanic['age']
titanic['Age'] = by_sex_class['Age'].transform(impute_median)
titanic.head(4)

Groupby and filtering¶

auto = pd.read_csv('mpg.csv')

auto.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
mpg             398 non-null float64
cylinders       398 non-null int64
displacement    398 non-null float64
horsepower      398 non-null object
weight          398 non-null int64
acceleration    398 non-null float64
model_year      398 non-null int64
origin          398 non-null int64
name            398 non-null object
dtypes: float64(3), int64(4), object(2)
memory usage: 28.1+ KB

auto.head()

Mean MPG by year¶

auto.groupby('model_year')['mpg'].mean()

model_year
70    17.689655
71    21.250000
72    18.714286
73    17.100000
74    22.703704
75    20.266667
76    21.573529
77    23.375000
78    24.061111
79    25.093103
80    33.696552
81    30.334483
82    31.709677
Name: mpg, dtype: float64

groupby object¶

### groupby object's attribute "groups" is a dictionary

splitting = auto.groupby('model_year')

print type(splitting)

print type(splitting.groups)

<class 'pandas.core.groupby.DataFrameGroupBy'>
<type 'dict'>

print splitting.groups

{70: [0L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L], 71: [29L, 30L, 31L, 32L, 33L, 34L, 35L, 36L, 37L, 38L, 39L, 40L, 41L, 42L, 43L, 44L, 45L, 46L, 47L, 48L, 49L, 50L, 51L, 52L, 53L, 54L, 55L, 56L], 72: [57L, 58L, 59L, 60L, 61L, 62L, 63L, 64L, 65L, 66L, 67L, 68L, 69L, 70L, 71L, 72L, 73L, 74L, 75L, 76L, 77L, 78L, 79L, 80L, 81L, 82L, 83L, 84L], 73: [85L, 86L, 87L, 88L, 89L, 90L, 91L, 92L, 93L, 94L, 95L, 96L, 97L, 98L, 99L, 100L, 101L, 102L, 103L, 104L, 105L, 106L, 107L, 108L, 109L, 110L, 111L, 112L, 113L, 114L, 115L, 116L, 117L, 118L, 119L, 120L, 121L, 122L, 123L, 124L], 74: [125L, 126L, 127L, 128L, 129L, 130L, 131L, 132L, 133L, 134L, 135L, 136L, 137L, 138L, 139L, 140L, 141L, 142L, 143L, 144L, 145L, 146L, 147L, 148L, 149L, 150L, 151L], 75: [152L, 153L, 154L, 155L, 156L, 157L, 158L, 159L, 160L, 161L, 162L, 163L, 164L, 165L, 166L, 167L, 168L, 169L, 170L, 171L, 172L, 173L, 174L, 175L, 176L, 177L, 178L, 179L, 180L, 181L], 76: [182L, 183L, 184L, 185L, 186L, 187L, 188L, 189L, 190L, 191L, 192L, 193L, 194L, 195L, 196L, 197L, 198L, 199L, 200L, 201L, 202L, 203L, 204L, 205L, 206L, 207L, 208L, 209L, 210L, 211L, 212L, 213L, 214L, 215L], 77: [216L, 217L, 218L, 219L, 220L, 221L, 222L, 223L, 224L, 225L, 226L, 227L, 228L, 229L, 230L, 231L, 232L, 233L, 234L, 235L, 236L, 237L, 238L, 239L, 240L, 241L, 242L, 243L], 78: [244L, 245L, 246L, 247L, 248L, 249L, 250L, 251L, 252L, 253L, 254L, 255L, 256L, 257L, 258L, 259L, 260L, 261L, 262L, 263L, 264L, 265L, 266L, 267L, 268L, 269L, 270L, 271L, 272L, 273L, 274L, 275L, 276L, 277L, 278L, 279L], 79: [280L, 281L, 282L, 283L, 284L, 285L, 286L, 287L, 288L, 289L, 290L, 291L, 292L, 293L, 294L, 295L, 296L, 297L, 298L, 299L, 300L, 301L, 302L, 303L, 304L, 305L, 306L, 307L, 308L], 80: [309L, 310L, 311L, 312L, 313L, 314L, 315L, 316L, 317L, 318L, 319L, 320L, 321L, 322L, 323L, 324L, 325L, 326L, 327L, 328L, 329L, 330L, 331L, 332L, 333L, 334L, 335L, 336L, 337L], 81: [338L, 339L, 340L, 341L, 342L, 343L, 344L, 345L, 346L, 347L, 348L, 349L, 350L, 351L, 352L, 353L, 354L, 355L, 356L, 357L, 358L, 359L, 360L, 361L, 362L, 363L, 364L, 365L, 366L], 82: [367L, 368L, 369L, 370L, 371L, 372L, 373L, 374L, 375L, 376L, 377L, 378L, 379L, 380L, 381L, 382L, 383L, 384L, 385L, 386L, 387L, 388L, 389L, 390L, 391L, 392L, 393L, 394L, 395L, 396L, 397L]}

print splitting.groups.keys()

[70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82]

groupby object: iteration¶

for group_name, group in splitting:
    avg = group['mpg'].mean()
    print(group_name, avg)

(70, 17.689655172413794)
(71, 21.25)
(72, 18.714285714285715)
(73, 17.1)
(74, 22.703703703703702)
(75, 20.266666666666666)
(76, 21.573529411764707)
(77, 23.375)
(78, 24.061111111111114)
(79, 25.09310344827585)
(80, 33.696551724137926)
(81, 30.33448275862069)
(82, 31.70967741935484)

groupby object: iteration and filtering¶

for group_name, group in splitting:
    avg = group.loc[group['name'].str.contains('chevrolet'), 'mpg'].mean()
    print(group_name, avg)

(70, 15.666666666666666)
(71, 20.25)
(72, 15.333333333333334)
(73, 14.833333333333334)
(74, 18.666666666666668)
(75, 17.666666666666668)
(76, 23.25)
(77, 20.25)
(78, 23.233333333333334)
(79, 21.666666666666668)
(80, 30.05)
(81, 23.5)
(82, 29.0)

groupby object: comprehension¶

### dictionary comprehension

chevy_means = {year:group.loc[group['name'].str.contains('chevrolet'),'mpg'].mean()
for year,group in splitting}

chevy_means

{70: 15.666666666666666,
 71: 20.25,
 72: 15.333333333333334,
 73: 14.833333333333334,
 74: 18.666666666666668,
 75: 17.666666666666668,
 76: 23.25,
 77: 20.25,
 78: 23.233333333333334,
 79: 21.666666666666668,
 80: 30.05,
 81: 23.5,
 82: 29.0}

Boolean groupby¶

chevy = auto['name'].str.contains('chevrolet')
print type(chevy)
print chevy.shape
chevy.head()

<class 'pandas.core.series.Series'>
(398L,)

0     True
1    False
2    False
3    False
4    False
Name: name, dtype: bool

### share same index

auto.shape

(398, 9)

auto.groupby(['model_year', chevy]).mean()

Grouping and filtering with .apply()¶

### By using .apply(), you can write functions that filter rows within groups. The .apply() method will handle the iteration over individual groups and then re-combine them back into a Series or DataFrame.
- take the Titanic data set and analyze survival rates from the 'C' deck, which contained the most passengers.
  - To do this you'll group the dataset by 'sex' and then use the .apply()
    - on a provided user defined function which calculates the mean survival rates on the 'C' deck:

titanic.head()

def c_deck_survival(gr):

    c_passengers = gr['Cabin'].str.startswith('C').fillna(False)

    return gr.loc[c_passengers, 'Survived'].mean()

# Create a groupby object using titanic over the 'sex' column: by_sex
by_sex = titanic.groupby('Sex')

# Call by_sex.apply with the function c_deck_survival and print the result
c_surv_by_sex = by_sex.apply(c_deck_survival)

c_surv_by_sex

Sex
female    0.888889
male      0.343750
dtype: float64

use pandas¶

titanic.loc[titanic['Cabin'].str.startswith('C').fillna(False)].groupby('Sex')['Survived'].mean()

Sex
female    0.888889
male      0.343750
Name: Survived, dtype: float64

titanic['Cabin'].str.startswith('C').head(10)

0      NaN
1     True
2      NaN
3     True
4      NaN
5      NaN
6    False
7      NaN
8      NaN
9      NaN
Name: Cabin, dtype: object

titanic['Cabin'].str.startswith('C').fillna(False).head(10)

0    False
1     True
2    False
3     True
4    False
5    False
6    False
7    False
8    False
9    False
Name: Cabin, dtype: bool

titanic.loc[titanic['Cabin'].str.startswith('C').fillna(False)].groupby('Sex')['Survived'].mean()

Sex
female    0.888889
male      0.343750
Name: Survived, dtype: float64

Grouping and filtering with .filter()¶

### You can use groupby with the .filter() method to remove whole groups of rows from a DataFrame based a boolean condition.

df

# Group sales by 'Company': by_company
by_company = df.groupby('Company')

# Compute the sum of the 'Units' of by_company: by_com_sum
by_com_sum = by_company['Units'].sum()
by_com_sum

Company
Acme Coporation    34
Hooli              30
Initech            30
Mediacore          45
Streeplex          36
Name: Units, dtype: int64

# Filter 'Units' where the sum is > 35: by_com_filt
by_com_filt = by_company.filter(lambda g:g['Units'].sum() > 35)
by_com_filt

Filtering and grouping with .map()¶

#### You have seen how to group by a column, or by multiple columns. Sometimes, you may instead want to group by a function/transformation of a column.
#### The key here is that the Series is indexed the same way as the DataFrame. You can also mix and match column grouping with Series grouping.

# Create the Boolean Series: under10
under10 = (titanic['Age'] < 10).map({True:'under 10', False:'over 10'})
print type(under10)
under10.head(10)

<class 'pandas.core.series.Series'>

0     over 10
1     over 10
2     over 10
3     over 10
4     over 10
5     over 10
6     over 10
7    under 10
8     over 10
9     over 10
Name: Age, dtype: object

share the same index

# Group by under10 and compute the survival rate
survived_mean_1 = titanic.groupby(under10)['Survived'].mean()
print(survived_mean_1)

Age
over 10     0.366707
under 10    0.612903
Name: Survived, dtype: float64

# Group by under10 and pclass and compute the survival rate
survived_mean_2 = titanic.groupby([under10, 'Pclass'])['Survived'].mean()
survived_mean_2

Age       Pclass
over 10   1         0.629108
          2         0.419162
          3         0.222717
under 10  1         0.666667
          2         1.000000
          3         0.452381
Name: Survived, dtype: float64

	1964	1965	1966	1967	1968	1969	1970	1971	1972	1973	...	2004	2005	2006	2007	2008	2009	2010	2011	2012	2013
Country
Afghanistan	33.639	34.152	34.662	35.170	35.674	36.172	36.663	37.143	37.614	38.075	...	56.583	57.071	57.582	58.102	58.618	59.124	59.612	60.079	60.524	60.947
Albania	65.475	65.863	66.122	66.316	66.500	66.702	66.948	67.251	67.595	67.966	...	75.725	75.949	76.124	76.278	76.433	76.598	76.780	76.979	77.185	77.392
Algeria	47.953	48.389	48.806	49.205	49.592	49.976	50.366	50.767	51.195	51.670	...	69.682	69.854	70.020	70.180	70.332	70.477	70.615	70.747	70.874	71.000

	mpg	cylinders	displacement	horsepower	weight	acceleration	model_year	origin	name
0	18.0	8	307.0	130	3504	12.0	70	1	chevrolet chevelle malibu
1	15.0	8	350.0	165	3693	11.5	70	1	buick skylark 320
2	18.0	8	318.0	150	3436	11.0	70	1	plymouth satellite
3	16.0	8	304.0	150	3433	12.0	70	1	amc rebel sst
4	17.0	8	302.0	140	3449	10.5	70	1	ford torino

		mpg	cylinders	displacement	weight	acceleration	origin
model_year	name
70	False	17.923077	6.615385	269.230769	3315.076923	13.269231	1.346154
70	True	15.666667	8.000000	387.000000	3873.000000	10.166667	1.000000
71	False	21.416667	5.583333	208.041667	2985.916667	15.083333	1.500000
71	True	20.250000	5.500000	220.000000	3052.500000	15.500000	1.000000
72	False	19.120000	5.720000	212.700000	3195.040000	15.120000	1.600000
72	True	15.333333	6.666667	265.666667	3593.333333	15.166667	1.000000
73	False	17.500000	6.264706	246.617647	3339.735294	14.205882	1.441176
73	True	14.833333	7.000000	315.000000	3868.333333	14.916667	1.000000
74	False	23.208333	5.250000	166.541667	2835.208333	16.104167	1.750000
74	True	18.666667	5.333333	213.333333	3219.666667	17.000000	1.000000
75	False	20.555556	5.407407	196.444444	3117.925926	16.222222	1.518519
75	True	17.666667	7.333333	287.333333	3706.666667	14.500000	1.000000
76	False	21.350000	5.666667	199.566667	3097.000000	15.673333	1.533333
76	True	23.250000	5.500000	184.500000	2941.750000	17.950000	1.000000
77	False	23.895833	5.291667	181.500000	2929.583333	15.620833	1.666667
77	True	20.250000	6.500000	250.750000	3404.000000	14.325000	1.000000
78	False	24.136364	5.303030	175.696970	2857.272727	15.790909	1.666667
78	True	23.233333	6.000000	201.000000	2911.666667	15.966667	1.000000
79	False	25.488462	5.653846	201.884615	3021.730769	16.034615	1.307692
79	True	21.666667	7.333333	248.333333	3346.666667	13.900000	1.000000
80	False	33.966667	4.148148	115.185185	2439.444444	17.003704	2.296296
80	True	30.050000	4.000000	124.500000	2399.000000	16.000000	1.000000
81	False	30.578571	4.571429	133.964286	2515.714286	16.439286	2.000000
81	True	23.500000	6.000000	173.000000	2725.000000	12.600000	1.000000
82	False	32.111111	4.222222	129.925926	2424.814815	16.381481	1.740741
82	True	29.000000	4.000000	121.750000	2647.500000	18.375000	1.000000

	bread	butter	city	weekday
0	139	20	Austin	Sun
1	237	45	Dallas	Sun
2	326	70	Austin	Mon
3	456	98	Dallas	Mon

		bread	butter
city	weekday
Austin	Mon	326	70
Austin	Sun	139	20
Dallas	Mon	456	98
Dallas	Sun	237	45

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

		PassengerId	Survived	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin
Embarked	Pclass
C	1	85	85	85	85	74	85	85	85	85	66
	2	17	17	17	17	15	17	17	17	17	2
	3	66	66	66	66	41	66	66	66	66	1
Q	1	2	2	2	2	2	2	2	2	2	2
	2	3	3	3	3	2	3	3	3	3	1
	3	72	72	72	72	24	72	72	72	72	1
S	1	127	127	127	127	108	127	127	127	127	106
	2	164	164	164	164	156	164	164	164	164	13
	3	353	353	353	353	290	353	353	353	353	10

	Age		Fare
	max	median	max	median
Pclass
1	80.0	37.0	512.3292	60.2875
2	70.0	29.0	73.5000	14.2500
3	74.0	24.0	69.5500	8.0500

	region
Country
Afghanistan	South Asia
Albania	Europe & Central Asia
Algeria	Middle East & North Africa
Angola	Sub-Saharan Africa
Antigua and Barbuda	America

	Company	Product	Units
2015-02-02 08:30:00	Hooli	Software	3
2015-02-02 21:00:00	Mediacore	Hardware	9
2015-02-03 14:00:00	Initech	Software	13
2015-02-04 15:30:00	Streeplex	Software	13
2015-02-04 22:00:00	Acme Coporation	Hardware	14
2015-02-05 02:00:00	Acme Coporation	Software	19
2015-02-05 22:00:00	Hooli	Service	10
2015-02-07 23:00:00	Acme Coporation	Hardware	1
2015-02-09 09:00:00	Streeplex	Service	19
2015-02-09 13:00:00	Mediacore	Software	7
2015-02-11 20:00:00	Initech	Software	7
2015-02-11 23:00:00	Hooli	Software	4
2015-02-16 12:00:00	Hooli	Software	10
2015-02-19 11:00:00	Mediacore	Hardware	16
2015-02-19 16:00:00	Mediacore	Service	10
2015-02-21 05:00:00	Mediacore	Software	3
2015-02-21 20:30:00	Hooli	Hardware	3
2015-02-25 00:30:00	Initech	Service	10
2015-02-26 09:00:00	Streeplex	Service	4

	Units
2015-02-02 08:30:00	-1.250862
2015-02-02 21:00:00	-0.113715
2015-02-03 14:00:00	NaN
2015-02-04 15:30:00	0.913926
2015-02-04 22:00:00	1.182727
2015-02-05 02:00:00	1.376494
2015-02-05 22:00:00	-0.344124
2015-02-07 23:00:00	-1.414214
2015-02-09 09:00:00	1.781530
2015-02-09 13:00:00	-0.492764
2015-02-11 20:00:00	-0.698884
2015-02-11 23:00:00	-1.505289
2015-02-16 12:00:00	0.075810
2015-02-19 11:00:00	0.802955
2015-02-19 16:00:00	-0.344124
2015-02-21 05:00:00	0.707107
2015-02-21 20:30:00	0.707107
2015-02-25 00:30:00	0.107521
2015-02-26 09:00:00	-1.491202

	Company	Product	Units
0	HooliMediacoreStreeplexMediacoreHooli	SoftwareHardwareServiceSoftwareSoftware	48
1	Initech	Software	13
2	StreeplexAcme CoporationInitechHooliInitech	SoftwareHardwareSoftwareSoftwareService	48
3	Acme CoporationHooliMediacoreMediacoreStreeplex	SoftwareServiceHardwareServiceService	59
5	Acme CoporationMediacoreHooli	HardwareSoftwareHardware	7

		PassengerId	Survived	Name	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
Sex	Pclass
female	1	94	94	94	85	94	94	94	94	81	92
	2	76	76	76	74	76	76	76	76	10	76
	3	144	144	144	102	144	144	144	144	6	144
male	1	122	122	122	101	122	122	122	122	95	122
	2	108	108	108	99	108	108	108	108	6	108
	3	347	347	347	253	347	347	347	347	6	347