pivoting and pivot table

In [1]:
import pandas as pd

sales = pd.DataFrame({'eggs': {0: 47, 1: 110, 2: 221, 3: 77, 4: 132, 5: 205},
 'month': {0: 1, 1: 2, 2: 1, 3: 2, 4: 1, 5: 2},
 'salt': {0: 12.0, 1: 50.0, 2: 89.0, 3: 87.0, 4: None, 5: 60.0},
 'spam': {0: 17, 1: 31, 2: 72, 3: 20, 4: 52, 5: 55},
 'state': {0: 'CA', 1: 'CA', 2: 'NY', 3: 'NY', 4: 'TX', 5: 'TX'}})
In [2]:
sales
Out[2]:
eggs month salt spam state
0 47 1 12.0 17 CA
1 110 2 50.0 31 CA
2 221 1 89.0 72 NY
3 77 2 87.0 20 NY
4 132 1 NaN 52 TX
5 205 2 60.0 55 TX
In [3]:
sales.pivot('month','state','spam')
Out[3]:
state CA NY TX
month
1 17 72 52
2 31 20 55
In [4]:
sales.pivot(index = 'month',
            columns = 'state',
            values = 'spam')
Out[4]:
state CA NY TX
month
1 17 72 52
2 31 20 55
In [5]:
sales.pivot(index = 'month',
            columns = 'state')
Out[5]:
eggs salt spam
state CA NY TX CA NY TX CA NY TX
month
1 47 221 132 12.0 89.0 NaN 17 72 52
2 110 77 205 50.0 87.0 60.0 31 20 55
In [6]:
sales.pivot('state','month')
Out[6]:
eggs salt spam
month 1 2 1 2 1 2
state
CA 47 110 12.0 50.0 17 31
NY 221 77 89.0 87.0 72 20
TX 132 205 NaN 60.0 52 55

Pivoting a single variable

  • ### df.pivot()
    • Suppose you started a blog for a band, and you would like to log how many visitors you have had, and how many signed-up for your newsletter.
In [51]:
users = pd.DataFrame({'city': {0: 'Austin', 1: 'Dallas', 2: 'Austin', 3: 'Dallas'},
 'signups': {0: 7, 1: 12, 2: 3, 3: 5},
 'visitors': {0: 139, 1: 237, 2: 326, 3: 456},
 'weekday': {0: 'Sun', 1: 'Sun', 2: 'Mon', 3: 'Mon'}})
In [52]:
users
Out[52]:
city signups visitors weekday
0 Austin 7 139 Sun
1 Dallas 12 237 Sun
2 Austin 3 326 Mon
3 Dallas 5 456 Mon
In [53]:
users.pivot('weekday','city','visitors')
Out[53]:
city Austin Dallas
weekday
Mon 326 456
Sun 139 237
  • Pivoting all variables
In [54]:
users.pivot(index='weekday', columns='city')
Out[54]:
signups visitors
city Austin Dallas Austin Dallas
weekday
Mon 3 5 326 456
Sun 7 12 139 237

Stacking & unstacking DataFrames

  • ### stacking and unstacking DataFrames
    • ### with a MultiIndex
In [55]:
users
Out[55]:
city signups visitors weekday
0 Austin 7 139 Sun
1 Dallas 12 237 Sun
2 Austin 3 326 Mon
3 Dallas 5 456 Mon
In [83]:
users2  = users.set_index(['city','weekday']).sort_index()
users2
Out[83]:
signups visitors
city weekday
Austin Mon 3 326
Sun 7 139
Dallas Mon 5 456
Sun 12 237
  • ### Pay attention to the index, and notice that the index levels are ['city', 'weekday']. So 'weekday' - the second entry - has position 1.
    • ### This position is what corresponds to the level parameter in .stack() and .unstack() calls.
      • ### Alternatively, you can specify 'weekday' as the level instead of its position.
In [87]:
users2.unstack(level = 'weekday')
Out[87]:
signups visitors
weekday Mon Sun Mon Sun
city
Austin 3 7 326 139
Dallas 5 12 456 237
In [88]:
users2.unstack()
Out[88]:
signups visitors
weekday Mon Sun Mon Sun
city
Austin 3 7 326 139
Dallas 5 12 456 237
In [90]:
users3 = users2.unstack(1)
users3
Out[90]:
signups visitors
weekday Mon Sun Mon Sun
city
Austin 3 7 326 139
Dallas 5 12 456 237

stack

In [93]:
users3
Out[93]:
signups visitors
weekday Mon Sun Mon Sun
city
Austin 3 7 326 139
Dallas 5 12 456 237
In [94]:
users3.index
Out[94]:
Index([u'Austin', u'Dallas'], dtype='object', name=u'city')
In [98]:
users3.stack()
Out[98]:
signups visitors
city weekday
Austin Mon 3 326
Sun 7 139
Dallas Mon 5 456
Sun 12 237

Stack 2

  • ### Your job in this exercise is to unstack and then stack the 'city' level, as you did previously for 'weekday'. Note that you won't get the same DataFrame.
In [100]:
users2
Out[100]:
signups visitors
city weekday
Austin Mon 3 326
Sun 7 139
Dallas Mon 5 456
Sun 12 237
In [103]:
users2.unstack(level = 0)
Out[103]:
signups visitors
city Austin Dallas Austin Dallas
weekday
Mon 3 5 326 456
Sun 7 12 139 237
In [104]:
users2.unstack(level = 'city')
Out[104]:
signups visitors
city Austin Dallas Austin Dallas
weekday
Mon 3 5 326 456
Sun 7 12 139 237
In [105]:
users2.unstack(level = 'city').stack()
Out[105]:
signups visitors
weekday city
Mon Austin 3 326
Dallas 5 456
Sun Austin 7 139
Dallas 12 237

__

In [56]:
users.unstack()
Out[56]:
city      0    Austin
          1    Dallas
          2    Austin
          3    Dallas
signups   0         7
          1        12
          2         3
          3         5
visitors  0       139
          1       237
          2       326
          3       456
weekday   0       Sun
          1       Sun
          2       Mon
          3       Mon
dtype: object
In [57]:
pd.melt(users)
Out[57]:
variable value
0 city Austin
1 city Dallas
2 city Austin
3 city Dallas
4 signups 7
5 signups 12
6 signups 3
7 signups 5
8 visitors 139
9 visitors 237
10 visitors 326
11 visitors 456
12 weekday Sun
13 weekday Sun
14 weekday Mon
15 weekday Mon

Creating a multi-level index

In [58]:
users2 = users.set_index(['city','weekday'])
users2
Out[58]:
signups visitors
city weekday
Austin Sun 7 139
Dallas Sun 12 237
Austin Mon 3 326
Dallas Mon 5 456
In [59]:
users2.unstack('weekday')
Out[59]:
signups visitors
weekday Mon Sun Mon Sun
city
Austin 3 7 326 139
Dallas 5 12 456 237
In [60]:
users2.unstack('city')
Out[60]:
signups visitors
city Austin Dallas Austin Dallas
weekday
Mon 3 5 326 456
Sun 7 12 139 237
In [61]:
users2
Out[61]:
signups visitors
city weekday
Austin Sun 7 139
Dallas Sun 12 237
Austin Mon 3 326
Dallas Mon 5 456

Unstacking a multi-index

In [62]:
users2.unstack(1)
Out[62]:
signups visitors
weekday Mon Sun Mon Sun
city
Austin 3 7 326 139
Dallas 5 12 456 237
In [63]:
users2.unstack(level = 0)
Out[63]:
signups visitors
city Austin Dallas Austin Dallas
weekday
Mon 3 5 326 456
Sun 7 12 139 237

Stacking DataFrames

In [64]:
users2
Out[64]:
signups visitors
city weekday
Austin Sun 7 139
Dallas Sun 12 237
Austin Mon 3 326
Dallas Mon 5 456
In [65]:
users2.stack()
Out[65]:
city    weekday          
Austin  Sun      signups       7
                 visitors    139
Dallas  Sun      signups      12
                 visitors    237
Austin  Mon      signups       3
                 visitors    326
Dallas  Mon      signups       5
                 visitors    456
dtype: int64

Restoring the index order

Swapping levels

  • ### use .swaplevel(0, 1) to flip the index levels.
In [66]:
users2
Out[66]:
signups visitors
city weekday
Austin Sun 7 139
Dallas Sun 12 237
Austin Mon 3 326
Dallas Mon 5 456
In [67]:
users2.swaplevel()
Out[67]:
signups visitors
weekday city
Sun Austin 7 139
Dallas 12 237
Mon Austin 3 326
Dallas 5 456
In [68]:
users2.swaplevel(0,1)
Out[68]:
signups visitors
weekday city
Sun Austin 7 139
Dallas 12 237
Mon Austin 3 326
Dallas 5 456
In [69]:
users2.swaplevel().sort_index()
Out[69]:
signups visitors
weekday city
Mon Austin 3 326
Dallas 5 456
Sun Austin 7 139
Dallas 12 237

Tidy Data

Melting DataFrames

the goal of melting is to restore a pivoted DataFrame to its original form, or to change it from a wide shape to a long shape.

In [70]:
users
Out[70]:
city signups visitors weekday
0 Austin 7 139 Sun
1 Dallas 12 237 Sun
2 Austin 3 326 Mon
3 Dallas 5 456 Mon
  • ### You can explicitly specify the columns that should remain in the reshaped DataFrame with id_vars, and list which columns to convert into values with value_vars.
    • ### if you don't pass a name to the values in pd.melt(), you will lose the name of your variable. You can fix this by using the value_name keyword argument.
In [71]:
grouped = users.pivot('city','weekday','visitors')
grouped
Out[71]:
weekday Mon Sun
city
Austin 326 139
Dallas 456 237

restore grouped DataFrame

  • ### pd.melt()
In [72]:
grouped
Out[72]:
weekday Mon Sun
city
Austin 326 139
Dallas 456 237

reset index

In [74]:
test = grouped.copy()
test
Out[74]:
weekday Mon Sun
city
Austin 326 139
Dallas 456 237
In [77]:
test.index.name = 'cities'
test
Out[77]:
weekday Mon Sun
cities
Austin 326 139
Dallas 456 237
In [78]:
test.reset_index()
Out[78]:
weekday cities Mon Sun
0 Austin 326 139
1 Dallas 456 237

manually reset index

In [39]:
grouped['city']=grouped.index
grouped
Out[39]:
weekday Mon Sun city
city
Austin 326 139 Austin
Dallas 456 237 Dallas
In [50]:
grouped
Out[50]:
weekday Mon Sun city
city
Austin 326 139 Austin
Dallas 456 237 Dallas
In [41]:
pd.melt(grouped)
Out[41]:
weekday value
0 Mon 326
1 Mon 456
2 Sun 139
3 Sun 237
4 city Austin
5 city Dallas

Specifying id_vars

In [42]:
pd.melt(grouped,'city')
Out[42]:
city weekday value
0 Austin Mon 326
1 Dallas Mon 456
2 Austin Sun 139
3 Dallas Sun 237
In [43]:
pd.melt(grouped,'Sun')
Out[43]:
Sun weekday value
0 139 Mon 326
1 237 Mon 456
2 139 city Austin
3 237 city Dallas
In [44]:
pd.melt(grouped,id_vars='Mon')
Out[44]:
Mon weekday value
0 326 Sun 139
1 456 Sun 237
2 326 city Austin
3 456 city Dallas

Specifying value_name

In [47]:
pd.melt(grouped, id_vars=['city'],
 var_name='whichdayofweek', value_name='visitors')
Out[47]:
city whichdayofweek visitors
0 Austin Mon 326
1 Dallas Mon 456
2 Austin Sun 139
3 Dallas Sun 237

Obtaining key-value pairs with melt()

  • ### Sometimes, all you need is some key-value pairs, and the context does not matter. If said context is in the index, you can easily obtain what you want.
    • ### For example, in the users DataFrame, the visitors and signups columns lend themselves well to being represented as key-value pairs.
      • #### So if you created a hierarchical index with 'city' and 'weekday' columns as the index, you can easily extract key-value pairs for the 'visitors' and 'signups' columns by melting users and specifying col_level=0.
In [106]:
users
Out[106]:
city signups visitors weekday
0 Austin 7 139 Sun
1 Dallas 12 237 Sun
2 Austin 3 326 Mon
3 Dallas 5 456 Mon
In [107]:
# Set the new index: users_idx
users_idx = users.set_index(['city', 'weekday'])
users_idx
Out[107]:
signups visitors
city weekday
Austin Sun 7 139
Dallas Sun 12 237
Austin Mon 3 326
Dallas Mon 5 456
In [111]:
pd.melt(users_idx, col_level=0)
Out[111]:
variable value
0 signups 7
1 signups 12
2 signups 3
3 signups 5
4 visitors 139
5 visitors 237
6 visitors 326
7 visitors 456
In [112]:
users
Out[112]:
city signups visitors weekday
0 Austin 7 139 Sun
1 Dallas 12 237 Sun
2 Austin 3 326 Mon
3 Dallas 5 456 Mon
In [109]:
pd.melt(users)
Out[109]:
variable value
0 city Austin
1 city Dallas
2 city Austin
3 city Dallas
4 signups 7
5 signups 12
6 signups 3
7 signups 5
8 visitors 139
9 visitors 237
10 visitors 326
11 visitors 456
12 weekday Sun
13 weekday Sun
14 weekday Mon
15 weekday Mon

Pivot tables

  • ### df.pivot() requires Unique index-column pairs to identify values in the new table
    • ## df.pivot_table()
      • ### reshape df by summarizing
        • ### can use aggregation func
          • #### default aggregation is average
            • ### keyword, aggfunc e.g. df.pivot_table(..., aggfunc = 'count')
In [113]:
users
Out[113]:
city signups visitors weekday
0 Austin 7 139 Sun
1 Dallas 12 237 Sun
2 Austin 3 326 Mon
3 Dallas 5 456 Mon
In [114]:
users.pivot_table(index='weekday', columns='city')
Out[114]:
signups visitors
city Austin Dallas Austin Dallas
weekday
Mon 3 5 326 456
Sun 7 12 139 237
In [115]:
users.pivot('weekday','city')
Out[115]:
signups visitors
city Austin Dallas Austin Dallas
weekday
Mon 3 5 326 456
Sun 7 12 139 237

specify aggregation func

In [118]:
users.pivot_table(index='weekday', columns='city',aggfunc='count')
Out[118]:
signups visitors
city Austin Dallas Austin Dallas
weekday
Mon 1 1 1 1
Sun 1 1 1 1

can only pass index, values will be averaged

  • #### You can also use aggregation functions with in a pivot table by specifying the aggfunc parameter. In this exercise, you will practice using the 'count' and len aggregation functions - which produce the same result - on the users DataFrame.
In [120]:
users.pivot_table(index='weekday')
Out[120]:
signups visitors
weekday
Mon 4.0 391.0
Sun 9.5 188.0
In [122]:
users.pivot_table(index='weekday',aggfunc='count')
Out[122]:
city signups visitors
weekday
Mon 2 2 2
Sun 2 2 2
In [124]:
users.pivot_table(index='weekday',aggfunc=len)
Out[124]:
city signups visitors
weekday
Mon 2 2 2
Sun 2 2 2

Using margins in pivot tables

  • ### Sometimes it's useful to add totals in the margins of a pivot table. You can do this with the argument margins=True
    • ### using margins in a pivot table along with a new aggregation function: sum.
In [125]:
users
Out[125]:
city signups visitors weekday
0 Austin 7 139 Sun
1 Dallas 12 237 Sun
2 Austin 3 326 Mon
3 Dallas 5 456 Mon
In [126]:
# Create the DataFrame with the appropriate pivot table: signups_and_visitors
users.pivot_table(index='weekday', aggfunc=sum)
Out[126]:
signups visitors
weekday
Mon 8 782
Sun 19 376
In [127]:
# Add in the margins: signups_and_visitors_total 
users.pivot_table(index='weekday', aggfunc=sum, margins=True)
Out[127]:
signups visitors
weekday
Mon 8.0 782.0
Sun 19.0 376.0
All 27.0 1158.0
In [ ]: