Index objects and labeled data

pandas Data Structures

  • Key building blocks
    • Indexes: Sequence of labels
    • Series: 1D array with Index
    • Series: 1D array with Index
  • Indexes
    • Immutable (Like dictionary keys)
    • Homogenous in data type (Like NumPy arrays)

Creating a Series

In [2]:
import pandas as pd
In [3]:
prices=[10.70, 10.86, 10.74, 10.71, 10.79]
sales = pd.Series(prices)
sales
Out[3]:
0    10.70
1    10.86
2    10.74
3    10.71
4    10.79
dtype: float64

Creating an index

In [4]:
days = ['Mon', 'Tue', 'Wed', 'Thur', 'Fri']
sales = pd.Series(prices, index=days)
sales
Out[4]:
Mon     10.70
Tue     10.86
Wed     10.74
Thur    10.71
Fri     10.79
dtype: float64
In [5]:
print sales.index
print sales.index[3]
print sales.index[:2]
print sales.index[-3:]

print sales.index.name
Index([u'Mon', u'Tue', u'Wed', u'Thur', u'Fri'], dtype='object')
Thur
Index([u'Mon', u'Tue'], dtype='object')
Index([u'Wed', u'Thur', u'Fri'], dtype='object')
None
In [6]:
# Create the list of new indexes: new_idx
new_idx = [month.upper() for month in sales.index]

# Assign new_idx to sales.index
sales.index = new_idx

# Print the sales DataFrame
print(sales)
MON     10.70
TUE     10.86
WED     10.74
THUR    10.71
FRI     10.79
dtype: float64

Modifying index name

In [7]:
shares.index.name = 'weekday'
shares
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-7-dcafc145cb84> in <module>()
----> 1 shares.index.name = 'weekday'
      2 shares

NameError: name 'shares' is not defined

Modifying index entries

  • immutable
In [8]:
shares.index[2] = 'Wednesday'
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-8-6e55d7d08a13> in <module>()
----> 1 shares.index[2] = 'Wednesday'

NameError: name 'shares' is not defined

Modifying all index entries

  • will elimintate index.name as well
In [9]:
shares.index = ['Monday', 'Tuesday', 'Wednesday','Thursday', 'Friday']
shares
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-9-1fae432ccc25> in <module>()
----> 1 shares.index = ['Monday', 'Tuesday', 'Wednesday','Thursday', 'Friday']
      2 shares

NameError: name 'shares' is not defined

RangeIndex

In [10]:
census=pd.read_csv('census.csv')
census.info(False)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3193 entries, 0 to 3192
Columns: 100 entries, SUMLEV to RNETMIG2015
dtypes: float64(30), int64(68), object(2)
memory usage: 2.4+ MB
In [11]:
census.columns
Out[11]:
Index([u'SUMLEV', u'REGION', u'DIVISION', u'STATE', u'COUNTY', u'STNAME',
       u'CTYNAME', u'CENSUS2010POP', u'ESTIMATESBASE2010', u'POPESTIMATE2010',
       u'POPESTIMATE2011', u'POPESTIMATE2012', u'POPESTIMATE2013',
       u'POPESTIMATE2014', u'POPESTIMATE2015', u'NPOPCHG_2010',
       u'NPOPCHG_2011', u'NPOPCHG_2012', u'NPOPCHG_2013', u'NPOPCHG_2014',
       u'NPOPCHG_2015', u'BIRTHS2010', u'BIRTHS2011', u'BIRTHS2012',
       u'BIRTHS2013', u'BIRTHS2014', u'BIRTHS2015', u'DEATHS2010',
       u'DEATHS2011', u'DEATHS2012', u'DEATHS2013', u'DEATHS2014',
       u'DEATHS2015', u'NATURALINC2010', u'NATURALINC2011', u'NATURALINC2012',
       u'NATURALINC2013', u'NATURALINC2014', u'NATURALINC2015',
       u'INTERNATIONALMIG2010', u'INTERNATIONALMIG2011',
       u'INTERNATIONALMIG2012', u'INTERNATIONALMIG2013',
       u'INTERNATIONALMIG2014', u'INTERNATIONALMIG2015', u'DOMESTICMIG2010',
       u'DOMESTICMIG2011', u'DOMESTICMIG2012', u'DOMESTICMIG2013',
       u'DOMESTICMIG2014', u'DOMESTICMIG2015', u'NETMIG2010', u'NETMIG2011',
       u'NETMIG2012', u'NETMIG2013', u'NETMIG2014', u'NETMIG2015',
       u'RESIDUAL2010', u'RESIDUAL2011', u'RESIDUAL2012', u'RESIDUAL2013',
       u'RESIDUAL2014', u'RESIDUAL2015', u'GQESTIMATESBASE2010',
       u'GQESTIMATES2010', u'GQESTIMATES2011', u'GQESTIMATES2012',
       u'GQESTIMATES2013', u'GQESTIMATES2014', u'GQESTIMATES2015',
       u'RBIRTH2011', u'RBIRTH2012', u'RBIRTH2013', u'RBIRTH2014',
       u'RBIRTH2015', u'RDEATH2011', u'RDEATH2012', u'RDEATH2013',
       u'RDEATH2014', u'RDEATH2015', u'RNATURALINC2011', u'RNATURALINC2012',
       u'RNATURALINC2013', u'RNATURALINC2014', u'RNATURALINC2015',
       u'RINTERNATIONALMIG2011', u'RINTERNATIONALMIG2012',
       u'RINTERNATIONALMIG2013', u'RINTERNATIONALMIG2014',
       u'RINTERNATIONALMIG2015', u'RDOMESTICMIG2011', u'RDOMESTICMIG2012',
       u'RDOMESTICMIG2013', u'RDOMESTICMIG2014', u'RDOMESTICMIG2015',
       u'RNETMIG2011', u'RNETMIG2012', u'RNETMIG2013', u'RNETMIG2014',
       u'RNETMIG2015'],
      dtype='object')
In [12]:
census.index=census['CENSUS2010POP']
census.index
Out[12]:
Int64Index([4779736,   54571,  182265,   27457,   22915,   57322,   10914,
              20947,  118572,   34215,
            ...
               2484,   28205,    8667,   29116,   10247,   43806,   21294,
              21118,    8533,    7208],
           dtype='int64', name=u'CENSUS2010POP', length=3193)

delete unwanted column

In [13]:
del census['CENSUS2010POP']
census.info(False)

print type(census.index)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3193 entries, 4779736 to 7208
Columns: 99 entries, SUMLEV to RNETMIG2015
dtypes: float64(30), int64(67), object(2)
memory usage: 2.4+ MB
<class 'pandas.indexes.numeric.Int64Index'>
In [14]:
census.head(3)
Out[14]:
SUMLEV REGION DIVISION STATE COUNTY STNAME CTYNAME ESTIMATESBASE2010 POPESTIMATE2010 POPESTIMATE2011 ... RDOMESTICMIG2011 RDOMESTICMIG2012 RDOMESTICMIG2013 RDOMESTICMIG2014 RDOMESTICMIG2015 RNETMIG2011 RNETMIG2012 RNETMIG2013 RNETMIG2014 RNETMIG2015
CENSUS2010POP
4779736 40 3 6 1 0 Alabama Alabama 4780127 4785161 4801108 ... 0.002295 -0.193196 0.381066 0.582002 -0.467369 1.030015 0.826644 1.383282 1.724718 0.712594
54571 50 3 6 1 1 Alabama Autauga County 54571 54660 55253 ... 7.242091 -2.915927 -3.012349 2.265971 -2.530799 7.606016 -2.626146 -2.722002 2.592270 -2.187333
182265 50 3 6 1 3 Alabama Baldwin County 182265 183193 186659 ... 14.832960 17.647293 21.845705 19.243287 17.197872 15.844176 18.559627 22.727626 20.317142 18.293499

3 rows × 99 columns

assign index at the beginning

In [15]:
df = pd.read_csv('census.csv', index_col=['STNAME','CTYNAME','COUNTY'] )
df.head(7)
Out[15]:
SUMLEV REGION DIVISION STATE CENSUS2010POP ESTIMATESBASE2010 POPESTIMATE2010 POPESTIMATE2011 POPESTIMATE2012 POPESTIMATE2013 ... RDOMESTICMIG2011 RDOMESTICMIG2012 RDOMESTICMIG2013 RDOMESTICMIG2014 RDOMESTICMIG2015 RNETMIG2011 RNETMIG2012 RNETMIG2013 RNETMIG2014 RNETMIG2015
STNAME CTYNAME COUNTY
Alabama Alabama 0 40 3 6 1 4779736 4780127 4785161 4801108 4816089 4830533 ... 0.002295 -0.193196 0.381066 0.582002 -0.467369 1.030015 0.826644 1.383282 1.724718 0.712594
Autauga County 1 50 3 6 1 54571 54571 54660 55253 55175 55038 ... 7.242091 -2.915927 -3.012349 2.265971 -2.530799 7.606016 -2.626146 -2.722002 2.592270 -2.187333
Baldwin County 3 50 3 6 1 182265 182265 183193 186659 190396 195126 ... 14.832960 17.647293 21.845705 19.243287 17.197872 15.844176 18.559627 22.727626 20.317142 18.293499
Barbour County 5 50 3 6 1 27457 27457 27341 27226 27159 26973 ... -4.728132 -2.500690 -7.056824 -3.904217 -10.543299 -4.874741 -2.758113 -7.167664 -3.978583 -10.543299
Bibb County 7 50 3 6 1 22915 22919 22861 22733 22642 22512 ... -5.527043 -5.068871 -6.201001 -0.177537 0.177258 -5.088389 -4.363636 -5.403729 0.754533 1.107861
Blount County 9 50 3 6 1 57322 57322 57373 57711 57776 57734 ... 1.807375 -1.177622 -1.748766 -2.062535 -1.369970 1.859511 -0.848580 -1.402476 -1.577232 -0.884411
Bullock County 11 50 3 6 1 10914 10915 10887 10629 10606 10628 ... -30.953709 -5.180127 -1.130263 14.354290 -16.167247 -29.001673 -2.825524 1.507017 17.243790 -13.193961

7 rows × 97 columns

Hierarchical Indexing

In [16]:
from yahoo_finance import Share
In [17]:
tsla=Share('TSLA')
price = tsla.get_historical('2016-1-1','2016-12-31')
In [18]:
df = pd.DataFrame(price)
print df.shape
df.head()
(252, 8)
Out[18]:
Adj_Close Close Date High Low Open Symbol Volume
0 213.690002 213.690002 2016-12-30 217.50 211.679993 216.300003 TSLA 4642600
1 214.679993 214.679993 2016-12-29 219.199997 214.119995 218.559998 TSLA 4045000
2 219.740005 219.740005 2016-12-28 223.800003 217.199997 221.529999 TSLA 3782500
3 219.529999 219.529999 2016-12-27 222.25 214.419998 214.880005 TSLA 5915700
4 213.339996 213.339996 2016-12-23 213.449997 207.710007 208.00 TSLA 4670500
In [19]:
df.index=df['Date']
del df['Date']
df = df.sort_index()
print df.shape
df.head()
(252, 7)
Out[19]:
Adj_Close Close High Low Open Symbol Volume
Date
2016-01-04 223.410004 223.410004 231.380005 219.00 230.720001 TSLA 6827100
2016-01-05 223.429993 223.429993 226.889999 220.00 226.360001 TSLA 3186800
2016-01-06 219.039993 219.039993 220.050003 215.979996 220.00 TSLA 3779100
2016-01-07 215.649994 215.649994 218.440002 213.669998 214.190002 TSLA 3554300
2016-01-08 211.00 211.00 220.440002 210.770004 217.860001 TSLA 3628100

reset index will insert index into column

In [23]:
df.reset_index().head(7)
Out[23]:
Date Adj_Close Close High Low Open Symbol Volume
0 2016-01-04 223.410004 223.410004 231.380005 219.00 230.720001 TSLA 6827100
1 2016-01-05 223.429993 223.429993 226.889999 220.00 226.360001 TSLA 3186800
2 2016-01-06 219.039993 219.039993 220.050003 215.979996 220.00 TSLA 3779100
3 2016-01-07 215.649994 215.649994 218.440002 213.669998 214.190002 TSLA 3554300
4 2016-01-08 211.00 211.00 220.440002 210.770004 217.860001 TSLA 3628100
5 2016-01-11 207.850006 207.850006 214.449997 203.00 214.009995 TSLA 4091400
6 2016-01-12 209.970001 209.970001 213.740005 205.309998 211.600006 TSLA 3091900

MultiIndex on df

Hierarchical Indexing

Setting index

  • use list
In [124]:
df['date']=df.index
df = df.set_index(['Symbol','date']).head(7)
print df.shape
df.head(3)
(7, 6)
Out[124]:
Adj_Close Close High Low Open Volume
Symbol date
TSLA 2016-01-04 223.410004 223.410004 231.380005 219.00 230.720001 6827100
2016-01-05 223.429993 223.429993 226.889999 220.00 226.360001 3186800
2016-01-06 219.039993 219.039993 220.050003 215.979996 220.00 3779100

df.index.names

In [125]:
df
Out[125]:
Adj_Close Close High Low Open Volume
Symbol date
TSLA 2016-01-04 223.410004 223.410004 231.380005 219.00 230.720001 6827100
2016-01-05 223.429993 223.429993 226.889999 220.00 226.360001 3186800
2016-01-06 219.039993 219.039993 220.050003 215.979996 220.00 3779100
2016-01-07 215.649994 215.649994 218.440002 213.669998 214.190002 3554300
2016-01-08 211.00 211.00 220.440002 210.770004 217.860001 3628100
2016-01-11 207.850006 207.850006 214.449997 203.00 214.009995 4091400
2016-01-12 209.970001 209.970001 213.740005 205.309998 211.600006 3091900
In [126]:
print df.index.names
[u'Symbol', u'date']

indexing

In [127]:
df
Out[127]:
Adj_Close Close High Low Open Volume
Symbol date
TSLA 2016-01-04 223.410004 223.410004 231.380005 219.00 230.720001 6827100
2016-01-05 223.429993 223.429993 226.889999 220.00 226.360001 3186800
2016-01-06 219.039993 219.039993 220.050003 215.979996 220.00 3779100
2016-01-07 215.649994 215.649994 218.440002 213.669998 214.190002 3554300
2016-01-08 211.00 211.00 220.440002 210.770004 217.860001 3628100
2016-01-11 207.850006 207.850006 214.449997 203.00 214.009995 4091400
2016-01-12 209.970001 209.970001 213.740005 205.309998 211.600006 3091900

Slicing (outermost index)

In [128]:
df.loc['TSLA']
Out[128]:
Adj_Close Close High Low Open Volume
date
2016-01-04 223.410004 223.410004 231.380005 219.00 230.720001 6827100
2016-01-05 223.429993 223.429993 226.889999 220.00 226.360001 3186800
2016-01-06 219.039993 219.039993 220.050003 215.979996 220.00 3779100
2016-01-07 215.649994 215.649994 218.440002 213.669998 214.190002 3554300
2016-01-08 211.00 211.00 220.440002 210.770004 217.860001 3628100
2016-01-11 207.850006 207.850006 214.449997 203.00 214.009995 4091400
2016-01-12 209.970001 209.970001 213.740005 205.309998 211.600006 3091900
In [129]:
df.loc[('TSLA','2016-01-07')]
Out[129]:
Adj_Close    215.649994
Close        215.649994
High         218.440002
Low          213.669998
Open         214.190002
Volume          3554300
Name: (TSLA, 2016-01-07), dtype: object
In [130]:
df.loc[('TSLA','2016-01-07'),'Open']
Out[130]:
'214.190002'

fancy indexing

  • Fancy indexing (innermost index)
In [137]:
df.loc[('TSLA',['2016-01-05','2016-01-11']),['Close','Open']]
Out[137]:
Close Open
Symbol date
TSLA 2016-01-05 223.429993 226.360001
2016-01-11 207.850006 214.009995

? Slice (both indexes)

  • ### slice(start, stop[, step])
In [142]:
df.loc[(slice(None), slice('2016-01-04', '2016-01-12')),:]
Out[142]:
Adj_Close Close High Low Open Volume
Symbol date
TSLA 2016-01-04 223.410004 223.410004 231.380005 219.00 230.720001 6827100
2016-01-05 223.429993 223.429993 226.889999 220.00 226.360001 3186800
2016-01-06 219.039993 219.039993 220.050003 215.979996 220.00 3779100
2016-01-07 215.649994 215.649994 218.440002 213.669998 214.190002 3554300
2016-01-08 211.00 211.00 220.440002 210.770004 217.860001 3628100
2016-01-11 207.850006 207.850006 214.449997 203.00 214.009995 4091400
2016-01-12 209.970001 209.970001 213.740005 205.309998 211.600006 3091900

examples and practice

In [155]:
sales = pd.DataFrame({'eggs': {('CA', 1): 47,
  ('CA', 2): 110,
  ('NY', 1): 221,
  ('NY', 2): 77,
  ('TX', 1): 132,
  ('TX', 2): 205},
 'salt': {('CA', 1): 12.0,
  ('CA', 2): 50.0,
  ('NY', 1): 89.0,
  ('NY', 2): 87.0,
  ('TX', 1): None,
  ('TX', 2): 60.0},
 'spam': {('CA', 1): 17,
  ('CA', 2): 31,
  ('NY', 1): 72,
  ('NY', 2): 20,
  ('TX', 1): 52,
  ('TX', 2): 55}})

Extracting data with a MultiIndex

In [156]:
sales
Out[156]:
eggs salt spam
CA 1 47 12.0 17
2 110 50.0 31
NY 1 221 89.0 72
2 77 87.0 20
TX 1 132 NaN 52
2 205 60.0 55
In [157]:
sales.index
Out[157]:
MultiIndex(levels=[[u'CA', u'NY', u'TX'], [1, 2]],
           labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])
In [158]:
sales.index.names
Out[158]:
FrozenList([None, None])
In [162]:
# Print sales.loc[['CA', 'TX']]
sales.loc[['CA', 'TX']]
Out[162]:
eggs salt spam
CA 1 47 12.0 17
2 110 50.0 31
TX 1 132 NaN 52
2 205 60.0 55
In [163]:
# Print sales['CA':'TX']
sales.loc['CA':'TX']
Out[163]:
eggs salt spam
CA 1 47 12.0 17
2 110 50.0 31
NY 1 221 89.0 72
2 77 87.0 20
TX 1 132 NaN 52
2 205 60.0 55

Setting & sorting a MultiIndex

In [182]:
sales0 = pd.DataFrame({'eggs': {0: 47, 1: 110, 2: 221, 3: 77, 4: 132, 5: 205},
 'month': {0: 1, 1: 2, 2: 1, 3: 2, 4: 1, 5: 2},
 'salt': {0: 12.0, 1: 50.0, 2: 89.0, 3: 87.0, 4: None, 5: 60.0},
 'spam': {0: 17, 1: 31, 2: 72, 3: 20, 4: 52, 5: 55},
 'state': {0: 'CA', 1: 'CA', 2: 'NY', 3: 'NY', 4: 'TX', 5: 'TX'}})
In [183]:
sales2 = sales0.copy()
sales2
Out[183]:
eggs month salt spam state
0 47 1 12.0 17 CA
1 110 2 50.0 31 CA
2 221 1 89.0 72 NY
3 77 2 87.0 20 NY
4 132 1 NaN 52 TX
5 205 2 60.0 55 TX
In [180]:
# Set the index to be the columns ['state', 'month']: sales
sales2 = sales2.set_index(['state', 'month'])
# Sort the MultiIndex: sales
sales2 = sales2.sort_index()

# Print the sales DataFrame
sales2
Out[180]:
eggs salt spam
state month
CA 1 47 12.0 17
2 110 50.0 31
NY 1 221 89.0 72
2 77 87.0 20
TX 1 132 NaN 52
2 205 60.0 55

Using .loc[] with nonunique indexes

In [185]:
sales3 = sales0.copy()
# Set the index to the column 'state': sales
sales3 = sales3.set_index(['state'])
sales3
Out[185]:
eggs month salt spam
state
CA 47 1 12.0 17
CA 110 2 50.0 31
NY 221 1 89.0 72
NY 77 2 87.0 20
TX 132 1 NaN 52
TX 205 2 60.0 55
In [186]:
# Access the data from 'NY'
print(sales.loc['NY'])
       eggs  salt  spam
month                  
1       221  89.0    72
2        77  87.0    20

Indexing multiple levels of a MultiIndex

In [187]:
sales
Out[187]:
eggs salt spam
state month
CA 1 47 12.0 17
2 110 50.0 31
NY 1 221 89.0 72
2 77 87.0 20
TX 1 132 NaN 52
2 205 60.0 55
In [189]:
# Look up data for NY in month 1: NY_month1
NY_month1 = sales.loc[('NY', 1), :]
NY_month1
Out[189]:
eggs    221.0
salt     89.0
spam     72.0
Name: (NY, 1), dtype: float64
In [190]:
# Look up data for CA and TX in month 2: CA_TX_month2
CA_TX_month2 = sales.loc[(['CA', 'TX'], 2), :]
CA_TX_month2
Out[190]:
eggs salt spam
state month
CA 2 110 50.0 31
TX 2 205 60.0 55
In [191]:
# Look up data for all states in month 2: all_month2
all_month2 = sales.loc[(slice(None), 2), :]
all_month2
Out[191]:
eggs salt spam
state month
CA 2 110 50.0 31
NY 2 77 87.0 20
TX 2 205 60.0 55
In [ ]:
 
In [ ]:
 
In [ ]: