In [3]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
In [50]:
nohitter_times = np.array([ 843, 1613, 1101,  215,  684,  814,  278,  324,  161,  219,  545,
        715,  966,  624,   29,  450,  107,   20,   91, 1325,  124, 1468,
        104, 1309,  429,   62, 1878, 1104,  123,  251,   93,  188,  983,
        166,   96,  702,   23,  524,   26,  299,   59,   39,   12,    2,
        308, 1114,  813,  887,  645, 2088,   42, 2090,   11,  886, 1665,
       1084, 2900, 2432,  750, 4021, 1070, 1765, 1322,   26,  548, 1525,
         77, 2181, 2752,  127, 2147,  211,   41, 1575,  151,  479,  697,
        557, 2267,  542,  392,   73,  603,  233,  255,  528,  397, 1529,
       1023, 1194,  462,  583,   37,  943,  996,  480, 1497,  717,  224,
        219, 1531,  498,   44,  288,  267,  600,   52,  269, 1086,  386,
        176, 2199,  216,   54,  675, 1243,  463,  650,  171,  327,  110,
        774,  509,    8,  197,  136,   12, 1124,   64,  380,  811,  232,
        192,  731,  715,  226,  605,  539, 1491,  323,  240,  179,  702,
        156,   82, 1397,  354,  778,  603, 1001,  385,  986,  203,  149,
        576,  445,  180, 1403,  252,  675, 1351, 2983, 1568,   45,  899,
       3260, 1025,   31,  100, 2055, 4043,   79,  238, 3931, 2351,  595,
        110,  215,    0,  563,  206,  660,  242,  577,  179,  157,  192,
        192, 1848,  792, 1693,   55,  388,  225, 1134, 1172, 1555,   31,
       1582, 1044,  378, 1687, 2915,  280,  765, 2819,  511, 1521,  745,
       2491,  580, 2072, 6450,  578,  745, 1075, 1103, 1549, 1520,  138,
       1202,  296,  277,  351,  391,  950,  459,   62, 1056, 1128,  139,
        420,   87,   71,  814,  603, 1349,  162, 1027,  783,  326,  101,
        876,  381,  905,  156,  419,  239,  119,  129,  467])

How often do we get no-hitters?

The number of games played between each no-hitter in the modern era (1901-2015) of Major League Baseball is stored in the array nohitter_times.

If you assume that no-hitters are described as a Poisson process, then the time between no-hitters is Exponentially distributed. As you have seen, the Exponential distribution has a single parameter, which we will call \tau\tau, the typical interval time. The value of the parameter Ï„Ï„ that makes the exponential distribution best match the data is the mean interval time (where time is in units of number of games) between no-hitters.

Compute the value of this parameter from the data. Then, use np.random.exponential() to "repeat" the history of Major League Baseball by drawing inter-no-hitter times from an exponential distribution with the Ï„Ï„ you found and plot the histogram as an approximation to the PDF.

NumPy, pandas, matlotlib.pyplot, and seaborn have been imported for you as np, pd, plt, and sns, respectively.

In [51]:
# Seed random number generator
np.random.seed(7)

# Compute mean no-hitter time: tau
tau = np.mean(nohitter_times)

# Draw out of an exponential distribution with parameter tau: inter_nohitter_time
inter_nohitter_time = np.random.exponential(tau, 100000)

# Plot the PDF and label axes
_ = plt.hist(inter_nohitter_time,
             bins=50, normed=True, histtype='step')
_ = plt.xlabel('Games between no-hitters')
_ = plt.ylabel('PDF')

# Show the plot
plt.show()
In [52]:
def ecdf(data):
    x = np.sort(data)
    y = np.arange(1, len(data)+1)/float(len(data))
    return x, y

Do the data follow our story?

You have modeled no-hitters using an Exponential distribution. Create an ECDF of the real data. Overlay the theoretical CDF with the ECDF from the data. This helps you to verify that the Exponential distribution describes the observed data.

It may be helpful to remind yourself of the function you created in the previous course to compute the ECDF, as well as the code you wrote to plot it.

In [53]:
plt.figure(figsize=[16,5])
# Create an ECDF from real data: x, y
x, y = ecdf(nohitter_times)

# Create a CDF from theoretical samples: x_theor, y_theor
x_theor, y_theor = ecdf(inter_nohitter_time)

# Overlay the plots
plt.plot(x_theor, y_theor)
plt.plot(x, y, marker='.', linestyle='none')

# Margins and axis labels
plt.margins(0.02)
plt.xlabel('Games between no-hitters')
plt.ylabel('CDF')

# Show the plot
plt.show()

How is this parameter optimal?

Now sample out of an exponential distribution with Ï„Ï„ being twice as large as the optimal Ï„Ï„. Do it again for Ï„Ï„ half as large. Make CDFs of these samples and overlay them with your data. You can see that they do not reproduce the data as well. Thus, the Ï„Ï„ you computed from the mean inter-no-hitter times is optimal in that it best reproduces the data.

Note: In this and all subsequent exercises, the random number generator is pre-seeded for you to save you some typing.

In [59]:
plt.figure(figsize=[16,5])

# Plot the theoretical CDFs
plt.plot(x_theor, y_theor)
plt.plot(x, y, marker='.', linestyle='none')
plt.margins(0.02)
plt.xlabel('Games between no-hitters')
plt.ylabel('CDF')

# Take samples with half tau: samples_half
samples_half = np.random.exponential(tau/2, 10000)

# Take samples with double tau: samples_double
samples_double = np.random.exponential(2*tau, 10000)

# Generate CDFs from these samples
x_half, y_half = ecdf(samples_half)
x_double, y_double = ecdf(samples_double)

# Plot these CDFs as lines
_ = plt.plot(x_half, y_half, c='r')
_ = plt.plot(x_double, y_double, c='b')

# Show the plot
plt.show()

EDA of literacy/fertility data

In the next few exercises, we will look at the correlation between female literacy and fertility (defined as the average number of children born per woman) throughout the world. For ease of analysis and interpretation, we will work with the illiteracy rate.

It is always a good idea to do some EDA ahead of our analysis. To this end, plot the fertility versus illiteracy and compute the Pearson correlation coefficient. The Numpy array illiteracy has the illiteracy rate among females for most of the world's nations. The array fertility has the corresponding fertility data.

Here, it may be useful to refer back to the function you wrote in the previous course to compute the Pearson correlation coefficient.

In [55]:
df = pd.read_csv('./female_literacy_fertility.csv')
print df.shape
df.head()
(162, 5)
Out[55]:
Country Continent female literacy fertility population
0 Chine ASI 90.5 1.769 1,324,655,000
1 Inde ASI 50.8 2.682 1,139,964,932
2 USA NAM 99.0 2.077 304,060,000
3 Indonésie ASI 88.8 2.132 227,345,082
4 Brésil LAT 90.2 1.827 191,971,506
In [56]:
illiteracy, fertility = 100 - df['female literacy'], df['fertility']

def pearson_r(x, y):
    """Compute Pearson correlation coefficient between two arrays."""
    # Compute correlation matrix: corr_mat
    corr_mat = np.corrcoef(x, y)

    # Return entry [0,1]
    return corr_mat[0,1]
In [58]:
plt.figure(figsize=[16,5])

# Plot the illiteracy rate versus fertility
_ = plt.plot(illiteracy, fertility, marker='.', linestyle='none')

# Set the margins and label axes
plt.margins(0.02)
_ = plt.xlabel('percent illiterate')
_ = plt.ylabel('fertility')

# Show the plot
plt.show()

# Show the Pearson correlation coefficient
print(pearson_r(illiteracy, fertility))
0.8041324026815344

Linear regression

We will assume that fertility is a linear function of the female illiteracy rate. That is, f=ai+bf=ai+b, where aa is the slope and bb is the intercept. We can think of the intercept as the minimal fertility rate, probably somewhere between one and two. The slope tells us how the fertility rate varies with illiteracy. We can find the best fit line using np.polyfit().

Plot the data and the best fit line. Print out the slope and intercept. (Think: what are their units?)

In [61]:
plt.figure(figsize=[16,5])

# Plot the illiteracy rate versus fertility
_ = plt.plot(illiteracy, fertility, marker='.', linestyle='none')
plt.margins(0.02)
_ = plt.xlabel('percent illiterate')
_ = plt.ylabel('fertility')

# Perform a linear regression using np.polyfit(): a, b
a, b = np.polyfit(illiteracy, fertility, 1)

# Print the results to the screen
print('slope =', a, 'children per woman / percent illiterate')
print('intercept =', b, 'children per woman')

# Make theoretical line to plot
x = np.array([0, 100])
y = a * x + b

# Add regression line to your plot
_ = plt.plot(x, y)

# Draw the plot
plt.show()
('slope =', 0.04979854809063423, 'children per woman / percent illiterate')
('intercept =', 1.888050610636557, 'children per woman')

How is it optimal?

The function np.polyfit() that you used to get your regression parameters finds the optimal slope and intercept. It is optimizing the sum of the squares of the residuals, also known as RSS (for residual sum of squares). In this exercise, you will plot the function that is being optimized, the RSS, versus the slope parameter a. To do this, fix the intercept to be what you found in the optimization. Then, plot the RSS vs. the slope. Where is it minimal?

In [62]:
# Specify slopes to consider: a_vals
a_vals = np.linspace(0, 0.1, 200)

# Initialize sum of square of residuals: rss
rss = np.empty_like(a_vals)

# Compute sum of square of residuals for each value of a_vals
for i, a in enumerate(a_vals):
    rss[i] = np.sum((fertility - a*illiteracy - b)**2)

# Plot the RSS
plt.plot(a_vals, rss, '-')
plt.xlabel('slope (children per woman / percent illiterate)')
plt.ylabel('sum of square of residuals')

plt.show()

Linear regression on appropriate Anscombe data

For practice, perform a linear regression on the data set from Anscombe's quartet that is most reasonably interpreted with linear regression.

In [64]:
ab = pd.read_csv('./anscombe.csv')
ab
Out[64]:
0 0.1 1 1.1 2 2.1 3 3.1
0 x y x y x y x y
1 10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58
2 8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76
3 13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71
4 9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84
5 11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47
6 14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04
7 6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25
8 4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50
9 12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56
10 7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91
11 5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89
In [69]:
x, y = ab.iloc[1:, 0].astype(float), ab.iloc[1:, 1].astype(float)
In [70]:
# Perform linear regression: a, b
a, b = np.polyfit(x, y, 1)

# Print the slope and intercept
print(a, b)

# Generate theoretical x and y data: x_theor, y_theor
x_theor = np.array([3, 15])
y_theor = a * x_theor + b

# Plot the Anscombe data and theoretical line
_ = plt.plot(x, y, marker='.', linestyle='none')
_ = plt.plot(x_theor, y_theor)

# Label the axes
plt.xlabel('x')
plt.ylabel('y')

# Show the plot
plt.show()
(0.5000909090909095, 3.000090909090909)

Linear regression on all Anscombe data

Now, to verify that all four of the Anscombe data sets have the same slope and intercept from a linear regression, you will compute the slope and intercept for each set. The data are stored in lists; anscombe_x = [x1, x2, x3, x4] and anscombe_y = [y1, y2, y3, y4], where, for example, x2 and y2 are the xx and yy values for the second Anscombe data set.

In [71]:
ab.head()
Out[71]:
0 0.1 1 1.1 2 2.1 3 3.1
0 x y x y x y x y
1 10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58
2 8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76
3 13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71
4 9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84
In [72]:
anscombe_x = [ab.iloc[1:, 0].astype(float), ab.iloc[1:, 2].astype(float), 
              ab.iloc[1:, 4].astype(float), ab.iloc[1:, 6].astype(float)]

anscombe_y = [ab.iloc[1:, 1].astype(float), ab.iloc[1:, 3].astype(float), 
              ab.iloc[1:, 5].astype(float), ab.iloc[1:, 7].astype(float)]
In [73]:
# Iterate through x,y pairs
for x, y in zip(anscombe_x, anscombe_y):
    # Compute the slope and intercept: a, b
    a, b = np.polyfit(x, y, 1)

    # Print the result
    print('slope:', a, 'intercept:', b)
('slope:', 0.5000909090909095, 'intercept:', 3.000090909090909)
('slope:', 0.5000000000000004, 'intercept:', 3.0009090909090896)
('slope:', 0.4997272727272731, 'intercept:', 3.0024545454545453)
('slope:', 0.4999090909090908, 'intercept:', 3.0017272727272735)
In [88]:
plt.figure(figsize=[8,8])

for i in range(4):
    plt.subplot(2,2,i+1)
    plt.plot(anscombe_x[i], anscombe_y[i], marker='.', linestyle='None')

plt.show()