DATA Analytics: Project 3 Exercise 1

Exercise 1 - Exploratory Data Analysis:
import numpy as np
import pandas
import matplotlib.pyplot as plt

def entries_histogram(turnstile_weather):
'''
Before we perform any analysis, it might be useful to take a
look at the data we're hoping to analyze. More specifically, let's
examine the hourly entries in our NYC subway data and determine what
distribution the data follows. This data is stored in a dataframe
called turnstile_weather under the ['ENTRIESn_hourly'] column.

Let's plot two histograms on the same axes to show hourly
entries when raining vs. when not raining. Here's an example on how
to plot histograms with pandas and matplotlib:
turnstile_weather['column_to_graph'].hist()

Your histograph may look similar to bar graph in the instructor notes below.


You can read a bit about using matplotlib and pandas to plot histograms here:
http://pandas.pydata.org/pandas-docs/stable/visualization.html#histograms

You can see the information contained within the turnstile weather data here:
https://www.dropbox.com/s/meyki2wl9xfa7yk/turnstile_data_master_with_weather.csv
'''

x = turnstile_weather["ENTRIESn_hourly"][turnstile_weather["rain"] == 1] # your code here to plot a historgram for hourly entries when it is raining
y = turnstile_weather["ENTRIESn_hourly"][turnstile_weather["rain"] == 0] # your code here to plot a historgram for hourly entries when it is not raining
plt.figure()
x.hist(bins=50)
y.hist(bins=50)
return plt

Does the data seem normally distributed? No
Do you think we would be able to use Welch's t-test on this data? No, Because distribution is not normal

Exercise 3 - Mann Whitney U Test:

import numpy as np

import scipy

import scipy.stats

import pandas

def mann_whitney_plus_means(turnstile_weather):

'''

This function will consume the turnstile_weather dataframe containing

our final turnstile weather data.

You will want to take the means and run the Mann Whitney U-test on the

ENTRIESn_hourly column in the turnstile_weather dataframe.

This function should return:

1) the mean of entries with rain

2) the mean of entries without rain

3) the Mann-Whitney U-statistic and p-value comparing the number of entries

with rain and the number of entries without rain

You should feel free to use scipy's Mann-Whitney implementation, and you

might also find it useful to use numpy's mean function.

Here are the functions' documentation:

http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mannwhitneyu.html

http://docs.scipy.org/doc/numpy/reference/generated/numpy.mean.html

You can look at the final turnstile weather data at the link below:

https://www.dropbox.com/s/meyki2wl9xfa7yk/turnstile_data_master_with_weather.csv

'''

rainy = turnstile_weather[turnstile_weather['rain'] == 1]

with_rain_mean = np.mean(rainy['ENTRIESn_hourly'])

not_rainy = turnstile_weather[turnstile_weather['rain'] == 0]

without_rain_mean = np.mean(not_rainy['ENTRIESn_hourly'])

U , p = scipy.stats.mannwhitneyu(rainy['ENTRIESn_hourly'], not_rainy['ENTRIESn_hourly'])

return with_rain_mean, without_rain_mean, U, p # leave this line for the grader

Here's your output:
(1105.4463767458733, 1090.278780151855, 1924409167.0, 0.024999912793489721)
Here's the correct output:
(1105.4463767458733, 1090.278780151855, 1924409167.0, 0.024999912793489721)

Exercise 5 - Linear Regression:

import numpy as np

import pandas

def normalize_features(array):

"""

Normalize the features in the data set.

"""

array_normalized = (array-array.mean())/array.std()

mu = array.mean()

sigma = array.std()

return array_normalized, mu, sigma

def compute_cost(features, values, theta):

"""

Compute the cost function given a set of features / values,

and the values for our thetas.

This can be the same code as the compute_cost function in the lesson #3 exercises,

but feel free to implement your own.

"""

m = len(values)

sum_of_square_errors = np.square(np.dot(features, theta) - values).sum()

cost = sum_of_square_errors / (2*m)

return cost

def gradient_descent(features, values, theta, alpha, num_iterations):

"""

Perform gradient descent given a data set with an arbitrary number of features.

This can be the same gradient descent code as in the lesson #3 exercises,

but feel free to implement your own.

"""

m = len(values)

cost_history = []

for i in range(num_iterations):

predicted_values = np.dot(features, theta)

theta -= (alpha / m) * np.dot((predicted_values - values), features)

cost_history.append(compute_cost(features, values, theta))

return theta, pandas.Series(cost_history)

def predictions(dataframe):

'''

The NYC turnstile data is stored in a pandas dataframe called weather_turnstile.

Using the information stored in the dataframe, let's predict the ridership of

the NYC subway using linear regression with gradient descent.

You can see the information contained in the turnstile weather dataframe here:

https://www.dropbox.com/s/meyki2wl9xfa7yk/turnstile_data_master_with_weather.csv

Your prediction should have a R^2 value of 0.40 or better.

Note: Due to the memory and CPU limitation of our Amazon EC2 instance, we will

give you a random subet (~15%) of the data contained in

turnstile_data_master_with_weather.csv

If you receive a "server has encountered an error" message, that means you are

hitting the 30-second limit that's placed on running your program. Try using a

smaller number for num_iterations if that's the case.

If you are using your own algorithm/models, see if you can optimize your code so

that it runs faster.

'''

dummy_units = pandas.get_dummies(dataframe['UNIT'], prefix='unit')

features = dataframe[['rain', 'precipi', 'Hour', 'meantempi']].join(dummy_units)

values = dataframe[['ENTRIESn_hourly']]

m = len(values)

features, mu, sigma = normalize_features(features)

features['ones'] = np.ones(m)

features_array = np.array(features)

values_array = np.array(values).flatten()

#Set values for alpha, number of iterations.

alpha = 0.1 # please feel free to change this value

num_iterations = 100 # please feel free to change this value

#Initialize theta, perform gradient descent

theta_gradient_descent = np.zeros(len(features.columns))

theta_gradient_descent, cost_history = gradient_descent(features_array,

values_array,

theta_gradient_descent,

alpha,

num_iterations)

predictions = np.dot(features_array, theta_gradient_descent)

return predictions

Exercise 6 - Plotting Residuals :

import numpy as np

import scipy

import matplotlib.pyplot as plt

def plot_residuals(turnstile_weather, predictions):

'''

Using the same methods that we used to plot a histogram of entries

per hour for our data, why don't you make a histogram of the residuals

(that is, the difference between the original hourly entry data and the predicted values).

Based on this residual histogram, do you have any insight into how our model

performed? Reading a bit on this webpage might be useful:

http://www.itl.nist.gov/div898/handbook/pri/section2/pri24.htm

'''

plt.figure()

(turnstile_weather['ENTRIESn_hourly'] - predictions).hist()

return plt

Exercise 7 - Compute R Squared :

import numpy as np

import scipy

import matplotlib.pyplot as plt

import sys

def compute_r_squared(data, predictions):

'''

In exercise 5, we calculated the R^2 value for you. But why don't you try and

and calculate the R^2 value yourself.

Given a list of original data points, and also a list of predicted data points,

write a function that will compute and return the coefficient of determination (R^2)

for this data. numpy.mean() and numpy.sum() might both be useful here, but

not necessary.

Documentation about numpy.mean() and numpy.sum() below:

http://docs.scipy.org/doc/numpy/reference/generated/numpy.mean.html

http://docs.scipy.org/doc/numpy/reference/generated/numpy.sum.html

'''

mean = np.mean(data)

SSr = np.sum(np.square(data - predictions))

SSt = np.sum(np.square(data - mean))

r_squared = 1.0 - (SSr / SSt)

return r_squared

DATA Analytics

Monday, June 23, 2014

Project 3 Exercise 1 - 7

No comments:

Post a Comment