Thursday, June 26, 2014

Project 4

Exercise - Visualization 1:

from pandas import *
from ggplot import *

def plot_weather_data(turnstile_weather):
'''
You are passed in a dataframe called turnstile_weather.
Use turnstile_weather along with ggplot to make a data visualization
focused on the MTA and weather data we used in assignment #3.
You should feel free to implement something that we discussed in class
(e.g., scatterplots, line plots, or histograms) or attempt to implement
something more advanced if you'd like.

Here are some suggestions for things to investigate and illustrate:
* Ridership by time of day or day of week
* How ridership varies based on Subway station
* Which stations have more exits or entries at different times of day

If you'd like to learn more about ggplot and its capabilities, take
a look at the documentation at:
https://pypi.python.org/pypi/ggplot/

You can check out:
https://www.dropbox.com/s/meyki2wl9xfa7yk/turnstile_data_master_with_weather.csv

To see all the columns and data points included in the turnstile_weather
dataframe.

However, due to the limitation of our Amazon EC2 server, we are giving you about 1/3
of the actual data in the turnstile_weather dataframe
'''

dataTW = turnstile_weather
entries_DayOfMonth = dataTW[['DATEn', 'ENTRIESn_hourly']].groupby('DATEn', as_index=False).sum()
entries_DayOfMonth['Day'] = [datetime.strptime(x, '%Y-%m-%d').strftime('%w %A')

for x in entries_DayOfMonth['DATEn']]
entries_Day = entries_DayOfMonth[['Day', 'ENTRIESn_hourly']].groupby('Day', as_index=False).sum()
plot = ggplot(entries_Day, aes(x='Day', y='ENTRIESn_hourly')) + geom_bar(aes(weight='ENTRIESn_hourly'), fill='red') \
+ ggtitle('NYC Subway ridership / day of week') + xlab('Day') + ylab('Entries')
return plot

Tuesday, June 24, 2014

Data Visualization

Component of effective Visualization:

Visual cues

Position
Length
Angle

Visual Encoding

Direction
Shape
Area
Volume
Color

Hue
Saturation

Coordinate systems

Scale / Data Types

Numeric
Categorical
Time Series

Context

Visual Cues Accuracy:

Python Plotting Package:

MatPlotLib
GGPloyt ( Grammar of Graphics)

ggplot ( data, aes (xVar, yVar)) + geom _point(color='red') + geom_line(color='blue') + xlab ('x axis label') + ylab ( 'y axis label')
print ggplot ( data, aes (xVar, yVar))

Plotting in Python

from pandas import *
from ggplot import *

import pandas

def lineplot(hr_year_csv):
# A csv file will be passed in as an argument which
# contains two columns -- 'HR' (the number of homerun hits)
# and 'yearID' (the year in which the homeruns were hit).
#
# Fill out the body of this function, lineplot, to use the
# passed-in csv file, hr_year_csv, and create a
# chart with points connected by lines, both colored 'red',
# showing the number of HR by year.
#
# You will want to first load the csv file into a pandas dataframe
# and use the pandas dataframe along with ggplot to create your visualization
#
# You can check out the data in the csv file at the link below:
# https://www.dropbox.com/s/awgdal71hc1u06d/hr_year.csv
#
# You can read more about ggplot at the following link:
# https://github.com/yhat/ggplot/

inData = pandas.read_csv(hr_year_csv)
gg = ggplot(inData, aes('yearID', 'HR')) + geom_point(color = 'red') + geom_line(color = 'red') + ggtitle('title') + xlab('x') + ylab('y')

return gg

Data Types:

Numeric

Discrete
Continuous

Categorical Data

Time Series

Plotting Line Chart:
from pandas import *
from ggplot import *

import pandas

def lineplot_compare(hr_by_team_year_sf_la_csv):
# Write a function, lineplot_compare, that will read a csv file
# called hr_by_team_year_sf_la_csv and plot it using pandas and ggplot2.
#
# This csv file has three columns -- yearID, HR, and teamID,
# representing the total number of HR hit each year by the SF Giants
# and LA Dodgers. Produce a visualization comparing the total HR by
# year of the two teams.
#
# You can see the data in hr_by_team_year_sf_la_csv
# at the link below:
# https://www.dropbox.com/s/wn43cngo2wdle2b/hr_by_team_year_sf_la.csv
#
# Note that to differentiate between multiple categories on the
# same plot in ggplot, we can pass color in with the other arguments
# to aes, rather than in our geometry functions.
#
# For example, ggplot(data, aes(xvar, yvar, color=category_var)). This
# should help you.

inData = pandas.read_csv(hr_by_team_year_sf_la_csv)
gg = ggplot(inData, aes(x='yearID', y='HR', color = 'teamID')) + geom_point() + geom_line()

return gg

Monday, June 23, 2014

Project 3 Exercise 1 - 7

Exercise 1 - Exploratory Data Analysis:
import numpy as np
import pandas
import matplotlib.pyplot as plt

def entries_histogram(turnstile_weather):
'''
Before we perform any analysis, it might be useful to take a
look at the data we're hoping to analyze. More specifically, let's
examine the hourly entries in our NYC subway data and determine what
distribution the data follows. This data is stored in a dataframe
called turnstile_weather under the ['ENTRIESn_hourly'] column.

Let's plot two histograms on the same axes to show hourly
entries when raining vs. when not raining. Here's an example on how
to plot histograms with pandas and matplotlib:
turnstile_weather['column_to_graph'].hist()

Your histograph may look similar to bar graph in the instructor notes below.


You can read a bit about using matplotlib and pandas to plot histograms here:
http://pandas.pydata.org/pandas-docs/stable/visualization.html#histograms

You can see the information contained within the turnstile weather data here:
https://www.dropbox.com/s/meyki2wl9xfa7yk/turnstile_data_master_with_weather.csv
'''

x = turnstile_weather["ENTRIESn_hourly"][turnstile_weather["rain"] == 1] # your code here to plot a historgram for hourly entries when it is raining
y = turnstile_weather["ENTRIESn_hourly"][turnstile_weather["rain"] == 0] # your code here to plot a historgram for hourly entries when it is not raining
plt.figure()
x.hist(bins=50)
y.hist(bins=50)
return plt

Does the data seem normally distributed? No
Do you think we would be able to use Welch's t-test on this data? No, Because distribution is not normal

Exercise 3 - Mann Whitney U Test:

import numpy as np

import scipy

import scipy.stats

import pandas

def mann_whitney_plus_means(turnstile_weather):

'''

This function will consume the turnstile_weather dataframe containing

our final turnstile weather data.

You will want to take the means and run the Mann Whitney U-test on the

ENTRIESn_hourly column in the turnstile_weather dataframe.

This function should return:

1) the mean of entries with rain

2) the mean of entries without rain

3) the Mann-Whitney U-statistic and p-value comparing the number of entries

with rain and the number of entries without rain

You should feel free to use scipy's Mann-Whitney implementation, and you

might also find it useful to use numpy's mean function.

Here are the functions' documentation:

http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mannwhitneyu.html

http://docs.scipy.org/doc/numpy/reference/generated/numpy.mean.html

You can look at the final turnstile weather data at the link below:

https://www.dropbox.com/s/meyki2wl9xfa7yk/turnstile_data_master_with_weather.csv

'''

rainy = turnstile_weather[turnstile_weather['rain'] == 1]

with_rain_mean = np.mean(rainy['ENTRIESn_hourly'])

not_rainy = turnstile_weather[turnstile_weather['rain'] == 0]

without_rain_mean = np.mean(not_rainy['ENTRIESn_hourly'])

U , p = scipy.stats.mannwhitneyu(rainy['ENTRIESn_hourly'], not_rainy['ENTRIESn_hourly'])

return with_rain_mean, without_rain_mean, U, p # leave this line for the grader

Here's your output:
(1105.4463767458733, 1090.278780151855, 1924409167.0, 0.024999912793489721)
Here's the correct output:
(1105.4463767458733, 1090.278780151855, 1924409167.0, 0.024999912793489721)

Exercise 5 - Linear Regression:

import numpy as np

import pandas

def normalize_features(array):

"""

Normalize the features in the data set.

"""

array_normalized = (array-array.mean())/array.std()

mu = array.mean()

sigma = array.std()

return array_normalized, mu, sigma

def compute_cost(features, values, theta):

"""

Compute the cost function given a set of features / values,

and the values for our thetas.

This can be the same code as the compute_cost function in the lesson #3 exercises,

but feel free to implement your own.

"""

m = len(values)

sum_of_square_errors = np.square(np.dot(features, theta) - values).sum()

cost = sum_of_square_errors / (2*m)

return cost

def gradient_descent(features, values, theta, alpha, num_iterations):

"""

Perform gradient descent given a data set with an arbitrary number of features.

This can be the same gradient descent code as in the lesson #3 exercises,

but feel free to implement your own.

"""

m = len(values)

cost_history = []

for i in range(num_iterations):

predicted_values = np.dot(features, theta)

theta -= (alpha / m) * np.dot((predicted_values - values), features)

cost_history.append(compute_cost(features, values, theta))

return theta, pandas.Series(cost_history)

def predictions(dataframe):

'''

The NYC turnstile data is stored in a pandas dataframe called weather_turnstile.

Using the information stored in the dataframe, let's predict the ridership of

the NYC subway using linear regression with gradient descent.

You can see the information contained in the turnstile weather dataframe here:

https://www.dropbox.com/s/meyki2wl9xfa7yk/turnstile_data_master_with_weather.csv

Your prediction should have a R^2 value of 0.40 or better.

Note: Due to the memory and CPU limitation of our Amazon EC2 instance, we will

give you a random subet (~15%) of the data contained in

turnstile_data_master_with_weather.csv

If you receive a "server has encountered an error" message, that means you are

hitting the 30-second limit that's placed on running your program. Try using a

smaller number for num_iterations if that's the case.

If you are using your own algorithm/models, see if you can optimize your code so

that it runs faster.

'''

dummy_units = pandas.get_dummies(dataframe['UNIT'], prefix='unit')

features = dataframe[['rain', 'precipi', 'Hour', 'meantempi']].join(dummy_units)

values = dataframe[['ENTRIESn_hourly']]

m = len(values)

features, mu, sigma = normalize_features(features)

features['ones'] = np.ones(m)

features_array = np.array(features)

values_array = np.array(values).flatten()

#Set values for alpha, number of iterations.

alpha = 0.1 # please feel free to change this value

num_iterations = 100 # please feel free to change this value

#Initialize theta, perform gradient descent

theta_gradient_descent = np.zeros(len(features.columns))

theta_gradient_descent, cost_history = gradient_descent(features_array,

values_array,

theta_gradient_descent,

alpha,

num_iterations)

predictions = np.dot(features_array, theta_gradient_descent)

return predictions

Exercise 6 - Plotting Residuals :

import numpy as np

import scipy

import matplotlib.pyplot as plt

def plot_residuals(turnstile_weather, predictions):

'''

Using the same methods that we used to plot a histogram of entries

per hour for our data, why don't you make a histogram of the residuals

(that is, the difference between the original hourly entry data and the predicted values).

Based on this residual histogram, do you have any insight into how our model

performed? Reading a bit on this webpage might be useful:

http://www.itl.nist.gov/div898/handbook/pri/section2/pri24.htm

'''

plt.figure()

(turnstile_weather['ENTRIESn_hourly'] - predictions).hist()

return plt

Exercise 7 - Compute R Squared :

import numpy as np

import scipy

import matplotlib.pyplot as plt

import sys

def compute_r_squared(data, predictions):

'''

In exercise 5, we calculated the R^2 value for you. But why don't you try and

and calculate the R^2 value yourself.

Given a list of original data points, and also a list of predicted data points,

write a function that will compute and return the coefficient of determination (R^2)

for this data. numpy.mean() and numpy.sum() might both be useful here, but

not necessary.

Documentation about numpy.mean() and numpy.sum() below:

http://docs.scipy.org/doc/numpy/reference/generated/numpy.mean.html

http://docs.scipy.org/doc/numpy/reference/generated/numpy.sum.html

'''

mean = np.mean(data)

SSr = np.sum(np.square(data - predictions))

SSt = np.sum(np.square(data - mean))

r_squared = 1.0 - (SSr / SSt)

return r_squared

Basic of Statistic and Machine Learning used for Analyze Data (3)

Machine Learning:

A branch of artificial intelligence, concerns the construction and study of systems that can learn from data.

Machine Learning vs Statistics:

Statistics is about drawing valid conclusions

It cares deeply about how the data was collected, methodology, and statistical properties of the estimator. Much of Statistics is motivated by problems where you need to know precisely what you're doing (clinical trials, other experiments).

Statistics insists on proper and rigorous methodology, and is comfortable with making and noting assumptions. It cares about how the data was collected, the resulting properties of the estimator or experiment (e.g. p-value, unbiased estimators), and the kinds of properties you would expect if you did a procedure many times.

Machine Learning is about prediction

It cares deeply about scalability and using the predictions to make decisions. Much of Machine Learning is motivated by problems that need to have answers (e.g. image recognition, text inference, ranking, computer vision, medical and healthcare, search engines.)

ML is happy to treat the algorithm as a black box as long as it works. Prediction and decision-making is king, and the algorithm is only a means to an end. It's very important in ML to make sure that your performance would improve (and not take an absurd amount of time) with more data.

Type of Machine Learning Algorithm:

Supervised learning algorithms are trained on labelled examples, i.e., input where the desired output is known. The supervised learning algorithm attempts to generalize a function or mapping from inputs to outputs which can then be used speculatively to generate an output for previously unseen inputs.

Unsupervised learning algorithms operate on unlabelled examples, i.e., input where the desired output is unknown. Here the objective is to discover structure in the data ( clustering), not to generalize a mapping from inputs to outputs.

Linear Regression:
In statistics, linear regression is an approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables denoted X.

Gradient Descent in Python:

import numpy
import pandas

def compute_cost(features, values, theta):
"""
Compute the cost of a list of parameters, theta, given a list of features (input
data points) and values (output data points).
"""
m = len(values)
sum_of_square_errors = numpy.square(numpy.dot(features, theta) - values).sum()
cost = sum_of_square_errors / (2*m)

return cost

def gradient_descent(features, values, theta, alpha, num_iterations):
"""
Perform gradient descent given a data set with an arbitrary number of features.
"""

# Write code here that performs num_iterations updates to the elements of theta.
# times. Every time you compute the cost for a given list of thetas, append it
# to cost_history.

cost_history = []

m = len(values)

for i in range(num_iterations):

predicted_values = numpy.dot(features, theta)
theta -= (alpha / m) * numpy.dot((predicted_values - values), features)
cost_history.append(compute_cost(features, values, theta))

return theta, pandas.Series(cost_history) # leave this line for the grader

Coefficient of determination:
In statistics, the coefficient of determination, denoted R² and pronounced R squared, indicates how well data points fit a statistical model – sometimes simply a line or curve.

Calculating R Squared:

import numpy as np

def compute_r_squared(data, predictions):
# Write a function that, given two input numpy arrays, 'data', and 'predictions,'
# returns the coefficient of determination, R^2, for the model that produced
# predictions.
#
# Numpy has a couple of functions -- np.mean() and np.sum() --
# that you might find useful, but you don't have to use them.

# YOUR CODE GOES HERE
mean = np.mean(data)
SSr = np.sum(np.square(data - predictions))
SSt = np.sum(np.square(data - mean))
r_squared = 1.0 - (SSr / SSt)

return r_squared

Basic of Statistic and Machine Learning used for Analyze Data (2)

Non-Parametric Test:

Non-parametric covers techniques that do not rely on data belonging to any particular distribution.
Non-parametric covers techniques that do not assume that the structure of a model is fixed.

Mann-Whitney U Test:

Is a non parametric test of the null hypothesis that two populations are the same against an alternative hypothesis, especially that a particular population tends to have larger values than the other.

scipy.stats.mannwhitneyu

scipy.stats.mannwhitneyu(x, y, use_continuity=True)]

Computes the Mann-Whitney rank test on samples x and y.

Parameters:	x, y : array_like Array of samples, should be one-dimensional. use_continuity : bool, optional Whether a continuity correction (1/2.) should be taken into account. Default is True.
Returns:	u : float The Mann-Whitney statistics. prob : float One-sided p-value assuming a asymptotic normal distribution.

Parameters:

x, y : array_like

Array of samples, should be one-dimensional.

use_continuity : bool, optional

Whether a continuity correction (1/2.) should be taken into account. Default is True.

Returns:

u : float

The Mann-Whitney statistics.

prob : float

One-sided p-value assuming a asymptotic normal distribution.

If you need test the data is Normal distribution or not you can use:

scipy.stats.shapiro

scipy.stats.shapiro(x, a=None, reta=False)

Perform the Shapiro-Wilk test for normality.

The Shapiro-Wilk test tests the null hypothesis that the data was drawn from a normal distribution.

Parameters:	x : array_like Array of sample data. a : array_like, optional Array of internal parameters used in the calculation. If these are not given, they will be computed internally. If x has length n, then a must have length n/2. reta : bool, optional Whether or not to return the internally computed a values. The default is False.
Returns:	W : float The test statistic. p-value : float The p-value for the hypothesis test. a : array_like, optional If reta is True, then these are the internally computed “a” values that may be passed into this function on future calls.

Parameters:

x : array_like

Array of sample data.

a : array_like, optional

Array of internal parameters used in the calculation. If these are not given, they will be computed internally. If x has length n, then a must have length n/2.

reta : bool, optional

Whether or not to return the internally computed a values. The default is False.

Returns:

W : float

The test statistic.

p-value : float

The p-value for the hypothesis test.

a : array_like, optional

If reta is True, then these are the internally computed “a” values that may be passed into this function on future calls.