DATA Analytics: Basic of Statistic and Machine Learning used for Analyze Data (1)

Basic Distribution Model:

Normal :

T-Test:

A t-test is any statistical hypothesis test in which the test statistic follows a Student's t distribution if the null hypothesis is supported. It can be used to determine if two sets of data are significantly different from each other, and is most commonly applied when the test statistic would follow a normal distribution if the value of a scaling term in the test statistic were known.

2 Version Of T-Test:

Equal Sample Size
Same Variance

P-Value

p-value is the probability of obtaining a test statistic result at least as extreme as the one that was actually observed, assuming that the null hypothesis is true.

Welch's T-Test Exercise:

import numpy

import scipy.stats

import pandas

def compare_averages(filename):

"""

Performs a t-test on two sets of baseball data (left-handed and right-handed hitters).

You will be given a csv file that has three columns. A player's

name, handedness (L for lefthanded or R for righthanded) and their

career batting average (called 'avg'). You can look at the csv

file via the following link:

https://www.dropbox.com/s/xcn0u2uxm8c4n6l/baseball_data.csv

Write a function that will read that the csv file into a pandas data frame,

and run Welch's t-test on the two cohorts defined by handedness.

One cohort should be a data frame of right-handed batters. And the other

cohort should be a data frame of left-handed batters.

We have included the scipy.stats library to help you write

or implement Welch's t-test:

http://docs.scipy.org/doc/scipy/reference/stats.html

With a significance level of 95%, if there is no difference

between the two cohorts, return a tuple consisting of

True, and then the tuple returned by scipy.stats.ttest.

If there is a difference, return a tuple consisting of

False, and then the tuple returned by scipy.stats.ttest.

For example, the tuple that you return may look like:

(True, (9.93570222, 0.000023))

"""

baseballdf = pandas.read_csv(filename)

leftdf = baseballdf[baseballdf['handedness'] == "L"]

rightdf = baseballdf[baseballdf['handedness'] == "R"]

#print leftdf['avg']

stat = scipy.stats.ttest_ind(rightdf['avg'], leftdf['avg'], equal_var = False)

DATA Analytics

Friday, June 20, 2014

Basic of Statistic and Machine Learning used for Analyze Data (1)

No comments:

Post a Comment