DATA Analytics: Data Wrangling Project 2 : Wrangle NYC Subway and Weather Data Exercise 1

Project 2 : Wrangle NYC Subway and Weather Data

Exercise 1 - Number of Rainy Days

import pandas

import pandasql

def num_rainy_days(filename):

'''

This function should run a SQL query on a dataframe of

weather data. The SQL query should return one column and

one row - a count of the number of days in the dataframe where

the rain column is equal to 1 (i.e., the number of days it

rained). The dataframe will be titled 'weather_data'. You'll

need to provide the SQL query. You might find SQL's count function

useful for this exercise. You can read more about it here:

https://dev.mysql.com/doc/refman/5.1/en/counting-rows.html

You might also find that interpreting numbers as integers or floats may not

work initially. In order to get around this issue, it may be equal to cast

these numbers as integers. This can be done by writing cast(column as integer).

So for example, if we wanted to cast the maxtempi column as an integer, we would actually

write something like where cast(maxtempi as integer) = 76, as opposed to simply

where maxtempi = 76.

You can see the weather data that we are passing in below:

https://www.dropbox.com/s/7sf0yqc9ykpq3w8/weather_underground.csv

'''

weather_data = pandas.read_csv(filename)

q = """

select count(*) from weather_data where rain=1;

"""

#Execute your SQL command against the pandas frame

rainy_days = pandasql.sqldf(q.lower(), locals())

return rainy_days

Exercise 2: > Temp on Foggy and Non Foggy Days

import pandas

import pandasql

def max_temp_aggregate_by_fog(filename):

'''

This function should run a SQL query on a dataframe of

weather data. The SQL query should return two columns and

two rows - whether it was foggy or not (0 or 1) and the max

maxtempi for that fog value (i.e., the maximum max temperature

for both foggy and non-foggy days). The dataframe will be

titled 'weather_data'. You'll need to provide the SQL query.

You might also find that interpreting numbers as integers or floats may not

work initially. In order to get around this issue, it may be useful to cast

these numbers as integers. This can be done by writing cast(column as integer).

So for example, if we wanted to cast the maxtempi column as an integer, we would actually

write something like where cast(maxtempi as integer) = 76, as opposed to simply

where maxtempi = 76.

You can see the weather data that we are passing in below:

https://www.dropbox.com/s/7sf0yqc9ykpq3w8/weather_underground.csv

'''

weather_data = pandas.read_csv(filename)

q = """

select fog, MAX(cast (maxtempi as integer) ) from weather_data GROUP BY fog;

"""

#Execute your SQL command against the pandas frame

rainy_days = pandasql.sqldf(q.lower(), locals())

return rainy_days

06/17/2014

Exercise 3 - Mean Temp on Weekends

import pandas

import pandasql

def avg_min_temperature(filename):

'''

This function should run a SQL query on a dataframe of

weather data. The SQL query should return one column and

one row - the average meantempi on days that are a Saturday

or Sunday (i.e., the the average mean temperature on weekends).

The dataframe will be titled 'weather_data' and you can access

the date in the dataframe via the 'date' column.

You'll need to provide the SQL query.

You might also find that interpreting numbers as integers or floats may not

work initially. In order to get around this issue, it may be equal to cast

these numbers as integers. This can be done by writing cast(column as integer).

So for example, if we wanted to cast the maxtempi column as an integer, we would actually

write something like where cast(maxtempi as integer) = 76, as opposed to simply

where maxtempi = 76.

Also, you can convert dates to days of the week via the 'strftime' keyword in SQL.

For example, cast (strftime('%w', date) as integer) will return 0 if the date

is a Sunday or 6 if the date is a Saturday.

You can see the weather data that we are passing in below:

https://www.dropbox.com/s/7sf0yqc9ykpq3w8/weather_underground.csv

'''

weather_data = pandas.read_csv(filename)

q = """

select AVG(cast (meantempi as integer)) from weather_data where cast(strftime('%w',date) as integer) = 0 or

cast(strftime('%w',date) as integer) = 6;

"""

#Execute your SQL command against the pandas frame

mean_temp_weekends = pandasql.sqldf(q.lower(), locals())

return mean_temp_weekends

Exercise 4 - Mean Temp on Rainy Days

import pandas

import pandasql

def avg_min_temperature(filename):

'''

This function should run a SQL query on a dataframe of

weather data. More specifically you want to find the average

minimum temperature on rainy days where the minimum temperature

is greater than 55 degrees.

You might also find that interpreting numbers as integers or floats may not

work initially. In order to get around this issue, it may be equal to cast

these numbers as integers. This can be done by writing cast(column as integer).

So for example, if we wanted to cast the maxtempi column as an integer, we would actually

write something like where cast(maxtempi as integer) = 76, as opposed to simply

where maxtempi = 76.

You can see the weather data that we are passing in below:

https://www.dropbox.com/s/7sf0yqc9ykpq3w8/weather_underground.csv

'''

weather_data = pandas.read_csv(filename)

q = """

select AVG(cast (mintempi as integer)) from weather_data where cast(mintempi as integer) > 55

and rain =1;

"""

#Execute your SQL command against the pandas frame

mean_temp_weekends = pandasql.sqldf(q.lower(), locals())

return mean_temp_weekends

Exercise 5 - Fixing NYC Subway Turnstile Data

Input file :

A002,R051,02-00-00,05-21-11,00:00:00,REGULAR,003169391,001097585,05-21-11,04:00:00,REGULAR,003169415,001097588,05-21-11,08:00:00,REGULAR,003169431,001097607,05-21-11,12:00:00,REGULAR,003169506,001097686,05-21-11,16:00:00,REGULAR,003169693,001097734,05-21-11,20:00:00,REGULAR,003169998,001097769,05-22-11,00:00:00,REGULAR,003170119,001097792,05-22-11,04:00:00,REGULAR,003170146,001097801

Output File:

A002,R051,02-00-00,05-21-11,00:00:00,REGULAR,003169391,001097585

A002,R051,02-00-00,05-21-11,04:00:00,REGULAR,003169415,001097588

A002,R051,02-00-00,05-21-11,08:00:00,REGULAR,003169431,001097607

A002,R051,02-00-00,05-21-11,12:00:00,REGULAR,003169506,001097686

A002,R051,02-00-00,05-21-11,16:00:00,REGULAR,003169693,001097734

import csv

def fix_turnstile_data(filenames):

'''

Filenames is a list of MTA Subway turnstile text files. A link to an example

MTA Subway turnstile text file can be seen at the URL below:

http://web.mta.info/developers/data/nyct/turnstile/turnstile_110507.txt

As you can see, there are numerous data points included in each row of the

a MTA Subway turnstile text file.

You want to write a function that will update each row in the text

file so there is only one entry per row. A few examples below:

A002,R051,02-00-00,05-28-11,00:00:00,REGULAR,003178521,001100739

A002,R051,02-00-00,05-28-11,04:00:00,REGULAR,003178541,001100746

A002,R051,02-00-00,05-28-11,08:00:00,REGULAR,003178559,001100775

Write the updates to a different text file in the format of "updated_" + filename.

For example:

1) if you read in a text file called "turnstile_110521.txt"

2) you should write the updated data to "updated_turnstile_110521.txt"

The order of the fields should be preserved.

You can see a sample of the turnstile text file that's passed into this function

and the the corresponding updated file in the links below:

Sample input file:

https://www.dropbox.com/s/mpin5zv4hgrx244/turnstile_110528.txt

Sample updated file:

https://www.dropbox.com/s/074xbgio4c39b7h/solution_turnstile_110528.txt

'''

for name in filenames:

ifile = open(name, "rb")

reader = csv.reader(ifile)

outputfile = 'updated_' + name

ofile = open(outputfile, "wb")

writer = csv.writer(ofile, delimiter=',')

for row in reader:

row = [x.strip(' ') for x in row]

for i in range( 0 , 7):

if row[3+5*i:8+5*i] == []:

break;

else:

writer.writerow(row[0:3] + row[3+5*i:8+5*i])

ifile.close()

ofile.close()

DATA Analytics

Wednesday, June 18, 2014

Data Wrangling Project 2 : Wrangle NYC Subway and Weather Data Exercise 1 - 5

1 comment: