DATA Analytics: Data Wrangling Project 2 : Wrangle NYC Subway and Weather Data Exercise 6 -11

Exercise 6 - Combining NYC Subway Turnstile Data

def create_master_turnstile_file(filenames, output_file):

'''

Write a function that takes the files in the list filenames, which all have the

columns 'C/A, UNIT, SCP, DATEn, TIMEn, DESCn, ENTRIESn, EXITSn', and consolidates

them into one file located at output_file. There should be ONE row with the column

headers, located at the top of the file.

For example, if file_1 has:

'C/A, UNIT, SCP, DATEn, TIMEn, DESCn, ENTRIESn, EXITSn'

line 1 ...

line 2 ...

and another file, file_2 has:

'C/A, UNIT, SCP, DATEn, TIMEn, DESCn, ENTRIESn, EXITSn'

line 3 ...

line 4 ...

line 5 ...

We need to combine file_1 and file_2 into a master_file like below:

'C/A, UNIT, SCP, DATEn, TIMEn, DESCn, ENTRIESn, EXITSn'

line 1 ...

line 2 ...

line 3 ...

line 4 ...

line 5 ...

'''

with open(output_file, 'w') as master_file:

master_file.write('C/A,UNIT,SCP,DATEn,TIMEn,DESCn,ENTRIESn,EXITSn\n')

#res = []

for filename in filenames:

master_file.write(''.join(open(filename).readlines()))

master_file.close()

Exercise 7 - Filtering "Non Regular" Data

import pandas

def filter_by_regular(filename):

'''

This function should read the csv file located at filename into a pandas dataframe,

and filter the dataframe to only rows where the 'DESCn' column has the value 'REGULAR'.

For example, if the pandas dataframe is as follows:

,C/A,UNIT,SCP,DATEn,TIMEn,DESCn,ENTRIESn,EXITSn

0,A002,R051,02-00-00,05-01-11,00:00:00,REGULAR,3144312,1088151

1,A002,R051,02-00-00,05-01-11,04:00:00,DOOR,3144335,1088159

2,A002,R051,02-00-00,05-01-11,08:00:00,REGULAR,3144353,1088177

3,A002,R051,02-00-00,05-01-11,12:00:00,DOOR,3144424,1088231

The dataframe will look like below after filtering to only rows where DESCn column

has the value 'REGULAR':

0,A002,R051,02-00-00,05-01-11,00:00:00,REGULAR,3144312,1088151

2,A002,R051,02-00-00,05-01-11,08:00:00,REGULAR,3144353,1088177

'''

turnstile_data = pandas.read_csv(filename)

turnstile_data = turnstile_data[turnstile_data.DESCn.isin(["REGULAR"])]

return turnstile_data

Exercise 8 - Get Hourly Entries

import pandas

def get_hourly_entries(df):

'''

The data in the MTA Subway Turnstile data reports on the cumulative

number of entries and exits per row. Assume that you have a dataframe

called df that contains only the rows for a particular turnstile machine

(i.e., unique SCP, C/A, and UNIT). This function should change

these cumulative entry numbers to a count of entries since the last reading

(i.e., entries since the last row in the dataframe).

More specifically, you want to do two things:

1) Create a new column called ENTRIESn_hourly

2) Assign to the column the difference between ENTRIESn of the current row

and the previous row. If there is any NaN, fill/replace it with 1.

You may find the pandas functions shift() and fillna() to be helpful in this exercise.

Examples of what your dataframe should look like at the end of this exercise:

C/A UNIT SCP DATEn TIMEn DESCn ENTRIESn EXITSn ENTRIESn_hourly

0 A002 R051 02-00-00 05-01-11 00:00:00 REGULAR 3144312 1088151 1

1 A002 R051 02-00-00 05-01-11 04:00:00 REGULAR 3144335 1088159 23

2 A002 R051 02-00-00 05-01-11 08:00:00 REGULAR 3144353 1088177 18

3 A002 R051 02-00-00 05-01-11 12:00:00 REGULAR 3144424 1088231 71

4 A002 R051 02-00-00 05-01-11 16:00:00 REGULAR 3144594 1088275 170

5 A002 R051 02-00-00 05-01-11 20:00:00 REGULAR 3144808 1088317 214

6 A002 R051 02-00-00 05-02-11 00:00:00 REGULAR 3144895 1088328 87

7 A002 R051 02-00-00 05-02-11 04:00:00 REGULAR 3144905 1088331 10

8 A002 R051 02-00-00 05-02-11 08:00:00 REGULAR 3144941 1088420 36

9 A002 R051 02-00-00 05-02-11 12:00:00 REGULAR 3145094 1088753 153

10 A002 R051 02-00-00 05-02-11 16:00:00 REGULAR 3145337 1088823 243

...

'''

df['ENTRIESn_hourly'] = df['ENTRIESn'] - df['ENTRIESn'].shift(periods=1)

df = df.fillna(1)

return df

Exercise 9 - Get Hourly Exits

import pandas

def get_hourly_exits(df):

'''

The data in the MTA Subway Turnstile data reports on the cumulative

number of entries and exits per row. Assume that you have a dataframe

called df that contains only the rows for a particular turnstile machine

(i.e., unique SCP, C/A, and UNIT). This function should change

these cumulative exit numbers to a count of exits since the last reading

(i.e., exits since the last row in the dataframe).

More specifically, you want to do two things:

1) Create a new column called EXITSn_hourly

2) Assign to the column the difference between EXITSn of the current row

and the previous row. If there is any NaN, fill/replace it with 0.

You may find the pandas functions shift() and fillna() to be helpful in this exercise.

Example dataframe below:

Unnamed: 0 C/A UNIT SCP DATEn TIMEn DESCn ENTRIESn EXITSn ENTRIESn_hourly EXITSn_hourly

0 0 A002 R051 02-00-00 05-01-11 00:00:00 REGULAR 3144312 1088151 0 0

1 1 A002 R051 02-00-00 05-01-11 04:00:00 REGULAR 3144335 1088159 23 8

2 2 A002 R051 02-00-00 05-01-11 08:00:00 REGULAR 3144353 1088177 18 18

3 3 A002 R051 02-00-00 05-01-11 12:00:00 REGULAR 3144424 1088231 71 54

4 4 A002 R051 02-00-00 05-01-11 16:00:00 REGULAR 3144594 1088275 170 44

5 5 A002 R051 02-00-00 05-01-11 20:00:00 REGULAR 3144808 1088317 214 42

6 6 A002 R051 02-00-00 05-02-11 00:00:00 REGULAR 3144895 1088328 87 11

7 7 A002 R051 02-00-00 05-02-11 04:00:00 REGULAR 3144905 1088331 10 3

8 8 A002 R051 02-00-00 05-02-11 08:00:00 REGULAR 3144941 1088420 36 89

9 9 A002 R051 02-00-00 05-02-11 12:00:00 REGULAR 3145094 1088753 153 333

'''

df['EXITSn_hourly'] = df['EXITSn'] - df['EXITSn'].shift(periods=1)

df = df.fillna(0)

return df

Exercise 10 - Time to Hour

import pandas

import time as ti

def time_to_hour(time):

'''

Given an input variable time that represents time in the format of:

00:00:00 (hour:minutes:seconds)

Write a function to extract the hour part from the input variable time

and return it as an integer. For example:

1) if hour is 00, your code should return 0

2) if hour is 01, your code should return 1

3) if hour is 21, your code should return 21

Please return hour as an integer.

'''

hour = ti.strptime(time, "%H:%M:%S")[3]

return hour

Exercise 11 - Reformat Subway Dates

import datetime

import time as ti

def reformat_subway_dates(date):

'''

The dates in our subway data are formatted in the format month-day-year.

The dates in our weather underground data are formatted year-month-date.

In order to join these two data sets together, we'll want the dates formatted

the same way. Write a function that takes as its input a date in the MTA Subway

data format, and returns a date in the weather underground format.

Hint:

There is a useful function in the datetime library called strptime.

More info can be seen here:

http://docs.python.org/2/library/datetime.html#datetime.datetime.strptime

'''

temp = ti.strptime(date, "%m-%d-%y")

temp1 = temp[0]

temp2 = temp[1]

temp3 = temp[2]

dt = datetime.datetime(temp[0], temp[1], temp[2])

date_formatted = dt.strftime("%Y-%m-%d")

return date_formatted

DATA Analytics

Wednesday, June 18, 2014

Data Wrangling Project 2 : Wrangle NYC Subway and Weather Data Exercise 6 -11

No comments:

Post a Comment