Exercise 6 - Combining NYC Subway Turnstile Data
def create_master_turnstile_file(filenames, output_file):
'''
Write a
function that takes the files in the list filenames, which all have the
columns 'C/A,
UNIT, SCP, DATEn, TIMEn, DESCn, ENTRIESn, EXITSn', and consolidates
them into one
file located at output_file. There
should be ONE row with the column
headers,
located at the top of the file.
For example, if
file_1 has:
'C/A, UNIT,
SCP, DATEn, TIMEn, DESCn, ENTRIESn, EXITSn'
line 1 ...
line 2 ...
and another
file, file_2 has:
'C/A, UNIT,
SCP, DATEn, TIMEn, DESCn, ENTRIESn, EXITSn'
line 3 ...
line 4 ...
line 5 ...
We need to
combine file_1 and file_2 into a master_file like below:
'C/A, UNIT,
SCP, DATEn, TIMEn, DESCn, ENTRIESn, EXITSn'
line 1 ...
line 2 ...
line 3 ...
line 4 ...
line 5 ...
'''
with
open(output_file, 'w') as master_file:
master_file.write('C/A,UNIT,SCP,DATEn,TIMEn,DESCn,ENTRIESn,EXITSn\n')
#res = []
for filename
in filenames:
master_file.write(''.join(open(filename).readlines()))
master_file.close()
Exercise 7 - Filtering "Non Regular"
Data
import pandas
def filter_by_regular(filename):
'''
This function
should read the csv file located at filename into a pandas dataframe,
and filter the
dataframe to only rows where the 'DESCn' column has the value 'REGULAR'.
For example, if the
pandas dataframe is as follows:
,C/A,UNIT,SCP,DATEn,TIMEn,DESCn,ENTRIESn,EXITSn
0,A002,R051,02-00-00,05-01-11,00:00:00,REGULAR,3144312,1088151
1,A002,R051,02-00-00,05-01-11,04:00:00,DOOR,3144335,1088159
2,A002,R051,02-00-00,05-01-11,08:00:00,REGULAR,3144353,1088177
3,A002,R051,02-00-00,05-01-11,12:00:00,DOOR,3144424,1088231
The dataframe will look like below after
filtering to only rows where DESCn column
has the value
'REGULAR':
0,A002,R051,02-00-00,05-01-11,00:00:00,REGULAR,3144312,1088151
2,A002,R051,02-00-00,05-01-11,08:00:00,REGULAR,3144353,1088177
'''
turnstile_data =
pandas.read_csv(filename)
turnstile_data =
turnstile_data[turnstile_data.DESCn.isin(["REGULAR"])]
return
turnstile_data
Exercise 8 - Get
Hourly Entries
import pandas
def get_hourly_entries(df):
'''
The data in the MTA
Subway Turnstile data reports on the cumulative
number of entries
and exits per row. Assume that you have
a dataframe
called df that
contains only the rows for a particular turnstile machine
(i.e., unique SCP,
C/A, and UNIT). This function should
change
these cumulative
entry numbers to a count of entries since the last reading
(i.e., entries
since the last row in the dataframe).
More specifically,
you want to do two things:
1) Create a new
column called ENTRIESn_hourly
2) Assign to the
column the difference between ENTRIESn of the current row
and the
previous row. If there is any NaN, fill/replace it with 1.
You may find the
pandas functions shift() and fillna() to be helpful in this exercise.
Examples of what
your dataframe should look like at the end of this exercise:
C/A UNIT
SCP DATEn TIMEn
DESCn ENTRIESn EXITSn
ENTRIESn_hourly
0 A002
R051 02-00-00 05-01-11
00:00:00 REGULAR 3144312
1088151 1
1 A002
R051 02-00-00 05-01-11
04:00:00 REGULAR 3144335
1088159 23
2 A002
R051 02-00-00 05-01-11
08:00:00 REGULAR 3144353
1088177 18
3 A002
R051 02-00-00 05-01-11
12:00:00 REGULAR 3144424
1088231 71
4 A002
R051 02-00-00 05-01-11
16:00:00 REGULAR 3144594
1088275 170
5 A002
R051 02-00-00 05-01-11
20:00:00 REGULAR 3144808
1088317 214
6 A002
R051 02-00-00 05-02-11
00:00:00 REGULAR 3144895
1088328 87
7 A002
R051 02-00-00 05-02-11
04:00:00 REGULAR 3144905
1088331 10
8 A002
R051 02-00-00 05-02-11
08:00:00 REGULAR 3144941
1088420 36
9 A002
R051 02-00-00 05-02-11
12:00:00 REGULAR 3145094
1088753 153
10 A002
R051 02-00-00 05-02-11
16:00:00 REGULAR 3145337
1088823 243
...
...
'''
df['ENTRIESn_hourly'] = df['ENTRIESn'] - df['ENTRIESn'].shift(periods=1)
df = df.fillna(1)
return df
Exercise 9 - Get
Hourly Exits
import pandas
def get_hourly_exits(df):
'''
The data in the MTA
Subway Turnstile data reports on the cumulative
number of entries
and exits per row. Assume that you have
a dataframe
called df that
contains only the rows for a particular turnstile machine
(i.e., unique SCP,
C/A, and UNIT). This function should
change
these cumulative
exit numbers to a count of exits since the last reading
(i.e., exits since
the last row in the dataframe).
More specifically,
you want to do two things:
1) Create a new
column called EXITSn_hourly
2) Assign to the
column the difference between EXITSn of the current row
and the
previous row. If there is any NaN, fill/replace it with 0.
You may find the
pandas functions shift() and fillna() to be helpful in this exercise.
Example dataframe below:
Unnamed:
0 C/A
UNIT SCP DATEn
TIMEn DESCn ENTRIESn
EXITSn ENTRIESn_hourly EXITSn_hourly
0 0
A002 R051 02-00-00
05-01-11 00:00:00 REGULAR
3144312 1088151 0 0
1 1
A002 R051 02-00-00
05-01-11 04:00:00 REGULAR
3144335 1088159 23 8
2 2
A002 R051 02-00-00
05-01-11 08:00:00 REGULAR
3144353 1088177 18 18
3 3
A002 R051 02-00-00
05-01-11 12:00:00 REGULAR
3144424 1088231 71 54
4 4
A002 R051 02-00-00
05-01-11 16:00:00 REGULAR
3144594 1088275 170 44
5 5
A002 R051 02-00-00
05-01-11 20:00:00 REGULAR
3144808 1088317 214 42
6 6
A002 R051 02-00-00
05-02-11 00:00:00 REGULAR
3144895 1088328 87 11
7 7
A002 R051 02-00-00
05-02-11 04:00:00 REGULAR
3144905 1088331 10 3
8 8
A002 R051 02-00-00
05-02-11 08:00:00 REGULAR
3144941 1088420 36 89
9 9
A002 R051 02-00-00
05-02-11 12:00:00 REGULAR
3145094 1088753 153 333
'''
df['EXITSn_hourly']
= df['EXITSn'] - df['EXITSn'].shift(periods=1)
df = df.fillna(0)
return df
Exercise 10 - Time to
Hour
import pandas
import time as ti
def time_to_hour(time):
'''
Given an input
variable time that represents time in the format of:
00:00:00
(hour:minutes:seconds)
Write a function to
extract the hour part from the input variable time
and return it as an
integer. For example:
1) if hour is 00,
your code should return 0
2) if hour is
01, your code should return 1
3) if hour is
21, your code should return 21
Please return hour
as an integer.
'''
hour =
ti.strptime(time, "%H:%M:%S")[3]
return hour
Exercise 11 - Reformat
Subway Dates
import datetime
import time as ti
def reformat_subway_dates(date):
'''
The dates in our
subway data are formatted in the format month-day-year.
The dates in our
weather underground data are formatted year-month-date.
In order to join
these two data sets together, we'll want the dates formatted
the same way. Write a function that takes as its input a
date in the MTA Subway
data format, and
returns a date in the weather underground format.
Hint:
There is a useful
function in the datetime library called strptime.
More info can be
seen here:
http://docs.python.org/2/library/datetime.html#datetime.datetime.strptime
'''
temp =
ti.strptime(date, "%m-%d-%y")
temp1 = temp[0]
temp2 = temp[1]
temp3 = temp[2]
dt =
datetime.datetime(temp[0], temp[1], temp[2])
date_formatted =
dt.strftime("%Y-%m-%d")
return
date_formatted
No comments:
Post a Comment