Exercise 6 - Combining NYC Subway Turnstile Data
def create_master_turnstile_file(filenames, output_file):
    '''
    Write a
function that takes the files in the list filenames, which all have the 
    columns 'C/A,
UNIT, SCP, DATEn, TIMEn, DESCn, ENTRIESn, EXITSn', and consolidates
    them into one
file located at output_file.  There
should be ONE row with the column
    headers,
located at the top of the file.
    For example, if
file_1 has:
    'C/A, UNIT,
SCP, DATEn, TIMEn, DESCn, ENTRIESn, EXITSn'
    line 1 ...
    line 2 ...
    and another
file, file_2 has:
    'C/A, UNIT,
SCP, DATEn, TIMEn, DESCn, ENTRIESn, EXITSn'
    line 3 ...
    line 4 ...
    line 5 ...
    We need to
combine file_1 and file_2 into a master_file like below:
     'C/A, UNIT,
SCP, DATEn, TIMEn, DESCn, ENTRIESn, EXITSn'
    line 1 ...
    line 2 ...
    line 3 ...
    line 4 ...
    line 5 ...
    '''
    with
open(output_file, 'w') as master_file:
      
master_file.write('C/A,UNIT,SCP,DATEn,TIMEn,DESCn,ENTRIESn,EXITSn\n')
       #res = []
       for filename
in filenames:
           
master_file.write(''.join(open(filename).readlines()))
      
master_file.close()     
 Exercise 7 - Filtering "Non Regular"
Data
 import pandas
def filter_by_regular(filename):
    '''
    This function
should read the csv file located at filename into a pandas dataframe,
    and filter the
dataframe to only rows where the 'DESCn' column has the value 'REGULAR'.
    For example, if the
pandas dataframe is as follows:
   
,C/A,UNIT,SCP,DATEn,TIMEn,DESCn,ENTRIESn,EXITSn
    0,A002,R051,02-00-00,05-01-11,00:00:00,REGULAR,3144312,1088151
   
1,A002,R051,02-00-00,05-01-11,04:00:00,DOOR,3144335,1088159
   
2,A002,R051,02-00-00,05-01-11,08:00:00,REGULAR,3144353,1088177
   
3,A002,R051,02-00-00,05-01-11,12:00:00,DOOR,3144424,1088231
    The dataframe will look like below after
filtering to only rows where DESCn column
    has the value
'REGULAR':
   
0,A002,R051,02-00-00,05-01-11,00:00:00,REGULAR,3144312,1088151
   
2,A002,R051,02-00-00,05-01-11,08:00:00,REGULAR,3144353,1088177
    '''
    turnstile_data =
pandas.read_csv(filename)
    turnstile_data =
turnstile_data[turnstile_data.DESCn.isin(["REGULAR"])]
    return
turnstile_data
Exercise 8 - Get
Hourly Entries
import pandas
def get_hourly_entries(df):
    '''
    The data in the MTA
Subway Turnstile data reports on the cumulative
    number of entries
and exits per row.  Assume that you have
a dataframe
    called df that
contains only the rows for a particular turnstile machine
    (i.e., unique SCP,
C/A, and UNIT).  This function should
change
    these cumulative
entry numbers to a count of entries since the last reading
    (i.e., entries
since the last row in the dataframe).
    More specifically,
you want to do two things:
       1) Create a new
column called ENTRIESn_hourly
       2) Assign to the
column the difference between ENTRIESn of the current row 
          and the
previous row. If there is any NaN, fill/replace it with 1.
    You may find the
pandas functions shift() and fillna() to be helpful in this exercise.
    Examples of what
your dataframe should look like at the end of this exercise:
           C/A  UNIT      
SCP     DATEn     TIMEn   
DESCn  ENTRIESn    EXITSn 
ENTRIESn_hourly
    0     A002 
R051  02-00-00  05-01-11 
00:00:00  REGULAR   3144312  
1088151                1
    1     A002 
R051  02-00-00  05-01-11 
04:00:00  REGULAR   3144335  
1088159               23
    2     A002 
R051  02-00-00  05-01-11 
08:00:00  REGULAR   3144353  
1088177               18
    3     A002 
R051  02-00-00  05-01-11 
12:00:00  REGULAR   3144424  
1088231               71
    4     A002 
R051  02-00-00  05-01-11 
16:00:00  REGULAR   3144594  
1088275              170
    5     A002 
R051  02-00-00  05-01-11 
20:00:00  REGULAR   3144808  
1088317              214
    6     A002 
R051  02-00-00  05-02-11 
00:00:00  REGULAR   3144895  
1088328               87
    7     A002 
R051  02-00-00  05-02-11 
04:00:00  REGULAR   3144905  
1088331               10
    8     A002 
R051  02-00-00  05-02-11 
08:00:00  REGULAR   3144941  
1088420               36
    9     A002 
R051  02-00-00  05-02-11 
12:00:00  REGULAR   3145094  
1088753              153
    10    A002 
R051  02-00-00  05-02-11 
16:00:00  REGULAR   3145337  
1088823              243
    ...
    ...
    '''
   
df['ENTRIESn_hourly'] = df['ENTRIESn'] - df['ENTRIESn'].shift(periods=1)
    df = df.fillna(1)
    return df
Exercise 9 - Get
Hourly Exits
import pandas
def get_hourly_exits(df):
    '''
    The data in the MTA
Subway Turnstile data reports on the cumulative
    number of entries
and exits per row.  Assume that you have
a dataframe
    called df that
contains only the rows for a particular turnstile machine
    (i.e., unique SCP,
C/A, and UNIT).  This function should
change
    these cumulative
exit numbers to a count of exits since the last reading
    (i.e., exits since
the last row in the dataframe).
    More specifically,
you want to do two things:
       1) Create a new
column called EXITSn_hourly
       2) Assign to the
column the difference between EXITSn of the current row 
          and the
previous row. If there is any NaN, fill/replace it with 0.
    You may find the
pandas functions shift() and fillna() to be helpful in this exercise.
    Example dataframe below:
          Unnamed:
0   C/A 
UNIT       SCP     DATEn    
TIMEn    DESCn  ENTRIESn   
EXITSn  ENTRIESn_hourly  EXITSn_hourly
    0              0 
A002  R051  02-00-00 
05-01-11  00:00:00  REGULAR  
3144312   1088151                0              0
    1              1 
A002  R051  02-00-00 
05-01-11  04:00:00  REGULAR  
3144335   1088159               23              8
    2              2 
A002  R051  02-00-00 
05-01-11  08:00:00  REGULAR  
3144353   1088177               18             18
    3              3 
A002  R051  02-00-00 
05-01-11  12:00:00  REGULAR  
3144424   1088231               71             54
    4              4 
A002  R051  02-00-00 
05-01-11  16:00:00  REGULAR  
3144594   1088275              170             44
    5              5 
A002  R051  02-00-00 
05-01-11  20:00:00  REGULAR  
3144808   1088317              214             42
    6              6 
A002  R051  02-00-00 
05-02-11  00:00:00  REGULAR  
3144895   1088328               87             11
    7              7 
A002  R051  02-00-00 
05-02-11  04:00:00  REGULAR  
3144905   1088331               10              3
    8              8 
A002  R051  02-00-00 
05-02-11  08:00:00  REGULAR  
3144941   1088420               36             89
    9              9 
A002  R051  02-00-00 
05-02-11  12:00:00  REGULAR  
3145094   1088753              153            333
    '''
    df['EXITSn_hourly']
= df['EXITSn'] - df['EXITSn'].shift(periods=1) 
    df = df.fillna(0)
    return df
Exercise 10 - Time to
Hour
import pandas
import time as ti
def time_to_hour(time):
    '''
    Given an input
variable time that represents time in the format of:
    00:00:00
(hour:minutes:seconds)
    Write a function to
extract the hour part from the input variable time
    and return it as an
integer. For example:
        1) if hour is 00,
your code should return 0
        2) if hour is
01, your code should return 1
        3) if hour is
21, your code should return 21
    Please return hour
as an integer.
    '''
    hour =
ti.strptime(time, "%H:%M:%S")[3]
    return hour
Exercise 11 - Reformat
Subway Dates
import datetime
import time as ti
def reformat_subway_dates(date):
    '''
    The dates in our
subway data are formatted in the format month-day-year.
    The dates in our
weather underground data are formatted year-month-date.
    In order to join
these two data sets together, we'll want the dates formatted
    the same way.  Write a function that takes as its input a
date in the MTA Subway
    data format, and
returns a date in the weather underground format.
    Hint: 
    There is a useful
function in the datetime library called strptime. 
    More info can be
seen here:
   
http://docs.python.org/2/library/datetime.html#datetime.datetime.strptime
    '''
    temp =
ti.strptime(date, "%m-%d-%y")
    temp1 = temp[0]
    temp2 = temp[1]
    temp3 = temp[2]
    dt =
datetime.datetime(temp[0], temp[1], temp[2])
    date_formatted =
dt.strftime("%Y-%m-%d")
    return
date_formatted
 
No comments:
Post a Comment