Exercise 3 - Busiest Hour
import sys
import string
import logging
from util import mapper_logfile
logging.basicConfig(filename=mapper_logfile, format='%(message)s',
level=logging.INFO, filemode='w')
def mapper():
"""
In this exercise, for each turnstile unit, you will determine the date and time
(in the span of this data set) at which the most people entered through the unit.
The input to the mapper will be the final Subway-MTA dataset, the same as
in the previous exercise. You can check out the csv and its structure below:
https://www.dropbox.com/s/meyki2wl9xfa7yk/turnstile_data_master_with_weather.csv
For each line, the mapper should return the UNIT, ENTRIESn_hourly, DATEn, and
TIMEn columns, separated by tabs. For example:
'R001\t100000.0\t2011-05-01\t01:00:00'
Since you are printing the output of your program, printing a debug
statement will interfere with the operation of the grader. Instead,
use the logging module, which we've configured to log to a file printed
when you click "Test Run". For example:
logging.info("My debugging message")
"""
for line in sys.stdin:
data = line.strip().split(",")
if len(data) == 22 and data[6] == 'ENTRIESn_hourly':
continue
print "{0}\t{1}\t{2}\t{3}".format(data[1],data[6],data[2],data[3])
mapper()
import sys
import string
import logging
from util import mapper_logfile
logging.basicConfig(filename=mapper_logfile, format='%(message)s',
level=logging.INFO, filemode='w')
def mapper():
"""
In this exercise, for each turnstile unit, you will determine the date and time
(in the span of this data set) at which the most people entered through the unit.
The input to the mapper will be the final Subway-MTA dataset, the same as
in the previous exercise. You can check out the csv and its structure below:
https://www.dropbox.com/s/meyki2wl9xfa7yk/turnstile_data_master_with_weather.csv
For each line, the mapper should return the UNIT, ENTRIESn_hourly, DATEn, and
TIMEn columns, separated by tabs. For example:
'R001\t100000.0\t2011-05-01\t01:00:00'
Since you are printing the output of your program, printing a debug
statement will interfere with the operation of the grader. Instead,
use the logging module, which we've configured to log to a file printed
when you click "Test Run". For example:
logging.info("My debugging message")
"""
for line in sys.stdin:
data = line.strip().split(",")
if len(data) == 22 and data[6] == 'ENTRIESn_hourly':
continue
print "{0}\t{1}\t{2}\t{3}".format(data[1],data[6],data[2],data[3])
mapper()
---------------------------------------------------------------------------------------------------------------------
import sys
import logging
from util import reducer_logfile
logging.basicConfig(filename=reducer_logfile, format='%(message)s',
level=logging.INFO, filemode='w')
def reducer():
'''
Write a reducer that will compute the busiest date and time (that is, the
date and time with the most entries) for each turnstile unit. Ties should
be broken in favor of datetimes that are later on in the month of May. You
may assume that the contents of the reducer will be sorted so that all entries
corresponding to a given UNIT will be grouped together.
The reducer should print its output with the UNIT name, the datetime (which
is the DATEn followed by the TIMEn column, separated by a single space), and
the number of entries at this datetime, separated by tabs.
For example, the output of the reducer should look like this:
R001 2011-05-11 17:00:00 31213.0
R002 2011-05-12 21:00:00 4295.0
R003 2011-05-05 12:00:00 995.0
R004 2011-05-12 12:00:00 2318.0
R005 2011-05-10 12:00:00 2705.0
R006 2011-05-25 12:00:00 2784.0
R007 2011-05-10 12:00:00 1763.0
R008 2011-05-12 12:00:00 1724.0
R009 2011-05-05 12:00:00 1230.0
R010 2011-05-09 18:00:00 30916.0
...
...
Since you are printing the output of your program, printing a debug
statement will interfere with the operation of the grader. Instead,
use the logging module, which we've configured to log to a file printed
when you click "Test Run". For example:
logging.info("My debugging message")
'''
max_entries = 0.0
old_key = None
datetime = ''
for line in sys.stdin:
data = line.strip().split('\t')
if len(data) != 4:
continue
this_key, count, date, time = data
count = float(count)
if old_key and old_key != this_key:
print "{0}\t{1}\t{2}".format(old_key, datetime, max_entries)
max_entries = 0
datetime = ''
old_key = this_key
if count >= max_entries:
max_entries = count
datetime = str(date) + ' ' + str(time)
if old_key != None:
print "{0}\t{1}\t{2}".format(old_key, datetime, max_entries)
logging.info("{0}\t{1}\t{2}".format(old_key, datetime, max_entries))
reducer()
No comments:
Post a Comment