Saturday, July 26, 2014

Realistic Test Data Generetor

Great site for generating CSV file with random and customized data
http://www.mockaroo.com/

Wednesday, July 23, 2014

Harvard Business Review, Article #3, Why IT Fumbles Analytics

HBR  
Why IT Fumbles Analytics
by Donald A. Marchand and Joe Peppard Jan 2013
http://hbr.org/2013/01/why-it-fumbles-analytics/ar/1

Preface

  • Massive amounts of data now available from internal and external sources
  • Companies are spending heavily on IT tools and hiring data scientists
  • They treat their big data and analytics projects the same way they treat all IT projects 
  • Not realizing that the two are completely different animals
Conventional approach to an IT project

  • Focuses on building and deploying the technology :
    • On time 
    • To plan
    • Within budget
  • The information requirements and technology specifications are established up front, at the design stage
  • This approach works fine if the goal is to improve business processes
  • Such projects improve efficiency, lower costs, and increase productivity,executives are still dissatisfied. 
    • The reason: Once the system goes live, no one pays any attention to figuring out how to use the information it generates to make better decisions or gain deeper
Where is the problem?
  • It’s crucial to understand how people create and use information
  • Project teams need members well versed in the cognitive and behavioral sciences, not just in engineering, computer science, and math
  • Deploying analytical IT tools is relatively easy. Understanding how they might be used is much less clear
  • No one knows the decisions the tools will be asked to support 
  • No one knows the questions they will be expected to help answer
  • Therefore, a big data or analytics project can’t be treated like a 
    • conventional
    • large IT project 
    • with its defined outcomes 
    • required tasks
    • detailed plans    
Five guidelines for taking this voyage of discovery
  • Place People at the Heart of the Initiative
    • WRONG : The logic behind many investments in IT tools and big data initiatives is that giving managers more high-quality information :
      • More rapidly will improve their decisions 
      • Help them solve problems 
      • Gain valuable insights
    • Because: It ignores the fact that
      • Managers might discard information no matter how good it is
      • They have various biases
      • They might not have the cognitive ability to use information effectively
      • The reality is that many people, including managers, are uncomfortable working with data
      • It must place users, the people who will create meaning from the information, at its heart
      • How they do or do not use data in reaching conclusions and making decisions
  • Emphasize Information Use as the Way to Unlock Value from IT
    • People don’t think in a vacuum; they make sense of situations on the basis of their own knowledge,mental models, and experiences
    • They also use information in different ways, depending on the context
    • An organization’s culture, can frame how people make decisions, collaborate, and share knowledge
    • People use information dynamically and iteratively
    • The design of most IT systems takes into account the data that have been identified as important and controllable
      • That approach is fine for activities that are highly structured and whose tasks can be precisely described
      • It is ideal for moving information from the human domain into the technology domain
    • The problem is that many organizations mistakenly apply same design philosophy to the task of getting data out of the technology domain and into the human domain 
    • Analytics projects succeed by challenging and improving the way: 
      • Information is used
      • Questions are answered
      • Decisions are made
    • Here are some ways to do this:
      • Ask second-order questions, that is, questions about questions
      • Discover what data you do and do not have
      • Give IT project teams the freedom to reframe business problems

  • Equip IT Project Teams With Cognitive and Behavioral Scientists
    • Most IT professionals have engineering, computer science, and math backgrounds
    • They are generally very logical and are strong process thinkers
    • Big Data needs people who understand how people:
      • Perceive problems 
      • Use information 
      • Analyze data in developing solutions ideas and knowledge
    • This shift  economics to behavioral economics
    • Organizations that want employees to be more data oriented in their thinking and decision making must train them to know 
      • When to draw on data 
      • How to frame questions
      • How Build hypotheses 
      • How conduct experiments
      • How interpret results

  • Focus on Learning
    • Big data and other analytics projects are more akin to scientific research and clinical trials than to IT initiatives
    • In short, they are opportunities for discovery
    • How make learning a central focus of big data and analytics projects:
      • Promote and facilitate a culture of information sharing
      • Expose your assumptions, biases, and blind spots
      • Strive to demonstrate cause and effect
      • Identify the appropriate techniques and tools

  • Worry More About Solving Business Problems than About Deploying Technology
    • Conventional IT project management is risk-averse
    • Big data should focus less on managing the risks of deploying technology and more on solving business problems
    • The paradox is that the technologies that were supposed to help manage data are now causing a massive deluge


Monday, July 21, 2014

Harvard Business Review, Article #2, The New Patterns of Innovation

The New Patterns of Innovation
How to use data to drive growth 
by Rashik Parmar, Ian Mackenzie, David Cohn, and David Gann Jan 2014


http://hbr.org/2014/01/the-new-patterns-of-innovation/ar/1

Idea in Brief

  • THE CHALLENGE
    • Established companies are notoriously bad at finding new ways to make money, despite the pressure on them to grow.
  • THE ANALYSIS
    • Most companies own or have access to information that could be used to expand old businesses or build new ones. These opportunities exist because of the explosion in digital data, analytic tools, and cloud computing.
  • THE SOLUTION
    • Answering a series of questions—from “What data can we access that we’re not capturing now?” to “Can we deliver one of our capabilities as a digital service?”—will help companies find ways to unlock new business value.



Traditional, tested ways of framing the search for ideas

  • Competency based
    • It asks, How can we build on the capabilities and assets that already make us distinctive to enter new businesses and markets?
  • Customer focused: 
    • What does a close study of customers’ behavior tell us about their tacit, unmet needs?
  • Changes in the business environment: 
    • If we follow “megatrends” or other shifts to their logical conclusion, what future business opportunities will become clear?
  • Fourth approach: 
    • It complements the existing frameworks but focuses on opportunities generated by the explosion in digital information and tools
    • How can we create value for customers using data and analytic tools we own or could have access to?
Patterns
  1. Using data that physical objects now generate (or could generate) to improve a product or service or create new business value.
  2. Digitizing physical assets.
  3. Combining data within and across industries.
  4. Trading data
  5. Codifying a capability
Pattern 1
Augmenting Products to Generate Data
  • Because of advances in sensors, wireless communications, and big data, it’s now feasible to gather and crunch enormous amounts of data in a variety of contexts.
  • Such capabilities, in turn, can become the basis of new services or new business models.

PATTERN 2
Digitizing Assets

  • Over the past two decades, the digitization of music, books, and video has famously upended entertainment industries, spawning new models such as iTunes, streaming video services, e-readers, ...
  • Digitization of health records, of course, is expected to revolutionize the health care industry, by making the treatment of patients more efficient and appropriate

PATTERN 3
Combining Data Within and Across Industries
  • The science of big data, along with new IT standards that allow enhanced data integration, makes it possible to coordinate information across industries or sectors in new ways.
  • The goal is to encourage the private sector to develop new business models, such as shared-delivery services in specific areas
PATTERN 4
Trading Data
  • The ability to combine disparate data sets allows companies to develop a variety of new offerings for adjacent businesses.
PATTERN 5
Codifying a Distinctive Service Capability
  • Now companies have a practical way to take the processes they’ve perfected, standardize them, and sell them to other parties
  • Cloud computing has put such opportunities within even closer reach, because it allows companies to:
    • easily distribute software
    • simplify version control 
    • offer customers “pay as you go” pricing
COMBINING THE PATTERNS
  • The five patterns are a helpful way to structure a conversation about new business ideas, but actual initiatives often encompass two or three of the patterns
Key Questions
  • questions designed to inventory the raw material out of which new business value can be carved
    • What data do we have?
    • What data can we access that we are not capturing?
    • What data could we create from our products or operations?
    • What helpful data could we get from others?
    • What data do others have that we could use in a joint initiative?
  • Armed with the answers, the team cycles back through each pattern to explore whether it, or perhaps a modification or combination of patterns, could be applicable in the company’s business context
AUGMENTING PRODUCTS
  • Which of the data relate to our products and their use?
  • Which do we now keep and which could we start keeping?
  • What insights could be developed from the data?
  • How could those insights provide new value to us, our customers, our suppliers, our competitors, or players in another industry?
DIGITIZING ASSETS
  • Which of our assets are either wholly or essentially digital?
  • How can we use their digital nature to improve or augment their value?
  • Do we have physical assets that could be turned into digital assets?

COMBINING DATA
  • How might our data be combined with data held by others to create new value?
  • Could we act as the catalyst for value creation by integrating data held by other players?
  • Who would benefit from this integration and what business model would make it attractive to us and our collaborators?

TRADING DATA
  • How could our data be structured and analyzed to yield higher-value information?
  • Is there value in this data to us internally, to our current customers, to potential new customers,or to another industry?

CODIFYING A CAPABILITY
  • Do we possess a distinctive capability that others would value?
  • Is there a way to standardize this capability so that it could be broadly useful?
  • Can we deliver this capability as a digital service?
  • Who in our industry or other industries would find this attractive?
  • How could the gathering, management, and analysis of our data help us develop a capability that we could codify?

Success Factors






Thursday, July 17, 2014

Harvard Business Review, Article #1, Analytics 3.0

HBR  Analytics 3.0
by Thomas H. Davenport Dec 2013
http://hbr.org/2013/12/analytics-30/ar/1

The Evolution of Analytics
Briefly, it is a new resolve to apply:

  • Powerful data-gathering 
  • Analysis methods 

not just to a company’s operations but also to its offerings to embed data smartness into the products and services customers buy.


History 

  • The use of data to make decisions is not a new idea.
  • It is as old as decision making itself. 
  • But the field of business analytics was born in the mid-1950s
Analytics 1.0 the era of “business intelligence


  • For the first time
    • data about production processes 
    • sales
    • customer interactions 
    • and more 
         were 
    • recorded 
    • aggregated
    • analyzed
  • This was the era of the enterprise data warehouse
    • used to capture information 
  •  business intelligence software
    • used to query and report it
  • Readying a data set for inclusion in a warehouse was difficult
    • Analysts spent much of their time preparing data for analysis and relatively little time on the analysis itself


Analytics 2.0 the era of big data
  • At the mid 2000s, when internet-based and social network firms primarily in Silicon Valley (Google, eBay, ...) began to amass and analyze new kinds of information
  • Big data also came to be distinguished from small data because it was not generated purely by a firm’s internal transaction systems
  • Big data couldn’t fit or be analyzed fast enough on a single server
    •  So it was processed with Hadoop, an open source software framework for fast batch data processing across parallel servers
  • To deal with relatively unstructured data companies turned to a new class of databases known as NoSQL
  • Machine-learning methods (semi automated model development and testing) were used to rapidly generate models from the fast-moving data



Analytics 3.0 the era of data-enriched offerings

  • The pioneering big data firms in Silicon Valley began investing in analytics to support customer-facing products, services, and features
  • The common thread in these companies management to compete on analytics 
    • not only in the traditional sense (by improving internal business decisions) 
    • but also by creating more-valuable products and services
  • Any company, in any industry, can develop valuable products and services from their aggregated data
  • This strategic change in focus means a new role for analytics within organizations


Ten Requirements for Capitalizing on Analytics 3.0
  1. Multiple types of data, often combine
  2. A new set of data management options
    1. In the 1.0 era, firms used data warehouses as the basis for analysis. 
    2. In the 2.0 era, they focused on Hadoop clusters and NoSQL databases. 
    3. Today the technology answer is “all of the above”
  3. Faster technologies and methods of analysis
  4. Embedded analytics
    1. Consistent with the increased speed of data processing and analysis, models in Analytics 3.0 are often embedded into operational and decision processes
    2. Some firms are embedding analytics into fully automated systems through 
      1. Scoring algorithms 
      2. Analytics-based rules.
  5. Data discovery
  6. Cross-disciplinary data teams
    1. Companies now employ data hackers, who excel at extracting and structuring information, to work with analysts, who excel at modeling it
  7. Chief analytics officers
  8. Prescriptive analytics
    1. There have always been three types of analytics:
      1. Descriptive, which reports on the past
      2. Predictive, which uses models based on past data to predict the future 
      3. Prescriptive, which uses models to specify optimal behaviors and actions
    2. Analytics 3.0 includes all three types, 
    3. Analytics 3.0  emphasizes the last
  9. Analytics on an industrial scale 
    1. Analytics 3.0 provides an opportunity to scale those processes to industrial strength
    2. Creating many more models through machine learning can let an organization become much more granular and precise in its predictions
  10. New ways of deciding and managing 
    1. Some of the changes prompted by the widespread availability of big data will not yield much certainty
    2. Big data flows continuously consider the analysis of brand sentiment derived from social media sources and so metrics will inevitably rise and fall over time 
    3. Additional uncertainty arises from the nature of big data relationships. Unless they are derived from formal testing, the results from big data generally involve correlation, not causation
Creating Value in the Data Economy
  • Analytics 3.0 is the direction of change and the new model for competing on analytics.

Thomas H. Davenport is the President’s Distinguished Professor of IT and Management at Babson College, a fellow of the MIT Center for Digital Business, a senior adviser to Deloitte Analytics, and a cofounder of the International Institute for Analytics (for which the ideas in this article were generated). He is a coauthor of Keeping Up with the Quants (Harvard Business Review Press, 2013) and the author of Big Data at Work (forthcoming from Harvard Business Review Press).

Friday, July 11, 2014

Google Analytics - Reporting

Building Reports with Dimensions & Metrics


The building blocks of every report in Google Analytics are :
  • Dimensions : A dimension describes characteristics of your data
  • Metrics : Metrics are the quantitative measurements of your data
Reporting API


To use the reporting APIs, you have to build your own application. This application needs to be able to write and send a query to the reporting API. The API uses the query to retrieve data from the aggregate tables, and then sends a response back to your application containing the data that was requested.
Each query sent to the API must contain specific information, including the ID of the view that you would like to retrieve data from, the start and end dates for the report, and the dimensions and metrics you want. Within the query you can also specify how to filter, segment and order the data just like you can with tools in the online reporting interface.

Report Sampling
Report sampling is an analytics practice that generates reports based on a small, random subset of your data instead of using all of your  available data. Sampling lets programs, including Google Analytics, calculate the data for your reports faster than if every single piece of data is included during the generation process.

When does sampling happen?
  • During processing
  • Modifying one of the standard reports in Google Analytics by adding a segment, secondary dimension, or another customization

Adjusting the sample size
The number of sessions used to calculate the report is called the “sample size.” .If you increase the sample size, you’ll include more sessions in your calculation, but it’ll take longer to generate your report. If you decrease the the sample size, you’ll include fewer sessions in your calculation, but your report will be generated faster.


The sampling limit
Google Analytics sets a maximum number of sessions that can be used to calculate your reports. If you go over that limit, your data gets sampled.
One way to stay below the limit is to shorten the date range in your report, which reduces the number of sessions Google Analytics needs to calculate your request

Google Analytics - Processing & configuration

These two components work together to organize and transform the data that you collect into the information you see in your reports.

Processing data and applying your configuration settings
During processing, there are four major transformations that happen to the data. You can control parts of these transformations using the configuration settings in your Properties and Views.
  • First, Google Analytics organizes the hits you’ve collected into users and sessions. There is a standard set of rules that Google Analytics follows to differentiate users and sessions, but you can customize some of these rules through your configuration settings.
  • Second, data from other sources can be joined with data collected via the tracking code. For example, you can configure Google Analytics to import data from Google AdWords, Google AdSense or Google Webmaster Tools. You can even configure Google Analytics to import data from other non-Google systems.
  • Third, Google Analytics processing will modify your data according to any configuration rules you’ve added. These configurations tell Google Analytics what specific data to include or exclude from your reports, or change the way the data’s formatted.
  • Finally, the data goes through a process called “aggregation.” During this step, Google Analytics prepares the data for analysis by organizing it in meaningful ways and storing it in database tables. This way, your reports can be generated quickly from the database tables whenever you need them.

How hits are organized by users
First, let’s talk about how Google Analytics creates users. The first time a device loads your content and a hit is recorded, Google Analytics creates a random, unique ID that is associated with the device. Each unique ID is considered to be a unique user in Google Analytics. This unique ID is sent to Google Analytics in each hit, and every time a new ID is detected, Google Analytics counts a New User. When Google Analytics sees an existing ID in a hit, it counts a Returning User.


How hits are organized into sessions
A session in Google Analytics is a collection of interactions, or hits, from a specific user during a defined period of time. These interactions can include pageviews, events or e-commerce transactions.
A single user can have multiple sessions. Those sessions can occur on the same day, or over several days, weeks, or months. As soon as one session ends, there is then an opportunity to start a new session. But how does Google Analytics know that a session has ended?
By default, a session ends after 30 minutes of inactivity


Importing data into Google Analytics
The most common way to get data into Google Analytics is through your tracking code, but you can also add data from other sources. By adding data into Google Analytics, you can give more context to your analysis.

There are two ways to add data into your Google Analytics account without using the tracking code: through account linking and through Data Import. Both are managed via your Configuration settings in the Admin section of Google Analytics. Any data that you add from these sources will be processed along with all the hits you collect from the tracking code


Account linking
You can link various Google products directly to Google Analytics via your account settings. This includes:
  • Google AdWords
  • Google AdSense
  • Google Webmaster Tools

Data Import
In addition to account linking, you can add data to Google Analytics using the Data Import feature. This might include advertising data, customer data, product data, or any other data.
To import data into Google Analytics there must a “key” that exists both in the data that Google Analytics collects and in the data you want to import.
There are two ways to import data into Google Analytics:
  • Dimension Widening : With Dimension Widening, you can import just about any data into Google Analytics
  • Cost Data Import : The other kind of data import is called Cost Data Import. You use this feature specifically to add data that shows the amount of money you spent on your non-Google advertising. Importing cost data lets Google Analytics calculate the return-on-investment of your non-Google ads
Transforming & Aggregating Data
An important part of processing is data transformation and aggregation

Common configuration settings: 
  • Filters : Filters provide a flexible way you can modify the data within each view. You can use them to exclude data, include data, or actually change how the data looks in your reports. Filters help you transform the data so it’s better aligned with your reporting needs.
  • Goals : Another way to transform your data is to set up Goals. When you set up Goals, Google Analytics creates new metrics for your reports

  • Channel Grouping and Content Grouping : Grouping is another way you can transform your data. With grouping, you can aggregate certain pieces of data together so you can analyze the collective performance. You can create two types of groups in Google Analytics: Channel groups and Content groups.


Data aggregation
All of your configuration settings, including Filters, Goals, and Grouping, are applied to your data before it goes through aggregation, the final step of data processing.
During aggregation, Google Analytics creates and organizes your report dimensions into tables, called aggregate tables. Google Analytics pre-calculates your reporting metrics for each value of a dimension and stores them in the corresponding table. When you open a Google Analytics report, a query is sent to the aggregate tables that are full of this prepared data, and returns the specific dimensions and metrics for the report. Storing data in these tables makes it faster for your reports to access data when you request it.

Google Analytics - Data collection

Google Analytics uses tracking code to collect data. It doesn’t matter if you are tracking a website, mobile app or other digital environment -- it’s the tracking code that gathers and sends the data back to your account for reporting.



How tracking works
Depending on the environment you want to track -- a website, mobile app, or other digital experience -- Google Analytics uses different tracking technology to create the data hits. For example, there is specific tracking code to create hits for websites and different code to create hits for mobile apps.
In addition to creating hits, the tracking code also performs another critical function. It identifies new users and returning users. We’ll explain how the tracking code does this in later lessons.
Another key function of the tracking code is to connect your data to your Google Analytics account. This is accomplished through a unique identifier embedded in your tracking code.


Google Analytics uses different tracking technology to measure user activity depending on the specific environment you want to track -- websites, mobile apps, or other digital experiences. To track data from a website, Google Analytics provides a standard snippet of JavaScript tracking code. This snippet references a JavaScript library called analytics.js that controls what data is collected.


Adding the Google Analytics JavaScript code to your website
You simply add the standard code snippet before the closing </head> tag in the HTML of every web page you want to track. This snippet generates a pageview hit each time a page is loaded. It’s essential that you place the Google Analytics tracking code on every page of your site

Collecting and sending data with the Measurement Protocol
The Measurement Protocol lets you send data to Google Analytics from any web-connected device. Recall that the Google Analytics JavaScript and mobile SDKs automatically build hits to send data to Google Analytics. However, when you want to collect data from a different device, you must manually build the data collection hits. The Measurement Protocol defines how to construct the hits and how to send them to Google Analytics.

Google Analytics - Platform Principles

There are four main parts of the Google Analytics platform:

  • Collection
  • Configuration
  • Processing
  • Reporting
Collection
Collection is all about getting data into your Google Analytics account.
To collect data, you need to add Google Analytics code to your website, mobile app or other digital environment you want to measure. This tracking code provides a set of instructions to Google Analytics, telling it which user interactions it should pay attention to and which data it should collect. The way the data is collected depends on the environment you want to track.
Processing & Configuration
During data processing, Google Analytics transforms the raw data from collection using the settings in your Google Analytics account. These settings, also known as the configuration, help you align the data more closely with your measurement plan and business objectives.  
Reporting
After Google Analytics has finished processing, you can access and analyze your data using the reporting interface, which includes easy-to-use reporting tools and data visualizations. It’s also possible to systematically access your data using the Google Analytics Core Reporting API. Using the API you can build your own reporting tools or extract your data directly into third-party reporting tools.
Conclusion
Throughout the rest of this course, we will dive deeper into key topics about collection, configuration, processing and reporting. Having a comprehensive understanding of each of these platform components will help you better understand the data you see in Google Analytics. It will also prepare you for more advanced topics about how you can customize your data.


Data Model

There are three components to this data model:
  • Users : Is a visitor to your website or app,
  • Sessions : Is the time they spend there
  • Interactions : Is what they do while they’re there



Wednesday, July 2, 2014

Project 5 Exercise 3

 Exercise 3 - Busiest Hour

import sys
import string
import logging

from util import mapper_logfile
logging.basicConfig(filename=mapper_logfile, format='%(message)s',
                    level=logging.INFO, filemode='w')

def mapper():
    """
    In this exercise, for each turnstile unit, you will determine the date and time 
    (in the span of this data set) at which the most people entered through the unit.
    
    The input to the mapper will be the final Subway-MTA dataset, the same as
    in the previous exercise. You can check out the csv and its structure below:
    https://www.dropbox.com/s/meyki2wl9xfa7yk/turnstile_data_master_with_weather.csv

    For each line, the mapper should return the UNIT, ENTRIESn_hourly, DATEn, and 
    TIMEn columns, separated by tabs. For example:
    'R001\t100000.0\t2011-05-01\t01:00:00'

    Since you are printing the output of your program, printing a debug 
    statement will interfere with the operation of the grader. Instead, 
    use the logging module, which we've configured to log to a file printed 
    when you click "Test Run". For example:
    logging.info("My debugging message")
    """


    for line in sys.stdin:
        data = line.strip().split(",")
        if len(data) == 22 and data[6] == 'ENTRIESn_hourly':
            continue
               
        print "{0}\t{1}\t{2}\t{3}".format(data[1],data[6],data[2],data[3])
        

mapper()
---------------------------------------------------------------------------------------------------------------------
import sys
import logging

from util import reducer_logfile
logging.basicConfig(filename=reducer_logfile, format='%(message)s',
                    level=logging.INFO, filemode='w')

def reducer():
    '''
    Write a reducer that will compute the busiest date and time (that is, the 
    date and time with the most entries) for each turnstile unit. Ties should 
    be broken in favor of datetimes that are later on in the month of May. You 
    may assume that the contents of the reducer will be sorted so that all entries 
    corresponding to a given UNIT will be grouped together.
    
    The reducer should print its output with the UNIT name, the datetime (which 
    is the DATEn followed by the TIMEn column, separated by a single space), and 
    the number of entries at this datetime, separated by tabs.

    For example, the output of the reducer should look like this:
    R001    2011-05-11 17:00:00   31213.0
    R002 2011-05-12 21:00:00   4295.0
    R003 2011-05-05 12:00:00   995.0
    R004 2011-05-12 12:00:00   2318.0
    R005 2011-05-10 12:00:00   2705.0
    R006 2011-05-25 12:00:00   2784.0
    R007 2011-05-10 12:00:00   1763.0
    R008 2011-05-12 12:00:00   1724.0
    R009 2011-05-05 12:00:00   1230.0
    R010 2011-05-09 18:00:00   30916.0
    ...
    ...

    Since you are printing the output of your program, printing a debug 
    statement will interfere with the operation of the grader. Instead, 
    use the logging module, which we've configured to log to a file printed 
    when you click "Test Run". For example:
    logging.info("My debugging message")
    '''

    max_entries = 0.0
    old_key = None
    datetime = ''

    for line in sys.stdin:

        data = line.strip().split('\t')

        if len(data) != 4:
            continue

        this_key, count, date, time  = data
        count = float(count)

        if old_key and old_key != this_key:
            print "{0}\t{1}\t{2}".format(old_key, datetime, max_entries)
            max_entries = 0
            datetime = ''

        old_key = this_key
        if count >= max_entries:
            max_entries = count
            datetime = str(date) + ' ' + str(time)

    if old_key != None:
        print "{0}\t{1}\t{2}".format(old_key, datetime, max_entries)
        logging.info("{0}\t{1}\t{2}".format(old_key, datetime, max_entries))
reducer()









Project 5 Exercise 2

Exercise 2 - Ridership by Weather Type


import sys
import string
import logging

from util import mapper_logfile
logging.basicConfig(filename=mapper_logfile, format='%(message)s',
                    level=logging.INFO, filemode='w')

def mapper():
    '''
    For this exercise, compute the average value of the ENTRIESn_hourly column 
    for different weather types. Weather type will be defined based on the 
    combination of the columns fog and rain (which are boolean values).
    For example, one output of our reducer would be the average hourly entries 
    across all hours when it was raining but not foggy.

    Each line of input will be a row from our final Subway-MTA dataset in csv format.
    You can check out the input csv file and its structure below:
    https://www.dropbox.com/s/meyki2wl9xfa7yk/turnstile_data_master_with_weather.csv
    
    Note that this is a comma-separated file.

    This mapper should PRINT (not return) the weather type as the key (use the 
    given helper function to format the weather type correctly) and the number in 
    the ENTRIESn_hourly column as the value. They should be separated by a tab.
    For example: 'fog-norain\t12345'
    
    Since you are printing the output of your program, printing a debug 
    statement will interfere with the operation of the grader. Instead, 
    use the logging module, which we've configured to log to a file printed 
    when you click "Test Run". For example:
    logging.info("My debugging message")
    '''

    # Takes in variables indicating whether it is foggy and/or rainy and
    # returns a formatted key that you should output.  The variables passed in
    # can be booleans, ints (0 for false and 1 for true) or floats (0.0 for
    # false and 1.0 for true), but the strings '0.0' and '1.0' will not work,
    # so make sure you convert these values to an appropriate type before
    # calling the function.
    def format_key(fog, rain):
        return '{}fog-{}rain'.format(
            '' if fog else 'no',
            '' if rain else 'no'
        )


    for line in sys.stdin:
        data = line.strip().split(',');
       
        if len(data) !=22 or data[6] == "ENTRIESn_hourly":
            continue
        else:
            print "{0}\t{1}".format(format_key(float(data[14]),float(data[15])), data[6])
            logging.info("{0}\t{1}".format(format_key(float(data[14]),float(data[15])), data[6]))
       

mapper()
-------------------------------------------------------------------------------------------------------------------------------------------------- import sys
import logging
import numpy
from util import reducer_logfile
logging.basicConfig(filename=reducer_logfile, format='%(message)s',
                    level=logging.INFO, filemode='w')

def reducer():
    '''
    Given the output of the mapper for this assignment, the reducer should
    print one row per weather type, along with the average value of
    ENTRIESn_hourly for that weather type, separated by a tab. You can assume
    that the input to the reducer will be sorted by weather type, such that all
    entries corresponding to a given weather type will be grouped together.

    In order to compute the average value of ENTRIESn_hourly, you'll need to
    keep track of both the total riders per weather type and the number of
    hours with that weather type. That's why we've initialized the variable 
    riders and num_hours below. Feel free to use a different data structure in 
    your solution, though.

    An example output row might look like this:
    'fog-norain\t1105.32467557'

    Since you are printing the output of your program, printing a debug 
    statement will interfere with the operation of the grader. Instead, 
    use the logging module, which we've configured to log to a file printed 
    when you click "Test Run". For example:
    logging.info("My debugging message")
    '''

    entries = 0.0
    avg = 0.0
    num  = 0
    old_key = None

    for line in sys.stdin:
        data = line.strip().split("\t")
        if len(data) !=2:
            continue
        this_key, count = data
        
        if old_key and old_key != this_key:
            print "{0}\t{1}".format(old_key,avg)
            entries = 0
            num = 0
        old_key = this_key
        entries += float(count)
        num += 1
        avg =  entries / num
        
    if old_key != None:
        print "{0}\t{1}".format(old_key, avg)
        logging.info("{0}\t{1}".format(old_key, avg))


reducer()