DATA Analytics: July 2014

Saturday, July 26, 2014

Realistic Test Data Generetor

Great site for generating CSV file with random and customized data
http://www.mockaroo.com/

Wednesday, July 23, 2014

Harvard Business Review, Article #3, Why IT Fumbles Analytics

HBR
Why IT Fumbles Analytics
by Donald A. Marchand and Joe Peppard Jan 2013
http://hbr.org/2013/01/why-it-fumbles-analytics/ar/1

Preface

Massive amounts of data now available from internal and external sources
Companies are spending heavily on IT tools and hiring data scientists
They treat their big data and analytics projects the same way they treat all IT projects
Not realizing that the two are completely different animals

Conventional approach to an IT project

Focuses on building and deploying the technology :

On time
To plan
Within budget

The information requirements and technology specifications are established up front, at the design stage
This approach works fine if the goal is to improve business processes
Such projects improve efficiency, lower costs, and increase productivity,executives are still dissatisfied.

The reason: Once the system goes live, no one pays any attention to figuring out how to use the information it generates to make better decisions or gain deeper

Where is the problem?

It’s crucial to understand how people create and use information
Project teams need members well versed in the cognitive and behavioral sciences, not just in engineering, computer science, and math
Deploying analytical IT tools is relatively easy. Understanding how they might be used is much less clear
No one knows the decisions the tools will be asked to support
No one knows the questions they will be expected to help answer
Therefore, a big data or analytics project can’t be treated like a

conventional
large IT project
with its defined outcomes
required tasks
detailed plans

Five guidelines for taking this voyage of discovery

Place People at the Heart of the Initiative

WRONG : The logic behind many investments in IT tools and big data initiatives is that giving managers more high-quality information :

More rapidly will improve their decisions
Help them solve problems
Gain valuable insights

Because: It ignores the fact that

Managers might discard information no matter how good it is
They have various biases
They might not have the cognitive ability to use information effectively
The reality is that many people, including managers, are uncomfortable working with data
It must place users, the people who will create meaning from the information, at its heart
How they do or do not use data in reaching conclusions and making decisions

Emphasize Information Use as the Way to Unlock Value from IT

People don’t think in a vacuum; they make sense of situations on the basis of their own knowledge,mental models, and experiences
They also use information in different ways, depending on the context
An organization’s culture, can frame how people make decisions, collaborate, and share knowledge
People use information dynamically and iteratively
The design of most IT systems takes into account the data that have been identified as important and controllable

That approach is fine for activities that are highly structured and whose tasks can be precisely described
It is ideal for moving information from the human domain into the technology domain

The problem is that many organizations mistakenly apply same design philosophy to the task of getting data out of the technology domain and into the human domain
Analytics projects succeed by challenging and improving the way:

Information is used
Questions are answered
Decisions are made

Here are some ways to do this:

Ask second-order questions, that is, questions about questions
Discover what data you do and do not have
Give IT project teams the freedom to reframe business problems

Equip IT Project Teams With Cognitive and Behavioral Scientists

Most IT professionals have engineering, computer science, and math backgrounds
They are generally very logical and are strong process thinkers
Big Data needs people who understand how people:

Perceive problems
Use information
Analyze data in developing solutions ideas and knowledge

This shift economics to behavioral economics
Organizations that want employees to be more data oriented in their thinking and decision making must train them to know

When to draw on data
How to frame questions
How Build hypotheses
How conduct experiments
How interpret results

Focus on Learning

Big data and other analytics projects are more akin to scientific research and clinical trials than to IT initiatives
In short, they are opportunities for discovery
How make learning a central focus of big data and analytics projects:

Promote and facilitate a culture of information sharing
Expose your assumptions, biases, and blind spots
Strive to demonstrate cause and effect
Identify the appropriate techniques and tools

Worry More About Solving Business Problems than About Deploying Technology

Conventional IT project management is risk-averse
Big data should focus less on managing the risks of deploying technology and more on solving business problems
The paradox is that the technologies that were supposed to help manage data are now causing a massive deluge

Monday, July 21, 2014

Harvard Business Review, Article #2, The New Patterns of Innovation

The New Patterns of Innovation
How to use data to drive growth
by Rashik Parmar, Ian Mackenzie, David Cohn, and David Gann Jan 2014

http://hbr.org/2014/01/the-new-patterns-of-innovation/ar/1

Idea in Brief

THE CHALLENGE

Established companies are notoriously bad at finding new ways to make money, despite the pressure on them to grow.

THE ANALYSIS

Most companies own or have access to information that could be used to expand old businesses or build new ones. These opportunities exist because of the explosion in digital data, analytic tools, and cloud computing.

THE SOLUTION

Answering a series of questions—from “What data can we access that we’re not capturing now?” to “Can we deliver one of our capabilities as a digital service?”—will help companies find ways to unlock new business value.

Traditional, tested ways of framing the search for ideas

Competency based:

It asks, How can we build on the capabilities and assets that already make us distinctive to enter new businesses and markets?

Customer focused:

What does a close study of customers’ behavior tell us about their tacit, unmet needs?

Changes in the business environment:

If we follow “megatrends” or other shifts to their logical conclusion, what future business opportunities will become clear?

Fourth approach:

It complements the existing frameworks but focuses on opportunities generated by the explosion in digital information and tools
How can we create value for customers using data and analytic tools we own or could have access to?

Patterns

Using data that physical objects now generate (or could generate) to improve a product or service or create new business value.
Digitizing physical assets.
Combining data within and across industries.
Trading data
Codifying a capability

Pattern 1

Augmenting Products to Generate Data

Because of advances in sensors, wireless communications, and big data, it’s now feasible to gather and crunch enormous amounts of data in a variety of contexts.
Such capabilities, in turn, can become the basis of new services or new business models.

PATTERN 2

Digitizing Assets

Over the past two decades, the digitization of music, books, and video has famously upended entertainment industries, spawning new models such as iTunes, streaming video services, e-readers, ...
Digitization of health records, of course, is expected to revolutionize the health care industry, by making the treatment of patients more efficient and appropriate

PATTERN 3

Combining Data Within and Across Industries

The science of big data, along with new IT standards that allow enhanced data integration, makes it possible to coordinate information across industries or sectors in new ways.
The goal is to encourage the private sector to develop new business models, such as shared-delivery services in specific areas

PATTERN 4

Trading Data

The ability to combine disparate data sets allows companies to develop a variety of new offerings for adjacent businesses.

PATTERN 5

Codifying a Distinctive Service Capability

Now companies have a practical way to take the processes they’ve perfected, standardize them, and sell them to other parties
Cloud computing has put such opportunities within even closer reach, because it allows companies to:

easily distribute software
simplify version control
offer customers “pay as you go” pricing

COMBINING THE PATTERNS

The five patterns are a helpful way to structure a conversation about new business ideas, but actual initiatives often encompass two or three of the patterns

Key Questions

questions designed to inventory the raw material out of which new business value can be carved

What data do we have?
What data can we access that we are not capturing?
What data could we create from our products or operations?
What helpful data could we get from others?
What data do others have that we could use in a joint initiative?

Armed with the answers, the team cycles back through each pattern to explore whether it, or perhaps a modification or combination of patterns, could be applicable in the company’s business context

AUGMENTING PRODUCTS

Which of the data relate to our products and their use?
Which do we now keep and which could we start keeping?
What insights could be developed from the data?
How could those insights provide new value to us, our customers, our suppliers, our competitors, or players in another industry?

DIGITIZING ASSETS

Which of our assets are either wholly or essentially digital?
How can we use their digital nature to improve or augment their value?
Do we have physical assets that could be turned into digital assets?

COMBINING DATA

How might our data be combined with data held by others to create new value?
Could we act as the catalyst for value creation by integrating data held by other players?
Who would benefit from this integration and what business model would make it attractive to us and our collaborators?

TRADING DATA

How could our data be structured and analyzed to yield higher-value information?
Is there value in this data to us internally, to our current customers, to potential new customers,or to another industry?

CODIFYING A CAPABILITY

Do we possess a distinctive capability that others would value?
Is there a way to standardize this capability so that it could be broadly useful?
Can we deliver this capability as a digital service?
Who in our industry or other industries would find this attractive?
How could the gathering, management, and analysis of our data help us develop a capability that we could codify?

Success Factors

Thursday, July 17, 2014

Harvard Business Review, Article #1, Analytics 3.0

HBR Analytics 3.0
by Thomas H. Davenport Dec 2013
http://hbr.org/2013/12/analytics-30/ar/1

The Evolution of Analytics
Briefly, it is a new resolve to apply:

Powerful data-gathering
Analysis methods

not just to a company’s operations but also to its offerings to embed data smartness into the products and services customers buy.

History

The use of data to make decisions is not a new idea.
It is as old as decision making itself.
But the field of business analytics was born in the mid-1950s

Analytics 1.0 the era of “business intelligence

For the first time

data about production processes
sales
customer interactions
and more

were

recorded
aggregated
analyzed

This was the era of the enterprise data warehouse

used to capture information

business intelligence software

used to query and report it

Readying a data set for inclusion in a warehouse was difficult

Analysts spent much of their time preparing data for analysis and relatively little time on the analysis itself

Analytics 2.0 the era of big data

At the mid 2000s, when internet-based and social network firms primarily in Silicon Valley (Google, eBay, ...) began to amass and analyze new kinds of information
Big data also came to be distinguished from small data because it was not generated purely by a firm’s internal transaction systems
Big data couldn’t fit or be analyzed fast enough on a single server

So it was processed with Hadoop, an open source software framework for fast batch data processing across parallel servers

To deal with relatively unstructured data companies turned to a new class of databases known as NoSQL
Machine-learning methods (semi automated model development and testing) were used to rapidly generate models from the fast-moving data

Analytics 3.0 the era of data-enriched offerings

The pioneering big data firms in Silicon Valley began investing in analytics to support customer-facing products, services, and features
The common thread in these companies management to compete on analytics

not only in the traditional sense (by improving internal business decisions)
but also by creating more-valuable products and services

Any company, in any industry, can develop valuable products and services from their aggregated data
This strategic change in focus means a new role for analytics within organizations

Ten Requirements for Capitalizing on Analytics 3.0

Multiple types of data, often combine
A new set of data management options

In the 1.0 era, firms used data warehouses as the basis for analysis.
In the 2.0 era, they focused on Hadoop clusters and NoSQL databases.
Today the technology answer is “all of the above”

Faster technologies and methods of analysis
Embedded analytics

Consistent with the increased speed of data processing and analysis, models in Analytics 3.0 are often embedded into operational and decision processes
Some firms are embedding analytics into fully automated systems through

Scoring algorithms
Analytics-based rules.

Data discovery
Cross-disciplinary data teams

Companies now employ data hackers, who excel at extracting and structuring information, to work with analysts, who excel at modeling it

Chief analytics officers
Prescriptive analytics

There have always been three types of analytics:

Descriptive, which reports on the past
Predictive, which uses models based on past data to predict the future
Prescriptive, which uses models to specify optimal behaviors and actions

Analytics 3.0 includes all three types,
Analytics 3.0 emphasizes the last

Analytics on an industrial scale

Analytics 3.0 provides an opportunity to scale those processes to industrial strength
Creating many more models through machine learning can let an organization become much more granular and precise in its predictions

New ways of deciding and managing

Some of the changes prompted by the widespread availability of big data will not yield much certainty
Big data flows continuously consider the analysis of brand sentiment derived from social media sources and so metrics will inevitably rise and fall over time
Additional uncertainty arises from the nature of big data relationships. Unless they are derived from formal testing, the results from big data generally involve correlation, not causation

Creating Value in the Data Economy

Analytics 3.0 is the direction of change and the new model for competing on analytics.

Thomas H. Davenport is the President’s Distinguished Professor of IT and Management at Babson College, a fellow of the MIT Center for Digital Business, a senior adviser to Deloitte Analytics, and a cofounder of the International Institute for Analytics (for which the ideas in this article were generated). He is a coauthor of Keeping Up with the Quants (Harvard Business Review Press, 2013) and the author of Big Data at Work (forthcoming from Harvard Business Review Press).

Friday, July 11, 2014

Google Analytics - Reporting

Building Reports with Dimensions & Metrics

The building blocks of every report in Google Analytics are :

Dimensions : A dimension describes characteristics of your data
Metrics : Metrics are the quantitative measurements of your data

Reporting API

To use the reporting APIs, you have to build your own application. This application needs to be able to write and send a query to the reporting API. The API uses the query to retrieve data from the aggregate tables, and then sends a response back to your application containing the data that was requested.

Each query sent to the API must contain specific information, including the ID of the view that you would like to retrieve data from, the start and end dates for the report, and the dimensions and metrics you want. Within the query you can also specify how to filter, segment and order the data just like you can with tools in the online reporting interface.

Report Sampling

Report sampling is an analytics practice that generates reports based on a small, random subset of your data instead of using all of your available data. Sampling lets programs, including Google Analytics, calculate the data for your reports faster than if every single piece of data is included during the generation process.

When does sampling happen?

During processing
Modifying one of the standard reports in Google Analytics by adding a segment, secondary dimension, or another customization

Adjusting the sample size

The number of sessions used to calculate the report is called the “sample size.” .If you increase the sample size, you’ll include more sessions in your calculation, but it’ll take longer to generate your report. If you decrease the the sample size, you’ll include fewer sessions in your calculation, but your report will be generated faster.

The sampling limit

Google Analytics sets a maximum number of sessions that can be used to calculate your reports. If you go over that limit, your data gets sampled.

One way to stay below the limit is to shorten the date range in your report, which reduces the number of sessions Google Analytics needs to calculate your request

Google Analytics - Processing & configuration

These two components work together to organize and transform the data that you collect into the information you see in your reports.

Processing data and applying your configuration settings

During processing, there are four major transformations that happen to the data. You can control parts of these transformations using the configuration settings in your Properties and Views.

First, Google Analytics organizes the hits you’ve collected into users and sessions. There is a standard set of rules that Google Analytics follows to differentiate users and sessions, but you can customize some of these rules through your configuration settings.
Second, data from other sources can be joined with data collected via the tracking code. For example, you can configure Google Analytics to import data from Google AdWords, Google AdSense or Google Webmaster Tools. You can even configure Google Analytics to import data from other non-Google systems.
Third, Google Analytics processing will modify your data according to any configuration rules you’ve added. These configurations tell Google Analytics what specific data to include or exclude from your reports, or change the way the data’s formatted.
Finally, the data goes through a process called “aggregation.” During this step, Google Analytics prepares the data for analysis by organizing it in meaningful ways and storing it in database tables. This way, your reports can be generated quickly from the database tables whenever you need them.

How hits are organized by users

First, let’s talk about how Google Analytics creates users. The first time a device loads your content and a hit is recorded, Google Analytics creates a random, unique ID that is associated with the device. Each unique ID is considered to be a unique user in Google Analytics. This unique ID is sent to Google Analytics in each hit, and every time a new ID is detected, Google Analytics counts a New User. When Google Analytics sees an existing ID in a hit, it counts a Returning User.

How hits are organized into sessions

A session in Google Analytics is a collection of interactions, or hits, from a specific user during a defined period of time. These interactions can include pageviews, events or e-commerce transactions.

A single user can have multiple sessions. Those sessions can occur on the same day, or over several days, weeks, or months. As soon as one session ends, there is then an opportunity to start a new session. But how does Google Analytics know that a session has ended?

By default, a session ends after 30 minutes of inactivity

Importing data into Google Analytics

The most common way to get data into Google Analytics is through your tracking code, but you can also add data from other sources. By adding data into Google Analytics, you can give more context to your analysis.

There are two ways to add data into your Google Analytics account without using the tracking code: through account linking and through Data Import. Both are managed via your Configuration settings in the Admin section of Google Analytics. Any data that you add from these sources will be processed along with all the hits you collect from the tracking code

Account linking

You can link various Google products directly to Google Analytics via your account settings. This includes:

Google AdWords
Google AdSense
Google Webmaster Tools

Data Import

In addition to account linking, you can add data to Google Analytics using the Data Import feature. This might include advertising data, customer data, product data, or any other data.

To import data into Google Analytics there must a “key” that exists both in the data that Google Analytics collects and in the data you want to import.

There are two ways to import data into Google Analytics:

Dimension Widening : With Dimension Widening, you can import just about any data into Google Analytics
Cost Data Import : The other kind of data import is called Cost Data Import. You use this feature specifically to add data that shows the amount of money you spent on your non-Google advertising. Importing cost data lets Google Analytics calculate the return-on-investment of your non-Google ads

Transforming & Aggregating Data

An important part of processing is data transformation and aggregation

Common configuration settings:

Filters : Filters provide a flexible way you can modify the data within each view. You can use them to exclude data, include data, or actually change how the data looks in your reports. Filters help you transform the data so it’s better aligned with your reporting needs.

Goals : Another way to transform your data is to set up Goals. When you set up Goals, Google Analytics creates new metrics for your reports

Channel Grouping and Content Grouping : Grouping is another way you can transform your data. With grouping, you can aggregate certain pieces of data together so you can analyze the collective performance. You can create two types of groups in Google Analytics: Channel groups and Content groups.

Data aggregation

All of your configuration settings, including Filters, Goals, and Grouping, are applied to your data before it goes through aggregation, the final step of data processing.

During aggregation, Google Analytics creates and organizes your report dimensions into tables, called aggregate tables. Google Analytics pre-calculates your reporting metrics for each value of a dimension and stores them in the corresponding table. When you open a Google Analytics report, a query is sent to the aggregate tables that are full of this prepared data, and returns the specific dimensions and metrics for the report. Storing data in these tables makes it faster for your reports to access data when you request it.

Google Analytics - Data collection

Google Analytics uses tracking code to collect data. It doesn’t matter if you are tracking a website, mobile app or other digital environment -- it’s the tracking code that gathers and sends the data back to your account for reporting.

How tracking works

Depending on the environment you want to track -- a website, mobile app, or other digital experience -- Google Analytics uses different tracking technology to create the data hits. For example, there is specific tracking code to create hits for websites and different code to create hits for mobile apps.

In addition to creating hits, the tracking code also performs another critical function. It identifies new users and returning users. We’ll explain how the tracking code does this in later lessons.

Another key function of the tracking code is to connect your data to your Google Analytics account. This is accomplished through a unique identifier embedded in your tracking code.

Google Analytics uses different tracking technology to measure user activity depending on the specific environment you want to track -- websites, mobile apps, or other digital experiences. To track data from a website, Google Analytics provides a standard snippet of JavaScript tracking code. This snippet references a JavaScript library called analytics.js that controls what data is collected.

Adding the Google Analytics JavaScript code to your website

You simply add the standard code snippet before the closing </head> tag in the HTML of every web page you want to track. This snippet generates a pageview hit each time a page is loaded. It’s essential that you place the Google Analytics tracking code on every page of your site

Collecting and sending data with the Measurement Protocol

The Measurement Protocol lets you send data to Google Analytics from any web-connected device. Recall that the Google Analytics JavaScript and mobile SDKs automatically build hits to send data to Google Analytics. However, when you want to collect data from a different device, you must manually build the data collection hits. The Measurement Protocol defines how to construct the hits and how to send them to Google Analytics.

Google Analytics - Platform Principles

There are four main parts of the Google Analytics platform:

Collection

Configuration

Processing

Reporting

Collection

Collection is all about getting data into your Google Analytics account.

To collect data, you need to add Google Analytics code to your website, mobile app or other digital environment you want to measure. This tracking code provides a set of instructions to Google Analytics, telling it which user interactions it should pay attention to and which data it should collect. The way the data is collected depends on the environment you want to track.

Processing & Configuration

During data processing, Google Analytics transforms the raw data from collection using the settings in your Google Analytics account. These settings, also known as the configuration, help you align the data more closely with your measurement plan and business objectives.

Reporting

After Google Analytics has finished processing, you can access and analyze your data using the reporting interface, which includes easy-to-use reporting tools and data visualizations. It’s also possible to systematically access your data using the Google Analytics Core Reporting API. Using the API you can build your own reporting tools or extract your data directly into third-party reporting tools.

Conclusion

Throughout the rest of this course, we will dive deeper into key topics about collection, configuration, processing and reporting. Having a comprehensive understanding of each of these platform components will help you better understand the data you see in Google Analytics. It will also prepare you for more advanced topics about how you can customize your data.

Data Model

There are three components to this data model:

Users : Is a visitor to your website or app,
Sessions : Is the time they spend there
Interactions : Is what they do while they’re there

Wednesday, July 2, 2014

Project 5 Exercise 3

Exercise 3 - Busiest Hour

import sys
import string
import logging

from util import mapper_logfile
logging.basicConfig(filename=mapper_logfile, format='%(message)s',
level=logging.INFO, filemode='w')

def mapper():
"""
In this exercise, for each turnstile unit, you will determine the date and time
(in the span of this data set) at which the most people entered through the unit.

The input to the mapper will be the final Subway-MTA dataset, the same as
in the previous exercise. You can check out the csv and its structure below:
https://www.dropbox.com/s/meyki2wl9xfa7yk/turnstile_data_master_with_weather.csv

For each line, the mapper should return the UNIT, ENTRIESn_hourly, DATEn, and
TIMEn columns, separated by tabs. For example:
'R001\t100000.0\t2011-05-01\t01:00:00'

Since you are printing the output of your program, printing a debug
statement will interfere with the operation of the grader. Instead,
use the logging module, which we've configured to log to a file printed
when you click "Test Run". For example:
logging.info("My debugging message")
"""

for line in sys.stdin:
data = line.strip().split(",")
if len(data) == 22 and data[6] == 'ENTRIESn_hourly':
continue

print "{0}\t{1}\t{2}\t{3}".format(data[1],data[6],data[2],data[3])

mapper()

---------------------------------------------------------------------------------------------------------------------

import sys

import logging

from util import reducer_logfile

logging.basicConfig(filename=reducer_logfile, format='%(message)s',

level=logging.INFO, filemode='w')

def reducer():

'''

Write a reducer that will compute the busiest date and time (that is, the

date and time with the most entries) for each turnstile unit. Ties should

be broken in favor of datetimes that are later on in the month of May. You

may assume that the contents of the reducer will be sorted so that all entries

corresponding to a given UNIT will be grouped together.

The reducer should print its output with the UNIT name, the datetime (which

is the DATEn followed by the TIMEn column, separated by a single space), and

the number of entries at this datetime, separated by tabs.

For example, the output of the reducer should look like this:

R001 2011-05-11 17:00:00 31213.0

R002 2011-05-12 21:00:00 4295.0

R003 2011-05-05 12:00:00 995.0

R004 2011-05-12 12:00:00 2318.0

R005 2011-05-10 12:00:00 2705.0

R006 2011-05-25 12:00:00 2784.0

R007 2011-05-10 12:00:00 1763.0

R008 2011-05-12 12:00:00 1724.0

R009 2011-05-05 12:00:00 1230.0

R010 2011-05-09 18:00:00 30916.0

...

Since you are printing the output of your program, printing a debug

statement will interfere with the operation of the grader. Instead,

use the logging module, which we've configured to log to a file printed

when you click "Test Run". For example:

logging.info("My debugging message")

'''

max_entries = 0.0

old_key = None

datetime = ''

for line in sys.stdin:

data = line.strip().split('\t')

if len(data) != 4:

continue

this_key, count, date, time = data

count = float(count)

if old_key and old_key != this_key:

print "{0}\t{1}\t{2}".format(old_key, datetime, max_entries)

max_entries = 0

datetime = ''

old_key = this_key

if count >= max_entries:

max_entries = count

datetime = str(date) + ' ' + str(time)

if old_key != None:

print "{0}\t{1}\t{2}".format(old_key, datetime, max_entries)

logging.info("{0}\t{1}\t{2}".format(old_key, datetime, max_entries))

reducer()

Project 5 Exercise 2

Exercise 2 - Ridership by Weather Type

import sys
import string
import logging

from util import mapper_logfile
logging.basicConfig(filename=mapper_logfile, format='%(message)s',
level=logging.INFO, filemode='w')

def mapper():
'''
For this exercise, compute the average value of the ENTRIESn_hourly column
for different weather types. Weather type will be defined based on the
combination of the columns fog and rain (which are boolean values).
For example, one output of our reducer would be the average hourly entries
across all hours when it was raining but not foggy.

Each line of input will be a row from our final Subway-MTA dataset in csv format.
You can check out the input csv file and its structure below:
https://www.dropbox.com/s/meyki2wl9xfa7yk/turnstile_data_master_with_weather.csv

Note that this is a comma-separated file.

This mapper should PRINT (not return) the weather type as the key (use the
given helper function to format the weather type correctly) and the number in
the ENTRIESn_hourly column as the value. They should be separated by a tab.
For example: 'fog-norain\t12345'

Since you are printing the output of your program, printing a debug
statement will interfere with the operation of the grader. Instead,
use the logging module, which we've configured to log to a file printed
when you click "Test Run". For example:
logging.info("My debugging message")
'''

# Takes in variables indicating whether it is foggy and/or rainy and
# returns a formatted key that you should output. The variables passed in
# can be booleans, ints (0 for false and 1 for true) or floats (0.0 for
# false and 1.0 for true), but the strings '0.0' and '1.0' will not work,
# so make sure you convert these values to an appropriate type before
# calling the function.
def format_key(fog, rain):
return '{}fog-{}rain'.format(
'' if fog else 'no',
'' if rain else 'no'
)

for line in sys.stdin:
data = line.strip().split(',');

if len(data) !=22 or data[6] == "ENTRIESn_hourly":
continue
else:
print "{0}\t{1}".format(format_key(float(data[14]),float(data[15])), data[6])
logging.info("{0}\t{1}".format(format_key(float(data[14]),float(data[15])), data[6]))

mapper()
-------------------------------------------------------------------------------------------------------------------------------------------------- import sys
import logging
import numpy
from util import reducer_logfile
logging.basicConfig(filename=reducer_logfile, format='%(message)s',
level=logging.INFO, filemode='w')

def reducer():
'''
Given the output of the mapper for this assignment, the reducer should
print one row per weather type, along with the average value of
ENTRIESn_hourly for that weather type, separated by a tab. You can assume
that the input to the reducer will be sorted by weather type, such that all
entries corresponding to a given weather type will be grouped together.

In order to compute the average value of ENTRIESn_hourly, you'll need to
keep track of both the total riders per weather type and the number of
hours with that weather type. That's why we've initialized the variable
riders and num_hours below. Feel free to use a different data structure in
your solution, though.

An example output row might look like this:
'fog-norain\t1105.32467557'

Since you are printing the output of your program, printing a debug
statement will interfere with the operation of the grader. Instead,
use the logging module, which we've configured to log to a file printed
when you click "Test Run". For example:
logging.info("My debugging message")
'''

entries = 0.0
avg = 0.0
num = 0
old_key = None

for line in sys.stdin:
data = line.strip().split("\t")
if len(data) !=2:
continue
this_key, count = data

if old_key and old_key != this_key:
print "{0}\t{1}".format(old_key,avg)
entries = 0
num = 0
old_key = this_key
entries += float(count)
num += 1
avg = entries / num

if old_key != None:
print "{0}\t{1}".format(old_key, avg)
logging.info("{0}\t{1}".format(old_key, avg))

reducer()