Wednesday, June 11, 2014

DATA Wrangling

Reference: https://www.udacity.com/course/ud359

Data Sources:  File, Databases , Web sites
Data is messy.
Acquiring Data ,a lot of data on text and on government  websites and internet
Common Data Formats: CSV, XML, JSON
XML and JSON support Nested Structure but CSV don't.
CSV Exercise:
import pandas

def add_full_name(path_to_csv, path_to_new_csv):
    #Assume you will be reading in a csv file with the same columns that the
    #Lahman baseball data set has -- most importantly, there are columns
    #called 'nameFirst' and 'nameLast'.
    #1) Write a function that reads a csv
    #located at "path_to_csv" into a pandas dataframe and adds a new column
    #called 'nameFull' with a player's full name.
    #
    #For example:
    #   for Hank Aaron, nameFull would be 'Hank Aaron',
                #
                #2) Write the data in the pandas dataFrame to a new csv file located at
                #path_to_new_csv

    #WRITE YOUR CODE HERE
    baseballData = pandas.read_csv(path_to_csv)
    baseballData['nameFull'] = baseballData['nameFirst'] + ' ' + baseballData['nameLast']
    baseballData.to_csv(path_to_new_csv)


Relational Databases:
. DB Schema =  Blueprint
.Db Queries: SQL languages
Simple Query Quiz:
import pandas
import pandasql

def select_first_50(filename):
    # Read in our aadhaar_data csv to a pandas dataframe.  Afterwards, we rename the columns
    # by replacing spaces with underscores and setting all characters to lowercase, so the
    # column names more closely resemble columns names one might find in a table.
    aadhaar_data = pandas.read_csv(filename)
    aadhaar_data.rename(columns = lambda x: x.replace(' ', '_').lower(), inplace=True)

    # Select out the first 50 values for "registrar" and "enrolment_agency"
    # in the aadhaar_data table using SQL syntax.
    #
    # Note that "enrolment_agency" is spelled with one l. Also, the order
    # of the select does matter. Make sure you select registrar then enrolment agency
    # in your query.
    q = """
    select registrar,enrolment_agency from aadhaar_data limit 50;
    """
   
    #Execute your SQL command against the pandas frame
    aadhaar_solution = pandasql.sqldf(q.lower(), locals())
    return aadhaar_solution   


Complex Query Quiz:

import pandas
import pandasql

def aggregate_query(filename):
    # Read in our aadhaar_data csv to a pandas dataframe.  Afterwards, we rename the columns
    # by replacing spaces with underscores and setting all characters to lowercase, so the
    # column names more closely resemble columns names one might find in a table.
   
    aadhaar_data = pandas.read_csv(filename)
    aadhaar_data.rename(columns = lambda x: x.replace(' ', '_').lower(), inplace=True)

    # Write a query that will select from the aadhaar_data table how many men and how
    # many women over the age of 50 have had aadhaar generated for them in each district
    #
    # Note that in this quiz, the SQL query keywords are case sensitive.
    # For example, if you want to do a sum make sure you type 'sum' rather than 'SUM'.
    #

    # The possible columns to select from aadhaar data are:
    #     1) Registrar
    #     2) Enrolment Agency
    #     3) State
    #     4) District
    #     5) Sub District
    #     6) Pin Code
    #     7) Gender
    #     8) Age
    #     9) Aadhaar generated
    #     10) Enrolment Rejected
    #     11) Residents providing email,
    #     12) Residents providing mobile number
    #
    # You can download a copy of the aadhaar data that we are passing
    # into this exercise below:
    # https://www.dropbox.com/s/vn8t4uulbsfmalo/aadhaar_data.csv
       
    q = """
    select District, Gender, sum(aadhaar_generated) from aadhaar_data where Age > 50 Group By District, Gender Order by District, Gender;
    """

    # Execute your SQL command against the pandas frame
    aadhaar_solution = pandasql.sqldf(q.lower(), locals())
    return aadhaar_solution   


Access data on web sites:
Application Programming Interface : API
Twitter: Representational State Transfer : REST
API Example:
http://ws.audioscrabbles.com/2.0/?method=album.getinfo&api_key=[API_KEY]&artist=Rihanna&album=loud&format=json
http://www.last.fm/api/show/album.getInfo


API Exercise:
import json
import requests

def api_get_request(url):
    # In this exercise, you want to call the last.fm API to get a list of the
    # top artists in Spain.
    #
    # Once you've done this, return the name of the number 1 top artist in Spain.
     url = 'http://ws.audioscrobbler.com/2.0/?method=geo.gettopartists&country=spain&api_key=0a5de916777ad83d6de29347308556a1&format=json'
     data = requests.get(url).text
     data = json.loads(data)
     artists = data['topartists']
     topArtist = artists['artist'][0]

     topArtistName = topArtist['name']
     listeners = topArtist['listeners']  
     print topArtistName + '   ' + listeners
     return topArtistName   
    return topArtistName # return the top artist in Spain


Sanity Checking Data:
Pandas Describe Function, works here.
it does some statistics on data including:
Count, Mean, Standard Deviation (std),  Minimum (min),  Maximum (max) , 25%, 50%, 75%

Missing Values on our Data:
Why  are they missing:
The reason is:
1) Some data is missed because of privacy setting
2) Some data is missed because of mistakes to enter data
3) System Errors

Dealing with missing data:
1) Partial Deletion
                1-1) Listwise Deletion : exclude the whole record for analysis
                1-2) Pairwise Deletion : exclude just the missed filed for calculation not whole record, because we can use other                                 filed  in record for analysis
               
2) Impution
When we have limited data and removing records with missing filed may cause lose of large amount of information we try to  impute data
1) Use mean  value of all other records and use that for missing value
                Pros:  we don't change the mean of whole data
                Cons: correlation between values will not be accurate
2) Use Linear Regression:
To estimate missed value, we create a equation from other records which have data for missing filed we try to creat a model and learn the model with the whole data which already we have
                Cons: Over emphasize of existing data
                Pros: Exact value suggestion for  data

Imputation Exercise:

from pandas import *
import numpy

def imputation(filename):
    # Pandas dataframes have a method called 'fillna(value)', such that you can
    # pass in a single value to replace any NAs in a dataframe or series. You
    # can call it like this:
    #     dataframe['column'] = dataframe['column'].fillna(value)
    #
    # Using the numpy.mean function, which calculates the mean of a numpy
    # array, impute any missing values in our Lahman baseball
    # data sets 'weight' column by setting them equal to the average weight.
    #
    # You can access the 'weight' colum in the baseball data frame by
    # calling baseball['weight']

    baseball = pandas.read_csv(filename)
    baseball['weight'] = baseball['weight'].fillna(numpy.mean(baseball['weight']) )
    #YOUR CODE GOES HERE
     return baseball

Exercise 1 - Number of Rainy Days
import pandas
import pandasql


def num_rainy_days(filename):
    '''
    This function should run a SQL query on a dataframe of
    weather data.  The SQL query should return one column and
    one row - a count of the number of days in the dataframe where
    the rain column is equal to 1 (i.e., the number of days it
    rained).  The dataframe will be titled 'weather_data'. You'll
    need to provide the SQL query.  You might find SQL's count function
    useful for this exercise.  You can read more about it here:
   
    https://dev.mysql.com/doc/refman/5.1/en/counting-rows.html
   
    You might also find that interpreting numbers as integers or floats may not
    work initially.  In order to get around this issue, it may be equal to cast
    these numbers as integers.  This can be done by writing cast(column as integer).
    So for example, if we wanted to cast the maxtempi column as an integer, we would actually
    write something like where cast(maxtempi as integer) = 76, as opposed to simply
    where maxtempi = 76.
   
    You can see the weather data that we are passing in below:
    https://www.dropbox.com/s/7sf0yqc9ykpq3w8/weather_underground.csv
    '''
    weather_data = pandas.read_csv(filename)

    q = """
    select count(*) from weather_data where rain=1;
    """
   
    #Execute your SQL command against the pandas frame
    rainy_days = pandasql.sqldf(q.lower(), locals())
    return rainy_days

Exercise 2:  > Temp on Foggy and Non Foggy Days


import pandas
import pandasql


def max_temp_aggregate_by_fog(filename):
    '''
    This function should run a SQL query on a dataframe of
    weather data.  The SQL query should return two columns and
    two rows - whether it was foggy or not (0 or 1) and the max
    maxtempi for that fog value (i.e., the maximum max temperature
    for both foggy and non-foggy days).  The dataframe will be
    titled 'weather_data'. You'll need to provide the SQL query.
   
    You might also find that interpreting numbers as integers or floats may not
    work initially.  In order to get around this issue, it may be useful to cast
    these numbers as integers.  This can be done by writing cast(column as integer).
    So for example, if we wanted to cast the maxtempi column as an integer, we would actually
    write something like where cast(maxtempi as integer) = 76, as opposed to simply
    where maxtempi = 76.
   
    You can see the weather data that we are passing in below:
    https://www.dropbox.com/s/7sf0yqc9ykpq3w8/weather_underground.csv
    '''
    weather_data = pandas.read_csv(filename)

    q = """
        select fog, MAX(cast (maxtempi as integer) ) from weather_data GROUP BY fog;
    """
   
    #Execute your SQL command against the pandas frame
    rainy_days = pandasql.sqldf(q.lower(), locals())
    return rainy_days

No comments:

Post a Comment