DATA Analytics: DATA Wrangling

Reference: https://www.udacity.com/course/ud359

Data Sources: File, Databases , Web sites

Data is messy.

Acquiring Data ,a lot of data on text and on government websites and internet

Common Data Formats: CSV, XML, JSON

XML and JSON support Nested Structure but CSV don't.

CSV Exercise:

import pandas

def add_full_name(path_to_csv, path_to_new_csv):

#Assume you will be reading in a csv file with the same columns that the

#Lahman baseball data set has -- most importantly, there are columns

#called 'nameFirst' and 'nameLast'.

#1) Write a function that reads a csv

#located at "path_to_csv" into a pandas dataframe and adds a new column

#called 'nameFull' with a player's full name.

#For example:

# for Hank Aaron, nameFull would be 'Hank Aaron',

#2) Write the data in the pandas dataFrame to a new csv file located at

#path_to_new_csv

#WRITE YOUR CODE HERE

baseballData = pandas.read_csv(path_to_csv)

baseballData['nameFull'] = baseballData['nameFirst'] + ' ' + baseballData['nameLast']

baseballData.to_csv(path_to_new_csv)

Relational Databases:

. DB Schema = Blueprint

.Db Queries: SQL languages

Simple Query Quiz:

import pandas

import pandasql

def select_first_50(filename):

# Read in our aadhaar_data csv to a pandas dataframe. Afterwards, we rename the columns

# by replacing spaces with underscores and setting all characters to lowercase, so the

# column names more closely resemble columns names one might find in a table.

aadhaar_data = pandas.read_csv(filename)

aadhaar_data.rename(columns = lambda x: x.replace(' ', '_').lower(), inplace=True)

# Select out the first 50 values for "registrar" and "enrolment_agency"

# in the aadhaar_data table using SQL syntax.

# Note that "enrolment_agency" is spelled with one l. Also, the order

# of the select does matter. Make sure you select registrar then enrolment agency

# in your query.

q = """

select registrar,enrolment_agency from aadhaar_data limit 50;

"""

#Execute your SQL command against the pandas frame

aadhaar_solution = pandasql.sqldf(q.lower(), locals())

return aadhaar_solution

Complex Query Quiz:

import pandas

import pandasql

def aggregate_query(filename):

# Read in our aadhaar_data csv to a pandas dataframe. Afterwards, we rename the columns

# by replacing spaces with underscores and setting all characters to lowercase, so the

# column names more closely resemble columns names one might find in a table.

aadhaar_data = pandas.read_csv(filename)

aadhaar_data.rename(columns = lambda x: x.replace(' ', '_').lower(), inplace=True)

# Write a query that will select from the aadhaar_data table how many men and how

# many women over the age of 50 have had aadhaar generated for them in each district

# Note that in this quiz, the SQL query keywords are case sensitive.

# For example, if you want to do a sum make sure you type 'sum' rather than 'SUM'.

# The possible columns to select from aadhaar data are:

# 1) Registrar

# 2) Enrolment Agency

# 3) State

# 4) District

# 5) Sub District

# 6) Pin Code

# 7) Gender

# 8) Age

# 9) Aadhaar generated

# 10) Enrolment Rejected

# 11) Residents providing email,

# 12) Residents providing mobile number

# You can download a copy of the aadhaar data that we are passing

# into this exercise below:

# https://www.dropbox.com/s/vn8t4uulbsfmalo/aadhaar_data.csv

q = """

select District, Gender, sum(aadhaar_generated) from aadhaar_data where Age > 50 Group By District, Gender Order by District, Gender;

"""

# Execute your SQL command against the pandas frame

aadhaar_solution = pandasql.sqldf(q.lower(), locals())

return aadhaar_solution

Access data on web sites:

Application Programming Interface : API

Twitter: Representational State Transfer : REST

API Example:

http://ws.audioscrabbles.com/2.0/?method=album.getinfo&api_key=[API_KEY]&artist=Rihanna&album=loud&format=json

http://www.last.fm/api/show/album.getInfo

API Exercise:

import json

import requests

def api_get_request(url):

# In this exercise, you want to call the last.fm API to get a list of the

# top artists in Spain.

# Once you've done this, return the name of the number 1 top artist in Spain.

url = 'http://ws.audioscrobbler.com/2.0/?method=geo.gettopartists&country=spain&api_key=0a5de916777ad83d6de29347308556a1&format=json'

data = requests.get(url).text

data = json.loads(data)

artists = data['topartists']

topArtist = artists['artist'][0]

topArtistName = topArtist['name']

listeners = topArtist['listeners']

print topArtistName + ' ' + listeners

return topArtistName

return topArtistName # return the top artist in Spain

Sanity Checking Data:

Pandas Describe Function, works here.

it does some statistics on data including:

Count, Mean, Standard Deviation (std), Minimum (min), Maximum (max) , 25%, 50%, 75%

Missing Values on our Data:

Why are they missing:

The reason is:

1) Some data is missed because of privacy setting

2) Some data is missed because of mistakes to enter data

3) System Errors

Dealing with missing data:

1) Partial Deletion

1-1) Listwise Deletion : exclude the whole record for analysis

1-2) Pairwise Deletion : exclude just the missed filed for calculation not whole record, because we can use other filed in record for analysis

2) Impution

When we have limited data and removing records with missing filed may cause lose of large amount of information we try to impute data

1) Use mean value of all other records and use that for missing value

Pros: we don't change the mean of whole data

Cons: correlation between values will not be accurate

2) Use Linear Regression:

To estimate missed value, we create a equation from other records which have data for missing filed we try to creat a model and learn the model with the whole data which already we have

Cons: Over emphasize of existing data

Pros: Exact value suggestion for data

Imputation Exercise:

from pandas import *

import numpy

def imputation(filename):

# Pandas dataframes have a method called 'fillna(value)', such that you can

# pass in a single value to replace any NAs in a dataframe or series. You

# can call it like this:

# dataframe['column'] = dataframe['column'].fillna(value)

# Using the numpy.mean function, which calculates the mean of a numpy

# array, impute any missing values in our Lahman baseball

# data sets 'weight' column by setting them equal to the average weight.

# You can access the 'weight' colum in the baseball data frame by

# calling baseball['weight']

baseball = pandas.read_csv(filename)

baseball['weight'] = baseball['weight'].fillna(numpy.mean(baseball['weight']) )

#YOUR CODE GOES HERE

return baseball

Exercise 1 - Number of Rainy Days

import pandas

import pandasql

def num_rainy_days(filename):

'''

This function should run a SQL query on a dataframe of

weather data. The SQL query should return one column and

one row - a count of the number of days in the dataframe where

the rain column is equal to 1 (i.e., the number of days it

rained). The dataframe will be titled 'weather_data'. You'll

need to provide the SQL query. You might find SQL's count function

useful for this exercise. You can read more about it here:

https://dev.mysql.com/doc/refman/5.1/en/counting-rows.html

You might also find that interpreting numbers as integers or floats may not

work initially. In order to get around this issue, it may be equal to cast

these numbers as integers. This can be done by writing cast(column as integer).

So for example, if we wanted to cast the maxtempi column as an integer, we would actually

write something like where cast(maxtempi as integer) = 76, as opposed to simply

where maxtempi = 76.

You can see the weather data that we are passing in below:

https://www.dropbox.com/s/7sf0yqc9ykpq3w8/weather_underground.csv

'''

weather_data = pandas.read_csv(filename)

q = """

select count(*) from weather_data where rain=1;

"""

#Execute your SQL command against the pandas frame

rainy_days = pandasql.sqldf(q.lower(), locals())

return rainy_days

Exercise 2: > Temp on Foggy and Non Foggy Days

import pandas

import pandasql

def max_temp_aggregate_by_fog(filename):

'''

This function should run a SQL query on a dataframe of

weather data. The SQL query should return two columns and

two rows - whether it was foggy or not (0 or 1) and the max

maxtempi for that fog value (i.e., the maximum max temperature

for both foggy and non-foggy days). The dataframe will be

titled 'weather_data'. You'll need to provide the SQL query.

You might also find that interpreting numbers as integers or floats may not

work initially. In order to get around this issue, it may be useful to cast

these numbers as integers. This can be done by writing cast(column as integer).

So for example, if we wanted to cast the maxtempi column as an integer, we would actually

write something like where cast(maxtempi as integer) = 76, as opposed to simply

where maxtempi = 76.

You can see the weather data that we are passing in below:

https://www.dropbox.com/s/7sf0yqc9ykpq3w8/weather_underground.csv

'''

weather_data = pandas.read_csv(filename)

q = """

select fog, MAX(cast (maxtempi as integer) ) from weather_data GROUP BY fog;

"""

#Execute your SQL command against the pandas frame

rainy_days = pandasql.sqldf(q.lower(), locals())

return rainy_days

DATA Analytics

Wednesday, June 11, 2014

DATA Wrangling

No comments:

Post a Comment