Reference: https://www.udacity.com/course/ud359
Data Sources: File,
Databases , Web sites
Data is messy.
Acquiring Data ,a lot of data on text and on government websites and internet
Common Data Formats: CSV, XML, JSON
XML and JSON support Nested Structure but CSV don't.
CSV Exercise:
import pandas
def add_full_name(path_to_csv, path_to_new_csv):
#Assume you will
be reading in a csv file with the same columns that the
#Lahman baseball
data set has -- most importantly, there are columns
#called 'nameFirst' and 'nameLast'.
#1) Write a
function that reads a csv
#located at
"path_to_csv" into a pandas dataframe and adds a new column
#called 'nameFull'
with a player's full name.
#
#For example:
# for Hank Aaron, nameFull would be 'Hank
Aaron',
#
#2)
Write the data in the pandas dataFrame to a new csv file located at
#path_to_new_csv
#WRITE YOUR CODE
HERE
baseballData =
pandas.read_csv(path_to_csv)
baseballData['nameFull'] = baseballData['nameFirst'] + ' ' +
baseballData['nameLast']
baseballData.to_csv(path_to_new_csv)
Relational Databases:
. DB Schema =
Blueprint
.Db Queries: SQL languages
Simple Query Quiz:
import pandas
import pandasql
def select_first_50(filename):
# Read in our aadhaar_data
csv to a pandas dataframe. Afterwards,
we rename the columns
# by replacing
spaces with underscores and setting all characters to lowercase, so the
# column names
more closely resemble columns names one might find in a table.
aadhaar_data =
pandas.read_csv(filename)
aadhaar_data.rename(columns = lambda x: x.replace(' ', '_').lower(),
inplace=True)
# Select out the
first 50 values for "registrar" and "enrolment_agency"
# in the
aadhaar_data table using SQL syntax.
#
# Note that
"enrolment_agency" is spelled with one l. Also, the order
# of the select
does matter. Make sure you select registrar then enrolment agency
# in your query.
q =
"""
select
registrar,enrolment_agency from aadhaar_data limit 50;
"""
#Execute your SQL
command against the pandas frame
aadhaar_solution =
pandasql.sqldf(q.lower(), locals())
return
aadhaar_solution
Complex Query Quiz:
import pandas
import pandasql
def aggregate_query(filename):
# Read in our aadhaar_data csv to a pandas
dataframe. Afterwards, we rename the
columns
# by replacing
spaces with underscores and setting all characters to lowercase, so the
# column names
more closely resemble columns names one might find in a table.
aadhaar_data =
pandas.read_csv(filename)
aadhaar_data.rename(columns = lambda x: x.replace(' ', '_').lower(),
inplace=True)
# Write a query
that will select from the aadhaar_data table how many men and how
# many women over
the age of 50 have had aadhaar generated for them in each district
#
# Note that in
this quiz, the SQL query keywords are case sensitive.
# For example, if
you want to do a sum make sure you type 'sum' rather than 'SUM'.
#
# The possible
columns to select from aadhaar data are:
# 1) Registrar
# 2) Enrolment Agency
# 3) State
# 4) District
# 5) Sub District
# 6) Pin Code
# 7) Gender
# 8) Age
# 9) Aadhaar generated
# 10) Enrolment Rejected
# 11) Residents providing email,
# 12) Residents providing mobile number
#
# You can download
a copy of the aadhaar data that we are passing
# into this
exercise below:
#
https://www.dropbox.com/s/vn8t4uulbsfmalo/aadhaar_data.csv
q =
"""
select District,
Gender, sum(aadhaar_generated) from aadhaar_data where Age > 50 Group By
District, Gender Order by District, Gender;
"""
# Execute your SQL
command against the pandas frame
aadhaar_solution =
pandasql.sqldf(q.lower(), locals())
return
aadhaar_solution
Access data on web sites:
Application Programming Interface : API
Twitter: Representational State Transfer : REST
API Example:
http://ws.audioscrabbles.com/2.0/?method=album.getinfo&api_key=[API_KEY]&artist=Rihanna&album=loud&format=json
http://www.last.fm/api/show/album.getInfo
API Exercise:
import json
import requests
def api_get_request(url):
# In this
exercise, you want to call the last.fm API to get a list of the
# top artists in
Spain.
#
# Once you've done
this, return the name of the number 1 top artist in Spain.
url =
'http://ws.audioscrobbler.com/2.0/?method=geo.gettopartists&country=spain&api_key=0a5de916777ad83d6de29347308556a1&format=json'
data =
requests.get(url).text
data =
json.loads(data)
artists =
data['topartists']
topArtist =
artists['artist'][0]
topArtistName =
topArtist['name']
listeners =
topArtist['listeners']
print
topArtistName + ' ' + listeners
return
topArtistName
return
topArtistName # return the top artist in Spain
Sanity Checking Data:
Pandas Describe Function, works here.
it does some statistics on data including:
Count, Mean, Standard Deviation (std), Minimum (min),
Maximum (max) , 25%, 50%, 75%
Missing Values on our Data:
Why are they missing:
The reason is:
1) Some data is missed because of privacy setting
2) Some data is missed because of mistakes to enter data
3) System Errors
Dealing with missing data:
1) Partial Deletion
1-1)
Listwise Deletion : exclude the whole record for analysis
1-2)
Pairwise Deletion : exclude just the missed filed for calculation not whole
record, because we can use other filed
in record for analysis
2) Impution
When we have limited data and removing records with missing
filed may cause lose of large amount of information we try to impute data
1) Use mean value of all
other records and use that for missing value
Pros: we don't change the mean of whole data
Cons:
correlation between values will not be accurate
2) Use Linear Regression:
To estimate missed value, we create a equation from other
records which have data for missing filed we try to creat a model and learn the
model with the whole data which already we have
Cons:
Over emphasize of existing data
Pros:
Exact value suggestion for data
Imputation Exercise:
from pandas import *
import numpy
def imputation(filename):
# Pandas
dataframes have a method called 'fillna(value)', such that you can
# pass in a single
value to replace any NAs in a dataframe or series. You
# can call it like
this:
# dataframe['column'] =
dataframe['column'].fillna(value)
#
# Using the
numpy.mean function, which calculates the mean of a numpy
# array, impute
any missing values in our Lahman baseball
# data sets
'weight' column by setting them equal to the average weight.
#
# You can access
the 'weight' colum in the baseball data frame by
# calling
baseball['weight']
baseball =
pandas.read_csv(filename)
baseball['weight']
= baseball['weight'].fillna(numpy.mean(baseball['weight']) )
#YOUR CODE GOES
HERE
return baseball
Exercise 1 - Number of Rainy Days
import pandas
import pandasql
def num_rainy_days(filename):
'''
This function
should run a SQL query on a dataframe of
weather data. The SQL query should return one column and
one row - a count
of the number of days in the dataframe where
the rain column is
equal to 1 (i.e., the number of days it
rained). The dataframe will be titled 'weather_data'.
You'll
need to provide
the SQL query. You might find SQL's
count function
useful for this
exercise. You can read more about it
here:
https://dev.mysql.com/doc/refman/5.1/en/counting-rows.html
You might also
find that interpreting numbers as integers or floats may not
work
initially. In order to get around this
issue, it may be equal to cast
these numbers as
integers. This can be done by writing
cast(column as integer).
So for example, if
we wanted to cast the maxtempi column as an integer, we would actually
write something
like where cast(maxtempi as integer) = 76, as opposed to simply
where maxtempi =
76.
You can see the
weather data that we are passing in below:
https://www.dropbox.com/s/7sf0yqc9ykpq3w8/weather_underground.csv
'''
weather_data =
pandas.read_csv(filename)
q =
"""
select count(*)
from weather_data where rain=1;
"""
#Execute your SQL
command against the pandas frame
rainy_days =
pandasql.sqldf(q.lower(), locals())
return rainy_days
Exercise 2: > Temp on Foggy
and Non Foggy Days
import pandas
import pandasql
def max_temp_aggregate_by_fog(filename):
'''
This function
should run a SQL query on a dataframe of
weather data. The SQL query should return two columns and
two rows - whether
it was foggy or not (0 or 1) and the max
maxtempi for that
fog value (i.e., the maximum max temperature
for both foggy and
non-foggy days). The dataframe will be
titled
'weather_data'. You'll need to provide the SQL query.
You might also
find that interpreting numbers as integers or floats may not
work
initially. In order to get around this
issue, it may be useful to cast
these numbers as
integers. This can be done by writing
cast(column as integer).
So for example, if
we wanted to cast the maxtempi column as an integer, we would actually
write something
like where cast(maxtempi as integer) = 76, as opposed to simply
where maxtempi =
76.
You can see the
weather data that we are passing in below:
https://www.dropbox.com/s/7sf0yqc9ykpq3w8/weather_underground.csv
'''
weather_data =
pandas.read_csv(filename)
q =
"""
select fog,
MAX(cast (maxtempi as integer) ) from weather_data GROUP BY fog;
"""
#Execute your SQL
command against the pandas frame
rainy_days =
pandasql.sqldf(q.lower(), locals())
return rainy_days
No comments:
Post a Comment