Project 2 :
Wrangle NYC Subway and Weather
Data
Exercise 1 - Number of Rainy Days
import pandas
import pandasql
def num_rainy_days(filename):
'''
This function
should run a SQL query on a dataframe of
weather data. The SQL query should return one column and
one row - a count
of the number of days in the dataframe where
the rain column is
equal to 1 (i.e., the number of days it
rained). The dataframe will be titled 'weather_data'.
You'll
need to provide
the SQL query. You might find SQL's
count function
useful for this
exercise. You can read more about it
here:
https://dev.mysql.com/doc/refman/5.1/en/counting-rows.html
You might also
find that interpreting numbers as integers or floats may not
work initially. In order to get around this issue, it may be
equal to cast
these numbers as
integers. This can be done by writing
cast(column as integer).
So for example, if
we wanted to cast the maxtempi column as an integer, we would actually
write something
like where cast(maxtempi as integer) = 76, as opposed to simply
where maxtempi =
76.
You can see the
weather data that we are passing in below:
https://www.dropbox.com/s/7sf0yqc9ykpq3w8/weather_underground.csv
'''
weather_data =
pandas.read_csv(filename)
q =
"""
select count(*)
from weather_data where rain=1;
"""
#Execute your SQL
command against the pandas frame
rainy_days =
pandasql.sqldf(q.lower(), locals())
return rainy_days
Exercise 2: > Temp on Foggy
and Non Foggy Days
import pandas
import pandasql
def max_temp_aggregate_by_fog(filename):
'''
This function
should run a SQL query on a dataframe of
weather data. The SQL query should return two columns and
two rows - whether
it was foggy or not (0 or 1) and the max
maxtempi for that
fog value (i.e., the maximum max temperature
for both foggy and
non-foggy days). The dataframe will be
titled
'weather_data'. You'll need to provide the SQL query.
You might also
find that interpreting numbers as integers or floats may not
work
initially. In order to get around this
issue, it may be useful to cast
these numbers as
integers. This can be done by writing
cast(column as integer).
So for example, if
we wanted to cast the maxtempi column as an integer, we would actually
write something
like where cast(maxtempi as integer) = 76, as opposed to simply
where maxtempi =
76.
You can see the
weather data that we are passing in below:
https://www.dropbox.com/s/7sf0yqc9ykpq3w8/weather_underground.csv
'''
weather_data =
pandas.read_csv(filename)
q =
"""
select fog,
MAX(cast (maxtempi as integer) ) from weather_data GROUP BY fog;
"""
#Execute your SQL
command against the pandas frame
rainy_days =
pandasql.sqldf(q.lower(), locals())
return rainy_days
06/17/2014
Exercise
3 - Mean Temp on Weekends
import pandas
import pandasql
def avg_min_temperature(filename):
'''
This function
should run a SQL query on a dataframe of
weather data. The SQL query should return one column and
one row - the
average meantempi on days that are a Saturday
or Sunday (i.e.,
the the average mean temperature on weekends).
The dataframe will
be titled 'weather_data' and you can access
the date in the
dataframe via the 'date' column.
You'll need to
provide the SQL query.
You might also
find that interpreting numbers as integers or floats may not
work
initially. In order to get around this
issue, it may be equal to cast
these numbers as
integers. This can be done by writing
cast(column as integer).
So for example, if
we wanted to cast the maxtempi column as an integer, we would actually
write something
like where cast(maxtempi as integer) = 76, as opposed to simply
where maxtempi =
76.
Also, you can
convert dates to days of the week via the 'strftime' keyword in SQL.
For example, cast
(strftime('%w', date) as integer) will return 0 if the date
is a Sunday or 6
if the date is a Saturday.
You can see the
weather data that we are passing in below:
https://www.dropbox.com/s/7sf0yqc9ykpq3w8/weather_underground.csv
'''
weather_data =
pandas.read_csv(filename)
q =
"""
select AVG(cast
(meantempi as integer)) from weather_data where cast(strftime('%w',date) as
integer) = 0 or
cast(strftime('%w',date) as integer) = 6;
"""
#Execute your SQL
command against the pandas frame
mean_temp_weekends
= pandasql.sqldf(q.lower(), locals())
return
mean_temp_weekends
Exercise 4
- Mean Temp on Rainy Days
import pandas
import pandasql
def avg_min_temperature(filename):
'''
This function
should run a SQL query on a dataframe of
weather data.
More specifically you want to find the average
minimum
temperature on rainy days where the minimum temperature
is greater than
55 degrees.
You might also
find that interpreting numbers as integers or floats may not
work
initially. In order to get around this
issue, it may be equal to cast
these numbers
as integers. This can be done by writing
cast(column as integer).
So for example,
if we wanted to cast the maxtempi column as an integer, we would actually
write something
like where cast(maxtempi as integer) = 76, as opposed to simply
where maxtempi
= 76.
You can see the
weather data that we are passing in below:
https://www.dropbox.com/s/7sf0yqc9ykpq3w8/weather_underground.csv
'''
weather_data =
pandas.read_csv(filename)
q =
"""
select
AVG(cast (mintempi as integer)) from weather_data where cast(mintempi as
integer) > 55
and rain
=1;
"""
#Execute your
SQL command against the pandas frame
mean_temp_weekends = pandasql.sqldf(q.lower(), locals())
return
mean_temp_weekends
Exercise 5
- Fixing NYC Subway Turnstile Data
Input file :
A002,R051,02-00-00,05-21-11,00:00:00,REGULAR,003169391,001097585,05-21-11,04:00:00,REGULAR,003169415,001097588,05-21-11,08:00:00,REGULAR,003169431,001097607,05-21-11,12:00:00,REGULAR,003169506,001097686,05-21-11,16:00:00,REGULAR,003169693,001097734,05-21-11,20:00:00,REGULAR,003169998,001097769,05-22-11,00:00:00,REGULAR,003170119,001097792,05-22-11,04:00:00,REGULAR,003170146,001097801
Output File:
A002,R051,02-00-00,05-21-11,00:00:00,REGULAR,003169391,001097585
A002,R051,02-00-00,05-21-11,04:00:00,REGULAR,003169415,001097588
A002,R051,02-00-00,05-21-11,08:00:00,REGULAR,003169431,001097607
A002,R051,02-00-00,05-21-11,12:00:00,REGULAR,003169506,001097686
A002,R051,02-00-00,05-21-11,16:00:00,REGULAR,003169693,001097734
import csv
def fix_turnstile_data(filenames):
'''
Filenames is a
list of MTA Subway turnstile text files. A link to an example
MTA Subway
turnstile text file can be seen at the URL below:
http://web.mta.info/developers/data/nyct/turnstile/turnstile_110507.txt
As you can see,
there are numerous data points included in each row of the
a MTA Subway
turnstile text file.
You want to
write a function that will update each row in the text
file so there
is only one entry per row. A few examples below:
A002,R051,02-00-00,05-28-11,00:00:00,REGULAR,003178521,001100739
A002,R051,02-00-00,05-28-11,04:00:00,REGULAR,003178541,001100746
A002,R051,02-00-00,05-28-11,08:00:00,REGULAR,003178559,001100775
Write the
updates to a different text file in the format of "updated_" +
filename.
For example:
1) if you
read in a text file called "turnstile_110521.txt"
2) you
should write the updated data to "updated_turnstile_110521.txt"
The order of
the fields should be preserved.
You can see a
sample of the turnstile text file that's passed into this function
and the the
corresponding updated file in the links below:
Sample input
file:
https://www.dropbox.com/s/mpin5zv4hgrx244/turnstile_110528.txt
Sample updated
file:
https://www.dropbox.com/s/074xbgio4c39b7h/solution_turnstile_110528.txt
'''
for name in
filenames:
ifile = open(name, "rb")
reader =
csv.reader(ifile)
outputfile
= 'updated_' + name
ofile = open(outputfile, "wb")
writer =
csv.writer(ofile, delimiter=',')
for row in
reader:
row
= [x.strip(' ') for x in row]
for
i in range( 0 , 7):
if row[3+5*i:8+5*i] == []:
break;
else:
writer.writerow(row[0:3] + row[3+5*i:8+5*i])
ifile.close()
ofile.close()
Nice and good article. It is very useful for me to learn and understand easily. Thanks for sharing your valuable information and time. Please keep updating MicroStrategy Online course Hyderabad
ReplyDelete