The US Department of Agriculture publishes price estimates for fruits and vegetables online. The most recent estimates are based on a 2013 survey of US retail stores.
The estimates are provided as a collection of MS Excel files, with one file per fruit or vegetable. The assignment3_data.zip
file contains the fruit and vegetable files in the directories fruit
and vegetables
, respectively.
Exercise 1.1. Use pandas to extract the "Fresh" row(s) from the fruit Excel files. Combine the data into a single data frame. Your data frame should look something like this:
type | food | form | price_per_lb | yield | lb_per_cup | price_per_cup |
---|---|---|---|---|---|---|
fruit | watermelon | Fresh1 | 0.333412 | 0.52 | 0.330693 | 0.212033 |
fruit | cantaloupe | Fresh1 | 0.535874 | 0.51 | 0.374786 | 0.3938 |
vegetables | onions | Fresh1 | 1.03811 | 0.9 | 0.35274 | 0.406868 |
... |
It's okay if the rows and columns of your data frame are in a different order. These modules are especially relevant:
str
methodsos
os.path
read_excel()
, concat()
, .fillna()
, .str
, plotting methodsAsk questions and search the documentation/web to find the functions you need.
import pandas as pd
import numpy as np
import os.path
import re
# get the file names in fruit directory
#filenames = os.listdir("./assignment3_data/fruit")
# create a new dataframe
# http://stackoverflow.com/questions/27420263/pandas-parse-merged-header-columns-from-excel
# this website talked about how to merge two rows as header and two column as one element
# .fillna() function to fill na: axis = 0 means column and method = pad or ffill means fill na by using data from previous one. If 2 is na, fill na by using 1's value.
# .concat() axis = 1 equal to cbind; axis = 0 equal to rbind; default is 0;
# .append() have similar function with concat
def filetidy(path,filenames,label):
fdf = pd.DataFrame()
for x in filenames:
# skip the 1st row
ft = pd.read_excel(path+"/"+x,skiprows=1,header=None)
# for the header: the first two rows are all header.
for i in [3,4,6]:
ft.iloc[0,i] = str(ft.iloc[0,i])+" "+str(ft.iloc[1,i])
# change price to be price_per_lb
#ft.iloc[0][1] = str(ft.iloc[0][1])+" "+str(ft.iloc[2][2])
# size_per_lb
#ft.iloc[0][4] = str(ft.iloc[0][4])+" "+str(ft.iloc[2][5])
ft.iloc[0,2] = "price unit"
ft.iloc[0,5] = "size unit"
#change column names to the one I cleaned up
ft.columns = ft.iloc[0]
ft = ft.loc[2:]
# choose the row that has "fresh"; fresh[2]: means the 3rd column; fresh["Form"]: means "Form" this column; fresh.loc[0]: the row whose index label is 0; .iloc: refers to index position; loc: refers to index label and it also accept boolean arraies
#ft = ft.loc[ft["Form"]==label] # ref: http://stackoverflow.com/questions/21800169/python-pandas-get-index-of-rows-which-column-matches-certain-value
ft = ft.loc[ft["Form"].apply(lambda y: bool(re.match(label+".*",str(y))))] # use str(y) in case there y type is not a string
ft["name"] = re.search("(.*).xlsx",x).group(1)
# append the table with previous ones
fdf = fdf.append(ft,ignore_index=True)
#df_fresh = pd.concat(fresh,ignore_index=True)
# delete the last two columns which are all nans
fdf = fdf.ix[:,0:7]
return(fdf)
path = "./assignment3_data/fruit"
label = "Fresh"
filenames = os.listdir("./assignment3_data/fruit")
fresh_fruit = filetidy(path,filenames,label)
fresh_fruit
Exercise 1.2. Reuse your code from exercise 1.1 to extract the "Fresh" row(s) from the vegetable Excel files.
Does your code produce the correct prices for tomatoes? If not, why not? Do any other files have the same problem as the tomatoes file?
You don't need to extract the prices for these problem files. However, make sure the prices are extracted for files like asparagus that don't have this problem.
# For vegetable
path = "./assignment3_data/vegetables"
label = "Fresh"
filenames = os.listdir("./assignment3_data/vegetables")
fresh_veg = filetidy(path,filenames,label)
fresh_veg
# For tomatoes
filenames = ["tomatoes.xlsx"]
path = "./assignment3_data/vegetables"
label = "Fresh"
filetidy(path,filenames,label)
From the table above we can see that there is no price for tomatoes.
By taking a look at the tomatoes.xlsx file, we can know that there is only one cell for Form columns, and other cells are empty. Thus, our dataframe above is correct.
There are also many other files that have no price for fresh rows. For example "broccoli.xlsx".
Exercise 1.3. Remove rows without a price from the vegetable data frame and then combine the fruit and vegetable data frames. Make sure all columns of numbers are numeric (not strings).
# Remove rows without a price from the vegetable data frame
fresh_veg_c = fresh_veg[pd.notnull(fresh_veg["price unit"])]
fresh_veg_c
# combine the fruit and vegetable data frames
# adding a column that indicate the type of each dataframe
fresh = [fresh_fruit,fresh_veg_c]
fresh = pd.concat(fresh, keys = ["fruits","vegetables"])
fresh
# make sure all the variables are numerical, not a string
for y in fresh.columns:
print fresh[y].dtype
print fresh.columns.values
# change them to be numerical
fresh.iloc[:,[0,1,3,4]] = fresh.iloc[:,[0,1,3,4]].apply(lambda x: pd.to_numeric(x, errors='ignore'))
# save file
fresh.to_csv('fresh.csv')
Exercise 1.4. Discuss the questions below (a paragraph each is sufficient). Use plots to support your ideas.
1.What kinds of fruits are the most expensive (per pound)? What kinds are the least expensive?
#(1)fresh_fruit.icol[:,1] is average retail price
# max
print "max:"+fresh_fruit.iloc[fresh_fruit.iloc[:,1].idxmax(),5]
# min
print "min:"+fresh_fruit.iloc[fresh_fruit.iloc[:,1].idxmin(),5]
import matplotlib
import matplotlib.pyplot as plt
# use plot
fresh_fruit.plot(x="name",y="Average retail price ",kind="line")
plt.show()
2.How do the price distributions compare for fruit and vegetables?
f, a = plt.subplots(2,1)
fresh.xs("vegetables").plot(x="name",y="Average retail price ",kind="line",ax=a[0],color="red")
fresh.xs("fruits").plot(x="name",y="Average retail price ",kind="line",ax=a[1])
plt.show()
The plot above is vegetables, the plot in the bottom is fruits. By comparing there values we can know that fruits' price are ranging from 0-6+. However, most vegetables' price are ranging from 0-3. Thus, fruits are comparatively more expensive.
3.Which foods are the best value for the price?
# use 'Average price per cup equivalent' as the standard for best value
print fresh.iloc[:,0].idxmax()
veg = fresh.ix["vegetables"]
veg.iloc[2,5]
fresh.plot(x="name",y="Average price per cup equivalent",kind="line")
plt.show()
Asparagus are the best value for the price among all the food.
4.What's something surprising about this data set?
# the average price for asparagus
fresh.loc[fresh["name"] =="asparagus"]
# the average price for asparagus
fresh.loc[fresh["name"] =="raspberries"]
Even though average retail price of raspberries are twice of the price of asparagus, asparagus have the best value price. This may tell us the price of raspberries should not be as high as they are now.
5.Which foods do you expect to provide the best combination of price, yield, and nutrition? A future assignment may combine this data set with another so you can check your hypothesis
Based on the plots and data we have and our common sense, comparing with fruits, vegetables should provide lower price, higher yield and pretty much the same nutrition.