In this assignment, you'll combine the assignment 3 data set with nutrition data from the USDA Food Composition Databases. The CSV file fresh.csv
contains the fresh fruits and vegetables data you extracted in assignment 3.
The USDA Food Composition Databases have a documented web API that returns data in JSON format . You need a key in order to use the API. Only 1000 requests are allowed per hour, so it would be a good idea to use caching.
Sign up for an API key here. The key will work with any Data.gov API. You may need the key again later in the quarter, so make sure you save it.
These modules may be useful:
Exercise 1.1. Read the search request documentation, then write a function called ndb_search()
that makes a search request. The function should accept the search term as an argument. The function should return the search result items as a list (for 0 items, return an empty list).
Note that the search url is: https://api.nal.usda.gov/ndb/search
As an example, a search for "quail eggs"
should return this list:
[{u'ds': u'BL',
u'group': u'Branded Food Products Database',
u'name': u'CHAOKOH, QUAIL EGG IN BRINE, UPC: 044738074186',
u'ndbno': u'45094707',
u'offset': 0},
{u'ds': u'BL',
u'group': u'Branded Food Products Database',
u'name': u'L&W, QUAIL EGGS, UPC: 024072000256',
u'ndbno': u'45094890',
u'offset': 1},
{u'ds': u'BL',
u'group': u'Branded Food Products Database',
u'name': u'BUDDHA, QUAIL EGGS IN BRINE, UPC: 761934535098',
u'ndbno': u'45099560',
u'offset': 2},
{u'ds': u'BL',
u'group': u'Branded Food Products Database',
u'name': u'GRAN SABANA, QUAIL EGGS, UPC: 819140010103',
u'ndbno': u'45169279',
u'offset': 3},
{u'ds': u'BL',
u'group': u'Branded Food Products Database',
u'name': u"D'ARTAGNAN, QUAIL EGGS, UPC: 736622102630",
u'ndbno': u'45178254',
u'offset': 4},
{u'ds': u'SR',
u'group': u'Dairy and Egg Products',
u'name': u'Egg, quail, whole, fresh, raw',
u'ndbno': u'01140',
u'offset': 5}]
As usual, make sure you document and test your function.
import requests
import requests_cache
from urlparse import urlunparse, urlparse
import pandas as pd
# Set up a cache for requests.
requests_cache.install_cache("cache")
key = "HwyOBVU5rcviLL3vPvvL9EHaQWEdMMoh3Tw6yU7Z"
def ndb_search(item):
"""
input: name of the item. e.g: 'quail eggs'
output: list
"""
urll = list(urlparse("https://api.nal.usda.gov/ndb/search/?format=json&q=butter&sort=n&max=25&offset=0&api_key=DEMO_KEY"))
#method = "ed/collegescorecard/v1/schools.json"
params = "format=json&q={}&sort=n&max=25&offset=0&api_key=HwyOBVU5rcviLL3vPvvL9EHaQWEdMMoh3Tw6yU7Z".format(item)
urll[4] = params
response = requests.get(urlunparse(urll))
# if there is no item in my search
if response.json().keys()==[u'errors']:
# print out the error message
return response.json()[u'errors'][u'error'][0][u'message']
# convert the output of ndb_search() to a data frame with pd.DataFrame()
else:
results = response.json()[u'list'][u'item']
return results
ndb_search("quail eggs")
Exercise 1.2. Use your search function to get NDB numbers for the foods in the fresh.csv
file. It's okay if you don't get an NDB number for every food, but try to come up with a strategy that gets most of them. Discuss your strategy in a short paragraph.
Hints:
ndb_search()
to a data frame with pd.DataFrame()
.pd.merge()
.(1) the foods are all raw and unbranded 1. so do I need to add "raw" for search? If I do not add "raw" and search for cabbage, there will be results like :"Cabbage, kimchi". kimchi is not a raw food. Also, there are some name with "raw" and "juice". Apparently "juice" is not a fresh food. We should drop these rows. 2. how to define unbranded items? 2.1 if the fruit is unbranded, the group column will have value "Fruits and Fruit Juices" or "Vegetables and Vegetable Products". 2.2 But there are also exceptions like "American Indian/Alaska Native Foods". 2.3 for banded item, the value in group column will be "Branded Food Products Database" 3. if there is nothing find in the website by using "food(the fresh name) raw", try to find others 4. sometimes there will return multiple ndb numbers. Try to find the one that best matches food.
# make some change of the ndb_search() function
def ndb_search(item):
"""
input: name of the item. e.g: 'quail eggs'
output: dataframe with ndb numbers
"""
urll = list(urlparse("https://api.nal.usda.gov/ndb/search/?format=json&q=butter&sort=n&max=25&offset=0&api_key=DEMO_KEY"))
#method = "ed/collegescorecard/v1/schools.json"
params = "format=json&q={}&sort=n&max=25&offset=0&api_key=HwyOBVU5rcviLL3vPvvL9EHaQWEdMMoh3Tw6yU7Z".format(item)
urll[4] = params
response = requests.get(urlunparse(urll))
# if there is no item in my search
if response.json().keys()==[u'errors']:
# print out the error message
return response.json()[u'errors'][u'error'][0][u'message']
# convert the output of ndb_search() to a data frame with pd.DataFrame()
else:
results = response.json()[u'list'][u'item']
return pd.DataFrame(results)
fresh = pd.read_csv("fresh.csv")
print fresh.shape
print len(fresh.food.unique())
# results below shows that there are 47 unique food names however, there are 49 rows in fresh dataset.
# By taking a look of the fresh dataframe, we know "cucumbers" and "cabbage" are the two items that got repeated.
print fresh[fresh["food"]=="cucumbers"].index.tolist()
fresh = fresh.drop(26)
print fresh[fresh["food"]=="cabbage"].index.tolist()
fresh = fresh.drop(36)
import re
def get_food(item):
# only return the foods which are raw and unbranded
tb = ndb_search(item+" "+"raw")
if not type(tb) == unicode: # when type(tb) == unicode, tb has value "u'Your search resulted in zero results.Change your parameters and try again'"
tb_n = tb[(~tb.group.str.contains("Branded")) & ((tb.group.str.contains("Fruit")) | (tb.group.str.contains("Vegetable")))&(~tb.name.str.contains("juice"))].copy()
if not tb_n.empty:
#print type(re.search("^([0-9]+)(\s)+([0-9]+).*",str(tb_n.iloc[0:1,3])).group(3))
# type of tb_n.iloc[0:1,3]], I can not extract the number directly
# "\s" means space
tb_n1 = pd.DataFrame({"ndbno": re.search("^([0-9]+)(\s)+([0-9]+).*",str(tb_n.iloc[0:1,3])).group(3),"food":[item]})
# if i do not use regular expression, use df.x.str.strip("0characters") and ("change to numeric": pd.to_numeric(df.x) x column in df)
#tb_n1 = pd.DataFrame({"ndbno": pd.to_numeric(tb_n.iloc[0:1,3]),"food":[item]})
# but return to numeric is not a good method because for string "09876" it will turn to be 9876. But what we want is exactly "09876"
#tb_n1 = pd.DataFrame({"ndbno": tb_n.ndbno.str.split("\s"),"food":[item]})
# when I want to make a dataframe and there is only one object, I need to use ["the value"]. otherwise, there will be error
#tb_n1["fresh"] = item
return tb_n1
dbn_fresh = pd.concat([get_food(x) for x in fresh["food"]])
print dbn_fresh
fresh
# merge this two tables
fresh_n = pd.merge(fresh,dbn_fresh,how="right")
fresh_n.head()
Exercise 1.3. Read the food reports V2 documentation, then write a function called ndb_report()
that requests a basic food report. The function should accept the NDB number as an argument and return the list of nutrients for the food.
Note that the report url is: https://api.nal.usda.gov/ndb/V2/reports
For example, for "09279"
(raw plums) the first element of the returned list should be:
{u'group': u'Proximates',
u'measures': [{u'eqv': 165.0,
u'eunit': u'g',
u'label': u'cup, sliced',
u'qty': 1.0,
u'value': u'143.93'},
{u'eqv': 66.0,
u'eunit': u'g',
u'label': u'fruit (2-1/8" dia)',
u'qty': 1.0,
u'value': u'57.57'},
{u'eqv': 151.0,
u'eunit': u'g',
u'label': u'NLEA serving',
u'qty': 1.0,
u'value': u'131.72'}],
u'name': u'Water',
u'nutrient_id': u'255',
u'unit': u'g',
u'value': u'87.23'}
Be sure to document and test your function.
def ndb_report(ndb):
"""
input: name of the ndb. e.g: '09279'
output: list of nutritions
"""
url = "https://api.nal.usda.gov/ndb/V2/reports"
response = requests.get(url, params = {
"api_key": "HwyOBVU5rcviLL3vPvvL9EHaQWEdMMoh3Tw6yU7Z",
"ndbno": ndb
#"type": "b"
#"format":"json"
})
# if there is no item in my search
if response.json().keys()==[u'errors']:
# print out the error message
return response.json()[u'errors'][u'error'][0][u'message']
# convert the output of ndb_search() to a data frame with pd.DataFrame()
else:
results = response.json()[u'foods'][0][u'food'][u'nutrients']
return results
pd.DataFrame(ndb_report("11098")).head()
Exercise 1.4. Which foods provide the best combination of price, yield, and nutrition? You can use kilocalories as a measure of "nutrition" here, but more a detailed analysis is better. Use plots to support your analysis.
1.Add a column that named "nutrition"?
no, there are many kinds of nutrition. I will choose those noramlly shown in the package: protein; Carbohydrate; Sodium; cholesterol; Energy
2. do I need to transfer the unit here?
Yes: compare the same thing with the same unit
3. what kinds of plots should I use? barplot? scatter plot(fruits and veg)?
3.1 barplots: (protein; Carbohydrate; Sodium; cholesterol; Energy), price and yield for each fruit and vegetable.
3.2 scatter plots: make pairwise comparison (e.g energy vs price) for different groups (e.g fruits vs vegetables)
3.3 plot the distribution of one quantity for different groups. The way of creating groups is based on the results we have before. (e.g we already know berry has higher price then other fruits. We can divide fruits into berries and notberries)
# the dataframe with fresh name, price, yield is called "fresh_n"
# for create a dataframe for each nutrition: protein, Carbohydrate, Sodium, cholesterol, Energy (be careful about the unit, also make sure the type in each cell is numerical)
for y in ["Protein","Carbohydrate","Energy"]:
locals()['df_{}'.format(y)] = []
for x in fresh_n["ndbno"]:
df = pd.DataFrame(ndb_report(x))
# keep the "ndbno" in the table
df["ndbno"] = [x]*df.shape[0]
locals()['df_{}'.format(y)].append(df.loc[df.name.str.contains(y)])
locals()['nutr_{}'.format(y)] = pd.concat(locals()['df_{}'.format(y)])
locals()['nutr_{}'.format(y)] = pd.merge(locals()['nutr_{}'.format(y)],fresh_n)
nutr_Protein.head()
# plot 1: barplot
# Matplotlib
import matplotlib.pyplot as plt
import numpy as np
# for fruits
fresh_f = fresh_n.loc[fresh_n.type.str.contains("fruit")]
def barplt(dataframe,x):
fig, ax = plt.subplots(1, 1)
Int = np.arange(dataframe.shape[0])
bar_width = 0.8
plt.barh(Int,dataframe[x])
plt.yticks(Int + bar_width / 2, dataframe["food"])
barplt(fresh_f,"price_per_lb")
plt.xlabel('Price')
plt.ylabel('Fruits')
plt.title('Bar chart of Price for each fruit')
fig.autofmt_xdate()
plt.show()
Plot 1
# for vegetables
# for fruits
fresh_v = fresh_n.loc[fresh_n.type.str.contains("vegetable")]
barplt(fresh_v,"price_per_lb")
plt.xlabel("Price")
plt.ylabel('Vegetable')
plt.title('Bar chart of Price for each vegetable')
fig.autofmt_xdate()
plt.show()
Plot 2
By analysing those two plots (plot1 and plot2) we know that most of the fruits and vegetables have price less than \$3 per lb. But there is only okra and asparagus in vegetable have price higher than \$3. For fruits which have relatively higher price are berries, like blueberries, cherries and raspberries.
For fruits that have relatively lower prices are some common ones like banana, grapefruit, cantaloupe, watermelon and pineapple. For vegetables, potatoes, sweet-potatoes have the lowest prices.
# yield
# for fruits
barplt(fresh_f,"yield")
plt.xlabel('Yield')
plt.ylabel('Fruits')
plt.title('Bar chart of Yield for each fruit')
fig.autofmt_xdate()
plt.show()
Plot 3
barplt(fresh_v,"yield")
plt.ylabel('Vegetables')
plt.xlabel('Yield')
plt.title('Bar chart of Yield for each vegetable')
fig.autofmt_xdate()
plt.show()
Plot 4
From plot3 we know that for the fruits have lower price like banana, pineapple and grapefruit, they also have lower yields. However, for those berries, they have much higher yields.
From plot 4 we know that for vegetables, artichoke has the lowest yield. Brussels sprouts and kale have the highest yields. Kale and artichoke have relatively high price.
# nutrition content
#(1) protein
pro_f = nutr_Protein.loc[nutr_Protein.type.str.contains("fruit")]
#fresh_v = fresh_n.loc[fresh_n.type.str.contains("vegetable")]
barplt(pro_f,"value")
plt.ylabel('Fruit')
plt.xlabel('Protein')
plt.title('Bar chart of Protein for each fruit')
fig.autofmt_xdate()
plt.show()
Plot 5
pro_v = nutr_Protein.loc[nutr_Protein.type.str.contains("vegetable")]
barplt(pro_v,"value")
plt.ylabel('vegetable')
plt.xlabel('Protein')
plt.title('Bar chart of Protein for each vegetable')
fig.autofmt_xdate()
plt.show()
Plot 6
From plot 5 we know that most of the protein in fruits are less than 2g. Only grape has the largest protein value which is lager than 5g.
From plot 6 we know that there are several vegetables that have protein value larger than 2g. Kale has the largest protein value: 4.3g among vegetables. Corn sweet, brussels sprouts and artichoke also have relatively high protein which are larger than 3g.
#(2) Carbohydrate
ca_f = nutr_Carbohydrate.loc[nutr_Carbohydrate.type.str.contains("fruit")]
barplt(ca_f,"value")
plt.ylabel('Fruit')
plt.xlabel('Carbohydrate')
plt.title('Bar chart of Carbohydrate for each fruit')
fig.autofmt_xdate()
plt.show()
Plot 7
ca_v = nutr_Carbohydrate.loc[nutr_Carbohydrate.type.str.contains("vegetables")]
barplt(ca_v,"value")
plt.ylabel('Vegetables')
plt.xlabel('Carbohydrate')
plt.title('Bar chart of Carbohydrate for each vegetable')
fig.autofmt_xdate()
plt.show()
Plot 8
From plot 7 we know that for all the fruits, the content of Carbohydrate are about the same. Banana and orange have the highest Carbohydrate. Cherries has the lowest.
Plot 8 shows that for the vegetables, the content of their Carbohydrate are comparatively lower than fruits. Potatoes and corn sweet have the highest Carbohydrate. Cabbage and cucumbers have the lowest.
# (3)Energy
en_f = nutr_Energy.loc[nutr_Energy.type.str.contains("fruit")]
barplt(en_f,"value")
plt.ylabel('Fruit')
plt.xlabel('Energy')
plt.title('Bar chart of Energy for each fruit')
fig.autofmt_xdate()
plt.show()
Plot 9
en_v = nutr_Energy.loc[nutr_Energy.type.str.contains("vegetables")]
barplt(ca_v,"value")
plt.ylabel('Vegetables')
plt.xlabel('Energy')
plt.title('Bar chart of Energy for each vegetable')
fig.autofmt_xdate()
plt.show()
Plot 10
By comparing the value of energy in plot 9 and plot 10 we know that fruits have much larger energy than vegetables. For vegetables, corn sweet and potatoes have the largest energy content (about 19 kcal). For the fruits, orange, grape and banana have the largest energy content (about 95 kcal).
Now, let's summary all the conclusion we have: (1)Price
By analysing those two plots (plot1 and plot2) we know that most of the fruits and vegetables have price less than \$3 per lb. But there is only okra and asparagus in vegetable have price higher than \$3. For fruits which have relatively higher price are berries, like blueberries, cherries and raspberries.
For fruits that have relatively lower prices are some common ones like banana, grapefruit, cantaloupe, watermelon and pineapple. For vegetables, potatoes, sweet-potatoes have the lowest prices. (2)Yield
From plot3 we know that for the fruits have lower price like banana, pineapple and grapefruit, they also have lower yields. However, for those berries, they have much higher yields.
From plot 4 we know that for vegetables, artichoke has the lowest yield. Brussels sprouts and kale have the highest yields. Kale and artichoke have relatively high price. (3)Protein
From plot 5 we know that most of the protein in fruits are less than 2g. Only grape has the largest protein value which is lager than 5g.
From plot 6 we know that there are several vegetables that have protein value larger than 2g. Kale has the largest protein value: 4.3g among vegetables. Corn sweet, brussels sprouts and artichoke also have relatively high protein which are larger than 3g. (4)Carbohydrate
From plot 7 we know that for all the fruits, the content of Carbohydrate are about the same. Banana and orange have the highest Carbohydrate. Cherries has the lowest.
Plot 8 shows that for the vegetables, the content of their Carbohydrate are comparatively lower than fruits. Potatoes and corn sweet have the highest Carbohydrate. Cabbage and cucumbers have the lowest. (5)Energy
By comparing the value of energy in plot 9 and plot 10 we know that fruits have much larger energy than vegetables. For vegetables, corn sweet and potatoes have the largest energy content (about 19 kcal). For the fruits, orange, grape and banana have the largest energy content (about 95 kcal).
In order to find the best combination of price, yield and nutrition, we want to find the food that have the lowest price, highest nutrition and highest yield. For nutrition content, we want to find a food that has the highest protein, lowest Carbohydrate and highest Energy. Thus, by considering all those aspects, my choice of the best food is among Brussels sprouts, corn sweet, potatoes, banana and grape.