In this assignment, you'll analyze a collection of data sets from the San Francisco Open Data Portal and Zillow. The data sets have been stored in the SQLite database sf_data.sqlite
, which you can download here. The database contains the following tables:
Table | Description |
---|---|
crime |
Crime reports dating back to 2010. |
mobile_food_locations |
List of all locations where mobile food vendors sell. |
mobile_food_permits |
List of all mobile food vendor permits. More details here. |
mobile_food_schedule |
Schedules for mobile food vendors. |
noise |
Noise complaints dating back to August 2015. |
parking |
List of all parking lots. |
parks |
List of all parks. |
schools |
List of all schools. |
zillow |
Zillow rent and housing statistics dating back to 1996. More details here. |
The mobile_food_
tables are explicitly connected through the locationid
and permit
columns. The other tables are not connected, but you may be able to connect them using dates, latitude/longitude, or postal codes.
Shapefiles for US postal codes are available here. These may be useful for converting latitude/longitude to postal codes.
Shapefiles for San Francisco Neighborhoods are available here.
Exercise 1.1. Which mobile food vendor(s) sells at the most locations?
import sqlite3 as sql
db = sql.connect("sf_data.sqlite")
db.row_factory = sql.Row
import pandas as pd
#pd.read_sql("SELECT * FROM sqlite_master", db)
#db.close()
pd.read_sql("SELECT * FROM mobile_food_permits LIMIT 4", db)
pd.read_sql("SELECT * FROM mobile_food_locations LIMIT 4",db)
pd.read_sql("SELECT * FROM mobile_food_schedule LIMIT 4",db)
mobile_food_schedule has locationid and permit. We are able to merge information in mobile_food_permits and mobile_food_locations by using this table.
# create a new table with information of Vendors type and location
db.execute(
"CREATE TABLE Vendor_loc AS "
"SELECT S.permit, S.locationid, P.FacilityType "
"FROM mobile_food_schedule AS S INNER JOIN mobile_food_permits AS P ON S.permit=P.permit"
)
#db.execute("DROP TABLE Vendor_loc")
pd.read_sql("SELECT * FROM Vendor_loc LIMIT 4",db)
# count the distinct locationid for each facility type, higher counts means that this facility type occurs in more locations
pd.read_sql("SELECT COUNT(DISTINCT(locationid)) AS loc_count, FacilityType FROM Vendor_loc GROUP BY FacilityType",db)
We know from the table above that Truck sells at the most locations.
Exercise 1.2. Ask and use the database to analyze 5 questions about San Francisco. For each question, write at least 150 words and support your answer with plots. Make a map for at least 2 of the 5 questions.
You should try to come up with some questions on your own, but these are examples of reasonable questions:
Please make sure to clearly state each of your questions in your submission.
# Make Jupyter to automatically display matplotlib plots.
%matplotlib inline
# Make plots larger.
import matplotlib.pyplot as plt
import geopandas as gpd
import pysal as ps
import plotly.plotly as py
import plotly.graph_objs as go
import plotly.tools as tls
tls.set_credentials_file(username='63xinxin', api_key='mDtzarbNAj4XSeno8Bu2')
plt.rcParams['figure.figsize'] = (12, 12)
pd.read_sql("SELECT * FROM zillow LIMIT 4",db)
# the price of house data should stored in Zillow
medp = pd.read_sql("SELECT AVG(MedianSoldPricePerSqft_AllHomes) AS MedianPrice, RegionName FROM zillow GROUP BY RegionName", db)
medp.head()
# there is only zipcode information in zillow. We need to merge it with city name
#zillow = pd.read_sql("SELECT * FROM zillow", db)
medp.dtypes
zips = gpd.read_file("zipcodes/cb_2015_us_zcta510_500k.shp")
zips = zips.rename(columns={"ZCTA5CE10":"RegionName"})
zips.dtypes
sf = gpd.read_file("SF/geo_export_fbb4e1da-9b27-4620-ad7a-0cf514236ce5.shp")
base = sf
#sf.plot()
zips["RegionName"] = zips["RegionName"].apply(pd.to_numeric)
medpzip = pd.merge(medp,zips,on="RegionName",how="inner")
medpzip.head() # the first row is the information of South San Francisco
medpzip = medpzip.iloc[1:,:] # the 1st is row, 2nd is column
# change to gpd geo dataframe
medpzip = gpd.GeoDataFrame( medpzip, geometry = medpzip["geometry"] )
medpzip['coords'] = medpzip['geometry'].apply(lambda x: x.representative_point().coords[:])
medpzip['coords'] = [coords[0] for coords in medpzip['coords']]
#import pysal as ps # in order to show the scale and use "scheme=..."
medpzip.plot(column='MedianPrice',scheme='QUANTILES',k=9, legend=True,colormap=plt.cm.Blues_r)
for idx, row in medpzip.iterrows():
plt.annotate(s=row['RegionName'], xy=row['coords'],
horizontalalignment='center')
Plot 1
According to the colored map we know that places in the Northeastern part of San Francisco have higher all home housing price. Areas with zipcodes 94105, 94104 and 94123 have the highest sold price. Area in the South and Western part of SF have the least price. The lowest ones are area with zipcode 94134, 94124 and 94132. This result is reasonable because Northeastern part of SF is the are with finance district and have larger population density compare with other places. </p>
zillow = pd.read_sql("SELECT * FROM zillow",db)
zillow["Year"] = zillow.Date.str.extract("([0-9]{4}).*",expand=False)
zillow.to_sql("zillow_y",db)
pd.read_sql("SELECT * FROM zillow_y LIMIT 4",db)
zillow_y = pd.read_sql("SELECT AVG(MedianSoldPricePerSqft_AllHomes) AS Price, Year FROM zillow_y GROUP BY Year ",db)
trace = go.Scatter(
x = zillow_y.Year,
y = zillow_y["Price"],
mode="markers")
data = [trace]
layout = dict(title="Median Sold Price of AllHomes from 2010 to 2017",
xaxis = dict(title = "Year"),
yaxis = dict(title = "Median Sold Price"))
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='SoldPrice 2010-2017')
There is an obvious increasing trend of median all homes sold price. Even though there is a slightly decreasing trend from 2007 to 2011 because of economic depression.
Find out the most dangerous city
crime = pd.read_sql("SELECT * FROM crime",db)
crime["Datetime"] = crime.Datetime.str.strip()
crime.dtypes
# extract the year from datetime variable
# ways to transfer object to string....can't find it. But i find this: very useful ~~~
# http://pandas.pydata.org/pandas-docs/stable/text.html
crime["Year"] = crime.Datetime.str.extract("([0-9]{4}).*",expand=False)
#db.execute("DROP TABLE crime_y")
# create crime table with year information in it.
crime.to_sql("crime_y",db)
We should transfer Datetime into datetime format and create tables group by year. For each year find out the avg number of incidents for each location (group by location).
pd.read_sql("SELECT DISTINCT(Year) FROM crime_y ORDER BY Year ",db)
There are total crime data ranging from 2011 to 2017.
# create a table for each year
crime_year = pd.read_sql("SELECT SUM(IncidntNum) AS IncidntNumSum, Year FROM crime_y GROUP BY Year ORDER BY Year",db)
crime_year.Year = crime_year.Year.apply(lambda x: pd.to_numeric(x))
crime_year["Year"]
#crime_year["IncidntNumSum"]
crime_year["Year"]
# Create random data with numpy
trace = go.Scatter(
x = crime_year.Year,
y = np.log(crime_year["IncidntNumSum"]),
mode="markers")
data = [trace]
layout = dict(title="Crime Incidents from 2010 to 2017",
xaxis = dict(title = "Year"),
yaxis = dict(title = "Log of Crime Incidents"))
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='Criminal 2010-2017')
From the plot above we know that there is an increasing trend for criminal incidents. Since the data for 2017 is incomplete, we do not take it into account.
crime_address = pd.read_sql("SELECT SUM(IncidntNum) AS IncidntNumSum, Address, Lon, Lat FROM crime GROUP BY Address",db)
import shapely.geometry as geom
def pdtoGeo(df):
lonlat = [geom.Point(lon, lat) for lon, lat in zip(df.Lon, df.Lat)]
df = df.drop(["Lon", "Lat"], axis = 1) #drop column
df = gpd.GeoDataFrame(df, geometry = lonlat, crs = {'init' :'epsg:4326'})
return df
crime_address = pdtoGeo(crime_address)
crime_address.geometry.head()
crime_address.head()
#import pysal as ps # in order to show the scale and use "scheme=..."
#import pysal as ps
#import shapely.geometry as geom
crime_address.plot()
import pysal as ps
base = sf.plot(color="grey")
crime_address.plot(ax=base,column="IncidntNumSum",marker="o",markersize =7,scheme='QUANTILES',k=6, legend=True,colormap=plt.cm.Blues_r,alpha=0.5)
This plot shows all the positions that had ever happend crime incidents. The color stands for the number of total incidents happened at that place.
The plot above is the overall crime incidents across 2010-2017 in San Francisco. Sine 2016 has the highest criminal rate. Let us take a look at the districution of crime incidents in 2016 in San Francisco.
crime2016 = pd.read_sql("SELECT SUM(IncidntNum) AS IncidntNumSum, Address, Lon, Lat FROM crime_y WHERE Year = 2016 GROUP BY Address ",db)
crime2016 = pdtoGeo(crime2016)
crime2016 = pd.read_sql("SELECT SUM(IncidntNum) AS IncidntNumSum, Address, Lon, Lat FROM crime_y WHERE Year = 2016 GROUP BY Address ",db)
crime2016 = pdtoGeo(crime2016)
#import pysal as ps
base = sf.plot()
crime2016.plot(ax=base,column="IncidntNumSum",marker="o",markersize = 12,scheme='QUANTILES',k=5, legend=True,colormap=plt.cm.Blues_r,alpha=0.8)
crime2016.dtypes
From the plot above we know that the distribution of crime incidents in 2016 is similar with the distribution for all those years. Thus we can conclude that the most dangerous part in San Francisco is in the Southeastern part.
pd.read_sql("SELECT * FROM noise LIMIT 4",db)
#mobile_food_permits mobile_food_schedule
pd.read_sql("SELECT * FROM mobile_food_locations LIMIT 4",db)
Draw a plot that different colors presents vendor and noise complaints in a map
# change column name of mobile_food_locations
#db.execute("ALTER TABLE mobile_food_locations RENAME COLUMN Latitude TO Lat")
# I cannot use RENAME COLUMN in sqlite3 (not possible to add or delete a column): http://stackoverflow.com/questions/4007014/alter-column-in-sqlite
db.execute("ALTER TABLE mobile_food_locations ADD Lat float64 ")
db.execute("UPDATE mobile_food_locations SET Lat= Latitude")
#db.execute("ALTER TABLE mobile_food_locations DROP Latitude")
db.execute("ALTER TABLE mobile_food_locations ADD Lon float64 ")
db.execute("UPDATE mobile_food_locations SET Lon= Longitude")
# create a new column as indicator
db.execute("ALTER TABLE mobile_food_locations ADD Indicator Character String")
db.execute("UPDATE mobile_food_locations SET Indicator= 'Vendor'")
# also add an indicator in noise
db.execute("ALTER TABLE noise ADD Indicator Character String")
db.execute("UPDATE noise SET Indicator= 'Noise'")
pd.read_sql("SELECT * FROM mobile_food_locations LIMIT 4",db)
#ALTER TABLE Persons DROP COLUMN DateOfBirth
pd.read_sql("SELECT * FROM noise LIMIT 4",db)
#vendor_loc = pd.read_sql("SELECT Longitude, Latitude, Indicator FROM mobile_food_locations",db)
vendor_loc = pd.read_sql("SELECT Lon, Lat, Indicator FROM mobile_food_locations",db)
#vendor_loc = vendor_loc.rename(columns={"Longitude":"Lon","Latitude":"Lat"})
noise_loc = pd.read_sql("SELECT Lon, Lat, Indicator FROM noise",db)
noise_loc.dtypes
vendor_noise = pd.concat([vendor_loc,noise_loc])
vendor_noise = pdtoGeo(vendor_noise)
vendor_noise.head()
vendor_noise.geometry.head()
#base = sf.plot()
#noise_loc = pdtoGeo(noise_loc)
vendor_loc = pdtoGeo(vendor_loc)
#vendor_noise.plot(ax=base)
vendor_loc.plot()
I can't find the reason for not being able to show all the points. All my codes and data are correct. Unable to debug.
There is an observation' lon and lat are equal to zero. Consider delete it. May be that is the reason
vendor_loc = vendor_loc.loc[~(vendor_loc.Lat == 0)]
vendor_noise = pd.concat([vendor_loc,noise_loc])
vendor_noise = pdtoGeo(vendor_noise)
base = sf.plot(color="grey")
vendor_noise.plot(ax=base,column="Indicator",markersize=7,alpha=0.5,legend=True,colormap=plt.cm.Blues_r)
It is not clear that whether vendor is overlapped by points from complaints. So I drawed a plot for vendor separately. From this plot we know that most of the vendors and noises are in the Northeastern part of SF. The reason for it may because of large population density in that area. But for other areas there is not obvious sign of close relationship between noise and vendors. The location of Noise complaints have much larger scope than location of vendors.
vendor_loc = pdtoGeo(vendor_loc)
base = sf.plot(color="blue")
vendor_loc.plot(ax=base,markersize=7,legend=True)
From this plot we know that vendors most located in the left part of SF. By comparing this plot with the previous one with points for both noise and vendors, we can conclude that there is not close relationship between vendors and noise. Because there is not many noise complaints in the Southeastern part of SF which the spread of vendors are dense.
Since there is no strong correlation for vendors and noise in location aspects, we do not need to further discuss it in time aspect.
The best places to find food trucks means we can easily find food trucks in that place. Or we can say that the truck density is large. The best times to find food trucks means that during that period of time the numbers of food trucks that are working is largest. So for this question, we first find the best place and then find the best time.
pd.read_sql("SELECT DISTINCT(FacilityType) FROM mobile_food_permits LIMIT 4",db)
# create table for Truck
db.execute("CREATE TABLE Truck AS "
"SELECT * FROM mobile_food_permits WHERE FacilityType = 'Truck' ")
#pd.read_sql("SELECT * FROM mobile_food_permits LIMIT 4",db)
pd.read_sql("SELECT * FROM Truck LIMIT 4",db)
# combine time information
db.execute("CREATE TABLE Truck_schedule AS "
"SELECT * FROM Truck AS T INNER JOIN mobile_food_schedule AS S ON T.permit = S.permit ")
#pd.read_sql("SELECT * FROM mobile_food_schedule LIMIT 4",db)
pd.read_sql("SELECT * FROM Truck_schedule LIMIT 4",db)
Obiviously, we need to use mobile_food_locations table for locations information.
pd.read_sql("SELECT * FROM mobile_food_locations LIMIT 4",db)
# combine location and schedule
pd.read_sql("SELECT S.locationid, S.DayOfWeek, S.EndHour, S.StartHour, L.Latitude, L.Longitude, L.Address "
"FROM Truck_schedule AS S INNER JOIN mobile_food_locations AS L ON L.locationid = S.locationid ORDER BY L.Latitude LIMIT 4",db)
We can see from the table above that there are lat and lon values equal to zero. But when I search those address in Google, they are actually located in SF. So consider correct those data.
# number of lat == 0
pd.read_sql("SELECT COUNT(DISTINCT(L.Address)) AS Counts "
"FROM Truck_schedule AS S INNER JOIN mobile_food_locations AS L ON L.locationid = S.locationid WHERE L.Latitude = 0",db)
Since we do not have information of Lon and Lat for those locations and the amount of them is 40 which is small consider to the total amount of Trucks(609), I delete those locations directly.
db.execute("CREATE TABLE Truck_sch_loc AS "
"SELECT S.locationid, S.DayOfWeek, S.EndHour, S.StartHour, L.Latitude, L.Longitude, L.Address "
"FROM Truck_schedule AS S INNER JOIN mobile_food_locations AS L ON L.locationid = S.locationid WHERE L.Latitude != 0")
Truck = pd.read_sql("SELECT * FROM Truck_sch_loc",db)
Truck.loc[:15]
Truck = Truck.rename(columns={"Longitude":"Lon","Latitude":"Lat"})
Truck = pdtoGeo(Truck)
base = sf.plot(color="blue")
Truck.plot(ax=base,markersize=7,legend=True)
This map with points presenting the location of trucks is pretty much the same as we drawed before for all the vendors. And refer to the map with zipcode on it we can know that the place with most Trucks is 94104. That area has financial district. The places nearby is Union Square.</p>
# find the location have the most vendor
pd.read_sql("SELECT COUNT(DayOfWeek) AS Fre_DayOfWeek , DayOfWeek FROM Truck_sch_loc GROUP BY DayOfWeek ORDER BY Fre_DayOfWeek",db)
On Monday the number of Trucks that work is the largest. On Sunday the number of Trucks that work is the smallest.</p>
# combine the information of start and end hour.
# USE CAST() transfer numeric to string and use || to cancat string
db.execute("CREATE TABLE Truck_Dur AS "
"SELECT CAST(StartHour AS TEXT)||'_'||CAST(EndHour AS TEXT) AS Duration FROM Truck_sch_loc")
pd.read_sql("SELECT * FROM Truck_Dur LIMIT 5",db)
pd.read_sql("SELECT COUNT(Duration) AS Fre_Dur , Duration FROM Truck_Dur GROUP BY Duration ORDER BY Fre_Dur DESC LIMIT 10",db)
From the table above we know that most of the Trucks work from 9 to 10. There are also many Trucks that work from 10 to 11. Thus, the best time for find truck is 9 to 11 on Monday. </p>
Let us learn more about the information in school. It is a table with school name, category, general type.
pd.read_sql("SELECT * FROM schools LIMIT 5",db)
# distinct Category
pd.read_sql("SELECT COUNT(Category) AS Fre_Catogory, Category FROM schools GROUP BY Category ORDER BY Fre_Catogory desc",db)
# distinct GeneralType
pd.read_sql("SELECT COUNT(GeneralType) AS Fre_GeneralType, GeneralType FROM schools GROUP BY GeneralType ORDER BY Fre_GeneralType",db)
# Entity
pd.read_sql("SELECT COUNT(Entity) AS Fre_Entity, Entity FROM schools GROUP BY Entity ORDER BY Fre_Entity",db)
We can make a map that different colors in the map presents the value of median allhome sold price and points on the maps are schools. The colors of each points presents for the value of the school's ....
school = pd.read_sql("SELECT * FROM schools",db)
school = pdtoGeo(school)
#GradeRange
base = medpzip.plot(column='MedianPrice',scheme='QUANTILES',k=6, colormap=plt.cm.Blues_r)
school.plot(ax=base, column="GradeRange", legend=True,markersize=9)
There is no difference for the location of schools with different GradeRange. Overall schools are mostly located in the Northeastern part of SF.</p>
#GradeRange
base = medpzip.plot(column='MedianPrice',scheme='QUANTILES',k=6, colormap=plt.cm.Blues_r)
school.plot(ax=base, column="GeneralType", legend=True,markersize=9)
School with GradeRange being "CC" are most located in the Northeastern part of SF. There are several of them located in the area with low housing price. </p>
base = medpzip.plot(column='MedianPrice',scheme='QUANTILES',k=6, colormap=plt.cm.Blues_r)
school.plot(ax=base, column="Entity", legend=True,markersize=9)
This plot shows similar results that most school are located in the Northeastern part of SF. In these parts of SF, house price is also higher. So we can conclude that there are somewhat relationship for schools and house price. If we have data of school ratings, we may be able to see more obivious relationship between schools and house price.