Assignment 5

In this assignment, you'll scrape text from The California Aggie and then analyze the text.

The Aggie is organized by category into article lists. For example, there's a Campus News list, Arts & Culture list, and Sports list. Notice that each list has multiple pages, with a maximum of 15 articles per page.

The goal of exercises 1.1 - 1.3 is to scrape articles from the Aggie for analysis in exercise 1.4.

Exercise 1.1. Write a function that extracts all of the links to articles in an Aggie article list. The function should:

  • Have a parameter url for the URL of the article list.

  • Have a parameter page for the number of pages to fetch links from. The default should be 1.

  • Return a list of aricle URLs (each URL should be a string).

Test your function on 2-3 different categories to make sure it works.

Hints:

  • Be polite to The Aggie and save time by setting up requests_cache before you write your function.

  • Start by getting your function to work for just 1 page. Once that works, have your function call itself to get additional pages.

  • You can use lxml.html or BeautifulSoup to scrape HTML. Choose one and use it throughout the entire assignment.

In [534]:
# when I failed to install a package use pip, try conda install!! That's how I installed ggplot successfully.
import requests
import requests_cache
import requests_ftp
import lxml
from bs4 import BeautifulSoup
from collections import Counter
from matplotlib import pyplot as plt
from urlparse import urlunparse, urlparse
import pandas as pd
plt.style.use('ggplot')
requests_cache.install_cache('cache') #????????? why use coll_cache instead of cache
%matplotlib inline       
#???????????????
#requests_cache.install_cache("cache")
In [535]:
def url_lxml(url,page):
    response = requests.get(url+'/page/{}'.format(page))
    html = response.text  # xxx.text  ->  extract the text
    # this part if I run aggie = BeautifulSoup(aggiehtml,"lxml-xml"); aggie.prettify() is very short. Why???
    return BeautifulSoup(html,"lxml")
In [536]:
#url = "https://theaggie.org/"
#params = {"groups":"NEWS","page":1}
def link_art(url,page):
    """
    input: url and page
    output: links for the articles in each article lists
    """
    #response = requests.get(url)
    #html = response.text
    #aggie = BeautifulSoup(html,"lxml")
    #agglist_content = aggie.find_all(name="a",attrs={"itemprop":"name"})
    #aggie article list url: get the url for each article list
    #list_url = [x["href"]for x in agglist_content]
    all_links = []
    #for lists in list_url:
    article = url_lxml(url,page)
    art_content = article.find_all(name="h2",attrs={"class":"entry-title"})
        # there are some in a that don't have href attrs
    for art in art_content:
        try:                # use xxx.a gose down to a tag a directly 
            all_links.append(art.a["href"])
        except TypeError:
            None       
    return all_links
In [465]:
link_art("https://theaggie.org/campus",6)
Out[465]:
['https://theaggie.org/2016/11/29/advocacy-groups-write-letters-to-uc-president-amid-concerns-of-anti-semitism/',
 'https://theaggie.org/2016/11/29/student-health-and-counseling-services-launches-nap-campaign/',
 'https://theaggie.org/2016/11/28/uc-davis-receives-760-million-for-research/',
 'https://theaggie.org/2016/11/27/two-sexual-assault-occurrences-reported-during-fall-quarter/',
 'https://theaggie.org/2016/11/27/this-week-in-senate-32/',
 'https://theaggie.org/2016/11/22/the-life-of-former-chancellor-linda-p-b-katehi-post-resignation/',
 'https://theaggie.org/2016/11/21/uc-davis-releases-2015-2016-annual-campus-travel-survey-results/',
 'https://theaggie.org/2016/11/21/plant-and-animal-sciences-at-uc-davis-rank-number-one-in-the-world/',
 'https://theaggie.org/2016/11/20/achieve-uc-program-encourages-students-to-apply-to-ucs/',
 'https://theaggie.org/2016/11/20/uc-transfer-application-deadline-extended/',
 'https://theaggie.org/2016/11/18/anti-diversity-posters-discovered-on-campus/',
 'https://theaggie.org/2016/11/17/s-p-e-a-k-community-bands-together-on-quad-to-protest-trump-presidency/',
 'https://theaggie.org/2016/11/17/university-of-california-among-largest-source-of-donations-to-clinton/',
 'https://theaggie.org/2016/11/17/this-week-in-senate-31/',
 'https://theaggie.org/2016/11/16/matthew-mcfadden-confirmed-as-new-interim-senator/']
In [466]:
link_art("https://theaggie.org/city",6)
Out[466]:
['https://theaggie.org/2016/10/09/downtown-davis-receives-artsy-public-pianos/',
 'https://theaggie.org/2016/10/07/hopeful-hyatt-house-hotel-denied-approval-by-planning-commission/',
 'https://theaggie.org/2016/10/04/police-logs/',
 'https://theaggie.org/2016/09/27/davis-farmers-market-hits-the-stands-for-its-40th-year/',
 'https://theaggie.org/2016/09/22/bike-city-usa/',
 'https://theaggie.org/2016/06/03/wakeboarding-team-breaks-guinness-world-record-in-woodland/',
 'https://theaggie.org/2016/06/03/davis-city-council-recognizes-social-justice-advocates/',
 'https://theaggie.org/2016/06/02/clintons-cover-california-from-north-to-south/',
 'https://theaggie.org/2016/06/02/come-for-the-beer-stay-for-the-cause-at-davis-beer-and-cider-festival/',
 'https://theaggie.org/2016/06/02/sacramento-black-book-fair-kicks-off-june-3/',
 'https://theaggie.org/2016/06/02/10-events-happening-in-and-around-davis-this-summer/',
 'https://theaggie.org/2016/06/01/davis-arts-center-presents-june-pop-up-series/',
 'https://theaggie.org/2016/06/01/bat-walk-and-talk-all-summer-long/',
 'https://theaggie.org/2016/05/31/donald-trump-to-hold-rally-in-sacramento/',
 'https://theaggie.org/2016/05/31/california-moves-toward-implementing-earthquake-warning-system/']

Exercise 1.2. Write a function that extracts the title, text, and author of an Aggie article. The function should:

  • Have a parameter url for the URL of the article.

  • For the author, extract the "Written By" line that appears at the end of most articles. You don't have to extract the author's name from this line.

  • Return a dictionary with keys "url", "title", "text", and "author". The values for these should be the article url, title, text, and author, respectively.

For example, for this article your function should return something similar to this:

{
    'author': u'Written By: Bianca Antunez \xa0\u2014\xa0city@theaggie.org',
    'text': u'Davis residents create financial model to make city\'s financial state more transparent To increase transparency between the city\'s financial situation and the community, three residents created a model called Project Toto which aims to improve how the city communicates its finances in an easily accessible design. Jeff Miller and Matt Williams, who are members of Davis\' Finance and Budget Commission, joined together with Davis entrepreneur Bob Fung to create the model plan to bring the project to the Finance and Budget Commission in February, according to Kelly Stachowicz, assistant city manager. "City staff appreciate the efforts that have gone into this, and the interest in trying to look at the city\'s potential financial position over the long term," Stachowicz said in an email interview. "We all have a shared goal to plan for a sound fiscal future with few surprises. We believe the Project Toto effort will mesh well with our other efforts as we build the budget for the next fiscal year and beyond." Project Toto complements the city\'s effort to amplify the transparency of city decisions to community members. The aim is to increase the understanding about the city\'s financial situation and make the information more accessible and easier to understand. The project is mostly a tool for public education, but can also make predictions about potential decisions regarding the city\'s financial future. Once completed, the program will allow residents to manipulate variables to see their eventual consequences, such as tax increases or extensions and proposed developments "This really isn\'t a budget, it is a forecast to see the intervention of these decisions," Williams said in an interview with The Davis Enterprise. "What happens if we extend the sales tax? What does it do given the other numbers that are in?" Project Toto enables users, whether it be a curious Davis resident, a concerned community member or a city leader, with the ability to project city finances with differing variables. The online program consists of the 400-page city budget for the 2016-2017 fiscal year, the previous budget, staff reports and consultant analyses. All of the documents are cited and accessible to the public within Project Toto. "It\'s a model that very easily lends itself to visual representation," Mayor Robb Davis said. "You can see the impacts of decisions the council makes on the fiscal health of the city." Complementary to this program, there is also a more advanced version of the model with more in-depth analyses of the city\'s finances. However, for an easy-to-understand, simplistic overview, Project Toto should be enough to help residents comprehend Davis finances. There is still more to do on the project, but its creators are hard at work trying to finalize it before the 2017-2018 fiscal year budget. "It\'s something I have been very much supportive of," Davis said. "Transparency is not just something that I have been supportive of but something we have stated as a city council objective [ ] this fits very well with our attempt to inform the public of our challenges with our fiscal situation." ',
    'title': 'Project Toto aims to address questions regarding city finances',
    'url': 'https://theaggie.org/2017/02/14/project-toto-aims-to-address-questions-regarding-city-finances/'
}

Hints:

  • The author line is always the last line of the last paragraph.

  • Python 2 displays some Unicode characters as \uXXXX. For instance, \u201c is a left-facing quotation mark. You can convert most of these to ASCII characters with the method call (on a string)

    .translate({ 0x2018:0x27, 0x2019:0x27, 0x201C:0x22, 0x201D:0x22, 0x2026:0x20 })

    If you're curious about these characters, you can look them up on this page, or read more about what Unicode is.

In [537]:
# ???????when should I ues .translate({ 0x2018:0x27, 0x2019:0x27, 0x201C:0x22, 0x201D:0x22, 0x2026:0x20 }) ???????
# type of author[0].b.text is unicode
# use unicodedata package to transform unicode type into string
import unicodedata
import re
def extra_content(url):  # in the website Inspector there is  ::before   what's the meaning of this??
    """
    input: url
    output: dictionary with author, text, title and url
    """
    # parse to lxml
    alllxml = url_lxml(url,1)
    # create a dictionary
    adict = {"author":[],"text":[],"title":[],"url":[]}
    #alllxml = url_lxml(test_art[0],1)
    # title
    titlelxml = alllxml.find_all(name="h1",attrs={"class":"entry-title","itemprop":"headline"})
    try:
        title = titlelxml[0].text.strip().encode('ascii','ignore')
    except IndexError:
        title = None
    adict["title"]=title
    # texts
    textlxml = alllxml.find_all("p")
    texts1 = [x.text.strip().encode('ascii','ignore') for x in textlxml]
    # author is not always in the last row. Most of the time it is the 2nd last row
    # xx[-1]: the last one; xx[:-1]: balabala untill the 2nd last one
    # all the text content
    adict["text"]="".join(texts1[:-2])
    # extract the author
    # use regular expression  "\s" is space. {n} means repeat n times. {1,3} means repeat 1 or 2 or 3 times
    try:
        author = re.search(".*:\s*([a-zA-Z -]+)\s.*@",texts1[-2]).group(1)
    except AttributeError:
        try:
            author = re.search(".*:\s*([a-zA-Z -]+)\s.*@",texts1[-1]).group(1)
        except AttributeError:
            author = None
    adict["author"]= author
    adict["url"] = url
    return adict
In [227]:
extra_content('https://theaggie.org/2017/02/14/project-toto-aims-to-address-questions-regarding-city-finances/')
Out[227]:
{'author': 'Bianca Antunez',
 'text': 'Davis residents create financial model to make citys financial state more transparentTo increase transparency between the citys financial situation and the community, three residents created a model called Project Toto which aims to improve how the city communicates its finances in an easily accessible design.Jeff Miller and Matt Williams, who are members of Davis Finance and Budget Commission, joined together with Davis entrepreneur Bob Fung to create the model plan to bring the project to the Finance and Budget Commission in February, according to Kelly Stachowicz, assistant city manager.City staff appreciate the efforts that have gone into this, and the interest in trying to look at the citys potential financial position over the long term, Stachowicz said in an email interview. We all have a shared goal to plan for a sound fiscal future with few surprises. We believe the Project Toto effort will mesh well with our other efforts as we build the budget for the next fiscal year and beyond.Project Toto complements the citys effort to amplify the transparency of city decisions to community members. The aim is to increase the understanding about the citys financial situation and make the information more accessible and easier to understand.The project is mostly a tool for public education, but can also make predictions about potential decisions regarding the citys financial future. Once completed, the program will allow residents to manipulate variables to see their eventual consequences, such as tax increases or extensions and proposed developmentsThis really isnt a budget, it is a forecast to see the intervention of these decisions, Williams said in an interview with The Davis Enterprise. What happens if we extend the sales tax? What does it do given the other numbers that are in?Project Toto enables users, whether it be a curious Davis resident, a concerned community member or a city leader, with the ability to project city finances with differing variables.The online program consists of the 400-page city budget for the 2016-2017 fiscal year, the previous budget, staff reports and consultant analyses. All of the documents are cited and accessible to the public within Project Toto.Its a model that very easily lends itself to visual representation, Mayor Robb Davis said. You can see the impacts of decisions the council makes on the fiscal health of the city.Complementary to this program, there is also a more advanced version of the model with more in-depth analyses of the citys finances. However, for an easy-to-understand, simplistic overview, Project Toto should be enough to help residents comprehend Davis finances.There is still more to do on the project, but its creators are hard at work trying to finalize it before the 2017-2018 fiscal year budget.Its something I have been very much supportive of, Davis said. Transparency is not just something that I have been supportive of but something we have stated as a city council objective [] this fits very well with our attempt to inform the public of our challenges with our fiscal situation.',
 'title': 'Project Toto aims to address questions regarding city finances',
 'url': 'https://theaggie.org/2017/02/14/project-toto-aims-to-address-questions-regarding-city-finances/'}

Exercise 1.3. Use your functions from exercises 1.1 and 1.2 to get a data frame of 60 Campus News articles and a data frame of 60 City News articles. Add a column to each that indicates the category, then combine them into one big data frame.

The "text" column of this data frame will be your corpus for natural language processing in exercise 1.4.

In [538]:
# each page there are 15 articles. we need 4 pages to get 60 articles
# add column as category. 
# get the link, output are multiple links
def create_df(news_link):                  # page 1-5
    all_link = [link_art(news_link,page) for page in range(1,5)]
    news = []
    for links in all_link:
        news_art = [extra_content(link) for link in links]
        news.append(news_art)
    return pd.concat([pd.DataFrame(new) for new in news])    
camp_df = create_df('https://theaggie.org/campus')
city_df = create_df('https://theaggie.org/city')
In [240]:
print city_df.shape
print camp_df.shape
(60, 4)
(60, 4)
In [539]:
news_df = pd.concat([camp_df,city_df])
In [540]:
# simply use "+" to adding elements in the same list without creating nested lists
category = ["campus"]*60 + ["city"]*60
news_df["category"] = category
# reset the index value
news_df = news_df.set_index([range(120)])
In [280]:
news_df
Out[280]:
author text title url category
0 Alyssa Vandenberg Six senators, new executive team electedCurren... 2017 Winter Quarter election results https://theaggie.org/2017/02/24/2017-winter-qu... campus
1 Aaron Liss and Raul Castellanos Wells Fargo faces fraud, predatory lending cha... University of California, Davis City Council s... https://theaggie.org/2017/02/23/university-of-... campus
2 Kimia Akbari Faculty, students recount personal tales of im... Academics unite in peaceful rally against immi... https://theaggie.org/2017/02/23/academics-unit... campus
3 Kenton Goldsby Opening date pushed back to May 1Students have... Memorial Union to reopen Spring Quarter https://theaggie.org/2017/02/23/memorial-union... campus
4 Ivan Valenzuela Veto included revision abandoning creation of ... ASUCD President Alex Lee vetoes amendment for ... https://theaggie.org/2017/02/23/asucd-presiden... campus
5 Alyssa Vandenberg Shaheens name to remain on ballot, his votes w... Senate candidate Zaki Shaheen withdraws from race https://theaggie.org/2017/02/22/senate-candida... campus
6 Aaron Liss Students receive email warnings from UC Davis ... UC Davis experiences several recent hate-based... https://theaggie.org/2017/02/21/uc-davis-exper... campus
7 Alyssa Vandenberg UC Board of Regents to vote on the appointment... UC President selects Gary May as new UC Davis ... https://theaggie.org/2017/02/21/uc-president-s... campus
8 Jeanna Totah Tighter policies require greater approval of o... Katehi controversy prompts decline of UC admin... https://theaggie.org/2017/02/20/katehi-controv... campus
9 Ivan Valenzuela SR #7 asks university to increase capacity for... ASUCD Senate passes resolution submitting comm... https://theaggie.org/2017/02/20/asucd-senate-p... campus
10 Yvonne Leong UC Davis leads in sustainability with largest ... UC releases 2016 Annual Report on Sustainable ... https://theaggie.org/2017/02/20/uc-releases-20... campus
11 Kenton Goldsby Speakers, including Interim Chancellor Ralph J... UC Davis Global Affairs holds discussion on Pr... https://theaggie.org/2017/02/19/uc-davis-globa... campus
12 Kimia Akbari Executive order has immediate consequences for... Trumps immigration ban affects UC Davis community https://theaggie.org/2017/02/19/trumps-immigra... campus
13 Kaitlyn Cheung Student protesters march from MU flagpole to M... UC Davis students participate in UC-wide #NoDA... https://theaggie.org/2017/02/17/uc-davis-stude... campus
14 Jayashri Padmanabhan Conference entails full day of speakers, panel... UC Davis holds first mental health conference https://theaggie.org/2017/02/17/uc-davis-holds... campus
15 Demi Caceres Last week in SenateThe ASUCD Senate meeting wa... Last week in Senate https://theaggie.org/2017/02/16/last-week-in-s... campus
16 Alyssa Vandenberg and Emilie DeFazio Executive: Josh Dalavai and Adilla JamaludinIn... 2017 ASUCD Winter Elections Meet the Candidates https://theaggie.org/2017/02/16/2017-asucd-win... campus
17 Ivan Valenzuela New showcase provides opportunity for students... Shields Library hosts new exhibit for Davis ce... https://theaggie.org/2017/02/14/shields-librar... campus
18 Demi Caceres Students promote fruit and vegetable meals via... Student Health and Counseling Services hosts S... https://theaggie.org/2017/02/14/student-health... campus
19 Lindsay Floyd New fees to pay for equipment replacementTo co... PE classes may charge additional fees https://theaggie.org/2017/02/13/pe-classes-may... campus
20 Jeanna Totah Recipients each rewarded $25,000 for researchU... 11 new Chancellor Fellows honored for 2016 https://theaggie.org/2017/02/12/11-new-chancel... campus
21 Aaron Liss Muslim Student Association curates five-part D... Muslim students respond to recent political ev... https://theaggie.org/2017/02/12/muslim-student... campus
22 Lindsay Floyd Events to promote safe sexOn Feb. 1, Student H... Sexcessful Campaign launched in time for Valen... https://theaggie.org/2017/02/12/sexcessful-cam... campus
23 Alyssa Vandenberg Chan replaces former senator Sam ParkMichael C... Michael Chan sworn in as interim senator https://theaggie.org/2017/02/10/michael-chan-s... campus
24 Kenton Goldsby Regents approve tuition increase in 16-4 voteT... University of California Regents meet, approve... https://theaggie.org/2017/02/09/university-of-... campus
25 Yvonne Leong Last week in SenateThe ASUCD Senate meeting wa... Last week in Senate https://theaggie.org/2017/02/09/last-week-in-s... campus
26 Jayashri Padmanabhan Funding to expand innovation, entrepreneurship... UC Davis receives $2.2 million from Assembly B... https://theaggie.org/2017/02/09/uc-davis-recei... campus
27 Ivan Valenzuela Davis College Democrats host Dodd for question... Senator Bill Dodd visits UC Davis https://theaggie.org/2017/02/06/senator-bill-d... campus
28 Kenton Goldsby Law to affect students selected to attend Nati... AB 1887 prevents use of state funds, including... https://theaggie.org/2017/02/05/ab-1887-preven... campus
29 Jayashri Padmanabhan Kathleen Salvaty to oversee implementation of ... UC system hires Title IX coordinator https://theaggie.org/2017/02/02/uc-system-hire... campus
... ... ... ... ... ...
90 Raul Castellanos Downtown Davis offers a wide range of Thai cui... No such thing as too much Thai food https://theaggie.org/2017/01/17/no-such-thing-... city
91 Kaelyn Tuermer-Lee Petco Foundation awards $10,000 to local anima... A dog named Disney wins grant money for Rotts ... https://theaggie.org/2017/01/16/a-dog-named-di... city
92 Andie Books by Mail program will deliver materials s... Yolo County Library materials to be more widel... https://theaggie.org/2017/01/15/yolo-county-li... city
93 Sam Seasons Greetings EditionDec. 24Downstairs nei... Police Logs https://theaggie.org/2017/01/12/police-logs-7/ city
94 Dianna Rivera The Davis Manor Neighborhood hosts first Holid... Neighbors unite https://theaggie.org/2017/01/12/neighbors-unite/ city
95 Kaelyn Tuermer-Lee Local author Matt Biers-Ariels latest novelMan... Sparks fly in Light the Fire https://theaggie.org/2016/12/09/sparks-fly-in-... city
96 Raul Castellanos Volunteer program distributes bicycles, aims t... Bike Campaign offers bicycles to those who can... https://theaggie.org/2016/12/09/bike-campaign-... city
97 Juno Bhardwaj-shah Speakers criticize unconstitutional systemOn N... Bail reform advocates gather at annual fundraiser https://theaggie.org/2016/12/08/bail-reform-ad... city
98 Samantha Solomon Live-action retelling of Christmas poem promis... Twas the Night Before Christmas in Old Sacramento https://theaggie.org/2016/12/07/twas-the-night... city
99 Sam Solomon 35th annual tree lighting ceremony kicks off w... Childrens Candlelight Parade lights up downtow... https://theaggie.org/2016/12/05/childrens-cand... city
100 Sam Solomon Another week of why did people call the police... Police Logs https://theaggie.org/2016/12/04/police-logs-6/ city
101 Dianna Rivera The Yolo County Childrens Alliance celebrates ... The season of giving https://theaggie.org/2016/12/04/the-season-of-... city
102 Samantha Solomon Students, activists call for solidarity with S... NoDAPL protest erupts in downtown Davis https://theaggie.org/2016/12/02/nodapl-protest... city
103 Andie Joldersma Participants gathered for 5k, 10k, half marath... Davis Turkey Trot: more than just another race https://theaggie.org/2016/12/02/davis-turkey-t... city
104 Andie Joldersma Citizens can expect more cost-competitive clea... Affordable, clean, green energy is coming to Y... https://theaggie.org/2016/11/30/affordable-cle... city
105 Bianca Antunez Davis community renews local parcel tax for K-... Measure H passes, voters support Davis schools https://theaggie.org/2016/11/29/measure-h-pass... city
106 Anya Rehon College alcohol use, high-risk drinking discus... Local residents attend Davis town hall meeting https://theaggie.org/2016/11/29/local-resident... city
107 Sam Solomon Looks like its been an interesting weekNov. 15... Police Logs https://theaggie.org/2016/11/27/police-logs-5/ city
108 Raul Castellanos Ornamental piano outside Mishkas Caf vandalize... Public piano destroyed in act of vandalism https://theaggie.org/2016/11/27/public-piano-d... city
109 Samantha Solomon Davis residents light candles, promote sanctua... Holding the Light https://theaggie.org/2016/11/27/holding-the-li... city
110 Dianna Rivera Dont be left in the dark now that Daylight Sav... Sun down, bike lights out https://theaggie.org/2016/11/22/sun-down-bike-... city
111 Sam Solomon Nov. 7Subject stated our pizza is ready and th... Police Logs https://theaggie.org/2016/11/22/police-logs-4/ city
112 Kaelyn Tuermer-Lee Helping the community through Davis Community ... Tune in to Watermelon Musics strings-for-food ... https://theaggie.org/2016/11/21/tune-in-to-wat... city
113 Andie Joldersma Protesters gather at State Capitol Building fo... Water is sacred, water is life https://theaggie.org/2016/11/20/water-is-sacre... city
114 Anya Rehon The Yolo County community comes together for h... The Yolo Food Bank addresses food insecurity https://theaggie.org/2016/11/17/the-yolo-food-... city
115 Bianca Antunez Election results are in; Davis community conce... Nov. 8 2016: An Election Day many may never fo... https://theaggie.org/2016/11/17/nov-8-2016-an-... city
116 None More turkeys, more tomfoolery, more accidental... Police Logs https://theaggie.org/2016/11/15/police-logs-3/ city
117 Bianca Antunez Participants line up for Thanksgiving 5k befor... Yolo Food Banks eighth Annual Running of the T... https://theaggie.org/2016/11/15/yolo-food-bank... city
118 Raul Castellanos Bernie Sanders visits Sacramento to rally for ... Return of the Bern https://theaggie.org/2016/11/15/return-of-the-... city
119 Alana Joldersma Indoor facility will provide a cafeteria, clas... Construction of the All Student Center at Davi... https://theaggie.org/2016/11/14/construction-o... city

120 rows × 5 columns

Exercise 1.4. Use the Aggie corpus to answer the following questions. Use plots to support your analysis.

  • What topics does the Aggie cover the most? Do city articles typically cover different topics than campus articles?

  • What are the titles of the top 3 pairs of most similar articles? Examine each pair of articles. What words do they have in common?

  • Do you think this corpus is representative of the Aggie? Why or why not? What kinds of inference can this corpus support? Explain your reasoning.

Hints:

  • The nltk book and scikit-learn documentation may be helpful here.

  • You can determine whether city articles are "near" campus articles from the similarity matrix or with k-nearest neighbors.

  • If you want, you can use the wordcloud package to plot a word cloud. To install the package, run

    conda install -c https://conda.anaconda.org/amueller wordcloud

    in a terminal. Word clouds look nice and are easy to read, but are less precise than bar plots.

In [541]:
import numpy as np
import nltk
from nltk import corpus
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors
from matplotlib import pyplot as plt
plt.style.use('ggplot')
%matplotlib inline
In [542]:
tokenize = nltk.word_tokenize
def stem(tokens,stemmer = PorterStemmer().stem):
    return [stemmer(w.lower()) for w in tokens] 

def lemmatize(text):
    """
    Extract simple lemmas based on tokenization and stemming
    Input: string
    Output: list of strings (lemmata)
    """
    return stem(tokenize(text))
In [543]:
textd = {} #dictionary from lemmata to document ids containing that lemma
textall=[]
for art in range(120):
    textall.append("".join(news_df.get_value(art,col="text")))
    d = news_df.get_value(art,col="title")
    s = set(lemmatize(t))
    try: # by using toks | s, toks will combine elements of toks in the last iteration. Can I also use +?????
        toks = toks | s
    except NameError: # this is only used for the first iteration that no toks is created
        toks = s
    for tok in s:
        try: # lemmata : [title]
            textd[tok].append(d)
        except KeyError:
            textd[tok] = [d]
            
docids = {} #dictionary of the document id to an integer id for the document
N = 120  
for i in range(120):   # title : i
    docids[news_df.get_value(i,col="title")] = i
    
tokids = {} #dictionary of lemma to integer id for the lemma
tok_list = list(toks)  # a list of lemmata
m = len(tok_list) # the length of lemmata
for j in xrange(m):  # lemma : j
    tokids[tok_list[j]] = j    
In [544]:
# dictionary: lemma: number of documents this lemma occurs 
numd = {key:len(set(val)) for key,val in textd.items()}

logN = np.log(120)
# lemma : its smoothed idf
idf_smooth = {key:logN - np.log(1 + val) for key, val in numd.items() if val > 1}
In [302]:
idf_smooth
Out[302]:
{'1,800': 3.6888794541139358,
 u'hatr': 3.401197381662155,
 'four': 2.3025850929940455,
 u'protest': 2.0149030205422647,
 'sleep': 3.6888794541139358,
 'asian': 3.401197381662155,
 'oldest': 3.6888794541139358,
 'hate': 2.2225423853205091,
 'whose': 3.1780538303479453,
 'saylor': 3.6888794541139358,
 'voter': 3.6888794541139358,
 u'bike': 2.7080502011022101,
 'under': 1.7429693050586228,
 '@': 3.401197381662155,
 u'everi': 1.6094379124341001,
 'risk': 2.7080502011022101,
 u'compassion': 3.6888794541139358,
 'blanket': 3.6888794541139358,
 u'rise': 3.1780538303479453,
 u'years.th': 3.6888794541139358,
 u'voic': 2.4849066497879999,
 u'tenni': 3.6888794541139358,
 'jack': 3.6888794541139358,
 u'unitran': 3.6888794541139358,
 u'govern': 2.0149030205422647,
 'jacob': 3.6888794541139358,
 'affect': 1.9542783987258296,
 u'school': 1.2611312181658842,
 u'scholar': 2.9957322735539909,
 u'later.th': 3.6888794541139358,
 u'showcas': 3.6888794541139358,
 u'environmentally-friendli': 3.6888794541139358,
 u'enjoy': 2.0794415416798357,
 'plaza': 3.1780538303479453,
 u'speci': 3.401197381662155,
 'miller': 3.6888794541139358,
 'bacon': 2.9957322735539909,
 u'request': 2.5902671654458262,
 'budget': 2.4849066497879999,
 u'consequ': 2.5902671654458262,
 'second': 1.9542783987258296,
 u'diseas': 2.9957322735539909,
 u'empow': 2.7080502011022101,
 'gorman': 3.401197381662155,
 'nahabedian': 3.6888794541139358,
 'campus.i': 2.9957322735539909,
 u'illumin': 3.6888794541139358,
 'even': 1.4916548767777167,
 u'employ': 3.6888794541139358,
 u'dialogu': 3.6888794541139358,
 u'neg': 2.7080502011022101,
 'near': 3.1780538303479453,
 'fossil': 2.8415815937267324,
 'dr.': 3.401197381662155,
 'new': 0.78015855754957464,
 'net': 3.6888794541139358,
 'ever': 2.5902671654458262,
 'disney': 3.6888794541139358,
 'told': 3.1780538303479453,
 'mee': 3.401197381662155,
 u'ongo': 3.401197381662155,
 u'intellectu': 3.6888794541139358,
 u'abov': 3.6888794541139358,
 'sigala': 3.6888794541139358,
 'never': 3.6888794541139358,
 u'accus': 3.6888794541139358,
 'here': 1.6094379124341001,
 'met': 2.8415815937267324,
 u'accur': 3.6888794541139358,
 'path': 2.8415815937267324,
 '100': 2.4849066497879999,
 u'enrol': 2.9957322735539909,
 u'shemeri': 3.6888794541139358,
 'luther': 3.6888794541139358,
 u'daughter': 2.8415815937267324,
 'forum': 3.1780538303479453,
 u'militari': 3.6888794541139358,
 u'community.th': 3.401197381662155,
 u'student-faculti': 3.6888794541139358,
 'credit': 3.401197381662155,
 u'harass': 3.1780538303479453,
 u'mentor': 3.1780538303479453,
 u'studi': 1.6964492894237297,
 u'controversi': 2.7080502011022101,
 u'counti': 1.6964492894237297,
 u'imposs': 3.6888794541139358,
 u'volunt': 2.3895964699836751,
 'coho': 3.6888794541139358,
 'campaign': 2.0794415416798357,
 'brought': 2.3025850929940455,
 u'attitud': 3.6888794541139358,
 u'scientif': 3.6888794541139358,
 'total': 2.3025850929940455,
 u'unit': 1.8971199848858813,
 u'highli': 3.1780538303479453,
 'sarah': 3.401197381662155,
 'dna': 3.6888794541139358,
 'spoke': 2.1484344131667874,
 'would': 1.0262916270884834,
 u'prescript': 3.6888794541139358,
 'hamidian': 3.6888794541139358,
 'program': 1.1499055830556602,
 'music': 2.4849066497879999,
 'asset': 3.6888794541139358,
 u'recommend': 3.1780538303479453,
 'type': 2.5902671654458262,
 'tell': 2.4849066497879999,
 u'relat': 1.4916548767777167,
 'officio': 3.6888794541139358,
 u'notic': 2.3895964699836751,
 u'hurt': 3.1780538303479453,
 u'warn': 3.6888794541139358,
 'phone': 2.7080502011022101,
 'warm': 3.1780538303479453,
 u'adult': 3.401197381662155,
 'former': 1.9542783987258296,
 '90': 3.1780538303479453,
 u'hole': 3.6888794541139358,
 u'hold': 2.2225423853205091,
 u'pantri': 3.401197381662155,
 'must': 2.0149030205422647,
 'me': 1.791759469228055,
 u'irvin': 3.6888794541139358,
 'word': 2.5902671654458262,
 'room': 2.3025850929940455,
 '1997': 3.6888794541139358,
 u'flore': 3.1780538303479453,
 u'work': 0.66035735773695414,
 'mu': 3.401197381662155,
 u'wors': 3.6888794541139358,
 'my': 1.2909841813155656,
 u'advocaci': 2.5902671654458262,
 u'climat': 2.1484344131667874,
 u'give': 1.4916548767777167,
 u'indic': 3.401197381662155,
 '10,000': 3.1780538303479453,
 'woman': 3.401197381662155,
 'want': 1.0739196760777379,
 '2.2': 3.6888794541139358,
 'david': 3.6888794541139358,
 u'attract': 2.8415815937267324,
 'motion': 2.9957322735539909,
 u'end': 1.9542783987258296,
 u'turn': 2.4849066497879999,
 u'polic': 2.1484344131667874,
 'travel': 2.3025850929940455,
 u'ceremoni': 3.1780538303479453,
 'how': 1.1765738301378215,
 u'recoveri': 3.6888794541139358,
 'interview': 2.2225423853205091,
 u'disappoint': 3.1780538303479453,
 u'perspect': 2.8415815937267324,
 u'confid': 3.401197381662155,
 'shaitaj': 2.9957322735539909,
 u'recogn': 2.4849066497879999,
 'after': 0.98082925301172619,
 u'community.w': 3.6888794541139358,
 'lab': 3.1780538303479453,
 u'befor': 1.455287232606842,
 u'beauti': 2.8415815937267324,
 u'law': 2.0794415416798357,
 u'demonstr': 2.5902671654458262,
 'attempt': 3.1780538303479453,
 'sioux': 3.6888794541139358,
 'third': 3.401197381662155,
 u'services.w': 3.6888794541139358,
 'think': 1.0986122886681096,
 u'detent': 3.6888794541139358,
 'greek': 3.6888794541139358,
 'maintain': 2.1484344131667874,
 'green': 2.9957322735539909,
 'south': 3.401197381662155,
 u'reloc': 3.401197381662155,
 u'enter': 2.7080502011022101,
 u'fan': 3.6888794541139358,
 'nodapl': 3.6888794541139358,
 'order': 1.3535045382968995,
 u'wind': 3.6888794541139358,
 'wine': 3.6888794541139358,
 u'oper': 2.3895964699836751,
 'consent': 3.401197381662155,
 u'offici': 2.0794415416798357,
 u'failur': 3.6888794541139358,
 u'becaus': 0.81719982922992385,
 u'incid': 2.4849066497879999,
 u'appar': 3.6888794541139358,
 'mayor': 1.8430527636156055,
 u'fit': 2.9957322735539909,
 'backpack': 3.6888794541139358,
 ',': 0.025317807984289509,
 'better': 2.0794415416798357,
 u'offic': 1.455287232606842,
 u'comprehens': 3.401197381662155,
 'easier': 3.6888794541139358,
 'then': 1.3535045382968995,
 'dec.': 2.8415815937267324,
 u'anim': 3.1780538303479453,
 u'proce': 3.6888794541139358,
 u'enrich': 3.1780538303479453,
 'slate': 3.401197381662155,
 'safe': 2.3025850929940455,
 'break': 2.8415815937267324,
 u'promis': 3.6888794541139358,
 'they': 0.5108256237659905,
 u'battl': 3.6888794541139358,
 u'life.thi': 3.6888794541139358,
 'one': 0.55338523818478613,
 'silver': 3.401197381662155,
 'bank': 3.401197381662155,
 u'choic': 2.7080502011022101,
 'alex': 2.5902671654458262,
 'meat': 3.401197381662155,
 u'accommod': 2.9957322735539909,
 u'luca': 3.6888794541139358,
 'each': 1.5293952047605637,
 'went': 2.0149030205422647,
 'side': 2.8415815937267324,
 u'mean': 1.8430527636156055,
 u'healthcar': 3.401197381662155,
 u'resum': 3.6888794541139358,
 u'oppos': 2.7080502011022101,
 'taught': 2.9957322735539909,
 '2,500': 3.6888794541139358,
 'logo': 3.6888794541139358,
 u'commiss': 2.2225423853205091,
 'saturday': 3.401197381662155,
 'rp': 3.1780538303479453,
 u'network': 2.7080502011022101,
 u'goe': 2.7080502011022101,
 u'facil': 2.4849066497879999,
 'crucial': 3.401197381662155,
 'content': 3.1780538303479453,
 'laid': 3.1780538303479453,
 'daniel': 3.6888794541139358,
 'adapt': 3.401197381662155,
 'got': 2.5902671654458262,
 'forth': 3.6888794541139358,
 'twice': 3.401197381662155,
 'u.s.': 1.8430527636156055,
 'ruttkay': 3.6888794541139358,
 u'situat': 2.7080502011022101,
 'free': 1.7429693050586228,
 u'standard': 2.8415815937267324,
 u'jennif': 3.401197381662155,
 u'disenfranchis': 3.6888794541139358,
 u'issues.th': 3.401197381662155,
 u'status.th': 3.6888794541139358,
 'atm': 3.6888794541139358,
 'moment': 3.1780538303479453,
 u'renew': 2.9957322735539909,
 u'unabl': 3.401197381662155,
 'loud': 3.6888794541139358,
 u'rang': 2.8415815937267324,
 u'grade': 3.401197381662155,
 u'dhaliw': 2.8415815937267324,
 u'wast': 3.1780538303479453,
 u'rank': 3.6888794541139358,
 u'capac': 3.1780538303479453,
 u'restrict': 3.401197381662155,
 u'instruct': 2.8415815937267324,
 u'alreadi': 2.0794415416798357,
 u'agre': 2.7080502011022101,
 u'primari': 3.6888794541139358,
 'feb.': 1.5686159179138452,
 u'sourc': 2.9957322735539909,
 u'nomin': 2.9957322735539909,
 'their': 0.48342664957787562,
 'top': 2.8415815937267324,
 u'sometim': 2.9957322735539909,
 u'necessarili': 2.9957322735539909,
 u'master': 3.401197381662155,
 'too': 2.5902671654458262,
 'john': 3.401197381662155,
 u'listen': 2.8415815937267324,
 'growth': 3.401197381662155,
 'segundo': 3.6888794541139358,
 u'tool': 3.1780538303479453,
 'took': 2.0794415416798357,
 u'direct': 2.1484344131667874,
 u'legislatur': 3.6888794541139358,
 u'email.th': 3.6888794541139358,
 u'conserv': 2.4849066497879999,
 'hadnt': 3.6888794541139358,
 'white': 2.0794415416798357,
 u'silli': 3.6888794541139358,
 u'target': 2.8415815937267324,
 u'sacramento.th': 3.6888794541139358,
 u'provid': 0.81719982922992385,
 'tree': 3.6888794541139358,
 u'biolog': 2.7080502011022101,
 'project': 1.4916548767777167,
 'matter': 2.9957322735539909,
 u'anxieti': 3.6888794541139358,
 u'minut': 3.1780538303479453,
 u'entail': 3.6888794541139358,
 u'runner': 3.6888794541139358,
 'modern': 3.401197381662155,
 'mind': 3.1780538303479453,
 'mine': 3.6888794541139358,
 'raw': 3.6888794541139358,
 'manner': 3.6888794541139358,
 'seen': 2.5902671654458262,
 u'seem': 2.4849066497879999,
 u'seek': 2.1484344131667874,
 u'strength': 3.6888794541139358,
 u'recreat': 2.9957322735539909,
 u'especi': 1.6964492894237297,
 u'thoma': 3.6888794541139358,
 u'thoroughli': 3.6888794541139358,
 u'neurobiolog': 3.401197381662155,
 u'rider': 3.6888794541139358,
 u'predominantli': 3.6888794541139358,
 'blue': 3.401197381662155,
 u'insur': 3.6888794541139358,
 u'plenti': 3.6888794541139358,
 'though': 2.7080502011022101,
 u'russel': 3.401197381662155,
 u'object': 2.7080502011022101,
 'what': 0.95885034629295074,
 u'marin': 3.6888794541139358,
 'regular': 3.6888794541139358,
 'third-year': 1.7429693050586228,
 u'letter': 2.7080502011022101,
 'drought': 3.6888794541139358,
 u'everyth': 2.4849066497879999,
 u'tradit': 2.7080502011022101,
 'don': 3.401197381662155,
 'professor': 2.0794415416798357,
 u'camp': 3.6888794541139358,
 'alumni': 3.401197381662155,
 'dog': 2.9957322735539909,
 u'hospit': 2.7080502011022101,
 u'alumnu': 3.6888794541139358,
 u'declar': 2.9957322735539909,
 u'tech': 3.1780538303479453,
 u'oppress': 3.401197381662155,
 'came': 1.9542783987258296,
 u'treati': 3.6888794541139358,
 'hunger': 3.6888794541139358,
 u'opposit': 3.401197381662155,
 u'advisori': 3.6888794541139358,
 u'sanctuari': 2.9957322735539909,
 u'inaugur': 3.1780538303479453,
 u'academi': 3.1780538303479453,
 'earth': 3.6888794541139358,
 u'environ': 2.2225423853205091,
 u'bail': 3.401197381662155,
 u'involv': 1.7429693050586228,
 u'despit': 2.7080502011022101,
 u'acquir': 3.6888794541139358,
 u'explain': 1.7429693050586228,
 u'restaur': 3.401197381662155,
 'lgbtqia': 3.6888794541139358,
 u'theme': 3.1780538303479453,
 u'busi': 1.5293952047605637,
 u'stephani': 3.6888794541139358,
 'constant': 3.6888794541139358,
 'rice': 3.6888794541139358,
 'plate': 3.6888794541139358,
 'yfb': 3.6888794541139358,
 'wide': 3.1780538303479453,
 'de': 3.401197381662155,
 'stop': 2.3895964699836751,
 'dc': 3.6888794541139358,
 '25,000': 3.6888794541139358,
 u'report': 1.6964492894237297,
 'instructor': 3.6888794541139358,
 u'receiv': 1.2611312181658842,
 u'earn': 3.1780538303479453,
 'bar': 3.401197381662155,
 'spokesperson': 3.401197381662155,
 u'politician': 3.1780538303479453,
 u'secretari': 3.401197381662155,
 u'indigen': 3.401197381662155,
 'bad': 2.8415815937267324,
 'ban': 2.8415815937267324,
 u'septemb': 3.6888794541139358,
 'respond': 2.4849066497879999,
 'human': 2.4849066497879999,
 'fair': 3.401197381662155,
 'specialist': 2.9957322735539909,
 u'resist': 2.9957322735539909,
 u'mandatori': 3.6888794541139358,
 u'result': 2.0794415416798357,
 u'respons': 1.6964492894237297,
 u'fail': 2.8415815937267324,
 'weird': 3.6888794541139358,
 u'news': 2.1484344131667874,
 'best': 2.3025850929940455,
 'juliet': 3.6888794541139358,
 u'awar': 2.0794415416798357,
 'co2': 3.6888794541139358,
 'away': 2.2225423853205091,
 u'discoveri': 3.6888794541139358,
 u'figur': 2.8415815937267324,
 'score': 3.6888794541139358,
 'drawn': 3.1780538303479453,
 u'approach': 2.1484344131667874,
 'berkeley': 3.1780538303479453,
 u'attribut': 3.6888794541139358,
 u'accord': 1.455287232606842,
 'men': 3.1780538303479453,
 u'wa': 0.22314355131420971,
 'adilla': 3.6888794541139358,
 u'extens': 2.9957322735539909,
 'drew': 3.1780538303479453,
 u'intox': 3.6888794541139358,
 u'feloni': 3.401197381662155,
 'kitchen': 3.401197381662155,
 u'protect': 1.7429693050586228,
 u'accident': 3.6888794541139358,
 u'expos': 2.9957322735539909,
 'ill': 2.8415815937267324,
 'against': 1.6964492894237297,
 u'countri': 1.6519975268528961,
 u'compromis': 3.6888794541139358,
 'and': 0.0,
 'had': 1.0986122886681096,
 u'inher': 3.401197381662155,
 '2nd': 3.6888794541139358,
 '250': 3.1780538303479453,
 u'guid': 2.9957322735539909,
 'speak': 1.9542783987258296,
 'him.jan': 3.6888794541139358,
 'bathroom': 3.401197381662155,
 'iraq': 3.6888794541139358,
 u'angel': 3.401197381662155,
 u'union': 2.3025850929940455,
 'three': 1.4916548767777167,
 'been': 0.61310447288640901,
 '.': -0.0082988028146955273,
 u'revenu': 2.9957322735539909,
 'beer': 3.6888794541139358,
 'much': 1.8430527636156055,
 'interest': 1.6964492894237297,
 'basic': 2.5902671654458262,
 u'quickli': 3.401197381662155,
 'life': 1.8430527636156055,
 u'regul': 3.1780538303479453,
 u'worker': 2.8415815937267324,
 'ariana': 3.6888794541139358,
 'jcoc': 3.6888794541139358,
 'child': 2.9957322735539909,
 '165': 3.6888794541139358,
 u'davi': 0.21278076427866299,
 'mohammad': 3.6888794541139358,
 'adela': 3.6888794541139358,
 u'ident': 2.8415815937267324,
 u'affirm': 2.9957322735539909,
 u'servic': 1.2039728043259359,
 u'properti': 2.3895964699836751,
 u'commerci': 2.9957322735539909,
 'air': 3.1780538303479453,
 u'aim': 1.8971199848858813,
 u'calcul': 3.6888794541139358,
 u'monse': 3.6888794541139358,
 u'publicli': 3.6888794541139358,
 'aid': 2.3895964699836751,
 'seven': 2.3895964699836751,
 u'garag': 3.401197381662155,
 'mexico': 3.6888794541139358,
 'is': 0.10536051565782589,
 u'it': 0.09614386055290236,
 'ii': 3.401197381662155,
 'cant': 2.9957322735539909,
 'im': 1.9542783987258296,
 'in': -0.0082988028146955273,
 'id': 3.1780538303479453,
 u'sever': 1.6094379124341001,
 'if': 1.0986122886681096,
 'grown': 3.401197381662155,
 u'jame': 3.6888794541139358,
 u'perform': 2.2225423853205091,
 u'suggest': 2.3895964699836751,
 u'make': 0.74444047494749555,
 u'transpar': 3.401197381662155,
 u'wound': 3.6888794541139358,
 'airport': 3.6888794541139358,
 'dean': 2.9957322735539909,
 'complex': 2.9957322735539909,
 u'jerri': 3.6888794541139358,
 u'complet': 2.1484344131667874,
 u'elli': 3.6888794541139358,
 u'evid': 3.6888794541139358,
 'sacramento': 1.6519975268528961,
 u'redevelop': 3.6888794541139358,
 'rain': 3.1780538303479453,
 u'hand': 3.401197381662155,
 u'fairli': 3.6888794541139358,
 u'rais': 2.3025850929940455,
 'scale': 3.6888794541139358,
 u'kid': 2.8415815937267324,
 'kept': 3.6888794541139358,
 u'thu': 2.8415815937267324,
 u'1970': 3.401197381662155,
 'contact': 2.4849066497879999,
 u'shortli': 3.6888794541139358,
 u'thi': 0.23361485118150505,
 'the': -0.0082988028146955273,
 u'campu': 0.87546873735389985,
 u'legisl': 2.3895964699836751,
 'left': 2.3895964699836751,
 u'identifi': 2.8415815937267324,
 'capitol': 3.1780538303479453,
 'just': 0.83624802420061828,
 u'photo': 2.9957322735539909,
 '2020.in': 3.6888794541139358,
 u'victim': 3.1780538303479453,
 'yet': 2.7080502011022101,
 u'languag': 3.1780538303479453,
 u'previous': 2.9957322735539909,
 'katehi': 3.1780538303479453,
 u'easi': 3.6888794541139358,
 'josh': 3.6888794541139358,
 'spread': 2.9957322735539909,
 'board': 1.9542783987258296,
 '5k': 3.401197381662155,
 'prison': 3.1780538303479453,
 u'els': 2.9957322735539909,
 'east': 2.9957322735539909,
 'hat': 3.6888794541139358,
 'gave': 2.5902671654458262,
 u'applic': 2.5902671654458262,
 u'mayb': 3.1780538303479453,
 u'preserv': 2.8415815937267324,
 u'donat': 2.0794415416798357,
 'background': 2.2225423853205091,
 'sunoco': 3.6888794541139358,
 u'readili': 3.401197381662155,
 u'athlet': 3.1780538303479453,
 'apart': 2.9957322735539909,
 u'measur': 2.3025850929940455,
 'gift': 3.6888794541139358,
 u'specif': 2.3025850929940455,
 '54': 3.1780538303479453,
 'sandhu': 2.9957322735539909,
 u'panelist': 3.6888794541139358,
 u'remind': 2.5902671654458262,
 '37': 3.6888794541139358,
 u'night': 2.7080502011022101,
 'hung': 3.6888794541139358,
 u'flagpol': 3.6888794541139358,
 'attorney': 2.4849066497879999,
 'right': 1.3217558399823193,
 'old': 2.7080502011022101,
 u'crowd': 2.5902671654458262,
 u'percentag': 3.1780538303479453,
 '50': 2.4849066497879999,
 u'rile': 3.6888794541139358,
 'dear': 3.6888794541139358,
 u'proudli': 3.6888794541139358,
 u'elderli': 3.6888794541139358,
 u'transmiss': 3.6888794541139358,
 u'cooper': 3.6888794541139358,
 'combat': 3.1780538303479453,
 'for': 0.068992871486951657,
 u'enolog': 3.6888794541139358,
 'fox': 3.401197381662155,
 'p.m.': 2.3895964699836751,
 u'condit': 2.9957322735539909,
 u'underserv': 3.6888794541139358,
 'core': 3.401197381662155,
 u'plu': 3.6888794541139358,
 u'sensibl': 3.6888794541139358,
 'tour': 3.1780538303479453,
 u'insecur': 3.401197381662155,
 u'pose': 3.401197381662155,
 u'confer': 3.1780538303479453,
 u'colleg': 1.4916548767777167,
 u'promot': 2.1484344131667874,
 u'peer': 2.9957322735539909,
 u'post': 2.3895964699836751,
 'super': 3.401197381662155,
 u'describ': 2.5902671654458262,
 'chapter': 2.5902671654458262,
 u'bylaw': 3.6888794541139358,
 u'slightli': 3.401197381662155,
 u'surround': 2.3025850929940455,
 u'unfortun': 3.401197381662155,
 u'festiv': 3.1780538303479453,
 'dinner': 3.6888794541139358,
 'afternoon': 3.6888794541139358,
 u'commit': 2.1484344131667874,
 u'produc': 2.5902671654458262,
 'civil': 2.8415815937267324,
 u'tackl': 3.6888794541139358,
 u'profession': 2.5902671654458262,
 'down': 1.6964492894237297,
 u'creativ': 2.4849066497879999,
 u'resili': 3.6888794541139358,
 u'formerli': 3.6888794541139358,
 u'opportun': 1.3862943611198904,
 u'manageri': 3.1780538303479453,
 'deal': 2.8415815937267324,
 u'frustrat': 3.6888794541139358,
 'support': 0.87546873735389985,
 u'closur': 3.6888794541139358,
 u'transform': 3.6888794541139358,
 'fight': 3.401197381662155,
 u'avail': 2.1484344131667874,
 u'reli': 3.6888794541139358,
 'editor': 3.401197381662155,
 'way': 1.3217558399823193,
 'spring': 2.5902671654458262,
 'call': 1.2611312181658842,
 u'war': 3.6888794541139358,
 u'analysi': 2.9957322735539909,
 u'head': 2.4849066497879999,
 'yolo': 2.0149030205422647,
 'form': 2.5902671654458262,
 u'solidar': 2.3895964699836751,
 u'forc': 2.3895964699836751,
 u'forb': 3.401197381662155,
 u'heal': 3.401197381662155,
 u'armi': 3.6888794541139358,
 u'surveil': 3.6888794541139358,
 'hear': 2.7080502011022101,
 'solar': 3.401197381662155,
 'true': 3.401197381662155,
 'analyst': 3.401197381662155,
 'absent': 3.401197381662155,
 u'counsel': 2.7080502011022101,
 u'wineri': 3.401197381662155,
 u'intern': 1.6964492894237297,
 'until': 2.2225423853205091,
 'vanguard': 3.6888794541139358,
 'jan.': 1.0262916270884834,
 u'fundament': 3.6888794541139358,
 'cair-sv': 3.6888794541139358,
 u'retir': 3.1780538303479453,
 '150': 2.8415815937267324,
 'later': 2.5902671654458262,
 'classic': 3.6888794541139358,
 'upset': 3.1780538303479453,
 'proven': 3.6888794541139358,
 u'drive': 2.7080502011022101,
 u'exist': 2.3025850929940455,
 u'desir': 3.401197381662155,
 u'ship': 3.6888794541139358,
 'ramirez': 2.9957322735539909,
 'mold': 3.6888794541139358,
 'trip': 3.401197381662155,
 u'impos': 3.401197381662155,
 'said.for': 3.401197381662155,
 'floor': 3.1780538303479453,
 u'holli': 3.6888794541139358,
 u'excel': 2.3895964699836751,
 'actor': 3.6888794541139358,
 u'flood': 3.6888794541139358,
 'role': 2.7080502011022101,
 u'test': 3.401197381662155,
 u'tie': 3.401197381662155,
 u'unwilling': 3.6888794541139358,
 'roll': 2.9957322735539909,
 u'realiti': 3.1780538303479453,
 u'legitim': 3.6888794541139358,
 'intend': 3.401197381662155,
 u'benefici': 3.6888794541139358,
 'felt': 2.9957322735539909,
 'outreach': 2.7080502011022101,
 'fell': 3.6888794541139358,
 u'intent': 3.1780538303479453,
 u'award': 2.0794415416798357,
 u'consid': 2.0149030205422647,
 u'easili': 3.1780538303479453,
 'weekend': 3.401197381662155,
 'billion': 3.6888794541139358,
 'grief': 3.6888794541139358,
 u'femal': 2.5902671654458262,
 'chairperson': 3.401197381662155,
 'longer': 2.7080502011022101,
 u'anywher': 2.8415815937267324,
 u'ignor': 3.6888794541139358,
 'time': 0.89567144467141935,
 u'push': 2.7080502011022101,
 u'serious': 3.401197381662155,
 u'daili': 3.1780538303479453,
 u'recipi': 2.9957322735539909,
 'concept': 3.6888794541139358,
 u'consum': 3.1780538303479453,
 u'focus': 2.0794415416798357,
 'drop-off': 3.6888794541139358,
 u'signific': 2.3895964699836751,
 u'supplement': 3.6888794541139358,
 'milo': 2.5902671654458262,
 'chair': 2.0794415416798357,
 u'decid': 2.0794415416798357,
 u'middl': 2.9957322735539909,
 u'grape': 3.6888794541139358,
 'shkreli': 3.6888794541139358,
 u'flash': 3.6888794541139358,
 'father': 3.6888794541139358,
 u'environment': 1.4916548767777167,
 u'certainli': 3.1780538303479453,
 u'decis': 2.1484344131667874,
 u'sociolog': 3.401197381662155,
 'oversight': 3.401197381662155,
 'brown': 3.1780538303479453,
 'vet': 3.6888794541139358,
 u'low-incom': 3.401197381662155,
 'string': 3.6888794541139358,
 'ibrahim': 3.6888794541139358,
 u'convict': 3.6888794541139358,
 'join': 2.0149030205422647,
 'exact': 3.6888794541139358,
 'wore': 3.6888794541139358,
 u'valuabl': 3.6888794541139358,
 u'administr': 1.5686159179138452,
 u'level': 2.0794415416798357,
 'tear': 3.6888794541139358,
 u'die': 3.6888794541139358,
 '1996': 3.6888794541139358,
 u'democrat': 2.9957322735539909,
 u'item': 2.8415815937267324,
 'team': 2.0149030205422647,
 'quick': 3.6888794541139358,
 'nagey': 3.6888794541139358,
 u'guy': 3.1780538303479453,
 u'round': 3.6888794541139358,
 u'prevent': 2.5902671654458262,
 'pork': 3.401197381662155,
 u'outlin': 3.401197381662155,
 'trend': 3.1780538303479453,
 u'compens': 3.1780538303479453,
 u'sign': 1.8430527636156055,
 u'cost': 2.3895964699836751,
 'patient': 2.9957322735539909,
 '6:10': 3.6888794541139358,
 u'appear': 3.401197381662155,
 u'energi': 2.3025850929940455,
 u'current': 1.2039728043259359,
 u'suspect': 3.401197381662155,
 u'uc-wid': 3.401197381662155,
 u'appeal': 2.7080502011022101,
 u'gener': 1.6964492894237297,
 'muslim': 2.7080502011022101,
 'french': 3.6888794541139358,
 'water': 2.4849066497879999,
 u'entertain': 3.6888794541139358,
 u'address': 1.791759469228055,
 u'locat': 2.0794415416798357,
 'along': 2.5902671654458262,
 u'destigmat': 3.6888794541139358,
 u'teacher': 3.1780538303479453,
 'wait': 2.9957322735539909,
 u'alto': 3.6888794541139358,
 'male': 2.9957322735539909,
 u'invit': 2.8415815937267324,
 'proud': 2.8415815937267324,
 u'healthi': 2.8415815937267324,
 u'extrem': 2.5902671654458262,
 'bob': 3.1780538303479453,
 u'ourselv': 3.401197381662155,
 u'orient': 3.401197381662155,
 'love': 1.791759469228055,
 'extra': 3.1780538303479453,
 'prefer': 3.6888794541139358,
 u'leav': 2.3025850929940455,
 u'seattl': 3.401197381662155,
 u'instal': 3.401197381662155,
 'fbi': 3.6888794541139358,
 'should': 1.455287232606842,
 u'mobil': 3.6888794541139358,
 u'market': 1.8430527636156055,
 u'two-third': 3.6888794541139358,
 'prove': 3.1780538303479453,
 'sake': 3.6888794541139358,
 u'univers': 1.1239300966523995,
 u'visit': 2.3895964699836751,
 'by': 0.24419696051204198,
 u'everybodi': 3.401197381662155,
 'live': 1.5293952047605637,
 u'loung': 3.6888794541139358,
 'scope': 3.6888794541139358,
 'checkout': 3.6888794541139358,
 u'suicid': 3.6888794541139358,
 'today': 2.3025850929940455,
 u'capit': 3.1780538303479453,
 'said': 0.13353139262452274,
 u'afford': 2.1484344131667874,
 u'peopl': 0.61310447288640901,
 'curriculum': 3.6888794541139358,
 u'enhanc': 2.7080502011022101,
 'downtown': 1.8430527636156055,
 'visual': 3.401197381662155,
 u'examin': 3.6888794541139358,
 'effort': 1.6964492894237297,
 'behalf': 3.6888794541139358,
 u'religi': 2.8415815937267324,
 u'demolish': 3.6888794541139358,
 u'slogan': 3.1780538303479453,
 u'keynot': 3.6888794541139358,
 'car': 3.401197381662155,
 u'prepar': 2.3895964699836751,
 u'judg': 2.8415815937267324,
 u'focu': 2.1484344131667874,
 u'imper': 3.6888794541139358,
 u'cat': 3.6888794541139358,
 u'whatev': 2.9957322735539909,
 u'purpos': 2.5902671654458262,
 u'preclud': 3.6888794541139358,
 'heart': 2.8415815937267324,
 u'complic': 3.6888794541139358,
 u'trump': 1.6964492894237297,
 u'predict': 2.9957322735539909,
 u'curri': 3.6888794541139358,
 u'topic': 2.4849066497879999,
 'heard': 2.5902671654458262,
 u'critic': 2.0794415416798357,
 'council': 1.8971199848858813,
 u'recycl': 3.6888794541139358,
 u'agenc': 2.3895964699836751,
 u'contamin': 3.6888794541139358,
 'stole': 3.6888794541139358,
 'occur': 2.5902671654458262,
 u'clair': 3.6888794541139358,
 'pink': 3.6888794541139358,
 u'multipl': 2.4849066497879999,
 'winter': 2.3025850929940455,
 u'assemblymemb': 3.401197381662155,
 u'economi': 3.401197381662155,
 'write': 2.8415815937267324,
 u'alway': 2.0149030205422647,
 'sunday': 3.401197381662155,
 'vital': 2.9957322735539909,
 u'anyon': 2.4849066497879999,
 'fourth': 3.6888794541139358,
 'sworn': 3.6888794541139358,
 u'pathway': 3.401197381662155,
 u'product': 2.3895964699836751,
 u'said.davi': 3.1780538303479453,
 u'superintend': 3.401197381662155,
 u'spot': 3.1780538303479453,
 'bear': 3.6888794541139358,
 'date': 3.401197381662155,
 'such': 1.0986122886681096,
 u'grow': 2.3895964699836751,
 u'man': 3.6888794541139358,
 u'classroom': 3.401197381662155,
 u'stress': 2.5902671654458262,
 u'practic': 2.4849066497879999,
 u'secur': 2.4849066497879999,
 u'ideolog': 3.6888794541139358,
 u'inform': 1.4201959127955717,
 'switch': 3.6888794541139358,
 'so': 0.5108256237659905,
 'african': 3.401197381662155,
 u'offend': 3.6888794541139358,
 'tall': 3.401197381662155,
 u'riversid': 3.401197381662155,
 u'talk': 1.9542783987258296,
 u'shield': 3.6888794541139358,
 u'anticip': 3.401197381662155,
 u'approv': 2.1484344131667874,
 u'tip': 3.401197381662155,
 'brain': 3.6888794541139358,
 'citizenship': 3.6888794541139358,
 u'equip': 2.9957322735539909,
 'still': 1.6964492894237297,
 u'mainli': 3.6888794541139358,
 u'dynam': 3.6888794541139358,
 u'entiti': 3.6888794541139358,
 u'ethic': 3.1780538303479453,
 u'conjunct': 3.6888794541139358,
 'group': 1.5686159179138452,
 u'thank': 2.4849066497879999,
 u'polici': 1.4916548767777167,
 u'passag': 3.1780538303479453,
 u'platform': 2.5902671654458262,
 u'window': 2.8415815937267324,
 u'torr': 3.6888794541139358,
 'main': 2.0149030205422647,
 'halt': 3.401197381662155,
 '3': 2.5902671654458262,
 u'financi': 2.2225423853205091,
 u'automobil': 3.401197381662155,
 u'initi': 1.791759469228055,
 u'nation': 1.455287232606842,
 'answer': 2.4849066497879999,
 'kkk': 3.6888794541139358,
 'half': 2.9957322735539909,
 'not': 0.26570316573300534,
 'now': 1.3862943611198904,
 'nop': 3.6888794541139358,
 u'discuss': 1.6964492894237297,
 'nor': 3.6888794541139358,
 u'introduct': 3.6888794541139358,
 'wont': 2.7080502011022101,
 '&': 2.8415815937267324,
 u'term': 2.5902671654458262,
 u'name': 2.2225423853205091,
 u'entrepreneur': 3.6888794541139358,
 u'drop': 3.401197381662155,
 u'separ': 3.401197381662155,
 u'magazin': 3.1780538303479453,
 'wong': 3.6888794541139358,
 'rock': 2.9957322735539909,
 u'januari': 3.1780538303479453,
 u'quarter': 2.5902671654458262,
 'years.i': 3.6888794541139358,
 'nick': 2.8415815937267324,
 u'muslim-major': 3.6888794541139358,
 u'replac': 2.4849066497879999,
 u'individu': 1.6519975268528961,
 u'continu': 1.0498221244986774,
 u'ensur': 2.0149030205422647,
 u'sponsor': 2.8415815937267324,
 'year': 0.58279912339108009,
 'happen': 2.0149030205422647,
 u'baselin': 3.6888794541139358,
 'canada': 3.1780538303479453,
 'shown': 3.1780538303479453,
 u'accomplish': 2.8415815937267324,
 'jackson': 3.6888794541139358,
 u'space': 1.9542783987258296,
 u'profit': 3.1780538303479453,
 u'bespok': 3.6888794541139358,
 'internet': 2.9957322735539909,
 'fargo': 3.6888794541139358,
 'sweet': 3.401197381662155,
 u'correct': 3.6888794541139358,
 u'integr': 3.1780538303479453,
 'e.': 3.6888794541139358,
 'state': 0.91629073187415466,
 'million': 1.8971199848858813,
 'seventh': 3.1780538303479453,
 u'argu': 2.8415815937267324,
 u'headlin': 3.6888794541139358,
 u'flyer': 3.6888794541139358,
 'california': 1.0033021088637848,
 'span': 3.6888794541139358,
 u'landfil': 3.6888794541139358,
 'card': 2.9957322735539909,
 'care': 1.6519975268528961,
 u'refus': 3.401197381662155,
 'honest': 3.6888794541139358,
 u'recov': 3.6888794541139358,
 u'thing': 1.3535045382968995,
 'place': 1.2039728043259359,
 u'greenhous': 3.6888794541139358,
 'nonprofit': 2.2225423853205091,
 u'principl': 3.1780538303479453,
 'childhood': 3.401197381662155,
 'frequent': 3.401197381662155,
 'first': 1.0986122886681096,
 u'origin': 2.4849066497879999,
 u'directli': 2.3895964699836751,
 u'carri': 3.1780538303479453,
 u'onc': 2.2225423853205091,
 'yourself': 2.7080502011022101,
 'submit': 3.1780538303479453,
 'spanish': 3.6888794541139358,
 'vote': 2.0149030205422647,
 u'happi': 2.9957322735539909,
 u'open': 1.6519975268528961,
 'given': 1.791759469228055,
 'boulevard': 3.1780538303479453,
 u'christma': 3.6888794541139358,
 'district': 2.3025850929940455,
 'bloom': 3.6888794541139358,
 'caught': 3.6888794541139358,
 u'breed': 3.401197381662155,
 'iac': 3.6888794541139358,
 'plastic': 3.401197381662155,
 u'citi': 1.2039728043259359,
 '2': 2.5902671654458262,
 u'draft': 3.6888794541139358,
 u'proposit': 2.9957322735539909,
 u'conveni': 3.6888794541139358,
 u'cite': 3.1780538303479453,
 u'friend': 2.3025850929940455,
 u'transgend': 3.6888794541139358,
 'hub': 3.6888794541139358,
 'that': 0.087011376989629241,
 'season': 2.5902671654458262,
 u'logist': 3.6888794541139358,
 u'mostli': 3.1780538303479453,
 'quad': 3.1780538303479453,
 'than': 1.1239300966523995,
 'boyfriend': 3.6888794541139358,
 '11': 2.8415815937267324,
 '10': 1.6094379124341001,
 '13': 3.1780538303479453,
 '12': 2.5902671654458262,
 '15': 2.3895964699836751,
 '14': 2.7080502011022101,
 '17': 2.5902671654458262,
 '16': 2.7080502011022101,
 '19': 2.8415815937267324,
 '18': 2.8415815937267324,
 'recruit': 3.401197381662155,
 u'banner': 3.401197381662155,
 'were': 0.56798403760593885,
 u'posit': 1.6964492894237297,
 'counselor': 3.1780538303479453,
 u'seri': 2.3025850929940455,
 u'prospect': 3.6888794541139358,
 'fork': 3.6888794541139358,
 u'rene': 3.6888794541139358,
 'san': 2.4849066497879999,
 ...}
In [303]:
plt.hist(idf_smooth.values(),bins=20)
Out[303]:
(array([  17.,   10.,   18.,   16.,   19.,   27.,   35.,   49.,   65.,
          56.,  104.,   86.,  147.,   77.,  205.,  131.,  179.,  252.,
         341.,  781.]),
 array([-0.0082988 ,  0.17656011,  0.36141902,  0.54627794,  0.73113685,
         0.91599576,  1.10085467,  1.28571359,  1.4705725 ,  1.65543141,
         1.84029033,  2.02514924,  2.21000815,  2.39486706,  2.57972598,
         2.76458489,  2.9494438 ,  3.13430272,  3.31916163,  3.50402054,
         3.68887945]),
 <a list of 20 Patch objects>)
In [545]:
from os import path
from wordcloud import WordCloud
# all
wordcloud = WordCloud().generate("".join(textall))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()
In [380]:
# for the campus news
wordcloud = WordCloud().generate("".join(textall[:59]))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()
In [382]:
# for the city
wordcloud = WordCloud().generate("".join(textall[59:119]))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

From the previous three plots we know that the key words for all the articles are "Davis", "UC", "student", "community", "people" and "city".
For the articles in campus news, the key words are "UC", "student", "Davis" and "campus".
For the articles in city news, the key words are "Davis", "city", "coummunity", "people", "food", "student" amd "Sacramento".
Thus there is no big difference of the main topics in campus news and city news.

In [546]:
# the previous parts are only for calculating inverse document frequency
# now we want to know the idf with tf(t,d) weighted
vectorizer = TfidfVectorizer(tokenizer=lemmatize,stop_words="english",smooth_idf=True,norm=None)
# textall is a list of raw files
tfs = vectorizer.fit_transform(textall)
In [547]:
sim = tfs.dot(tfs.T)
In [548]:
sim.mean()
Out[548]:
1547.5883685507699
In [549]:
# Find the smallest value
# in-place: update the original string instead of creating a new one
# convert the sparse matrix to a np.array by using .toarray() . And then I can use np.where(sim == sim.max()) to find the index of the max value
simay = sim.toarray()
#np.where(simay == simay.max())
simay.shape
Out[549]:
(120L, 120L)
In [550]:
import numpy as np
#plt.imshow(simay, cmap='hot', interpolation='nearest')
plt.pcolor(simay)
plt.show()
# extract the upper diagonal
#simay_upp = np.triu(simay,k=0)

From the heatmap above we can know that documentID 16 has more similarity with other documents

In [595]:
# tranfer into 1D array, the order of it is row by row
#simay_upp1 = np.reshape(simay,(1,np.product(simay.shape)))
#simay_upp1

simay_upp = np.triu(simay,k=1) # k=1 make all the diagonal elements equal to 0
# order it to find the largest ones
def sim_art(n):   # get the index of sorted data
    flat = simay_upp.flatten()
    indices = np.argpartition(flat, -n)[-n:]
    indices = indices[np.argsort(-flat[indices])]
    loc = np.unravel_index(indices, simay.shape)
    print [docids.keys()[docids.values().index(loc[0][n-1])], loc[0][n-1]]
    print [docids.keys()[docids.values().index(loc[1][n-1])], loc[1][n-1]]
In [571]:
def largest_indices(ary, n):
    """Returns the n largest indices from a numpy array."""  # take a look at argpartition !!!!
    flat = ary.flatten()
    indices = np.argpartition(flat, -n)[-n:]
    indices = indices[np.argsort(-flat[indices])]
    return np.unravel_index(indices, ary.shape)
In [596]:
# the most similar one
sim_art(1)
['UC Davis holds first mental health conference', 14]
['UC Davis to host first ever mental health conference', 35]
In [597]:
#show the similarity of this two articles
wordcloud = WordCloud().generate("".join(textall[14]))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()
wordcloud = WordCloud().generate("".join(textall[35]))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

From the plot above we know that the most similar articles are "UC Davis holds first mental health conference" and "UC Davis to host first ever mental health conference". The common words are "mental", "health", "student" and "conference".

In [598]:
# the 2nd similar one
sim_art(2)
['UC Davis holds first mental health conference', 14]
['2017 ASUCD Winter Elections  Meet the Candidates', 16]
In [599]:
#show the similarity of this two articles
wordcloud = WordCloud().generate("".join(textall[14]))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()
wordcloud = WordCloud().generate("".join(textall[16]))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

From the plot above we know that the 2nd most similar articles are 'UC Davis holds first mental health conference' and '2017 ASUCD Winter Elections Meet the Candidates'. The common words are "student" and "year".

In [600]:
# the 3rd similar one
sim_art(3)
['2017 ASUCD Winter Elections  Meet the Candidates', 16]
['Nov. 8 2016: An Election Day many may never forget', 115]
In [601]:
wordcloud = WordCloud().generate("".join(textall[16]))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()
wordcloud = WordCloud().generate("".join(textall[115]))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

From the plot above we know that the 3rd similar articles are '2017 ASUCD Winter Elections Meet the Candidates' and 'Nov. 8 2016: An Election Day many may never forget'. The reason they are similar is because both two articles are about election and government.

For the last question. I assume that this corpus is the 16th article. By comparing the wordcloud of overall the aggie corpus with the wordcloud of the 16th corpus, we can know that they have some common words like "student", "community" and "davis". Thus I think it is representitive for the aggie.