Assignment 5¶

In this assignment, you'll scrape text from The California Aggie and then analyze the text.

The Aggie is organized by category into article lists. For example, there's a Campus News list, Arts & Culture list, and Sports list. Notice that each list has multiple pages, with a maximum of 15 articles per page.

The goal of exercises 1.1 - 1.3 is to scrape articles from the Aggie for analysis in exercise 1.4.

Exercise 1.1. Write a function that extracts all of the links to articles in an Aggie article list. The function should:

Have a parameter url for the URL of the article list.
Have a parameter page for the number of pages to fetch links from. The default should be 1.
Return a list of aricle URLs (each URL should be a string).

Test your function on 2-3 different categories to make sure it works.

Hints:

Be polite to The Aggie and save time by setting up requests_cache before you write your function.
Start by getting your function to work for just 1 page. Once that works, have your function call itself to get additional pages.
You can use lxml.html or BeautifulSoup to scrape HTML. Choose one and use it throughout the entire assignment.

# when I failed to install a package use pip, try conda install!! That's how I installed ggplot successfully.
import requests
import requests_cache
import requests_ftp
import lxml
from bs4 import BeautifulSoup
from collections import Counter
from matplotlib import pyplot as plt
from urlparse import urlunparse, urlparse
import pandas as pd
plt.style.use('ggplot')
requests_cache.install_cache('cache') #????????? why use coll_cache instead of cache
%matplotlib inline       
#???????????????
#requests_cache.install_cache("cache")

def url_lxml(url,page):
    response = requests.get(url+'/page/{}'.format(page))
    html = response.text  # xxx.text  ->  extract the text
    # this part if I run aggie = BeautifulSoup(aggiehtml,"lxml-xml"); aggie.prettify() is very short. Why???
    return BeautifulSoup(html,"lxml")

#url = "https://theaggie.org/"
#params = {"groups":"NEWS","page":1}
def link_art(url,page):
    """
    input: url and page
    output: links for the articles in each article lists
    """
    #response = requests.get(url)
    #html = response.text
    #aggie = BeautifulSoup(html,"lxml")
    #agglist_content = aggie.find_all(name="a",attrs={"itemprop":"name"})
    #aggie article list url: get the url for each article list
    #list_url = [x["href"]for x in agglist_content]
    all_links = []
    #for lists in list_url:
    article = url_lxml(url,page)
    art_content = article.find_all(name="h2",attrs={"class":"entry-title"})
        # there are some in a that don't have href attrs
    for art in art_content:
        try:                # use xxx.a gose down to a tag a directly 
            all_links.append(art.a["href"])
        except TypeError:
            None       
    return all_links

link_art("https://theaggie.org/campus",6)

['https://theaggie.org/2016/11/29/advocacy-groups-write-letters-to-uc-president-amid-concerns-of-anti-semitism/',
 'https://theaggie.org/2016/11/29/student-health-and-counseling-services-launches-nap-campaign/',
 'https://theaggie.org/2016/11/28/uc-davis-receives-760-million-for-research/',
 'https://theaggie.org/2016/11/27/two-sexual-assault-occurrences-reported-during-fall-quarter/',
 'https://theaggie.org/2016/11/27/this-week-in-senate-32/',
 'https://theaggie.org/2016/11/22/the-life-of-former-chancellor-linda-p-b-katehi-post-resignation/',
 'https://theaggie.org/2016/11/21/uc-davis-releases-2015-2016-annual-campus-travel-survey-results/',
 'https://theaggie.org/2016/11/21/plant-and-animal-sciences-at-uc-davis-rank-number-one-in-the-world/',
 'https://theaggie.org/2016/11/20/achieve-uc-program-encourages-students-to-apply-to-ucs/',
 'https://theaggie.org/2016/11/20/uc-transfer-application-deadline-extended/',
 'https://theaggie.org/2016/11/18/anti-diversity-posters-discovered-on-campus/',
 'https://theaggie.org/2016/11/17/s-p-e-a-k-community-bands-together-on-quad-to-protest-trump-presidency/',
 'https://theaggie.org/2016/11/17/university-of-california-among-largest-source-of-donations-to-clinton/',
 'https://theaggie.org/2016/11/17/this-week-in-senate-31/',
 'https://theaggie.org/2016/11/16/matthew-mcfadden-confirmed-as-new-interim-senator/']

link_art("https://theaggie.org/city",6)

['https://theaggie.org/2016/10/09/downtown-davis-receives-artsy-public-pianos/',
 'https://theaggie.org/2016/10/07/hopeful-hyatt-house-hotel-denied-approval-by-planning-commission/',
 'https://theaggie.org/2016/10/04/police-logs/',
 'https://theaggie.org/2016/09/27/davis-farmers-market-hits-the-stands-for-its-40th-year/',
 'https://theaggie.org/2016/09/22/bike-city-usa/',
 'https://theaggie.org/2016/06/03/wakeboarding-team-breaks-guinness-world-record-in-woodland/',
 'https://theaggie.org/2016/06/03/davis-city-council-recognizes-social-justice-advocates/',
 'https://theaggie.org/2016/06/02/clintons-cover-california-from-north-to-south/',
 'https://theaggie.org/2016/06/02/come-for-the-beer-stay-for-the-cause-at-davis-beer-and-cider-festival/',
 'https://theaggie.org/2016/06/02/sacramento-black-book-fair-kicks-off-june-3/',
 'https://theaggie.org/2016/06/02/10-events-happening-in-and-around-davis-this-summer/',
 'https://theaggie.org/2016/06/01/davis-arts-center-presents-june-pop-up-series/',
 'https://theaggie.org/2016/06/01/bat-walk-and-talk-all-summer-long/',
 'https://theaggie.org/2016/05/31/donald-trump-to-hold-rally-in-sacramento/',
 'https://theaggie.org/2016/05/31/california-moves-toward-implementing-earthquake-warning-system/']

Exercise 1.2. Write a function that extracts the title, text, and author of an Aggie article. The function should:

Have a parameter url for the URL of the article.
For the author, extract the "Written By" line that appears at the end of most articles. You don't have to extract the author's name from this line.
Return a dictionary with keys "url", "title", "text", and "author". The values for these should be the article url, title, text, and author, respectively.

For example, for this article your function should return something similar to this:

{
    'author': u'Written By: Bianca Antunez \xa0\u2014\xa0city@theaggie.org',
    'text': u'Davis residents create financial model to make city\'s financial state more transparent To increase transparency between the city\'s financial situation and the community, three residents created a model called Project Toto which aims to improve how the city communicates its finances in an easily accessible design. Jeff Miller and Matt Williams, who are members of Davis\' Finance and Budget Commission, joined together with Davis entrepreneur Bob Fung to create the model plan to bring the project to the Finance and Budget Commission in February, according to Kelly Stachowicz, assistant city manager. "City staff appreciate the efforts that have gone into this, and the interest in trying to look at the city\'s potential financial position over the long term," Stachowicz said in an email interview. "We all have a shared goal to plan for a sound fiscal future with few surprises. We believe the Project Toto effort will mesh well with our other efforts as we build the budget for the next fiscal year and beyond." Project Toto complements the city\'s effort to amplify the transparency of city decisions to community members. The aim is to increase the understanding about the city\'s financial situation and make the information more accessible and easier to understand. The project is mostly a tool for public education, but can also make predictions about potential decisions regarding the city\'s financial future. Once completed, the program will allow residents to manipulate variables to see their eventual consequences, such as tax increases or extensions and proposed developments "This really isn\'t a budget, it is a forecast to see the intervention of these decisions," Williams said in an interview with The Davis Enterprise. "What happens if we extend the sales tax? What does it do given the other numbers that are in?" Project Toto enables users, whether it be a curious Davis resident, a concerned community member or a city leader, with the ability to project city finances with differing variables. The online program consists of the 400-page city budget for the 2016-2017 fiscal year, the previous budget, staff reports and consultant analyses. All of the documents are cited and accessible to the public within Project Toto. "It\'s a model that very easily lends itself to visual representation," Mayor Robb Davis said. "You can see the impacts of decisions the council makes on the fiscal health of the city." Complementary to this program, there is also a more advanced version of the model with more in-depth analyses of the city\'s finances. However, for an easy-to-understand, simplistic overview, Project Toto should be enough to help residents comprehend Davis finances. There is still more to do on the project, but its creators are hard at work trying to finalize it before the 2017-2018 fiscal year budget. "It\'s something I have been very much supportive of," Davis said. "Transparency is not just something that I have been supportive of but something we have stated as a city council objective [ ] this fits very well with our attempt to inform the public of our challenges with our fiscal situation." ',
    'title': 'Project Toto aims to address questions regarding city finances',
    'url': 'https://theaggie.org/2017/02/14/project-toto-aims-to-address-questions-regarding-city-finances/'
}

Hints:

The author line is always the last line of the last paragraph.
Python 2 displays some Unicode characters as \uXXXX. For instance, \u201c is a left-facing quotation mark. You can convert most of these to ASCII characters with the method call (on a string)
```
.translate({ 0x2018:0x27, 0x2019:0x27, 0x201C:0x22, 0x201D:0x22, 0x2026:0x20 })
```
If you're curious about these characters, you can look them up on this page, or read more about what Unicode is.

# ???????when should I ues .translate({ 0x2018:0x27, 0x2019:0x27, 0x201C:0x22, 0x201D:0x22, 0x2026:0x20 }) ???????
# type of author[0].b.text is unicode
# use unicodedata package to transform unicode type into string
import unicodedata
import re
def extra_content(url):  # in the website Inspector there is  ::before   what's the meaning of this??
    """
    input: url
    output: dictionary with author, text, title and url
    """
    # parse to lxml
    alllxml = url_lxml(url,1)
    # create a dictionary
    adict = {"author":[],"text":[],"title":[],"url":[]}
    #alllxml = url_lxml(test_art[0],1)
    # title
    titlelxml = alllxml.find_all(name="h1",attrs={"class":"entry-title","itemprop":"headline"})
    try:
        title = titlelxml[0].text.strip().encode('ascii','ignore')
    except IndexError:
        title = None
    adict["title"]=title
    # texts
    textlxml = alllxml.find_all("p")
    texts1 = [x.text.strip().encode('ascii','ignore') for x in textlxml]
    # author is not always in the last row. Most of the time it is the 2nd last row
    # xx[-1]: the last one; xx[:-1]: balabala untill the 2nd last one
    # all the text content
    adict["text"]="".join(texts1[:-2])
    # extract the author
    # use regular expression  "\s" is space. {n} means repeat n times. {1,3} means repeat 1 or 2 or 3 times
    try:
        author = re.search(".*:\s*([a-zA-Z -]+)\s.*@",texts1[-2]).group(1)
    except AttributeError:
        try:
            author = re.search(".*:\s*([a-zA-Z -]+)\s.*@",texts1[-1]).group(1)
        except AttributeError:
            author = None
    adict["author"]= author
    adict["url"] = url
    return adict

extra_content('https://theaggie.org/2017/02/14/project-toto-aims-to-address-questions-regarding-city-finances/')

{'author': 'Bianca Antunez',
 'text': 'Davis residents create financial model to make citys financial state more transparentTo increase transparency between the citys financial situation and the community, three residents created a model called Project Toto which aims to improve how the city communicates its finances in an easily accessible design.Jeff Miller and Matt Williams, who are members of Davis Finance and Budget Commission, joined together with Davis entrepreneur Bob Fung to create the model plan to bring the project to the Finance and Budget Commission in February, according to Kelly Stachowicz, assistant city manager.City staff appreciate the efforts that have gone into this, and the interest in trying to look at the citys potential financial position over the long term, Stachowicz said in an email interview. We all have a shared goal to plan for a sound fiscal future with few surprises. We believe the Project Toto effort will mesh well with our other efforts as we build the budget for the next fiscal year and beyond.Project Toto complements the citys effort to amplify the transparency of city decisions to community members. The aim is to increase the understanding about the citys financial situation and make the information more accessible and easier to understand.The project is mostly a tool for public education, but can also make predictions about potential decisions regarding the citys financial future. Once completed, the program will allow residents to manipulate variables to see their eventual consequences, such as tax increases or extensions and proposed developmentsThis really isnt a budget, it is a forecast to see the intervention of these decisions, Williams said in an interview with The Davis Enterprise. What happens if we extend the sales tax? What does it do given the other numbers that are in?Project Toto enables users, whether it be a curious Davis resident, a concerned community member or a city leader, with the ability to project city finances with differing variables.The online program consists of the 400-page city budget for the 2016-2017 fiscal year, the previous budget, staff reports and consultant analyses. All of the documents are cited and accessible to the public within Project Toto.Its a model that very easily lends itself to visual representation, Mayor Robb Davis said. You can see the impacts of decisions the council makes on the fiscal health of the city.Complementary to this program, there is also a more advanced version of the model with more in-depth analyses of the citys finances. However, for an easy-to-understand, simplistic overview, Project Toto should be enough to help residents comprehend Davis finances.There is still more to do on the project, but its creators are hard at work trying to finalize it before the 2017-2018 fiscal year budget.Its something I have been very much supportive of, Davis said. Transparency is not just something that I have been supportive of but something we have stated as a city council objective [] this fits very well with our attempt to inform the public of our challenges with our fiscal situation.',
 'title': 'Project Toto aims to address questions regarding city finances',
 'url': 'https://theaggie.org/2017/02/14/project-toto-aims-to-address-questions-regarding-city-finances/'}

Exercise 1.3. Use your functions from exercises 1.1 and 1.2 to get a data frame of 60 Campus News articles and a data frame of 60 City News articles. Add a column to each that indicates the category, then combine them into one big data frame.

The "text" column of this data frame will be your corpus for natural language processing in exercise 1.4.

# each page there are 15 articles. we need 4 pages to get 60 articles
# add column as category. 
# get the link, output are multiple links
def create_df(news_link):                  # page 1-5
    all_link = [link_art(news_link,page) for page in range(1,5)]
    news = []
    for links in all_link:
        news_art = [extra_content(link) for link in links]
        news.append(news_art)
    return pd.concat([pd.DataFrame(new) for new in news])    
camp_df = create_df('https://theaggie.org/campus')
city_df = create_df('https://theaggie.org/city')

print city_df.shape
print camp_df.shape

(60, 4)
(60, 4)

news_df = pd.concat([camp_df,city_df])

# simply use "+" to adding elements in the same list without creating nested lists
category = ["campus"]*60 + ["city"]*60
news_df["category"] = category
# reset the index value
news_df = news_df.set_index([range(120)])

news_df

Exercise 1.4. Use the Aggie corpus to answer the following questions. Use plots to support your analysis.

What topics does the Aggie cover the most? Do city articles typically cover different topics than campus articles?
What are the titles of the top 3 pairs of most similar articles? Examine each pair of articles. What words do they have in common?
Do you think this corpus is representative of the Aggie? Why or why not? What kinds of inference can this corpus support? Explain your reasoning.

Hints:

The nltk book and scikit-learn documentation may be helpful here.
You can determine whether city articles are "near" campus articles from the similarity matrix or with k-nearest neighbors.
If you want, you can use the wordcloud package to plot a word cloud. To install the package, run
```
conda install -c https://conda.anaconda.org/amueller wordcloud
```
in a terminal. Word clouds look nice and are easy to read, but are less precise than bar plots.

import numpy as np
import nltk
from nltk import corpus
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors
from matplotlib import pyplot as plt
plt.style.use('ggplot')
%matplotlib inline

tokenize = nltk.word_tokenize
def stem(tokens,stemmer = PorterStemmer().stem):
    return [stemmer(w.lower()) for w in tokens] 

def lemmatize(text):
    """
    Extract simple lemmas based on tokenization and stemming
    Input: string
    Output: list of strings (lemmata)
    """
    return stem(tokenize(text))

textd = {} #dictionary from lemmata to document ids containing that lemma
textall=[]
for art in range(120):
    textall.append("".join(news_df.get_value(art,col="text")))
    d = news_df.get_value(art,col="title")
    s = set(lemmatize(t))
    try: # by using toks | s, toks will combine elements of toks in the last iteration. Can I also use +?????
        toks = toks | s
    except NameError: # this is only used for the first iteration that no toks is created
        toks = s
    for tok in s:
        try: # lemmata : [title]
            textd[tok].append(d)
        except KeyError:
            textd[tok] = [d]
            
docids = {} #dictionary of the document id to an integer id for the document
N = 120  
for i in range(120):   # title : i
    docids[news_df.get_value(i,col="title")] = i
    
tokids = {} #dictionary of lemma to integer id for the lemma
tok_list = list(toks)  # a list of lemmata
m = len(tok_list) # the length of lemmata
for j in xrange(m):  # lemma : j
    tokids[tok_list[j]] = j

# dictionary: lemma: number of documents this lemma occurs 
numd = {key:len(set(val)) for key,val in textd.items()}

logN = np.log(120)
# lemma : its smoothed idf
idf_smooth = {key:logN - np.log(1 + val) for key, val in numd.items() if val > 1}

idf_smooth

{'1,800': 3.6888794541139358,
 u'hatr': 3.401197381662155,
 'four': 2.3025850929940455,
 u'protest': 2.0149030205422647,
 'sleep': 3.6888794541139358,
 'asian': 3.401197381662155,
 'oldest': 3.6888794541139358,
 'hate': 2.2225423853205091,
 'whose': 3.1780538303479453,
 'saylor': 3.6888794541139358,
 'voter': 3.6888794541139358,
 u'bike': 2.7080502011022101,
 'under': 1.7429693050586228,
 '@': 3.401197381662155,
 u'everi': 1.6094379124341001,
 'risk': 2.7080502011022101,
 u'compassion': 3.6888794541139358,
 'blanket': 3.6888794541139358,
 u'rise': 3.1780538303479453,
 u'years.th': 3.6888794541139358,
 u'voic': 2.4849066497879999,
 u'tenni': 3.6888794541139358,
 'jack': 3.6888794541139358,
 u'unitran': 3.6888794541139358,
 u'govern': 2.0149030205422647,
 'jacob': 3.6888794541139358,
 'affect': 1.9542783987258296,
 u'school': 1.2611312181658842,
 u'scholar': 2.9957322735539909,
 u'later.th': 3.6888794541139358,
 u'showcas': 3.6888794541139358,
 u'environmentally-friendli': 3.6888794541139358,
 u'enjoy': 2.0794415416798357,
 'plaza': 3.1780538303479453,
 u'speci': 3.401197381662155,
 'miller': 3.6888794541139358,
 'bacon': 2.9957322735539909,
 u'request': 2.5902671654458262,
 'budget': 2.4849066497879999,
 u'consequ': 2.5902671654458262,
 'second': 1.9542783987258296,
 u'diseas': 2.9957322735539909,
 u'empow': 2.7080502011022101,
 'gorman': 3.401197381662155,
 'nahabedian': 3.6888794541139358,
 'campus.i': 2.9957322735539909,
 u'illumin': 3.6888794541139358,
 'even': 1.4916548767777167,
 u'employ': 3.6888794541139358,
 u'dialogu': 3.6888794541139358,
 u'neg': 2.7080502011022101,
 'near': 3.1780538303479453,
 'fossil': 2.8415815937267324,
 'dr.': 3.401197381662155,
 'new': 0.78015855754957464,
 'net': 3.6888794541139358,
 'ever': 2.5902671654458262,
 'disney': 3.6888794541139358,
 'told': 3.1780538303479453,
 'mee': 3.401197381662155,
 u'ongo': 3.401197381662155,
 u'intellectu': 3.6888794541139358,
 u'abov': 3.6888794541139358,
 'sigala': 3.6888794541139358,
 'never': 3.6888794541139358,
 u'accus': 3.6888794541139358,
 'here': 1.6094379124341001,
 'met': 2.8415815937267324,
 u'accur': 3.6888794541139358,
 'path': 2.8415815937267324,
 '100': 2.4849066497879999,
 u'enrol': 2.9957322735539909,
 u'shemeri': 3.6888794541139358,
 'luther': 3.6888794541139358,
 u'daughter': 2.8415815937267324,
 'forum': 3.1780538303479453,
 u'militari': 3.6888794541139358,
 u'community.th': 3.401197381662155,
 u'student-faculti': 3.6888794541139358,
 'credit': 3.401197381662155,
 u'harass': 3.1780538303479453,
 u'mentor': 3.1780538303479453,
 u'studi': 1.6964492894237297,
 u'controversi': 2.7080502011022101,
 u'counti': 1.6964492894237297,
 u'imposs': 3.6888794541139358,
 u'volunt': 2.3895964699836751,
 'coho': 3.6888794541139358,
 'campaign': 2.0794415416798357,
 'brought': 2.3025850929940455,
 u'attitud': 3.6888794541139358,
 u'scientif': 3.6888794541139358,
 'total': 2.3025850929940455,
 u'unit': 1.8971199848858813,
 u'highli': 3.1780538303479453,
 'sarah': 3.401197381662155,
 'dna': 3.6888794541139358,
 'spoke': 2.1484344131667874,
 'would': 1.0262916270884834,
 u'prescript': 3.6888794541139358,
 'hamidian': 3.6888794541139358,
 'program': 1.1499055830556602,
 'music': 2.4849066497879999,
 'asset': 3.6888794541139358,
 u'recommend': 3.1780538303479453,
 'type': 2.5902671654458262,
 'tell': 2.4849066497879999,
 u'relat': 1.4916548767777167,
 'officio': 3.6888794541139358,
 u'notic': 2.3895964699836751,
 u'hurt': 3.1780538303479453,
 u'warn': 3.6888794541139358,
 'phone': 2.7080502011022101,
 'warm': 3.1780538303479453,
 u'adult': 3.401197381662155,
 'former': 1.9542783987258296,
 '90': 3.1780538303479453,
 u'hole': 3.6888794541139358,
 u'hold': 2.2225423853205091,
 u'pantri': 3.401197381662155,
 'must': 2.0149030205422647,
 'me': 1.791759469228055,
 u'irvin': 3.6888794541139358,
 'word': 2.5902671654458262,
 'room': 2.3025850929940455,
 '1997': 3.6888794541139358,
 u'flore': 3.1780538303479453,
 u'work': 0.66035735773695414,
 'mu': 3.401197381662155,
 u'wors': 3.6888794541139358,
 'my': 1.2909841813155656,
 u'advocaci': 2.5902671654458262,
 u'climat': 2.1484344131667874,
 u'give': 1.4916548767777167,
 u'indic': 3.401197381662155,
 '10,000': 3.1780538303479453,
 'woman': 3.401197381662155,
 'want': 1.0739196760777379,
 '2.2': 3.6888794541139358,
 'david': 3.6888794541139358,
 u'attract': 2.8415815937267324,
 'motion': 2.9957322735539909,
 u'end': 1.9542783987258296,
 u'turn': 2.4849066497879999,
 u'polic': 2.1484344131667874,
 'travel': 2.3025850929940455,
 u'ceremoni': 3.1780538303479453,
 'how': 1.1765738301378215,
 u'recoveri': 3.6888794541139358,
 'interview': 2.2225423853205091,
 u'disappoint': 3.1780538303479453,
 u'perspect': 2.8415815937267324,
 u'confid': 3.401197381662155,
 'shaitaj': 2.9957322735539909,
 u'recogn': 2.4849066497879999,
 'after': 0.98082925301172619,
 u'community.w': 3.6888794541139358,
 'lab': 3.1780538303479453,
 u'befor': 1.455287232606842,
 u'beauti': 2.8415815937267324,
 u'law': 2.0794415416798357,
 u'demonstr': 2.5902671654458262,
 'attempt': 3.1780538303479453,
 'sioux': 3.6888794541139358,
 'third': 3.401197381662155,
 u'services.w': 3.6888794541139358,
 'think': 1.0986122886681096,
 u'detent': 3.6888794541139358,
 'greek': 3.6888794541139358,
 'maintain': 2.1484344131667874,
 'green': 2.9957322735539909,
 'south': 3.401197381662155,
 u'reloc': 3.401197381662155,
 u'enter': 2.7080502011022101,
 u'fan': 3.6888794541139358,
 'nodapl': 3.6888794541139358,
 'order': 1.3535045382968995,
 u'wind': 3.6888794541139358,
 'wine': 3.6888794541139358,
 u'oper': 2.3895964699836751,
 'consent': 3.401197381662155,
 u'offici': 2.0794415416798357,
 u'failur': 3.6888794541139358,
 u'becaus': 0.81719982922992385,
 u'incid': 2.4849066497879999,
 u'appar': 3.6888794541139358,
 'mayor': 1.8430527636156055,
 u'fit': 2.9957322735539909,
 'backpack': 3.6888794541139358,
 ',': 0.025317807984289509,
 'better': 2.0794415416798357,
 u'offic': 1.455287232606842,
 u'comprehens': 3.401197381662155,
 'easier': 3.6888794541139358,
 'then': 1.3535045382968995,
 'dec.': 2.8415815937267324,
 u'anim': 3.1780538303479453,
 u'proce': 3.6888794541139358,
 u'enrich': 3.1780538303479453,
 'slate': 3.401197381662155,
 'safe': 2.3025850929940455,
 'break': 2.8415815937267324,
 u'promis': 3.6888794541139358,
 'they': 0.5108256237659905,
 u'battl': 3.6888794541139358,
 u'life.thi': 3.6888794541139358,
 'one': 0.55338523818478613,
 'silver': 3.401197381662155,
 'bank': 3.401197381662155,
 u'choic': 2.7080502011022101,
 'alex': 2.5902671654458262,
 'meat': 3.401197381662155,
 u'accommod': 2.9957322735539909,
 u'luca': 3.6888794541139358,
 'each': 1.5293952047605637,
 'went': 2.0149030205422647,
 'side': 2.8415815937267324,
 u'mean': 1.8430527636156055,
 u'healthcar': 3.401197381662155,
 u'resum': 3.6888794541139358,
 u'oppos': 2.7080502011022101,
 'taught': 2.9957322735539909,
 '2,500': 3.6888794541139358,
 'logo': 3.6888794541139358,
 u'commiss': 2.2225423853205091,
 'saturday': 3.401197381662155,
 'rp': 3.1780538303479453,
 u'network': 2.7080502011022101,
 u'goe': 2.7080502011022101,
 u'facil': 2.4849066497879999,
 'crucial': 3.401197381662155,
 'content': 3.1780538303479453,
 'laid': 3.1780538303479453,
 'daniel': 3.6888794541139358,
 'adapt': 3.401197381662155,
 'got': 2.5902671654458262,
 'forth': 3.6888794541139358,
 'twice': 3.401197381662155,
 'u.s.': 1.8430527636156055,
 'ruttkay': 3.6888794541139358,
 u'situat': 2.7080502011022101,
 'free': 1.7429693050586228,
 u'standard': 2.8415815937267324,
 u'jennif': 3.401197381662155,
 u'disenfranchis': 3.6888794541139358,
 u'issues.th': 3.401197381662155,
 u'status.th': 3.6888794541139358,
 'atm': 3.6888794541139358,
 'moment': 3.1780538303479453,
 u'renew': 2.9957322735539909,
 u'unabl': 3.401197381662155,
 'loud': 3.6888794541139358,
 u'rang': 2.8415815937267324,
 u'grade': 3.401197381662155,
 u'dhaliw': 2.8415815937267324,
 u'wast': 3.1780538303479453,
 u'rank': 3.6888794541139358,
 u'capac': 3.1780538303479453,
 u'restrict': 3.401197381662155,
 u'instruct': 2.8415815937267324,
 u'alreadi': 2.0794415416798357,
 u'agre': 2.7080502011022101,
 u'primari': 3.6888794541139358,
 'feb.': 1.5686159179138452,
 u'sourc': 2.9957322735539909,
 u'nomin': 2.9957322735539909,
 'their': 0.48342664957787562,
 'top': 2.8415815937267324,
 u'sometim': 2.9957322735539909,
 u'necessarili': 2.9957322735539909,
 u'master': 3.401197381662155,
 'too': 2.5902671654458262,
 'john': 3.401197381662155,
 u'listen': 2.8415815937267324,
 'growth': 3.401197381662155,
 'segundo': 3.6888794541139358,
 u'tool': 3.1780538303479453,
 'took': 2.0794415416798357,
 u'direct': 2.1484344131667874,
 u'legislatur': 3.6888794541139358,
 u'email.th': 3.6888794541139358,
 u'conserv': 2.4849066497879999,
 'hadnt': 3.6888794541139358,
 'white': 2.0794415416798357,
 u'silli': 3.6888794541139358,
 u'target': 2.8415815937267324,
 u'sacramento.th': 3.6888794541139358,
 u'provid': 0.81719982922992385,
 'tree': 3.6888794541139358,
 u'biolog': 2.7080502011022101,
 'project': 1.4916548767777167,
 'matter': 2.9957322735539909,
 u'anxieti': 3.6888794541139358,
 u'minut': 3.1780538303479453,
 u'entail': 3.6888794541139358,
 u'runner': 3.6888794541139358,
 'modern': 3.401197381662155,
 'mind': 3.1780538303479453,
 'mine': 3.6888794541139358,
 'raw': 3.6888794541139358,
 'manner': 3.6888794541139358,
 'seen': 2.5902671654458262,
 u'seem': 2.4849066497879999,
 u'seek': 2.1484344131667874,
 u'strength': 3.6888794541139358,
 u'recreat': 2.9957322735539909,
 u'especi': 1.6964492894237297,
 u'thoma': 3.6888794541139358,
 u'thoroughli': 3.6888794541139358,
 u'neurobiolog': 3.401197381662155,
 u'rider': 3.6888794541139358,
 u'predominantli': 3.6888794541139358,
 'blue': 3.401197381662155,
 u'insur': 3.6888794541139358,
 u'plenti': 3.6888794541139358,
 'though': 2.7080502011022101,
 u'russel': 3.401197381662155,
 u'object': 2.7080502011022101,
 'what': 0.95885034629295074,
 u'marin': 3.6888794541139358,
 'regular': 3.6888794541139358,
 'third-year': 1.7429693050586228,
 u'letter': 2.7080502011022101,
 'drought': 3.6888794541139358,
 u'everyth': 2.4849066497879999,
 u'tradit': 2.7080502011022101,
 'don': 3.401197381662155,
 'professor': 2.0794415416798357,
 u'camp': 3.6888794541139358,
 'alumni': 3.401197381662155,
 'dog': 2.9957322735539909,
 u'hospit': 2.7080502011022101,
 u'alumnu': 3.6888794541139358,
 u'declar': 2.9957322735539909,
 u'tech': 3.1780538303479453,
 u'oppress': 3.401197381662155,
 'came': 1.9542783987258296,
 u'treati': 3.6888794541139358,
 'hunger': 3.6888794541139358,
 u'opposit': 3.401197381662155,
 u'advisori': 3.6888794541139358,
 u'sanctuari': 2.9957322735539909,
 u'inaugur': 3.1780538303479453,
 u'academi': 3.1780538303479453,
 'earth': 3.6888794541139358,
 u'environ': 2.2225423853205091,
 u'bail': 3.401197381662155,
 u'involv': 1.7429693050586228,
 u'despit': 2.7080502011022101,
 u'acquir': 3.6888794541139358,
 u'explain': 1.7429693050586228,
 u'restaur': 3.401197381662155,
 'lgbtqia': 3.6888794541139358,
 u'theme': 3.1780538303479453,
 u'busi': 1.5293952047605637,
 u'stephani': 3.6888794541139358,
 'constant': 3.6888794541139358,
 'rice': 3.6888794541139358,
 'plate': 3.6888794541139358,
 'yfb': 3.6888794541139358,
 'wide': 3.1780538303479453,
 'de': 3.401197381662155,
 'stop': 2.3895964699836751,
 'dc': 3.6888794541139358,
 '25,000': 3.6888794541139358,
 u'report': 1.6964492894237297,
 'instructor': 3.6888794541139358,
 u'receiv': 1.2611312181658842,
 u'earn': 3.1780538303479453,
 'bar': 3.401197381662155,
 'spokesperson': 3.401197381662155,
 u'politician': 3.1780538303479453,
 u'secretari': 3.401197381662155,
 u'indigen': 3.401197381662155,
 'bad': 2.8415815937267324,
 'ban': 2.8415815937267324,
 u'septemb': 3.6888794541139358,
 'respond': 2.4849066497879999,
 'human': 2.4849066497879999,
 'fair': 3.401197381662155,
 'specialist': 2.9957322735539909,
 u'resist': 2.9957322735539909,
 u'mandatori': 3.6888794541139358,
 u'result': 2.0794415416798357,
 u'respons': 1.6964492894237297,
 u'fail': 2.8415815937267324,
 'weird': 3.6888794541139358,
 u'news': 2.1484344131667874,
 'best': 2.3025850929940455,
 'juliet': 3.6888794541139358,
 u'awar': 2.0794415416798357,
 'co2': 3.6888794541139358,
 'away': 2.2225423853205091,
 u'discoveri': 3.6888794541139358,
 u'figur': 2.8415815937267324,
 'score': 3.6888794541139358,
 'drawn': 3.1780538303479453,
 u'approach': 2.1484344131667874,
 'berkeley': 3.1780538303479453,
 u'attribut': 3.6888794541139358,
 u'accord': 1.455287232606842,
 'men': 3.1780538303479453,
 u'wa': 0.22314355131420971,
 'adilla': 3.6888794541139358,
 u'extens': 2.9957322735539909,
 'drew': 3.1780538303479453,
 u'intox': 3.6888794541139358,
 u'feloni': 3.401197381662155,
 'kitchen': 3.401197381662155,
 u'protect': 1.7429693050586228,
 u'accident': 3.6888794541139358,
 u'expos': 2.9957322735539909,
 'ill': 2.8415815937267324,
 'against': 1.6964492894237297,
 u'countri': 1.6519975268528961,
 u'compromis': 3.6888794541139358,
 'and': 0.0,
 'had': 1.0986122886681096,
 u'inher': 3.401197381662155,
 '2nd': 3.6888794541139358,
 '250': 3.1780538303479453,
 u'guid': 2.9957322735539909,
 'speak': 1.9542783987258296,
 'him.jan': 3.6888794541139358,
 'bathroom': 3.401197381662155,
 'iraq': 3.6888794541139358,
 u'angel': 3.401197381662155,
 u'union': 2.3025850929940455,
 'three': 1.4916548767777167,
 'been': 0.61310447288640901,
 '.': -0.0082988028146955273,
 u'revenu': 2.9957322735539909,
 'beer': 3.6888794541139358,
 'much': 1.8430527636156055,
 'interest': 1.6964492894237297,
 'basic': 2.5902671654458262,
 u'quickli': 3.401197381662155,
 'life': 1.8430527636156055,
 u'regul': 3.1780538303479453,
 u'worker': 2.8415815937267324,
 'ariana': 3.6888794541139358,
 'jcoc': 3.6888794541139358,
 'child': 2.9957322735539909,
 '165': 3.6888794541139358,
 u'davi': 0.21278076427866299,
 'mohammad': 3.6888794541139358,
 'adela': 3.6888794541139358,
 u'ident': 2.8415815937267324,
 u'affirm': 2.9957322735539909,
 u'servic': 1.2039728043259359,
 u'properti': 2.3895964699836751,
 u'commerci': 2.9957322735539909,
 'air': 3.1780538303479453,
 u'aim': 1.8971199848858813,
 u'calcul': 3.6888794541139358,
 u'monse': 3.6888794541139358,
 u'publicli': 3.6888794541139358,
 'aid': 2.3895964699836751,
 'seven': 2.3895964699836751,
 u'garag': 3.401197381662155,
 'mexico': 3.6888794541139358,
 'is': 0.10536051565782589,
 u'it': 0.09614386055290236,
 'ii': 3.401197381662155,
 'cant': 2.9957322735539909,
 'im': 1.9542783987258296,
 'in': -0.0082988028146955273,
 'id': 3.1780538303479453,
 u'sever': 1.6094379124341001,
 'if': 1.0986122886681096,
 'grown': 3.401197381662155,
 u'jame': 3.6888794541139358,
 u'perform': 2.2225423853205091,
 u'suggest': 2.3895964699836751,
 u'make': 0.74444047494749555,
 u'transpar': 3.401197381662155,
 u'wound': 3.6888794541139358,
 'airport': 3.6888794541139358,
 'dean': 2.9957322735539909,
 'complex': 2.9957322735539909,
 u'jerri': 3.6888794541139358,
 u'complet': 2.1484344131667874,
 u'elli': 3.6888794541139358,
 u'evid': 3.6888794541139358,
 'sacramento': 1.6519975268528961,
 u'redevelop': 3.6888794541139358,
 'rain': 3.1780538303479453,
 u'hand': 3.401197381662155,
 u'fairli': 3.6888794541139358,
 u'rais': 2.3025850929940455,
 'scale': 3.6888794541139358,
 u'kid': 2.8415815937267324,
 'kept': 3.6888794541139358,
 u'thu': 2.8415815937267324,
 u'1970': 3.401197381662155,
 'contact': 2.4849066497879999,
 u'shortli': 3.6888794541139358,
 u'thi': 0.23361485118150505,
 'the': -0.0082988028146955273,
 u'campu': 0.87546873735389985,
 u'legisl': 2.3895964699836751,
 'left': 2.3895964699836751,
 u'identifi': 2.8415815937267324,
 'capitol': 3.1780538303479453,
 'just': 0.83624802420061828,
 u'photo': 2.9957322735539909,
 '2020.in': 3.6888794541139358,
 u'victim': 3.1780538303479453,
 'yet': 2.7080502011022101,
 u'languag': 3.1780538303479453,
 u'previous': 2.9957322735539909,
 'katehi': 3.1780538303479453,
 u'easi': 3.6888794541139358,
 'josh': 3.6888794541139358,
 'spread': 2.9957322735539909,
 'board': 1.9542783987258296,
 '5k': 3.401197381662155,
 'prison': 3.1780538303479453,
 u'els': 2.9957322735539909,
 'east': 2.9957322735539909,
 'hat': 3.6888794541139358,
 'gave': 2.5902671654458262,
 u'applic': 2.5902671654458262,
 u'mayb': 3.1780538303479453,
 u'preserv': 2.8415815937267324,
 u'donat': 2.0794415416798357,
 'background': 2.2225423853205091,
 'sunoco': 3.6888794541139358,
 u'readili': 3.401197381662155,
 u'athlet': 3.1780538303479453,
 'apart': 2.9957322735539909,
 u'measur': 2.3025850929940455,
 'gift': 3.6888794541139358,
 u'specif': 2.3025850929940455,
 '54': 3.1780538303479453,
 'sandhu': 2.9957322735539909,
 u'panelist': 3.6888794541139358,
 u'remind': 2.5902671654458262,
 '37': 3.6888794541139358,
 u'night': 2.7080502011022101,
 'hung': 3.6888794541139358,
 u'flagpol': 3.6888794541139358,
 'attorney': 2.4849066497879999,
 'right': 1.3217558399823193,
 'old': 2.7080502011022101,
 u'crowd': 2.5902671654458262,
 u'percentag': 3.1780538303479453,
 '50': 2.4849066497879999,
 u'rile': 3.6888794541139358,
 'dear': 3.6888794541139358,
 u'proudli': 3.6888794541139358,
 u'elderli': 3.6888794541139358,
 u'transmiss': 3.6888794541139358,
 u'cooper': 3.6888794541139358,
 'combat': 3.1780538303479453,
 'for': 0.068992871486951657,
 u'enolog': 3.6888794541139358,
 'fox': 3.401197381662155,
 'p.m.': 2.3895964699836751,
 u'condit': 2.9957322735539909,
 u'underserv': 3.6888794541139358,
 'core': 3.401197381662155,
 u'plu': 3.6888794541139358,
 u'sensibl': 3.6888794541139358,
 'tour': 3.1780538303479453,
 u'insecur': 3.401197381662155,
 u'pose': 3.401197381662155,
 u'confer': 3.1780538303479453,
 u'colleg': 1.4916548767777167,
 u'promot': 2.1484344131667874,
 u'peer': 2.9957322735539909,
 u'post': 2.3895964699836751,
 'super': 3.401197381662155,
 u'describ': 2.5902671654458262,
 'chapter': 2.5902671654458262,
 u'bylaw': 3.6888794541139358,
 u'slightli': 3.401197381662155,
 u'surround': 2.3025850929940455,
 u'unfortun': 3.401197381662155,
 u'festiv': 3.1780538303479453,
 'dinner': 3.6888794541139358,
 'afternoon': 3.6888794541139358,
 u'commit': 2.1484344131667874,
 u'produc': 2.5902671654458262,
 'civil': 2.8415815937267324,
 u'tackl': 3.6888794541139358,
 u'profession': 2.5902671654458262,
 'down': 1.6964492894237297,
 u'creativ': 2.4849066497879999,
 u'resili': 3.6888794541139358,
 u'formerli': 3.6888794541139358,
 u'opportun': 1.3862943611198904,
 u'manageri': 3.1780538303479453,
 'deal': 2.8415815937267324,
 u'frustrat': 3.6888794541139358,
 'support': 0.87546873735389985,
 u'closur': 3.6888794541139358,
 u'transform': 3.6888794541139358,
 'fight': 3.401197381662155,
 u'avail': 2.1484344131667874,
 u'reli': 3.6888794541139358,
 'editor': 3.401197381662155,
 'way': 1.3217558399823193,
 'spring': 2.5902671654458262,
 'call': 1.2611312181658842,
 u'war': 3.6888794541139358,
 u'analysi': 2.9957322735539909,
 u'head': 2.4849066497879999,
 'yolo': 2.0149030205422647,
 'form': 2.5902671654458262,
 u'solidar': 2.3895964699836751,
 u'forc': 2.3895964699836751,
 u'forb': 3.401197381662155,
 u'heal': 3.401197381662155,
 u'armi': 3.6888794541139358,
 u'surveil': 3.6888794541139358,
 'hear': 2.7080502011022101,
 'solar': 3.401197381662155,
 'true': 3.401197381662155,
 'analyst': 3.401197381662155,
 'absent': 3.401197381662155,
 u'counsel': 2.7080502011022101,
 u'wineri': 3.401197381662155,
 u'intern': 1.6964492894237297,
 'until': 2.2225423853205091,
 'vanguard': 3.6888794541139358,
 'jan.': 1.0262916270884834,
 u'fundament': 3.6888794541139358,
 'cair-sv': 3.6888794541139358,
 u'retir': 3.1780538303479453,
 '150': 2.8415815937267324,
 'later': 2.5902671654458262,
 'classic': 3.6888794541139358,
 'upset': 3.1780538303479453,
 'proven': 3.6888794541139358,
 u'drive': 2.7080502011022101,
 u'exist': 2.3025850929940455,
 u'desir': 3.401197381662155,
 u'ship': 3.6888794541139358,
 'ramirez': 2.9957322735539909,
 'mold': 3.6888794541139358,
 'trip': 3.401197381662155,
 u'impos': 3.401197381662155,
 'said.for': 3.401197381662155,
 'floor': 3.1780538303479453,
 u'holli': 3.6888794541139358,
 u'excel': 2.3895964699836751,
 'actor': 3.6888794541139358,
 u'flood': 3.6888794541139358,
 'role': 2.7080502011022101,
 u'test': 3.401197381662155,
 u'tie': 3.401197381662155,
 u'unwilling': 3.6888794541139358,
 'roll': 2.9957322735539909,
 u'realiti': 3.1780538303479453,
 u'legitim': 3.6888794541139358,
 'intend': 3.401197381662155,
 u'benefici': 3.6888794541139358,
 'felt': 2.9957322735539909,
 'outreach': 2.7080502011022101,
 'fell': 3.6888794541139358,
 u'intent': 3.1780538303479453,
 u'award': 2.0794415416798357,
 u'consid': 2.0149030205422647,
 u'easili': 3.1780538303479453,
 'weekend': 3.401197381662155,
 'billion': 3.6888794541139358,
 'grief': 3.6888794541139358,
 u'femal': 2.5902671654458262,
 'chairperson': 3.401197381662155,
 'longer': 2.7080502011022101,
 u'anywher': 2.8415815937267324,
 u'ignor': 3.6888794541139358,
 'time': 0.89567144467141935,
 u'push': 2.7080502011022101,
 u'serious': 3.401197381662155,
 u'daili': 3.1780538303479453,
 u'recipi': 2.9957322735539909,
 'concept': 3.6888794541139358,
 u'consum': 3.1780538303479453,
 u'focus': 2.0794415416798357,
 'drop-off': 3.6888794541139358,
 u'signific': 2.3895964699836751,
 u'supplement': 3.6888794541139358,
 'milo': 2.5902671654458262,
 'chair': 2.0794415416798357,
 u'decid': 2.0794415416798357,
 u'middl': 2.9957322735539909,
 u'grape': 3.6888794541139358,
 'shkreli': 3.6888794541139358,
 u'flash': 3.6888794541139358,
 'father': 3.6888794541139358,
 u'environment': 1.4916548767777167,
 u'certainli': 3.1780538303479453,
 u'decis': 2.1484344131667874,
 u'sociolog': 3.401197381662155,
 'oversight': 3.401197381662155,
 'brown': 3.1780538303479453,
 'vet': 3.6888794541139358,
 u'low-incom': 3.401197381662155,
 'string': 3.6888794541139358,
 'ibrahim': 3.6888794541139358,
 u'convict': 3.6888794541139358,
 'join': 2.0149030205422647,
 'exact': 3.6888794541139358,
 'wore': 3.6888794541139358,
 u'valuabl': 3.6888794541139358,
 u'administr': 1.5686159179138452,
 u'level': 2.0794415416798357,
 'tear': 3.6888794541139358,
 u'die': 3.6888794541139358,
 '1996': 3.6888794541139358,
 u'democrat': 2.9957322735539909,
 u'item': 2.8415815937267324,
 'team': 2.0149030205422647,
 'quick': 3.6888794541139358,
 'nagey': 3.6888794541139358,
 u'guy': 3.1780538303479453,
 u'round': 3.6888794541139358,
 u'prevent': 2.5902671654458262,
 'pork': 3.401197381662155,
 u'outlin': 3.401197381662155,
 'trend': 3.1780538303479453,
 u'compens': 3.1780538303479453,
 u'sign': 1.8430527636156055,
 u'cost': 2.3895964699836751,
 'patient': 2.9957322735539909,
 '6:10': 3.6888794541139358,
 u'appear': 3.401197381662155,
 u'energi': 2.3025850929940455,
 u'current': 1.2039728043259359,
 u'suspect': 3.401197381662155,
 u'uc-wid': 3.401197381662155,
 u'appeal': 2.7080502011022101,
 u'gener': 1.6964492894237297,
 'muslim': 2.7080502011022101,
 'french': 3.6888794541139358,
 'water': 2.4849066497879999,
 u'entertain': 3.6888794541139358,
 u'address': 1.791759469228055,
 u'locat': 2.0794415416798357,
 'along': 2.5902671654458262,
 u'destigmat': 3.6888794541139358,
 u'teacher': 3.1780538303479453,
 'wait': 2.9957322735539909,
 u'alto': 3.6888794541139358,
 'male': 2.9957322735539909,
 u'invit': 2.8415815937267324,
 'proud': 2.8415815937267324,
 u'healthi': 2.8415815937267324,
 u'extrem': 2.5902671654458262,
 'bob': 3.1780538303479453,
 u'ourselv': 3.401197381662155,
 u'orient': 3.401197381662155,
 'love': 1.791759469228055,
 'extra': 3.1780538303479453,
 'prefer': 3.6888794541139358,
 u'leav': 2.3025850929940455,
 u'seattl': 3.401197381662155,
 u'instal': 3.401197381662155,
 'fbi': 3.6888794541139358,
 'should': 1.455287232606842,
 u'mobil': 3.6888794541139358,
 u'market': 1.8430527636156055,
 u'two-third': 3.6888794541139358,
 'prove': 3.1780538303479453,
 'sake': 3.6888794541139358,
 u'univers': 1.1239300966523995,
 u'visit': 2.3895964699836751,
 'by': 0.24419696051204198,
 u'everybodi': 3.401197381662155,
 'live': 1.5293952047605637,
 u'loung': 3.6888794541139358,
 'scope': 3.6888794541139358,
 'checkout': 3.6888794541139358,
 u'suicid': 3.6888794541139358,
 'today': 2.3025850929940455,
 u'capit': 3.1780538303479453,
 'said': 0.13353139262452274,
 u'afford': 2.1484344131667874,
 u'peopl': 0.61310447288640901,
 'curriculum': 3.6888794541139358,
 u'enhanc': 2.7080502011022101,
 'downtown': 1.8430527636156055,
 'visual': 3.401197381662155,
 u'examin': 3.6888794541139358,
 'effort': 1.6964492894237297,
 'behalf': 3.6888794541139358,
 u'religi': 2.8415815937267324,
 u'demolish': 3.6888794541139358,
 u'slogan': 3.1780538303479453,
 u'keynot': 3.6888794541139358,
 'car': 3.401197381662155,
 u'prepar': 2.3895964699836751,
 u'judg': 2.8415815937267324,
 u'focu': 2.1484344131667874,
 u'imper': 3.6888794541139358,
 u'cat': 3.6888794541139358,
 u'whatev': 2.9957322735539909,
 u'purpos': 2.5902671654458262,
 u'preclud': 3.6888794541139358,
 'heart': 2.8415815937267324,
 u'complic': 3.6888794541139358,
 u'trump': 1.6964492894237297,
 u'predict': 2.9957322735539909,
 u'curri': 3.6888794541139358,
 u'topic': 2.4849066497879999,
 'heard': 2.5902671654458262,
 u'critic': 2.0794415416798357,
 'council': 1.8971199848858813,
 u'recycl': 3.6888794541139358,
 u'agenc': 2.3895964699836751,
 u'contamin': 3.6888794541139358,
 'stole': 3.6888794541139358,
 'occur': 2.5902671654458262,
 u'clair': 3.6888794541139358,
 'pink': 3.6888794541139358,
 u'multipl': 2.4849066497879999,
 'winter': 2.3025850929940455,
 u'assemblymemb': 3.401197381662155,
 u'economi': 3.401197381662155,
 'write': 2.8415815937267324,
 u'alway': 2.0149030205422647,
 'sunday': 3.401197381662155,
 'vital': 2.9957322735539909,
 u'anyon': 2.4849066497879999,
 'fourth': 3.6888794541139358,
 'sworn': 3.6888794541139358,
 u'pathway': 3.401197381662155,
 u'product': 2.3895964699836751,
 u'said.davi': 3.1780538303479453,
 u'superintend': 3.401197381662155,
 u'spot': 3.1780538303479453,
 'bear': 3.6888794541139358,
 'date': 3.401197381662155,
 'such': 1.0986122886681096,
 u'grow': 2.3895964699836751,
 u'man': 3.6888794541139358,
 u'classroom': 3.401197381662155,
 u'stress': 2.5902671654458262,
 u'practic': 2.4849066497879999,
 u'secur': 2.4849066497879999,
 u'ideolog': 3.6888794541139358,
 u'inform': 1.4201959127955717,
 'switch': 3.6888794541139358,
 'so': 0.5108256237659905,
 'african': 3.401197381662155,
 u'offend': 3.6888794541139358,
 'tall': 3.401197381662155,
 u'riversid': 3.401197381662155,
 u'talk': 1.9542783987258296,
 u'shield': 3.6888794541139358,
 u'anticip': 3.401197381662155,
 u'approv': 2.1484344131667874,
 u'tip': 3.401197381662155,
 'brain': 3.6888794541139358,
 'citizenship': 3.6888794541139358,
 u'equip': 2.9957322735539909,
 'still': 1.6964492894237297,
 u'mainli': 3.6888794541139358,
 u'dynam': 3.6888794541139358,
 u'entiti': 3.6888794541139358,
 u'ethic': 3.1780538303479453,
 u'conjunct': 3.6888794541139358,
 'group': 1.5686159179138452,
 u'thank': 2.4849066497879999,
 u'polici': 1.4916548767777167,
 u'passag': 3.1780538303479453,
 u'platform': 2.5902671654458262,
 u'window': 2.8415815937267324,
 u'torr': 3.6888794541139358,
 'main': 2.0149030205422647,
 'halt': 3.401197381662155,
 '3': 2.5902671654458262,
 u'financi': 2.2225423853205091,
 u'automobil': 3.401197381662155,
 u'initi': 1.791759469228055,
 u'nation': 1.455287232606842,
 'answer': 2.4849066497879999,
 'kkk': 3.6888794541139358,
 'half': 2.9957322735539909,
 'not': 0.26570316573300534,
 'now': 1.3862943611198904,
 'nop': 3.6888794541139358,
 u'discuss': 1.6964492894237297,
 'nor': 3.6888794541139358,
 u'introduct': 3.6888794541139358,
 'wont': 2.7080502011022101,
 '&': 2.8415815937267324,
 u'term': 2.5902671654458262,
 u'name': 2.2225423853205091,
 u'entrepreneur': 3.6888794541139358,
 u'drop': 3.401197381662155,
 u'separ': 3.401197381662155,
 u'magazin': 3.1780538303479453,
 'wong': 3.6888794541139358,
 'rock': 2.9957322735539909,
 u'januari': 3.1780538303479453,
 u'quarter': 2.5902671654458262,
 'years.i': 3.6888794541139358,
 'nick': 2.8415815937267324,
 u'muslim-major': 3.6888794541139358,
 u'replac': 2.4849066497879999,
 u'individu': 1.6519975268528961,
 u'continu': 1.0498221244986774,
 u'ensur': 2.0149030205422647,
 u'sponsor': 2.8415815937267324,
 'year': 0.58279912339108009,
 'happen': 2.0149030205422647,
 u'baselin': 3.6888794541139358,
 'canada': 3.1780538303479453,
 'shown': 3.1780538303479453,
 u'accomplish': 2.8415815937267324,
 'jackson': 3.6888794541139358,
 u'space': 1.9542783987258296,
 u'profit': 3.1780538303479453,
 u'bespok': 3.6888794541139358,
 'internet': 2.9957322735539909,
 'fargo': 3.6888794541139358,
 'sweet': 3.401197381662155,
 u'correct': 3.6888794541139358,
 u'integr': 3.1780538303479453,
 'e.': 3.6888794541139358,
 'state': 0.91629073187415466,
 'million': 1.8971199848858813,
 'seventh': 3.1780538303479453,
 u'argu': 2.8415815937267324,
 u'headlin': 3.6888794541139358,
 u'flyer': 3.6888794541139358,
 'california': 1.0033021088637848,
 'span': 3.6888794541139358,
 u'landfil': 3.6888794541139358,
 'card': 2.9957322735539909,
 'care': 1.6519975268528961,
 u'refus': 3.401197381662155,
 'honest': 3.6888794541139358,
 u'recov': 3.6888794541139358,
 u'thing': 1.3535045382968995,
 'place': 1.2039728043259359,
 u'greenhous': 3.6888794541139358,
 'nonprofit': 2.2225423853205091,
 u'principl': 3.1780538303479453,
 'childhood': 3.401197381662155,
 'frequent': 3.401197381662155,
 'first': 1.0986122886681096,
 u'origin': 2.4849066497879999,
 u'directli': 2.3895964699836751,
 u'carri': 3.1780538303479453,
 u'onc': 2.2225423853205091,
 'yourself': 2.7080502011022101,
 'submit': 3.1780538303479453,
 'spanish': 3.6888794541139358,
 'vote': 2.0149030205422647,
 u'happi': 2.9957322735539909,
 u'open': 1.6519975268528961,
 'given': 1.791759469228055,
 'boulevard': 3.1780538303479453,
 u'christma': 3.6888794541139358,
 'district': 2.3025850929940455,
 'bloom': 3.6888794541139358,
 'caught': 3.6888794541139358,
 u'breed': 3.401197381662155,
 'iac': 3.6888794541139358,
 'plastic': 3.401197381662155,
 u'citi': 1.2039728043259359,
 '2': 2.5902671654458262,
 u'draft': 3.6888794541139358,
 u'proposit': 2.9957322735539909,
 u'conveni': 3.6888794541139358,
 u'cite': 3.1780538303479453,
 u'friend': 2.3025850929940455,
 u'transgend': 3.6888794541139358,
 'hub': 3.6888794541139358,
 'that': 0.087011376989629241,
 'season': 2.5902671654458262,
 u'logist': 3.6888794541139358,
 u'mostli': 3.1780538303479453,
 'quad': 3.1780538303479453,
 'than': 1.1239300966523995,
 'boyfriend': 3.6888794541139358,
 '11': 2.8415815937267324,
 '10': 1.6094379124341001,
 '13': 3.1780538303479453,
 '12': 2.5902671654458262,
 '15': 2.3895964699836751,
 '14': 2.7080502011022101,
 '17': 2.5902671654458262,
 '16': 2.7080502011022101,
 '19': 2.8415815937267324,
 '18': 2.8415815937267324,
 'recruit': 3.401197381662155,
 u'banner': 3.401197381662155,
 'were': 0.56798403760593885,
 u'posit': 1.6964492894237297,
 'counselor': 3.1780538303479453,
 u'seri': 2.3025850929940455,
 u'prospect': 3.6888794541139358,
 'fork': 3.6888794541139358,
 u'rene': 3.6888794541139358,
 'san': 2.4849066497879999,
 ...}

plt.hist(idf_smooth.values(),bins=20)

(array([  17.,   10.,   18.,   16.,   19.,   27.,   35.,   49.,   65.,
          56.,  104.,   86.,  147.,   77.,  205.,  131.,  179.,  252.,
         341.,  781.]),
 array([-0.0082988 ,  0.17656011,  0.36141902,  0.54627794,  0.73113685,
         0.91599576,  1.10085467,  1.28571359,  1.4705725 ,  1.65543141,
         1.84029033,  2.02514924,  2.21000815,  2.39486706,  2.57972598,
         2.76458489,  2.9494438 ,  3.13430272,  3.31916163,  3.50402054,
         3.68887945]),
 <a list of 20 Patch objects>)

from os import path
from wordcloud import WordCloud
# all
wordcloud = WordCloud().generate("".join(textall))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

# for the campus news
wordcloud = WordCloud().generate("".join(textall[:59]))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

# for the city
wordcloud = WordCloud().generate("".join(textall[59:119]))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

From the previous three plots we know that the key words for all the articles are "Davis", "UC", "student", "community", "people" and "city".
For the articles in campus news, the key words are "UC", "student", "Davis" and "campus".
For the articles in city news, the key words are "Davis", "city", "coummunity", "people", "food", "student" amd "Sacramento".
Thus there is no big difference of the main topics in campus news and city news.

# the previous parts are only for calculating inverse document frequency
# now we want to know the idf with tf(t,d) weighted
vectorizer = TfidfVectorizer(tokenizer=lemmatize,stop_words="english",smooth_idf=True,norm=None)
# textall is a list of raw files
tfs = vectorizer.fit_transform(textall)

sim = tfs.dot(tfs.T)

sim.mean()

1547.5883685507699

# Find the smallest value
# in-place: update the original string instead of creating a new one
# convert the sparse matrix to a np.array by using .toarray() . And then I can use np.where(sim == sim.max()) to find the index of the max value
simay = sim.toarray()
#np.where(simay == simay.max())
simay.shape

(120L, 120L)

import numpy as np
#plt.imshow(simay, cmap='hot', interpolation='nearest')
plt.pcolor(simay)
plt.show()
# extract the upper diagonal
#simay_upp = np.triu(simay,k=0)

From the heatmap above we can know that documentID 16 has more similarity with other documents

# tranfer into 1D array, the order of it is row by row
#simay_upp1 = np.reshape(simay,(1,np.product(simay.shape)))
#simay_upp1

simay_upp = np.triu(simay,k=1) # k=1 make all the diagonal elements equal to 0
# order it to find the largest ones
def sim_art(n):   # get the index of sorted data
    flat = simay_upp.flatten()
    indices = np.argpartition(flat, -n)[-n:]
    indices = indices[np.argsort(-flat[indices])]
    loc = np.unravel_index(indices, simay.shape)
    print [docids.keys()[docids.values().index(loc[0][n-1])], loc[0][n-1]]
    print [docids.keys()[docids.values().index(loc[1][n-1])], loc[1][n-1]]

def largest_indices(ary, n):
    """Returns the n largest indices from a numpy array."""  # take a look at argpartition !!!!
    flat = ary.flatten()
    indices = np.argpartition(flat, -n)[-n:]
    indices = indices[np.argsort(-flat[indices])]
    return np.unravel_index(indices, ary.shape)

# the most similar one
sim_art(1)

['UC Davis holds first mental health conference', 14]
['UC Davis to host first ever mental health conference', 35]

#show the similarity of this two articles
wordcloud = WordCloud().generate("".join(textall[14]))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()
wordcloud = WordCloud().generate("".join(textall[35]))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

From the plot above we know that the most similar articles are "UC Davis holds first mental health conference" and "UC Davis to host first ever mental health conference". The common words are "mental", "health", "student" and "conference".

# the 2nd similar one
sim_art(2)

['UC Davis holds first mental health conference', 14]
['2017 ASUCD Winter Elections  Meet the Candidates', 16]

#show the similarity of this two articles
wordcloud = WordCloud().generate("".join(textall[14]))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()
wordcloud = WordCloud().generate("".join(textall[16]))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

From the plot above we know that the 2nd most similar articles are 'UC Davis holds first mental health conference' and '2017 ASUCD Winter Elections Meet the Candidates'. The common words are "student" and "year".

# the 3rd similar one
sim_art(3)

['2017 ASUCD Winter Elections  Meet the Candidates', 16]
['Nov. 8 2016: An Election Day many may never forget', 115]

wordcloud = WordCloud().generate("".join(textall[16]))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()
wordcloud = WordCloud().generate("".join(textall[115]))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

From the plot above we know that the 3rd similar articles are '2017 ASUCD Winter Elections Meet the Candidates' and 'Nov. 8 2016: An Election Day many may never forget'. The reason they are similar is because both two articles are about election and government.

For the last question. I assume that this corpus is the 16th article. By comparing the wordcloud of overall the aggie corpus with the wordcloud of the 16th corpus, we can know that they have some common words like "student", "community" and "davis". Thus I think it is representitive for the aggie.

	author	text	title	url	category
0	Alyssa Vandenberg	Six senators, new executive team electedCurren...	2017 Winter Quarter election results	https://theaggie.org/2017/02/24/2017-winter-qu...	campus
1	Aaron Liss and Raul Castellanos	Wells Fargo faces fraud, predatory lending cha...	University of California, Davis City Council s...	https://theaggie.org/2017/02/23/university-of-...	campus
2	Kimia Akbari	Faculty, students recount personal tales of im...	Academics unite in peaceful rally against immi...	https://theaggie.org/2017/02/23/academics-unit...	campus
3	Kenton Goldsby	Opening date pushed back to May 1Students have...	Memorial Union to reopen Spring Quarter	https://theaggie.org/2017/02/23/memorial-union...	campus
4	Ivan Valenzuela	Veto included revision abandoning creation of ...	ASUCD President Alex Lee vetoes amendment for ...	https://theaggie.org/2017/02/23/asucd-presiden...	campus
5	Alyssa Vandenberg	Shaheens name to remain on ballot, his votes w...	Senate candidate Zaki Shaheen withdraws from race	https://theaggie.org/2017/02/22/senate-candida...	campus
6	Aaron Liss	Students receive email warnings from UC Davis ...	UC Davis experiences several recent hate-based...	https://theaggie.org/2017/02/21/uc-davis-exper...	campus
7	Alyssa Vandenberg	UC Board of Regents to vote on the appointment...	UC President selects Gary May as new UC Davis ...	https://theaggie.org/2017/02/21/uc-president-s...	campus
8	Jeanna Totah	Tighter policies require greater approval of o...	Katehi controversy prompts decline of UC admin...	https://theaggie.org/2017/02/20/katehi-controv...	campus
9	Ivan Valenzuela	SR #7 asks university to increase capacity for...	ASUCD Senate passes resolution submitting comm...	https://theaggie.org/2017/02/20/asucd-senate-p...	campus
10	Yvonne Leong	UC Davis leads in sustainability with largest ...	UC releases 2016 Annual Report on Sustainable ...	https://theaggie.org/2017/02/20/uc-releases-20...	campus
11	Kenton Goldsby	Speakers, including Interim Chancellor Ralph J...	UC Davis Global Affairs holds discussion on Pr...	https://theaggie.org/2017/02/19/uc-davis-globa...	campus
12	Kimia Akbari	Executive order has immediate consequences for...	Trumps immigration ban affects UC Davis community	https://theaggie.org/2017/02/19/trumps-immigra...	campus
13	Kaitlyn Cheung	Student protesters march from MU flagpole to M...	UC Davis students participate in UC-wide #NoDA...	https://theaggie.org/2017/02/17/uc-davis-stude...	campus
14	Jayashri Padmanabhan	Conference entails full day of speakers, panel...	UC Davis holds first mental health conference	https://theaggie.org/2017/02/17/uc-davis-holds...	campus
15	Demi Caceres	Last week in SenateThe ASUCD Senate meeting wa...	Last week in Senate	https://theaggie.org/2017/02/16/last-week-in-s...	campus
16	Alyssa Vandenberg and Emilie DeFazio	Executive: Josh Dalavai and Adilla JamaludinIn...	2017 ASUCD Winter Elections Meet the Candidates	https://theaggie.org/2017/02/16/2017-asucd-win...	campus
17	Ivan Valenzuela	New showcase provides opportunity for students...	Shields Library hosts new exhibit for Davis ce...	https://theaggie.org/2017/02/14/shields-librar...	campus
18	Demi Caceres	Students promote fruit and vegetable meals via...	Student Health and Counseling Services hosts S...	https://theaggie.org/2017/02/14/student-health...	campus
19	Lindsay Floyd	New fees to pay for equipment replacementTo co...	PE classes may charge additional fees	https://theaggie.org/2017/02/13/pe-classes-may...	campus
20	Jeanna Totah	Recipients each rewarded $25,000 for researchU...	11 new Chancellor Fellows honored for 2016	https://theaggie.org/2017/02/12/11-new-chancel...	campus
21	Aaron Liss	Muslim Student Association curates five-part D...	Muslim students respond to recent political ev...	https://theaggie.org/2017/02/12/muslim-student...	campus
22	Lindsay Floyd	Events to promote safe sexOn Feb. 1, Student H...	Sexcessful Campaign launched in time for Valen...	https://theaggie.org/2017/02/12/sexcessful-cam...	campus
23	Alyssa Vandenberg	Chan replaces former senator Sam ParkMichael C...	Michael Chan sworn in as interim senator	https://theaggie.org/2017/02/10/michael-chan-s...	campus
24	Kenton Goldsby	Regents approve tuition increase in 16-4 voteT...	University of California Regents meet, approve...	https://theaggie.org/2017/02/09/university-of-...	campus
25	Yvonne Leong	Last week in SenateThe ASUCD Senate meeting wa...	Last week in Senate	https://theaggie.org/2017/02/09/last-week-in-s...	campus
26	Jayashri Padmanabhan	Funding to expand innovation, entrepreneurship...	UC Davis receives $2.2 million from Assembly B...	https://theaggie.org/2017/02/09/uc-davis-recei...	campus
27	Ivan Valenzuela	Davis College Democrats host Dodd for question...	Senator Bill Dodd visits UC Davis	https://theaggie.org/2017/02/06/senator-bill-d...	campus
28	Kenton Goldsby	Law to affect students selected to attend Nati...	AB 1887 prevents use of state funds, including...	https://theaggie.org/2017/02/05/ab-1887-preven...	campus
29	Jayashri Padmanabhan	Kathleen Salvaty to oversee implementation of ...	UC system hires Title IX coordinator	https://theaggie.org/2017/02/02/uc-system-hire...	campus
...	...	...	...	...	...
90	Raul Castellanos	Downtown Davis offers a wide range of Thai cui...	No such thing as too much Thai food	https://theaggie.org/2017/01/17/no-such-thing-...	city
91	Kaelyn Tuermer-Lee	Petco Foundation awards $10,000 to local anima...	A dog named Disney wins grant money for Rotts ...	https://theaggie.org/2017/01/16/a-dog-named-di...	city
92	Andie	Books by Mail program will deliver materials s...	Yolo County Library materials to be more widel...	https://theaggie.org/2017/01/15/yolo-county-li...	city
93	Sam	Seasons Greetings EditionDec. 24Downstairs nei...	Police Logs	https://theaggie.org/2017/01/12/police-logs-7/	city
94	Dianna Rivera	The Davis Manor Neighborhood hosts first Holid...	Neighbors unite	https://theaggie.org/2017/01/12/neighbors-unite/	city
95	Kaelyn Tuermer-Lee	Local author Matt Biers-Ariels latest novelMan...	Sparks fly in Light the Fire	https://theaggie.org/2016/12/09/sparks-fly-in-...	city
96	Raul Castellanos	Volunteer program distributes bicycles, aims t...	Bike Campaign offers bicycles to those who can...	https://theaggie.org/2016/12/09/bike-campaign-...	city
97	Juno Bhardwaj-shah	Speakers criticize unconstitutional systemOn N...	Bail reform advocates gather at annual fundraiser	https://theaggie.org/2016/12/08/bail-reform-ad...	city
98	Samantha Solomon	Live-action retelling of Christmas poem promis...	Twas the Night Before Christmas in Old Sacramento	https://theaggie.org/2016/12/07/twas-the-night...	city
99	Sam Solomon	35th annual tree lighting ceremony kicks off w...	Childrens Candlelight Parade lights up downtow...	https://theaggie.org/2016/12/05/childrens-cand...	city
100	Sam Solomon	Another week of why did people call the police...	Police Logs	https://theaggie.org/2016/12/04/police-logs-6/	city
101	Dianna Rivera	The Yolo County Childrens Alliance celebrates ...	The season of giving	https://theaggie.org/2016/12/04/the-season-of-...	city
102	Samantha Solomon	Students, activists call for solidarity with S...	NoDAPL protest erupts in downtown Davis	https://theaggie.org/2016/12/02/nodapl-protest...	city
103	Andie Joldersma	Participants gathered for 5k, 10k, half marath...	Davis Turkey Trot: more than just another race	https://theaggie.org/2016/12/02/davis-turkey-t...	city
104	Andie Joldersma	Citizens can expect more cost-competitive clea...	Affordable, clean, green energy is coming to Y...	https://theaggie.org/2016/11/30/affordable-cle...	city
105	Bianca Antunez	Davis community renews local parcel tax for K-...	Measure H passes, voters support Davis schools	https://theaggie.org/2016/11/29/measure-h-pass...	city
106	Anya Rehon	College alcohol use, high-risk drinking discus...	Local residents attend Davis town hall meeting	https://theaggie.org/2016/11/29/local-resident...	city
107	Sam Solomon	Looks like its been an interesting weekNov. 15...	Police Logs	https://theaggie.org/2016/11/27/police-logs-5/	city
108	Raul Castellanos	Ornamental piano outside Mishkas Caf vandalize...	Public piano destroyed in act of vandalism	https://theaggie.org/2016/11/27/public-piano-d...	city
109	Samantha Solomon	Davis residents light candles, promote sanctua...	Holding the Light	https://theaggie.org/2016/11/27/holding-the-li...	city
110	Dianna Rivera	Dont be left in the dark now that Daylight Sav...	Sun down, bike lights out	https://theaggie.org/2016/11/22/sun-down-bike-...	city
111	Sam Solomon	Nov. 7Subject stated our pizza is ready and th...	Police Logs	https://theaggie.org/2016/11/22/police-logs-4/	city
112	Kaelyn Tuermer-Lee	Helping the community through Davis Community ...	Tune in to Watermelon Musics strings-for-food ...	https://theaggie.org/2016/11/21/tune-in-to-wat...	city
113	Andie Joldersma	Protesters gather at State Capitol Building fo...	Water is sacred, water is life	https://theaggie.org/2016/11/20/water-is-sacre...	city
114	Anya Rehon	The Yolo County community comes together for h...	The Yolo Food Bank addresses food insecurity	https://theaggie.org/2016/11/17/the-yolo-food-...	city
115	Bianca Antunez	Election results are in; Davis community conce...	Nov. 8 2016: An Election Day many may never fo...	https://theaggie.org/2016/11/17/nov-8-2016-an-...	city
116	None	More turkeys, more tomfoolery, more accidental...	Police Logs	https://theaggie.org/2016/11/15/police-logs-3/	city
117	Bianca Antunez	Participants line up for Thanksgiving 5k befor...	Yolo Food Banks eighth Annual Running of the T...	https://theaggie.org/2016/11/15/yolo-food-bank...	city
118	Raul Castellanos	Bernie Sanders visits Sacramento to rally for ...	Return of the Bern	https://theaggie.org/2016/11/15/return-of-the-...	city
119	Alana Joldersma	Indoor facility will provide a cafeteria, clas...	Construction of the All Student Center at Davi...	https://theaggie.org/2016/11/14/construction-o...	city