Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Finn
Arup Nielsen
DTU Compute
Technical University of Denmark
September 22, 2014
Overview
Get the stuff: Crawling, search
Converting: HTML processing/stripping, format conversion
Tokenization, identifying and splitting words and sentences.
Word normalization, finding the stem of the word, e.g., talked talk
Text classificiation (supervized), e.g., spam detection.
Finn
Arup Nielsen
Finn
Arup Nielsen
Handling errors
Finn
Arup Nielsen
10
Serial
Parallel
Finn
Arup Nielsen
11
Combinations
It becomes more complicated:
When you download in parallel and need to make sure that you are not
downloading from the same server in parallel.
When you need to keep track of downloading errors (should they be
postponed or dropped?)
Finn
Arup Nielsen
12
13
#
#
#
#
f.entries[i].title
f.entries[i].subtitle
f.entries[i].link
f.entries[i].updated
f.entries[i].updated_parsed
f.entries[i].summary
#
#
#
#
#
#
14
Reading JSON . . .
JSON (JavaScript Object Notation), http://json.org, is a lightweight
data interchange format particularly used on the Web.
Python implements JSON encoding and decoding with among others the
json and simplejson modules.
simplejson and newer json use, e.g., loads() and dumps() whereas older
json uses read() and write(). http://docs.python.org/library/json.html
Arhus
JSON data structures are mapped to corresponding Python structures.
Finn
Arup Nielsen
15
. . . Reading JSON
MediaWikis may export some their data in JSON format, and here is an
example with Wikipedia querying for an embedded template:
import urllib, simplejson
url = "http://en.wikipedia.org/w/api.php?" + \
"action=query&list=embeddedin&" + \
"eititle=Template:Infobox_Single_nucleotide_polymorphism&" + \
"format=json"
data = simplejson.load(urllib.urlopen(url))
data[query][embeddedin][0]
gives {uns: 0, upageid: 238300, utitle: uFactor V Leiden}
Here the Wikipedia article Factor V Leiden contains (has embedded) the
template Infobox Single nucleotide polymorphism
(Note MediaWiki may need to be called several times for the retrieval of
all results for the query by using data[query-continue])
Finn
Arup Nielsen
16
17
18
Finn
Arup Nielsen
19
Reading HTML . . .
HTML contains tags and content. There are several ways to strip the
content.
1. Simple regular expression, e.g., re.sub(<.*?>, , s)
2. htmllib module with the formatter module.
3. Use nltk.clean_html() (Bird et al., 2009, p. 82). This function uses
HTMLParser
4. BeautifulSoup module is a robust HTML parser (Segaran, 2007, p.
45+).
5. lxml.etree.HTML
Finn
Arup Nielsen
20
. . . Reading HTML
The htmllib can parse HTML documents (Martelli, 2006, p. 580+)
import htmllib, formatter, urllib
p = htmllib.HTMLParser(formatter.NullFormatter())
p.feed(urllib.urlopen(http://www.dtu.dk).read())
p.close()
for url in p.anchorlist: print url
The result is a printout of the list of URL from http://www.dtu.dk:
/English.aspx
/Service/Indeks.aspx
/Service/Kontakt.aspx
/Service/Telefonbog.aspx
http://www.alumne.dtu.dk
http://portalen.dtu.dk
...
Finn
Arup Nielsen
21
22
23
Reading XML
xml.dom: Document Object Model. With xml.dom.minidom
xml.sax: Simple API for XML (and an obsolete xmllib)
xml.etree: ElementTree XML library
Example with minidom module with searching on a tag name:
>>> s = """<persons> <person> <name>Ole</name> </person>
<person> <name>Jan</name> </person> </persons>"""
>>> import xml.dom.minidom
>>> dom = xml.dom.minidom.parseString(s)
>>> for element in dom.getElementsByTagName("name"):
...
print(element.firstChild.nodeValue)
...
Ole
Jan
Finn
Arup Nielsen
24
25
Finn
Arup Nielsen
26
Generating HTML . . .
The simple way:
>>> results = [(Denmark, 5000000), (Botswana, 1700000)]
>>> res = <tr>.join([ <td>%s<td>%d % (r[0], r[1]) for r in results ])
>>> s = """<html><head><title>Results</title></head>
<body><table>%s</table></body></html>""" % res
>>> s
<html><head><title>Results</title></head>\n<body><table><td>Denmark
<td>5000000<tr><td>Botswana<td>1700000</table></body></html>
If the input is not known it may contain parts needing escapes:
>>> results = [(Denmark (<Sweden), 5000000), (r<script
type="text/javascript"> window.open("http://www.dtu.dk/", "Buy
Viagra")</script>, 1700000)]
>>> open(test.html, w).write(s)
Input should be sanitized and output should be escaped.
Finn
Arup Nielsen
27
28
Finn
Arup Nielsen
29
30
31
Finn
Arup Nielsen
32
33
34
Word normalization . . .
Converting talking, talk, talked, Talk, etc. to the lexeme talk
(Bird et al., 2009, page 107)
>>> porter = nltk.PorterStemmer()
>>> [porter.stem(t.lower()) for t in tokens]
[to, suppos, that, the, eye, with, all, it, inimit,
contriv, for, adjust, the, focu, to, differ, distanc,
,, for, admit, differ, amount, of, light, ,, and,
Another stemmer is lancaster.stem()
The Snowball stemmer works for non-English, e.g.,
>>> from nltk.stem.snowball import SnowballStemmer
>>> stemmer = SnowballStemmer("danish")
>>> stemmer.stem(universiteterne)
universitet
Finn
Arup Nielsen
35
. . . Word normalization
Normalize with a word list (WordNet):
>>> wnl = nltk.WordNetLemmatizer()
>>> [wnl.lemmatize(token) for token in tokens]
[To, suppose, that, the, eye, with, all, it,
inimitable, contrivance, for, adjusting, the, focus, to,
different, distance, ,, for, admitting, different,
amount, of, light, ,, and, for, the, correction,
Here words contrivances and distances have lost the plural s and
its the genitive s.
Finn
Arup Nielsen
36
Word categories
Part-of-speech tagging with NLTK
>>> words = nltk.word_tokenize(s)
>>> nltk.pos_tag(words)
[(To, TO), (suppose, VB), (that, IN), (the, DT),
(eye, NN), (with, IN), (all, DT), (its, PRP$),
(inimitable, JJ), (contrivances, NNS), (for, IN),
NN noun, VB verb, JJ adjective, RB adverb, etc., see, common tags.
>>> tagged = nltk.pos_tag(words)
>>> [word for (word, tag) in tagged if tag==JJ]
[inimitable, different, different, light, spherical,
chromatic, natural, confess, absurd]
confess is wrongly tagged.
Finn
Arup Nielsen
37
Some examples
Finn
Arup Nielsen
38
Keyword extraction . . .
Consider the text:
We want to extract computer programming, statistics, linear algebra advanced programming (or perhaps just programming!?), data
analysis, machine learning.
But we do not want and linear or courses such, i.e., not just bigrams.
Note the lack of verbs and the missing period.
Finn
Arup Nielsen
39
. . . Keyword extraction . . .
Lets see what NLTKs part-of-speech tagger can do:
>>> text = ("Computer programming (e.g., 02101 or 02102), statistics "
"(such as 02323, 02402 or 02403) and linear algebra (such as 01007) "
"More advanced programming and data analysis, e.g., Machine Learning "
"(02450 or 02457), or courses such as 02105 or 01917")
>>> tagged = nltk.pos_tag(nltk.word_tokenize(text))
>>> tagged
[(Computer, NN), (programming, NN), ((, :), (e.g.,
NNP), (,, ,), (02101, CD), (or, CC), (02102, CD),
(), CD), (,, ,), (statistics, NNS), ((, VBP), (such,
JJ), (as, IN), (02323, CD), (,, ,), (02402, CD),
(or, CC), (02403, CD), (), CD), (and, CC), (linear,
JJ), (algebra, NN), ((, :), (such, JJ), (as, IN),
(01007, CD), (), CD), (More, NNP), (advanced, VBD),
(programming, VBG), (and, CC), (data, NNS), (analysis,
NN), (,, ,), (e.g., NNP), (,, ,), (Machine, NNP),
(Learning, NNP), ((, NNP), (02450, CD), (or, CC),
(02457, CD), (), CD), (,, ,), (or, CC), (courses,
NNS), (such, JJ), (as, IN), (02105, CD), (or, CC),
(01917, CD)]
40
. . . Keyword extraction . . .
Idea: assemble consecutive nouns here a first attempt:
phrases, phrase = [], ""
for (word, tag) in tagged:
if tag[:2] == NN:
if phrase == "": phrase = word
else: phrase += " " + word
elif phrase != "":
phrases.append(phrase.lower())
phrase = ""
Result:
>>> phrases
[computer programming, e.g., statistics, algebra, programming,
data analysis, e.g., machine learning (, courses]
Well . . . Not quite right. More control structures, stopword lists, . . . ?
Finn
Arup Nielsen
41
. . . Keyword extraction . . .
Chunking: Make a small grammar with regular expression that, e.g., catch
a sentence part, here we call it a noun phrase (NP):
>>> grammar = "NP: { <JJ>*<NN.?>+ }"
>>> cp = nltk.RegexpParser(grammar)
>>> cp.parse(tagged)
Tree(S, [Tree(NP, [(Computer, NN), (programming, NN)]),
((, :), Tree(NP, [(e.g., NNP)]), (,, ,), (02101,
CD), (or, CC), (02102, CD), (), CD), (,, ,),
Tree(NP, [(statistics, NNS)]), ((, VBP), (such, JJ),
(as, IN), (02323, CD), (,, ,), (02402, CD), (or,
CC), (02403, CD), (), CD), (and, CC), Tree(NP,
...
Finn
Arup Nielsen
42
Finn
Arup Nielsen
43
. . . Keyword extraction . . .
Extract the NP parts:
def extract_chunks(tree, filter=NP):
extract_word = lambda leaf: leaf[0].lower()
chunks = []
if hasattr(tree, node):
if tree.node == filter:
chunks = [ " ".join(map(extract_word, tree.leaves())) ]
else:
for child in tree:
cs = extract_chunks(child, filter=filter)
if cs != []:
chunks.append(cs[0])
return chunks
>>> extract_chunks(cp.parse(tagged))
[computer programming, e.g., statistics, linear algebra,
more, data analysis, e.g., machine learning (, courses]
Still not quite right.
Finn
Arup Nielsen
44
45
46
47
class DtuSpider(CrawlSpider):
name = "dtu"
allowed_domains = ["dtu.dk"]
start_urls = ["http://www.dtu.dk"]
rules = (Rule(SgmlLinkExtractor(), callback="parse_items", follow=True),)
def parse_items(self, response):
print(response.url)
Finn
Arup Nielsen
48
Finn
Arup Nielsen
49
Finn
Arup Nielsen
50
51
MediaWiki
For MediaWikis (e.g., Wikipedia) look at Pywikipediabot
Download and setup user-config.py
Here I have setup a configuration for wikilit.referata.com
>>> import pywikibot
>>> site = pywikibot.Site(en, wikilit)
>>> pagename = "Chitu Okoli"
>>> wikipage = pywikibot.Page(site, pagename)
>>> text = wikipage.get(get_redirect = True)
u{{Researcher\n|name=Chitu Okoli\n|surname=Okoli\n|affiliat ...
There is also a wikipage.put for writing on the wiki.
Finn
Arup Nielsen
52
53
54
55
Finn
Arup Nielsen
56
pylab import *
urllib2 import urlopen
simplejson import load
re import findall
url = http://neuro.compute.dtu.dk/w/api.php? + \
action=query&list=blocks& + \
bkprop=id|user|by|timestamp|expiry|reason|range|flags& + \
bklimit=500&format=json
data = load(urlopen(url))
users = [block[user] for block in data[query][blocks] if user
in block]
ip_users = filter(lambda s: findall(r^\d+, s), users)
ip = map(lambda s: int(findall(r\d+, s)[0]), ip_users)
dummy = hist(ip, arange(256), orientation=horizontal)
xlabel(Number of blocks); ylabel(First byte of IP)
show()
Finn
Arup Nielsen
57
Email mining . . .
Read in a small email data set with three classes, conference, job
and spam (Szymkowiak et al., 2001; Larsen et al., 2002b; Larsen et al.,
2002a; Szymkowiak-Have et al., 2006):
documents = [dict(
email=open("conference/%d.txt" % n).read().strip(),
category=conference) for n in range(1,372)]
documents.extend([ dict(
email=open("job/%d.txt" % n).read().strip(),
category=job) for n in range(1,275)])
documents.extend([ dict(
email=open("spam/%d.txt" % n).read().strip(),
category=spam) for n in range(1,799)])
Now the data is contained in documents[i][email] and the category in
documents[i][category].
Finn
Arup Nielsen
58
. . . Email mining . . .
Parse the emails with the email module and maintain the body text, strip
the HTML tags (if any) and split the text into words:
from email import message_from_string
from BeautifulSoup import BeautifulSoup as BS
from re import split
for n in range(len(documents)):
html = message_from_string(documents[n][email]).get_payload()
while not isinstance(html, str):
# Multipart problem
html = html[0].get_payload()
text = .join(BS(html).findAll(text=True))
# Strip HTML
documents[n][html] = html
documents[n][text] = text
documents[n][words] = split(\W+, text)
# Find words
Finn
Arup Nielsen
59
. . . Email mining . . .
Document classification a la (Bird et al., 2009, p. 227+) with NLTK:
import nltk
all_words = nltk.FreqDist(w.lower() for d in documents for w in d[words])
word_features = all_words.keys()[:2000]
word features now contains the 2000 most common words across the
corpus. This variable is used to define a feature extractor:
def document_features(document):
document_words = set(document[words])
features = {}
for word in word_features:
features[contains(%s) % word] = (word in document_words)
return features
Each document has now an associated dictionary with True or False on
whether a specific word appear in the document
Finn
Arup Nielsen
60
. . . Email mining . . .
Scramble the data set to mix conference, job and spam email:
import random
random.shuffle(documents)
Build variable for the functions of NLTK:
featuresets = [(document_features(d), d[category]) for d in documents]
Split the 1443 emails into training and test set:
train_set, test_set = featuresets[721:], featuresets[:721]
Train a naive Bayes classifier (Bird et al., 2009, p. 247+):
classifier = nltk.NaiveBayesClassifier.train(train_set)
Finn
Arup Nielsen
61
. . . Email mining
Classifier performance evaluated on the test set and show features (i.e.,
words) important for the classification:
>>> classifier.classify(document_features(documents[34]))
spam
>>> documents[34][text][:60]
uBENCHMARK PRINT SUPPLY\nLASER PRINTER CARTRIDGES JUST FOR
>>> print nltk.classify.accuracy(classifier, test_set)
0.890429958391
>>> classifier.show_most_informative_features(4)
Most Informative Features
contains(candidates) = True
job : spam
=
contains(presentations) = True
confer : spam
=
contains(networks) = True
confer : spam
=
contains(science) = True
job : spam
=
Finn
Arup Nielsen
62
YOU
75.2
73.6
70.4
69.0
:
:
:
:
1.0
1.0
1.0
1.0
More information
Recursively Scraping Web Pages With Scrapy, tutorial by Michael Herman.
Text Classification for Sentiment Analysis Naive Bayes Classifier by
Jacob Perkins.
Finn
Arup Nielsen
63
Summary
For web crawling there are the basic tools of urllibs and requests.
For extraction and parsing of content Python has, e.g., regular expression
handling in the re module and BeautifulSoup.
There are specialized modules for json, feeds and XML.
Scrapy is a large framework for crawling and extraction.
The NLTK package contains numerous natural language processing methods: sentence and word tokenization, part-of-speech tagging, chunking,
classification, . . .
Finn
Arup Nielsen
64
References
References
Bird, S., Klein, E., and Loper, E. (2009). Natural Language Processing with Python. OReilly, Sebastopol,
California. ISBN 9780596516499.
Larsen, J., Hansen, L. K., Have, A. S., Christiansen, T., and Kolenda, T. (2002a).
Webmining: learning from the world wide web. Computational Statistics & Data Analysis, 38(4):517532.
DOI: 10.1016/S0167-9473(01)00076-7.
Larsen, J., Szymkowiak, A., and Hansen, L. K. (2002b). Probabilistic hierarchical clustering with labeled
and unlabeled data. International Journal of Knowledge-Based Intelligent Engineering Systems, 6(1):56
62. http://isp.imm.dtu.dk/publications/2001/larsen.kes.pdf.
Martelli, A. (2006). Python in a Nutshell. In a Nutshell. OReilly, Sebastopol, California, second edition.
Martelli, A., Ravenscroft, A. M., and Ascher, D., editors (2005). Python Cookbook. OReilly, Sebastopol,
California, 2nd edition.
. (2003). The Brede database: a small database for functional neuroimaging. NeuroImage,
Nielsen, F. A
19(2). http://www2.imm.dtu.dk/pubdb/views/edoc download.php/2879/pdf/imm2879.pdf. Presented
at the 9th International Conference on Functional Mapping of the Human Brain, June 1922, 2003, New
York, NY.
Pilgrim, M. (2004). Dive into Python.
Segaran, T. (2007). Programming Collective Intelligence. OReilly, Sebastopol, California.
Szymkowiak, A., Larsen, J., and Hansen, L. K. (2001). Hierarchical clustering for datamining. In Babs,
N., Jain, L. C., and Howlett, R. J., editors, Proceedings of KES-2001 Fifth International Conference on
Knowledge-Based Intelligent Information Engineering Systems & Allied Technologies, pages 261265.
http://isp.imm.dtu.dk/publications/2001/szymkowiak.kes2001.pdf.
Szymkowiak-Have, A., Girolami, M. A., and Larsen, J. (2006). Clustering via kernel decomposition. IEEE
Transactions on Neural Networks, 17(1):256264. http://eprints.gla.ac.uk/3682/01/symoviak3682.pdf.
Finn
Arup Nielsen
65
References
Index
Apache, 4
multiprocessing, 10
download, 9, 10, 12
gdata, 50, 51
HTML, 20
robotparser, 3
JSON, 5, 15, 16
robots.txt, 2, 3
lxml, 26
simplejson, 15, 57
YouTube, 50, 51
Finn
Arup Nielsen
66