Sei sulla pagina 1di 67

Python programming — text and web mining

˚

Finn Arup Nielsen

DTU Compute Technical University of Denmark

September 22, 2014

Python programming — text and web mining

Python programming — text and web mining
Python programming — text and web mining

Overview

Get the stuff: Crawling, search

Converting: HTML processing/stripping, format conversion

Tokenization, identifying and splitting words and sentences.

Word normalization, finding the stem of the word, e.g., “talked” “talk”

Text classificiation (supervized), e.g., spam detection.

˚

Finn Arup Nielsen

1

September 22, 2014

Python programming — text and web mining

Python programming — text and web mining
Python programming — text and web mining

Web crawling issues

Honor robots.txt — the file on the Web server that describe what you are allowed to crawl and not.

Tell the Web server who you are.

Handling errors and warnings gracefully, e.g., the 404 (“Not found”).

Don’t overload the Web server you are downloading from, especially if you do it in parallel.

Consider parallel download large-scale crawling

˚

Finn Arup Nielsen

2

September 22, 2014

Python programming — text and web mining

Python programming — text and web mining
Python programming — text and web mining

Crawling restrictions in robots.txt

Example

Disallow:

Meaning

crawled.

on

/wiki/Special:Search

with rule:

http://neuro.compute.dtu.dk/wiki/Special:Search

should not be

Python module robotparser for handling rules:

>>>

import

robotparser

 

>>>

rp

=

robotparser.RobotFileParser()

>>>

rp.set_url("http://neuro.compute.dtu.dk/robots.txt")

>>>

rp.read()

#

Reads

the

robots.txt

>>>

rp.can_fetch("*",

"http://neuro.compute.dtu.dk/wiki/Special:Search")

False

>>>

rp.can_fetch("*",

"http://neuro.compute.dtu.dk/movies/")

True

˚

Finn Arup Nielsen

3

September 22, 2014

Python programming — text and web mining

Python programming — text and web mining
Python programming — text and web mining

Tell the Web server who you are

Use of urllib2 module to set the User-agent of the HTTP request:

import

urllib2

opener

=

urllib2.build_opener()

opener.addheaders

response

[("User-agent",

"fnielsenbot/0.1

=

opener.open("http://neuro.compute.dtu.dk")

=

(Finn

A.

Nielsen)")]

This will give the following entry (here split into two line) in the Apache Web server log (/var/log/apach2/access.log) :

130.225.70.226

-

-

"GET

/

HTTP/1.1"

[31/Aug/2011:15:55:28

+0200]

200

6685

"-"

"fnielsenbot/0.1

(Finn

A.

Nielsen)"

This allows a Web server admininstrator to block you if you put too much load on the Web server.

See also (Pilgrim, 2004, section 11.5) “Setting the User-Agent”.

˚

Finn Arup Nielsen

4

September 22, 2014

Python programming — text and web mining

Python programming — text and web mining
Python programming — text and web mining

The requests module

urllib and urllib2 are in the Python Standard Library.

Outside this PSL is requests which some regards as more convenient (“for humans”), e.g., setting the user-agent and requesting a page is only one line:

import

response

requests

=

requests.get("http://neuro.compute.dtu.dk",

headers={’User-Agent’:

"fnielsenbot/0.1"})

The response object also has a JSON conversion method:

>>>

url

=

>>>

params

=

{"action":

"query",

"prop":

"links",

"pllimit":

"500",

"format":

"json"}

>>>

params.update(titles="Python

(programmeringssprog)")

 

>>>

requests.get(url,

params=params).json()["query"]["pages"].values()[0]["links"]

[{u’ns’:

0,

u’title’:

u’Aspektorienteret

programmering’},

{u’ns’:

0,

u’title’:

u’Eiffel

(programmeringssprog)’},

{u’ns’:

0,

u’title’:

u’Funktionel

programmering’}

˚

Finn Arup Nielsen

5

September 22, 2014

Python programming — text and web mining

Python programming — text and web mining
Python programming — text and web mining

Handling errors

>>>

import

urllib

>>>

urllib.urlopen("http://neuro.compute.dtu.dk/Does_not_exist").read()[64:

’<title>404

Not

Found</title>’

Ups! You may need to look at getcode() from the response:

>>>

response

=

urllib.urlopen("http://neuro.compute.dtu.dk/Does_not_exist")

>>>

response.getcode()

404

urllib2 throws an exception:

import

urllib2

opener

=

urllib2.build_opener()

try:

response

=

opener.open(’http://neuro.compute.dtu.dk/Does_not_exist’)

except

urllib2.URLError

print(e.code)

as

e:

#

In

this

case:

404

˚

Finn Arup Nielsen

6

September 22, 2014

Python programming — text and web mining

Python programming — text and web mining
Python programming — text and web mining

Handling errors with requests

The requests library does not raise by default on ‘ordinary’ errors, but you can call the raise for status() exception:

>>>

import

requests

>>>

response

=

requests.get("http://neuro.compute.dtu.dk/Does_not_exist")

>>>

response.status_code

404

>>>

response.ok

 

False

 

>>>

response.raise_for_status()

[

]

requests.exceptions.HTTPError:

404

Client

Error:

Not

Found

Note that requests does raise errors on, e.g., on the name service error

with

requests.get(’http://asdf.dtu.dk’).

˚

Finn Arup Nielsen

7

September 22, 2014

Python programming — text and web mining

Python programming — text and web mining
Python programming — text and web mining

Don’t overload Web servers

Don’t overload the webserver by making a request right after response

Put in a

time.sleep(a_few_seconds)

to be nice.

Some big websites have automatic load restrictions and need authentica- tion, e.g., Twitter.

˚

Finn Arup Nielsen

8

September 22, 2014

Python programming — text and web mining

Python programming — text and web mining
Python programming — text and web mining

Serial large-scale download

Serial download from 4 different Web servers:

import

=

urls

time,

urllib2

[’http://dr.dk’,

’http://nytimes.com’,

’http://bbc.co.uk’,

’http://finnaarupnielsen.wordpress.com’]

start

result1

=

time.time()

=

[(time.time()-start,

time.time()-start)

Plot download times:

urllib2.urlopen(url).read(),

for

url

in

urls]

from

hold(True)

for

pylab

import

*

enumerate(result1):

n+1],

r[::2],

n,

plot([n+1,

r

in

’k-’,

linewidth=30,

ylabel(’Time

[seconds]’);

grid(True);

axis((0,

5,

solid_capstyle=’butt’)

0,

4));

show()

˚

Finn Arup Nielsen

9

September 22, 2014

Python programming — text and web mining

Python programming — text and web mining
Python programming — text and web mining

Parallel large-scale download

twisted event-driven network engine (http://twistedmatrix.com) could be used. For an example see RSS feed aggregator in Python Cookbook (Martelli et al., 2005, section 14.12).

Or use

multiprocessing

import

multiprocessing,

time,

urllib2

def

download((url,

return

start)):

(time.time()-start,

time.time()-start)

urllib2.urlopen(url).read(),

=

pool

result2

start

=

time.time()

multiprocessing.Pool(processes=4)

=

pool.map(download,

zip(urls,

[start]*4))

˚

Finn Arup Nielsen

10

September 22, 2014

Python programming — text and web mining

Python programming — text and web mining
Python programming — text and web mining

Serial

Python programming — text and web mining Serial Parallel In this small case the parallel download

Parallel

Python programming — text and web mining Serial Parallel In this small case the parallel download

In this small case the parallel download is almost twice as fast.

˚

Finn Arup Nielsen

11

September 22, 2014

Python programming — text and web mining

Python programming — text and web mining
Python programming — text and web mining

Combinations

It becomes more complicated:

When you download in parallel and need to make sure that you are not downloading from the same server in parallel.

When you need to keep track of downloading errors (should they be postponed or dropped?)

˚

Finn Arup Nielsen

12

September 22, 2014

Python programming — text and web mining

Python programming — text and web mining
Python programming — text and web mining

Reading feeds with feedparser

Mark Pilgrim’s Python module feedparser for RSS and Atom XML files.

feedparser.parse() may read from a URL, file, stream or string. Example with Google blog search returning “atoms”:

import

url

=

feedparser

"http://blogsearch.google.dk/blogsearch_feeds?"

"q=visitdenmark&output=atom"

f

=

feedparser.parse(url);

f.entries[0].title

+

gives u’<b>VisitDenmark</b>

fjerner

fupvideo

fra

nettet

-

\

Politiken.dk’

Some feed fields may contain HTML markup. feedparser does HTML sanitizing and removes, e.g., the <script> tag.

For mass download see also Valentino Volonghi and Peter Cogolo’s mod- ule with twisted in (Martelli et al., 2005)

˚

Finn Arup Nielsen

13

September 22, 2014

Python programming — text and web mining

Python programming — text and web mining
Python programming — text and web mining

Reading feeds with feedparser

Some of the most useful fields in the feedparser dictionary (see also feed- parser reference:

f.bozo

#

Indicates

if

errors

occured

during

parsing

f.feed.title

#

Title

of

feed,

e.g.,

blog

title

f.feed.link

#

Link

to

the

blog

 

f.feed.links[0].href

#

URL

to

feed

 

f.entries[i].title

#

Title

of

post

(HTML)

 

f.entries[i].subtitle

#

Subtitle

of

the

post

(HTML)

f.entries[i].link

#

Link

to

post

f.entries[i].updated

#

Date

of

post

in

string

 

f.entries[i].updated_parsed

#

Parsed

date

in

tuple

f.entries[i].summary

#

Posting

(HTML)

 

The summary field may be only partial.

˚

Finn Arup Nielsen

14

September 22, 2014

Python programming — text and web mining

Python programming — text and web mining
Python programming — text and web mining

Reading JSON

JSON (JavaScript Object Notation), http://json.org, is a lightweight data interchange format particularly used on the Web.

Python implements JSON encoding and decoding with among others the

json

and

simplejson

modules.

simplejson and newer json use, e.g., loads() and dumps() whereas older json uses read() and write(). http://docs.python.org/library/json.html

>>>

s

=

simplejson.dumps({’Denmark’:

u’ Arhus’],

>>>

{"Denmark":

>>>

>>>

Arhus ˚

˚

’population’:

5000000}})

print

s

{"towns":

["Copenhagen",

data

print

simplejson.loads(s)

data[’Denmark’][’towns’][1]

=

{’towns’:

[’Copenhagen’,

#

Note

Unicode

"\u00c5rhus"],

"population":

5000000}}

JSON data structures are mapped to corresponding Python structures.

˚

Finn Arup Nielsen

15

September 22, 2014

Python programming — text and web mining

Python programming — text and web mining
Python programming — text and web mining

Reading JSON

MediaWikis may export some their data in JSON format, and here is an example with Wikipedia querying for an embedded “template”:

import

urllib,

simplejson

url

=

"http://en.wikipedia.org/w/api.php?"

+

\

 

"action=query&list=embeddedin&"

+

\

"eititle=Template:Infobox_Single_nucleotide_polymorphism&"

+

\

"format=json"

 

data

=

simplejson.load(urllib.urlopen(url))

 

data[’query’][’embeddedin’][0]

gives

{u’ns’:

0,

u’pageid’:

238300,

u’title’:

u’Factor

V

Leiden’}

Here the Wikipedia article Factor V Leiden contains (has embedded) the template Infobox Single nucleotide polymorphism

(Note MediaWiki may need to be called several times for the retrieval of all results for the query by using data[’query-continue’])

˚

Finn Arup Nielsen

16

September 22, 2014

Python programming — text and web mining

Python programming — text and web mining
Python programming — text and web mining

Regular expressions with re

>>>

>>>

import

s

re

=

’The

Substitute ”<

following

is

some

text

a

link

>

>>>

re.sub(’<.*?>’,

’’,

s)

’The

following

is

a

link

to

DTU’

to

<a

href="http://www.dtu.dk">DTU</a>’

with an empty string with re.sub()

Escaping non-alphanumeric characters in a string:

>>>

print

re.escape(u’Escape

non-alphanumerics

",

\,

#,

˚

A

and

=’)

Escape\

non\-alphanumerics\

\

\"\,\

\\\,\

\#\,\

˚

\ A\

and\

\=

XML-like matching with the named group <(?P<name>

 

)>

construct:

>>>

s

=

’<name>Ole</name><name>Lars</name>’

>>>

re.findall(’<(?P<tag>\w+)>(.*?)</(?P=tag)>’,

s)

[(’name’,

’Ole’),

(’name’,

’Lars’)]

˚

Finn Arup Nielsen

17

September 22, 2014

Python programming — text and web mining

Python programming — text and web mining
Python programming — text and web mining

Regular expressions with re

Non-greedy match of content of a <description> tag:

s

multiline

>>>

=

"""<description>This

is

a

string.</description>"""

>>>

re.search(’<description>(.+?)</description>’,

(’This

is

a

\nmultiline

string.’,)

s,

re.DOTALL).groups()

Find Danish telephone numbers in a string with initial compile():

>>>

s

=

’(+45)

45253921

4525

39

21

2800

45

45

25

39

21’

>>>

r

=

re.compile(r’((?:(?:\(\+?\d{2,3}\))|\+?\d{2,3})?(?:

?\d){8})’)

>>>

r.search(s).group()

45253921’

r.findall(s)

’(+45)

>>>

[’(+45)

45253921’,

4525

39

21’,

45

45

25

39’]

˚

Finn Arup Nielsen

18

September 22, 2014

Python programming — text and web mining

Python programming — text and web mining
Python programming — text and web mining

Regular expressions with re

Unicode letter match with [^\W\d_]+ meaning one or more not non-

˚

alphanumeric and not digits and not underscore (\xc5 is unicode “ A”):

>>>

re.findall(’[^\W\d_]+’,

u’Nielsen’]

[u’F’,

u’\xc5’,

u’F

˚

A

Nielsen’,

re.UNICODE)

Matching the word immediately after “the” regardless of case:

>>>

s

=

’The

dog,

the

cat

and

the

mouse

in

the

USA’

>>>

re.findall(’the

([a-z]+)’,

s,

re.IGNORECASE)

[’dog’,

’cat’,

’mouse’,

’USA’]

˚

Finn Arup Nielsen

19

September 22, 2014

Python programming — text and web mining

Python programming — text and web mining
Python programming — text and web mining

Reading HTML

HTML contains tags and content. There are several ways to strip the content.

1. Simple regular expression, e.g., re.sub(’<.*?>’, ’’, s)

2. htmllib module with the formatter module.

3. Use nltk.clean_html() (Bird et al., 2009, p. 82). This function uses HTMLParser

4. BeautifulSoup module is a robust HTML parser (Segaran, 2007, p.

45+).

5. lxml.etree.HTML

˚

Finn Arup Nielsen

20

September 22, 2014

Python programming — text and web mining

Python programming — text and web mining
Python programming — text and web mining

Reading HTML

The htmllib can parse HTML documents (Martelli, 2006, p. 580+)

import

htmllib,

formatter,

urllib

htmllib.HTMLParser(formatter.NullFormatter())

p

p.feed(urllib.urlopen(’http://www.dtu.dk’).read())

p.close()

for

=

url

in

p.anchorlist:

print

url

The result is a printout of the list of URL from ’http://www.dtu.dk’:

/English.aspx

/Service/Indeks.aspx

/Service/Kontakt.aspx

/Service/Telefonbog.aspx

http://www.alumne.dtu.dk

http://portalen.dtu.dk

˚

Finn Arup Nielsen

21

September 22, 2014

Python programming — text and web mining

Python programming — text and web mining
Python programming — text and web mining

Robust HTML reading

Consider an HTML file, test.html, with an error:

<html>

<body>

<h1>Here

A

<h2>Subsection</h2>

is

error</h1

an

&gt;

is

missing

</body>

</html>

Earlier versions of nltk and HTMLParser would generate error. NLTK can now handle it:

>>>

import

nltk

>>>

nltk.clean_html(open(’test.html’).read())

’Here

is

an

error

Subsection’

˚

Finn Arup Nielsen

22

September 22, 2014

Python programming — text and web mining

Python programming — text and web mining
Python programming — text and web mining

Robust HTML reading

BeautifulSoup survives the missing “>” in the end tag:

>>>

from

BeautifulSoup

import

BeautifulSoup

as

BS

>>>

html

=

open(’test.html’).read()

 

>>>

BS(html).findAll(text=True)

 

[u’\n’,

u’\n’,

u’Here

is

an

error’,

u’Subsection’,

u’\n’,

u’\n’,

u’\n’]

Another example with extraction of links from http://dtu.dk:

>>>

from

urllib2

import

urlopen

>>>

html

=

urlopen(’http://dtu.dk’).read()

 

>>>

ahrefs

=

BS(html).findAll(name=’a’,

attrs={’href’:

True})

>>>

urls

=

[dict(a.attrs)[’href’]

for

a

in

ahrefs]

>>>

urls[0:3]

 

[u’/English.aspx’,

u’/Service/Indeks.aspx’,

u’/Service/Kontakt.aspx’]

˚

Finn Arup Nielsen

23

September 22, 2014

Python programming — text and web mining

Python programming — text and web mining
Python programming — text and web mining

Reading XML

xml.dom:

Document Object Model. With xml.dom.minidom

xml.sax:

Simple API for XML (and an obsolete xmllib)

xml.etree:

ElementTree XML library

Example with minidom module with searching on a tag name:

>>>

s

=

"""<persons>

<person>

<name>Ole</name>

</person>

 

<person>

<name>Jan</name>

</person>

</persons>"""

>>>

import

xml.dom.minidom

 

>>>

dom

=

xml.dom.minidom.parseString(s)

 

>>>

for

element

in

dom.getElementsByTagName("name"):

 
 

print(element.firstChild.nodeValue)

 

Ole

Jan

˚

Finn Arup Nielsen

24

September 22, 2014

Python programming — text and web mining

Python programming — text and web mining
Python programming — text and web mining

Reading XML: traversing the elements

>>>

s

=

"""<persons>

<person

id="1">

<name>Ole</name>

<topic>Bioinformatics</topic>

</person>

<person

id="2">

<name>Jan</name>

<topic>Signals</topic>

</person>

</persons>"""

>>>

import

xml.etree.ElementTree

 

>>>

x

=

xml.etree.ElementTree.fromstring(s)

>>>

[x.tag,

x.text,

x.getchildren()[0].tag,

x.getchildren()[0].attrib,

x.getchildren()[0].text,

x.getchildren()[0].getchildren()[0].text]

x.getchildren()[0].getchildren()[0].tag,

[’persons’,

’\n

’,

’person’,

{’id’:

’1’},

’,

’name’,

’Ole’]

>>>

import

xml.dom.minidom

>>>

y

=

xml.dom.minidom.parseString(s)

>>>

[y.firstChild.nodeName,

y.firstChild.firstChild.nodeValue,

y.firstChild.firstChild.nextSibling.nodeName]

[u’persons’,

u’\n

’,

u’person’]

˚

Finn Arup Nielsen

25

September 22, 2014

Python programming — text and web mining

Python programming — text and web mining
Python programming — text and web mining

Other xml packages: lxml and BeautifulSoup

Outside the Python standard library (with the xml packages) is lxml pack- age.

lxml’s documentation claims that lxml.etree is much faster than ElementTree in the standard xml package.

Also note that BeautifulSoup will read xml files.

˚

Finn Arup Nielsen

26

September 22, 2014

Python programming — text and web mining

Python programming — text and web mining
Python programming — text and web mining

Generating HTML

The simple way:

>>>

results

=

[(’Denmark’,

5000000),

(’Botswana’,

1700000)]

 

>>>

res

=

’<tr>’.join([

’<td>%s<td>%d’

%

(r[0],

r[1])

for

r

in

results

])

>>>

s

=

"""<html><head><title>Results</title></head>

 

<body><table>%s</table></body></html>"""

>>>

%

res

s ’<html><head><title>Results</title></head>\n<body><table><td>Denmark

<td>5000000<tr><td>Botswana<td>1700000</table></body></html>’

If the input is not known it may contain parts needing escapes:

>>>

type="text/javascript">

Viagra")</script>’’’,

>>>

results

[(’Denmark

(<Sweden)’,

5000000),

(r’’’<script

=

window.open("http://www.dtu.dk/",

"Buy

1700000)]

’w’).write(s)

open(’test.html’,

Input should be sanitized and output should be escaped.

˚

Finn Arup Nielsen

27

September 22, 2014

Python programming — text and web mining

Python programming — text and web mining
Python programming — text and web mining

Generating HTML the outdated way

— text and web mining Generating HTML the outdated way Writing an HTML file with the

Writing an HTML file with the HTMLgen module and the code below will generate a HTML file as shown to the left.

Another HTML generation module is Richard

Jones’ html module (http://pypi.python.org/pypi/html),

and see also the cgi.escape() function.

import

HTMLgen

doc

doc.append(HTMLgen.Heading(1,

table

table.body

doc.append(table)

doc.write("test.html")

HTMLgen.SimpleDocument(title="Results")

results"))

=

=

"The

HTMLgen.Table(heading=["Country",

=

[[

HTMLgen.Text(’%s’

%

"Population"])

r[1]

]

for

r

r[0]),

in

results

]

˚

Finn Arup Nielsen

28

September 22, 2014

Python programming — text and web mining

Python programming — text and web mining
Python programming — text and web mining

Better way for generating HTML

Probably a better way to generate HTML is with one of the many tem- plate engine modules, e.g., Cheetah (see example in CherryPy documen- tation), Django (obviously for Django), Jinja2, Mako, tornado.template (for Tornado),

>>>

from

jinja2

import

Template

 

>>>

tmpl

=

Template(u"""<html><body><h1>{{

name|escape

}}</h1>

 

</body></html>""")

>>>

tmpl.render(name

=

u"Finn

˚

< Arup>

Nielsen")

u’<html><body><h1>Finn

&lt;\xc5rup&gt;

Nielsen</h1></body></html>’

˚

Finn Arup Nielsen

29

September 22, 2014

Python programming — text and web mining

Python programming — text and web mining
Python programming — text and web mining

Natural language Toolkit

Natural Language Toolkit (NLTK) described in the book (Bird et al., 2009) and included with “import nltk” and it contains data and a number of classes and functions:

nltk.corpus:

standard natural language processing corpora

nltk.tokenize,

nltk.stem:

ming or lemmatization

sentence and words segmentation and stem-

nltk.tag:

part-of-speech tagging

nltk.classify,

nltk.cluster:

supervized and unsupervized classification

And a number of other moduls: nltk.collocations, nltk.chunk,

nltk.parse,

nltk.app,

nltk.sem,

nltk.chat

nltk.inference,

nltk.metrics,

nltk.probability,

˚

Finn Arup Nielsen

30

September 22, 2014

Python programming — text and web mining

Python programming — text and web mining
Python programming — text and web mining

Splitting words: Word tokenization

>>>

s

=

"""To

suppose

that

the

eye

with

all

its

inimitable

contrivances

for

adjusting

the

focus

to

different

distances,

for

admitting

different

amounts

of

light,

and

for

the

correction

of

spherical

and

chromatic

aberration,

could

have

been

formed

by

natural

selection,

seems,

I

freely

confess,

absurd

in

the

highest

degree."""

>>>

s.split()

[’To’,

’inimitable’,

’to’,

’suppose’,

’that’,

’contrivances’,

’the’,

’eye’,

’with’,

’all’,

’its’,

’for’,

’adjusting’,

’the’,

’focus’,

’different’,

’distances,’,

>>>

re.split(’\W+’,

s)

#

Split

on

non-alphanumeric

[’To’,

’suppose’,

’that’,

’the’,

’eye’,

’with’,

’all’,

’its’,

’inimitable’,

’to’,

’contrivances’,

’for’,

’adjusting’,

’different’,

’distances’,

’for’,

’the’,

’focus’,

˚

Finn Arup Nielsen

31

September 22, 2014

Python programming — text and web mining

Python programming — text and web mining
Python programming — text and web mining

Splitting words: Word tokenization

A text example from Wikipedia with numbers

>>>

s

=

"""Enron

Corporation

(former

NYSE

ticker

symbol

ENE)

was

an

American

energy

company

based

in

Houston,

Texas.

Before

its

bankruptcy

in

late

2001,

Enron

employed

approximately

22,000[1]

and

was

one

of

the

world’s

leading

electricity,

natural

gas,

pulp

and

paper,

and

communications

in

2000."""

companies,

with

claimed

revenues

of

nearly

$101

billion

For re.split(’\W+’, s) there is a problem with genetive (world’s) and numbers (22,000)

˚

Finn Arup Nielsen

32

September 22, 2014

Python programming — text and web mining

Python programming — text and web mining
Python programming — text and web mining

Splitting words: Word tokenization

Word tokenization inspired from (Bird et al., 2009, page 111)

>>>

pattern

=

r"""(?ux)

#

Set

Unicode

and

verbose

flag

(?:[^\W\d_]\.)+

#

Abbreviation

 

|

[^\W\d_]+(?:-[^\W\d_])*(?:’s)?

#

Words

with

optional

hyphens

|

\d{4}

#

Year

|

\d{1,3}(?:,\d{3})*

#

Number

 

|

\$\d+(?:\.\d{2})?

#

Dollars

|

\d{1,3}(?:\.\d+)?\s%

#

Percentage

|

\.\.\.

#

Ellipsis

|

[.,;"’?!():-_‘/]

#

"""

>>>

import

re

>>>

re.findall(pattern,

s)

 

>>>

import

nltk

>>>

nltk.regexp_tokenize(s,

pattern)

 

˚

Finn Arup Nielsen

 

33

September 22, 2014

Python programming — text and web mining

Python programming — text and web mining
Python programming — text and web mining

Splitting words: Word tokenization

From informal quickly written text (YouTube):

>>>

s

=

u"""Det

er

˚

S A

LATTERLIGT/PLAT!!

-Det

har

jo

ingen

sammenhæng

med,

hvad

DK

repræsenterer!!

ARGHHH!!"""

>>>

re.findall(pattern,

s)

[u’Det’,

u’er’,

u’S\xc5’,

u’LATTERLIGT’,

u’/’,

u’PLAT’,

u’!’,

u’!’,

u’Det’,

u’har’,

u’jo’,

u’ingen’,

u’sammenh\xe6ng’,

u’med’,

u’,’,

u’hvad’,

u’DK’,

u’repr\xe6senterer’,

u’!’,

u’!’,

u’ARGHHH’,

u’!’,

u’!’]

Problem with emoticons such as “:o(”: They are not treated as a single “word”.

Difficult to construct a general tokenizer.

˚

Finn Arup Nielsen

34

September 22, 2014

Python programming — text and web mining

Python programming — text and web mining
Python programming — text and web mining

Word normalization

Converting “talking”, “talk”, “talked”, “Talk”, etc. to the lexeme “talk” (Bird et al., 2009, page 107)

>>>

porter

=

nltk.PorterStemmer()

 

>>>

[porter.stem(t.lower())

for

t

in

tokens]

[’to’,

’contriv’,

’,’,

’suppos’,

’that’,

’the’,

’eye’,

’with’,

’all’,

’it’,

’inimit’,

’distanc’,

’and’,

’for’,

’adjust’,

’the’,

’focu’,

’to’,

’differ’,

’for’,

’admit’,

’differ’,

’amount’,

’of’,

’light’,

’,’,

Another stemmer is lancaster.stem()

The Snowball stemmer works for non-English, e.g.,

>>>

from

nltk.stem.snowball

import

SnowballStemmer

>>>

stemmer

=

SnowballStemmer("danish")

>>>

stemmer.stem(’universiteterne’)

 

’universitet’

˚

Finn Arup Nielsen

35

September 22, 2014

Python programming — text and web mining

Python programming — text and web mining
Python programming — text and web mining

Word normalization

Normalize with a word list (WordNet):

>>>

wnl

=

nltk.WordNetLemmatizer()

 

>>>

[wnl.lemmatize(token)

for

token

in

tokens]

[’To’,

’inimitable’,

’suppose’,

’that’,

’the’,

’eye’,

’with’,

’contrivance’,

’for’,

’adjusting’,

’all’,

’the’,

’it’,

’focus’,

’to’,

’different’,

’distance’,

’,’,

’for’,

’admitting’,

’different’,

’amount’,

’of’,

’light’,

’,’,

’and’,

’for’,

’the’,

’correction’,

Here words “contrivances” and “distances” have lost the plural “s” and “its” the genitive “s”.

˚

Finn Arup Nielsen

36

September 22, 2014

Python programming — text and web mining

Python programming — text and web mining
Python programming — text and web mining

Word categories

Part-of-speech tagging with NLTK

>>>

words

=

nltk.word_tokenize(s)

>>>

nltk.pos_tag(words)

[(’To’,

’TO’),

(’suppose’,

’VB’),

(’that’,

’IN’),

(’the’,

’DT’),

(’eye’,

’NN’),

(’with’,

’IN’),

(’all’,

’DT’),

(’its’,

’PRP$’),

(’inimitable’,

’JJ’),

(’contrivances’,

’NNS’),

(’for’,

’IN’),

NN noun, VB verb, JJ adjective, RB adverb, etc., see, common tags.

>>>

tagged

=

nltk.pos_tag(words)

 

>>>

[word

for

(word,

tag)

in

tagged

if

tag==’JJ’]

[’inimitable’,

’chromatic’,

’different’,

’different’,

’light’,

’natural’,

’confess’,

’absurd’]

“confess” is wrongly tagged.

’spherical’,

˚

Finn Arup Nielsen

37

September 22, 2014

Python programming — text and web mining

Python programming — text and web mining
Python programming — text and web mining

Some examples

˚

Finn Arup Nielsen

38

September 22, 2014

Python programming — text and web mining

Python programming — text and web mining
Python programming — text and web mining

Keyword extraction

Consider the text:

“Computer programming (e.g., 02101 or 02102), statistics (such as 02323, 02402 or 02403) and linear algebra (such as 01007) More advanced programming and data analysis, e.g., Machine Learning (02450 or 02457), or courses such as 02105 or 01917”

We want to extract “computer programming”, “statistics”, “linear alge- bra” “advanced programming” (or perhaps just “programming”!?), “data analysis”, “machine learning”.

But we do not want “and linear” or “courses such”, i.e., not just bigrams.

Note the lack of verbs and the missing period.

˚

Finn Arup Nielsen

39

September 22, 2014

Python programming — text and web mining

Python programming — text and web mining
Python programming — text and web mining

Keyword extraction

Lets see what NLTK’s part-of-speech tagger can do:

>>>

text

=

("Computer

programming

(e.g.,

02101

or

02102),

statistics

"

"(such

as

02323,

02402

or

02403)

and

linear

algebra

(such

as

01007)

"

"More

advanced

programming

and

data

analysis,

e.g.,

Machine

Learning

"

 

"(02450

or

02457),

or

courses

such

as

02105

or

01917")

 

>>>

tagged

=

nltk.pos_tag(nltk.word_tokenize(text))

 

>>>

tagged

 

[(’Computer’,

’NN’),

(’programming’,

’NN’),

(’(’,

’:’),

(’e.g.’,

 

’NNP’),

(’,’,

’,’),

(’02101’,

’CD’),

(’or’,

’CC’),

(’02102’,

’CD’),

 

(’)’,

’CD’),

(’,’,

’,’),

(’statistics’,

’NNS’),

(’(’,

’VBP’),

(’such’,

’JJ’),

(’as’,

’IN’),

(’02323’,

’CD’),

(’,’,

’,’),

(’02402’,

’CD’),

(’or’,

’CC’),

(’02403’,

’CD’),

(’)’,

’CD’),

(’and’,

’CC’),

(’linear’,

’JJ’),

(’algebra’,

’NN’),

(’(’,

’:’),

(’such’,

’JJ’),

(’as’,

’IN’),

(’01007’,

’CD’),

(’)’,

’CD’),

(’More’,

’NNP’),

(’advanced’,

’VBD’),

(’programming’,

’VBG’),

(’and’,

’CC’),

(’data’,

’NNS’),

(’analysis’,

(’,’,

(’Learning’,

’NN’),

’,’),

’NNP’),

(’e.g.’,

(’(’,

’NNP’),

’NNP’),

(’,’,

’,’),

(’Machine’,

(’02450’,

’CD’),

(’or’,

’NNP’),

’CC’),

(’02457’,

’CD’),

(’)’,

’CD’),

(’,’,

’,’),

(’or’,

’CC’),

(’courses’,

’NNS’),

(’such’,

’JJ’),

(’as’,

’IN’),

(’02105’,

’CD’),

(’or’,

’CC’),

(’01917’,

’CD’)]

Note an embarrassing error: (’(’, ’NNP’).

˚

Finn Arup Nielsen

40

September 22, 2014

Python programming — text and web mining

Python programming — text and web mining
Python programming — text and web mining

Keyword extraction

Idea: assemble consecutive nouns here a first attempt:

phrases,

for

phrase

[],

=

tag)

==

phrase

""

(word,

if

in

tagged:

’NN’:

"":

+=

"

tag[:2]

if

else:

==

phrase

phrase

"

+

=

word

word

elif

phrase

!=

"":

phrases.append(phrase.lower())

phrase

=

""

Result:

>>>

phrases

[’computer

programming’,

’e.g.’,

’statistics’,

’algebra’,

’programming’,

’data

analysis’,

’e.g.’,

’machine

learning

(’,

’courses’]

Well

Not quite right. More control structures, stopword lists,

?

 

˚

Finn Arup Nielsen

41

September 22, 2014

Python programming — text and web mining

Python programming — text and web mining
Python programming — text and web mining

Keyword extraction

Chunking: Make a small grammar with regular expression that, e.g., catch a sentence part, here we call it a noun phrase (NP):

>>>

>>>

>>>

Tree(’S’,

grammar

cp

cp.parse(tagged)

"NP:

{

<JJ>*<NN.?>+

}"

=

nltk.RegexpParser(grammar)

=

[Tree(’NP’,

[(’Computer’,

’NN’),

(’programming’,

’NN’)]),

(’(’,

’:’),

Tree(’NP’,

[(’e.g.’,

’NNP’)]),

(’,’,

’,’),

(’02101’,

’CD’),

(’or’,

’CC’),

(’02102’,

’CD’),

(’)’,

’CD’),

(’,’,

(’such’,

’,’),

’JJ’),

Tree(’NP’,

[(’statistics’,

’NNS’)]),

(’(’,

’VBP’),

(’as’,

’IN’),

(’02323’,

’CD’),

(’,’,

’,’),

(’02402’,

’CD’),

(’or’,

’CC’),

(’02403’,

’CD’),

(’)’,

’CD’),

(’and’,

’CC’),

Tree(’NP’,

˚

Finn Arup Nielsen

42

September 22, 2014

Python programming — text and web mining

Python programming — text and web mining
Python programming — text and web mining
Python programming — text and web mining NLTK can produce parse trees. Here the first chunk:

NLTK can produce parse trees. Here the first chunk:

>>>

list(cp.parse(tagged))[0].draw()

˚

Finn Arup Nielsen

43

September 22, 2014

Python programming — text and web mining

Python programming — text and web mining
Python programming — text and web mining

Keyword extraction

Extract the NP parts:

def

extract_chunks(tree,

extract_word

chunks

if

filter=’NP’):

leaf:

=

lambda

leaf[0].lower()

=

[]

hasattr(tree,

’node’):

if

tree.node

==

filter:

 

chunks

=

[

"

".join(map(extract_word,

tree.leaves()))

]

else:

 

for

child

in

tree:

cs

=

extract_chunks(child,

filter=filter)

if

cs

!=

[]:

>>>

return

chunks

chunks.append(cs[0])

extract_chunks(cp.parse(tagged))

[’computer

programming’,

’e.g.’,

’statistics’,

’linear

algebra’,

’more’,

’data

analysis’,

’e.g.’,

’machine

learning

(’,

’courses’]

Still not quite right.

˚

Finn Arup Nielsen

44

September 22, 2014

Python programming — text and web mining

Python programming — text and web mining
Python programming — text and web mining

Checking keyword extraction on new data set

text

=

"""To

give

an

introduction

to

advanced