Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Boston
Contents
Html
Structure
Goal
of
html
content
extraction
Html
Stripping
Java
Options
BoilerPipe
Apache
Tika
Python
Options
BoilerPipe
Web
API
Html2Text
Beautiful
Soup
Html
Structure
<HTML>
<HEAD>
<TITLE>
Title
of
the
page.
</TITLE>
</HEAD>
<BODY>
Page
content.
</BODY>
</HTML>
Example
Page
http://www.minecraft.net/about.jsp
Navigation
links
at
the
top
Main
text
in
the
body
Buy
Now
on
the
side
Copyright
on
the
bottom
HTML
Stripping
Regular
Expression
replacement:
<[^>]+>
In
Java:
String
noHTMLString
=
htmlString.replaceAll("\\<.*?
>","");
In
Python:
re.compile(r'<.*?>')
.sub('',
toStrip)
#Must
import
re!
BoilerPipe
Tool
that
intelligently
removes
html
tags
(and
even
irrelevant text).
Much smarter than a regular expression Provides several extraction methods. Returns text in a variety of formats.
are important
BoilerPipe
Extractors
ARTICLE_EXTRACTOR:
Specializes
on
nding
articles.
DEFAULT_EXTRACTOR:
Picks
up
more
than
just
articles.
Filters
navigation
links.
CANOLA_EXTRACTOR:
Extractor
based
on
krdwrd.
KEEP_EVERYTHING_EXTRACTOR:
Gets
everything.
Could
use
this
for
extracting
the
title.
BoilerPipe
Tests
Try
BoilerPipe.
No
setup
required!
http://boilerpipe-web.appspot.com/
http://code.google.com/p/boilerpipe/downloads/ detail?name=boilerpipe-1.1.0-bin.tar.gz 2. Extract and add all JARs to path/workspace. 3. Adapt code from next slide.
Apache
Tika
Apache
Tika:
Java
library
that
can
parse
many
Allows
traversal
of
the
parse
tree
as
parse
events,
meaning
the
entire
document
need
not
be
in
memory
at
one
time
to
parse
it.
Parses
and
preserves
metadata.
1.
Web
API
Generate
a
URL
that
requests
a
text
le
from
the
test
site
Unfortunately,
your
system
will
fail
if
the
site
is
unavailable.
You
can
use
dierent
arguments
to
get
dierent
formats.
Experiment
with
using
the
web
site
to
nd
out
what
kinds
of
Url
has
three
parts:
3.
Html2Text
Get
it
at:
4.
Beau7ful
Soup
Generates
a
parse
tree
of
a
webpage
Have
to
nd
relevant
content
on
your
own
Handles
pages
made
with
bad
markup
Addi7onal
Resources
CLEANEVAL:
A
contest
for
html
extractors.
http://cleaneval.sigwac.org.uk/
Ques7ons?
Html
Structure
Goal
of
html
content
extraction
Html
Stripping
Java
Options
BoilerPipe
Apache
Tika
Python
Options
BoilerPipe
Web
API
Html2Text
Beautiful
Soup