Sei sulla pagina 1di 23

Chris

Boston

Contents
Html Structure Goal of html content extraction Html Stripping Java Options BoilerPipe Apache Tika Python Options BoilerPipe Web API Html2Text Beautiful Soup

Html Structure
<HTML> <HEAD> <TITLE> Title of the page. </TITLE> </HEAD> <BODY> Page content. </BODY> </HTML>

Example Page
http://www.minecraft.net/about.jsp Navigation links at the top Main text in the body Buy Now on the side Copyright on the bottom

Goal of Html extrac7on


Given HTML, identify the relevant text. Strip page navigation links. Strip site-specic text (copyright, etc.). Many tools can nd data based on where it occurs in the structure of the html. In our case, we are trying to strip out words that are

obviously related to site functions.

HTML Stripping
Regular Expression replacement: <[^>]+> In Java: String noHTMLString = htmlString.replaceAll("\\<.*? >",""); In Python: re.compile(r'<.*?>') .sub('', toStrip) #Must import re!

Example Stripped Page

BoilerPipe
Tool that intelligently removes html tags (and even

irrelevant text).

Much smarter than a regular expression Provides several extraction methods. Returns text in a variety of formats.

How Boilerpipe Extracts Content


Retrieves Html given URL (optional) Parses Html to nd text content Separates text into text blocks Uses variety of classiers to determine which blocks

are important

BoilerPipe Extractors
ARTICLE_EXTRACTOR: Specializes on nding articles. DEFAULT_EXTRACTOR: Picks up more than just articles. Filters navigation links. CANOLA_EXTRACTOR: Extractor based on krdwrd. KEEP_EVERYTHING_EXTRACTOR: Gets everything. Could use this for extracting the title.

BoilerPipe Tests
Try BoilerPipe. No setup required! http://boilerpipe-web.appspot.com/

Ge?ng Started with BoilerPipe


1. Download BoilerPipe:

http://code.google.com/p/boilerpipe/downloads/ detail?name=boilerpipe-1.1.0-bin.tar.gz 2. Extract and add all JARs to path/workspace. 3. Adapt code from next slide.

Example Java Code


public static String extractFromUrl(String targetUrl) throws Exception { ExtractorBase extractor = CommonExtractors.ARTICLE_EXTRACTOR; return extractor.getText(new URL(targetUrl)); }

Apache Tika
Apache Tika: Java library that can parse many

formats, including html. Lets you have a lot of control.


http://tika.apache.org/

Apache Tika Features


Unlike BoilerPipe, Apache Tika can generate an xml

parse tree from documents of almost any format.


Allows traversal of the parse tree as parse events,

meaning the entire document need not be in memory at one time to parse it.
Parses and preserves metadata.

Op7ons for Python


1. Use the BoilerPipe Web API 2. Make a simple helper BoilerPipe JAR, then do the

heavy lifting in python.


3. Html2Text 4. Beautiful Soup

1. Web API
Generate a URL that requests a text le from the test site
Unfortunately, your system will fail if the site is unavailable. You can use dierent arguments to get dierent formats. Experiment with using the web site to nd out what kinds of

options are available.


Url has three parts:

http://boilerpipe-web.appspot.com/extract?url= 2. http://www.myurl.net/ 3. &extractor=ArticleExtractor&output=text


1.

Choose your Extractor type and return type here

1. Web API: Example Python Code


def extract(url): fullUrl = "http://boilerpipe-web.appspot.com/extract? url=" fullUrl += url fullUrl += "&extractor=ArticleExtractor&output=text" html = urllib.urlopen(fullUrl) return html2text.html2text(html.read(), fullUrl)

2. Call Executable JAR from Python


import os if __name__ == "__main__": startingDir = os.getcwd() # remember the current directory jarDir = Path/To/Jar os.chdir(jarDir) # change to our test directory os.system("java -jar myJar.jar myParameters") os.chdir(startingDir) # change back to where we started

3. Html2Text
Get it at:

http://www.aaronsw.com/2002/html2text/ Demo also on web site. Example code

import html2text import urllib test = urllib.urlopen(url) result = html2text.html2text(test.read(), url)

4. Beau7ful Soup
Generates a parse tree of a webpage Have to nd relevant content on your own Handles pages made with bad markup

Addi7onal Resources
CLEANEVAL: A contest for html extractors. http://cleaneval.sigwac.org.uk/

Ques7ons?
Html Structure Goal of html content extraction Html Stripping Java Options BoilerPipe Apache Tika Python Options BoilerPipe Web API Html2Text Beautiful Soup

Potrebbero piacerti anche