Boiler Pipe

Chris
Boston
Contents
Html Structure Goal of html content extraction Html Stripping Java Options BoilerPipe Apache Tika Python Options BoilerPipe Web API Html2Text Beautiful Soup
Html Structure
<HTML> <HEAD> <TITLE> Title of the page. </TITLE> </HEAD> <BODY> Page content. </BODY> </HTML>
Example Page
http://www.minecraft.net/about.jsp Navigation links at the top Main text in the body Buy Now on the side Copyright on the bottom
Goal of Html extrac7on

Given HTML, identify the relevant text. Strip page navigation links. Strip site-specic text (copyright, etc.). Many tools can nd data based on where it occurs in the structure of the html. In our case, we are trying to strip out words that are
obviously related to site functions.
HTML Stripping
Regular Expression replacement: <[^>]+> In Java: String noHTMLString = htmlString.replaceAll("\\<.*? >",""); In Python: re.compile(r'<.*?>') .sub('', toStrip) #Must import re!
Example Stripped Page
BoilerPipe
Tool that intelligently removes html tags (and even
irrelevant text).
Much smarter than a regular expression Provides several extraction methods. Returns text in a variety of formats.
How Boilerpipe Extracts Content

Retrieves Html given URL (optional) Parses Html to nd text content Separates text into text blocks Uses variety of classiers to determine which blocks
are important
BoilerPipe Extractors
ARTICLE_EXTRACTOR: Specializes on nding articles. DEFAULT_EXTRACTOR: Picks up more than just articles. Filters navigation links. CANOLA_EXTRACTOR: Extractor based on krdwrd. KEEP_EVERYTHING_EXTRACTOR: Gets everything. Could use this for extracting the title.
BoilerPipe Tests
Try BoilerPipe. No setup required! http://boilerpipe-web.appspot.com/
Ge?ng Started with BoilerPipe

1. Download BoilerPipe:
http://code.google.com/p/boilerpipe/downloads/ detail?name=boilerpipe-1.1.0-bin.tar.gz 2. Extract and add all JARs to path/workspace. 3. Adapt code from next slide.
Example Java Code

public static String extractFromUrl(String targetUrl) throws Exception { ExtractorBase extractor = CommonExtractors.ARTICLE_EXTRACTOR; return extractor.getText(new URL(targetUrl)); }
Apache Tika
Apache Tika: Java library that can parse many
formats, including html. Lets you have a lot of control.

http://tika.apache.org/
Apache Tika Features

Unlike BoilerPipe, Apache Tika can generate an xml
parse tree from documents of almost any format.

Allows traversal of the parse tree as parse events,
meaning the entire document need not be in memory at one time to parse it.
Parses and preserves metadata.
Op7ons for Python

1. Use the BoilerPipe Web API 2. Make a simple helper BoilerPipe JAR, then do the
heavy lifting in python.

3. Html2Text 4. Beautiful Soup
1. Web API
Generate a URL that requests a text le from the test site
Unfortunately, your system will fail if the site is unavailable. You can use dierent arguments to get dierent formats. Experiment with using the web site to nd out what kinds of
options are available.

Url has three parts:
http://boilerpipe-web.appspot.com/extract?url= 2. http://www.myurl.net/ 3. &extractor=ArticleExtractor&output=text

1.
Choose your Extractor type and return type here
1. Web API: Example Python Code

def extract(url): fullUrl = "http://boilerpipe-web.appspot.com/extract? url=" fullUrl += url fullUrl += "&extractor=ArticleExtractor&output=text" html = urllib.urlopen(fullUrl) return html2text.html2text(html.read(), fullUrl)
2. Call Executable JAR from Python

import os if __name__ == "__main__": startingDir = os.getcwd() # remember the current directory jarDir = Path/To/Jar os.chdir(jarDir) # change to our test directory os.system("java -jar myJar.jar myParameters") os.chdir(startingDir) # change back to where we started
3. Html2Text
Get it at:
http://www.aaronsw.com/2002/html2text/ Demo also on web site. Example code
import html2text import urllib test = urllib.urlopen(url) result = html2text.html2text(test.read(), url)
4. Beau7ful Soup
Generates a parse tree of a webpage Have to nd relevant content on your own Handles pages made with bad markup
Addi7onal Resources
CLEANEVAL: A contest for html extractors. http://cleaneval.sigwac.org.uk/
Ques7ons?
Html Structure Goal of html content extraction Html Stripping Java Options BoilerPipe Apache Tika Python Options BoilerPipe Web API Html2Text Beautiful Soup

Boiler Pipe

Caricato da

Informazioni sul documento

Descrizione originale:

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Boiler Pipe

Caricato da

Copyright:

Formati disponibili

Chris

Goal of Html extrac7on

obviously related to site functions.

Example Stripped Page

How Boilerpipe Extracts Content

Ge?ng Started with BoilerPipe

Example Java Code

formats, including html. Lets you have a lot of control.

Apache Tika Features

parse tree from documents of almost any format.

Op7ons for Python

heavy lifting in python.

options are available.

http://boilerpipe-web.appspot.com/extract?url= 2. http://www.myurl.net/ 3. &extractor=ArticleExtractor&output=text

Choose your Extractor type and return type here

1. Web API: Example Python Code

2. Call Executable JAR from Python

http://www.aaronsw.com/2002/html2text/ Demo also on web site. Example code

import html2text import urllib test = urllib.urlopen(url) result = html2text.html2text(test.read(), url)

Potrebbero piacerti anche