Extracting Tourism Information From The Web

Diplomarbeit
Extracting Tourism Information from the Web
ausgef uhrt am Institut f ur Informationssyteme Abteilung f ur Datenbanken und Articial Intelligence der Technischen Universit at Wien unter der Anleitung von o.Univ.Prof. Dipl.-Ing. Dr.techn. Georg Gottlob und Dipl.-Ing. Dr.techn. Marcus Herzog
durch Emanuela Schwab Vorgartenstr. 134-138/2/108, 1020 Wien April 9, 2002
Datum
Unterschrift
Contents
1 Introduction 1.1 The Aim of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Using Lixto 2.1 Introduction . . . . . . . . . . . . . . 2.2 Patterns . . . . . . . . . . . . . . . . 2.2.1 Pattern Categories . . . . . . 2.2.2 Pattern Tree . . . . . . . . . 2.2.3 Packing and Expanding Parts 2.2.4 Adding Patterns . . . . . . . 2.2.5 Removing Patterns . . . . . . 2.2.6 Testing a Pattern . . . . . . . 2.3 Filters . . . . . . . . . . . . . . . . . 2.3.1 Adding Filters . . . . . . . . 2.3.2 Manipulating Filters . . . . . 2.4 8 8 9 10 10 10 10 11 12 12 13 13 14 15 22 23 23 25 26 26 29 36 39 39 41 43 46 48 50 50 50
. . . . . . . . . . . . . . . . . . . . . . . . . . . . of the Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
2.5 2.6 2.7 2.8
Conditions . . . . . . . . . . . . . . . . . 2.4.1 External and Internal Conditions 2.4.2 Range Conditions . . . . . . . . Saving and Loading Programs . . . . . . Show Program . . . . . . . . . . . . . . Digital Camera Example . . . . . . . . . Digital Camera Exercise . . . . . . . . .
3 InfoPipes 3.1 The User Interface and Components . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Source Component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2 3.1.3 3.1.4 Integrator Component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Transformer Component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Deliverer Component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4 Location Based Data Retrieval 4.1 Location Based Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Locating the Device . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
CONTENTS 4.2 Geographic Information System (GIS) . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 4.2.2 4.2.3 What Is GIS? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . GIS Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Data for the GIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 52 52 52 53 55 55 56 56 57 57 59 59 60 61 65 68 71 74 75 76 76 76 76 78 78 80 80 81 82 83 84 84 84 86 86 90 90 91 91
5 Night Owl Application 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.2 Static and Dynamic Searches . . . . . . . . . . . . . . . . . . . . 5.1.3 Location Based Service . . . . . . . . . . . . . . . . . . . . . . . 5.1.4 The User Interface and InfoPipes . . . . . . . . . . . . . . . . . . 5.2 Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Common Output Structure . . . . . . . . . . . . . . . . . . . . . 5.2.2 Gastronomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.3 Cinema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.4 Pharmacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.5 Grocery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.6 Map and Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.7 Max.mobil . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 IP Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Gastronomy Pipe . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Cinema Pipe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.3 Pharmacy Pipe . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.4 Grocery Pipe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Use Case Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 U1 - User Interaction . . . . . . . . . . . . . . . . . . . . . . . . 5.4.2 S1 - Static Search . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.3 S2 - Dynamic Search . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.4 S3 - System Generates Map of Vicinity and Routing Information 5.4.5 M1 - Max Mobile Phone Location . . . . . . . . . . . . . . . . . 6 Conclusion and Future Goals A Extraction Programs and Outputs
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
A.1 Gastronomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.1.1 Elog Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2 Cinema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2.1 Elog Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.3 Pharmacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.3.1 Elog Rules for Vienna . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.4 Grocery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.4.1 Elog Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.4.2 XSLT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 A.4.3 Master Source XML Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
CONTENTS
A.5 Routing and Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 A.5.1 Elog Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
List of Tables
5.1 5.2 5.3 5.4 Query Query Query Query Parameters Parameters Parameters Parameters of the Gastronomy Page . . . . of the Cinema Page . . . . . . . of the Cinema Page . . . . . . . for the Routing Information and . . . . . . . . . Map . . . . . . . . . . . . . . . . . . . . . generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 64 66 72
List of Figures
2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14 2.15 2.16 2.17 2.18 2.19 2.20 2.21 2.22 2.23 Pattern Tree Example . . . . . . . . . . . . . . . . . . . Complex Pattern Tree Example . . . . . . . . . . . . . . Expanding a Subtree . . . . . . . . . . . . . . . . . . . . Four Steps of Adding a New Pattern . . . . . . . . . . . Removing Patterns: Warning . . . . . . . . . . . . . . . The Control Panel During Pattern Testing . . . . . . . . Testing Mode . . . . . . . . . . . . . . . . . . . . . . . . First Way of Selecting the Parent Pattern for a Filter . Second Way of Selecting the Parent Pattern for a Filter Select Example Parent Instance . . . . . . . . . . . . . . Selecting the Sample Instance of a New Filter . . . . . . Choose the Path Specication Mode . . . . . . . . . . . Manual Path Specication . . . . . . . . . . . . . . . . . Specify Attribute Construction Mode . . . . . . . . . . . Attribute Construction Mode: Categories . . . . . . . . You can also enter a Condition for the Content . . . . . Attribute Construction Mode: Custom . . . . . . . . . . Syntactical Concepts . . . . . . . . . . . . . . . . . . . . Semantical Concepts . . . . . . . . . . . . . . . . . . . . String Filter Creation . . . . . . . . . . . . . . . . . . . Manipulating Filters . . . . . . . . . . . . . . . . . . . . Select the Type of Condition You Want . . . . . . . . . External Condition: Select Distance Tolerance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 12 12 13 13 14 14 15 15 16 17 17 18 18 19 20 20 21 21 22 23 23 24 24 25 26 27 28 29 30 30 31 32
2.24 Negated External Condition: Select Distance Tolerance . . . . . . . . . . . . . . . 2.25 Range Condition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.26 The Program Menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.27 Long Program View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.28 Elog Program View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.29 Example Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.30 Selecting the Camera Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.31 Range Restriction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.32 Selecting a Camera Row . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.33 Attribute Construction for the Rating Pattern 5 . . . . . . . . . . . . . . . . . . . .
LIST OF FIGURES 2.34 Attribute Construction to Extract the Camera Name . . . . . . . . . . . . . . . . . 2.35 2.36 2.37 2.38 2.39 2.40 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12 3.13 3.14 3.15 4.1 4.2 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 Attribute Construction for the Price Pattern . . . Adding Pattern ratingValue . . . . . . . . . . . . . Attribute Construction for the Next Pattern . . . . Example Page for the Exercise in Section 2.8 . . . Attribute Construction for the Value of the Detail Pattern Tree of the Exercise Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6 33 33 34 35 37 37 38 40 41 42 42 42 43 44 44 45 45 46 47 47 48 49 51 52 55 57 58 58 59 60 63 63 64 66 66 67 67 68 69 71 73
The Main Window of InfoPipes . . . . . . . . . . . . . . . . . . . . . . . . . . . The Three Dierent Popup Menus . . . . . . . . . . . . . . . . . . . . . . . . . Source Conguration Step 1: Enter an URL . . . . . . . . . . . . . . . . . . . . Source Conguration Step 2: Add a Content Extractor . . . . . . . . . . . . . . Source Conguration Step 4: Congure Scheduler . . . . . . . . . . . . . . . . . Integrator Conguration Step 1: Select an Output Structure . . . . . . . . . . . Integrator Conguration Step 2: Rene Output Structure . . . . . . . . . . . . Integrator Conguration Step 3: Dene Input to Output Mapping . . . . . . . Integrator Conguration Step 4: Semantic Mapping . . . . . . . . . . . . . . . Integrator Conguration Step 5: Edit XSLT Stylesheet . . . . . . . . . . . . . . Transformer Conguration: Dene Mapping and Joining . . . . . . . . . . . . . Transformer Conguration: Dene Selection Criteria (Admin View) . . . . . . Transformer Conguration: Dene Selection Criteria (User View) . . . . . . . . Deliverer Conguration Step 1 and 2: Congure Output Device and Scheduler Deliverer Conguration Step 3: Compose Output . . . . . . . . . . . . . . . . .
The Flexible Architecture based upon Mobile Location Center . . . . . . . . . . . Layers of an GIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Access Point . . . . . . . . . . . . . . . Prototype of UI: Search Mask . . . . . . Prototype of UI: Result List . . . . . . . Prototype of UI: Map of Vicinity . . . . Prototype of UI: Routing Information . Search on the Gastroweb Page . . . . . The Result List on the Gastroweb Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
Restaurant Details of the Gastroweb Result List . . . . . . . . . . . . . . . . . . . Falter Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.10 Falter Result List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.11 Falter Cinema Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.12 Falter Film Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.13 Night Pharmacies Vienna Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.14 Night Pharmacies Vienna Result List . . . . . . . . . . . . . . . . . . . . . . . . . . 5.15 List of Shops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.16 Search Parameter for Routing and Map . . . . . . . . . . . . . . . . . . . . . . . . 5.17 Routing Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
LIST OF FIGURES 5.18 Map of Vicinity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.19 5.20 5.21 5.22 5.23 5.24 5.25 5.26 Example for a request to the GIS server of max.mobil Example answer for the request in gure 5.19 . . . . . Gastronomy Pipe . . . . . . . . . . . . . . . . . . . . . Cinema Pipe . . . . . . . . . . . . . . . . . . . . . . . Pharmacy Pipe . . . . . . . . . . . . . . . . . . . . . . Admin View of the Grocery Transformer . . . . . . . . Grocery Pipe . . . . . . . . . . . . . . . . . . . . . . . Interaction of the Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7 74 75 75 76 76 77 77 77 78
Chapter 1
Introduction
The Web has become a major access channel to information repositories of all kinds. Because it is based on open standards, which results in low entry cost for publishers and free navigation tools for end users, it has become the de-facto standard to publish information on-line. In many cases, HTML pages are automatically generated from databases or other highly structured sources. Because HTML does not support logical markup, like for example XML does, the original structure and coherence of the data are lost. HTML oers convenient access for human users, but it is not suitable for processing by computer programs. Consequently, information sources on the Web often exist unconnected to each other, like isolated islands of information. For instance, if you go to an American on-line bookstore and want to get a price quote in Euro, you will have to get the price from the bookstore (in US Dollars), go to another Web site oering information on currency exchange rates and do the conversion to Euro there. Furthermore, you would have to do that by hand, by fetching and navigating Web documents, copying the price and lling out forms. We would really like to automate the entire process: Web awareness of Web services (services taking advantage of one another) and interoperability with other Web services and legacy databases. To reach these goals, we need Web wrappers. In the database community, a wrapper is a software component that converts queries and data from one model to another. In the Web environment, wrappers can extract information that is contained in an HTML document and make its structure explicit, usable for further processing.
1.1
The Aim of the Thesis
At the Institute of Information Systems, Database and Articial Intelligence Group on the Vienna university of technology two programs were developed that together provide much more functionality than a simple Web wrapper. They make it possible to automate the whole process of data extraction, transformation, integration and delivery. These two programs are called Lixto and InfoPipes. The aim of the thesis is to write a demo application using these programs.
CHAPTER 1. INTRODUCTION
Lixto is a fully visual interactive system for the generation of wrappers based on newly developed algorithms and a declarative language for the denition of HTML/XML wrappers. Lixto oers a powerful interactive visual interface, allows for expressive and exible data extraction, and uses intuitive hierarchical extraction, as well as string extraction techniques. Lixto translates relevant parts of Web pages into XML. InfoPipes is a fully visual tool for creating information processing pipelines. The start of such a pipeline is the XML output of a wrapper created by Lixto, which is fully integrated into InfoPipes. The next stages in the pipeline allow for integration of dierent sources and transformation of the XML structure. The last stage is always the delivery of data which can be done in many dierent formats, including XML, HTML, and SMS. The detailed objectives of the theses are: Denition of an application scenario for mobile users in the domain of tourism. Specication of requirements and use-cases and design of the application. Investigation and analysis of the Web information as well as the creation of the extraction programs
1.2
Structure of the Thesis
In chapter 2 we will look at the Lixto extraction tool in more detail. Chapter 3 provides a brief overview of the user interface of InfoPipes. Chapter 4 describes location based services, such as locating mobile devices and working with geographical information systems. In chapter 5 the demo application is presented. The last chapter presents conclusions and directions for future study. The appendix contains the extraction programs and some of the style sheets used.
Chapter 2
Using Lixto
2.1 Introduction
The Lixto Wrapper Generator is a visual and interactive tool to create wrapper programs which is used to extract data from HTML pages and convert this information into XML. Such a Lixto program comprises a hierarchical structure of patterns (see section 2.2). Patterns themselves comprise at least one lter (see section 2.3) and can have several child patterns. Filters specify the information which parts of the Web page should be extracted. Filters can be restricted with conditions (see section 2.4). All conditions of a lter have to be satised by the elements to be extracted by the lter but only one of the lters of a pattern has to be satised to be extracted by the pattern. The process of writing a Lixto wrapper consists of creating patterns, adding lters to them, testing the matching instances (section 2.2.6, and 2.3.2) and optionally adding conditions to the lters. The actual information extracted (when testing patterns or lters or by running the Lixto program) depends on the input page. One program can be applied to a class of structurally similar web pages.
2.2
Patterns
Patterns are the basic constructs of a Lixto program and dene its hierarchical structure. Each pattern is subsequently mapped to an XML tag and all child patterns correspond to nested XML tags of the parent tag.
2.2.1
Pattern Categories
Depending on the type of information you want to extract, you have to choose one out of three pattern categories: Tree patterns serve to extract parts of documents corresponding to HTML elements or lists of elements. This can be e.g.. a table, a paragraph, a list of table rows and also content elements. 10
CHAPTER 2. USING LIXTO
11
String patterns serve to extract textual strings from visible and invisible parts of a document. This can be an email address or a zip code but also attribute values such as the name of an image. Document patterns serve to extract a whole web page and are used for navigation to web pages. The initial pattern of all Lixto programs, called rootDocument, is an example of a document pattern, but you can also add document patterns later in your program to extract data from further linked Web pages.
2.2.2
Pattern Tree
In the pattern tree you can see the pattern name, the pattern category, the default parent pattern, the lters and optionally the conditions assigned to a lter. The lters of a pattern are shown directly below the pattern to which they belong. Conditions are shown below the lter to which they belong. All other information is shown on the same line as the pattern name (see gure 2.1).
Figure 2.1: Pattern Tree Example
Complex Pattern Trees Internally in the system the patterns are arranged as a graph, however they are visually displayed as a serialized tree. Therefore a pattern occurs more than once in the pattern tree if this pattern contains lters with dierent parent patterns. However, lters can only be edited in one place, all other occurances are read only. If you select such a read only lter, the Jump to enabled-button is displayed, which allows you to access the editable version. This instance is in the subtree of its parent pattern (see gure 2.2).
12
Figure 2.2: Complex Pattern Tree Example
2.2.3
Packing and Expanding Parts of the Tree
For large programs it is sometimes useful to hide parts of the tree. Select the pattern for which you want to hide the child patterns and lters and click on the Pack-button. If you select a packed pattern, the buttons in the control panel change and you can click on the Expand-button to show the whole subtree (see gure 2.3).
Figure 2.3: Expanding a Subtree The arrows in front of the pattern names indicate whether subtrees are expanded or packed. An arrow to the left indicates an expanded subtree whereas an arrow to the right indicates a packed subtree.
2.2.4
Adding Patterns
New patterns are added following these four steps (see gure 2.4): Step 1: Select the parent pattern in the pattern tree (see section 2.2.2). Step 2: Enter a unique pattern name consisting of letters and digits only. The pattern name will be used later on as the default name of the XML element.
13
Figure 2.4: Four Steps of Adding a New Pattern Step 3: Select the desired pattern category (see section 2.2.1). Step 4: Click on the Add pattern-button. By clicking on the Add pattern-button, the new pattern is added to the pattern tree (see section 2.2.2). Because there is no lter dened for it yet, missing lter is shown below the pattern name. (How to add a lter to the pattern is described in section 2.3.) In the visual display each pattern is shown including its pattern type (abbreviated as T for tree patterns, S for string patterns, or D for document patterns).
2.2.5
Removing Patterns
To remove a pattern from the tree, select it in the tree path and click on the Delete patternbutton. If the pattern has sub patterns, you will be asked if you want to delete all descending patterns and lters (see gure 2.5).
Figure 2.5: Removing Patterns: Warning
2.2.6
Testing a Pattern
To test which parts of the document match a pattern, select the desired pattern and click on the Test pattern-button. In the document window, the matching instances of the selected pattern
CHAPTER 2. USING LIXTO are highlighted in two alternating colors and the control panel looks as shown in gure 2.6.
14
Figure 2.6: The Control Panel During Pattern Testing In the listbox labeled Matched documents, all documents from which at least one instance is extracted are listed and you can select the one you want to view in the document window. The number of extracted instances in this document is shown in parentheses after the URL. In the line beneath the list box the number of extracted instances in all documents and in the selected document only is given. If the checkbox blink is enabled, the highlighted text will blink. This is the default. To close this view, click on the close-button. Testing Mode You can choose between two testing modes: partial and general. In partial mode, only the current path (and not paths to any other occurances of this pattern/lter in the tree) to the selected instance of the pattern/lter is tested. In general mode, all occurrences of the selected pattern are tested. This makes a dierence for patterns that occur in more than one place in the pattern tree, e.g. when a pattern contains a lter with a parent pattern other than the default parent pattern of the target pattern. You can limit the number of documents searched with the eld labeled max. docs. This can be useful when testing programs that contain document patterns to avoid too deep recursion.
Figure 2.7: Testing Mode
2.3
Filters
Filters constrain how to extract data inside a particular parent pattern context. If more than one lter is specied, only one has to be satised for a data element to be extracted.
15
2.3.1
Adding Filters
Depending on the category of the pattern, you get dierent menus for lter creation. The rst three steps are the same for all categories. Step 1: Select the pattern in the pattern tree to which you want to add the lter (see section 2.2.2). Step 2: Select the parent pattern which is referenced by the lter. The lter is dened relatively to the data of the parent pattern. In other words, you can only extract data that occur inside the chosen parent pattern. The parent pattern of the lter does not have to be the same as the default parent pattern of the pattern for which you dene the lter. However this is the case in most cases. You have to specify this for each lter you are dening. There are two ways of selecting the parent pattern for the lter: 1. Select the desired parent pattern in the listbox in the right hand lower corner of the control panel (see gure 2.8).
Figure 2.8: First Way of Selecting the Parent Pattern for a Filter 2. If you do not select any parent pattern in the listbox, you will be asked to select a parent pattern in the pattern tree after you clicked on the Add filter-button (see gure 2.9). In case a parent pattern occurs multiple times in your extraction program, it might be required to do it this way to be able to select the desired parent pattern instance.
Figure 2.9: Second Way of Selecting the Parent Pattern for a Filter Step 3: Click on the Add filter-button. The next steps depend on the pattern category and will be explained for each category separately.
CHAPTER 2. USING LIXTO Tree Filter
16
Step 4: If the parent pattern matches multiple times in the document , you have to choose which one you want to use for the creation of the lter (see gure 2.10). In the listbox labeled Matched documents you can select one of the documents entered in the document manager. With the Prev-button, the Next-button, or directly with the mouse you can select the sample parent example on the displayed page which you want to use. Your selection is highlighted in green and you can continue the lter creation by clicking on the Select this instance-button.
Figure 2.10: Select Example Parent Instance Step 5: In the document window only the selected (or the only matching) instance of the parent pattern is shown. Select the region of the page the lter should extract. You can do this either with two consecutive mouse clicks (at the beginning and the end of the region) or with a double click at the desired (leaf-)element of the document model. Again your selection is highlighted in green and you can continue the lter creation by clicking on the Select this instancebutton. If the selection does not show exactly what you intended to mark, you can use the cursor keys or the according buttons on the left in the control panel to move or resize the selection (see gure 2.11). Also path and attributes of the current selection are shown in the control panel. This is very useful to check wether you selected the correct region. If you click on the Cancel-button, the whole lter creation is canceled. Depending on the actual selection as dened in step ve, one of two slightly dierent kinds of tree lters will be created. The rst represents a single element of the HTML page (e.g. a table or a paragraph) whereas the second represents a sequence of successive HTML elements. If you visualize the HTML code as a tree, the rst kind corresponds to exactly one node (which doesnt have to be a leaf node) and the second one corresponds to a sequence of nodes with same parent node but which doesnt include all child nodes of the parent node because else you would have selected the parent node . Step 6: The system automatically detects the HTML path to the selected item(s), but you can adjust that path if you want. You have to choose between the following three modes (see gure 2.12):
17
Figure 2.11: Selecting the Sample Instance of a New Filter
Figure 2.12: Choose the Path Specication Mode
18
default: The system automatically creates the default path, which consists of the sequence of elements of the tree path separated by wildcards. (Hint: Usually this mode suces. Later versions will support some AI techniques here, too.) manual: For each detected element of the path you can select whether a star prex (=wildcard) or no prex should be added or if it should be omitted completely (see gure 2.13).
Figure 2.13: Manual Path Specication minimal: The system will take only the last path component with a wildcard before it. When you selected a sequence of elements in step ve, you have to do the path specication for the parent node and for the rst and last child nodes separately. Step 7: Specify the attribute construction mode (see gure 2.14). If you enter conditions for some of the attributes of the selected item, it will be more likely that only items of interest will match the lter.
Figure 2.14: Specify Attribute Construction Mode The dierent attribute construction modes are: Default: The system automatically selects the same attributes you selected last time in the categories mode. If you did not use the categories mode yet, it will select the href attribute if available. All exactly: The system takes all attributes and their values exactly as discovered in the selected instance. None: The system does not take any attributes, i.e. the attribute condition set is empty.
19
Categories: You can specify which attributes to consider by their purpose in the HTML page (see gure 2.15).
Figure 2.15: Attribute Construction Mode: Categories Only categories which apply to the selected item are enabled. Select which categories you want to use and then select in the listbox labeled values selection one of the comparison types as given below. ignore value: The attribute has to exist, but its value is ignored. exactly: The value of the attribute must match exactly the corresponding of the selected item. contains: The value of the attribute has to contain the value of the selected attribute. It is also possible to add restrictions on the content (see gure 2.16). For that purpose you have one more element in the listbox of the comparison types which is matches regexp. In the textbox on the right you have to enter the value for the content respectively the regular expression in case you selected matches regexp in the listbox. Regular expressions are a very powerful tool to express conditions on strings but their explanation is not subject of this thesis. For more details on regular expressions see for example [10] or [12]. Custom: You have full control over attribute selection (see gure 2.17). In the custom attribute construction window only attributes of the selected item are shown. Select all attributes you want to use and also select the comparison types in the listbox for them. You can then enter text in the textbox on the right.
20
Figure 2.16: You can also enter a Condition for the Content
Figure 2.17: Attribute Construction Mode: Custom
CHAPTER 2. USING LIXTO In this construction mode four additional comparison types are present, which are is syntactical concept contains syntactical concept is semantical concept contains semantical concept
21
If you choose, for example, is syntactical concept, you can select one of several predened syntactical concepts like date or email (and possible some methods such as isFormer which performs a date comparison) (see gure 2.18). How to dene your own concepts will be explained in a later chapter. Semantical concepts work similar to syntactical concepts (see gure 2.19).
Figure 2.18: Syntactical Concepts
Figure 2.19: Semantical Concepts When you selected a sequence of elements in step ve, you have to do the attribute construction for each of the parent node and the rst and last child nodes.
CHAPTER 2. USING LIXTO String Filter A basic support for sting rules is implemented so far. Step 4: Select which type of condition you want to use and enter a value (see gure 2.20).
22
Figure 2.20: String Filter Creation It will also be possible to extract attribute values such as a link (href) or the name of an image. Document Filter Document lters are used to concatenate information originating from several HTML pages in one single XML le. You can integrate a linked page in your program anywhere you want. There are two situations where document lters are typically used: 1. Detailed information is found on pages linked to from the original page. Usually, these pages have a dierent structure than the main page and are parsed with a subtree of the program. 2. Long lists are often split into several pages connected via next links. These pages usually all have the same structure and a recursive program is used to parse them. This is done by adding a new document lter to the rootDocument pattern which follows the next link (for an example, see section 2.7). The creation of a document lter is very simple, you only have to specify the parent pattern for this kind of lter. This can either be a string pattern dening the URL or a tree pattern with a href attribute. In the second case this value is used for navigation.
2.3.2
Manipulating Filters
When you select a lter in the pattern tree, buttons for manipulating lters appear (see gure 2.21). Conditions and ranges are explained in section 2.4. Deleting Filters To delete the selected lter, click on the Delete filter-button.
23
Figure 2.21: Manipulating Filters Testing Filters Testing lters works like testing patterns (see section 2.2.6) but only one lter is tested at a time. Dont be confused, if testing shows some empty pattern or lter instances indicated by highlighted vertical lines. This happens due to limits of the underlying tree model and will be improved in later versions. (Currently such instances have to be removed by applying conditions.)
2.4
Conditions
If the testing of your lter selects to much data you can restrict it by adding conditions. To add an internal or external condition to a lter, select the lter in the control panel and then click on the Add condition-button (see gure 2.21). To add a range restriction, click on the Restrict range-button, to remove a range restriction, click on the Clear range-button. To remove a condition, select the condition in the pattern tree and click on the Delete condition-button.
2.4.1
External and Internal Conditions
Select the type of condition you want (see gure 2.22).
Figure 2.22: Select the Type of Condition You Want With external conditions you can specify which element must (not) be before or after this pattern whereas with internal conditions you can specify which element must (not) occur inside the lter respectively which element must be the rst (last) child in a sequence of elements. First and last child (i.e. a starts (ends) with condition) are not supported in the current version.
CHAPTER 2. USING LIXTO Adding an External Condition
24
To add an external condition the following steps are necessary. (It is assumed that you already selected External condition or Negated external condition, see gure 2.22) Most of them are similar to the steps of adding a lter and are explained in detail there (see section 2.3.1). Step 1: Select an example instance of the lter. Step 2: Select the desired example instance, that has to occur before or after the previously selected instance. Step 3: Select the distance tolerance (see gure 2.23). A distance of zero percent means, that the item has to occur within exactly the same distance whereas a distance of hundred percent means, that it can occur anywhere before (after) the matching instance.
Figure 2.23: External Condition: Select Distance Tolerance Step 4: Select the path specication mode Step 5: Select the attribute construction mode Adding a Negated External Condition To add a negated external condition the following steps are necessary. (It is assumed that you already selected Negated external condition see gure 2.22.) Most of them are similar to the steps of adding a lter and are explained in detail there (see section 2.3.1). Step 1: Select an example instance of the lter. Step 2: Select the desired example instance, that must not occur before or after the previously selected instance. Step 3: Select the distance tolerance (see gure 2.24). A distance of zero percent means, that only items within exactly the same distance are considered whereas a distance of hundred percent means, that all items before (after) the matching instance are considered.
Figure 2.24: Negated External Condition: Select Distance Tolerance
CHAPTER 2. USING LIXTO Step 4: Select path specication mode Step 5: Select attribute construction mode Adding an Internal Condition
25
To add an internal condition the following steps are necessary. (It is assumed, that you already selected Internal condition see gure 2.22.) Most of them are similar to the steps of adding a lter and are explained in detail there (see section 2.3.1). Step 1: Select an example instance of the lter. Step 2: Select the desired example instance, that has to occur inside the previously selected instance. Step 3: Select the path specication mode Step 4: Select the attribute construction mode Adding a Negated Internal Condition To add a negated internal condition the following steps are necessary. (It is assumed, that you already selected Negated internal condition see gure 2.22.) Most of them are similar to the steps of adding a lter and are explained in detail there (see section 2.3.1). Step 1: Select an example instance of the lter. Step 2: Select the desired example instance, that must not occur inside the previously selected instance. Step 3: Select the path specication mode Step 4: Select the attribute construction mode
2.4.2
Range Conditions
Another way to restrict the matching instances of a lter is to specify a range, e.g. if you are only interested in the rst three rows of a table. To add a range condition, click on the restrict range-button. Now you have to enter the rst target number and the number of targets you want to extract (see gure 2.25).
Figure 2.25: Range Condition
26
Figure 2.26: The Program Menu
2.5
Saving and Loading Programs
To save or load a Lixto program, select the menu item Program (see gure 2.26). Select New to start a new program. Open to open an existing program. Save to save your program. You will be asked for a name, if you save the program for the rst time, else it will be saved under the same name as before. The program will be saved in XML. Save as to save your program under a new name. Import elog to import a program prior exported as elog. Elog is the programming language internally used by Lixto. Import wlg to import a program prior exported as wlg. This is the serialized Java object. Its use is not recommended. Export elog to export your program as elog (e.g. for manual editing of a program). Export wlg to export your program as wlg, not recommended.
2.6
Show Program
You can view your program in an extra window by selecting Show program in the Program menu (see gure 2.26). You can choose between three display modes: short: This is the same view of the pattern tree as in the control panel. long: Similar to short, but also lter and condition denitions are shown. The syntax used for the denitions is similar to the syntax of elog (the programming language internally used by Lixto). For a detailed description of elog see [4] and [2]. (see gure 2.27). elog: This shows the elog rules of the program (see gure 2.28).
27
Figure 2.27: Long Program View
28
Figure 2.28: Elog Program View
29
2.7
Digital Camera Example
In this example most of the above explained mechanisms will be used and will be explained step by step. Starting point is the page (see gure 2.29) http://www.epinions.com/elec-Photo-Cameras-All-Olympus-Digital where a listing of digital cameras is given. In the example you will extract names, prices (if available), and ratings of all cameras. As the list is split over several pages, you will use recursion to extract the data on the following pages too. The pattern tree of the complete program is shown in gure 2.27. You can check if your pattern tree looks right at any time you want.
Figure 2.29: Example Page Begin Start a new program and add the above URL in the document manager. Pattern cameraTable Most HTML pages are structured via tables and it is a good start to dene a pattern which extracts the table(s) with the relevant information. Add a new tree pattern called cameraTable to the new Lixto program (see section 2.2.4). Filter for pattern cameraTable To extract the desired table, add a lter to the cameraTable pattern (see section 2.3.1). 1. Select the pattern cameraTable.
CHAPTER 2. USING LIXTO 2. The only possible parent pattern is the rootDocument, therefore specify it. 3. Click on the Add filter-button
30
4. Select the example instance with two mouse clicks. After a few tries you will recognize that the rst click has to go before the Rank text and the second one after select items above for side-by-side comparison at the end of the page (see gure 2.30).
Figure 2.30: Selecting the Camera Table The whole camera table (and a few lines above and below, which dont look like they belong to the table but do) is now highlighted in green and you can click on the Select this instance-button. 5. Specify the path specication mode. If the system asks you to specify the path specication for a tree region, you selected more than one table and have to do the lter creation again. If you want to see the created path, select manual, else default will do. 6. Dene none as attribute construction mode and click on the Next-button. Test pattern cameraTable Test the pattern (or lter) to see, how many tables match the lter. Because more than the desired table match, you have to restrict the lter. Restrict Range One way to restrict the lter to select only the desired table is to add a range restriction (see section 2.4.2). 1. Select the lter in the pattern path and click on the restrict range-button. 2. Select the camera table with one mouse click or with the Next-button or the Prev-button and then click on the Select this instance-button. 3. Enter one as the number of targets (see gure 2.31) and click on the Next-button.
Figure 2.31: Range Restriction
CHAPTER 2. USING LIXTO If you test the cameraTable again, only one instance is found.
31
Pattern cameraRow Next add the tree pattern cameraRow to extract single rows of the camera table. This pattern serves to collect all camera data like name and price. It is possible to extract these items directly without this pattern, but then the relation between individual items would be lost. Filter for Pattern cameraRow
1. Select the pattern cameraRow in the pattern tree. 2. Select cameraTable as parent pattern for the lter. 3. Click on the Add filter-button. 4. Select one single row of the table with the mouse. To do so, the rst click has to be above the checkbox and the second click has to be at the beginning of the last item (where the cursor is shaped like a hand) (see gure 2.32).
Figure 2.32: Selecting a Camera Row One single camera row is now highlighted in green and you can click on the Select this instance-button. 5. Select any path specication mode you want. It doesnt make any dierence in this case. 6. Select none as attribute construction mode. Condition for Pattern cameraRow Testing of the lter shows too much rows selected. Therefore add a condition to the lter. 1. Select the lter in the pattern tree. 2. Click on the Add condition-button 3. Select internal condition 4. Select one row with camera data as sample instance 5. Select the stars of the rating with two mouse clicks and click on the Select this instancebutton 6. Select any path specication mode
32
Figure 2.33: Attribute Construction for the Rating Pattern 7. Select Custom as attribute construction and there select the attribute alt which has to contain the string Rating (see gure 2.33). Now testing shows only 10 items selected, which is exactly what you want. Patterns name, price, and rating For each of name, price, and rating add a new Pattern to cameraRow. Filter for Pattern name If you look at the page, the characteristic of the name is that it is a hyper link and bold. 1. Select the pattern name in the pattern tree. 2. Select cameraRow as parent pattern for the lter. 3. Click on the Add filter-button. 4. Select any row you like as example instance. 5. Select the name link with a double click. 6. Select any path specication you want. 7. Select Categories as attribute construction mode. Three attribute categories are found and selected. You dont have to change anything (see gure 2.34). Test pattern name If you test the newly created lter exactly ten instances are found. Filter for Pattern price Characteristic for the price is the $ sign. 1. Select the pattern price in the pattern tree. 2. Select cameraRow as parent pattern for the lter. 3. Click on the Add filter-button. 4. Select any row you like as example instance. 5. Select the price item with a double click. 6. Select any path specication you want.
33
Figure 2.34: Attribute Construction to Extract the Camera Name
Figure 2.35: Attribute Construction for the Price Pattern
34
7. Select Custom as attribute construction and select elementtext which has to contain a $ (see gure 2.35). Test pattern price If you test the newly created lter exactly ten instances are found. Filter for Pattern rating The extraction of the rating is a bit more complicated because the value is stored in an attribute. First we have to extract the item itself, then we can dene a new pattern which extracts the attribute value for it. 1. Select the pattern rating in the pattern tree. 2. Select cameraRow as parent pattern for the lter. 3. Click on the Add filter-button. 4. Select any row you like as example instance. 5. Select the image of the rating with two mouse clicks. 6. Select any path specication you want. 7. Select Custom as attribute construction mode and select the attribute alt which has to contain the string Rating (see gure 2.33). Pattern ratingValue Add the pattern ratingValue as child of Rating. This time you have to choose string as pattern category (see gure 2.36).
Figure 2.36: Adding Pattern ratingValue Filter for Pattern ratingValue To extract the value of the alt attribute, add a string lter to the pattern. 1. Select the pattern ratingValue in the pattern tree. 2. Select rating as parent pattern for the lter. 3. Click on the Add filter-button. 4. Select the radio button Attribute and leave the text eld empty. 5. Select a parent pattern instance (Every instance will do). 6. Select the alt attribute.
CHAPTER 2. USING LIXTO Pattern next
35
To extract items on next pages also, dene a pattern which will extract the next link. Its parent pattern is the rootDocument. Filter for Pattern next
1. Select the pattern next in the pattern tree. 2. Select rootDocument as parent pattern for the lter. 3. Click on the Add filter-button. 4. Select the next link with a double click. 5. Select minimal path specication. 6. Select Categories as attribute construction mode. There select Hyperlinks with value selection ignore value and also select Content which has to contain the string Next (see gure 2.37).
Figure 2.37: Attribute Construction for the Next Pattern
Recursion: Additional Filter for Pattern rootDocument The rootDocument should also extract documents linked with the next button. Add a new lter with next as parent pattern to the rootDocument. Thats all! Test the Recursion Select the rootDocument and click on the Test pattern-button. Only one instance is found because the testing mode is set to partial. If you select general, more instances will be found. XML output To view the XML output of your program select XML Output in the Program menu. The full program is shown in gure 2.26.
36
2.8
Digital Camera Exercise
In this exercise you will extract detailed information about single cameras of the previous example. In contrast to the previous section, single steps will not be described in detail but a brief guide with some hints will be given. Detailed descriptions of the cameras are found by following the links of the camera names (below the picture on the sample page of the last example) and there by following the links View full details. One example of such a page is (see gure 2.38): http://www.epinions.com/elec-Photo-Cameras-All-Olympus_C-2100_Ultra_Zoom/additive_~1 Your Program will extract the camera name, the text of the description and the technical details. Step 1: Start a new program and add the above URL to the document manager. Step 2: Extract the camera name. Characteristic for the name are the href attribute and the font attributes. Use categories attribute construction mode. Step 3: Extract the whole product description table including the title Product Description and the body with the description itself. Limit the matching instances by specifying an inner condition containing the title (You could also use a range condition, but in this case an inner condition is more robust against layout changes of the page.) Step 4: Extract the description itself. Be aware of the fact that on some pages the description section consists of several paragraphs whereas on other pages it consists only of one. Step 5: Test the newly created lter and add a range condition that excludes the title. Be sure to enter a high enough number for the number of targets so that also on pages which use more paragraphs, all of them will be extracted. Step 6: Extract the whole details table and use an appropriate method to select only this table. Step 7: Extract one single row of the details table. Step 8: Extract the name of the detail like Flash or Optical Zoom. If you use the font attribute (bold), only the name of the detail will be extracted, else you have to use conditions. Step 9: Extract the value of the detail. If you use custom attribute construction mode and enter the regular expression .+ for the elementtext and also use the columns attribute (see gure 2.39), you will not have to use further conditions. Step 10: Look at the XML output and change lter denitions if required. Step 11: Choose a dierent details page (from another camera) and test the XML output as well. hint: You can avoid empty instances by specifying the attribute condition regexp .+ for the elementtext (. stands for any character and + says, that there has to be at least one of it.) Your nal program should be similar to the one shown in gure 2.40.
37
Figure 2.38: Example Page for the Exercise in Section 2.8
Figure 2.39: Attribute Construction for the Value of the Detail
38
Figure 2.40: Pattern Tree of the Exercise Program
Chapter 3
InfoPipes
InfoPipes is a fully visual tool for creating information processing pipelines. Such pipelines consist of several dierent components, each responsible for a specic task. Because most components use XML both as input and output format, they can be arranged in almost any order. Also, pipelines need not be linear, components can combine input from more than one component. There are four types of components: source, deliverer, transformer, and integrator. The beginning of a pipe is always a source component, because its input is not XML but the URL of a Web page. It cannot be used in the middle of a pipe, but it is possible to have more than one source in a pipe. The nal stage of a pipe is a deliverer. This is the only component which can produce output in a format other than XML. It, too, cannot be used in the middle of a pipe. The actual output format of a specic deliverer depends on its conguration. Deliverers exist for, among other formats, XML, HTML, and SMS. If more than one output format is desired, a separate deliverer has to be added for each format. The other two types of components, the integrator and the transformer, can be put in any order and combination between source and deliverer components. The integrator component is for the integration of several dierent sources. It provides syntactic and semantic mapping of XML elements. This is useful for combining several pages with dierent XML structure to one single XML document with an uniform structure, or to change the structure of a single input. The transformer component provides two dierent features: Joining dierent inputs to one common dataset by combining elements and attributes from the inputs on a per-element basis, and restricting the output data by selection.
3.1
The User Interface and Components
This section gives a brief overview of the most important elements to show what can be done with InfoPipes. For a more detailed description see [15]. After a user logs into the system, the main window is shown (see gure 3.1). The part on the left-hand side is used to manage a users pipes. It contains a list of his pipes and links to add a new (empty) pipe or a shared pipe. 39
CHAPTER 3. INFOPIPES
40
Figure 3.1: The Main Window of InfoPipes
41
A shared pipe is a pipe congured by another user which can be used as part of ones own pipes. Its elements can only be congured by the pipes owner, but additional elements can be added and then congured by other users. On the right-hand side the applet window is located, in which the components of the selected pipe are shown and can be manipulated in a visual, interactive way. The right mouse button pops up a context sensitive menu. This menu allows to add new components to congure and to connect components and the removal of connections (see gure 3.2).
Figure 3.2: The Three Dierent Popup Menus The functions in the component menu are: Congure: A new conguration window is opened which is dierent for each component type. The conguration process for each component is described in the next sections. Connect To: After selecting this menu item, the user can establish a connection to the next component in the pipe by drawing a line from one component to another. Remove: The selected component will be removed from the pipe. Caption: A small window to enter a new caption is opened. Subtype: A small window to enter the name of the subtype is opened. Disable/Enable: This toggles a components activation status. Only active components update their content when the pipe is triggered. This is useful for testing pipes.
3.1.1
Source Component
The source component navigates through web pages, extracts the information and transforms the data to XML using Lixto extraction programs. The conguration of the source consists of the following steps: step 1: Enter an URL and use the web browser to navigate to the page you want to extract information from (see gure 3.3). step 2: Add a content extractor (e.g. a Lixto program you prepared for this web page) (see gure 3.4). step 3: Optionally view the XML output step 4: Congure the scheduler which controls when and how often the XML will be extracted from the web page (see gure 3.5).
42
Figure 3.3: Source Conguration Step 1: Enter an URL
Figure 3.4: Source Conguration Step 2: Add a Content Extractor
Figure 3.5: Source Conguration Step 4: Congure Scheduler
43
3.1.2
Integrator Component
The integrator component provides a unied view of numerous input XML documents. It maps all input schemes to a single output scheme using syntactic and semantic mapping. With syntactic mapping it is possible to reorganize the hierarchy of the XML tags, to rename them, or to add new ones. With semantic mapping the content of XML tags can be transformed. Before you can congure the integrator, you have to connect an input to it. The conguration of the integrator consists of the following steps: step 1: Dene an output structure which can either be the structure from one of the inputs or can be built from scratch (see gure 3.6). step 2: Rene the output structure. New elements or attributes can be inserted and existing ones can be renamed or deleted (see gure 3.7). step 3: For each connected source component dene the mapping of the input elements/attributes to output elements/attributes (syntactic mapping) (see gure 3.8). step 4: Optionally dene regular expressions used to transform the content of elements (semantic mapping) (see gure 3.9). step 5: Optionally edit the generated XSLT stylesheet manually (see gure 3.10).
Figure 3.6: Integrator Conguration Step 1: Select an Output Structure
44
Figure 3.7: Integrator Conguration Step 2: Rene Output Structure
Figure 3.8: Integrator Conguration Step 3: Dene Input to Output Mapping
45
Figure 3.9: Integrator Conguration Step 4: Semantic Mapping
Figure 3.10: Integrator Conguration Step 5: Edit XSLT Stylesheet
46
3.1.3
Transformer Component
The transformer provides two dierent functionalities. It can be used to join inputs to a common data set or to restrict the output data by dening selection criteria. Combining inputs works similar to joining tables in a relational database system: for each element, attributes from the inputs are combined into one output element. In contrast, the integrator converts the format of its inputs to a common output format and concatenates the records found in each (see gure 3.11). Its second use is to lter elements from a list of elements according to the specied selection criteria. If this component gets distributed as part of a shared pipe, the user of that pipe can set the values of these selection parameters see gure 3.12 and 3.13).
Figure 3.11: Transformer Conguration: Dene Mapping and Joining
47
Figure 3.12: Transformer Conguration: Dene Selection Criteria (Admin View)
Figure 3.13: Transformer Conguration: Dene Selection Criteria (User View)
48
3.1.4
Deliverer Component
The job of this component is to deliver information to the end user. Deliverers for several dierent formats and delivery methods are available: SMS, Email, HTML, WML, and XML. For SMS an email gateway is used, so both SMS and Email use the same method. For all other formats the output is saved as a le on the web server. Before you can congure the deliverer, you have to connect an input to it. The conguration of a deliverer consists of the following steps: step 1: Select the desired output device and congure its parameters (e.g. mobile phone number or le name) (see gure 3.14). step 2: Congure the scheduler which controls when and how often data shall be delivered. It is also possible to specify that data shall only be delivered if changed (see gure 3.14). step 3: Compose the output from static text and elements of the input (see gure 3.15).
Figure 3.14: Deliverer Conguration Step 1 and 2: Congure Output Device and Scheduler
49
Figure 3.15: Deliverer Conguration Step 3: Compose Output
Chapter 4
Location Based Data Retrieval

4.1 Location Based Services
Location Based Service (LBS) is the ability to nd the geographical location of a mobile device and provide services based on this location information. For example, a person at a shopping mall calls for the nearest restaurant with economy budget. He needs only names and addresses of restaurants which are within walking distance, say within one kilometer, out of the database of over two thousand restaurants in the city, spread over 1600 square kilometers. The legal foundation for Location Based Services has been established by a ruling of the U. S. Federal Communications Commission (www.fcc.gov) which requires network operators to provide public emergency response services with the callers location, with an accuracy of at least 125 meters, and callback phone number. This gave rise to the emergence of a new and dynamic eld called LBS, where the service is based on the geographical location of the calling device. Developments in the elds of Positioning Systems, Communications and GIS opened a wide range of possibilities in the area of providing the user a customized service depending upon his geographical location.
4.1.1
Locating the Device
Location-nding equipment vendors provide the actual location technology. Location technologies for handsets include GPS, Overlay Triangulation technologies and Cell of Origin information. Cell of Origin information, although the least accurate, is the locating technique most widely used by the network operators. The dierent technologies for locating the mobile device are: Cell of Origin (COO): COO uses the network base station cell area to identify the location of the caller. The accuracy depends upon the cell area which can be as small as 150 meters in diameter for an urban area. Although the accuracy is not high and cannot be applied for emergency usage it is popular amongst the operators as it does not require any modications in the handset or the network, hence it is comparatively cheap to deploy. Time of Arrival (TOA): Here the dierence in the time of arrival of the signal from the mobile device to more than one base station is used to calculate the location of the device. This needs 50
CHAPTER 4. LOCATION BASED DATA RETRIEVAL
51
time synchronization in cellular network using GPS or atomic clocks at each base station. The cell sites are equipped with location measurement units (LMUs). By measuring the signal received from the mobile phone, the LMUs can triangulate the users position. While TOA is more accurate than COO technology, cost/benet analysis does not favor the usage of this technology, as the cost of implementing it is very high due to the large number of LMUs required. Angle of Arrival (AOA): AOA requires a complex antenna array at each cell site. These antenna work together to determine the angle (relative to the cell site) from which a cellular signal originated. Enhanced Observed Time Dierence (E-OTD): E-OTD systems operate by overlaying the cellular network with location receivers, used as a location measurement units (LMUs), at multiple sites geographically dispersed in a wide area. Each of these LMUs has an accurate timing source. When a signal from at least three base stations is received by an E-OTD enabled mobile device and the LMU, the time dierences of arrival of the signal from each base transceiver station (BTS) at the handset and the LMU are calculated. The dierences in time are combined to produce intersecting hyperbolic lines from which the location is estimated. E-OTD schemes oer greater positioning accuracy than cell of origin, between 50 and 125 meters, but have a slower speed of response, typically around ve seconds, and require modied software in the handsets. Assisted GPS (AGPS): The last main category is comprised of assisted global positioning services (AGPS). AGPS can achieve an accuracy of up to ten meters, but are expensive for end users as they have to obtain a GPS-equipped handset. Also, too operate the GPS handset needs to be in sight of three or more satellites, hence making its use dicult in heavily urban areas and indoors. Flexible Architecture: The architecture being adopted today by many network operators is based upon mobile location centers (MLC). The MLC separates the location technology to locate the device from the application the location information will be put to. Since many applications can function quite well with cell of origin information, network operators can deploy advanced location technology gradually and need not wait for complete coverage to oer new services.
Figure 4.1: The Flexible Architecture based upon Mobile Location Center
52
4.2
4.2.1
Geographic Information System (GIS)

What Is GIS?
Geographic Information Systems (GIS) are mapping programs that link information about where things are with information about what they are. Unlike a paper map, where what you see is all you get, a GIS map can dynamically combine many layers of information. On a paper map are representations of cities and roads, mountains and rivers, railroads and political boundaries. The cities are represented by little dots or circles, the roads by black lines, the mountain peaks by tiny triangles, and the lakes by small blue areas. A digital map created by a GIS will also have dots, or points, that represent features on the map such as cities, lines that represent features such as roads, and small areas that represent features such as lakes. But in contrast to a paper map, the information that is represented comes from a database that can be queried by the user: he can select what kind of information to show and can request more detailed information about a specic feature. Each piece of information on the map is located on a layer, which users can turn on or o according to their needs. One layer could be made up of all the roads, another could represent all the buildings, while a third could represent the customers of a businessman(see gure 4.2).
Figure 4.2: Layers of an GIS The advantage of a GIS over paper maps is that they oer the ability to select what information will be displayed, which makes them suitable for more than one use. A business person trying to map customers in a particular city will want to see dierent information as a water engineer who wants to see the water pipelines for the same city. Both may start with a common map a street and neighborhood map of the city but the information they add to that map will dier. What distinguishes GIS from other forms of information systems, such as databases and spreadsheets, is that GIS deals with spatial information. GIS has the capability to relate layers of data for the same points in space, combining, analyzing and, nally, mapping out the results.
4.2.2
GIS Software
GIS software provides the functions and tools needed to store, analyze, and display information about places. The key components of GIS software are
53
tools for entering and manipulating geographic information such as addresses or political boundaries; a database management system (DBMS); tools that create intelligent digital maps one can analyze, query for more information, or print for presentation; and an easy-to-use graphical user interface (GUI) GIS software systems range from low-end business-mapping software appropriate for displaying sales territories to high-end software capable of managing and studying large protected natural areas.
4.2.3
Data for the GIS
The most common data storage formats of a GIS are raster data structures an vector data structures. There are GIS which can only cope with one of these two formats and there are others which can combine data in dierent formats. Vector Data The elements or components of the spatial data in a vector based GIS are constructed from the graphic primitives: points, lines, and areas. Point features are zero dimensional and therefore have no length or area. A point could be used to represent the position of a soil sample location, a manhole cover, the corner of a land parcel. Note, however, that a feature may be a point at a small scale while being an area at a large scale, e. g. a tree will be represented as a point at a scale of 1:2500000 but would be an area feature at a scale of 1:500. In a GIS, points are also important as they mark the position where two lines intersect. Line features comprise two or more points and are one dimensional. Consequently, line features have a length but no area. Lines are used to represent features such as roads, rivers and cadastral boundaries. Again, line features are scale dependent. A river will be represented as a line at a scale of 1:2500000 but would be an area feature at a scale of 1:500. Area features comprise a series of one or more lines are two dimensional and therefore have both area and perimeter length. Areas represent features such as soil types, vegetation assemblages and land ownership parcels. The inherent functionality of vector based GIS depends in the system being able to recognize points, lines and areas as individual entities. Furthermore, the system needs to know which points comprise which lines and which lines comprise which areas. To add to the complexity, a line feature may be an entity in itself (e. g. a road) but may also be part of an area (e. g. land parcel boundary). An additional feature of vector GIS data storage is the recording of the relationship between each of the entities in a database. This gives the database a built in intelligence that enables us to ask what lies to the north of area X ? or If I drive in a northerly direction along
54
road Y, what soil type is on my left ? This intelligence is gained through the use of a topological data structure. Raster Data Structures Unlike a vector system which uses a series of points, lines and areas to represent features in the geographic database, a raster data structure divides the geographic region into a regular grid of squares or grid cells. Each cell contains a single value relating to the feature or entity being represented. Points are represented by coding one grid cell (or pixel). Lines are represented by encoding a linear array of grid cells. Areas are represented by encoding a cluster of grid cells. Also image data which can include such diverse elements as satellite images, aerial photographs, and scanned data can be represented in the raster data structure and can therefore be used in a GIS. Coordinates and Geocoding Data used by a GIS contain either an explicit geographic reference, such as latitude and longitude coordinates, or an implicit reference, such as an address, postal code, census tract name, forest stand identier, or road name. To correlate dierent features, a GIS requires explicit references. It can create these explicit references from implicit references by an automated process called geocoding. Geocoding is the process of matching records in two databases: for example an address database (without map position information) and a reference street map or other address dictionary (with known map position information). Geocoding software links records in the two databases by matching street names and addresses. When database records are successfully matched to a reference street map database, the records are tagged with the correct map positions, typically latitude/longitude coordinates. Thereafter, the database (table) carries its own position information and can be mapped without the reference street map or address dictionary. A coordinate system species the units used to locate features in two-dimensional space and the origin point of those units. Latitude and longitude is such a coordinate system (often called the geographic coordinate system). If you are using an established GIS database, all data will probably already be encoded in the same coordinate system and projection. If you are collecting data from various sources, though, you need to verify this and convert coordinate systems as necessary.
Chapter 5
Night Owl Application

5.1 Introduction
The night owl application oers information which is relevant for mobile users during the night hours. The information is split into dierent topics such as pharmacy or gastronomy. Search results will be constraint by a given time and location. Only services that are opened at the given time and close to the given location will be returned. It is also possible to view a map of the surroundings of the retrieved item. The search can be performed using a PDA, at one of the public access points of apc interactive solutions ag, which are spread over Austria (see gure 5.1), or with a standard Web browser.
Figure 5.1: Access Point
55
CHAPTER 5. NIGHT OWL APPLICATION
56
5.1.1
Services
At this level of implementation four information sources are available. For each source one representing Web page is considered. Some services cover only the city of Vienna, because for Vienna more sources are available. Gastronomy: With this service dierent kinds of restaurants all over Austria are searchable. The URL for this service is: http://www.gastroweb.at/ Although this web page is designed to cover all of Austria, the data population for some states is very small and the results are not very satisfying. Cinema: With this service the current cinema program is searchable and also descriptions to the lms are available. The URL for this service is: HTTP://www.falter.at/programm/kino/kino.php This service is available for all over Austria. Pharmacy: Each night some of the pharmacies in Austria are open. In every state the mode of operation is dierent. In most cases the pharmacies are divided into groups and there are calendars where for each day one or more groups are scheduled. For Vienna the calendar is available under the following URL: http://mumps.apotronik.at/mweb/msm.web?EP=AKAL Grocery: Where can I buy some milk or a bottle of wine or something to eat even if it is eight oclock in the evening? This question is answered by the grocery service. It includes grocery stores near railway stations, which are allowed to stay open during the night, as well as gas stations with a shop or bakers with extended opening hours. The URL for this service (which only covers Vienna) is: http://www.hauptstadt.at/milch/Startseite.htm
5.1.2
Static and Dynamic Searches
The Web sources of the four services can be divided into two dierent types: The rst type includes the gastronomy and the cinema service. For this service a search on the web page is necessary. It is not possible (and not desirable) to extract all avaliable data at once. For each service dierent search criterias are avaliable and they will be described in detail later. Furthermore, the cinema data changes each week and has therefore to be extracted regularly. All the search restrictions are done dynamically on the web form for each user separately. The second type includes the grocery and the pharmacy service. Here the data pool is not very large and can be extracted at once. Moreover there are no specic search possibilities on the Web page itself.
57
5.1.3
Location Based Service
The original idea has been, to make this service location-based that is, hits are sorted by the distance to the position of the user. As the evaluation has shown, this can only be done with the help of a so called GIS software, which is to complex to integrate into InfoPipes at the moment. Therefore the exact position will not be used for the search, only the state boundaries (in case of Vienna, the districts) are considered. Also a GIS software could produce a map of vicinity and routing informations. As I do not use any GIS software I have to produce them on another way. It is done by using a Web page oering this service. The source is described in section 5.2.6. For details about location based services (LBS) and geographic information systems (GIS) see chapter 4.
5.1.4
The User Interface and InfoPipes
The UI for this application is a Web page which will be integrated into a Portal. The Portal will be responsible for the connection and data transfer to InfoPipes so these actions are not described in detail. Because the UI design is not part of the thesis, its functionality is only described as far as ist is needed for understanding the application. A prototype of the UI is shown in gures 5.2 to 5.5.
Figure 5.2: Prototype of UI: Search Mask
58
Figure 5.3: Prototype of UI: Result List
Figure 5.4: Prototype of UI: Map of Vicinity
59
Figure 5.5: Prototype of UI: Routing Information
5.2
Data Sources
In this section the source Web pages are described as well as the structure of the XML after the extraction of the page and if needed some transformations of this XML structure.
5.2.1
Common Output Structure
Before the single sources are examined we will give a list and description of the tags which are equal for all four services (but not for the map and routing informations). For the XML output we will use the following notation: On the left-hand side the XML code is listed and on the right-hand side the occurance of the elements is described. The meaning of the used letters for it is: 1: may only occur once n: may occur multiple m: mandatory o: optional The content of the elements is omitted here for better readability, therefore empty elements may occur if there are no nested elements. <document> (1,m)
CHAPTER 5. NIGHT OWL APPLICATION <item> <name/> <street/> <zipcode/> <city/> <phone/> <web/> <description/> </item> <document> (n,m) (1,m) (1,m) (1,m) (1,m) (n,o) (n,o) (n,o)
60
The <phone>-tag, <web>-tag, <description>-tag and of course also the <item>-tag can occur multiple times, all other may occur only once in the context of the parent tag. The <document>tag, <item>-tag, <name>-tag, <street>-tag, <zipcode>-tag, <city>-tag are mandatory, all others are optional.
5.2.2
Gastronomy
URL: http://www.gastroweb.at From the start page select osterreichweit on the bottom of the page and then select Wer hat jetzt oen on the right-hand side of the newly opened page to get the search window described below. This form is used, because there it is possible to enter a date and time as selection criteria (see gure 5.6). If only restaurants which oer hot meals at the given time are of interest, the option Wo gibt es jetzt warme K uche has to be used. The query parameters (see next section) are the same for both queries.
Figure 5.6: Search on the Gastroweb Page
CHAPTER 5. NIGHT OWL APPLICATION Query Parameters
61
The search possibilities of this page are various and not all of them are used in this application. The used ones are described in table 5.2.2. As you can see, the dierent restaurant types are not very well subdivided as there are too many similar categories (like Restaurant and Hotelrestaurant). Some restaurants are in more than one category, so it is not easy to nd, e.g., all locations where you can get warm dishes. It would be a good idea to group them into more general groups, but then the search is much more dicult because for each type on the page an extra search has to be performed and the result lists have to be combined and as some entries are in more than one group, doubles have to be removed. Output - Source The structure of the XML le that is produced by the source component looks like the following: <document> <restaurant> <name/> <street/> <zipcode/> <city/> <phone/> <hours> <opening/> <closing/> <day/> </hours> </restaurant> </document>
(n,m) (1,m) (1,m) (1,m) (1,m) (n,o) (n,m) (1,m) (1,m) (1,m)
The <restaurant>-tag wraps all information about one single restaurant and occurs multiple times in the XML document. Each restaurant has for each day, where it is open, an <hours>-tag which includes an <opening>-tag, a <closing>-tag and a <day>-tag.
5.2.3
Cinema
URL: http://www.falter.at/programm/kino/kino.php This URL shows directly the form page for the cinema and lm search (see gure 5.9). No further navigation is needed. Query Parameters The query parameters are described in table 5.2.3. The used constraint on the time does not work correctly. A selection of 20 -21 for example results in all lms starting later than 20 p.m., and not only in the ones starting between 20 p.m. and 21 p.m..
CHAPTER 5. NIGHT OWL APPLICATION parameter day hour name tag stunde type text select selection list (caption) yyyy-mm-dd 00 01 02 . . . 23 00 15 30 45 keine Einschr ankung Wien Nieder osterreich Ober osterreich Burgenland Steiermark Salzburg K arnten Tirol Vorarlberg American/Cocktail Bar Bar/Clubbar Bierbeisl Bierlokal Bistro B ackerei/Konditorei Cafe/Bar Cafe Konditorei Cafe Restaurant Catering Discothek Eissalon Espresso Fastfood-Lokal Gasthaus Gasthaus/Wiener Beisl Heurigenrestaurant Heuriger Hotelrestaurant Imbiss Irishpub Kaeehaus Pub Restaurant Spezialit atenrestaurant Stadtheuriger Stehcafe Suppenspezialit aten Vinothek Wiener Beisl W urstelstand selection list value 00 01 02 . . . 23 00 15 30 45 empty W N O B St Sa K T V 26 10 20 11 23 31 22 7 6 27 28 14 8 15 19 4 21 16 3 24 13 5 12 2 9 17 32 29 18 30 25
62
minute
minute
select
state
bundesland
select
type
lokaltyp2
select
Table 5.1: Query Parameters of the Gastronomy Page
63
Figure 5.7: The Result List on the Gastroweb Page
Figure 5.8: Restaurant Details of the Gastroweb Result List
64
Figure 5.9: Falter Search parameter date name datum type select selection list (caption) HEUTE [day], dd.mm. [day], dd.mm. . . . Alle verf ugbaren Termine Alle verf ugbaren Zeiten vor 15.00 Uhr 15.00 - 16.00 Uhr . . . 23.00 - 24.00 Uhr Wien Nieder osterreich Burgenland Steiermark K arnten Salzburg Tirol Vorarlberg Ober osterreich Alle Bezirke 1., Innere Stadt 2., Leopoldstadt . . . 22., Donaustadt cinema name handicapped accessible original version kino behgerecht original text checkbox checkbox selection list value yyyy-mm-dd yyyy-mm-dd . . . empty empty 14 15 . . . 23 1 2 3 4 5 6 7 8 9 empty 1010 1020 . . . 1220
time
zeit
select
state
bdsland
select
city district
ort bezirk
text select
Table 5.2: Query Parameters of the Cinema Page
CHAPTER 5. NIGHT OWL APPLICATION Output - Source The structure of the XML le that is produced by the source component looks like this: <document> <cinemaRegion> <cinemaLink> <cinemaName/> <web/> <phone/> <zipcode/> <street/> <city/> </cinemaLink> <filmRegion> <startingTime/> <filmLink> <filmTitle/> <filmDesc/> </filmLink> </filmRegion> </cinemaRegion> <date>13.02.</date> </document>
65
(n,m) (1,m) (1,m) (n,o) (n,o) (1,m) (1,m) (1,m) (n,m) (n,m) (1,m) (1,m) (n,m)
(1,m)
Each cinema is wrapped in a <cinemaRegion>-tag. In each <cinemaRegion>-tag one <cinemaLink>tag exists where the data of the cinema is contained and for each lm playing an extra <filmRegion>tag exist. In each <filmRegion>-tag an extra <startingTime>-tag exist for each starting time of this lm and the description of the lm is nested in the <filmLink>-tag where title and description of the lm is listed. The description of the lm contains the name of the lm, production country and year, stage director and actors, and a description of the story.
5.2.4
Pharmacy
URL: http://mumps.apotronik.at/mweb/msm.web?EP=AKAL The page that lists which pharmacies have night service is not directly accessible because it is the result of a previous search. By the above URL you obtain this search page (see gure 5.13) where date and district can be entered. We use the possibility to get the list for all dates and all districts at once. The used form elds are described in table 5.2.4 where the boldfaced values are selected by InfoPipes. Output - Source The structure of the XML le that is produced by the source component looks like this: <document>
66
Figure 5.10: Falter Result List
Figure 5.11: Falter Cinema Details parameter day name TAG type select selection list (caption) keine Einschr ankung vorgestern, [day], dd-MM-yyyy gestern, [day], dd-MM-yyyy heute, [day], dd-MM-yyyy morgen, [day], dd-MM-yyyy u bermorgen, [day], dd-MM-yyyy alle Bezirke 1. Inner Stadt . . . 23. Liesing xx. Schwechat selection list value empty -2 -1 0 1 2 0 1 . . . 23 30
district
DISTRICT
select
Table 5.3: Query Parameters of the Cinema Page
67
Figure 5.12: Falter Film Details
Figure 5.13: Night Pharmacies Vienna Search
CHAPTER 5. NIGHT OWL APPLICATION <pharmacyRegion> <name/> <city/> <street/> <zipcode/> <phone/> <group/> </pharmacyRegion> </document> (n,m) (1,m) (1,m) (1,m) (1,m) (n,o) (1,m)
68
Figure 5.14: Night Pharmacies Vienna Result List
Transformation The pharmacies are divided into seven dierent groups numbered from one to seven. Each group is on duty on a dierent day and on weekends the same group is on duty for both days. The groups change regularly in the order of their number. To transform a date into a group number, the following formula is used: w . . . week of the year (1 to 52) d . . . day of the week ( monday = 1, . . . , saturday = sunday = 6) g . . . group g =dw+6 (mod 7)
The calculated group is on duty from 8 a.m till 8 a.m. the next day. So from 0 a.m. till 8 a.m. the number of the day has to be the one of the day before. This transformation has to be done by the UI.
5.2.5
Grocery
URL: www.hauptstadt.at/milch/Startseite.htm
69
This page lists for each district of Vienna shops with extended opening hours. Each district resides on its own page (see gure 5.15).
Figure 5.15: List of Shops
Output - Source The structure of the XML le that is produced by the source component looks like this: <document> <shop> <name/> <street/> <opening> <timeFrom/> <timeTo/> <dayFrom/> <dayTo/> </opening> <web/> <phone/> <reachable/> <description/> </shop> <districtnumber/> <districtname/> </document>
(n,m) (n,m) (n,m) (1,m) (1,m) (1,o) (1,o) (n,o) (n,o) (1,o) (1,o) (1,m) (1,m)
70
The structure of the XML is very much like the default structure described at the beginning of this section (see section 5.2). The opening hours are described in the next section (see section 5.2.5); the <reachable>-tag includes a description of how to reach this shop with public transport; the <description>-tag includes a nice verbal description of the shop; the tags districtnumber and districtname occur only once for each district and have to be mapped to each shop later on. Opening Hours Transformation The opening hours of this page have to be transformed into something better searchable. There exist dierent formats on the page which have to be handled separately: 1. Mo-Fr 9-22h 2. Do 12-20h 3. Mo, Fr 10-22h 4. t aglich 0-24h The rst step is done in Lixto where the dierent styles are extracted into dierent patterns and where also the time interval is split into separate elements. The according XML tags to the above example are: 1. <opening> <timeFrom>9</timeFrom> <timeTo>22</timeTo> <dayFrom>Mo</dayFrom> <dayTo>Fr</dayTo> </opening> 2. <opening> <timeFrom>12</timeFrom> <timeTo>20</timeTo> <daySingle>Do</daySingle> </opening> 3. <opening> <timeFrom>10</timeFrom> <timeTo>22</timeTo> <daySingle>Mo</daySingle> <daySingle>Fr</daySingle> </opening> 4. <opening> <timeFrom>0</timeFrom> <timeTo>24</timeTo> <daily>t agl</daily> </opening>
71
As a next step, for each day an extra tag has to be generated. This is done in InfoPipes with a manually generated stylesheet. (The stylesheet is shown in appendix A.4.2.) The nal structure then looks like the following for each day of the week where the shop is open. <opening> <timeFrom>10</timeFrom> <timeTo>22</timeTo> <day>Fr</day> </opening> With this structure it is easy to determine all open shops for a given date and time.
5.2.6
Map and Routing
URL: http://www.shellgeostar.at/share/iti.asp On this page two locations can be entered and the system generates a verbal routing description from one to the other location as well as an image of this route (see gure 5.16) and optionally an image of the destination.
Figure 5.16: Search Parameter for Routing and Map
CHAPTER 5. NIGHT OWL APPLICATION what Ausgangsort: Strasse Stadt Postleitzahl Land Ziel: Strasse Stadt Postleitzahl Land name ITI ITI ITI ITI START START START START ADDRESS CITYNAME ZIPCODE COUNTRYCODE type text text text select selection list (caption)
72 selection list value
Osterreich . . .
AT . . .
ITI ITI ITI ITI
END END END END
ADDRESS CITYNAME ZIPCODE COUNTRYCODE
text text text select
Osterreich . . .
AT . . .
Table 5.4: Query Parameters for the Routing Information and Map generation Query Parameters The query parameters are listed in table 5.2.6. The country is already preselected with Osterreich because of the Austrian URL (the same site is avaliable also for .com or .de among others) and also the language on the page is german. Output Source <document> <step> <stepNumber/> <distance/> <time/> <stepDescription/> </step> <timeanddistance/> <routeImageURL/> <destImgURL/> </document> The routing information consists of several steps, each one resides in its own <step>-tag. Each step has a number, distance, time and a verbal description of where to go. At the end of the document a summarize of the travel time and distance ist given (<timeanddistance>-tag). The images of the whole route as well as for the destination point can be viewed using the extracted URL and after the transformation, explained in the next section, has been applied. Transformation In the URL to the images, all & have to be transformed to simple & and the ../ has to be replaced by http://www.shellgeostar.at.
(n,m) (1,m) (1,o) (1,o) (1,m) (1,m) (1,m) (1,m)
73
Figure 5.17: Routing Information
74
Figure 5.18: Map of Vicinity
5.2.7
Max.mobil
To locate a mobile phone a connection to the T-Mobile location server has to be established. The location server has an HTTP interface wich will be used for communication. To prevent any third parties from intercepting or manipulating the information transferred the location server supports secure communication via SSL, including the support of certicates to authenticate the requesting party and the T-Mobil platform. The communication between client and location server is message based. The exact specication of messages exchanged between the T-Mobile location server and an application, in this case the Night Owl application, will not be given here, only a brief overview will be provided. All messages from the client to the location server are issued by sending an HTTP POST request to an URL on the location server. All reply messages are sent as formatted strings containing the parameters specied for the specic message. (The specication of these messages is not included here.) Each parameter is enclosed in angle brackets (< and >). Whitespace is not signicant. Before sending location requests, the client must establish a connection to the location server with a connection setup-message. Unless an error occurs, the request will be answered with a Connection Conrmation-message. After the connection is established, a location request including a mobile phone number can be sent. After the location server has located the mobile phone, a Location Response-message will be returned. When the client no longer uses the connection (no further connection requests will be sent), a Connection Release-message should be sent, which will be conrmed by a Connection Release Conrmation-message, and the connection will be terminated. The coordinates obtained by this procedure are the coordinates of the nearest transmitter (see chapter ?? Cell of Origin).
CHAPTER 5. NIGHT OWL APPLICATION http://\dots/gate?class=maxLocate&proc=locatePoi&x=627437&y=4808% +10&poi=ICQ%2DPOI&cont_poi=3&desc=BOTH&surround=0&details=REMARK%2BADDRESS Figure 5.19: Example for a request to the GIS server of max.mobil <?xml version="1.0" encoding="iso-8859-1" ?> <!DOCTYPE locateReply (View Source for full doctype...)> <locateReply> <ret_value>00</ret_value> <address> <country>A</country> <region>WIEN</region> <city>WIEN</city> <citydistrict>WIEN 3.,LANDSTRA_E</citydistrict> <street1>KELSENSTRASSE</street1> </address> </locateReply> Figure 5.20: Example answer for the request in gure 5.19
75
To convert these coordinates into street addresses, the GIS interface of max.mobile can be used. It, too, uses the HTTP protocol, but unlike the location server, no connection has to be established, and results are sent in XML. The only message type which is relevant for us is the locate poi-message, in which the already obtained coordinates are included (see gure 5.19 and 5.20 for an example). Unfortunately the locating server provides coordinates in WGS84 format and the GIS server expects them in Austrian Lambert coordinates. Therefor the coordinates have to be transformed before using the GIS server. (Max.mobile will provide a method for this conversion, but none is documented yet.)
5.3
IP Architecture
In this section the architecture of the IP components is described for each service. For each service a separate pipe is generated which can be used as subpipe for other applications. The pipes do not contain any deliverer, because this is subject of the user interface for this application which is not part of the thesis. So there are four pipes labeled: Gastronomy Cinema Pharmacy Grocery
76
5.3.1
Gastronomy Pipe
This pipe is very simple and contains only one user parameterized source component. The XML output of this source has already a suitable format and therefore no more transformation have to be done. Because the selection is done by the source component, also there is no need for further restrictions. In future releases it is possible to add more dierent sources to get a better coverage of Austria. These dierent sources have to be combined with an integrator. A graphic of the pipe is shown in gure 5.21.
Figure 5.21: Gastronomy Pipe
5.3.2
Cinema Pipe
For this pipe the same arguments apply as for the gastronomy pipe. It consists also of only one user congurable source and no other components are needed. A graphic of the pipe is shown in gure 5.22.
Figure 5.22: Cinema Pipe
5.3.3
Pharmacy Pipe
The pipe for the pharmacies contains only the source for Vienna. A Further restrictions on the group and the district are performed with a transformer. A graphic of the pipe is shown in gure 5.23.
5.3.4
Grocery Pipe
In this pipe, the InfoPipes capability of recursive wrapping is used. For each district of Vienna a separate web page listing groceries in that district exists and they are collected with an other type link in InfoPipes. One source component, the master, extracts all links to the target pages (in this case its input is a static XML document, because the links
77
Figure 5.23: Pharmacy Pipe dont change) and these links are used in a second source component, the detail source component, to extract information from these pages. The master source component replaces the links with the data extracted by the detail source component, which is connected to it. The next component is an integrator which performs changes on the structure of the output using a manually generated stylesheet which can be found in the appendix A.4.2. Finally, a transformer selects the shops with the desired combination of district and opening hours (see gure 5.24).
Figure 5.24: Admin View of the Grocery Transformer A graphic of the pipe is shown in gure 5.25.
Figure 5.25: Grocery Pipe
78
5.4
Use Case Descriptions
Figure 5.26 gives an overview about how the described use cases of this section interact with each other. Use cases in boxes with dashed lines are optionally and only invoked if the user chooses so. Use case U1 includes all the others as sub use cases.
Figure 5.26: Interaction of the Use Cases
5.4.1
U1 - User Interaction
Characteristic Information Goal in Context: User receives list of nearby open locations and views map of vicinity and routing information to one of them. Primary Actor: User Initial Action: User loads night owl page Main Success Scenario 1. User loads night owl page 2. User selects one of the four avaliable services and enters service specic data (if needed). The service specic query parameters are described in section 5.2. 3. User enters location data (which is the land and in case of Vienna also the district) (opt.) 4. User enters date and time (opt.) 5. User enters cardinality of return set (opt. because there is a default value) 6. User triggers search 7. UI passes the data to InfoPipes, IP performs search (see use cases S1 and S2) and generates result list containing at least names and addresses of the matching items and passes it to UI which displays the list
CHAPTER 5. NIGHT OWL APPLICATION 8. User browses the list, chooses one item, and requests map of vicinity
79
9. UI passes request for map to InfoPipes, IP performs search (see use case S3) and returns URL of map to UI which displays the map 10. User requests routing information for the selected item 11. IP passes request for routing information to InfoPipes, IP performs search (see use case S3) and returns routing information to UI which displays routing information 12. User is happy ;-) Variations Automatic location detection: In case of the PDA frontend and the use of a max mobile phone the locating service of max.mobil will be used to determine the current position of the user (see section 5.4.5). In case of an Access Point the location of the terminal will be used but in all other cases the location must be entered. System date: If date and/or time are not entered, the current date and/or time is used. For the cinema service the possible date range is limited to approximately one week because the cinema program changes every week and data is only avaliable for the current and the following week. Extensions User input is incomplete: If mandatory data is missing, the user is prompted to enter it (Check is performed by the UI). User input is incorrect: invalid date or time: The user is prompted to enter a valid date or time (Check performed by UI). date out of range for cinema service: No check can be performed in advance but an empty result list will be produced. invalid source specic data: No check can be performed in advance but an empty result list will be produced. No items found: This can happen because of too restrictive search values or incorrect values but also because of to less restrictions and the server asks for more restrictions. In all cases an error message will be displayed. Service unavailable: If the service is unavailable (server down, maintenance work, ...) an error message will be displayed. Automatic location detection unavailable or error occurred: An Error message will be displayed and the user is prompted to enter his location manually. To much items found: If the result list contains more items than the entered cardinality, it has to be limited. The user has to be informed, that more items are avaliable.
80
5.4.2
S1 - Static Search
Characteristic Information Goal in Context: Generating list of groceries or pharmacies Primary Actor: User Initial Action: User selected grocery or pharmacy service and triggered search (see use case U1). Precondition: The UI checked the input data and triggers InfoPipes. Main Success Scenario 1. User data is used as input for the transformer parameters. Date, time, and district of Vienna are required (This two services are only available for Vienna). 2. The appropriate items are selected. (For more details about the structure of the data and how items can be selected see section 5.2.) 3. The selected items are delivered as XML. Extensions No items found: If no items are found, an error message will be displayed. Service unavailable: If the service is unavailable (server down, maintenance work, ...) an error message will be displayed. To much items found: If the result list contains more items than the entered cardinality, it has to be limited. The user has to be informed, that more items are avaliable.
5.4.3
S2 - Dynamic Search
Characteristic Information Goal in Context: Generating list of restaurants or cinemas. Primary Actor: User Initial Action: User selected gastronomy or cinema service and triggered search (see use case U1). Precondition: The UI checked the input data and triggers InfoPipes. Main Success Scenario 1. User input is used as input for the source component parameters. 2. The parameters are used on the form and the server query is triggered.
81
3. Source component extracts result list and transmits it to deliverer. (No further selection is needed.) 4. The extracted items are delivered as XML. Extensions User input is incorrect: date out of range for cinema service: No check can be performed in advance but an empty result list will be produced. invalid source specic data: No check can be performed in advance but an empty result list will be produced. No items found: This can happen because of too restrictive search values or incorrect values but also because of to less restrictions and the server asks for more restrictions. In all cases an error message will be displayed. Service unavailable: If the service is unavailable (server down, maintenance work, ...) an error message will be displayed. To much items found: If the result list contains more items than the entered cardinality, it has to be limited. The user has to be informed, that more items are avaliable.
5.4.4
S3 - System Generates Map of Vicinity and Routing Information
Characteristic Information Goal in Context: Produce map of vicinity and routing information of selected item Primary Actor: User Initial Action: User wants to see a map or routing information Main Success Scenario 1. User selected an item in the result list (see use case U1) and wants to see the map of vicinity or routing information for this item . 2. If not automatic location detection is used, the user can enter his exact position. (opt.) 3. Address information of the item and address of user are transmitted to the source component. 4. Source component enters address data on form and triggers query on the according server. 5. Source component extracts URL of the generated maps and the textual description of the route and transmits it to deliverer. 6. Deliverer component delivers data.
CHAPTER 5. NIGHT OWL APPLICATION Extensions
82
Service unavailable: If the service is unavailable (server down, maintenance work, ...) an error message will be displayed. Missing address data: If any mandatory data is missing, the server opens a warning window and the user will be informed. Wrong address: If some parts of the entered Addresses are invalid or not unique (e.g. there are at least eight dierent cities called Hart in Austria) the server provides a list of possible completions. This is forwarded to the user which selects one item of the list.
5.4.5
M1 - Max Mobile Phone Location
As the locating service from max.mobile is not yet working, this use case gives only a brief overview of how it could work. Characteristic Information Goal in Context: Obtain state and optionally district of position of mobile user. Primary Actor: IP Initial Action: User selects service and does not enter any location Main Success Scenario 1. User selects service and does not enter any location 2. IP sends location request to max.mobil 3. Max.mobil determines position of the mobile phone 4. IP parses return document 5. IP converts coordinates from WGS84 into Austrian Lambert. 6. IP sends second location request with coordinates to max.mobil 7. IP parses return document 8. IP transmits data do next IP component
Chapter 6
Conclusion and Future Goals

The rst attempt to make this application location based showed two major problems. The rst is, that max.mobile is at the moment implementing the interface to obtain the position of mobile phones and therefore information about how this service will work was very rare. Also it was not possible to test this service to get a feeling of how to use it and how to combine integrate it into the application. The second problem is, that it is dicult to calculate the distance between two given points (as is needed to order the result items by distance). For this purpose you should use a GIS software which is capable of transfering address data into geographical data and also in performing distance calculations and the generation of maps and routing information. Unfortunately such GIS software is very complex and therefore also expensive. Furthermore the inclusion of a GIS into the existing system (InfoPipes) is too complex at the moment. Some remarks about the sources: Gastronomy: In future releases some other server than gastroweb should be included, because especially in other states than Vienna, the data pool is very small. Cinema: This service is a good example for the use of Lixto and InfoPipes, because data changes regularly and is interestingly for many people. Pharmacies: In future releases some eort to include also other states than Vienna should be done. There are some other pages for some areas, but no global rule for night services for all of Austria exists. Therefore each region has to be treated separately. Grocery: There are almost no sources for this service avaliable. It would be good, to nd some sources which cover other parts of Austria too.
83
Appendix A
Extraction Programs and Outputs

A.1
A.1.1
Gastronomy
Elog Rules
rootDocument(X0, X1) : null ( , X0), getDocument(X0, X1). rootDocument(X0, X1) : nextlink( , X0), getDocumentFromHref(X0, X1).
link (X0, X1) : rootDocument( , X0), subelem(X0, (..body..table .. tr .. td .. table .. tr .. td .. p..content , [( href , visitenkarte , substring) ]) , X1).
restaurant(X0, X1) : link ( , X0), getDocumentFromHref(X0, X1).
details04 (X0, X1) : restaurant( , X0), subelem(X0, (..body..table .. tr .. td .. table .. tr .. td .. table , [( elementtextO, FFNUNGSZEITEN, substring)]), X1).
details05 (X0, X1) : 84
APPENDIX A. EXTRACTION PROGRAMS AND OUTPUTS details04 ( , X0),
85
subelem(X0, (..tr , []) , X1), FFNUNGSZEITEN, substring) notcontains(X1 , (.. td .. p..content , [( elementtextO, ])).
hours(X0, X1) : details05 ( , X0), subtext(X0, [AZaz]+[09][09]{2}:[09]{2}[09]{2}:[09]{2}, X1).
time(X0, X1) : hours( , X0), subtext(X0, [09]{2}:[09]{2}[09]{2}:[09]{2}, X1).
opening(X0, X1) : time( , X0), subtext(X0, [09]{2}:[09]{2}, X1).
closing (X0, X1) : time( , X0), subtext(X0, [09]{2}:[09]{2}$, X1).
day(X0, X1) : hours( , X0), subtext(X0, [AZaz]+, X1).
details01 (X0, X1) : restaurant( , X0), subelem(X0, (..body..table .. tr .. td .. table .. tr .. td .. p , [( elementtext, ([\ s ].+|.{2}.) , regexp), (columns , [2]:[2], substring) ]) , X1) [0, 0].
street (X0, X1) : details01 ( , X0), subelem(X0, (..content , [( elementtext, ([\ s ].+|.{2}.) , regexp)]) , X1) [1, 1].
name(X0, X1) :
APPENDIX A. EXTRACTION PROGRAMS AND OUTPUTS details01 ( , X0), subelem(X0, (..content , [( elementtext, ([\ s ].+|.{2}.) , regexp)]) , X1) [0, 0].
86
details03 (X0, X1) : details01 ( , X0), subelem(X0, (..content , [( elementtext, Tel .:, substring) ]) , X1).
phone(X0, X1) : details03 ( , X0), subtext(X0, [09]., X1).
details02 (X0, X1) : details01 ( , X0), subelem(X0, (..content , [( elementtext, [09]{4}., regexp)]) , X1).
zipcode(X0, X1) : details02 ( , X0), subtext(X0, [09]{4}, X1).
city (X0, X1) : details02 ( , X0), subtext(X0, [AZaz]., X1).
nextlink(X0, X1) : rootDocument( , X0), subelem(X0, (..body..table .. tr .. td .. table .. tr .. td .. table .. tr .. td .. p..content , [( elementtext, Weitere >>, substring), (href , , substring) ]) , X1).
A.2
A.2.1
Cinema
Elog Rules
rootDocument(X0, X1) : null ( , X0), getDocument(X0, X1). rootDocument(X0, X1) : nextLink( , X0), getDocumentFromHref(X0, X1).
APPENDIX A. EXTRACTION PROGRAMS AND OUTPUTS
87
allFilms(X0, X1) : rootDocument( , X0), subelem(X0, (..body..table .. tr .. td .. table .. tr .. td , []) , X1), contains(X1 , (.. p..content , [( href , kinosuche, substring) ]) , X3).
cinemaRegion(X0, X1) : allFilms( , X0), subsq(X0, (..p , []) , (. content , [( href , kinosuche, substring) ]) , (. content , []) , X1 , 0, 1), after (X0, X1 , (.. p..content , [( href , kinosuche, substring) ]) , 0.0, 0.0, X3, X4). cinemaRegion(X0, X1) : allFilms( , X0), subsq(X0, (..p , []) , (. content , [( href , kinosuche, substring) ]) , (. content , []) , X1 , 0, 1), notafter(X0, X1 , (.. p..content , []) , 50.0) .
cinemaLink(X0, X1) : cinemaRegion( , X0), subelem(X0, (..content , [( href , kinosuche, substring) ]) , X1).
cinemaName(X0, X1) : cinemaLink( , X0), subatt(X0, elementtext, X1).
cinemaDoc(X0, X1) : cinemaLink( , X0), getDocumentFromHref(X0, X1).
adressRegion(X0, X1) : cinemaDoc( , X0), subelem(X0, (..body..table .. tr .. td .. table .. tr .. td , [( elementtext, Telefon , substring) ]) , X1).
web(X0, X1) : adressRegion( , X0),
APPENDIX A. EXTRACTION PROGRAMS AND OUTPUTS subelem(X0, (..p..content , [( href , , substring) ]) , X1).
88
phoneRegion(X0, X1) : adressRegion( , X0), subelem(X0, (..p..content , [( elementtext, Telefon , substring) ]) , X1).
phone(X0, X1) : phoneRegion( , X0), subtext(X0, [09][az], X1).
adressRegion2(X0, X1) : adressRegion( , X0), subelem(X0, (..p..content , [( elementtext, [09]{4}., regexp)]) , X1).
zipcode(X0, X1) : adressRegion2( , X0), subtext(X0, [09]{4}, X1).
adressRegion3(X0, X1) : adressRegion2( , X0), subtext(X0 , ,., X1).
street (X0, X1) : adressRegion3( , X0), subtext(X0 , [,], X1).
adressRegion4(X0, X1) : adressRegion2( , X0), subtext(X0 , [09]{4}[,], X1).
city (X0, X1) : adressRegion4( , X0), subtext(X0, [ 09]., X1).
APPENDIX A. EXTRACTION PROGRAMS AND OUTPUTS lmRegion(X0, X1) : cinemaRegion( , X0), subsq(X0 , (, []) , (. content , [( href , lmdetail , substring) ]) , (. content , [( elementtext, Zeit :, substring) ]) , X1, 0, 1).
89
timeRegion(X0, X1) : lmRegion( , X0), subelem(X0, (..content , [( elementtext, Zeit , substring) ]) , X1).
startingTime(X0, X1) : timeRegion( , X0), subtext(X0, [09][09]\.[09][09], X1).
original (X0, X1) : timeRegion( , X0), subtext(X0, OF|OV|OmU, X1).
lmLink(X0, X1) : lmRegion( , X0), subelem(X0, (..content , [( href , lmdetail , substring) ]) , X1).
lmTitle (X0, X1) : lmLink( , X0), subatt(X0, elementtext, X1).
lmDoc(X0, X1) : lmLink( , X0), getDocumentFromHref(X0, X1).
lmDesc(X0, X1) : lmDoc( , X0), subelem(X0, (..body..table .. tr .. td .. table .. tr .. td , [( elementtext, [\ s ].+, regexp)]) , X1), before(X0, X1 , (.. body..table .. tr .. td .. table .. tr .. td .. p..content , [( elementtext sterreichs!, substring)]) , 1.0, 89.0, X3, X4), , Das beste KinoprogrammO
90
after (X0, X1 , (.. body..table .. tr .. td .. table .. tr .. td .. p..content , [( elementtext , Termine:, substring) ]) , 0.0, 95.0, X5, X6).
nextLink(X0, X1) : rootDocument( , X0), subelem(X0, (..body..table .. tr .. td .. table .. tr .. td .. p..content , [( elementtext, anchste, substring) , ( href , , substring) ]) , X1) [0, 0].
date(X0, X1) : rootDocument( , X0), subelem(X0, (..body..table .. tr .. td .. table .. tr .. td .. p..content , [( elementtext , [09][09].[09][09]., regexp)]) , X1), before(X0, X1 , (.. body..table .. tr .. td .. table .. tr .. td .. p..content , [( elementtext , Datum, substring)]) , 0.0, 0.0, X3, X4).
A.3
A.3.1
Pharmacy
Elog Rules for Vienna
rootDocument(X0, X1) : null ( , X0), getDocument(X0, X1).
allPharmacies(X0, X1) : rootDocument( , X0), subelem(X0, (..body..table , []) , X1), notcontains(X1 , (.. tr .. td .. p..content , [( elementtext, webmaster@aponet.at, substring)])).
pharmacyRegion(X0, X1) : allPharmacies( , X0), subelem(X0, (..tr , [( columns, , substring) ]) , X1), notcontains(X1 , (.. td .. p..content , [( font weight, bold, substring) , ( b , , substring) ]) ) .
name(X0, X1) : pharmacyRegion( , X0), subelem(X0, (..td .. p..content , [( elementtext , .[ AZaz]+., regexp), (columns , :[1], substring)]) , X1).
91
city (X0, X1) : pharmacyRegion( , X0), subelem(X0, (..td .. p..content , [( elementtext, [AZaz]+, regexp), (columns , :[4], substring)]) , X1).
street (X0, X1) : pharmacyRegion( , X0), subelem(X0, (..td .. p..content , [( columns , :[2], substring) ]) , X1) [0, 0].
phoneRegion(X0, X1) : pharmacyRegion( , X0), subelem(X0, (..td .. p..content , [( elementtext, Tel :, substring) , ( columns , :[2], substring) ]) , X1).
phone(X0, X1) : phoneRegion( , X0), subtext(X0, [09]., X1).
zipcode(X0, X1) : pharmacyRegion( , X0), subelem(X0, (..td .. p..content , [( elementtext , [09]., regexp), (columns , :[3], substring) ]) , X1).
group(X0, X1) : pharmacyRegion( , X0), subelem(X0, (..td .. p..content , [( elementtext , .[09]., regexp), (columns , :[5], substring) ]) , X1).
A.4
A.4.1
Grocery
Elog Rules
rootDocument(X0, X1) : null ( , X0), getDocument(X0, X1).
APPENDIX A. EXTRACTION PROGRAMS AND OUTPUTS shop(X0, X1) : rootDocument( , X0), subelem(X0, (..body..table , []) , X1).
92
name(X0, X1) : shop( , X0), subelem(X0, (.tr.td. table . tr .td. table . tr .td.p , [( elementtext , ([\ w].) , regexp)]) , X1 ).
shopdetails(X0, X1) : shop( , X0), subelem(X0, (..tr .. td .. table .. tr .. td .. table .. tr .. td .. table , []) , X1).
shopdetrow(X0, X1) : shopdetails( , X0), subelem(X0, (..tr .. td , []) , X1).
web(X0, X1) : shopdetrow( , X0), subelem(X0, (..p..content , [( a , , substring) , ( href , , substring) ]) , X1).
detail05 (X0, X1) : shopdetrow( , X0), subelem(X0, (..p , [( elementtext, Kommentar, substring)]), X1).
description (X0, X1) : detail05 ( , X0), subelem(X0, (..content , [( elementtext, ([\ s ].+|.{2}.) , regexp)]) , X1), notcontains(X1 , (, [( elementtext, Kommentar, substring)])).
detail04 (X0, X1) : shopdetrow( , X0), subelem(X0, (..p , [( elementtext, Erreichbar, substring) ]) , X1).
reachable(X0, X1) :
APPENDIX A. EXTRACTION PROGRAMS AND OUTPUTS detail04 ( , X0), subelem(X0, (..content , [( elementtext, ([\ s ].+|.{2}.) , regexp)]) , X1), notcontains(X1 , (, [( elementtext, Erreichbar, substring) ]) ) .
93
detail03 (X0, X1) : shopdetrow( , X0), subelem(X0, (..p , [( elementtext, Telefon , substring) ]) , X1).
phone(X0, X1) : detail03 ( , X0), subelem(X0, (..content , [( elementtext , [09]., regexp)]) , X1).
detail02 (X0, X1) : shopdetrow( , X0), nungszeiten , substring) ]) , X1). subelem(X0, (..p , [( elementtextO,
detail06 (X0, X1) : detail02 ( , X0), subelem(X0, (..content , [( elementtext , .[ AZaz09]+., regexp)]), X1), nungszeiten , substring) ]) ) . notcontains(X1 , (, [( elementtextO,
opening(X0, X1) : detail06 ( , X0), subtext(X0, [AZagiz][h], X1).
detail07 (X0, X1) : opening( , X0), subtext(X0, [AZaz]{2},[09], X1).
daySingle(X0, X1) : detail07 ( , X0), subtext(X0, [AZaz]{2}, X1). daySingle(X0, X1) : opening( , X0), subtext(X0, [AZaz]{2}(?= [09]), X1). daySingle(X0, X1) :
APPENDIX A. EXTRACTION PROGRAMS AND OUTPUTS opening( , X0), subtext(X0, So/Ftg., X1).
94
time(X0, X1) : opening( , X0), subtext(X0, [09].[09], X1).
detail10 (X0, X1) : time( , X0), subtext(X0, [09.]+, X1).
timeFrom(X0, X1) : detail10 ( , X0), subtext(X0, [09.]+, X1).
detail11 (X0, X1) : time( , X0), subtext(X0, [09.]+, X1).
timeTo(X0, X1) : detail11 ( , X0), subtext(X0, [09.]+, X1).
dayFromTo(X0, X1) : opening( , X0), subtext(X0, [AZaz]{2}[AZaz]{2}, X1).
detail08 (X0, X1) : dayFromTo( , X0), subtext(X0, [azAZ]{2}, X1).
dayFrom(X0, X1) : detail08 ( , X0), subtext(X0, [azAZ]{2}, X1).
95
detail09 (X0, X1) : dayFromTo( , X0), subtext(X0, [azAZ]{2}, X1).
dayTo(X0, X1) : detail09 ( , X0), subtext(X0, [azAZ]{2}, X1).
daily(X0, X1) : opening( , X0), subtext(X0, atgl , X1).
detail01 (X0, X1) : shopdetrow( , X0), subelem(X0, (..p , [( elementtext, Adresse, substring) ]) , X1).
street (X0, X1) : detail01 ( , X0), subelem(X0, (..content , [( elementtext, ([\ s ].+|.{2}.) , regexp)]) , X1), notcontains(X1 , (, [( elementtext, Adresse, substring) ]) ) .
district (X0, X1) : rootDocument( , X0), subelem(X0, (..body..p..content , [( elementtext, ([\ s ].+|.{2}.) , regexp)]) , X1) [0, 0].
districtnumber(X0, X1) : district ( , X0), subtext(X0, [09]+, X1).
districtname(X0, X1) : district ( , X0), subtext(X0, [azAZ]., X1).
A.4.2
XSLT
APPENDIX A. EXTRACTION PROGRAMS AND OUTPUTS <?xml version=1.0?> <xsl: stylesheet xmlns:xsl=http://www.w3.org/1999/XSL/Transform version=1.0 > <xsl:template name=day> <xsl:param name=dayname/> <day> <xsl:valueof select=$dayname/> </day> <xsl:for each select=../timeFrom> <opening> <xsl:valueof select=text()/> </opening> </xsl:for each> <xsl:for each select=../timeTo> <closing> <xsl:valueof select=text()/> </closing> </xsl:for each> </xsl:template>
96
<! This le is automatically generated by the Integrator Component. Do not edit! > < xsl:template match=/> <document> <rootDocument> <xsl:for each select=/document> <xsl:for each select=./rootDocument> <xsl:for each select=./shop> <shop> <xsl:for each select=./name> <name> <xsl:valueof select=text()/> </name> </xsl:for each> <xsl:for each select=./street> <street> <xsl:valueof select=text()/> </street> </xsl:for each> <hours>
97
<xsl:for each select=./opening> <xsl:for each select=./daySingle> <xsl: call template name=day> <xsl:withparam name=dayname select=text()/> </xsl: call template> </xsl:for each> <xsl:for each select=./daily> <xsl: call template name=day> <xsl:withparam name=dayname </xsl: call template> <xsl: call template name=day> <xsl:withparam name=dayname </xsl: call template> <xsl: call template name=day> <xsl:withparam name=dayname </xsl: call template> <xsl: call template name=day> <xsl:withparam name=dayname </xsl: call template> <xsl: call template name=day> <xsl:withparam name=dayname </xsl: call template> <xsl: call template name=day> <xsl:withparam name=dayname </xsl: call template> <xsl: call template name=day> <xsl:withparam name=dayname </xsl: call template> </xsl:for each>
select=Mo/>
select=Di/>
select=Mi/>
select=Do/>
select=Fr/>
select=Sa/>
select=So/>
<xsl:for each select=./dayFrom> <xsl:variable name=A><xsl:valueof select=text()/></xsl:variable> <xsl:for each select=../dayTo> <xsl:variable name=Z><xsl:valueof select=text()/></xsl:variable> <xsl: if test=$A=Mo> <xsl: call template name=day> <xsl:withparam name=dayname select=Mo/> </xsl: call template>
APPENDIX A. EXTRACTION PROGRAMS AND OUTPUTS </xsl: if > <xsl: if test=($A=Mo or $A=Di) and ($Z=Di or $Z=Mi or $Z=Do or $Z=Fr or $Z=Sa or $Z=So)> <xsl: call template name=day> <xsl:withparam name=dayname select=Di/> </xsl: call template> </xsl: if > <xsl: if test=($A=Mo or $A=Di or $A=Mi )and ($Z=Mi or $Z=Do or $Z=Fr or $Z=Sa or $Z=So)> <xsl: call template name=day> <xsl:withparam name=dayname select=Mi/> </xsl: call template> </xsl: if > <xsl: if test=($A=Mo or $A=Di or $A=Mi or $A=Do )and ($Z=Do or $Z=Fr or $Z=Sa or $Z=So)> <xsl: call template name=day> <xsl:withparam name=dayname select=Do/> </xsl: call template> </xsl: if > <xsl: if test=($A=Mo or $A=Di or $A=Mi or $A=Do or $A=Fr ) and ($Z=Fr or $Z=Sa or $Z=So)> <xsl: call template name=day> <xsl:withparam name=dayname select=Fr/> </xsl: call template> </xsl: if >
98
<xsl: if test=($A=Mo or $A=Di or $A=Mi or $A=Do or $A=Fr or $A=Sa )and ($Z =Sa or $Z=So)> <xsl: call template name=day> <xsl:withparam name=dayname select=Sa/> </xsl: call template> </xsl: if > <xsl: if test=$Z=So> <xsl: call template name=day> <xsl:withparam name=dayname select=So/> </xsl: call template> </xsl: if >
99
</xsl:for each> </xsl:for each>
</xsl:for each> </hours> <xsl:for each select=./web> <web> <xsl:valueof select=text()/> </web> </xsl:for each> <xsl:for each select=./reachable> <reachable> <xsl:valueof select=text()/> </reachable> </xsl:for each> <xsl:for each select=./description> <description> <xsl:valueof select=text()/> </description> </xsl:for each> <xsl:for each select=./phone> <phone> <xsl:valueof select=text()/> </phone> </xsl:for each> <xsl:for each select=../districtnumber> < district > <xsl:valueof select=text()/> </district > </xsl:for each> <city> Wien </city> </shop> </xsl:for each> </xsl:for each> </xsl:for each> </rootDocument> </document> </xsl:template>
100
<! Filtering out what we dont want > <xsl:template match=text()> </xsl:template> </xsl: stylesheet >
A.4.3
Master Source XML Input
<?xml version=1.0 encoding=ISO88591?> <document> <rootDocument> <otl subtype=milch url=milch/01.htm baseurl=http://www.hauptstadt.at></otl> </rootDocument> <rootDocument> <otl subtype=milch url=milch/02.htm baseurl=http://www.hauptstadt.at></otl> </rootDocument> <rootDocument> <otl subtype=milch </rootDocument> <rootDocument> <otl subtype=milch </rootDocument> <rootDocument> <otl subtype=milch </rootDocument> <rootDocument> <otl subtype=milch </rootDocument> <rootDocument> <otl subtype=milch </rootDocument> <rootDocument> <otl subtype=milch </rootDocument> <rootDocument>
url=milch/03.htm baseurl=http://www.hauptstadt.at></otl>
<otl subtype=milch url=milch/09.htm baseurl=http://www.hauptstadt.at></otl> </rootDocument> <rootDocument> <otl subtype=milch url=milch/10.htm baseurl=http://www.hauptstadt.at></otl> </rootDocument> <rootDocument> <otl subtype=milch url=milch/11.htm baseurl=http://www.hauptstadt.at></otl> </rootDocument>
APPENDIX A. EXTRACTION PROGRAMS AND OUTPUTS <rootDocument> <otl subtype=milch </rootDocument> <rootDocument> <otl subtype=milch </rootDocument> <rootDocument> <otl subtype=milch </rootDocument> <rootDocument> <otl subtype=milch </rootDocument> <rootDocument> <otl subtype=milch </rootDocument> <rootDocument> <otl subtype=milch </rootDocument> <rootDocument> <otl subtype=milch </rootDocument> <rootDocument> <otl subtype=milch </rootDocument> <rootDocument> <otl subtype=milch </rootDocument> <rootDocument> <otl subtype=milch </rootDocument> <rootDocument> <otl subtype=milch </rootDocument> url=milch/12.htm baseurl=http://www.hauptstadt.at></otl>
101
<rootDocument> <otl subtype=milch url=milch/23.htm baseurl=http://www.hauptstadt.at></otl> </rootDocument> </document>
A.5
A.5.1
Routing and Map

Elog Rules
rootDocument(X0, X1) : null ( , X0),
APPENDIX A. EXTRACTION PROGRAMS AND OUTPUTS getDocument(X0, X1).
102
details01 (X0, X1) : rootDocument( , X0), subelem(X0, (..body..table .. tr .. td .. table .. tr .. td .. table .. tr .. td .. table , [( elementtext, Etappe, substring)]) , X1), contains(X1 , (.. tr .. td .. p..content , [( elementtext, Wegbeschreibung, substring)]), X3).
step(X0, X1) : details01 ( , X0), subelem(X0, (..tr , [( elementtext , [09]., regexp)]) , X1).
stepNumber(X0, X1) : step( , X0), subelem(X0, (..td .. p , [( columns , [1][1, 2, 3][1]:[1],
substring) ]) , X1).
distance(X0, X1) : step( , X0), subelem(X0, (..td .. p , [( columns , [1][1, 2, 3][1]:[2],
time(X0, X1) : step( , X0), subelem(X0, (..td .. p , [( columns , [1][1, 2, 3][1]:[3],
stepDescription(X0, X1) : step( , X0), subelem(X0, (..td .. p , [( columns , [1][1, 2, 3][1]:[4], substring) ]) , X1).
timeanddistance(X0, X1) : rootDocument( , X0), subelem(X0, (..body..table .. tr .. td .. table .. tr .. td .. table .. tr .. td .. table .. tr .. td .. p , [( elementtext, Reisezeit , substring) ]) , X1), contains(X1 , (.. content , [( elementtext, Entfernung, substring)]) , X3).
APPENDIX A. EXTRACTION PROGRAMS AND OUTPUTS image(X0, X1) :
103
rootDocument( , X0), subelem(X0, (..body..table .. tr .. td .. table .. tr .. td .. table .. tr .. td .. p..img , [( src , image.asp, substring) ]) , X1).
imageURL(X0, X1) : image( , X0), subatt(X0, src , X1).
destImgLink(X0, X1) : rootDocument( , X0), subelem(X0, (..body..table .. tr .. td .. table .. tr .. td .. table .. tr .. td .. table .. tr .. td .. table .. tr .. td .. p..content , [( elementtext, Zielort , substring) , ( href , map.asp, substring)]) , X1).
destImgDoc(X0, X1) : destImgLink( , X0), getDocumentFromHref(X0, X1).
destImg(X0, X1) : destImgDoc( , X0), subelem(X0, (..body..input , [( src , image.asp, substring) ]) , X1).
destImageURL(X0, X1) : destImg( , X0), subatt(X0, src , X1).
Bibliography
[1] Kristion V.B. Andersen, Michael Cheng, and Rasmus Klitgaard-Nielsen. Location Base Information Query for Personal Digital Assistants. 2001. [2] Robert Baumgarntern, Sergio Flesca, and Georg Gottlob. Declarative information extraction, web crawling and recursive wrapping with lixto, a. [3] Robert Baumgarntern, Sergio Flesca, and Georg Gottlob. Supervised wrapper generation with lixto, b. [4] Robert Baumgarntern, Sergio Flesca, and Georg Gottlob. Visual web information extraction with lixto. In Proceedings of the 27th VLDB Converence, Roma, Italy, 2001. [5] Alistair Cockburn. Structuring use cases with goals. Journal of Object-Oriented Programming, Sept-Oct 1997a. [6] Alistair Cockburn. Structuring use cases with goals. Journal of Object-Oriented Programming, Nov-Dec 1997b. [7] Robert Eckstein and Michel Casabianca. XML Pocket Reference. OReilly, 2001. [8] GIS. Introduction to gis and http://www.kingston.ac.uk/geog/gis/intro.htm. geospatial data. URL
[9] Gunter Grieser, Klaus P. Jantke, Steen Lange, and Bernd Thomas. A unifying approach to HTML wrapper representation and learning. In Proceedings of the Third International Conference on Discovery Science, 2000. [10] Andy Oram Jerey E. Friedl. Mastering Regular Expressions: Powerful Techniques for Perl and Other Tools. OReilly Nutshell, 2000. [11] Michal Kay. XSLT Programmers Reference. Wrox Press, 2000. [12] A.M. Kuchling. Regular expression howto. http://py-howto.sourceforge.net/regex/regex.html. [13] Susan Lilly. How to avoid use-case-pitfalls, Jan 2000. [14] Joseph M. Piwowar. Introduction to geographic information systems. Geography 255. URL http://www.fes.uwaterloo.ca/crs/geog255.f99/Introduction/Introduction.html. [15] Alois Waser. Test Automation A Case Study. Technical University of Vienna, 2002. 104 1998. URL

Extracting Tourism Information From The Web

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Extracting Tourism Information From The Web

Caricato da

Copyright:

Formati disponibili

Diplomarbeit

Extracting Tourism Information from the Web

durch Emanuela Schwab Vorgartenstr. 134-138/2/108, 1020 Wien April 9, 2002

2.5 2.6 2.7 2.8

A.4.2 XSLT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 A.4.3 Master Source XML Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

A.5 Routing and Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 A.5.1 Elog Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

Restaurant Details of the Gastroweb Result List . . . . . . . . . . . . . . . . . . . Falter Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

The Aim of the Thesis

Structure of the Thesis

CHAPTER 2. USING LIXTO

Figure 2.1: Pattern Tree Example

CHAPTER 2. USING LIXTO

Figure 2.2: Complex Pattern Tree Example

Packing and Expanding Parts of the Tree

CHAPTER 2. USING LIXTO

Figure 2.5: Removing Patterns: Warning

Figure 2.7: Testing Mode

CHAPTER 2. USING LIXTO

CHAPTER 2. USING LIXTO Tree Filter

CHAPTER 2. USING LIXTO

Figure 2.11: Selecting the Sample Instance of a New Filter

Figure 2.12: Choose the Path Specication Mode

CHAPTER 2. USING LIXTO

CHAPTER 2. USING LIXTO

CHAPTER 2. USING LIXTO

Figure 2.17: Attribute Construction Mode: Custom

Figure 2.18: Syntactical Concepts

CHAPTER 2. USING LIXTO

External and Internal Conditions

Select the type of condition you want (see gure 2.22).

CHAPTER 2. USING LIXTO Adding an External Condition

Figure 2.24: Negated External Condition: Select Distance Tolerance

Figure 2.25: Range Condition

CHAPTER 2. USING LIXTO

Figure 2.26: The Program Menu

Saving and Loading Programs

CHAPTER 2. USING LIXTO

Figure 2.27: Long Program View

CHAPTER 2. USING LIXTO

Figure 2.28: Elog Program View

CHAPTER 2. USING LIXTO

Digital Camera Example

Figure 2.31: Range Restriction

CHAPTER 2. USING LIXTO

CHAPTER 2. USING LIXTO

Figure 2.34: Attribute Construction to Extract the Camera Name

Figure 2.35: Attribute Construction for the Price Pattern

CHAPTER 2. USING LIXTO

CHAPTER 2. USING LIXTO Pattern next

Figure 2.37: Attribute Construction for the Next Pattern

CHAPTER 2. USING LIXTO

Digital Camera Exercise

CHAPTER 2. USING LIXTO

Figure 2.38: Example Page for the Exercise in Section 2.8

Figure 2.39: Attribute Construction for the Value of the Detail

CHAPTER 2. USING LIXTO

Figure 2.40: Pattern Tree of the Exercise Program

The User Interface and Components

Figure 3.1: The Main Window of InfoPipes

Figure 3.3: Source Conguration Step 1: Enter an URL

Figure 3.4: Source Conguration Step 2: Add a Content Extractor

Figure 3.5: Source Conguration Step 4: Congure Scheduler

Figure 3.6: Integrator Conguration Step 1: Select an Output Structure