Sei sulla pagina 1di 8

Creating Ontologies from Web documents

David SNCHEZ , Antonio MORENO Department of Computer Science and Mathematics Universitat Rovira i Virgili (URV) Avda. Pasos Catalans, 26. 43007 Tarragona. {dsanchez, amoreno}@etse.urv.es
Abstract. In this paper we present a methodology to build automatically an ontology, extracting information from the World Wide Web from an initial keyword. This ontology represents a taxonomy of classes and gives to the user a general view of the kind of concepts and the most significant sites that he can find on the Web for the specified keyword's domain. The system uses intensively a publicly available search engine, extracts concepts (based on its relation to the initial one and statistical data about appearance) and represents the result in a standard way. Keywords. Ontology building, information extraction, World Wide Web, OWL.

Introduction In the last years, the growth of the Information Society has been very significant, providing a way for fast data access and information exchange all around the world. However, the classical human readable data resources (like electronic books or web sites) present serious problems for achieving machine interoperability. This is why a structured way for representing information is required and ontologies [6] (machine-processable representations that contain the semantic information of a domain), can be very useful. They allow transferring and processing information effectively in a distributed environment. Moreover, many authors [3, 4, 12, and 13] are using the ontology's semantic data to improve the search for information itself on unstructured documents (that represent almost 100 % of the available resources). Therefore, the building of an ontology that represents a specified domain is a critical process that has to be made carefully. However, manual ontology building is a difficult task that requires an extended knowledge of the domain (an expert) and, in most cases, the result could be incomplete or inaccurate. In order to ease the ontology construction process, automatic methodologies can be used [8]. The idea is to extract structured knowledge, like concepts and semantic relations, from unstructured resources that cover the main domain's topics. Concretely, the solution that we propose in this paper is to use the available information on the World Wide Web to create the ontology. This method has the advantage that the ontology is built automatically and fully represents the actual state of the art of a domain (based on the web pages that cover a specific topic). So, in this paper, we present a methodology to extract information from the Web to build an ontology for a given domain. Moreover, during the building process, the most representative web sites for each ontology concept will be retrieved. A prototype has been implemented to test this method. The rest of the paper is organised as follows: section 1 describes the methodology developed to build the ontology and the standard language used to represent it. Section 2 contains information about the prototype implemented and some tests. Section 3 contains some conclusions and proposes some future lines of work.

1. Ontology building methodology In this section, the methodology used to discover and select representative concepts and websites for a domain and construct the final ontology is described. The algorithm is based on analysing a large number of web sites in order to find important concepts for a domain by studying the initial keyword's neighbourhood (we assume that words that are near to the specified keyword are closely related). The candidate concepts are processed in order to select the most adequate ones by performing a statistical analysis. The selected classes are finally incorporated to the ontology. For each one, the main websites from where it was extracted are stored, and the process is repeated recursively in order to find new terms and build a hierarchy of concepts. The resulting taxonomy of terms can be the base for finding more complex ontological relations between concepts [8], or it can be used to guide a search for information or a classification process from a document corpus [3, 4, 12, and 13].
"keyword"
Search Constraints

Web documents

Input parameters

Parsing

}
Name Attributes

...

Name Attributes

...

Name Candidate Attributes Concepts

"class+keyword"

Figure 1. Ontology building algorithm

}
Class selection
Selection Constraints
Class URL's Class URL's

...

Class URL's

...

}
OWL Ontology

1.1 Building algorithm More in detail, the algorithm's sequence shown on figure 1, has the following phases: It starts with a keyword that has to be representative enough for a specific domain (e.g. biosensor) and a set of parameters that constrain the search and the concept selection (described below). Then, it uses a publicly available search engine (Google) in order to obtain the most representative web sites that contain that keyword. The search constraints specified are the following: Maximum number of pages returned by the search engine: this parameter constrains the size of the search. For a general keyword with a large amount of results (10000 or above), analysing between 5% and 10% of them (beginning from the most popular ones) produces quite representative results.

"class+keyword"

"class+keyword"

Filter of similar sites: for a general keyword (e.g. car), the enabling of this filter hides the web sites that belong to the same web domain, obtaining a set of results that represent a wider spectrum. For a concrete word (e.g. biosensor) with a smaller amount of results, the disabling of this filter will return the whole set of pages (even sub pages of a domain), allowing wider searches. For each web site returned, an exhaustive analysis is performed in order to obtain useful information from each one. Concretely: Different types of no-HTML document formats are processed (pdf, ps, doc, ppt, etc.) by obtaining the HTML version from Google's cache. For each "Not found" or "Unable to show" page, the parser tries to obtain the web site's data from Google's cache. Redirections are followed until finding the final site. Frame-based sites are also considered, obtaining the complete set of texts by analysing each web subframe. The parser returns the useful text from each site (rejecting tags and visual information), and tries to find the initial keyword (e.g. biosensor). For each matching, it analyses the immediate anterior word (e.g. optical biosensor). If it fulfills a set of prerequisites, it is selected as candidate concept. Concretely, the parser verifies the following: Words must have a minimum size and must be represented with a standard ASCII character set (not Japanese, for example). They must be "relevant words". Prepositions, determinants, and very common words ("stop words") are rejected. Each word is analysed from its morphological root (e.g. fluorescence and fluorescent are considered as the same word and their attribute values described below - are merged: for example, the number of appearances of both words is added). A stemming algorithm for the English language is used to reject plurals, verbal forms, etc. For each candidate concept selected (some examples are contained in table 1), a statistical analysis is performed in order to select the most representative ones. Concretely, we consider the following attributes: Total number of appearances (on all the analysed web sites): this represents a measure of the concept's relevance for the domain and allows to eliminate very specific ones (e.g. company names like Questlink) or no directly related ones (e.g. novel). Number of different web sites that contain the concept at least one time: this gives a measure of the word's generality for the domain (e.g. amperometric is quite common, but Dupont isn't). Estimated number of results returned by the search engine setting the selected concept alone: this indicates the global generality of the word and allows avoiding widely-used ones (e.g. advanced). Estimated number of results returned by the search engine joining the selected concept with the initial keyword: this represents a measure of association between those two terms (e.g. "optic biosensor" gives many results but "techs biosensor" doesn't). Ratio between the two last measures. This indicates the intensity of the relation between the concept and the keyword (e.g. "amperometric biosensor" is much more relevant than "government biosensor").

Only concepts (a little percentage of the candidate list) whose attributes fit with a set of specified constraints (a range of values for each parameter) are selected (marked in bold in table 1). For each one, a new keyword is constructed joining the new concept with the initial one (e.g. "optic biosensor"), and the algorithm is executed again from the beginning. This process is repeated recursively until a selected depth level is achieved or no more results are found (e.g. reusable quartz fiber optic biosensor has not got any subclass). Each new execution has its own search and selection parameter values because the searched keyword is more restrictive (constraints have to be relaxed in order to obtain a significant number of final results). The obtained result is a hierarchy that is stored as an ontology. Each class name is represented as its morphological root (e.g. optical = OPTIC). However, if a word has different derivative forms, all of them are evaluated independently (e.g. optic, optical). Moreover, each class stores the concept's attributes described previously and the set of URLs from where it was selected. The sites associated with these URLs are the most representative ones for each concept on the ontology (from where the concepts have been selected). Finally, an ontology refinement process is performed in order to obtain a more compact taxonomy and avoid redundancy. In this process, classes and subclasses that have the same set of associated URLs will be merged because we consider that they are closely related: in the search process, the two concepts have always appeared together. For example, the hierarchy optic -> fiber -> quartz -> reusable -> rapid will result in optic -> fiber -> rapid_reusable_quartz because the last 3 subclasses have the same web sets. Moreover, the list of URLs will be processed in order to avoid redundancies between the classes sets (e.g. if a web address is stored in a subclass, it will be deleted from the superclass set).

Table 1. Candidate concepts for the biosensor ontology. Words in bold represent all the selected classes (merged ones -with the same root- in italic). The other ones are a reduced list of some of the rejected concepts (attributes that don't fulfil the selection constraints are represented in italic). Concept fluorescence fluorescent optic optical amperometric based electrochemical glucose resonance salomon spr government advanced novel techs ambri dupont questlink Morphological root FLUORESC FLUORESC OPTIC OPTIC AMPEROMETR BASE ELECTROCHEM GLUCOS RESON SALOMON SPR GOVERN ADVANC NOVEL TECH AMBRI DUPONT QUESTLINK #Appear. 1 36 8 16 14 14 11 21 6 48 6 6 16 4 9 2 6 3 #Different pages 1 19 6 13 12 9 9 18 6 35 5 5 12 4 3 2 1 2 # Search Results 410000 616000 945000 4270000 9840 8790000 126000 811000 803000 597000 350000 5560000 10050000 5150000 301000 2400 951000 44600 # Joined Results 56 130 450 1100 505 1560 447 737 481 326 306 40 425 383 0 170 17 0 Result Ratio 1.3E-4 2.1E-4 4.6E-4 2.5E-4 0.051 1.7E-4 0.003 9.1E-4 5.9E-4 5.4E-4 8.7-4 7.1E-6 4.2E-5 7.4E-5 0.0 0.07 1.7E-5 0.0

1.2 Ontology representation The final ontology is stored in a standard representation language: OWL [16]. The Web Ontology Language is a semantic markup language for publishing and sharing ontologies on the World Wide Web. It is developed as a vocabulary extension of RDF [17] (Resource Description Framework) and is derived from the DAML+OIL [14] Web Ontology Language. It is designed for use by applications that need to process the content of information and facilitates greater machine interpretability by providing additional vocabulary along with a formal semantics. All these features allow to find easily relations between the found classes and subclasses (equivalences, intersections, unions, etc.). Thus, the final hierarchy of terms is presented to the user in a refined way. Moreover, OWL is supported by many ontology visualizers and editors, like Protg 2.0, allowing the user to explore, understand, analyse or even modify the resulting ontology easily. In order to evaluate the correctness of the results, a set of formal tests could be performed (Protg provides tests for finding loops, inconsistencies or redundancies). However, the evaluation from a semantic point of view can only be made by comparing the results with other existing semantic studies or through an analysis performed by an expert. Once the ontology is created, it is easy to obtain the most representative web sites for each category (concept), because their URLs are stored on each subclass frame. Moreover, they cover the full spectrum (or at least the most important) of web resources available on the Web at this time. The updating of this list could be made easily by performing simple (and fast) individual searches for each subclass keyword. 2. The prototype In order to test the performance of the ontology construction methodology, we have built a prototype with the algorithm described previously. The program has been fully implemented in Java because there is a large amount of libraries that ease the retrieval and parsing of web pages and the construction of ontologies. Concretely, the tools used are the following: Stemmers 1.0: it provides a stemming algorithm to find the morphological root of a word for the English language. Html Parser 1.4: this is a powerful HTML parser that allows extracting and processing text from a web site. Google Web APIs (beta): this is the library that the Google search engine provides to programmers allowing making queries and retrieving search results. OWL API 1.2: it is one of the first libraries providing functions for constructing and managing ontologies in OWL. Moreover, we have used Protg as an ontology visualization and edition tool with the ezOWL plug-in that creates a visual representation of an OWL ontology. 2.1 Execution example As an example, we have used the word "biosensor" as the initial keyword for the domain that represents the different types of this device. In order to constraint the search according to this initial word we have defined the following parameters: Candidate concepts must have a minimum length of 2 characters. The maximum number of web sites per search has been set to 1000 (a good value because "biosensor" has over 20000 different results on Google's API).

The maximum depth level has not been constrained: the system searches subclasses until it finds no more. On the first level, the filter for similar pages has been enabled to obtain all different web sites. For deeper levels, it has been disabled in order to obtain more results (even if they belong to the same web domain). The minimum number of total hits has been set to 5 (on at least 2 different websites) for the first level. For deeper ones, it has been decreased until 1 occurrence for levels deeper than 3. The maximum number of results returned by Google for each new concept has been set to 10.000.000 (to avoid very general words) and the minimum number of results joining the initial keyword with the new one has been set to 10 (to avoid very concrete words). The minimum ratio between these two numbers has been set to 0.0001 (to select only closely related words).

With these parameters, the search has been performed, obtaining the ontology shown on figure 3L (visualized on Protg 2.0). An example of the candidate concepts for the first level of the search and their attribute values (used for the class selection) is shown on table 1. The resulting taxonomy is formally correct (it has passed all the ontology tests provided by Protg) and quite accurate according to a biosensor classification that can be found at [18]. Concretely, all basic classes specified on the document (amperometric/potentiometric, enzyme, optical and chemical) have been found.
C OPTIC BIOSENSOR C Optic Biosensor
http://mywebpages.comcast.net/tfs-jdownward/Web_Pages/TFS_HH01_Fluorometer.html http://www.roanoke.edu/Chemistry/JSteehler/Web/fiber.htm

C AWDFiber Optic Biosensor


http://www.fbodaily.com/cbd/archive/1997/01(January)/15-Jan-1997/Aawd002.htm

C Fiber Optic Biosensor


http://www.isb.vt.edu/brarg/brasym94/rogers.htm http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=146999&rendertype=abstract http://www.photonics.com/todaysheadlines/XQ/ASP/url.lookup/id.493/QX/today.htm http://www.bdt.fat.org.br/binas/Library/book/rogers.html http://www.age.psu.edu/FAC/IRUDAYARAJ/biosensors/Soojin3-FiberOpt.htm http://www.age.psu.edu/FAC/IRUDAYARAJ/biosensors/Soojin3-FiberOpt_files/slide0001.htm http://flux.aps.org/meetings/YR97/BAPSSES97/abs/S1900013.html http://www.baeg.engr.uark.edu/FACULTY/yanbinli/projects/project6.html http://www.verbund-sensorik.de/projects/Projekt13028_19_eng.pdf

C Chemiluminescence Fiber Optic Biosensor


http://ift.confex.com/ift/2001/techprogram/paper_7902.htm

C Evanescent Wave Fiber Optic Biosensor


http://ieeexplore.ieee.org/xpl/abs_free.jsp?arNumber=294007

C Model HHXX Series Fiber Optic Biosensor


http://ic.net/~tfs/Web_Pages/TFS_HH01_Fluorometer.html

C Rapid Reusable Quartz Fiber Optic Biosensor


http://www.nal.usda.gov/ttic/tektran/data/000006/79/0000067907.html

C Real-time Fiber Optic Biosensor


http://www.imd3.org/tang.doc

C Optical Biosensor
http://www.cfdrc.com/applications/biotechnology/biosensor.html http://www.cfdrc.com/applications/biotechnology/microspheres.html http://www-users.med.cornell.edu/~jawagne/surf.plasmon.res.biosensor.html http://www.nanoptics.com/biosensor.htm http://www1.elsevier.com/vj/microtas/46/show/indexes/all_externals.htt?KEY=Optical+biosensor http://www1.elsevier.com/vj/microtas/46/show/indexes/all_externals.htt?KEY=Optical+biosensor+microarray http://www.ch.ic.ac.uk/leatherbarrow/PDF/Edwards%20etal%20(1997)%20J%20Mol%20Recog%2010,%20128.pdf http://www.ibmh.msk.su/gpbm2002/ppt/archakov/ http://www.ee.umd.edu/LaserLab/research.html

C Affinity-based Optical Biosensor


http://www.biochem.utah.edu/files/Long_Chapter_Affinity_Chr.pdf

C Coupling Optical Biosensor


http://www.iscpubs.com/articles/abl/b0011fit.pdf

C Fiber Optical Biosensor


http://www.isb.vt.edu/brarg/brasym94/rogers.htm http://ieeexplore.ieee.org/xpl/abs_free.jsp?arNumber=294007 http://www.bdt.fat.org.br/binas/Library/book/rogers.html Generic Optical Biosensor http://www2.elen.utah.edu/~blair/R/research.html

C Integrated Optical Biosensor


http://cism.jpl.nasa.gov/events/workshop/abstracts/Swanson.pdf http://www.optics.arizona.edu/library/PublicationDetail.asp?PubID=10285

C Mobile Multi-channel Optical Biosensor


http://www.ee.washington.edu/research/spr/projects.htm

C Multi-analyte Optical Biosensor


http://instruct1.cit.cornell.edu/Courses/nsfreu/baeumner.htm

C Plastic Colorimetric Resonant Optical Biosensor


http://www.pcm411.com/sensors/abstracts/abs043.pdf

Figure 3. Left: Biosensor ontology; Right: URL hierarchy for OPTIC Biosensor subclass.

As mentioned previously, OWL allows us to find interclass relations automatically, like intersections, inclusions or equalities. For example, in this case, we have found that the "amperometric biosensor" class includes the "glucose amperometric biosensor" subclass and the "glucose biosensor" class includes the "amperometric glucose biosensor" subclass. As "glucose amperometric biosensor" and "amperometric glucose biosensor" are equivalent (this fact is shown by Protg automatically by marking the classes with a different colour), their subclass hierarchy is merged (obtaining a new taxonomy). Moreover, for each obtained class, a list of URLs has been stored, allowing the user to access the most representative web sites for each concept. An example of the retrieved URLs for the "OPTIC biosensor" subclass is shown on figure 3R. As you can see, different no-HTML file types (e.g. PDFs) have been retrieved. 3. Conclusion and future work Some authors have been working on ontology learning from different kinds of structured information sources (like data bases, knowledge bases or dictionaries [7]). However, taking into consideration the amount of resources available easily on the Internet, we believe that ontology creation from unstructured documents like webs is an important line of research. In this sense, many authors [2, 5, 8, and 11] are putting their effort on processing natural language texts. In most cases, an ontology of basic relations is used like a semantic repository (WordNet [1]) from which one can extract word's meanings and senses and perform linguistic analysis. Moreover, in some of them, the ontology learning is made over an existing representative ontology for the explored domain. In most cases, a relevant corpus of documents carefully selected is used as a starting point. On the contrary, the proposed methodology does not start from any kind of predefined knowledge of the domain, and it only uses publicly available web search engines. Performing a statistical analysis, new knowledge is discovered and processed recursively building a hierarchy of representative classes. The obtained taxonomy really represents the state of the art on the WWW for a given concept and the hierarchical structured list of the most representative web sites for each class is a great help for finding and accessing the desired web resources. As future lines of research some topics can be proposed: To ease the definition of the search and selection parameters, a pre-analysis can be performed from the initial keyword in order to estimate the most adequate values based on the domain. For example, the number of total results for this concept can tell us a measure of its generality (setting more restrictive or relaxed constraints). In this first prototype, we have used the keyword's previous word mainly to create the ontology. However, the posterior ones can tell us the domain where the main concept is applied, allowing more general searches and wider ontologies. For example, for the biosensor example, some of the words returned by a "posterior word" analysis are: design, group, research, technology, system, application, etc. Several executions from the same initial keyword in different times can give us different taxonomies. A study about the changes can tell us how a domain evolves. For each class, an extended analysis of the relevant web sites could be performed to find possible attributes and values that describe important characteristics (e.g. the name of a company), or closely related words (like a topic signature [9]). More complex relations between classes could be extracted from an exhaustive analysis of the coincidence between the obtained URLs, the possible attributes (and their values), or the multiple subclass dependence (on different depth levels).

Acknowledgements We would like to thank David Isern and Jaime Bocio, members of the hTechSight project [4], for their help. This work has been supported by the "Departament d'Universitats, Recerca i Societat de la Informaci" of Catalonia. References [1] [2] WordNet: a lexical database for English Language. Web page: http://www.cogsci.princeton.edu/wn. O. Ansa, E. Hovy, E. Aguirre, D. Martnez, Enriching very large ontologies using the WWW, In proceedings of the Workshop on Ontology Construction of the European Conference of AI (ECAI-00), 2000. H. Alani, S. Kim, D. Millard, M. Eal, W. Hall, H. Lewis, and N. Shadbolt, Automatic Ontology-Based Knowledge Extraction from Web Documents, 14-21, IEEE Intelligent Systems, IEEE Computer Society, 2003. A. Aldea, R. Baares-Alcntara, J. Bocio, J. Gramajo, D. Isern, J. Jimnez, A. Kokossis, A. Moreno, and D. Riao, An ontology-based knowledge management platform, in Workshop on Information Integration on the Web (IIWEB03) at IJCAI03, 177-182, 2003 E. Alfonseca and S. Manandhar, An unsupervised method for general named entity recognition and automated concept discovery, in Proceedings of the 1st International Conference on General WordNet, 2002. D. Fensel, Ontologies: A Silver Bullet for Knowledge Management and Electronic Commerce, Volume 2, Springer Verlag, 2001. D. Manzano-Macho, A. Gmez-Prez, A survey of ontology learning methods and techniques, OntoWeb: Ontology-based Information Exchange Management and Electronic Commerce, 2000. A. Maedche, R. Volz, J.U. Kietz, A Method for Semi-Automatic Ontology Acquisition from a Corporate Intranet, EKAW00 Workshop on Ontologies and Texts, 2000. C.Y. Lin, and E.H. Hovy, The Automated Acquisition of Topic Signatures for Text Summarization, Proceedings of the COLING Conference, 2000. A Maedche, Ontology Learning for the Semantic web, volume 665, Kluwer Academic Publishers, 2001. P. Velardi, R. Navigly, Ontology Learning and Its Application to Automated Terminology Translation, 22-31, IEEE Intelligent Systems, 2003. A. Sheth, Ontology-driven information search, integration and analysis, Net Object Days and MATES, 2003. L. Magnin, H. Snoussi, J. Nie, Toward an Ontologybased Web Extraction, The Fifteenth Canadian Conference on Artificial Intelligence, 2002. DAML+OIL. W3C. Web page: http://www.w3c.org/TR/daml+oil-reference. Extensible Mark-up Language (XML). W3C. Web page: http://www.w3c.org/TR/owl-feagures/. OWL. Web Ontology Language. W3C. Web: http://www.w3c.org/TR/owl-features/. Resource Description Framework (RDF). W3C. Web page: http://www.w3c.org/RDF M. Woods, Biosensors, volume 2 of The World and I, 176, http://worldandI.com/public/1987/frebruary/ns2.cfm, 1987.

[3]

[4]

[5]

[6] [7]

[8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18]

Potrebbero piacerti anche