Sei sulla pagina 1di 7

Investigating ConceptNet

Dustin Smith {smitda1@wfu.edu}; Advisor: Stan Thomas, Ph.D. December 2004 Wake Forest Department of Computer Science
the related OMCS project which aimed to give computers access to knowledge like: Touching re will There are many interdependent levels of organiza- burn you, Hitting somebody will make them untion behind natural text; and, those which address happy and A clock tells time. The relationships the meaning are the most dicult for computers to ConceptNet uses can be combined and processed work with. However, programs which deal with nat- by more sophisticated reasoning models. The data ural language will inevitably make mistakes if they from ConceptNet was automatically derived from ignore semantics. Human communication makes use the OMCS-2 corpus and the current version is 2.1. of thousands of common sense facts, beliefs, and asThis paper is the result of an exploration into Consociations and the ConceptNet project attempts to ceptNet (Sections 1 and 2), its applications (3) and make this information available to software. In addi- weaknesses, and some of the authors attempt to imtion to aiding Natural Language Processing (NLP), prove the knowledge base (4,5). the natural language knowledge representation used in ConceptNet make it ideal for various common sense reasoning applications. This paper explores the 1.1 History of ConceptNet organization of ConceptNet and the CN2 project. In 2000, researchers at MITs Media Lab beFinally, it discusses future directions for Concept- gan the Open Mind Common Sense project Net and possible improvements. (OMCS). OMCS is accessed through a website1

Abstract

Introduction

Programs which deal with natural language (e.g., spelling and grammar validators, translators and information indexing agents) will inevitably make mistakes, no matter how sophisticated, unless they address the meaning behind the text. ConceptNet is a knowledge base that contains everyday knowledge, which is relevant to the semantic and pragmatic levels of human language. ConceptNet has both immediate applications (e.g., Interface Agents) as well as the potential to aid Natural Language Processing (NLP). Ideally, conceptual databases will be used to enrich programs which involve natural language so that they can deal eectively with semantics. ConceptNet is a freely available, large common sense semantic network available to the Articial Intelligence (AI) community. To date it contains 1.6 million common sense statements that were derived from the contributions of more than 14,000 authors[7]. These concepts encompass thousands of pieces of knowledge that most adults already know. The ConceptNet knowledge base stemmed from 1

where the general Internet public can contribute common sense knowledge by answering ll-in-the-blank questionnaires[12]. For example, a web form would prompt the user with a statement such as: A hammer is used to , instructing the person to ll in an appropriate response (e.g., hit, drive nails, pound). The responses are stored in a database, which is periodically mined to generate predicate lists for the various ConceptNet, LifeNet, and StoryNet projects. The predicates which comprise ConceptNets knowledge base originated as patterns mapped from OMCS that have been converted into semi-structured natural language binary relations. Next, during the normalization process, all of the predicates undergo various ltering and lexical distillation. Verbs and nouns are reduced to their canonical base forms; determiners (e.g., a, the) and modals (e.g., may, could, will) are removed. Additionally, Part of Speech (POS) taggers are used to ensure predicates all correspond to a specic set of valid word orders. Afterwards, during the relaxation phase, duplicate entries are merged. Through exploiting the underly1 http://www.openmind.org/commonsense/

Human communication makes use of thousands of common sense facts, beliefs and causality chains in speech and written text. Because this information 2 ConceptNets Applications is obvious to most people, it is rarely stated explicAmbiguity, redundancy, and noisy data undermine itly. However, the meaning behind the communicated the feasibility of using ConceptNet for programs message is heavily dependent on underlying assumpwhich demand a high degree of accuracy. Despite tions which makes it very dicult for an program to this, certain applications, such as non-intrusive In- make more than a shallow assessment of the text. terface Agents can still benet from the common ConceptNet contains thousands of common sense sense knowledge that ConceptNet possesses. The assertions, and thus it should be considered for NLP API that accompanies ConceptNet v2.1 permits applications. Here is an example statement which illustrates the contextual realm-ltering, topic generation, analogyneed for common sense knowledge in NLP programs: making, projection, aect sensing, and concept identication/approximating[7]. As will be discussed The teacher said, That was an A paper. in section 3.2, the defeasible knowledge within ConThe teachers remark would likely be rejected if ceptNet may be ideal for common sense reasoning parsed by a grammar correcting software. The softamong natural language processing software in conware is likely to conclude that the string an A paper sequence of its ambiguity and redundancy. is invalid on the syntactical level (seeing two articles). A typical human reader would have no trouble with 2.1 Interface Agents it since they already have the knowledge: a paper is a type of assignment2 , teachers grade their students Recently a number of people, including those inassignments, and grades can come in the form of letvolved with the development of ConceptNet, have ters. used the knowledge base to build interactive applicaSemantic resolution is challenging because it retions that deal intelligently with (often user-supplied) quires both a large amount of background knowledge English text. These applications are dissimilar to and human-like ways to reason with this knowledge.3 those of earlier common sense eorts, like Cyc. InThere are many dierent levels of organization bestead of the common sense component being the cenhind both speech and text; but, those levels which enter of the program (e.g.,question answering applicacompass the meaning behind words are the most diftions where the software seeks to correctly respond cult to deal with from a computational standpoint. to human-level text), ConceptNet has been recomListed are three such levels: [4] mended for fail-safe agents which do not purport to 2 The predicates in ConceptNet are acknowledged to be deliver intelligent conclusions without exception. Indefeasible, in that they are not always truethey can be made terface Agents are non-intrusive fail-safe agents which void; often they are only veritable within certain contexts. 3 Or, at least, a program will need enough good tricks to run within an interface and learn from a users activity to provide assistance or help improve eciency. accomplish its objective well. 2

ing taxonomic structure, the IsA hierarchy is used to lift knowledge up from children nodes to their parents. If concepts X1 , X2 , X3 all have relationships in the format (IsA X Y), which implies they are in the same taxonomic subset, and (PropertyOf X Z), indicating they all share the same property Z, then a new concept would be inferred and added to the predicates taking the form (PropertyOf Y Z). Each predicate also has two numeral metadata attributes, f and i, which have the default value of zero. Whenever the same relationship is derived from the OMCS corpus, the predicates f value is incremented. If during the relaxation phase a duplicate predicate is inferred, then i is incremented. In addition to the concept predicates, ConceptNet contains a platform independent Python interface and developers API with natural language parsing software.

They run in the background, occasionally interjecting the results of their common sense reasoning for the user to consider. If their results are wrong or do not apply to the given case, the user can simply ignore their advice. They can accumulate various common sense related inferences, making helpful suggestions to the user, or ask the user to ll-in missing information that may help produce better results[5]. Although the development of Interface Agents is less ambitious than all-encompassing projects like CyC, there are still many uses for common sense reasoning and applications which can be immediately realized. Additionally, the sort of common sense knowledge which ConceptNet aims to provide is crucial for developing more intelligent NLP software.

2.2

Natural Language Processing

Semantics - the meaning behind words and groups of words. Pragmatics - the use of language to accomplish tasks (i.e, to give a command, share an idea, draw attention, etc). Discourse - making sense of linguistic units larger than a single utterance. The inability to deal with these three levels of language is what holds NLP back from being able to understand what words mean. Even seemingly computationally-savvy applications such as grammar correction software (since syntax is a formal set of rules) will not always successfully parse a language where colloquialisms and homonyms abound.

Organization of ConceptNet
(Relationship Concept Concept)

One problem with this organization is that all nodes are treated equally. This can lead to poorly directed reasoning attempts, like: What was the chairs motivation for breaking? The lack of the informations consistency makes complex reasoning chains unlikely to succeed. Also, some nodes have more conceptual relationships than others, thus they are more useful to reasoning methods. Helpful information like this should be available at a metadata level, so the more dense nodes (which have more relations) are tried before the sparsely connected nodes which are more likely to lead to dead-ends. Unfortunately, the available metadata in ConceptNet is scarcely useful because its values are not well distributed among concepts. In my CN2 project, I worked towards this objective by implementing a connectivity index. The method used and its results are explained in 4.2.1.

ConceptNet expresses predicates in the form:

3.2

Representing Knowledge Natural Language

with

ConceptNet uses semi-structured natural language to represent information. Its authors chose this format because of the straightforwardness of natural language, which they also thought to be ideal for common sense reasoning. Firstly, an explanation of why how knowledge is represented is such an important issue: 3.1 Relationships Knowledge representation (KR) is a key-issue in The predicates in ConceptNet use 20 relationship AI. Essentially, all KR systems are methods for reptypes which fall into 8 categories. Each category conresenting surrogates for entities in the real world (or tains one or more types of relationships. Listed are a virtual domain). These models are never entirely the categories and their corresponding relationship accurate; they always contain some discrepancies or types: omissions, because perfect knowledge representation is impossible[2].4 Most importantly, the way the K-Lines: ConceptuallyRelatedTo, ThematicKLine, Su- knowledge is represented entails the ways the knowlperThematicKLine edge can be manipulated. In other words, KR is Things: IsA, PartOf, PropertyOf, DenedAs, MadeOf tightly-bound to the reasoning methods that are de Spatial: LocationOf ployed on it. Events: SubeventOf, PrerequisiteEventOf, FirstIts authors wanted ConceptNet to cover common SubeventOf, LastSubeventOf sense knowledge, similar to the CyC project, while Causal: EectOf, DesirousEectOf incorporating the ease-of-use of WordNet5 , which Aective: MotivationOf, DesireOf uses a KR based in natural language. Using semi Functional: CapableOfReceivingAction, UsedFor structured natural language in their representations Agents: CapableOf marked a divergence from the mainstream opinions These various types of relationships cover a vast concerning common sense reasoning in AI. Liu and spectrum of problem categories which are commonly Singh [6] make a good argument against using solely available to humans. The information can be used 4 A surrogate representation can never have perfect delity to answer questions about where and what an object with the corresponding real world object it represents. Usually is, what it is used for, and its possible motivations only specic aspects of the object are represented (thus some and goals for a given action. Sequential information aspects are omitted); moreover, even a full-blown replica of the external object would still dier from the original, at least in is also available, so that for a given event, events that location. 5 http://www.cogsci.princeton.edu/wn/ commonly precede or follow are related. The concepts themselves are often called nodes, of which there are 300,000. All relationships are binary (having two arguments) and there are twenty relationship types. 3

logic for common sense representations: For common sense reasoning, which ConceptNet is built for, recursive denitions are commonplace, multiple answers can exist simultaneously, and contradictions are permitted. However, representations which use logic require strict consistency among predicates in their knowledge bases. The existence of two contradictory predicates would jeopardize the whole system. John McCarthy, a proponent of using logic in common sense reasoning, used the following to illustrate a representation which is problematic to logical consistency. Although most people will agree that the statements Birds can y and Penguins are birds are both true, there is a logical inconsistency with reality that cannot be avoided while maintaining these two beliefs[8]. Penguins and ostriches are anomalies among feathered vertebrates with wings; they are birds that cannot y.6 Articial Intelligence researcher Marvin Minsky argues that redundancy and inconsistency are properties of human-like intelligence. In other words, the approach which ConceptNet takes is more appropriate for its objective than the other common sense initiatives that use logical deduction exclusively. Minsky believes that the overlapping, interconnecting networks of knowledge are what allow human common sense reasoning to be eective in various situations[10]: If you understand something in only one way then you scarcely understand it at all. For then, if anything should go wrong, youll have no other place to go. But if you represent something in multiple ways, then when one of them fails you can switch to another until you nd one that works for you. Its the same when you solve a new kind of problem: along with rening the method you used, you also should try to nd other ways to do it. Then whenever you get into trouble, youll be able to switch to a different technique. If you only know a single technique, then youll get stuck when that method fails. But if you have multiple ways to proceed, then you can deal with more kinds of predicaments.

itself than a closed domain of symbols. However, this connectionist approach also raises some implementation issues. Singh and Liu point out that the ambiguity in natural language will result in redundant relationships[6]. Additionally, there are many context-sensitive relationships that are simply false when in the wrong word sense. These two factors limit the immediate types of applications for ConceptNet; however, there are some ways to address both of these issues and some ideas are brought up in section 5 of this paper.

3.3

Overview: The Substance

The architecture of ConceptNet is well designed; however, the data within is often problematic. When examning the actual data, very few nodes were inherited based on the gures from the metadata. Accounting for all predicates in the non-concise les, all nodes had fewer than 4 inferences in each value for f and i, and very few nodes had more than one (g. 1).

Figure 1: Distribution of f and i values.

As a result, this metadata is not helpful for distinguishing between two otherwise identical concepts. The K-Line 7 category contains almost three times as many assertions as the other relations combined, The normalized natural language fragments and so they are located in a separate le in the Conmake ConceptNet a good candidate for NLP ceptNet kit. On page 5 is a breakdown of the full applications it is easier to map natural language to (non-concise) distribution of predicates in the various categories (g. 2).
6 To be fair, McCarthy came up a solution for these atypical cases he dened a separate category of predicates that with the prex ab, indicating abnormality. In turn, this approach has an added complexity that alternatives do not.

contextual identiers that relate a given concept to a theme. They are the most general of the concepts types.

7 K-Lines

Category K-Lines ConceptuallyRelatedTo SuperThematicKLine ThematicKLine Functional CapableOfReceivingAction UsedFor Agents CapableOf Things IsA PartOf PropertyOf DenedAs MadeOf Events SubEventOf FirstSubEventOf PrerequisiteEventOf LastSubEventOf Aective MotivationOf DesireOf Spatial LocationOf Causal EectOf DesirousEectOf

Assertions 1,035,035 816,737 160,181 58,117 103,556 57,600 45,956 89,313 89,313 46,828 16,720 12,934 9,135 6,520 1,519 35,317 22,764 4,453 4,092 4,008 31,196 24,483 6,713 28,805 28,805 15,303 9,057 6,246

things that people are (in this case: speaking). This structure would allow the reasoning methods to treat nodes dierently (e.g., recognize the distinction between objects and agents) and also permit a more eective use of the existing relationships, without necessitating repeating information for each individual concept. By creating a well-formed ontological tree using IsA relationships, all of the other relations in ConceptNet would benet. There were two major issues that needed to be resolved rst: Removing Cyclic Relationships There were several nodes that had cyclic relationships like: (IsA Something Object) and (IsA Object Something). Existence of nodes like this would cause nontermination problems for reasoning agents if they tried to traverse this taxonomic tree. The object/something type problems could be eliminated with one SQL statement; however, cyclic relationships that had more than one degree of separation were dicult to detect. Dealing With Bad Data Unfortunately the mass collaboration from the public approach eects the quality of data. ConceptNet contains enough misspelled words, false concepts, and overly-specic data that it is quarrelsome to organize on a large-scale. These problems result in a lot of repeated information (i.e., duplicate information with dierent spelling) that make any eort to bring consistency dicult. In the following sections I explore the steps taken to develop CN2, my ndings, and some concluding remarks concerning future directions of this and similar projects.

Figure 2: Breakdown of concepts over categories.

CN2 Explorer

As part of an undergraduate research project, I investigated methods to reorganize the data in ConceptNet so that reasoning methods could be more easily and consistently applied. Most of my work centered around the IsA nodes; the objective was build a consistent taxonomic hierarchy that categorized all of the concepts. This idea was similar to the inference scheme deployed in the relaxation phase of ConceptNets generation. The idea is that a given node should be able to inherit relationships from its parent nodes (at any level). For example, with these concepts:
(IsA Governor Politician) (IsA Politician Person) (CapableOf Person Speaking)

4.1

From Python to SQL

As a client-side Python application, the potential of a full semantic network could be realized; however, the volume of the predicate les impaired performance when rendering graph structures. The objectives of CN2 necessitated modifying and querying large bundles of concepts at once, which is why SQL was most appropriate. In order to port the predicates to a le which could be rendered by a DBMS, conversion to commadelimited le was necessary. I wrote a program to convert predicates of the form:
(LocationOf plane at airport f=8;i=0;) (LocationOf army in war f=3;i=0;)

A reasoning system should be able to infer that a governor, being a person, is capable of all the 5

into CSV format.8 The data was then transferred into a Microsoft SQL Server 2000 database.

Interestingly there was a rather large number of independent concepts that did not have any connections at all (which proved to be one of the most dicult problems in reasoning with knowledge from 4.2 Adding Metadata ConceptNet). Of the 191,334 distinct nodes in the When the data is listed in a database instead of a modied data set:10 2,520 concepts had more than graph, access to information behind each concept is 17 connections (above base index), 206 were equal, limited. For example, it is impossible to directly com- and 188,708 had below 17 (most of which had none). pute in a declarative language (like SQL) exactly how In other words, much of the data is not well conmany hierarchial levels were below or above a given nected to the rest. The real average of relations per IsA node. Fortunately, there were two alternatives: concept was 1.175. Here are the top 5 most connected concepts, ordered in descending order of their 1. The specic DBMS used in the project permitted connectivity index value: a more powerful language called Transactional # Node Connections Index Value SQL (T-SQL) which included iteration and con1 person 17,838 104,929 ditional looping. 2 human 1,369 8,052 3 child 1,186 6,976 2. Building automated scripts (which continually 4 man 1,086 6,388 self-loaded until they reached completion) in 5 dog 971 5,711 ColdFusion allowed automation of tasks and full access to a procedural language. Slightly over 10% of all relationships involved these For IsA concepts, it was helpful to have information concerning the concepts distinctiveness at hand. I added a metadata eld which was the total of how many other nodes shared that same parent node. The nodes with the highest value for this were person (96), place (95), instrument (86), animal (73), and tool (64). In other words, there were 96 dierent assertions of the form:
(IsA X Person)

top 5 concepts. What can be gathered from this information (besides evidence that dog is mans best friend)? The distribution of concepts was rather heavily biased towards concepts of specic types: those which involve people.

5
5.1

Concluding Remarks
Re-organization

Additionally, for each IsA concept, boolean values for metadata Up and Down were added. Given a node, these specied whether or not there were any parent or children nodes, respectively.

The future direction of ConceptNet will depend on how eective its data can be regulated. There is great potential based on its architecturemany of the types of relationships it represents are indeed of practical value; however, much of the helpful common sense 4.2.1 Connectivity Index knowledge is often spoiled by the minority of noisy Another metadata attributes which was added to all data. Fortunately there are many possible ways to nodes, not just IsA relationships, was a weighted in- help eliminate bad data and improve the quality of dex value. The index value is a relative index which existing and future data. One approach attacks the problem from where the species the number of connections the rst concept information originates: those involved with the mass has in a given assertion compared to a base, avercollaboration. Perhaps methods to increase the moage, value. If each node had one relationship for each relation- tivation of the knowledge bases contributors, includship type9 , they would have a total of 17. Using this ing ways to assure them that their contributions have as the average value, any node that had 17 relation- been purposeful thus far, will stimulate their desire ships would have a connectivity index value of 100 to contribute quality information. Other possible approaches deal with the actual (the base value). Each index value is calculated as infrastructure. For example, a semi-structured ap(100 Connections)/17. proach, such as using the IsA relations to form a hi8 Available for download and execution on Windows: erarchial taxonomy from which all other nodes are
http://www.ogghelp.com/dsmith/conceptnet/predToCSV.zip 9 The K-Line category of relationships were ignored altogether. is after applying many cleaning methods to eliminate redundant data.
10 This

interconnected, will permit the development of more sophisticated common sense reasoning methods to be executed upon the data. Merging ConceptNet and WordNet may expedite this eort, as the latter project already oers a well organized IsA ontology. (Something similar was already accomplished with an early edition of OMCS data11 ) Another option is to approach the problem of reorganization in the same spirit as the rest of the OMCS collective, through mass collaboration. Chklovski developed a method of data validation in his Learner knowledge acquisition project. From a web-based interface, statements would be inferred (via cumulative analogy) from the existing data and human contributors would verify the generated assertions[1]. I have also taken this approach with CN2s online parser.12 From user input, it can infer motivations by traversing up to two levels of IsA connections. The user may then delete incorrect or ll-in missing information. If this approach is taken, however, it should be done in only one location (to avoid splitting the project), at which many users already contribute.

References
Using analogy to ac[1] Chklovski, T. quire commonsense knowledge from human contributors. Tech. Rep. AITR-2003002, MIT AI Lab, Feb. 2003. Avaialble online at ftp://publications.ai.mit.edu/aipublications/2003/AITR-2003-002.pdf. [2] Davis, R., Shrobe, H., and Szolovits, P. What is a knowledge representation? AI Magazine 14, 1 (1993), 17. [3] Friedland, N. Project halo: Towards a digital aristotle. AI Magazine 25, 4 (2004), 29. [4] Jurafsky, D., and Martin, J. H. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Prentice Hall, 2000. [5] Lieberman, H., Liu, H., Singh, P., and Barry, B. Beating some common sense into interactive applications. AI Magazine 25, 4 (2004), 63. [6] Liu, H., and Singh, P. Commonsense reasoning in and over natural language. In Proceedings of the 8th International Conference on Knowledge-Based Intelligent Information & Engineering Systems (KES-2004) (Wellington, New Zealand, 2004), Springer-Verlag. [7] Liu, H., and Singh, P. Conceptneta praticial commonsense reasoning tool-kit. BT Technology Journal 22, 4 (2004), 211. [8] McCarthy, J. Applications of circumscription to formalizing common sense knowledge. Articial Intelligence 28 (1986), 89116. Reprinted in [9]. [9] McCarthy, J. Formalization of common sense, papers by John McCarthy edited by V. Lifschitz. Ablex, 1990. [10] Misnky, M. The Emotion Machine. Forthcoming; Simon & Schuster, 2005. [11] Richardson, M., and Domingos, P. Building large knowledge bases by mass collaboration, 2003. [12] Singh, P., Lin, T., Mueller, E., Lim, G., Perkins, T., and Zhu, W. Open mind common sense: Knowledge acquisition from the general public, 2002.

5.2

Targeting Relevant Applications

A knowledge bases relevance to types of problems determines which applications it is most appropriate for. Some have claimed that the lack of targeting relevant information is what has stunted the progress of the CyC project: The initial CyC philosophy of simply entering knowledge regardless of its possible uses is arguably one of the main reasons it has failed to have a signicant impact so far[11]. In a recent independent study which surveyed three dierent knowledge representation and reasoning systems, those which were designed specically for their objective produced better results than the massive amount of unspecialized knowledge in the CyC system[3]. Based on the ndings described in 4.2.1, ConceptNet may be very useful for applications which deal with social, interpersonal information. Targeting a specic domain of knowledge can be done at the knowledge acquisition level and from the perspective of which projects users of ConceptNet choose to implement. In any case, focusing on ConceptNets current strengths may set the unprecedented course for development of programs which exhibit social intelligence.
11 http://www.eturner.net/omcsnetcpp/wordnet/ 12 http://www.ogghelp.com/cn2/

Potrebbero piacerti anche