Sei sulla pagina 1di 12

Egyptian Computer Science Journal ,ECS ,Vol.36 No.

4, September 2012

ISSN-1110-2586

Legal Indexing Aid System


RAMMAL Mahmouda, AL ACHKAR Monab
Legal Informatics center, Lebanese University, BP 5395/116, Beirut Lebanon, a rammal.mahmoud@gmail.com, bmaj_aj@hotmail.com

Abstract
Our study focuses on the analysis of the keywords, assigned to the texts titles published in the Lebanese official journal. This latter contains legislative and regulatory texts. The study focuses also on the related legal lexicon generated, through the manual information processing operation executed, on the texts titles, during more than two decades. The object of the study is to describe a legal indexing aid system developed to meet homogenization legal vocabularys needs in order to achieve consistency and to enhance access and retrieval of legal information. Our experiment shows that most of the assigned keywords, meant to represent the legal documents content, through the manual analysis and processing of their titles, may be automatically done. The keywords may be extracted by a system built out of pre-designed patterns and algorithms, based on the frozen structure of the titles which are analyzing and grouped according their objectives.. We use the local grammar approach as finite state automata to represent each group. The system aims to automatically find and extract the keywords assigned by the indexers and to suggest or generate further potential keywords based on a set of features calculated for each node of a title. Keywords : Artificial intelligence, Legal indexing decision support system, local grammar, linguistic information retrieval, text parsing, Arabic natural language processing, Legal informatics.

1. Scope of the project


The official journal or the gazette is where the parliament and the executives bodies and agencies of the Lebanese government publish legislative and regulatory texts. In general, each text is published under a specific title according to its subject matter. The content of these laws and regulations is immensely varied since it deals with various aspects of citizens life and governments activities and relationships [18]. It covers all facets of human activities: social, economic, academic, political, commercial. The gazette has the most important place in the public and professional perception of the Lebanese legal system. Consequently, the access to its content is of an utter importance, not only for the professionals and decision makers, in public and private sectors, but also, for the

-16-

Egyptian Computer Science Journal ,ECS ,Vol.36 No.4, September 2012

ISSN-1110-2586

laymen. Nevertheless, the huge number and the varieties of the published texts, searching for specific legislation or regulatory material is a difficult and time consuming task. Hence, indexing the gazette in order to enhance information search and retrieval was the first step toward establishing Legal Data bank at the center of studies and researches in legal informatics, at the Lebanese university [1]. Nevertheless many technical hindrances emerged from many realities among which, those due to the fact that these texts are written in Arabic while adequate application, softwares, and linguistic tools werent available. Other hindrances were due to the nature of the legal language requiring specific knowledge background related not only to knowledge representation, but also to legal concepts exact meaning and relevant context. Accordingly, it was decided to rely on human indexing with controlled vocabulary methodology.

2. Objectives of the project


Our objective is to create an Indexing Aid (expert) system, in order to automate indexing of legal texts published in the official journal, as much as possible. This operation will assist indexers in selecting appropriate terms and applying indexing rules. Moreover, it will be reducing indexation costs, in terms of time and efforts, and achieving more accuracy through vocabulary homogenization. Hence, we aim to implement a thesaurus based indexing, next to legal language patterning and semantic networks techniques. Moreover, the system is developed in a way that makes it able to deal not only with the titles content description, but also with the dynamic nature of the documents database.

3. Preparatory activities
The operation relies on human indexation. Its based on the respect of the special structure of the document, as well as on its relationship to three categories of administrative, legal and topic, clusters. These categories was decided upon the express relations of the text clearly exposed by its titles core subject, the administration concerned by its implementation as well as by the legal domain it is connected to. The processing operation mainly consists of describing the titles content that indicates the topic of the text by assigning keywords. For some texts published without title, the indexer assigns one. To do this, he shall go back, either to the body of the document itself, or to the titles of the texts it refers to. This practice is due to the dynamic linkage that may exist, more often, between legal texts, making reference a determinant element, not only in describing the content, but also, in building the contextual environment and deciding upon the accuracy of implicit concepts. It reveals the importance of the title and text network in prcising semantic environment. These links refer to some modified, implemented, detailed, or deleted laws or regulations. Assigned Keywords reflect explicit concepts contained in the title and the document but also implicit ones. Giving the fact, that computer cannot understand natural language, and the

-17-

Egyptian Computer Science Journal ,ECS ,Vol.36 No.4, September 2012

ISSN-1110-2586

implicit meaning it bears, these later were of weight in choosing the human indexing methodology. This indexing effort has been ongoing for almost three decades now and has generated a list of keywords that was used to develop linguistic tools among which, an extensive lexicon specifically designed to ensure vocabularys indexing consistency. Nevertheless, and despite the satisfying results it yields, human indexing is still expensive and is still biased by subjectivity, since it heavily relies on personal understanding and interpretation of the analyzed content, which in its turn depends on the expertise, the scientific acquirements, and the personal background of the indexer [11]. Moreover, a study done at the center revealed that, not only keywords consistency is questionable (since its different from indexing work session to another indexing work session and from indexer to anther indexer), but also, that the operation revealed to be neither cost worthy, nor efficient to attain the methodologys stated objectives, such as: Understanding the individuality of the text Expressing the meaning of this distinctiveness accurately and consistently through specific keywords. Controlling the indexing vocabulary. Achieving search and retrieval pertinence worthy of the financial cost invested in the work

3.1 Evaluation process Giving the fact, that human indexing has been adopted mainly to achieve what automatic indexing cannot: understanding and describing the implicit content of the text, it was normal to evaluate the output of the methodology by conducting a study intended to analyze the nature of the assigned keywords (apparent or implicit) and their role in describing the content in the light of the fact that the gazettes vocabulary is a strict language. The number of the keywords was 5593 used to describe 1098 legislative and regulatory texts. The studys results showed the following: The study revealed that 79% of the keywords used to describe the apparent concepts were descriptors literally taken either from the title or from the documents body itself. Moreover, 93% of the apparent keywords was taken as it is, without any grammatical or syntax change, 2,9% was derive, and 4, 1% was synonyms. The keywords meant to describe the implicit concepts were used only in 59,4% of the texts, they represented only 17,3% of the 5593 keywords, while the rest which is 2,8% of the keywords were false descriptors. A closer look at these keywords showed that only 51,6% were precise while, 4,6% were broader terms, 40,9% were contextual, 0,7% were inaccurate. Besides, 45% of these keywords represent either administrative division or proper nouns.

-18-

Egyptian Computer Science Journal ,ECS ,Vol.36 No.4, September 2012

ISSN-1110-2586

3.2 Assessment process In this context, we began to believe automation of the keywords assignment operation, is a must, especially after adding to the above results some well-known elements such as: - The strict nature of the legislative and regulatory language, that makes it always use the same terms to describe the same concepts. - The particular linguistic nature of the titles that starts with a number of particular specific verbs. - The dangers of human indexations inconsistency, as well as its lack of pertinence even when carried on by persons with thorough knowledge of the domain. - The importance of the texts networks that helps determining and prcising the implicit concepts.

4. Related work
In general the extraction keyword methods use Statistical, linguistic or hybrid approaches. It use explicitly the information contained in a document such as word frequency and word position [22], or the TF x IDF (term frequency and inverted document frequency) with the distance between word or POS of a Word[14][26][6]. Recently with the expansion of web semantic concept, the extraction approaches use a semantic level as in [10][7]. To build a help system for selecting keywords we use a particular linguistic tool, the local grammar. [13], which represent more adequate the frozen texts where limited variation in form is possible. The local grammar were used to extract information, many systems are developed to extract proper name in different language [17][4], or to extract date, time and measure [5]. The local grammar is represented by a FSA and creates a syntactic chain between the words of title. In this paper, we use local grammar to represent the title of texts of official journal in order to select keywords that semantically represent the title.

5. Database of Official Lebanese Journal


The data base of the Official Lebanese Journal contains the texts published from 1918 to 2012, which represents more than 62000 texts of law. It is the historical file of the laws and the decrees in force. The system is actually working and a free service of information retrieval is available on the web [21]. It is associated with a legal lexicon of 40000 terms defined and represented with synonym relations, which are important, as a matter of fact, for the accurate and effective choice of words not only, in communication, translation, and content representation, but also, in pertinent access to the information.

-19-

Egyptian Computer Science Journal ,ECS ,Vol.36 No.4, September 2012

ISSN-1110-2586

As a matter of fact, the official journal texts titles are the entity, which describes the main subject, in any given text; hence, the specimen we choose to work with is about 32000 titles, published, during a decade, going from the year 2000 till 2010. It represents 52% of the all published texts in the official journal.

6. Project execution and implementation


Looking at the syntactic structure of the titles, we notice a consistent structure, according to which, they always start with a verb that plays the determinant role in generating the main guidelines to content representation and description. Moreover, the same verb is more often followed by the same group of words that convey the subject, the object, as well as the parties concerned by the text, etcand while the words place may slightly change, it has no implication on the meaning or on the adopted keywords selection process. Accordingly, the titles represent a model based on the nature of the first word, which is a verb, the meaning, and the semantic environment of this starting word. On the indexing level, Verbs and following groups of words taken from the titles are the assigned keywords usually chosen to describe explicit meaning. Meanwhile, those not taken from the titles are either a synonymy of a word already apparent, in the title, or of a word in the texts body. It could be also taken from the semantic relation between the apparent words in the titles. Building on these facts, it goes without saying that our work shall start by categorizing the titles according to the verbs they start with. And the methodology shall heavily rely on the very structure of the title. These groups of following words are what we call the scheme of the verb. For example the table 1 shows a sample of Verbs and their occurrence. Table 1. Sample of Verbs used in the titles
Arabic Verb Translated Verb Establishment transfer License Accept Generation Renew Set Leave Ratification Identify Considered Recognition Conclude # occurrence 3631 2959 2812 1944 1657 1066 1017 930 862 516 403 379 345

-20-

Egyptian Computer Science Journal ,ECS ,Vol.36 No.4, September 2012

ISSN-1110-2586

The study of each scheme suggests that titles are sufficiently frozen to be described by local grammars [13], which allow the construction of finite state automata (FSA) representing the local grammar for each group type.

7. The architecture of the developed expert systems


The FSA is composed by state or node [25]. The initial state represent the Verb and the other state are the different parts related to the Verb. The extraction of keywords is done by using an analyzer of the FSA and to normalize the extracted word we use the Dictionary of terms with its morphological forms and its synonyms. The architecture of the system is showing in fig. 1.
Title

Analyzer
Filter the title Stemming the terms of title Apply local grammar Apply rules

Resources
Local grammar Rules List of synonyms List of morphologic variations

Extraction Keywords

Verification & Approval

Insert into Database

Figure (1) : The architecture of extraction system The system is organized according to three modules structure: the Analyzer, the resources, and the data base system. This latter is represented by three fold operation: the extraction keywords, the verification and approval, and the insert into database. Moreover, the system uses a set of resources such as: the local grammar, the rules, the list of synonyms and the list of morphological variations. The title is the entry for the system.

-21-

Egyptian Computer Science Journal ,ECS ,Vol.36 No.4, September 2012

ISSN-1110-2586

The analyzer is the main component of the system. Its essential feature is to be directed by the syntax, it starts by a filter to select the first term of title, so the terms of the title are normalized by using the synonyms list and morphological variations list. After that the analyzer applies the corresponding FSA according to the first word in the title. A set of rules is used to define the semantic relation between the terms in order to facilitate the keywords choice. The rules are generally expressed as an 'IF-THEN' rule. An example for the rule is as following: IF in the [Object] we have cooperation agreement and in the [Subject] we have youth and sports THEN the keywords are Sports cooperation, Youth cooperation.

8. Implementation and Results


The scheme of the verb can be seen as a frame with different slots that represent the states [6]. For each state we assign a list of words that will allow the transition in the graph. After analyzing a state, an action is activated to add keywords regarding the actual state and the previous state. This method allows by the semantic relation between the states to add implicit keywords. For example we apply our system to the type of Verb Conclude "" The graph of figure (2) represents the FSA of type of Verb conclude. Its composed by four states: the initial state activates the FSA, it represents the first word of the title. The object describes the type of the agreement, the subject describes the objectives and the parties are about whom is concerned by the signed agreement.
Verb : Object Subject Parties

Figure (2): general FSA for the Verb Conclude The grayed states Object and Subject are the names of sub-graphs. The sub-graph Object, recognize the type of verb conclude, is composed of two list of single words to facilitate the composition of the keywords as showing in figure (3). The sub-graph Subject is composed by two boxes, it recognize first the propositions such as ( , about) , ( concerning, in the field), the other state is the sentence that contains the objectives of the agreement as showing in figure (4). The parties is not processed by our system, it need more study to build a specialized ontology that can represent the varieties, the particularity, and the structures of the different state department and administration usually involved in the signature of conventions and contracts between the Lebanese government and foreign countries, international and regional organizations, NGOs and governments. The parties are added manually. The table 2 shows a sample of titles recognized by the system.

-22-

Egyptian Computer Science Journal ,ECS ,Vol.36 No.4, September 2012

ISSN-1110-2586

Table 2: Sample of Official Lebanese Journal Titles 1 Conclude a cooperation agreement in the field of youth and sports between the Government of the Republic of Lebanon and the Government of the Hashemite Kingdom of Jordan 2 The conclusion of an executive program in the field of tourism cooperation between the Government of the Republic of Lebanon and the Government of Arab Republic of Egypt

After analyze the title, the system can add keywords as following in table 3: Table 3: keyword extracted by the system States Verb: conclude "" Object : cooperation agreement Subject : the field of youth and sports Parties : the Government of the Republic of
Lebanon and the Government of the Hashemite Kingdom of Jordan

keywords Conclude signature "" Cooperation Cooperation agreement ""


Sports cooperation Youth cooperation"" Not Processed

-23-

2102 Egyptian Computer Science Journal ,ECS ,Vol.36 No.4, September

6852-0111-ISSN

/ /

>< >< >< >< ><

Sentence

Figure (4): Sub-graph of the Subject

Figure (3) : Sub-graph detailed for object

-42-

Egyptian Computer Science Journal ,ECS ,Vol.36 No.4, September 2012

ISSN-1110-2586

9. Benefits of the Developed System


The advantages of the system can be measured at two levels: processing and access. At the first one, it enhances the chances of constant and accurate representation of the content, through the automatic guidance of the indexers, as well as, through the vocabulary homogenization. It also, provides the opportunity to reduce efforts and time. Moreover, the system can be described as problem-solving skills, especially when used by beginners indexers. At the same time, it assists reviewers objectively decide upon the accuracy of an indexing term. At the second, it enhances access and retrievals pertinence, since it reduces the silence and the noise, which result from the impertinence and the disparity of keywords describing the same concepts.

10. Conclusion
This system was implemented on a sample taken from the already existing database. The indexers, who reviewed the results, were satisfied with the consistency, the exactness and the precision of the assigned keywords. They even reported similarity between automatically assigned keywords and manually assigned ones. The fact is confirmed by applying a study of similarity between the term extracted manually and automatically. We used the cosine coefficient [22], which is one of the methods adopted to measure similarity between two groups of words. Hence, the terms are represented by weighted vector; we give simply the weight 1 for the term in the list and 0 otherwise. The result is about 0.76. We explain the result by the missing of proper name not integrated by the system and some title doesnt exactly match the structure of local grammar. As stated, geographical entities, personal nouns, and institutions names, havent been integrated in the processing operation. Their varieties as well as, their huge numbers make it very difficult to process them in absence of specific ontologies. Accordingly, they still need to be represented, through human intervention. Accordingly, the Next step in the project will focus on automating the categorization task that involves integrating official journals texts according to their content into: three categories: legal, administrative and thematic. On the other hand, we will work on the list of keywords to build a legal ontology domain [2] [23].

-25-

Egyptian Computer Science Journal ,ECS ,Vol.36 No.4, September 2012

ISSN-1110-2586

References
[1] Al Achkar Mona (2007), Official Journal Indexing at The legal informatics center Lebanese University- Internal Report. 2007. [2] A Salem, Marco Alphonce, (2010) Web-Based Ontologies for Breast and Lung Cancer 6th International Conference of Euro-Mediterranean Medical Informatics and Telemedicine,2010 [3] Barnbrook G. (2002). Defining Language: A local grammar of definition sentences. Amsterdam: John Benjamins Publishers. [4] Choi, Key-sun, Nam, Jee-sun. (1997). A Local-Grammar-based Approach to Recognizing of Proper Names in Korean Texts. Proceedings of the Fifth Workshop on Very Large Corpora, University/Hong-Kong University of Science and Technology, pp. 2730-288. [5] Constant M. (2002), Methods for constructing lexicon-grammar resources : the example of measure expressions, Proceedings of the 3rd conference Language Resources and Evaluation Conference, Las Palmas, 2002 [6] E. Frank, E. Paytner, I.H. Witten, C. Gutwin, C.G. Nevill-Manning (1999), Domain specific keyphrase extraction In Proceedings of the sixteenth international joint conference on artificial intelligence, Morgan Kaufmann , pp. 667668, 1999 [7] Ercan, G., Cicekli, I. (2007). Using lexical chains for keyword extraction In Information Processing and Management 43(6), pp 17051714, 2007 [8] Friburger N., Maurel D. (2001). Finite state transducer Cascade to extract Proper Nouns in French text, 2nd Conference on Implementing and Application of Automata: in Lecture Notes in Computer Science, Pretoria (South Africa). [9] Frantzi, K.T. , Ananiadou, S. (1996) : A hybrid approach to term recognition. Proc. Int. Conf. on Natural Language Processing and Industrial Applications, Universit de Moncton, Canada [10] Grineva, M.; Grinev, M.; and Lizorkin, D. (2009). Extracting key terms from noisy and multitheme documents . In Proceedings of the 18th International Conference on WWW, pp : 661670, (2009) [11] Ginger Shields, (2005), What are the main differences between human indexing and automatic indexing? LI-842 Automatic Indexing Assignment, April 26, 2005. [12] Gonenc Ercan, Ilyas Cicekli, Using Lexical Chains for Keyword Extraction In Information Processing & Management, Vol 43, Issue 6, November 2007, pp 1705 1714 [13] Gross. M. (1993). Local grammars and their representation by finite automata. In M. Hoey, editor, Data Description, Discourse, pages 2638. HarperCollins, London.

-26-

Egyptian Computer Science Journal ,ECS ,Vol.36 No.4, September 2012

ISSN-1110-2586

[14] Mihalcea, R., and Tarau, P. (2004). Textrank: Bringing order into texts. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP-04), pp : 233242.2004. [15] Harris Z. (1991). A Theory of Language and Information: A Mathematical Approach. Oxford: Clarendon Press [16] Hunston, S., Sinclair, J. (2000). A Local Grammar of Evaluation. In Evaluation in Text: Authorial Stance and the Construction of Discourse, Hunston, S. & Thompson, G. (eds), Oxford, Oxford University Press: pp. 75-100. [17] H. N. Traboulsi,(2006) Named Entity Recognition: A Local Grammar-based Approach, Ph.D. dissertation, Dept. of Computing, Surrey Univ. Guildford, [18] Legal Informatics Center (1993): Data Base. Lebanese University publications. ( in Arabic) Beirut 1993. [19] Peter D. Karp (1993), The Design Space of Frame Knowledge Representation Systems Artificial Intelligence Center, SRI International, note #512. [20] P. Turney (2000), Learning to extract keyphrases from text In Journal of Information Retrieval, 2 (4), pp. 303336, 2000 [21] RAMMAL M, (2006), Access to legal Documents on the Web. The Lebanese Experience, in The 2nd International Workshop on New Trends in Information, NTIT2006, Homs, Syria. [22] Salton G., Buckley C, Term-weighting approaches in automatic text retrieval, Information Processing and Management, 24(5), 513-523, 1988. [23] Sartor, G.; Casanovas, P.; Biasiotti, M.; Fernndez-Barrera, (2011), Approaches to Legal Ontologies, Series: Law, Governance and Technology Series, Vol. 1. 1st Edition. 2011. [24] Silberztein M. (1997), The Lexical Analysis of Natural Languages, In E. Roche, Y. Schabes (eds.), Finite State Language Processing, The MIT Press, Cambridge, MA [25] Woods W.A. (1970), Transition Network Grammars for Natural Language Analysis, Communications of the ACM, 13:10 [26] Zhang, K.; Xu, H.; Tang, J.; and Li, J.-Z. (2006). Keyword extraction using support vector machine. In Lecture Notes in Computer Science: Advances in Web-Age Information Management, pp :8596.

-27-

Potrebbero piacerti anche