Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
The overarching goal is, essentially, to turn text into data for analysis, via application of natural language processing (NLP) and analytical methods.
I. INTRODUCTION Text mining, sometimes alternately referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning. Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interestingness. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities). Text analysis involves information retrieval, lexical analysis to study word frequency distributions, pattern recognition, tagging/annotation, information extraction, data mining techniques including link and association analysis, visualization, and predictive analytics.
Manuscript received 02/September/2013. S.Manikandan, Assistant Professor, Department of Information Technology, Vel Tech Multi Tech Dr.Rangarajan Dr.Sakunthala Engineering College. (E-mail: manik.bhuvan@gmail.com). P.Vijay Anand, Assistant Professor, Department of Information Technology, Vel Tech Multi Tech Dr.Rangarajan Dr.Sakunthala Engineering College. (E-mail: vijayanandparthasarathy@gmail.com). R.Prabhu, Assistant Professor, Department of Information Technology, Vel Tech Multi Tech Dr.Rangarajan Dr.Sakunthala Engineering College. (E-mail: dprpit@gmail.com). D.Suresh Babu, Assistant Professor, Department of Information Technology, Vel Tech Multi Tech Dr.Rangarajan Dr.Sakunthala Engineering College. (E-mail: sureshbabu.me@gmail.com).
Measuring Vocabulary Consistency, measures the consistency of the words that are used in the paper, which is prepared for acceptance. The input paper is traversed throughout and the words are collected and are stored in a Database. These words are compared with the corpus dictionary that contains the pre-graded vocabulary and the words from the paper are graded as comparing with the corpus. Controlled vocabularies provide a way to organize knowledge for subsequent retrieval. They are used in subject indexing schemes, subject headings, thesauri, taxonomies and other form of knowledge organization systems. Controlled vocabulary schemes mandate the use of predefined, authorized terms that have been preselected by the designer of the vocabulary, in contrast to natural language vocabularies, where there is no restriction on the vocabulary. When indexing a document, the indexer also has to choose the level of indexing exhaustively, the level of detail in which the document is described. For example using low indexing exhaustively, minor aspects of the work will not be described with index terms. In general the higher the indexing exhaustively, the more terms indexed for each document. In recent years free text search as a means of access to documents has become popular. This involves using natural language indexing with an indexing exhaustively set to maximum (every word in the text is indexed). Many studies have been done to compare the efficiency and effectiveness of free text searches against documents that have been indexed by experts using a few well-chosen controlled vocabulary descriptors. II. RELATED WORK 2.1 A Study of Indexing Consistency [1] [6] [7] The article aims to compare the indexing consistency between the Library of Congress (LC) and the British Library (BL) catalogers with regards to their using the Library of Congress Subject Headings (LCSH). Eighty-two titles, published in 1987 in the field of Library and Information Science (LIS), were identified for comparison, and for each title its LC subject headings, 47
assigned by both LC and BL catalogers, were compared. By applying Hooper's "consistency of a pair" equation, the average indexing consistency value was found for 82 titles. The average indexing consistency value between LC and BL catalogers is 16 percent for exact matches, and 36 percent for partial matches. The major findings of the study are discussed, and, in the Appendix, the examples of LCSH that assigned by both LC and BL catalogers for the same document are provided along with its consistency value. Indexing consistency in a group of indexers is defined as "the degree of agreement in the representation of the essential information content of the document by certain sets of indexing terms selected individually and independently by each of the indexers in the group". It should be borne in mind that the sample used in this study is small. Moreover, since two different subject indexing tools (i.e., LCSH and PRECIS) are used in LC and BL, it may not be very meaningful, if at all, to compare the two groups of catalogers. Yet the indexing consistency value found in this study is similar to those of reported in other consistency studies. In conclusion, the indexing consistency value between LC and BL catalogers for the books in the field of LIS is 16 percent for exact matches and 36 percent for both exact and partial matches, which is pretty low.
performance of a set of professional human indexers. Alternatively, for request-oriented indexing, where a documents irretrievability is more important than the consistency of its representation, the weights could be derived from searchers relevance judgments. [10] We plan to use this measure to assess the quality of automatically produced key phrases and to compare them with ones extracted by human indexers. Analysis of the conceptual relations between the phrases instead of simple matching of their stems will provide a sounder basis for judging the usability of automatic extraction in real-world applications.
Journal of Computer Applications (JCA) ISSN: 0974-1925, Volume VI, Issue 3, 2013 dictionary work.Variables included were demographic data, views of health promotion, health promotion activities at the school, barriers and opportunities to implement health promotion activities. III. SYSTEM AND ADVERSARY MODEL 3.1 EXISTING SYSTEM The Existing System has a committee of members; they go through the paper and check for consistency. The paper, that is to be published, may be written by a group of people; involved in the research. Each pupil may submit his document of work. When the final paper is prepared by combining all the others, it may not be considered consistent at the vocabulary level. When this paper is traversed by the committee members, they sort out the inconsistency and re-circulate the paper for correction. This involves more time and an additional evaluator may be needed for documenting the paper. given to the pupil for manual alteration and goes through consistency check.
3.2.1 ENTRY MODULE The Entry module describes the user profile with maximum failed attempts and blocks them, when they cross their attempts more than 3. When a new user uploads a paper, a random generated number is given for respective user identification. The user can upload only .doc, .docx, .pdf and .txt files as input. 3.2.2 TRAVERSE MODULE The input paper is traversed and the words are collected. The collected word list is compared with the corpus dictionary words. The new list has words that are graded as high, low and medium and stored in a separate database.
3.2. PROPOSED SYSTEM In our proposed system, the application traverses the paper and collects each word. These collected words are stored in a database for comparison. The stored words are fetched and compared with the dictionary maintained manually. When the words are matched with respective table of high, low and medium; they are graded respectively. Each and every word is done so and the entire paper is graded with each sections. Finally, the average of the count is taken, and the rest of the grades are made to the average by suggesting their synonyms respectively. Thus the final paper contains the average of all the graded words, and they are said to be consistent. When certain lemmas are not found, they are 49
3.2.3 CONSISTENCY MODULE The words separated are checked for consistency of occurrences and also the frequency of corpus graded words occurrences. Their meanings are checked along with their corresponding occurrences; the synonym of each word is
suggested along with their definitions. These are given as output to the user as they could replace the necessary words as they wish. Finally the altered paper may again go for consistency check; resulting with a consistent paper as output.
concepts are also mapped, our application becomes complete without any flaws. REFERENCES
[1]. Yasar Tonta "A study of indexing consistency: consistency between the Library of congress and the british library catalogers. [2]. Olena MedelyanMeasuring Inter-Indexer Consistency Using a Thesaurus. [3].Hooper, R.S. (1965). Indexer consistency tests-Origin, measurements, results and utilization. IBM, Bethesda. [4].mirja Iivonen, consistency in the selection of search concepts and search terms , Information Processing and management : an international journal ,v.31 n.2 , p.173-190, march/april 1995. [5]. Markey, K. (1984). Inter-indexer consistency tests. Library and Information Science Research, 6, 155--177. [6].Rolling, L. (1981). Indexing consistency, quality and efficiency. Information Processing and Management, 17, 69 76. [7]. Zunde, P., & Dexter, M.E. (1969). Indexing consistency and quality. American Documentation, 20, 259 26. [8]. Asadi , R. Schwartz and J. Makhoul "Automatic Modeling for Adding New Words to a Large-Vocabulary Continuous Speech Recognition System", IEEE International Conference on Acoustics, Speech and Signal Processing, pp.305 -308 1991. [9]. Relevance Search and Anomaly Detection using Bipartite Graphs Jimeng Sun1 Huiming Qu2 Deepayan Chakrabarti3 Christos Faloutsos1 1Carnegie Mellon Univ. 2Univ. of Pittsburgh 3Yahoo! Research [10].Challenging Issues of Automatic Summarization: Relevance Detection and Quality-based Evaluation Elena Lloret and Manuel Palomar Department of Software and Computing Systems, University of Alicante,spain.
BIBLIOGRAPHY
Mr. S. Manikandan has received his B.Tech degree,in 2006 in the stream of Information Technology from Anna University, Chennai,India and M.E degree in Systems Engineering and Operation in 2009 from Anna University, Chennai ,India. He is currently employed at Vel Tech Multi Tech Engineering College Chennai. His research interests include Data Mining. Cryptography,Mobile networks . Mr.D.Suresh Babu has received his B.Tech degree in the stream of Information Technology in 2006 from Anna University, Chennai and M.E degree in Software Engineering in 2009 from Anna University, Chennai. He is currently employed at Vel Tech Multi Tech Engineering College Chennai. His research interests include Adhoc Networks, Data Mining. Network Security and Mobile networks . Mr.R.Prabu has received his B.Tech degree in the stream of Information Technology in 2004 from periyar University, salem and M.Tech degree in Information Technology in 2008 from saithiyabama University, Chennai. He is currently employed at Vel Tech Multi Tech Engineering College Chennai. His research interests include Adhoc Networks, Data Mining. Mobile networks . Mr.P.Vijay anand has received his B.Tech degree in the stream of Information Technology in 2005 from Anna University, Chennai and M.Tech degree in Information Technology in 2011 from saithiyabama University, Chennai. He is currently employed at Vel Tech Multi Tech Engineering College Chennai. His research interests include Adhoc Networks, Data Mining. Network Security and Mobile networks .
IV. CONCLUSION Our project is only a humble venture to satisfy the author for documenting the research article on his own. Several user friendly coding have also adopted. This package shall prove to be a powerful package in satisfying all the requirements of the members involved in the committee of checking the consistency of paper. A third person requirement is not needed for documenting or for diversified vocabulary level. The application helps the author by finding the inconsistent vocabulary terms and suggests him/her with the respective vocabulary level that suits the grade well. When the paper is again traversed, the consistency level must be same. The words are suggested along with the definitions as it might help the author to fix the appropriate words for each term. V. FUTURE ENHANCEMENT The future enhancement would be the concept mapping. The paper when checked for consistency, it traverses the paper and gives the consistency level. It also supports paper with zero concepts; finds and suggests grades. Here, it blocks the user to upload, by handling a database which holds the results of the paper after checking for concepts in next stages. This could be done here in this stage while checking for consistency itself. It can overcome by Ontology, which helps to map concepts and check for worthy ideas. When the 50