ICCA Volume 5

I nt er nat i onal Conf er ence on Comput er Appl i cat i ons 2012
Volume 5

I nt er nat i onal Conf er ence on Comput er Appl i cat i ons 2012
Volume 5
In association with
Association of Scientists, Developers and Faculties (ASDF), India
Association of Computer Machinery(ACM)
Science & Engineering Research Support society (SERSC), Korea

Knowledge Engineering, Software Engineering Systems

27-31 January 2012
Pondicherry, India

Editor-in-Chief
K. Kokula Krishna Hari

Editors:
E Saikishore, T R Srinivasan, D Loganathan,
K Bomannaraja and R Ponnusamy

Published by

Association of Scientists, Developers and Faculties
Address: 27, 3
rd
main road, Kumaran Nagar Extn., Lawspet, Pondicherry-65008
Email: admin@asdf.org.in || www.asdf.org.in

International Conference on Computer Applications (ICCA 2012)
VOLUME 5

Editor-in-Chief: K. Kokula Krishna Hari
Editors: E Saikishore, T R Srinivasan, D Loganathan, K Bomannaraja and R Ponnusamy

Copyright 2012 ICCA 2012 Organizers. All rights Reserved

This book, or parts thereof, may not be reproduced in any form or by any means, electronic or mechanical, including
photocopying, recording or any information storage and retrieval system now known or to be invented, without written
permission from the ICCA 2012 Organizers or the Publisher.

Disclaimer:
No responsibility is assumed by the ICCA 2012 Organizers/Publisher for any injury and/ or damage to persons or
property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products
or ideas contained in the material herein. Contents, used in the papers and how it is submitted and approved by the
contributors after changes in the formatting. Whilst every attempt made to ensure that all aspects of the paper are uniform
in style, the ICCA 2012 Organizers, Publisher or the Editor(s) will not be responsible whatsoever for the accuracy,
correctness or representation of any statements or documents presented in the papers.

ISBN-13: 978-81-920575-8-3
ISBN-10: 81-920575-8-3

PREFACE

This proceeding is a part of International Conference on Computer Applications 2012 which was
held in Pondicherry, India from 27-Dec-2012 and 31-Dec-2012. This conference was hosted by
Techno Forum Research and Development Centre, Pondicherry in association with Association of
Computer Machinery(ACM), Association of Scientists, Developers and Faculties (ASDF), India,
British Computer Society (BCS), UK and Science and Engineering Supporting Society (Society),
Korea.
The world is changing. From shopping malls to transport terminals, aircraft to passenger ships, the
infrastructure of society has to cope with ever more intense and complex flows of people. Today,
more than ever, safety, efficiency and comfort are issues that must be addressed by all designers. The
World Trade Centre disaster brought into tragic focus the need for well-designed evacuation systems.
The new regulatory framework in the marine industry, acknowledges not only the importance of
ensuring that the built environment is safe, but also the central role that evacuation simulation can
play in achieving this.
An additional need is to design spaces for efficiency ensuring that maximum throughput can be
achieved during normal operations and comfort ensuring that the resulting flows offer little
opportunity for needless queuing or excessive congestion. These complex demands challenge
traditional prescriptive design guides and regulations. Designers and regulators are consequently
turning to performance-based analysis and regulations facilitated by the new generation of people
movement models.
When a greater changes are achieved these past years, still more is to be achieved which still seems
to be blue sky of 1970s. But for all the challenges, capabilities continue to advance at phenomenal
speed. Even three years ago it may have been considered a challenge to perform a network design
involving the evacuation of 45,000 people from a 120 story building, but with todays sophisticated
modelling tools and high-end PCs, this is now possible. Todays challenges are much more ambitious
and involve simulating the movement and behaviour of over one million people in city-sized
geometries. The management of these network is also easy and more specifically all the 45,000
people can be monitored by a single person sitting in his cabin. This has been the evidence of the
development these days.
As such, the conference represents a unique opportunity for experts and beginners to gain insight into
the rapidly.
Also I would like to thank all the co-operators for bringing out these proceedings for you which
majorly includes my mom Mrs. K. Lakshmi and my dad Mr. J. Kunasekaran. Apart from them my
biggest worthy gang of friends including Dr. S. Prithiv Rajan, Chairman of this conference, Dr. R. S.
Sudhakar, Patron of this Conference, Dr. A. Manikandan and Dr. S. Avinash, Convener of this
conference, Dr. E. Sai Kishore, Organizing Secretary of this Conference and the entire team which
worked along with me for the rapid success of the conference for past 1 year from the date of
initiating this Conference. Also I need to appreciate Prof. T. R. Srinivasan and his team of Vidyaa
Vikas College of Engineering and Technology for helping to make the publication job easy.
Finally, I thank my family, friends, students and colleagues for their constant encouragement and
support for making this type of conference.
-- K. Kokula Krishna Hari
Editor-in-Chief

Organizing Committee

Chief Patron
Kokula Krishna Hari K, Founder & President, Techno Forum Group, Pondicherry, India

Patron
Sudhakar R S, Chief Executive Officer(CEO), Techno Forum Group, Pondicherry, India

Chairman
Prithiv Rajan S, Chairman & Advisor, Techno Forum Group, Pondicherry, India

Convener
Manikandan A, Chief Human Resources Officer(CHRO), Techno Forum Group, India

Organizing Secretary
Sai Kishore E Chief Information Officer, Techno Forum Group, India.
Operations Chair
G S Tomar Director, MIR Labs, Gwalior, India

International Chair
Maaruf Ali Executive Director, (ASDF) - Europe, Europe

Hospitality
Muthualagan R Alagappa College of Technology, Chennai

Industry Liaison Chair
Manikandan S Executive Secretary, Techno Forum Group, India

Technical Panels Chair
Debnath Bhattacharyya, Executive Director, (ASDF) - West Bengal, India

Technical Chair
Samir Kumar Bandyopadhyay Former Registrar, University of Calcutta, India
Ponnusamy R President, Artificial Intelligence Association of India, India
Srinivasan T R,Vice-Principal, Vidyaa Vikas College of Engineering and Technology

Workshops Panel Chair
Loganathan D Department of Computer Science and Engineering, Pondicherry
Engineering College, India

MIS Co-Ordinator
Harish G Trustee, Techno Forum Research and Development Centre, Pondicherry

Academic Chair
Bommanna Raja K, Principal, Excel College of Engineering for Women, India
Tai-Hoon Kim Professor & Chairman, Dept. of Multimedia, Hanmam University, Korea

TECHNICAL REVIEWERS
Adethya Sudarsanan Cognizant Technology Solutions, India
Ainuddin University of Malaya, Malaysia
Ajay Chakravarthy University of Southampton, UK
Alessandro Rizzi University of Milan, Italy
Al-Sakib Khan Pathan International Islamic University, Malaysia
Angelina Geetha B S Abdur Rahman University, Chennai
Aramudhan M PKIET, Karaikal, India
Arivazhagan S Mepco Schlenk Engineering College, India
Arokiasamy A Anjalai Ammal Mahalingam Engineering College, India
Arul Lawrence Selvakumar A Adhiparasakthi Engineering College, India
Arulmurugan V Pondicherry University, India
Aruna Deoskar Institute of Industrial & Computer Management and Research, Pune
Ashish Chaurasia Gyan Ganga Institute of Technology & Sciences, Jabalpur, India
Ashish Rastogi Guru Ghasidas University, India
Ashutosh Kumar Dubey Trinity Institute of Technology & Research, India
Avadhani P S Andhra University, India
Bhavana Gupta All Saints College of Technology, India
Bing Shi University of Southampton, UK
C Arun R. M. K. College of Engineering and Technology, India
Chandrasekaran M Government College of Engineering, Salem, India
Chandrasekaran S Rajalakshmi Engineering College, Chennai, India
Chaudhari A L University of Pune, India
Ching-Hsien Hsu Chung Hua University, Taiwan
Chitra Krishnamoorthy St Josephs College of Engineering and Technology, India
Christian Esteve Rothenberg CPqD (Telecom Research Center), Brazil
Chun-Chieh Huang Minghsin University of Science and Technology, Taiwan
Darshan M Golla Andhra University, India
Elvinia Riccobene University of Milan, Italy
Fazidah Othman University of Malaya, Malaysia
Fulvio Frati University of Milan, Italy
G Jeyakumar Amrita School of Engineering, India
Geetharamani R Rajalakshmi Engineering College, Chennai, India
Gemikonakli O Middlesex University, UK
Ghassemlooy Z Northumbria University, UK
Gregorio Martinez Perez University of Murcia, Spain
Hamid Abdulla University of Malaya, Malaysia
Hanumantha Reddy T Rao Bahadur Y Mahabaleswarappa Engineerng College, Bellary
Hari Mohan Pandey NMIMS University, India
Helge Langseth Norwegian University of Science and Technology, Norway
Ion Tutanescu University of Pitesti, Romania
Jaime Lloret Universidad Politecnica de Valencia, Spain
Jeya Mala D Thiagarajar College of Engineering, India
Jinjun Chen University of Technology Sydney, Australia
Joel Rodrigues University of Beira Interior, Portugal
John Sanjeev Kumar A Thiagarajar College of Engineering, India
Joseph M Mother Terasa College of Engineering & Technology, India
K Gopalan Professor, Purdue University Calumet, US
K N Rao Andhra University, India
Kachwala T NMIMS University, India
Kannan Balasubramanian Mepco Schlenk Engineering College, India
Kannan N Jayaram College of Engineering and Technology, Trichy, India
Kasturi Dewi Varathan University of Malaya, Malaysia
Kathirvel A Karpaga Vinayaga College of Engineering & Technology, India
Kavita Singh University of Delhi, India
Kiran Kumari Patil Reva Institute of Technology and Management, Bangalore, India
Krishnamachar Sreenivasan IIT-KG, India
Kumar D Periyar Maniammai University, Thanjavur, India
Lajos Hanzo Chair of Telecommunications, University of Southampton, UK
Longbing Cao University of Technology, Sydney
Lugmayr Artur Texas State University, United States
M HariHaraSudhan Pondicherry University, India
Maheswaran R Mepco Schlenk Engineering College, India
Malmurugan N Kalaignar Karunanidhi Institute of Technology, India
Manju Lata Agarwal University of Delhi, India
Mazliza Othman University of Malaya, Malaysia
Mohammad M Banat Jordan University of Science and Technology
Moni S NIC - GoI, India
Mnica Aguilar Igartua Universitat Politcnica de Catalunya, Spain
Mukesh D. Patil Indian Institute of Technology, Mumbai, India
Murthy B K Department of Information and Technology - GoI, India
Nagarajan S K Annamalai University, India
Nilanjan Chattopadhyay S P Jain Institute of Management & Research, Mumbai, India
Niloy Ganguly IIT-KG, India
Nornazlita Hussin University of Malaya, Malaysia
Panchanatham N Annamalai University, India
Parvatha Varthini B St Josephs College of Engineering, India
Parveen Begam MAM College of Engineering and Technology, Trichy
Pascal Hitzler Wright State University, Dayton, US
Pijush Kanti Bhattacharjee Assam University, Assam, India
Ponnammal Natarajan Rajalakshmi Engineering College, Chennai, India
Poorna Balakrishnan Easwari Engineering College, India
Poornachandra S RMD Engineering College, India
Pradip Kumar Bala IIT, Roorkee
Prasanna N TMG College, India
Prem Shankar Goel Chairman - RAE, DRDO-GoI, India
Priyesh Kanungo Patel Group of Institutions, India
Radha S SSN College of Engineering, Chennai, India
Radhakrishnan V Mookamibigai College of Engineering, India
Raja K Narasu's Sarathy Institute of Technology, India
Ram Shanmugam Texas State University, United States
Ramkumar J VLB Janakiammal college of Arts & Science, India
Rao D H Jain College of Engineering, India
Ravichandran C G R V S College of Engineering and Technology, India
Ravikant Swami Arni University, India
Raviraja S University of Malaya, Malaysia
Rishad A Shafik University of Southampton, UK
Rudra P Pradhan IIT-KGP, India
Sahaaya Arul Mary S A Jayaram College of Engineering & Technology, India
Sanjay Chaudhary DA-IICT, India
Sanjay K Jain University of Delhi, India
Satheesh Kumar KG Asian School of Business, Trivandrum, India
Saurabh Dutta Dr B C Roy Engineering College, Durgapur, India
Senthamarai Kannan S Thiagarajar College of Engineering, India
Senthil Arasu B National Institute of Technology - Trichy, India
Senthil Kumar A V Hindustan College, Coimbatore, India
Shanmugam A Bannari Amman Institute of Technology, Erode, India
Sharon Pande NMIMS University, India
Sheila Anand Rajalakshmi Engineering College, Chennai, India
Shenbagaraj R Mepco Schlenk Engineering College, India
Shilpa Bhalerao FCA Acropolis Institute of Technology and Research
Singaravel G K. S. R. College of Engineering, India
Sivabalan A SMK Fomra Institute of Technology, India
Sivakumar D Anna University, Chennai
Sivakumar V J National Institute of Technology - Trichy, India
Sivasubramanian A St Josephs College of Engineering and Technology, India
Sreenivasa Reddy E Acharya Nagarjuna University, India
Sri Devi Ravana University of Malaya, Malaysia
Srinivasan A MNM Jain Engineering College, Chennai
Srinivasan K S Easwari Engineering College, Chennai, India
Stefanos Gritzalis University of the Aegean, Greece
Stelvio Cimato University of Milan, Italy
Subramanian K IGNOU, India
Suresh G R DMI College of Engineering, Chennai, India
Tulika Pandey Department of Information and Technology - GoI, India
Vasudha Bhatnagar University of Delhi, India
Venkataramani Y Saranathan College of Engineering, India
Verma R S Joint Director, Department of Information and Technology - GoI, India
Vijayalakshmi K Mepco Schlenk Engineering College, India
Vijayalakshmi S Vellore Institute of Technology, India
Ville Luotonen Hermia Limited, Spain
Vimala Balakrishnan University of Malaya, Malaysia
Vishnuprasad Nagadevara Indian Institute of Management - Bangalore, India
Wang Wei University of Nottingham, Malaysia
Yulei Wu Chinese Academy of Sciences, China

Part I
Proceedings of the Second International Conference on
Computer Applications 2012
ICCA 12
Volume 5
Proc. of the Second International Conference on Computer Applications 2012 [ICCA 2012] 1
www.asdf.org.in 2012 :: Techno Forum Research and Development Centre, Pondicherry, India www.icca.org.in
Applying Database Independent Technique to Discover Interoperability Problem

Sachin Yadav
Department of MCA
Shri Ram Swroop Memorial Group of
Professional Colleges
Lucknow, India
Shefali Verma
Department of MCA
Shri Ram Swroop Memorial Group of
Professional Colleges
Lucknow, India

AbstractThese days many software companies understand
the importance of developing application that do not depend
on a specific database type (i.e. Oracle, SOL Server,
DB2...Etc), Which allows to customer to choose their own
used platform?
In general, software developers recognize that their customers
are responsible for database maintenance and must use
existing platforms and personnel. In this paper, we will
discuss the difference between different database (SQL
Server and Oracle platform) from an application perspective
and discuss the possible methods of developing methods in a
on database dependent environment with design for
databases you can create a data model for a specific DBMS or
you can create data model independent of the implementation
of the any particular DBMS. You can choose the DBMS you
want to use at the time that you need to connect to your target
database. Database independent modeling is particularly
useful if you want to implement a data model on more than
one DBMS.
Index Terms: Oracle, SQL Server, DB2, Database
Independent technique

I. INTRODUCTION
One of the goals of software development is to create
software that could potentially run on many different
computing platforms, with the rising popularity of java.
We are coming close to achieving this goal unfortunately
most database applications are tied to specific relational
database management System (RDBMS) even it is written
in java using the java database connectivity specification
(JDBC). This is due to the fact that deferent RDBMS
vendors created their own extensions to the ANSI SQL
standard to handle primary key generation, database
triggers and stored package programs(Procedure and
factions ) etc. In addition to the SQL differences
transactional (Records Management) behaviors of deferent
RDBMSs are usually incompatible (E.g. trigger firing

Proc. of the Intl. Conf. on Computer Applications Volume
1. Copyright 2012 Techno Forum Group, India. ISBN:
978-81-920575-8-3 :: doi: 10. 73452/ISBN_0768ACM #:
dber.imera.10. 73452

scope). These differences created a huge challenge for
system developer to create software that should run on
different database. In this article, we will explore how to
architect and implement a software system using java and
JDBC that could potentially work with many backend
RDBMSs.
Many articles have been written that describe the
general differences between oracle and SQL server from
the organizational and database administrator standpoint.
In this article, I will fill you in one of the differences
between SQL server and Oracle platform from an
application perspective and discuss the possible method of
developing application in a non-database-dependent
environment.
At this time, I will not talk about the differences
between the two platforms that are transparent to be an
application, such as table partitioning and indexing.
II. DEFINING COMMON INTERFACES AND LANGUAGES
These are few common languages and interfaces that
allow database independency within applications and
supposedly can be used for any relational database in the
same way:
ANSI is defined as the American National Standards
Institute is a voluntary membership organization (run with
private funding) that develops national consensus standers
for a wide variety of the devices and procedure. In the
database area, ANSI defines the standard for writing SQL
commands, giving the ability of running the command in
any database without changing the commands syntax.
ODBC is the Open Database Connectivity interface by
Microsoft, which allows applications to access data in
database management system (DBMS) using SQL as a
standard for accessing the data. ODBC permits maximum
interoperability, which means a single application can
access different DBMS. Application end-user can then add
ODBC database drivers to link the application to their
choice of DBMS.
OLEDB, the successor to ODBC, is a set of software
components that allow a front end such as GUI based on
VB, C++, Access or whatever to connect with a back end
such as SQL server, Oracle, DB2, and My SQL etc. In
many cases the OLEDB corposants offer much better
performance than the older ODBC.
JDBC (Java Database Connectivity) API is the
industry standers for database-independent connectivity
between the java programming language and a wide range
of databases-SQL databases and other tabular data sources,
such as spread sheets or flat files. This JDBC API provides
a call level API for SQL-based database access.
III. DATABASE-INDEPENDENT MODELING
With design for Database you can create a data model
for a specific DBMS or you can create a data model
independent of the implementation of any particular
DBMS. You can choose the DBMS you want to use at the
time that you need to connect to your target database or
when you want to forward engineer. Database independent
modeling is particular useful if you want to implement a
data model on more than one DBMS. Please note that
database independent model is only supported in the
Expert edition of Design for Database.

CREATING A DATABASE-INDEPENDENT MODEL
You can create a new, empty database independent data
model document using the new project command on the
file menu.
In the new project dialog you have to select the type of
model you want to create: a model for a specific DBMS or
a database independent model.
It is possible to convert an existing model for a specific
target DBMS to a database independent model. You can do
that using the switch target DBMS command on the File
menu.
When you are done with your data model Design for
databases lets you create the database or export SQL
scripts for database creation. When you have a database
independent model, you have to choose the target DBMS
(database type) you want to use in the Generate database
dialog.
You can also alter an existing database-independent
model. In that case you choose the target DBMS when you
connect to your target DBMS.

Forward engineering to a Target DBMS

Select a database type when generating from a database independent
model

MAPPING PORTABLE DATA TYPE
There are significance variations between the data
types supported by different databases products. When
you have a database independent model, you work with
a set of portable data types. These data types are very
widely accepted data type names such as INTEGER,
NUMERIC, OR VARCHAR. These data types must be
mapped to the correct target DBMS data type. The data
type conversion rules are used for this conversion. You
can modify the data type conversion rules in the data
type conversion rules editor when thats necessary.
Design for database will automatically convert your
portable data types to the correct data types for the
selected target DBMS when the database is generated.

Edit data type conversion rules

USING CONDITION GENERATION
Conditional generation (IFDEFENDIF) lets tou
generate selected parts of your data model. You can
define the generation directives(s) that must be used for
the model to database generation. This is very useful
when you have database independent model. With
conditional generation You can use target DBMS
specific code when that is necessary.
For Example: Lets say you have a database-
independent model that you want to forward engineer to
MS SQL Server and Oracle. You can define trigger in
database independent model. Trigger code is database
specific. You must have code that is only added to
Oracle or MS SQL Server output. You can then use
(user defined) generation directives such as ORACLE
and MSSQL in your trigger code (see screenshot here
below).

Generation directive used in trigger code
When you forward engineer or compare your model
to a database you can generation directives you want to
use for the model to database generation. You can define
multiple generation directives.


Defining generation directives

IV. POSSIBLE SOLUTIONS
We have seen three possible solution to the database
interoperability problem.

# Solution Description Advantage Disadvantage
1 Handling two versions
of the application
one for Oracle and one
for SQL Server.
1. No need to
handle SQL
commands
versioning.
1. Duplicate
codemust
apply every
change in both
version.
2 Using common
language
(ANSI/ODBC/OLED
B/..)
As much as possible
& handling the
different commands
with IF command in
the application.

1. Handling
single
application.
1. If there are
many non ANSI
commands, the
code might be
large, which can
affect
applications
performance.

2. Code might
become
complex due to
many IF
statement.
3 Keeping database
commands in a
database or INI file,
reading them to cache
when application
starts.
1. No need for
IF commands
in application.
2. SQL
commands can
be changed on
the fly without
the need to
compile the
application
after modified.

1.SQL
commands
management
could become
complex.

V. CONCLUSION
In this article, we have discussed the important
guidelines to design and develop database application
system that can be portable among different database.
Weve looked at if we would like to develop application as
database-independent, we should plan the solution
carefully. Take into consideration the applications
complexity at the database level and the total amount of
code needed. During the planning process, it is important
to think in the terms of the applications future growth.

REFERENCES

[1]. T_SQL and PL/SQL common language for database-
independent applications, By Michelle Gutsiest
{2006}.
[2]. http://ryanfarley.com/blog/archive/2004/04/03/495.asp
x
[3]. http://www.datanamic.com/dezign/index.html
[4]. The following are various commercial databases (their
trademarks and copyrights).
Oracle http://oracle.com
DB2 (UDB) http://www.ibm.com
Microsoft SQL Server
http://www.microdoft.com
Java/JDBC/J2EE http://java.sun.com
[5]. Edward K. Yu, Database Independent! Myth or
Reality? Senior Solutions Architect, Advanced
Solutions Group University of South Carolina [2003].

Management and flexible transaction of uncertain data in XML Database

Prof. Umesh Dwivedi Prof. Firoz Ahamad
Dept. of Computer Application, Dept. of Computer Application,
SRMGPC, Lucknow, India SRMGPC, Lucknow, India

AbstractProcessing XML documents to store precise values
about objects in the real world in multi-user database
management environments requires a suitable storage model
of XML data and uncertain data can be represented in terms
of a continuous probability distribution, support of typical
XML document processing (XDP) interfaces, and
concurrency control (CC) mechanisms tailored to the XML
data model. In this paper, we sketch the architecture and
interfaces of our prototype and probabilistic XML database
management system model which can be connected to any
existing relational DBMS, capable of storing continuous
uncertain data in XML documents and provides for
declarative and navigational data access of concurrent
transactions. The probabilistic XML data model is based on
the probabilistic tree.
Instead of enumerating explicitly the data values with their
associated probabilities, a probability density function
represents a continuous probability distribution in terms of
integrals.
In order to query this data, we present a query language
containing query operations that are based on probability
theory. Next, we introduce some new query operators
supporting the aggregation of continuous probability
distributions using the same semantically foundation.
A proof of concept demonstrates the outcomes of our study
towards the management of continuous uncertain data. This
proof of concept allows the end-user to query XML
documents containing continuous uncertain data.
I. INTRODUCTION
Database Management Systems (DBMSs) play an
important role in todays world. Almost every information
system contains one or more databases [5]. The application
domain ranges from those things that we daily use (e.g.
phone book on a cellular phone) to highly sophisticated
systems (e.g. container terminal management system in a
main port). From a traditional perspective, databases are
used to store precise values about objects in the real
world. Though, today, there are many applications in
which data is uncertain and/or imprecise.
It is frequently the case that these applications use a
conventional DBMS and process the uncertainty outside
the DBMS.
Run an experiment on available DBMSs with
collaboratively used XML documents [16] and you will
experience a "performance catastrophe" meaning that all
transactional

Proc. of the Intl. Conf. on Computer Applications
Volume 1. Copyright 2012 Techno Forum Group, India.
ISBN: 978-81-920575-8-3 :: doi: 10.
73459/ISBN_0768ACM #: dber.imera.10. 73459
operations are processed in strict serial order. Storing
XML documents into relational DBMSs forces the
developers to use simple CLOBs or to choose among an
innumerable number of algorithms mapping the semi-
structured documents to tables and columns (the so-called
shredding). In any case, there are no specific provisions to
process concurrent transactions guaranteeing the ACID
properties and using typical XDP interfaces like SAX [2],
DOM [16], and XQuery [16] simultaneously. Especially
isolation in relational DBMS does not take the properties
of the semi-structured XML data model into account and
causes disastrous locking behavior by blocking entire
CLOBs or tables. Native XML database systems often use
mature storage engines tailored to relational structures
[13]. Because their XML document mapping is usually
based on fixed numbering schemes used to identify XML
elements, they primarily support efficient document
retrieval and query evaluation. Frequently concurrent
and transaction-safe modifications would lead to
renumeration of large document parts which could cause
unacceptable reorganization overhead and degrade XML
processing in performance-critical workload situations. As
a rare example of an update-oriented system, Natix [5] is
designed to support concurrent transaction processing, but
accomplishes lternative solutions for data storage and
transaction isolation as compared to our proposal.
Our approach aims at the adequate support of all
known types of XDP interfaces (eventbased like SAX,
navigational like DOM, and declarative like XQuery) and
provides the well-known ACID properties [7] for their
concurrent execution. We have implemented the XML
Transaction Coordinator (XTC) [9], an (O)RDBMS-
connectable DBMS for XML documents, called XDBMS
for short, as a testbed for empirical transaction processing
on XML documents. Here, we present its advantages for
concurrent transaction processing in a native XDBMS
achieved by a storage model and CC mechanisms tailored
to the XML data model. This specific CC improves not
only collaborative XDP but also SQL applications when
"ROX: Relational Over XML" [8] becomes true. An
overview of the XTC architecture and their XDP interfaces
is sketched in Section 2. Concurrent data access is
supported by locks tailored to the taDOM tree [10]a data
model which extends the DOM treeas outlined in
sections 3 and 4, thereby providing tunable, fine-grained
lock granularity and lock escalation as well as navigational
transaction path locking inside an XML document. In

Section 5, we give a first impression of concurrent
transaction processing gains, before we wrap up with
conclusions and some aspects of future work in Section 6.
System Architecture and XDP Interfaces
Our XTC database engine (XTCserver) adheres to the
widely used five-layer DBMS architecture [11]. In Figure
1, we concentrate on the representation and mapping of
XML documents. Processing of relational data is not a
focus of this paper. The file-services layer operates on the
bit pattern stored on external, non-volatile storage devices.
In collaboration with the OS file system, the i/o managers
store the physical data into extensible container files; their
uniform block length is configurable to the characteristics
of the XML documents to be stored. A buffer manager per
container file handles fixing and unfixing of pages in main
memory and provides a replacement algorithm for them
which can be optimized to the anticipated reference
locality inherent in the respective XDP applications. Using
pages as basic storage units, the record, index, and catalog
managers form the access services. The record manager
maintains in a set of pages the tree-connected nodes of
XML documents as physically adjacent records.
Each record is addressed by a unique life-time ID
managed within a B-tree by the index manager [9]. This is
essential to allow for fine-grained concurrency control
which requires lock acquisition on unique identifiable
nodes (see Section 4). The catalog manager provides for
the database metadata. The node manager implementing
the navigational access layer transforms the records from
their internal physical into an external representation,
thereby managing the lock acquisition to isolate the
concurrent transactions. The XML-services layer contains
the XML manager responsible for declarative document
access, e. g., evaluation of XPath queries or XSLT
transformations [16]. At the top of our architecture, the
agents of the interface layer make the functionality of the
XML and node services available to common internet
browsers, ftp clients, and the XTCdriver thereby achieving
declarative / set-oriented as well as navigational / node-
oriented interfaces. The XTCdriver linked to client-side
applications provides for methods to execute XPath-like
queries and to manipulate documents via the SAX or DOM
API. Each API accesses the stored documents within a
transaction to be started by the XTCdriver. Transactions
can be processed in the well-known isolation levels
uncommitted, committed, repeatable, and serializable [1].

II. STORAGE MODEL
Efficient and effective synchronization of concurrent
XDP is greatly facilitated if we use a specialized internal
representation which enables fine-granular locking. For
this reason, we will introduce two new node types:
attributeRoot and string. This representational
enhancement does not influence the user operations and
their semantics on the XML document, but is solely
exploited by the lock manager to achieve certain kinds of
optimizations when an XML document is modified in a
cooperative environment. As a running example, we,
therefore, refer to an XML document which is slightly
enhanced for our purpose to a so-called taDOM tree [10],
as shown in Figure 2. AttributeRoot separates the various
attribute nodes from their element node. Instead of locking
all attribute nodes separately when the DOM method
getAttibutes() is invoked, the lock manager obtains the
same effect by a single lock on attributeRoot. Hence, such
a lock does not affect parallelism, but leads to more
effective lock handling and, thus, potentially to better
performance. A string node, in contrast, is attached to the
respective text or attribute node and exclusively contains
the value of this node. Because reference to that value
requires an explicit invocation of getValue() with a
preceding lock request, a simple existence test on a text or
attribute node avoids locking such nodes. Hence, a
transaction only navigating across such nodes will not be
blocked, although a concurrent transaction may have
modified them and may still hold exclusive locks on them.

It is essential for the locking performance to provide a
suitable storage structure for ta-DOM trees which supports
a flexible storage layout that allows a distinguishable
(separate) node representation of all node types to achieve
fine-grained locking. Therefore, we have implemented
various container types which enable effective storage of
very large and very small attribute and element nodes as
well as combinations thereof [9]. Furthermore, fast access
to and identification of all nodes of an XML document is
mandatory to enable efficient processing of direct-access
methods, navigational methods,
and lock management. For this reason, our record
manager assigns to each node a unique node ID (rapidly
accessible via a B-tree) and stores the node as a record in a
data page. The tree order of the XML nodes is preserved
by the physical order of the records within logically
consecutive pages (chained by next/previous page
pointers) together with a so-called level indicator per
record.
III. CONCURRENCY CONTROL
So far, we have explained the newly introduced node
types and how fast and selective access to all nodes of an
XML document can be guaranteed. In a concurrent
environment,
the various types of XML operations have to be
synchronized using appropriate protocols entirely
transparent to the different XDP interfaces supported.
Hence, a lock manager is responsible for the acquisition
and maintenance of locks, processing of the quite complex
locking protocols and their adherence to correctness
criteria, as well as optimization issues such as adequate
lock granularity and lock escalation. Because the DOM
API not only supports navigation starting from the
document root, but also allows jumps "out of the blue" to
an arbitrary node within the document, locks must be
automatically, that is, by the lock manager, acquired in
either case for the path of ancestor nodes. The currently
accessed node is called context node in the following. This
up-to-the-root locking procedure is performed as follows:
If such an ancestor path is traversed the first time and if the
IDs of the ancestors are not present in the so-called parent
index (on-demand indexing of structural relationships [9])
for this path, the record manager is invoked to access
stored records thereby searching all ancestor records. The
IDs of these records are saved in the parent index. Hence,
future traversals of this ancestor path can be processed via
the parent index only. Navigational locking of children or
siblings is optimized by such structural indexes in a similar
way. The lock modes depend on the type of access to be
performed, for which we have tailored the node lock
compatibilities and defined the rules for lock conversion as
outlined in Section 4.1 and Section 4.2. To achieve optimal
parallelism, we discuss means to tune lock granularities
and lock escalation in Section 4.3. When an XML
document has to be traversed by navigational methods,
then the actual navigation paths also need strict
synchronization. This means, a sequence of method
calls must always obtain the same sequence of result
nodes. To support this demand, we present so-called
navigation locks in Section 4.4. Furthermore, query access
methods also need strict synchronization to accomplish the
well-known repeatable read property and, in addition, the
prevention of phantoms in rare cases. Our specific solution
is outlined in Section 4.5.
4.1 Node Locks
While traversing or modifying an XML document, a
transaction has to acquire a lock in an adequate mode for
each node before accessing it. Because the nodes in an
XML document are organized by a tree structure, the
principles of multi-granularity locking schemes can be
applied. The method calls of the different XDP interfaces
used by an application are interpreted by the lock manager
to select the appropriate lock modes for the entire ancestor
path. Such tree locking is similar to multi-granularity
locking in relational environments (SQL) where intention
locks communicate a transactions processing needs to
concurrent transactions. In particular, they prevent a
subtree s from being locked in a mode incompatible to
locks already granted to s or subtrees of s. However, there
is a major difference, because the nodes in an ancestor path
are part of the document and carry user data, whereas, in a
relational DB, user data is exclusively stored in the leaves
(records) of the tree (DAG) whose higher-level nodes are
formed by organizational concepts (e. g., table, segment,
DB). For example, it makes perfect sense to lock an
intermediate XML node n for reads, while in the subtree of
n another transaction may perform updates. For this and
other reasons, we differentiate the read and write

operations thereby replacing the well-known (IR, R) and
(IX, X) lock modes with (NR, LR, SR) and (IX, CX, X)
modes, respectively. As in the multi-granularity scheme,
the U mode plays a special role because it permits lock
conversion. Figure 3a contains the compatibility matrix for
our lock modes whose effects are described now:
An NR lock mode (node read) is requested for
reading the context node. To isolate such a read access, an
NR lock has to be acquired for each node in the ancestor
path. Note, the NR mode takes over the role of IR together
with a specialized R, because it only locks the specified
node, but not any descendant nodes. An IX lock mode
(intention exclusive) indicates the intent to perform write
operations somewhere in the subtree but not on a direct-
child node of the node being locked (see CX lock).

An LR lock mode (level read) locks the context node
together with its direct-child nodes for shared access. For
example, the method getChildNodes() only requires an LR
lock on the context node and not individual NR locks for
all child nodes. Similarly, an LR lock, requested for an
attributeRoot node, locks all its attributes implicitly (to
save lock requests for the getAttributes() method).
An SR lock mode (subtree read) is requested for the
context node c as the root of subtree s to perform read
operations on all nodes belonging to s. Hence, the entire
subtree is granted for shared access. An SR lock on c is
typically used if s is completely reconstructed to be printed
out as an XML fragment.
A CX lock mode (child exclusive) on context node c
indicates the existence of an X lock on some direct-child
node and prohibits inconsistent locking states by
preventing
LR and SR lock modes. In contrast, it does not prohibit
other CX locks on c, because separate direct-child nodes of
c may be exclusively locked by concurrent transactions.
A U lock mode (update option) supports a read
operation on context node c with the option to convert the
mode for subsequent write access. It can be either
converted back to a read lock if the inspection of c shows
that no update action is
needed or to an X lock after all existing read locks on c
are released. Note, the asymmetry in the compatibility
definition among U and (NR, IX, LR, SR, CX) which
prevents granting further read locks on c, thereby
enhancing protocol fairness,
that is, avoiding transaction starvation.
To modify the context node c (updating its contents
or deleting c and its entire subtree), an X lock mode
(exclusive) is needed for c. It implies a CX lock for its
parent node and an IX lock for all other ancestors up to the
document root.
Note again, this differing behavior of CX and IX locks
is needed to enable compatibility of IX and LR locks and
to enforce incompatibility of CX and LR locks.
Figure 3b represents a cutout of the taDOM tree
depicted in Figure 2 and illustrates the result of the
following example: Transaction T1 starts modifying the
value Darcy and, therefore, acquires an X lock for the
corresponding string node. The lock manager com-
plements this action by accessing all ancestors and by
acquiring a CX lock for the parent and IX locks for all
further ancestors. Simultaneously, transaction T2 wants to
delete the entire <editor> node including the string Gerbag
for which T2 must acquire an X lock. This lock request,
however, cannot be immediately granted because of the
existing IX lock of T1. Hence, T2 placing its request in
the lock request queue (LRQ: X2)must
synchronously wait for the release of the IX lock of T1
on the <editor> node. Meanwhile, transaction T3 is
generating a list of all book titles and has, therefore,
requested an LR lock for the <bib> node to obtain read
access to all direct-child nodes thereby using the level-read
optimization. To access the title strings for each <book>
node, the paths downwards to them are locked by NR
locks. Note, LR3 on <bib> implicitly locks
the <book> nodes in shared mode and does not prohibit
updates somewhere deeper in the tree. If X2 is eventually
granted for the <editor> node, T2 gets its CX lock on the
<book> node and its IX locks granted up to the root.
4.2 Node Lock Conversion
The compatibility matrix shown in Figure 3a describes
the compatibility of locks acquired on the same node by
separate transactions. If a transaction T already holds a
lock and requests a lock in a more restrictive or
incomparable mode on the same node, we would have to
keep two locks for T on this node. In general, k locks per
transaction and node are conceivable. This proceeding
would require longer lists of granted locks per node and a
more complex run-time inspection algorithm checking for
lock compatibility.
Therefore, we replace all locks of a transaction per
node with a single lock in a mode giving sufficient
isolation. The corresponding rules are specified by the lock
conversion matrix in Figure 4, which determines the
resulting lock for context node c, if a transaction already
holds a lock (matrix header row) and requests a further
lock (matrix header column) on c. A lock l1 specified by
an additional subscripted lock l2 (e. g., CXNR) means that

l1 has to be acquired on c and l2 has to be acquired on
each direct-child node of c. An example for this procedure
is given in the now following paragraph.
IV. PERFORMANCE EVALUATION
In our first experiment, we consider the basic cost of
lock management described so far. For this purpose, we
use the xmlgen tool of the XMark XML benchmark project
[15] to generate a variety of XML documents consisting of
5,000 up to 25,000 individual XML nodes. The documents
are stored in our native XDBMS [9] and accessed by a
clientside DOM application requesting every node by a
separate RMI call. To reveal lock management overhead,
each XML document is reconstructed by a consecutive
traversal in depth-first order under isolation levels
committed and repeatable read. Isolation level committed
certainly provides higher degrees of concurrency with
(potentially) lesser degrees of consistency of shared
documents; when used, the programmer accepts a
responsibility to achieve full consistency. Depending on
the position of the node to be locked, it may cause much
more overhead, because each individual node access
requires
short read locks along its ancestor path. In contrast,
isolation level repeatable read sets long locks until
transaction commit and, hence, does not need to
repetitively lock ancestor nodes. In fact, they are already
locked due to the depth-first traversal.
The second experiment illustrates the benefits for
transaction throughput depending on the chosen isolation
level and lock-depth value. For this purpose, we extend the
sample document of Figure 2 to a library database by
grouping the books into specific topics and adding a
persons directory. The DataGuide describing the resulting
XML document is depicted in Figure 9. We created the
library document with 500 persons and
25,000 books grouped into 50 specific topics. The
resulting document (requiring approximately 6,4 MB)
consists of 483,317 XML nodes and is stored in our
XDBMS [9]. We apply different transaction types
simulating typical read/write access to XML documents.
Transaction TB is searching for a book with a randomly
selected title. This simulates a query of a library visitor.
The activities of the library employees are represented by
transactions TP, TL, and TR. Transaction TP is searching
for a randomly chosen person by his/her last name.
Transactions TL and TR are simulating the lending of
books.
Transaction TL randomly locates a person and a book
to be lent; then it adds a new child node containing the
persons id to the <history> element within the located
<book>
subtree. Transaction TR "returns" the book by setting
the return attribute of the corresponding <lend> element to
the current system date.
"Interactive" Transaction Processing
Transactions interrupted by human interactions ("the
human is in the loop") or performing complex operations
may exhibit drastically increased lock duration times.
While the average transaction response time and lock
duration was far less than a second in batch processing
mode, now the average lock duration was "artificially"
increased probably by more than a factor of 10. As a
consequence, the finer granularity of locks and the
duration of short read locks gained in importance on
transaction throughput while the relative effect of lock
management overhead was essentially scaled down.
Longer lock durations and, in turn, blocking times reduced
the number of successful commits (write and overall
transactions) to about 50% and 10% as shown in Figure
11a and b and caused a relative performance behavior as
anticipated in relational environments.

In general, transaction throughput can be increased by
decreasing the level of isolation (from repeatable read
down to uncommitted) or increasing the lock depth (if
possible).
As observed at lock depths 4 to 7 in Section 5.1, all
transactions can be executed in parallel and our XDBMS
approaches stable transaction throughput in this
experiment.
For future benchmarks, we expect the gap between
uncommitted and committed to grow larger for "deeper"
XML documents (longer paths from the root to the leaves).
Similarly, the gap between committed and repeatable read
widens with an increasing percentage of write transactions
(causing more waiting cycles).
V. RELATED WORK, CONCLUSIONS AND FUTURE WORK
So far, only a few papers deal with fine-grained CC in
XML documents. DGLOCK [6] explores a path-oriented
protocol for semantic locking on DataGuides. It is running
in a layer on top of a commercial DBMS and can,
therefore, not reach the fine granularity and flexibility of
our approach. In particular, it cannot support ID-based
access and position- based predicates. Another path-
oriented protocol is proposed in [3, 4] which also seems to
be limited as far as the full expressiveness of XPath
predicates and direct jumps into subtrees are concerned. To

our knowledge, the only competing approach which is also
navigation oriented comes from the locking protocols
designed for Natix [12]. They are also tailored to typical
APIs for XDP. While the proposed lock modes are
different to ours, the entire protocol behavior should be
compared. Currently, we have the advantage that we do
not need to simulate our protocols, but we can measure
their performance on existing benchmarks and get real
numbers. In this paper, we have primarily explored
transaction isolation issues for collaborative XML
document processing. We first sketched the design and
implementation of our native XML database management
system. For concurrent transaction processing, we have
introduced our concepts enabling fine-granular
concurrency control on taDOM trees representing our
natively stored XML documents. As the key part, we have
described the locking protocols for direct and navigational
access to individual nodes of a taDOM tree, thereby
supporting different isolation levels. The performance
evaluation has revealed the locking overhead of our
complex protocols, but, on the other hand, has
confirmed the viability, effectiveness, and benefits of
our approach. As a striking observation, lower isolation
levels on XML documents do not necessarily guarantee
better transaction throughput, because the potentially
higher transaction parallelism may be (over-)compensated
by higher lock management overhead. There are many
other issues that wait to be resolved: For example, we did
not say much about the usefulness of optimization features
offered. Effective phantom control needs to be
implemented and evaluated (thereby providing for
isolation level serializable), based on the ideas we
described. Then, we can start to systematically evaluate the
huge parameter space available for collaborative XML
processing (fan-out and depth of XML trees, mix of
transactional operations, benchmarks for specific
application domains, degree of application concurrency,
optimization of protocols, etc.). Acknowledgements. The
anonymous referees who pestered us with many questions
helped to improve the final version of this paper.

REFERENCES
[1] American National Standard for Information
Technology. Database Languages - SQL - Part 2:
Foundation (1999)
[2] D. Brownell. SAX2. OReilly (2002)
[3] S. Dekeyser, J. Hidders. Path Locks for XML
Document Collaboration. Proc. 3rd Conf. on Web
Information Systems Engineering (WISE), Singapore, 105-
114 (2002)
[4] S. Dekeyser, J. Hidders, J. Paredaens. A Transaction
Model for XML Databases. World Wide Web Journal
7(2): 29-57 (2004)
[5] T. Fiebig, S. Helmer, C.-C. Kanne, G. Moerkotte, J.
Neumann, R. Schiele, T. Westmann. Natix: A Technology
Overview. A.B. Chaudri et al. (Eds.): Web, Web Services,
and Database Systems, NODe 2002, Erfurt, Germany,
LNCS 2593, Springer, 12-33 (2003)
[6] T. Grabs, K. Bhm, H.-J. Schek: XMLTM: Efficient
Transaction Management for XML Documents. Proc.
ACM CIKM Conf., McLean, VA, 142-152 (2002)
[7] J. Gray, A. Reuter. Transaction Processing: Concepts
and Techniques. Morgan Kaufmann (1993)
[8] A. Halverson, V. Josifovski, G. Lohman, H. Pirahesh,
M. Mrschel. ROX: Relational Over XML. Proc. 30th
VLDB Conf., Toronto (2004)
[9] M. Haustein, T. Hrder. Fine-Grained Management of
Natively Stored XML Documents, submitted (2004)
[10] M. Haustein, T. Hrder. taDOM: A Tailored
Synchronization Concept with Tunable Lock Granularity
for the DOM API. Proc. 7th ADBIS Conf., Dresden,
Germany, 88-102 (2003)
[11] T. Hrder, A. Reuter. Concepts for Implementing a
Centralized Database Management System. Proc.
Computing Symposium on Application Systems
Development, Nrnberg, Germany, 28-60 (1983)
[12] S. Helmer, C.-C. Kanne, G. Moerkotte. Evaluating
Lock-Based Protocols for Cooperation on XML
Documents. SIGMOD Record 33(1): 58-63 (2004)
[13] H. V. Jagadish, S. Al-Khalifa, A. Chapman.
TIMBER: A Native XML Database. The VLDB Journal
11(4): 274-291 (2002)
[14] J. R. Jordan, J. Banerjee, R. B. Batman: Precision
Locks. Proc. ACM SIGMOD Conf., Ann Arbor, Michigan,
143-147 (1981)
[15] A. Schmidt, F. Waas, M. Kersten. XMark: A
Benchmark for XML Data Management. Proc. 28th VLDB
Conf., Hong Kong, China, 974-985 (2002)
[16] W3C Recommendations. http://www.w3c.org (2004)
iography. Personal hobbies will be deleted from the
biography.
COMPONENT WEIGHT ASSIGNMENT ALGORITHM WITH SVM : MINING
REVIEWS FOR SOCIAL AND ECONOMIC SUPPORT
T.Kohilakanagalakshmi M.C.A.,M.Phil., M.E.,
Dept of Computer Science & Engg.,
Anna University of Technology Chennai,
Chennai, India.

Tina Esther Trueman M.E.,
Dept of Computer Science & Engg,
Anna University of Technology Chennai,
Chennai, India..

AbstractWith the high volume of reviews that are typically
published for a single product makes harder for individuals as
well as manufacturers to locate the best reviews and
understand the true underlying quality of a product. In this
project, we re-examine the impact of reviews on economic
outcomes like product sales and see how different factors
affect social outcomes like the extent of their perceived
usefulness. Our elementary econometric analysis using two
stage regression reveals that the extent of subjectivity, in
formativeness, readability, and linguistic correctness in
reviews matters in influencing sales and perceived usefulness.
By using component weight assignment algorithm along with
svm classifier we can accurately predict the impact of reviews
on sales and their perceived usefulness.
Keywords- Data mining, Text mining, elementary
econometric analysis, weight assignment algorithm

1.INTRODUCTION
Interpersonal conversation, or word-of-mouth (WOM),
is one of the important factors in affecting product sales.
Potential buyers can gather information on the quality of the
product through other consumers WOM. Online reviews
is a new form of WOM, which can not only increase
product awareness among potential buyers but can also
affect their buying decisions. With the development of
online review systems, consumers can express personal
opinions on a particular product freely online without being
limited to face-to-face interactions.
Due to the rapid growth of the Internet, the ability of
users to create and publish content has created active
electronic communities that provide a wealth of product
information. Reviewers contribute time and energy to
generate reviews, enabling a social structure that provides
benefits both for the users and the firms that host electronic
markets. In such a context, who says what and how
they say it, matters. On the flip side, a large number of
reviews for a single product may also make it harder for
individuals to track the gist of users discussions and
evaluate the true underlying quality of a product.
Reviews are either allotted an extremely high rating or
an extremely low rating. In such situations, the average

978-81-920575-8-3 :: doi: 10. 73466/ISBN_0768ACM #:
numerical star rating assigned to a product may not convey
a lot of information to a prospective buyer or to the
manufacturer who tries to understand what aspects of its
product are important. Instead, the reader has to read the
actual reviews to examine which of the positive and which
of the negative attributes of a product are of interest. So far,
the best effort for ranking reviews for consumers comes in
the form of peer reviewing in review forums, where
customers give helpful votes to other reviews in order to
signal their informativeness.
Unfortunately, the helpful votes are not a useful feature
for ranking recent reviews: the helpful votes are
accumulated over a long period of time, and hence cannot
be used for review placement in a short- or medium-term
time frame. Similarly, merchants need to know what
aspects of reviews are the most informative from
consumers perspective. Such reviews are likely to be the
most helpful for merchants, as they contain valuable
information about what aspects of the product are driving
the sales up or down.
In this project, we propose techniques for predicting the
helpfulness and importance of a review so that we can have:
A consumer-oriented mechanism which can
potentially ranks the reviews according to their expected
helpfulness (i.e., estimating the social impact)
A manufacturer-oriented ranking mechanism,
which can potentially rank the reviews according to their
expected influence on sales (i.e., estimating the economic
impact).

To understand better what are the factors that influence
consumers perception of usefulness and what factors affect
consumers most, we conduct a two-level study. First, we
perform an explanatory econometric analysis, trying to
identify what aspects of a review (and of a reviewer) are
important determinants of its usefulness and impact.
Then, at the second level, we build a predictive model
using Component weight assignment model with SVM that
offer significant predictive power and allow us to predict
with high accuracy how peer consumers are going to rate a
review and how sales will be affected by the posted review
Our algorithms are based on the idea that the writing
style of the review plays an important role in determining
the perceived helpfulness by other fellow customers and the
perform multiple levels of automatic text analysis to
identify characteristics of the review that are important. We
perform our analysis at the lexical, grammatical, semantic,
and at the stylistic levels to identify text features that have
high predictive power in identifying the perceived
helpfulness and the economic impact of a review.

II. THEORETICAL FRAMEWORK AND RELATED
LITERATURE

The effect of online product reviews on sales is because
they provide information about the product or the vendor to
potential consumers. Prior research has demonstrated an
association between numeric ratings of reviews (review
valence) and subsequent sales of the book on that site [2],
[3], [4], or between review volume and sales [5].
However, prior work has not looked at how the textual
characteristics of a review affect sales. Our hypothesis is
that the text of product reviews affects sales are reviews of
reasonable length, that are easy to read, and lack spelling
and grammar errors should be, all else being equal, more
helpful, and influential compared to other reviews that are
difficult to read and have errors. Reviewers also write
subjective opinions that portray reviewers emotions
about product features or more objective statements that
portray factual data about product features, or a mix of both.
Keeping these in mind, we test the following hypotheses:
Hypothesis 1a. All else equal, a change in the
subjectivity level and mixture of objective and subjective
statements in reviews will be associated with a change in
sales.
Hypothesis 1b. All else equal, a change in the
readability score of reviews will be associated with a
change in sales.
Hypothesis 1c. All else equal, a decrease in the
proportion of
spelling errors in reviews will be positively related to
sales.
Our approach is the two-pronged approach building on
methodologies from economics and from data mining,
building both explanatory and predictive models to
understand better the impact of different factors. For
prediction model we are using component weight assignment
algorithm with svm. Interestingly, all prior research use
Random Forests [1] which has the following main drawback
Each tree is grown at least partially at random
Randomness is injected by growing each tree on a
different random subsample of the training data
Randomness is injected into the split selection
process so that the splitter at any node is determined partly
at random
Each tree is grown to the largest extent possible.
There is no pruning. This occupies more memory space.
Random forests have been observed to overfit for
some datasets with noisy classification/regression tasks.

III. TEXTUAL ANALYSIS OF REVIEWS

A. Readability Analysis

A review that is easy to read will be more helpful than
another that has spelling mistakes and is difficult to read.
To measure the cognitive effort that a user needs in order to
read a review, we measured the length of a review in
sentences, words, and characters. Specifically, we
computed [6]
Automated Readability Index
Gunning fog Index
SMOG

B. Subjectivity Analysis

Using this following definition, we then generated a
training set with two classes of documents:
A set of objective documents that contains the
product descriptions of each of the products in our data set.
A set of subjective documents that contains
randomly retrieved reviews.
Instead of classifying each review as subjective or
objective, we instead classified each sentence in each
review as either objective or subjective, keeping the
probability being subjective pr(sub) for each sentence s.
Hence, for each review, we have a subjectivity score for
each of the sentences. Based on the classification scores for
the sentences in each review, we derived the average
probability AvgProb(r) of the review r being subjective
defined as the mean value of the Pr
subj
(s
j
) values for the
sentences s1; . . . ; sn in the review r.
n
Avgprob( r ) = 1/n Pr
subj
(s
j
)
i=1
Since the same review may be a mixture of objective
and subjective sentences, we also kept of standard
deviation DevProb(r) of the subjectivity scores Prsubj(s
j
)
for the sentences in each review

n
DevProb(r) = 1/n ( Prsubj(sj) Avgprob(r))
2

i=1

IV. EXPLANATORY ECONOMETRIC ANALYSIS
First, we perform an explanatory econometric analysis,
trying to identify what aspects of a review are important
determinants of its usefulness and impact. The unit of
observation in our analysis is a product-date, and the
dependent variable is ln(SalesRank), the log of sales rank
of product k in time t. Specifically, to study the impact of
reviews and the quality of reviews on sales, we estimate
the following model:

Log(SalesRank)
kt
= + 1. Log(Salesprice
kt
)
+ 2. Avgprob
k(t-1)

+ 3. DevProb
k(t-1)

+ 4. Avg Review rating
k(t-
1)

+

5. Log(No.ofReviews

k(t-1)
)
+ 6. Readability
+ 7. Log(SpellingErrors

k(t-1)
)
+ 8.(AnyDisclosure
k(t-1)
)
+ 9.log(ElapsedDate
kt
)
+
kt
+
kt

where
k
is a product fixed effect that accounts for
unobserved heterogeneity across products and
kt
is the
error term. (The other variables are described in Table 1 ).
Econometric analysis will be done using two stage least
square regression to reveal the extent of subjectivity,
readability, in formativeness, correctness in review
matters. some explanatory variables are correlated with
errors, then ordinary least-squares regression gives biased
and inconsistent estimates. To control for this potential
problem, we use a Two Stage Least-Squares (2SLS)
regression with instrumental variables. Under the 2SLS
approach, in the first stage, each endogenous variable is
regressed on all valid instruments, including the full set of
exogenous variables in the main regression. Since the
instruments are exogenous, these approximations of the
endogenous covariates will not be correlated with the error
term. So, intuitively they provide a way to analyze the
relationship between the dependant variable and the
endogenous covariates. In the second stage, each
endogenous covariate is replaced with its approximation
estimated in the first stage and the regression is estimated
as usual.

V. PREDICTIVE MODELING

Then, at the second level, we build a predictive model
using Component weight assignment model with SVM that
offer significant predictive power and allow us to predict
with high accuracy how peer consumers are going to rate a
review and how sales will be affected by the posted review
Prediction will be done using component weight
assignment algorithm. Our algorithm based on the
condition by summation of all component weight is equal
to 1 for useful review and for unuseful review weight
should be close to 0. This assignment trying to identify the
probability of subjective comments , which help to
increase in product sales.

Input: Potential useful review vector set P
Unuseful review vector set N
Output: useful review vector set D

C1 : a classification algorithm with adjustable
parameters W
that identifies useful review vector pairs from P

C2 : a supervised classifier, SVM

Algorithm:
1) D =
2) Set the parameters W of C1 according to N
3) Use C1 to get a set of useful review vector pairs
d1from P
4) Use C1 to get a set of useful review vector pairs f
from N
5) P = P- d1
6) while | d1 | 0
7) N = N - f
8) D = D + d1 + f
9) Train C2 using D and N
10) Classify P using C2 and get a set of newly
identified
Useful review vector pairs d2
11) P = P - d2
12) D =D + d2
13) Adjust the parameters W of C1 according to N
and D
14) Use C1 to get a new set of useful vector pairs
d1from P
15) Use C1 to get a new set of useful review vector
pairs f
from N
16) N=N
17) Return D
VI . TABLE 1
Variables collected for the work

Type Variable Explanation
Product and Sales
Retail Price
Sales Rank
Average Rating
Number of Reviews
Elapsed Date
The retail price of the product
The sales rank with in product category
Average rating of the posted Review
Number of reviews posted for the product
Number of days since the release of the date
Individual Review
Moderate review
Helpful Votes
Total Votes
Does the review have a moderate rating (3 star rating ) or
not
The number of helpful votes for the review
Helpfulness The total number of votes for the review
Helpfulvotes / Totalvotes
Reviewer
Characteristics
Real Name
Nick Name
Birthday
Location
Any Disclosure
Has the reviewer disclosed his/her real name?
Does the reviewer have a nick name labeled in the profile?
Does the reviewer list his/ her birthday?
Does the reviewer disclosed its location?
Does the reviewer list any of the above in the reviewer
profile?
Review Readability
Length (char)
Length (words)
Length (sentence)
Spelling Error
ARI
Gunning Index
SMOG
Length of the review in characters
Length of the review in words
Length of the review in sentences
No. of spelling errors in the review
Automatic Readability Index
Gunning FOG index for review
Simple measure of Gobbledygook score for the review
Review Subjectivity
AvgProb
DevProb
Average probability of a sentence in the review being
subjective
The Standard deviation of the subjectivity probability

VII. CONCLUSION
The essential step to find the impact of reviews is
to find the most useful review and the one which is not
useful. This is done with the help of Two stage regression
classifier. Then the predictive model is constructed using
Component Weight Assignment algorithm with SVM to
examine whether, given an existing review , how well can
we predict the helpfulness and impact of unseen review.

REFERENCES

[1] Anindya Ghose and Panagiotis G. Ipeirotis, Estimating
the helpfulness and Economic Impact of Product Reviews :
Mining Text and Reviewer Characteristics, IEEE
Transaction on Knowledge And Data Engineering, Vol. 23,
NO. 10, October2011.
[2] C. Dellarocas, N.F. Awad, and X.M. Zhang, Exploring
the Value of Online Product Ratings in Revenue
Forecasting: The Case of Motion Pictures, Working Paper,
Robert H. Smith School Research Paper, 2007.
[3] J.A. Chevalier and D. Mayzlin, The Effect of Word of
Mouth on Sales: Online Book Reviews, J. Marketing
Research, vol. 43, no. 3, pp. 345-354, Aug. 2006.
[4] D. Reinstein and C.M. Snyder, The Influence of Expert
Reviews on Consumer Demand for Experience Goods: A
Case Study of Movie Critics, J. Industrial Economics, vol.
53, no. 1, pp. 27-51, Mar. 2005.
[5] C. Forman, A. Ghose, and B. Wiesenfeld, Examining
the Relationship between Reviews and Sales: The Role of
Reviewer Identity Disclosure in Electronic Markets,
Information Systems Research, vol. 19, no. 3, pp. 291-313,
Sept. 2008.
[6] W.H. DuBay, The Principles of Readability, Impact
Information,
http://www.nald.ca/library/research/readab/readab.pdf,
2004.

[7] C. Danescu-Niculescu-Mizil, G. Kossinets, J. Kleinberg,
and L.Lee, How Opinions Are Received by Online
Communities: A Case Study on Amazon.com Helpfulness
Votes, Proc. 18th Intl Conf. World Wide Web (WWW
09), pp. 141-150, 2009.
Dataset Privacy for Organizations : A review
Pankaj Kumar Sharma
Shri Ramswaroop Memorial Group of Professional Colleges
Lucknow, India

Abstract-- Data Management within organization has been an
mount in initiatives for security, resusing and storage within
and outside the organization. When data mining and granting
access to data then put practice to care of data .the initiatives
as master data management, cloud computing and data set
privacy for a large data set of an organization which is being
transferred for testing, consolidation, authentication and data
management. In this paper an approach is introduced a new
layer by the governance group for data management and this
layer is also used for preserving privacy of the information
which is sensitive and personal.
Keywords- Privacy,Data mining, Data management, Data
warehousing, Data governance.
I. INTRODUCTION
Privacy is the right of individuals to be left alone and
to choose what information about them to share with
others. When personal data is collected and stored in
database used with in an organization or disseminated to
others, it can be used to violate the privacy rights of
individuals. The objective of data privacy requires the
protection from inadvertent or unauthorized disclosure to
unauthorized persons for unauthorized purposes.
Ensuring data privacy requires both legal and
technical solutions, while DBMS can improve on the
scene access control. It can never stop an unauthorized
person from accessing the data and later using it for
unauthorized purpose.
In Case of large data analysis and business
intelligence the Cloud services have been predicted. The
data volumes in such cases are beyond the capacity of a
single organization to handle or so put as not a good
investment.
The interesting and challenging facts are data
management and governance. When we are finding
privacy of business sensitive information, a meaningful
insight with consolidation of internal and external data is
essential. Now days there are very big problems with data
are huge amount of data and the time for locking it
because no longer locked possible of any information. In
case of business and customer the access to sensitive
information is required but there is organization and up
facing loss of production, revenue and customer
confidence.
Cloud computing and virtualization make security
more difficult to maintain. In the following figure-1 Data
needs for various groups of an organization are depicted.

Volume 1. Copyright 2012 Techno Forum Group,
India. ISBN: 978-81-920575-8-3 :: doi: 10.
73319/ISBN_0768
ACM #: dber.imera.10.73319

II. PRESERVING DATA PRIVACY
The store data is very useful now a days it is
repurposed for additional benefits such as BI (Business
Intelligence), Analytics and Mining.

Figure 1.

The data is shared with software service provides,
third parties with whom the organization may collaborate.
Then for security of data there are different application,
are required on different levels; data storage and transfer
are protected such that the legal requirements of sensitive
information are not compromised and data is protected
against theft or misuse.
There are certain techniques used previously for
keeping data privacy like substitution or keeping data
privacy like substitution techniques, Transposition,
Encryption and data marking etc. These techniques are
not sufficient for applications like BI, mining and
Analytics, Such techniques help for preparing test data
and other non strategic purposes.
Basically in e-commerce B2C (business to customer)
a technique data perturbation have been proposed, where
in, the user data is initially distorted an is regenerated in a
probabilistic manner to be provided the eventual miner.
For the Governances team it is very important where
the information resides and who access these, to setup
polices and restricts illegal practice.
There are problems with internal sensitive information
about employees and customers are available on file
sharing networks and are susceptible for identity theft or
fraud.
A published survey proved that more than 50% of the
organizations who provide sensitive data for projects
related purpose dont preserve the privacy of data before
sharing the various groups in an organization access data
is shown in fig.
In organization various levels of hierarchy followed
when data is provided to users, the disclosure of sensitive
data checks needed. In this paper the solution to
preventive privacy of customer and business information
the privacy layer is embedded as apart of organization
data architecture in which the privacy layer is after
authentication.
Data access management for the different users of
organization hierarchy becomes challenging. For
preserving data privacy, data republication, maintenance
of additional storage for access based on role and
privileges is not a profitable point for many organizations.
So we need a solution for these problem with minimal
uses of additional hardware; data duplication, user
management and maintenance requirements.
III. PROPOSED APPROACH
The approach proposed to preserving privacy there are
maintenance for each new data set previously the number
of algorithms are to be used for the data privacy but they
may be unsuitable for data mining or online analytical
approach of privacy layer built in to the architecture
designed. Refer Figure 2
In This approach there is no need of re-engineering of
the business activities and technology and also the
authorization and authentication processes with in the
organization not be modified.

Figure 2. Architecture with Privacy Layer
The data source which is used may be of online
transactions. In this case the data which is required
privacy are changes frequently. So the data consolidations
of data from various sources are not easy. The data stores
in cloud, internal or external, may also be encompassed
with the privacy layer. For mining related needs the data
at the root level may be directly exposed to the experts
here the knowledge worker is properly authorized for the
accessing of multiple sources. The security concerns for a
governance team with in an organization the privacy
control is used as dataset.
This privacy layer will provide a secured request
processing for data and check the privilege details of the
reguester, what is the level of privileges?
A. Descriptions about Privacy Layer
The access to sensitive data would be governed by the
level of the hierarchy in the organization and the current
role and responsibility of the requester. Thus the access
privilege and the role definition will be the key inputs for
the functionality embedded in the Privacy Layer. The
authorization and authentication verifications will
continue to be maintained at the respective layers of the
data and application architecture. This may check alone
may not rule out the exposure of customer and business
sensitive information requested from the consolidated
data at base or any level of aggregations.
B. Components of Privacy Layer
Rule Engine and Data Range Customizer are the key
components of the privacy layer, a sample data flow is
shown in the Figure 3.

Figure 3. A sample data flow involving the privacy layer

These two components are discussed below.
The Rule Engine will maintain the following:
1. A User Portfolio access, role privileges, history
2. Computational Algorithms For each requester,
depending upon the role and privileges, it will calculate
and present personalized data set.
3. Rule Manager Interface Required for the privacy
administrator for maintaining the rule engine and
implementing policies as lay down by governance team.

The desensitizing technique utilizing the data range
Customizer will generate data ranges and count the
occurrences of instances in each. The range for requested
subject of measure for a requester will be static and will
alter if his privilege and role is changed. Let us explain
this with an example.

Table 1. For an actual Annual Income of customer
$75,000 per annum the following ranges will be presented
to the user.

Role Privilege
level
Data Range
$, 000

External User Low Age Group 60-90
Internal
Operator
Medium-
Low
Age Group 65-85
Managerial Medium Age Group 70-80
Knowledge
Worker
Medium-
High
Age Group 75-80

CXO High Age Group 75

Table 1 may not present the actual view of data ranges,
and is a hypothetical one used for explaining the
personalized data set creation technique by the privacy
layer. The length of the range provided for external user
for analysis will be higher in value as compared to
internal operator and manager. They, till the time they are
playing the current role, will have the ranges fixed for
them as provided in Table 1. Internal operator will be
allocated the range 65-85 till the rule engine entries are
altered for this user with a different role or changed
privilege. The data sets for manager will be displaying
ranges like 70-80 for fact measures and other data he has
privileges to access and analyze. The knowledge worker
assumed to be with higher privileges would be having a
shorter data range and CXO can fetch actual values.
Now if a conclusion from analyzed data asset in
the form of a report is submitted by internal operator,
based on his data range, the manager would be able to get
a redistributed report for the range of data he is entitled to.
When his report is viewed by the CXO, he will be able to
analyze further with root level data available in the data
store. The use of this approach may require changes to the
way the data is presented to the user, unlike the traditional
tabular format often provided by tools used to query
databases. However, unlike the other techniques used for
preserving data privacy, the proposed approach will not
alter the raw data and will be useful in Conducting a
meaningful mining kind of analysis.
The establishment of a governance body and a senior
management sponsorship will be a must the system to
achieve its desired objective. In addition to defining the
role and range sets for each requester, the security
infrastructure for data level and object level will also be
important to hinder unauthorized access.
The data ranges can be presented for various types of
measure required for data mining and analysis. For
instance:
1. Item Sold: 100-500 for a given class of products
2. Age Group: 25 to 35
3. Sales Amount: USD 10,000 to USD 50,000
4. Number of visits: 1 to 10 by a class of customers
The validity of the ranges may not clear the Chi
Square test, but the point here is not to provide perfect
ranges for the requested data, rather to avoid disclosure
of private and sensitive business and customer
information in the form of raw data and prohibit the
unauthorized user to be able to regenerate the actual
values.
IV. CONCLUSIONS
In this paper an approach has been presented to
overcome the common drawbacks available in most of the
techniques used for desensitizing business and customer
data. The solution proposed overcomes these drawbacks
by presenting personalized data sets in the form of ranges
for analysis and controlled by the role and privileges
defined by the governance body of an organization.
The main benefit of this approach will not alter the
data in the original data stores, and no additional data
stores will be required for housing desensitized or
synthetic data for specific set of users. Also, as the
original data is not altered, there will be no requirement to
maintain data integrity and hence all types of users
external; operational, knowledge workers or management
will be able relate their results meaningfully as it
originates from the common data store, though the data
sets will be controlled by a set of governing rules.
The only additional requirements will be the creation
of application for administrator interface, a small rule
database and code development and maintenance for the
data range Customizer. The code for the Customizer once
designed and implemented will not have to undergo
changes with new data loads, and will have a minimal
impact due to any strategic changes implemented by
business at a process level or changes modeled within
applications developed for strategic purposes. This is
because, for every requester of data, who has privilege
granted to access data stores will be able to analyze the
data sets till his profile is maintained in the privacy layer.
Additional of large volume of data in bulk will not require
specific changes to the data range Customizer or the rules
engine. After every successful data loads to the original
data stores, a mere refresh of the query will enable the
requester to view the revised data set.
When the data is shared outside the legal bounds of
the organization, the external user access will have a
minimal impact as the data shared to him is sufficiently
desensitized for him to be able to guess or derive the
original values. As the root level data is restricted to a
specific hierarchy in an organization, the disclosure of
data, say by means peer to peer disclosure will not be
difficult to track. In situations where an outsourced
consultant is provided with the data sets for delivering
results post analysis, he will not have any privilege to
access the actual root level and his results will not vary
far away from the truth.
Future work of this model is on the lines of service
oriented architecture where the suggested approach of
having a privacy layer will be induced as a service over
the existing portfolio of information management.
This approach can be implemented within an
organization for data access from data stores meant for
analysis to transaction data stores. This solution can be
made an integral part of enterprise data security and
privacy service model
REFERENCES
[1] Kobielus, J., Advanced Analytics Predictions For 2010
http://blogs.forrester.com/business_process/2009/12/advanced-analytics-
predictions-for-2010.html.
[2] HP, Top 10 trends in Business Intelligence for 2010, Business
White Paper, Rev. 1, February2010;
http://h20195.www2.hp.com/v2/GetPDF.aspx/4AA0-6420ENW.pdf.
[3] State of Enterprise Security 2010,
http://www.symantec.com/content/en/us/about/presskits/SES_report_Fe
b2010.pdf.
[4] Widespread Data Breaches Uncovered by FTC Probe
http://www.ftc.gov/opa/2010/02/p2palert.shtm.
[5] Korolov, M., 'Vast Gaps' in Data Protection, Information
Management Online, March 10,2010.
http://www.information-management.com/ news/data_
protection_security-10017342-1.html.
[6] http://web-docs.stern.nyu.edu/ old_web/emplibrary /Datta
_PreservingPrivacy.pdf
[7] http://en.wikipedia.org/wiki/Data_ set_ (IBM_ main frame)
[8] http://www.misti.com/default.asp?Page=10&pcID= 5838
[9] http://www.dsci.in/taxonomypage/416
[10] http://databases.about.com/od/security/a/databas eroles.htm

Novel Approach for Discovery of Frequently Occurring Sequences

Vibhavari Kamble
PG Student, Computer Department,
Pune Institute of computer
Technology, India

Emmanuel M.
Head of IT department,
I.T. Department,
Pune Institute of Computer
Technology, India
Anupama Phakatkar
Assistant Professor,
Computer Department,
Technology, India

Abstract Frequent pattern mining from sequential datasets
is an important data mining method. It has various
applications like discovery of motifs in DNA sequences,
financial industry, the analysis of web log,
telecommunication, customer shopping sequences and the
investigation of scientific or medical processes etc. Motif
mining requires efficient mining of approximate patterns that
are contiguous. The main challenge in discovering frequently
occurring patterns is to allow for some noise or mismatches in
the matching process. Existing algorithms focus on mining
subsequences but very few algorithms find approximate
pattern mining. In this paper we propose approach for
finding frequently occurring approximate sequences from
sequential datasets. Proposed work uses the concept of
hamming distance and depth first traversing of suffix tree for
discovering frequent pattern with fixed length, maximum
distance & minimum support.
Keywords-data mining; sequence mining; frequent patterns;
suffix tree; hamming distance;
I. INTRODUCTION
Sequential pattern mining deals with data in large
sequential data sets. Sequence mining has gained
popularity in marketing in retail industry, biomedical
research, DNA sequence patterns, financial industry, and
telecommunication. It is most common applications are
discovery of motifs in DNA sequences, financial industry
use data mining to identify interesting share price
movements, the analysis of web log for web usage,
customer shopping sequences and the investigation of
scientific or medical processes and so on. The results of
pattern mining can be used for business management
marketing, planning and prediction. The difficulty in
discovering frequent patterns is to allow for some noise in
the matching process. The most important part of pattern
discovery is the definition of a pattern and similarity
between two patterns which may vary from one application
to another.

ISBN: 978-81-920575-8-3 :: doi: 10. 73333/ISBN_0768
ACM #: dber.imera.10. 73333

Agarwal has started work in this area for association
rule mining. It analyses customer buying behaviour by
finding associations between the items that customers buy.
Discovering frequent patterns is computationally expensive
process and counting the instances of pattern requires a
large amount of processing time. Huge amount of literature
is available and number of algorithms is proposed for
mining the frequent patterns or itemsets. These algorithms
differ in ways of traversing the itemset lattice, dimension
and the way in which they handle the database; i.e., how
many passes they make over the entire database and how
they reduce the size of the processed database in each pass.
In this paper we focus on repeated occurrences of short
approximate sequences, i.e. occurrences are not always
identical such subsequences is also called as frequently
occurring approximate sequences. There are two types of
subsequences, non-contiguous subsequences and
contiguous subsequences [8]. If sequence A=abcacbcc and
B=abccb then sequence B is non-contiguous subsequence
of A by choosing first, second, third, fifth, and sixth
element from sequence A. We focus on discovering the
contiguous subsequence of fixed length because
noncontiguous subsequence mining is not applicable in
DNA and protein sequence mining applications. Some
algorithms [1] [2] [3] [4] are available to mine contiguous
subsequences. In this paper we propose an approach for
discovering frequently occurring approximate pattern of
fixed length. It uses hamming distance, suffix array, and
depth first order. This approach is applicable in many real
life and biomedical applications such as bioinformatics for
finding patterns in long noisy DNA sequence and protein
motif mining. Section 2 presents related work, section 3
describes problem definition, section 4 describes
frequently occurring pattern discovery methods and section
5 describes summaries.
II. RELATED WORK
J. Han and M. Kamber[5] stated, data mining is process
of extracting knowledge from large database. Data mining
is also referred as knowledge discovery from data or KDD.
There exist a large number of algorithms for sequential
pattern mining and each algorithm is having different
features. Early work focused on mining association rules.
Later AprioriAll algorithm is derived from the Apriori
algorithm. In these types of algorithms, candidate
sequences are generated and stored. After these algorithms
main goal is to reduce the number of candidate sequences
generated so as to minimize input output cost. In another
type of algorithm support is counted and it is used to test
the frequency. The key strategy here is to eliminate any
database or data structure that is maintained all the time for
counting the support. Usually, a proposed algorithm also
has a proposed data structure, such as SPADE by Zaki
1998 uses vertical databases; PrefixSpan[6] uses projected
databases; FLAME[8] uses count suffix tree. current
Algorithms in the area can be classified into three
main categories, namely, apriori-based, pattern-growth,
and early-pruning with a fourth category as a hybrid of the
main three [7].
The problem of subsequence mining was introduced in
[9]. Later on several other algorithms have proposed as
improvement on [9] such algorithms are SPADE [10] and
BIDE [1]. Statistical sampling based method [11] uses
compatibility matrix to find patterns in presence of noise.
Most of the algorithms like CloSpan[2] are presented to
mine exact motifs, such algorithms does not allow noise in
matching process. FLAME [8] algorithm is discovering
motif and it allow noise in matching process, so it is
efficient for approximate substrings. FAS-Miner [12]
algorithm can discover patterns of longer lengths and
higher supports. It uses suffix array to store suffixes and
sorted in lexicographic order. [13] Proposed an algorithm
for finding frequent approximate sequential patterns. It
uses hamming distance model and break-down-and-build-
up methodology.
III. PROBLEM DEFINITION
In this paper we propose an approach to efficiently mine
all frequently occurring approximate substring. Many
challenges arise in sequential mining such as projection and
prefix extension techniques need more space. Another
challenge is allow some noise in matching process. Every
substring needs to compare with other substring for
checking mismatches. We want to find the instances of the
pattern in the presence of noise but we do not want that it
matches unrelated subsequences which may have large
number of mismatches. The input sequence database is
composed of symbols from a finite discrete alphabet set.
Let denote the finite discrete alphabet set. An input
sequence S is ordered list of items or events. For example, S
is a DNA Sequence over {A, C, G, T}. The i-th item of a
sequence S can be denoted as S[i]. Consider two substings p
and q of S having same length n. The hamming distance
d(p,q) of two strings p and q is the number of mismatching
characters. Equation (1) is from [12].

d (p, q) = | I | (1)
I = { i | pi qi, 1 i n}

In this model, two strings are considered same if and
only if they are of equal length and their distance is less
than or equal to a user specified distance. Here we are
considering the (L, d, k) model for discovering patterns, In
this model L denotes the length of the frequent pattern
string, d denotes the maximum number of mismatches i.e.
hamming distance between pattern string and instance of
that string and k denotes the minimum support that is
minimum number of instances of pattern in input
sequential database, database may be a single string or set
of strings. Such model is commonly used for finding DNA
motif in computational biology. Proposed approach outputs
the model strings that had sufficient support that is the
support is greater than or equal to predefined k.
IV. FREQUENTLY OCCURRING PATTERN DISCOVERY
Floratou et al. [8] has proposed efficient method for
discovering approximate frequent patterns. We have
proposed approach by using suffix array and version space
Tree. Two suffix trees are used in FLAME algorithm,
instead of that we are using one suffix array and one
version space tree. Suppose that values of length L,
distance d, and minimum support k is used as input.
Traditional approaches considering only those substrings
of given length that actually occur in the input sequential
database but this approach fails to find all possible patterns.
Frequently occurring pattern of model (L, d, k) a string of
length L that occurs minimum k times in the input dataset,
with each occurrence being within a Hamming distance of
d from the model string. For two strings P and Q of a same
length n, the Hamming distance between them is defined as
Dist (A, B) = | I |, I= {i | A
i
B
i
, 1 i n} from [12].
Hamming distance between strings S and P is shown in
Fig.1 from [12]. In this example Fig. 1 hamming distance
between string S and P is 2 (area which is not highlighted).

A = AABCACDCABCD
B = AACCABDCABCD

Fig. 1 Hamming Distance between string A and B are 2

Approach to find such pattern is: first find all possible
strings (s
1
, s
2
s
n
) of length L over the finite discrete
alphabet set , then for each string s
i
find out its possible
instances such that each instance is at distance less than or
equal to d from same set (s
1
, s
2
s
n
).
Now we will get string s
i
and its group of possible
instances and find the support of each string from group i.e.
(G1,G2Gn).To find the support construct a version space
tree or count suffix tree on the actual input dataset, and
suffix array on the set of all possible instances from group
Gi. Here we are using suffix array to reduce the space
usage i.e. suffix array requires less space than suffix tree. I
am suggesting alternative to count suffix tree on input
dataset. Explanation of count suffix tree from reference [8]
is given below.
This suffix tree is traversed in depth first search to find
the support of each instance from group. In count suffix
tree [8] is shown in Figure 2 for string ABBCACA. In this
count suffix tree first internal node contains value 3, which
indicates that subtree rooted from that internal node having
3 leaf nodes. String on edge from root node to internal
node represent prefix for all strings on edges from that
internal node to leaf node. E.g. string B is common prefix
for string BCACA and CACA. In short, the Count Suffix
tree on input data combines the work common to finding
the support for models like CABBDE and CABBDF. In
count suffix tree on the dataset, computes support at
various nodes in the model space and prunes away large
portions of the model space.

Fig. 2 Count Suffix tree for string S = ABBCACA.

Instead of count suffix tree we can use the FAVST
algorithm [14] FAVST algorithm first process minimum
frequency predicate and then scan database to construct
initial version space tree. This algorithm is efficient in
terms of computation time and memory usage. FAVST
algorithm is proposed by Lee et al. [14] for mining string
database under constraints. We are using FAVST
algorithm for counting frequency of string from input data
sets by modifying original FAVST algorithm from [14] We
have modified FAVST algorithm as per our requirement.
Input to this algorithm is a single sequential string or a set
of sequential string and minimum frequency (s, D
i
, k). If
input is single sequential string then it is divided into
subsets D
1
, D
2
,D
n
. Output is a version space tree
representing strings satisfying minimum frequency. Initialy
scans initial data subset D
i
and build an initial version
space tree (VST). After constructing initial version space
tree process remaining D
2
,.D
n
input data subsets and will
not add new node in initial VST, it will just count the
frequency and mark with those note satisfying the
minimum frequency. represents that we are not
interested in that string so branches with only are pruned
away after processing each input data subset D
i
. This
pruning reduces the number of substring patterns for
checking further input subset.
Resultant version space tree contains only strings with
minimum frequency k. This approach will provide better
results because most of the previous methods use projected
data, breadth first search, generate and test, multiple scan
of database, which requires large main memory and
processing time. In our approach we are using FAVST and
suffix array data structure solves the problem of memory
consumption, search space partition for efficient memory
management less memory utilization and less processing
time.
V. CONCLUSIONS
In this paper we focus on contiguous subsequence
mining problem and proposed an approach to discover
frequently occurring approximate pattern in sequential
database. This approach will give better results than
previous methods because we are using hamming distance
concept, version space tree and suffix array. Our approach
will discover all frequent approximate patterns of model (L,
d, k). This type of pattern discovery is important in
computational biology to detect short subsequences that
occur frequently in a given set of sequences.

REFERENCES

[1] J. Wang and J. Han, BIDE: Efficient Mining of Frequent Closed
Sequences, in ICDE, 2004, pp. 7990.
[2] X. Yan, J. Han, and R. Afshar, CloSpan: Mining Closed Sequential
Patterns in Large Datasets, in SDM, 2003.
[3] M. J. Zaki, Sequence Mining in Categorical Domains:
Incorporating Constrains, in CIKM, 2000, pp. 442429.
[4] J. Pei, J. Han, and W. Wang, Mining Sequential Patterns With
Constraints in Large Databases, in CIKM, 2002, pp. 1825.
[5] J. Han, and M. Kamber, Data Mining Concepts and Techniques.
Morgan Kanufmann, 2000.
[6] J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and
M. Hsu, PrefixSpan: Mining Sequential Patterns by Prefix-
Projected Growth, in ICDE, 2001, pp. 215224.
[7] R. Nizar, Mabroukeh and C.I. Ezeife, A taxonomy of sequential
pattern mining algorithms, ACM Comput. Surv. 43, 1, Article 3,
2010.
[8] A. Floratou, S. Tata and J. M. Patel, Efficient and Accurate
Discovery of Patterns in Sequence Data Sets, IEEE Trans.
Knowledge and Data Engineering, vol. 23, pp. 1154-1168, Aug.
2011.
[9] R. Agrawal and R. Srikant. Mining sequential pattems. In proc.
1995 Int. Conf Data Engineering (ICDE95), pages 3-14, Taipei,
Taiwan, Mar. 1995.
[10] M. J. Zaki, SPADE: An Efficient Algorithm for Mining Frequent
Sequences, Machine Learning, vol. 42, no. 1/2, pp. 3160, 2001.
[11] J. Yang, W. Wang, P. S. Yu, and J. Han, Mining Long Sequential
Patterns in a Noisy Environment, in SIGMOD, 2002, pp. 406417.
[12] X. Ji and J. Bailey, An efficient technique for mining
approximately frequent substring patterns, Seventh IEEE
International Conference on Data Mining, pp. 325-330, 2007.
[13] F. Zhu, X. Yan, J. Han and P. Yu, Efficient Discovery of Frequent
Approximate Sequential Patterns, Seventh IEEE International
Conference on Data Mining ICDM, pp. 751-756, Oct. 2007.
[14] S. D. Lee and L. Raedt, An Efficient Algorithm for Mining String
Databases under Constraints (Extended Abstract),
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.98.354.
Visual and Web Usage Mining Based On Extended Server Logs

Miss Dipti S. Zambre
G.H.R.I.E.M.
Jalgaon, India

AbstractWeb analyst may influence website improvement
process for understanding Web Navigational Data and
current usage of it. Web usage mining to evaluate usage
of website. In which Web server log file is used for taking
web data. Through web usage mining methods, graph mining
covers complex web browsing behaviours of the user.
Visualization technique obtain the understanding of the
structure of a particular website and web surfers behaviour
when visiting that site. This paper presents a web usage
mining method which combines Visualization technique. In
other words, graph based web usage mining and Visualization
technique helps us to reconstruct user session exactly as it has
been and based on these data, we find web usage patterns and
navigational pattern with more accuracy
Keywords Graph Based Web Usage Mining, Page
Browsing Time, Web visualisation , web attacks
I. INTRODUCTION
To improve the design of website we should find out
how it is used by analysing users browsing behaviour [1]
and navigational pattern [4]. Statistical Analysis and Web
Usage Mining are two ways to analyse users web browsing
behaviour. Visualization is used to analyse the users
navigational pattern. The result of Statistical Analysis
contains Page Views, Page Browsing Time, and so on. Web
usage mining applies data mining methods to discover web
usage pattern through web usage data. Here Graph Mining
method is used to analyse web usage data [2] which is one
of type of data mining method. The websites hyperlink
structure is not easy to navigate, so users leave it before
completing a goal. To achieve the insight required to
understand users behaviour, Web analysts make extensive
use of Web metrics derived from the field of Web Usage
Mining (WUM). Nevertheless, the vast amount of
information available to be analysed and understood
emphasizes the need of exploratory tools to help analysts in
their website improvements process.
There are many factors that may influence users be-
haviour. For instance, discovering the entry and exit points
of the site may provide useful information on how the users
arrive to a site, where do they abandon it, and what did they
do during their visit.
Web usage data can be collected from three sources:
server level, client level, and proxy level. Statistical

ISBN: 978-81-920575-8-3 :: doi: 10. 73347/ISBN_0768
analysis, web usage mining and visualisation usually use
Server Log as main data source which is server level data
source. Server log is a file that automatically created by
web server and kept on server. It contains some data about
requests which are sent to web server.
The combination of the statistical analysis and web
usage mining considering client side data, presents a
powerful method to evaluate the usage of website. Among
statistical analysis results, browsing time is a good scale to
evaluate website and users, for example in e-learning
systems where we expect students spend a minimum
duration of time on certain page, the value less than this
minimum may represent inattention of student or weakness
of the webpage. Among web usage mining methods, graph
mining can discover users access patterns through complex
browsing behaviour.
Web experts suffer in their day-to-day work, and the
properties of the tools that they are currently using in order
to get feedback on how to improve the functionalities of the
Website Exploration Tool (WET), already introduced in [3].
According to our preliminary poster [4] and the feed-
back provided by the Web analysts, the main contribution
of this paper is the presentation of a set of techniques that
enable the representation of Web Navigational Data,
providing combined visual abstractions that can be
customized by the user, as well as recomputed by changing
the focus of interest.
Information presented in a visual format is learned and
remembered better than information presented textually or
verbally. The human brain is structured so that visual
processing occurs rapidly and in parallel given a
complicated visual scene, humans can immediately pick out
important features in a matter of millisecondsand they excel
at the processing of visual information.
II. RELATED PAPER
Using a data mining system displays the traffic with a
web browser, filtered by source/destination host, protocol or
alert, using bar graphs or pie-charts. The logs of a web
server were processed and a log reduction system based on
frequencies used in order to select the traffic for the
visualisation of the web requests and the detection of
unauthorised traffic.
We consider website structure as weighted vertex
directed graph and user navigation path as traversal on it
where node represents webpage, edge represents link
between web pages, and weight of node represents
browsing time of webpage. Then we apply graph mining
method presented by Seong Dae Leeetal [5] to discover web
usage patterns.
III. RPEOPOSED METHOD
Server log format is depending on the configuration of
the web server. The common server log includes the
following seven fields: Remote-Host, Identification, Auth-
User, Date- Time, HTTP-Request, Status-Code, and
Transfer-Volume. The extended common server log format
has two additional fields, Referrer and User-Agent field.
Referrer field can be used to reconstruct user navigation
path and Date-Time field can be used to discover page
browsing time. When a request is sent to web server, one
record including some or all of above fields is appended to
server log. If we just rely on server log, it confronts us with
two problems during construction users session.
User have five options during web browsing: open
webpage in current window, open webpage in new window,
open webpage in new tab, switch between tabs or windows,
or move to visited pages by clicking back button or forward
button of web browser.

Definition 1. We define Open function to present users
web browsing behavior as follows:
Open (Referrer, Target, Type);

Here Referrer is the current webpage which is being
browsed by user, for the first request where there is no
opened page, it is . Target is the page which is requested
through Referrer. Type is one of the five web browsing
options: CurrentWindow, NewWindow, NewTab, Switch,
and Return.
Now, suppose user follows this scenario:

Open (, A, NewWindow);
Open (A, B, NewTab);
Open (A, C, NewTab);
Open (B, D, CurrentWindow);

TABLE I.

Open (C, D, CurrentWindow);
Open (D, E, CurrentWindow);

Fig 1.a shows the sequence of page request in server
log. When we want to reconstruct users navigation paths,
we cannot determine which D is a referrer of E (Fig 1.b). In
other words, the navigation path is not clear (Fig 1.c).

Definition 2. Session Start is the time which user
session is started and it is equal with time of first user
request. It is shown as SS.

Definition 3. Session End is the time which user session
is expired and it is occurs when user doesnt send any
request to web server for a certain duration. It is shown as
SE.

Definition 4. Request Time is the time of webpage
request. Request time of page X is shown as RX.

Definition 5. Browsing Time is the duration of webpage
visit. Browsing time of page X is shown as BX.

For example consider following example:

Open (, A, NewWindow);

Open (A, B, NewTab);

Open (A, C, NewTab);

There are three open pages in client web browser. Fig
2.a shows server log data and Fig 2.b shows browsing time
of pages based on server log data. Whereas maybe user has
switched to B, just two seconds after opening C. in this case
log browsing time of pages is as Fig 2.c.

TABLE III.

Referrer Target
A
A B
A C
B D
C D
D E

<A,B,D,E
>

O
R

<A,C,D,E
>
(a) (b) (c)
Figure 1: (a) Sequence of page request in log server, (b) Unknown
referrer,
(c) Unknown navigation path

B
A
= R
B
-
R
A
B
B
= R
C
-
R
B
B
C
= S
E
-
R
C

B
A
= R
B
- R
A

B
B
= (R
C
- R
B
) + (S
E

(R
C
- 2))
B
C
= 2
Page Request
Time
A R
A

B R
B

C R
C

(a) (b) (c)
Figure 2: (a) Server Log Data, (b) Browsing time of pages based on
server
, (c) Real browsing time
IV. PPROCESS OF PROPOSED PARER
The proposed method consists of four phases, phase 1 is
related to monitoring and recording users web browsing
behavior including client side web browsing behavior [6],
phase 2 is related to converting recorded web usage data
into graph structure [6], phase 3 is related to applying graph
mining method to discover web usage patterns. And phase 4
is related to applying visualization to discover users
navigational pattern. Fig 5 shows process of the proposed
method.

Figure 3. Process of the proposed method.
Phase 1: to record users web browsing behavior. To do
this, we design an AJAX interface. Actually the interface is
a customized web application server which is able to
monitor users browsing behaviors on client side. It controls
five events: OnSessionStart, OnSessionEnd,
OnPageRequest, OnPageLoad, and OnPageFocus.
OnSessionStart, OnSessionEnd, and OnPageRequest are
server side events, OnPageLoad and OnPageFocus are
client side events.

OnSessionStart: it occurs when user requests a web
page from web server for first time. When this event
raises a unique ID, SessionID, is generated. It helps to
segregate user sessions from each other.

Algorithm, OnSessionStart
{
// Generate Session ID
SessionID = GenerateSessionID();
}

OnPageRequest: this event raises every time that user
request webpage from web server. When this event raises a
unique ID, PageID, is generated and assigned to the
requested page (PR). A script is registered on client
machine to monitor client side browsing behaviors.

{
// Generate Unique ID
PageID = GeneratePageID();
//Assign a unique ID (PageID) to requested page (PR)
AssignPageID (PR, PageID);
//Register script to monitor client side browsing
behaviors
RegisterClientScript ();
}

OnPageLoad: this is client side event that occurs when a
web page is loaded on users browser. When this event
raises, some data about users activity such as PageID of
Referrer, PageID of Target, Browsing Time, Event Name,
Date, and Time are recorded in session.

Algorithm, OnPageLoad
{
//it writes information in session WriteToSession
(SessionID, TargetPageID, ReferrerPageID,
BrowsingTime, Event, Date, Time);
}
OnPageFocus: this is a client side event that occurs
when user focus on a web page. When this event raises
some data similar to OnPageLoad event, is recorded in a
session.
Algorithm, OnPageFocus
{
//it writes information in session WriteToSession
(SessionID, TargetPageID, ReferrerPageID,
BrowsingTime, Event, Date, Time);
}

OnSessionEnd: it occurs when user doesnt send any
request to the web server for a predefined duration of time
which is called Session Timeout. When this event raises the
content of user session is recorded in a file named Log File
on server. Log file is like Server Log but is created by web
application and has been customized as our need.
Algorithm, OnSessionEnd
{
//Write session data into log file
for each Record R in Session
{
WriteToLogFile(R);
}
}

Phase 2: in this phase we convert web usage data into
graph structure in way which can be used for graph mining
method.

Definition 6. Considering definition of weighted
directed graph, base graph, and traversal [5], in this
paper base graph and session graph is a site structure as
graph where vertices represent web pages, edges represent
link between web pages, and weight of vertices represent
browsing time of web pages. Traversal is a sequence of
consecutive web pages along a sequence of link between
web pages on a base graph. The weight of a traversal is the
sum of web pages browsing time weights in the traversal. A
traversal database is a set of traversals.
In this phase, we use Log File to make web usage data
ready for graph and visual mining algorithm. Base graph is
constructed, users session is separated, users traversal is
discovered, traversal database is constructed, browsing time
of each page is calculated and is assigned to corresponding
vertex in the base graph.
Algorithm, Preprocess
{
for each SessionID SID in Log File
{
Traversal T = FindTraversal(SID);

AppendToTraversalDB(SID, T);
}
for each Page P in T
{

BrowsingTime BT = CalculateBrowsingTime(P);

AssignBrowsinTime(P,BT);
}
}

Sessions graph
Users navigational patterns are more complicated than
just a tree. There fore, we have developed visualisation
dedicated to the representation of detailed sessions based
on the focus of interest of the Web analyst. That is,WET
allows the analyst to fix any node as root of any of the
available hierarchies, modifying the focus of interest of the
visualisations.

Figure 4. A graph representing a website
Phase 3: in this phase, web usage data has been
converted in form which can be used in graph mining
method [6], then graph mining method is applied to
discover weighted frequent pattern.

Definition 7. A pattern P is said to be weighted frequent
when the weight of traversal is greater than or equal to a
given Minimum Browsing Time (MBT).

Algorithm, Mining Weighted Frequent Pattern
{
//the mining method is applied
MineWeightedFrequentPattern(BaseGraph, TraversalDB
MBT);
}

Phase 4: in this phase, the website Exploration Tool
(WET) is a visual tool to support Web Mining. Its main
characteristic is its ability to overlay the visualisation of
web metrics and web usage on top of meaningful
representations of a website. The main goal of WET is to
assist in the conversion of web data into information by
providing an already known context where web analysts
may interpret the data. In that sense, WET generates
representative hierarchies
from the website structure or the aggregation of users
navigation. After this action is performed, the system
queries a database where users sessions have been
previously stored, collecting all the sessions that have
started in that node. Such sessions are depicted using a
Force Directed Layout. In this case, edge width and
transparency have been used for representing the sessions
frequency, which corresponds to the number of users that
started navigating at the focus node, and passed through
every node in the visualisation.

IV. VISUALISATION
One of these exploratory techniques is the usage of th
3D visualisations as a background context where to map in-
formation regarding Web metrics, such as number of visits
per page, page rank or entry points.
The preprocessor module examines the web requests to
detect malicious traffic. Microsoft internet information
services (iis) has been used in order to study the various
types of attacks[7].
The web attack classes used, with the associated
visualisation colour, are the following:

CMD Unix or Windows command execution attempt
(crimson)
INS Code insertions of HTML, Perl, SQL JavaScript,
SQL, Perl, Access db (dark orchid)
TBA Trojan backdoor attempt (deep pink)
MAI Different mails such as sendmail, formail, email
etc. (forest green)
BOV Buffer overflow (cyan) CGI scripts (gold)
IIS IIS server attacks (blue)
CSS Cross Site Scripting or Server Side Include (coral)
MISC Miscellaneous, Coldfusion, Unicode and
Malicious web request options such as
PROPFIND, CONNECT, OPTIONS, SEARCH, DEBUG,
PUT and TRACE (dark orange).
Normal traffic is visualised in black and malicious
traffic in 9 different colours, one for each attack class. This
visual separation was necessary because normal traffic
overloads the display and the security analyst cannot
interpret quickly the malicious attempts. When visualising
both normal and malicious traffic the security analyst
spends more time navigating through the graph trying to
eliminate normal traffic by zooming into the coloured part
of the display, than he would if he had only a coloured
graph to contend with.
Figure 5 shows how the visualisation helps on finding
interesting links, by first depicting all the links within the
site (Figure 5(a)), and then filtering system which them
according to predefined usage thresholds (Figure 5(b)).


Figure 5. a) All the links of a website

Figure 5. b) Radial Tree with most revelent links after applying a filter
Figure 5. The filtering system allows to identify the most used links
(right) from the clut- tered visualisation of the whole Webgraph

WET has also been provided with a highlighting system
that enables the highlighting across the different visual
metaphors. Hence, any interaction with a metaphor may
also be reflected in any of the available visualizations.

V. SNAPSHOT

VI. CONCLUSION
Among statistical analysis we took page browsing time
into account and among web usage mining methods we
chose graph based web usage mining. We considered
website as a graph and user navigation path as traversal on
it. Then we applied graph mining method to discover web
usage patterns and visual mining to discover the users
navigational pattern. These patterns help administrators to
evaluate current usage of their website. In this paper we
introduce exploratory
tool that allows drilling down through Web data. The
tool, an improved version of WET, provides a set of
combined visual abstractions that can be visually
customised as well as recomputed by changing the focus of
interest , the system also contemplates the visualisation of
users sessions using a Force Directed Graph. A
visualisation of web traffic that enables rapid perception
and detection of unauthorised traffic. Capability to isolate
malicious traffic for immediate analysis and response.
visualisation considerably reduces the time required for data
analysis and at the same time provides insights which might
otherwise be missed during textual analysis

REFERENCES
[1] I-Hsien Ting, Chris Kimble and Daniel Kudenko,
Applying Web Usage Mining Techniques to Discover
Potential Browsing Problems of Users, Seventh IEEE
International Conference on Advanced Learning
Technologies (ICALT 2007)
[2] R. Ivncy and I. Vajk. Frequent Pattern Mining in
Web Log Data. Acta Polytechnica Hungaria,
Journal of Applied Sciences at Budapest Tech
Hungary, Special Issue on Computational
Intelligence. Vol.4, No.1, pp.77-99, 2006.
[3] V. Pascual and J. C. Dursteler. Wet: a prototype of an
exploratory search system for web mining to assess
usability. In IV 07: Proceedings of the 11th
International Conference Information Visualization,
pages 211215. IEEE Computer Society, 2007.
[4] V. Pascual-Cid1,2, R. Baeza-Yates2,3, J.C. Du
rsteler2 , S. Minguez1 and C. Middleton1. New
Techniques for Visualising Web Navigational Data,
2009 13th International Conference Information
Visualisation.
[5] Seong Dae Lee, and Hyu Chan Park, Mining
Weighted Frequent Patterns from Path Traversals on
Weighted Graph, IJCSNS International Journal of
Computer Science and Network Security, VOL.7
No.4, April 2007
[6] Mehdi Heydari, Raed Ali Helal, Khairil Imran Ghauth
A Graph-Based Web Usage Mining
MethodConsidering Client Side Data, 2009
International Conference on Electrical Engineering
and Informatics 5-7 August 2009, Selangor, Malaysia
[7] I. Xydas1, G. Miaoulis1, P.-F. Bonnefoi2, D.
Plemenos2, D. Ghazanfarpour2, 3D Graph
Visualisation of Web Normal and Malicious Traffic.

OPINION MINING: Extracting and Analyzing Customers Opinion on the Internet
Chandrashekhar D. Badgujar
Asst. Prof., Computer Engineering Department,
GHRIEM, Jalgaon, India

Abstract-Internet as customer reviews, comments,
newsgroups post, discussion forums or blogs which is
collectively called user generated contents. This information
can be used to generate the public reputation of the service
providers. To do this, data mining techniques, specially
recently emerged opinion mining could be a useful tool. In this
paper we present a state of the art review of Opinion Mining
from online customer feedback. An opinion mining approach is
presented which allows an automated extraction, aggregation,
and analysis of customer on products by using text mining.
Thus, strengths and weaknesses judged by customers can be
detected at an early stage and starting points for product
design and marketing can be identified. The application of the
approach is illustrated by a case study coming from the
automotive industry.
Keywords- Data mining; text mining; web mining; opinion
mining; Web 2.0; social networks; forum analysis; product
rating
I. INTRODUCTION
The Internet is increasingly changing from a medium of
distribution to a platform of interaction. Customer
discussions in Web 2.0 are a valuable source of information
for companies and enable a new kind of market research. An
abundance of customer opinions is available in Internet
forums free of charge and up-to-date. Manual analysis of
customer opinions is only possible to a certain extent and
very time-consuming due to the multitude of contributions.
Opinions on products and their features can be extracted
from discussion forums, aggregated, and analyzed
automatically by using enhanced text mining concepts. The
task of Opinion Mining is technically challenging because of
the requirement of natural language processing; which itself
is a tedious job. But it has potentiality and usefulness in real
life applications. For example, to improve the quality of their
product and services; businesses always interested to find
their customers opinions or feedback. Opinion Mining is
concerned with the opinion it expresses instead of the topic
of a document.
II. BACKGROUND
As a human being, people like to express their own
opinion. They are also interested to know about others

1. Copyright 2012 Techno Forum Group, India.
ISBN: 978-81-920575-8-3 :: doi: 10. 73354/ISBN_0768
opinion on anything they are interested; especially whenever
they need to make a decision. Before emerge of the Internet,
there was a very little written text opinion available in the
market. In that time, if an individual needed to make a
decision, He/she typically asked for opinions from friends
and families. When an organization needed to find opinions
of the general public about its products and services, it
conducted surveys and focused groups. With the rapid
expansion of e-Commence, more users are becoming
comfortable with the Web and an increasing number of
people are writing reviews (Wang and Zhou, 2009). As a
result, the number of reviews that a product receives grows
rapidly. With the explosive growth of the user generated
content on the Web, the world has changed. One can post
reviews of products at merchant sites and express views on
almost anything in Internet forums, discussion groups, and
blogs, which are collectively called the user generated
content (Pang and Lee, 2008). Now if one wants to buy a
product, it is no longer necessary to ask friends and families
because there are plentiful of product reviews on the Web
which give the opinions of the existing users of the product.
For a company, it may no longer need to conduct surveys, to
organize focused groups or to employ external consultants in
order to find consumer opinions or sentiments about its
products and those of its competitors. The existing online
customer feedback can be used effectively to fulfill that
objective. The technology of opinion mining thus has a
tremendous scope for practical applications. In order to
enhance customer satisfaction and shopping experience, it
has become a common practice for online merchants to
enable their customers to review or to express opinions on
the products that they have purchased. If an individual wants
to purchase that product, it is useful to see a summary of
opinions of existing users so that he/she can make an
informed decision. This is better than reading a large number
of reviews to form a mental picture of the strengths and
weaknesses of the product. For this reason, mining and
organizing opinions from different sources are important for
individuals and organizations. The basic idea behind is that,
from the customer review text, an overall opinion; either
positive, negative or neutral will be calculated. The
summation of that value will give the public reputation of
that particular service or service provider. The opinion
regarding different element or feature of the service could be
considered. Most existing techniques utilize a list of opinion
bearing words, generally called opinion lexicon; for this
purpose (Ding et al. 2008). Opinion words are words that
express desirable (e.g. awesome, fantastic, great, amazing,
exceptional, excellent, best, etc.) or undesirable (e.g. bad,
poor, frustrating, disappointing, horrible, terrible, worst,
sucks etc.) states (Lu and Zhai, 2008).
III. OPINION MINING
Text mining aims at detecting interesting patterns and
knowledge within texts. Texts can contain two types of
information: Facts and Opinions. While traditional text
mining focuses on the analysis of facts, opinion mining
concentrates on attitudes. Three main fields of research
predominate in opinion mining:
A. Sentiment classification
Sentiment classification deals with classifying entire
documents according to the opinions towards certain objects.
Pang et al. [5].
E.g. divide movie reviews into the classes positive and
negative, as Turney [7] does with various product reviews.
B. Feature based opinion mining
Feature-based opinion mining on the other hand
considers the opinions on features of certain objects in
sentences. Popescu and Etzioni[6] and Liu et al. [3].
E.g. examine reviews of scanner and camera features at
sentence level.
C. Comparative opinion mining
The aim of comparative opinion mining is to detect the
preferred object in those sentences in which objects are
compared with each other.Ganapathibhotla and Liu [1].
E.g. filter out preferred cameras in comparative
sentences.
IV. APPROACH
The presented opinion mining approach belongs to the
category of feature-based opinion mining and aims at
extracting and analyzing customer opinions on products in
forum postings. It comprises four succeeding steps:
selection, extraction, aggregation, and analysis (see. Fig. 1).
In the first step, the most relevant discussion forums out of a
multitude of forums on the Internet are selected and their
postings are downloaded. In the next step, product features
and their evaluations are extracted from the postings by text
mining and stored in a database. Subsequently, product
features and their evaluations are aggregated in order to give
an overview of which product features judged to be positive
or negative. Finally, product feature evaluations are
analyzed. The goal is to detect strengths and weaknesses of
different (e.g., rival) products and to reveal associations
between. To illustrate the approach customer opinions on
products of Germanys automotive industry have been
analyzed.

Figure 1. Opinion mining Approach.
The automotive industry is not only one of the
economically most important industries, making up one
quarter of the total turnover of the German industry in 2008,
but is also one of the most discussed industries in the
German-speaking Web 2.0. According to a study by the
Web-2.0-agency Ethority, about one third of the German-
speaking Web-2.0-discussions concern the automotive
industry [Ethority 2008].
A. Selection
In order to identify the most important discussion forums,
a relevance index is used which assigns a value to each
forum according to its reach and size.A forum's reach is
determined by the average ranking in the search engines
Google, Yahoo, Yoodo and MSN when entering the word
forum and the given product (e.g., car). The size is
determined by the number of users, contributions, and topics.
All discussion forums are ranked according to these criteria.
The relevance index is calculated as the arithmetic mean of
these ranks. Figure 2 shows the five most relevant discussion
forums. 900 postings have been selected from the most
relevant motor-talk-forum; 300 for each of the car models
Audi A3, BMW 3-Series, and VW Golf.

F
o
r
u
m

A
m
o
u
n
t

o
f

U
s
e
r
s

A
m
o
u
n
t

o
f

P
o
s
t
i
n
g
s

A
m
o
u
n
t

o
f

S
u
b
j
e
c
t
s

R
a
n
k
i
n
g

i
n

S
e
a
r
c
h

E
n
g
i
n
e
s

R
e
l
e
v
a
n
c
e

I
n
d
e
x

Motor-talk
808767 16494172 1877013 2 1.13
autoextrem
107864 1393945 171573 4 2.50
pagenstecher
29549 936666 62296 9 5.00
Carpassion 68132 622278 31113 10 6.00
hunny 12400 106057 19175 3 6.67

Figure 2. Forum Ranking.
B. Extraction
The aim of extraction is to identify product features and
their ratings in forum postings. Product features comprise
physical components of a product (product components) as
well as attributes associated with the product (product
attributes). Figure 3 shows some relevant components and
attributes of a car.

The evaluations show how the customers judge the
various product features. There are two concepts to measure
evaluations: polarity and intensity. In this approach, polarity
is modeled by three classes: positive, negative, and neutral.
Intensity is divided into the classes strong, weak, and
medium (see Fig. 4).

Identification of product features and evaluations is a
classification task. Passages of forum postings are divided
into classes according to their linguistic attributes (words and
their grammatical function).Support vector machines enable
the learning of binary classification rules. To classify
sentences according to their polarity, three rules are learned:
Positive versus Not Positive, Negative versus Not Negative,
Neutral versus Not Neutral. The final class is decided on by a
majority vote. As input, the support vector machines use a
training dataset which contains text passages along with their
linguistic attributes and associated classes.
C. Aggregation
Aggregation aims at connecting and summarizing
product features and evaluations. It takes place at two
different levels of abstraction. At the upper level, an
overview is given as to how positive or negative product
features are rated. For example, Figure 5 (left hand side)
shows the normalized differences in positive and negative
evaluations of all product components and attributes of the
Audi A3. At the lower level of abstraction, the evaluations of
interesting product features and elements can be further
differentiated according to their intensity. Figure 5 (right
hand side) gives a detailed insight into the intensity of the
evaluations of the two best rated product attributes sportiness
and drivability of the Audi.
D. Analysis
The aim of the analysis phase is to examine product
features and their evaluations in order to find starting points
for improving product design and sales promotion. It
comprises analysis of competitors and identification of
associations. The competitor analysis allows the comparison
of the evaluations of different products. Figure 6 (left hand
side) contrasts the positive evaluations with the negative
ones as far as suitability for daily use, drivability, and
reliability of the three car models are concerned. Association
identification shows which product components customers
frequently associate with which product attributes. The
thickness of the edges in the network indicates the frequency
of association. Fig. 6 (right hand side) shows a part of the
association network.
V. APPLICATION AREA
Opinion Mining is now becoming available
fromcommercial applications. Companies like University of
Marylands opinion analysis system OASYS
(http://oasys.umiacs.umd.edu/oasys) achieved high accuracy
on sentiment queries while tested with humans opinion data
(Subramanian, 2009). As more accurate opinion information
is available commercially, the demand for such system is
increasing.BuzzMetrics,Umbria,SentiMetrix
(www.sentimetris.com) offer opinion mining services on a
continuous scale. The by the corporations, governments,
nonprofits and also individuals. Opinion Mining can be used
in various fields to meet varied purposes. Binali etal. (2009)
presented some application domain where Opinion Mining
can be used to be benefitted. They have listed in the area of
shopping, entertainment, government, research and
development, marketing and education for e-Learning with
some example of current applications. There are many other
fields which could be also benefitted to opinion mining
information in individual or organizational level.
VI. FUTURE CHALLENGES
Beyond what have been discussed so far, we also need to
deal with the issue of opinion spam, which refers to writing
fake or bogus reviews that try to deliberately mislead readers
or automated systems by giving untruthful positive or
negative opinions to promote a target object or damage the
reputation of another object. Detecting such spam is vital as
we go forward because spam can make sentiment analysis
useless. Opinion mining suffers from several different
challenges, such as determining which segment of text is
opinionated, identifying the opinion holder, determining the
positive or negative strength of opinion. Following are the
general challenges pointed out by different authors:
Authority
Non Expert opinion
Domain Dependent
Language differences
Effects of syntax on semantics
Sentence document Complexity, Contextual Sentiments,
Heterogeneous documents, Reference Resolution, Modal
operators: might, could, and should are still remaining
challenging problems in this area.
VII. CONCLUSION
Opinion Mining has the potentiality to use from the
individual level to organizational level such as companies
and government. People and organizations from several
domains could be benefitted in various ways by using the
Opinion Mining techniques from online customers
feedback. In this paper we have reviewed the current
research work in the area of Opinion Mining. We have
analyzed several approaches taken by the researchers to
extract overall opinion from the unstructured text expressed
as opinion. We have classified and critically evaluated the
existing work. We strongly believe that this study will help
to new researchers to expose cutting edge area of interest in
Opinion Mining.
REFERENCES
[1] Freimut Bodendorf, Carolin Kaiser, Mining Customer
Opinions on the Internet A Case study in the
Auomative Industry, 2010 Third International
Conference on Knowledge Discovery and Data Mining.
[2] Touhid Bhuiyan, Yue Xu,Audun Josang,State-of-the
Art Review on Opinion Mining from Online
Customers Feedback,Proceeding of the 9th Asia-
Pacific Complex System Conference 09
[3] M. Ganapathibhotla, B. Liu, Mining Opinions in
Comparative Sentences, in: Proc. of the 22nd Intl
Conf. on Computational Linguistics, Manchester, 2008,
pp. 241-248.
[4] Ethority, Brands in Social Media, Hightext Verlag,
2008.
[5] B. Liu, M. Hu, J. Cheng, Opinion Observer: Analyzing
and Comparing Opinions on the Web. Proc. Of Intl
World Wide Web Conf., Chiba, 2005.
[6] B. Liu, Web Data Mining: Exploring Hyperlinks,
Contents and Usage Data, Springer, Berlin- Heidelberg,
2007.
[7] B. Pang, L. Lee, S.Vaithyanathan, Thumbs up?
Sentiment Classification Using Machine Learning
Techniques. Proc. of the 2002 Conf. on Empirical
Methods in Natural Language Processing, 2002.
[8] A. M. Popescu, O. Etzioni, Extracting Product
Features and Opinions from Reviews, in: Natural
Language Processing and Text Mining, Springer,
London, 2007, pp. 9-28.
[9] P. Turney, Thumbs Up or Thumbs Down? Semantic
Orientation Applied to Unsupervised Classification of
Reviews, Proc. of the 40th Ann. Meeting of the Assoc.
for Computational Linguistics, 2002.
[10] Hsinchun chen and David Zimbra,AI and Opinion
Mining.


Figure 5. Aggregation on Different Levels of Abstraction

Figure 6. Analysis of Competitors and Associations.

Parallel Association Rule for Data Mining in Multidimensional Databases

T.R.SRINIVASAN
Dept of Information Technology
Vidyaa Vikas College of Engineering
and Technology,
(ANNA UNIVERSITY)
Tiruchengode, Tamilnadu, India

C.D.RAJAGANAPATHY
Dept of Computer Applications
and Technology,
(ANNA UNIVERSITY)

S.GOPIKRISHNAN
Dept of Computer Science and
Engineering
The Kavery Engineering College,
(ANNA UNIVERSITY)
Mecheri, Tamilnadu, India

Abstract:- One of the important problems in data mining is
discovering association rules from databases of transactions
where each transaction consists of a set of items in multi
dimensional database environment. The most time
consuming operation in this discovery process is the
computation of the frequency of the occurrences of
interesting subset of items (called candidates) in the
database of transactions. To prune the exponentially large
space of datasets, most existing algorithms consider only
those datasets that have a user defined minimum support.
Even with the pruning, the task of finding all association
rules requires a lot of computation power and memory.
Parallel computers offer a potential solution to the
computation requirement of this task, provided efficient
and scalable parallel algorithms can be designed. In this
paper, we present two new parallel mining association rules
for multidimensional database. The Parallel Data
Distribution algorithm efficiently uses aggregate memory of
the parallel computer by employing Dataset partitioning
scheme and uses efficient communication mechanism to
move data among the Databases. The Hybrid Distribution
algorithm further improves upon the Parallel Data
Distribution algorithm by dynamically partitioning the Data
set to maintain good load balance. The experimental results
on a Cray T3D parallel computer show that the Hybrid
Distribution algorithm scales linearly, exploits the aggregate
memory better, and can generate more association rules
with a single scan of database per pass.

Keywords: Data mining, parallel Algorithm, Data
Distribution, Multidimensional Database, Efficient
communication.
I. INTRODUCTION
Recently, Database Mining" has begun to attract
strong attention. Because of the progress of bar-code
technology, point-of-sales systems in retail company
become to generate large amount of transaction data, but
such data being archived and not being used efficiently.
The advance of microprocessor and secondary storage
technologies allows us to analyze this vast amount of
transaction log data to extract interesting customer
behaviors. Database mining is the method of efficient
discovery of useful information such as rules and
previously unknown patterns existing between data items
embedded in large databases, which allows more
effective utilization of existing data.

India. ISBN: 978-81-920575-8-3 :: doi: 10.
73361/ISBN_0768
One of the important problems in data mining
[SADC93] is discovering association rules from
databases of transactions, where each transaction
contains a set of items. The most time consuming
operation in this discovery process is the computation of
the frequencies of the occurrence of subsets of items,
also called candidates, in the database of transactions.
Since usually such transaction-based databases contain a
large number of distinct items, the total number of
candidates is prohibitively large. Hence, current
association rule discovery techniques [AS94, HS95,
SON95, and SA95] try to prune the search space by
requiring a minimum level of support for candidates
under consideration. Support is a measure of the number
of occurrences of the candidates in database transactions.
Apriori [AS94] is a recent state-of the-art algorithm that
aggressively prunes the set of potential candidates of size
k by using the following observation
A candidate of size k can meet the minimum level of
support only if all of its subsets also meet the minimum
level of support. In the k th iteration, this algorithm
computes the occurrences of potential candidates of size
k in each of the transactions. To do this task efficiently,
the algorithm maintains all potential candidates of size k
in a hash tree. This Algorithm does not require the
transactions to stay in main memory, but requires the
hash trees to stay in main memory. Even with the highly
effective pruning method of Apriori, the task of finding
all association rules can require a lot of computation
power that is available only in parallel computers.
Furthermore, the size of the main memory in the serial
computer puts an upper limit on the number of the
candidates that can be considered in any iteration without
requiring multiple scans of the data. This effectively puts
a lower bound on the minimum level of support imposed
on candidates under consideration. Parallel computers
also offer increased memory to solve this problem.
II. MINING ASSOCIATION RULES
First we introduce some basic concepts of association
rules, using the formalism presented in [1]. Let I = {i1;
i2; : : : ; i} mg be a set of literals, called items. Let D =
{t1; t2; : : : ; tN} g be a set of transactions, where each
transaction t is a sets of items such that t1. A transaction
has an associated unique identifier called TID. We say
each transaction contains a set of items X if X _ I. The
item set X has support s in the transaction set D if s% of
transactions in D contain X, here we denote s =
support(X). An association rule is an implication of the

form X ) Y , where X; Y _ I, and X \ Y = ;. Each rule has
two measures of value, support and confidence: The
support of the rule X ) Y is support(X [ Y ). The
confidence c of the rule X ) Y in the transaction set D
means c% of transactions in D that contain X also
contain Y , which is can be written as the ratio support(X
[ Y )=support(X). The problem of mining association
rules is to and all the rules that satisfy a user-specified
minimum support and minimum confidence, which can
be decomposed into two sub problems.
1. Find all item sets that have support above the user-
specified minimum support. These item sets are called
the large item sets.
2. For each large item set, derive all rules that have
more than user-specified minimum confidence as
follows: for a large item set X and any Y (Y - X), if
support(X)=support(X * Y ) minimum confidence, then
the rule X Y ) Y is derived.
For example, let T1= f1; 3; 4g; T2= f1; 2g; T3=f2;
4g; T4= f1; 2; 3; 5g; T5= f1; 3; 5g be the transaction
database. Let minimum support and minimum
confidence be 60% and 70% respectively. Then, the first
step generates the large item sets f1g; f2g; f3g; f1; 3g. In
the second step, an association rule 1 ) 3(support = 60%;
confidence = 75%) and 3 ) 1 (support = 60%; confidence
= 100%) is derived. After finding all large item sets,
association rules are derived in a straightforward manner.
This second sub- problem is not a big issue. However
because of the large scale of transaction data sets used in
database mining, the first sub problem is a nontrivial
problem. Much of the research to date has focused on the
first sub problem. Here we briefly explain the Apriori
algorithm for finding all large item sets, proposed in [2],
since the parallel algorithms to be proposed by us in
section 3 are based on this algorithm. Figure 1 gives an
overview of the algorithm, using the notation given in
Table 1.k-item set An item set having k items. L k Set of
large k-item sets, whose sup-port is larger than user-
specified minimum support. C k Set of candidate k-item
sets, which is potentially large item set

Database: Sample

Transacti
ons
Item
s
183 Appl
e
167 Oran
ge
129 Man
go
193 Grap
es
159 Cher
ry

Table 1: Notation

In the first pass (pass 1), support count for each item
is counted by scanning the transaction database.
Hereafter we prepare a field named support count for
each item set, which is used to measure how many times
the item set appeared in transactions. Since item set here
contains just single item, each item has a support count
field. All the items which satisfy the minimum support
are picked out. These items are called large 1-itemset
(L1). Here k-item set is defines a set of k items. The
second pass (pass 2), the 2- item sets are generated using
the large 1-itemset which is called the candidate 2-
itemsets (C2). Then the support count of the candidate 2-
itemsets is counted by scanning the transaction database.
Here sup-port count of the item set means the number of
trans-actions which contain the item set. At the end of
scan-L1:= large 1-itemsetsk :=2while (Lk*16= ;) do C
k:= The candidates of size k generated from L k 1 for
all transactions t 2 D Increment the support count of all
candidates in C k that are contained in t L k:= All
candidates in C k which satisfy minimum support k :=
k+1 end Answer := S k L k Figure 1: Apriori algorithm
fining the transaction data, the large 2-itemsets (L2)
which satisfy minimum support are determined. The
following denotes the k-th iteration, pass k. 1. Generate
candidate item set: The candidate k-item sets (Ck) are
generated using large (k * 1) item sets (Lk*1) which were
determined in the previous pass (see Section 2.1).2.
Count support :The support count for the candidate k-
item sets are counted by scanning the transaction
database. 3. Determine large item set: The candidate k-
item sets are checked for whether they satisfy the
minimum support or not, the large k-item sets (Lk) which
satisfy the minimum support are determined.4. The
procedure terminates when the large item set becomes
empty. Otherwise k := k + 1
III. PROBLEM DESCRIPTION
This section is largely based on the description of the
problem in [l] and [2]. Formally, the problem can be
stated as follows: Let Z = {il, ia, . . , im} be a set of m.
distinct literals called items. 2) is a set of variable length
transactions over Z. Each transaction contains a set. of
items ii, ;j, , ik c 1. A transaction also has an associated
unique identifier called TID. An association rule is an
implication of the form X j Y, where X, Y c Z, and X II
Y = 0. X is called the antecedent and Y is called t(he
consequent of the rule. In general, a set of items (such as
the antecedent or the consequent of a rule) is called an
item set. The number of items in an item set is called the
length of an itemset. Itemsets of some length k are
referred to ask-itemsets. For an itemset X . Y, if Y is an
m -itemset then Y is called an m-extension of X.
Each itemset has an associated measure of statistical
significance called support. For an itemset ik- C 1,
Support(x) = s, if the fraction of transactions in 2,
containing X equals s. A rule has a measure of its
strength called confidence defined as the ratio support(X
U Y) / support(X). The problem of mining association
rules is to generate all rules that have support and
confidence greater than some user specified minimum
support and minimum confidence thresholds,

respectively. This problem can be decomposed into the
following sub problems:
1. All item sets that have support above the user
specified minimum support are generated. These item set
are called the large item sets. All others are said to be
small.
2. For each large item set, all the rules that have
minimum confidence are generated as follows:

Fig 2.1 Transaction on Item set in sample database

For a large itemset X and any Y C l In this paper we
use the terminology introduced in [l] X, if
support(X)/support(X - Y) 2 minimum-confidence, then
the rule X - Y j Y is a valid rule.
For example, let Tl = {A, B, C}, Tz = {A. B, D}, T3
= {A, D,E} and T4 = {A,B. D} be the only transactions
in the database. Let, the minimum support and minimum
confidence be 0.5 and 0.8 respectively. Then the large
itemsets are the following: {A}, {B}, {D}, {AB}, {AD}
and {ABD}. The valid rules are B j A and D j A. The
second sub problem, i.e., generating rules given all large
itemsets and their supports, is relatively straightforward.
However, discovering all large itemsets and their
supports is a nontrivial problem if the cardinality of the
set of items, 1 Z 1, and the database, V, are large. For
example, if 1 Z 1 = m, the number of possible distinct
itemsets is 2m. The problem is to identify which of these
large number of itemsets has the minimum support for
the given set of transactions. For very small values of m,
it is possible to setup 2m counters, one for each distinct
itemset, and count
The support for every itemset by scanning the
database once. However, for many applications m can be
more than 1,000. Clearly, this approach is impractical.
To reduce the combinatorial search space, all algorithms
exploit the following property: any subset of a large
itemset must also be large. Conversely, all extensions of
a small itemset are also small. This property is used by
all existing algorithms for mining association rules as
follows: initially support for all item sets of length 1 (1-
itemsets) are tested by scanning the database. The item
sets that are found to be small are discarded. A set of 2-
itemsets called candidate item sets are generated by
extending the large 1-itemsets generated in the previous
pass by one (l-extensions) and their support is tested by
scanning the database. Itemsets that are found to be large
are again extended by one and their support is tested. In
general, some kth iteration contains the following steps:
1. The set of candidate k-item sets is generated by l
extensions of the large (6 - l)-item sets generated in the
previous iteration.
2. Supports for the candidate k-item sets are
generated by a pass over the database.
3. Itemsets that do not have the minimum support are
discarded and the remaining item sets are called large k-
item sets.

This process is repeated until no larger item sets are
found.
IV. PREVIOUS WORK
The problem of generating association rules was first
introduced in [l] and an algorithm called AIS was
proposed for mining all association rules. An algorithm
called SETM was proposed to solve this problem using
relational operations. Two new algorithms called Apriori
and Apriority were proposed. These algorithms achieved
significant improvements over the previous algorithms.
The rule generation process was also extended to include
multiple items in the consequent, and an efficient
algorithm for generating the rules was also presented.
The algorithms vary mainly in (a) how the candidate
item sets are generated; and (b) how the supports for the
candidate item sets are counted. In [l], the candidate item
sets are generated on the fly during the pa.ss over the
database. For every transaction, candidate item sets are
generated by extending the large item sets from previous
pass with the items in the transaction such that, the new
item sets are contained in that transaction. In [2]
candidate item sets are generated in a separate step using
only the large item sets from the previous pass. It is
performed by joining the set of large item sets with itself.
The resulting candidate set is further pruned to eliminate
any item sets whose subset is not contained in the
previous large item sets. This technique produces a.
much smaller candidate setthan the former technique.
Supports for the candidate item sets are determined as
follows. For each transaction, the set of all candidate
item sets that are contained in that transaction are
identified. The counts for these item sets are then
incremented by one. In [l] the authors do not describe the
data structures used for this subset operation. Apriori and
Apriority differ based on the data structures used for
generating the supports for candidate item sets. In
Apriori, the candidate item sets are compared with the
transactions to determine if they are contained in the
transaction. A hash tree structure is used to restrict, the
set of candidate item sets compared so that subset testing
is optimized. Bitmaps are used in place of transactions to
make the testing fast. In Apriority. after every pass, an
encoding of all the la.rge item sets contained in a
transaction is used in place of the transaction. In the next
pass, candidate l item sets are tested for inclusion in a
transaction by checking whether the large item sets used
to generate

The candidate item sets are contained in the encoding
of the transaction. In Apriori, the subset testing is
performed for every transaction in each pass. However,
in Apriority, if a transaction does not contain any large
item sets in the current pass, that transaction is not
considered in subsequent passes. Consequently, in later
passes, the size of the encoding can be much smaller than
the actual database. A hybrid algorithm is also proposed
which uses Apriori for initial passes and switches to
Apriority for later passes.
4.1 Discovering Large Item sets
Algorithms for discovering large item sets make
multiple passes over the data. In the rest pass, we count
the support of individual items and determine which of
them are large, i.e. have minimum support. In each
subsequent pass, we start with a seed set of item sets
found to be large in the previous pass. We use this seed
set for generating new potentially large item sets, called
candidate item sets, and count the actual support for
these candidate item sets during the pass over the data.
At the end of the pass, we determine which of the
candidate item sets are actually large, and they become
the seed for the next pass. This process continues until no
new large item sets are found.
The Apriori and Apriority algorithms we propose
differ fundamentally from the AIS [4] and SETM [13]
algorithms in terms of which candidate item sets are
counted in a pass and in the way that those candidates are
generated. In both the AIS and SETM algorithms,
candidate item sets are generated on-the-y during the
pass as data is being read. Specifically, after reading a
transaction, it is determined which of the item sets found
large in the previous pass are present in the transaction.
New candidate item sets are generated by extending
these large item sets with other items in the transaction.
However, as we will see, the disadvantage is that this
results in unnecessarily generating and counting too
many candidate item sets that turn out to be small.
The Apriori and Apriority algorithms generate the
candidate item sets to be counted in a pass by using only
the item sets found large in the previous pass { without
considering the transactions in the database. The basic
intuition is that any subset of a large item sets must be
large. Therefore, the candidate item sets having k items
can be generated by joining large item sets having k 1
items, and deleting those that contain any subset that is
not large. This procedure results in generation of a much
smaller number of candidate item sets. The Apriority
algorithm has the additional property that the database is
not used at all for counting the support of candidate item
sets after the rest pass. Rather, an encoding of the
candidate item sets used in the previous pass is employed
for this purpose. In later passes, the size of this encoding
can become much smaller than the database, thus saving
much reading effort. We will explain these points in
more detail when we describe the algorithms. Notation
We assume that items in each transaction are kept sorted
in their lexicographic order. It is straightforward to adapt
these algorithms to the case where the database D is kept
normalized and each database record is a <TID, item>
pair, where TID is the identifier of the corresponding
transaction. We call the number of items in an item sets
its size, and call an item sets of size k a k-item sets. Items
within an item sets are kept in lexicographic order.
We use the notation c[1] _ c[2] _ . . . _ c[k] to
represent a k- item sets c consisting of items c[1]; c[2]; . .
.c[k], where c[1] < c[2] < . . . < c[k]. If c = X _ Y and Y
is an m-item sets, we also call Y an m-extension of X.
Associated with each item sets is a count field to store
the support for this item sets. The count field is
initialized to zero when the item sets is _rest created. We
summarize in Table 1 the notation used in the algorithms.
The set Ck is used by Apriority and will be further
discussed when we describe this algorithm.
4.2 Subset Function
Candidate item sets Ck are stored in a hash-tree. A
node of the hash-tree either contains a list of item sets (a
leaf node) or a hash table (an interior node). In an interior
node, each bucket of the hash table points to another
node. The root of the hash-tree is defined to be at depth
1. An interior node at depth d points to nodes at depth
d+1. Itemsets are stored in the leaves. When we add an
item sets c, we start from the root and go down the tree
until we reach a leaf. At an interior node at depth d, we
decide which branch to follow by applying a hash
function to the d the item of the item sets. All nodes are
initially created as leaf nodes. When the number of item
sets in a leaf node exceeds a specified threshold, the leaf
node is converted to an interior node. Starting from the
root node, the subset function finds all the candidates
contained in a transaction t as follows. If we are at a leaf,
we find which of the item sets in the leaf are contained in
t and add references to them to the answer set. If we are
at an interior node and we have reached it by hashing the
item i, we hash on each item that comes after i in t and
recursively apply this procedure to the node in the
corresponding bucket. For the root node, we hash on
every item in t. To see why the subset function returns
the desired set of references, consider what happens at
the root node. For any item sets c contained in
transaction t, the rest item of c must be in t. At the root,
by hashing on every item in t, we ensure that we only
ignore item sets that start with an item not in t. similar
arguments apply at lower depths. The only additional
factor is that, since the items in any item sets are ordered,
if we reach the current node by hashing the item i, we
only need to consider the items in t that occur after i.
4.3 Data Structures
We assign each candidate item sets a unique number,
called its ID. Each set of candidate item sets Ck is kept in
an array indexed by the IDs of the item sets in Ck. A
member of Ck is now of the form < TID; FID >. Each Ck
is stored in a sequential structure. The apriori-gen
function generates a candidate k- item sets ck by joining
two large (k 1)-item sets. We maintain two additional
fields for each candidate item sets: i) generators and ii)
extensions. The generators field of a candidate item sets
ck stores the IDs of the two large (k 1)-item sets whose

join generated ck. The extensions field of an item sets ck
stores the IDs of all the (k+1)-candidates that are
extensions of ck. Thus, when a candidate ck is generated
by joining l1 k 1 and l2 k 1, we save the IDs of l1 k 1
and l2 k 1 in the generators field for ck. At the same
time, the ID of ck is added to the extensions field of l1
k*1. We now describe how Step 7 of Figure 2 is
implemented using the above data structures. Recall that
the t set-of-item sets field of an entry t in Ck 1 gives the
IDs of all (k 1)-candidates contained in transaction
t.TID. For each such candidate ck 1 the extensions field
gives Tk, the set of IDs of all the candidate k-itemsets
that are extensions of ck 1. For each ck in Tk, the
generators field gives the IDs of the two itemsets that
generated ck. If these itemsets are present in the entry for
t. set-of-itemsets, we can conclude that ck is present in
transaction t.TID, and add ck to Ct.
V ALGORITHMS
The problem of discovering generalized association
rules can be decomposed into three parts: 410 1. 2. 3.
Rule # Rule support

1 Clothes + Footwear
2 Outerwear =$ Footwear
3 Jackets + Footwear

Item set Description
Item
sets
Suppo
rt
Apple 2
Orange 3
Mango 1
Grapes 2
Cherry 4
Avocad
o
3
Banana 5
Pineapp
le
3

Table 2: Example - Support

Find all sets of items (itemsets) whose support is
greater than the user-specified minimum support.
Itemsets with minimum support are called frequent
itemsets. Use the frequent itemsets to generate the
desired rules. The general idea is that if, say, ABCD and
AB are frequent itemsets, then we can determine if the
rule AB +- CD holds by computing the ratio conf =
support(ABCD)/support(AB). If conf >= minconf, then
the rule holds. (The rule will have minimum support
because ABCD is frequent.) Prune all uninteresting rules
from this set. In the rest of this section, we look at
algorithms for finding all frequent itemsets where the
items can be from any level of the taxonomy. Given the
frequent itemsets, the algorithm in [l] [2] can be used to
generate rules. We first describe the obvious approach
for finding frequent itemsets, and then present our two
algorithms.
5.1 Algorithm Basic
Consider the problem of deciding whether a
transaction T supports an itemset X. If we take the raw
transaction, this involves checking for each item z E X
whether z or some descendant of z is present in the
transaction. The task become much simpler if we first
add all the ancestors of each item in T to T; let us call
this extended transaction T. Now T supports X if and
only if T is a superset of X. Hence a straight-forward
way to find generalized association rules would be to run
any of the algorithms for finding association rules from
PI PI 151 [61 [71 on the extended transactions. We
discuss below the generalization of the Apriori algorithm
given in [2]. Figure 5 gives an overview of the algorithm,
using the notation in Figure 4. The first pass of the
algorithm simply counts item occurrences to determine
the frequent 1-itemsets. Note that items in the itemsets
can come from the leaves of the taxonomy or from
interior nodes. A subsequent pass, say pass h, consists of
two phases. First, 31n earlier papers [l] [2], itemsets with
minimum support were called large itemsets. However,
some readers associated large with the number of items
in the itemset, rather than its support. So we are
switching the terminology to frequent itemsets. An
itemset having k items.

Mining Algorithm based on support
Rule Sup
port
Transac
tion
Impor
t
84
%
77.2%
Expor
t
74
%
84.6%
Storag
es
93
%
77.2%
Collat
ions
98
%
84.6%

Table3: Notation

151 := {frequent 1-itemsets}; k := 2; // k represents
the pass number
while ( Lk - I # 8 ) do
begin
ck := New candidates of size k generated from Lk-1.
For all transactions t E P do begin
Add all ancestors of each item in t to t, removing any
duplicates. Increment the count of all candidates in ck
that are contained in t. end Lk := All candidates in ck
with minimum support.
k := k + 1;
end
Answer := U, Lk;
Algorithm Basic the frequent itemsets Lk-1 found in
the (k-1)th pass are used to generate the candidate
itemsets Ck, using the apriori candidate generation
function described in the next paragraph. Next, the
database is scanned and the support of candidates in ck is
counted. For fast counting, we need to efficiently
determine the candidates in Ck that are contained in a

given transaction t. We reuse the hash-tree data structure
described in [2] for this purpose. Candidate Generation
Given Lk-1, the set of all frequent (k-1)-itemsets, we
want to generate a superset
of the set of all frequent Ic-item sets. Candidates may
include leaf-level items as well as interior nodes in the
taxonomy. The intuition behind this procedure is that if
an itemset X has minimum support, so do all subsets of
X. For simplicity, assume the items in each itemset are
kept sorted in lexicographic order. First, in the join step,
we join Lk-i with Lk-1: insert into ck select p. item r, p.
item z, . . . . p itemk-1, q.itemk-1 411 from Lk-1 P7 Lk-1
9 where p. item l = q. item l p. itemk-Z = qhmk-2,
p.itemk-1 < q.itemk-1; Next, in the prune step, we delete
all itemsets c E Ck such that some (]c- 1)-subset of c is
not in Lk-l.
5.2 Algorithm Cumulate
We add several optimizations to the Basic algorithm
to develop the algorithm Cumulate. The name
indicates that all itemsets of a certain size are counted in
one pass, unlike the Stratify algorithm in
1. Filtering the ancestors added to transactions. We
do not have to add all ancestors of the items in a
transaction t to t. Instead, we just need to add ancestors
that are in one (or more) of the candidate itemsets being
counted in the current pass. In fact, if the original item is
not in any of the itemsets, it can be dropped from the
transaction.
For example, assume the parent of Jacket is
Outerwear, and the parent of Outerwear is Clothes.
Let (Clothes, Shoes} be the only itemset being counted.
Then, in any transaction containing Jacket, we simply
replace Jacket by Clothes. We do not need to keep Jacket
in the transaction, nor do we need to add Outerwear to
the transaction.
2. Pre-computing ancestors. Rather than finding
ancestors for each item by traversing the taxonomy
graph, we can pre-compute the ancestors for each item.
We can drop ancestors that are not present in any of the
candidates at the same time.
3. Pruning itemsets containing an item and its
ancestor. We first present two lemmas to justify this
optimization. Lemma 1 The support Z for an itemset X
that contains both an item x and its ancestor 3 will be the
same as the support i for the itemset X-2. Lemma 2 If L
k, the sel of frequent k-itemsets, does nol include any
itemset that contains both an item and it.9 ancestor, the
se2 of candidates ck+l generated by the candidate
generation procedure in Section 3.1 will not include any
itemset that contains both an item and ias ancestor.
Proofs of these lemmas are given in [9]. Lemma 1 shows
that we need not count any itemset which contains both
an item and its ancestor. We add Compute T*, the set of
ancestors of each item, from 7.
// Optimization 1
LI := {frequent 1-itemsets};
k := 2; // k represents the pass number
while ( L+1 # 0 ) do
begin
Ck := New candidates of size k generated from Lk-1.
if (k = 2) then
Delete any candidate in Cz that consists of an
item and its ancestor.
Delete any ancestors in 7 that are not present in any
of the candidates in Ck.

Fig 5.1 Optimization of Sample database on data sets

// Optimization 2
forall transactions t E D do
begin
foreach item z E t do
Add all ancestors of x in I to t.
Remove any duplicates from t.
Increment the count of all candidates in Ck
that are contained in t.
end
Lk := All candidates in ck with minimum support.
k := k+l;
end
Answer := U, Lk;
Algorithm Cumulate this optimization by pruning the
candidate itemsets of size two which consist of an item
and its ancestor. Lemma 2 shows that pruning these
candidates is sufficient to ensure that we never generate
candidates in subsequent passes which contain both an
item and its ancestor. Figure 6 gives an overview of the
Cumulate algorithm.
5.3 Stratification
We motivate this algorithm with an example. Let
{Clothes, Shoes}, {Outerwear, Shoes) and {Jacket,
Shoes} be candidate itemsets to be counted, with
Jacket being the child of Outerwear, and
Outerwear the child of Clothes. If {Clothes, Shoes}
does not have minimum support, we do not have to count
either {Outerwear, Shoes} or {Jacket, Shoes}. Thus,
rather than counting all candidates of a given size in the
same pass as in Cumulate, it may be faster to first count
the support of {Clothes, Shoes}, then count {Outerwear,
Shoes} if {Clothes, Shoes} turns out to have minimum
support, and finally count {Jacket, Shoes} if (Outerwear,
Shoes) also has minimum support. Of course, the extra
cost in making multiple passes over the database may be
more than the benefit of counting fewer itemsets. We
will discuss this tradeoff in more detail shortly. We

develop this algorithm by first presenting the straight-
forward version, Stratify, and then describe- 412 fining
the use of sampling to increase its effectiveness (the
Estimate and Est Merge versions). The optimizations
we introduced for the Cumulate algorithm apply to this
algorithm as well.
5.4 Stratify
Consider the partial ordering induced by the
taxonomy DAG on a set of itemsets. Itemsets with no
parents are considered to be at depth 0. For other
itemsets, the depth of an itemset X is defined to be
(m=({depth WW I X is a parent of X}) + 1). We first
count all itemsets Cc at depth 0. After deleting
candidates that are descendants of those itemsets in Cc
that did not have minimum support, we count the
remaining itemsets at depth 1 (Cl). After deleting
candidates that are descendants of the itemsets in Cr
without minimum support, we count the itemsets at depth
2, etc. If there are only a few candidates at depth n, we
can count candidates at different depths (n, n+l, . ..)
together to reduce the overhead of making multiple
passes.
There is a tradeoff between the number of itemsets
counted (CPU time) and the number of passes over the
database (IO+CPU time). One extreme would be to make
a pass over the database for the candidates at each depth.
This would result in a minimal number of itemsets being
counted, but we may waste a lot of time in scanning the
database multiple times. The other extreme would be to
make just one pass for all the candidates, which is what
Cumulate does. This would result in counting many
itemsets that do not have minimum support and whose
parents do not have minimum support. In our
implementation, we used the heuristic (empirically
determined) that we should count at least 20% of the
candidates in each pass.
VI EXPERIMENTS
In previous sections we have described the new
algorithm and given some analysis to show that we
expect it to be efficient. We implemented the algorithm
to run in main memory and read a file of transactions.
The execution times given are for running the algorithm
on the IBM Risk/System 6000 350 with a clock speed of
41.1 MHz. In [4], a data set was used that consists of
sales data obtained from a large retailing company with a
total of 46,873 customer transactions. The experiments
were conducted using this data set.
6.1 Variation of relation ships
We first study how the size of the R, (trans-id and
items) relation varies with each iteration of algorithm
SETM. In Figure 5 we show the variation in the size (in
Kbytes) of R, with iteration i for the retailing data set.
Curves are how n for different values of minimum
support, where minimum support is varied from 0.1% to
5%. The maximum size of the rules is 3, hence in all
cases lR41 = 0 (with 141 denoting the cardinality of R,).
Also, the starting relations are the same and hence I R1 I
= 115,568 in all cases. If the minimum support is small
enough (5 0.1%) , the size of relation R, can first increase
and then decrease. But the general trend is that the size
relation R, decreases. For large values of minimum
support, IR,I decreases quite rapidly from the first
iteration to the second. This sharp decrease is delayed
somewhat for the smaller values of minimum support.
Hence, using small values of minimum support allows us
to obtain more rules. In general, it also allows us to 31
obtain rules with more items in the antecedent. For
example, if the minimum support is reduced to 0.05%,
we obtain rules with 3 items in the antecedent. We expect
the i (count) relations to be small enough to fit in
memory. We now study how the cardinality (ICil) of
these relations varies with iteration number. Figure 6
shows curves for different values 200 of minimum
support. The values of lCil measure k the number of item
combinations that could garner e enough support. We
observe that for small values of c *= 150 minimum
support the value of lCil increases initially Q, before
decreasing with later iterations. Since lCil is a 5 measure
of how many rules can possibly be generated, c we again
see the importance of handling small values -1.00. Also,
the starting relations are the same and hence IC1 I = 59
for all minimum support values.

Fig 6.1 Relationship of Sample database on Multidimensional
Database

6.2 Execution times
We measured the execution times of our set- 1 2 3 4
oriented algorithm SETM for various values of the
Iteration Number minimum support. We varied the
minimum support from 0.1% to 5%. The execution times
are shown in
the following table.

250
N
50
l8
Minimum Support
(%I
0.1

Figure 6.1: Size of relation R; Execution Time
(seconds) 6.90 0.5
1

2
5.30
4.64 I 4.22
400
I 5 I 3.97 I I 1
- .. I .2 300
We see that algorithm SETM is very stable. The 3c
execution time varies from 7 sec for 0.1% minimum *- 0
support to sz 4 secs for 5% minimum support. 2 7
Conclusions y 200 c 3 8
In this paper, we have investigated a set-oriented 100
approach to mining association rules. We have shown
that by following a set-oriented methodology, we 60
arrived at a simple algorithm. The algorithm is 30
straightforward-basic steps are sorting and merge 0 scan
join-and could be implemented easily in a relational
database system. The major contribution of this paper is
that it shows that at least some aspects of data mining can
be carried out by using general query languages such as
SQL, rather than by developing specialized black box
algorithms.

Fig 6.2 Execution Time of Sample database on Transactions

6.3 Iteration Number
The algorithm exhibits good performance and stable
behavior, with execution time almost insensitive to the
chosen minimum support. For a real-life data set,
execution times are on the order of 4-7 seconds. The
simple and clean form of our algorithm makes it easily
extensible and facilitates integration into a (interactive)
data mining system. We are investigating extending the
algorithm in order to handle additional kinds of mining,
e.g., relating association rules to customer classes.

Figure 6.3: Cardinality of Ci

Fig 6.4: Iteration Number
VII SUMMARY
We introduced the problem of mining Parallel
Association Rule for Data Mining in Multidimensional
Databases. Given a large database of customer
transactions, where each transaction consists of a set of
items, and a taxonomy (is-a hierarchy) on the items, we
find associations between items at any level of the
taxonomy. Earlier work on association rules did not
consider the presence of taxonomies, and restricted the
items in the association rules to the leaf-level items in the
taxonomy. An obvious solution to the problem is to
replace each transaction with an extended transaction
that contains all the items in the original transaction as
well as all the ancestors of each item in the original
transaction. We could then run any of the earlier
algorithms for mining association rules on these extended
transactions to get generalized association rules.
However, this Basic approach is not very fast. We
presented two new algorithms, Cumulate and Est Merge.
Empirical evaluation showed that these two algorithms
run 2 to 5 times faster than Basic; for one real-life
dataset, the performance gap was more than 100 times.
Between the two algorithms, Est Merge performs
somewhat better than Cumulate, with the performance
gap increasing as the size of the database increases. Both
E & Merge and Cumulate exhibit linear scale-up with the
number of transactions. A problem users experience in
applying association rules to real problems is that many
uninteresting or redundant rules are generated along with

the interesting rules. We developed a new interest
measure that uses the taxonomy information to prune
redundant rules.
The intuition behind this measure is that if the
support and confidence of a rule are close to their
expected values based on an ancestor of the rule, the rule
can be considered redundant. This measure was able to
prune 40% to 60% of the rules on two real-life datasets.
In contrast, an interest measure based on statistical
significance that did not use taxonomies was not able to
prune even 1% of the rules.
VIII. ACKNOWLEDGMENT
We are proposing this paper based on the future work
of other existing systems. If in case of any new research
ideas on these papers are warmly welcomes. We would
also like to thank the anonymous reviewers for their
valuable and constructive comments.

REFERENCES

[1.] A.Savasere, E. Omiecinski, and S. Navathe, An
Efficient Algorithm for Mining Association Rules in
Large Databases, Proc. 21st Very Large Data Bases
Conf., pp. 432-443, 1995.
[2.] C.H. Papadimitriou and K. Steiglitz, Combinatorial
Optimization: Algorithms and Complexity.
Englewood Cliffs, NJ: Prentice-Hall, 1982.
[3.] E.H. Han, G. Karypis, and V. Kumar, Scalable
Parallel Data Mining for Association Rules, Proc.
1997 ACM-SIGMOD Int'l Conf. Management of
Data, 1997.
[4.] G. Piatetsky-Shapiro. Discovery, Analysis, and
Presentation of Strong Rules. In Knowledge
Discovery in Databases, pages 229-248. AAAI/MIT
Press, 1991.
[5.] J. Pearl, editor. Probabilistic Reasoning in Intelligent
Systems: Networks of Plausible Inference. Morgan
Kaufman, 1992.
[6.] J.S. Park, M.S. Chen, and P.S. Yu, Efficient Parallel
Data Mining for Association Rules, Proc. Fourth
Int'l Conf. Information and Knowledge
Management, 1995.
[7.] J. Han, Y. Cai, and N. Cercone. Knowledge
discovery in databases: An attribute oriented
approach. In Proc. of the VLDB Conference, pages
547{559, Vancouver, British Columbia,Canada,
1992.
[8.] M. Holsheimer and A. Siebes. Data mining: The
search for knowledge in databases. Technical Report
CS-R9406, CWI, Netherlands, 1994.
[9.] M. Houtsma and A. Swami. Set-oriented mining of
association rules. Research Report RJ 9567, IBM
Almaden Research Center, San Jose, Cali-fornia,
October 1993.
[10.] N. Alon and J. H. Spencer. The Probabilislic
Method. John Wiley Inc., New York, 1992.
[11.] R. Agrawal and R. Srikant. Fast algorithms for
mining association rules. In Proc. of the VLDB
Conference, Santiago, Chile, September 1994.
Expanded version available as IBM Research Report
RJ9839, June 1994.
[12.] R. Agrawal and J.C. Shafer, Parallel Mining of
Association Rules, IEEE Trans. Knowledge and
Data Eng., vol. 8, no. 6, pp. 962-969, Dec. 1996.
[13.] R. Agrawal, T. Imielinski, and A. Swami, Mining
Association Rules Between Sets of Items in Large
Databases, Proc. 1993 ACMSIGMOD Int'l Conf.
Management of Data, 1993.
[14.] T. Hagerup and C. Riib. A guided tour of Chernoff
bounds. Information Processing Letters, 33:305-
308,1989/90.
[15.] T.Shintani, and M.Kitsuregawa: Considera- tion
on Parallelization of Database Mining", In Institute
of Electronics, Information and Com- munication
Engineering Japan (SIG-CPSY95- 88), Technical
Report, Vol.95, No.47, pp.57-62, December 1995.
[16.] T. Shintani and M. Kitsuregawa, Hash Based
Parallel Algorithms for Mining Association Rules,
Proc. Conf. Paralellel and Distributed Information
Systems, 1996.
[17.] V Kumar, A. Grama, A. Gupta, and G. Karypis,
Introduction to Parallel Computing: Algorithm
Design and Analysis. : Redwood City: Benjamin
Cummings/ Addison Wesley, 1994.

TAXONOMIES, CHALLENGES AND APPROACHES TO AUTOMATIC
WEB QUERY CLASSIFICATION

Mohammad Shahid
Deptt. Of Comp.Sci
Singhania University,Raj. (India)

ABSTRACT-Users with an information need start with the
search engines by submitting their queries. The search engine
tries to retrieve the documents most relevant to the queries
submitted by the users. On the economic front personalisation
of the search result page is inevitable. Web query
classification is an area of research which tries to help the
above said problems. But web query classification has a lot of
problems inherent in the characteristics of the queries. So
researchers have tried various ways of classifying queries and
have come up with various solutions catering to the various
problems in web query classification. This paper is an
attempt to summarise the three dimensions in web query
classification. They are the characteristics of the query, the
various web query taxonomies proposed and the techniques
employed by the researchers who embark on the automatic
classification of web queries.
Keywords : Information Retrieval; Query Classification;
Taxonomies; Web Knowledge; Query log; Machine Learning;
Natural Language Processing; Clustering

I. INTRODUCTION
The starting point for 88% of the users of the internet is
the search engine. Users are familiar with the natural
language required to type in their requirement query and
not the way the search engine treats their query. The
search engine on the other hand attempts to retrieve a
set of web documents with related advertisements on the
search engine result page. The goal is well achieved if
the retrieved documents intersect perfectly with the
relevant documents and if the advertisements are relevant
to the user. Automatic classification of web queries is a
research undertaken to satisfy the above objectives.

II. CHARACTERISTICS OF A WEB QUERY,
SEARCH ENGINE USER
To aid in the automatic classification of web queries,
knowledge of the characteristics of the web queries is a
necessity. The behaviour of the search engine user after
submitting the query was also studied. This section
summarises some of the findings about the properties of a
web query and the behaviour of the user. The queries are
invariably short with the mean number of terms per query

ISBN: 978-81-920575-8-3 :: doi: 10. 73368/ISBN_0768
being 2.6. Single term queries constitute 48.4% of the
queries submitted. The query terms may be words, names
of persons or locations, URLs, operators, empty terms,
special acronyms, program segments or malicious code.
Most of the queries are rare though the web query
vocabulary is a much bigger superset of the natural
language vocabulary. The usage of non-English and
unclassifiable queries is found to be on the rise with the
number getting tripled between 1997 and 2002. The
queries exhibit the property of polysemy with around 30%
queries having more than 3 possible meanings. The
meaning of queries evolves over time. A word which is
unheard of today would be a reality tomorrow with the web
users searching for the same. [1][2][3]. The way users
modify queries and view results has also been studied. The
mean number of pages viewed after submitting a query is
1.7 and users viewed the first result screen for 85%. Users
are being less tolerant to providing feedback and their level
of interactivity is found to be on the decline [1][2]. It was
observed that 12% of the users modified queries by adding
or removing terms and queries were totally modified
35.2% of the times[4]. With the session length used in the
analysis not being obvious, it is difficult to conclude
if the queries were modified because satisfactory
answers were not returned or they started searching on an
entirely new topic. These above mentioned characteristics
have posed major challenges in the automatic classification
of web queries.

III. A TAXOMONY ON WEB QUERIES
Automatic classification of web queries has seen much
taxonomy being proposed based on user survey and log
analysis. Broder [5] and Rose et.al [6] classified queries
based on user intent and Gravano et.al [8] saw queries
falling into the category of global and local queries.
KDDCUP 2005 gave a set of 67 two level target categories
which many researchers tried classifying the topical
queries into. In the absence of a clear cut target category
for topical queries many researchers have proposed their
own topical categories to check their solution with, while
some have opted clustering their queries.
A. User Intent based Taxomony
Three major works which built up a taxonomy based on
user intent are discussed here. Broder classified queries
based on the user intent using user survey and transaction
log analysis. He found that on an average 43.5% queries
were informational (topic relevant tasks), 22.25% were
navigational (homepage finding tasks) and 33% were
transactional (service finding tasks) queries [5]. Rose et.al
used the Altavista query log and human editors and
classified queries into 3 broad categories with some sub-
categories. They found that on an average approximately
61.93%, 13.5% and 24.56% of queries were information,
navigational and resource respectively. The type
resource could be mapped to Broders transactional
queries and the deviation in the percentage of findings
between Broders work and Rose et.als work was
attributed to Rose et.al looking at only the sessioninitial
queries. Rose et.al sub-categorised informational queries as
directed if a user wants to know something about a topic. If
a user wanted to know the full information about a topic
they were taken as undirected queries. Directed queries
were sub-categorised as closed and open based on whether
a single clear answer or answers with unconstrained depth
was expected. Queries with an intention to get advice,
ideas, suggestions and instructions were categorised as
informational queries under the sub-category advice.
Queries that want to locate a service or product rather than
know what the product or service is all about are locate
type informational queries. List type informational queries
constituted queries expecting a list of valuable websites for
a topic. Resources were sub-categorised as download,
entertainment, interact and obtain based on the intent[6]. In
the extensive research done by Jansen et.al on hierarchical
user intent taxonomies, navigational queries were further
sub-categorised into navigation to a
transactional site and navigation to an informational
site [7].
B. Geographical Locality based Taxonomy
Gravano et.al used the Excite query log to determine
that more relevant pages could be returned if queries could
be differentiated as global and local queries based on the
geographical locality. Using the rule based classifier
RIPPER, log-linear regression, C4.5 and SVM with a
linear and Gaussian kernel with equal and unequal error
costs they tried classifying the queries into global and local
queries. The maximum accuracy was achieved with log
linear regression and the maximum F-measure was attained
with linear SVM with unequal error costs. To achieve
unequal error costs, a false local assignment was given
twice the penalty of a false global assignment [8].
C. Query Ambiguity based Taxonomy
Word Sense Disambiguation is inevitable for effective
document retrieval [9]. 16% of all queries are ambiguous
and a work on resolving ambiguity saw 3 new taxonomies
being proposed. Ambiguity was resolved by transforming
it to a query classification problem with queries being
classified as ambiguous, broad and clear. SVM with RBF
kernel was used to solve the query ambiguity model [10].
D. Faceted Query Classification
With too much importance placed on high level user
goals, a recent attempt decided to create a broader impact
by classifying queries into primitives corresponding to
facets. Queries were classified as ambiguous, authority
sensitive, temporal sensitive and spatial sensitive.
Ambiguity is similar to the one suggested by Song et.
al[10] and spatial sensitivity is similar to Gravano et.als [8]
classification. Authority sensitivity looks at the level of
authority and trust required in the answer. A query is said
to be temporally sensitive if the answer differs according to
the time
period at which the query is posted. To maintain
uniformity with previous work, the categories were
mapped into the categories given by Rose et.al [11].

IV. AUTOMATIC WEB QUERY
CLASSIFICATION
Web queries can be classified manually as is done in
Yahoo and OpenDirectory [12] or using automated
techniques. To achieve accuracy in classification, human
editors have been involved in the classification of web
documents into directories. But the magnitude at which the
web grows makes manual classification obsolete. So many
attempts have been undertaken to automatically classify
queries. While classifying queries, they have been
modelled as a bag of words or as an expression where the
syntactic position is considered. Vector representation of
the query is commonly used due to the ease with which
similarity between vectors can be calculated and for the
easy application of the machine learning techniques. Many
similarity functions like the cosine-similarity, Jaccard
similarity, edit distance etc have been used. Preprocessing
of the queries is inevitable given the amount of noise
inherent in user queries. Stemming, lemmatization, stop
word removal are the other commonly performed
preprocessing. Classification techniques can be classified
into post-retrieval and preretrieval methods based on
whether the query is classified before or after the search
result is returned. Based on the technique employed, they
can be classified as follows.
A. Using Search Result and Web Knowledge
To solve the classification problem researchers have
used various types of information available in the web.
Some of the web knowledge commonly employed for
classification is the directory services, Wordnet and
thesaurus. Online resources like Wordnet and thesaurus
have been extensively used for query expansion and
modification.
The search results list returned by the search engines
for the queries posted has been seen as a pseudo-relevance
feedback. The returned top documents can be treated
individually or the whole result page can be treated as one
meta-data for feature generation. But it was found that
treating the returned results as individuals gave better
results than considering the result page as one document
[13]. Then the option of using only the summaries returned
or using the entire document content was considered and it
was found that using only the query summary returned in
the search result outperforms using the entire text of the
document[14][15]. The KDDCUP participants started a
trend of using the search engine directory services and the
categories of the open directory project to classify against
the KDDCUP 2005s 67 two level target
categories[3][16][17]. Shen et.al used an ensemble of
classifiers. They tried classifying the queries onto the 67
target categories based on the categories returned by the
Google directory service and ODP. But due to the low
recall achieved by the above mentioned synonym based
mapping, SVM with a linear kernel was used to classify
the web documents returned for the query [3]. But this
model required the classifier to be retrained every time the
target category was changed. So along with web
knowledge a bridging classifier which needed to be trained
only once was used as an intermediate taxonomy [14].
Taksa et.al focussed on employing machine learning
techniques to the short text classification problem which
lacks sufficient features for training. The sufficient features
were invariably retrieved from the web. Six age indicative
classes were taken and it was represented by 10 query
terms considered as representatives of the classes. These
queries were sent through Google and the top 10 result
documents were taken as representatives of the particular
query. Thus the query along with the retrieved documents
was taken as the training corpus. Each term was taken as a
feature and using information gain the terms with the
highest information gain were considered and a supervised
text classification was performed[18]. The amount of
confidence on the returned web documents is a matter of
concern. With hundreds of millions of queries being
submitted per day [19], the feasibility of this type of
classification in a large-scale operational environment
seems unlikely. But the problem of automatic web query
classification has benefited from the search results and vice
versa better search results can be retrieved via proper
automatic web query classification.
B. Using Content, Anchor Link and URL Information
Document content, anchor links, hyperlinks and URLs
have been considerably used in task based classifications.
The level importance to be placed on the above mentioned
information is a point of contention as shown by various
researchers. The importance of inlinks, authorities, hubs
and anchors in web query classification were compared by
Westerveld et.al and they concluded that URLs prove
to be a useful feature in identifying navigational queries,
while inlinks give a marginal improvement in the
identification. The results also showed that using content
information with other sources improves the performance
over content only information [20]. Craswell et.al
compared link- based ranking methods and content-based
ranking methods and concluded that anchor texts provide
better features than the document content for site finding
tasks [21]. To differentiate navigational and informational
queries Kang et.al assumed that if anchor texts and
homepages of websites contained more of the query terms,
they were navigational queries. Further documents with
root type URL were put as a database of navigational
queries. The other documents formed a database of
informational queries. Then the number of times a query
term occurred in each database was calculated and if the
number was more in a particular collection, the query was
put into that class. As a third way of predicting the user
goal, multi term queries were considered. Pointwise mutual
information was calculated for the queries to find the
dependency of the multi-term query in a collection. Higher
the mutual information, the more a query was assumed to
be of a particular type. They concluded that title, URL and
link information were good for navigational queries and
bad for informational queries. Combining the above 3
approaches with POS information which is dealt with in
section 4.4 resulted in a precision of 91.7% and a recall of
61.5% [22].
Kang et.als work was compared by Lee et.al who
considered user click behaviour, average number of clicks
per query and anchor link distribution as features. The
features were then linearly combined and the effectiveness
was checked with a goal prediction graph and linear
regression. They disproved Kang et.als hypothesis of
anchor usage rate being more in navigational queries and
said that the first 2 approaches would also not work
because of the lack of consistent difference in the 2
databases. But it is to be noted that Lee et.als work was on
a potentially biased dataset [23]. Liu et.al crawled 202M
web pages and concluded that the anchor text evidence is
applicable only for less than 20% queries. Further it
showed that query identification can be query content
based for navigational queries and search context based for
queries which can be navigational or informational[24]. Lu
et.al generated 24 URL features and 63 anchor text features
before classifying queries based on Broders categories.
Their result showed that the above features proved useful
only when coupled with click behaviour[25].
C. Using Machine Learning and Statistical Techniques
Typically, the short nature of the queries provides very
few features and is therefore a weak source for machine
learning. But researchers have circumvented this by adding
features from the query log, by using web search results,
synonyms and statistical methods. The lack of training data
is handled by uncertainty sampling where queries taken
from a large unlabeled query log is given to human
classifiers and are used as training data [18]. The Naive
Bayes classifier assigns probabilistic classes to many
unlabelled examples and the Perceptron learner was found
to perform well on queries with strong features [26].
Machine learning methods like the Naive Bayes model,
maximum entropy model, support vector machines and
stochastic gradient boosting tree have been used to identify
navigational queries. It was concluded that the stochastic
gradient boosting tree with linear SVM feature selection
gives the best result. It also showed that linear SVM and
boosting tree outperform information gain in feature
selection [25]. Kang classified the linked objects of a
hyperlink based on the file type and possible action of the
linked object. The tagged anchor texts were then used to
identify cue expressions of a linked type and were used as
training data in TiMBL a supervised learner [27].
D.Using Natural Language Processing
Simple natural language processing techniques like
stemming, paraphrasing and POS tagging to high-level
techniques like selectional preferences have been used in
automatic web query classification. The POS information
like be verbs was used to identify informational queries
by Kang et.al [22]. Kowalczyk et.al analysed the queries
posed to open domain question answering systems. By
shallow linguistic analysis, six query features like the type
of initial query word, main focus, main verb, rest of the
query, named entities and prepositional phrases were
taken. They were used by a rule based classifier to
differentiate queries as location, name, number, person,
time and other [28]. Beitzel et.al used selectional
preferences on a large unlabeled query log. He developed a
rule based classifier to classify queries with a minimum of
two words. He found that the latent relationship between
queries was well expressed by selection preference. The
queries were first converted into a set of head-tail (x,y)
pairs. Weighted forward pairs (x,u) and backward pairs
(u,y) where u represents a category were created. Forward
pairs were created when y was associated with at least one
u and backward pairs are created when x was associated
with at least one u. Each lexicon had a series of terms
called the thesaurus class associated with it in each
category. So when x or y was
part of the thesaurus class, they were associated with u.
Rules were formulated based on the selectional preference
and the weighted pairs were checked against the initial and
final tokens of a given query. He also used n-gram a
popular statistical natural language approach and showed
that exact match and n-gram approaches were good for
better precision for popular queries [26].

E. Query Log Analysis and Click through Behaviour

Analysis of search engine logs has been seen as a way
to understand the user intent based on the history of the
user behaviour and user click behaviour has been
extensively used to understand the user
intent[24][25][29][30]. Query logs and cookie information
have been surveyed to table the trends in the web search
and to elicit the characteristics of the query[1][2][4]. Liu
et.al proposed that a user looking for a particular site
would click very few results(less effort assumption) and
they would mostly be the first few results (cover page
assumption) to identify navigational
queries [24]. Jansen et.al used transaction logs to
identify the features of navigational, transactional and
informational queries. The derived features and a series of
if-then rules were then used to classify the queries based on
the task [7]. He et.al compared ridge regression with a
combination of regular regression and search click
information and found an improvement in the performance.
His assumption was that if a user clicks a document then it
is relevant to the query [30]. But Agichtein et.al found that
there is an inherent bias to click the top result even when it
was not a relevant document. But this valuable information
was not considered in those techniques employing user
click behavior. Agichtein et.al modelled user behaviour as
a set of features like presentation, click through and
browsing features and then used RankNet a neural net
tuning algorithm to identify the best results for a
navigational query [31].
F. Unsupervised Clustering
An alternative to query classification is to allow the
queries to cluster into similar classes. Beeferman et.al used
a bipartite graph constructed from the click through data
with queries on one side and URLs on the other. Co-
occurring in a click through record initiated an edge
between the vertices on either side. They found the
similarity between the edges on either side using a
similarity function and merged same side vertices that were
similar. When the algorithm terminated the similar queries
and URLs were clustered with edges between related
queries and URLs [32]. When et.al based their work on the
principles that if 2 queries have same or similar terms they
refer to the same user intent and that if the same document
is selected for 2 different queries, then the queries are
similar. Similarities based on query contents and user
clicks were taken and linearly combined. This resulted in
queries with similar terms and the queries of similar
topics getting clustered. For clustering DBSCAN and
incremental DBSCAN were used [33].

V. ACKNOWLEDGEMENT
I would like to thank my official supervisors and
Professor Dr. Mohd. Hussain for all their help, advice and
encouragement.

VI. CONCLUSION
Classification can be user intent based, pre-defined
category based or just an unsupervised clustering. But
comparison of various query classification techniques is
complicated due to the absence of proper approved
benchmarks. Though KDDCUP 2005s category list is
taken as a benchmark by some, the presence of a
standardised benchmark to compare the research ideas is
required. Manual surveys seem unavoidable because of the
absence of a proper benchmark. The vector model was
mostly used to represent queries. An ensemble of
classifiers was commonly used to mitigate the error
possibly introduced by a classifier. Various data mining
and machine learning techniques, computational
linguistics, the web knowledge, user behaviour and
statistical techniques have been used to classify queries.
Behavioural pattern and training data have invariably
employed human survey and an extensive reliance on
query logs is noticed. There is currently a shift towards e-
commerce [1]. On an average 29% queries are
transactional [5][6]. When the query web document
collection was submitted, the intent was to retrieve a
collection of web documents. But a list of documents with
the above keywords appeared in the popular search engine
Google. Most of the previous query classification based
on user intent has tried classifying as navigational or
informational/transactional. Kang et.al suggested extending
the work to transactional queries though no explicit
solution was given. Except for an effort by Kang and
Jansen et.al transactional query classification is still at its
infancy. Jansen et.al derived some features for
transactional queries which resembles Rose et.als
subcategories and Kangs file types [6][7][27]. More
research needs to take place in classifying short length
queries and transactional queries. The reliability on the
web information though seems inevitable due to the rapidly
changing nature of the web, needs to be minimised to
implement in large-scale operational environments. Most
of the classification techniques rely on using the search
result returned by the search engines. But research needs to
be undertaken to retrieve the relevant documents in the
search result based on the classification. More
concentration on pre-retrieval classification is the need of
the hour. A new generation of search engines which
understands the intent behind the user query is needed to
provide the user a satisfying experience on the web.

REFERENCES
[1]. A. Spink, B. J. Jansen, D. Wolfram and T. Saracevic,
From E-Sex to E-Commerce: Web Search Changes,
IEEEComputer, Vol.35(3), pp. 107109, 2002.
[2]. A. Spink, D. Wolfram, B. J. Jansen and T. Saracevic,
Searching the Web: The public and their queries,
Journal of the American Society for Information
Science and Technology, Vol. 52(3), pp. 226234,
2001.
[3]. D. Shen, R. Pan, J. Sun, J. Pan, K. Wu, J. Yin, and Q.
Yang. Query enrichment for web-query
classification, ACM TOIS, 24:320-352, July 2006.
[4]. C. Silverstein, M. Henzinger, H. Marais, & M.
Moricz, Analysis of a very large Web search engine
query log, SIGIR Forum, 33(1), 612, 1999.
[5]. A. Broder, A Taxonomy of Web Search, SIGIR
Forum, pp. 3-10, 36(2), 2002.
[6]. Daniel E Rose & Danny Levinson, Understanding
User Goals in Web Search In Proceedings of the 13th
International Conference on World Wide Web, pp. 13-
19, New York, New York, 2004.
[7]. Bernard J Jansen, Danielle L Booth, Amanda Spink,
Determining the informational, navigational, and
transactional intent of web queries, Information
Processing and Management, Vol. 44, No. 3, pp. 1251
-1266, May 2008.
[8]. L. Gravano, V. Hatzivassiloglou, and R. Lichtenstein,
Categorizing web queries according to geographical
locality, In CIKM'03, 2003.
[9]. M. Sanderson, Retrieving with good sense,
Information Retrieval, 2(1):4969, 2000.
[10]. R. Song, Z. Luo, J.-R. Wen, Y. Yu, and H.-W.
Hon, Identifying ambiguous queries in web search.
In Procs WWW, 11691170, 2007.
[11]. Bang Viet Nguyen, Min-Yen Kan, Functional
faceted web query analysis, WWW2007, 2007.
[12]. www.dmoz.org
[13]. Andrei Broder, Marcus Fontoura, Evgeniy
Gabrilovich, Amruta Joshi, Vanja Josifovski, Tong
Zhang, Robust Classification of Rare Queries Using
Web Knowledge, Proceedings of the 30
th
annual
international ACM SIGIR conference on Research and
development in information retrieval, pp. 231 238,
2007.
[14]. D. Shen, J. Sun, Q. Yang, and Z. Chen,
Building bridges for web query classification, In
SIGIR'06, pp. 131- 138, 2006.
[15]. S. Beitzel, E. Jensen, A. Chowdhury, and O.
Frieder, Varying approaches to topical web query
classification, In SIGIR07: Proceedings of the 30th
annual international ACM SIGIR conference on
Research and development, pages 783784, 2007.
[16]. D. Vogel, S. Bickel, P. Haider, R. Schimpfky, P.
Siemen, S. Bridges, and T. Scheffer, Classifying
search engine queries using the web as background
knowledge, In SIGKDD Explorations, volume 7.
ACM, 2005.
[17]. Zsolt T Kardkovacs, D. Tikk, and A. Bansaghi,
The Ferrety algorithm for the KDD Cup 2005
problem, SIGKDD Explor. Newsl. 7, 2, 111-116,
2005.
[18]. Isak Taksa, Sarah Zelikovitz, Amanda Spink,
Using Web Search Logs to Identify Query
Classification Terms, Proceedings of the International
Conference on Information Technology, pp. 469
474, 2007.
[19]. D. Sullivan, "Searches Per Day," vol. 2003:
Search Engine Watch, 2003.
[20]. T. Westerveld, , W. Kraaij, and D. Hiemstra,
Retrieving Web Pages using Content, Links, URLs
and Anchors, In Proceedings of Text Retrieval
Conference (TREC- 10), pp. 663-672, 2001.
[21]. N. Craswell, D. Hawking and S. Robertson.
Effective Site Finding using Link Anchor
Information, In Proceedings of ACM SIGIR 01,
2001.
[22]. I. H. Kang, G. Kim, Query Type Classification
for Web Document Retrieval, In Proceedings of the
26th Annual International ACM SIGIR conference on
Research and Development in Information Retrieval,
pp 64-71, Toronto, Canada, 2003.
[23]. U. Lee, Z. Liu, and J. Cho, Automatic
Identification of User Goals in Web Search, In
Proceedings of the 14th International World Wide
Web Conference (WWW), Chiba, Japan, 2005.
[24]. Yiqun Liu, Min Zhang, Liyun Ru, Automatic
Query Type Identification Based on Click Through
Information, AIRS 2006: 593-600, 2006.
[25]. Yumao Lu, Fuchun Peng, Xin Li, Nawaaz
Ahmed, Coupling feature selection and machine
learning methods for navigational query
identification, Proceedings of the 15th ACM
international conference on Information and
knowledge management, pp. 682 - 689 , 2006
[26]. S. M. Beitzel, E. C. Jensen, D. D. Lewis, A.
Chowdhury, & O. Frieder, Automatic classification
of Web queries using very large unlabeled query
logs, ACM Transactions on Information Systems,
25(2) (Article No. 9), 2007.
[27]. I.H. Kang. Transactional query identification in
Web search, In Asian Information Retrieval
Symposium, 2005.
[28]. Pawel Kowalczyk, Ingrid Zukerman, and
Michael Niemann, Analyzing the Effect of Query
Class on Document Retrieval Performance, AI 2004:
Advances in Artificial Intelligence, pp. 550-561, 2005.
[29]. Xiao Li, Ye-Yi Wang, Alex Acero, Learning
query intent from regularized click graphs,
Proceedings of the 31st annual international ACM
SIGIR conference on Research and development in
information retrieval, pp.339-346, 2008.
[30]. Xiaofei He, Pradhuman Jhala, Regularized
query classification using search click information,
The journal of the pattern recognition society, pp.
2283- 2288, 2008.
[31]. Eugene Agichtein, Zijian Zheng, Identifying
best bet" web search results by mining past user
behaviour, Proceedings of the 12th ACM SIGKDD
International Conference on Knowledge discovery and
data mining, pp. 902 - 908 , 2006
[32]. D. Beeferman, and A. Berger, Agglomerative
clustering of a search engine query log, In Proc. Of
the Sixth ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, Boston:
ACM Press, pp. 407-415, 2000.
[33]. J.R. Wen, J.-Y. Nie, H.-J. Zhang, Query
clustering using user logs, ACM Transactions on
Information Systems (TOIS), pp. 59-81.

,
Bayesian MAP Model for Edge Preserving Image Restoration: A Survey

Greeshma T.R
Student, Dept. of Computer Science and Engineering
MES College of Engineering, Kuttippuram
Kerala

Abstract.Image restoration is a dynamic field of
research. The need for efficient image restoration
methods has grown with the massive production of
digital images and movies of all kinds. It often happens
that in an image acquisition system, an acquired image
has less desirable quality than the original image due to
various imperfections and/or physical limitations in the
image formation and transmission processes. Thus the
main objective of image restoration is to improve the
general quality of an image or removing defects from it.
The two main considerations in recovery procedures are
categorized as blur and noise. In the case of images with
presence of both blur and noises, it is impossible to
recover a valuable approximation of the image of interest
without using some a priori information about its
properties. The instability of image restoration is
overcome by using a priori information which leads to
the concept of image regularization. A lot of
regularization methods are developed to cop up with the
criteria of estimating high quality image representations.
The Maximum A posteriori Probability (MAP) based
Bayesian approach provide a systematic and flexible
framework for this. This paper presents a survey on
image restoration based on various prior models such as
tikhonov, TV, wavelet etc in the Bayesian MAP
framework.

Keywords-Image restoration, Image Regularization, prior,
Bayesianmodel, MAP estimation, total variation(TV), wavelet.
I. INTRODUCTION
With the advancement of digital technology, digital
images became an important part of our day today life.
These images span many industries like medical,
military, scientific etc. They include images taken from a
low cost ccd camera to expensive and complex magnetic
resonance imaging. Noise and blur are inherent to most
of the imaging domains. Millions of images and movies
are either taken in poor conditions or transferred over
different communication channels which are highly
prone to noise. Restoration has been a well studied
problem in the image processing community and
continues to attract researchers with an aim to achieve a
better estimate in the presence of noise.

India. ISBN: 978-81-920575-8-3 :: doi: 10.
73375/ISBN_0768
A. Digital Image
A digital image is usually represented as a matrix
of grey or color values. Each element in the matrix is
called pixel, i.e. picture element. Each pixel may consist
of one or more bits of information, representing the
brightness of the image at that point. For color images
there are 3 matrices each for red, green and blue (RGB).
An N N image can be represented as a vector in R
N2
space.

Figure 1. Digital image representation

Image Processing is a technique to enhance the
raw images received from cameras/sensors to make it
suitable for various day today life applications. Image
restoration is a classical problem in image processing.
This deals with the reconstruction of a clean image from
a blurred and noisy image contaminated with additive
white Gaussian noise (AWGN). This gets a lot of
attention from the image processing research community
due to the fact that millions of images and videos are
taken in noisy atmosphere or communicated through
noisy channels and also they have blurred differently. It
requires a systematic approach that takes into account the
entire process of image formation and provides a
foundation for the subsequent steps of image processing.
Normally images are degraded by two major
effects such as blur and noise. Blurring is fundamental to
imaging process and it occurs due to various
phenomenon like atmospheric turbulence, camera out of
focus, motion etc. Noises are also inherent to images
while transmission. This necessitates image restoration
since the degraded images are visually annoying and
found to be wrong target for compression and analysis.
Image restoration refers to removal or minimization of
degradations in an image. This includes deblurring of
images, noise filtering, and correction of geometric
distortion or non-linearity due to sensors. The ultimate
goal of image restoration is to get a high quality image
estimate by compensating or undoing defects which
degrade an image.
The remainder of this paper is organized as
follows. In section II, we describe about Image
Restoration Problem. Section III presents Image
regularization concept, section IV focuses on Bayesian
MAP concept, V and VI depicts related works and
Conclusion.
II. IMAGE RESTORATION PROBLEM
An ideal image x is measured in the presence of
additive white Gaussian noise. The measured image y is
thus defined as y = Hx+ Where H represents the
convolution matrix that models blurring effect and
represents the random error which is taken as noise.
Image restoration problem deals with designing an
algorithm that can remove the unwanted information
from y, getting as close to the original image x.
Figure below is a pictorial representation of an image
degradation restoration model. First half shows the
degradation operation in which image is being blurred
with some degradation function H and a noise vector is
added to that. The second half is the area under
consideration of this work.

Figure 2. Basic image degradation restoration model
In the case of images with blur, it is possible to come
up with a very good estimate of the actual blurring
function and undo the blur to restore the original image
by the effective inverse filtering operation. Inverse
filtering is a deconvolution method in which the inverse
operation of degradation function is performed. It is
evident that inverse operation does not go well in the
cases of images which are corrupted by both blur and
noise. This is due to the fact that the inverse operation
results in amplification of noise. By employing any kind
of denoising methods in such situation initiates
smoothening of fine details of the image. The best
method that can be employed in such situation is to use
some prior information to calculate the image estimates
by mathematical or statistical approaches. The Bayesian
formulation offers a systematic and flexible way or
image regularization and it provides a rigorous
framework for estimation of the model parameters.

III. IMAGE REGULARIZATION
Image regularization is an effective field of research
under image restoration field .This method focuses on
incorporating some prior information to solve the ill-
possedness of the restoration problem. These priors are
normally taken as penalty for complexity such as
smoothing limitation.
The general formulation for regularization techniques
is

Where
is the Error term, is the
regularization parameter and ||Lx|| is the penalty term.
IV. BAYESIAN MAP CONCEPT
The Bayesian formulation of the image restoration
problem provides a systematic and flexible way for
regularization using MAP estimation. In the Bayesian
framework, the inverse problem can be expressed as a
problem to find the posterior density of unknown data
from the realization observed. The best method that can
be employed for this purpose is the MAP which is a
mode of the posterior distribution. By Maximum A
posteriori probability (MAP), the estimate of unobserved
quantity from empirical data is calculated. It is
regularization to maximum likelihood estimation which
focuses on prior distribution. This method is best used
for a range of image restoration problems, as it provides
a computationally efficient means to deal with the
shortcomings of filtering operation without sacrificing
the computational simplicity of the filtering approach.

The Bayesian estimation is based on the equation
P (x|y) P (y|x) P(x)

Posterior Likelihood Prior

In this framework inverse problem is expressed as
problem to find posterior density P
posterior
(x) from the
realization observed y
observed
as
P
posterior
(x) = P
prior
(x). P (y
observed
| x)
P (y
observed
)
There will be infinite number of solutions for this
eqation. From that the more optimized result is to be
extracted out .Here by using the concept of MAP
method, it is easy to calculate the posterior of the data as
argmax

P (x|y) =argmax
P (y|x) P(x).

By the MAP method the underlying scene x which
maximizes the Bayesian expression is calculated which
will be the accurate result.
2
2
2
2
|| || || || Lx y Hx
2
2
|| || y Hx
V. RELATED WORK
There are extensive works on the development and
evaluation of various image restoration techniques. Edge
preserving image reconstruction has been proposed to
recover the original image from its degraded version
while this method also preserves the fine details in the
image such as edges. The problem is that enhancement of
fine detail (or edges) is equivalent to enhancement of
noise. So we have to reconstruct the image in such a way
that it is recovered from blur and noise and preserve its
edge details. All the restoration methods are focusing on
prior based Bayesian MAP estimation.
In the Bayesian estimation there is a wide freedom in
the choice of prior where the quality of restoration varies
according to different prior. This work provides an
analytical survey on various prior based image
reconstruction methods in the Bayesian framework.
S. Derin Babacan et al. [3] proposed a novel
variation methods based on the hierarchical Bayesian
formulation, and provide approximations to the posterior
distributions of the image, blur, and model parameters
rather than point estimates. TV-based deconvolution into
a Bayesian estimation problem provides advantages in
blind deconvolution, such as means to estimate the
uncertainties of the estimates. The blind deconvolution
problem is formulated using a hierarchical Bayesian
model, and variational inference is utilized to
approximate the posterior distributions of the unknown
parameters rather than point estimates. Approximating
the posterior distribution makes evaluating the
uncertainty of the
estimates possible. The unknown parameters of the
Bayesian formulation can be calculated automatically
using only the observation or using also prior knowledge
with different confidence values to improve the
performance of the algorithms.
The experimental results shown that the quality of the
restorations can be improved by utilizing prior
knowledge about the parameters. By comparing the
restoration result of Lena image with the same
confidence parameters for TV approach (ISNR=3.19dB)
and SAR1 approach (ISNR=1.26dB) it is noted that TV-
based approaches are more successful at removing the
blur while providing smooth restorations with less
ringing. However this method still have presence of edge
smoothening. Moreover the rate of convergence of the
TV algorithm is defined by the image size.
. M. Dirk Robinson et al. [4] discussed about a
wavelet based deconvolution and restoration algorithm.
The wavelet methods are developed based on the idea
that image representations based on small set of wavelet
is accurate. Here the image is represented by DWT with
convolution operators are specified in Fourier domain.
And the method of Fourier Wavelet super resolution
combines multiple aliased low quality images to produce
a high-resolution high-quality image. The efficiency of
algorithm stems from separating the multiframe
deconvolution or restoration step from the wavelet-based
denoising step allows achieving nonlinear denoising in a
non iterative fashion.
In this method, first a set of aliased low quality
images are captured. These images are then broken down
into small tiles and process them separately. To each tiles
apply the wiener fiter to fuse and deblur. Then estimate
the wavelet signal power by first very coarsely denoising
the sharpened image. A simple hard thresholding wavelet
denoising approach is applied to obtain the coarsely
denoised image using a different set of scaling and
wavelet functions and wavelet. Update the wavelet and
apply new wavelet to efficiently denoise the images thus
results in high contrast super-resolved images. The main
defect of wavelet method is the presence of ringing
artifacts.
Nelly Pustelnik et al. [5] introduce an image
restoration algorithm based on combined or hybrid
methods of total variation and wavelet regularization.
The main advantage of hybrid method is that it
compensates the drawbacks of both the method while
making use of advantages of both. Here the image is
decomposed into a number of wavelets and to each
wavelet effective total variation image estimation is
applied. Then the proximity operators can be easily
computed. The main advantages of the method are:
1) To deal directly with the true noise likelihood
without requiring any approximation of it
2) To permit the use of sophisticated regularization
functions, 3) Can be implemented on multiple
architecture. This work is being tested on multiple
natural and synthetic images.
This method is quantitatively measured in terms of
PSNR for the blurred Boat image with PSNR 11.2dB.
The results shows that this method provides a better
restoration with improved PSNR of 18.8dB where as the
TV approach provides an improved PSNR of 17.8dB and
wavelet provide a better PSNR of 18.0dB. But this
method is relatively complex one.
David Humphrey and David Taubman. [6],
reviews a restoration method based on the MAP
estimation of images. In this method, a piecewise
stationary Gaussian prior is used in calculating the
Maximum Aposteriori Probability estimate of the
underlying data from the observer data by the Linear
Shift Invariant Wiener Filter. This method deals with
images which are degraded by both blur and noise.
In this paper, it is seen that the image is first
segmented and model each region in the segmentation
with independent Gaussian priors and maintaining some
auxiliary data to make use of Wiener filter at all points in
the image. The notion of the extension of a region is
introduced which could be created by cutting out the
appropriate parts from the other parts. Segmentation is
the most important stage of this system since it helps to
detect the region of interest by using some prior
information. Then to each region a MAP estimate is
performed to get the hidden data in each region. A local
cost function is applied and iterates this method until we
get the intended image estimate for each segment. Finally
all the segments that are of interest are combined. This
method is efficient in the problem of demosaicking of
digital camera images and results shown that this method
efficiently preserve edges.
From the literature survey, it is observed that the
Bayesian framework is a best environment to work with
the image restoration problems in the case of images with
the presence of both blur and noises [1]. Bayesian
algorithm marginalizes the hidden variables using
maximum likelihood function. The MAP based
estimation in the Bayesian framework support more for
calculating the posterior probability of the hidden data
from the observed information [2]. In this framework
there is a lot of freedom in the choice of prior. This
framework provides a systematic and flexible way for
image restoration with a minimum ISNR of 5dB as in
table below.

Table 1: SNR improvement using Bayesian MAP methods

Met
hod
Opted
Image
Analyzed
Degrad
ed Image
SNR
Restored
Image SNR
Baye
sian
Method
Came
raman
Image
PSNR=
35.0dB
PSNR=43.0
3dB
MA
P
estimati
on
Lena
Image
PSNR=
23.23dB
PSNR=29.9
9dB

Here quality of regularization varies according to the
choice of prior. Lot of time and energy has been invested
by image processing community for the modelling of
adequate priors. A common way to construct the prior is
based on intuitive expectation on data content where it is
seen that a healthier prior add more to the quality of the
image reproduced. In the earlier years edge preserving
image regularization were based on the L2 norm based
Tikhonov regularization method. Standard Tikhonov
regularization, corresponding to a quadratic
regularization term with a linear solution. This quadratic
cost functionals are of limited use for measuring the
plausibility of natural images. So this prior causes over
smoothness which strongly penalizes large edges in
images. This L2 norm was replaced by L1 norm in later
years.
A popular one is the differential L1 norm that is
used in total variation which aims at the modulus of the
gradient and is an optimal regularization for piecewise
constant solutions. The use of L1 norm formally
introduced the idea of sparsity in image processing. This
sparsity concept expands to the development of wavelet
approach. The underlying idea of wavelet regularization
is that natural images tend to be sparse in the wavelet
domain. Hence, among all the possible candidates, a
solution has only few significant (non-zero) wavelet
coefficients.ie. The image can be represented by lesser
number of coefficients.
The figure below shows a qualitative or visual
comparison of restoration result of various regularization
methods discussed above in the case of a T2 image
degraded with a PSNR of 30dB [7].

Figure 3. Comparison of different regularization with respect to T2
image [7]
From the survey it is seen that wavelet is
competitive with TV regularization both in terms of SNR
and computation time. It also appears that the prior
corresponding to L1 regularization may be better adapted
to images than the classical L2 term. Wavelet
reconstruction usually outperforms TV for images that
contain textured areas and/or many small details. Even
though the hybrid method accommodates more complex
procedures to restore the image, it results in negligible
improvement in restoration both qualitatively and
quantitatively. Table below shows the comparison of
different regularization methods.

Table 2: SNR improvement with respect to various priors

METH
OD
Image
Analyzed
Degrade
d Image
SNR
Restored
Image SNR
Tikhon
ov
T2
MRI
image
PSNR=3
0.0dB
ISNR=35
.0dB
Total
Variation
Boat
image
SNR=11
.2dB
SNR=17.
8dB
Wavel
et
Boat
image
SNR=11
.2dB
SNR=18.
0dB
Hybrid Boat
image
SNR=11
.2dB
SNR=18.
8dB

VI. DISCUSSION AND CONCLUSION
In the imaging applications, it often happens that
images are subject to various imperfections and/or
physical limitations such as noise and blurs generally so
that the resulting image is not what we desire. To reduce
noise amplification via common methods of inverse
operations, regularization in the Bayesian framework with
the concept of Maximum A posteriori Probability (MAP)
is a first class approach to allow realistic restoration with
preserving fine details of image preferably edges. In the
Bayesian framework there is a good freedom in selection
of priors.
Earlier works were focused on different regularization
methods both piecewise and unbroken and they still have
need for lots of improvements in the contest of preserving
edges while getting rid of noise effects. The first method
was focused on the Tikhonov regularization scheme in
which L2 based prior are focused. Due to its over
smoothness another method called Total Variation is
developed based on the variational difference between
images. This TV concept aim to the development of
sparcity concept which guided to the development of
wavelet method in which the image domain. All the
recent works were developed in the wavelet basis. But
this method still has some smoothing effect. Nowadays
edge preserving restorations are focusing on how to
implement sparcity in the pixel domain rather than
wavelet to achieve a better restoration. This method is
assumed to be competent both qualitatively and
quantitatively.
REFERENCES

[1] Javier.Mateos, W.Tom. E. Bishopi, Rafael Molina
and Aggelos. K. Katsaggelos , \Local Bayesian
Image Restoration Using Variational MethodsS And
Gamma- Normal Distributions," IEEE Transactions
On Image Processing, 2009.
[2] Masayuki Tanaka,Takafumi Kanda and Masatoshi
Okutomi, \Progressive MAP-Based Deconvolution
with Pixel-Dependent Gaussian Prior," in 2010
International Conference On Pattern Recognition ,
2010
[3] S. Derin Babacan,Rafael Molina and Aggelos .K.
Katsaggelos, \Variational Bayesian Blind
Deconvolution Using A Total Variation Prior," in
IEEE Transactions On Image Processing, 2007.
[4] M. Dirk Robinson,Cynthia. A. Toth, Joseph. Y. Lo,
and Sina Farsiu, \Efficient Fourier-Wavelet Super-
Resolution," in IEEE Transactions On Image
Processing, vol. 19, no 10, OCT 2010.
[5] Nelly Pustelnik,Caroline Chaux and Jean-Christophe
Pesquet, \Parallel Proximal Algorithm for Image
Restoration Using Hybrid Regularization," in IEEE
Transactions On Image Processing, vol. 20, no
9,SEP 2011
[6] David Humphrey, and David Taubman, \A Filtering
Approach to Edge Preserving MAP Estimation of
Images," IEEE Transactions On Image
Processing,vol. 20, no. 5, MAY 2011.
[7] M. Guerquin-Kern, D. Van De Ville, C. Vonesch,
J.C. Baritaux, K. P. Pruessmann and M. Unser,
\Wavelet Regularized Reconstruction For Rapid
MRI,"IEEE Transactions On Image Processing,
MAY 2009.

Opinion mining for Decision Making in Medical Decision Support system
-A Survey-

A.Ananda Shankar
Assoc. Professor, Dept of ISE, Reva ITM.,
Bangaluru. , India.
.
Abstract Experts are asked to provide their opinion in a
situation of uncertainty. Since experts can make more
accurate decisions, because they have accumulated
knowledge about the problem with their past experience
and also experience is subjective, different experts propose
different diagnosis and decisions based on the same facts
about the same disease gathered from the observation of a
patient. Machine learning methods also posses background
knowledge encoded in their induction algorithms. In this
paper we present different methods for modifying the
background knowledge and can therefore produce different
opinions on same observation that therefore exposes
multiple opinions. e.g. Multiple opinions(MO) of experts.

Keywords Computer Supported Cooperative Work ,
Opinion Mining,Multiple opinion.

I. INTRODUCTION
Highly complex problems are usually solved by a
team of experts and decision making that involves
uncertain data is challenging for human decision makers.
Today collaborative decision making is an important
human activity and it has many practical applications in
medical, society ,economy, management and engineering,
etc.
Researchers are faced with new challenges in
theory and methods of Computer Supported Cooperative
Work (CSCW) with the rapid advent of internet and
information technology. Also, the existing groupware do
not have efficient tools to effectively analyse, predict and
provide support for social facets collective decision in
medical diagnosis.[1]. One of the challenges in
collaborative decision making is social decision making
in a computer mediated environment and also it is further
complicated in medical applications. Now days in
medical domain, it is difficult to make a decision for
complex diseases henceforth doctors seek multiple
opinions of different experts in order to achieve the
accurate diagnosis process for the diseases [2].
In such environment CSCW gives a good assistance.
Under collaborative environment Experts can make
highly accurate decisions, as they have background
knowledge about the problem with their past experience.
Since experience is very subjective, different experts
propose different diagnosis and decision on the same facts
gathered with different or same observation of a patient.
Also different experience can lead to a different diagnosis

India. ISBN: 978-81-920575-8-3 :: doi: 10.
73382/ISBN_0768
on same specific case. A couple of decades ago, people
deemed doctors as the ultimate authority. But now days
doctors and patients seek multiple opinions[2].
The objective of successful diagnosis is by experts
past experience or the knowledge gained from those
experience. Experts can make prediction from previous
observations(solved cases) and produce diagnosis for new
cases. Using these experience experts can suggest good
opinions about the diseases. Similarly different experts
having different background knowledge(experience) can
suggest different opinions ,which leads for multiple
opinions.
In the multi-criteria decision making problem, there
arise the situation that each expert has his own opinion
and estimated value for their different experiences and
knowledge. The opinion of each expert may be close, or
conflict. How to combine the individual opinions into a
common consensus one? It has been an important issue
under group decision making. We are addressing this
issues by proposing different techniques.
II. EXISTING METHODS
A. Models of discrete opinion
A. Ising model
Ising model is applied for discrete opinion, here
refers to binary real numbers on a particular question. In
many cases an agent must take a simple decision ,yes or
no, such situation arises very often in our daily life. We
do not deal here with all the models but emphasize the
more interesting ones. Let us consider a community
composed of N participants, each individual holds one of
two opposite opinions on a certain subject, denoted by
(t) = 1 , i = 1,2.N . The Ising model is mostly used in
statistical mechanics, this model has been successful in
describing numerous phenomenon in fields of physics
and social community. Ising model is appropriate for
binary opinion ,we need to develop other models to
depict evolution of opinion if values of opinions are
continuous or disperse multi-valued[3]
B. voter model
In voter model is an old model used in binary
opinion, each person updates his opinion by randomly
taking over selected neighbour, namely i =j ,the
probability of taking others opinion is proportional to
the number of neighbours having that opinion. In contrast
to Ising model, the probability depends linearly on the
numbers of neighbours consensus is reached when
everybody shares the same opinion, as is known to all,
the time needed to reach the consensus increases with a
power of the group size where aggregation is not
possible .[4]

C. Consensus with Constrained Convergence Rate:
Agreement in Communities:
The study focuses on a class of discrete-time multi-
agent systems modelling opinions with decaying
confidence. Essentially, we propose an agreement
protocol that impose a prescribed convergence rate.
Under that constraint, global consensus may not be
achieved and only local agreements may be reached. The
agents reaching a local agreement form communities
inside the network. In this model, we show that
communities correspond to asymptotically connected
component of the network and give an algebraic
characterization of communities in terms of Eigen values
of the matrix defining the collective communities
opinions.[5].
B. Models of continuous opinion
A. Deffuant model:
In Deffuant model, agents adjust continuous
opinions whenever their difference in opinion is below a
given threshold, high thresholds yield convergence of
opinions toward an average opinion, whereas low
thresholds result in several opinion clusters, the model
can also be applied to investigate network interactions.
Each agent i is initially given an opinion i is initially
given an opinion i , the interval of i is usually selected
as [0,1],the opinion updating is based on random binary
encounters, each time step individual randomly chooses
neighboring agents to interact, defining the interacting
agents opinion as i and j the parameter , is the so-
called convergence parameter and its value lies in the
interval[0,1/2]. If the difference of the two agents
opinions exceed the threshold, nothing happens. Deffuant
model uses a compromise strategy: after a constructive
debate, the opinions of the interacting agents get closer to
each other, by the elative amount . The two agents will
converge to the average of their opinions after the
discussion. The simulations of the Deffuant model can be
reads: With many people and few opinions, nearly all
opinions have some followers, and the number of final
opinion clusters nearly always agrees with the total
number of opinions, in the opposite side of many
opinions for few people , nearly every person forms a
separate opinion cluster[6].
B. HK model:
The model proposed by Hegselmann and Krause is
similar to that of Deffuant model. It has not been
investigated by many people so far. The difference
between Deffuant model and HK model is the evolution
equations, in HK model, the new opinion formation of a
person is the arithmetic average over the opinions of the
whole population under the bounded confidence, in
Deffuant model, each person selects randomly another
person and then both move in their opinion towards each
other by an amount proportional to their opinion
difference, for instance, after interacting with other
individual , the opinion of agent i at next time can be
obtained. Deffuant model is suitable to describe the
opinions of large populations, where people meet in
small groups, like pairs. In contrast, HK rule is
appropriate to describe formal meetings, where there is
an effective interaction involving many people at the
same time. Since at every Monte Carlo step, the
randomly selected agent i takes the average of the
opinions of those agents j such that i (t) j(t) < , this
averaging process makes the algorithm very time-
consuming compared with Deffuant model, and this is
the essentially reason why most people of the
computational socio physics community do not find HK
model attractive. we study here HK model that fit the
cases in real life. Recently Santo Fortunato extends HK
model to integer instead of real numbers,_ for the case of
a society in which everybody can talk to everybody else,
he finds that the chance to reach consensus is much
higher as compared to other models, if the number of
possible opinions Q 7, consensus is always reached,
while for Q > 7 the number of surviving opinions is
approximately the same independently of the size N of
the population. The main result of our simulations is the
existence of a threshold Qc, for Q Qc, consensus is
always reached, for Q > Qc we mean that more than just
a single opinion survive in the final configuration. It
remarks that the_ threshold is higher than in the
discretized version of Deffuant. This shows that the
dynamics of the KH model is the most suitable to explain
how competing factions can find an agreement and to
justify the stability of political coalitions with several
parties like in Italy. This versatility of the generalized
model makes it more suitable than others in order to
explain how consensus can be reached in a variety of
situations[7].
C. Models of dynamics opinion
A. Vector Opinion Dynamics Model:
Most people hold a variety of opinions on different
topics ranging from sports, entertainment, spiritual
beliefs to moral principles. These can be based on
personal reflections or on their interactions with others.
How do we influence others in our social network and
how do they influence us and how do we reach
consensus? In this method, we present our investigations
based on the use of multiple opinions (a vector of
opinions) that should be considered to determine
consensus in a society. We have extended Deffuant
model and have tested our model on top of two well-
known network topologies the Barabasi- Albert network
and the Erdos-Renyi network. We have implemented a
two phase filtering process determining the consensus[8].
B. Pheromone Dynamics Model:
A method to simulate the dynamics of public
opinion formation with pheromone is proposed. It is
assumed that an individual would sense and deposit
pheromone during its opinion formation, and it changes
and updates opinion basing on the transition probability
determined by the distribution of pheromone. Using this
model, simulations with several sets of parameters are
carried out. The results show that the phenomena of
consensus and bifurcation emerge in the evolution of
public opinion under the action of pheromone. We also
find that the evolution needs more time to reach steady
state for the greater size of individuals, and for the less
evaporation of pheromone. Moreover, the model can be
used to explain the influence of opinion leader on other
individuals in the process of opinion formation. The
interactions between individuals and environment can be
modelled successfully with pheromone[9].
C. Bounded Confidence Dynamics Model :
We present and analyse a model of Opinion
Dynamics and Bounded Confidence on the stochastic
movement world There are two mechanisms for
interaction. 'Eyeshot' limits the set of neighbours around
the agent and 'Bounded Confidence' chooses the agents
to exchange the opinion in the set. Every time step, agent
i looks for the agents in its eyeshot and adjusts their
opinion based on the algorithm of Bounded Confidence.
When the exchange ends, every agent moves itself in a
random direction and waits for the next time step. There
are three special agents in the model, infector, extremist
and leader. The infector is specified as an agent with
large eyeshot and the extremist is the agent with high
confidence. The leader possesses both high confidence
and large eyeshot. We simulated the opinion formation
process using the proposed model, results show the
system is more realistic than the classic BC model[10].
D. Majority rule Dynamics model :
This model was proposed by Serge Galam 25 years
ago, due to its more realistic properties[,All agents take
the majority opinion inside the group, This is the basic
principle of the majority rule (MR) model, it is proposed
to describe public debates, by using the model we
successfully predict the win of the rightists in the French
presidential election in 2000 .So far the MR model has
been extended to multi-state opinions, and the
modifications of the MR model include: a model where
agents can move in space, a dynamics where each agent
interacts with a variable number of neighbours, the
introduction of a probability to favour a particular
opinion. By citing the future directions, we hope this will
guide readers to intensively study this fantastic field[11].
E. Opinion Dynamics in Heterogeneous Network model:
This paper studies the opinion dynamics model
recently introduced by Hegselmann and Krause: each
agent in a group maintains a real number describing its
opinion; and each agent updates its opinion by averaging
all other opinions that are within some given confidence
range. The confidence ranges are distinct for each agent.
This heterogeneity and state dependent topology leads to
poorly-understood complex dynamic behaviour. We
classify the agents via their interconnection topology and,
accordingly, compute the equilibrium of the system. We
conjecture that any trajectory of this model eventually
converges to a steady state under fixed topology. To
establish this conjecture, we derive two novel sufficient
conditions: both conditions guarantee convergence and
constant topology for infinite time, while one condition
also guarantees monotonicity of the convergence. In the
evolution under fixed topology for infinite time, we
define leader groups that determine the followers rate
and direction of convergence[12].
F. Socially-Constrained Exogenously-Driven Opinion
model:
A number of studies have explored the
dynamics of opinion change among interacting
knowledge workers, using different modelling techniques.
We are particularly interested in the transition from
cognitive convergence (a positive group phenomenon) to
collapse (which can lead to overlooking critical
information). This method extends previous agent-based
studies of this subject in two directions. First, we allow
agents to belong to distinct social groups and explore the
effect of varying degrees of within-group affinity.
Second, we provide exogenous drivers of agent opinion
in the form of a dynamic set of documents that they may
query. We exhibit a met astable configuration of this
system with three distinct phases, and develop an
operational metric for distinguishing convergence from
collapse in the final phase. Then we use this metric to
explore the systems dynamics, over the space defined by
social affinity and precision of queries against documents,
and under a range of different functions for the influence
that an interaction partner has on an agent[13].
III. CONCLUSIONS
In this paper, we propose a Computer Supported
Cooperative Work (CSCW), novel model to deal with
Opinion mining for Decision Making in Medical
Decision Support system.

REFERENCES

[1] Indiramma.M, Ananda Kumar.K.R. Collaborative
decision making frame work for multiagent
system Proceeding of the International conference on
computer & communication Engg 2008, May 13-
15,2008,Malaysia.
[2] Mitja Lenic, Petra Povalej, Milan Zorman, Peter
Kokol,faculty of electrical engineering and computer
science, university of Maribor, slovenia ,Proceedings
of the 17
th
IEEE symposium on computer-based medical
systems(CBMS04) 1063-7125/04,2004 IEEE.
[3] Dietrich Stauffer,Sociophysics: the Sznajd model
and its applications ,Computer Physics
Communications 146 (2002) 9398
[4] Andre L.M., Vilela, F.G., Brady
Moreira,Majorityvote model with different agents,
Physica A 388 (2009) 4171- 4178
[5] Irinel Morarescu , Antoine Girard ,49
th
IEEE
conference on Decision and Control December 15-
17,2010,Hilton Atlanta Hotel,Atlanta,GA,USA.
[6] D. Stauffer, Journal of Artificial Societies and Social
Simulation 5, No.1,paper 4 (2002)
(jasss.soc.surrey.ac.uk). D. Stauffer: AIP Conference
Pro-ceedings 690, 147 (2003)
[7] Clelia M. Bordogna, Ezequiel V. Albano,Dynamic
behavior of a social model for opinion formation,
Physical Review E76, 061125(2007)
[8] Alya Alaali, Maryam A Purvis, Bastin Tony Roy
Savarimuthu Vector Opinion Dynamics: An
extended model for consensus in social networks, 2008
IEEE/WIC/ACM International Conference on Web
Intelligence and Intelligent Agent Technology
[9] Fan Jia, Yun Liu, Fei Ding & Di Xie, A Pheromone
Model For Public Opinion Formation ,978-1-4211-
55586-7/10/2010 IEEE
[10] Shusong LI , Shiyong ZHANG, Leader and
Follower: Agents in an Opinion Dynamics and Bounded
Confidence Model on the Stochastic Movement
World, 201O Second International Conference on
Computational Intelligence and Natural Computing
(CINC)
[11] Andre L.M. Vilela, F.G. Brady
Moreira,Majorityvote model with different agents,
Physica A 388 (2009) 4171-4178
[12] Ananhita Mirtabatabaei, Francesco Bullo, On
Opinion Dynamics in Heterogeneous Networks
American Control Conference on OFarrell Street, San
Francisco,CA,USA june 29-2011.
[13] H. Van Dyke Parunak, Elizabeth Downs, Andrew
Yinger Vector Research Center Jacobs Engineering
Group Ann Arbor, Socially-Constrained Exogenously-
Driven Opinion Dynamics MI 48105 USA,2001 fifth
IEEE International Conference on Self-Adaptive and
Self- Organizing systems.

A Novel Ranking Algorithms for Ordering Web Search Results

Gaurav Agarwal
Research Scholar
Manav Bharti University
Solan, (H.P.)

Abstract-Web Mining is the process of retrieving documents
from the World Wide Web which are relevant to the users
needs. Search engines are used for this purpose. Search
engine displays search results based on the relevancy of the
documents to the given query. Relevancies are calculated
according to the popularity of the document (Popularity
score) and the occurrence of the query keywords (Content
score) inside the document. Ranking algorithms are used to
calculate the popularity score. There are different ranking
algorithms available in the literature. The basis of all the
ranking algorithms is the link structure of the web. This
paper provides an insight into popular search engines and
ranking algorithms available over the web.

Keywords: Link structure, PageRank, Search engine, Web
mining

I. INTRODUCTION
Web mining is used to search the content of the
Web, to perform link analysis and to identify the users
behavior in the past to predict the future usage of the web.
Based on the above, web mining is divided into three
categories such as Web Content Mining (WCM), Web
Structure Mining (WSM), and Web Usage Mining (WUM)
[1, 2, 3]. WCM discovers the useful information from the
web documents by applying some traditional data mining
techniques. WSM deals with the discovery of relationships
between web pages by analyzing web structures.
Relationship is calculated using the linkage among the
documents. Based on this relationship, the importance of a
page can be calculated. WUM mines user log files to
identify the users behavior in viewing the web pages. This
information is helpful to predict the users behavior and to
make future decisions.
All existing search engines perform Web
Structure Mining using in-links and out-links of the web
pages to identify the popularity of a page. Based on the
popularity, ranks are assigned to the web documents. A
page is more important if it is pointed by many pages.
Using this concept as a base, many algorithms are devised
to rank the pages according to its importance.
Different WSM algorithms were devised by researchers
to calculate the popularity score. One such algorithm is the

ISBN: 978-81-920575-8-3 :: doi: 10. 73389/ISBN_0768
PageRank [4] algorithm and it has proved to be a very
effective algorithm for finding popularity score of the
results of search engines. Improvement in the PageRank
algorithm is done by HITS [5] using the concept of
authorities and hubs. Authoritative pages have more
number of incoming links and hub pages have more
number of outgoing links. SALSA [6] combines the
random walk method of PageRank along with the hub and
authority concept of HITS. Topic Sensitive Page Rank
algorithm [7] computes a set of PageRank vectors, biased
using a set of representative topics, to capture more
accurately the notion of importance with respect to a
particular topic. Weighted PageRank algorithm[8] is an
improvement of page rank algorithm in which the higher
scores are assigned to more important pages instead of
dividing the page rank equally to all the out-link pages.
Associated PageRank algorithm [9] orders search results
using the scores calculated based on the relevance between
web documents.
Rank of a page can be calculated offline or online.
When calculated online, the user will have to wait for a
longer time for getting the results. To reduce the online
computational time, popular search engines calculate the
popularity score (Link score) of the page offline and
content score online. During query processing, content
score is calculated based on the occurrence of query terms
in different parts of the web page. The calculated content
score is then multiplied with the existing offline popularity
score to produce the rank of each web page.
The remainder of this paper is organized as follows:
Section 2 gives an overview of the related work. Section 3
discusses the search engines available. Section 4 discusses
the ranking algorithms available in the literature. Section 5
concludes this paper and future work is also mentioned.

II. RELATED WORK
A significant amount of research has recently focused
on link-based ranking techniques, i.e., techniques that use
the hyperlink (or graph) structure of the web to identify
interesting pages or relationships between pages. The
success of such link-based ranking techniques has also
motivated a large amount of research focusing on the basic
structure of the web [10], efficient computation with
massive web graphs [11], and other applications of link-
based techniques such as finding related pages [12],
classifying pages [13], crawling important pages [14] or
pages on a particular topic [15], or web data mining [16].
One such link based technique is the PageRank
technique used by the Google search engine [4], which
assigns a popularity score to each page on the web based
on the number and importance of other pages linking to it.
Another approach introduced by Kleinberg [5] is
the HITS algorithm which calculates the popularity score
based on the authority score and hub score for a web page
and this method is subsequently modified and extended.
[17, 18, 19, 20] first identifies pages of interest through
term-based techniques and then performs an analysis of
only the graph neighborhood of these pages.
Both PageRank and HITS algorithms are based on
an iterative process defined on the web graph that assigns
higher scores to pages that are more central. One primary
difference is that HITS run at query time. The advantage of
query time calculation is that the ranking process is tuned
towards a particular query, e.g., by incorporating term-
based techniques.
The main drawback of online calculation is the
significant overhead in performing an iterative process for
each of the thousands of queries per second that are
submitted to the major search engines. PageRank, on the
other hand, is independent of the query posed by a user;
this means that it can be pre-computed and then used to
optimize the layout of the inverted index structure
accordingly.
SALSA [6] combines the random walk method of
PageRank along with the hub and authority concept of
HITS. Topic Sensitive PageRank algorithm [7] computes a
set of PageRank vectors instead of a single PageRank
vector used by PageRank algorithm, each vector
corresponds to a specific topic. Weighted Page Rank
algorithm[8] is assigns higher scores to more important
pages instead of dividing the rank of a page equally to all
the out-link pages. Associated PageRank algorithm [9]
calculates the similarity among the documents and based
on this similarity the rank of a web page is calculated.

III. SEARCH ENGINES
3.1 Working Methodology
The following are the operations done by the
search engines,
1. Crawling the web pages
2. Creating an Index
3. Page rank calculation
4. Query processing
5. Ordering search results
6. Displaying search results
It is necessary for the search engines to maintain an
index of all the web pages along with their popularity
score. Web crawlers are used to collect the web pages
hosted on the web. Different types of crawlers employ
different mechanisms to crawl the web. Usually, the root
set of pages is given as an input to the crawler and it
follows the links on each root page. Limitations on the
number of pages to be crawled can be set as stopping
criteria for the crawler program. Once the pages are
crawled, it is stored in a repository and the indices created
which point to the repository. Using the indices and
repository, popularity score is calculated by generating a
web graph with all web pages as nodes and links as edges.
Directory search engines stores only a description for each
web page instead of storing the entire web page content.
When the user issues a query to a search engine, it
parses the query by removing all stop words. With the
keywords extracted from the query, it searches for the web
pages which have those keywords in it. Searching of these
pages is done by first converting the keywords into their
corresponding word IDs and then searching it in the word
index. Documents pointed to by these word IDs are taken
and ordered. Ordering of the pages thus retrieved is done
by identifying the occurrence of the keywords in different
parts of the page and using the popularity score which was
pre-computed and available with each document. Steps 3
and 4 can be interchanged if the page rank is calculated
online.

3.2 Types of Search Engines
Search engines are broadly classified into four
types:
i. Spider or Crawler based search engines and
ii. Directory based search engines
iii. Hybrid Search Engines
iv. Meta Search Engines
3.2.1 Spider or Crawler based search engines
Many existing search engines such as google use a
software program called spiders or crawlers [21] which
crawls the web to collect the web pages and stores the
collected pages in a repository and also in an index.
Initially, spiders are given with root set of pages and then it
catches the web pages by following the links on the root
set of pages. After the web pages are caught, spiders form
an index by extracting the useful information from the web
pages. The extracted information is stored in the repective
indexes namely, document index (Forward Index) and a
word index (Inverted Index).
In a forward index, information about each document is
stored with DocID as the unique field. Each record of a
forward index includes the contents of the tags and the bag
of wordIDs which points to the wordIDs in an inverted
index. The inverted index is used to store the details of all
the words with the occurrence details of it in every
document. The inverted index is made from the forward
index by sorting the contents of the barrels in the order of
word IDs. Inverted index consists of document lists, one
list for each word and grouped into barrels by word IDs.
3.2.2 Directory based search engines
A directory based search engine [22], such as the Open
Directory Project relies on humans for reviewing the web
pages. Previously yahoo used to be a directory based
search engine and now it also uses crawlers to collect web
pages. In directory based search engines, a directory gets
the information about every site from submissions, which
include a short description to the directory. A search
through this kind of search engine looks for matches only
in the submitted descriptions. Changing of web pages does
not have any effect on the listing of the pages.
3.2.3 Hybrid search engines
Hybrid search engines combine the best features of the
crawler based search engines and the directory based
search engines. In this, crawler program is used to crawl
the web pages as in the crawler based search engines and
the crawled pages are reviewed by a reviewer as in the
directory based search engines.
3.2.4 Meta search engines
Meta search engines collect web pages from
different search engines and display it to the user [23]. It
does not maintain an index of its own. Instead, it collects
the pages dynamically after receiving the query from the
user. There are two ways to order the collected search
results.
i. Using the ranks provided by the search engines
from which the pages are retrieved
ii. Calculating a new page rank after collecting the
pages
3.3 Sample Search Engines
Since the content of the web is changing regularly, it is
better to use crawler based search engines or the
combination of crawler based and directory based search
engines. Following are some of the popular crawler based
search engines:
1. Google
2. Gigablast
3. Lycos
4. Yahoo
3.3.1 Google
Google search engine has two important features
that helps it to produce high precision results. First, it
calculates popularity score of each page using the link
structure statically. It then calculates the content score
based on the occurrence of a keyword in the page. Finally,
the content score is added with the popularity score which
was calculated earlier to produce the final page rank. It
uses around 200 factors for finding the occurrence of a
keyword. Based on the position of occurrence of a
keyword in a document, different weights are given. For
example, higher weight is added when the keyword occurs
in a title of a document.
It has a repository of thousands of millions of web
pages. It creates an index structure in which each word is
having its word ID and each document is assigned with the
doc ID.
The forward index consists of barrels and each barrel
has its own range of word IDs. A record in a barrel consists
of a doc ID followed by word IDs with the details of the
document.
The inverted index is used to store the details of all the
words with the occurrence of it in every document. The
index is made from the forward index by sorting the
contents of the barrels in the order of word IDs. Inverted
index consists of document lists, one list for each word and
grouped into barrels by word IDs.
Each doc ID in an inverted index is associated with an
array having three values. First value indicates the
occurrence of a term in the title tag of the corresponding
document which can be 0 or 1. This value is 1 when the
term appears in the title tag and 0 when it does not occur in
the title tag. Second value gives the occurrence of a term in
the meta tag of a document. This value is 1 when the term
appears in the meta tag and 0 when it does not occur in the
meta tag. Third value indicates frequency of the term in the
document.

Term 11 (fruit) - 33 [1, 0, 20], 47 [1, 0, 11], 110 [0, 1,
10]

Term 25 (apple) - 36 [1, 1, 5], 47 [1, 0, 7], 110 [1, 0,
20]

Term 30 (pen) - 10 [1, 0, 16], 66 [0, 1, 14]
When the user enters a query such as fruit apple,
google extracts the documents which contain the terms
fruit or apple. Content score is calculated for the extracted
documents of the query fruit apple such as

Term 11(fruit)-(1+0+20 ) x ( 1+0+11 ) x ( 0+1+10 )
=2772

Term 25(apple)-(1+1+5 ) x ( 1+0+7 ) x ( 1+0+20 ) =
1176

These content scores are added with the
popularity score which was calculated using the link
structure of the web page.

3.3.2 Gigablast
Gigablast is a web search engine with over 2 billion
pages in its index. Gigablast offers fast service, relevant
search query results, and a couple of extra search features.
Gigablasts advantage is that the generation of related
topics dynamically on a per query basis. Basically,
Gigablast offers a search suggestions based on the search
query. Using these suggestions, user can refine the queries
by adding terms so that the search can be narrowed down
to a specific topic. It also uses the link structure of the web
to identify the popularity of the web page. The number of
factors used to identify the occurrence of a keyword in a
document is very less when compared to google.
3.3.3 Lycos
It is a search engine with a longer existence. Lycos is
keeping up by strong relevancy ranking capabilities and
providing a mix of features. As is the trend with the major
search tools, Lycos is a conglomeration of databases,
online services and other Internet properties. It uses T-Rex
spider to crawl the web and does not weigh meta tag
heavily. It uses only URL, title and word frequency for
ranking. Sponsored entries in the lycos page are provided
by overture. Category links are provided by open directory
and web site links are provided by fast search.
3.3.4 Yahoo
It is a directory based search engine and it uses the page
rank algorithm to arrange the web documents in a specific
order. When compared to google, it uses only less number
of features while calculating the relevancy of the document
to the keyword. In yahoo, few search results are displayed
along with an image which provides identification for that
document. Like google, it also allows the user to form the
query using special characters and operators but the
options are only limited. It displays the categories or
clusters such as wikipedia on a screen so that the user can
select the relevant document easily. It provides guidance
for framing the query by adding keywords with the already
given query.
IV. RANKING ALGORITHMS
One of the major challenges in information retrieval is
the ranking of search results. In the context of web search,
where the data is massive and queries rarely contain more
than three terms, most searches produce large collection of
results. Since the majority of search engine users examine
only the first few result pages [12], effective ranking
algorithms are key for satisfying users needs.
Leading search engines use different features for
ranking the search results. The features include textual
similarity between query and documents, the popularity of
documents and finally hyperlink between web pages,
which is viewed as a form of peer endorsement among
content providers. There are many research works going on
in link-based ranking algorithms. Most of the research
works has centered around proposing new link-based
ranking algorithms or improving the computational
efficiency of existing ones.
The following are the popular ranking algorithms:
i. PageRank
ii. HITS
iii. SALSA
iv. Topic Sensitive PageRank
v. Weighted PageRank
vi. Relevance based PageRank
4.1 PageRank
PageRank algorithm [4] is a commonly used
algorithm in Web Structure Mining. It measures the
importance of the pages by analyzing the links [24]. Page
Rank has been developed by Google and is named after
Larry Page, Googles co-founder.
Google first retrieves a list of pages which are relevant
pages to a given query based on the occurrence of
keywords in URL and specific tags. Then it uses popularity
score which was calculated independent of the query to
adjust the results so that more relevant pages are likely to
be displayed at the beginning of the list of search results
[25].
PageRank algorithm works by considering the in-
degree of each page. Each page distributes its rank to all
the pages which it links. Whenever the page has more
number of out-links (Links pointing to other pages), the
rank of the page is less. When more number of in-links
were there on a page, it is likely to have the higher ranking.

PR(P1) = (1 d) + d (PR(P
1
) /
L(P
1
)+PR(P
2
)/L(P
2
)
+ + PR(P
n
) / L(P
n
))

Where PR(P
n
) is the Page Rank (Popularity Score) of
the page P
n
and L(P
n
) is the number of out-links the page
PR(P
n
) have and d is the damping factor which is normally
set as 0.85.
The figure 1 shows the simple distribution of page
rank. The ranks of pages P
4
and P
5
are set using the ranks
of the pages which links to it. P
5
is assigned with higher
ranking since it has more number of in-links. Ranks are
calculated as,

PR(P
4
) = 0.15 + 0.85 ( 0.5 + 0.33 )
= 0.855

PR(P
5
) = 0.15 + 0.85 ( 0.5 + 0.33 + 0.5 )
= 1.281

Figure 1 : Page Rank Calculation
4.2 HITS
HITS (Hypertext Induced Topic Selection) [5]
algorithm is query dependent. After receiving a query from
the user, it collects the pages that has the keywords in a
given query and forms a graph with authorities and hubs.
Web pages pointed by many hyperlinks are called
authorities. Web pages that point to many other pages are
called hubs. Strong authority is the page which has links
from many highly scored hubs. Popular hub is the page
which points or links to highly scored authorities.

Steps
1. Construct an adjacency matrix by using
the neighborhood graph N which indicates the
connectivity of all the nodes.

2. Calculate the authority vector from the
adjacency matrix formed using the formula

V
k
= (A
t
. A) . U
K

1

3. Calculate the hub vector from the
adjacency matrix using the formula
U
k
= (A . A
t
) . V
K

4. Rank the pages using the hub and
authority vectors formed.

Major advantage of HITS algorithm is its dual
rankings. HITS presents two ranked lists to the user. One
with more authoritative pages related to the query and the
other with most hubby documents. Authority score can be
used when the search is a oriented towards research. Hub
score can be used when the search is broad.
Disadvantage of HITS algorithm is that the algorithm is
query-dependent. At query time, neighborhood graph must
be constructed and at least one eigen vector must be
solved. Another disadvantage of HITS is susceptibility to
spamming. By adding links to and from any web page, it is
possible to change the hub and authority scores. Since the
hub score and authority score are interdependent, when the
hub score is increased by introducing more out-links on a
page, automatically the authority score of a page increases.
4.3 SALSA
SALSA, the Stochastic Approach for Link-Structure
Analysis [6] is based on the theory of Markov chains, and
uses the stochastic properties of random walks done on a
collection of pages. It combines the random walk method
of PageRank with hub and authority technique of HITS.
The meta algorithm used by both HITS and SALSA is
similar but the basic difference between the two methods is
the formation of the association matrix. HITS algorithm
considers the tight connection between the nodes of the
graph but SALSA considers light connection by
performing random walk in the graph. It overcomes the
disadvantages of HITS such as spamming and topic drift
by introducing less interdependence between hub and
authority scores.
The input to SALSA contains a collection of web pages
C which are returned by a search engine for a given query
t. Intuition suggests that authoritative pages on query t
should be visible from many pages in the subgraph induced
by C. Thus, a random walk on this subgraph will visit t-
authorities with high probability.
The following figure shows the neighborhood graph
and the bipartite graph which includes the hub side and
authority side.

Figure 2a: Bipartite Graph Figure 2b : Hubs and
Authorities

Steps
1. A graph is drawn with hubs in one side
of the graph and authorities in another of the
graph.
i. Hub includes the nodes (V
h
) which have
outdegree greater than zero and
ii. Authority side has nodes (V
a
) which
have
in-degree greater than zero

2. Adjacency matrix (L) of neighborhood
graph N is formed.

3. Hub and authority matrices are formed
by
H = L
r
L
c
T

A = L
c
T
L
r

Where L
r
Non-zero rows of L divided by its
row sum
L
c
Non-zero rows of L divided by its
column sum
4. Eigen vectors are formed from the hub
and authority matrices.

5. Based on the hub and authority vectors,
the pages are ranked.

4.4 Topic Sensitive Page Rank
In the traditional PageRank algorithm, for
ordering search results a single PageRank vector is
computed using the link structure of the Web. This is
calculated independent of a particular query. To yield more
accurate search results, a set of PageRank vectors are
computed where each vector corresponds to a particular
topic[7]. These vectors are calculated offline. By using the
computed vectors, importance of pages is calculated at
query time. During query processing time, these
importance scores are combined based on the topics of the
query to form a composite PageRank score for those pages
matching the query. For ordinary keyword search queries,
Topic-sensitive PageRank scores are computed for pages
returned for a given query using the terms in the query.

Steps
1. A Nonuniform damping vector p = v
j
(Vector)
where,
v
ji
= 1/ |T
j
| if i Tj else v
ji
= 0

T
j
be the set of URLS in the Open Directory Project
category c
j

2. - Given a query q,q` be the context of q. It means
that q` includes the terms in the query.

Let q
i
` be the ith term in the query context q`. For
the given query, following is calculated for each c
j.

P(c
j
|q`) = ((P(c
j
) . P(q`|c
j
)) /P(q`)) P(c
j
) .
P(q
i
`|c
j
)
i
Where P(q
i
` | c
j
) is calculated from the class term-
vector.

- Using text index, URL of all the documents
containing the original query terms q is retrieved.

- Finally query-sensitive importance score of each
of these retrieved URLS is calculated as follows:
S
qd
= P(c
j
|q`) . rank
jd

j

4.5 Weighted PageRank
In all the above methods, the rank of a page is equally
distributed to the out-link pages. Weighted PageRank
algorithm [8] assigns larger rank values to more important
(popular) pages instead of dividing the rank value of a page
evenly among its out-link pages. Each out-link page gets a
value proportional to its popularity (its number of in-links
and out-links). The popularity from the number of in-links
and out-links is recorded as W
in
(v,u) and W
out
(v,u)
,
respectively.
W
in

(v,u)
is the weight of link (v, u) calculated based on
the number of in-links of page u and the number of in-links
of all reference pages of page v.

W
in

(v,u)
= I
u
/ I
p

PR(v)

where I
u
and I
p
represent the number of in-links of page
u and page p, respectively. R(v) denotes the reference page
list of page v.
W
out
(v,u)
is the weight of link(v, u) calculated based on
the number of out-links of page u and the number of out-
links of all reference pages of page v.

W
out
(v,u)
= O
u
/ O
p

P R(v)
where O
u
and O
p
represent the number of out-links of
page u and page p, respectively. R(v) denotes the reference
page list of page v.
Page Rank formula is modified as
PR(u) = (1 d) + d PR(v) W
in

(v,u)
W
out
(v,u)

v B (u)

4.6 Relevance based PageRank
This algorithm [9] orders search results by
calculating the relevance between the web documents. It
increases the degree of relevance than the original one, and
decreases the query time efforts of Topic Sensitive
PageRank. A document Di is identified by a set of frequent
terms Tj and the terms may be weighted by Wj according
to their importance in each document. In order to measure
the similarity between different documents Dp and Dq, the
m most frequent terms are retrieved from documents.
Two different scenarios are used to calculate the
similarity among the documents. In the first scenario, the
most frequent term sets (MFTS) from different documents
are compared with the most frequent term sets of web
pages in ODP (Open Directory Project) categories; By
doing this way, the relevant degree of target documents for
certain category s calculated.
In the second scenario, the semantic distance value of
the most frequent term sets from the document Dp and Dq
is used to compute the content relevance.
Scenario I

1. Get Categories C
1
,C
2
C
n
from ODP; n is the
number of categories in ODP.

2. Extract web pages of each category to get the most
frequent term sets (MFTS).

C
i
={T
i,j
| T
i,j
is the jth most frequent term of web
pages in category i; where i changes from 1 to n and
j changes from 1 to m}

n is the number of categories, m is the number of the
MFTS.
C
1
: T
1,1
, T
1,2
T
1,m

C
2
: T
2,1
, T
2,2
T
2,m

C
n
: T
n,1
, T
n,2
T
n,m

Weights (W) can be used in each word as following:
C
1
: W
1,1
, W
1,2
W
1,m

C
2
: W
2,1
, W
2,2
W
2,m

...
C
n
: W
n,1
, W
n,2
W
n,m

3. Get the MFTS of Page A: T
A,1
, T
A,2
T
A,p

4. The relevance function for page A to category i

R (T
A,h,
T
i,k
)= W
i,k
/ Distance(T
A,h,
T
i,k
)
H changes from 1 to p;
i changes form 1 to n;
k changes from 1 to m
T
A, h
- hth most frequent term set of page A.
T
i, k,
- kth most frequent term set of category i.
Distance(T
A,h,
T
i,k
) is the distance function by
comparing the semantic distance via WordNet.

5. R
A
(h,i)=Max[R (T
A,h,
T
i,k
)];
h changes from 1 to p;
i changes from 1 to n;
Thus we can get R
A
(1,i), R
A
(2,i), R
A
(3,i)R
A
(p,i) for
each category i.

6. Get the relevance score of Page A in category Cn,
p
S
A
(i)=TF
h
X R
A
(h,i) ; i=1 to n
h=1
TFh is the hth term frequency for the MFTS of page
A. By doing this way, S
A
(1), S
A
(2) S
A
(n) are
computed.

7. Repeat steps 3 to 6 to get the relevance score of Page
B in category Cn: S
B
(1), S
B
(2) S
B
(n).
n
R
AB
=(S
A
(K) X S
B
(K))) / n
K=1
Scenario II

1. Get the MFTS of Page A: T
A,1
, T
A,2
T
A,n
.

2. Get the MFTS of Page B: T
B,1
, T
B,2
T
B,m
.

3. R (T
A,h
, T
B,g
)= 1 / Distance(T
A,h
, T
B,g
);
h changes from 1 to n;
g changes from 1 to m
T
A,h
, is the most frequent term set of page A.
T
B,g,
is the most frequent term set of page B.

4. n m
R
AB
=( [TF
A,h
X TF
B,g
X R (T
A,h
,
TB,g
)]) /
h=1g=1
n m
[TF
A,h
X TF
B,g
]
h=1 g=1
TF
A,h
is the hth term frequency of page A.
TF
B,g
is the gth term frequency of page B.

Distance function will be:

Distance(T
A,h
, T
B,g
) = 1 if T
A,h
= T
i,k
else

The associated pagerank is defined as:

PR(Vi)=(1-d)+d(R
ij
X PR(V
j
)) / R
jk
jn(V
i
) kOut(V
j
)
R
ji
is the relevance value of page j to page i.

The advantage of this method is that the similarity
among the document is considered so that more relevant
documents are displayed before other less relevant
documents. Computational complexity is the disadvantage
of this method.
V. ACKNOWLEDGEMENT
Professor Dr. Prashant Kumar Pandey for all their help,
advice and encouragement.

VI. CONCLUSION AND FUTURE WORK
This paper presents an overview of the existing search
engines and ranking algorithms. Search engines retrieves
documents that contain the query terms and rank these
documents using the popularity score and content score. In
all search engines popularity score is calculated based on
the link structure. In traditional PageRank algorithm, rank
of a page is distributed among its out-link pages. HITS
produce two different ranks for each web page namely,
Authority score and Hub score. SALSA combines the
random walk method of PageRank with the authority and
hub concept of HITS. The difference between HITS and
SALSA is that the interdependencies between the authority
score and hub score is removed in SALSA. Topic Sensitive
PageRank calculates a set of PageRank vectors instead of a
single PageRank vector where each vector corresponds to a
particular topic. It calculates PageRank during query time
which increases the computational complexity. Weighted
PageRank algorithm assigns weights to the links based on
the number of in-links and number of out-links so that the
rank is distributed based on the popularity. Similarity
among the documents is considered in Associated
PageRank algorithm for calculating the PageRank. Among
the above mentioned algorithms, associated PageRank
algorithm takes relevant document to the beginning of the
list since the popularity score of a document is dependent
on the links from the similar pages. In future, a new
method can be introduced to calculate the similarity among
the linked pages using different similarity functions.

REFERENCES
[1]. R. Kosala and H. Blockeel, Web mining
research: A survey, ACM SIGKDD Explorations,
2(1):115, 2000.
[2]. S. Madria, S. S. Bhowmick, W. K. Ng, and E.-P.
Lim. Research issues in web data mining, In
Proceedings of the Conference on Data Warehousing
and Knowledge Discovery, pages 303319, 1999.
[3]. S. Pal, V. Talwar, and P. Mitra, Web mining in
soft computing framework : Relevance, state of the art
and future directions, IEEE Trans. Neural Networks,
13(5):11631177, 2002.
[4]. S. Brin and L. Page, The anatomy of a large-
scale hypertextual web search engine, In Proc. of the
Seventh World Wide Web Conference, 1998.
[5]. J. Kleinberg, Authoritative sources in a
hyperlinked environment, Proc. of the 9th ACM-
SIAM Symposium on Discrete Algorithms, pages 668-
677, January 1998.
[6]. R. Lempel and S. Moran, SALSA: The
Stochastic Approach for Link-Structure analysis,
ACM Transactions on Information Systems, Vol. 19,
No. 2, Pages 131160, April 2001.
[7]. Taher H. Haveliwala, Topic Sensitive Page
Rank, WWW2002, May 711, 2002.
[8]. Wenpu Xing, Ali Ghorbani, "Weighted PageRank
Algorithm", cnsr, pp.305-314, Second Annual
Conference on Communication Networks and Services
Research (CNSR'04), 2004
[9]. Chia-Chen Yen, Jih-Shih Hsu, Pagerank
Algorithm Improvement by Page Relevance
Measurement, Journal of Convergence Information
Technology, Volume 5, Number 8, October 2010.
[10]. A. Broder, R. Kumar, F. Maghoul, P. Raghavan,
S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener,
Graph structure in the web: experiments and models
In 9th Int. World Wide Web Conference, 2000.
[11]. K. Bharat, A. Broder, M. Henzinger, P. Kumar,
and S. Venkatasubramanian, The connectivity server:
Fast access to linkage information on the web, In 7th
Int. World Wide Web Conference, May 1998.
[12]. K. Bharat and M. Henzinger, Improved
algorithms fortopic distillation in a hyperlinked
environment, In Proc. 21st Int. Conf. on Research and
Development in Inf. Retrieval (SIGIR), August 1998.
[13]. S. Chakrabarti, B. Dom, and P. Indyk, Enhanced
hypertext categorization using hyperlinks, In Proc. of
the ACM SIGMOD Int. Conf. on Management of Data,
pages 307{318, June 1998.
[14]. J. Cho, H. Garcia-Molina, and L. Page, Efficient
crawling through URL ordering, In 7th Int. World
Wide Web Conference, May 1998.
[15]. S. Chakrabarti, M. van den Berg, and B. Dom,
Focused crawling: A new approach to topic-specific
web resource discovery, In Proc. of the 8th Int. World
[16]. R. Kumar, P. Raghavan, S. Rajagopalan, and A.
Tomkins, Extracting large-scale knowledge bases from
the web, In Proc. of the 25th Int. Conf. on Very Large
Data Bases, September 1999.
[17]. Ask Jeeves, Inc. Teoma search engine.
http://www.teoma.com.
[18]. S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg,
P. Raghavan, and S. Rajagopalan, Automatic resource
list compilation by analyzing hyperlink structure and
associated text, In Proc. of the 7th Int. World Wide
Web Conference, May 1998.
[19]. R. Lempel and S. Moran, The Stochastic
Approach for Link-Structure Analysis (SALSA) and the
TKC Effect, In Proc. of the 9th Int. World Wide Web
Conference, May 2000.
[20]. D. Zhang and Y. Dong, An effcient algorithm to
rank web resources, In Proc. of the 9th Int. World
[21]. Marios D. Dikaiakos, Athena Stassopoulou,
Loizos Papageorgiou, An investigation of web crawler
behavior: characterization and metrics, Computer
Communications, Vol. 28, No. 8, pp. 880-897, 2005.
[22]. M.P.S.Bhatia, Divya Gupta, Discussion on Web
Crawlers of Search Engine, Proceedings of 2nd
National Conference on Challenges & Opportunities in
Information Technology (COIT-2008) RIMT-IET,
Mandi Gobindgarh. March 29, 2008.
[23].

Manoj and Elizebeth Jacub (2008), Information
retrieval on internet using meta-search engines: A
review, Journal of Science and Industrial Research,
Vol 67, pp: 739-746, 2008.
[24]. L. Page, S. Brin, R. Motwani, and T. Winograd,
The pagerank citation ranking: Bringing order to the
web, Technical report Stanford Digital Libraries
SIDL-WP-1999-0120, 1999.
[25]. C. Ridings and M. Shishigin, Pagerank
uncovered, Technical report, 2002.
PERFORMANCE & EVALUATION OF DSS USING OLAP TECHNIQUES IN
THE DATA WAREHOUSE

Renu Yadav
Research Scholar
Solan, (H.P)

Abstract-In recent era, IT Industry has been introducing
many techniques providing better decision making using data
warehouse. The Performance and evaluation are the main
problems of the databases; to avoid these problems the
OLAP techniques is in vogue now. Traditional databases
techniques are effective to some kinds of applications, but
using data warehouse techniques is suitable for all types of
applications. We analyze the control in a Decision Support
System (DSS) based on Data warehouse, and using OLAP
techniques for the evaluation of different set of alternatives
for the purpose of decision making. The objective of this
paper is to present comparative analysis of the OLAP
techniques. The goal of the analysis is to find the strategy for
the better decision making using data warehouse.
Keywords: Data warehousing, OLAP Techniques, Decision
Support System (DSS).
I. INTRODUCTION
The basic objective of the database is the freedom to
share information. The judiciary in many countries is
reorganizing the communication through Internet as
evidence. As technology evolved new computerized
decision support applications were developed and studied.
Researchers used multiple frameworks to help build and
understand these systems. Today one can organize the
history of DSS into the five broad DSS categories namely:
communication-driven, data-driven, document- driven,
knowledge-driven and model-driven decision support
systems.
The effectiveness of judiciary system and its very
survival, in such an environment depends on decision
making processes that are productive, agile, innovative and
reputable. Increasingly, the development and use of
decision support systems need to be cognizant of the
contextual aspects of the environment in which these
decisional processes unfold.
This paper emphasis on consideration of Legal Based
Decision Support in Turbulent and High-Velocity
Environments examining issues concerned with computer
based support of decision makers in the challenging
environment [9].

ISBN: 978-81-920575-8-3 :: doi: 10. 73396/ISBN_0768

Our study focuses on legal based decision support
applications and research studies related to model and data-
oriented systems, management expert systems,
multidimensional data analysis, query and reporting tools,
online analytical processing (OLAP), group DSS and
document management. All of these technologies have
been used to support legal based decision making.
A: Decision Support System
A Decision Support System (DSS) is an umbrella term
used to describe any computer application that enhances
the users ability to make decisions. More specifically, the
term is usually used to describe a computer-based system
designed to help decision-makers use data, knowledge and
communications technology to identify problems and make
decisions to solve those problems.
Decision support system helps in achieving
competitive advantage to the firms and the evidences
indicates that legal practitioners use sophisticated data
driven and document driven Decision Support System to
obtain information that was buried for many years in filing
cabinet or archived on computer tapes[7].
A decision web portal can provide access to
information from different systems, synchronize relevant
and personalized information, support collaboration and
extend decision support to plaintiffs and defendants [8].
Decision makers face a very large number of
heterogeneous contextual cues. Some of these pieces of
information are always relevant (time period, unpredicted
event, etc.) while others are only used in some cases
(number of problems in the case, number of acts in a case
etc.) [10].
B: Data Warehousing
A data warehouse is a subject-oriented, integrated,
time-variant and non-volatile collection of data in support
of management's decision making process [1].In table
focus of analysis usually contains numeric data called
measures, for example: analysis of case numbers
representing analysis needs in a numeric form. The
dimension represents the database applications designed
according to the needs of the individual user and it may
also have descriptive attributes such as name of the parties,
type of problem etc. In the 21
st
Century, the data has
increased manifold which improved the performance of the
database and degraded the related application. The data
warehousing is that the data stored for business analysis
can most effectively be accessed, separating it from the
data in the operational system [2]. Todays data
warehousing system support very sophisticated on line
analysis including multidimensional analysis. Data
mining is a technology that provides sophisticated analysis
for applications such as data warehouse [3, 4, 5, 6].
C: olap techniques:
Types of Data warehouse and Online Analytical
Processing (OLAP) play a key role in legal intelligent
system. With increasing amount of spatial data stored in
legal database, how to utilize this spatial information to get
inside into legal data from the geospatial point of view is
becoming an important issue of data warehouse and
OLAP. OLAP systems allow decision making user to
dynamically manipulate data contain in a Data Warehouse.
OLAP systems use a structured a cube based dimension,
measure and hierarchies. Hierarchies allow the user to see
detailed as well as generalized data using the roll-up and
drill down operations. OLAP is a new and very promising
area both from a theoretical and practical point of view, in
describing the OLAP operation over legal based data.
Depending on the queries a legal based OLAP system
should capture different types of summarized data.

II. PERFORMANCE AND EVALUATION
OLAP Technique is a broad, interdisciplinary field
with a rich body of methods. The suitability of these
methods depends significantly on the data set they are
applied to; some problems may be solved with a simple
linear regression, while others demand for more
sophisticated techniques. As it is still unclear which OLAP
methods perform particularly well on which kind of
simulation performance data, a system for the convenient
generation of selection functions will have to provide
several alternatives. Performance databases are essential
tools for the thorough experimental analysis of algorithms,
as they support the experimenter in dealing with the
typically vast amounts of performance data as easily as
possible.
Fig. 1 shows the Performance of Database using OLAP
Techniques Evaluation of Problems.

Fig. 1 Performance of Database using OLAP Techniques
In the above figure analysis of the performance in
database with data using of OLAP techniques and collected
different types of problems and finding the solution of the
problems.

Table 1
Legal Cases Background

d
Pl
aintif
f
Nam
e
Def
endant
Name
Cou
rt
Name
Poli
ce
Station
Name
P
roble
m
R
elief
Aske
d
C
ase
De
cid
ed
H
ussai
nabad
And
Allie
d
Trust
Nisa
r
Ahmad
Civi
l Judge
Mohanl
alaganj
Senior
Divisio
n
Luckno
w
Tha
kurganj
N
isar
Has
Occu
pied
Land
Of
H.Tr
ust
Illega
ly
S
uit
For
Posse
ssion
&
Perm
anent
Injun
ction
N
o
M
ohd
Ismai
l
Sha
keel
Khan
1st
Additio
nal
Judge
Small
Causes
Luckno
w
Kais
erbagh
M
ohd
Want
To
Obtai
n
Hous
e For
Resid
ential
Purpo
se
S
uit
For
Relea
se Of
The
Hous
e In
Case
Of
Tena
nt
N
o
R
Kum
ar
Rasto
gi
Pre
m
Kumar
Rastogi
And
Two
Others
2nd
Additio
nal
Civil
Judge
Senior
Divisio
n,Luck
now
Kais
erbagh
W
ant
The
Legal
Owne
rship
On
The
Land
S
uit
For
Posse
ssion
&
Perm
anent
Injun
ction
Y
es
A
arifa
Jamal
Mus
htaq
Alam
Jud
ge
Small
Causes
Luckno
w
Kha
labazar
Ill
egal
Sub
Let
Of
The
S
uit
For
Injun
ction
&
N
o
Performance
Database
Perfo

P

Problems
OLAP
P1
P2
P3
P4
Evaluatio
d
Pl
aintif
f
Nam
e
Def
endant
Name
Cou
rt
Name
Poli
ce
Station
Name
P
roble
m
R
elief
Aske
d
C
ase
De
cid
ed
Landl
ords
Shop
Reco
very
Of
Arrea
rs Of
Rent
S
hrima
ti
Sarita
Pre
m
Kumar
Gupta
Add
itional
Princip
al
Judge
Family
Court
Luckno
w
Tha
kurganj
D
eserte
d By
Husb
and
Illega
lly
S
uit
For
Maint
enanc
e
Amo
unt
Of Rs
1000
0 Per
Mont
h
Y
es
N
irmal
Kum
ar
Badla
ni
Sud
hashu
Dinger
Civi
l Judge
South
Junior
Divisio
n
Luvkno
w
Khr
shed
Bagh
D
efend
ant
Want
Illega
l
Posse
ssion
On
Owne
rs
Land
S
uit
For
Posse
ssion
&
Perm
anent
Injun
ction
N
o
M
ahara
ja
Agras
en
Dhar
math
Trust
K C
Gupta
Jud
ge
Small
Causes
Luckno
w
Tha
kurganj
W
ant
Back
Posse
ssion
On
His
Part
Of
Hous
e
S
uit
For
Injun
ction
&
Reco
very
Of
Arrea
rs Of
Rent
Y
es
B
ahar
Alam
Hus
sainaba
d And
Allied
Trust
Civi
l Judge
Junior
Divisio
n
Kais
erbagh
D
efend
ant Is
Thro
wing
S
uit
For
Posse
ssion
N
o
d
Pl
aintif
f
Nam
e
Def
endant
Name
Cou
rt
Name
Poli
ce
Station
Name
P
roble
m
R
elief
Aske
d
C
ase
De
cid
ed
Luckno
w
Luckno
w
Plaint
iff
Force
fully
&
Perm
anent
Injun
ction
F
azal
Abba
s
Hus
sainaba
d And
Allied
Trust
Luckno
w
Civi
l Judge
South
Junior
Divisio
n
Luckno
w
Tha
kurganj
W
rong
Sell
Deed
Is
Prepa
red
S
uit
For A
Perm
anent
Injun
ction
N
o
0
S
unil
Kum
ar
Jolly
Mes
sers
Ansal
Housin
g &
Constru
ction
Ltd.
Civi
l Judge
Senior
Divisio
n
Luckno
w
Khu
rshedba
gh
F
orcef
ully
Sellin
g
Land
As
Reco
very
Of
Loan
Amo
unt
S
uit
For A
Plot
Of
Land
N
o
1
V
inay
Singh
Stat
e Of
U.P.An
d 5
Others
2nd
Additio
nal
Civil
Judge
Senior
Divisio
n
Luckno
w
Kais
erbagh
B
reach
Of
Sellin
g
Conta
ct Of
Shop
S
uit
For
Perpe
tual
And
Mand
atory
Injun
ction
Y
es
2
S
uresh
Chan
dra
Garg
Lala
Pannala
l Trust
And
Others
3rd
Additio
nal
Civil
Judge
Junior
Divisio
n
Luckno
w
Kais
erbagh
L
ack
Of
Intere
st Of
Ex
Party
S
uit
For A
Perm
anent
Injun
ction
N
o
3
D
ayana
Upk
aram
Civi
l Judge
Khu
rshedab
W
ant
S
uit
N
o
d
Pl
aintif
f
Nam
e
Def
endant
Name
Cou
rt
Name
Poli
ce
Station
Name
P
roble
m
R
elief
Aske
d
C
ase
De
cid
ed
nd
Seva
Sanst
han,
Arya
Sama
j
Karoti
Society
North
Denior
Divisio
n
Luckno
w
agh The
Illega
l
Owne
rship
On
The
Land
For A
Plot
Of
Land
4
B
abula
l
Devp
rasad
Pan
nalal &
Sons
Civi
l Judge
Junior
Divisio
n
Luckno
w
Kais
erbagh
W
ant
Back
Posse
ssion
Of
His
Shop
S
uit
For A
Plot
Of
Land
Y
es
5
Tr
iveni
Singh
Hus
sainaba
d &
Allied
Trust
2nd
Additio
nal
Civil
Judge
Senior
Divisio
n
Luckno
w
Tha
kurganj
F
orcef
ully
Sellin
g
Land
Of
Plaint
iff.
S
uit
For
Posse
ssion
And
Perm
anent
Injun
ction
Y
es
6
M
s
Geeta
Singh
Cha
dra
Shekhar
12th
Additio
nal
District
Judge
Luckno
w
Kais
erbagh
C
handr
a
Sheh
kar Is
Alleg
ing
That
There
Is No
Will
S
uit
For
Grant
Of
Proba
te On
The
Basis
Of
Will
N
o
7
R
am
Gupt
a
Anil
Kumar
Gupta
Jud
ge
Small
Causes
Luckno
w
Haz
ratganj
Ej
ectme
nt Of
Shop
In
Tena
ncy
Of
Anil
Kum
ar
S
uit
For
Eject
ment
And
Arrea
rs Of
Rent
N
o
d
Pl
aintif
f
Nam
e
Def
endant
Name
Cou
rt
Name
Poli
ce
Station
Name
P
roble
m
R
elief
Aske
d
C
ase
De
cid
ed
Gupt
a
8
S
anjay
Verm
a
Dee
pak
Singh
Jud
ge
Small
Causes
Luckno
w
Kais
erbagh
M
r.
Shar
ma
Want
To
Do
Medi
cal
Practi
ce In
His
Shop
S
uit
For
Eject
ment
Of
Tena
nt
N
o
9
D
r.Ras
hid
Anwa
r
Siddi
qui &
Other
s
Dilk
ash
Sehkari
Awas
Samiti
&
Others
Civi
l Judge
Maliha
bad
Senior
Divisio
n
Luckno
w
Mal
ihabad
Pl
aintif
f
Want
To
Set
Aside
Ex-
Party
Judg
ment
S
uit
For
Settin
g
Aside
Ex-
Party
Judg
ment
N
o
0
R
am
Chan
dra
Rasto
gi &
Anot
her
Raje
ndra
Prasad
&
Another
Civi
l Judge
North
Junior
Divisio
n
Luckno
w
Tha
kurganj
R
estrai
ning
To
Take
Illega
l &
Force
ful
Posse
ssion
S
uit
For A
Perm
anent
Injun
ction
Y
es

Table 1: Provides background data on the legal cases
that participated in the court and clearly shows that the two
groups are indeed comparable in terms of Problems, Relief,
and Decision.
Experiment Design: The system is a database front-end
client that gives access to information about all cases on
Legal matters and decision estimation that have been coded
according to the classification categories problems,
estimation approach, study context, and data set.
Table 2 Shows the Court Name and Case Number. In
table shows the sum of case number reference of court
name.
Court Name
Case
Number
CIVIL JUDGE
MOHANLALAGANJ
SENIOR DIVISION
LUCKNOW
1.00
1ST ADDITIONAL
JUDGE SMALL CAUSES
LUCKNOW
2.00
2ND ADDITIONAL
CIVIL JUDGE SENIOR
DIVISION,LUCKNOW
3.00
ADDITIONAL
PRINCIPAL JUDGE FAMILY
COURT LUCKNOW
5.00
CIVIL JUDGE SOUTH
JUNIOR DIVISION
LUVKNOW
6.00
CIVIL JUDGE SOUTH
JUNIOR DIVISION
LUCKNOW
9.00
CIVIL JUDGE SENIOR
DIVISION LUCKNOW
10.00
3RD ADDITIONAL
CIVIL JUDGE JUNIOR
DIVISION LUCKNOW
12.00
CIVIL JUDGE NORTH
DENIOR DIVISION
LUCKNOW
13.00
12TH ADDITIONAL
DISTRICT JUDGE
LUCKNOW
16.00
CIVIL JUDGE
MALIHABAD SENIOR
DIVISION LUCKNOW
19.00
CIVIL JUDGE NORTH
JUNIOR DIVISION
LUCKNOW
20.00
CIVIL JUDGE JUNIOR
DIVISION LUCKNOW
22.00
2ND ADDITIONAL
CIVIL JUDGE SENIOR
DIVISION LUCKNOW
26.00
JUDGE SMALL CAUSES
LUCKNOW
46.00
All COURT NAME 210.00

The subjects were also told that to collect valid data, a
few rules had to be followed. First, they needed to use the
preconfigured development environment (e.g., the Eclipse
IDE). Next, they had to work independently: They could
not get help
from colleagues or the experimenters. Technical
questions to the latter had to be asked via e-mail. The
reason for this was twofold:
1. So that the subjects would not engage in
a technical conversation with the experimenters
and
2. So that answers were carefully recorded.

III. ACKNOWLEDGEMENT

CONCLUSION
This paper will help in explaining the origin of the
various technological aspects that are converging to
provide integrated support for legal practitioner working
alone and in teams to manage legal issues and make more
rational decisions. This study focuses on collecting more
firsthand cases and in building a more complete mosaic of
what was occurring in courts and in judicial system to
build and use DSS. OLAP and DSS can be collectively
applied to solve complex legal problems. OLAP techniques
are used in making more legislative decisions. The
objective of this paper is to throw light on the usefulness of
using OLAP techniques to improve the performance of
Decision Support System (DSS) in legal matters.

REFERENCES
[1]. W. Inmon. Building the Data Warehouse. John Wiley
& Sons, 2002.
[2]. Alexis Leon (2008). Enterprise Resource Planning
(Second Edition), chapter 2, Page no. 73, TMH
Publication.
[3]. Berson, a, Data warehousing, Data mining and OLAP,
McGrawHill, 1997
[4]. Michael, L.G. and Bel, G.R, Data mining-a powerful
information creating tool, OCLC systems & services,
Vol.15:2, 1999, pp81-90.
[5]. Robert S.C. Joseph A.V and David B., Microsoft Data
warehousing, John Wiley & Sons, 1999.
[6]. Margarent, A.H. and Rod, H, facilitating corporate
knowledge: building the data warehouse, Information
& computer security, Vol 5:5, 1997, pp 170-174.
[7]. Daniel. J. Power: Decision Support System Concepts
and Resources for managers, 2002, page 19 -24.
[8]. Manel Mora, Guisseppi Forgionne, and Jatinder N.D.
Gupta: Decision Making Support Systems:
Achievements, Trends & Challenges for the new
decade.
[9]. Frada Burstein and Clyde W. Holsapple, Decision
support systems in context, Published online: 29
February 2008.
[10]. Patrick Brezillon Juliette Brezillon, Context-
sensitive decision support systems in road safety,
Published online: 4 March
[11]. 2008,Springer-Verlag2008
.
.

Development and Evaluation of a Document Clustering Engine
Monali B. Kulkarni
Computer Division
Bhabha Atomic Research Centre
Mumbai, India

Abstract Document clustering is the automatic grouping of
text documents into clusters so that documents within a
cluster have high similarity in comparison to one another, but
are dissimilar to documents in other clusters. It is widely
applicable in areas such as search engines, web mining, and
information retrieval. A very common use of document
clustering is in distinguishing mails between a fixed set of
classes, such as 'spam' and 'non spam', or 'personal' and
'technical'.
There are many document clustering algorithms proposed
in the literature. Some of them are implemented and available
as open source software. Since they are very general in terms
of applicability, they may not work very well in specific
applications. In this project we developed and implemented a
document clustering algorithm tuned to specific applications,
e.g., to flag a document technical or otherwise, which will
enable us to track all emails with technical documents as
attachments, or to classify problems posted on helpdesk as
Electrical or Civil complaints or email issues. The clustering
engine so developed was compared for performance with
available open software.
Keywords-(Clustering, Naive Bayes, Principle Component
Analysis, Artificial Neural Network)
I. INTRODUCTION AND MOTIVATION
Today's organizations face a vast volume of data and
information. Most of the data is stored in different types of
documents but only a few people (often only the authors of
the documents) know where to locate them. There are
plenty of ways to approach the problem of organizing
knowledge in a company. Here we concentrated on the
document clustering.
Large organizations like Bhabha Atomic Research
Centre generate huge amount of data per day, via emails,
attachments, or in the form of problems posted. This data is
approximately 3 to 4 GB. Distinguishing such a huge data
set manually is a very difficult task. Hence, a Document
Clustering engine was proposed which will do the
categorization of this data automatically and efficiently.
The main classification objective, particularly with
respect to knowledge management, is to simplify access to
and processing of available data. Classification supports
analyzing the knowledge and, thus, can ease thereafter

ISBN: 978-81-920575-8-3 :: doi: 10. 73403/ISBN_0768
- retrieval
- organization
- visualization
- development, and
- exchange of knowledge
Document clustering is an unsupervised learning
procedure. Cluster analysis is targeted on exploring
similarities in the contents of the documents and arranges
them in groups according to these properties.
The common challenges in document clustering are:
- Document retrieval,
- Complexity of natural languages,
- High dimensionality,
- Accuracy in clustering,
- Speed at which they are accurately clustered,
- Versatility of requirements e.g. Topic extraction,
pattern matching etc.
Document categorization may be viewed as assigning
documents in a predefined set of categories. Usually this set
is created during the training of the classifier with the so
called training documents. The classifier will preserve the
training pattern in some way which will be used the next
time when documents are to be classified.
Most of the methods used in document classification
have been used in data mining applications. The data
analyzed by data mining are numerical and, therefore,
already in the format required by the algorithms.
To apply these algorithms for document classification
one has to convert the words of the documents into
corresponding numerical representations. This step is
called document preprocessing and subsumes feature
extraction, feature selection, and document representation
as activities.
Naive Bayes and Artificial Neural Network were
selected to solve the problem in hand. The reason behind
algorithm selection is that Naive Bayes has a naive
mathematical approach to the problem of classification
which assumes total independence among the features
extracted from a document, and ANN acts as a black box
returning the predicted category for the given input feature
sequence.
A. Naive Bayes Algorithm
Naive Bayes Classier(NBC), a simple probabilistic
classier based on Bayes theorem with strong (naive)
independence assumptions, solves document classification
problems, e.g., spam filtering, efficiently. However, the
Bayes Classifier(BC) requires labeled training data.
Using the Bayes Rule, we can calculate the probability
of a document being classified in each of the available
category. For a given document, the category against which
the algorithm gives highest probability is considered to be
the resulting category for the given document [1][3].
B. Principle Component Analysis
Principal component analysis(PCA) is a classical
statistical method. This linear transform has been widely
used in data analysis and compression [8][9]. It is a way of
identifying patterns in data, and expressing data in such a
way as to highlight their similarities and differences.
Another important advantage of PCA is that once the
patterns in the data have been found, the compression of
the data, i.e. reduction in the number of dimensions, is
without much loss of information.
C. Artificial Neural Network
ANN have been used in recent years for modeling
complex systems where no explicit equations are known, or
the equations are too ideal to represent the real world. ANN
can form predictive models from data available from past
history. ANN training is done by learning from known
examples. A network of simple mathematical neurons is
connected by weights. Adjusting the weights between the
neurons does the training of the ANN. Advanced
algorithms can train large ANN models, with thousands of
inputs and outputs [11].
Neural network is a powerful technique to solve many
real world problems. They have the ability to learn from
experience in order to improve their performance and to
adapt themselves to changes in the environment. In
addition to that they are able to deal with incomplete
information or noisy data and can be very effective
especially in situations where it is not possible to define the
rules or steps that lead to the solution of a problem.
II. IMPLEMENTATION
A. Efficient preprocessor Implementation
The Pre-processing tasks broadly include tokenizing,
stop word removal, removing nonprintable characters,
feature selection and generating a numeric equivalent for
every unique feature selected.
The software is used for clustering the incoming mails
as well as their attachments, which may be in a format
other than text. Hence the pre-processor developed for this
software also incorporates file format conversion features
to convert documents in other format into text format.
Currently, the pre-processor supports .PDF to .TXT
conversion. Some other popular file formats are to be
included in the near future.
B. Naive Bayes Implementation
Bayes Theorem:
Nave Bayes algorithm follows Bayes theorem which is
stated as:
P(A|B) = P(B|A) * P(A) / P(B)

The probability of A happening given B is determined
from the probability of B given A, the probability of A
occurring and the probability of B. The Bayes Rule enables
the calculation of the likelihood of event A given that B has
happened. This is used in text classification to determine
the probability that a document B is of type A just by
looking at the frequencies of words in the document.
For the purposes of text classification, the Bayes Rule
is used to determine the category a document falls into by
determining the most probable category. That is, given this
document with these words in it, which category does it fall
into?
Using the Bayes Rule, we can calculate P(Ci|D) by
computing:
P(Ci|D) = P(D|Ci) * P(Ci) / P(D)

Here Ci represents a category and i ranges from 1 to
number of categories available.
P(Ci|D) is the probability that document D is in
category Ci; that is, the probability that given the set of
words in D, they appear in category Ci. P(D|Ci) is the
probability that for a given category Ci, the words in D
appear in that category.
P(Ci) is the probability of a given category; that is, the
probability of a document being in category Ci without
considering its contents. P(D) is the probability of that
specific document occurring.

The block diagram of the system is shown in the Fig. 1:

Fig.1. Block diagram of the complete system

As shown in Fig. 1, input documents initially go
through the preprocessor, where they are refined by
removing unimportant text. Preprocessor performs various
activities such as tokenizing, removing stop words etc.
Then the important features are extracted from the cleaned
data. Preprocessor internally converts these features into
numerical form which is understood by the system. Hence,
every document is internally represented by a two
dimensional array of features and their frequencies in a
particular document. The frequencies used are normalized
frequencies which normalizes the term counts over the
documents collection. Hence, a term which occurs
infrequently will make the same contribution to the distance
between documents as a very common term. A resultant
two dimensional matrix is shown in Fig. 2.
Fig.2. Two dimensional matrix of document features
and their frequencies.
All the unique features are stored in a vocabulary for
future use and all these features have unique key Ids to
identify them uniquely.
This two dimensional matrix is given as input to both
the clustering techniques. Both the techniques function in
two phases.
- Training phase
- Classification phase
During training phase Naive Bayes Classifier will
calculate the probabilities of given input document to be
classified in all the available clusters and the cluster having
the highest probability for a given document is assumed to
be the destination for the given document.
The probabilities for all the clusters, returned by Naive
Bayes Classifier are stored and sent as input to the Artificial
Neural Network along with the two dimensional matrix.
The ANN tries to find the resultant cluster for the given
document and will output its calculated results.
The purpose of this two way clustering is to improve the
accuracy with which documents are clustered.
C. ANN Implementation
ANN used here is the Feed Forward Back Propagation
neural network. Back Propagation is a learning algorithm
for adjusting the weights. Input data is propagated through
network until it reaches output layer. Error value for the
network is calculated. The neural network then uses it to
train the network by propagating the error backwards. Again
forward process starts. The cycle is repeated until the error
is minimized.
ANN will always have fixed number of inputs, whereas
the size of two dimensional matrix generated from the input
document will always depend upon the input document
itself. Hence, some dimensionality reduction steps must be
performed in order to make these two sizes equal to each
other. There are two options:
- Principle Component Analysis (PCA)
- Sort all the two dimensional matrix on frequency
values n select top m features out of n (m < n)
The PCA does not work well if the variations in the
input data are not high. It works well under the condition
that the data set has homogeneous density. Hence, the
second method of sorting was used. In this method all the
features were sorted in descending order of normalized
frequencies. And then top p frequencies were selected in a
particular cluster.
The library used for ANN implementation is Fast
Artificial Neural Network. It requires input file into a
particular format, the preprocessor converts incoming
documents to that format and then that file is given as input
to the ANN.
III. EXPERIMENT
A. Data
The data set used was extracted from
(http://kdd.ics.uci.edu/). The data set consists of 20,000
messages from 20 news groups. The data was manually
clustered into 20 possible clusters. The manual cluster size
was 1,000 documents per cluster. This approach represents a
daily, normal operation of information seeking and
categorization in organizations today, where vast amount of
information is gathered and categorized.
B. Method
The 20,000 documents were analyzed using classical
text analysis methodology. Documents in formats other than
text are converted in .TXT format. Currently the clustering
engine supports only PDF to .TXT conversion. Stop
words were removed. The resulted terms went under
further stemming process, using Porter stemming algorithm
[15]. Finally, normalized weighted term vectors
representing the documents were generated.
These terms were applied to Naive Bayes classifier
where after performing the mathematics it outputs the
resulting probabilities corresponding to all the clusters.
These output probabilities along with the normalized
weighted term vectors were given to ANN and the resultant
cluster was found. Outputs from both these techniques were
compared and the final output is decided by the clustering
engine.
The two way clustering improves the accuracy in
clustering given input documents.
In order to evaluate performance of both the techniques,
both automatic clustering results were compared to the
manual categorization.
C. Comparison
The newly developed system is compared with one of
the open source software available. We have used MALLET
(Machine Learning for Language Toolkit) for judging the
performance of the clustering engine. MALLET is open
source software used for natural language processing,
document classification, clustering, topic modelling,
information extraction, and other machine learning
applications to text. It includes efficient routines for
converting text to "features", a wide variety of algorithms
(including Nave Bayes, Maximum Entropy, and Decision
Trees), and code for evaluating classifier performance using
several commonly used metrics [12].
IV. EXPERIMENTAL RESULTS
The software works in two phases employing one
technique in each phase. The Nave Bayes Classifier was
tested on a manually created data corpus of 100 documents
manually clustered in 5 different clusters, each having 20
documents. The results were compared with MALLET, and
are shown in Table I.

TABAE I. COMPARISON OF TEST RESULTS
Property
Testing Results
MALLET NBC
Training speed
10.732 s import
documents
+
0.560 s train
749 ms
Classification
speed
< 1 s 83 ms
Accuracy 89% 83%

The next phase employs ANN for clustering. The
documents after classification by NBC are sent to the ANN
for clustering. The implementation of ANN has been
finished, and the testing is going on. The results are
expected till mid December.

V. CONCLUSION
In this paper, it has been reviewed that whether
pipelining the output of NBC to ANN improves the
accuracy in clustering documents or both the techniques
work better if they are run independently. A reasonable
tradeoff between speed and accuracy is to be found.
The NBC works fairly well, giving a good tradeoff
between speed and accuracy. The tests regarding the output
of NBC as input to ANN along with the document to be
clustered are yet to be done. The tests will be performed to
check if the accuracy in clustering the document improves
by employing two different techniques or the two
techniques work well when they are run independently.
These results are expected in mid December.

VI. COPYRIGHT FORMS AND REPRINT ORDERS
Copyright 2011, by Computer Division, Bhabha Atomic
Research Centre, Trombay Mumbai 400085.
ACKNOWLEDGMENT
My deep sense of gratitude to Shri. A.G. Apte (Head,
Computer Division), BARC and Dr. P. K. Pal (SO H,
DRHR), BARC, for their support and guidance. I am
heartily thankful to my guide J. J. Kulkarni, Technical
Advisor Phool Chand and Rohitashva Sharma for their
encouragement, guidance. Thanks and appreciation to the
helpful people at BARC.

REFERENCES
[1]
Ioan Pop, An approach of the nave bayes classifier
for the document classification
[2]
I Rish, An empirical study of the nave bayes
classifier, IJCAI 2001, Woekshop on Empirical
Methods in Artificial Intelligence.
[3]
Laurane V. Fausett , Fundamentals of neural
networks.
[4]
Lee Zhi Sam, Mohd Aizaini Maarof, Ali Selamat,
Automated web pages classification with integration
of principle component analysis (PCA) and
independent component analysis (ICA) as feature
reduction, Proceedings of International Conference on
Man-Machine Systems 2006, Sepetember 15-16-2006,
Langkawi, Malaysia.
[5]
http://www.drdobbs.com/architecture-and-
design/184406064
[6]
Jon Shlens, A tutorial on principle component
analysis, derivation, decomposition, and singular
values decomposition, 25 March 2003.
[7]
Mohamed N. Nounou, Bhavik R. Bakshi, Bayesian
principle component analysis.
[8]
Erkki Oja. Subspace methods of pattern recognition,
volume 6 of Pattern recognition and image processing
series. John Wiley & Sons, 1983.
[9]
Rafael C. Gonzalez and Richard E. Woods. Digital
image processing. Addison Wessley Publishing
Company, 1992.
[10]
Erkki Oja. Neural networks, principal components,
and subspaces. International Journal of Neural
Systems, 1(1):61-68, 1989.
[11]
Diana Goren-Bar, Tsvi Kuflic, Dror Lev, Supervised
learning for automatic classification of documents
using self organizing Maps.
[12]
http://mallet.cs.umass.edu/

Knowledge Management Strategies for Implementation of Green Technologies
Flight Lieutenant Sonali Shirpurkar Badkas

(Ex Indian Air Force Aeronautical Engineer Officer)
Agra, India

Abstract-Knowledge is the only treasure you can give
entirely without running short of it.
-An African proverb.
Over the past decade, it has become clear for the scientific
community that the linkages between Community,
Climate Change and Conservation are deeper than they
appear to be. The existing pattern of energy use and
demand is now a global concern not only due to its impact
on climate but also due to their exhaustive nature.
Renewable energy is therefore creating a lot of excitement
in terms of meeting energy demands and reducing
environmental impact. Alternatives such as bio-fuel, wind,
solar, biomass has a great potential for ensuring
sustainable growth. What we term as Green Technology
(GT). The effective implementation of GT brings in the
Knowledge management (KM). KM accepts is the
leveraging of collective wisdom to increase responsiveness
and innovation. Knowledge management is not one single
discipline. Rather, it an integration of numerous
endeavours and fields of study. This paper provides a
framework for characterizing the various tools (methods,
practices and technologies) available to Knowledge
Management practitioners and using it for
implementation of Green Technology aspect. It provides a
high-level overview of a number of key terms and
concepts, describes the framework, provides examples of
how to use it, and explores a variety of potential
application areas.

Key Words: KM, KM Strategy, Green Technology, Energy
Conservation, Solar Energy

I. KNOWLEDGE MANAGEMENT
Knowledge Management (KM) comprises a range
of strategies and practices used in an organization to
identify, create, represent, distribute, and enable
adoption of insights and experiences. Such insights and
experiences comprise knowledge, either embodied in
individuals or embedded in organizational processes or
practice.
KM is not information management. It is the
process of transforming unstructured data into
contextual information and then applying that
information. Knowledge as contextual information is
the ability to draw on information and combine it with
experience by applying it to a particular situation or
circumstance when it is needed.

India. ISBN: 978-81-920575-8-3 :: doi: 10.
73410/ISBN_0768
Many large companies and non-profit organizations
have resources dedicated to internal KM efforts, often
as a part of their 'business strategy', 'information
technology', or 'human resource management'
departments (Addicott, McGivern and Ferlie 2006).
Several consulting companies also exist that provide
strategy and advice regarding KM to these
organisations.
There is little consensus on a Knowledge
Management definition is because most knowledge
management discussions surround information
management. They are the codification or classification
systems that help to capture and codify knowledge but
they do not take knowledge to the next step of infusing
it into the enterprise (or creating a learning
organization). And from there, pushing it still further to
the application of that information in a value added
context for day to day activities. It is only with the
application of information coupled with experience that
something becomes knowledge, it is NOT some
system.
For the enterprise to continue to wring value out
of any technological implementations or other
technology investments, the organisation must change.
For the organization to be effective, technology must
support the capture, organization, and implementation
of the unstructured knowledge contained in peoples
heads, or jotted down on crib sheets. This is not an easy
task.
Knowledge is not data and information. Data
consists of facts, observations, occurrences, numbers,
and things that are objectively perceived.
Information is a collection of various aggregated or
synthesized data points. From there, Knowledge is the
mix of information, experience, and context adding
value to an activity or process
Knowledge Management is the systematic process
by which an organization maximizes the uncodified and
codified knowledge within an organization.

II. KM DIMENSIONS
An established discipline since 1991 (see Nonaka
1991), KM includes courses taught in the fields of
business administration, information systems,
management, and library and information sciences
(Alavia and Leidner 1999). More recently, other fields
have started contributing to KM research; these include
information and media, computer science, public health,
and public policy.
One proposed framework for categorizing the
dimensions of knowledge distinguishes between tacit
knowledge and explicit knowledge as depicted in Fig-
1. Tacit knowledge represents internalized knowledge
that an individual may not be consciously aware of,
such as how he or she accomplishes particular tasks. At
the opposite end of the spectrum, explicit knowledge
represents knowledge that the individual holds
consciously in mental focus, in a form that can easily be
communicated to others. (Alavi and Leidner 2001).
Similarly, Hayes and Walsham (2003) describe content
and relational perspectives of knowledge and
knowledge management as two fundamentally different
epistemological perspectives. The content perspective
suggest that knowledge is easily stored because it may
be codified, while the relational perspective recognizes
the contextual and relational aspects of knowledge
which can make knowledge difficult to share outside of
the specific location where the knowledge is developed.
Early research suggested that a successful KM effort
needs to convert internalized tacit knowledge into
explicit knowledge in order to share it, but the same
effort must also permit individuals to internalize and
make personally meaningful any codified knowledge
retrieved from the KM effort. Subsequent research into
KM suggested that a distinction between tacit
knowledge and explicit knowledge represented an
oversimplification and that the notion of explicit
knowledge is self-contradictory. Specifically, for
knowledge to be made explicit, it must be translated
into information (i.e., symbols outside of our heads)
(Serenko and Bontis 2004). Later on, Ikujiro Nonaka
proposed a model (SECI for Socialization,
Externalization, Combination, Internalization) which
considers a spiraling knowledge process interaction
between explicit knowledge and tacit knowledge
(Nonaka & Takeuchi 1995). In this model, knowledge
follows a cycle in which implicit knowledge is
'extracted' to become explicit knowledge, and explicit
knowledge isre-internalized' into implicit knowledge.
More recently, together with Georg von Krogh, Nonaka
returned to his earlier work in an attempt to move the
debate about knowledge conversion forwards (Nonaka
& von Krogh 2009).

III. KM STRATEGIES
Knowledge may be accessed at three stages: before,
during, or after KM-related activities. Different
organizations have tried various knowledge capture
incentives, including making content submission
mandatory and incorporating rewards into performance
measurement plans. Considerable controversy exists
over whether incentives work or not in this field and no
consensus has emerged.
One strategy to KM involves actively managing
knowledge (push strategy). In such an instance,
individuals strive to explicitly encode their knowledge
into a shared knowledge repository, such as a database,
as well as retrieving knowledge they need that other
individuals have provided to the repository.
[12]
This is
also commonly known as the Codification approach to
KM.
Another strategy to KM involves individuals
making knowledge requests of experts associated with a
particular subject on an ad hoc basis (pull strategy). In
such an instance, expert individual(s) can provide their
insights to the particular person or people needing this
(Snowden 2002). This is also commonly known as the
Personalization approach to KM.

IV. GREEN TECHNOLOGY
Environmental technology or "green technology" is
the application of the environmental sciences to
conserve the natural environment and resources, and by
curbing the negative impacts of human involvement.
Sustainable development is the core of environmental
technologies. When applying sustainable development
as a solution for environmental issues, the solutions
need to be socially equitable, economically viable, and
environmentally sound.
Some environmental technologies that retain
sustainable development are; recycling, water
purification, sewage treatment, remediation, flue gas
treatment, solid waste management, renewable energy,
and others.
The term "technology" refers to the application of
knowledge for practical purposes. The field of "green
technology" encompasses a continuously evolving
group of methods and materials, from techniques for
generating energy to non-toxic cleaning products.
The present expectation is that this field will bring
innovation and changes in daily life of similar
magnitude to the "information technology" explosion
over the last two decades. In these early stages, it is
impossible to predict what "green technology" may
eventually encompass.
The goals that inform developments in this rapidly
growing field include:
Sustainability - meeting the needs of society in
ways that can continue indefinitely into the future
without damaging or depleting natural resources. In
short, meeting present needs without compromising the
ability of future generations to meet their own needs.
"Cradle to cradle" design - ending the "cradle
to grave" cycle of manufactured products, by creating
products that can be fully reclaimed or re-used.
Source reduction - reducing waste and
pollution by changing patterns of production and
consumption.
Innovation - developing alternatives to
technologies - whether fossil fuel or chemical intensive
agriculture - that have been demonstrated to damage
health and the environment; and
Viability - creating a centre of economic
activity around technologies and products that benefit
the environment, speeding their implementation and
creating new careers that truly protect the planet.
In the light of above two main scenarios it is
important to understand each terminology clearly and
its implementation. This is where the implementation of
Knowledge Management strategies in Green
Technology aspect would help. The exact steps for
what, how, when, where, which, by whom and why in
knowing Green Technology would be done in the back
drop of Knowledge Management.
V. ENERGY CONSERVATION
Energy conservation refers to efforts made to reduce
energy consumption. Energy conservation can be
achieved through increased efficient energy use, in
conjunction with decreased energy consumption and/or
reduced consumption from conventional energy
sources. Energy conservation act was passed on 2001.
The state of the environment has been on a descent
in the past tens of years, and if people do not make the
proper changes, things could get really bad. However, it
seems that many people are willing to make the change
and to adopt a green living style. They use green
products, green energy sources, and they use the green
technology. The green technology uses the
environmental science and the modern technology in
order to reduce the damage done to the nature, and to
reverse the side effects of the human activities.
Energy conservation can result in increased
financial capital, environmental quality, national
security, personal security, and human comfort.
Individuals and organizations that are direct consumers
of energy choose to conserve energy to reduce energy
costs and promote economic security. Industrial and
commercial users can increase energy use efficiency to
maximize profit.
The green technology uses the environmental
science and the modern technology in order to reduce
the damage done to the nature, and to reverse the side
effects of the human activities.
Its main task is to change the way in which we
think, and to change the way in which we live. It is a
very new type of technology and it is currently into a
development stage. The problem is that it faces really
harsh competition, which is not interested in preserving
the environment at all. Money dictates the actions of
these companies, and as long as they make money they
do not care about the health of the environment, or
about the health of the individual. The aim of the green
technology is to find a way of living our lives in the
way we are used to, but without sacrificing the
environment in the process. Recycling is one of the
biggest weapons of the green technology. They want
people to recycle as much as possible, because this is
one of the most important processes through which we
can reduce the overall damage done to the environment.
There are various elements which can be included in
the green technology category, such as the green energy
for example. This is one of the most important
elements, because the majority of the pollution comes
from the energy sources which we currently use. The
problem is that the fossil fuels which we currently rely
on are depleting and in the future we might be forced to
make the change. We might not have a different option.

VI. CONSERVATION AND EFFICIENCY- THE
ENERGY PYRAMID
Energy conservation and Energy efficiency are
presently the most powerful tools in our transition to a
clean energy future. As depicted in Fig-2, The Energy
Pyramid, renewable energy is an important piece of our
energy future, but the largest opportunities are currently
in energy conservation and efficiency.
When we talk about renewable energy, there are
many ways in which it can be done, one of it being the
solar energy. The scarcity of electricity is felt
everywhere leading to load shedding off and on. The
structured analysis shows the use of KM for
implementation of GT.
VII. SOLAR ENERGY
The history of lighting is dominated by the use of
natural light. The Romans recognized a right to light as
early as the 6th century and English law echoed these
judgments with the Prescription Act of 1832.In the 20th
century artificial lighting became the main source of
interior illumination but day lighting techniques and
hybrid solar lighting solutions are ways to reduce
energy consumption.
The Change Structure as shown in Fig -3 depicts
that changes do take place and it is necessary to analyse
and understand the requirements at various stages and
through various angles.
A. Present Stage
We need to understand the present technical, social
and cultural subsystems. In this particular scenario as
we know that electricity is the basic need and
technically we have advanced in producing power.
However there are still places where electricity has not
even reached, especially few rural and tribal areas. Also
due to increase in population, the supply Vs demand
ratio is imbalanced, giving rise to need for more.
B. Assessment Stage
When we analyse and assess a situation we certainly
get a better hold over the problems faced. Thereafter
possible solutions can be thought of. Here when we talk
about scarcity of electricity, when can think of
developing a strategy.
C. Developing Future Strategy
We can think of:
Save electricity.
Generate electricity.
Use of Green technology equipments
which consume less power.
D. Future Stage
Again here we need to understand the future
technical, social and cultural subsystems. The most
important aspect is the availability of resources and its
cost effectiveness.
Day lighting systems collect and distribute sunlight
to provide interior illumination. This passive technology
directly offsets energy use by replacing artificial
lighting, and indirectly offsets non-solar energy use by
reducing the need for air-conditioning. Although
difficult to quantify, the use of natural lighting also
offers physiological and psychological benefits
compared to artificial lighting. Day lighting design
implies careful selection of window types, sizes and
orientation; exterior shading devices may be considered
as well. Individual features include saw tooth roofs,
clerestory windows, light shelves, skylights and light
tubes. They may be incorporated into existing
structures, but are most effective when integrated into a
solar design package that accounts for factors such as
glare, heat flux and time-of-use. When day lighting
features are properly implemented they can reduce
lighting-related energy requirements by 25%.
Hybrid solar lighting is an active solar method of
providing interior illumination. HSL systems collect
sunlight using focusing mirrors that track the Sun and
use optical fibers to transmit it inside the building to
supplement conventional lighting. In single-story
applications these systems are able to transmit 50% of
the direct sunlight received.
Solar lights that charge during the day and light up
at dusk are a common sight along walkways. Solar-
charged lanterns have become popular in developing
countries where they provide a safer and cheaper
alternative to kerosene lamps.
Although daylight saving time is promoted as a way
to use sunlight to save energy, recent research has been
limited and reports contradictory results: several studies
report savings, but just as many suggest no effect or
even a net loss, particularly when gasoline consumption
is taken into account. Electricity use is greatly affected
by geography, climate and economics, making it hard to
generalize from single studies.
VIII. CONCLUSION
When we talk of conserving energy it is equally
important to use the knowledge effectively so as to
produce beneficial and cost effective results. Though
every renewable energy will have its limitations, the use
of available resources is extremely essential to conserve
energy. Few points to be considered are:
The use of telecommuting by major
corporations is a significant opportunity to conserve
energy, as many Americans now work in service jobs
that enable them to work from home instead of
commuting to work each day.
Electric motors consume more than 60% of all
electrical energy generated and are responsible for the
loss of 10 to 20% of all electricity converted into
mechanical energy.
Consumers are often poorly informed of the
savings of energy efficient products. The research one
must put into conserving energy often is too time
consuming and costly when there are cheaper products
and technology available using today's fossil fuels.
Some Governments and NGOs are attempting to reduce
this complexity with ecolabels that make differences in
energy efficiency easy to research while shopping.
Technology needs to be able to change
behavioral patterns, it can do this by allowing energy
users, business and residential, to see graphically the
impact their energy use can have in their workplace or
homes. Advanced real-time energy metering is able to
help people save energy by their actions. Rather than
become wasteful automatic energy saving technologies,
real-time energy monitors and meters such as the
Energy Detective, Enigin Plc's Eniscope, Ecowizard, or
solutions like EDSA'a Paladin Live are examples of
such solutions.
It is frequently argued that effective energy
conservation requires more than informing consumers
about energy consumption, for example through smart
meters at home or ecolabels while shopping. People
need practical and tailored advice how to reduce energy
consumption in order to make change easy and lasting.
This applies to both efficiency investments, such as
investment in building renovation, or behavioral
change, for example turning down the heating. To
provide the kind of information and support people
need to invest money, time and effort in energy
conservation, it is important to understand and link to
people's topical concerns.

REFERENCES
[1] Liebowitz, Jay (2006). What they didn't tell you
about knowledge management. pp. 23
[2] Addicott, Rachael; McGivern, Gerry; Ferlie,
Ewan (2006). "Networks, Organizational
Learning and Knowledge Management: NHS
Cancer Networks". Public Money &
Management 26 (2): 8794.
doi:10.1111/j.1467-9302.2006.00506.x.
[3] Alavi, Maryam; Leidner, Dorothy E. (1999).
"Knowledge management systems: issues,
challenges, and benefits". Communications of
the AIS 1 (2).
[4] Andriessen, Daniel (2004). "Reconciling the rigor-
relevance dilemma in intellectual capital research".
The Learning Organization 11 (4/5): 393401.
doi: 10.1108/09696470410538288
[5] Benbasat, Izak; Zmud, Robert (1999). "Empirical
research in information systems: The practice of
relevance". MIS Quarterly 23 (1): 316.
doi:10.2307/249403. JSTOR 249403.
[6] Booker, Lorne; Bontis, Nick; Serenko, Alexander
(2008). "The relevance of knowledge management
and intellectual capital research". Knowledge and
Process Management 15 (4): 235
246.doi:10.1002/kpm.314.
http://foba.lakeheadu.ca/serenko/papers/Booker_B
ontis_Serenko_KM_relevance.pdf.
[7] Capozzi, Marla M. (2007). "Knowledge
Management Architectures Beyond Technology".
FirstMonday12(6).
http://firstmonday.org/htbin/cgiwrap/bin/ojs/index.
php/fm/article/view/1871/1754.
[8] Davenport, Tom (2008). "Enterprise 2.0: The
New, New Knowledge Management?". Harvard
Business Online, Feb. 19, 2008.
http://discussionleader.hbsp.com/davenport/2008/0
2/enterprise_20_the_new_new_know_1.html.
[9] Ferguson, J (2005). "Bridging the gap between
research and practice". Knowledge Management
for Development Journal 1 (3): 4654.
[10] Gupta, Jatinder; Sharma, Sushil (2004).
Creating Knowledge Based Organizations.
Boston: Idea Group Publishing.
ISBN 1591401631.
[11] Lakhani, Karim R.; McAfee (2007). "Case study
on deleting "Enterprise 2.0" article". Courseware
#9-607-712, Harvard Business School.
http://courseware.hbs.edu/public/cases/wikipedia
[12] McAdam, Rodney; McCreedy, Sandra (2000). "A
Critique Of Knowledge Management: Using A
Social Constructionist Model". New Technology,
Work and Employment 15 (2).
http://papers.ssrn.com/sol3/papers.cfm?abstract_
id=239247
[13] McInerney, Claire (2002). "Knowledge
Management and the Dynamic Nature of
Knowledge". Journal of the American Society
for Information Science and Technology 53 (12):
10091018.doi:10.1002/asi.10109.
http://www.scils.rutgers.edu/~clairemc/KM_dyn
amic_nature.pdf
[14] Morey, Daryl; Maybury, Mark; Thuraisingham,
Bhavani (2002). Knowledge Management:
Classic and Contemporary Works. Cambridge:
MIT Press. p. 451.ISBN 0262133849.
http://mitpress.mit.edu/catalog/item/default.asp?t
type=2&tid=8987.
[15] Nanjappa, Aloka; Grant, Michael M. (2003).
"Constructing on constructivism: The role of
technology". Electronic Journal for the
Integration of Technology in Education 2 (1).
http://ejite.isu.edu/Volume2No1/nanjappa.pdf.
[16] Nonaka, Ikujiro (1991). "The knowledge creating
company". Harvard Business Review 69 (6 Nov-
Dec):96104.
http://hbr.harvardbusiness.org/2007/07/the-
knowledge-creating-company/es.
[17] Nonaka, Ikujiro; von Krogh, Georg (2009). "Tacit
Knowledge and Knowledge Conversion:
Controversy and Advancement in Organizational
Knowledge Creation Theory". Organization
Science 20 (3): 635652.
doi:10.1287/orsc.1080.0412.
http://zonecours.hec.ca/documents/H2010-1-
2241390.S2TacitKnowledgeandKnowledgeConve
rsionControversyandAdvancementinOrganizationa
lKnowledgeCreation.pdf.
[18] Sensky, Tom (2002). "Knowledge Management".
Advances in Psychiatric Treatment 8 (5): 387
395.doi:10.1192/apt.8.5.387.
http://apt.rcpsych.org/cgi/content/full/8/5/387.
[19] Snowden, Dave (2002). "Complex Acts of
Knowing - Paradox and Descriptive Self
Awareness". Journal of Knowledge Management,
Special Issue 6 (2): 100111.
doi:10.1108/13673270210424639.
http://www.cognitiveedge.com/articledetails.php?a
rticleid=13.
[20] Spender, J.-C.; Scherer, A. G. (2007). "The
Philosophical Foundations of Knowledge
Management: Editors' Introduction". Organization
14 (1): 528. doi:10.1177/1350508407071858.
http://ssrn.com/abstract=958768.
[21] Serenko, Alexander; Bontis, Nick (2004). "Meta-
review of knowledge management and intellectual
capital literature: citation impact and research
productivity rankings". Knowledge and Process
Management 11 (3): 185198.
doi:10.1002/kpm.203.
http://www.business.mcmaster.ca/mktg/nbontis//ic
/publications/KPMSerenkoBontis.pdf.
[21] Thompson, Mark P. A.; Walsham, Geoff (2004).
"Placing Knowledge Management in Context".
Journal of Management Studies 41 (5): 725747.
doi:10.1111/j.1467-6486.2004.00451.x.
http://papers.ssrn.com/sol3/papers.cfm?abstract_id
=559300.
[22] Wilson, T.D. (2002). "The nonsense of 'knowledge
management'". Information Research 8 (1).
http://informationr.net/ir/8-1/paper144.html.
[22] Wright, Kirby (2005). "Personal knowledge
management: supporting individual knowledge
worker performance". Knowledge Management
Research and Practice 3 (3): 156165. doi:
10.1057/palgrave.kmrp.8500061.
[23] Secretary of Defense Corporate Fellows Program;
Observations in Knowledge Management:
Leveraging the Intellectual Capital of a Large,
Global Organization with Technology, Tools and
Policies. IBM, Global Business Services. 2002.
http://www.ndu.edu/sdcfp/reports/2007Reports/IB
M07%20.doc. Retrieved 15 January 2010.
[24] "Information Architecture and Knowledge
Management". Iakm.kent.edu. Archived from the
original on June 29, 2008.
http://web.archive.org/web/20080629190725/http:/
/iakm.kent.edu/programs/information-use/iu-
curriculum.html. Retrieved 15 January 2010.
[25] Snowden, Dave (2002). "Complex Acts of
Knowing - Paradox and Descriptive Self
Awareness". Journal of Knowledge Management,
Special Issue 6 (2): 100 - 111.
[26] SSRN-Knowledge Ecosystems: A Theoretical Lens
for Organizations Confronting Hyperturbulent
Environments by David Bray. Papers.ssrn.com.
http://papers.ssrn.com/sol3/papers.cfm?abstract_id
=984600. Retrieved 15 January 2010.
[27] http://Environmental technology - Wikipedia, the
free encyclopedia.mht
[28] http://Wh
[29] http://en.
[30] http://nw
[31] http://en
[32] Book:Th
Top
Watch b
2008).

Fig 1-The Kno
Fig -2 Con

[33] Book: G
Sustaina
Michael
hat is Green T
.wikipedia.org
wcommunityen
n.wikipedia.or
he Clean Tec
Trends, Tech
by Ron Pernic
owledge Spiral as
nservation and Ef
Green Tech: H
able IT Solutio
l Wallace (Jul
Technology.m
g/wiki/Solar_e
nergy.org/biog
rg/wiki/Energy
ch Revolution
hnologies, and
ck and Clint W
described by No
fficiency- The En
How to Plan
ons by Lawren
28, 2009).
mht
energy
geo/efficiency
y_conservatio
n: Discover t
d Companies
Wilder (Sep 1
onaka & Takeuchi
nergy Pyramid
and Impleme
nce Webber a
y
on
the
to
16,

hi.

ent
and

[3
34] Book: En
Strategies
Jayamaha
ergy-Efficient
for Operatio
.
Fig 3 The Cha
nt Building Sy
on and Mainte
ange Structure
ystems: Green
enance by La
n
al

A
A CO
Abstract: This
existing tech
construction w
paper is to
methodologies
ontologies for
Manual, Sem
available to
methodologies
and the results
Keywords:
In this er
influence of
usage of inter
of importanc
defined as [1]
E-learnin
use of netw
technology in
E-lea
emerging tec
learning by
engineering e
follows
Increases
interactivity a
and video.
Learners c
especially im
school physic
The syste
allows for la
discussion on
E-learning
using the wor
According
of the main ar
Considerin
more specific
been used to b

Proc. of th
Volume 1.
India.
73417/IS
ORRELAT
ONT
D
Depart
R
Kavarai
s paper is a c
hniques and
with respect t
provide a
s used in som
r an e-learni
mi-automatic a
build ontol
s that are ver
s are compare
Ontology const
I. M
ra of greater
technology in
rnet in learnin
e. This meth
]:
ng is common
worked infor
n teaching and
arning is gai
chnologies. It
people in
etc. Some of th
conceptual un
and animation
can learn at on
mportant for
cally during th
em may retai
ater reference
n bulletin board
g permits ins
rld-wide resou
g to George S
reas of specifi
ng the benef
c ideas in lear
build relations
he Intl. Conf. o
Copyright
ISBN: 978-81
BN_0768ACM
TIVE AN
TOLOGY
D.REBECCA
tment of Com
Enginee
RMD Engineer
ipettai-601206

correlative stud
d approache
to e-learning.
comprehensi
me of the app
ing system.
and Automatic
logies; out o
y apt are cho
d and tabulate
truction; e-lear
MOTIVATION
technological
n all walks o
g and teaching
od is called
nly referred t
rmation and
learning.
ining importa
is used
all categorie
he benefits of
nderstanding t
n, and through
nes own pace
learners wh
he regular hour
in records of
through the
ds.
structors to d
urces of the W
Siemens Fig1.
cation in E-lea
fits of E-lear
rning concepts
ships among le
on Computer A
2012 Techno
1-920575-8-3
M #: dber.ime
NALYSIS
Y CONSTR

A AMULYA
mputer Science
ering,
ring College,
6, Tamilnadu,
dy of some of
es in onto
The goal of
ive idea of
proaches to b
Presently m
c approaches
of these cer
osen and analy
ed.
rning
N
intelligence
of life especi
g has gained a
E-LEARNIN
to the intentio
communicati
ance due to
as a means
es like scien
f e-learning ar
hrough the us
h the use of au
e and time. Th
ho cannot att
rs.
f discussion
use of threa
develop mater
eb.
. represents so
arning.
rning to prov
s ontologies h
earning concep
Applications
Forum Group
:: doi: 10.
era.10. 73417
1
OF MET
RUCTIO
e and
India.
f the
logy
this
the
build
many
are
rtain
yzed
and
ially
a lot
NG
onal
ions
the
s of
nce,
re as
se of
udio
his is
tend
and
aded
rials
ome
vide
have
epts.

p,
of
ord
spe
ont
(co
on
dom
dev
exi
stu
me
not
use
pro
and
hie
Wo
cre
rele
Inv
for
rela
aut
e-le
doc
THODOL
ON FOR E

Departme
RMD
Kavaraipet
Ontology [2]
ideas and con
der. [2] Defi
ecification of a
tology engi
onstruction) is
the relation
main. The ava
variety of o
velopment of d

II. METH
CON
This section
sting method
died and the r
The methodo
thod to build
tes (PDF files
ers query whi
ovides pre-requ
d it also prov
rarchy of the
ordNet for re
ated as and w
evant keywor
verse Docume
extracting ter
ationship amo
tomatic metho
earning persp
cuments are c
LOGIES I
E-LEARN
R.M. SURE
ent of Comput
Engineerin
D Engineering
ttai-601206,T

in informatio
ncepts gathere
ned ontology
a shared conce
ineering cal
used to devel
ships and co
ailability of a
ntology build
domain ontolo
Figure 1 .Ca
HODOLOGIE
NSTRUCTIO
n provides a
dologies, appr
esults are tabu
ology proposed
domain ontolo
soft copies
ich is usually a
uisites and fol
vides an ontol
lecture notes
eference beca
when in this m
rd searching
ent Frequency
rms and aprio
ong the keywo
od to construct
pective. In t
considered i.e.
INVOLVE
NING
ESH
ter Science an
ng,
g College,
Tamilnadu, Ind
on system is a
ed together in
y as A fo
eptualization
alled ontolo
lop domain on
oncepts of t
ding tools ha
ogies for E-lea
ategories
ES USED IN O
ON FOR E-LE
a complete s
roaches and t
ulated.
d by [3] is a fu
ogies from the
of text books
a keyword. Th
llow-ups base
logy graph re
s. The system
ause ontologi
method. To p
(tf-idf) Term
y weighting sc
ori algorithm i
ords. [4] Pro
t domain onto
this method
. documents l
ED IN
nd
dia.
a large numbe
a hierarchica
ormal explici
.A subfield o
ogy building
ntologies based
that particula
as lead to the
arning.
ONTOLOGY
EARNING.
survey of the
techniques are
ully automated
e given lecture
) based on the
he system also
d on the query
epresenting the
m does not use
ies are to be
rovide a more
m Frequency
cheme is used
is used to find
ovides a semi
ologies with an
all types o
like word file
er
al
it
of
g
d
ar
e
e
e
d
e
e
o
y
e
e
e
e
y-
d
d
i-
n
of
s
2
with .doc or .pdf or .txt extensions and power point

presentation files with .ppt extension. A Java based tool
called GATE file loader is used to load the documents
from the heterogeneous database of files. Then based on
document two results are provided first a domain
ontology based on the outline of the document and second
the semantic relationships between them.
Yao-tang yu et al [5] proposed a method for
constructing structured ontologies with the help of data
clustering and pattern tree mining. This methodology uses
latent Semantic Analysis, dimension reduction matrix and
feature vector along with formal concept analysis. The
document matrix is used to construct the structured
ontology based on relations of keywords. Yun hong yan
et als [6] methodology is based IEEE 1074- 2006
standard knowledge engineering approach. The approach
consists of acquiring the ontology, implementing the
ontology, evaluating the ontology and formalizing the
ontology.
A fuzzy logic based methodology for ontology
extraction to facilitate adaptive e-learning was proposed
by Lau et al [10]. In this approach any type of text
documents are considered and concept maps are
developed for those documents. The idea behind this
approach is to develop a fuzzy domain ontology
extraction method which makes manual construction of
domain ontologies knowledge acquisition easy. A
framework for automated concept map generation is
developed which is applied to an E-learning system to
facilitate adaptive learning.
Jose et al [11] proposed a domain ontology method
for personalized E-learning in educational systems. In this
method a description of the material composing the E-
Learning course has been made reusable by developing
domain ontologies for the E-Learning course. The
resources considered here are learning style and the
hardware and software features of the used device. The
learning style used is Felder-Silverman Learning Style
and the device used is based on FIPA (Foundation for
Intelligent Physical Agents) device ontology for the
description of the devices. From the IEEE LOM
(Learning Object Metadata) standard certain elements
have been chosen to describe the resources metadata. In
all the methodology Jose et al concentrated on reusability
of the E-Learning resources with domain ontologies.

A semi-automatic method to extract ontologies from
software documentation was put forth by Sabou Marta
[12]
The motivation for this methodology was based on
web service functionalities, these services as a precursor
to perform tasks of reasoning with a higher complexity
but these services and their functionalities were
constrained by the fact that the acquisition time is very
large. To overcome this constraint a semi automatic
method to extract ontologies especially from software
documentation was proposed by Sabou Marta. The
process involved in this approach is based on the Golden
Standard for ontology extraction. It involved tagging a
corpus from which automatic pair extraction was done
from which the significant pairs are identified and manual
ontology is constructed.
[13] Proposed a method for Construction of domain
ontologies from tables. The method aims to do an
automatic transformation of arbitrary table-like structures
into knowledge models, i.e. frames, and on the extension
for merging of frames into domain ontologies. Hurst
Model is used as base for this methodology.
Thus some methodologies for constructing domain
ontologies are studied and the various approaches are
analyzed. The next section provides the analysis of the
methodologies.

III. CORRELATIVE ANALYSIS OF THE
METHODOLOGIES STUDIED.
Below is the tabular column showing the comparative
analysis of the above discussed methods. The parameters
considered for the analysis are as follows the input to each
method, its goal, the technique used, parameters of
concern (e.g. no. of. Keywords), results and type of
method. The methodologies under study are from a wide
range starting from the newest method to some of the
oldest methods.
TABLE 1. COMPARISON OF THE METHODOLOGIES
S.No Title of the
paper
Input Goal Techniqu
e Used
Paramete
rs of concern
Results \
Performance
Type of Method
1

Automated
Building of
Domain
Ontologies
from Lecture
Notes
PDF
Files
To automatically
construct a
ontology(dependency graph)
from the given lecture notes
Tf-idf
scheme for
extracting
terms, lucene
for indexing,
apriori
algorithm for
finding the
association
between the
extracted
terms
No. of.
Keywords
Based on
the experts
keywords and
the proposed
system
generated
keywords using
precision and
recall results are
generated
Fully Automatic
2 Ontology
Extraction for
Knowledge
Reuse the E-
learning
perspective
PDF, Text,
PPT, DOC
files
To define a complete
methodology for automatic
knowledge extraction in the
form of ontological
concepts , from a
knowledge base of
heterogeneous documents
String based
technique,
graph
analysis,
semantic
analysis.
No. of.
Documents
The evaluation
and validation of
the implemented
system in terms
of precision and
recall measures
for documents
Semi automatic
3 A structured
ontology
construction by
using data
Key words

This paper proposes a
structured ontology
Construction based on data
clustering and pattern tree
Formal
concept
analysis
Relations of
keywords
Document
matrix is
established with
the results
Manual
3
clustering and
pattern tree
mining

mining.
S.No Title of the
paper
Input Goal Technique
Used
Parameters of
concern
Results \
Performance
Type of Method
4 Development
of Domain
ontology for e-
learning course
Subjective
in nature
(for C
language)
The goal is to analyze the
problems in development of
domain ontology and
demonstrate the application
of proposed knowledge
engineering approach for
the development of e-
learning course ontology.

------------
Roles and
Concepts are
the important
parameters

Subject
evaluation could
be done on the
basis of the
questionnaires
filled by
students at the
end of the
course.
Objective
evaluation will
be the future
work.

Manual

5 Towards a
fuzzy domain
ontology
extraction
method for
adaptive e-
learning
Any Text
Documents
To use fuzzy logic for
domain ontology extraction
to facilitate adaptive lear
ning
A framework
for automatic
concept map
generation.
Automatically
generated
Concept maps
Using adjacency
matrix results
of concept maps
are evaluated
Automatic concept
maps.
6 Domain
Ontology for
Personalized E-
learning in
Educational
systems

--------
To propose a domain
ontology
to describe learning
materials that compose an
adaptive
course.

Felder-
Silverman
learning style
model and
FIPA

Learning style
and software
and hardware
features of the
used device.

-------
Manual
7 Extracting
ontologies from
software
documentation
a semi
automatic
method and its
evaluation
Software
documents
To describe a semi
automatic method to extract
ontologies from software
documentation

-------
Extracted pairs
and no. of.
Documents
Evaluation of
the pairs using
precision and
recall
Semi-automatic
8

Construction of
domain from
tables
Arbitrary
table-like
structures
The focus of the paper is on
automatic transformation of
arbitrary table-like
structures into knowledge
models i.e. frames and on
the extension for merging of
frames into domain
ontologies.
Hursts
model.
Table layouts
of any files
The efficiency E
of the approach
is measured
according to the
portion of
correctly
transformed
tables
Automatic with
respect to tables.

IV. CONCLUSIONS AND FURTHER RESEARCH
In this paper a survey is done by comparing different
methodologies with the help of certain characteristics.

4
This evaluation will help in choosing the best

methodology based on the input to be used when building
and developing ontologies in a particular domain. The
methods have been studied and surveyed and the results
have been analyzed. Further research points include the
development of a single methodology to construct domain
ontologies for E-Learning that comprises all the best
features of the studied approaches.

REFERENCES
[1]. Som Naidu,E-Learning- A Guidebook of
Principles, Procedures and Practices, CEMCA, 2nd
revised edition, pp.11, 2006.
[2]. Thomas R.Gruber, Towards Principles for the
design of ontologies used for knowledge
sharing.International Journal on Human-Computer
Studies, Vol. 43, Issues 5-6 , pp. 907-928,December
1995.
[3]. NeelaMadhv Gantayat and Sridhar Iyer,Automated
building of domain ontologies from lecture notes,
IEEE Conference on Technology for
Education,pp.89-95, July 2011.
[4]. Gaeta,M.;Orciuoli,F.;Paolozzi,S.;Salerno,S.,
Ontology Extraction for Knowledge Reuse: The e-
Learning Perspective, IEEE Transactions on
systems,man and cybernetics, vol 41, issue 4 pp.798-
809,July 2011.
[5]. Yao-tang yu and chien chang hsu., A Structured
Ontology by using Data Clustering and Pattern Tree
Mining. IEEE Proceedings of the international
conference on machine learning and cybernetics,
2011.
[6]. Yun hong-yan ,xuijan liang, wei moji, xong jing.
Development of Domain Ontologies for e-learning
course. IEEE symposium on IT in medicine and
education, 2009.
[7]. DA-YOU LIU, Learning owl ontologies from free
texts. In Machine Learning and Cybernetics,
volume 2, pages 1233 1237, 2004.
[8]. Juan Ramos, Using tf-idf to determine word
relevance in document queries. First International
Conference on Machine Learning, 2003.
[9]. Raymond Y.K Lau, Albert Y.K.Chung, Dawei Song
and Qiang Huang,Towards Fuzzy Domain
Ontology Based Concept Map Generation for E-
Learning, Proceedings of the sixth International
Conference on web-based learning, volume 4823
,pages 90-101, 2008.
[10]. R.Y.K Lau ,Dawei Song, Yuefeng li, chang
T.C.H jin-ring hao, Towards a fuzzy domain
ontology extraction method for adaptive e-
learning, IEEE Transactions on knowledge and data
engineering ,2009.
[11]. Jose M.Gascuena. Antonio Fernandez-
Caballero and Pascual Gonazalez, Domain
Ontologies for personalized E-Learning in
Educational Systems, in 6th international
conference on advanced learning technologies, 2006.
[12]. Sabou Marta,Extracting Ontologies from
Software Documentation a Semi-Automatic Method
and its evaluation,Workshop on Ontology
Learning and Population , 2004.
[13]. Aleksander pivk and Matjaz Gams,
Construction of domain ontologies from tables,
Proceedings of the International Multiconference,
2005.
[14]. M.Hurst, Layout and Language: Beyond
Simple Text for Information interaction-modelling
the table., proceedings of the 2nd International
Conference on Multimodal Interfaces, 1999.
[15]. A.Gomez Perez, A Survey on Ontology Tools
.Onto Web Deliverable , 2002.

Ranking of Documents in Semantic Web using Particle Optimization

V Francis Densil Raj S Sanjeeve Kumar C.M. Selvarani
Faculty - MCA Faculty CSE Professor -CSE
Anna University of Technology Anna University of Technology Pannai Engg.College,
Madurai, India Madurai, India Sivagangai,India

Abstract---Particle Swarm Optimization is a biologically
inspired population based optimization technique that has
been successfully applied to various problems in Science and
Engineering. With the massive growth and large volume of the
web it is the function of search engine to recover results based
on the user preferences. But most of the time the user gets
useless pages. The next generation web architecture, semantic
web reduces the burden of the user by performing search
based on semantics instead of keywords. Even in the context of
semantic technologies optimization problem occurs but rarely
considered. In this paper, we propose a relation based page
ranking algorithm using Particle Swam Optimization which
can be used along with semantic search engine to optimize the
results of web search. The proposed method uses Jena API and
GATE tool API and the documents can be recovered based on
their annotation features and relations. A preliminary
experiment shows that the proposed method is feasible and
generates relevant documents in higher ranking.
Keywords-Component; Conceptual Graph (CG), Gate Tool API,
Jena API, Semantic Web,

I. INTRODUCTION
The semantic web which comes under web 3.0 is the
extension of the current web which aims for better data
automation and recovery of documents based on user
preference. The main advantage of semantic web with the
use of ontology is to enhance the search mechanism [2].
Ontology is used to describe the concepts and relations.
Next comes the use of RDF(S) and web ontology language,
which are W3C recommended data models used to represent
the ontology [3]. The method used in constructing semantic
web is by using the term defined in the ontology as
metadata to markup the web contents. Annotations are
based on classes of concepts and relations among them.
The semantic web is a mesh of information linked in
such a way as to be easily processable by machines on a
global scale. It is an efficient way of representing data on
the World Wide Web or as a globally linked database.
Today search engine plays an important role in extracting
knowledge from the web [1]. However user gets useless
web pages even through popular web search engines. The

978-81-920575-8-3 :: doi: 10. 73424/ISBN_0768ACM #:
reason is that the pages are recovered just based on the
keywords entered by the user and not by analyzing the
meaning which leads the users to irrelevant pages.
Moreover the semantic web also aims to achieve better data
automation, reuse and interoperability. The main advantage
of semantic web is to enhance search mechanisms with the
use of ontology. The resource Description Framework
(RDF) and Web Ontology Language(OWL) are W3C
recommended data models used to represent ontology. The
basic method for constructing semantic web is to use the
terms defined in ontology as metadata to markup the webs
content. It is generally accepted that ontology refers to
formal specification of conceptualization..
Various information retrieval models are Boolean
model, vector space model, probabilistic model and hyper
link model [4]. There have been works which employ
semantic web technology for information and retrieval such
as KIM [4]. Moreover traditional search engines do not have
proper infrastructure for exploiting relation based
information that belongs to semantic annotations for a web
page. The semantic web solves this problem to improve in
ranking of web pages, techniques based on heuristics can be
considered.
In semantic web the web pages has semantic metadata
that contains additional details about the web page itself.
Semantic annotations are based on concepts and relations
among them. A number of semantic web tools work only
with few semantic web architecture like RDF editors or
ontology editors. [4]. The current research focuses on
enhanced information retrieval through semantic metadata.
Various annotation tools are, SHOE, Annotea, Protg.
Several research projects are going on about ontology based
information retrieval.
In this paper we propose an ontology based information
retrieval model which uses PSO to rank the documents
based on their relevance. Semantic annotation is based on
classes of concepts and relations among them. The use of
relations among concepts is embedded with semantic
annotation in ranking improves the accuracy of query
results. Particle Swarm Optimisation is effective for global
search in finding solutions to nondeterministic problems.
Moreover Particle Swarm Optimisation method enhances
adaptability of meta searching. Similarity measure uses tag
structure based weights in addition to standard weighing
schemes.


II. REVIEW OF RELATED WORKS
With the development of semantic web varied studies
are emerging. The ontology based annotations for
information retrieval is not new [5]. But the initial works on
semantic web did not focus on the semantic relations which
plays a vital role in the semantic web. In order to achieve
the full benefits on the semantic contents relation based
page ranking scheme is needed [6] [7].
Initially semantic web search is enhanced with ranking
based on similarity score measurement through the distance
between the user queries and retrieved resource [8].
Similarity is computed as the ratio between the concepts in
the user query and the relation instances in the semantic
knowledge base. This process is applied to all properties.
Since the user is requested to specify all the relations of
interest it exceeds the number of concepts [8]. A different
methodology has been exploited in SemRank[9], the idea is
to rank the results based on the information conveyed by the
result. The problem with this approach is scalability for
huge semantic web environment.
Currently a challenge when querying information using
semantics offered by ontology is how to extract information
from ontology more efficiently [11]. Ontolook uses graph
based representations of web page annotation through
conceptual graphs, where concepts and relations are
modeled as vertices and edges. It extracts candidate relation
keyword set submitted to the annotated database which
reduces uninteresting pages in result set. The user need not
specify any relation but the limitation is that it does not use
any ranking strategy. The use of various techniques is not
feasible since they cannot be applied to concept-relation
based framework. The authors suggest that relation based
page rank algorithm is needed.
Semantic annotation is about assigning to the entities in
the text links to their semantic description [16]. Annotation
provides additional information about web contents so that
better decision on content can be made. Annotation
ontology tells us what kind of property and value types
should be used in describing a resource. The usage of
domain ontologies are used for annotation. The manual
annotation of document is of high cost and error prone task.
However there is still some work to do achieve a complete
automation of annotation. The classical model is incapable
of supporting logical inference.
In this paper we focus on ranking based on the
underlying structure of the ontology and web pages are
ranked based on the relevance score. The proposed work is
the extension of [12], Particle Swarm Optimisation is used
to find the sub graph based on concept and its relation. The
fitness value is calculated based on number of relations and
the best fitness is the minimum number of relations. The
proposed method is not intended to replace the ranking
strategies of actual search engines, but it simply relies on
the relevance information. Moreover it produces semantic
aware result set with increased hit ratio in query processing
by the user.

III. OVERVIEW OF RANKING STRATEGY
This section gives the details system architecture, page
ranking through semantic relation extraction and the
construction of concept-relation graph representation for
ranking.

A. System Architecture

Figure 1. System Architecture

Fig 1 shows the system architecture of ranking through
semantic relation extraction. The crawler program collects
the web pages on the internet with its semantic markup and
corresponding ontology, described in an OWL document in
the internet. The collected web pages are transported to web
page database for future use.
The ontology, OWL document is given to the OWL
parser which maps the ontology to relational databases.
Moreover RDF metadata is interpreted by OWL parser and
stored in knowledgebase. The user interface allows for the
definition of query by the user which is passed to the
ranking logic based on particle swarm optimization. The
ordered result set generated by this module is returned as
result set to the user.
The prototype uses travel ontology which is generated
using Protg, an ontology editor tool. Using Jena,
framework for semantic web the documents can be retrieved
from corpora not only based on their textual contents but
based on their features which improves the recovery of
relevant documents.
B. Semantic Relation Extraction
Speed of response and the size of the index are factors in
user happiness. It seems reasonable to assume that relevance
of results is the most important factor: blindingly fast,
useless answers do not make a user happy. However, user
perceptions do not always coincide with system designers'
notions of quality.To measure ad hoc information retrieval
effectiveness in the standard way, we need a test collection
consisting of three things:
1. A document collection
2. A test suite of information needs, expressible as
queries

3. A set of relevance judgments, standard a binary
assessment of either relevant or non relevant for each query-
document pair.
The standard approach to information retrieval system
evaluation revolves around the notion of relevant and non
relevant documents. With respect to a user information
need, a document in the test collection is given a binary
classification as either relevant or non relevant. This
decision is referred to as the gold standard or ground truth
judgment of relevance.
The test document collection and suite of information
needs have to be of a reasonable size: you need to average
performance over fairly large test sets, as results are highly
variable over different documents and information needs.
As a rule of thumb, 50 information needs has usually been
found to be a sufficient minimum.
A document is relevant if it addresses the stated
information need, not because it just happens to contain all
the words in the query. If a user types python into a web
search engine, they might want to know where they can
purchase a pet python. Or they might want information on
the programming language Python. From a one word query,
it is very difficult for a system to know what the information
need is. But, nevertheless, the user has one, and can judge
the returned results on the basis of their relevance to it. To
evaluate a system, one can require an overt expression of an
information need, which can be used for judging returned
documents as relevant or non relevant. In such cases, the
correct procedure is to have one or more development test
collections, and to tune the parameters on the development
test collection.

C. Semantic Relation based Page Ranking
Initially keyword combination given as input by the user
is analyzed and the corresponding concept is identified. The
concept is sent to the ontology database to retrieve the
relations defined in the ontology. After all concepts and
relations are identified concept-relation graph can be
formed.
Consider that the user specifies the keyword India and
selects the concepts as Destination or City, then the user
adds the next keyword Hotel and the concept is
Accommodation. In this case the set of annotated pages
contains keywords India and Hotel and the associated
concepts Destination and Accommodation.
The traditional search engines like Google will return
both the pages without considering the semantic markup. If
the relation based search mechanism is used one can get
pages only if there exists enough relations and concepts.
This mechanism provides keyword isolated searching. This
is based on the process that if the number of relations
linking each concept with each other concept in the
ontology is more than higher is the probability that this page
contains exactly those relations which the user is interested.
D. Concept Relation Graph Representation for
Ranking
A conceptual graph (CG) is a notation for logic based on
the existential graphs and the semantic networks of artificial
intelligence. Concept relational graph (CR graph) is used to
represent the ontology and annotation where the concepts
are represented through ellipses and the relations through
edges.
Conceptual graphs are formally defined in an abstract
syntax that is independent of any notation, but the
formalism can be represented in several different concrete
notations. A portion of ontology graph for travel domain is
shown in Fig 2 .The existing relations between the 2
concepts in the domain are indicated by means of connected
vertices in the graph.
The ontology graph is represented as G(C,R) where C
represents set of concepts C={C1,C2,,Cn} and R
represents the set of relations R={R1,R2,,Rn}. Ontology
subgraph is to be constructed by removing the relations
which are not relevant to the keyword given. In each sub
graph there are some quantitative relations between
concepts. If the number near the arc is more than more
relations exists between concepts.

Figure 2. Ontology Graph for Travel Domain

The keyword is submitted by the user based on which
keyword pair is generated. Suppose the keyword are
represented as k1,k2,kn and the sub graph is represented
as G1,G2,,Gp. This graph is generated through genetic
algorithm based edge minimization method. In keyword
based search only matching keywords in the document are
considered and the relation between them is lost. But based
on this approach the relation can be retained.
E. JENA API
Jena is a Java API which can be used to create
and manipulate RDF graphs like this one. Jena has object
classes to represent graphs, resources, properties and
literals. The interfaces representing resources, properties
and literals are called Resource, Property and Literal
respectively. In Jena, a graph is called a model and is
represented by the Model interface. Protg maintains a
copy of the OWL model using the Jena API, and changes
in the Protg model are synchronized with the OWL
objects. This technology ensures that all language
elements that Protg does not support in its own Meta
class hierarchy at least remain untouched when saved back
to a file. Editing OWL files with Protg is therefore
lossless.
IV. PARTICLE SWARM OPTIMIZATION
Particle swarm optimization first introduced by Kennedy
and Eberhart [13], [14], [15], as an optimization technique
based on the movement and intelligence of the swarm. It
inspired by social behavior and dynamics of movement of

birds and fish. PSO uses a number of particles that
constitute a swarm moving around in the search space to
find the best solution. Each particle is treated as a point in
the search space which adjusts its flying according to its
own flying experience and other particles flying experience.
The original PSO formulae define each particle as potential
solution to a problem in D-dimensional space. The position
of particle i is represented in equation (1).

(1)

Each particle also maintains a memory of its previous
best position, represented in equation (2).
(2)

A particle in a swarm moves; hence, it has a velocity,
which can be represented in equation (3).

(3)

Each particle knows its best value so far (pbest) and its
position. Moreover, each particle knows the best value so
far
in the group (gbest) among pbests. This information is
analogy of knowledge of how the other particles around
them have performed. Each particle tries to modify its
position using the following information:

the distance between the current position
and pbest
the distance between the current position
and gbest

Initially, the PSO algorithm randomly selects candidate
solutions (particles) within the search space. During each
iteration of the algorithm, each particle is evaluated by the
objective function being optimized, determining the fitness
of the solution. A new velocity value for each particle is
calculated using the equation (4)

(4)

v
id
=w*v
id
+c
1
*rand()*(p
id
-x
id
)+c
2
*rand()*(p
gd
-x
id
)
where, v
id
: velocity of particle
x
id
: current position of particle
w : weighting function,
c
1
& c
2
: determine the relative influence of the
social and cognitive components
p
id
: pbest of particle i,
p
gd
: gbest of the group.

The index of each particle is represented by I, so, vi(t) is
the velocity of particle i as time t and xi(t) is the position of
particle i at time t. Parameters c1,c2,w are user supplied
coefficients. The value r1 and r2 are random values
regenerated for each velocity update. Value xcapi(t) is the
individual best candidate solution for particle i at time t and
g(t) is the swarms global best candidate solution at time t.
Once the velocity for each particle is calculated, each
particles position is updated by applying new velocity to
the particles previous position using equation (5). This
process is repeated until some stopping condition is met.

(5)

Figure 3. General Flowchart of PSO

A. PSO Based Page Ranking
Genetic Mining of HTML structures use the HTML tag
weights to improve the performance of document retrieval
system [10]. The term that exists in title, bold and anchor
tags add more weight to the document than the other terms.
The document retrieval performance is improved depending
on the structural importance of the document.
To extend vector space model to support structured
ranking occurrences within each document a structure must
be included. The weight of a term in the document is
basically computed by term frequency(tf) and inverse
document frequency(idf).
Steps to determine the weights w
i,j
on terms
Definition : Let N be the total number of documents
in the system and n
i
be the number of documents in which
the index k
i
appears.Let freq
i,j
be the raw frequency of term
k
i
in the document d
j
(i.e. the number of times the term k
i
is
mentioned in the text of the document d
j
). Then, the
normalized frequency f
i,j
of term k
i
in the document d
j
is
given by Eqn (6)

(6)

where the maximum is computed over all terms which
are mentioned in the text of the document d
j
. If the term k
i

does not appear in the document d
j
then f
i,j
=0. idf
i
, inverse
document frequency for k
i
, be given by Eqn (7)

i
i
n
N
idf log
j l l
j i
j i
freq
freq
f
,
,
,
max

i
j i j i
n
N
f w log * , ,
(7)

The weights can be computed through Eqn (8)
(8)

Such term-weighting strategies are called tf-idf schemes
which are used in predicting the term frequency and
document frequency. The keyword based analysis collects
the set of keywords or terms that occur frequently together
and then finds the correlation among them.
The semantic information retrieval model can be built
and associated with the documents using domain ontology
which describes the concepts. Relation has to be extracted
based on the concepts and with the available relations and
concepts concept relational graph can be constructed. The
vertices represent the concepts (C1,C2,,Cn) and the edges
represent the relation (R1,R2,,Rn).
The concept relation subgraph can be generated on edge
removal method by partitioning graph in to disjoint subsets.
Particle Swarm Optimisation is applied to find the subgraph
based on concept and its relation. The candidate is
represented as a particle which represents the vertices in the
sub graph. Fitness refers to the rate at which the individual
ends begin sampled contributing to next best fitness value.
The optimal fitness is the minimum number of edges
connecting to the concepts that are not related to keywords.
The size of particle is initialized as N and the initial
particle velocity is generated randomly. Particle Swarm
Optimisation uses particle velocity which is updated
internally. Fitness is defined as total relations that connect to
concept nodes. The next fitness is minimum value of the
relation. The retrieved page result set covers all relations
arcs in the concept relation subgraph.
Fitness=count(R
k
)
Rk is the total number of relations connected to different
concepts in graph.
Edge minimization algorithm based on Particle Swarm
Optimization is as follows:
Algorithm PSOE dgemin ()
Input: Graph G(V,E) and concept
{Generate initial population at random
Initialize Particle
R
k
be the total relations connected to different concepts
in graph
Repeat
For each particle calculate particle velocity
using eqn()
Evaluate fitness
Fitness=count(R
k
)
If fitness better than Pbest
set current value as new Pbest
End
Until (optimal fitness is reached)
For every edge in R
k

{include the edge in concept relation subgraph}}
Output: Best individual and fitness that contains web
pages with keywords and relation pairs generated from sub
graph.
V. Results And Discussions
The Keyword based analysis collects set of keywords or
terms that occur frequently together and finds the
correlation relationship among them. The semantic
information retrieval knowledge base has been built and
associated to the information document base by using
domain ontologys that describes the concepts. The
proposed method is evaluated using USGS repository. The
environmental science documents collected from USGS
repository is annotated using ontology and it is represented
in RDF using GATE tool API. Using Jena SPARQL query
model documents can be retrieved from Corpora not only
based on their textual content but also based on their
features. The Concept-Relational Graph generates minimum
edges covering the concept. The documents that are
annotated with concept are retrieved with higher ranking.
The predefined base ontology described based on USGS
scientific directory provides the basis for semantic indexing
of documents with no embedded annotations.
The documents are annotated with the concept instances
from the knowledge base. The query weights can be set by
the user or automatically derived by concept frequency
analysis. To evaluate the model precision and recall
measures are applied. The collection of HTML documents
from USGS science directory, collection of 15 queries and
collection of relevant documents for each query are
prepared.

TABLE I Sample Queries
Mineral Resources
Water Resources
Contamination : water
Environmental Pollution
Pollution Impact
Table I gives the set of sample queries. GATE Tool is
used for implementation. The weight for each query term is
assigned and the query is run against the document taken
from USGS and returns the relevant information. The
similarity is evaluated using the ranking model. The
experiment shows that the weights learned using swarm
intelligence with document annotation has increased the
precision of retrieved document.
Once the experimental setup is made, it is tested with
the IR functionality in GATE. GATE comes with a full
featured Information Retrieval(IR) sub system which uses
the most popular open source full text search engine
Lucene. In GateIR the document can be recovered from
Corpora not only based on their textual content but also
according to their annotations. The annotation tool plugin is
available in GATE which enables user manually annotate a
text with respect to one or more ontologies. The required
ontology must be selected from a pull down list of available
ontologies. The documents that are annotated with concepts
are retrieved with higher ranking as shown in Table II.

Table II Document Rank Comparison

Document
Name
Keyword
based
Ranking
with PSO
based
Concept
Relations
First 0.05 0.058
Second 0.184 0.191

Third 0.063 0.068
Fourth 0.0203 0.0209
Fifth 0.084 0.088

Figure 4 Document Rank Comparison Keyword vs PSO based

The system takes the query and it is executed against
knowledge base and returns the matching documents. A
query weight gives the importance of concept in information
needed by user. Several measures like precision and recall
are used to evaluate the proportion of retrieved documents.
Precision p is defined as proportion of retrieved
documents that are relevant given in equation (9) where R is
the relevant documents retrieved and L is retrieved
document.
(9)

Recall is defined as proportion of relevant documents
that are retrieved and given in equation(10) where Re is
relevant document.
(10)

L is the number of retrieved document Re is the number
of relevant document and R is the number of retrieved
relevant documents.

Instead of simple keyword index, the semantic search
system processes a semantic query against the knowledge
base which returns the relevant document. The results show
that semantic information retrieval improves document
ranking. Better precision is achieved by using structured
document annotation weight and PSO based learning of the
concept.
TABLE III Comparison of Average Precision
Avera
ge
Precision
Key
word
based
PSO
based
retrieval
Ratio
Top 10
documents
0.376
1
0.4118 1.09492
2
Top 20
documents
0.198
3
0.2095 1.05648

VI. CONCLUSION
The future web represented by semantic web provides a
structured data and framework for knowledge representation
of web information. It improves the search strategies and
returns relevant documents to user. The semantic web
provides a technique to generate metadata that semantically
annotates the web page. The proposed method combines
ontology based approach and uses Swarm Intelligence to
form concept relation graph based on user keyword. The
relations are extracted from ontology and concept relation
pair is generated. The web pages returned will be relevant
and satisfies the user expectation since it includes semantics
of keywords and its relations. Future enhancement is to
work with multiple ontologies characterized by billions of
web pages.
REFERENCES
[1] L.Ding,T.Finn,A.Joshi,R.Pan,R.S.Cost,Y.Peng,P.Reddi
vari,V.Doshi and J.Sachs, Search on the Semantic
Web,ACM intl conf on information and knowledge
management pp 652-659,2004.
[2] Ahu SIeg, Bamshad Mobasher and Robin Burke,
Learning Ontology -Based User Profiles: A Semantic
Approach to Personalized Web Search, IEEE
Intelligent Informatics Bulletin, 8,pp.7- 18,2007.
[3] Mehrnoush Shamsfard, Azadeh Nematzadeh and Sarah
Motiee, Rank:An Ontology Based System for Ranking
documents,International Journal of Computer
Science, 1,225- 231, 2006.
[4] Sun Kim , Byoung-Tak Zhang, Genetic Mining of
HTML Structures for effective Web Document
Retrieval, Applied Intelligence, 18, pp.243-256, 2003.
[5] Wang Wei, Payam M.Barjaghi, Andrzej Bargiela,
Semantic enhanced information search and
retrieval,Sixth International Conference on Advance
Language and Web Information Technology, pp218-
223,2007.
[6] L.Ding,T.Finn,A.Joshi,R.Pan,R.S.Cost,Y.Peng,P.Reddi
vari,V.Doshi and J.Sachs, Swoogle:A Search and
metadata engine on the Semantic
Web,Computer,vol38,no 10,pp 62-69,Oct 2005.
[7] K.Anyanwu, A.Maduko,and A.Sheth,SemRank:
RankingComplex Relation Search and Results on the
semantic web,Proc 14th intl conf WWW pp117-127
2005.
[8] R.Guha, R.McCool and E.Miller, Semantic
Search,Proc 12th Intl ConfWWW pp700-709 2003.
[9] Yufei Li,Yuan Wang, and xiaotao Huang, A Relation
Based Search Engine in Semantic Web, IEEE Trans.
Knowledge and Data Engg. Vol19.no2,pp 273-282, Feb
2007.
[10] T.Preibe,C.Schlanger and G.Pernul, A Search Engine
for RDF Metadata,Proc. 15th Intl Workshop Databse
and Expert Systems Applications pp168-172,2004.
[11] B.Aleman-Meza,C Halaschek, I.Arpinar, and A.Sheth,
A Context_Aware Semantic Association
Ranking,Proc.First Intl Workshop Semantic Web and
Databses pp33-50,2003.
[12] Fabrizio Lamberti, Claudio Demartini A Relation
Based Page Rank Algorithm for Semantic Web Search
Engines, IEEE Trans. Knowledge and Data Engg.
Vol21.no1,pp 123-136, Jan 2009.

[13] J. Kennedy and R.C. Eberhart, Particle swarm
optimization, Proceedings of the IEEE international
conference on neural networks IV, pp. 19421948,
1995.
[14] R.C. Eberhart and J. Kennedy, A new optimizer using
particle swarm theory, Proceeding 6th International
Symposium on Micro Machine and Human Science, pp. 39-
43, 1995.
[15] R.C. Eberhart and Y.H. Shi, Comparison between genetic
algorithms and particle swarm optimization, Proceedings
of Annual Conference on Evolutionary Programming, pp.
611-616, 1998.
[16] Maedche,A, S.Staab, N.Stojanovic, R.Studer and Y.Sure,
Semantic Portal: The SEAL Approach,Spinning the
Semantic Web, pp317-359,2003.


Literature Survey and Proposed model to Enhance Security of Database
Naveen Kolhe (Author)
M. Tech. Scholar (CSE)
BIST
Bhopal, India

AbstractWe have known that Encryption is one of the
strongest security solutions for database, but developing a
database encryption strategy must take many factors into
consideration. Basically encryption should be performed
where data is originating. This paper examines the various
issues of implementing database encryption and makes
recommendations. In this paper first presents security issue or
the main relevant challenges of data security, encryption
overhead, and key management. Furthermore we are
concluding with a benchmark using the proposed design
criteria like proposed encryption key model, proposed
encryption algorithm model. Finally in this survey, we focus
on the academic work and propose a design-oriented
framework which can be used by database encryption
providers as well as DBAs.
Keywords-Encryption; Decryption; Tansparent Database
Encryption(TDE); Encryption algorithm
I. INTRODUCTION
If we want to provide highest level of security on
confidential information which is stored in database then
we can convert our information into unreadable form
known as encrypted form. The goal of encryption is to
make data unreadable to unauthorized readers and
extremely hard to decipher when unauthorized readers want
to be access. Encryptions are performed by using
encryption keys (see Figure 1). We have already known
that the key makes encrypted data harder to read normally.
We can use to a Keys for both encryption and decryption.
Keys are often stored to allow encrypted data to be
decrypted at a later date. The security of encrypted data
depends on several factors like what algorithm is used,
what is the key size and how was the algorithm
implemented in the product.

ISBN: 978-81-920575-8-3 :: doi: 10.

Figure 1. Simple Encryption of data

For example, many databases use predefine encryption
algorithm like AES, DEA, IDEA, Blow-Fish (BF), RSA
and many more to protect sensitive information, but
existing encryption algorithm has long been considered
insecure for protecting data for any significant length of
time. Additionally, different algorithms perform differently,
so while one encryption algorithm is insecure it is faster
than another existing encryption algorithm. As we know
that 3DES is the three times slower then DES. Encryption
support highest security and is accepted by a company,
organization or industry to keep data private, encryption
can affect our data and our database. We have already
known that, the impacts of encryption can increase data size
and decrease performance. For example in the case of a
private card, does a company policy require the entire
private card number to be encrypted often the decision on
how much of the data must be encrypted is the first step in
determining the overall architecture of our solution.
Encryption affects data size. Often the ciphers used to
encrypt blocks of text in a database produce output in fixed
block sizes and require the input data to match this output
size or it will be padded. Encryption operations, especially
on smaller data items, may increase the size of the stored
data in our database table and cause us to resize database
columns.
II. LITERATURE SURVEY
Here a newly developed technique named, A Database
Encryption solution that is protecting against external and
internal threats and meeting regulatory requirements [9] is
discussed. In this they are presenting practical
implementation of field level encryption in enterprise
database systems, based on research and practical
experience from many years of commercial use of
cryptography in database security. They present how this
column-level database encryption is the only solution that is
capable of protecting against external and internal threats,
and at the same time meeting all regulatory requirements.
They use the key concepts of security dictionary, type
transparent cryptography. They are presenting outline the
different strategies for encrypting stored data so they can
make the decision that is best to use in each different
situation, for each individual field in their database to be
able to practically handle different security and operating
requirements. They present a policy driven solution that
allows transparent data level encryption that does not
change the data field type or length.
Here another newly developed technique named, A
Novel Framework for Database Security based on Mixed
Cryptography[5] is discussed. The objective of this
research is to propose a general database cryptography
framework, without explaining the fixed symmetric or
asymmetric encryption algorithms used in that framework.
The encryption algorithms affect the performance of query
processing and security analysis. His framework does not
rely on a specific symmetric encryption algorithm; any
symmetric algorithm with the appropriate key size can be
used. Also, it is possible to apply different encryption
algorithms on different sides. For example, the client could
use algorithm A, the trusted third party could apply
algorithm B, and the server could do it with algorithm C.
This provides a flexible and more secure encrypting
environment. This section analyzes the security of data
storage and data transmission that deals with basic attack
classes, comprising inside and outside attacks, without
technical details related to the encryption techniques used.
Here another newly developed technique named, Data
Encryption- Solution for Security of Database Contents
[3] is discussed. In this paper they are using Microsoft
SQL Server 2008 tool to be secured data base. In this
technique they need to create a master key, and then obtain
a certificate protected by the master key after that create a
database encryption key and protect it by the certificate and
Set the database to use encryption. Generally in this the
steps of Transparent Data Encryption in Microsoft SQL
Server 2008 Transparent Data Encryption of the database
file is performed at the page level. The pages in an
encrypted database are encrypted before they are written to
disk and decrypted when read into memory. Service Master
Key is created at a time of SQL Server setup; DPAPI
encrypts the Service Master Key. Service Master Key
encrypts Database Master Key for the Master Database.
The Database Master Key of the master Database Creates
the Certificate then the certificate encrypts the database
encryption key in the user database. The entire database is
secured by the Database Master Key of the user Database
by using Transparent Database encryption [3].
Here another newly developed technique named, An
Authentication Scheme for Preventing SQL Injection
Attack Using Hybrid Encryption (PSQLIA-HBE) [4] is
discussed. In this technique they are using an authentication
scheme by hybrid encryption algorithm which is the
combination of Advance Encryption Standard (AES) and
RSA for preventing SQL Injection attack. In the technique,
unique secret key is assigned for every user and the server
has private key and public key pair for RSA encryption. In
this technique, they apply two level of encryption in login
query which is following.
Symmetric key encryption for user name and password
by using users secret key.
Asymmetric key encryption for query by using servers
public key.
According to Indrani Balasundaram and E. Ramaraj [4]
this technique is highly secured due to apply Hybrid
encryption scheme.
III. PROBLEM FORMULATION
In [3,4] they are sending their personal information over
public network in the form of text value, images, pdf file
and many more format to store in database any where for
security purpose. In this I have analysis that there is only
one level of security are using. Form the security point of
view this is very week, any eavesdropper can easily identify
the meaning of cipher text using brut force attack. Another
important thing this algorithm is not efficient because of
data base is the case sensitive so we have also measure that
information should be properly stored in the database.
Another important factor is the memory utilization by the
algorithm. To store unnecessary value we have required
lots of memory space.
We have observed in [9, 5, 4, and 3] performance of
DBMS has degraded due to algorithm used in encryption.
From the study [5, 3] that DES is insecure, 3DES is slow
and most of the algorithm are using 128 bit keys at a
minimum. An inherent vulnerability of DBMS-based
encryption is the encryption key used to encrypt data likely
will be stored in a database table inside the database,
protected by DBMS access controls.
IV. PROPOSED WORK
As corporate networks become more and more open to
the outside to accommodate suppliers, customers and
partners, network perimeter security is no longer sufficient
to protect data. Industry experts have long recommended a
defense in depth approach by adding layers of security
around the data. With the network being regarded as
inherently insecure, encrypting the data it is the best option,
often cited as the last line of defense. In terms of
database security, encryption secures the actual data within
the database and protects backups. That means data remains
protected even in the event of a data breach. Modern
approaches to database encryption, such as the Transparent
Data Encryption (TDE) architectures introduced by Oracle
and Microsoft, make it easier for organizations to deploy
database encryption because TDE does not require any
changes to database applications.
The objective of my research is not only database
unreadable, it also extends into user authentication that is,
providing the recipient with assurance that the encrypted
database originated from a trusted source. Because
everybody knows about secrets; some have more than
others. It is necessary when we are transmitting
information from one node to another node to store in
database because it's important to protect the information
while it's in transit. Cryptography is one of the best
techniques for security in the form of readable data, and
transforming it into unreadable data for the purpose of
secure transmission, and then using a key to transform it
back into readable data when it reaches its destination.
Furthermore to my research is to develop a new encryption
algorithm model as well as algorithm that apply key
hopping, variable key length and algorithm hopping that
differ from conventional cryptographic algorithms. The
algorithm is based on a symmetric block cipher. The
performance and strength of new algorithm is expected to
be better than conventional cryptographic algorithm and
highly effective against brute force attack.
A. Planning a Database to be Secure using Encryption
Strategy
Whenever we have started to design a database
encryption strategy for secure information which is stored
in the data base, we need to understand following things:
How encryption works.
How data flows in our application.
How database protection fits into our organization
overall security policy.
Once we have assessed the security and encryption
needs of the sensitive data being gathered in our
application, we will need to pick a course of action to
ensure it is protected once it reaches the database. There are
two strategies we can use:
Using encryption features of our Database
Management Strategy (DBMS).
Performing encryption and decryption outside the
database.
Each of these approaches has its advantages and
disadvantages. In this research we will outline the only one
best strategy for encrypting stored data forever.
Figure 2. shows typical three tier application
architecture. The data may be at risk of exposure in any of
the three tiers, or as it travels between components.
Encryption performed by the DBMS can protect data at
rest, but we must decide if we also require protection for
data while its moving between the applications and the
database. Sending sensitive information over the Internet or
within our corporate network clear text, defeats the point of
encrypting the text in the database to provide data privacy.
Good security practice is to protect sensitive data in both
cases as it is transferred over the network and at rest.

Figure 2

In this work we are concentrating on keys that encrypt
and decrypt data, our database protection solution is only as
good as the protection of our proposed keys. Security
depends on two factors: where the keys are stored and who
has access to them. There is some following point which
can be play important role in database security:
How many encryption keys will we required?
How will we manage keys?
Where will the keys be stored?
How will we protect access to the encryption
keys?
How often should keys change?
In this section I am trying to explain above mention
point that management of encryption key is a very difficult
problem to solve. If we are using a single key for database
implementation simpler, but it also means all our sensitive
data is vulnerable if the key is broken by intruders. If we
are using multiple key then scenarios become more
complicated to keys to decrypt data. Because encrypted
data can only be decrypted with its corresponding key, any
system or application will need to know how to find that
key. The more systems that know the encryption key, the
higher the risk that the key will be exposed unless a strong
access management system is applied. A good rule of
thumb is the only one key we use to encrypt information
between two people. Another point is where to store keys.
This is the part of management that where to store them.
There is a solution is to store the keys in a restricted portion
of database table or file.
Implementing a Database encryption strategy:- To
effectively secure our databases using encryption, three
issues are of primary importance: where to perform the
encryption, where to store encryption keys and who has
access to encryption keys. The process of encryption can be
performed either within the database, if our DBMS
supports the encryption features we need, or outside the
DBMS, where encryption processing and key storage is
offloaded to centralized Encryption Servers.
B. Block Diagram of Proposed System
Generally we have used a series of mathematical
operation during encryption and decryption to generate an
alternate form of that data; the sequence of these operations
is called an algorithm. To help distinguish between the two
forms of data, the unencrypted data is referred to as the
plaintext and the encrypted data as cipher text. Encryption
is used to ensure that information is hidden from anyone for
whom it is not intended, even those who can see the
encrypted data. The process of reverting cipher text to its
original plaintext is called decryption. This process is
illustrated in the Figure 3 below. And figure 4 is showing
proposed idea.


Figure 3:- Encryption and Decryption

C. Block Diagram of Proposed Idea:

Figure 4:- Block Diagram of Proposed Idea
V. CONCLUSION
In this section, the performance of the proposed
encryption scheme is analyzed in detail. We discuss the
security analysis of the proposed encryption scheme
including some important ones like the complexity of time
and space. However, it is a difficult problem to evaluate the
specific algorithm, they must consider many factors:
security, the features of algorithm, the complexity of time
and space, etc, so research on the time-consuming of
algorithm is one of the important respects. In the past
evaluating time-consuming of algorithm usually through
comparing its time complexity, while in this research we
are proposing a new evaluation model and two evaluation
modes to measure the time-consuming of these
cryptographic algorithms and our proposed algorithm.
REFERENCES
[1] Maram Balajee UNICODE and Colors Integration
tool for Encryption and Decryption published in
International Journal on Computer Science and
Engineering (IJCSE). ISSN : 0975-3397 Vol. 3 No. 3
Mar 2011.
[2] A.Rathika, Parvathy Nair, M.Ramya A High
Throughput Algorithm for Data encryption published
in International Journal of Computer Applications
(0975 8887) Volume 13 No.5, January 2011.
[3] Dr. Anwar Pasha Abdul Gafoor Deshmukh and Dr.
Riyazuddin QureshiTransparent Data Encryption-
Solution for Security of Database Contents published
in (IJACSA) International Journal of Advanced
Computer Science and Applications,Vol. 2, No.3,
March 2011.
[4] Indrani Balasundaram and E. Ramaraj An
Authentication Scheme for Preventing SQL Injection
Attack Using Hybrid Encryption (PSQLIA-HBE)
published in European Journal of Scientific Research
ISSN 1450-25X Vol.53 No.3 (2011), pp.359-368.
[5] Hasan Kadhem, Toshiyuki Amagasa and Hiroyuki
Kitagawa A Novel Framework for Database Security
based on Mixed Cryptography published in Fourth
International Conference on Internet and Web
Applications and Services 2009 IEEE.
[6] Yan Wang and Ming Hu Timing evaluation of the
known cryptographic algorithms 2009 International
Conference on Computational Intelligence and
Security 978-0-7695-3931-7/09 $26.00 2009 IEEE
DOI 10.1109/CIS.2009.81.
[7] Nadeem.A, A performance comparison of data
encryption algorithms," IEEE Information and
CommunicationTechnologies, 2006.
[8] Stallings.W, Cryptography and Network Security,
Prentice Hall, 4th Ed, 2005.
[9] A Database encryption solution that is protecting
against external and internal threats and meeting
regulatory requirements published in 2004.
[10] IEEE Transactions on Circuits and Systems for Video
Technology: Special Issue on Authentication,
Copyright Protection, and Information Hiding, Vol. 13,
No. 8, August 2003.
[11] P. Wayner, Disappearing Cryptography : Information
Hiding: Steganography and Watermarking. Morgan
Kaufmann, 2nd edition, 2002.
[12] D. R. Stinson, Cryptography Theory and Practice
CRC Press, Inc., 2002.
[13] N. F. Johnson and S. Jajodia. Steganalysis of imges
created using current steganography software. In
IHW98 Proceedings of the International Information
hiding workshop. April 1998.
[14] Menezes. A, Van.P, Oorschot, and Vanstone.S,
Handbook of Applied Cryptography, CRC Press,
1996.

Classification of Seating Position Specific Patterns in Road Traffic Accident Data
through Data Mining Techniques
S.Shanthi
Senior Lecturer (Ph.D. Scholar)
Department of Computer Science and Engineering
Rajalakshmi Institute of Technology
Kuthambakkam, Chennai, India

Dr.R.Geetha Ramani
Professor and Head
Rajalakshmi Engineering College
Thandalam, Chennai, India

Abstract It is important to know many environmental and
road related factors which influence the number of road
accidents. This paper summarizes the performance of
classification algorithms viz. C4.5, CART, ID3, Nave Bayes
and RndTree applied to modelling the Seating_Position
related patterns that occurred during traffic accidents. The
training dataset used for the research work is obtained from
Fatality Analysis Reporting System (FARS) which is provided
by the University of Alabamas Critical Analysis Reporting
Environment (CARE) system. The results are demonstrated
through the accuracy measures such as Recall, Precision and
ROC curves. It reveals that RndTree outperforms all other
algorithms.
Keywords- Data Mining; Classification Algorithms;
Accuracy Measures; ROC curve; Road Accident Data
I. INTRODUCTION
Data Mining [3] has attracted a great deal of attention in
the information industry and in society due to the wide
availability of huge amounts of data and there is a need for
converting such data into useful information and
knowledge. The information and knowledge [3] gained can
be used for applications ranging from market analysis, fraud
detection, and customer retention, to production control and
science exploration. Data mining techniques include
association, classification, prediction, clustering etc.
Classification algorithms are used to classify large volume
of data and provide interesting results.
Application of data mining techniques on social issues
has been a popular technique of late. Fatal rates due to road
accidents contribute more on the total death rate of the
world. Road Traffic injuries are one of the top three causes
of death and it shows a steep socioeconomic gradient [15].
Many literature analyses the road related factors which
increase the death ratio. The attribute Seating_Position has
been selected as the class attribute for our study.
General data mining techniques, including Associations,
Classifications, Predictions, and Clustering, can be applied
to many applications such as marketing, traffic
surveillance, fraud detection, bio medicine etc.
Classification algorithms give interesting results from a
large set of data attributes.

ISBN: 978-81-920575-8-3 :: doi: 10.
In this paper, we focus on Seating_Position based
classification to find patterns in road accident data by
applying various classification algorithms viz. Nave Bayes,
ID3, RndTree, C4.5 and CART. Among these algorithms
RndTree algorithm gives better results. The rest of this
paper is organized as follows. The summary of the related
works in classification algorithms are discussed in section I.
Section II illustrates the methodology used which includes
the training dataset description, system design,
classification algorithms and classifier accuracy measures.
In section IV we present and discuss the experiment results.
Finally the section V concludes the paper.
II. RELATED WORK
Various studies which have been conducted to
emphasize the use of classification algorithms are
discussed in this section.
Non-parametric classification tree techniques are
applied by the authors of [2] to analyze Taiwan accident
data from the year 2001. They developed a CART model
to find the relationship between injury severity and
driver/vehicle characteristics, highway/environment
variables, and accident variables.
Performance measurements (e.g., accuracy, sensitivity,
and specificity), Receiver Operating Characteristic (ROC)
curve and Area Under the receiver operating characteristic
Curve (AUC) were used to measure the efficiency of the
proposed classifier [6].
The performance of data mining and statistical
techniques has been compared in [18] by varying the
number of independent variables, the types of independent
variables, the sample size and the number of classes of the
independent variables. The results have shown that the
artificial neural network performance was faster than that
of the other methods as the number of classes of
categorical variable increased.
Various application domains are used in [9] to study the
different relationships and groupings among the
performance metrics, thus facilitating the selection of
performance metrics that capture relatively independent
aspects of a classifiers performance. Factor analysis is
applied in [9] to the classifier performance space.
While evaluating credit risk in [4] Logistic Regression
and SVM algorithms give best classification accuracy, and
the SVM shows the higher robustness and generalization
ability compared to the other algorithms. The C4.5
algorithm is sensitive to input data, and the classification
accuracy is unstable, but it has the better explanatory [4].
The authors of [8] used a combination of cluster
analysis, regression analysis, and geographical information
system (GIS) techniques to group homogeneous accident
data, estimate the number of traffic accidents, and assess
fatal risk in Hong Kong.
In IEEE International Conference on Data Mining
(ICDM) held in December 2006 the authors [14] presented
the top 10 data mining algorithms viz. C4.5, k-Means,
SVM, Apriori, EM, PageRank, AdaBoost, kNN, Naive
Bayes, and CART. They covered classification, clustering,
statistical learning, association analysis, and link mining,
which are all among the most important topics in data
mining research and development.
In our research work we focused on Seating_Position
specific classification to find accident patterns in road
accident data using various classification algorithms. Next
section illustrates the methodology used in our research
work which includes CART, ID3, NaiveBayes, RndTree
and C4.5 algorithms.
III. METHODOLOGY
This research work focuses on Seating_Position specific
classification of road accident patterns. The existing
classification algorithms viz. Nave Bayes, ID3, RndTree,
C4.5, CART is adopted for the classification. The error
rates of all the algorithms have been compared. RndTree
algorithm produces classification results with 4.95%
misclassification rate which is lesser than other algorithms
error rates.
The details of the work are given in the following sub
sections.
A. Training Dataset Description
We carry out the experiment with road accident
training dataset obtained from Fatality Analysis Reporting
System (FARS) [16] which is provided by Critical
Analysis Reporting Environment (CARE) system. This
safety data consists of U.S. road accident information from
2005 to 2009. It consists of 459549 records and 44
attributes.
To train the classifiers we have selected accident details
of the year 2009 which consist of 77125 records with 16
attributes. The selected dataset with 77125 records is
divided into training dataset and test dataset. Training data
set is used to build the model and test dataset is used to
evaluate the model. The list of attributes and their
description is given in the Table I.
TABLE I. TRAINING DATASET ATTRIBUTES DESCRIPTION
Attributes Description
State State in which the accident occurred
Harmful_Event
First harmful event occurred during
accidents
Manner_of_Collision Manner of collision
Person_Type Driver/Passenger
Attributes Description
Seating_Position Seating Position
Age_Range Age range of the person involved
Gender Male/Female
Race/Ethnicity Nationality of People involved in accident
Injury_Severity Injury Severity
AirBag Location of the airbag
Protection_System Type of the protection system used
Ejection_Path Path from which the persons were ejected
Alcohol_Test Method of alcohol test
Drug_Involvement Consumption of Drug
Accident_Location Accident Location
Related_Factors Road related factors
We have applied the classification algorithms using
Seating_Position attribute as the class attribute. Next sub
section deals about the system model used in this study.
B. System Design
This section describes the steps used in this research
work. The steps used in this work are depicted in Fig. 1.

Figure 1. Steps involved in the study
Precision, Recall,
ROC
Training
Dataset
(Accident Data)
Data
Preprocessing
PREDICTED PATTERNS
Test Dataset
(Accident Data)
Knowledge Base
(Trained Rules)
Evaluate Accuracy
(Precision, Recall)
)
Test Data
Classifier
ID3, CART,
C4.5, RndTree,
Nave Bayes
Classification
Algorithms
After preprocessing the training set is given as input to
the classifiers (CART, ID3, Nave Bayes, RndTree, C4.5).
The results are evaluated based on error rates and found
that RndTree gives better results with 4.95%
misclassification rate. The error rates are compared using
accuracy measures such as precision, recall and ROC. It is
found that RndTree shows significant difference in the
accuracy. Test dataset is applied to evaluate the results.
C. Classification Algorithms
This section illustrates the decision tree algorithms viz.
RndTree, C4.5, CART, Nave Bayes and ID3. The
accuracy of RndTree decision tree algorithm is better than
that of other classification algorithms [12] when it is used
with feature selection algorithms. The advantage of
decision tree algorithms is it is easy to derive the rules.
1) RndTree
Random tree [10] can be applied to both regression and
classification problems. The method combines bagging
idea and the random selection of features in order to
construct a collection of decision trees with controlled
variation. Each tree is constructed using the following
algorithm:
Let the number of training cases be N, and the
number of variables in the classifier be M.
We are told the number m of input variables to be
used to determine the decision at a node of the tree;
m should be much less than M.
Choose a training set for this tree by choosing n
times with replacement from all N available training
cases (i.e. take a bootstrap sample).
Use the rest of the cases to estimate the error of the
tree, by predicting their classes.
For each node of the tree, randomly choose m
variables on which to base the decision at that node.
Calculate the best split based on these m variables
in the training set.
Each tree is fully grown and not pruned (as may be
done in constructing a normal tree classifier).
For prediction a new sample is pushed down the
tree. It is assigned the label of the training sample in
the terminal node it ends up in.
This procedure is iterated over all trees in the
ensemble, and the average vote of all trees is
reported as random forest prediction [10].
2) C4.5
Given a set S of cases, C4.5 grows an initial tree [14]
using the divide-and-conquer algorithm as follows:
If all the cases in S belong to the same class or S is
small, the tree is a leaf labeled with the most
frequent class in S.
Otherwise, choose a test based on a single attribute
with two or more outcomes. Make this test the root
of the tree with one branch for each outcome of the
test, partition S into corresponding subsets S1, S2, .
. . according to the outcome for each case, and
apply the same procedure recursively to each
subset.
3) CART
It is an exhaustive search for univariate splits method
for categorical or ordered predictor variables [1]. With this
method, all possible splits for each predictor variable at
each node are examined to find the split producing the
largest improvement in goodness of fit (or equivalently, the
largest reduction in lack of fit). For categorical predictor
variables with k levels present at a node, there are 2(k-1) -
1 possible contrasts between two sets of levels of the
predictor. For ordered predictors with k distinct levels
present at a node, there are k -1 midpoints between distinct
levels. Thus it can be seen that the number of possible
splits that must be examined can become very large when
there are large numbers of predictors with many levels that
must be examined at many nodes.
4) Nave Bayes
The Naive Bayes Classifier technique is based on the so-
called Bayesian theorem [1] and is particularly suited when
the dimensionality of the inputs is high. Despite its
simplicity, Naive Bayes can often outperform more
sophisticated classification methods. Naive Bayes
classifiers can handle an arbitrary number of independent
variables whether continuous or categorical. Given a set of
variables, X = {x1, x2, x3..., xd}, we want to construct the
posterior probability for the event Cj among a set of
possible outcomes C = {c1,c2,c3...,cd}. In a more familiar
language, X is the predictors and C is the set of categorical
levels present in the dependent variable. Bayes' rule is:
j j d d j
C p C x x x p x x x C p ,.. , ,.. ,
2 1 2 1

where p(Cj | x1,x2,x3...,xd) is the posterior probability of
class membership, i.e., the probability that X belongs to Cj.
Naive Bayes assumes that the conditional probabilities of
the independent variables are statistically independent.
Using Bayes' rule above, we label a new case X with a
class level Cj that achieves the highest posterior
probability.
5) ID3
Iterative Dichotomiser 3 (ID3) is a decision tree
induction algorithm. In the decision tree each node
corresponds to a non-categorical attribute and each arc to a
possible value of that attribute. A leaf of the tree specifies
the expected value of the categorical attribute for the
records described by the path from the root to that leaf. In
the decision tree at each node should be associated the non-
categorical attribute which is most informative among the
attributes not yet considered in the path from the root.
Entropy is used to measure how informative is a node [5].
The ID3 algorithm takes all unused attributes and counts
their entropy concerning test samples. Choose attribute for
which entropy is minimum (or, equivalently, information
gain is maximum).
6) Accuracy Measures
The accuracy of a classifier on a given set is the
percentage of test set tuples that are correctly classified by
the classifier [3]. The confusion matrix is a useful tool for
analyzing the efficiency of the classifiers [3]. Given two
classes the contingency or confusion matrix can be given
as in Table II.
TABLE II. CONTINGENCY TABLE FOR TWO CLASSES
Confusion
Matrix
Predicted Class
Actual
Class
Class Class 1 Class 2
Class 1 True Positive (TP) False Negativ (FN)
Class 2 False Positive (FP) True Negative (TN)

TP refers to the positive tuples TN refers to the negative
tuples that were correctly classified by the classifier. FN
refers to the positive tuples and FP refers to negative tuples
that were incorrectly identified by the classifier [3]. The
sensitivity or Recall or True Positive Rate (TPR),
Precision, False Positive Rate and Accuracy can be
calculated using the following equations [3]:
FN TN FP TP
TN TP
Accuracy
TN FP
FP
FPR
FP TP
TP
ecision
FN TP
TP
TPR call

Pr
Re

Receiver Operating Characteristics (ROC) graphs have
long been used in signal detection theory to depict the
tradeoff between hit rates and false alarm rates over noisy
channel [17]. ROC curve is a plot of TPR against FPR
(False Positive Rate) which depicts relative trade-offs
between true positives and false positives [17, 3]. The
ROC curve space for two classifiers is given in Fig.2 [11].
If the curve is closer to the diagonal line then the model is
less accurate [3]. Area under receiver operating
characteristic curve (AUC) was calculated to assert the
prediction accuracy besides the sensitivity, specificity and
accuracy. An area of 0.5 represents a random test; values
of AUC<0.7 represent poor predictions; AUC>0.8
represents good prediction [17].

Figure 2. ROC Curve Space
IV. EXPERIMENT RESULTS
We have used Tanagra for our experimental study. It
proposes several data mining methods from exploratory
data analysis, statistical learning, machine learning and
databases area [13]. We have divided the accident dataset
into two parts: Training dataset which is used to build the
model and test dataset which is used to evaluate the model.
A. Experimental Results of Base Classifiers
In this phase we have applied basic decision tree
algorithms RndTree, C4.5, CART, Nave Bayes and ID3 to
classify the training dataset. The results of these models are
evaluated based on their error rates, precision and recall.
The error rates of the C4.5 and RndTree algorithms are
given in the Fig. 3 and Fig.4 respectively.

Figure 3. Error Rate of C4.5 Classifier
RndTree gives 4.95% misclassification rate and C4.5
gives 5.95% misclassification rate.

Figure 4. Error Rate of RndTree Classifier
Similarly the error rates of all other algorithms are
given in the Table.III.
TABLE III. ERROR RATES OF BASE CLASSIFIERS
Classifier Error Rate
C4.5
0.0595
CART 0.0759
ID3
0.0749
Nave Bayes
0.0923
RndTree 0.0495
It is clear that the error rate produced by RndTree
Algorithm is very less compared to all other Algorithms.
The misclassification percentage is very less for RndTree.
The graph in Fig.5 shows the comparison of error rates of
all the classifiers used.

Figure 5. Error Rate of Base Classifiers
The Error rate of CART algorithm according to
growing set and pruning set is given in the Fig.6.

Figure 6. Error Rate of CART Classifier According to Tree Complexity
The error rate is more for the pruning set than that of
the growing set. The comparisons between the accuracies
of the RndTree and C4.5 classifiers are given in Fig.7.

Figure 7. Base Classifiers Accuracy- Training Dataset
From Fig.7 it is clear that among base classifiers
RndTree results in high accuracy.
B. Experimental results of Accuracy Measures
In this research work we used accuracy measures
such as recall or True Positive Rate (TPR) which should be
high, precision and False Positive Rate (FPR) should be
low to have high accuracy. Different classification
algorithms may have their own characteristics on the same
dataset [7]. In this work, the performance of RndTree
shows better results than that of C4.5, CART, ID3 and
Nave Bayes. The classification algorithm RndTree is more
specific and more sensitive. Most of the records with
Seating_Position-Front_Seat values have been classified
correctly which increases the accuracy of RndTree. The
results have been evaluated using test dataset. The outcome
of training data and test data are same. The precision and
recall accuracy measures of C4.5, CART, ID3, RndTree
and Nave Bayes algorithms are given in the Table IV.
TABLE IV. CLASSIFIERS-ACCURACY MEASURES
Classifiers
C4.5 CART ID3 Nave Bayes RndTree
Precision Recall Precision Recall Precision Recall Precision Recall Precision Recall
Front_Seat 0.9708 0.9764 0.9715 0.9661 0.9688 0.9699 0.9655 0.9498 0.9652 0.9865
Others 0.9885 0.9804 0.9983 0.9629 0.9987 0.9605 0.9981 0.9625 0.9932 0.9887
Second_Seat 0.7640 0.8420 0.6856 0.8357 0.6932 0.8297 0.6486 0.7853 0.8407 0.8004
Unknown 0.7952 0.5243 0.7611 0.3227 0.749 0.3704 0.4342 0.5273 0.9200 0.7309
Cargo_Area 0.8034 0.5731 0.5055 0.3950 0.4864 0.2625 0.5567 0.3339 0.8351 0.6670
Third_Seat 0.6333 0.0896 0.0000 0.0000 0.6129 0.0299 0.3333 0.0283 0.9130 0.3962
Trailing_Unit 0.9600 0.6316 0.0000 0.0000 0.0000 0.0000 0.4030 0.7105 0.8929 0.6579
Fourth_Seat 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.2000 0.0385 0.9000 0.3462
Fig.8. shows the graph which depicts the accuracy
measures of the base classifiers. It is clear from the graph that
RndTree gives high preision and recall values.

Figure 8. Classifiers Accuracy Measures
Fig.9 explains the performance measures using ROC curves.
The Score 5 (RndTree) gives the curve which is nearer to the
perfection point (i.e. 1).

Figure 9. Classifiers Accuracy- Training Dataset
Table V lists the AUC values of all classifiers depends on
different sizes of the training data. Though AUC of all the
classifiers are greater than 0.7, the AUC of RndTree (0.9873)
is higher than other classifiers which conforms that RndTree is
comparatively good classifier than other classifiers. We got
same results for test data.
TABLE V. ROC CURVE RESULTS
Sample size : 77125 Positive examples : 9433 Negative examples : 67692
Score
Attribute
C4.5 (Score 1) CART (Score 2) ID3 (Score 3) NaiveBayes (Score 4) RndTree (Score 5)
AUC 0.9779 0.9605 0.9723 0.9611 0.9873
Target
size (%)
Score FPR TPR Score FPR TPR Score FPR TPR Score FPR TPR Score FPR TPR
0 1 0 0 0.8337 0 0 1 0 0 0.9983 0 0 1 0 0
5 0.8125 0.0051 0.3722 0.74 0.0096 0.3397 0.7486 0.0068 0.3601 0.8217 0.0104 0.3341 1 0 0.4088
10 0.6667 0.0196 0.6766 0.6358 0.0294 0.6065 0.6 0.0249 0.6388 0.6476 0.0304 0.5996 0.5714 0.0113 0.7364
15 0.3125 0.0466 0.8918 0.3354 0.0541 0.8382 0.4231 0.053 0.8459 0.4389 0.0605 0.7921 0.3019 0.0439 0.9115
20 0.1132 0.094 0.9609 0.0985 0.1033 0.8938 0.125 0.0947 0.9557 0.2213 0.1012 0.909 0.0843 0.0889 0.9969
25 0.0323 0.147 0.9891 0.0985 0.1547 0.9336 0.0571 0.1468 0.9907 0.0812 0.1502 0.9659 0 0.1455 1
30 0.0209 0.2027 0.9984 0.0985 0.2063 0.9723 0 0.2024 1 0.0176 0.2035 0.9928 0 0.2024 1
35 0.0002 0.2594 1 0.0003 0.2594 0.9999 0 0.2594 1 0.0001 0.2594 0.9998 0 0.2594 1
40 0 0.3164 1 0.0003 0.3164 1 0 0.3164 1 0 0.3164 1 0 0.3164 1
45 0 0.3734 1 0 0.3734 1 0 0.3734 1 0 0.3734 1 0 0.3734 1
50 0 0.4303 1 0 0.4303 1 0 0.4303 1 0 0.4303 1 0 0.4303 1
55 0 0.4873 1 0 0.4873 1 0 0.4873 1 0 0.4873 1 0 0.4873 1
60 0 0.5443 1 0 0.5443 1 0 0.5443 1 0 0.5443 1 0 0.5443 1
65 0 0.6012 1 0 0.6012 1 0 0.6012 1 0 0.6012 1 0 0.6012 1
70 0 0.6582 1 0 0.6582 1 0 0.6582 1 0 0.6582 1 0 0.6582 1
75 0 0.7152 1 0 0.7152 1 0 0.7152 1 0 0.7152 1 0 0.7152 1
80 0 0.7721 1 0 0.7721 1 0 0.7721 1 0 0.7721 1 0 0.7721 1
85 0 0.8291 1 0 0.8291 1 0 0.8291 1 0 0.8291 1 0 0.8291 1
90 0 0.8861 1 0 0.8861 1 0 0.8861 1 0 0.8861 1 0 0.8861 1
95 0 0.943 1 0 0.943 1 0 0.943 1 0 0.943 1 0 0.943 1
100 0 1 1 0 1 1 0 1 1 0 1 1 0 1 1

V. CONCLUSION
In this paper we analyzed road accident training dataset
using C4.5, CART, ID3, Nave Bayes and RndTree algorithms
to find patterns using Seating_Position based classification.
Among the algorithms RndTree gives high accuracy. The
accuracy is evaluated based on precision, recall and ROC
curves. The results show that the RndTree gives 95.05%
accuracy.
REFERENCES
[1] CART, http://www.statsoft.com/textbook/classification-
trees
[2] Chang L. and H. Wang, "Analysis of traffic injury
severity: An application of non-parametric classification
tree techniques Accident analysis and prevention", 2006,
Accident analysis and prevention, Vol. 38(5), pp 1019-
1027.
[3] Han, J. and Kamber, M., Data mining: concepts and
techniques, Academic Press, ISBN 1- 55860-489-8.
[4] Hong Yu, Xiaolei Huang, Xiaorong Hu, Hengwen Cai, A
comparative study on data mining algorithms for
individual credit risk evaluation, Int. Conference on
Management of e-Commerce and e-Government, 2010.
[5] ID3,
http://www.cis.temple.edu/~ingargio/cis587/readings/id3-
c45.html
[6] Jaree Thongkam, Guandong Xu and Yanchun Zhang,
AdaBoost algorithm with random forests for predicting
breast cancer survivability, International Joint Conference
on Neural Networks, 2008.
[7] Jingran Wen, Xiaoyan Zhang, Ye Xu, Zuofeng Li, Lei Liu,
Comparison of AdaBoost and logistic regression for
detecting colorectal cancer patients with synchronous liver
metastasis, International Conference on Biomedical and
Pharmaceutical Engineering, December 2-4, 2009.
[8] Kwok Suen Ng, K. S., W. T. Hung, Wing-gun Wong, "An
algorithm for assessing the risk of traffic accidents", 2002,
Journal of Safety Research, Vol. 33, pp 387-410.
[9] Naeem Seliya, Taghi M. Khoshgoftaar , Jason Van Hulse
, A study on the relationships of classifier performance
metrics, IEEE International Conference on Tools with
Artificial Intelligence, pp.59-66, 2009.
[10] Random Tree Algorithm, http://www.answers.com
[11] ROC Space, http://en.wikipedia.org/wiki/File:ROC_space-
2.png
[12] S.Shanthi, Dr.R.Geetha Ramani, Classification of
Vehicle Collision Patterns in Road Accidents using Data
Mining Algorithms, Int. Journal of Computer
Applications, Vol.35, No.12, pp.30-37.
[13] Tanagra data mining tutorials, http://data-mining-
tutorials.blogspot.com
[14] Xindong Wu Vipin Kumar J. Ross Quinlan, et al., Top
10 algorithms in data mining, Knowledge Information
System, Vol.14, pp.137.
[15] World Health Organization, Global status report on road
safety: time for action, Geneva, 2009.
[16] www.nhtsa.gov FARS analytic reference guide.
[17] www.cs.iastate.edu/~jtian/cs573/WWW/Lectures/lecture0
6-ClassifierEvaluation-2up.pdf, Classifier Evaluation
Techniques.
[18] Yong Soo Kim, Comparison of the decision tree, artificial
neural network, and linear regression methods based on the
number and types of independent variables and sample
size, Expert Systems with Applications, 2008, Vol. 34,
pp. 12271234.

Efficient Classifier for Classification of Hepatitis C Virus Clinical Data through
Data Mining Algorithms and Techniques

Shomona Gracia Jacob
1
,
Department of Computer Science and
Engineering
Chennai, India
1
Ph.D Research Scholar, E-mail:

Nancy.P
3
Department of Computer Science
and Engineering
1
Ph.D Research Scholar, E-mail:

Dr.R Geetha Ramani
2
Department of Computer Science and
Engineering
Chennai, India

2
Professor & Head, Department of CSE

Abstract Data Mining refers to the task of identifying and
extracting related and vital facts/information from a large
collection of exhaustive data. A key application area of data
mining in the field of clinical research involves classification
of medical records for the purpose of predicting future
trends. The focal point of this paper is to evaluate the
performance of twenty classification algorithms on the
(Oncovirus) cancerous Hepatitis C virus dataset from the UCI
Machine Learning Repository. This research work involves
performing binary classification on the dataset, comprising of
155 instances and 19 predictor features. The performance of
the classification algorithms reveal that Random Tree
Classification and Quinlan's C4.5 algorithm produce 100
percent classifier accuracy on this dataset with proper
evaluation of algorithmic parameters. Moreover, the accuracy
of the classifier was tested with a new test data to verify its
precision in classification.
Keywords-Data Mining; Classification;Hepatitis C Virus;
Clnical Data
I. INTRODUCTION
Data mining [1] is the technique applied to scrutinize
data from different perspectives and then recapitulate it into
constructive, meaningful and effective information. From a
technical point of view, the goal of Data Mining [2] is to
extract knowledge from a data set in a human-
understandable structure

and involves database, data
management, data preprocessing, classification etc.,.
Classification is one of the significant phases in Data
Mining that aims to place an unknown set of records under
one of already known pre-defined groups. Given a set of
instances, it evaluates the values each example has for each
of the predictor features, makes decisions and places the
new record into the class that most describes the new
record. It is the task of generalizing known structure to
apply to a new, unseen before record. Classification [3] is
a data mining function that assigns items in a collection to
target categories or classes. The goal of classification is to
accurately predict the target class for each case in the data. .

978-81-920575-8-3 :: doi: 10. 73445/ISBN_0768ACM #:
Clinical data mining [4, 5] is the application of data
mining techniques to records containing medical data. In
this paper we perform binary classification that refers to the
classification process which involves only two target
values. In this research our aim is to classify the patient
records into two categories viz, fatal case of Hepatitis C
Virus (HCV) or non-fatal case of HCV.
Any virus that can cause cancer is termed an Oncovirus
[6]. HCV infection is the leading cause of liver
transplantation in the U.S and is a risk factor for liver
cancer.HCV [7] is one of several viruses that cause hepatitis
(inflammation of the liver). Up to 85% of individuals who
are initially (acutely) infected with HCV will fail to
eliminate the virus and tend to become chronically
infected. Hepatitis C infection is an infection of the liver
caused by the hepatitis C virus (HCV). Over decades,
chronic infection with HCV damages the liver and can
cause liver failure. In the U.S., [8, 9] more than three
million people are chronically infected with HCV. There
are 8,000 to 10,000 deaths each year in the U.S. related to
HCV infection.
The painful symptoms at advanced stages of HCV
infection and the current state of affairs where people tend
to be ignorant of preventive measures until they become a
snare to ones well-being is perturbing. The growing
number of HCV infected patients and the imperfect
response of infected patients to therapy is certainly
alarming. The agony faced by the patients and the need to
make people aware of the consequences of this oncovirus
infection has been the driving force behind this research.
In this paper, we compare the performance measures of
twenty classification algorithms viz, Binary Logistic
Regression (BLR), Quinlans C4.5 decision tree algorithm
(C4.5) ,Partial Least Squares for Classification (C-PLS),
Classification Tree(C-RT), Cost-Sensitive Classification
Tree(CS-CRT), Cost-sensitive Decision Tree algorithm(CS-
MC4), SVM for classification(C-SVC), Iterative
Dichomotiser(ID3), K-Nearest Neighbor(K-NN), Linear
Discriminant Analysis (LDA), Logistic Regression,
Multilayer Perceptron(MP), Multinomial Logistic
Regression(MLR), Nave Bayes Continuous(NBC), Partial
Least Squares -Discriminant/Linear Discriminant
Analysis(PLS-DA/LDA), Prototype-Nearest Neighbor(P-
NN), Radial Basis Function (RBF), Random Tree (Rnd
Tree), Support Vector Machine(SVM) classification
algorithms, in terms of classification rate or accuracy and
propose a data mining framework for the design of an
efficient classifier.
A. Paper Organisation
Section 2 reviews the related work in the area of data
mining. Section 3 presents the proposed system design. In
Section 4 we discuss the experimental results while Section
5 concludes the paper.
II. RELATED WORK
Previous research on application of data mining
techniques on clinical data is succinctly presented here.
Jilani et.al, [10] has presented an automatic diagnosis
system based on Neural Network for hepatitis virus. The
system has two stages, feature extraction and classification
stages. In the feature extraction stage, the hepatitis features
were obtained from the UCI Repository of Machine
Learning Databases. Missing values of the instances were
adjusted using local mean method. Then, the number of
these features was reduced to 6 from 19 due to relative
significance of fields. In the classification stage, these
reduced features were given as inputs to the Neural
Network classifier. The classification accuracy of this ANN
diagnosis in the training phase was only 99.1 %. However
the test phase gave 100 percent accuracy. However they
have limited the scope of their work to Neural Network
learning.
Yasin et.al, [11] scrutinize factors that dispense
significantly to augmenting the risk of hepatitis-C virus.
First, the dimension of the problem was trimmed down.
Next binary logistic regression was applied to classify the
cases by using qualitative and quantitative approaches for
data reduction. The three stage procedure produced more
than 89% accurate classification. Gong et.al, [12] reported
an 80 percent classifier accuracy on the Hepatitis C virus
dataset. However they have implemented only ID3, J48,
Bayes Net, Nave Bayes, Linear Regression, Logistic, Prism
algorithms on the pre-processed data. Anagnostopoulos and
Maglogiannis [13] deal with breast cancer diagnostic and
prognostic estimations employing neural networks over the
Wisconsin Breast Cancer datasets. They employ a
probabilistic approach to solve the diagnosis problem.
Moreover regression algorithms estimate the time interval
that corresponds to the right end-point of the patients
disease-free survival time or the time when the tumour
recurs (time-to-recur). For the diagnosis problem, the
accuracy of the neural network in terms of sensitivity and
specificity was measured at 98.6% and 97.5% respectively,
using the leave-one-out test method. In the case of the
prognosis problem, the accuracy of the neural network was
measured through a stratified tenfold cross-validation
approach. Sensitivity ranged between 80.5% and 91.8%,
while specificity ranged between 91.9% and 97.9%,
depending on the tested fold and the partition of the
predicted period. The prognostic recurrence predictions
were then further evaluated using survival analysis and
compared with other techniques found in literature. Mullins
[14] et.al, applied a new data mining technique named
"Healthminer" to a large cohort of 667,000 inpatient and
outpatient records from an academic digital system. Their
principal goal was to investigate the potential value of
searching these databases, without bias, for novel
biomedical insights. Their results conclude that
unsupervised data mining of large clinical repositories is
feasible. W.H. Wolberg [15] has reported their findings on
breast cancer. Their paper described the accuracy of the
system in diagnostically classifying 569 (212 malignant and
357 benign) breast cancer cases and its prospective
accuracy in testing on 75 (23 malignant, 51 benign, and 1
papilloma with atypia) newly obtained samples.
Additionally, prognostic implications of the system were
explored. The prospective accuracy was estimated at 97.2%
with 96.7% sensitivity and 97.5% specificity using ten-fold
cross validation.
The proposed design of the data mining framework is
described in the following section.
III. PROPOSED SYSTEM DESIGN
The proposed data mining framework comprises of two
phases A training phase and a test phase. The proposed
data mining framework is portrayed in Figure 1.

Phase I (Training Phase)

Classification Rules

Classifier Result
(Fatal/Non-fatal case)
Phase II (Test Phase)
Figure 1. Proposed System Design of Classifier
Training
dataset
(HCV data
Data
Visualization
and Pre-
processing
Classification
Classification
Algorithms
(Random Tree,
ID3, KNN,
C4.5, C-PLS,
LDA, SVM,
Nave
Bayes)
Compare
Classifiers
performance
Best
Classifier
Test Data
The training phase involves the process of learning patterns
and rules that will facilitate the formulation of decisions in
new cases. The test phase substantiates the fact that the
trained classifier is able to accurately predict the nature of a
new Hepatitis C infected case suggesting whether the case
is of fatal or non-fatal nature. The training phase includes
the following components:

Training dataset collection of Hepatitis C virus
infected cases
Data Pre-processing
Classification
The test phase incorporates the process of
classifying a new test data utilizing the classification rules
generated in the training phase and reports 100 percent
accuracy.
A. Hepatitis C Dataset (Training Data)
The Hepatitis C virus dataset [16] comprises of 155
instances and 19 predictor attributes. The target attribute is
binary that is, the outcome can be either of two classes that
indicate whether the medical record pertains to a fatal case
or a non-fatal case. The attributes and their IDs are listed in
Table 1.There are 32 fatal cases and 123 non-fatal cases.
B. Data Pre-processing
The HCV data records are downloaded in the form of a
text file. These records are then imported into MS-Excel
and placed under the appropriate column headings. The
missing values are replaced with appropriate values. These
records are then loaded into TANAGRA [17], a data mining
tool with the predictor and target attributes defined using
appropriate techniques. Table 1 describes the HCV
attributes used in training the classifier.
TABLE I DESCRIPTION OF ATTRIBUTES OF HEPATITIS C VIRUS DATASET
S.No Attribute Possible Values
1 Class Fatal -1,Not Fatal-2
2 Age 10 to 80
3 Sex Male-1, Female-2
4 Steroid No-1, Yes-2
5 Antiviral No-1, Yes-2
6 Fatigue No-1, Yes-2
7 Malaise No-1, Yes-2
8 Anorexia No-1, Yes-2
9 Liver big No-1, Yes-2
10 Liver firm No-1, Yes-2
11 Spleen palpable No-1, Yes-2
12 Spiders No-1, Yes-2
13 Ascites No-1, Yes-2
14 Varices No-1, Yes-2
15 Bilirubin 0.39 to 4.00
16 Alk phosphate 0 to 4
17 Sgot 0 to 250
18 Albumin 0 to 6
19 Protime 0 to 90
20 Histology No-1, Yes-2
The classification phase and the algorithms producing
100 percent classifier accuracy are explained in the
following sub-section.
C. Classification Algorithms
Classification is the technique applied to place an incoming
data under a particular class or category based on the
evaluation of its features. Classification [3] phase of
Clinical data mining places emphasis on understanding the
clinical data, act as a tool for healthcare professionals, and
develop a data analysis methodology suitable for medical
data. Classification is the most frequently used data mining
function. Classification algorithms [3, 18] predict one or
more discrete variables, based on the other attributes in the
dataset. When the prediction is done on continuous
variables, it is termed Regression. We describe in detail
Random Tree and Quinlans C4.5 [19] algorithm that
generated 100 percent classifier accuracy on the HCV
dataset.
1) Quinlans C4.5 algorithm
C4.5 is an algorithm used to generate a decision tree
developed by Ross Quinlan [18, 19]. C4.5 builds decision
trees from a set of training data in the same way as ID3,
using the concept of information entropy. The training data
is a set S = s
1
, s2... of already classified samples. Each
sample s
i
= x
1
, x2... is a vector where x
1
, x2... represent
attributes or features of the sample. The training data is
augmented with a vector C = c
1
, c2... where c
1
, c2...
represent the class to which each sample belongs. At each
node of the tree, C4.5 [20] chooses one attribute of the data
that most effectively splits its set of samples into subsets
enriched in one class or the other. The attribute with the
highest normalized information gain is chosen to make the
decision. The C4.5 algorithm [21] then recurses on the
smaller sublists. Sample rules generated from C4.5
algorithm are given below. The minimum size of the leaves
and the confidence level need to be set to 1 to ensure
accurate classification.
ASCITES >= 1.5000
o IF LIVER BIG
AGE < 47.0000 then Class =
NF (100.00 % of 1 examples)
AGE >= 47.0000 then Class = F
(100.00 % of 2 examples)

2) Random Tree Algorithm
Random trees have been introduced by Leo Breiman and
Adele Cutler [22]. The algorithm can deal with both
classification and regression problems [23]. The random
trees classifier takes the input feature vector, classifies it
with every tree in the forest, and outputs the class label that
received the majority of votes [17, 24]. Sample rules
generated by the Random Tree algorithm are given below.
ASCITES >= 1.5000
SGOT < 7.0000
SGOT >= 7.0000
IF LIVER BIG

Decision Tree for the above rule is presented in Figure 2.

Yes

Yes No

Yes
Yes No

Figure 2. Decision Tree from Random Tree Algorithm
D. Test Phase
Once the classifier is trained to recognize the patterns in
the input datasets, we need to test the accuracy of the
classifier on a new data set. The accuracy of the classifier is
measured in terms of the misclassification rate. The results
obtained by classification are elaborated in the following
section.
IV. PERFORMANCE EVALUATION
The training data comprises of 155 cases each
comprising of 19 independent features and 1 target
attribute. These features are loaded into TANAGRA [17]
and the classification algorithms are applied. The
performance measures used to evaluate the classification
algorithms are described below:
A. Performance Measures
The measures taken into consideration to assess the
performance of the classification algorithms are stated as
given by Han and Kamber [1].
1) Accuracy
The accuracy [1] of a classifier on a given test set is the
percentage of test set tuples that are correctly classified
by the classifier.
2) Misclassification Rate
The error rate [1, 25] is also called the misclassification
rate. It is simply 1-Acc (M), where Acc (M) is the
accuracy of M.
3) Precision
Precision [1] refers to the data that is correctly
classified by the classification algorithm. 1.000 precision
indicates 100% accuracy
4) Recall
Recall [1] is the percentage of information relevant
to the class and is correctly classified.
5) Confusion Matrix
A confusion matrix [25] contains information about
actual and predicted classifications done by a classification
system. Performance of such systems is commonly
evaluated using the data in the matrix.
B. Experimental Results.
The results (sample) obtained after classification on the
Hepatitis C Virus dataset from the UCI Machine Learning
Repository are given in the following paragraphs. The
comparative performance measures in terms of accuracy,
of all the twenty classification algorithms are tabulated in
Table 2.

TABLE II ACCURACY OF CLASSIFICATION ALGORITHMS
PERFORMANCE EVALUATION
S.No Classification
Algorithms
Accuracy (%)
1
BLR 89.68
2
C4.5 100
3
C-PLS 77.42
4
C-RT 79.35
5
CS-CRT 79.35
6
CS-MC4 85.81
7
C-SVC 87.74
8
ID3 79.35
9
KNN 87.1
Ascites >=1.5
SGOT<7
Age<37
Liver Big
NF F
Age<47
NF F
10
LDA 88.39
11
Log-Reg 79.35
12
Mulit- layer perceptron 91.61
13
Multinomial logistic regression 89.68
14
Nave Bayes Continuous 83.87
15
PLS-DA 88.39
16
PLS-LDA 87.74
17
Prototype-NN 80.65
18
Radial basis function 82.58
19
RND TREE 100
20
SVM 85.81

The results of the Random Tree classification algorithm
and the Quinlans C4.5 algorithms are portrayed in Figure
3 and Figure 4 respectively. The rate of true positives (TP),
true negatives (TN), False Positives (FP) and False
Negatives (FN) from the confusion matrix clearly reveal
100 percent accurate classification of these algorithms. It is
essential to estimate the parameters for execution of these
algorithms. The number of attributes for split in Random
Tree needs to set to greater than 4 for 100 percent
accuracy.

Supervised Learning 1 (Rnd Tree)
Nb att for split = 10
Results
C. Cl assi f i er perf ormances
Error rat e 0. 0000
Val ues predi ct i on Conf usi on mat ri x
Val ue Recal l
1-
Preci si on
NF 1. 0000 0. 0000
F 1. 0000 0. 0000

NF F Sum
NF 123 0 123
F 0 32 32
Sum 123 32 155

Figure 3 Random Tree Classification Result
The minimum size of the leaves and the confidence
level in the C4.5 algorithm needs to be set to 1 to obtain
100 percent accurate classification. The result of the C4.5
algorithm is given in Figure 4. The next better performing
algorithm is the classification algorithm using Multilayer
Perceptron that reports an accuracy of 91.61%.

Supervised Learning 2 (C4.5)
Deci si on t ree (C4. 5) paramet ers
Min size of l eaves 1
Conf idence-l evel f or pessimist ic 1. 00

Results
D. Cl assi f i er perf ormances
Error rat e 0. 0000
Val ues predi ct ion Conf usi on mat ri x
Val ue Recal l
1-
Preci si on
NF 1. 0000 0. 0000
F 1. 0000 0. 0000

NF F Sum
NF 123 0 123
F 0 32 32
Sum 123 32 155

Figure 4 Quinlans C4.5 Classification Result
The graphical representation of the performance of
classification algorithms is portrayed in Figure 5.

Figure 4 Graphical Representation of Classifier Performance
Comparison
VI CONCLUSION
A major area of active research in data mining is the
cosmic sphere of medical diagnosis and prognosis.
Classification is one of the imperative phases in data
mining that is indispensable to the field of medical
research. In this paper, we have brought out the
performance of classification algorithms on the Hepatitis C
virus dataset that comprises of individual medical cases
containing information on whether certain indications and
clinical findings could be critical or not. Proper
classification of this data and efficient training of a
classification system would definitely enable clinicians and
oncologists to predict the severity of a new clinical case
with similar reported symptoms and would enable them to
combat such viruses effectively. We substantiate the fact
the Random Tree classification algorithms and Quinlans
C4.5 classification algorithm produce 100 percent
classification accuracy on this dataset. We have tested the
accuracy of this classifier system with a test dataset and
conclude that this system gives 100 percent accuracy.
REFERENCES
[1] J. Han and M. Kamber, Data Mining; Concepts and
Techniques, Morgan Kaufmann Publishers, 2000.
[2] Tan, Steinbach, Kumar, Introduction to Data Mining,
2004.
[3] Matthew N. Anyanwu, Sajjan G. Shiva, Comparative
Analysis of Serial Decision Tree Classification
Algorithms, International Journal of Computer
Science and Security, (IJCSS) Volume (3) : Issue (3)
[4] J. Iavindrasana, G. Cohen, A. Depeursinge, H. Mller,
R. Meyer, A. Geissbuhler, Clinical Data Mining: A
Review. IMIA Yearbook, 2009.ISSN: 0943-4747,
Issue 2009:1, pp 121-133.
[5] Norn GN, Bate A, Hopstadius J, Star K, Edwards IR.
Temporal Pattern Discovery for Trends and Transient
Effects: Its Application to Patient
Records. Proceedings of the Fourteenth International
Conference on Knowledge Discovery and Data
Mining SIGKDD 2008, pages 963971. Las Vegas
NV, 2008
[6] Parkin, Donald Maxwell (2006). "The global health
burden of infection-associated cancers in the year
2002". International Journal of Cancer 118 (12):
303044.doi:10.1002/ijc.21731. PMID 16404738.
[7] World Health Organization. Viral Cancers: Hepatitis
C. Retrieved April 3, 2008 from
http://www.who.int/vaccine_research/diseases/viral_ca
ncers/en/index2.html
[8] Rosen, HR (2011 Jun 23). "Clinical practice. Chronic
hepatitis C infection.". The New England journal of
medicine 364 (25): 2429-38. PMID 21696309.
[9] Armstrong GL, Wasley A, Simard EP et al. The
prevalence of hepatitis C virus infection in the United
States, 1999. Annals of Internal Medicine. 2006;
144:705-714
[10] Tahseen A Jilani, Huda Yasin and Madiha
Mohammad Yasin. Article: PCA-ANN for
Classification of Hepatitis-C Patients.International
Journal of Computer Applications 14(7):16, February
2011. Published by Foundation of Computer Science.
[11] Huda Yasin, Tahseen A Jilani and Madiha Danish.
Article: Hepatitis-C Classification using Data Mining
Techniques.International Journal of Computer
Applications 24(3):16, June 2011. Published by
Foundation of Computer Science.
[12] Diaconis,P. & Efron,B. (1983). Computer-Intensive
Methods in
Statistics. Scientific American, Volume 248.
[13] Ioannis Anagnostopoulos and Ilias Maglogiannis,
Neural network-based diagnostic and Prognostic
estimations in breast cancer microscopic instances,
Medical and Biological Engineering and Computing,
Volume 44, Number 9, 773-784, 2006.
[14] Irene M. Mullins, Mir S. Siadaty, Jason Lyman, Ken
Scully, Carleton T. Garrett, W. Greg Miller, Rudy
Muller, Barry Robeson, Chid Apte, Sholom Weiss,
Isidore Rigoutsos, Daniel Platt, Simona Cohen,
William A. Knaus, Data mining and clinical data
repositories: Insights from a 667,000 patient data set,
Elsevier- Computers in Biology and Medicine, August
2005.
[15] W.H. Wolberg, W.N. Street, and O.L. Mangasarian,
Image analysis and machine learning applied to Breast
cancer diagnosis and prognosis, Analytical and
Quantitative Cytology and Histology, Vol. 17, No. 2,
pages 77-87, April 1995
[16] UCI Irvine Machine Learning Repository ,
http://archive.ics.uci.edu/ml/datasets/Hepatitis
[17] Tanagra Data Mining tutorials, http://data-mining-
tutorials.blogspot.com/
[18] Jeffrey W. Seifert, Data Mining:An Overview, CRS
Report for Congress, Received through the CRS Web
[19] Ron Kohavi and Ross Quinlan, Decision Tree
Discovery, October 10, 1999.
[20] Quinlan, J. R. C4.5: Programs for Machine Learning.
Morgan Kaufmann Publishers, 1993.
[21] J. R. Quinlan. Improved use of continuous attributes in
c4.5. Journal of Artificial Intelligence Research, 4:77-
90, 1996.
[22] Leo Breiman, Adele Cuttler, Random Trees,
http://www.stat.berkeley.edu/users/breiman/RandomF
orests/
[23] Jeff Hussmann, Tree Classifiers and Random Forests.
[24] Xindong Wu Vipin Kumar J. Ross Quinlan
Joydeep Ghosh Qiang Yang Hiroshi Motoda
Geoffrey J. McLachlan Angus Ng Bing Liu Philip
S. Yu Zhi-Hua Zhou Michael Steinbach David J.
Hand Dan Steinberg, Top 10 algorithms in data
mining, Knowledge Information Systems (2008) 14:1
37DOI 10.1007/s10115-007-0114-2
[25] Ron Kohavi and Foster Provost, On Applied Research
in Machine Learning. Editorial for the Special Issue on
Applications of Machine Learning and the Knowledge
Discovery Process (volume 30, Number 2/3,
February/March 1998)
AUTHORS
Scalable, Robust, Efficient Location Database Architecture Based on the Location-
Independent PTNs
Kaptan Singh
Computer Science & Engg
T.I.E.I.T.
Bhopal, India

AbstractThis paper describes current and proposed
protocols for mobility management for public land mobile
network (PLMN). First, a review is provided of location
management algorithms for personal communication systems
(PCS) implemented over a PLMN network. In this article
we focus on, different database management schemes which
reduces the user profile lookup time and the signaling
traffic. The next-generation mobile network will support
terminal mobility, personal mobility, and service provider
portability, making global roaming seamless. A location-
independent personal telecommunication number (PTN)
scheme is conducive to implementing such a global mobile
system. However, the non-geographic PTNs coupled with the
anticipated large number of mobile users in future mobile
networks may introduce very large centralized databases.
This necessitates research into the design and performance
of high- throughput database technologies used in mobile
systems to ensure that future systems will be able to carry
efficiently the anticipated loads. This paper proposes
scalable, robust, efficient location database architecture
based on the location- independent PTNs. The proposed
multi-tree database architecture consists of a number of
database subsystems, each of which is a three-level tree
structure and is connected to the others only through its
root. By exploiting the localized nature of calling and
mobility patterns, the proposed architecture effectively
reduces the database loads as well as the signaling traffic
incurred by the location registration and call delivery
procedures. Results have revealed that the proposed
database architecture for location management can
effectively support the anticipated high user density in the
future mobile networks.

Keywords- Database Architecture, Location Management,
Location Tracking, Mobile Networks.

I. INTRODUCTION
The next-generation mobile network will be an integrated
global system that provides heterogeneous services across
network providers, network backbones,and geographical
regions [1]. Global roaming is a basic service of the
future mobile networks, where terminal mobi l i t y,
personal mobility, and service
India.
ISBN: 978-81-920575-8-3 :: doi: 10.73312/ISBN_0768

provider portability must be supported. A non-
geographic personal telecommunication number (PTN) for
each mobile user is desirable to implement these types of
mobile freedom. With location- independent PTNs, users
can access their personalized services regardless of
terminal or attachment point to the network; they can
move into different service providers network and
continue to receive subscribed services without changing
their PTNs. Another advantage of the flat PTN scheme is
that it is much more efficient in terms of capacity than
the location-dependent numbering scheme where the
capacity of the subscriber number (SN) may be exhausted
in a highly populated area; where as the SNs capacity is
wasted in a sparsely populated area [2]. However, using
the location-independent numbering plan may introduce
large centralized databases into a mobile system. To
make things worse, each call may require an interrogation
to the centralized databases, thus signaling traffic will
grow considerably and call setup time may increase
dramatically. The large centralized databases may
become the bottleneck of the global mobile system, thus
necessitating research into the design and performance of
high-throughput database technologies as used in mobile
networks to meet future demands. Location management
is one of the most important functions to support global
roaming. Location management procedures involve
numerous operations in various databases. These
databases record the relevant information of a mobile
user, trace the users location by updating the relevant
database entries, and map the users PTN to its current
location. In current cellular networks location tracking is
based on two types of location databases [3], [4]: the
home location register (HLR) and the visitor location
register (VLR). In general, there is an HLR for each
mobile network. Each mobile subscriber has a service
profile stored in the HLR. The user profile contains
information such as the service types subscribed, the
users current location, etc. The VLR where a mobile
terminal (MT) resides also keeps a copy of the MTs
user profile. A VLR is usually collocated with a mobile
s
r
R
a
i
g
r
M
d
b
i
c
A
H
g
H
a
a
m
l
l
a
p
m
t
d
a
b
i
d
t
t
(
switching cen
registration ar
RA, the HLR
and the MT
incoming call
get the locati
routing addr
MSC/VLR.
directory num
back the TLD
information to
called MSC t
An SC/VLR
HLR, and a
get the addre
HLR-VLR da
accessed for
an expected
mobile netwo
location datab
level database
In this p
architecture
plan is propo
mobile system
tree structure
distributed da
a three-level
be adopted i
introduce lon
delivery. The
through their
to the others
(PSTN), ATM
Fig.1. P

nter (MSC),
reas (RAs). W
R is updated
is deregistere
l arrives, the
ion of the ser
ress request
The MSC
mber (TLDN)
DN to the H
o the calling
then can be se
may not kn
global title t
ess of the M
atabase archit
each location
much higher
orks, the upda
bases will be
e architecture w
paper, a dist
based on th
osed to suppor
m. The propose
e (Fig. 1),
atabase subsyst
tree structure.
n a DS. How
ger delays in
se DSs comm
root databases
by the publi
M networks, or

Proposed multi-tre
which contr
Whenever an
to point to t
ed from the o
called MTs H
rving VLR of
message i
allocates a
to the calle
HLR, which in
MSC. A c
etup through
now the addr
translation (GT
MTs HLR. W
ecture, the H
update or cal
r user densi
ating and quer
e very heavy
will become i
tributed hiera
he location-in
rt location tra
ed database s
consisting o
tems (DSs),
. More than
wever, adding
location reg
municate with
s, DB0s, whi
ic switched te
r other networ

ee database archi
rols a group
MT changes
the new loca
old VLR. As
HLR is querie
f the MT, the
is sent to
temporary l
d MT and s
n turn relays
connection to
the SS7 netw
ress of an M
TT) is neede
With the two-l
HLR needs to
l delivery. Du
ity in the fu
rying loads on
[5] and the t
nfeasible.
archical data
ndependent P
acking in a gl
system is a m
of a number
each of whic
three levels
g more levels
gistration and
each other
ich are conne
elephone netw
rks.
tecture.
p of
s its
tion,
s an
ed to
en a
the
local
ends
this
the
work.
MTs
ed to
level
o be
ue to
uture
n the
two-
abase
PTN
lobal
multi-
r of
ch is
may
will
call
only
ected
work

f
g
t
p
c
s
ro
pr
th
T
of
cr
of
da
ar
on
w
nu
ne
sa
th
de
m
st
da
ad
le
re
ex
of
ar
an
it
to
by
ho
ro
in
re
ha
ou
ar
hi
m
re
S
The propos
following.
1) A loca
global roaming
terminal mobi
portability w
can retain its
service provid

2) The mul
obust than the
roposed archit
he root datab
Thus, each roo
f the user pr
rash of one r
f other root da
atabase is m
rchitecture wh
nce the root is
3) The mu
which is cruc
umber of m
etworks. Wh
aturated, a ne
he end-to-end
elivery will no
mobile networ
tructure, when
atabase is satu
dded in order
evel databases
egistration and
4) The pro
xpand and m
f a global
rchitecture, ea
nd it is straigh
s service cov
o operate and m
y a single se
owever, may n
The propos
oot architectur
n terms of
egistration and
ave demonstra
utperforms th
rchitecture, an
igh access rat
mobile networ
equirements f
The rest
ection2 descr
sed database a
ation-independe
g in the next-ge
ility, personal
will be implem
lifelong PTN
der.
lti-tree databa
e one-root hie
tecture, an M
bases accordin
ot database on
rofiles in the
root database
atabases, and t
much easier t
here all user p
s crashed.
ulti-tree datab
cial to supp
mobile subs
hen the capa
ew DS is read
d delay in lo
ot increase du
rk. On the o
n the capacity
urated, more l
r to reduce th
s. This will in
d call delivery
oposed multi-tr
maintain in the
mobile sys
ach service pr
htforward for
verage by add
manage a DS
ervice provide
not have such
sed architectur
re as well a
the signalin
d call delivery
ated that the p
e one-root ar
nd can effectiv
tes to various
rks while m
for location re
of the pape
ibes existing
architecture is
ent PTN prov
eneration mobil
mobility, and
mented. A m
N regardless of
ase architectur
erarchical arc
Ts profile is
ng to its cu
nly maintains
e global mob
will not disru
the recovery o
than in the o
profiles need
base architect
port continuo
cribers in
acity of a r
dily added. M
ocation regist
ue to such an e
other hand, w
y of the root
levels of datab
he burden on t
ncrease the de
y.
ree database s
e multi opera
stem. With
rovider can h
a service pro
ding new DSs
when the DS
er. The one-ro
advantages.
re is compare
s the HLR-V
ng loads d
y. Numer
proposed datab
rchitecture and
vely cope wit
location dat
eeting the e
egistration and
r is organiz
System. Secti
motivated by
vides a basis
le networks wh
d service provi
mobile subscri
f its location a
re is much m
chitecture. In
stored in one
urrent locati
s a small port
bile system. T
upt the operat
of the failed r
one-root datab
to be recove
ture is scalab
ously increas
future mob
root database
More importan
tration and c
expansion in
with the one-r
t or a high-le
bases need to
the root or hi
elays in locat
system is easy
ator environm
the multi-t
ave its own D
ovider to expa
s. It is also e
is wholly own
oot architectu
ed with the o
VLR architect
due to locat
rical resu
base architect
d the HLR-V
th the anticipa
abases in fut
end-to-end de
d call delivery
zed as follow
ion 3 deals w
the
for
here
ider
iber
and
more
the
e of
ion.
tion
The
tion
root
base
ered
ble,
sing
bile
is
ntly,
call
the
root
evel
be
igh-
tion
y to
ment
tree
DSs
and
asy
ned
ure,
one-
ture
tion
ults
ture
VLR
ated
ture
elay
.
ws;
with
the description about multitree distributed database
architecture Section 4 deals with simulation setup. Section
5 gives the conclusion.
II. EXISTING SYSTEM
The auxiliary strategies try to exploit the spatial and
temporal locality in each users calling and mobility
patterns to reduce the signaling traffic and database
loads. Examples include the forwarding strategy, the
anchoring strategy, the caching strategy, and the
replication strategy.
In the forwarding strategy [6], [7], a forwarding
pointer is set up in the old VLR pointing to the new VLR
of an MT to avoid a location update at the HLR as the
MT changes its RA. When a call for the MT arrives, the
HLR is queried first to determine the first VLR which
the MT was registered at, and a forwarding pointer chain
is followed to locate the MT in its current VLR. The
forwarding strategy reduces location update signaling but
increases the call setup delay. Thus, the length of the
forwarding point chain needs to be limited. It is shown
that this scheme may not always result in a cost savings
as compared to the standard IS-41 scheme. The
forwarding scheme is effective only when the call arrival
rate is low relative to the mobility rate for an MT.
With the anchoring strategy [8], location updates are
performed at a nearby VLR (i.e, local anchor) for an
MT to reduce signaling traffic between the HLR and the
VLRs. The HLR maintains a pointer to the MTs local
anchor. As an incoming call occurs, the HLR forward
the call to the local anchor, which in turn queries the
serving VLR of the MT for a TLDN. The call delivery
time is increased due to one extra database query to the
local anchor. Similar to the forwarding scheme, the local
anchoring scheme is efficient only when an MTs call
arrival rate is low relative to its mobility rate. With the
caching strategy [9], an MTs location obtained from a
previous call is cached and re-used for subsequent calls
to that MT. After a cache entry of the MTs location
information is created at a signal transfer point (STP), if
another call for the MT is received by the STP, the
STP will forward the call to the VLR as specified by the
cache. If the MT is still in the same VLR, a hit occurs
and the call is successfully delivered. However, if the
MT has moved to another VLR, a miss occurs and the
IS-41 call delivery process has to be followed to find the
MT, thus incurring a much longer setup delay. When an
MT changes its location more often than receiving calls,
the caching scheme may become inefficient in reducing
cost.
In the replication strategy [10], an MTs location is
replicated at selected local databases, so that calls to the
MT originating from the service area of these replicated
databases can be routed without querying the HLR.
When the MT changes its location, all replicated
databases need to be updated for the MT, thus incurring a
high database update load and signaling traffic, especially
for highly mobile users.
In summary, each auxiliary strategy outperforms the
IS-41 only under certain calling and mobility
parameters. As the cell sizes become smaller to support
an increasing user density and the number of mobile
subscribers increases, even these augmentations will not
be sufficient to meet the future demands of mobile
networks. It becomes obvious that reducing the access rate
to the centralized HLR is a critical step to support an
increasing number of mobile subscribers.
The hierarchical database architecture can reduce the
access load on an upper-level database by distributing
query load into the lower-level databases, thus it has
been studied extensively in previous research. In [11],
an extra level of databases called directory registers
(DRs), was added between the HLR and the VLRs of
current cellular systems. The DR periodically computes
the location information distribution strategy for each
associated MT in order to achieve a reduced access rate
to the HLR. The performance of this scheme depends
on the availability and accuracy of the users calling and
mobility parameters. It is usually computationally
intensive to obtain these parameters. Given the large
number of MTs, the burden on the DRs would be very
heavy.
A multilevel hierarchical database architecture was
introduced in [12] for location tracking in personal
communications systems (PCSs). However, the numbering
plan used to identify an MT is location-dependent, which
is similar to the telephone number plan, thus the MT
needs to be allocated a new number whenever it changes
its home service area. In, a distributed database
architecture based on the IEEE 802.6 MAN was proposed,
only suitable for MAN-wide mobile systems. In [13],
three-level hierarchical database architecture for mobile
networks was presented based on the location-independent
numbering plan.
There is a common drawback with the database
architectures proposed in these previous studies. The
whole database system had only one centralized root
database, where all user profiles were maintained. For
a worldwide- scale mobile system, it would be impractical
to store and manage subscriber information in a single
centralized database due to the expected huge number
of subscribers. Furthermore, the crash of the root
database may paralyze the entire system.

III.MULTITREE DISTRIBUTED DATABASE
ARCHITECTURE:

A Organizations of Location Databases
1) Organization of DB0:
The DB0 consists of an index file and a data file. With
the location-independent numbering plan being adopted,
every subscriber in the whole mobile system has an
entry in the index file. If the direct file is used, each
index entry only contains a pointer. When a user is
residing in the current DS area, the pointer is pointing to
the users service profile stored in the data file. The
user service profile contains a pointer to the DB1 where
the user is visiting. When the user is staying in another
DS, the pointer in the users index entry points to the
DB0 associated with that DS. All entries in the index file
are allocated the same size of storage and stored in
increasing order of the users PTNs so that direct
addressing can be used to retrieve a record from the
index file. Note that the PTN does not need to be
stored in the index entry. On the other hand, the T-tree
or the B -tree needs to include the PTN in each index
entry and store other index management information, thus
requiring more memory capacity than the direct file.
Therefore, the direct file is the best choice for the index
file of the DB0. In the data file, each user residing in
the current DS area is allocated a record to store the
users service profile. Note that the access time of the
DB0 is independent of the database size when the direct
file technique is employed (but the access time is affected
by the access frequency of the DB0). This scalability
feature is very useful for future mobility applications
since the number of subscribers is expected to increase
steadily.
2) Organization of DB1:
Each DB1 consists only of one part: the index file, in
which each user currently residing in the DB1 area has a
data item. Each data item in the index file consists of
two fields: the users PTN and a pointer to the DB2
the user is currently visiting. No other user information
is stored in the DB1. Results from Section V reveal that
the T-tree is a preferable technique for the index file of
DB1.
3) Organization of B2
Each of the database DB2s consists of two parts: the
index file and the data file. Each user currently
residing in the DB2 area has an entry in the index.
Each entry in the index consists of two fields: the users
PTN and a pointer to the user record in the data file that
stores the service profiles for each user currently visiting
this DB2 area.
In this section, the location tracking procedures are
described based on the proposed multi-tree database
architecture as well as the proposed database
organizations. Location tracking consists of two
procedures: the location registration procedure and the
call delivery procedure. Location registration is the
procedure through which a user reports its location to the
network whenever the user enters a new location. As an
incoming call arrives, the call delivery procedure is
invoked to deliver the call to the user. For simplicity, in
this paper, it is assumed that a DB2 only controls one
RA. In real applications, a DB2 may control several
RAs.
B. Location Registration Procedure
With the previously defined file structures of DB0,
DB1, and DB2 as well as the proposed multi-tree location
database architecture, the location update procedure in a
global mobile system can be described as follows.
1) When a user enters a new RA, a registration request
message is sent to the associated DB2 which in turn
sends a registration request message to the DB1
controlling this area. If the user has no entry in this DB1,
go to step 3; otherwise, go to step 2.
2) The fact that the user has an entry in this DB1
indicates that the new DB2 is within the same DB1
area as the old DB2. A pointer to the new DB2
replaces the old one in the users entry in the DB1. No
further query to the DB0 is needed. The DB1 sends a
registration cancellation message to the old DB2, then go
to step 8.
3) The fact that the user has no entry in this DB1
indicates that the user has moved to a new DB1 area. In
the new DB1 an index entry is added to contain a pointer
to the new DB2 of the user. An update request is also
sent to the associated DB0.
4) The DB0 is checked to see if it contains the users
service profile. If no, this means that the user enters a
new DS, then go to step 5a; otherwise, the DB0 updates
the users service profile to point to the new DB1 and
sends a registration cancellation message to the old DB1,
then go to step 7.
5) a) The new DB0 sends a query to the old DB0 to
request the users service profile.
b) The new DB0 stores the users service profile and
updates the service profile to point to the new DB1. A
copy of the users service profile is also sent to the new
DB2.
6) a) The old DB0 sends the users service profile to
the new
D0
b) The old DB0 updates the users entry in the index
file to point to the new DB0, and deletes the user
service profile from its data file. A registration
cancellation message is sent to the old DB1.
7) The old DB1 deletes the users index entry, and
sends a registration cancellation message to the old DB2.
8) If the old DB2 is in the same DS as the new DB2, a
copy of the users service profile is sent to the new
DB2. The users index entry as well as the users service
profile is removed from the old DB2.
9) After receiving the users service profile, the
new DB2 sets up an index entry for the user and create
the users service profile. The location registration
procedure is completed.
Note that when a user changes its DS, with the
preceding location registration procedure, only the old
DB0 points to the new DB0 directly. All other DB0s
(except for the new DB0) still point to the old DB0. A
forwarding pointer chain corresponding to each of these
DB0s is generated, like in the general forwarding strategy
[8]. The length of these forwarding pointer chains will
increase as the user continues to change its DS. As a
result, the end-to-end setup delay will increase for inter-
DS calls. Compared to the single root structure, the
proposed multi-tree structure achieves its robustness,
scalability, maintainability, etc., at the expense of the
necessary of synchronizing the DB0s to contain the
call setup delay as an MT changes its DS. There
exists a tradeoff between the overhead of DB0
synchronization and the call setup delay of inter-DS
calls. Specifically, if the always synchronization
strategy is employed, i.e., the index files in all DB0s are
updated to point to the new DB0 upon each DS change,
a large amount of signaling traffic as well as database
access load will be triggered within a short time if the
number of DSs is large, but the setup delay of inter-DS
calls is minimized. On the other hand, if the never
synchronization strategy is adopted, i.e., the index file
of a DB0 is never updated upon DS changes, the
length of the forwarding chain continues to increase as
an MT changes its DS, so does the setup delay of inter-
DS calls. One solution to this issue is to adjust the
forwarding pointer chain during call delivery, called on-
demand synchronization. Specifically, when an inter-DS
call has to traverse a forwarding chain going through
more than two DB0s, the calling DB0 updates the
callees index entry to point to the called DB0 directly.
With this method, only the setup delay of Fig. 3.
Flowchart of location registration procedure. The call
adjusting the forwarding pointer chain is increased.
Another DB0 synchronization strategy is to update a
group of selected DB0s that generate relatively high
call rates to the moving MT as the MT changes its DS.
The rest of DB0s are updated during the first inter-DS
calls. We refer to this strategy as partial
synchronization. Compared to the on-demand
synchronization strategy, this strategy achieves a smaller
expected setup delay of inter-DS calls by modestly
increasing the synchronization traffic. This approach
essentially combines the advantages of the forwarding
strategy and the replication strategy. Note that the
performance of the discussed DB0 synchronization
strategies is closely related to the call-to-mobility ratio of
an MT. Due to limited space, this issue will be
addressed in future study. Another issue that needs to be
addressed is the security and privacy of the users
service profile when it is transferred between DB0s. If
the involved DB0s belong to the same service provider,
no problem exists. If the users service profile is moved
between two DSs operated by different service providers,
security issues should be considered. The security issue is
out of the scope of this paper and will not be addressed
further.
C. Call Delivery Procedure
When an incoming call arrives, the call delivery
procedure for the callee can be performed in the
following steps:
1) When a call is detected in the callers MSC, the
callers DB2 is checked to see if an index entry for the
callee exists. If yes, go to step 5, and no further queries
to the DB1 and the DB0 are required. Otherwise, a
query is sent to the associated DB1, then go to step 2.
2) The DB1 examines if the callee has an entry in
its index file. If yes, go to step 4, and no further query
to the DB0 is required. Otherwise, a query is sent to
the associated DB0, then go to step 3.
3) The DB0 examines if the callee is associated with
one of its DB1s. If yes, the DB0 sends a routing
address request message to the DB1, then go to step 4;
otherwise, go to step7.
4) The DB1 determines the callees DB2 and sends a
query to the DB2 to request the routing address.
5) The DB2 searches for the callee. If the callee is
found, a TLDN is allocated to the callee and sent back
to the calling MSC.
6) After receiving the TLDN, the calling MSC
sets up a connection to the called MSC associated
with the callees current DB2. Then the call delivery
process stops.
7) If the callee is residing in another DS, a query is
sent to the associated DB0. The searching process is
repeated from step 3.

i
b
s
r
h

Fig. 2
F

It is worth
in the location
based on the
simplify the d
reducing the o
IV.
We have w
has one datab
2. Flow chart of l
Fig. 3. Flow chart
hwhile to poin
n registration
proposed da
deployment o
overall system
. SIMU
wireless netw
base provider
location registrati

of call delivery p
nt out that no
and call del
atabase archit
of the propose
cost.
ULATION SET
work of 4 clus
r which can p
ion procedure.
procedure.
o GTT is requ
livery proced
ecture. This
ed strategy w
TUP
sters. Each clu
provide datab

uired
dures
will
while
uster
base
se
no
ac
of
re
cl
re
co
O
se
se
is
w
ds
ge
re
ervices to cli
odes in the c
ccess servers
f the databas
ed circles).
Input
Enter the no
luster no. it w
equires some
orresponding s
Output
The algorit
erver, if it fin
ervices from t
s not found in
which contains
s which cont
etting the pr
espective serv

D
Figure Simula
ients (marked
lusters are cl
(similar to H
se services an
ode id of the n
wants to mov
database serv
service no (wh
thm finds the
nds the locatio
the respective
n the home d
the service n
tains the locat
rovider inform
vice from the
Location Serv
Regi
Database Server In
Up
ation shows that M
d as green ci
lients. We hav
HLR) which c
nd their locati
node to be mov
ve. Enter the
vice from a p
hich can be 1
e service no
on, then the cl
e provider. If
ds server, it se
no. then it con
tion of the
mation, the
provider.

ver Information A
istration
nformation After
pdate

Mobile node 7 m
ircles) the oth
ve four datab
contain the deta
ions (marked
ved and the n
client id whi
provider and t
11,222,333,44
in the home
lient can get t
f the service
earch for the
ntacts the corr
e provider. Af
client gets t
After
Location
move to cluster no
her
ase
ails
as
new
ich
the
44)
ds
the
no
ds
rect
fter
the

o. 4
a
s
p
s
e
p
e
d
c
s
a
s
p
e
r
a
u
f
s
d
c

P
I
f
F
A
p
4
v
i
s
n
and mobile 5 star
Figure show
simulation time w

Distributed
proposed for
system, whe
employed to
proposed data
efficient. Com
database arch
can support a
signaling load
architecture,
scalability an
population at
evaluation, a
results have
architecture c
update and q
future mobile
structures are
databases in m
center and th
[1] I. F. Akyildiz
Mobility mana
Proc.IEEE, vol. 8
[2] E. D. Sykas
IBCN
for mobile com
Feb.1991.
[3] Y.-B. Lin
Architecture. New
[4] S. Mohan
personal commun
4250, First Qua
[5] C. N. Lo and
volume to suppo
in Proc. IEEE Int
[6] I.-R. Chen,
strategies for
networks,ACM/B
105115,2001.
[7] R. Jain and Y
rt requesting servi
ws that packet for
which is taken as
V. CON
d multi-tree d
r location ma
ere the loca
support sea
abase architec
mpared to th
itecture, the p
a much highe
d significantly.
the proposed
nd reliability
t a lower sig
analysis mode
revealed t
can effectivel
query rates t
e networks. T
e also suitabl
mobile networ
he equipment i
REFE
z, J. Mcnair, J. S.
agement in n
87, pp. 1347138
and M. E. Theo
mmunications, P
and I. Chlamt
w York: Wiley, 2
and R. Jain,
nications servic
arter1994.
d R. S. Wolff, E
rt wireless person
t. Conf. Communi
T.-M. Chen, an
reducing locati
Baltzer J. Mobil
Y.-B. Lin, An aux
ice from mobile n
rwarded ratio with
40 seconds
NCLUSION
database archi
anagement in
ation-indepen
amless globa
cture is scala
he existing tw
proposed data
er user densit
. Compared to
d architecture
while support
gnaling cost.
el was devel
that the pro
ly handle the
to the locati
The proposed
le for other
rks, such as t
identity regist
ERENCES
M. Ho, H. Uzuna
next-generation
4, Aug. 1999.
logou, Numberi
Proc. IEEE, vol
tac, Wireless a
001.
Two user loc
ces, IEEE Pe
Estimated network
nal data commun
ications, May 199
nd C. Lee, Ag
ion managemen
le Netw. Applica
xiliary user locati
node 8
h respect to
itecture has b
a global mo
ndent PTNs
al roaming.
able, robust,
wo-level loca
abase architec
ty while redu
o the one-root
e provides be
ting a larger u
For performa
loped. Numer
oposed datab
anticipated h
ion databases
database ac
large central
the authentica
er.
alioglu, and W. W
wireless syste
ing and addressin
l. 79, pp. 230
nd Mobile Net
cation strategies
ers. Commun.,
k database transa
nications applicat
93, pp. 1257126
gent-based forwa
nt cost in m
at., vol. 6, no. 2
on strategy emplo

been
obile
are
The
and
ation
cture
cing
tree
etter
user
ance
rical
base
high
s in
ccess
lized
ation
Wang,
ems,
ng in
240,
twork
s for
pp.
action
tion,
63.
rding
mobile
2, pp.
oying
fo
J.
[8
co
70
[9
re
12
14
[1
re
re
14
19

[1
da
IE
[1
un
Co
[1
m
19

orwarding pointer
Wireless Netw.,
8] , Local anch
ommunicaations n
09725, Oct. 199
9] R. Jain, Y.-B.
educe network im
2, pp.
4341444, Oct. 1
0] N. Shivakum
eplication in mobi
esults, ACM/Balt
40, Oct.
997.
1] J. S. M.
atabase architect
EEE/ACM Trans.
2] J. Z. Wang,
niversal personal
ommun., vol. 11,
3] X. Qiu and
mobility managem
995, pp. 434 444
rs to reduce netwo
vol. 1, no. 2, pp.
or scheme for r
networks, IEEE/
6.
Lin, C. Lo, and
mpacts of PCS, I
994.
mar, J. Jannink,
ile environments:
tzer J. Mobile Ne
Ho and I. F.
ture for location
Networking, vol.
A fully distribu
l communication
pp. 850860, Au
d V. O. K. L
ment database sy
4
ork impacts of PC
197210, July 19
educing signalin
/ACM Trans. Netw
d S. Mohan, A
IEEE J. Select. A
, and J. Widom
Algorithms, ana
etw. Applicat., vo
Akyildiz, Dy
n management
5, pp. 646660,
uted location regi
n systems, IEE
ug. 1993.
Li, Performance
ystem, in Proc.
CS, ACM-Baltzer
995.
ng costs in perso
working, vol. 4, p
caching strategy
Areas Commun., v
m, Per-user pro
lysis, and simulat
ol. 2, no. 2, pp. 12
ynamic hierarch
in PCS network
Oct. 1997.
istration strategy
EE J. Select. Ar
e analysis of P
IEEE IC3N, S
r
onal
pp.
y to
vol.
ofile
tion
29
ical
ks,
for
reas
PCS
ept.
Regression analysis using log data
Anup S. Jawanjal
PG Student, Department of Computer Engineering
Pune Institute of Computer Technology
Pune, India

AbstractLog files are important source of information
for the system because they depict current status of the
system. They record all the events happening in the system.
Manual analysis of this log files is tedious job because log
files are large in size. To automate log analysis process
various tools and techniques have been proposed in the past.
Also tools have been proposed to make use of log files for
profiling, automated problem detection, system monitoring,
system management etc. In this paper we propose a
framework by which log files can be used for regression
analysis of the software system. The framework compares log
files from two different builds and generates analysis report.
Each log line is taken into account, so the framework does an
analysis that is very detailed.
Keywords-log analysis, regression analysis, log abstraction
I. INTRODUCTION
Most of the Software systems collect information
about their activities in log le. As the events happen in
the system the message that corresponds to event is
added to the log file. The information in the log les
consists of the start or end of events or actions of the
software system, state information and error information.
Each log line typically contains date and time information,
user information, application information, and event
information. Logs are often collected for system monitoring,
system debugging and fault diagnosis. Numerous log le
analysis tools and techniques are available to carry out a
variety of analysis. Insights of varying degrees are achieved
by log le analysis. These include but are not limited to, fault
detection by monitoring, fault isolation, operational proling
etc.
Tools like Splunk [1], and Swatch [2], are used to
monitor log les. Splunk is a log management tool, and
Swatch is a log monitoring tool. Swatch monitors log
files by reading every event message line that is
appended to the log file, and compares it with rules
where the conditional part of each rule is a regular
expression (rules are stored in a textual configuration
file). If the regular expression of a certain rule matches
the event message line, Swatch executes the
action part of the rule.

ISBN: 978-81-920575-8-3 :: doi: 10. 73340/ISBN_0768
A. Analyzing system logs
Each line in a log le is a combination of a static
message type and variable parameter information. The
separation of static fields from dynamic fields is called as log
abstraction. The dynamic field can be time information,
date information, IP address, port number or any
application specific information. To separate these two
fields we require identifying common pattern in these
events. Most commonly from event pattern, regular
expressions are built. When we parse the log file we compare
the log line against these regular expressions and set event
type and other information of the matched regular
expression. The dynamic information in log line varies
because each application store only relative information
in log file. Once event type is determined various
techniques such as event correlation, data clustering,
pattern mining can be used to extract useful information
from the log files.
B. Contribution of the paper
For a software systems new features are added and
collection of this features result in a software
build. For some system this build may be produced on a
daily basis. Now due to addition of new features current
system components may get affected. For this purpose
regression analysis need to be done which tells if any system
component is taking more time than it was taking in
previous build.
In this paper we propose a framework that makes use of
log files for regression analysis. The framework takes log
file from different builds, compares them and gives the
regression report.
The paper is organized as follows: Section II
discuss previous work regarding different use of log
file. Section III discuss the issues related to
using log files for regression analysis, Section IV
discuss the problem using set theory. In Section V the
proposed framework is described. Section VI discusses
future work on the proposed technique.
II. DIFFERENT USE OF LOG FILE
Log file has been used for various purposes
such as profiling, automated system problem
detection, system monitoring, system management etc.
We discuss the work done in the past in the following
subsections.
A. Tsesting using logs
In [3], a framework is presented for automatically
analyzing log files. It describes a language for specifying
analyzer programs. The language permits compositional,
compact speciations of software, which acts as test
oracles; which can be used for unit and system level
testing..
B. Use of logs for problem detection and troubleshooting
CLUEBOX [4] proposes a non-intrusive toolkit that aids
rapid problem diagnosis. It employs machine learning
techniques on the available performance logs to
characterize workloads, predict performance and
discover anomalous behavior. By identifying the
most relevant anomalies to focus on, CLUEBOX
automates the most onerous aspects of performance
troubleshooting.
In [5], a technique to automatically analyze log les
and retrieve important information to identify failure
causes is presented. It automatically identifies
dependencies between events and values in logs
corresponding to legal executions, generates models of
legal behaviors and compares log les collected during
failing executions with the generated models to detect
anomalous event sequences that are presented to users.
C. Use of logs for system management
In [6], text mining techniques are applied to
categorize messages in log les into common
situations, improve categorization accuracy by
considering the temporal characteristics of log
messages, develop temporal mining techniques to
discover the relationships between different events,
and utilize visualization tools to evaluate and validate
the interesting temporal patterns for system management.
In his paper Vaarandi [7] uses event correlation
technique for building a network management tool. The
tool is lightweight, open source and implements platform
independent event correlation. A tool called as Sec
is a rule-based event correlation tool that receives its
input events from a file stream, and produces output
events by executing user-specified shell commands.
Regular files, named pipes, and standard input are supported
as input. To be able to handle input events regardless of their
format, sec uses regular expressions for recognizing them.
Sec regular expressions can match complex patterns
spanning over several input lines.
III. RELATED ISSUES
A. Dynamic information in log line
The dynamic information in a log line generally consists
of time at which event happened, date, parameters such
as IP address, port number (if network components are
involved), application information etc. Now to use log
file for regression analysis we need to record time taken
by each event. Such type of log line would look like this:

10/1/2011 12:34:30 Call To Subprogram1 Done: 12 msec

Recording time against each event needs fine grained
event partition. For e.g. we should not record the total time
for event in which sub events may take varying amount of
time. Hence the log files having time recorded against each
event would serve as input to our framework.
B. Structure of log files
Following diagram shows the general structure in which
events in software program gets executed. Activity defines
distinct group of events in log file. For e.g. For mailing
software system activities will be Open Mail, Close
Mail, Send Mail, etc.

Figure 1.Sample event execution structure

As shown in Figure 1 the structure is nested and all
the events under one activity are bounded by activity
start and end weather they are nested or sequential.
Hence events in system can be bounded by some
activity under which they happen. As log file merely
records all the events happening in the system their
structure is same as that of event execution structure.
C. Handling large log files
As the events are appended to log file, it grows. As
old events are not deleted size of log file becomes
very large. Normally for every new session new log
file is created. Previous work on log analysis uses
different techniques such as clustering, pattern mining,
event correlation etc. To handle large log files, they
correlate events happening at different time to
know more about the health of system. Now as
we are doing regression analysis we need to take each event
into account.
So to handle large log file we are dividing them
based on activities. Activities define set of events that are
distinct from the other. Two activities might contain some
events in common but we can clearly separate them based
on activity name. So all the events happening inside
each activity will be handled separately. This
allows us to handle large log files and to do detailed
regression analysis.

IV. PROBLEM DESCRIPTION
We define a system set S as

S = {I, T, R}

Where, I: Input set, T: Tree set, R: Analysis report

Set I = {Fp, Fn} defines input set

Where Fp=log file of previous build
Fn=log file of new build

Each log file consist of events sequence E1,,En,
Where Ei = (ti, ei) is an event, ei is the type of Ei, ti is the
time required to execute Ei, Other dynamic information
such as time of event execution, date etc. may be
present but only event type and time taken by event
are required for regression analysis.

Now this log files are big so they are broken into
smaller sets called as activity.

Set I become,

I = {Fpu, Fnu}

Where Fpu= log file of previous build for activity u
Fnu=log file of new build for activity u

Fpu Fp and Fnu Fn

Fp={Fpu1,Fpu2,.,Fpun}

Fn={Fnu1,Fnu2,.,Fnun} such that,

Fpu1 Fpu2 .. Fnun = Fp and

Fpu1 Fpu2 . Fnun=

Similar is the case for Fn

Now in order to compare two log files we need to
convert this log files to some common format.
This format should allow for comparing events on
individual basis and allow for the identification of
extra events because two different builds consist of
some distinct events. We use the tree as a common
format for comparing log files. Tree will allow us to
support nested program structure and handle extra events.
Tree set is defined as follows:

T = {Tpu, Tnu}

Tpu = {P} and Tnu= {P} and P= {P1, P2, P3.Pn}

Where P=parent set.

And P1= {C1, C2 Cn} where P1=children set

Function f is applied to both Fpu and Fnu. It checks
each event and stores them in proper hierarchy.

f: Fpu Tpu and f: Fnu Tnu

Function g compares two trees as follows:

g: Tpu Tnu

Now each build version may consist of different
events which cause different parents in different
versions. Relation of these parents can be shown with
following Venn diagram.

Figure 2.Venn diagram showing relation between parent
Objects

Now parent might contain child events. The number of
child in same parent differs from build to build. New build
might contain some new sub events.


Figure 3. Venn diagram showing relation between child
Objects
As we compare trees we record common event and
store time taken in different builds against it. The
distinct events in two builds will contain only time of
build in which it appears. The final report generated
would be:

R= {e, tp, tn, r}

Where, e = event type, tp =time required by e in
previous build ,tn= time required by e in new build
and r is the regression i.e difference in time in two
builds.
V. MODEL GENERATION
A. Using tree structure
As we parse log line we separate the static field from the
dynamic fields. Log events may be broadly divided into
two types: one which has time information associated
with them one that does not have any time information.
Generally events such as activity start or end,
subprogram start or end do not have any time
information associated them. We treat these events as
parent and one having time information as child. Now
to store some other information such as title, level etc.
We use objects to represent parent and child.

Figure 4.Parent and child object properties

Parent object will have properties such as title: the name
of parent (same as event type), level: its position in tree,
children: which will be list containing all its children,
parent_children: parents which are in lower hierarchy
compared to this parent. Child object will have properties
such as title: name of child, time: time of event
execution, parent: parent of this child, type: may be
time and no_time because certain child will not have
any time information associated with them. More
properties can be added as and when required.
B. Proposed algorithm for building the tree
Algorithm in Figure 5 is used to build the tree. This
algorithm is applied to both the log files (from
the new build and previous build) one by one.

Figure 5.Algorithm for building the model

In above mentioned algorithm regex stands for regular
expression. We identify common pattern for identifying
events and stores them as regular expression. In this
algorithm child regex and parent regex broadly divide
events but in fact there may be several child and parent
regex.
The algorithm requires several data structures when
implemented. For keeping track of current parent
it requires stack, for storing all the children and
parent_children to parent two separate lists are required.
When the algorithm finishes we get two top level objects
(one for each file). These two objects contain all the
information required to compare each child and parent
object.
C. Proposed algorithm for comparing trees
Algorithm in Figure 6 is used to compare trees from
the two builds. The top level objects are passed to the
algorithm first time. The algorithm iterates through all
the children and parent_children. The algorithm first
handles all the children for the parents that are
passed to it. Then it checks for any extra children. After
that it calls the CompareTrees function again with new
set of parents. If some parents remain then they are extra
and treated separately.

Figure 6.Algorithm for comparing trees

The top level objects are passed to the algorithm
first time. The algorithm iterates through all the children
and parent_children. The algorithm first handles all
the children for the parents that are passed to
it. Then it checks for any extra children. After that it
calls the CompareTrees function again with new set of
parents. If some parents remain then they are extra and
treated separately.


Figure 7.CompareChildren sub routine

For extra child or parent regression analysis cannot be
done as no corresponding child or parent is present in
the other tree. Hence we simply set regression as 0. The
algorithm consists of several loops but it is efficient as not
all are executed to their completion. If we consider two
subsequent builds, numbers of extra children or parents
are very less. Hence in CompareChildren subroutine the
first repeat until loop will be executed exactly one time
for each child. Also the two while loop which check
extra child in each tree are executed exactly one for
each extra child. Same is the case for two while loop
which check extra parents. Hence the technique is efficient.
There are three main stages that the framework goes
through. In first stage it takes two log files as input.
In second stage it parses each line in log file
and builds parent and child objects. In third and last
step it compares these objects and builds the analysis
report. It uses first algorithm for building objects and
second algorithm for comparing them. The properties of
objects are so assigned that one top level parent object
can give information about the whole tree. Hence for
comparison only this top level parent object needs to be
passed. If during comparison algorithm finds any extra
parent or child events then it renders their regression
as 0 and show their time under appropriate build. By seeing
this report developer can quickly find the regression
suspect. Regression suspect is one which causes regression
in the program. We can output only those suspects in
report that takes time more than the user specified value.
VI. FUTURE WORK
By using the tree model, we can efficiently compare
log files. Storing each tree node as objects allow us to
store various properties with them. We can add a property
to parent or child node as and when required. The model
uses regular expression to identify event types. These
expressions are written manually by identifying each event
type in system. By doing the automatic
identification of different event types the framework
can be improved. Also different types of report can be
generated so that developer can quickly find the regression
suspect. Also division of log file into activities needs to be
done automatically.

REFERENCES

[1] Splunk http://www.splunk.com/ (accessed 12/10/2011)
[2] Stephen E. Hansen and E. Todd Atkins. 1993.
Automated System Monitoring and Notification With Swatch.In
Proceedings of USENIX 7th System Administration
Conference,145-152.
[3] J.Andrews, Testing using Log File Analysis:Tools, Methods, and
Issues, In 13th IEEE International Conference on Automated
Software Engineering, 13-16 Oct 1998 , Honolulu, HI ,
USA.157-166.
[4] S.Sandeep, M.Swapna, T.Niranjan, S.Susarala and S. Nandi,
CLUEBOX: A Performance Log Analyzer for Automated
Troubleshooting. In WASL'08 Proceedings of the First USENIX
conference on Analysis of system logs, 2008.
[5] L. Mariani and F. Pastore,Automated Identification of Failure
Causes in System Logs.In 19th International Symposium on
Software Reliability Engineering,2008. Seattle, WA.117-126.
[6] T.Li, W.Peng, S.Ma and H.Wang,An Integrated Framework on
Mining Logs Files for Computing System Management. In IEEE
Transactions on Systems, Man and Cybernetics, Part A: Systems
and Humans, 2010.90-99.
[7] R.Vaarandi, Platform independent event corelation tool for network
management,In Symposium of Network Operations and
Management,2010.907-909.


Name Alias Detection Using Graph mining method

Jakhete Sumitra Amit
PG Student,
M.E.Computer
technology,
Pune, India

Prof. Dhramdhikari Shweta
Associate Professor, Information
Technology,
Technology,
Pune, India

Prof. Madhuri Wakode
Assistant Professor, Computer
Department,
Technology,
Pune, India

Abstract Corefernces are the occurrences of a word or
word sequence referring to a given entity. Entity like a
person name, location and organization. Identifying the
correct reference to an entity among a list of references is
required in lots of works such as information retrieval,
sentiment analysis, person name disambiguation as well as in
biomedical fields. More Previous Work had been done on
solving lexical ambiguity here we proposed a method that is
based on referential ambiguity. In this paper we proposed a
method which is based on referential ambiguity to extract
correct alias for a given name. Given a person name and/or
with context data such as location, organization retrieves
top-K snippets and depth up to level two from a web search
engine. With the help of Lexical-pattern extract candidate
aliases. As to find correct alias from a list of aliases we used
graph mining frequent sub graph discovery based on a
priori-based mining of a graph. This method is useful to
improve the precision and minimize the recall than the
previous baseline method.
Subject Descriptor: - 8. II Database Management VIII
Database Application , 8 III Information Storage and Retrieval
III Information Search and Retrieval IX Artificial Intelligence
VII Natural Language Processing.

Keywords: Web Mining, Text Mining, Web Text Analysis, Graph
Mining

I. INTRODUCTION
To find a reference of particular entity on a web is very
important task as it is helpful in Information retrieval
process. References are referring as aliases on the Web.
Searching a particular entity for example name is difficult
because of different entities share same name. For removing
the ambits for namesake various name disambiguation
algorithms are exist. They solved the problem called lexical
ambiguity. Another difficulty for identifying entities is that
single entity can be identified by multiple names this
problem called referential ambiguity.
Previous search mainly focus on handling lexical

ISBN: 978-81-920575-8-3 :: doi: 10. 73326/ISBN_0768
ambiguity. For example The Bollywood star Amitabh
Bachan is often called Shahenshah of Bollywood as a three
word alias whereas Big B is a one word where as Angry
young man is three word. The famous cricket player Sachin
Tendular is also known as Master Blaster as a two word
alias Many times various types of terms are used as alias
such as doctor , name of a role or title of a drama.

II. RELATED WORKS
Correct alias finding is important in information
retrieval. In [2] proposed a method in which uses extraction
techniques to automatically extract significant entities
such as the names of other persons, organizations, and
locations on each webpage. In addition, it extracts and
parses HTML and Web related data on each webpage, such
as hyperlinks and email addresses. The algorithm then views
all this information in a unified way: as an Entity-
Relationship Graph where entities (e.g., people,
organizations, locations, WebPages) are interconnected via
relationships(e.g., webpage-mentions-person, relationships
derived from hyperlinks, etc). This method is used to find
relative information of a particular person on the web. In [3]
proposed a new method to measure the similarities between
documents: similarity via knowledge base (SKB).their
method uses a knowledge base to find which are important
keywords in document shared contexts of documents and to
more easily calculate the weight of the shared contexts.
Then, they use these Number of clusters similarity results
for the agglomerative clustering to group related documents
together.
Name disambiguation problem handle by various
method such as [4][5].But this work not handle the problem
of referential ambiguity or identifying correct reference for
a entity. In [2] proposed a method for automatic discovery
of personal name alias from the web. In this paper for a
given name query, based on rained dataset first extract
lexical pattern and then extract a large set of candidate
aliases from the snippets retrieved from a web search
engine. Various ranking score integrate into single function
using support vector machine to extract correct alias. In this
paper, they consider concurrence graph up to level first.
We proposed a novel approach to find alias of a given entity
in following threefold ways as
User interface approach for input with context data
such as organization, location retrieving top-K
pages [2].

Second order word co-occurrence graph-based data
representation and used a frequent subgraph
discovery based A priori-based graph mining
algorithm improves the more efficiency[7,8]

To measure the association between the name and
alias, we used statics as Co-occurrence significance
ratio [9] and AltaVista based Mutual Information
[10] which improves significance accuracy for
correct alias finding method.

III. PROPOSED METHODS
Information Extraction (IE) is a technology based on
analysing natural language in order to extract snippets of
information.
I Entity Search on the web
Our first algorithm is based on entity search. On the web
based user interface user gives name as a query as an input,
top-K pages retrieves using web search engine like Google
API after preprocessing
on this data it considered for entity resolution problem.
Let S be the System such as
S={Ip,Op,D,CD,P,O,Srn,D1,D2,PA,C,Su,F | }
Ip=Input of the system
Ip= {N1,N2,N3,.Nn} //set of names
CD={c1,c2,c3,.cn} //set of context data
L={l1,l2,l3,.ln} //set of Location
O={O1,O2,O3,On} //set of organization
Where N is the name of the person being queried by the user
Context Ci can be either
(a) an OR combination of Location from Li, or
(b) an OR combination of organizations from Oi
Here we used two main types of queries:
N AND Ci AND Cj
N AND Ci OR Cj
Let D = {d1, d2, . . . , dK} be the set of the top-K returned
web snippets.
II Candidate aliases Extraction
We considered a set of lexical pattern [2] which is
commonly found with name and alias. This patterns with the
realname has used for extracting a set of candidate aliases.
EC= {c1, c2, c3, cn|1<=n(C)<=5} //set of extracted
candidate.
W1= {a, an, the} //set of stop words
C = EC-W1 (1)
III Word Co-occurrence Graph using second order and
Graph-based data mining
Graph mining is an important tool to transform the
graphical data into graphical information. A graph is a pair
G = (V, E) where V is a set of vertices and E is a set of
edges. Edges connect one vertice to another and can be
represented as a pair of vertices. co-occur in a certain
context are likely to share between them some of their
importance. Co-occurrence, dependency and semantic
networks are three ways to build graph of given text.
Word co-occurrence graph, as an graph where each word
wi in vocabulary v is represented by a node in the graph.
Because one-to-one mapping pertains between a word and a
node, for simplicity we used Wj to represent both the word
and the corresponding node in the graph. An edge eij E
is created between two nodes wi, wj if they co-occur. Given
a personal name p, represented by a node p in the co-
occurrence graph, our objective is to identify the nodes that
represent aliases of p.
Graph-based representations of real-world problems
have been helpful due to their improved clarity and efficient
use in finding the solutions. The hash table scheme uses a
hash function to map keys with their corresponding values.
Its advantage over other representations is the speed with
which it retrieves the result. It is useful in the case of very
large input graphs. Subgraph discovery is one of the well-
addressed problems in the graph-mining domain. two main
sub-tasks are the frequent subgraph discovery and dense
subgraph discovery.
A graph-based ranking algorithm is a way of deciding the
importance of a vertex within a Graph. Text Rank score
associated with vertex is defined as equation (2) as,

= 1 +

||
2

Where Where S(Va ) = score of vertex a ,S(Vb ) = score
of vertex b In( va) = indegree of vertex a Out(vb) =
outdegree of vertex and d is a parameter that is set between
0 and 1[8]
Given a graph, let be the set of vertices that point to
vertex (predecessors), and be the set of vertices that vertex
points to (successors). The method we used to weight terms
to classify text. In our proposed method to draw graph of
document's text we considered co-occurrence of name and
alias relations.
IV Ranking Scores applied on the Candidate aliases for
identifying correct alias:
Considering the noise in web snippets, candidates
extracted by the shallow lexical patterns might include some
invalid aliases. From among these candidates, we must
identify those, which are most likely to be correct aliases of
a given name. We model this problem of alias recognition as
one of ranking candidates with respect to a given name such
that the candidates, who are most likely to be correct aliases
are assigned a higher rank.
Co-Occurrences in Anchor Texts
We define a name p and a candidate alias x as co-
occurring, if p and x appear in two different inbound anchor
texts of a url u. Moreover, we define co-occurrence
frequency (CF) as the number of different urls in which they
co-occur. It is noteworthy that we do not consider co-
occurrences of an alias and a name in the same anchor text.
Lexical co-occurrence is an important indicator of word
association and this has motivated several co-occurrence
measures for word association like PMI (Church and Hanks,
1989), LLR (Dunning, 1993), Dice (Dice, 1945), and
CWCD (Washtell and Markert, 2009, we proposed resent a
new measure of word association based on a new notion of
statistical significance for lexical co-occurrences called as
Co-occurrence significance ratio [9]
=

+
3

Altavista based Mutual Information [10] which
improves significance accuracy for correct alias finding
method.
, = log2
4

Here, hits(N NEAR A) is the number of hits
(documents) returned by AltaVista for a query in which the
two target terms are connected by the NEAR operator and
hits(wn) is the number of hits returned for a single term
query
IV CONCLUSION
We proposed this name alias detection method and we
consider second order word concurrence graph. CSR and
AVMI method for measuring association and also using
graph mining for classification. we will maximize recall and
improve the precision in relation detection task.

REFERENCES
[1] Danushka Bollegala, Yutaka Matsuo, and Mitsuru Ishizuka, Member,
IEEE Automatic Discovery of Personal Name Aliases from the Web
, IEEE Transaction on knowledge and data engineering, vol. 23, no.
6, june 2011.
[2] Dmitri V. Kalashnikov Zhaoqi Chen Rabia Nuray-Turan Sharad
Mehrotra Zheng Zhang Web People Search via Connection
Analysis, IEEE International Conference on Data Engineering,2009.
[3] Quang Minh Vu, Tomonari Masada, Atsuhiro Takasu ,Jun Adachi
Disambiguation of People in Web Search Using a Knowledge
Base,IEEE,2007.
[4] Meijuan Yin, Junyong Luo, Ding Cao, Xiaonan Liu and Yongxing
Tan User Name Alias Extraction in Emails, published online in I.J.
Image, Graphics and Signal Processing, 2011, 3, 1-9.
[5] Yu-Chuan Wei, Ming-Shun Lin and Hsin-Hsi Chen Name
Disambiguation in Person Information Mining, Proceedings of the
2006 International Conference on Web Intelligence (WI 2006 Main
Conference Proceedings) (WI'06).
[6] Md. Rafiqul Islam and Md. Rakibul Islam An Effective Term
Weighting Method Using Random Walk Model for Text
Classification In Proceedings of 11th International Conference on
Computer and Information Technology (ICCIT 2008) 25-27
December, 2008, Khulna, Bangladesh
[8] Samer Hassan, Rada Mihalcea and Carmen Banea
Random Walk Term Weighting for Improved Text Classification In
Proceedings of TextGraps: 2nd Workshop on Graph Based Methods
for Natural Language Processing. ACL, 53- 60, 2006.
[9] Dipak L. Chaudhari, Om P. Damani, Srivatsan
Laxman Lexical Co-occurrence, Statistical Significance, and Word
Association.
[10] Marco Baroni and Sabrina Bisi Using cooccurrence statistics and
the web to discover synonyms in a technical language, In
Proceedings of LREC 2004.
[11] http://www.google.com// Dated:24.12.2011.


Part II
Proceedings of the Second International Conference on
Computer Applications 2012
ICCA 12
Volume 5
Proc. of the Intl. Conf. on Computer Applications Volume 1.
Copyright 2012 Techno Forum Group, India.
ISBN: 978-81-920575-8-3 :: doi: 10.72843/ISBN_0768
Implementation of the DES Framework On SOA

Dr. Manish Kumar Mr. Ashish Avasthi Mr.Gaurav Mishra
SRMCET, Lucknow SRMCEM, Lucknow SRMCEM, Lucknow
Uttar Pradesh, India Uttar Pradesh, India Uttar Pradesh, India
dr.manish.2000@gmail.com ashishsrmcem@gmail.com gauravhanu@rediffmail.com

Abstract - Most of the Enterprise software consist
implementation of business logic which is dynamic. Rules
are the explicit instructions that determine the appropriate
response or action that should be taken in response to a set
of specific conditions. Rules can be used in many flexible
ways to reach decisions that can be carried out by virtually
any application or business process. Most of existing
systems are experiencing with remote accessibility
problems. Majority of those are not service oriented. This
makes problem while accessing such applications remotely.
This document gives an introduction to web services
followed by a description of how the service-oriented
architecture applies to web services regarding the three
principles of modularity, encapsulation and loose coupling.
It focuses on exposing the rules / services as easy as
accessing the local database. It has the capability of
authentication and authorization mechanism: the services
shown to the user as per his privileges. Access control
mechanism of these services can be imposed by using role
based menus. It also takes the advantage of encryption and
decryption mechanism for security reasons. Calls to the
services made with the encrypted data (typically encrypted
with DES). The service provider decrypts the data, invokes
the appropriate rules.

Keywords: Service oriented Architecture, Encryption, DES
I. INTRODUCTION
The functions / rules are turned up to services by
exposing them as web services. Calls to the services
are encrypted and decrypted for security aspect. First
Service Oriented Architecture (SOA) is discussed
followed by web services and encryption-decryption.
A service is an abstract resource that represents a
capability of performing tasks that form a coherent
functionality from the point of view of providers
entities and requestors entities. To be used, a service
must be realized by a concrete provider agent.

1.1. Service Oriented Architecture (SOA)
SOA is a lightweight environment for dynamically
discovering and using services on a network [1]. It
presents an approach for building distributed systems
that deliver application functionality as services to
either end-user applications or other services. Figure 1
depicts the architectural stack and the elements that
might be observed in a service oriented architecture.
The architectural stack is divided into two halves, the
left half addresses the functional aspects of the
architecture and the right half addresses the quality of
service aspects. These elements are described in detail
as follows

Functional aspects include
Transport is the mechanism used to move
service requests and service responses
between service consumer and service
provider
Service Communication Protocol is an agreed
mechanism that the service provider and the
service consumer use to communicate.
Service Description is an agreed schema for
describing what the service is, how it should
be invoked, and what data is required to
invoke the service successfully.
Service describes an actual service that is
made available for use.
Business Process is a collection of services,
invoked in a particular sequence with a
particular set of rules, to meet a business
requirement.
The Service Registry is a repository of
services and data descriptions used by service
providers to publish their services, and
service consumers to discover available
services.

Quality of service aspects include
Policy is a set of conditions or rules under
which a service provider makes the service
available to consumers.
Security is the set of rules that might be
applied to the identification, authorization,
and access control of service consumers
invoking services.
Transaction is the set of attributes that might
be applied to a group of services to deliver a
consistent result.
Management is the set of attributes that might
be applied to managing the services provided
or consumed.
1.2. Service Proxy
The service provider supplies a service proxy to
the service consumer. The service consumer executes
the request by calling an API function on the proxy.
The service proxy finds a contract and a reference to
the service provider in the registry. It then formats the
request message and executes the request on behalf of
the consumer. The service proxy can enhance
performance by caching remote references and data.
When a proxy caches a remote reference, subsequent
service calls will not require additional registry calls.
By storing service contracts locally, the consumer
reduces the number of network hops required to
execute the service.
Proxies can improve performance by eliminating
network calls altogether by performing some functions
locally. The proxy design pattern [9] states that the
proxy is simply a local reference to a remote object. If
the proxy in any way changes the interface of the
remote service, then technically, it is no longer a
proxy.
1.3. Service Lease
Service lease, which the registry grants the service
consumer, specifies the amount of time the contract is
valid: only from the time the consumer requests it
from the registry to the time specified by the lease.
When the lease runs out, the consumer must request a
new lease from the registry. Without the notion of a
lease, a consumer could bind to a service forever and
never rebind to its contract again.
1.4. SOA Characteristics
Each system's software architecture reflects the
different principles and set of tradeoffs used by the
designers. Service-oriented software architecture has
these characteristics [2]
Services are discoverable and dynamically
bound.
Services are self-contained and modular.
Services stress interoperability.
Services are loosely coupled.
Services have a network-addressable
interface.
Services have coarse-grained interfaces.
Services are location-transparent.
Services are compostable.
Service-oriented architecture supports self
healing.
1.5. Web Services
A web service is a piece of business logic, located
somewhere on the Internet, that is accessible through
standard based Internet protocols such as HTTP or
SMTP [3 and 4].
A software application identified by a URI, whose
interfaces and bindings are being defined, described
and discovered as XML artifacts. A web service
supports direct interactions with other software agents
using XML based message exchanges via Internet-
based protocols [5]. It is a software system designed to
support interoperable machine-to-machine interaction
over a network. It has an interface described in a
machine process able format.
Other systems interact with the Web service in a
manner prescribed by its description using SOA
Protocol defined messages, typically conveyed using
HTTP with an XML serialization in conjunction with
other Web related standards. A web service has special
behavioral characteristics are,
XML based: Web Service technology is based on
standardized XML [15] and supported globally by
most major technology firms. XML provides a
language neutral way for representing data.
Communication happens between the service
provider and the consumer through XML message
interchange.
Loosely coupled: A tightly coupled system
implies that the client and server logic are closely
tied to one another, implying that if one interface
changes, the other must also be updated. Adopting
a loosely coupled architecture tends to make
software systems more manageable and allows
simpler integration between different systems.
Ability to be synchronous or asynchronous: In
synchronous invocations, the client blocks and
waits for the service to complete its operation
before continuing. Asynchronous operations allow
a client to invoke a service and then execute other
functions. Asynchronous clients retrieve their
result at a later point in time, while synchronous
clients receive their result when the service has
completed. Asynchronous capability is a key
factor in enabling loosely coupled systems.
Supports Remote Procedure Calls (RPCs):
Web services allow clients to invoke procedures,
functions, and methods on remote objects using
an XML-based protocol.
Supports document exchange: Web services
support the transparent exchange of documents to
facilitate business integration.

1.6. Web Services and Related Protocols
Several communication mechanisms across
different system platform already exists, such as
DCOM (distributed component object model,
Microsoft), CORBA (Common Object Request Broke
Architecture, OMG), RMI (Remote Method Invoke,
Java) provide a tight-coupling communication method
are good within LAN. The Web Service mechanism
provides a loose coupling communication approach
and facilities to integrate function from other system
platform. Web Service technology is built upon a set
of open and standard Internet protocols, making the
interoperability between different application
platforms more convenient. In nature, Web Service
can be regarded as a set of components or objects
deployed on the WWW. It describes a large variety of
exchanging data with the XML document format,
provides the standardized remote invoking with the
SOAP protocol, depicting the services with WSDL,
registering the services in the UDDI Center, and
finally publishing them to the web server. The figure 2
below is the illustration of Web Service Protocols.
1.7. Web Service Protocol Stack
The Web services stack [8], shown in figure,
categorizes the technology of Web services into a
layered model. A typical Web Service application
paradigm is shown like the below: Firstly, Service
Provider defines and creates the service components,
and describes access entry and remote calling interface
with WSDL files for the Service Customer. Secondly,
Service Customer invokes the remote service
according to the interface declared by the WSDL file.
Thirdly, Web Service components receive the
invoking request, executing service, and returning the
result to the Service Customer.

Fig. 2 Web Service Protocol Stack
1.8. The Major Web Service Technologies
Simple Object Access Protocol (SOAP) provides
a standard packaging structure for transporting XML
documents over a variety of standard Internet
technologies, including SMTP, HTTP, and FTP [12].
It also defines encoding and binding standards for
encoding non-XML RPC invocations in XML for
transport. SOAP provides a simple structure for doing
RPC: document exchange. By having a standard
transport mechanism, heterogeneous clients and
servers can suddenly become interoperable. .NET
clients can invoke EJBs exposed through SOAP, and
Java clients can invoke .NET Components exposed
through SOAP.
Web Service Description Language (WSDL) is
an XML technology that describes the interface of a
web service in a standardized way [10 and 13]. WSDL
standardizes how a web service represents the input
and output parameters of an invocation externally, the
function's structure, the nature of the invocation (in
only, in/out, etc.), and the service's protocol binding.
WSDL allows disparate clients to automatically
understand how to interact with a web service.
Universal Description, Discovery, and
Integration (UDDI) provides a worldwide registry of
web services for advertisement, discovery, and
integration purposes [14]. UDDI is used to discover
available web services by searching for names,
identifiers, categories, or the specifications
implemented by the web service. UDDI provides a
structure for representing businesses, business
relationships, web services, specification metadata,
and web service access points [11].
1.9. Web Service Architecture
Web Services Architecture is built on the Internet
and the XML family of technologies. The overview of
this architecture is shown in the graphic below.

Fig. 3 Web Service Architecture

The architecture is divided into three conceptual
layers: SOAP, SOAP modules, and infrastructure
protocols. Modules express elements of functionality
that can be used individually or combined to achieve
composite and higher-level results. Infrastructure
protocols build on SOAP modules to provide end-to-
end functionality. Protocols at this layer tend to have
semantically-rich finite state machines as part of their
definition. They maintain state across a sequence of
messages and may aggregate the effect of many
messages to achieve a higher-level result. Web
Services Architecture rests on four key design
principles:
Modular: Rather than define large, monolithic
specifications this architecture is built on modular
components. These can be composed into solutions to
offer the exact set of features required by the problem
at hand.
General-Purpose: Web Services Architecture is
designed for a wide range of XML Web service
scenarios, from B2B and EAI solutions to peer-to-peer
applications and B2C services. Each module is
uniformly expressed whether used individually or in
combination with others, and is independent of any
limited application domain.
Federated: Web Services Architecture does not
require central servers or centralized administrative
functions. The architecture enables communication
across trust boundaries and autonomous entities.
Neither the modules nor the protocols make any
assumption about the implementation technology at
the message endpoints. Any technology can be used;
technologies can be transparently upgraded over time;
services can be delegated or brought inside over time.
Standards-Based: Web Services
Architecture builds entirely on the baseline
XML Web services specifications SOAP,
WSDL, and UDDI. The four specifications
say WS-Security, WS-License, WS-Routing
and WS-Referral are build on the SOAP
family XML interoperability technologies.
WS-Security [16], XML Web services can
examine incoming SOAP messages and,
based on an evaluation of the credentials,
determine whether or not to process the
request. WS-Security supports a wide range
of digital credentials and technologies
including both public key and symmetric key
cryptography.
WS-License describes how several common
license formats, including X.509 certificates
and Kerberos tickets, can be used as WS-
Security credentials. WSLicense includes
extensibility mechanisms that enable new
license formats to be easily incorporated into
the specification.
WS-Routing [7] provides addressing
mechanisms that enable specification of a
complete message path for the message. This
enables one-way messaging, two-way
messaging such as request/response, peer-to-
peer conversations, and long running dialogs.
WS-Referral is a simple SOAP extension
that enables the routing between SOAP nodes
on a message path to be dynamically
configured. This configuration protocol
enables SOAP nodes to efficiently delegate
part or all of their processing responsibility to
other SOAP nodes. WS-Security, WS-
License, WS-Routing, and WSReferral can
be used together as they are modular.
1.10. Encryption and Decryption
Encryption is the process of converting normal
data or plaintext to something incomprehensible or
cipher-text by applying mathematical transformations.
These transformations are known as encryption
algorithms and require an encryption key. Decryption
is the reverse process of getting back the original data
from the cipher text using a decryption key. The
encryption key and the decryption key could be the
same as in symmetric or secret key cryptography, or
different as in asymmetric or public key cryptography.
Algorithms: A number of encryption algorithms have
been developed over time for both symmetric and
asymmetric cryptography. The ones supported by the
default providers in J2SE v1.4 are DES, Triple DES,
Blowfish, PBE with MD5 and DES, and PBE with
MD5 and Triple DES. Note that these are all
symmetric algorithms. DES keys are 64 bits in length,
of which only 56 are effectively available as one bit
per byte is used for parity. This makes DES encryption
quite vulnerable to brute force attack. Triple DES, an
algorithm derived from DES, uses 128-bit keys (112
effective bits) and is considered much more secure.
Blowfish, another symmetric key encryption
algorithm, could use any key with size up to 448 bits,
although 128-bit keys are used most often. Blowfish is
faster than Triple DES but has a slow key setup time,
meaning the overall speed may be less if many
different keys are used for small segments of data.
Algorithms PBE with MD5 and DES and PBE with
MD5 and Triple DES take a password string as the
key and use the algorithm specified in PKCS#5
standards. For this paper we used DES (Digital
Encryption Standard) for simplicity.
II. PROBLEM
The business rules can be implemented as services
which are then called by clients when required. This
paper focuses on exposing the rules as web services.
Earlier systems were not service oriented; majority of
these calls to the required system might fail at the
firewall. As a remedy for this, we exposed the rules as
a web service. Web service is a way of implementing
service orientation. With service orientation system
can use the services provided any other machine. This
also guarantees the response from the desired system.
Typically calls to the web service might not fail at the
firewalls due to the inbuilt security features. It has the
capability of authentication and authorization
mechanism. In addition, it also takes the advantage of
the encryption mechanism. Calls to the services made
with the encrypted data. The mediator say agents are
exposed as web services, decrypts the data and invoke
the appropriate rules upon request. Agents play
important role like they act as the interfaces between
the client and the rules. The clients can not directly
access rules because they can only see the agents as
those are exposed to web service layer. Therefore
agents are important aspects with this paper.
III. IMPLEMENTATION
Implementation mainly deals with the distribution
of the services across several nodes. A simple stock
inventory system example is taken to illustrate this. It
takes care of stock maintenance, sales and generation
of the invoices. Role based authentication provides the
access to appropriate services. The administrator is
given access to all the services available in the
machine. Calls are made to the services only with the
encrypted data. The service providers decrypt the data
and process the requests. This is divided into three
modules are: Production, Finance and Administration
respectively.
Production : This module focuses into the
maintenance of inventory. After authentication of a
production user, the application makes a call to the
UDDI registry to get the available services
information. Once it gets the reply from registry the
page lists all the services currently operable like
AddStock, GetStock, DeleteStock and UpdateStock.
AddStock: As the name implies, it adds the stock into
the inventory through the web service interface. When
the user clicks on AddStock link the application
displays the form with the appropriate fields. Once
Insert Stock button located in the page is clicked,
firstly, the data validation is being performed.
Secondly, the jsp page encrypts the user information
using DES algorithm and posts it to the AddStock
service agent. The service agent upon receiving the
data from client decrypts it and calls the AddStock
Rules. AddStock rules in turn check the existence of
the current data with the old records. If the data is not
found, will be inserted to the database otherwise gives
the appropriate exception / result to the client via
agent.
GetStock: GetStock service retrieves the total stock
available from the inventory. If the user clicks on
GetStock link, the jsp page makes a call to the
GetStock service agent. Upon receiving a call from the
client, the agent makes a call to GetStock Rules.
GetStock Rules then forms an xml sting from the
database, encrypts it and sends to the client via agent.
Once the page receives the data from the agent,
decrypts it, parses the xml document and then displays
in the page. A sample xml document formed by
GetStock Rules from the database looks like this.
<Products>
<product>
<id>134</id>
<name>qwqw</name>
<type>123</type>
<quantity>1234</quantity>
<cost>12345</cost>
</product>
<product>
<id>215</id>
<name>gfrd</name>
<type>dfdf</type>
<quantity>45</quantity>
<cost>1000</cost>
</product>
</Products>
DeleteStock and UpdateStock: DeleteStock and
UpdateStock perform removing and modifying for the
particular stock from the dataset respectively.
Finance Module : Finance module looks for the
services related to the cost. All cost related services
such as update cost (UpdateCost) and products of cost
within the specified range (CostInRange) comes under
this category. When the finance user logs in, the
application makes a call to the UDDI registry to know
the financial services available in the machine. Once it
gets the reply from the registry the page lists all the
services currently operable.
UpdateCost and CostInRange: Update Cost service
enables the client to modify the cost of a product. Cost
in Range service displays the products of user
specified range. In either case all the communication
happens through the agent. Also uses the encryption
and decryption mechanism.
Administration Module
Administration module is the higher authority than
the previous two modules. It is provided some extra
features. The services offered by this module are,
User Related Queries
Services available at this machine
Check Services @ host
User Related Queries: The unique feature of the
administration module is processing user related
queries. This allow its user to perform create, edit and
delete user tasks.
Services available at this machine: The previous two
modules say production and finance have some
restriction. For example the service provided for
finance can never be seen by the production user and
vice versa. The administrator got that privilege where
he can see all the services provided by the machine.
Check Services @ host: Administrator can also check
the services offered from the host. This is
implemented by the help of the registry called UDDI.
The registry holds all services and from which host
those services are being offered.
3.1 Encryption and Decryption
Agents are exposed to the web service layer and
clients can only see the agents. As the data is sent
through the network, the data must be provided with
security. For this reason, every call made to the agent
is encrypted and decryption is done at respective
pages.
Implementation: The java class DesEncrypter is
responsible for data encryption and decryption. It uses
standard java classes such as javax.crypto.Cipher /
KeyGenerator / SecretKey internally in order to
produce the encrypted and decrypted strings. These
three classes are provided by default with java from
version 1.4. The implementation uses two methods for
serving that purpose are:
Public String encrypt (String value);
Public String decrypt (String value);
The encrypt method takes the string to be encrypted as
input and produces encrypted string as output. It is a
public method which can be used by any java class
that needs the encryption.
The decrypt method takes encrypted string as input
and produces decrypted string (original string) as
output. It is also a public method which can be used by
any javaclass that needs the decryption.
3.2 Architecture
Figure 4 depicts the architecture of the proposed
system. With this, call to the service goes via the
routing agent. Upon receiving the request from the
client, routing agent forwards it to the appropriate
service agent. Routing agent in turn encrypts the
information required to invoke the service and then
makes the call. The service agents A and B are
exposed as the web services.

Fig. 4 Architecture of the proposed system

These agents will only be available to the outside
world and are able to access the rule services A and B.
Service agents upon receiving the request, decrypts the
information received and then calls the rule services.
3.3 UDDI Implementation
UDDI plays major role in web service
environment. All the service providers publish their
service into the registry. This process is often called as
registration. Once the registration process is successful
the UDDI registry holds the information regarding to
that service. Whenever a client wants to access any
service, first a call is made to the registry. It replies the
client with the requested information. In this case this
is implemented in MySQL server database. Any
service deployed at any machine registers its existence
into the database table registry and updates the
database when undeployed as well.
IV. RESULT
This implementation providing a better
understanding of the web service definition and
deployment with a simple encryption and decryption
using DES for a simple inventory system with a
standard set of services is shown and the overhead of
the authentication and encryption is not considerable
since it is constant.
V. CONCLUSION
This paper deals with the simple implementation of
rules in terms of web services with minimum support
for the web personalization. The role based without
using any semantic techniques. The key distribution is
not discussed here. It combines the rule-based
approach with service-oriented computing to allow an
easy integration into client applications and make it
easy for applications to consume rule services (.e.,
rules are exposed as Web service). It also introduces
the distributed rule processing which allow execution
of rules over several machines completely being
transparent to the client applications. Finally, it also
provides role based authentication and authorization
mechanism.
REFERENCES
[1]. Herzum, P. Web Services and Service-Oriented
Architectures. Executive Report, vol. 4, no. 10. Cutter
Distributed Enterprise Architecture Advisory Service,
2002.
[2]. Stevens, M. Service-Oriented Architecture
Introduction, Part2. Developer.com, http: //
softwaredev.earthweb.com/msnet/article/0,,10527
_1014371,00.html, 2002.
[3]. M. Fisher. Introduction to web services. Part of
the java web services tutorial, Aug, 2002.
http://java.sun.com/webservices/docs/1.0/tutorial.
[4]. M. Menasce and V. Almeida. Capacity Planning
for Web Services. Prentice Hall, 2001.
[5]. W3C. Web Services Architecture Requirements,
Oct. 2002. http://w3c.org/ TR/wsa-reqs.
[6]. S.Petrovic. Rule-Based Expert Systems, 2006.
http://www.cs.nott.ac.uk/~sxp/ ES3/index.htm.
[7]. Microsoft Corporation. Web Services Routing
Protocol (WS-Routing), Oct. 2001. http: // msdn.
microsoft.com/webservices.
[8]. H. Kreger. Web Services Conceptual Architecture
(WSCA 1.0), IBM, May 2001, www3. ibm. com /
software /solutions/ webservices/pdf/WSCA.pdf.
[9]. Gamma, E.,Helm, R.,Johnson, R., and Vlissides, J.
Design Patterns: Elements of Reusable Object-
Oriented Software. Addison-Wesley, 1994.
[10]. W. Sadiq and S. Kumar. Web Service
Description Usage Scenarions. World Wide Web
Consortium (W3C), June 2002.
http://www.w3c.org/TR/wsdescusecases.
[11]. T. Bellwood et al. UDDI version 3.0, July 2002,
http://uddi.org/pubs/uddi_v3.html.
[12]. D. Box et al. Simple Object Access Protocol
(SOAP) 1.1, May 2000. http://www.w3c.org /TR/
SOAP/.
[13]. R. Chinnici, M Gudgin, J.J. Moreau and S.
Weerawarana. Web Service Description Language
(WSDL) version 1.2, July 2002, http: // w3c.org
/TR/wsdl12.
[14]. UDDI Consortium. UDDI Executive White
Paper, Nov. 2001. http://uddi.org/pubs/
UDDI_Executive_ White Paper.pdf.
[15]. W3C. Extensible Markup Language (XML) 1.0
(Second Edition), Oct 2000. http : // w3c . org /
TR/REC-xml.
[16]. B. Atkinson et al. Web Service Security
(WSSecurity), version 1.0 April 2002. http: // www .
ibm.com/developerworks/library/ws-secure/.
[17]. B. Orriens, J. Yang, and M. Papazoglou. A Rule
Driven Approach for Developing Adaptive Service
Oriented Business Collaboration. In Proceedings of
the 3rd International Conference on Service Oriented
Computing (ICSOC05), Amsterdam, The
Netherlands, Dec. 2005.
[18]. F. Rosenberg and S. Dustdar. Design and
Implementation of a Service-Oriented Business Rules
Broker. In Proceedings of the 1st IEEE International
Workshop on Service-oriented Solutions for
Cooperative Organizations (SoS4CO05), 2005.
[19]. M. Chhabra, H. Lu, "Towards Agent Based Web
Services", 6th IEEE/ACIS International Conference on
Computer and Information Science (ICIS 2007), 2007.
[20]. D. Greenwood, M. Calisti, Engineering Web
Services- Agent Integration, 2004 IEEE international
Conference on Systems, Man and Cybernetics, 2004.
[21]. E. M. Maximilien, P Singh," Agent-based
architecture for autonomic Web service selection", In
Proc. Of Workshop on Web Services and Agent Based
Engineering, Sydney, Australia, July 2003.
[22]. M. O. Shafiq, Y. Ding, D. Fensel: "Bridging
Multi Agent Systems and Web Services: towards
interoperability between Software Agents and
Semantic Web Services", Proceeding of the 10 IEEE
International Enterprise Distributed Object
Computing Conference (EDOC'06), IEEE, 2006.
[23]. Christoph Nagl, Florian Rosenberg, Schahram
Dustdar VIDRE A Distributed Service- Oriented
Business Rule Engine based on RuleML. Proceedings
of the 10th IEEE International Enterprise Distributed
Object Computing Conference.

ISBN: 978-81-920575-8-3 :: doi: 10. 72857/ISBN_0768

Agent Oriented Software Engineering Methodologies: A Review

Abhishek Singh Rathore
Department of Computer Science & Engineering
Maulana Azad National Institute of Technology
Bhopal, India
abhishekatujjain@gmail.com
Dr. Devshri Roy
Department of Computer Science & Engineering
Maulana Azad National Institute of Technology
Bhopal, India
droy.iit@gmail.com

Abstract Agent oriented software engineering (AOSE) is a
rapidly expanding field of academic research and real world
applications. AOSE provides various methodologies and tools
by which developers can provide construct agent-oriented
solutions. Agent-orientation offers higher level abstractions
and mechanisms that address additional issues such as
knowledge representation and reasoning, coordination, and
cooperation among heterogeneous and autonomous parties.
This paper provides reviews of various AOSE methodologies.
Keywords- Intelligent Agents, AOSE, methodologies, Gaia,
Tropos, MaSE, Prometheus, ROADMAP, ADELFE, MAS
I. INTRODUCTION
A software engineering methodology is an organized
process for the production of software by the use of
predefined techniques. An agent-oriented software
engineering methodology [1] is a methodology that uses the
notion of agent or actor in all stages of its process. There
exists number of AOSE methodologies providing solutions
for various problems; however, no methodology is allocated
sufficient research resources [2] to enable addressing all
aspects namely Historical Aspects, Economical Aspects,
Maintenance Aspect, Team Programming Aspects, Design
and Programming Aspects. So it is necessary to understand
the strength as well as applicability of various
methodologies.
This paper covers review of 6 major AOSE
methodologies: Gaia, ROADMAP, Tropos, MaSE,
Prometheus, and ADELFE. These methodologies are
evaluated on the basis of framework provided by the [2][3].
Aim of this study is not to find which one are the right
methodology, but identifying weakness and strength of the
methodologies.
A. Gaia
Gaia [4] is the first complete methodology proposed for
the analysis and design of Multi Agent System (MAS). Its
scope includes the analysis and design phase which
provides sufficient details for implementation. Gaia gains
some notations and terminologies from object oriented
analysis and design [5]. Gaias concepts [4] are divided into
two categories: abstract and concrete.

Abstract entities are used during analysis. It
includes roles, permission, responsibilities,
protocol, activities, liveness property and safety
property.

Figure 1. Models of the Gaia methodology [6].
Concrete entities are used in design process and
affect the system on runtime. It includes agent types
and services.
Analysis phase focuses on the understanding of the
system and its structure. Role is identified in this phase and
interaction between different roles is modeled. Roles are
atomic and cant be defined in terms of other roles. A role is
defined by four attributes: responsibilities, permissions,
activities and protocols.
Responsibilities determine the roles actual
functionality. It can be divided into liveness
properties and safety properties [7]. Liveness
properties explain the tasks that the role has to add
so that something good happens to the system. It
describes that the agent must fulfill certain
conditions. Consider the example of Music Player,
the role PlaylistLoader has the responsibility to load
songs whenever playlist is empty and jump to the
first track whenever playlist is ended. The liveness
properties are expressed in terms of regular
expressions. It is shown in Table I. A safety
property states that nothing bad happens. It
describes that the agent acting in the system must
preserves its role. For an example, it contains list of
legal values that a variable or resource can take.
Permissions describe the rights associated with
roles and resources allowed to access it.

TABLE I. GAIA OPERATORS FOR LIVENESS PROPERTY [6]
Operator Interpretation
x | y x or y occurs
x+ x occurs 1 or more
times
[x] x is optional
x . y x followed by y
x* x occurs 0 or more
times
x
x occurs infinitely often

x || y x and y interleaved

Activities are the computations which are carried
out by the agent without interacting with other
agents. These are the tasks associated to roles.
Protocols are interaction with other roles.
In the design phase, roles and protocols represented in
the role and interaction models (at analysis stage) are
mapped into concrete constructs [8]. The agent model
identifies the agent types that will make up the system. The
services model identifies the main services that are required
to realize the agents role. The acquaintance model
documents the lines of communication between different
agents.
B. Role-Oriented Analysis and Design for Multiagent
Programming (ROADMAP)
ROADMAP is an extension [9] of Gaia with a dynamic
role hierarchy. In ROADMAP, a system is viewed as an
organization of agents, consisting of a role hierarchy and
agent hierarchy. A role hierarchy represents system
specification thus providing agents behavior. System
implementation is specified in an agent hierarchy providing
functionalities of the system.

Figure 2. ROADMAP analysis and design models [9].

In ROADMAP, the models are divided vertically into
domain-specific models, application-specific models, and
reusable services models. Domain-specific models contain
reusable high-level domain information. Generic and
reusable low-level software components are captured in
reusable services models
Use-case model is introduced in the analysis phase to
cover specification requirements gathering [10]. In the
design phase, analysis models are optimized to reflect
design decisions. Roles and protocols in ROADMAP acts
as a message filter. ROADMAP uses these notations with
AUML to overcome poor notations of Gaia which was then
adopted by Gaia
C. Tropos
Tropos is a requirement-driven [11] approach, i.e.
domain and system requirement analysis leads the
development process. Tropos is based on two key features
[12]. First, the notion of agent is used in all software
development phases. Second, the methodology emphasizes
early requirements analysis.
Tropos uses the concepts of actor, goal, and (actor)
dependency as primitive concepts for modeling an
application. The overall process can be seen in Fig 3.
The proposed methodology spans four phases: Early
Requirement, Late Requirement, Architectural Design and
Detailed Design.
In the early requirement phase, emphasis given on
understanding the problem. Two diagrams namely
Actor and Goal Diagrams are used to define
problems
In Late requirement phase, system is described
within its operational environment with relevant
functions and qualities so that early view of system
is identified.
Systems global architecture is defined in terms of
subsystems, interconnected through data, control,
and other dependencies in Architectural design
phase.
In Detailed design phase the behavior of each
architectural component is defined in further detail.
Agents goals, beliefs and capabilities are specified
in detail, along with the interaction between them.

Figure 3. Process dig for Tropos [13].
D. Multiagent Systems Engineering (MaSE)

Figure 4. MaSE Methodology [14].
MaSE is an architecture-independent methodology for
providing agent based solutions for real world problems.
The MaSE takes an initial system specification, and
produces a set of formal design documents in a graphically
based style [14]. MaSE guides a designer from an initial set
of requirements through the analysis, design, and
implementation of a working Multi Agent System (MAS).
Object orientation techniques are the foundations for the
MaSE and thus implies to the implementation of MAS.
In the analysis phase system goals are defined from
a set of requirements and modeled in Goal
Hierarchy Diagram. After that, roles are defined to
achieve goals. MaSE uses Use Cases and Sequence
Diagrams to define roles from specific goals.
Design phase includes four steps. In first step, role
is assigned to specific agent types. In the second
step, the conversations between agent classes are
defined for robustness. In the third step, the internal
architecture and reasoning processes of the agent
classes are designed. Finally, in the last step,
number and location of agents to be deployed are
defined in Deployment Diagram.
E. Prometheus
The Prometheus methodology defines a detailed process
for specifying, designing, implementing and
testing/debugging agent-oriented software systems.
Prometheus is intended to be a practical [15] and detailed
methodology. Prometheus supports the design of agents that
are based on goals and plans.

Figure 5. Phases of Prometheus [16].
Prometheus methodology consists of 3 phases [16].
In System Specification phase, the system is
specified using goals and scenarios. Basic
functionalities are identified in terms of actions,
percepts and external data. System usage is
identified in terms of scenarios
In Architectural Design phase, agent types are
identified on the basis of functionalities. Interaction
diagrams and then interaction protocols are derived
from use-case scenarios. Overall structure is
identified initially.
In Detailed Design phase, agents are refined in
terms of capabilities, plans and data.
Implementation phase is omitted from the Fig 5,
because it depends on the implementation platform
used.
F. ADELFE
ADELFE [17] is an agent-oriented methodology for
designing Adaptive Multi-Agent System (AMAS). It is
based on object oriented methodology, following Rational
Unified Process uses AUML notations to guiding AMAS
designer to build solution accordingly to RUP[18][17].
The ADELFE process consists in six work definitions:
Preliminary Requirements Final Requirements Analysis
Design Implementation and Tests. ADELFE process is
described in three levels: Work Definitions (WD
i
),
Activities (A
j
) and Steps (S
k
) as shown in Fig 6.

Figure 6. ADELFE process.
In the preliminary requirements phase main focus is
to define the system thus to establish an agreement
on the preliminary requirements. It includes[19]:
A1: Define User Requirements
A2: Validate User Requirements
A3: Define Consensual Requirements
A4: Establish Keywords Set
A5: Extract limit and constraints
Final requirements phase transforms requirements
into a use-case model and manages the
requirements and their priorities [19]. It includes:
A6: Characterize environment
S1: Determine entitles
S2: Define context
S3: Characterize environment
A7: Determine use-cases
S1: Draw up an inventory of use cases
S2: Identify cooperating failures
S3: Elaborate sequence diagrams
A8: Elaborate UI Prototypes
A9: Validate UI Prototypes
In the analysis phase agents are identified. It
includes[19]:
A10: Analyze the domain
S1: Identify classes
S2: Study interclass relationship
S3: Construct preliminary class diagram
A11: Verify the AMAS adequacy
S1: Verify the global level AMAS adequacy
S2: Verify the local level AMAS adequacy
A12: Identify agents
S1: Study entities in domain context
S2: Identify potentially cooperative entities
S3: Determine agents
A13: Study interaction between entities
S1: Study the active passive-entities
relationships
S2: Identify the active entities relationships
A14: Study the agent relationships
In the design phase non-functional requirements
and solution domain are modeled to prepare the
implementation and test of the system. It
includes[19]:
A15: Study the detailed architecture and the multi
agents model
S1: Determine packages
S2: Determine classes
S3: Use design patterns
S4: Elaborate component and class diagrams
A16: Study the interaction language
A17: Design an agent
S1: Define its skills
S2: Define its aptitudes
S3: Define its interaction language
S4: Define its word representation
S5: Define its Non cooperative situations
A18: Fast prototyping
A19: Complete design diagrams
S1: Enhance design diagrams
S2: Design dynamic behaviors.
II. COMPARATIVE EVALUATION
A methodology is the set of guidelines for covering the
whole lifecycle of system development both technically and
managerially. This section examines multiple dimensions,
possibly referring to all of the major facets relevant to
methodology evaluation. The evaluation is performed based
on information regarding the examined methodologies
available in publications. The frameworks four aspects are:
concepts and properties, notations and modeling techniques,
development process, and pragmatics.
A. Metric
For the evaluation process a scale is proposed of High,
Good, Average and Poor.
1) High
It indicates that methodology fully supports that
property with minor deficiencies. Its range lies between
95%-100% support of the property.
2) Good
It indicates that methodology address that property with
few problems. Its range lies between 70%-95% support of
the property.
3) Average
It indicates that methodology addresses property up to
some extent with major issues is remain unaddressed. Its
range lies between 35%-70% support of the property.
4) Poor
It indicates that methodology does not support or little
information is available. Its range lies between 0%-35%
supports of the property.
B. Concepts and Properties
A concept is a notion derived from specific instances
within a problem domain. A property is a special capability
or a characteristic.
1) Autonomy
An agent is called as autonomous when it can work
without any supervision.
In Gaia, agents behavior is described in terms of
role schemas and all roles are atomic in nature.
Thus, representing roles autonomy and can be
graded as High.
In Tropos, autonomy is achieved through each
phase, various artifacts like roles, capabilities,
interactions of each stage provides autonomy and
hence can be graded as High.
In MaSE, autonomy is achieved through roles.
Roles are units of system requirements. Agent
classes are based on roles and thus provide
autonomy and MaSE can be graded as High.
In Prometheus, functionalities are encapsulated in
terms of goals, which can be slightly affected by
external data. Hence, it can be graded as Good.
In ROADMAP, agents behavior is encapsulated
in dynamic role hierarchy and roles are atomic and
thus can be graded as High.
In ADELFE, functionality is encapsulated in
behavioural rules. These rules are triggered from
agents state and hence can be graded as Good.
2) Reactiveness
It is the Agents ability to respond in a timely manner to
make changes in the environment.
In Gaia, reactiveness can be achieved through
liveness property of the roles responsibility. So
Gaia can be graded as High in terms of
reactiveness.
In Tropos, reactiveness is achieved through
interaction diagrams and activity diagrams in only
detailed design phase only hence can be graded as
Average.
In MaSE, reactiveness can be achieved through
sequence diagrams but it is not expressed
explicitly and thus it can be graded as Average.
In Prometheus, reactiveness can be achieved by
events but not expressed explicitly and thus can be
graded as Average.
In ROADMAP, use of keyword before, during
and after limits the applicability of liveness
property and thus can be graded as Good.
In ADELFE, reactiveness can be achieved through
active entities. But in some cases, AMAS does not
support dynamic behavior and hence it can be
graded as Average.
3) Proactiveness
It is the ability of an agent to track new goals.
In Gaia, proactiveness can be achieved through
liveness property of the roles responsibility. So
Gaia can be graded as High in terms of
proactiveness.
In Tropos, In Tropos, proactiveness is achieved
through activity diagrams hence can be graded as
High.
In MaSE, proactiveness is achieved through roles
task communication with other roles, which
modeled by finite state machines. Thus, it can be
graded as High.
In Prometheus, proactiveness is achieved through
goals and subgoals and hence can be graded as
High.
In ROADMAP, new subgoals added advantage to
Gaia and hence it can be graded as High.
In ADELFE, cooperative agents support
proactiveness and hence, can be graded as High.
4) Sociality
It is the agents ability to interact with through messages
and understanding those messages.
In Gaia, sociality aspects can be expressed within
agent acquaintance model and organizational
structure and rule, but no explicit definition found
for organizational structure within Multi Agent
System, thus, Gaia can be graded as Average in
terms of sociality.
In Tropos, sociality aspects can be expressed by
agent interaction diagrams and UML class
diagrams, but actors relationship is still having
issues that cannot clarify actors relationship, thus,
Tropos can be graded as Average in terms of
relationships.
MaSE does not support social aspects except
communication between agents, thus MaSE can be
graded as Average in terms of sociality.
Prometheus supports message interaction through
interaction protocols and hence can be graded as
High
Extended role and protocol model in ROADMAP
adds sociality to Gaia and hence can be graded as
Good.
TABLE II. THE MEASURE OF AGENT PROPERTIES

The coverage of the framework building block by
various AOSE methodologies is shown in Table III.

Properties Gaia Trop
os
MaS
E
Prom
etheus
ROA
DMA
P
ADEL
FE
Autonomy High High High Good High Good
Reactivene
ss
High Avera
ge
Avera
ge
Avera
ge
Good Averag
e
Proactivene
ss
High High High High High High
Sociality Avera
ge
Avera
ge
Avera
ge
High Good Averag
e
TABLE III. THE MEASURE OF AGENT CONCEPTS
C. Notations and Modeling Techniques
Notations symbols used to represent elements within a
system. A modeling technique is a set of models that
represents a system at different levels of abstraction and
different systems facets.
1) Accessibility
It is the ease or the simplicity of understanding and
using a method that enhances both experts and novices
capabilities of using a new concept.
In Gaia, models are easy to understand and use but
liveness property are expressed in terms of regular
expression which is not easy to understand for
everyone. Gaia can be graded as Average in terms
of accessibility.
In Tropos, models are easy to understand and use
but having iterative structure each model is
dependent or derived on the previous stage model,
thus reduces the accessibility. Hence, Tropos can
be graded as Average in terms of accessibility.
MaSE has concrete process and is easy to
understand and follow, thus can be graded as High.
Prometheus is uses notion of object oriented
technology, having simple models but having
iterative structure, each model is dependent on
previous stage models, thus reduces accessibility.
Hence, Prometheus can be graded as Average.
In ROADMAP, models are also easy to
understand and use. New keyword after, before
and during does not add simplicity and hence can
be graded as Average.
ADELFE uses the notion of UML, which is easy
to understand and follow. Sometimes, AMAS
theory is not easy to understand for everyone and
hence can be graded as Good.
2) Analyzability
It is a capability to check the internal consistency of
models, or to identify unclear aspects. It is usually
supported by automatic tools.
Gaia does not deal with this issue. So Gaia can be
graded as Poor in terms of analyzability.
In Tropos, there is a tool supported at each stage,
called as Tool for Agent Oriented Modeling
(TAOM4E), but still it is not clear to check how
internal consistency of models are maintained. So
Tropos can be graded as Average in terms of
analyzability.
O-MaSE Tool and process explains is used at
every stage of MaSE to check consistency of the
MaSE models, hence it can be graded as High.
Prometheus Design Tools (PDT) checks internal
inconsistencies at each stage and hence can be
graded as High.
ROADMAP also does not deal with this issue. So
it can be graded as Poor in terms of analyzability.
ADELFE is supported by UML and AUML and
hence internal consistency is maintained. Thus, it
can be graded as Good.
3) Complexity Management
It is the ability to maintain various level of abstraction
or hierarchical structure.
Gaia does not have any hierarchical representation
or any sort of complexity management, so it can be
graded as Poor in complexity management.
Tropos supports hierarchical structure as well us
using the UML notations supports complexity
management, but still unclear criteria for defining
soft goals from goals, hence can be graded as
Good in terms of complexity management.
MaSE has various levels of abstract but does not
have any explicit rule to manage complexity, thus
can be graded as Average.
In Prometheus, grouping of functionality may
maintain level of abstraction but when capabilities
are nested it increases complexity. Hence, it can be
graded as Average.
In ROADMAP, a hierarchical structure is added
with dynamic nature and hence can be graded as
Average.
ADELFE supports various levels of abstraction.
But navigation between various levels of
abstraction is poor and hence it can be graded as
Average.
Concept Gaia Tropo
s
MaSE Prome
theus
ROA
DMA
P
ADELF
E
Agent
Agent
Type
Actor
Agent
Type
Actor Agent Actor
Belief - Goal
Goal,
Task,
State
Goal - -
Desire - Goal
Goal,
Task,
State
Goal - -
Intension - Plan
Goal,
Task,
State
Plan - -
Message
Protoco
l
Activi
ty
Diagra
m
Conve
rsation
Protoc
ol
Agent
types
Protocol
Norm
Organiz
ation
Rule
Organ
ization
Rule
Organ
ization
Rule
Organi
zation
Rule
Organ
izatio
n
Rule
Organiz
ation
Rule
Organizati
on
Organiz
ation
- -
Agent
Descri
ptors
Organ
izatio
n
-
Protocol
Protoco
l
Agent
Intera
ction
Conve
rsation
, Task
Protoc
ol
Role,
Proto
col
Agent
Interacti
on
Role Role - Role Role Role -
Society
Organiz
ation
- - -
Organ
izatio
n
Coopera
tion
Task
Activit
y,
Respon
sibility
Capab
ility
Task
Capabi
lity
Respo
nsibili
ty
Task
4) Executability
It is the capability to test aspects of specifications either
by simulation or by providing any prototype models.
Gaia does not deal with this issue. So Gaia can be
graded as Poor in terms of executability.
Tropos supports code generation for Jadex which
is BDI extension of JADE and hence supports test
case generation for Jadex and JADE. Thus Tropos
can be graded as Good in terms of executability.
O-MaSE provides testing of specifications in early
phase but MaSE does not support any simulation
or prototypes. Hence MaSE can be graded as
Average.
PDT supports specification testing by providing
prototype and hence can be graded as Good.
ROADMAP does not deal with this issue and
hence can be graded as Poor.
ADELFE carry out specification testing by
OpenTool tool, but does not define any formal
testing plans or prototypes and hence can be
graded as Average.
5) Expressiveness
Applicability of following domains [3].
the structure of the system;
the knowledge encapsulated within the system;
the systems ontology;
the data flow within the system;
the control flow within the system;
the concurrent activities within the system
the resource constraints within the system
the systems physical architecture;
the agents mobility;
the interaction of the system with external systems
the user interface definitions
In Gaia,
Structure of system is not defined
Knowledge is encapsulated in role and not defined
explicitly
Ontology: NA
Data flow: Text representation
Control Flow: Does not identified
Concurrent Activities: NA
Resource Constraints: partially
Physical Architecture: NA
Agents Mobility: NA
Interaction with external system is not well
defined
User Interface definitions: NA
So Gaia can be graded as Average in terms of
expressiveness.
In Tropos,
Structure of system is not defined clearly
Knowledge is represented in terms of goals: hard
goals and soft goals.
Ontology: NA
Data flow: NA
Control Flow: It is expressed in terms of activity
diagram
Resource Constraints: NA
Physical Architecture: NA
Agents Mobility: NA
Interaction with external system is not well
defined explicitly
So Tropos can be graded as Average in terms of
expressiveness.
In MaSE,
Structure of system is defined in Agent
Architecture Diagram
Knowledge is encapsulated in system and not
defined explicitly
Ontology: O-MaSE supports ontology
Data flow: NA
Control Flow: Concurrent Diagrams express
control flow
Concurrent Activities: Concurrent diagrams
Resource Constraints: partially
Physical Architecture: It is expressed by
Deployment diagrams
Agents Mobility: MaSE supports agents mobility
by move activities
Interaction with external system is defined in
design phase by conversation
So MaSE can be graded as Good in terms of
expressiveness.
In Prometheus,
Structure of system is defined in system overview
diagram
Knowledge is represented in terms of goals.
Ontology: NA
Data flow: It is expressed in terms of agent
overview diagram
Control Flow: It is expressed in terms of capability
diagram.
Physical Architecture: system overview diagram
Agents Mobility: NA
Interaction with external system is defined by
interaction diagrams.
So Prometheus can be graded as Average in terms of
expressiveness.
In ROADMAP,
Structure of system is still not defined explicitly.
Knowledge is represented in terms of knowledge
and role models.
Ontology: NA
Data flow: It is expressed in terms of text.
Control Flow: Not expressed explicitly.
Physical Architecture: Not expressed explicitly.
Agents Mobility: NA
Interaction with external system is defined by
interaction diagrams.
So ROADMAP can be graded as Average in terms of
expressiveness.
In ADELFE,
Structure of system is defined by agent structure
diagram.
Knowledge is represented in terms of behavioural
rules.
Ontology: NA
Data flow: It is expressed in collaboration
diagrams.
Control Flow: It is expressed in sequence diagram.
Physical Architecture: Not expressed explicitly.
Agents Mobility: NA
Interaction with external system is not defined
explicitly.
So ADELFE can be graded as Average in terms of
expressiveness.
6) Modularity
It is the ability to divide the system and proceed
incrementally so that new addition will not affect the
system.
Responsibility, activity and protocols makes Gaia
as a modular in manner that roles can be assign
dynamically but any change in role may affect
protocols thus affecting internal architecture and
changes the roles permission and lower downs
modularity of Gaia. Hence Gaia can be graded as
Average in terms of modularity.
Modularity is fully supported by Tropos, hence
can be graded as High.
In MaSE, modularity is partially achieved through
architecture style templates, hence can be graded
as Average.
Modularity is fully supported by Prometheus,
hence can be graded as High.
Dynamic role hierarchy in ROADMAP permits
the role to modify other roles and hence can be
graded as High.
Although ADELFE is based on UML but it does
not have any explicit rules to integrate new
feature, and thus can be graded as Average.
7) Preciseness
It is ability to provide unambiguous definition of
models.
In Gaia, notations used for models have their clear
meaning and does not leads to misinterpret their
meanings. Responsibilities encapsulating the
functionality of the system are represented through
liveness and safety properties giving clear meaning
to the specifications. Thus, Gaia can be graded as
High in terms of preciseness.
Tropos semantics are unambiguous and provide
clear definitions and hence can be graded as High.
MaSE semantics are unambiguous and provide
clear definition and hence can be graded as High.
Prometheus also provides unambiguous definitions
to model and process and hence, can be graded as
High.
ROADMAP also provides unambiguous
definitions and hence can be graded as High.
ADELFE also provides unambiguous definitions
and hence can be graded as High.
TABLE IV. THE SCALE OF MODELING
D. Development Process
A development process is a series of actions results in a
working computerized system.

1) Development context
It specifies whether that the methodology is capable of
developing new software, supports reverse and
reengineering.
Gaia supports development of new software and
reengineering with reusable components, but it
does not have any implementation aspects so
cannot be used for reverse engineering. Therefore,
it can be graded as Good in development context.
Tropos supports development of new software,
prototyping and reengineering with reusable
components. Having iterative structure supports
reverse modeling in different stages but it does not
support from code to model. Hence, it can be
graded as Good in development context.
MaSE supports development of new software,
prototyping and reengineering with reusable
components, but it cannot transform models
Modeling
Techniques
Gaia
Tropo
s
MaSE
Prome
theus
ROAD
MAP
ADEL
FE
Accessibilit
y
Avera
ge
Avera
ge
High
Avera
ge
Averag
e
Good
Analyzabilit
y
Poor
Avera
ge
High High Poor Good
Complexity
Managemen
t
Poor Good
Avera
ge
Avera
ge
Averag
e
Averag
e
Executabilit
y
Poor Good
Avera
ge
Good Poor
Averag
e
Expressiven
ess
Avera
ge
Avera
ge
Good
Avera
ge
Averag
e
Averag
e
Modularity
Avera
ge
High
Avera
ge
High High
Averag
e
Preciseness High High High High High High
backwards, having issues in reverse engineering.
Hence, it can be graded as Good.
Prometheus supports all aspects of development
context except reverse engineering and hence can
be graded as Good.
ROADMAP does not solve the issue of reverse
engineering and hence can be graded as Good.
ADELFE supports all aspects of development
context except reverse engineering and hence can
be graded as Good.
2) Lifecycle coverage
It specifies whether particular methodology covers all
stages, requirement gathering, analysis, design,
implementation and testing, of software development life
cycle.
Gaia supports only analysis and development
phase, so it can be graded as Average in terms of
lifecycle coverage.
Tropos does not support testing phase, thus, can be
graded as Good in lifecycle coverage.
MaSE does not support testing phase, thus, can be
graded as Good in lifecycle coverage.
Prometheus does not support testing phase and
thus can be graded as Good.
Use cases are added in the ROADMAP to cover
requirement specification stage, which is not
present in Gaia. Thus, it can be graded as Good.
ADELFE covers from specification to design and
hence can be graded as Good.
3) Stages
It specifies the activities within stage of methodology.
In Gaia, very few guidelines are available for
analysis and design phase. So Gaia can be graded
as Average.
In Tropos, few details of each stage are available.
Thus, it can be graded as Average.
O-Mase provides guidelines about every stage of
MaSE. Hence, it can be graded as High.
PDT provides guidelines about every stage of
Prometheus but in some cases user input is still
required for finalization and hence can be graded
as Good.
In ROADMAP, still few guidelines are available
for different stages and hence can be graded as
Average.
OpenTool and Interactive tool are developed to
support ADELFE which explains activities at each
stage and hence can be graded as Good.
4) Verification and Validation
It checks whether the verification and validation rules
are present in methodology to check their deliverables.
Gaia does not support any verification and
validations rules, thus, graded as Poor.
Use of Formal Tropos, user may perform
verification and validation of analysis and design
models. Hence, it can be graded as Good.
MaSE provides some sort of checking models at
each stage but do not provide any verification and
validation rules for initial requirements that finally
outcome as a deliverable. Hence, it can be graded
as Average.
Prometheus does not support any verification and
validations rules and hence can be graded as Poor.
ROADMAP does not support any verification and
validations rules, thus, graded as Poor.
ADELFE validates specification over lifecycle but
formal verification and validation rules are not
expressed explicitly and hence can be graded as
Average.
5) Quality assurance
It checks whether the quality assurance rules are present
in methodology to check their deliverables.
Gaia does not support any quality assurance rules,
thus, graded as Poor.
Tropos does not support any quality assurance
rules, thus, graded as Poor.
MaSE does not support any quality assurance
Prometheus does not support any quality assurance
ROADMAP does not support any quality
assurance rules, thus, graded as Poor.
ADELFE does not support any quality assurance
6) Project management guidelines
It checks whether the Project management guidelines
are available in methodology.
Gaia does not support any Project management
guidelines, thus, graded as Poor.
Tropos does not support any Project management
MaSE does not support any Project management
Prometheus does not support any Project
management guidelines, thus, graded as Poor.
ROADMAP does not support any Project
ADELFE does not support any Project
TABLE IV. THE MEASURE OF PRAGMATICS
E. Pragmatics
It refers to dealing with practical aspects of using a
methodology.
1) Resources
It describes the resources available for the methodology.
These resources may be textbooks, online support, vendor
support, automated tools etc.
Not too much material is available for Gaia as well
as not too much online support or vendor support,
hence can be graded as Average.
Lots of research material is available for Tropos,
but it is lacking in training course and hence can
be graded as Good.
Number of publications available for MaSe as well
as AgentTool and O-MaSE tool but it is lacking in
training courses and hence can be graded as Good.
Prometheus Design Tool is available for
Prometheus but not too much online support and
materials are available. Thus, it can be graded as
Average.
Not too much material is available for ROADMAP
and hence can be graded as Average.
OpenTool and Interactive tool are available for
ADELFE. Online material is also available but no
online group is available and hence can be graded
as Good.
2) Required Expertise
It describes what level of background knowledge and
expertise needed by the user to learn the particular
methodology.
User is required to know modeling knowledge as
well as regular expression in order to fully exploit
the Gaia Methodology and many developers
avoids the use of regular expressions, hence can be
graded as Average.

TABLE V. THE MEASURE OF PROCESS
Does not need any special knowledge for Tropos,
it uses UML notations and hence can be graded as
High.
MaSE does not need any special knowledge as it
has object orientation techniques as it foundations.
Hence, it can be graded as Good.
Prometheus is based on object oriented concepts
with notation of UML. But it uses textual data
representation and hence can be graded as Good.
ROADMAP does not solve this issue and hence its
grading remains same; Average.
ADELFE is uses the UML and AUML notations.
Not too much expertise will be required and hence
can be graded as Good.
3) Language
This issues checks whether a particular methodology is
targeted to a specific language or architecture.
Gaia is not targeted to a specific language or
architecture, hence can be graded as High.
Tropos is based on BDI concept and hence can be
graded as Average.
MaSE is not targeted to a specific language or
architecture, hence can be graded as High.
Prometheus is based on BDI concept and hence
can be graded as Average.
ROADMAP is also not targeted to specific
language or architecture and hence can be graded
as High.
ADELFE in not intended to for any particular
language but ADELFE stereotypes are defined in
OTScript language and hence can be graded as
Average.
4) Domain Applicability
This issue checks whether the methodology stick on to
the particular problem domain only.
Gaia is not intended for any particular domain but
does not support dynamic development, hence can
be graded as Good in domain applicability.
Process Gaia Trop
os
MaS
E
Prom
etheus
ROA
DMA
P
ADEL
FE
Developme
nt context
Good Good Good Good Good Good
Lifecycle
coverage
Avera
ge
Good Good Good Good Good
Stages Avera
ge
Avera
ge
High Good Avera
ge
Good
Verificatio
n and
validation
Poor Good Avera
ge
Poor Poor Avera
ge
Quality
assurance
Poor Poor Poor Poor Poor Poor
Project
manageme
nt
guidelines
Poor Poor Poor Poor Poor Poor
Pragmati
cs
Gaia Trop
os
MaS
E
Prom
etheu
s
ROA
DMA
P
ADEL
FE
Resources Averag
e
Good Good Avera
ge
Avera
ge
Good
Required
expertise
Averag
e
High Good Good Avera
ge
Good
Language High Avera
ge
High Avera
ge
High Avera
ge
Domain
applicabil
ity
Good High High Good High Poor
Scalabilit
y
Good Avera
ge
Avera
ge
Avera
ge
Good Avera
ge
Tropos is not intended for any special domain; it is
based on BDI concept and thus, can be graded as
High.
MaSE is general purpose methodology, not
targeted to specific domain. Hence, it can be
graded as High.
Prometheus is not targeted to specific domain but
small and simple agent as well as simulating may
not use capabilities and hence can be graded as
Good.
ROADMAP is not intended for any particular
domain and hence can be graded as High.
ADELFE is not a general purpose methodology; it
is intended for applications that require adaptive
MAS design using the AMAS theory. Hence, it
can be graded as Poor.
5) Scalability
This issue checks whether the methodology appropriate
for handling the intended scale of applications.
Gaia has a simple structure which supports scaling
but has few issues in modularity hence scaling is
not up to the mark therefore can be graded as
Good.
Tropos have iterative structure but does not have
any subsystem concept and hence can be graded as
Average.
MaSE does not provide any explicit rules
regarding subsystem concept and hence can be
graded as Average.
Prometheus does not provide any explicit rules
regarding subsystem concept and hence can be
graded as Average.
Scaling issues of Gaia are still not solved by
ROADMAP; hence its grading remains good.
ADELFE does not provide any explicit rules for
subsystems and hence can be graded as Average.
III. CONCLUSION
In this paper, we had reviewed and evaluate 6
methodologies utilizing the framework proposed in [2][3].
Examining the tables from table 2 to table 6, all
methodologies supports most of the aspects, but each
methodology is lacking in requirement specification,
validation, verification, quality assurance, project
management guidelines and reverse engineering. There is
further more research can be carried out in this area, that
may be helpful for AOSE to became practically accepted by
industries.
REFERENCES
[1] Leon S. Sterling, Kuldar Taveter, The Art of Agent
Oriented Modeling, pp-192, The MIT Press, Cambridge,
Massachusetts London, England, 2009
[2] Yubo Jia, Chengwei Huang, Hao Cai : A Comparison of
Three Agent-Oriented Software Development
Methodologies: Mase, Gaia, and Tropos, IEEE 978-1-4244-
5076-3/09, 2009
[3] http://citeseerx.ist.psu.edu/viewdoc/download?
doi=10.1.1.5.4461, 30/3/2011, Arnon Sturm, Onn Shehory:
A Framework for Evaluating Agent-Oriented
Methodologies
[4] Michael Wooldridge, Nicholas R. Jennings, David Kinny:
The Gaia Methodology for Agent-Oriented Analysis and
Design, Autonomous Agents and Multi-Agent Systems, 3,
285.312, 2000 Kluwer Academic Publish, 2000
[5] D. Coleman, P. Arnold, S. Bodoff, C. Dollin, H. Gilchrist, F.
Hayes, and P. Jeremaes: Object-Oriented Development:
The fusion Method, Prentice Hall International: Hemel
Hempstead, England, 1994
[6] Franco Zambonelli, Nicholas R. Jennings, Michael
Wooldridge, Developing Multiagent Systems: The Gaia
Methodology, ACM Transactions on Software Engineering
and Methodology, Vol. 12, No. 3, July 2003
[7] A. Pnueli: Specification and development of reactive
systems, In Information Processing 86. Elsevier Science
Publishers B.V.: Amsterdam, The Netherlands, 1986
[8] http://mars.ing.unimo.it/Zambonelli/PDF/
MSEASchapter.pdf, 29/3/2011, Luca Cernuzzi, Thomas
Juan, Leon Sterling, and Franco Zambonelli: The Gaia
Methodology: Basic Concepts and Extensions
[9] Leon S. Sterling, Kuldar Taveter, The Art of Agent
Oriented Modeling, pp-220-222, The MIT Press,
Cambridge, Massachusetts London, England, 2009
[10] Thomas Juan, Adrian Pearce, Leon Sterling: ROADMAP:
Extending the Gaia Methodology for Complex Open
Systems, AAMAS '02 Proceedings of the first international
joint conference on Autonomous agents and multiagent
systems: part 1 ACM New York, NY, USA, 2002
[11] Bresciani, P., Giorgini, P., Giunchiglia, F., Mylopoulos, J.,
Perini, A.: Tropos: An Agent-Oriented Software
Development Methodology. Autonomous Agents and Multi-
Agent Systems, Autonomous Agents and Multi-Agent
Systems, Kluwer Academic Publishers Hingham, MA,
USA,Volume 8 Issue 3 May 2004
[12] http://citeseerx.ist.psu.edu/viewdoc/summary?
doi=10.1.1.1.7985, 30/3/2011, Paolo Giorgini, Manuel Kolp,
John Mylopoulos, Marco Pistore: The Tropos
Methodology
[13] Duy Cu Nguyen, Anna Perini, and Paolo Tonella, M. Luck
and L. Padgham (Eds.): A Goal-Oriented Software Testing
Methodology, AOSE 2007, LNCS 4951, pp. 5872, 2008.
Springer-Verlag Berlin Heidelberg, 2008
[14] Mark F. Wood Scott A. DeLoach: An Overview of the
Multiagent Systems Engineering Methodology,
Proceedings of the First International Workshop on Agent-
Oriented Software Engineering, 10th June 2000, Limerick,
Ireland. P. Ciancarini, M. Wooldridge, (Eds.), Lecture Notes
in Computer Science. Vol. 1957, Springer Verlag, Berlin,
January 2001.
[15] http://citeseerx.ist.psu.edu/viewdoc/download?
doi=10.1.1.84.8307&rep=rep1&type=pdf, 29/3/2011Lin
Padgham and Michael Winikoff: The Prometheus
Methodology, RMIT University, April 2004
[16] Lin Padgham and Michael Winikoff: Prometheus: A
Methodology for Developing Intelligent Agents, Agent
Oriented Software Engineering III, Third International
Workshop, AOSE 2002, Bologna, Italy, July 15, 2002
[17] Sylvain Rougemaille, Jean-Paul Arcangeli, Marie-Pierre
Gleizes, and Frederic Migeon: ADELFE Design, AMAS-
ML in Action: A Case Study, Engineering Societies in the
Agents World IX Lecture Notes in Computer Science, 2009,
Volume 5485/2009, 105-120, DOI: 10.1007/978-3-642-
02562-4_6
[18] Carole Bernon, Marie-Pierre Gleizes, Sylvain Peyruqueou,
Gauthier Picard: ADELFE, a Methodology for Adaptive
Multi-Agent Systems Engineering, ESAW'02 Proceedings
of the 3rd international conference on Engineering societies
in the agents world III, Springer-Verlag Berlin, Heidelberg,
2003
[19] http://www.math.yorku.ca/~cysneiro/articles/ Adelfe-
WER07-FinalVersion-1.pdf, 13/4/2011, V.M. B. Werneck;
A. Y. Kano, L. M. Cysneiros: Evaluating ADELFE
Methodology in the Requirements Identification

ISBN: 978-81-920575-8-3 :: doi: 10. 72864/ISBN_0768

Bit-Pattern Representation for Design Pattern Identification

Anil Patil
Department of Computer Engineering
Pune, India
patil.anil@hotmail.com

Pravin Game
Department of Computer Engineering
Pune, India
pravingame@gmail.com

Abstract Design patterns give solutions to recurring problem
& help to reuse experience of experts in software design. These
patterns have been widely used in developing many software
systems. But the pattern related information is not available in
implementation. So identifying the instances of design pattern
from the source code can help software maintainers to
understand the system design. In this paper we proposed our
approach for identifying design patterns which is based on bit-
pattern matrix representation of system and design pattern.
Keywords- Design patterns, reverse engineering, software
maintenance.
I. INTRODUCTION
Currently design patterns are increasingly used in almost
all software industries for developing their software
systems. Design patterns are common solutions to common
problems. Design patterns show you how to build the
systems with good object oriented design principles. They
dont give you code but they give you general solution to
design problem. You can apply them to your specific
application.
According to definition in [1] Design Patterns describes
the problems which occur over and over again in our
environment and then describes the core of solution to that
problem in such a way that you can use this solution a
million times over. These Patterns are descriptions of
communicating classes and objects which are used to solved
general design problem in a particular context. Design
patterns describes the classes and objects along with their
roles, collaborations, distribution of their responsibilities.
Use of the Design Patterns promotes software reuse. It
greatly offers loose coupling between constituents of a
system. By using it, changes can be done easily in code and
development becomes faster, thus reducing cost for
development.
Due to lack of design documentation, understanding the
structure of large software systems becomes difficult. After
deployment, design information and original architecture of
a system is generally lost. In such a situation to understand
the system only source code is available. But if the source
code is very large, suppose containing hundreds of classes
and thousands of lines of code, then it is difficult to
understand original design of a system. So comprehending
software design becomes very time consuming and it
becomes headache for software maintainers. Generally 50%
of maintenance time and cost is spend on program
understanding. So any help in understanding source code
would reduce this cost. So maintainer needs some
automated support for understanding the structure of
software system.
Software design pattern documents experience of design
experts and helps to capture the original design decisions.
Hence identifying the instances of the design pattern from
the source code or design can help to understand the system.
It can help to trace back to the original design decisions. So
once we identify the pattern-related information from the
source code, maintenance becomes easier. Thus from
forward engineering perspective, design patterns help to
design high quality systems, whereas from reverse
engineering perspective by analyzing the code and by
identifying instances of design patterns, design of system
can be recovered and help to understand the system. Thus
Identifying design pattern is essential part in reverse
engineering.
In this paper we proposed a new approach to identify
design pattern from system. We present a bit-pattern
structure to encode structural properties of system and
design pattern. We use matrices to identify pattern
instances.
The organization of paper is as follows. In next section
we will see related work. In section 3 we will see
mathematical specification for proposed identification
process.
II. RELATED WORK
Various pattern identifying techniques have been
proposed. N. Pettersonand, W. Lowe [2] proposed technique
for identification of pattern, which is based on filtering the
program graphs which are generally non-planer and after
filtering it removes the information which is not important
for pattern detection and makes the graph as planer graph.
The main idea was safe-filtering in which some nodes and
edge types of program graph can be removed without
affecting the result of detection. For non-planer graphs it
removes edges and tries to make it planer and reduces time
for detection.
Approach proposed by Kramer [3] in Pat system mainly
focuses on detection of five structural patterns. Design
information is extracted from CASE tool. He developed Pat
system in which both pattern and design are represented as
Prolog rules and prolog engine to develop actual search. But
the main limitation is that, it is unable to detect behavioral
patterns.
Reducing the search space for detection is main concern
in pattern identification process. The approach proposed by
R. Fiutem [4] was based on multistage filtering strategy and
avoid combinatorial explosion of search space. Both the
system and pattern are transformed in Abstract Object
Language (AOL) as intermediate representation and uses
software metrics to determine pattern constituent candidate
sets.
Identification of modified versions of design patterns is
introduced by Tsantalis [5] based on similarity scoring
between two graphs. He creates set of matrices for
describing pattern & system structures, and computes
similarity scores between those matrices. The approach has
the ability to identify the pattern that are modified from their
original representation. The approach proposed by Bernadi
and Antonio [7] is based on model-driven detection. It
detects pattern variants and detection process is based on
declarative specification, meta model is defined to represent
design patterns and system to mine and algorithm is used to
match both models.
A multilayered approach proposed by [6] which
indentifies the instances of design motifs from the source
code of a system. It basically identifies relationships among
classes and instances of design motif.It provides 3 step
identification process to identify design motifs based on
UML-like class model. This approach creates three models
namely Source code model, Idiom level model, and Design
level model. To describe constituents of class diagram like
classes, fields, methods, interface, inheritance, relationships
among them the PADL i.e. Pattern and Abstract level
Description Language is used. In the first model source code
S of system is analyzed and it creates a model. Then in
second layer this model is enriched to identify binary class
relationships like association, aggregation, composition and
idiom level model is created. In last layer the model is again
enriched and it identifies the micro architecture similar to
specified design motif. The main characteristic of this
approach is that it ensures traceability link between different
layers from source code upto identified micro architecture.
Here each layer is the refinement of the its previous layer.
To define these different models a meta model is created.
III. PROPOSED IDENTIFICATION
PROCESS:MATHEMATICAL SPECIFICATION
Identification process typically involves structural and
behavioral analysis. Every design pattern has its structural
and behavioral characteristics. Structural characteristics
comprises structural information which include classes,
interfaces, abstract classes, their relationships like
association, generalization, aggregation, composition. So in
structural analysis this structural information is extracted
from source code and match it with structural information of
patterns. We use advantage of available reverse engineering
tools such as IBM Rational Software Architect to recover
class diagram of source code. Now we represent this
procedure using mathematical terms.
A. Mathematical Model for Structural Analysis
Any software system can be considered as collection of
classes, methods and various attributes inside the classes.
So let S be a software system such as,
1 2 3 4
{ , , , ,..... }
n
S s s s s s
where
i
s ,1 i n can be a class or method or attribute.
Now let us assume that
1 2 3 4
{ , , , ,.... }
k
C c c c c c
represents set of all , k k n classes in system S and
C S .
Each class ,1
i
c i k has a set of methods and
attributes as its constituents. So we represent each class
i
c as,
1 2 3 1 2 3
{ , , ,....., , , ......}
i
c m m m a a a
Here m and a represents methods and attributes
respectively.
Consider a set R representing relationships between
classes of system, defined as
1 2 3 4
{ , , , .....} R r r r r
For example,
{ , , ,....} R association aggregation generalization

Class to class relationship: - Relation between two
classes
i
c and
j
c can be represented as function
( , ) 2
R
i j
f c c
For example,
1 2 2 4
( , ) { , } f c c r r
Here if we treat
2
r as a association and
4
r as
generalization then it means Class
1
c and Class
2
c have
generalization and association relationship between them.
The approach proposed by Tsantalis [5] uses matrix to
represent pattern structure. Taking motivation from his
work we are presenting system and pattern structure in
enhanced way. In object oriented system, the structural
information is represented by class diagrams. These
diagrams can be easily represented by square matrix.
Each matrix cell entry represents relationship between
corresponding classes. Instead of creating separate matrix
for each relationship as in [5] we are creating one matrix
to represent all classes and their relationships. Class
diagram of the source code and class diagram of design
pattern to be detected are transformed into matrix and
then matching between this two matrices takes place to
identify instances of pattern. After this matching we get
the classes which are involved in pattern.
a) System Matrix-
Consider relationships between various classes of a
system can be defined as matrix.
( ) k k ij k k A a
where ( , )
ij i j
a f c c ,
i j
c c C

, 1 , i j k
and

C k .
Here
ij
a represents bit-pattern between two classes
i
c and
j
c which is nothing but relationships between
these two classes. The bit-pattern structure will be
explained later.
Thus we have defined the matrix structure for
system now we define matrix for pattern.
b) Design Pattern Matrix-
Let set DP represents set of design patterns which
we have to identify in our software system.
{ , , ,.....} DP observer adaper state
Let set D represents set of classes participating in
design pattern p and p DP .
1 2 3 4
{ , , , ,.... }
m
D c c c c c
Now we represent structure of every design pattern
as matrix
m m
DPM

.
( )
m m ij m m
DPM d

Where ( , )
ij i j
d f c c and ,
i j
c c D , D m .
So again here
ij
d represents a bit-pattern between
classes
i
c and
j
c which represents particular
relationship between classes
i
c of
j
c design pattern.
Table 1 shows matrix for decorator design pattern.
Each cell entry represents bit-pattern between
respective classes.
c) Bit-Pattern-
There may exist more than one relationship
between two classes
i
c and
j
c .So to define multiple
relationships between these two classes we present a
bit-pattern structure as follows.
We consider only association, generalization,
aggregation relationships to identify the pattern
instances. Since we are considering only 3 relationships
we consider 3-bit binary pattern to represent this
relationships.
We assume structure of 3-bit binary pattern from
left to right as,
1
st
bit position represents generalization
2
nd
bit position represents association.
3
rd
bit position represents aggregation.
So bit-pattern for relationships between
i
c and
j
c is as
Generalization Association Aggregation
0\1 0\1 0\1

If any of the relationship out of this three exist between
i
c and
j
c we make respective bit as 1 otherwise 0.
If there does not exist any relationship between two
classes
i
c and
j
c the bit pattern will be 000.
For example consider decorator design pattern [1] as
shown in fig 1.The 3-bit pattern for the classes
ConcreteComponent and Component is 100.According to
our assumed 3-bit pattern structure, 1
st
bit position from
right to left is 1 which indicates there is generalization
relationship between ConcreteComponent and
Component class. The binary pattern for Decorator and
Component classes is 101. Since 1
st
and 3
rd
bits are
1,which indicates both this classes have generalization
and aggregation relationship between them. We use
binary shift operation to recover encoded relationships.
Shift operation is fast, so we use it for better performance.
Since we are considering only 3 relationships we use 3
right shift operations.
First right shift operation will give aggregation, second
right shift operation will give association and third right
shift operation will give generalization relationship. For
example in table 1 matrix cell entry for Decorator and
Component is 101. So after first right shift operation we
get bit 1 which indicates presence of aggregation
relationship. Second right shift operation results 0 so
there is no any association. Third right shift will give bit
1, so there exist generalization between decorator and
component class. So in this way we can represent
multiple relationships between two classes by using bit-
pattern. Also by performing shift operation we can
recover encoded relationships.
Table 1 shows matrix for decorator design pattern.
Each cell entry represents bit-pattern between respective
e
t
o
e
w
w
B
classes.
T
d) Match
For matchin
entry in matrix
the sub matrix
of pattern to b
Here every
element of A m
will be includ
So the resu
with the classe
B. Dynamic A
After stru
patterns wh
detected. Th
Because of
we concen
structural
patterns ha
represents
together. Th
invocations
behavior o
system we c

Component
ConcreateCompon
t
Decorator
ConcreteDecorato
Fig. 3
able 1: Matrix for
hing-
ng we treat ev
x represents a
x of system m
be detected. So
( match DP
element of D
matrix and if c
ed in output se
ultant set B w
es which are in
Analysis
uctural analysi
hich are structu
his is given as
f this search s
ntrate only
analysis. Add
ave behavioral
how there
his behavior i
s. Then match
of pattern is
can get more e
Componen
t
000
nen
100
101
or
000
Decorator patter
r Decorator desig

very bit-pattern
binary numbe
matrix which i
o we define it
, )
m m k k
PM A

DPM will be m
correct match
et B.
ill give us det
nvolved in it.
is we get the
urally similar
s input to the b
space is great
on instances
ditionally mo
l characteristi
objects colla
is typically de
hing between
checked. By
exact of design
ConcreateCom
ponent
000
000
000
000
rn.
gn pattern
n as binary dig
er. Here we id
is similar to m
as
) B
matched with
found those c
tected pattern
instances of d
to the pattern
behavioral ana
ly reduced be
s detected d
ost of the d
ics which typ
aborate and
escribed by m
this behavio
y applying th
n patterns.
Decorator
Conc
Decor
000 00
000 00
000 00
100 00

git. So
dentify
matrix
every
classes
along
design
n to be
alysis.
ecause
during
design
pically
work
method
or and
his on
C.
e
a
u
s
s
d
t
i
t
b
a
m
w
p
p
l
d
n
f
T
i
c
i
c
t
crete
rator
00
00
00
00
Similar patter
There are th
each other lik
approaches D
using similarit
state and str
structure as
diagrams are s
this two patt
identification
two patterns. T
but their intent
and reason f
motivation i.e
which pattern
pattern detect
patterns distinc
The one po
language stand
development i
naming classe
followed by
TCPListenStat
inside that clas
contain State
instead of strat
can distinguish
to each others.
rns detection
he patterns wh
ke state and
eMIMA [6] a
ty scoring [5]
rategy pattern
shown in fig
structurally sa
terns we get
process can
Though these
t i.e. descriptio
for using the
e. the scenari
n can be appl
tion method
ctly.
Fig.2.Stat

Fig.3.Str
ossible soluti
dards for patte
f developers a
s they give na
State e.g
te etc. Same
ss. So if we fo
are more lik
tegy patterns.
h between tho
.
hich are struct
strategy patt
and Design p
] can not disti
n because o
g 2 and fig 3
ame, if we ma
same matri
not distinguis
patterns have
on of goals be
em is differ
io consisting
lied is also d
must detect
te Pattern.
rategy Pattern.
ion is to us
ern detection.
apply State pa
ames to classe
g. class Fro
thing happe
ound the class
kely to be part
So by using th
ose patterns w
turally similar
tern[1]. Existi
pattern detecti
inguish betwe
of their simi
3.Since this tw
ake matrices
x structure.
sh between t
same structur
ehind this patte
rent. Also th
the problem
different. So t
such kinds
se programmi
Suppose duri
attern, then wh
es as class nam
ozenState, cl
ens for metho
ses whose nam
t of state patte
his technique
which are simi
r to
ing
ion
een
ilar
wo
for
So
this
res,
ern
heir
in
the
of

ing
ing
hile
me
ass
ods
mes
ern
we
ilar
IV. CONCLUSION
Process of discovering the original design structure of
software system by analyzing its implementation is the main
aim of reverse engineering. By retrieving this information
the reconstruction and maintenance phases becomes easier.
But these patterns are not traceable from system source code.
We present the new identification approach which uses bit-
pattern to represents relationships between classes so that it
can be used to detect pattern instances. We mainly
concentrate on structural analysis. Our future work includes
focus on dynamic analysis so that very accurate instances of
pattern can be found.
REFERENCES
[1] E.Gamma ,R,Helm, R. Jonson, and J.Vlissides, Design
patterns-Elements of reusable Object-Oriented
SoftwareAddison Welsley1995.
[2] N.Pettersson and W. Lowe, Efficient and Accurate Software
Pattern Detection,13
th
Asia-Pacific Software Engineering
Conference December2006.
[3] C. Kramer and L. Prechelt, Design Recovery by Automated
Search for Structural Design Patterns in Object-Oriented
SoftwareProceedings of the Third Working Conference on
Reverse Engineering, November 8-10, 1996, Monterey,
California
[4] G. Antoniol, R. Fiutem, and L. Cristoforetti, Design Pattern
Recovery in Object-Oriented Software, 6th International
Workshop on Program Comprehension : IWPC '98 :
proceedings, June 24-26, 1998, Ischia, Italy

[5] N. Tsantalis, A. Chatzigeorgiou, G. Stephanides, and S.
HalkidisDesign Pattern Detection Using Similarity Scoring,
IEEE Transaction. Software Engineering, pp.896-209 vol. 32,
no. 11, Nov. 2006.
[6] Yann-gaelGueheneuc and G. Antonil,DeMIMA: A
Multilayerd Approach Approach for Design Pattern
Identification,IEEE Trans. Software Engineering,pp.667-
684,2008.
[7] Mario lucabernadi and Giuseppe Antonio Di Lucca,Model
Driven Detection of design patternsSoftware Maintenance
(ICSM), 12-18 sep.2010 IEEE International Conference.

ISBN: 978-81-920575-8-3 :: doi: 10. 72871/ISBN_0768

Calculating the Cohesion and Coupling for Class Diagram through Slicing
Techniques

Abstract--High cohesion or module strength indicates that
a system has been well partitioned into components which
have strong internal relationships between attribute,
method and class. Cohesion is an important factor in term
of software design. Coupling indicates the degree of
interdependence among the component of a software
system. Coupling is thought to be a desirable goal in
software construction, leading to better value of internal
attribute, such as maintainability, reusability and
reliability. A coupling metrics capture the degree of
interaction and relationship among class dependency
graph element attribute and method in software system.
The Slicing technique using to slice the class dependency
graph, Program slicing describes a mechanism which
allows the automatic generation of a slice. In this paper,
we proposed a new technique for Cohesion and Coupling
for Class Diagram through Slicing Techniques. Proposed
new techniques show the class dependency graph,
dependency between attribute-attribute, attribute-method
and method-method. Result indicate that the propose
scheme significantly low coupling and high cohesion, so
system are more reliable, efficient and credibility of class
diagram is appropriate.
Keywords--Cohesion, Coupling, Class Slicing, Class
Dependency Graph, Static Slicing, Dynamic Slicing,
CBCD.
I. INTRODUCTION
Cohesion and Coupling are an important factor in
term of software design. Coupling indicates the degree
of interdependence among the component of a software
system. Classes play an important role in the object-
oriented based system design. Entities in an application
domain are captured as classes and applications are
built with the objects which are instantiated of
class.Unified Modelling Language (UML) play the
important role building of Object-Oriented software.
UML divided into two general sets: structural modelling
diagrams and behavioural modelling diagrams. Part one
will deal with structural modelling diagrams. Class
Diagrams show the different classes that make up a
system and how they relate to each other. Class
Diagrams are said to be static diagrams because they
show the classes, along with their methods and
attributes as well as the static relationships between
them. We have done our Class dependency graph
(CDG) representation of UML class diagrams [11]. A
CDG represents a set of classes and their relationships.
Coupling represent to the degree of interdependence
among the components of a class diagram of software
system [5]. Good software design should have low
coupling but higher Coupling makes a system more
complex, highly inter-related modules are difficult to
understand.
Cohesion indicates module strength in a system.
High cohesion or module strength indicates that a
system has been well partitioned into components
which have strong internal relationships between
attribute, method and class. Cohesion is an important
factor in term of software design. The Class cohesion is
the property of connectivity among the aggregate
elements of a single class. Mostly class cohesion
measures are based on an abstract representation that
represents the method-attribute reference relationships
and/or methods similarity relationships and they
calculate the class cohesion based on the number of
edges in the graph [2, 6].
Class slicing is a decomposition technique that
removes class components not relevant to a chosen
computation, referred to as a slicing criterion. The
remaining components form an executable class called a
slice that computes a projection of the original classs
semantics. The class slice consists of the parts of a class
that may affect the values computed at some point of
interest, referred to as a slicing criterion. Class slicing
can be used in debugging to locate source of errors
more easily. Class slicing describes a mechanism which
allows the automatic generation of a slice. All
Statements affecting or affected by the variables
mentioned in the slicing criterion becomes a part of the
slice.
The slices based coupling and cohesion determine
on the basis of dependencies among the class slicing
and design matrix for an UML model. A static slice can
be computed by identifying the different architectural
element and dependencies among them for an UML
model. These selectively identified architectural
elements can comprise classes and their objects,
different attributes and the method calls. We
collectively term these identified architectural elements
as a slice of architecture. These architectural elements
are identified based on a slicing criterion. In the
Akhilesh kumar
Guru Nanak Dev Engineering College Ludhiana
Punjab (India)

following, we define a slicing criterion, and its
corresponding computed static slice for an architectural
model. The calculation of coupling and cohesion
measures, sets of slices and their intersections
comparable to the use of slice profiles in are needed. A
class diagram convert into dependency graph and
identified the attributes and method and its relationship.
A slice applies on the basis of class slicing criterion to
calculate the cohesion on the basis of cohesion metrics.
In this paper, we proposed a new technique for
Cohesion and Coupling for Class Diagram through
Slicing Techniques.
Proposed new techniques show the class
dependency graph, dependency between attribute-
attribute, attribute-method and method-method. New
scheme show low coupling and high cohesion, so
design system is more reliable, efficient and credibility
of class diagram is appropriate. The rest of the paper is
organised as follow, we have given the related work in
section II. Design procedure for design matrics is
highlighted in section III. The calculation of slice based
cohesion and coupling of class dependency graph is
given in section IV and V. Performance evaluation
through simulation is described in section VI. Finally
the conclusion is given in section VII.

II. RELATED WORK
The most importance factor of software products
based on design properties such as coupling, there is
little work in this area. Most existing measures capture
coupling between modules using source code which is
only available after implementation [3]. These measures
and demonstrate that the values of coupling and
cohesion can also be used for assessing deterioration
effects [12]. Coupling is a measure for the strength of
inter-component connections, and cohesion is a measure
for the mutual affinity of subcomponents of a
component. Within the range of this contribution we are
interested in how these measures are calculated and
what they indicate. As adumbrated in the introduction, a
practical way in calculating coupling and cohesion
measures is to make use of slices. A coupling measure
named Coupling between classes diagram (CBCD) is
defined, and empirically validated in [2]. With the
CBCD measure, class A is coupled to class B if A uses
B's member method and/or instance variables. CBCD
counts the number of classes to which a given class is
coupled. Coupling refers to the degree of independence
between parts of the design. To measure coupling in
class diagrams there are three types of metrics. In this
paper we are shows that measure coupling performance.
A measure of coupling is more useful to determine the
complexity of software. CBCD for a class is a count of
the number of other classes to which it is coupled. The
definition of CBCD deals with instance variables and
total number of methods of the class. When two classes
are coupled, the methods and instance variables defined
in one class is used by another class. Multiple accesses
to one class are counted as one access.

III. DESIGN FOR PROCEDURE FOR DESIGN
METRICS
A dependency is used to model a wide range of
dependent relationships between model elements. It
would normally be used early in the design process
where it is known that there is some kind of link
between two elements. The trace relationship is a
specialization of a dependency. Different types of
dependency but we are used the generally four types of
dependency like a system dependency, class
dependency, data dependency, control dependency. A
class dependency is construction of class dependence
graphs for single classes, derived classes and
interacting classes. The section also discusses the way
in which our graphs represent polymorphism. For
example, a figure 1 show that Class Diagram of
Employee information, employee information have a
two classes first class is the employee class, in that class
four attributes employee Id, first Name, last Name,
email, and four method set department ( ), set email( ),
set first name ( ), set last name ( ),set manager ( ),set
employee ( ). A department class have a four attributes
department ID, Name, City, State, and four method set
department Id ( ), set Name ( ), set City ( ), State ( ).
Between all two classes has a different dependencies
like a employee- department class many to one
dependencies means many employee depend upon one
department, this dependency a employee class side, one
to many dependency means one department have a
many employee.

EMPLOYEE

-employee Id : int
-first Name : String
-last Name : String
-email : String

+get Department( ) : Department
+get Email( ) : String
+get Employee Id( ) : int
+get First Name( ) : String
+get Last Name( ) : String
+get Manager ( ) : Employee
+set Department(department : Department)
+set Email( email : String)
+set Employee Id( employee Id : int)
+set First Name(first Name : String)
+set Last Name( last Name : String)
+set Manager ( manager : Employee)
m

1

DEPARTMENT

-department ID : int
-name : String
-City : String
-State : String

+get department ID( ) : int
+get name( ) : String
+get City( ) : String
+get State( ) : String
+set department ID(department ID: int )
+set name(name: String)
+set City(city : String)
+set State( state : String)

Figure 1: Class Diagram of Employee Information

3.1 Scenarios of Class Diagram (Employee information)
In CDG, classes and their attributes, methods and
their call parameters, together with method return
values are represented as different types of nodes.
These would be represented by using appropriate
dependence edges in the CDG. Member dependence
edges represent the class memberships of methods and
attributes, while method dependence edges represent the
dependence of the call parameters and return values (if
any) on a method. Data dependence edges represent
own of data among statements of a class method. Based
the class dependency graph , the direct read dependence
from methods to attributes, the direct write dependence
from attributes to methods, and the direct call
dependence between methods can be easily to be
concluded. However, the flow dependence between
attributes is not so straight forward.
Let c be a class, and its relationships.
(1) If m M(c) and a RA(m), then m is read
dependent on a, denoted by m a.
(2) If m M(c) and a WA(m), then a is write
dependent on m, denoted by a --------- m.
(3) If m1, m2 M(c) and m2 CALL(m1), then m1 is
call dependent on m2, denoted by m1----------- m2.

Figure: 2 Dependency Graph of Employee Information
3.2 Static slicing of class dependency graph
Slices are computed using a dependence graph.
Static slicing technique uses static analysis to derive
slices. That is, the dependency graph is analyzed and
the slices are computed for all possible input values.
A state that slices based coupling and cohesion
determine on the basis of dependencies among the
program slicing and design matrix for an UML model.
A static slice can be computed by identifying the
different architectural element and the dependencies
among them for an UML model. These selectively
identified architectural elements can comprise classes
and their objects, different attributes, and the method
calls. We collectively term these identified architectural
elements as a slice of architecture. These architectural
elements are identified based on a slicing criterion. In
the following, we define a slicing criterion, and its
corresponding computed static slice for an architectural
model. For the calculation of coupling and cohesion
measures, sets of slices and their intersections
comparable to the use of slice profiles in are needed.
Therefore static slices are conservative and contain
more statements than necessary. A static program slice
S consists of all statements in program P that may affect
the value of variable v at some point p. The slice is
defined for a slicing criterion-
C=(x, V).. (I)
Where x is a class and V is a subset of variables in
class.
A static slice includes all the statements that affect
variable v for a set of all possible inputs at the point of
interest (at the statement x). Class Level Slice: A class
level slicing criterion is a triple
C= (P, C, Vc).. (II)
Where C is a class in certain modular P, and Vc is
variables set define or used in C.
Class- level slice is a set which consist of all
classes affecting and affected by the value of variables
set Vc. The hierarchical slice based object oriented
coupling measurement. In this paper , established a
measuring coupling framework based on hierarchical
slice model [7].

IV. CALCULATION OF SLICE BASED COUPLING
OF CLASS DEPENDENCY GRAPH
A coupling measure between classes, which class
directly coupled with other class, this section explains
slice-based coupling metrics of the class dependency
graph. In this research paper we are the measure
coupling on the basis of dependencies of attributes and
its methods. Coupling is a measure of information flow
between two classes.
A metrics used show as below

Flow P (c1, c2) implies that information flow from class
c1 to class c2 in a modular P, employee information
modular.

Flow P (c2, c1) implies that information flow from class
c2 to class c1 in a modular P, employee information
modular.

Calculate coupling between class employee and
department classes and vice-versa.

Slicing Criterion (Department, D-id)

Figure: 3 Slice based CDG of Employee Information
Coupling (Employee, Department) = 1/ 8 = 0.1
Coupling (Department, Employee) = 0

V. CALCULATION OF SLICE BASED COHESION
OF CLASS DEPENDENCY GRAPH.
Slice-based cohesion metrics and details the formulae
used in their calculation. Firstly a class diagram covert
into dependency graph and identified the attributes and
method and its relationship. A slice apply on the basis
of program slicing criterion and then calculate the
cohesion on the basis of cohesion metrics Worked
examples, illustrating the calculation of slice-based
cohesion metrics for a class dependency graph. There
are two slices in figure 4, 5.

Dep* (n): Consist of all attributes or method that n
directly- indirectly potentially depends. Nc is the total
number of nodes of class [2].

DRC: A Dependence Relationship Based Cohesion
Measure for Classes.

Slicing criterion (Employee, e-id)

Figure : 4 Slice based CDG of Employee Class

A cohesion measure figure wise.
Table 1: Cohesion of Employee Dependency Graph

Node Dep*(n) Dep_D(n)
Employee-id {Set e-id, First
Name, Last Name,
Email, Set F N, Set
L N, Set e-m }
0.87
First Name { Set e-id, Last
Name, Set F N, Set
L N, e-id}
0.62
Last name { Set e-id, First
Name, Email, Set F
N, Set L N, }
0.62
E-mail { Set e-id, First
Name, Set F N, Set
L N, Set e-m, e-id ,
Last Name}
0.87
Set Employee-
id
{ e-m, First Name,
Email, Set F N, Set
L N, Set e-m, e-id }
0.87
Set First
Name
{ Set e-id, First
Name, Email, L N,
Set L N, Set e-m, e-
id }
0.87
Set Last name { Set e-id, First
Name, Email, Set F
N, L N, Set e-m, e-
id }
0.87
Set E-mail { Set e-id, First
Name, Email, Set F
N, Set L N, L N, e-
id }
0.87

Now total cohesion of the employee class

= .87+0.62+0.62+0.87+0.87+0.87+0.87+0.87
/ 8
= 6.46/8= 0.81

Slicing Criterion (Department, d-id)

Table 2: Cohesion of Department Dependency Graph

Node Dep*(n) Dep_D(n)
D-id {Set d-id, Name,
Set Name, City, Set
City, State, Set S}
0.87
Name { Set d-id, Name,
D-id, City, Set
City, State, Set S }
0.87
City { State, Set S, Set
City, }
0.37
State { Set d-id, Name,
Set Name, City, Set
City, D-id, Set S,}
0.87
Set D-id { d-id, Name, Set
Name, City, Set
0.87
Set Name { Set d-id, Name,
D-id, City, Set
0.87
Set City { City, State, Set S} 0.37

Set State
{ Set d-id, Name,
Set Name, City, Set
City, State, d-id}

0.87

Now total cohesion of the company class

= 0.87+ 0.87+0.37+0.87+0.87+0.87+0.37+0.87
= 5.96/ 8 = 0.74
VI. PERFORMANCE ANALYSIS
The performance analysis is used to correlate the data.
Pearsons linear correlation is used to quantify the
relation between metrics. Such correlations measure
linear associations between variables. The output is a
correlation coefficient, reported as the value R, and the
coefficients of a linear model [17]. The statistical
significance of R can be belong
If R like between 0.8 - 1.0, then show strong
association.
If R like between 0.5 - 0.8, then show moderate
association
Otherwise show that weak or no association.
If R gives the negative value indicates an inverse
correlation. In our employee information example of
coupling (In section IV) we found results R equal to
0.1 , this result show that low coupling it means this is
better for good software design system. The example of
employee information for slicing in (In section V) we
found results R equal to 0.81, this result indicate the
high cohesion it means this is better for good software
design system. So finally these results show the more
reliable for software system.

VII. CONCLUSION
Cohesion and Coupling are an important factor in
term of software design. Coupling indicates the degree
of interdependence among the component of a software
system.This paper, showed a class dependence
relationships based coupling and cohesion measure for
classes based on dependence relationships. The
relationships among the members of a class are hence
well characterized and can objectively evaluate the
cohesiveness of a class. Proposed new techniques show
the class dependency graph, dependency between
attribute-attribute, attribute-method and method-method.
The coupling measures focused only on the
dependencies between classes on the basis of
dependencies of attributes and its methods, (for
example friendship between classes, specialization, and
aggregation). Our calculating result indicate that the
propose scheme significantly low coupling and high
cohesion, so system are more reliable, efficient and
credibility of class diagram is appropriate.
REFRENCES
[1] Lionel C. Briand, John W. Daly and Jrgen
Wst , A Unified Framework for Coupling
Measurement in Object-Oriented Systems,
IEEE transactions on software engineering,
germany, Vol. 25, No. 11999.
[2] Yuming Zhou, Lijie Wen Jianmin and Wang
Yujian Chen, A Dependence Relationships
Based Cohesion Measure for Classes, IEEE 10
th

Asia-Pacific Software Engineering Conference
(APSEC) Department of Computer Science &
Engineering, Southeast University, Nanjing,
2003.
[3] Erik Arishol, Dynamic Coupling Measures for
Object-Oriented Software, Proceedings of the
Eighth IEEE Symposium on Software Metrics
(METRICS-02), 2002
[4] Denys Poshyvanyk, and Andrian Marcus, The
Conceptual Coupling Metrics for Object-
Oriented Systems, IEEE International
Conference on Software Maintenance (ICSM'06)
Department of Computer Science Wayne State
University Detroit Michigan, 2006.

[5] Lionel Briand, Prem Devanbu and Walcelio
Melo, An Investigation into Coupling Measures
for C++, Proc. of the 19th International
Conference on Software Engeneering, PP.18-23,
May-1997.
[6] James M. Bieman and Byung-Kyoo Kang,
Measuring Design-Level Cohesion, IEEE
transactions on software engineering, Vol. 24,
NO. 2, Feb-1998.
[7] Bixin li , A hierarchical slice based framework
object oriented coupling measurement., Turku
Centre of Computer Science, TUCS, Technical
Report, 2001.
[8] Timothy M. Meyers and David Binkley ,Slice-
Based Cohesion Metrics and Software
Intervention, Timothy M. Meyers and David
Binkley Loyola College in Maryland Baltimore,
Maryland, PP. 21210-2699,1999.
[9] V. Krishnapriya and Dr. K. Ramar, Exploring
the Difference between Object Oriented Class
Inheritance and Interfaces Using Coupling
Measures, International Conference on
Advances in Computer Engineering, 2010.
[10] L.C. Briand, Y. Labiche, and Y. Wang. An
investigation of graph-based class integration test
order strategies. IEEE Transaction on Software
Engineering, PP. 594607, 2003.
[11] L. C. Briand, J. Feng and Y. Labiche, Using
genetic algorithms and coupling measures to
devise optimal integration test orders,
Proceedings of 14th International Conference in
Software Engineering and Knowledge
Engineering, PP 4350, 2002.
[12] A. Abdurazik and A. J. Offutt, Using coupling-
based weights for the class integration and test
order problem, The Computer Journal, PP. 557
570, 2009.
[13] P. Green, Lane, Rainer, Scholz, An
Introduction to Slice-Based Cohesion and
Coupling Metrics, Technical Report No. 488,
University of Hertfordshire, School of Computer
Science, 2009
[14] Timothy M. Meyers and David Binkley, Slice-
Based Cohesion Metrics and Software
Intervention, Loyola College in Maryland
Baltimore, Maryland PP 21210-2699.
[15] Zhenqiang Chen,Yuming Zhou and Baowen Xu,
A Novel Approach to Measuring Class
Cohesion Based on Dependence Analysis, IEEE
International Conference on Software
Maintenance (ICSM.02), PP. 7695-1819 , 2002
[16] Jaiprakash T. Lallchandani and R. Mall ,Static
Slicing of UML Architectural Models, UML
Architectural Models, in Journal of Object
Technology, Vol. 8, PP. 159-188, Jan-Feb 2009.
[17]Timothy M. Meyers and David Binkley, an
Empirical Study of Slice-Based Cohesion and
Coupling Metrics, ACM Transactions on
Software Engineering and Methodology
(TOSEM), Dec 2009.

ISBN: 978-81-920575-8-3 :: doi: 10. 72878/ISBN_0768

A QoS based conceptual analysis in Requirement Engineering for
Service-Based Systems

Ms.G.Rajalakshmi
PG Scholar, CSE Dept
Mailam Engg. College, Mailam
Tamil Nadu, India

Mrs.T.Priyaradhikadevi
Research Scholar, CSE Dept
Mailam Engg. College, Mailam
Tamil Nadu, India

Dr.R.M.S. Parvathi
Research Guide, CSE Dept
Segundar Engg. College
Trichencode,Tamil Nadu, India

Abstract Non-functional Requirements (NFRs) along with
the Functional Requirements (FRs) are the bedrock for a
software development. If requirements are not properly
considered at early stages of software development, it may
become very complex to address them later. The goal of
Service Oriented Architecture (SOA) is to allow creation,
development and deployment of the services. This paper
analyses the requirements in the early stage and provides a
mapping technique by which these requirements are classified
to provide the best service in web.
Keywords-Requirement Engineering, Web Services, QoS,
service discovery, service matchmaking, service selction.
I. INTRODUCTION
Software Engineering is the establishment and use of
software engineering principles in order to obtain
economically software that is reliable and works efficiently
on real machines. Software Engineering can be treated as a
layered technology with requirement, design, and
implementation as some of its layers. The goal of software
engineering is to produce quality software, which is
delivered on time, within budget, and which satisfies the
customers requirements and users needs. A Service
approach must rest on an organizational commitment to
quality. The bedrock that supports software engineering is
a quality focus. The initial phase starts with the gathering
of requirements based on the needs of a customer or a user.
Design phase specifies how the developed software should
perform its tasks. It consists of modularization and detailed
interfaces of design elements, their algorithms and
procedures, and the data type needed to support the
architecture and to satisfy its needs. Testing is made to
verify and validate the process before deployment.
Qualities can be differentiated as internal and external
quality. The external qualities are visible to the user such
as reliability, efficiency, usability etc. Whereas the internal
quality is concerned with the developers they are
verifiability, maintainability, adaptability etc.
Requirement engineering is a process of discovering
the needs of stake holders and documenting them for
analysis, communication and implementation. The two
types of requirements are functional and non functional
requirements. Functional requirements or behavioral
requirements define functions of the product. It includes
the input that the software gets and the output it generates.
Non-functional or non-behavioral requirements are the
properties of software such as portability, reliability,
testability, efficiency and modifiability. Requirements are
developed through requirement engineering which is a
process that includes a set of activities such as requirement
inception, elicitation, elaboration etc.
The main contribution of our work is to translate the
customers desire for a set of defined capabilities into a
working service. The complete information necessary for a
product or service is achieved through requirement
engineering. The overall requirements are elicited from the
customer. These requirements encompass information and
control needs, function and behavior, overall performance,
design interfacing constraints, and other specific needs. It
provides the appropriate mechanism for understanding
what the customer wants, analyzing need, assessing
feasibility, negotiating a reasonable solution, specifying the
solution unambiguously, validating the specification, and
managing the requirements as they are transformed into an
operational system.
Web Service is a software system designed to support
interoperable machine-to machine interaction over a
network. Web services interact with each other, fulfilling
tasks and requests that, in turn, carry out parts of complex
transactions or workflows.
A Web Service provides programmatic access to a
service. The main reason why Web Services are becoming
important is ease of integration. They run on all kinds of
machines, from the desktop to the mainframe. The World
Wide Web is like a large content library that enables to
share and distribute information. Web Services enhance
this by enabling the sharing and distribution of services.
The development of UDDI (Universal Description,
Discovery and Integration), SOAP (Simple Object Access
Protocol) and WSDL (Web Service Definition Language)
standards are the foundation of Web Services.
Our paper provides a better QoS in terms of
Requirements which helps the customer to differentiate the
best service among different services discovered thus
providing a decision basis for choosing an efficient QoS
based service. A framework for dynamic QoS adaption for
a service based system is developed where the requirements
are transformed into probabilistic values for optimization.
II. BACKGROUND
Kritikos and Dimitris [1] contributed an analysis of the
requirements for a semantically rich QoS in the phases of
WS match making, and selection describing two problems
in [6] regarding syntactic discovery which provides
inaccurate results and absence of selection mechanism for
selecting a best service among multiple services which
perform the same function. The first problem is solved by
providing a semantic web and WS technologies while the
other is solved based on the non-functional properties
collectively known as QoS. Filtering and ranking is
performed for the WS match making and selection phases
respectively.
Alrifai and Risse [2] highlighted the problems in Local
and Global optimization and proposed a hybrid composition
technique for WS selection containing a distributed QoS
computation model for WSs and a QoS aware service
selection approach. A feasible and an optimal selection is
done if all QoS constraints are satisfied and overall utility is
maximized. Service brokers are used to filter out services
that violate local constraints. Simple additive weighting
method is used for ranking the discovered services.
L.Taher, El Khatib and Basha [3] highlighted a
framework for dynamic QoS match making algorithm
which provides dynamic web service selection. A generic
QoS Information and Computation framework is provided
for QoS based service selection. An aeRegistry is provided
which is an implementation of the standard service registry
(UDDI) and a Registry Ontology is used to present the
semantics of their proposal.
Dongyun Liu and Hong Mei [4] contributed a feature
orientation approach for mapping requirements to software
architecture. Object oriented and traditional structure
method is focused. A systems functionality is bundled and
described as a feature which characterizes the perspective of
the system. Feature elicitation, analysis and refinement are
done.
David and Masoud [5] contributed a mapping concept
for non-functional requirements in terms of cloud
applications. Distributed Ensembles of Virtual Appliances
(DEVA) is a model described to represent complex systems
with QoS. The feasibility of the approach is demonstrated
to model the expected number of requests per second and
response time of a web application hosted in the cloud.
III. SCOPE OF OUR PAPER
The Web Services developed, deployed and published
by the Service Providers mean nothing, unless the Service
consumers can search, locate and bind to them. The typical
interactions involve publish, find and bind Operations.
Similarly finding the web service suitable for the customer
involves the following phases such as service discovery,
service match making and service selection. Initially based
on the requirements gathered we have to identify or
discover the services available for the customer then
among the various services provided we have to match the
requirements to provide a feasible solution and finally the
selection has to be made.

Figure 1. Scope of our survey.
A. Cause effect analysis associated with Web Service
Discovery
The problems related to the UDDI are tabulated below
along with its cause and effect.
TABLE I. PROBLEMS ASSOCIATED WITH WEB SERVICE DISCOVERY
Problem Reason Effect
Current UDDI
implementations are
limited in scope. It is
not innately designed
to publish and store the
QoS requirements and
other non-functional
requirements of a
service [13,10,11].

UDDI allows search
on limited attributes
of a service, namely,
Service Name
(selected by the
Service Provider),
keyReference (unique
for a service), or on a
categoryBag (listing
all the business
categories)[13,10,11].

This problem makes it
difficult to store within
the UDDI, the run-time
performance parameters
of the service capturing
its QoS parameters. It is
also difficult to capture
the Customer Feedback
about the service and
store it to analyze and
improve on these
valuable metrics
[13,10,11,12,8,9].
Public UDDI
registries, that were
run by IBM,
Microsoft, SAP and
NTT Com. have been
shut down in the
beginning of 2006[8].

There was no
consensus regarding
ownership of the root
UDDI rsegistries.
UBRs used to contain
listings of businesses
that no longer existed
and sites that were no
longer active [11,8].

There is no Universal
Registry where all Web
Services are published.
This makes it difficult to
check the performance,
scalability and statistical
gathering of data. An
earlier work carried out
by Su Myeon Kim and
Marcel-Catalin Rosu[11]
reports that only one-
third of the 1200
registered services
referenced a valid
WSDL.

The problem behind the WSDL approach is stated in
the figure 2 similarly the usage of UDDI provides us with
limited scope since it is not mainly designed to publish and
store the QoS requirements and other non functional
requirements. To overcome this problem we use the Web
Service Modeling Ontology based Quality of
Service(WSMO-QoS) where an extension is made for the
quality metrics, value attributes, and their corresponding
measurements.

Figure 2. Problem with WSDL.
B. Related Approaches
As in [15],[16],[17],[18],[19],[20] the QoS
requirements, evaluation and their optimization is tabulated
below.
TABLE II. OVERVIEW OF RELATED APPROACHES
QoS Requirements QoS Evaluation QoS Optimization
QoS-driven Service
Selection for
Reputation, Execution
time, Availability,
Price
Simple
Aggregation
functions
Population Diversity
Handling Genetic
Algorithm.
QoS-driven Workflow
management for
Availability
Simple
Aggregation
functions
sequential, choice,
parallel and
iterative web
service composition
Hill climbing based
selection Redundancy
mechanisms.
QoSMOS Service
selection for
Performance and
Reliability
Analytic solving of
Markov Models
Probabilistic model
checking.

IV. MAPPING REQUIREMENTS IN TERMS OF WS
To provide a better quality of service in terms of
requirement we propose a mapping with the phases of
requirement engineering to web service phases. When the
users requirements are stated the description is prioritized
on the basis of the utility function.
A. Problem Recognition
WS Functional Specifications are not enough to handle
the Service Discovery Process. This is because:
1) WS need to be automatically and dynamically
discovered and selected at runtime. A mechanism needs to
be in place to ensure that this automatic discovery happens
and the best services are chosen. This needs specifications
in the service profile beyond the mere functional aspects of
a WS.
2) With the abundance of WS created by many service
providers, often a number of Web Services satisfies the
functional requirements of a service request. Methods were
evolved to rank and select the best Web services for a
request among a list of candidate Web services, which can
provide similar functionality.
Thus we proceed to the non-functional requirements
focusing on the predominant factor Quality of Service
(QoS). A Consumer may want a Service that offers the
fastest response time while for another reliability or
constant availability could be the criteria and a third
Consumer may treat security as his most important
parameter. So we classify the requirements as normal,
exact and exciting further we map them with the partial,
exact and super matches of the service discovered. By
using the mixed integer programming we perform the
ranking and filtering of the service based on the
requirement classification.
B. Conceptual Overview
When the requirement of the customer is given the
process begins. First the requirements are grouped under
categories such as normal exact and exciting. We discover
the services related to the requirements specified. Then we
perform the mapping which is the match making phase of
the service as shown in the figure. Finally an efficient
service is selected based on the QoS in the selection phase.

Figure 3. Mapping concept.

Figure 3 depicts the matching pattern of the
requirements with the service matches. Considering a
student information system the basic requirements are
registration number, name, department etc. Thus if the user
provides the name of a candidate then the matching is
made and the candidates available with the same name are
provided. In case of the web service, based on the
requirements given, the services available similar to the
requirements are provided. Thus partial matches are
obtained which contains those QoS offers that have at least
one worse solution than those of the QoS request. This is
mapped with the normal requirements. This usually states
category of the users who are unaware of their needs and
thus provide a random search in the web, discovering the
correct match what they expect is tedious for a service
discovery engine.
On the second case the customer is aware of the needs
and he specifies the exact requirements essential for his
need. He might specify the name, department and other
essential requirements needed for discovering his need.
Here the exact match can be found which contains those
QoS offers that have a subset of the solutions expressed by
the QoS request. It is obtained by filtering the data based
upon the requirements specified. But still there can be
some disambiguation for the customer because a variety of
services are provided for a single matching criteria itself so
he has to chose an optimum one among the rest. Some
algorithms are available to choose the optimum service.
Usually ontology-based category of approaches has all
the potentials to be more accurate for example if the
customer specifies a unique identification of the candidate
(e.g. Reg. No.) along with other requirements then the
super match is found which contains those QoS offers that
have at least one better solution and all other equivalent to
those solutions of the QoS request. But it is highly
impossible for a customer to state a unique requirement to
choose an appropriate service in web. So we use a hybrid
approach of match making algorithm to match the expected
service extracted from.
C. Local Selection vs. Global Optimisation
Consider a distributed environment where the candidate
web services are managed by distributed service brokers
thus a local selection is to be made when a global
requirement is given. We select one service from each
group of service candidates independently on the other
groups. Using a given utility function, the values of the
different QoS criteria are mapped to a single utility value
and the service with maximum utility value is selected.
This approach is very efficient in terms of computation
time as the time complexity of the local optimization
approach is O(l), where l is the number of service
candidates in each group. It is very efficient but cannot
provide end to end quality of service. In a global approach
where composition of services is made end to end QoS can
be achieved so we combine and provide a hybrid approach
to provide an end to end QoS. Quality levels are extracted
and services are classified based on them. This
classification is done by using the MIP by using a binary,
objective or a constraint function. Filtering and ranking is
made based on the requirements provided.
D. Discovery
Figure 4 depicts the conceptual overview QoS aware
selection of web service. When the requirements are
specified by the user the discovery engine uses the existing
infrastructure e.g. UDDI to locate the available web
services based upon the requirement description. As a
result, a list of candidate services matching the criteria is
obtained. The goal of the QoS based selection is to focus
on the non functional properties and to select the optimum
service requested by the user. The requirements are
decomposed into several local constraints.

Figure 4. Conceptual Overview.

Thus when a request is given a list of candidates for
each requirement class is obtained based on the local
constraints along with the utility function for each
constraint. We select one web service for each requirement
class such that all the constraints are satisfied and overall
utility is maximized. We get a feasible solution if all the
constraints are satisfied and we get an optimal solution if
overall utility is maximized. Our service candidates are
ranked by using the utility function value. We extract
quality levels for the requirements given and map them
with the local constraints.
As in [6] a mixed-integer programming (MIP) problem
is one where some of the decision variables are constrained
to have only integer values (i.e. whole numbers such as -1,
0, 1, 2, etc.) at the optimal solution. The use of integer
variables greatly expands the scope of useful optimization
problems that you can be define and solve. An important
special case is a decision variable x1 that is integer with 0
x1 1. This forces x1 to be either 0 or 1 at the
solution. Variables like x1, called 0-1 or binary integer
variables, can be used to model yes/no decisions, such as
whether to build a plant or buy a piece of equipment.
If the consumer has reasonable preferences about
consumption in different circumstances, then we will use a
utility function to describe these preferences. Thus in order
to evaluate the multi-dimensional quality of a given web
service a utility function is used. The function maps the
quality vector Qs into a single real value, to enable sorting
and ranking of service candidates. In this paper we use a
Multiple Attribute Decision Making approach for the
utility function: i.e. the Simple Additive Weighting (SAW)
technique [1]. The utility computation involves scaling the
QoS attributes values to allow a uniform measurement of
the multi-dimensional service qualities independent of their
units and ranges. The scaling process is then followed by a
weighting process for representing user priorities and
preferences. In the scaling process each QoS attribute
value is transformed into a value between 0 and 1, by
comparing it with the minimum and maximum possible
value according to the available QoS information of
service candidates. Filtering is performed on the services
that violate the constraints. Ranking is made by using the
utility value obtained.
E. Selection
The selection process is completely personalized
(preference-oriented) as the ranking is based on user
preferences regarding the QoS criteria that are more
important to him (along with the degree of importance) and
different users may have different preferences. User
preferences can be given separately to the selection
algorithm or they can be obtained from the metric
ontology. Alternatively, they can be collected and derived
from user context (remember our previous discussion).
User preferences should include: weights given to metrics
(either generally or with respect to their domain or group),
preferred domains or QoS groups (along with their weight
if the importance of domains or groups differs),
information about if the increase of a metric benefits the
requester, maximum normalized values of metrics, etc.
Apart from user preferences, other information about the
QoS criteria/metrics must be available like the type and
range set of a metric, the groups or domains the metric
belongs to, what is the ordering of the range set of the
metric, etc. This other type of information is
collected/derived by the description of QoS metrics in
(user or provider-defined or commonly agreed) metric
ontology. When the selection algorithm collects all the
appropriate input, it starts processing available QoS offers
in order to give to each of them a suitable rank/degree
according to user preferences.
F. Framework
We use a tool based framework for our analysis known
as QoSMOS(QoS Management and Optimization of
Service-Based Systems). It dynamically adapts to the
changes of the states in the system to achieve the QoS
according to the environment of the system. It adapts a
hybrid way of optimizing by combining existing tools and
it transforms the requirement into probabilistic values for
optimization. Markov models are employed for
transformation of the requirements. The phases are
Monitor, Analyse, Plan and Execute. The performance and
workload is monitored. The necessary QoS specified by
the user is analysed. Based on this analysis a plan is made
for adaption of the system. According to the plan the
workflow is replaced to provide an optimized system. As
in [7] Realization is achieved by existing tools such as
KAMI [20],[21],[22], PRISM[23],[24], GPAC[12] where
KAMI is a framework used for runtime modeling of a
service based system. PRISM is probabilistic model
checker and GPAC is used for self managing systems.

Figure 5. QoSMOS Framework.
V. CONCLUSION
In this paper after defining the basic concepts behind
the QoS and its attributes we have proposed a match
making pattern for selecting the best service among the
discovered services based on the requirements given. The
QoSMOS framework helps us to provide dynamic
optimization and QoS management. Currently we are
working on the validation of this concept and the metrics
used for determining the QoS of a service provided.

REFERENCES
[1] K. Kritikos, D. Plexousakis, Requirements for QoS-based
Web Service Description and Discovery, IEEE
Transactions on Services computing, vol. 2, no. 4, pp.320-
337,October 2009
[2] K.Kritikos, QoS-Based Web Service Description and
Discovery, PhD thesis, Computer Science Dept., Univ. of
Crete, Dec. 2008.
[3] Radu Calinescu, Lars Grunske, Marta Kwiatkowska,Raffaela
Mirandola and Giordano Tamburrelli, Dynamic QoS
Management and Optimization in Service-Based Systems,
IEEE Transactions on Software Engineering,vol 37, no.3
May/June 2011.
[4] L. Zeng, B. Benatallah, A.H.H. Ngu, M. Dumas, J.
Kalagnanam, and H. Chang, QoS-Aware Middleware for
Web Services Composition, IEEE Trans. Software Eng.,
vol. 30, no. 5, pp. 311-327, May 2004.
[5] C. Zhang, S. Su, and J. Chen, DiGA: Population Diversity
Handling Genetic Algorithm for QoS-Aware Web Services
Selection, Computer Comm., vol. 30, no. 5, pp. 1082-1090,
May 2007.
[6] Y. Ma and C. Zhang, Quick Convergence of Genetic
Algorithm for QoS-Driven Web Service Selection,
Computer Networks, vol. 52, no. 5, pp. 1093-1104, 2008.
[7] C. Bettini, D. Maggiorini, and D. Riboni, Distributed
Context Monitoring for the Adaptation of Continuous
Services, World Wide Web, vol. 10, no. 4, pp. 503-528,
2007.
[8] M.Z. Kwiatkowska, G. Norman, and D. Parker,
Probabilistic Symbolic Model Checking with PRISM: A
Hybrid Approach, Intl J. Software Tools for Technology
Transfer, vol. 6, no. 2, pp. 128- 142, Aug. 2004.
[9] M.Z. Kwiatkowska, G. Norman, J. Sproston, and F. Wang,
Symbolic Model Checking for Probabilistic Timed
Automata, Information and Computation, vol. 205, no. 7,
pp. 1027-1077, 2007.
[10] Adrian Mello, 2002 Breathing new life into UDDI, Tech
Update, ZDNET.com.
[11] A.Blum, UDDI as an Extended Web Services Registry:
Versioning, quality of service, and more. White paper, SOA
World magazine, Vol. 4(6), 2004.
Article in a journal:
[12] Mydhili K Nair, V.Gopalakrishna, Look Before You Leap:
A Survey of Web Service Discovery, In Proceedings of the
International Journal of Computer Applications (0975-
8887), vol 7, no. 5, September 2010.
Article in a conference proceedings:
[13] Mohammad Alrifai and Thomas Risse, Combining global
optimization with local selection for efficient QoS-aware
service, In Proc. WWW09 Proceedings of the 18
th

international conference on World wide web, Published by
ACM 2009 Article.
[14] L. Taher, H. El Khatib, R. Basha, A framework and QoS
Matchmaking Algorithm for Dynamic Web Services
Selection (2005), In Proceedings of the 2 nd International
conference on Innovations in information
Technology(IIT05).
[15] Dongyun Liu and Hong Mei, Mapping Requirements to
Software Architecture by Feature-Orientation, in Workshop
on software Requirements to Architectures(STRAW), pp -
69-76, ICSE 2003 : Portland, Oregon, USA.
[16] David and S. Masoud, Mapping Non-functional
Requirements to Cloud applications, In Proceedings of the
2011 International Conference on Software Engineering and
Knowledge Engineering (SEKE 2011), Miami, Florida, in
press.
[17] Holger Lausen and Thomas Haselwanter,2007, Finding
Web Services, In the Proceedings of European Semantic
Technology Conference(ESTC 07)
[18] Daniel Bachlechner et al, 2006, Web Service Discovery
A Reality Check, In the 3rd European Semantic Web
Conference.
[19] Ali ShaikhAli, Rashid Al-Ali et al, 2003, UDDIe: An
Extended Registry for Web Services, IEEE Workshop on
Service Oriented Computing: Models, Architectures and
Applications
[20] Su Myeon Kim and Marcel-Catalin Rosu, 2004, A survey
of public web services, In the Proceedings of Proceedings
of the 13th International Conference on the World Wide
Web.
[21] S. Su, C. Zhang, and J. Chen, An Improved Genetic
Algorithm for Web Services Selection, Proc. Seventh IFIP
WG 6.1 Intl Conf. Distributed Application and
Interoperable Systems, J. Indulska and K. Raymond, eds.,
pp. 284-295, 2007.
[22] H. Guo, J. Huai, H. Li, T. Deng, Y. Li, and Z. Du, ANGEL:
Optimal Configuration for High Available Service
Composition, IEEE Intl Conf. Web Services, pp. 280-287,
2007.
[23] I. Epifani, C. Ghezzi, R. Mirandola, and G. Tamburrelli,
Model Evolution by Runtime Parameter Adaptation, Proc.
31st Intl Conf. Software Eng., pp. 111-121, 2009.
[24] L. Baresi and S. Guinea, Towards Dynamic Monitoring of
WS-BPEL Processes, Proc. Third Intl Conf. Service
Oriented Computing, 2005.

ISBN: 978-81-920575-8-3 :: doi: 10. 72885/ISBN_0768

Review Paper on Software Testing

Abstract - Software testing is an activity of checking
the software in order to ensure that the software is
meeting all its requirements and are error free. Its any
activity that aimed at evaluating an attribute or
capability of a program or system and determining that
it meets its required results. Although crucial to
software quality and widely deployed by programmers
and testers, software testing still remains an art, due to
limited understanding of the principles of software. The
difficulty in software testing stems from the complexity
of software: we cannot completely test a program with
moderate complexity. Testing is more that what we
know as debugging. The purpose of testing ranges from
quality assurance, verification and validation, to
reliability estimation. Testing can be used as a generic
metric as well. Two major areas of testing are
correctness testing and reliability testing. Software
testing we can say as a trade-off between budget, time
and quality.
Keywords:-generic metric, trade-off.

I. INTRODUCTION
Software testing is one of the oldest technologies in
the history of Digital computers. As the fourth
generation computers were developed the importance of
testing has increased and so is the time devoted to the
testing process. Software testing is said to be the phase
of verification and validation of software. Testing of the
software are the importance means of accessing the
quality of the software. Generally 40-50% of the time is
spended on the testing of the software. We generally
say that the testing process starts after the phase of
coding, but that is a myth, the actual process of testing
starts very much from the phase of requirement
gathering itself. As in the phase of requirement
gathering we check for the feasibility of the software
requirements that the client has given for the
development. Software testing is an empirical technical
investigation conducted to provide stakeholders with
information about the quality of the product or service
under test

II. WHAT IS SOFTWARE TESTING
Testing is the phase of finding out the undiscovered
errors in the software. Its responsibility is not to remove
the errors. A primary purpose of testing is to detect
software failures so that defects may be discovered and
corrected. Testing cannot establish that a product
functions properly under all conditions but can only
establish that it does not function properly under
specific conditions.
Unlike most physical systems, the defects of
software are the design errors, not manufacturing
defects. Software does not suffer from corrosion, wear-
and-tear and others generally it does not change until
upgrades, or until obsolescence. So the design defects
or bugs in the software remain buried until activation.
Software bugs exist in any software module with
moderate size- not because programmers, but because
the complexity of software is generally intractable and
humans have only limited ability to manage complexity.
The design defects cannot be ruled out for any complex
system and thats a fact.

A. The Goal of Testing
In different publications, the definition of testing
varies according to the purpose, process, and level of
testing described. Miller gives a good description of
testing in [2]
The general aim of testing is to authenticate the
quality of software systems by systematically exercising
the software in carefully controlled circumstances and
environment.
Millers description of testing views most software
quality assurances activities as testing. He contends that
testing should have the major intent of finding errors. A
test is said to be good if it has a high probability of
finding an undiscovered error, and a successful test is
one that uncovers an undiscovered error.
Finding out the design defects in any software is
very difficult and this is also because of the complexity
of the software. Testing boundary values do not
sufficient to guarantee correctness, because software
and any digital systems are not continuous. The
complete testing is infeasible as we cant test all the
possible values that are possible. Exhaustively testing a
simple program to add only two integer inputs of 32-
bits (yielding 2^64 distinct test cases) would take
hundreds of years, even if tests were performed at a rate
of thousands per second. So obviously, for a realistic
software module, the complexity can be far beyond the
example mentioned here. And if inputs from the real
world are involved, the problem will get worse, because
timing and unpredictable environmental effects and
human interactions are all possible input parameters
under consideration.
And the complications even increase due to the
dynamic nature of the programs. Failure occurring
during the preliminary testing, code can be changed and
the software would work for the test case we tested and
Ratika Gupta,
MCA Department
SRMGPC
Lucknow, UP
Saumya Rawat
MCA Department SRMGPC
Lucknow, UP
Ruchi Sharma
MCA Department SRMGPC
Lucknow, UP
corrected the problem for. . But its behavior on pre-
error test cases that it passed before can no longer be
guaranteed. For this again the testing needs to be
restarted and this is prohibited due to the expense
incurred in it.
Regardless of all the limitations that the testing has,
it is still an integral part of the software development.
The testing is carried out at every level of software
testing. It starts as we start the software development.
More than 50% percent of the development time is
spend on testing. The purpose of performing testing is
as follows
To improve quality.
For Verification & Validation (V&V)
For reliability estimation
A. To improve quality
There can be huge loss from the bugs as now-a-days
the computers and the softwares are used in critical
applications. Bugs in critical systems have caused
airplane crashes, allowed space shuttle missions to go
awry, halted trading on the stock market, and worse.
One of the major software failures is of The Explosion
of the Ariane 5[5]. Bugs can kill. Bugs can cause
disasters. The so-called year 2000 (Y2K) bug has given
birth to a cottage industry of consultants and
programming tools dedicated to making sure the
modern world doesn't come to a screeching halt on the
first day of the next century. In a computerized
embedded world, the quality and reliability of software
is a matter of life and death.
Quality means the conformance to the specified
design requirement. Being correct, the minimum
requirement of quality, means performing as required
under specified circumstances. Debugging, a narrow
view of software testing, is performed heavily to find
out design defects by the programmer. The imperfection
of human nature makes it almost impossible to make a
moderately complex program correct the first time.
Finding the problems and get them fixed, is the purpose
of debugging in programming phase.
B. For Verification & Validation (V&V)
Testing is heavily used as a tool in the Verification
&Validation process. Testers can make claims based on
interpretations of the testing results, which either the
product works under certain situations, or it does not
work. We can also compare the quality among different
products under the same specification, based on results
from the same test. We can only test the related factors
to quality to make it visible. Quality has three factors on
which it is test upon. These factors are:
Functionality
Engineering
Adaptability
These three factors are the dimensions in software
quality space. Each dimension may be broken down
into its component factors and considerations at
successively lower levels of detail. Some examples are
as showed in table :
Functionality Engineering Adaptability
(exterior quality) (interior quality) (future quality)
Correctness Efficiency Flexibility
Reliability Testability Reusability
Usability Documentation Maintainability
Table 1. Typical Software Quality Factors
Importance of these factors varies from application to
application. Any system where human lives are at stake
must place extreme emphasis on reliability and
integrity.
Drawback of testing is that it can test for a specific
test case only and finite number of test cases cant
validate the working of software for all the test cases.
And when software fails for even one test case it shows
that the software doesnt work. A dirty test, or negative
tests, refers to the tests aiming at breaking the software,
or showing that it does not work. A piece of software
must have sufficient exception handling capabilities to
survive a significant level of dirty tests.
A testable design is a design that can be easily
validated, falsified and maintained. Because testing is a
rigorous effort and requires significant time and cost,
design for testability is also an important design rule for
software development
C. For reliability estimation
Software reliability has important relations with
many aspects of software, including the structure, and
the amount of testing it has been subjected to. Based on
an operational profile (an estimate of the relative
frequency of use of various inputs to the program,
testing can serve as a statistical sampling method to
gain failure data for reliability estimation.
Over 20-30 years we are still using the same testing
techniques because of which we cant say testing as
science its still an art. Software testing can be very
costly as we can never be sure of that the specifications
are correct. No verification system can verify every
correct program. We can never be certain that a
verification system is correct either.

III. THE TAXONOMY OF TESTING
TECHNIQUES
Software testing is a very broad area. Testing of the
software starts at very initial stage, it also involves
many other technical and non-technical areas, such as
specification, design and implementation, maintenance,
process and management issues in software
engineering. There is a plethora of testing methods and
testing techniques, serving multiple purposes in
different life cycle phases. Software testing can be
divided into following parts on the basics of purpose:
correctness testing, performance testing, reliability
testing and security testing. Software testing can be
divided into following parts on the basics of life-cycle
phase: requirements phase testing, design phase testing,
program phase testing, evaluating test results,
installation phase testing, acceptance testing and
maintenance testing. Software testing can be divided
into following parts on the basics of scope: unit testing,
component testing, integration testing, and system
testing

IV. THE TESTING SPECTRUM AT VARIOUS
LEVELS
Software testing cannot be meaningful for activities
of validation and verification unless there is a
specification for the software. Software can be a single
module or unit of code, or an entire system. Depending
on the size of the development and the development
methods, specification of software can range from a
single document to a complex hierarchy of documents.
A hierarchy of software specifications will typically
contain three or more levels of software specification
documents.
a. The Requirements Specification: It specifies what the
software is required to do and may also specify
constraints on how this may be achieved.
b. The Architectural Design Specification: It describes
the architecture of a design which implements the
requirements. Components included in the software and
the relationship between them will be described in this
document.
c. Detailed Design Specifications: It describe how each
component in the software, down to individual units, is
to be implemented.
With such a hierarchy of specifications, it is
possible to test software at various stages of the
development, for conformance with each specification.
The testing done at each level of software
development is different in nature and has different
objectives as follows:

Unit Testing: It tests the basic unit of
software, which is the smallest testable piece
of software, and is often called unit,
module, or component interchangeably.

Software Integration Testing: In this
progressively larger groups of tested software
components corresponding to elements of the
architectural design are integrated and tested
until the software works as a whole.

System Testing: System test is often based on
the functional/requirement speci fi cat i on of
the system. Non-functional quality attributes,
such as reliability, security, and
maintainability, are also checked.

Acceptance Testing: It is done when the
completed system is handed over from the
developers to the customers or users. The
purpose of acceptance testing is rather to
give confidence that the system is working
than to find errors. If often includes subset of
the system tests, witnessed by the customers for
the software or system.

Once each level of software specification has been
written, the next step is to design the tests. The tests
should be designed before the software is implemented.
Test results are evaluated once the tests have been
applied within each level of testing. If a problem is
encountered, then either the tests are revised and
applied again, or the software is fixed and the tests
applied again. This is repeated until no problems are
encountered, at which point development can proceed to
the next level of testing.
Testing does not end following the conclusion of
acceptance testing. Software maintenance is also an
important aspect to fix problems which show up
during use and to accommodate new requirements.
Software tests have to be repeated, modified and
extended. The effort to revise and repeat tests
consequently forms a major part of the overall cost of
developing and maintaining software. The term
regression testing is used in this context to refer to the
repetition of earlier successful tests in order to make
sure that changes to the software have not introduced
side effects.
Static Vs Dynamic testing Based on whether the
actual execution of software under evaluation is
needed or not, there are two major categories of
quality assurance activities
Static Testing focuses on the range of methods that
are used to determine or estimate software quality
without reference to actual executions. Techniques
in this area include code inspection, program
analysis, symbolic analysis, and model checking.
Dynamic Analysis deals with specific methods for
finding and/or approximating software quality
through actual executions, i.e., with real data and
under real or simulated circumstances. It involves
executing programmed code with a given set of test
cases. Techniques in this area include synthesis of
inputs, the use of structurally dictated testing
procedures, and the automation of testing
environment generation. Typical techniques for this
are either using stubs/drivers or execution from a
debugger environment.
Functional And Structural Techniques
The information flow of testing is shown in Figure
1. As we can see, the testing process i ncl ude
the configuration of proper inputs, execution of the
software over the input, and the analysis of the
output. The Software Configuration consists of
requirements specification, design specification,
source code, and so on. The Test Configuration
includes test cases, test plan and procedures, and
testing tools.
A testing technique specifies the strategy used in
testing to select input test cases and analyze test
results. Different quality aspects of a software
system are depicted by different techniques, and there
are two major categories of testing techniques:
functional and structural.
Functional Testing : I n t h i s t e s t i n g
t e c h n i q u e the software program or system
under test is viewed as a black box . It is also
termed data-driven, input/output driven or
requirements-based testing. The selection of test cases
for functional testing is based on the requirement or
design specification of the software entity under test
without any knowledge of the internal
implementation. Black box testing methods include:
equivalence partitioning, boundary value analysis, fuzz
testing, model-based testing and specification-based
testing. Functional testing emphasizes on the external
behavior of the software entity. The tester treats the
software under test as a black box such that only the
inputs, outputs and specification are visible, and the
working is determined by observing the outputs to
corresponding inputs. During testing, various inputs are
exercised and the outputs are compared against
specification to validate their correctness. All test cases
are derived from the specification. No implementation
details of the code are considered.
Ideally we would be allured to exhaustively test the
input space. But as stated above, it is impossible to do
an exhaustive testing for most of the programs, let alone
considering invalid inputs, timing, sequence, and
resource variables. To make things worse, we can never
be sure whether the specification is either correct or
complete. Due to limitations of the language used in the
specifications (usually natural language), ambiguity is
often inevitable. Even if we use some type of formal or
restricted language, we may still fail to write down all
the possible cases in the specification. Sometimes, the
specification itself becomes an intractable problem: it is
not possible to specify precisely every situation that can
be encountered using limited words. And people can
seldom specify clearly what they want -- they usually
can tell whether a prototype is, or is not, what they want
after they have been finished. It is the specification
problems that contribute approximately 30 percent of
all bugs in software.
The research in black-box testing mainly focuses
on how to maximize the effectiveness of testing with
minimum cost, usually the number of test cases. It is
not possible to exhaust the input space, but it is
possible to exhaustively test a subset of the input
space. Partitioning is one of the common techniques. If
we have partitioned the input space and assume all the
input values in a partition is equivalent, then we only
need to test one representative value in each partition
to sufficiently cover the whole input space. Domain
testing partitions the input domain into regions, and
consider the input values in each domain an equivalent
class. Domains can be exhaustively tested and covered
by selecting a representative value(s) in each domain.
Boundary values are of special interest. Experience
shows that test cases that explore boundary conditions
have a higher payoff than test cases that do not.
Boundary value analysi requires one or more boundary
values selected as representative test cases. The major
difficulties with domain testing are that incorrect
domain definitions in the specification cannot be
adequately discovered. Good partitioning can be done
with the knowledge of the software structure. A sound
testing plan will not only contain black-box testing,
but also white-box approaches, and combinations of
the two.
Structural Testing : In this technique the software
entity is viewed as a white box. The selection of
test cases is based on the implementation of the
software entity. The goal of selecting such test
cases is to cause the execution of specific spots in
the software entity, such as specific statements,
program branches or paths. The expected results are
evaluated on a set of coverage criteria. Examples of
coverage criteria include path coverage, branch
coverage, and data-flow coverage. Structural testing
emphasizes on the internal structure of the software
entity. There are many techniques available in white-
box testing. The intention of exhausting some aspect of
the software is still strong in white-box testing, and
some degree of exhaustion can be achieved, such as
executing each line of code at least once (statement
coverage), traverse every branch statements (branch
coverage), or cover all the possible combinations of true
and false condition predicates (Multiple condition
coverage).
Testing like control-flow, loop, and data-flow
testing maps the corresponding flow structure of the
software into a directed graph. Test cases are carefully
selected based on the criterion that all the nodes or
paths are covered or traversed at least once. It helps to
discover unnecessary "dead" code a code
that is of no use, or never get executed at all, which
can not be discovered by a normal functional testing.

Figure2: Testing Information Flow


Grey Box Testing:Grey Box Testing involves
having knowledge of internal data structures and
algorithms in order to design the test cases, but testing
at the user, or black-box level. The tester is not
required to have a full access to the software's source
code.

Manipulating input data and formatting output do
not qualify as grey box, because the input and output
are clearly outside of the "black-box" that we are
calling the system under test. This distinction is
particularly important when conducting integration
testing between two modules of code written by two
different developers, where only the interfaces are
exposed for test. However, modifying a data repository
does qualify as grey box, as the user would not
normally be able to change the data outside of the
system under test. Grey box testing may also include
reverse engineering to determine, for instance,
boundary values or error messages.
A grey box tester will be permitted to set up his
testing environment; for instance, seeding a database;
and the tester can observe the state of the product
being tested after performing certain actions. For
instance, he/she may fire an SQL query on the
database and then observe the database, to ensure that
the expected changes have been reflected. Grey box
testings implements intelligent test scenarios, based on
limited information. This will particularly apply to
data type handling, exception handling, and so on.
Alpha and Beta Testing: Alpha: In this phase,
developers generally test the software using white box
techniques. Additional validation is then performed
using black box or gray box techniques, by
another testing team. Moving to black box testing inside
the organization is known as alpha release. It

simulated
or actual operational testing by potential
users/customers or an independent test team at the
developers' site. Alpha testing is often employed for
off-the-shelf software as a form of internal acceptance
testing, before the software goes to beta testing.
Beta: Beta (named after the second letter of the
Greek alphabet) is the software development phase
following alpha. It generally begins when the software
is feature complete. The focus of beta testing is
reducing impacts to users, often incorporating usability
testing. The process of delivering a beta version to the
users is called beta release and this is typically the first
time that the software is available outside of the
organization that developed it.
The users of a beta version are called beta testers.
They are usually customers or prospective customers of
the organization that develops the software, willing to
test the software without charge, often receiving the
final software free of charge or for a reduced price.
Non-Functional testing:
It refers to aspects of the software that may not be
related to a specific function or use, such as scalability
or other performance, behavior under certain
constraints, or security. Non-functional requirements
tend to be those that reflect the quality of the product,
particularly in the context of the propriety aspect of its
users. Non-functional testing verifies that the software
functions properly even when it receives invalid or
unexpected inputs. Software fault injection, in the form
of, fuzzzing, is an example of non-functional testing.
Non-functional testing, especially for software, is
designed to establish whether the device under test can
tolerate invalid or unexpected inputs, thereby
establishing the robustness of input validation routines
as well as error-management routines.
Software performance testing and load testing
Performance testing is executed to determine how fast a
system or sub-system performs under a particular
workload. It can also serve to validate and verify other
quality attributes of the system, such as scalability,
reliability and resource usage.
Load testing is primarily concerned with testing that
can continue to operate under a specific load, whether
that be large quantities of data or a large number of
users. This is generally referred to as software
scalability. The related load testing activity of when
performed as a non-functional activity is often referred
to as endurance testing.
Volume testing is a way to test functionality. Stress
testing is a way to test reliability. Load testing is a way
to test performance. There is little agreement on what
the specific goals of load testing are. The terms load
testing, performance testing, reliability testing, and
volume testing, are often used interchangeably.
Stability testing:
Stability testing checks to see if the software can
continuously function well in or above an acceptable
period. This activity of non-functional software testing
is often referred to as load (or endurance) testing.
Usability testing
Usability testing is needed to check if the user
interface is easy to use and understand. It is concerned
mainly with the use of the application.
Security testing
Security testing is essential for software that
processes confidential data to prevent system intrusion
by hackers. Software quality, reliability and security are
tightly coupled. Flaws in software can be exploited by
intruders to open security holes. With the development
of the Internet, software security problems are
becoming even more severe. Many critical software
applications and services have integrated security
measures against malicious attacks. The purpose of
security testing of these systems include identifying and
removing software flaws that may potentially lead to
security violations, and validating the effectiveness of
security measures. Simulated security attacks can be
performed to find vulnerabilities.
Reliability testing: Software reliability refers to the
possibility of failure-free operation of a system. It is
related to many aspects of software, including the
testing process. Directly estimating software reliability
by appraising its related factors can be difficult. Testing
is an effective sampling method to measure software
reliability. Guided by the operational profile, software
testing (usually black-box testing) can be used to obtain
failure data, and an estimation model can be further
used to analyze the data to estimate the present
reliability and predict future reliability. Therefore,
based on the estimation, the developers can decide
whether to release the software, and the users can
decide whether to adopt and use the software. Risk of
using software can also be assessed based on reliability
information. Advocates that the primary goal of testing
should be to measure the dependability of tested
software.
There is agreement on the inherent meaning of
dependable software: it does not fail in unexpected or
catastrophic ways. Robustness testing and stress
testing are variances of reliability testing based on this
simple criterion. By robustness of a software
component we mean the degree to which it can
function correctly in the presence of exceptional inputs
or stressful environmental conditions. In robustness
testing the functional correctness of the software is not
of concern as compared to correctness testing. It only
watches for robustness problems such as machine
crashes, process hangs or abnormal termination. The
oracle is relatively simple, therefore robustness testing
can be made more portable and scalable than
correctness testing. This research has drawn more and
more interests recently, most of which uses
commercial operating systems as their target. Stress
testing, or load testing, is often used to test the whole
system rather than the software alone. In such tests the
software or system are exercised with or beyond the
specified limits. Typical stress includes resource
exhaustion, bursts of activities, and sustained high
loads

V. RECENT SOFTWARE TESTING INNOVATIONS
A. Robustness testing: A software system should
be robust, if it can handle inappropriate inputs. Also if
we provide anomalous inputs to a system and make it
place in the inappropriate environment, the system
ensures to perform in acceptable manner. This type of
testing is directly related to the process of hardware and
software fault injection.
For example: FUZZ system that randomly injects the
input data into selected operating system kernel and
utility programs to facilitate an empirical examination
of operating system robustness. [Miller et al., 1990].
But there, it was possible to crash between 25% and
33% of the utility Programs, which were associated
with 4:3 BSD, SunOS 3.2, SunOS 4.0, SCO Unix, AOS
Unix, and AIX
1.1 Unix. [Miller et al., 1990].
1.2 Subsequent studies that incorporated additional
operating systems like AIX
3.2, Solaris 2.3, IRIX 5.1.1.2, NEXTSTEP 3.2 and
Slack ware Linux 2.1.0 indicate that there was a
noticeable
improvement in operating system utility robustness
during the intervening years between the first and
second studies [Miller et al., 1998]. Interestingly, the
usage of FUZZ on the utilities, which are associated
with a GNU/Linux operating system showed that these
tools exhibited significantly higher levels of robustness
than the commercial operating systems. [Miller et al.,
1998].
Testing Spreadsheet: Software testing has generally
focused on the testing and analysis .These programs can
be written in either procedural or object-oriented
programming languages. The form-based visual
programs paradigm, where spreadsheet programs are a
noteworthy example, is an important mode of software
development. And recently, the testing and analysis of
spreadsheet programs has been an area of
research.Spreadsheet languages have rarely been
studied. This is a serious omission that research shows
that many spreadsheets created with used languages
contain faults. Recent Software Testing Innovations 31
to provide support , that can help spreadsheet
programmers in determining the reliability.
Examples: Rothermel et al. and Fisher et al. have
focused on the testing and analysis of programs that are
written in spreadsheet languages [Fisheret al., 2002a,b,
Rothermel et al., 1997, 2000, 2001b]. Indeed, Frederick
Brooks makes the following observation about
spreadsheets and databases: These powerful tools, so
obvious in retrospect and yet so late in appearing, lend
themselves to a myriad of uses, some quite unorthodox
[Jr., 1995].
Database-Driven Application Testing: Many times
simple software applications also have complicated and
ever-changing operating environments. And that
increases the number of interfaces. And then interface
interactions need to be tested. Device drivers, operating
systems, and databases are all aspects for a software
systems environment. These are often ignored during
testing [Whittaker, 2000, Whittaker and Voas, 2000].
Relatively little research has specifically focused on the
testing and analysis of applications that interact with
databases. Recent Software Testing techniques are that
can test database-driven applications (written in a
general purpose programming language, like Java, C, or
C++) and include embedded structured query language
(SQL) statements that are designed to interact with a
relational database [Chan and Cheung, 1999a,b]. In this
approach, they transform the embedded SQL statements
within a database-driven application into general
purpose programming language constructs, the authors
provide C code segments which can describe the
selection, projection, union, difference, and Cartesian
product operators that form the relational algebra.
Though it heavily influence the structured query
language. Once the embedded SQL statements have
been transformed into general purpose programming
language constructs, then there is possibility to apply
traditional test adequacy criteria that interact with one
or more relational databases.

VI. WHY IS TESTING SO ESSENTIAL PART
To describe the strategies for system test cases.
To find the external, incorrect behaviour of a
program by testing and analysis of the program.
To test that part or collection of program source
statements ,that cause a program failure.
To find mistakes made by a programmer during the
implementation of a software system.
This is the process of analyzing the program under
test, with the test adequacy criterion, to produce a
list of tests that must be provided in order to create
a completely adequate test suite.
This is the manual or automatic procedure of
creating test cases for the program under test.
Automatic test case generation is as an attempt to
satisfy the constraints imposed by the selected test
adequacy criteria.
The measurement of the quality of an existing test
suite for a specific test adequacy criterion.

VII. CONCLUSION
Software testing is a very broad area in software
engineering. There are many test cases and test suite to
test the system for its proper working along with the
injected inputs to the system within the used
environment .Also there are various levels and phases
of testing that makes the system ensure that the output it
is providing must be same as the desired .And this can
be done by finding the errors in the program .In testing
techniques ,we check for the systems reliability
,robustness ,usability ,security ,functionality and so on.
Testing is more than just debugging. Testing is not only
used to find out the bugs and debug them. It is also used
in validation, verification process, and reliability
measurement. Testing is very expensive. Automation is
a good way to cut down cost and time. Testing
efficiency and effectiveness is the criteria for coverage-
based testing techniques. Complete testing is infeasible.
We cannot test software for all the possible test cases.
Also Complexity is the root cause. At some point,
software testing has to be stopped and product has to be
shipped. The stopping time can be decided by the trade-
off of time and budget. Or if the reliability estimate of
the software product meets requirement. Testing may
not be the most effective method to improve software
quality. Alternative methods, such as inspection, and
clean-room engineering, may be even better. Complete
testing is infeasible. We cannot test software for all the
possible test cases. Complexity is the root cause. At
some point, software testing has to be stopped and
product has to be shipped. The stopping time can be
decided by the trade-off of time and budget. Or if the
reliability estimate of the software product meets
requirement.
In todays scenario according to our findings
performance testing and reliability testing are the most
important testing. As these days the systems are mainly
real time systems where the main focus is on
performance of the software. As in real time system if
the system works properly after the specified time then
there is no use of that system working then that will be
considered as the failure of the testing.

REFERENCES
[1]. http://www.kaner.com/pdfs/ETatQAI.pdf(page11)
[2].http://en.wikipedia.org/wiki/Software_release_life_c
ycle
[3]. http://en.wikipedia.org/wiki/Software_testing
[4].http://www.softwaretestinghelp.com/types-of-
software-testing/
[5].Ian sommerville 2004,s/w enng 7th
edition/chapter23
[6]. Gregory M. Kapfhammer Department of Computer
Science Allegheny College
gkapfham@allegheny.edu
EdsgerW. Dijkstra [Dijkstra,1968]


ISBN: 978-81-920575-8-3 :: doi: 10. 72892/ISBN_0768

Empirical Validation of Quality Characteristics of Academic Websites

Abstract - Web-sites are domain intensive and some important
categories are social, cultural, entertainment, e-commerce, e-
government, museum, tourism, academic, etc. It is obvious that
domains of Web-sites differ significantly, and hence a common
yardstick can not be applied to measure quality of all Web-sites.
Signore, Offutt, Loranca and Espinosa, Olsina and Rossi,
Tripathi and Kumar and others have tried to define quality
characteristics that are domain specific. Attempts have also been
made to empirically validate these quality characteristics models.
In this work, authors have described the evaluation design and
implementation of measurement criterion for every measurable
attribute of Web-sites. The method of weighted average is used to
evaluate usability and functionality of Web-sites of academic
institutes. The evaluation results are shown in graphical form.
Keywords: Web-site Quality, Academic domain,
Hierarchical model, Attributes, Metrics
I. INTRODUCTION
The World Wide Web (WWW) is growing at rapid pace
due to uploading of many new Web-sites every day. Often
quality of Web-sites is unsatisfactory and basic Web principles
like inter-portability and accessibility are ignored [9, 13, 14].
The main reason for lack of quality is unavailability of trained
staff in Web technologies/engineering and orientation of Web
towards a more complex XML based architecture [9, 13, 14].
Web-sites can be categorized as social, cultural, e-
commerce, e-government, museums, tourism, entertainment,
and academic intensive. It is obvious that domains of Web-sites
differ significantly, and hence a common yardstick can not be
applied to measure quality of all Web-sites. Loranca et. al. [8]
and Olsina et. al. [10] have identified attributes, sub-attributes,
and metrics for e-commerce based Web-sites. Olsina et. al. [11]
has also specified metrics for Web-sites of museums. Tripathi
and Kumar [16] have specified quality characteristics for e-
commerce based Web-sites of Indian origin from user point of
view. Recently, Shrivastava, Rana and Kumar [12] have
specified characteristics, sub-characteristics and metrics to
measure external quality of academic Web-sites from user
point of view.
The aim of this research is to evolve a methodology that
can be applied to measure external quality of academic
institution Web-sites. We have considered Web-site quality
measurement process from the point of view of user (that is
external quality) only.

II. LITRATURE SURVEY
Some widely used software quality models were proposed
by Boehm, Brown and Lipow [1], and McCall and Covano [2].
International bodies such as ISO and CEN (European) are
trying to consolidate the definition of quality, starting from the
awareness that the quality as an attribute which changes
developers perspective and action context [4, 6, 7]. The
ISO/IEC 9126 model [7] defines three views of quality: users
view, developers view, and managers view. Users are
interested in external quality attributes, i.e., usability and
functionality, while developers are interested in internal quality
attributes such as maintainability, portability etc. This model is
hierarchical and contains six major quality attributes each very
broad in nature. They are subdivided into 27 sub-attributes that
contribute to external quality and 21 sub-attributes that
contribute to internal quality.
Offutt [9] was first to talk about quality characteristics of
Web-sites. Signore [13, 14] talked about poor quality of Web-
sites and ways to improve them. Olsina et. al. [10, 11] has
proposed hierarchical models of attributes, sub-attributes and
metrics for assessing quality of Web-sites of museum and e-
commerce domains. They have also developed a technique
called WebQEM to measure quality of these sites [10]. Tripathi
and Kumar [16] have identified attributes, sub-attributes and
metrics for Indian origin e-commerce Web-sites. They have
validated both theoretically and empirically the proposed
quality characteristics model (see [15]). Recently, Shrivastava,
Rana and Kumar [12] have proposed and theoretically
validated a hierarchical model of attributes, sub- attributes and
metrics for evaluating quality of Web-sites of academic
domain. In this research, we are proposing a methodology that
can be applied to measure external quality of Web-sites. It is
described in the next section.

III. EVALUATING EXTERNAL QUALITY
Our methodology suggests that the evaluator should
identify user needs (expectations) from Web-sites along with
common practice of describing quality characteristics as
defined in works of Offutt [9], Signore [13, 14], ISO/IEC 9126-
1 standard [7] and IEEE Std. 1061 [6]. The identified
attributes, sub-attributes should be expressed in terms of lower
abstraction attributes (metrics) that are directly measurable.
Ritu Shrivastava
Sagar Institute of Research Technology &
Science
Bhopal 462041, India


Our quality evaluation process consists of following three
phases
1. Quality Requirement Definition and Specification:
Here, evaluators select a quality model, say, ISO 9126-1
which specifies general quality characteristics of software
products. Depending upon evaluation goal (internal or
external) they select appropriate attributes (characteristics)
from quality model [6, 7, 9] and also user expectation
(viewpoint) translated in terms of characteristics, sub-
characteristics and metrics. The selected characteristics, sub-
characteristics and metrics are translated into quality
requirement tree. In our case, we prepared quality requirement
tree using this principle and validated it in the paper [12]. This
is reproduced for reference in Fig 1.
2. Elementary Evaluation Design and Implementation of
Measurement Criterion:
Elementary evaluation consists of evaluation design and
implementation. Here, each measurable attribute A
i
of quality
requirement tree is associated to a variable X
i
which can take a
real value in the closed interval [0, 1] of the attribute (metric).
It should be noted that the measured metric value will not
represent the elementary requirement satisfaction level, so it
becomes necessary to define an elementary criterion function
that will yield elementary indicator or satisfaction level. For
example, if we say that total number of missing links on a
web-site is 151, in reality it does not give clear picture about
missing links nor it helps in comparing two web-sites. In such
cases, we define an indirect metric X = # invalid links / # total
links on website.
We can now define elementary criterion function (or
elementary quality preference EP) as

EP = 1 (full satisfaction), if X = 0,
= (X
max
X)/X
max
, if X < X
max
= 0 ( no satisfaction), if X = > X
max

where X
max
is some agreed threshold value for invalid
links.
We have further created templates for each measurable
attribute of the quality tree of Fig. 1. It provides full definition,
measurement criterion for each measurable attribute and
elementary quality preference (EP). A sample template is
given in Table 1.
3. Global Evaluation Design and Implementation -
Combining all Measurements to Rank Web-sites:
Here, we select an aggregation criterion and a scoring
model to globally rank Websites. This makes our evaluation
model structured, accurate, and easy to apply. For aggregation,
we can use either linear additive model [5] or non-linear
multi-scoring model [3]. Both use weights to consider relative
importance of measurable attributes (metrics) in the quality
tree. The aggregation and partial/global preferences (P/GP),
in case of additive model, can be calculated using formula

where W
i
are weights and EP
i
are elementary preferences
in closed interval range [0, 1] or in percentage value. The
following is true for any EP
i

or

( in percentage)
Further

1, and W
i
> 0 for each i, i = 1,2, .m.

It should be noted that the basic arithmetic aggregation
operator in equation (1) for inputs is the plus (+) connector.
We can not use equation (1) to model input simultaneity. The
nonlinear multi-criteria scoring model is used to represent
input simultaneity or replace-ability, etc. This is a generalized
additive model, called Logic Scoring Preferences (LSP) model
(see [3]), and is expressed as

m i EP W GP P
m
i
r
r
i i
,........, 2 , 1 ; /
1
/ 1

where r and

The parameter r is a real number that is selected to achieve
the desired logical relationship and polarization intensity. The
equation (2) is additive when r = 1, which models neutrality
relationships. The equation (2) models input replace-ability or
disjunction when 1 r and models input conjunction or
simultaneity when r<1.
IV. APPLYING THE METHODOLOGY
Following the guidelines given in the Section 3, and the
hierarchical tree of quality characteristics (Fig. 1), we have
measured metric values of four academic institutions Web-
sites, viz., I. I. T., Delhi, M. A. N. I. T., Bhopal, B. I. T. S.,
Pilani, and C. B. I. T., Hyderabad. During the evaluation
process, we have defined for each quantifiable attribute, the
basis for the elementary evaluation criterion so that
measurement becomes unambiguous. For this, we have created
templates as shown in Table 1 for each characteristic of
hierarchical tree of Fig. 1 and measured each metric
(measurements were taken between 1
st
and 15
th
April 2011). It
was not possible to measure all values manually or
automatically and therefore both methods were used. The
measured values of some metrics are given in Table 2. We
have used additive model (equation (1)) to calculate usability
and functionality of sites. The values are shown in Fig. 2 & 3.

(1)
(2)

V.CONCLUSION
The paper describes a simple scientific method for
measuring external quality of Web-sites. It emphasizes
that the user needs, evaluation goals and international
guidelines for quality measurement should be guiding force
for deciding the characteristics, sub-characteristics, and
metrics to be used for measuring the quality. The elementary
evaluation design of the Section 3 is applied to measure metric
values of Fig. 1 and the measured values are given in Table 2.
The global usability and functionality of the sites, calculated
using equation (1), are given in Fig. 2 & 3. The work of partial
and global evaluation using generalized model (equation (2)),
and LSP is in progress and will be reported soon.
It is suggested that the method can be used to assess
usability and functionality of any Web-site before it is hosted
on Web.

REFERENCES
[1] Boehm B., Brown J., Lipow M., Quantitative evaluation of
software quality process , Intern. Conference on Software
Engineering, IEEE Comdputer Society Press, pp 592-605, 1976.
[2] Covano J., McCall J. , A framework for measurement of
software quality , Proc. ACM Software Quality Assurance
Workshop, pp133-139, 1978.
[3] Dujmovic J. J., A Method for Evaluation and Selection of
Complex Heardwar and Software Systems, Proc. 2
nd
Intern
Conf Resource Management and Performance Evaluation of
Computer Systems, Vol. 1, Computer Measurement Group,
Turnesville, N. J., pp. 368-378, 1996.
[4] Fenton N. E. and Fleeger S. L., Software Metrics: A Regorous
Approach, 2
nd
Edition, PWS Publishing Company, 1997.
[5] Gilb T., Software Metrics, Chartwell-Bratt, Cambridge, Mass,
1976.
[6] IEEE Std. 1061, IEEE Standard for Software Quality Metrics
Methodology, 1992.
[7] ISO/IEC 9126-1 : Software Engineering Product Quality
Part 1: Quality Model(2000) : http://www.usabilitynet
.org/tools/international.html#9126-1.
[8] Loranca M. B., Espinosa J. E., et. al. , Study for classification
of quality attributees in Argentinean E-commmerce sites ,
Proc. 16
th
IEEE Itern. Conf. on Electronics Communication &
Computers 2006.
[9] Offutt J. , Quality attributes of Web software applications ,
IEEE Software, March/April, pp25-32, 2002.
[10] Olsina L. and Rossi G., Measuring Web application quality
with WebQEM , IEEE Multimedia, pp 20-29. Oct-Dec 2002.
[11] Olsina L. , Website quality evaluation method : A case study
of Museums, 2
nd
workshop on Software Engineering over
Iternet, ICSE 1999.
[12] Shrivastava R., Rana J. L. and Kumar M., Specifying and
Validating Quality Characteristics for Academic Web-sites
Indian Origin, Intern. Journ. of Computer Sc. and Information
Security, Vol 8, No 4, 2010.
[13] Signore O. , Towards a quality model for Web-sites , CMG
Poland Annual Conference, Warsaw, 9-10 May, 2005,
http://www.w3c.it/papers/cmg2005Poland-quality.pdf.
[14] Signore O., et. al. , Web accessibility principles ,
International Context and Italian Regulations, EuroCMG,
Vienna, 19-21 Sept. 2004,
http://www.w3c.it/paperseurocmg2004.pdf.
[15] Tripathi P. , Kumar M. and Shrivastava N., Ranking of Indian
E-commerce Web-applications by measuring quality factors ,
Proc of 9
th
ACIS Itern Conf on Software Engineering, AI,
Networking and Parallel/Distributed Computing, Hilton Phulket,
Thailand, (Proc Published by IEEE Comp. Soc.), Aug 6-8, 2008.
[16] Tripathi P., Kumar M. , Some observations on quality models
for Web-applications , Proc. of Intern Conf on Web
Engineering and Applications, Bhubaneshwar, Orissa, India, 23-
24 Dec 2006 (Proc Published by Macmillan 2006).


Table 2 Metric Measured Values
Attribute Weig
ht
Global
Weight
IIT,
Delhi
MANIT,
Bhopal
BITS,
Pilani
CBIT,
Hydera
bad
1.1.1 0.2
0.6
100 100 100 100
1.1.2 0.2 100 100 100 100
1.1.3 0.2 0 0 0 0
1.1.4 0.2 100 0 80 100
1.1.5 0.2 80 0 100 0
1.2.1 0.2
0.4
100 0 100 0
1.2.3 0.2 100 0 0 0
1.2.4 0.2 100 80 0 0
1.2.5 0.2 100 60 0 0
1.2.7 0.2 100 0 0 0
2.1.1 0.3
0.25
100 80 100 0
2.1.2 0.35 100 100 100 100
2.1.3 0.35 100 100 100 100
2.2.1 0.2
0.25
100 0 0 100
2.2.2 0.2 100 0 0 100
2.2.3 0.2 90 80 70 70
2.2.4 0.2 100 100 100 100
2.2.5 0.2 0 0 0 0
2.3.1.1 0.35
0.2
100 100 100 100
2.3.1.3 0.35 100 60 100 40
2.3.1.4 0.3 100 0 0 0
2.3.3.1 0.25
0.3
100 100 100 100
2.3.3.2 0.25 100 100 100 100
2.3..3.3 0.25 100 0 0 0
2.3.3.4 0.25 100 100 100 100

1 Usability
1.1. Global Site understandability
1.1.1 Site Map(location map)
1.1.2 Table of Content
1.1.3 Alphabetical Index
1.1.4 Campus Image Map
1.1.5 Guided Tour
1.2. On-line Feedback and Help Features
1.2.1 Student Oriented Help
1.2.2 Search Help
1.2.3 Web-site last Update Indicator
1.2.4 E-mail Directory
1.2.5 Phone Directory
1.2.6 FAQ
1.2.7 On-line Feedback in form of
Questionnaire
1.3. Interface and Aesthetic Features
1.3.1 Link Color Style Uniformity
1.3.2 Global Style Uniformity
1.3.3 What is New Feature
1.3.4 Grouping of Main Control Objects

2 Functionality
2.1. Search Mechanism
2.1.1 People Search
2.1.2 Course Search
2.1.3 Academic Department Search
2.1.4 Global Search
2.2. Navigation and Browsing
2.2.1 Path Indicator
2.2.2 Current Position Indicator
2.2.3 Average Links Per Page
2.2.4 Vertical Scrolling
2.2.5 Horizontal Scrolling

2.3. Student-Oriented Features
2.3.1 Academic Infrastructure Information
2.3.1.1 Library Information
2.3.1.2 Laboratory Information
2.3.1.3 Research Facility Information
2.3.1.4 Central Computing Facility Information
2.3.2 Student Service Information
2.3.2.1 Hostel Facility Information
2.3.2.2 Sport Facilities
2.3.2.3 Canteen Facility Information
2.3.2.4 Scholarship Information
2.3.2.5 Doctor/Medical Facility Information
2.3.3 Academic Information
2.3.3.1 Courses Offered Information
2.3.3.2 Academic Unit (Department) Information
2.3.3.3 Academic Unit Site Map
2.3.3.4 Syllabus Information
2.3.3.5 Syllabus Search
2.3.4 Enrollment Information
2.3.4.1 Notification uploaded
2.3.4.2 Form Fill/Download
2.3.5 Online Services
2.3.5.1 Grade/ Result Information
2.3.5.2 Fee dues/Deposit Information
2.3.5.3 News Group Services

3 Reliability
3.1. Link and Other Errors
3.1.1 Dangling Links
3.1.2 Invalid Links
3.1.3 Unimplemented Links
3.1.4 Browser Difference Error
3.1.5 Unexpected Under Construction Pages
VI.EFFICIENCY
4.1. Performance
1) Matching of Link Title and Page Information
2) Support for Text only Version
3) Global Readability
4.1.4 Multilingual Support

Fig. 1 Quality Characteristics Hirachical Tree for Academic Institute Web-sites

Table 1 A Sample Template for Measuring Functionality

Template Illustrative Example
Title(code) Functionality (2)
Type Characteristics
Sub-characteristic (Code) Search Mechanism (2.1)
Definition & Comments The capability of Web-site to maintain specific level of search mechanism
Subtitle (code) Academic Department Search (2.1.3)
Type Attribute
Definition and Comments It represents the facility to search for any department in the institute
Metric criterion To find out whether such a search mechanism exists on the Website
Data collection Whether data is gathered manually or automatically through some tools
(manually)
Elementary Preference Function EP=1, if search mechanism exists
= 0, if it does not exist.
ISBN: 978-81-920575-8-3 :: doi: 10. 72899/ISBN_0768

Virtual Integrated Test Team for Telecommunication Ventures
Manju Geogy Dr. K.A. Sumithra Devi
Dept. of MCA Director
RV College of Engineering RV College of Engineering
Bangalore, India Bangalore, India

Abstract In every walk of life of common people, say from
satellites to submarines or e-commerce websites to medical and
research field the exploit of software is seen and so is its
complexity. The need for the software testing professionals to
find ways to keep up with the ever changing software world
and come up with innovative ways of testing like Crowd
testing, pairwise testing, virtualization testing to make ease and
focus on high quality deliverables is becoming a necessity.
Many problems arises in testing, some of the major ones are,
when to stop testing, testing coverage, testers not getting
sufficient time to test, developers testing their own code, skill
set of the testers. This paper focuses on the existing different
type of software testing methods and bringing out the best
option for a system testing for a telecommunication project.
Keywords-virtual team, crowd testing, virtualisation testing
and pairwise testing.
I. INTRODUCTION
Software testing is an investigation conducted to provide
stakeholders with information about the quality of the
product or service under test [1]
.
Software testing can also
provide an objective, independent view of the software to
allow the business to appreciate and understand the risks of
software implementation. Test techniques include, but are
not limited to, the process of executing a program or
application with the intent of finding errors.
The mission of the test team is not merely to perform
testing, but to help minimize the risk of product failure. [2]
Testers look for manifest problems in the product, potential
problems, and the absence of problems. They explore,
assess, track, and report product quality, so that others in the
project can make informed decisions about product
development.
Few of the trends affecting software testing are the
Globalization of software and systems development where
in reducing cost, automation of testing, in the early testing
phases, Commoditization of information technology and
high technology that is where users will demand cheaper
rates to high quality products, compliance and regulation
laws is a growing risk nowadays, education and certification
is a must to keep up to the latest technologies [3].
In today's world, the telecommunication infrastructure of
a country is one of the most important factors that can affect
greatly the development of that country [4]. Hence more
focus on telecommunication projects and the necessity of
testing to produce good quality.
II. TELECOMMUNICATION PROJECT SCENARIOS
Telecommunication simply put is the transmission of
information over significant distances to communicate,
which is the process of sending, propagating and receiving
an analogue or digital information signal over a physical
point-to-point or point-to-multipoint transmission medium,
either wired or optical fiber or wireless. [5]
Where we have the OSS (Operations support systems),
which are computer systems used by telecommunications
service providers that is the network systems dealing with
the telecom network itself, supporting processes such as
maintaining network inventory, provisioning services,
configuring network components, and managing faults. Then
the BSS (Business support systems) that is the system
dealing with customers, supporting processes such as taking
orders, processing bills, and collecting payments.
For OSS projects we have more of network or
configuration testing and BSS projects we have the
supporting processes testing of new features and
enhancements.
III. NEED FOR TESTING IN THESE PROJECTS
With the enhancement of technology the complexity of
software applications increases and testing becomes more
crucial. Due to competition and speed to market being very
high specifically in Telecommunication field, the need of the
hour is to have new testing techniques to save on testing
time, cheaper rates and deliver with good quality on time.
Telecommunication projects extend as maintenance projects,
hence having subject matter experts is necessary. Where in
most of the projects need to have regression testing as one of
the aspects to confirm any changes to the existing system is
not broken, else it will impact the business. Because of the
vast area of Telecommunication and competition of different
service providers all newly added features need to be tested
to assure the feature is working as per the expectation else
customers switch to other service providers and hence the
need for testing is very key for all Telecommunication
projects.

IV. TESTING TYPES FOR TELECOMMUNICATION PROJECTS
Crowd testing: Here, rather than relying on a dedicated
team of testers (in-house or out sourced), companies rely on
virtual test teams (created on demand) to get complete test
coverage and reduce the time to market for their applications.
The company defines its test requirements in terms of
scenarios, environments, and the type of testing (functional,
performance, etc) [6]. A crowd test vendor identifies a pool
of testers that meet the requirements, creates a project, and
assigns work. Testers check the application, report bugs, and
communicate with the company via an online portal. Crowd
testing vendors also provide other tools, such as powerful
reporting engines and test optimisation utilities. Main
advantage is in terms of reducing the test cycle time.
Virtualisation testing: Here a computer is divided into
multiple execution environments providing isolated
sandboxes for running applications enabling more testing
with existing resources and centralized configuration
management [7]. More management required and requires
close monitoring as teams can be located at different part of
the globe and the need of more tools. As an offshoot,
Virtualisation ensures that test labs reduce their energy
footprint, resulting in a positive environmental impact, as
well as significant savings.
All-pairs testing or pairwise testing are a combinatorial
software testing method that, for each pair of input
parameters to a system, tests all possible discrete
combinations of those parameters [8]. This is considered a
reasonable cost-benefit combinatorial testing methods, and
less exhaustive methods which fail to exercise all possible
pairs of parameters. Pairwise testing is typically used
together with other quality assurance techniques such as unit
testing, symbolic execution, fuzz testing, and code review.
V. PROPOSED MODEL
Keeping to the new trends and based on the need for an
apt testing for specifically Telecommunication system
software testing in this paper we suggest this concept. For
any telecom project we need the following key
systems/applications ordering system, provisioning system,
billing system and reports. And we build independent teams
for the following systems. Here we are setting up a virtual
team to work specifically for a single project. The virtual

Fig 1: Virtual team setup for a specific telecom project

Team consists of a tester representing each
system/application and a project Test manager for the project
under test. Here each tester reports to the project Test
manger for only the specific project progress/details. For
every other need he contacts the system/application Test
Manager. Fig 1, we have denoted the team in dotted lines to
represent it is a virtual team set up for the specific project.
The project Test Manager will lead this virtual team till the
end of the project. With the centralized tools the Test
Manager can track the progress of the project being executed
by the virtual team. Once the project execution is completed
and all the deliverables reviewed and closed the team is
dissolved to their respective systems.
Here we do extend the concept of Crowd testing with a
difference wherein the Project Test Manager is the technical
test manager. Test manager understands the whole project
and decides on what different types of testing is required.
Test manager is only responsible for the project deliverables
for the project. Each systems/applications write their test
cases and get it reviewed with the Test manager along with
the other systems/applications. This is key for the
integration testing.
Each application can be located at same or different parts
of the globe. Each application has its own system
environment to do their system testing. Timely status and
bug reports will be updated on online tools so all the
applications/systems can review the status along with the
Project Test Manager.
Once all systems/applications complete on their system
testing, all the application take on the system integration
testing. This will be executed on an environment similar to
the user environment setup. Here the Project Test Manager
needs to co-ordinate with all the systems/applications to
assure the correct set of test cases are executed on time for
the smooth data transfer for the downstream applications to
continue with their testing. This will incorporate the pairwise
testing scenarios in the integration testing between one up
and one down applications.
Project Test
Manager
Billing system
Provisioning system Ordering and
reporting system
In this model timely co-ordination and communication
with the team is very important.
Advantages can grow teams in their field of interests
and get excellent Subject matter experts in the respective
systems, cost effectively. And also increase in motivation
level of the respective resources. Hence the company will
have a very good knowledge base. Trainings and people
related issues for the team members will be handled by the
respective application Test managers. Project Test Manager
can concentrate purely on the technical challenges, project
deliverables and the quality. As it will be well planned, there
will be efficient resource utilization across teams. Just incase
during user acceptance testing if any issues come up, the
Test manager or the respective application subject matter
expert can help out technically faster.
Disadvantages Here the testers might land up working
on multiple projects and sometimes loose focus on critical
projects. Project Test managers knowledge is very key to
understand the technically issues upfront and alert the
respective team members, else it will jeopardize the whole
project.
VI. CONCLUSIONS
As stated the advent of telecommunication projects, its
problems in finishing testing on time with good quality is
playing a major role. Hence with this scenario, the proposed
concept helps to build up an efficient Telecom testing team
in their respective systems (ordering, provisioning, billing
and report systems). Project Test Manager can help on
clearing the technical challenges and ease the path of project
execution. Here members can be working on multiple
projects, if the project is well planned and hence efficient
utilization of resources.

REFERENCES
[1] Exploratory Testing, Cem Kaner, Florida Institute of
Technology, Quality Assurance Institute Worldwide
Annual Software Testing Conference, Orlando, FL,
November 2006
[2] James Bach, Managing a Software Test Team,
presented at Software Development 1996.
[3] Rex Black, Five Trends Affecting Software Testing,
Toronto Assosication of Systems and Software Quality,
TASSQ 30th May 2006.
[4] Dawit Bekele, The Ethiopian Telecommunications:
Past, Present and Future, 1st ESS Conference on
Ethiopian Telecommunications in the Information Age,
Washington, DC, 2nd July 1996.
[5] Lane, Beth. "What is Mass Media? suite 101.com,
29th June2007,
[6] Tanuj Vohra , Trends in Software Testing, published
in itMagZ.com, 8th April 2009.
[7] Author Sameer Jagtap, Test Lab Automation using
Virtualization.
[8] Black, Rex (2007). Pragmatic Software Testing:
Becoming an Effective and Efficient Test Professional.
New York: Wiley. p. 240.ISBN 978-0-470-12790-2

ISBN: 978-81-920575-8-3 :: doi: 10. 72906/ISBN_0768

Towards Agile Project Management
Nitin Uikey
1
Ugrasen Suman
2

School of Computer Science & IT, School of Computer Science & IT
Devi Ahilya University, Devi Ahilya University
Indore, India Indore, India

Abstract Agile software project management is based on the
principles such as welcoming changes, focus on customer
satisfaction, deliver parts of functionality frequently,
collaborate, reflect and learn continuously. The generic project
management process as defined in PMBOK (Project
Management Book of Knowledge) and developed by PMI
(Project Management Institute) is structured around five
process groups and nine knowledge areas. Agile methodology
is a software development approach that follows iterative and
incremental style of development along with efficient change
management. Whereas, PMBOK is a set of project
management practices with its step-by-step processes
emphasizing on documentation. Still, few PMBOK project
management processes can be incorporated together with agile
software development methodology. Thus, the objective of this
paper is to perform various comparative studies to analyze
both agile methodologies and PMBOK. The focus is to identify
possibilities and build agile software development more
reliable and well managed. Incorporating project management
practices in agile based methods will definitely help teams with
traditional mindset to proceed and survive in agile
environment. The study can also benefit the software
development and project management community to increase
the productivity and manageability of the software.
Keywords- Agile Methodologies, PMBOK, Process Groups,
Knowledge Areas, SCRUM.
I. INTRODUCTION
Identifying better ways of developing software by doing
it and guiding others do it is the idea of agile manifesto.
This can be achieved by valuing individuals and
interactions over processes and tools, working software over
comprehensive documentation, customer collaboration over
contract negotiation and responding to change over
following a plan [1]. Agile methodologies contain various
methods such as Extreme Programming (XP), Scrum,
Feature Driven Development (FDD), Crystal Methodologies
etc.
The companies and practitioners are facing the challenge
of understanding and embracing agile development
methodologies [2]. Today, information technology
professionals are under tremendous pressure to deliver
quality products and services in order to respond to an
always dynamic and fast changing market [5]. In this way,
companies or practitioners interested in adopting agile
framework with a generic project management in mind is
the major challenge of industry in inducing agile
methodology [2].
Industries are facing many problems and challenges
while moving from traditional development process towards
agile practices [3]. Most agile development methodologies
are created by software process experts as an attempt to
improve existing processes. Presently, there is no standard
project management process for agile based development.
On the other hand, PMBOK provides an effective project
management approach and defines a total of 44 project
processes that describes activities throughout a project life
cycle [4]. Traditional project management approaches rely
heavier on software development life cycle such as waterfall
model etc.[5].
It is observed that many of the practices identified in
PMBOK are quite compatible with agile practices [6]. The
different methodologies and tools can be used to implement
agile project management framework [7]. Studies suggest
that project organization, stakeholders expectation
management, scope management etc., are always important
factors leading to project success, if they are managed
properly [5]. In this paper, PMBOK and agile
methodologies are compared on the basis of certain project
management factors. Thereafter, PMBOK with process
groups and knowledge areas can be induced in agile based
development. In this way, agile methodologies can be more
effective and understandable in practice.
The rest of the paper is organized as follows. The
Section II describes the PMBOK, a set of principles defined
for project management. The agile methodologies with
focus on SCRUM will be discussed in Section III.
Incorporating PMBOK in agile is discussed in Section IV.
Section V describes the mapping of project management
with SCRUM and Section VI contains a comparative chart
of agile and generic project management based on certain
factors. The concluding remarks and future research work
will be presented in Section VII and Section VIII.
II. THE PMBOK
The PMBOK guide provides and promotes a common
vocabulary within the project management profession for
discussing, writing and applying Project Management (PM)
concepts [7]. The framework provided by PMBOK includes
the project stakeholders needs and expectations, project
management knowledge areas, project management tools and
techniques and the contribution of successful project to the
enterprise [8]. Stakeholders are the people involved in or are
affected by the project activities having different interest or
motivation from the project. Stakeholders may be sponsors,
project team, support staff, customers, users, suppliers etc.
and their needs and expectations are very important for the
success of the project.
A. Project Management Knowledge Areas (PMKA):
PMKA describes the abilities that project managers must
develop for successful project completion [8]. These areas
focus on ideas and objectives of the project and provide the
overall management activities to be performed in a project
such as scope, time, cost, quality, human resource,
communication, risk, procurement and integration.
Project scope management includes the processes
required to ensure that project includes all the work required
to complete the project successfully. It ensures to develop
same understanding between project team and stakeholders
regarding the processes being used to develop and what
product the project will produce [4]. It involves five main
processes; they are scope planning and definition, creating
work breakdown structure (WBS), scope verification and
control. Project time management involves the processes
concerning the timely completion of the project. This
enables to calculate the length of activities and further
outcome with a realistic schedule based on the estimates. It
consists the following: activity definition and sequencing,
activity resource and duration estimating, schedule
development and control.
Project cost management includes processes that ensure
the project is completed within the cost, approved by the
organization. The processes are cost estimating, budgeting
and control. Project quality management describes the
processes to ensure that the project meets customers needs
and expectations, and the objective for which it was
undertaken. The processes included are performing quality
planning, quality assurance, and quality control. Project
human resource management includes the processes
required for the organization and managing people involved
in the project effectively and efficiently. The various
processes included in human resource management are
human resource planning, acquiring, developing and
managing project team.
Project communication management describes the
processes concerning communication mechanism and
ensures timely and appropriate information generation,
collection, dissemination on storage and finally disposition
in proper way. For performing communication management
the processes are communication planning, information
distribution, performance reporting and managing
stakeholders. Project risk management involves processes
which help in identifying possible risks to the project and
take necessary action accordingly. The processes for risk
management are risk management planning, risk
identification, quantitative and qualitative risk analysis, risk
response planning and risk monitoring and control. Project
procurement management includes the processes that deal
with acquiring resources and services for a project
completion from outside the project team. The processes for
procurement management include planning purchases and
acquisitions, planning contracting, request seller responses,
selecting seller, contract administration and contract closure.
Project integration management involves coordinating all
the knowledge areas throughout a project life cycle. This
ensures that all the activities are performed at the right time
and in right manner to complete a project successfully.
Changes made in any one area of the project must be
integrated into the rest of the project [9]. It consists of
various processes such as developing a project charter, a
preliminary project scope statement, a project management
plan, directing and managing project execution, monitoring
and controlling project work, integrating change control, and
project closure.
B. Project Management Process Groups (PMPG):
A process is a series of actions directed towards a
particular result [8]. PMBOK defines five project
management process groups, namely; initiation, planning,
executing, monitoring and controlling, and closing [4]. Each
of the five project management process groups are
characterized by the completion of certain task in any
project [8]. Initiating process include defining and
authorizing a project or a project phase. An organization
invests considerable amount of time and thought in
choosing a project. Thus, project selection is important to
ensure that the right kind of project for the right reasons is
initiated.
Planning process includes devising and maintaining a
workable plan to ensure that project is progressing in right
direction and the project addresses the organizations and
stakeholders needs. Planning process is involved in all
knowledge areas as it relates to the project throughout the
development time. To accommodate changes in the project,
the project team revises the plans during each phase of the
project life cycle. Executing process includes coordination
among people and other resources to execute project plan
and produce the product, services or results of the project or
any phase.
Monitoring and controlling include measuring and
monitoring progress on regular basis to ensure that the
project team and its activities are moving in right direction
to achieve project objectives. Corrective actions are taken
when project teams do not perform and progresses against
the plan. Closing of project brings the project or a project
phase to an orderly end, involving stakeholder and customer
acceptance of the final products and services. All the
required deliverables are checked for their completion and
the product is presented before the customer.

III. AGILE SOFTWARE DEVELOPMENT PRACTICES
The objective of agile software development manifesto
is to uncover better ways of developing software by doing
it and helping others do it [1]. The various adopted agile
methodologies are XP, SCRUM, FDD etc. The practices are
based on beliefs and practices defined by manifesto of agile
software development [11]. Among all agile methodologies
the most widely used practices are XP and SCRUM. XP is
based on planning game and small release coding
guidelines, while maintaining the iterative and feedback
drive nature. SCRUM approach has been developed for
managing the software development process in available
environment based on flexibility, adaptively and
productivity [11]. SCRUM is one of the agile development
methodologies commonly used and developed by Jeft
Sutherland and further formalized by Ken Schwaber [2].
SCRUM methodology appears to be simple but deeply
influence the work experience of the project team [13].
A. SCRUM Process
SCRUM is an iterative incremental methodology for
software development. It is a management framework for
incremental product development using one or more cross-
functional, self-organizing team consisting of product
backlog, sprints, reviews and burn down charts. SCRUM
provides a structure of team roles, meetings, rules and
artifacts. Teams are responsible for creating and adapting
their processes within this framework.
For any product to be built, all the features are collected
from users, customers, executives and other stakeholders
often written in form of user story called the product
backlog. The team has certain amount of time frame called
sprint to complete a work with prioritized sprint backlog in
hand. The team plans out several sprints to complete the
work. Sprints are short duration milestones that allow team
to tackle and manage the chunk of projects and get it to a
ship ready state. Sprints generally range from couple of
weeks to as long as 30 days in length, depending on the
product release cycles. SCRUM components insist on short
daily standing meetings. Meeting daily, the team feels
confident that everyone is on the top of their task. Team
member spends a total of 15 minutes reporting to each other
about the progress, the work an individual will do today and
informing the impediments faced. After the sprint execution,
the team holds a meeting to demonstrate a working product
increment to the product owner and other stakeholders
called as sprint review meeting. After the end of each sprint,
a subset of the product or a backlog to a ship ready state is
produced, this is fully tested with all the features of that
sprint. The extension of sprint is an indicator that the project
is not on schedule and some measures are necessary to be
taken. Therefore, it is extremely important to monitor the
progress with the help of a burn down chart.
The role of product owner is to ensure that the right
features get into the product backlog representing the users
and the customers requirements. Product owner helps to set
the direction of the product. The role of scrum master is to
make certain the project is progressing smoothly and ever
team member has the available resources to complete the
task. Scrum master sets up meetings, monitor the work
being done and facilitates release planning, acting more as a
project manager. The team is cross-functional, self-
organizing and intensively collaborative. The team pulls few
features from the product backlog called as sprint backlog
and proceeds for the implementation [11].

IV. INCORPORATING PMBOK IN AGILE ENVIRONMENT
The motivation behind incorporating PMBOK in agile
methodology is to identify and organize possible
alternatives or improvements in existing agile project
management as PMBOK processes are organized as per
knowledge areas and process groups. PMBOK contains a
wealth of great ideas and processes to help project manager
deliver various projects successfully. Theoretically, all of
the PMBOK practices can be implemented in any agile
based project [14]. The PMBOK is a guide rather than a
methodology and it can be tailored and customized to
specific needs [5] [7].
A. Distinguishing PMBOK Process Groups with
SCRUM
SCRUM has four phases, namely; product backlog,
sprint backlog, sprint and product increment. Similarly,
PMBOK contains five project management process groups,
initiating, planning, execution, monitoring and controlling
and closeout. In the following subsections, we will
distinguish project management process groups with
SCRUM process activities.
1) Initiating process versus product backlog: In
conventional project management, initiating process
includes recognizing and then starting a new project. In this
process group, top management commitment is required,
thus project charter and preliminary scope statement is
created. Scope statement reflects the strategic planning of
the organization expressing the vision, mission, goals,
objectives and strategies of the organization [8]. This
identification of strategic planning of the organization will
help in future planning of the information system to be
created.
In SCRUM process, the project is initiated by capturing
project requirements and functionalities required by the
customer into a list called product backlog [12] [5]. It is the
Figure 1. SCRUM Process
responsibility of product owner to state all the requirements
on behalf of stakeholders involved in the project.
2) Planning process versus sprint backlog: The main
purpose of planning process is to guide execution. It ensures
the project is heading in the right direction. Planning is
conducted in all knowledge areas involved in the project.
This guides project team to correctly carry out their task.
In SCRUM, after finalizing the product backlog, the
team develops a sprint backlog. This contains a prioritized
list of tasks that must be performed to deliver a completed
increment of the product by the end of the sprint [15]. The
task in the sprint backlog is so identified that each task is
completed within the prescribed timeframe. It is the
collective responsibility of self organized, self managed and
cross functional teams to identify the tasks and prioritize
them accordingly to turn product backlog into a final
shippable product.
Before actually starting with the sprint, the team starts
with the sprint planning meeting, where the functionality to
be achieved during sprint is determined [12]. This is a
planning phase in SCRUM, as teams collaboratively and
collectively plans for each coming sprints and iteratively
follows it after completion of each sprint. The teams in
sprint planning meeting discuss all the aspects of the chosen
sprint backlog to be executed in sprint. Estimates of
schedule, cost, architecture design and component level
design are discussed at this level [5].
3) Executing process versus sprint: Executing process
in PMPG ensures the activities planned are completed
successfully. Actual products of the project are developed
involving most of the resources [8]. Execution of quality
measures, human resource, communication, resource
acquisition and integration are performed in this process.
In SCRUM, sprint is the phase of executing the task. In
sprint, the teams have certain amount of time, usually 2-4
weeks to complete a work with prioritized sprint backlog in
hand [11]. During the execution of sprint, the teams holds a
daily scrum meeting of duration 15 minutes, where each
member briefly describes their tasks and concerns about the
sprint backlog and resources needed to complete the task
[2].
4) Closing out process versus product increment:
Closing out process in PMPG includes formalizing
acceptance of the project or project phase and ending it
efficiently [8]. This process is performed at the end of the
development cycle, which involves more of administrative
activities such as archiving project files, demonstrating the
product to the customer, documenting lessons learnt and
providing experience as a feedback for the new project.
The outcome of the sprint process in SCRUM is product
increment. It is the closure of the SCRUM cycle, which
includes preparation of the product release, demonstration
of product increment to the customer and documenting
lessons learnt for future reference [5]. After each sprint the
team reports the stakeholders about the sprint that just
completed. Any change in the backlog or product increment
is then managed in the next development cycle [16].
5) Monitoring and controlling process versus SCRUM
process: Monitoring and controlling process ensures that the
team is progressing in right direction with all resource in
hand. The project manager and team monitor and keep track
of the progress as per plans and take necessary actions when
ever needed [8]. This can be achieved by reporting
performance to stakeholders and accommodate any change
to the process for successful completion of the project.
In SCRUM, the role of scrum master is the same as
project manager, who is responsible for managing and
defining the work [15]. Scrum master sets up meetings,
monitors the work being done, facilitates release planning
and necessary actions are taken to remove impediments
occurred during development cycle.
Monitoring and controlling in SCRUM starts from
product backlog and continues till shippable product release.
In SCRUM, scrum master and the project teams is
responsible for managing and controlling activities
throughout the SCRUM. Any changes monitored during the
development can easily be accommodated and accordingly
necessary actions are taken.
After creating product backlog, a sprint planning
meeting is held, where backlog is prioritized by the product
owner and the members of the teams defines the tasks that
can be completed in the coming sprint [5]. While in sprint, a
daily sprint meeting is held for almost 15 minutes, where
teams share their experiences, decides the work to be done
and reports the problem faced. This meeting tracks the
progress and performance of the teams and induces
necessary changes for removing the impediments and ensure
for successful completion of the project [16].
A sprint review meeting is held at the final stage of
producing product increment. The team demonstrates the
functionality to the customer and the management. Here, the
customer can get a better insight of the desired functionality
and can suggest for a change. The progress is monitored
with plan created in sprint planning meeting. After each
sprint, the product backlog is revised along with the
performance baseline, thus, revising the project cost and
duration estimates.
B. Distinguishing PMBOK Knowledge Areas with
SCRUM process
As, PMBOK has nine knowledge areas, which suggests
that project manager should always possess to gain success.
There are four core functions, which lead to specific project
objectives, i.e. scope, time, cost and quality management
and four facilitating functions through which the specific
project objectives are achieved, i.e. risk, human resource,
communication and procurement. The integration
management coordinates all other knowledge areas
throughout a life cycle [4].
While SCRUM is a process of incrementally building
software in ever changing environment using one or more
cross functional and self organizing team. SCRUM provides
a structure of team roles, meetings, rules and artifacts and
teams are responsible for creating and adapting their
processes within the framework.
1) Analyzing project integration management: In
SCRUM, project integration management can be performed
by verifying managements approval and funding for the
project, as this is very important for supporting the projects
life. Validating development tools and infrastructure
needed for the project, so that project continues without
failure. Further, supporting strong change management
procedure with product backlog and sprint backlog, creating
an initial plan for sprint execution, guiding and managing
project execution is required. Finally, refinement of system
architecture to support changes and performing closure
procedure will end up the project development.
2) Project scope management: In SCRUM
development, project scope is based on high level
requirements [17]. As agile software development welcomes
new ideas at any stage of software development therefore,
the project teams are ready to embrace changes in the
projects scope as per their occurrence. Thus, the important
aspect of agile software development is its ability to manage
scope creep. In agile software development the work is
divided into small work products called as sprint backlog.
[17]
Project scope management in SCRUM includes
performing domain analysis for development and creating a
comprehensive product backlog, this will define the
boundaries of the product. Choosing tasks for sprint
backlog, defining the functionalities for each release,
selecting the release most appropriate for immediate
development and tracking progress for assigned backlog
items at the end of each sprint ensures that better
understanding is developed between the customer and the
team.
3) Project time management: Sprint is a 2-4 weeks
cycle in which all the tasks taken from sprint backlog are to
be completed. Before the sprint, the teams selects the item
from sprint backlog and defines the time limit the task will
acquire to complete. At the end of the sprint, a sprint review
meeting is held, which tracks the performance of the teams
related to effectiveness and schedule.
Project time management in SCRUM includes defining
functionalities and delivery date for each release and
analyzing monthly iterations.
4) Project cost management: In SCRUM, the cost of the
product development is managed at the planning phase
based on previous experience or based on user stories or
functionalities of the product. Also, as SCRUM supports
changes at any stage of software development, the cost can
further be updated as per needs and negotiating with the
customer.
Thus, the project cost management in SCRUM includes
cost estimation at sprint backlog or estimation of release
cost and cost control and updating based on changes.
5) Project quality management: Quality management in
SCRUM is performed at various phases such as sprint
planning meeting, sprint review meeting and daily scrum
meeting. Here, teams decide the standards to be used during
the development, discuss about improving quality aspects,
implementing quality issues to their product release and
check for quality control.
Therefore, project quality management in SCRUM will
include identifying relevant quality standards, implementing
quality at sprint planning, assurance at daily scrum and
quality control at sprint review meeting.
6) Project human resource management: Project teams
in agile software development are self organizing and cross
functional, which are able to perform any task effectively
and efficiently. But, before the commencement of the
project, scrum master and the kind of team required to
ensure project success is identified. Also, it becomes very
necessary to recruit a new person in case a project member
is absent to let the project work smoothly and on schedule.
Thus, the project HR management for SCRUM includes
human resource planning, recruiting project team and
managing project team as and when required.
7) Project communication management: At daily scrum
meeting the team discusses about their progress and
impediments, tracks progress, resolve obstacles as quickly
as possible. Also, in the meeting the team communicates the
priorities of backlog items to other team members and
keeping everyone informed about the progress [16].
Project communication management in SCRUM
includes planning communication needs of stakeholders,
distributing information and performance reporting.
8) Project risk management: In SCRUM, project risk
management is carried out with every iteration. At daily
scrum meeting possible risk identified is discussed and
managed.
The project risk management in SCRUM includes initial
risk identification, risk response plan and risk review.
9) Project procurement management: Presently,
procurement management in SCRUM is not addressed at
all. But project procurement management in SCRUM can be
viewed as acquiring resources for a project from outside the
organization. Procurement management in SCRUM should
be same as suggested in PMBOK.

V. MAPPING PM PROCESS AND PMBOK AREAS
WITH SCRUM
PMBOK presents a set of standard terminology and
guidelines for project management, which can be tailored as
per the requirement of the project. Generic project
management process is a mature process evolved from
continuous improvements. While, agile is an evolving
paradigm with some modified processes suitable for small
scale to medium scale projects, it fulfills the demands of
customers effectively. But for any model to produce a
successful outcome, these require a project management
philosophy. Thus, the mapping of PMBOK process groups
and knowledge areas with agile software development as
shown in Table 1, observe the similarity and identifies the
areas possible for further improvements.
Agile methodologies are unique as they are evolutionary
models and supports iterative execution. The outputs
produced are modular and in short releases. Due to iterative
execution, the project management activities are performed
repeatedly for each cycle. The review meetings used to track
the status of the project are conducted frequently, to avoid
off tacking of the project. Customer is an important entity in
agile methodologies, due to its continuous collaboration and
commitment for the project. Agile methodology welcomes
changing requirement even in the middle of the process and
can be easily adjusted in forthcoming sprints. The team
working in an agile project is unique itself, as it is self
directed and self organized.
On the other hand, generic project management is
preformed in progressive and elaborative manner. The
project management activities are so matured and flexible
that these can be optimized according to the project. Risk in
the project is covered and managed through all aspects.
Emphasis on quality management through maturity models
and standards such as Capability Maturity Models, ISO, Six
Sigma, etc. is given. In generic project management,
detailed documentation is prepared, which covers all the
areas of the project, and it is able to handle any kind of
project i.e. small, large or life critical.

VI. AGILE METHODOLOGIES VERSUS GENERIC PROJECT
MANAGEMENT
A traditional software development method such as
waterfall life cycle is largely practiced and dominates the
system development environment, while surveys and studies
suggest the growing popularity of agile methods. The arrival
of these methodologies has divided the software community
in two camps of traditionalists and agilists, with each pro-
clamming the best of their methodology [18].
But, before deciding which is more appropriate for any
project, it is essential to compare both traditional and agile
methodologies based on certain factors, as shown in Table
2. The factors which helped to distinguish both approaches
are as follows:
Focus and situational approach: This factor defines
about the focus of the methodology and on which
situation it is best applied.
Requirement and technology: This factor suggests
the requirement gathering methods and technology
which can be more favorable for development.
Risk Management: This defines how efficiently the
risk is managed.
Communication: This factor states the mode of
communication taking place in both
methodologies.
Configuration change management: Any change
occurring while the development is managed how
efficiently and easily is shown by this factor.
Team size, location, characteristics and
composition: Details about the team is defined in
this factor.
Roles and responsibilities: This factor identifies the
roles and responsibilities of stakeholders and teams
involved in development.
Complexity handling and regulatory requirements:
How much complex system and level of projects
the methodology is capable in handling is
suggested by this factor.
Development Models: Models falling under the
category is defined in this factor.
Cost estimation techniques: The factor states what
cost estimation techniques are used in both
methodologies.
VII. DISCUSSION
It is observed that agile methodology can be improved
by incorporating generic project management practices for
the development of quality products.
For managing large projects through agile methodology,
the agile framework can be adopted as progressive and
optimized way of development at certain phases. Risk
management should be managed efficiently as in generic
project management. For quality improvements in agile
methodology quality standards and certifications should be
incorporated by documenting each phase sufficiently.
In 1995, the Standish Group conducted a comprehensive
study of corporate technology projects to identify the rate of
failures and their primary causes [20]. The various
companies of varying sizes that were developing multiple
kinds of projects, their results were evaluated. The groups
conducted a survey to determine factors that cause a project
to succeed or fail. According to the Chaos Surveys, which is
conducted every two years by the Standish Group, some
factors have been identified that contribute to project
success or failure are user involvement, executive support,
clear business objectives, emotional maturity, optimization,
project management expertise, execution, agile process and
tool and infrastructure [19][20][21]. Observing the above
factors identified in Chaos Surveys, almost all fall under the
category of project management. Thus, the project
management part becomes very important in success of a
project. As the new methodologies are evolving from past
improvements, the success rate of projects is gradually
rising, leaving back the failure and challenged projects. As
shown in Figure 2, the project success rate in year 1994 was
16 percent which gradually improved and in year 2008 the
success rate was reported to be 32 percent, simultaneously
decreasing the failure rate. While at the same time, surveys
published by Scott W. Amber reports success of Agile over
Traditional approach as shown in Figure 3 [22].
VIII. CONCLUSION AND FUTURE WORK
We have identified possibilities in agile project
management improvements to create agile software
development process more reliable, well managed and help
teams with traditional mindset to easily grasp the difference
and proceed for agile environment. To achieve the
objective, we have performed a mapping between project
management practices and SCRUM process and at the same
time, a comparative chart between traditional and agile
process is created, which defines uniqueness of both
methodologies. Further, it can be concluded as no major
difference lies between both the methodologies and both the
processes are almost similar to perform, but, with few
different terminologies. Generic project management is a
matured set of practices and agile is tailored from the same
accordingly to achieve success.
Also, a comparative study between traditional and agile
approach is presented based on certain project management
factors. It is observed that for more complex situation, large
teams and heavily regulated industries, the traditional
approach is more suitable. While, less complex, research
oriented, small teams and commercial projects can be easily
be developed using agile approach.
REFERENCES
[1] Manifesto for Agile Software Development, www.agilemanifesto.org.
[2] Hoda, R., Marshall, S., and Noble, J. 2008. Agile Project
Management. In Proceedings of the New Zealand Computer Science
Research Student Conference. (April 2008). NZCSRSC 2008.
Christchurch, New Zealand.
[3] Baig, A., and Muram, F. 2010. Agile Practices in the Industrial
Context. In Proceedings of Interesting Results in Computer Science
and Engineering. (October 28, 2010). IRCSE '10. Mlardalen
University, Sweden.
[4] Project Management Book of Knowledge, www.pmi.org.
[5] Fitsilis, P. 2008. Comparing PMBOK and Agile Project Management
software development processes. In Proceedings of Advances in
Computer and Information Sciences and Engineering. 2008, 378-383.
DOI=10.1007/978-1-4020-8741-7_68.
[6] Sliger, M. Relating PMBOK Practices to Agile Practices.
www.stickyminds.com.
[7] Stevens, D. Can Agile Learn Anything From the PMBOK?.
http://www.dennisstevens.com/2010/06/19/can-agile-learn-anything-
from-the-pmbok/.
[8] Schwalbe, K. 2007. Information Technology Project Management.
Thomson Course Technology.
[9] Chaudhuri, N. and Vainshtein, N. 2007. Agile Project Management:
PMBOK vs Agile. Agile Project Leader Network, Chapter Meeting,
Washington DC, August 2, 2007.
[10] Mishra, S.C., Kumar, V., and Kumar, U. 2006. Success Factors of
Agile Software Development. In Proceedings of The 2006 World
Congress in Computer Science, Computer Engineering, and Applied
Computing. (June 26-29, 2006). WORLDCOMP06. 2006. Las
Vegas, Nevada, USA.
[11] Uikey, N., Suman, U. and Ramani, A.K. (2011). A Documented
Approach in Agile Software Development. In International Journal
of Software Engineering. (IJSE). Volume 2 Issue 2. CSC Journals,
Kuala Lumpur, Malaysia. 2011. 13-22.
[12] Howell, G., and Koskela, L. The Theory of Project Management:
Explanation to Novel Methods. (2002). In Proceedings of IGLC-10.
Aug. 2002, Gramado, Brazil.
[13] Pudusserry, A. Agile Project Management Implementation Approach.
Project Management Research Institute.
http://www.collabteam.com/scrum3/AgileProjectManagementImplem
antationApproach.pdf.
[14] Krieger, T. Agile vs. PMBOK: Oil and water or delicious salad
dressing? http://blogs.captechconsulting.com/blog/thomas-
krieger/agile-vs-pmbok-oil-and-water-or-delicious-salad-dressing.
[15] Drnovscek, S. and Mahnic, V. Abstract Agile Software Project
Management with Scrum. (January 29, 2005). EUNIS 2005.
Manchester, UK.
[16] Janoff, N.S. and Rising, L. The Scrum Software Development
Process for Small Teams. (2000). In Journal IEEE Software. Volume
17 Issue 4, July 2000 IEEE Computer Society Press Los Alamitos,
CA, USA. Doi=10.1109/52.854065.
[17] Rehman, I.U., Rauf, A., Shahid, A., and Ullah, S. Scope management
in agile versus traditional software development methods. In
Proceedings of the 2010 National Software Engineering Conference.
NSEC '10. ACM New York, NY, USA. 2010.
Doi=10.1145/1890810.1890820.
[18] Mahapatra, R., Mangalaraj, G. and Nerur, S. Challenges of Migrating
to Agile Methodologies. Magazine Communications of the ACM -
Adaptive complex enterprises. Volume 48 Issue 5, May 2005. ACM
New York, NY, USA. Doi=10.1145/1060710.1060712
[19] Chaos report 2009 project success factors.
http://pmbullets.blogspot.com/2010/04/chaos-report-2009-project-
success.html.
[20] CHAOS Summary 2009.
http://www.portal.state.pa.us/portal/server.pt/document/690719/chaos
_summary_2009_pdf.
[21] Hastie, S. What makes Information Systems Project Successful?
www.softed.com.
[22] Ambler, S. W. Surveys Exploring The Current State of Information
Technology Practices. http://www.ambysoft.com/surveys/

Figure 2. Overall Project Success


Figure 3. Agile and Traditional Project Success Rate
TABLE I. MAPPING PM PROCESS AND PMBOK AREAS WITH SCRUM PROCESS
Project
Management
Process
Groups
PMBOK Knowledge Areas SCRUM
Initiating Project Charter, Scope Definition
Create Product Backlog List,
Verification of management approval and funding,
Availability of development tools and infrastructure
Planning
Project Plan, Scope Planning and Definition,
Construct WBS, Activity definition and
sequencing,
Activity resource and duration estimation,
Schedule development,
Cost estimation and budgeting,
Quality Planning,
Human Resource Planning,
Communication Planning,
Risk Planning, Risk Identification and
Analysis,
Risk response planning,
Plan purchases and acquisition, Plan Contracts
Divide Product Backlog into comprehensive Sprint
Backlog,
Defining the delivery date and functionality that will be
included in each release,
Prioritize Sprint Backlog,
Selection of the release most appropriate for immediate
development,
Sprint Planning Meeting,
Planning and adjusting quality standards with which
the product will conform,
Initial Assessment of risk during sprint planning

Execution
Direct and Manage Project execution based on
planning, Acquire and develop project team,
Information Distribution, Perform quality
assurance task, Request seller responses and
select sellers
Appointment of project teams as per release,
Perform Sprint,
Communication of standards to the project team,

Monitoring
and Control
Monitor and control project work,
Integrated change control,
Scope verification and scope control,
Schedule control, Cost control,
Perform quality control measures,
Guide and Manage Project Team,
Performance reporting to all stakeholders,
Risk monitoring and control,
Contract administration
Change Management Procedure,
Refinement of system architecture to support changes,
Daily scrum meeting,
Sprint review Meeting,
Risk review during review meetings,
Review of progress for assigned backlog items,
Reviewing monthly iterations for schedule,
Close Out Close contract and close project
Demonstration of product increment, Sprint
retrospective,

TABLE II. Generic Project Management versus Agile Methodologies
Factors Project Management Agile Project Management
Focus and situational
approach
Generic, process oriented, predictive, plan
based
requires up-front planning
Specific, people oriented, adaptive, on-going
planning, ability to reach to changes and
uncertainty.
Requirements and technology
Knowable early, largely stable, no
restriction on technology
Largely emergent, rapid change, unknown,
favors object-oriented technology
Risk management
Uses structure, well understood risk,
minor impact on occurrence
Uses flexibility, risk is seen in early stages,
unknown risk, major impact on occurrence
Communication Formal, well documented and structured
Informal, emphasis more on constant face-to-
face interaction
Configuration change
management
Formal change control with Change
Control Boards, less flexibility to the
customer,
Customer needs to go through a proper
change control system to introduce change
Limited or no formal change control within
iterations, more flexibility to the customer with
controlled cost
Team size, location,
characteristics and
composition
Can handle large teams,
Can be geographically disperse,
Team structured vertically and can go to
many levels,
Project Manager manages the team and the
project,
Project Manager defines work and
prioritize them,
Freeze Scope
Can only handle small teams,
Should be geographically collocated,
Self organized and self managed teams,
Teams are structured horizontally and
restricted to one or two levels,
Teams and scrum master together manages the
project,
Product owner writes story points and prioritize
them, Freeze sprint
Roles and Responsibilities
Well defined,
Project roles with separation of duties,
Project manager is totally responsible for
any failure to the project,
Customers involvement is less
Well defined teams,
Teams and scrum master are collective
responsible for the project,
Customer is a part of the whole exercise and he
is present at every sprint meeting.
Complexity handling and
regulatory requirements
Can handle more complex situations,
With more up-front modeling and
planning, Heavily regulated industries
requiring thorough documentation with
formal approvals
Less complex situations are better addressed
due to adaptive modeling process ,
Small commercial products with multiple
versions can be developed
Development model Life cycle models The evolutionary-delivery model
Cost Estimation techniques
Traditional estimation methods,
COCOMO,
function points, etc.
Story points, velocity of project, ideal time


ISBN: 978-81-920575-8-3 :: doi: 10. 72913/ISBN_0768

UML Designing and Comparative Performance of Various Computer Networks
Neeraj Verma
Research Scholar
Solan, (H.P)

Abstract-The main aim of this work is the designing of UML
models of various computer architectures such as Single
instruction stream over single data Stream (SISD), Single
instruction stream over multiple data Stream (SIMD), Multiple
instruction stream over single data Stream (MISD) and
Multiple instruction stream over multiple data Stream
(MIMD). This paper shows UML class diagrams of various
topologies and these diagrams are compared with the existing
diagrams of these topologies. Activity diagrams of computer
architecture models are also shown. The paper shows the
performance of mesh, ring and linear bus network topology on
the basis of tables and graphs which are plotted on number of
nodes Vs number of links and number of nodes Vs network size
(network diameter) for each type respectively and a
comparative study of the three topologies is made. It is
concluded that as the number of nodes in a network increase,
mesh network topology is recommended as it has the minimum
space complexity.

Keywords: Network Topologies, Execution Stages, Static
Interconnection, Unified Modeling Language (UML), Class
Diagrams, Activity Diagrams.

I.INTRODUCTION
In [1] authors Michael Blaha and James Rambaugh
proposed Object-Oriented Modeling and design with UML..
In [2] James Rambaugh, Michael Blaha, William
Premcrlani, Prederic Eddy and William Lorenson, explained
Object Oriented Modeling and Design with UML.In [3] Kai
Hwang published a book on Advance Computer
Architecture including parallelism, scalability and
programmability.
In [4] The Unified Modeling language Referance Manual
was explained by James Rambaugh, Ivan Jacobson and
Grady Booch. In [5] Grady Booch, James Rambaugh and
Ivan Jacobson gave The Unified Modeling language User
Guide.In [6] Ivar Jacobson, Magnus Christerson, Patric
Jonsson and Gunnar Overgaard explained Object Oriented
Software Engineering: A Use Case Driven Approach. In [7]
The Unified Software Development Process was explained
by Ivar Jacobson, Grady Booch and James Rumbaugh.
In [8] James Rumbaugh, Grady Booch and Ivar
Jacobson, Unified Modeling Reference Manual, this book
provides a real-world guide to working with UML. It
includes all facets of today's UML standard. [9] Grady
Booch, James Rumbaugh, Ivar Jacobson proposed User
Guide for Unified Modeling Language In [10] the website
provides Parallel Computer Communication Topology. In
[11] Simon Bennett, John Skelton, Ken Lunn proposes an
outline of UML. In [12] S.Pllana, T.Fahringer proposes
Modeling Parrallel applications with UML that explains that
the Unified Modeling Language(UML) offers an extensive
set of diagrams for modeling. However, the semantics of
specific diagrams are always not clear so as to decide how to
model specific aspects of parallel applications.
Unified Modeling Language is a graphical language used
for expressing software or system requirements, architecture,
and design efficiently. UML can be used for communication
with other developers, clients, and increasingly, with
automated tools that generate parts of a system.
The Unified Modeling Language (UML) is a widely used
de-facto standard visual modeling language that is used in
the software industry for modeling software. It is used by
practitioners to visualize, communicate, and implement their
designs. It is a general purpose, broadly-applicable, tool-
supported, industry-standardized modeling language.

II.BACKGROUND
A process may be defined as a connected set of actions
or events that produce continuation or gradual change.

Processing

Fig 1.Representation of Process
A group of processes that have a predefined regular
interconnection topology, such as a mesh, ring or bus, may
be defined as a process topology
[12]
.
In the figure below, the definition of a stereotype
processing unit is depicted based on the base class Class is
depicted. A stereotype is intended to extend a UML element
so that it is an instance of a new metaclass. These are used to
model process or threads.

Process
Process_id:Integer
In_time:string
Out_time:string
Process_priority:integer
Create()
Delete()
Update()
Fork/Join()

stereotype

Fig:1(A) Representation of Processing Unit

Fig:1(B) Multiple Instances of Process
A stereotype refers to a base class in the UML
metamodel. It indicates what type of element is stereotyped.
Here processing unit is a modeling element which is
defined by stereotyping the base class Class. The
compartment of stereotype Processing Unit named Tags
specifies a list of tag definitions that include id, type and
cardinality. Tag id is used to uniquely identify the modeling
element, type specifies the type of processing unit and
cardinality tag is used to specify the number of elements of a
set of instances. Fig:1(b) shows the multiple instance of
class Process based on the stereotype processing unit which
is used to model a process. Here abc is defined as object.

III.UML DESIGNS FOR TOPOLOGIES
The mapping or arrangement of the elements of a
network gives rise to certain basic topologies that may then
be combined to form more complex topologies. The term
"topology" refers to the layout of connected devices on a
network. The most common of these topologies are:
A. Mesh
B. Ring
C. Bus
A.MESH TOPOLOGY
In a mesh topology, each computer is connected to every
other computer by a separate cable. This configuration
provides redundant paths through the new work, so, if any
cable or node fails, there are many other ways for two nodes
to communicate.

Fig.2. Representation of Mesh
B. RING TOPOLOGY
Ring topology is a type of network topology in which
each of the nodes of the network is connected to two other
nodes in the network and the first and last nodes are
connected to each other, forming a ring.

Fig.3. Representation of Ring
C. BUS TOPOLOGY
In a bus topology, all computers are attached to a
continuous cable which connects them in a straight line. It is
called a common backbone or a trunk that connects all
devices. The backbone is a single cable that functions as a
shared communication medium. All the devices attach or tap
into it with an interface connector.

Fig 4. Representation of Linear Bus

IV.UML MODEL FOR NETWORK
TOPOLOGIES

A.UML Model for Mesh Topology
The simplest connection topology is an n-dimensional
mesh. In a 1-D mesh all nodes are arranged in a line, where
the interior nodes have two and the boundary nodes have
one neighbor(s). A two-dimensional mesh can be created by
having each node in a two-dimensional array connected to
all its four nearest neighbors.

Fig.5.UML Class Diagram of Mesh Topology
The above class diagram shows how the processes are
connected to each other and executed in the corresponding
processing unit. The mesh is considered here for (MN+1)
processes.

metaclass
class
stereotype
Processing Unit
Tags
Id:integer
Type:string
Cordinality:integer
abc Process
Processin
g Unit
P01 P0N
P
10
P
11
P
1N

P
M0 P
M1
P
MN

B.UML Model for Ring Topology
In the type of network topology each of the nodes of the
network is connected to two other nodes in the network and
with the first and last nodes being connected to each
other.Thus it forms a ring.The class diagram of the ring
topology is designed for (N+1) processes.

Fig.6. UML Class Diagram For Ring Topology
The above diagram shows that each of the process is
connected to the other as well as the first process is
connected to the last one forming a ring type of a network.
C. UML Model For Bus Topology
With the Bus topology, all workstations are connect
directly to the main backbone that carries the data. Traffic
generated by any computer will travel across the backbone
and be received by all workstations.

Fig.7. Class Diagram of Bus Topology
In the above diagram, all the processes are connected to
each other forming a bus network.

V.EXECUTION STAGES OF A PROCESS
A linear pipeline processor is a cascade of processing
stages. The processors are linearly connected to perform a
fixed function over a stream of data flowing from one end to
another.
A pipeline can execute a stream of instructions in an
overlapped manner. The execution of instruction consists of
a sequence of operations including instruction fetch, decode,
operand fetch, execute and write-back phases. These phases
are ideal for overlapped execution of linear pipeline.
The instruction pipelining is shown in the figure below.

Fig8.UML Class Diagram For Instruction Pipeline Execution Stages

First of all, the instruction is fetched from a process into
buffers called instruction cache. The instruction is then
decoded which reveals the instruction function to be
performed and identifies the resource needed for performing
the function. The operands are also read from the registers.
The instructions are executed in one or several execute
stages in the arithmetic and logic unit, which is responsible
for the execution. Next, the data address translation occurs in
the data cache access and the data tag is checked. The last is
the write back stage, which writes the result into the registers
and then to the process.
VI.SAMPLE ACTIVITY DIAGRAM FOR
COMPUTER ARCHITECTURES
Activity diagrams have the ability to expand and show
who has the responsibility of each activity in a process. To
do so, the diagram is separated into parallel segments called
swimlanes. Each swimlane shows the name of a role on the
top, and present the activities of each role. Transition can
occur from one swimlane to another. Each swimlane is
separated from its neighbouring by vertical solid lines.

A.UML Activity Diagram For SISD Model
SISD is a Single instruction stream over single data
Stream sequential computer. It exploits no parallelism in
either the instruction or the data streams.The figure below
depicts the mapping of a Single Instruction Single Data
(SISD) application.

Fig9.Activity Diagram of SISD Model

B.UML Activity Diagram For SIMD Model
SIMD is a computer which exploits multiple data
streams against a single instruction stream to perform
operations which may be naturally parallelised.
Figure 10 illustrates the mapping of a Single Instruction
Multiple Data (SIMD) application. The swimlane Process is
responsible for all activities of the program.

Processing
Unit
:Process 0
Processing
Unit
:Process 1
Processing
Unit
:Process N
I-cache
Decode
Register A
Process Cache
D-
Sample Action 1
Process ProcessP
Process PN
Process

Fig10. Activity Diagram of SIMD Model

C.UML Activity Diagram For MIMD Model
MIMD is multiple autonomous processors
simultaneously executing different instructions on different
data.
Figure11 .illustrates the mapping of a Multiple
Instruction Multiple Data (MIMD) application. Sample
Action 1, Sample Action 2 and Sample Action 5 are
processed by swimlane Process0. Sample Action 3, Sample
Action 4 are respectively processed by Process1 and
Process2.

Fig11. Activity Diagram of MIMD Model

D.UML Activity Diagram For MISD Model
MISD machines have many processing elements, all of
which execute independent streams of instructions.
However, all the processing elements work on the same data
stream.
Figure12.shows the mapping of a Multiple Instruction
Single Data (MISD) application. Sample Action 1, and
Sample Action 4 are processed by swimlane Process0.
Sample Action 2, Sample Action 3 are processed by
Process1 and Process2 respectively.

Process 0

Process 1

Process 2
Sample Action 1
Sample Action 2
Sample Action 3
Sample Action
Sample
Sample Action
Sample Action
Process 0 Process 1 Process2

Fig12.Activity Diagram of MISD Model

VII. RESULTS AND DISCUSSIONS
COMPARISON OF PERFORMANCE AMONG MESH,
RING AND BUS
The performance of mesh, ring and bus network
topology can be compared on the basis of the network size,
which is shown in the table below:

Type of topology Network size
2D-Mesh 2(r-1)
Ring N/2
Bus N-1

Where r is the bisection width, r =N;
N=Number of nodes
This comparison can be clearly made on the basis of
increasing number of nodes as shown in the table below:

Table 4.4 Comparison of Space Complexity for Mesh, Ring and Linear
Bus Topology

Number
of Nodes

Network Size
Mes
h
Ring Bus
1 0 1 0
2 0.82
8
1 1
3 1.46
4
2 2
4 2 2 3
5 2.47
2
3 4
6 2.89
8
3 5

10 4.32
4
5 9
100 18 50 99
10000 198 500
0
999
9

0
1
2
3
4
5
6
1 2 3 4 5 6
N
u
m
b
e
r
o
f
N
o
d
e
s
Network Si ze
Comparison of Performance of Mesh, Ring
and Bus Network Topology
Mesh
Ring
Bus

In the figure13. comparison of space complexities
for mesh, ring and linear bus network topology is done
taking number of nodes as 6. However, the figure can be
expanded for number of nodes taken as 10000.
As the space complexity of Mesh network
topology is minimum as shown in the figure, thus, mesh
network topology is recommended for large number of
nodes

VIII. ACKNOWLEDGEMENT

Sample Action
Sample Action 2
Sample Action 3
Sample Action 4
IX. FUTURE SCOPE
The above work can be extended in many directions and
some of these are given below:
UML modeling is also helpful to make designs for
the instruction pipelining, non-linear pipelining, Sparc
Architecture, Intel Pentium Architecture, and
performance of these models can also be judged by
supplying attributes through objects.
In place of UML Modeling, researchers can also
use the other modeling techniques like Agile Modeling,
etc.
By the use of Agile Modeling, researchers can also
judge the performance of designed models as it is done
by the use of UML model and by processing the
attributes through objects.
The other static and dynamic network topologies
are also another direction of expansion for checking the
suitability for the network setup. Considering other static
and dynamic network topologies with the help of UML
modeling further can expand the work.

REFERENCES
[1]. Michael Blaha, James Rambaugh,Object-Oriented Modeling
and design with UML, second edition, 2007 issue.
[2]. James Rambaugh, Michael Blaha, William Premcrlani, Prederic
Eddy, William Lorenson, Object Oriented Modeling and
Design , 2003 reprint.
[3]. Kai Hwang, Advance Computer Architecture,2001 edition,9-
12.
[4]. James Rambaugh, Ivan Jacobson, Grady Booch, The Unified
Modeling language Referance Manual, second edition, 2005.
[5]. Grady Booch, James Rambaugh, Ivan Jacobson, The Unified
Modeling language User Guide, 1999 issue
[6]. Ivar Jacobson, Magnus Christerson, Patric Jonsson, Gunnar
Overgaard,Object Oriented Software Engineering: A Use Case
Driven Approach, 1992 edition.
[7]. Ivar Jacobson, Grady Booch, James Rumbaugh,The Unified
Software Development Process, 1999 issue.
[8]. James Rumbaugh, Grady Booch and Ivar Jacobson, Unified
Modeling Reference Manual,isbn 0-201-30998-X
[9]. Grady Booch, James Rumbaugh, Ivar Jacobson, Unified
Modeling Language User Guide, 2nd Edition, 2005 issue,
ISBN: 0321267974
[10]. Parallel Computer Communication Topology [on-line]available
at www.gigaflop.demon.co.uk/comp/chapt7.htm
[11]. Simon Bennett, John Skelton, Ken Lunn, Schaum's Outline
of UML: Second Edition ,2005 issue, ISBN: 0077107411.
[12]. S.Pllana, T.Fahringer Modeling Parrallel applications with
UML 2002
[13]. Kai Hwang, Advance Computer Architecture, 2001,pg:9-12.


ISBN: 978-81-920575-8-3 :: doi: 10. 72920/ISBN_0768

Hierarchical Energy Efficient Protocol for LLC in Wireless Sensor Networks

K.Gandhimathi
[1]
, M.vinothini
[2]
,
Department of Information technology, M.E., Software Engineering,
Bannari Amman Institute of Technology, Bannari Amman Institute of Technology,
Sathyamangalam, TamilNadu, Sathyamangalam, TamilNadu,

Abstract - Wireless Sensor Networks is a collection of wireless
sensor nodes dynamically forming a temporary network
without the aid of any established infrastructure or
centralized administration. Routing protocols in wireless
sensor networks helps node to send and receive packets.
Traditional hierarchical routing algorithms combine
adaptability to changing environments with energy aware
aspects. In this paper a new distributed, self-organizing,
robust and energy efficient, hierarchical energy efficient
routing protocol is proposed for sensor networks .The
proposed routing is based on Low-Power Localized Clustering
(LLC).In this structure, which nodes can construct from the
position of their 1-hop neighbors. This also describes route
maintenance protocols to respond to predicted sensor failures
and addition of new sensors. The LLC to a k-level
hierarchical algorithm will improve the network energy
management. The LLC protocol is compared with the existing
Localized Power Efficient Data Aggregation Protocol (L-
PEDAP) and Ad hoc On demand Distance Vector routing
protocol (AODV). The performance of these three routing
protocols (L-PEDAP, AODV, and LLC) is based on metrics
such as packet delivery ratio, end to end delay, energy
consumption and throughput, Also this work will analyze and
compare the performance of protocols using NS-2.34, when
various other mobility models are applied to these protocols.
Keywords - mobile sensor networks, mobile node, mobility
model.
I. INTRODUCTION
WSN are self-configuring, self-healing networks
consisting of mobile or static sensor nodes connected
wirelessly to form an arbitrary topology. WSN are not
currently deployed on a large scale, research in this area is
mostly simulation based [1]. Mobile wireless sensor
networks owe its name the presence of mobile sink.
Advantage of mobile WSN over static WSN are better
energy efficiency, improved coverage and enhance target
tracking and superior channel capacity [2]. Mobility of the
nodes affects the throughput of the protocol because the
bandwidth reservation made or the control information
exchanged may end with no use, if the node mobility is
very high [3]. Figure 1 shows the mobile sensor network
scenario in which the position of a mobile node at time t, (t
+1), and (t+2) are shown as A, B and C respectively.
Performance of routing protocols is studied with the
MANETS using different mobility models. Among other
simulation parameters, the mobility model plays a very
important role in determining the protocol performance in
MSN. Hence it is essential to study and analyze various
mobility models and their effect on MSN. This paper
compares the two different protocols with two mobility
models and their performance with parameters like Packet
delivery ratio, End to End delay, Energy consumption and
throughput in MSN. Figure 2 shows the design flow of how
the mobility metrics are added to the mobility model and
the protocol performance with the connected paths is
analyzed.

Figure 1. MOBILE SENSOR NETWORK SCENARIO

II. RELATED WORKS
The effects of various mobility models and the
performance of two routing protocols Dynamic Source
Routing (DSR-Reactive Protocol) and Destination
Sequenced Distance Vector (DSDV-Proactive Protocol) is
studied in [4].Performance comparison has also been
conducted across varying node densities and number of
hops. Experiment results illustrate that performance of the
routing protocol varies across different mobility models,
node densities and length of data paths. Mobile wireless ad

hoc networks are infrastructure less and often used to
operate under unattended mode. So, it is significant in
bringing out a comparison of the various routing protocols
for better understanding and implementation of them. In
this paper, comparison of the performance of various
routing protocols like Ad hoc On-Demand Vector routing
(AODV),Localized-Power Efficient Data Aggregation
Protocol are discussed. The comparison results were
graphically depicted and explained [5].

Figure 2. DESIGN FLOW

The mobility model is the most important factors in the
performance evaluation of a mobile ad hoc network
(MANET). Traditionally, the random waypoint mobility
model has been used to model the node mobility, where the
movement of one node is modeled as independent from all
others. However, in large scale military scenarios, mobility
coherence among nodes is quite common. One typical
mobility behavior is group mobility. Thus, to investigate
military MANET scenarios, an underlying realistic
mobility model is highly desired. In this paper a virtual
track based group mobility model (VT model) which
closely approximates the mobility patterns in military
MANET scenarios is proposed. It models various types of
node mobility such as group moving nodes, individually
moving nodes as well as static nodes. Moreover, the VT
model not only models the group mobility, it also models
the dynamics of group mobility such as group merge and
split. Simulation experiments show that the choice of
mobility model has significant impact on network
performance [6]

III. MOBILITY MODELS
Mobility models consist of two different type of
Dependencies such as spatial and temporal dependency.
Mobility of a node may be constrained and limited by the
physical laws of acceleration, velocity and rate of change
of direction. Spatial dependence is a measure of node
mobility direction. Two nodes moving in same direction
have high spatial dependency. The current velocity of a
mobile node may depend on its previous velocity. The
velocities of single node at different time slots are
correlated. This mobility characteristic is called as the
temporal dependency of velocity [1].Frequently used
mobility models includes Random waypoint, Manhattan,
Energy consumption model. We compare the performance
of these models with parameters like packet delivery radio,
end to end delay, energy consumption and throughput, and
hop count using two different routing protocols.
Frequently used mobility models include Random
waypoint, Energy Consumption model. We compare the
performance of these models with parameters like Packet
delivery ratio, End to End delay, Energy consumption and
throughput, and hop count using two different routing
protocols.
A. Random way point mobility model
The Random Waypoint model is the most commonly
used mobility model in research community. At every
instant, a node randomly chooses a destination and moves
towards it with a velocity chosen randomly from a uniform
distribution [0, V_max], where V_max is the maximum
allowable velocity for every mobile node. After reaching
the destination, the node stops for a duration defined by the
'pause time' parameter. After this duration, it again chooses
a random destination and repeats the whole process until
the simulation ends. Figures 3 illustrate examples of a
topography showing the movement of nodes for Random
Mobility Model.

Figure 3.TOPOGRAPHY SHOWING THE MOVEMENT OF
NODES FOR RANDOM MOBILITY MODEL.

B. Energy Consumption Model

There are different models proposed for modeling
energy consumption in sensor nodes. Here, we use the first
order radio model proposed in [2]. In this model, the
energy consumed to transmit a k bit packet to a distance d
(denoted as E
tx
) and the energy consumed to receive a k bit
packet (denoted as E
rx
) are given as follows:

E
tx
(k,d) = ak+bkd
n
,E
rx
(k) = ck. (1)

In this model, a and c are the energy consumption
constants of the transmit and receiver electronics,
respectively, and b is the energy consumption constant for
the transmit amplifier. There are various studies in the
literature based on this model. S.Singh et al. [2] propose
and use this model, assuming that a a=c=50,b=0.1 and
n=2.On the other hand, in [10],( a+c )= 2 X 10
8,
, b=1,and
n=4.
According to proposed model, the total energy cost of
transmitting a k bit packet from a node i to a neighboring
node j as C
ij
(k), then C
ij
(k) is given as follows.

C
ij
(k) = ak+bkd
m
ij ,
if j is sink. (2)

The costs of transmission of one packet to another node
and to the sink are different, since the sink has no energy
constraints and its cost for receiving messages is ignored.

IV. ROUTING PROTOCOLS
A. Ad-Hoc on Demand Distance Vector Routing
(AODV)
AODV is a distance vector type routing protocol.
This protocol does not maintain the routes to destination
that are not activity used .Till the nodes have valid routes
to each other AODV does not play a role. It uses Router
request (REREQ) .Route replies (RREPs). Route error
(RERR). message to discover and maintain the routes.
When a node wants a route to a destination it broadcasts
RREQ to the entire network till the destination is reached
or a fresh route is found. Then a RREP is send back to the
source with the discovered path. When the node detects
the route is not valid it broadcasts a RERR message [4]
[7].
B. Localized Power Efficient Data Aggregation
Routing Protocol(L-PEDAP)
L-PEDAP is table driven algorithm based on Lmst
routing .Each node has a routing table having the
destination, next hop and number of hops to the
destination. The nodes periodically broadcast the updates.
A sequence number is tagged with time also the shortest
path to a destination is used .if a node detects route to a
destination is broken then the hop number is set to infinity
and its sequence number is updated to a odd number
.Even numbers represent the sequence numbers of
connected paths. The sequence numbers enable the mobile
nodes to distinguish stale routes from new ones, thereby
avoiding the formation of routing loops [4] [7] [8]

V. EXPERIMENT AND RESULTS
Each model is implemented with the L-PEDAP and
AODV protocols and their performance is analyzed with
various node densities such as 10,20,30,40 and 50 with
standard 802.11 MAC layer. The packet type generated in
the trace file is UDP .In the simulation scenario we used
omni directional antennas with transmission range
250m.Our simulation results has shown that packet
delivery ratio is higher in L-PEDAP when compared with
AODV. Energy consumption and End to End delay is less
in L-PEDAP when compared with AODV.

TABLE 1 SIMULATION PARAMETERS

Area 1000m X 1000m

MAC

IEEE 802.15.4

Transmission
range

250m

No of Source
Nodes

10,20,30,40,50

Simulation
Time(sec)

180sec

Mobility

Random Waypoint

Traffic Type

CBR

Transport Layer

UDP

Packet Size

1400 bytes

A. PACKET DELIVERY FRACTION


Figure 4. PACKET DELIVERY FRACTION
In the figure X axis represents the varying number of
node and y axis represents the packet delivery fraction
based on the Figure 4, it is shown than L-PEDAP perform
better when the number of nodes increases because nodes
become more stationary will lead to more stable path from
source to destination AODV performance dropped as
number of nodes increase because more packets dropped
due to link breaks.
B. END TO - END DELAY

Figure 5. END TO END DELAY

nodes and y axis represents the end to end delay in mille
seconds Based on Figure 5, L-PEDAP didnt produce so
much delay even the number of nodes increased. It is
better than AODV protocols.
C. ENERGY CONSUMPTION
In the figure X axis represents the varying number
of nodes and y axis represents the end to end delay in
mille seconds.

Figure 6. ENERGY CONSUMPTION
From Figure 6 L-PEDAP less prone to route stability
compared to AODV. For L-PEDAP, the routing
consumption is not so affected as generated in AODV.
D.THROUGHPUT
node and y axis represents the throughput.

Figure 7. THROUGHPUT
The number of packets received in L-PEDAP is higher
when compared to AODV protocol. It is also dependent of
time. When time increase number of packet received at
receiver side also increases correspondingly.

VI. CONCLUSION AND FUTURE WORK
The performance of L-PEDAP and AODV routing
protocol were measured with respect to metrics like Packet
Delivery Ratio, End to End Delay, Energy consumption,
throughput by single scenarios, (number of node). The
results indicate that the performance of L-PEDAP is
superior to regular AODV. It is also observed that the
performance is better, especially when the number of nodes
in the network is higher. When the number of nodes

increased beyond 20 and above, the performance of
proactive AODV degenerated due to the facts that a lot of
control packets are generated. It is also observed that L-
PEDAP is better than AODV protocol in both PDR and
Throughput. The reason for the performance to get drop at
10 to 20 nodes is due to varying source and destination
nodes and placement barrier in network topology. It is
concluded that the performance of AODV is high in
Energy Consumption and End to End delay when the no of
node is high (link breakage occurred) when compared to L-
PEDAP. Furthermore, performance comparison can be
done with other mobility models .In future X-LLC protocol
will be implemented for scenario and it will be compared
with the exiting protocol (AODV and L-PEDAP).

REFERENCES

[1] F. Bai, A.Helmy, A Servey of Mobility Modeling and
Analysis in Wireless Adhoc Networks in Wireless Ad Hoc and Sensor
Networks, Kluwer Academic publishers,2004 .
[2] S. Singh, M. Woo, and C.S. Raghavendra, Power-Aware
Routing in Mobile Ad Hoc Networks, Proc. Intl Conf. Mobile
Computing and Networking, pp. 181-190, 1998.
[3] C.E. Perkins and E.M. Royer, Ad-Hoc On-Demand Distance
Vector Routing, Proc. Second IEEE Workshop Mobile Computing
Systems and Applications, p. 90, 1999.
[4] C. Intanagonwiwat, R. Govindan, and D. Estrin, Directed
Diffusion: A Scalable and Robust Communication Paradigm for Sensor
Networks, Proc. Intl Conf. Mobile Computing and Networking, pp.
56-67, 2000
[5] S. Lindsey and C.S. Raghavendra, Pegasis: Power-Efficient
Gathering in Sensor Information Systems, Proc. IEEE Aerospace
Conf., Mar. 2002.
[6] V. Rodoplu and T. H. Meng, "Minimum energy mobile wireless
networks", IEEE Journal on Selected Areas in Communications, vol. 17,
no. 8, Aug. 1999, pp. 1333-1344.

[7] L. Li and J. Y. Halpern, "Minimum-energy mobile wireless
networks revisited", Proceedings IEEE ICC01, Helsinki, Finland, June
2001, pp. 278-283.
[8] W. R. Heinzelman, J. Kulik, and H. Balakrishnan, "Adaptive
protocols for information dissemination in wireless sensor networks",
Proceedings ACM MobiCom '99, Seattle, WA, Aug.1999, pp. 174-185
[9] Y.C. Hu, A. Perrig, and D.B. Johnson. Ariadne: A Secure On-
Demand Routing Protocol for Ad Hoc Networks. In Proceedings of the
Eighth ACM International Conference on Mobile Computing and
Networking (MobiCom 2002), September 23-28, 2002.
10] Y.C. Hu, D.B. Johnson, and A. Perrig. Secure Efficient
Distance Vector Routing Protocol in Mobile wireless Ad Hoc
Networks.In Proceedings of the Fourth IEEE Workshop on Mobile
Computing Systems and Applications (WMCSA 2002), June 2002.

ISBN: 978-81-920575-8-3 :: doi: 10. 72927/ISBN_0768

SEARCH BASED OPTIMIZATION FOR TEST DATA GENERATION USING
GENETIC APPROACHES
S.Kuppuraj
Anna University
Chennai, India

S.Priya
Anna University
Chennai, India

AbstractIn software testing, it is often desirable to find test
inputs that exercise specific program features. To find these
inputs by hand is extremely time-consuming, especially when
the software is complex. Therefore, many attempts have been
made to automate the process. Among those processes, this
paper addresses Genetic Algorithm (GA) which has been used
successfully to automate the generation of Test data for
software. The test data were derived from the program's
structure with the aim to traverse every statement in the
program. The investigation uses fitness functions based on the
branch (No of statements covered) distance. The input
variables are represented in binary code. The power of using
GAs lies in their ability to handle input data which may be of
complex structure. Thus, the problem of test data generation is
treated entirely as an optimization problem. The advantage of
GAs is that through the search and optimization process, test
sets are improved such that they are at or close to the input
subdomain boundaries.
Keywords-Software testing, Genetic Algorithm, Fitness
function, Optimization.
I. INTRODUCTION
1.1 overview
Almost 50% of the software production development
cost is expended in software testing. It consumes resources
and adds nothing to the product in terms of functionality.
Therefore, much effort has been spent in the automatic
software test data generation in order to significantly reduce
the cost of developing software. One objective of
software testing is to find errors and program structure
faults. However, a problem might be to decide when to
stop testing the software, e.g. if no errors are found or, how
long does one keep looking, if several errors are found. In
order to test software, test data have to be generated and
some test data are better at finding errors than others.
Therefore, a systematic testing system has to differentiate
good (suitable) test data from bad test (unsuitable) data, and
so it should be able to detect good test data if they are
generated.
Nowadays testing tools can automatically generate
test data that will satisfy certain criteria, such as branch
testing, path testing, etc. However, these tools have
problems, when complicated software is tested. A testing
tool should be general, robust and generate the right test
data corresponding to the testing criteria for use in the real
world of software testing. Therefore, a search algorithm of a
tool must decide where the best values (test data) lie and
concentrate its search there. It can be difficult to find correct
test data because conditions or predicates in the software
restrict the input domain that is a set of valid data .Test data
that are good for one program are not necessarily
appropriate for another program even if they have the same
functionality. Therefore, an adaptive testing tool for the
software under test is necessary. Adaptive means that it
monitors the effectiveness of the test data to the
environment in order to produce new solutions with the
attempt to maximize the test effectiveness. There are
number of test-data generation techniques that have been
automated earlier.

II. NEED OF TEST AUTOMATION
Testing is the most critical phase in the software
development life cycle. The main objective of testing is to
prove that the software product as a minimum meets a set of
pre-established acceptance criteria under a prescribed set of
environmental circumstances. There are two components to
this objective. The first component is to prove that the
requirements specification from which the software was
designed is correct. The second component is to prove that
the design and coding correctly respond to the requirements
.Correctness means that function, performance, and timing
requirements match acceptance criteria. Following are the
objectives that software testing follows:
A) Testing is a process of executing a program with
the intent of finding an error.
B) A good test case is one that has a high probability
of finding an as-yet undiscovered error
Principles of Software Testing:
a. All tests should be traceable to customer
requirements: The objective of system testing is to
uncover errors. It follows that the most severe defects are
those that cause the program to fail to meet its requirements.
b. Tests should be planned long before testing begins:
Test planning can begin as soon as the requirements model
is complete. Detailed definition of test cases can begin as
soon as the design model has been solidified. Therefore, all
tests can be planned and designed before any code has been
generated.
c. Testing should begin in the small and progress
toward testing in the large:The first test planned and
executed generally focus on individual program modules.
As testing progresses, testing shifts focus in an attempt to
find errors in integrated clusters of modules and ultimately
in the entire system.
d. Exhaustive testing is not possible:The number of path
permutations for even a moderately sized program is
exceptionally large. For this reason, it is impossible to
execute every combination of paths during testing.
e. To be most effective, testing should be conducted by
an independent third party: By most effective , means
testing that has the highest probability of finding errors. For
this reason, the software engineer who created the system is
not the best person to conduct all tests for the software.

III. EVOLUTIONARY ALGORITHMS AND
SIMPLE GENETIC ALGORITHM
Although the idea of using the principles of natural
evolution on problem solving dates back to the late 1940s, it
was during the 1960s that the first evolutionary algorithms,
the Evolution Strategies (ES) and the Evolutionary
Programming (EP), emerged. Together with the GA and the
Genetic Programming, they make up the four main branches
of the EAs. The main idea behind the EAs is to evolve a
population of individuals (candidate solutions for the
problem) through competition, mating and mutation, so that
the average quality of the population is systematically
increased in the direction of the solution of the problem at
hand. The evolutionary process of the candidate solutions is
stochastic and guided by the setting of adjustable
parameters. In an analogy with a natural ecosystem, in EA
different organisms (solutions) coexist and compete. The
more adapted to the design space will be more prone to
reproduce and generate descendants. On the other hand, the
worst individuals will have fewer or no offspring. In an
optimization problem, the fitness of each individual is
proportional to the value of the objective (cost) function,
also called fitness function. GAs are probably the most well-
known and used form of EAs.

IV. GENETIC ALGORITHM
A. Population and Generation
A population is like a list of several guesses (database).
It consists of information about the individuals. As in
nature, a population in GAs has several members in order to
be a healthy population. If the population is too small
inbreeding could produce unhealthy members. The
population changes from one generation to next. The
individuals in a population represent solutions. The
advantage of using a population with many members is that
many points in a space are searched in one generation. A
generation is said to have premature by converged when the
population of a generation has a uniform structure at all
positions of the genes without reaching an optimal structure.
This means that the GA or the generation has lost the power
of searching and generating much better solutions.
Generally, small populations can quickly find good
solutions because they require less computational effort
(fewer evaluations per generation). However, premature
convergence is generally obtained in a small population and
the effect is that the population is often stuck on a local
optimum. This can be compensated by an increased
mutation probability, which would be similar to a random
testing method. Larger populations are less likely to be
caught by local optima, but generally take longer to find
good solutions and an excessive amount of function
evaluation per generation is required.
A larger population allows the space to be sampled
more thoroughly, resulting in more accurate statistics.
Samples of small populations are not as thorough and
produce results with a higher variance. Therefore, a method
must be set up in order to find a strategy, which will avoid
converging towards a non-optimum solution.
B. Chromosomal Representation
A chromosome can be a binary string or a more
elaborate data structure. The initial pool of chromosomes
can be randomly produced or manually created. GAs
population is an array of POP individuals. The
representation of the chromosome can itself affect the
performance of a GA-based function optimizer. There are
different methods of representing a chromosome in the
genetic algorithm, e.g. using binary, Gray, integer or
floating data types. Some of the encoding types are
discussed in this section.
Chromosomes could be:
Bit strings: (0101 ... 1100)
Real numbers: (43.2 -33.1 ... 0.0 89.2) Permutations of
element: (E11 E3 E7 ...E15)
Lists of rules: (R1 R2 R3 ... R22 R23)
Program elements: (genetic programming)
... Any data structure.
C. Fitness Function Evaluation
The fitness function in algorithms directly affects
convergent speed of the genetic algorithms and whether it
can used to find the best solution. Good fitness function can
improve the search efficiency of best solution and achieve
the purpose. FF= 1s count in the given binary input.
D. Selection Scheme
There are many different techniques which a genetic
algorithm can be used to select the individuals to be copied
over into the next generation, listed below are some of the
common methods for selection.
a. Elicit selection
b. Fitness-proportionate selection
c. Roulette-wheel selection
d. Tournament selection
e. Rank selection.
Roulette Wheel Selection-Parents are selected
according to their fitness. The better the chromosomes are,
the more chances to be selected they have. Imagine a
roulette wheel where are placed all chromosomes in the
population. This can be simulated by following algorithm.
[Sum]calculate sum of all chromosome
fitness in population-sum S.
[Select]Generate random number from
interval (0, S)-r.
[Loop] Go through the population and sum
finesses from 0-sum S. When the sum s is greater
than r, stop and return the chromosome where you
are. Of course, step 1 is performed only once for
each population.
E. Reproduction
Crossover-Crossover selects genes from parent
chromosomes and creates a new offspring. The
simplest way how to do this is to choose randomly
some crossover point and everything before the
crossover point copy from a first parent and then
everything after a crossover point copy from the
second parent. Mutation (random process)
Crossover can look like this (| is the crossover point)

Table 3.2 Single point crossover

Chromoso
me 1
11011 |
00100110110
Chromoso
me 2
11011 |
11000011110
Offspring 1
11011|110000111
10
Offspring 2
11011 |
00100110110

Table 3.3 Mutation

Original
offspring 1
1101111000011
110
Original
offspring 2
1101100100110
110
Mutated
offspring 1
1100111000011
110
Mutated
offspring 2
1101101100110
110

F. Parameters of GA
There are two basic parameters of GA - crossover
probability and mutation probability.
Crossover Probability: how often crossover will be
performed. If there is no crossover, offspring are exact
copies of parents. If there is crossover, offspring are made
from parts of both parents chromosome.
a. If crossover probability is 100%, then all offspring
are made by crossover.
b. If it is 0%, whole new generation is made from
exact copies of chromosomes from old population (but
this does not mean that the new generation is the same!).
Crossover is made in hope that new chromosomes will
contain good parts of old chromosomes and therefore the
new chromosomes will be better. However, it is good to
leave some part of old populations survive to next
generation.
Mutation Probability: how often parts of chromosome
will be mutated. If there is no mutation, offspring are
generated immediately after crossover (or directly copied)
without any change. If mutation is performed, one or more
parts of a chromosome are changed. If mutation probability
is 100%, whole chromosome is changed, if it is 0%, nothing
is changed. Mutation generally prevents the GA from falling
into local extremes. Mutation should not occur very often,
because then GA will in fact change to random search.

V. FITNESS FUNCTION
The fitness function chosen to evaluate each test data
candidate was Similarity, proposed by Lin and Yeh [Lin and
Yeh 2001]. This fitness function extends the Hamming
Distance and has the ability to quantify the distance between
two given paths. Given a target path and a current one, the
similarity between them is calculated from n-order sets of
ordered and cascaded branches for each of the paths being
compared. After that, for each order, the distance between
the set of each path is given by the symmetric difference
between them. Then, this distance is normalized to become
a real number and the similarity between these n-order sets
is found by subtracting the value 1 from this normalized
distance. Finally, the total similarity between two paths will
be the sum of the similarities of all n-order sets, each one
associated with a weighting factor which is usually found by
experience. For further details, take a look at the work of
Lin and Yeh [Lin and Yeh 2001]. Similarity was chosen for
some reasons. First, it is a fitness function specially
developed for path testing. Second, it does not take into
account if test data values are closer to boundaries.
New fitness function focus on the path coverage -
the fitness function calculated for each test input data is
based on number of statements it covered in the given input
program.

VI. EXPERIMENTAL STUDIES
A. Triangle Classificationb Problem
Triangle classification program has been widely
used in the research area of software testing
Preprocessed triangle program:

1. #include<iostream.h>;
2. #include<conio.h>;
3. #include<math.h>;
4. void main();
5. {;
6. inta,b,c;
7. float s,area;
8. clrscr();
9. cout<<"Enter the sides of triangle:";
10. cin>>a>>b>>c;
11. if((a+b)>c &&(b+c)>a &&(a+c)>b);
12. {;
13. if(a==b && b==c);
14. {;
15. cout<<"It is equilateral triangle";
16. };
17. Else if(a==b || a==c || b==c);
18. {;
19. cout<<"\nIt is Isosceles triangle";
20. };
21. Else;
22. {;
23. cout<<"It is Scalene triangle";
24. };
25. s=(a+b+c)/2;
26. area=sqrt(s*(s-a)*(s-b)*(s-c));
27. cout<<"\nThe area is "<<area;
28. };
29. Else;
30. {;
31. cout<<"\nIt is not a triangle";
32. };
33. getch();
34. };

Path1 Not a Triangle:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 29, 30, 31, 32, 33, 34]

Path2Equilateral Triangle:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 25, 26,
27, 28, 33, 34]

Path3Isosceles Triangle:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 17, 18, 19, 20, 25, 26,
27, 28, 33, 34]

Path4Scalene Triangle:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 21, 22, 23, 24, 25, 26,
27, 28, 33, 34]

B. Experimental Settings
Settings of standard genetic algorithm are as following:
(1) Coding: binary string
(2) Length of chromosome: 24bits
(3) Population size: from 40 to 1000
(4) Roulette wheel selection
(5) One-point crossover probability = 0.9
(6) Mutation probability = 0.04
The first generation of test data was generated from their
domain at random.
C. Experimental Results

Figure 1 Test Cases Vs Average Runs

Figure [1] shows the experimental result of triangle
program using genetic algorithm and modified genetic
algorithm, the result shows that the average number of runs
taken for 1000 number of test cases using modified genetic
algorithm with the use of new fitness function is less than
the genetic algorithm run.

Figure 2 Test Cases Vs Time (Sec)

VII. CONCLUSION AND FUTUREWORK
The new features of GAs make the test data
generation process easily and find the nearly global
optimum. Even though GA yields the best fittest test
data after many numbers of iterations, within the
limited iteration solution derived by GA may be trapped
around a local optimum and fail to locate required
global optimum. The modified GA also having some
limitations, so additional work is needed to increase the
result efficiency. The experiments conducted so far are
based on relatively small examples and more research
needs to be conducted with larger commercial
examples. In order to overcome this issue GA will be
used along with Tabu Search in future work.

0
20
40
60
80
100
AverageNo ofruns
NoofTestcases
noofrunsin
GA
noofrunsin
modifiedGA
0
0.5
1
1.5
2
2.5
3
3.5
Time(seconds)
NoofTestCases
Timetakenfor
GA
Timetakenfor
modifiedGA
REFERENCES

[1] James H.Andrews, Tim Menzies, Felix C.H.Li Genetic
Algorithms for Randomized Unit Testing IEEE Transactions on
Software Engineering vol 37,1.January/February 2011.
[2] Wang Xibo, Su NaAutomatic Test Data Generation for
Path Testing Using Genetic Algorithms Third International
Conference on Measuring Technology and Mechatronics Automation,
IEEE Computer Society, Page No: 121-127, 2011.
[3] Shaukat Al, Lionel C. Briand, Hadi Hemmati, Rajwinder K.
Panesar-Walawege,A Systematic Review of the Application and
Empirical Investigation of Search-Based Test Case GenerationIEEE
Transactions on Software Engineering vol
36,No.6.November/December 2010.
[4] Jian Hua, Xuzhou An Approach to Automatic Generating
Test Data for Multi-path Coverage by Genetic Algorithm IEEE
Sixth International Conference on Natural Computation ICNC, Page
No: 27-33, 2010.
[5] Abhishek Rathore, Atul Bohara, Gupta Prashil R., Lakshmi
Prashanth T S, Praveen Ranjan Srivastava,Application of Genetic
Algorithm and Tabu Search in Software Testing ACM ,Page No:
774-787,2010.
[6] Sangeeta Tanwer and Dr. Dharmender Kumar, Automatic
Test Case Generation of C Program Using CFG, IJCSI International
Journal of Computer Science Issues, Vol. 7, Issue 4, No 8, July 2010.
[7] Yong Chen1,2, Yong Zhong2, Tingting Shi1, Jingyong Liu
Comparison of Two Fitness Functions for GA-based Path-Oriented
Test Data Generation IEEE Fifth International Conference on
Natural Computation, Page No: 215-222, 2009.
[8] Praveen Ranjan Srivastava, Priyanka Gupta, Yogita
Arrawatia, Suman Yadav, Use of Genetic Algorithm in Generation
of Feasible Test Data, SIGSOFT Software Engineering Notes,
Volume 34 Number 2 March 2009.
[9] Praveen Ranjan Srivastava1, Tai-hoon Kim2, Application of
Genetic Algorithm in Software Testing, International Journal of
Software Engineering and Its Applications Vol. 3, No.4, October
2009.
[10] Nai-Hsin Pan, Po-Wen Hsaio, Kuei-Yen Chen,A Study of
Project Scheduling Optimization Using Tabu Search ELSEIVER
Journal on Engineering Applications of Artificial Intelligence 21,
Page No:54-69, 2008.
[11] S. Berner, R. Weber, and R.K. Keller, Enhancing Software
Testing by Judicious use of Code Coverage Information, Proc. 29th
Intl Conf. Software Eng., pp. 612-620, May 2007.
[12] A. Watkins and E.M. Hufnagel, Evolutionary Test Data
Generation: A Comparison of Fitness Functions, Software Practice
and Experience, vol. 36, pp. 95-116, Jan. 2006.
[13] Nashat Mansour, Miran salame, Data Generation for path
testing ACM Software quality journal 12, Page No: 121-136, 2004.
[14] Korel B. Automated Software Test Data Generation. IEEE
Transactions on Software Engineering, 16(8): 870-879, August 1990

ISBN: 978-81-920575-8-3 :: doi: 10. 72934/ISBN_0768

Software designing for shrinkage testing station DIL 402C
Piotr Kumiski, Marek Hebda
Cracow University of Technology
Institute of Material Engineering,
Cracow, Poland
e-mail: mhebda@pk.edu.pl

Abstract - The purpose of this paper is an attempt to systematize
the basic requirements of the programmer in developing
applications for use in laboratories and the analysis of the
problems that emerged during the construction and the
implementation of algorithms for the analysis of density
variations in the sample on the basis of dilatometric data from
sintered metal powder samples tests.

Key words: Software designing, dilatometry analysis, sintering
shrinkage
I. INTRODUCTION
In many cases the greatest value of usability for
laboratory tests are integrated software platforms for
analysis and data processing. However, there are situations
in which the most convenient and effective solution is a
dedicated application closely matched to the requirements
of the study which reduces the time devoted to research
and improves laboratory team from which does not require
knowledge of advanced techniques in computer processing
of the data provided during the laboratory trials.
The article is divided into two parts: the first - presents
a purely theoretical basis, which can be addressed during
project implementation, the second - is a detailed
description of the steps to build a specific project.

II. THEORETICAL BASIS
With the advent of large amounts of RAD
programming enviroments writig a fully functional
window application has been made possible not only for
experienced programmers, but also for ordinary users.
Writing powerful applications has become easier but still
requires some experience and extensive knowledge on
topics related to ergonomic design of "front-end". People
trying to write their own applications, which will benefit
not only themselves to blame should note some basic
principles:
This what is obvious to you, mustnt be
equally clear for end user.
If the source code becomes unreadable for you,
the more it will be incomprehensible to someone who will
try to make modifications.
Using a windowed interface you need to
remember that the form shoud not have unnecessary "bells
and whistles", its often reflect negatively on the
functionality and applications usability.
Simplicity does not indicate a lack of skills, even
the simplest projects to facilitate the work, automate
analysis of large amounts of data can greatly shorten the
period "from research to results."
The basic tool for any programmer is a pen and a
card, you should investigate all the possibilities on paper,
the paper will take everything, upset and confused in the
maze of options user - not always.
Adjust the best possible environment for the task
posed in front of you, taking into consideration: the need
for rapid access to files, easy to build an application
framework, the fact that the information will be collected
real time, or read from the removable memory, and
because of their skills and flexibility;
With these rules, you can stand in face withe your
project without having to worry that he will struggle with
development environment interface, developing solutions
during the task, constantly struggling with the modified
procedures and functions as well, as with variables
conditioning the steps of individual algorithms.
The laboratory software is a separate group of tasks
requiring a specific approach,often it is a application that
operate on large amounts of data, requiring the
programmer's imagination for multi dimensional data
structures, the ability to easily create complex data types
etc.
Even small projects can provide many interesting
insights and discovers some difficult issues that hard to
reach in a purely theoretical way.
A. Software desing for calculating termal
shrinkage into density variation
The aim of this project was to write software that on
the basis of thermal shrinkage calculate the density
changes in sintered metal powder samples.
Based on the research work of Maca et al. [1], it was
possible to use ready-made mathematical tools to achieve
the project objective.
It can be assumed that during the cooling of the
sample, there is no sintering process, and therefore the
cooling process is accompanied by only changing volume,
following the trail, on the basis of expansion can be
calculated coefficient of thermal expansion (CTE):
100 * ) (
max
max
T T
CTE
p
T p

(1)
where:
p
after cooling shrinkage (%),

max T
- shrinkage in the end of process (%),

p
T
after cooling temperature (C),
max
T
end of the process temperature (C).

CTE value allows the calculation the difference
between the thermal dilatation and dimensional changes at
a given point in time, making it possible to obtain so-
called. technological shrinkage:

) ( * 100 * ) , (
p techn
T T CTE T t (2)

where:
) , ( T t thermal shrinkage,
t time,
T temperatue .

To transform the time dependence or temperature
dependence into density curves should you need to know
the value of density before sintering (
gd
) or density
after sintering (
f
), knowing this information, we can
write your equations:
3
3
)) , ( 100 (
100
* ) , (
T t E
T t
techn
gd

(3)

Very often after the sintering process pores are closed
in this case density measurement is easier and more
precise than the measurement of the initial density, there is
a small difference in mathematical notation:
3
3
max
)) , ( 100 (
) 100 (
* ) , (
T t E
E
T t
techn
T
f

(4)

These dependencies (4) describe isotropic dimensional
changes.

The coefficient of shrinkage anisotropy can be defined
by the ratio between axial and radial strains following
l
a
K
1
,
l
b
K
2
(5,6)

where:
a
,
b
- are the final transversal (radial) sample
shrinkages (%),
l
- is the final longitudal (axial) sample shrinkage
(%).

For the anisotropic dimensional changes the equation
describing the dependence of the density and temperature
at the time will look like (7):

(7)

Having already defined mathematical tools, you can go
to the next step in the task, building the application
skeleton.
The input data file contains the columns of values in a
structured sequence describing the temperature, time and
value of shrinkage at specific temperature.
Even in the implementation of such a simple task,
drawing a solution diagram is useful in implementing
appropriate solutions in the selected programming
language (Figure 1).

Figure 1. Typical scheme of building applications.

As first step it was decided that the best solution would
be to open the file read-only (Figure 2), recording the
results will take place to a file named preset by the user.

Read data
from file

Data
analysis

Save the
results
,
) * ) , ( 100 ( * ) * ) , ( 100 ( * )) , ( 100 (
) * 100 ( * ) * 100 ( * ) 100 (
) , (
2 1
2 max 1 max max
K T t K T t T t
K K
T t
techn techn techn
T T T
f


Figure 2. An example of an algorithm handling of
documents in the storage.
The most important decision in this step was to
determine whether the whole time that the application will
move through the space a file on disk or, if all data will be
loaded into memory. The input file format (Fig. 3)
suggested the second, we should write the algorithm where
the line dividing the string with the identification mark that
separates individual values and assign them successively
to structure containing types: real, real, real, integer.

116.48100; 9.15000; 0.12648;1
116.68300; 9.16667; 0.12686;1
116.89000; 9.18333; 0.12721;1
117.08400; 9.20000; 0.12759;1
117.28200; 9.21667; 0.12795;1
117.48000; 9.23333; 0.12832;1
117.69400; 9.25000; 0.12870;1
117.88500; 9.26667; 0.12907;1
118.07900; 9.28333; 0.12944;1
118.27300; 9.30000; 0.12979;1
118.48600; 9.31667; 0.13017;1
Figure 3. Example of the file format is saved by the
software supplied with the unit DIL 402C
The file is formatted as shown above, suggests the
creation of a multi-dimensional memory array of values to
allow freedom of movement in a given field with a suitable
type of indicators.
After loading the files, we can go to the next step: data
preparation and analysis. Here the task is complicating
because some values were not listed in the file, so it should
be provided and placed in the main window of the
appropriate edit boxes for the user to enter their values
(Figure 4), it should be remembered in this case, the
control error (situations in which it may occur, eg division
by 0, or admissibility of negative numbers).

Figure 4. Edit field for entering the value by user.

Analyzing the data loaded into memory only
"difficulty" is the appropriate use of conditional statements
that in appropriately selected loop make the required
calculations. In the course of preparation, I used a simple
mechanism of "user conduct", ie, further options are
activated during the program step by step which excludes
the situation of confusion and disorientation in selecting
the appropriate option.
Data visualization has been prepared using a standard
item provided by the package (Figure 5), ease of use and
its offered possibilities were sufficient for this project.

Figure 5. Component of a development environment, used
for data visualization of AstaloyCrL with addition of 1 wt.% PCA and 1
wt.% silicon carbide after 10 h of mechanical alloying process, sintered
in an atmosphere of hydrogen.

Yes

Opening
file, it is
possible
?

Send a
reading
error with
a
description
.
- Dynamically
Allocating
memory for
data

- Entering the
file name in
the dialog box
title

- Information
for the user.

Opening
the file
No
Representative examples of plots recorded during the
dilatometric analyses are shown in Fig. 5. Astaloy CrL
powder with addition of 1 wt.% PCA and 1 wt.% silicon
carbide after 10 h of mechanical alloying process was
sintered in pure hydrogen atmospheres. During heating to
the isothermal sintering temperature, Astaloy CrL showed
a continuous thermal expansion with a sharp shrinkage at
about 810C as a consequence of the sintering process. A
opposite effect for investigated compositions was detected
for density changes.

Chart generated by the program has the ability to
seamlessly zoom area, in order to trace in detail the
percentage variation of density. The project is
characterized by simplicity and clarity of what this type of
applications should be absolutely required, dispensing with
unnecessary features, and attempts to replace the use of
sophisticated data visualization package for embedded
solutions allows you to quickly and easily analyze which
results in obtaining satisfactory results.

III. CONCLUSION
In summary, even such simple projects could provide
a lot of experience in daily laboratory programming skill
is needed, whether it's simple to use VB macros in Excel
software type, or advanced stand-alone applications.

IV. ACKNOWLEDGMENT
The work was supported by the Polish Ministry of
Science and Higher Education within grant No N N508
393237.

REFERENCES
[1] K. Maca, V. Pouchly, A.R. Boccaccini Sintering
Densification Curve a Practical Approach For Its Construction
From Dilatometric Shrinkage Data Science of Sintering vol.
40, 2008, pp.117-122

ISBN: 978-81-920575-8-3 :: doi: 10. 72850
/ISBN_0768ACM #: dber.imera.10. 72850

Web Application Testing Using Inheritance Method
(Testing a Web Application Using a single Object)

C.D.RAJAGANAPATHY
Dept of Computer Applications
and Technology,
(ANNA UNIVERSITY)
cdrajaganapathy@gmail.com

T.R.SRINIVASAN
Dept of Information Technology
and Technology,
(ANNA UNIVERSITY)
srini_trs@yahoo.com

S.GOPIKRISHNAN
Dept of Computer Science and
Engineering
The Kavery Engineering College,
(ANNA UNIVERSITY)
Mecheri, Tamilnadu, India
gopikrishnanme@gmail.com

Abstract In this paper, we propose an Inheritance
Technique for web application testing. While the Inheritance
based class implementation greatly reduced the complexity
of web application, a four level class approach can be
employed to perform structure testing on them. In this
approaches, data flow analysis can be performed as a
Functional testing class, Verification and Validation testing
class, and Object testing class and Base testing class from
low abstract level to higher abstract level. Each test class in
the approach takes charge of the testing in a abstract level
for particular type of web document or object.

I. INTRODUCTION

Web applications are those applications that
communicate with clients via web protocols such as
HTTP, WSDL or SOAP for example. Their
preponderance and ubiquity are direct results of the
pervasiveness of the World Wide Web as well as the
increasing utilization by developers of web browsers to
host their applications. Web browsers provide the
developer with a unique opportunity to exploit the web
browsers inherent capability to relieve them of the
burden of having to cater for the various operating
systems or hardware platforms available to the intended
users of the application.
Today web applications range the gamut from dating
sites to gambling sites, ecommerce, on-line banking,
airline bookings, and corporate websites right through to
applications that are not meant for general access but
never the less use web browsers to be hosted. Of this type
there are many and the number is growing. At DDI Health
many of our main products are web applications, to name
a few; FIT (Filmless Image Technology) a filmless digital
radiological imaging solution; Pathology Web Portal a
portal that permits over the web secure access for medical
practitioners to view patient pathology reports; DDIs
RICS (pathology Request Image Capture System). Many
more examples may be cited. Suffice it to say that the
realm of web applications is very rich and growing.

II. THE DIFFICULTY OF TESTING WEB
APPLICATIONS

Developing web applications essentially is no
different to any other software application; you start
with ideas, mock-ups and requirements and in a
managed fashion work towards a working solution.
Testing web applications should be just as routine a
task. You start with requirements and mock-ups,
decide on the priorities for testing, design the tests,
execute the tests and report on the outcomes. You go
through this cycle two, three, or more times until you
reach your quality objectives (if you are lucky) or you
simply run out of time.
That is an ideal. The reality can be much different
and more often than not is. All of a sudden what you
thought were requirements agreed and understood
begin to change, and not just in the ones or twos but
eventually in avalanches as the end users or major
stakeholders begin to question and challenge original
concepts and thoughts. What was originally agreed
would be a link now changes to a button (fair
enough). Then someone decides that alteration to
signing in is required for legal reasons. Now signing
or logging in requires acknowledgement of legal
terms and conditions and further requires a check box
feature that must be checked before the Login
button is enabled as absolute proof of
acknowledgment. OK, that is good, we can handle
that. Another change arrives soon after; now it is
required that a unique error message for each error
type that may arise be displayed instead of the
general error message that simply alerted the user to
errors in their entries. The error messages now will be
unique to each entry field and will uniquely reflect
the type of error needing correction, it shall be
presented in red, 12 point Calibri font with the field
in question highlighted by a red background we can
test for that it just means a lot more combinations
than we figured on but we will attend to that, just
more tests to design and execute.
There is nothing strange in any of this; these are
the day to day challenges that face a tester, the test
schedule and undermine the test plan.

Web applications are a strange beast indeed; they
are a hybrid between a web site and a regular
application; they provide feature rich capabilities to
potentially millions simultaneously. Web applications
provide the user with an extremely rich experience
which is way beyond what mere websites alone can
offer. Consider an ecommerce web application, a
Florist web site say Interflora. You as a user are
at liberty of storing your billing and shipping details,
further you may specify multiple payment methods
which may include personal and corporate credit
cards. You may wish to have multiple orders placed
on a regular basis to different recipients. This suggest
that within the Interflora web application a highly
sophisticated address book and order processing
application is built in.
Consider the complexity of testing robustly just
this feature. Superimpose on this problem all the
issues related to cross browser compatibility,
aesthetics and usability issues and you begin to
appreciate the magnitude of the test problem.
Get it wrong, miss something, or fail to conduct
adequate progression and regression testing and real
problems may occur real fast out in production. The
issue is that defects that are missed in testing, upon
release of the web application, are immediately
available to and are at the mercy of a very large,
unforgiving and vocal user base. From thousands
to millions of users will be exposed to and use the
web applications in ways that a test department of
five, ten, twenty or a hundred may not be able to
design and execute enough tests enough times for.
Simply the number of potential usage patterns
immediately and simultaneously imposed on the web
application will exceed the number of patterns
derived by the test team in the project schedule
available to them. It can be an overwhelming
experience.
Adequate testing relies upon adequate test
generation, execution and maintenance. Applications
that have exposure to very large user bases are the
most challenging problems facing software test
organizations. These organizations must contend with
software complexity which is ever increasing whilst
test schedules are ever shrinking. In a standalone
application if it goes wrong the cries from the field
wont be heard for a while short while, but a while
none the less; they may even arrive in a steady
stream. On the other hand web applications gone
wrong in the wild have the potential and capacity to
release a deluge of complaint and dissatisfaction
almost immediately, the worst of which is revenue
loss to the stakeholders.

III. RELATED STUDIES
In recent years, several approaches and tools have
been proposed web application testing. However
because of the complexity and heterogeneous
representations in web applications, these approaches
are only proper for parts of web applications. Further
extensions are necessary for these approaches to
decrease the complexity of web applications and
provider greater support for the heterogeneous in web
applications.
Tonella and ricca extend traditional approaches
and propose a 2-layer model for web applications
testing. In their model, navigation models and control
flow models are integrated to analyze the web
applications at he high and low abstract levels,
respectively. With these two models, while box
approaches can be used to test we applications. This
approach is enough to test web applications at both
high level and low level. However there exists a gap
between the abstract levels of the two models because
navigation models are designed for navigation testing
and adequate to system testing while control flow
models without extension are more proper for unit
testing Kallepalli and tian present a statistical
approach with unified markov models to test web
applications. Instead of analyzing web applications
directly, this model collects artifacts, such as control
flow, data flow, and workload. Error information, etc.
from log files which are generated in the execution of
web applications. Based on the collected artifacts,
this approach could clarify users focus and guide the
testing. The reliability of the application could be
obtained from the error information. One
disadvantage of this approach is that it is in adequate
for a new web application because it requires valid
log files. Additionally the preciseness of the
approaches limited by the log files. Errors that cannot
be recorded by log files, such as in correct
information in generated pages, cannot be detected by
this approach.
Di lucca report another testing approach for web
applications. In their approach, a decision table is
applied to generate the functional test cases for single
units in web applications. Integration testing will be
completed after the system graph is built. A support
tool named WAT(web application tool), is
implemented to assist to test case generation, test case
execution , and test case result evaluation. The major
problem for this approach is that the decision table
analysis and test case generation are totally based on
human interaction. This approach itself does not
provide any means to decrease the complexity of web
applications. Hence this approach has to rely on the
experience of developers.
Kung et propose an object-oriented model,
Object oriented web test model (WTM) to model and
analyze web applications from object, behavior and
structure aspects, respectively. To construct the wtm
they define a set of concepts for web applications.
After that diagrams, such as navigation state diagrams
, inter-procedural diagrams control flow graphs,
object control flow graphs etc .can be built to model
web applications. Based on these diagrams, a four
level data flow testing approach is proposed. A
potential problem of these approaches is that are
generic approaches and need further extensions to

support to heterogeneous representations in web
applications.

Different from previous approaches, an
Inheritance Technique is good solution to decrease
the complexity of web applications. The approach
based on the Agent Based Testing (ABT) model of
rational agent, describes a set of abstract classes and
new diagrams to support the generation of test
classes. When testing web applications, a specific test
class can be generated from these abstract classes for
each type of representation in the application such
that testing can be finished by the collaboration of a
set test classes. Each test classes take charge of
testing a particular type of web document or object by
certain testing methods. Testing web applications is
completed by the co-operation of a set of agents. Test
agents root the ABT model, which is a model base on
the theory of rational actions in humans and has been
successfully deployed in many sophisticated
applications. A test class at least contains
Beliefs that represent the test artifacts related
to a specific testing approach and a particular type of
web document or object under test, such as source
code, specification, test model , test cases, test result,
etc.
Goals that describe the test criteria, such as
requirements on minimum coverage in data flow
analysis.
Plans that depicts the actions to be
performed to achieve the goals. They include actions
sequences, cost, preconditions, etc
Based on the above class environment, we extend
the previous structure testing approach for web
applications. This approach analysis the data flow
information in web applications and performs a four-
level testing web applications. the four-level testing
includes Function class, Cluster class, object class,
and Base testing class from low abstract level to high
abstract level. Each level of testing on a particular
type of web document or object is managed by
different test agent. In the process of web application
testing, a high level test class can create a set of low
level test object and ask them to complete the
corresponding low level testing. The high level test
class itself focuses on the comparatively high level
testing that cannot be finished by low level test
objects. Consequently, a high level testing task may
be completed by the cooperation of a set of low level
test objects and high level test objects.
Similar to the data flow analysis methods for
traditional applications testing, data flow analysis for
web applications requires adequate test models and
proper test criteria. A control flow graph annotated
with data flow information is a general accepted
approach for traditional application modeling. In web
applications testing, a control flow graph has to be
extended to handle the new features such as novel
data handling mechanisms, heterogeneous
representations, etc.
Test criteria in data flow analysis includes all
definition, all use, all-c-use, all-p-use, etc. in practice,
it may infeasible to cover all the achieved def-use
pairs in web applications testing. Alternatively, an
acceptable percentage of coverage of def-use pairs is
a good criterion for web application testing.

IV. OBJECTIVE
A class can be viewed as an autonomous object
that is designed and Implemented to complete certain
tasks by co-operating with the environment and other
classes.
Specific test objects are generated from generated
and abstract classes. Each test classes takes charge of
testing a particular type of web document by certain
testing methods. Testing web applications is
completed by the co-operations of a set of class
objects. This approach analysis the data flow
information in web applications and performs a four-
level testing web applications. The foul classes
include function class, Verification and Validation
class, object class, Base test class from low level to
high level. Each level of testing on a particular type
of web document or object is managed by different
test objects.

V. FUNCTION CLASS
In web applications, function class is
administrated by function class objects. Each test
object takes charge of the particular type of individual
function based on the information of def-use pairs in
the function. Function may be implemented in a
particular programming language (java, c#, etc) or
specific script language (JSP, ASP, Perl, etc). For a
specific test object, it includes
Beliefs: test artifacts for a particular type of
function, scu as source code, specification, and test
model for target functions, specifications of test
cases, etc. similar to test models for each target
function can be depicted by a control flow graph
marked with data flow information. Nodes in the
control flow graph represent statements, and edges
represent the control transfer

Fig. 5.1. Showing only the DOM Validator and TestSuite
Generator as examples of possible Testing implementations.

VI. VERIFICATION AND VALIDATION
CLASS
Verification and Validation Class tries to test a set
of functions with invocation relations in web
applications. in addition to direst function calls,
functions in web applications can be invoked by
submission request from client pages

Fig 6.1 Functional Description of Verification and Validation Class

VII. OBJECT CLASS

Object class employed to test the objects
produced by different function within a web
document. User interaction may trigger different
invocation sequences of functions within web
document or an object.

Fig 7.1 Functional Description of Object Class

VIII. BASE TESTING CLASS
Base Testing Class is used to test the application
global variables. In web applications thousands of
clients may execute the same web application at the
same time and data can be shared and transferred
among these clients. A set of plans for the target
application.
Collect test artifacts for target application,
generate test agents for function Class, Verification
and Validation Class and object class and perform
corresponding level testing. Setup the test criteria for
target application, execute test cases to verify target
application; determine whether the test criteria is
satisfied if not repeat the same process else terminate.

Fig 8.1 Functional Description of Base Test Class

IX. FEATURES
When compared with other approaches, the
Inheritance based Testing could easily obtain

flexibility and extensibility by introducing new types
test classes. Thus the complexity web applications
brought by heterogeneous representations and new
data flow mechanisms can be greatly decreased.
Based on the Inheritance based Technique, a four
level-level data flow testing approach as well as
corresponding four level test objects, can be
employed to test web applications.

X. CONCLUSION
We present an Inheritance Based testing approach
for web applications. Compared with other
approaches, the Inheritance Based testing approach
could easily obtain flexibility by introducing new
type of test classes. Thus the complexity of web
applications brought by heterogeneous
representations and new data flow mechanisms can
be greatly decreased.

XI. FUTURE WORK
The basic features of framework have been
implemented. Moreover integrating more testing
approaches, such as navigating testing, object state
testing, statistical testing, etc. it is still necessary for a
systematic testing approach for web applications.

XII. ACKNOWLEDGMENTS
We are proposing this paper based on the future
work of other existing systems. If in case of any new
research ideas on these papers are warmly welcomes.
We would also like to thank the anonymous
reviewers for their valuable and constructive
comments.

REFERENCES

[1] A. G. Lucca and A. R. Fasolino, Testing
Web-Based Applications: The State of the
Art and Future Trends, Information and
Software Technology, Vol. 48, No. 12,
2006, pp. 1172-1186.
[2] H. Liu and H. Beng Kuan Tan, Automated
verification and test case generation for
input validation, in Proceedings of the 2006
international workshop on Automation of
Software test. 2006, ACM: Shanghai, China.
p. 29-35.
[3] S. Artzi, A. Kiezun, J. Dolby, F. Tip, D.
Dig, A. Paradkar, and M. D. Ernst. Finding
bugs in dynamic web applications. In Proc.
Int. Symp. On Software Testing and
Analysis (ISSTA08), pages 261272.
ACM, 2008.
[4] G. Lucca and A. R. Fasolino, Web
Application Testing, Web Engineering,
Springer, Berlin, Chapter 7, 2006, pp. 219-
260. doi:10.1007/3-540-28218-1_7
[5] A. Bertolino, A. Polini, P. Inverardi, H.
Muccini, and Proc. (Extended abstract)
"June 28- 1 July 2004, Pag. 124-125.
Towards Anti-Model-based testing. in
International Conference of Dependable
Systems and Networks DSN 2004. 2004.
Florence.
[6] A. Bertolino. Software testing research:
Achievements, challenges, dreams. In ICSE
Future of Software Engineering (FOSE07),
pages 85103. IEEE Computer Society,
2007.
[7] S. Murugesan, Web Application
Development: Challenges and the Role of
Web Engineering, J. Karat and J.
Vanderdonckt, Eds., Web Engineering,
Modelling and Implementing Web
Applications, Springer, Berlin, 2008, pp. 7-
32.
[8] Software Engineering Book of Knowledge
IEEE 2004 Version
[9] R. V. Binder. Testing object-oriented
systems: models, patterns, and
tools. Addison-Wesley, 1999.

[10] L. A. Clarke and D. S. Rosenblum. A
historical perspective on runtime
[11] E assertion checking in software
development. ACM SIGSOFT Software
Engineering Notes, 31(3):2537, 2006.
[12] L. de Alfaro. Model checking the world
wide web. In Proceedings
of the 13th International Conference on
Computer Aided Verification
(CAV01), pages 337349. Springer-Verlag,
2001.
[13] Y. Qi, D. Kung and E. Wong, An Agent-
Based Data-Flow Testing Approach for Web
Applications, Journal of Information and
Software Technology, Vol. 48, No. 12,
2006, pp.1159-1171.
doi:10.1016/j.infsof.2006.06.005
[14] J. Pava, C. Enoex and Y. Hernandez, A
Self-Configuring Test Harness for Web
Applications, Proceedings of the 47th
Annual Southeast Regional Conference,
South Carolina, 2009, pp. 1-6.
doi:10.1145/1566445.1566533
[15] H. Raffelt, T. Margaria, B. Steffen and M.
Merten, Hybrid Test of Web Applications
with Webtest, Proceedings of the 2008
Workshop on Testing, Analysis, and
Verification of Web Services and
Applications, Seattle, 2008, pp. 1-7.
doi:10.1145/1390832.1390833
[16] P. Tonella and F. Ricca, A 2-Layer Model
for the White- Box Testing of Web
Applicaions, Proceedings of the 6th IEEE
International Workshop on the Web Site
Evolution, Chicago, 11 September 2004, pp.
11-19.

ISBN: 978-81-920575-8-3 :: doi: 10. 72941/ISBN_0768

An Improved Approach to F-COCOMO using PI Membership Function

r Girls, Yamunanagar, India

Abstract Software cost estimation predicts the amount of
effort and development time required to build a software
system. It is one of the most critical tasks and it helps the
software industries to effectively manage their software
development process. There are a number of cost estimation
models. The most widely used model is Constructive Cost
Model (COCOMO). All these models are centred on using the
future software size as the major determinant of effort.
However, estimates at the early stages of the development are
the most difficult to obtain due to imprecise and limited
details available regarding the project. Fuzzy logic based cost
estimation models address the vagueness and imprecision
present in these models to make reliable and accurate
estimates of effort. The aim of this paper is to provide a new
approach to implement F-COCOMO using PI membership
function and compare its efficiency in effort estimation with
other fuzzy membership functions. Size, mode of
development and all the 15 cost drivers of intermediate
COCOMO model are fuzzified in this approach.

Keywords Project management, Software Effort
Estimation, Fuzzy Logic, COCOMO model, Membership
Functions
I. INTRODUCTION
The software industry is very competitive to establish
the market with accurate cost estimation. It can help
industries to better analyse the feasibility of a project and
to effectively manage the software development process
[1]. The effort prediction aspect of software cost
estimation is concerned with the prediction of the person
hour required to accomplish the task. The development of
effective software effort prediction models has been a
research target for quite a long time [2].
In the last few decades many software effort estimation
models have been developed. The algorithmic models use
a mathematical formula to predict project cost based on the
estimates of project size, the number of software engineers,
and other process and product factors [3]. These models
can be built by analysing the costs and attributes of
completed projects and finding the closest fit formula to
actual experience. COCOMO (Constructive Cost Model),
is the best known algorithmic cost model published by
Barry Boehm in 1981 [4]. It was developed from the
analysis of sixty three software projects. Boehm provided
three levels of the model called Basic COCOMO,
Intermediate COCOMO and Detailed COCOMO. These
conventional approaches lacks in terms of effectiveness
and robustness in their results. These models require inputs
which are difficult to obtain during the early stages of a
software development project. They have difficulty in
modelling the inherent complex relationships between the
contributing factors and are unable to handle categorical
data as well as lack of reasoning capabilities [5]. The
limitations of algorithmic models led to the exploration of
the non-algorithmic models which are soft computing
based.
Non algorithmic models for cost estimation encompass
methodologies on fuzzy logic (FL), artificial neural
networks (ANN) and evolutionary computation
(EC).These methodologies handle real life situations by
providing flexible information processing capabilities.
Fuzzy logic based cost estimation is most appropriate
when vague and imprecise information is to be handled.
The first realization of the fuzziness of several aspects
of COCOMO was that of Fei and Liu [6] called F-
COCOMO. The reason for fuzziness to be considered in
COCOMO lies in the fact that the division of evaluation
and rating of some involved factors, which have important
influence upon development cost, are vague and indistinct.
Fuzzy logic based cost estimation allows inputs and
outputs to be represented linguistically and hence
contributing to more accuracy in effort estimation [7].
The rest of the paper is organized as follows: Section 2
gives an introduction on fuzzy logic and fuzzy logic
systems. Section 3 discusses the proposed framework.
Section 4 gives the algorithm used to implement the
proposed framework. Section 5 provides the experimental
results. Section 6 gives conclusions and future research.
II. FUZZY LOGIC AND FUZZY LOGIC SYSTEMS
Fuzzy Logic (FL) is a methodology to solve problems
which are too complex to be understood quantitatively. FL
is based on fuzzy set theory and introduced in 1965 by
Prof. Zadeh in the paper fuzzy sets [8]. It is a theory of
classes with unsharp boundaries, and considered as an
extension of the classical set theory. The main motivation
behind FL was the existence of imprecision in the
measurement process. They provide capabilities that allow
handling both quantitative and qualitative information
within one model. It is a form of multivalued logic derived
from the fuzzy set theory to deal with reasoning that is
approximate rather than precise. Fuzzy sets are sets whose
elements have degrees of membership. In classical set
theory, the membership of elements in a set is assessed in
A.K. Soni
Department of IT
School of Engg. and Tech.
Sharda University,
Greater Noida, India

Rachna Soni
Department of C. S. And Applications
DAV College for Girls
Yamunanagar, India

Anupama Kaushik
Department of IT
Maharaja Surajmal Institute of Tech.
Delhi, India

binary terms according to bivalent condition i.e. an
element either belongs or does not belong to the set. By
contrast, fuzzy set theory is described with the aid of the
membership function valued in the real unit interval of
[0, 1]. Fuzzy sets allow partial membership. A fuzzy set A
is defined by giving a reference set X, called the universe
and a mapping;

A
: X 0,1

called the membership function of the fuzzy set A
A
(x),
for x X is interpreted as the degree of membership of x
in the fuzzy set A. A membership function is a curve that
defines how each point in the input space is mapped to a
membership value between 0 and 1. The higher the
membership x has in the fuzzy set A, the more true that x
is A. The membership functions (MFs) may be triangular,
trapezoidal, Gaussian, Pi, parabolic etc. In this paper we
have discussed three kinds of MFs i.e. triangular (Fig. 1),
Gaussian (Fig. 2) and Pi (Fig.3) [9].
The triangular membership function is specified by a
triplet (a, b, c) as follows:
Triangle(x: a, b, c) =
0 0

0 0
(1)
The parameters a and c locate the feet of the
triangle and the parameter b locates the peak which is
shown in Fig.1.
The Gaussian membership function is specified by two
parameters (m,) as follows shown in Fig.2.

Gaussian(x: m, ) = exp
(2)

The Pi membership function is specified by four
parameters (a b c d) as follows:

Pix:a, b, c, d
0
2
1 2

1 2

0
(3)

The parameters a and d locate the feet of the
curve, while b and c locate its shoulders as shown in
Fig.3.

Figure 1. A triangular MF specified by (3, 6, 8).

Figure 2. A Gaussian MF specified by (2, 5).

Figure 3. A Pi MF specified by (1, 4, 5, 10).

Fuzzy Logic System (FLS) is the name given to any
system that has a direct relationship with fuzzy concepts.
The most popular fuzzy logic systems in the literature may
be classified into three types [10]: pure fuzzy logic systems,
Takagi and Sugenos fuzzy system and fuzzy logic system
with fuzzifier and defuzzifier. As most

of the engineering
applications use crisp data as input and produce crisp data
as output, the last type is the most widely used where the
fuzzifier maps crisp inputs into fuzzy sets and the
defuzzifier maps fuzzy sets into crisp outputs. This type of
fuzzy logic system was proposed by Mamdami [11].
A general fuzzy logic system includes the following
elements [12]:

1. Fuzzification Process: Here the membership
functions are applied to the numerical value of
input variables, to determine how much the input
variables fit the linguistic terms.
2. Knowledge Base: It is a set of expert control rules
needed to achieve a goal. The knowledge base is
usually expressed as a number of IF-THEN rules
based on the domain experts knowledge.
3. Fuzzy Inference Mechanism: It performs various
fuzzy logic operations by using knowledge base to
convert fuzzy inputs to fuzzy outputs.
4. Defuzzification Process: Here the conclusion of the
fuzzy rule set is translated into a crisp number
before results can be used in practice.
III. PROPOSED FRAMEWORK
This research developed a new fuzzy logic based
framework to handle the imprecision and uncertainty
present in the most widely used COCOMO model. The
COCOMO model is a set of three models: basic,
intermediate, and detailed [4]. This research used
intermediate COCOMO model because it has estimation
accuracy that is greater than the basic version, and at the
same time comparable to the detailed version [7].
COCOMO model takes the following as input: (1) the
estimated size of the software product in thousands of
Delivered Source Instructions (KDSI) adjusted for code
reuse; (2) the project development mode given as a
constant value B (also called the scaling factor); (3) 15 cost
drivers [4]. The development mode depends on one of the
three categories of software development modes: organic,
semi-detached, and embedded. It takes only three values,
{3.2, 3, 2.8}, which reflect the difficulty of the
development. The estimate is adjusted by factors called
cost drivers that influence the effort to produce the
software product. Cost drivers have up to six levels of
rating: Very Low, Low, Nominal, High, Very High, and
Extra High. Each rating has a corresponding real number
(effort multiplier), based upon the factor and the degree to
which the factor can influence productivity. The estimated
effort in person-months (PM) for the intermediate
COCOMO is given as:
Effort = A [KDSI]
B

i=1

15
EM
i
(4)
The constant A in (4) is also known as productivity
coefficient. The scale factors are based solely on the
original set of project data or the different modes as given
in Table I.
TABLE I. COCOMO MODE COEFFICIENTS AND SCALE FACTORS VALUES
MODE A B
Organic 3.2 1.05
Semi-Detached 3 1.12
Embedded 2.8 1.20

The contribution of effort multipliers corresponding to
the respective cost drivers is introduced in the effort
estimation formula by multiplying them together. The
numerical value of the ith cost driver is EM
i
and the
product of all the multipliers is called the estimated
adjustment factor (EAF).

The actual effort in person months (PM), PM
total
is the
product of the nominal effort (i.e. effort without the cost
drivers) and the EAF, as given in (5).

PM
total
=PM
nominal
EAF (5)

(Where PM
nominal
= A [KDSI]
B
and EAF =
i=1

15
EM
i
)

By studying the behavior of COCOMO a new fuzzy
approach for effort estimation is proposed by using Pi
membership function to deal with linguistic data as given
in (3).

IV. ALGORITHM
In this framework all the parameters of COCOMO
model i.e. Size, Mode, and 15 Cost drivers are fuzzified
using Fuzzy Inference System (FIS) of Matlab. Framework
for fuzzy logic implementation is shown in Fig 4.

INPUTS OUTPUT

FUZZY RULES

Figure 4: Fuzzy framework

Step 1. Choice of membership functions

In this approach three types of membership functions (MF)
namely triangular MF, Pi MF and Gaussian MF are chosen
for COCOMOs effort estimation.

Step 2. Fuzzification of Nominal effort

Fuzzy Inference System (FIS) is developed to calculate
nominal effort. The inputs to this system are MODE and
SIZE. The output is Fuzzy Nominal effort. Input variables
SIZE and MODE represented as Pi MF and are shown in
Fig. 5 and Fig. 6 respectively. Output variable
NOMINAL EFFORT represented as Pi MF is shown in
Fig. 7. In the same way FIS are developed for the same
using triangular and Gauss MFs.

Figure 5. Input variable SIZE represented as Pi MF

FIS

Figure 6. Input variable MODE represented as Pi MF

Figure 7. Output variable EFFORT represented as Pi MF

Fuzzy rules obtained for size, mode and effort are as
follows:
If size is s1 and mode is organic then effort is e11
If size is s1 and mode is semi-detached then effort is e12
If size is s1 and mode is embedded then effort is e13
If size is s2 and mode is organic then effort is e21
If size is s2 and mode is semi-detached then effort is e22
If size is s2 and mode is embedded then effort is e23
...
If size is s11 and mode is organic then effort is e11_1
If size is s11 and mode is semi-detached then effort is
e11_2
If size is s11 and mode is embedded then effort is e11_3

Step 3. Fuzzification of Cost Drivers

All cost drivers are fuzzified using separate FIS for every
cost driver. Also each cost driver is fuzzified using all
three MFs. Thus, a total of 15*3=45 FIS are developed for
cost drivers. Sample fuzzification of VEXP based on Table
II and III is shown in Fig. 8 and Fig. 9 using triangular
MF(TMF). The same is implemented using Gaussian and
Pi MF also.
TABLE II. VEXP COST DRIVER RANGE SPECIFIED IN MONTHS
Very low Low Nominal High
1 4 12 36

TABLE III. VEXP EFFORT MULTIPLIER RANGE DEFINITION
Very low Low Nominal High
1.21 1.10 1.00 0.90

Figure 8. Antecedent MFs for VEXP represented as TMF

Figure 9. Consequent MFs for VEXP represented as TMF

Rules obtained from Fig. 9 and Fig. 10 are:

If vexpa (vexp antecedent) is vlow then vexpc (vexp
consequent) is incsig (increased significantly)
If vexpa is low then vexpc is inc (increasing)
If vexpa is nom (nominal) then vexpc is uc (unchanged)
If vexpa is high then vexpc is dec (decreasing)

Sample fuzzification of TIME based on Table IV and V is
shown in Fig. 10 and Fig. 11 using Pi MF. The same is
implemented using triangular and Gaussian MFs also.

TABLE IV. TIME COST DRIVER RANGE IN TERMS OF PERCENTAGE
Nominal High Very High Extra High
<=50 70 85 95

TABLE V. TIME EFFORT MULTIPLIER RANGE DEFINITION
Nominal High Very High Extra
High
1.00 1.11 1.30 1.66

Figure 10. Antecedent MFs for TIME represented as Pi MF

Figure 11. Consequent MFs for TIME represented as Pi MF
Rules obtained from Fig. 10 and Fig. 11 are:
If timea (time antecedent) is nom (nominal) then timec
(time consequent) is uc (unchanged)
If timea is high then timec is inc (increasing)
If timea is vhigh (very high) then timec is incsig
(increasing significantly)
If timea is ehigh (extra high) then timec is incdras
(increasing drastically)
Sample fuzzification of PCAP based on Table VI and VII
is shown in Fig. 12 and Fig. 13 using Gaussian MF.
The same is implemented using triangular and Pi MFs
also.

TABLE VI. PCAP COST DRIVER RANGE DEFINED IN TERMS OF PERCENTILE
Very Low Low Nominal High Very High
15 35 55 75 90

TABLE VII. PCAP EFFORT MULTIPLIER RANGE DEFINITION
Very low Low Nominal High Very High
1.42 1.17 1.00 0.86 0.70

Figure 12. Antecedent MFs for PCAP represented as Gauss MF

Figure 13. Consequent MFs for PCAP represented as Gauss MF

Rules obtained from Fig. 12 and Fig.13 are:

If pcapa (pcap antecedent) is vlow (very low) then pcapc
(pcap consequent) is incsig (increasing significantly)
If pcapa is low then pcapc is inc (increasing)
If pcapa is nom (nominal) then pcapc is uc (unchanged)
If pcapa is high then pcapc is dec (decreasing)
If pcapa is vhigh(very high) then pcapc is decsig
(decreasing significantly)

Step 4: Estimated effort calculation by integrating
components

Estimated effort is obtained by multiplication of
nominal effort obtained from step 2 and EAF obtained
from step 3 (by multiplying effort multipliers
corresponding to each cost driver).
V. EXPERIMENTS AND RESULTS
Experiments are done by taking some of the original
projects from COCOMO81 dataset [4]. The estimated
efforts using COCOMO, Triangular MF, Pi MF and
Gaussian MF obtained are tabulated and compared. They
are shown in Table VIII.
The evaluation consists in comparing the accuracy of the
estimated effort with the actual effort. There are many
evaluation criteria for software effort estimation, among
them we applied the most frequent one which is Magnitude
of Relative Error (MRE) and is defined as in (6).

MRE=
||
100 (6)

The software development efforts obtained when using
COCOMO and other membership functions were observed.
After analysing the results obtained by means of applying
triangular, Pi and Gaussian MFs, it is observed that the
effort estimation of the proposed model is giving more
precise results in maximum projects as compared to
triangular and Gaussian membership functions. The effort
estimated by fuzzifying the size, mode and all the 15 cost
drivers using Pi MF is yielding better estimate.
The magnitude of relative error (MRE) was calculated
using (6). For example, the MRE calculated for project
ID (P.ID) 6 for COCOMO, triangular, Gaussian and the
proposed model is 21.6, 20.33, 75 and 20.16 respectively.
This clearly shows that there is a decrement in the relative
error, so the proposed model is more suitable for effort
estimation.
VI. CONCLUSIONS AND FUTURE WORK
Referring to Table VIII, effort estimation using PI MF
yields better results for maximum criterions when
compared with the other methods. Thus it is concluded that
the new approach using Pi MF is better than using TMF
(triangular membership function), Gauss MF and
Intermediate COCOMO. By suitably adjusting the values
of the parameters in FIS we can optimize the estimated
effort. Future work includes deploying similar algorithm
for COCOMO II and other cost estimation models. Newer
techniques like Type-2 fuzzy can also be applied for more
accurate predictions of software effort.

TABLE VIII. ESTIMATED EFFORT USING DIFFERENT APPROACHES

REFERENCES
[1] M. Kazemifard, A. Zaeri, N. Ghasem-Aghaee, M.A. Nematbakhsh,
F. Mardukhi, Fuzzy Emotional COCONO-II Software Cost
Estimation (FECSCE) using Multi-Agent Systems, Applied Soft
Computing 11 (2011) 2260-2270, ELSEVIER.
[2] B.Boehm, C. Abts, S.Chulani, Software Development Cost
Estimation Approaches : A Survey, University of Southern
California Centre for Software Engineering, Technical Report,
USC-CSE-2000-505, 2000.
[3] Ch. Satyananda Reddy, KVSN Raju, An Improved Fuzzy
Approach for COCOMOs Effort Estimation using Gaussian
Membership Function ,Journal of software, Volume 4, No. 5, July
2009.
[4] Boehm, B.W., Software Engineering Economics, Prentice Hall,
Englewood Cliffs, NJ(1981).
M.O. Saliu, M.Ahmed, Soft Computing based Effort Prediction Systems
A Survey, in : E.Damiani, L.C. Jain (Eds), Computational
Intelligence in Software Engineering, Springer-Verlag, July 2004,
ISBN 3-540-22030-5.
[5] Fei. Z, X. Liu. , f-COCOMO-Fuzzy Constructive Cost Model in
Software Engineering, Proceedings of IEEE International
Conference on Fuzzy System, IEEE Press, New York,1992 pp.
331-337.
[6] Idri, A. and Abran, A., COCOMO Cost Model using Fuzzy
Logic,7
th
International Conference on Fuzzy Theory and
Technology, Atlantic City, New Jersey, March 2000.
[7] Zadeh. L. A., Fuzzy Sets, Information and Control, volume 8, 1965,
pp. 338-353.
[8] http://www.mathworks.com/help/toolbox/fuzzy
[9] L.Wang,Adaptive Fuzzy System and Control: Design and
Stability Analysis, Prentice Hall Inc., Englewood Cliffs, NJ(1994).
[10] E.H. Mamdami,, Applications of Fuzzy Algorithms for Simple
Dynamic Plant, proceedings of IEEE 121(12)(1974).
[11] S.N. Sivanandam,S.N. Deepa, Principles of Soft Computing,
Wiley India (2007).
P.ID MODE SIZE EAF ACTUAL
EFFORT
COCOMO
EFFORT

TMF PIMF GAUSSMF
EAF EFFORT EAF EFFORT EAF EFFORT

1 3.2 6.9 0.40 8 9.8 0.41 12.84 0.41 12.81 0.71 22
2 3 90 0.70 453 326 0.70 327.28 0.70 326.59 0.88 409.76
3 3.2 13 2.81 98 133 2.84 89.90 2.83 89.46 2.64 110.82
4 3.2 15 0.35 12 20 0.36 17.78 0.36 17.73 0.63 35.13
5 3.2 6.2 0.39 8 8.4 0.39 12.28 0.39 12.24 0.68 19.11
6 3.2 5.3 0.25 6 4.7 0.26 7.22 0.26 7.21 0.45 10.50
7 3.2 23 0.38 36 33 0.38 28.15 0.38 28.07 0.63 55.37
8 3.2 6.3 0.34 18 7.5 0.34 10.79 0.34 10.74 0.60 17.24
9 2.8 25 1.09 130 145 1.09 154.68 1.09 154.22 1.34 186.99
10 3.2 28 0.45 50 47 0.45 50.23 0.44 49.99 0.78 88.13
11 3 9.1 1.15 38 42 1.05 44.26 1.10 46.61 1.19 47.24
12 2.8 10 0.39 15 17 0.39 18.96 0.38 18.90 0.52 23.28

ICCA Volume 5

Caricato da

Informazioni sul documento

Descrizione originale:

Copyright

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

ICCA Volume 5

Caricato da

Copyright:

I nt er nat i onal Conf er ence on Comput er Appl i cat i ons 2012

Flight Lieutenant Sonali Shirpurkar Badkas

with .doc or .pdf or .txt extensions and power point

This evaluation will help in choosing the best

x occurs infinitely often

Figure 2. Overall Project Success

Proc. of the Intl. Conf. on Computer Applications

node densities and length of data paths. Mobile wireless ad

B. Energy Consumption Model

Figure 5. END TO END DELAY

in the network is higher. When the number of nodes

after cooling shrinkage (%),

- shrinkage in the end of process (%),

Potrebbero piacerti anche