Sei sulla pagina 1di 72

General Graduate Exams Exploration and Visualization of Information in Search Engines

by

Panagiotis Papadakos

Presented to Graduate Studies Committee of the Computer Science Department of the University of Crete Heraklion, May 2009

ii

Contents
Page Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Interaction Paradigms and Visualization in Information Retrieval (IR) . . . . . . . . . . . . . . . 2.1 2.2 Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii v vii 1 3 3 3 3 4 4 5 5 5 7 7 7 7 9 9 10 10 10 12 14

Information Space and User Information Needs . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 2.2.2 2.2.3 Micro and Macro Level of Information . . . . . . . . . . . . . . . . . . . . . . . . . . Information Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . User Information Needs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.3

Interaction Paradigms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 2.3.2 2.3.3 Query Searching vs Browsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dierences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Three Paradigms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.4

Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 2.4.2 Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Scientic and Information Visualization . . . . . . . . . . . . . . . . . . . . . . . . .

3 Interaction Paradigms and Related Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Results Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 3.1.2 Clustering Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Clustering Algorithms Classication . . . . . . . . . . . . . . . . . . . . . . . . . . . Hierarchical and Non-Hierarchical Approaches . . . . . . . . . . . . . . . . . . . . . Document-based and Snippet-based Approaches . . . . . . . . . . . . . . . . . . . . 3.1.3 Cluster Presentation & User Interaction . . . . . . . . . . . . . . . . . . . . . . . . .

iii

3.2

Facets and Dynamic Taxonomies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 3.2.2 3.2.3 3.2.4 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Taxonomy Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . User Interface Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16 17 19 19 21 27 27 30 34 36 37 39 39 41 45 45 48 48 48 49 49 50 51 51 51 51

4 Visualization Models and Metaphors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 4.2 4.3 4.4 4.5 4.6 Multiple Reference Points Based Models (MRPBM) . . . . . . . . . . . . . . . . . . . . . . Euclidian Spatial Characteristic Based Model (ESCBM) . . . . . . . . . . . . . . . . . . . . Pathnder Associative Newtork (PFNET) . . . . . . . . . . . . . . . . . . . . . . . . . . . . Multidimensional Scaling Models (MDS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Self-organizing Map Model (SOM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Metaphors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.1 4.6.2 Metaphors for the Semantic Framework Presentation . . . . . . . . . . . . . . . . . . Metaphors for Information Retrieval Interaction . . . . . . . . . . . . . . . . . . . .

5 Vision and Research Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 5.2 5.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Proposed Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 5.3.2 5.3.3 5.3.4 5.3.5 5.3.6 5.4 Information Visualization Framework . . . . . . . . . . . . . . . . . . . . . . . . . . Metrics for Exploratory Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Interaction Models for Exploratory Search . . . . . . . . . . . . . . . . . . . . . . . . Exploratory Search and Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Evaluation and Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Work Done . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 5.4.2 ODBMS Index Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . FleXplorer, A Framework for Providing Faceted and Dynamic Taxonomy-based Information Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.3 Exploratory Web Searching with Dynamic Taxonomies and Results Clustering . . .

52 52 53

6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendices A Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

55 57

iv

List of Figures
3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 Clusty, a Snippet-based Clustering Approach . . . . . . . . . . . . . . . . . . . . . . . . . . Quintura Word Cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . grokker Generates an Euler Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Top-200 Web Search Results Clustering Displayed Using Two-level TreeMaps . . . . . . . . Kartoo Generates a Thematic Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example of a Materialized Faceted Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . ContentLandscape Applies Collapsible Panel Pattern for Zooming . . . . . . . . . . . . . . . FacetZoom Combines Ideas from Zoomable User Interfaces (UIs) With Faceted Search . . . Faceted Search for Small Screens in the FaThumb Prototype . . . . . . . . . . . . . . . . . . 13 15 16 17 17 20 21 22 23 23 28 30

3.10 Flamenco Allows Choosing Between a Search Over All Results or Within Current Focus . . 4.1 4.2 4.3 Display of 4 Reference Points in a Fixed Reference Point Environment . . . . . . . . . . . . VIBE Using 5 Reference Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . WebStar Using 4 Reference Points (RPs). Snapshots During a Full Rotation of international Reference Point 4.4 4.5 4.6 4.7 4.8 4.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31 32 33 34 35 37 38 41 42 42 43

Display of the Projected Cosine Model, in Distance-Angle DARE Model . . . . . . . . . . . Display of the Projected Cosine Model, in the Angle-Angle TOFIR Model . . . . . . . . . . Display of the Projected Distance Model, in the Distance-Distance GUIDO Model . . . . . Display of Original Network (left) and Final PFNET Network (right) . . . . . . . . . . . . . Display of ThemeScape and Galaxy Visualizations of IN-SPIRE Visualization Program . . A SOM Feature Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.10 A 3D Cone Tree (left) and a Basic Hyperbolic Tree (right) . . . . . . . . . . . . . . . . . . . 4.11 Perspective Wall (left) and ThemeRiver (right) . . . . . . . . . . . . . . . . . . . . . . . . . 4.12 DataLens, a 3D Pyramid Lens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.13 Gridl Prototype Displays Search Results Along Two Axes . . . . . . . . . . . . . . . . . . .

4.14 HotMaps, a 2D Visualization of How Query Terms Relate to Search Results . . . . . . . . .

44

vi

List of Tables
3.1 3.2 Basic Notions and Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 26

Interaction Notions and Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

vii

viii

Chapter 1

Motivation

The daily use of computers as tools of work, education, communication and entertainment produces a huge volume of data. As recent surveys state1 , the world produces between 1 and 2 exabytes (260 bytes) of unique information per year, 90% of which is digital and with a 50% annual growth rate. In addition, this new data are more complex and more dynamic. The adopted interaction paradigm of current IR systems and Web Search Engine (WSE) with a simple rectangular textbox, where the user inserts most of the times one or two terms and the system returns a ranked list of results has proven very useful for nding specic information and is very simple and intuitive to use. However, such systems do not provide adequate support for information needs that have an exploratory nature and/or aim at decision making. User studies have shown that casual users usually inspect only the rst page of results and they do not exploit any of the query language operators (not even Boolean queries) that is oered. Instead they issue very small queries which they reformulate in an iterative process based on the returned results [73, 62]. On the other hand, the powerful and expressive query languages that are usually oered for structured information (e.g. for the Semantic Web) are not fully utilized, in the sense that the formulation of queries is a laborious and dicult task for end users. In the previously analyzed, highly demanding and growing information environment, new intuitive and more user friendly UIs have to be created, providing eective and ecient services for retrieving and exploring the available information and supporting users in the various decision making tasks and processes. The elds of IR, Information Visualization (IV) and Human Computer Interaction (HCI) have to collaborate
1 http://www.sims.berkeley.edu/research/projects/how-much-info-2003/

Chapter 1. Motivation

in order to provide new intuitive and interactive UIs, where the information is presented, organized, and analyzed, giving the user the ability to recognize patterns and relations. For example, to select a hotel or a product to buy, it is not enough to return the list of choices that satisfy user-provided criteria. The ranking of the available choices according to user-based (i.e. preference), or statistical-based criteria is also required. Furthermore, exploration services, that provide users with comprehensive summaries of the available choices which enable them to grasp quickly the information landscape and allow them to restrict their focus, and thus approach gradually the most desired choices, are required. For this reason, eorts for the exploitation of the above languages in models of exploration/navigation have started to come up [56, 49, 30, 4]. Summarizing, the constantly increasing volume and requirements of our digital economy, requires providing intuitive modes of interaction, involving exible and ecient navigation, and advanced visualization.

Computer Science Department

University of Crete

Chapter 2

Interaction Paradigms and Visualization in IR

2.1 Information Retrieval


(IR) is the domain focusing on searching, exploring and discovering information either from organized textual and data repositories or the World Wide Web (WWW), in order to satisfy users information needs. However, since the information environment is constantly growing, another important aspect of IR systems is their ability to orgazine this information. This organization can facilitate the creation of innovative, more intuitive and user friendly UIs, which will provide users ecient ways of mapping, organizing and grouping available information. The above can enable users to discover new patterns and relationships between the available information and satisfy faster and more accurately their information needs.

2.2 Information Space and User Information Needs


2.2.1 Micro and Macro Level of Information
Information from an infobase can be divided to two levels. The rst is the micro level, which refers to individual objects or documents, such as contents, snippets or full text. This is the direct and obvious information. The other level is the macro level which refers to aggregated information of objects or documents from the collection. This information is not direct but is generated from the individual collection of objects, and relies on the way the information is organized and presented. Such information can provide

Chapter 2. Interaction Paradigms and Visualization in IR

object connections, rhythms, trends, patterns and relationships, explaining information at the micro level. The aggregate information at the macro level can vary in information organization methods and information presentations for the same data. By navigating information in the macro level the user can gain a better understanding of the provided collection and nd unexpected insights [92]. An IR system should provide access to both levels of information, by browsing and query searching.

2.2.2 Information Space


The information space can be conceived as an abstract and multidimensional space. Its structure is based on the semantic characteristics and relationships, derived from the organization of the collection data set, which enables users to explore and discover information from the data collection. An information space can be constituted by intrinsic attributes such as keywords, citations, hyperlinks, and authors or extrinsic structures like a subject directory, a thesaurous system, or an organized search result list. Combinations of intrinsic attributes and extrinsic structures can also form an information space. Since information does not constitute space, to describe its spatial characteristics, we have to dene basic topological properties like distance, direction and angle. For instance, the distance between two objects can be the shortest path of hyperlinks, citation or hierarchical structure, and the Euclidean distance in the Vector Space Model (VSM). Direction has a special meaning, in hyperlink and citation based systems, since if any objects links/cites another object, it means that one object directs to the other. In a multidimensional vector-space based IR system, angle is used as a retrieval model. Finally, the information space has to be reduced from N dimensions to 1, 2 or 3, in order to be perceived by humans, which can lead to user disorientation and ambiguity [92].

2.2.3 User Information Needs


User studies have shown that almost 60% of search tasks are exploratory [62]. The user does not know accurately his information need, he only provides 2-5 words, and focalized search very commonly leads to inadequate interactions and poor results. Unfortunately, the available user interfaces (UI)s do not aid the user in formulating his query. Furthermore, such systems do not provide adequate support for information needs that have an exploratory nature and/or aim at decision making. The answers returned are simple ranked lists of results, with no organization and no information on the macro level of the infobase. Casual users usually inspect only the rst page of results and they do not exploit any of the query language operators (not even Boolean queries) that is oered. Instead, they issue very small queries which they reformulate in an iterative process based on the returned results [73, 62]. On the other hand, the powerful

Computer Science Department

University of Crete

2.3. Interaction Paradigms

and expressive query languages that are usually oered for structured information (e.g. for the Semantic Web) are not fully utilized, in the sense that the formulation of queries is a laborious and dicult task.

2.3 Interaction Paradigms


2.3.1 Query Searching vs Browsing
In IR two paradigms are widely recognized: the rst is query searching and the second is browsing. Query searching is the paradigm where the user tries to describe his information needs with a group of relevant and important terms. The query is then analyzed by the IR engine and a list of related documents (based on the used ranking model) is returned. Most IR engines also display snippets of relevant parts of the returned documents. This means that for ambiguous words, where each word can have many meanings, the system might return non relevant results, which the user might accept as a search failure. On the other hand, browsing refers to UIs which allow the user to view, search and scan either the whole information or part of it. This enables the user to explore and discover information, along with data relationships and patterns. A UI for browsing should provide smooth and structured browsing. Methods for information browsing include hyperlink and hierarchical structures. However, huge volumes of data require the appropriate usage of automatic data analysis techniques, prior to visualization. According to [81], browsing is useful when (a) there is good underlying structure, so items close to one another are similar, (b) users are unfamiliar with the contents of the collection, (c) users have a limited understanding of the organization of a system and prefer a less cognitively loaded method of exploration, (d) it is dicult to verbalize the underlying information need and (e) the information is easier to recognize than describe.

2.3.2 Dierences
According to [92], the dierences between query searching and browsing, include: Judgment of Relevance Query searching is based on keyword matching of query terms and surrogates of documents in a database, at a lexical level. On the other hand, the relevance judgement of browsing is completed by users and it is a concept matching process. Continuity The retrieval process is continuous for browsing, while a retrieval process is discrete for query searching. Selecting a browsing path, examining a context, and relevance judgment is continuous and controlled by the user during browsing, while after executing a query, the internal query process and

General Graduate Exams

Panagiotis Papadakos

Chapter 2. Interaction Paradigms and Visualization in IR ranking of the results is a black box for the users. Cost in Time and Eort Browsing is a time and eort consuming action, since the user must remember the browsing path, search the contents and make decisions, while query searching involves only term selection and query formulation. Information Seeking behavior Browsing is a system based seeking behavior (i.e. what the system can oer), while query searching is a seeking behavior based on what the user wants. Iteration Browsing is completed by series of iterative acts, like getting an overview of available information, xing on a target and examining it more closely, and then moving on and starting the cycle again. Query searching on the other hand requires the denition of the query terms, the formulation of the query and examination of the results. Query searching might also be iterative, since the results might not fulll the information needs of the user. Granularity Using browsing the user can evaluate one relevant item at a time, while query search provides a group of retrieved documents. Clarity of information need When a user starts an information seeking process, he might have not dened a clear information need. In such a case, browsing is more appropriate, since it does not require a denite target, while query searching requires a relatively well-conceived information need, for which keywords can be chosen and query can be formulated. Interactivity Browsing is an interactive process by nature, which makes it more complicated and challenging, while query searching has fewer steps and less interaction. Retrieval Results Results of browsing are richer and more diverse, since they can lead to a wide range of retrieval results (i.e. from contextual information, structural information, relational information to individual objects), while query searching only retrieves a ranked list of documents.

Computer Science Department

University of Crete

2.4. Visualization

2.3.3 Three Paradigms


Although query searching and browsing are dierent ways to seek information, they can be synthesized. There are three basic paradigms: Querying and Browsing (QB) In this paradigm, an initial query is submitted to the system to restrict the infobase. Then the results are visualized in a visualization environment and browsed by users. Browsing and Querying (BQ) Information at the macro-level is presented and browsed and then information in the micro-level is searched and highlighted in the visualization contexts. Browsing Only (BO) Information at the macro level is displayed and browsed. It does not integrate any query searching components. Query searching is not categorized as a paradigm, because it is a traditional IR retrieval paradigm that does not require a visual space.

2.4 Visualization
2.4.1 Denition
According to [51], visualization is a method of computing, which transforms the symbolic into the geometric, enables researchers to observe their simulations and computations, oers a method for seeing the unseen, enriches the process of scientic discovery, and fosters profound and unexpected insights. Visualization is the process of transforming data, information, and knowledge into graphic presentations to support tasks such as data analysis, information exploration, information explanation, trend prediction, pattern detection, rhythm discovery, and so on. Without the visualization assistance, there is less perception or comprehension of the data, information, or knowledge by people for a variety of reasons. Such reasons, may include the limitations of human vision, or the invisibility and abstractness of the data, information and knowledge. Visualization requires certain methods or algorithms to convert raw data into a meaningful, interpretable, and displayable form to visually convey information to users.

2.4.2 Scientic and Information Visualization


Visualization can be classied into two categories: scientic visualization and information visualization. Scientic visualization is used most of the times to show things that are either too fast or too slow for the

General Graduate Exams

Panagiotis Papadakos

Chapter 2. Interaction Paradigms and Visualization in IR

eye to perceive, or for structures much smaller or larger than human scale, or for phenomena that people can not directly see, like x-ray or infrared radioaction [52]. Examples include shapes of molecules, missile tracking, astrophysics, uid dynamics, medical images, etc. On the other hand, information visualization, is generally used to view abstract information. Examples include visual reasoning, visual data modeling, visual programming, information retrieval visualization, visualization of program execution, visual languages, spatial reasoning, and visualization of systems [82]. Although their fundamental design principles, implementation means, and issues are common, information visualization does not have an inherent spatial structure or geometry of data to display, contrary to the scientic visualization. For the former, a spatial structure or framework for semantic relationships among data must be created. Finding or dening a spatial structure for information visualization is challenging because data in an information space may be multifaceted, relationships of data are interwoven and complicated. Furthermore, data may be of diverse nature. Denition of such a spatial structure for information visualization, is a complicated and creative process. Salient and displayable attributes from objects must be extracted, a semantic framework for displayable objects must be established, information must be organized, and objects must be projected onto the structure, in such a way that the user will be able to search and nd objects and objects relationships [92].

Computer Science Department

University of Crete

Chapter 3

Interaction Paradigms and Related Techniques

In this chapter we discuss the interaction paradigms of Results Clustering and Faceted or Dynamic Taxonomies correspondingly.

3.1 Results Clustering


Results clustering is a type of data analysis method that can organize a dataset into categorical groups clusters, besed on certain data association criteria. Dierent similarity measures can result in dierent clustering results. Items or objects within the same group/cluster are more similar than items between two distinct groups/clusters. Clustering is considered an unsupervised learning process because it can automatically reveal intrinsic categorical patterns from a dateset. The categories from a clustering algorithm rely on the nature of the dataset, association criteria of clustering, and distribution of data items in the dateset. The advantage of clustering is that it can be easily applied to any collections, revealing interesting and unexpected associations and trends. Disadvantages of clustering are the lack of predictability, their conation of many dimensions simultaneously, the diculty in groups labeling and the counterintuitiveness of cluster hierarchies [25].

10

Chapter 3. Interaction Paradigms and Related Techniques

3.1.1 Clustering Requirements


Results clustering algorithms should satisfy several requirements. First of all, the generated clusters should be characterized from high intra-cluster similarity. Moreover, results clustering algorithms should be ecient and scalable since clustering is an online task and the size of the retrieved document set can vary. Usually only the top C documents are clustered in order to increase performance. In addition, the presentation of each cluster should be concise and accurate, to allow users to detect what they need quickly. Cluster labeling is the task of deriving readable and meaningful (single-word or multiple-word) names for clusters, in order to help the user to recognize the clusters/topics he is interested in. Such labels must be predictive, descriptive, concise and syntactically correct. Finally, it should be possible to provide high quality clusters based on small document snippets rather than the whole documents.

3.1.2 Clustering Algorithms Classication


We can categorize the clustering algorithms, using two dierent classication schemes, based on either the structure of the clusters or the infobase that these algorithms are applied to. Hierarchical and Non-Hierarchical Approaches The rst category classies the clustering algorithms to either the non-hierarchical ones (partitioning clustering algorithms) or the hierarchical ones [61]. The major dierence between these two clustering types is that the former generates a hierarchy of clustered items while the later partitions the items in a single-level structure. Non-Hierarchical Approaches This kind of clustering algorithms, partition N items into K categories (K must be predened). One of the most popular non-hierarchical algorithms is the K-means [48] and its variants [36, 11] which is based on a simple iterative scheme for nding a local minimal solution. The algorithm starts with a guess about the solution, and then readjusts the cluster centroids, until reaching a local optimum. A centroid is a special articially created item in a cluster which is used to represent that cluster for various purposes. It is dened as the average coordinates of all items in a cluster which it represenents. A cluster membership function refers to a method to judge whether an item is assigned to a cluster or not in a clustering process. The main advantages of this algorithm are its simplicity and speed which allows it to run on large datasets. Its disadvantage is that it does not yield the same result with each run, since the resulting clusters depend on the initial random assignments. Another non-hierarchical algorithm is the Fuzzy c-means [32]. In fuzzy clustering, each point has a

Computer Science Department

University of Crete

3.1. Results Clustering

11

degree of belonging to clusters, as in fuzzy logic, rather than belonging completely to just one cluster. Thus, points on the edge of a cluster, may be in the cluster to a lesser degree than points in the center of cluster. The algorithm minimizes intra-cluster variance as well, but has the same problems as k-means, the minimum is a local minimum, and the results depend on the initial choice of weights. QT (quality threshold) clustering [29] is an alternative method of partitioning data, invented for gene clustering. It requires more computing power than k-means, but does not require specifying the number of clusters a priori, and always returns the same result when run several times. The user chooses a maximum diameter for clusters and the algorithm builds a candidate cluster for each point by including the closest point, the next closest, and so on, until the diameter of the cluster surpasses the threshold. The candidate cluster with the most points is the rst true cluster. Then recurse with the reduced set of points, to nd the rest of the clusters. STC and its variations, described later in the Section Snippet-based Approaches, are also nonhierarchical approaches. Hierarchical Approaches The hierarchical clustering algorithm yields a tree structure, which is also called a dendrogram. In such a structure, a child sub-cluster has to overlap with its parent cluster. The clustering process in such algorithms is recursive, meaning that successive sub-clusters are generated from an existing cluster, etc. There are two basic strategies for creating this structure: agglomerative (or from bottom to top) algorithms and divisive (or from top to bottom). The former algorithm rst clusters input items, forming a set of clusters, and then merges close clusters from the existing cluster set to form a parent cluster, based on a similarity measure. The algorithm ends when all clusters have been merge to one parent cluster, the root of the tree [36]. Dierent variations may employ dierent similarity measuring schemes [94]. The latter algorithm, takes the opposite direction. It starts with the root of the tree, and breaks down one large cluster into several smaller clusters. The recursion stops when certain criteria are met. Agglomerative clustering algorithms are more popular than divisive clustering algorithms. The above methods usually suer from their inability to perform adjustment once a merge or split has been performed. This ineexibility often lowers the clustering accuracy. Furthermore, due to the complexity of computing the similarity between every pair of clusters, such algorithms are not scalable for handling large data sets in document clustering. Another approach is the Hierarchical Frequent Term-based Clustering (HFTC) method, proposed in [1]. This algorithm exploits the notion of frequent itemsets1 used in data mining. HFTC greedily
1A

frequent itemset is a set of words which occur together in some minimum function of documents in a cluster

General Graduate Exams

Panagiotis Papadakos

12

Chapter 3. Interaction Paradigms and Related Techniques selects the next frequent itemset, which represents the next cluster, minimizing the overlap of clusters in terms of shared documents. Experiments have shown that this algorithm is not scalable [21]. A dierent approach based on the idea of frequent itemsets is the Frequent Itemset Hierarchical Clustering (FIHC). FIHC uses global frequent itemsets2 to construct clusters, which reduces the dimensionality of the document set, making this algorithm more ecient and scalable.

Document-based and Snippet-based Approaches Clustering can be applied either to the original documents (like in [11, 27, 21]), or to their (query-dependent) snippets (as in [86, 79, 71, 19, 88, 23, 77]). For instance, clustering Meta Web Search Engines (MWSEs) (e.g. clusty.com) use the results of one or more search engines (e.g. Google, Yahoo!), in order to increase coverage/relevance. Therefore, meta-search engines have direct access only to the snippets returned by the queried search engines. Clustering the snippets rather than the whole documents makes clustering algorithms faster. Some clustering algorithms [19, 15, 84] use internal or external sources of knowledge like Web directories3 (e.g. DMoz4 , Yahoo! Directory), dictionaries (e.g. WordNet) and thesauri, online encyclopedias (e.g. Wikipedia5 ) and other online knowledge bases. These external sources are exploited to identify key phrases that represent the contents of the retrieved documents or to enrich the extracted words/phrases in order to optimize the clustering and improve the quality of cluster labels. Document Vector-based Approaches The above traditional clustering algorithms, either at (like K-means) or hierarchical (agglomerative or divisive) are not based on snippets but on the original document vectors and on the similarity measure. Another such approach is ESTC (Extended STC) [10], which is an extension of STC (described latter in the Section Snippet-based Approaches), appropriate for application over the full texts (not snippets). To reduce the (roughly two orders of magnitude) increased number of clusters, a dierent scoring function and cluster selection algorithm is adopted. The cluster selection algorithm is based on a greedy search algorithm aiming at reducing the overlap and at increasing the coverage of the nal clusters. In brief, such approaches can be applied only on a stand alone engine (since they require accessing the entire vectors of the documents) and they are computationally expensive. Furtermore, clustering over full text is not appropriate for a (Meta) WSE since full text may not be available or too expensive to process.
2 Frequent 3A

itemsets that appear together in more than a minimum fraction of the whole document set web directory is a listing of websites organized in a hierarchy or interconnected list of categories 4 www.dmoz.org 5 www.wikipedia.org

Computer Science Department

University of Crete

3.1. Results Clustering Snippet-based Approaches

13

Figure 3.1: Clusty, a Snippet-based Clustering Approach Snippet-based approaches rely on snippets and there are already a few engines that provide such clustering services. Clusty6 is probably the most famous one, shown in Figure 3.1. Sux Tree Clustering (STC) [86] is a key algorithm in this domain and is used by Grouper [87] and Carrot2 [79, 71] MWSEs. It treats each snippet as an ordered sequence of words, it identies the phrases (ordered sequences of one or more words) that are common to groups of documents by building a sux tree structure, and it returns a at set of clusters that are naturally overlapping. Several variations of STC have been proposed. For instance, the trie can be constructed with the N -grams instead of the original suxes. The resulting trie has lower memory requirements (since suxes are no longer than N words) and its building time is reduced, but less common phrases are discovered and this may hurt the quality of the nal clusters. Specically, when N is smaller than the length of true common phrases the cluster labels can be unreadable. To overcome this shortcoming [33] proposed a join operation. A variant of STC with N -gram is STC with X gram [77] where X is an adaptive variable. It has lower memory requirements and is faster than both STC with N -gram and the original STC since it maintains fewer words. It is claimed that it generates more readable labels than STC with N -gram as it inserts in the sux tree more true common phrases and joins partial phrases to construct true common phrases, but no user study results have been reported in the literature, and the performance improvements reported are small.
6 www.clusty.com

General Graduate Exams

Panagiotis Papadakos

14

Chapter 3. Interaction Paradigms and Related Techniques Another snippet-based clustering approach is TermRank [23]. TermRank succeeds in ranking discriminative terms higher than ambiguous terms, and ambiguous terms higher than common terms. The top T terms, can then be used as feature vectors in K -means or any other Document Vector-based clustering algorithm. This approach requires knowing TF, it does not work on phrases (but on single words) and no evaluation results over snippets have been reported in the literature. Another approach is Findex [34], a statistical algorithm that extracts candidate phrases by moving a window with a length of 1..|P | words across the sentences (P), and f KWIC which extracts the most frequent keyword contexts which must be phrases that contain at least one of the query words. In contrast to STC, Findex does not merge clusters on the basis of the common documents but on the similarity of the extracted phrases. However, no comparative results regarding cluster label quality have been reported in the literature. Finally, there are snippet-based approaches that use external resources (lexical or training data). For instance, SNAKET7 [19] (a MWSE) uses DMoz web directory for ranking the gapped sentences 8 which are extracted from the snippets. Deep Classier [84] trims the large hierarchy, returned by an online Web directory, into a narrow one and combines it with the results of a search engine making use of a discriminative naive Bayesian Classier. Another (supervised) machine learning technique is the Salient Phrases Extraction[88]. It extracts salient phrases as candidate cluster names from the list of titles and snippets of the answer, and ranks them using a regression model over ve dierent properties, learned from human training data. Another approach that uses several external resources, such as WordNet and Wikipedia, in order to identify useful terms and to organize them hierachically is described in [15]. Other extensions of STC for oriental languages and for cases where external resources are available are described in [89, 78].

3.1.3 Cluster Presentation & User Interaction


Although cluster presentation and user interaction approaches are somehow orthogonal to the clustering algorithms employed, they are crucial for providing exible and eective access services to the end users. In most cases, clusters are presented using lists or trees. Some variations are described next. A well known interaction paradigm that involves clustering is Scatter/Gather [11, 27] which provides an interactive interface allowing the users to select clusters, then the documents of the selected clusters are clustered again, the new clusters are presented, and so on.
7 SNippet 8 Gapped

Aggregation for Knowledge ExtracTion sentences are sequences of terms occurring not-contiguously into the snippets

Computer Science Department

University of Crete

3.1. Results Clustering

15

Figure 3.2: Quintura Word Cloud

Clusty9 is an extension of Vivisimo that oers a new feature, called remix clustering, which clusters again the same search results but ignoring the topics that the user has seen. Another approach for the presentation layer is provided by Quintura10 , shown in Figure 3.2. It extracts keywords from search results and builds a word cloud (visual map). The name of each cluster is placed in a 2D area. The positions of the names are based on their distance, while font size indicates the size of each cluster. By clicking words in the cloud, the user query is rened. SNAKETs [19] interface oers a feature of personalization that is performed at the client side: the user can select a set of labels and then ask SNAKET to lter out (from the ranked list) all those snippets that do not belong to the folders labeled by the selected labels. SOMs have been used to support exploration of a document space to search for patterns and gain overviews of available documents and relationships between documents [42] (Figure 3.4). Another information visualization alternative, Citiviz displays the clusters in search results using a hyperbolic tree and a scatterplot. Several (M)WSE incorporate visualizations similar to both treemaps and hyperbolic trees. grokker11 , shown in Figure 3.3 clusters documents into a hierarchy and produces an Euler diagram, a coloured circle for each top-level cluster with sub-clusters nested recursively, where the user can zoom-in.
9 www.clusty.com 10 www.quintura.com 11 www.grokker.com

General Graduate Exams

Panagiotis Papadakos

16

Chapter 3. Interaction Paradigms and Related Techniques

Figure 3.3: grokker Generates an Euler Diagram

Another example is Kartoo12 ), shown in Figure 3.5, which generates a thematic map from the top dozen search results for a query, laying out small icons representing results onto the map, with which the user can interact.

3.2 Facets and Dynamic Taxonomies


Dynamic taxonomies (also known as faceted search systems) [64] is a general knowledge management model based on a multidimensional classication of heterogeneous data objects and is used to explore and browse complex information bases in a guided, yet unconstrained way through a visual interface. Features of faceted metadata search include (a) display of current results in multiple categorization schemes (facets) (e.g. based on metadata terms, such as size, price or date), (b) display categories leading to non-empty results, and (c) display of the count of the indexed objects of each category (i.e. the number of results the user will get if he selects this category).
12 www.kartoo.com

Computer Science Department

University of Crete

3.2. Facets and Dynamic Taxonomies

17

Figure 3.4: Top-200 Web Search Results Clustering Displayed Using Two-level TreeMaps

Figure 3.5: Kartoo Generates a Thematic Map

3.2.1 Introduction
Static taxonomies (such as Yahoo!s), based on a hierarchy of concepts can be used to select areas of interest and restrict the portion of the retrieved infobase. The creation of such taxonomies is usually a General Graduate Exams Panagiotis Papadakos

18

Chapter 3. Interaction Paradigms and Related Techniques

manual process although automatic and semi-automatic techniques have been proposed. However, static taxonomies are not scalable for large information bases [65], and the number of documents becomes rapidly too large for manual inspection. On the other hand, dynamic taxonomies [63, 64, 76] (also known as faceted search systems) are a general knowledge management model based on a multidimensional classication of heterogeneous data objects and are used to explore/browse complex information bases in a guided, yet unconstrained way through a visual interface. Features of faceted metadata search include: display of current results in multiple categorization schemes (facets) (e.g. based on metadata terms, such as size, price or date) display categories leading to non-empty results (Poka-Yoke
13

display of the count of the indexed objects of each category (i.e. the number of results the user will get if he selects this category) Such systems focus on user-centered interactive exploratory access, and propose a holistic approach in which modeling, interface and interaction issues are considered together. One of the key factors of this model is simplicity, in order to make it easily understandable and usable by end-users. The user always deals with a single conceptual representation of the infobase. The conceptual schema of a dynamic taxonomy is a plain taxonomy. It is a hierarchy going from the most general to the most specic concepts based on subsumptions. Directed acyclic graph taxonomies modelling multiple inheritance are supported but rarely required. The user is guided to reach his goal, because at each stage he has a complete list of all the concepts related to the current focus, which can be used to further rene his exploration. Furthermore as in traditional search methods, the infobase can be restricted and a reduced taxonomy can be created. The user is in charge of interaction and he can freely explore the infobase, discovering unexpected relationships. By construction, no empty results can occur, because they are automatically pruned. Usability studies [26, 85] show that despite slow response times, dynamic taxonomies produce a faster overall interaction and a signicantly better recall (both actual and perceived) than access through text retrieval. Dynamic taxonomies have an very fast convergence to small results sets, as described in [65]. For example, 3 zoom operations on terminal concepts are sucient to reduce a 10,000,000 object infobase described by a compact taxonomy with 1,000 concepts to an average 10 objects. Finally, the conceptual organization of dynamic taxonomies allows to gather user interests at a precise conceptual level by simply monitoring the zoom operations issued and the concepts the user focuses on.
13 Poka-Yoke

is a Japanese term that means fail-sang or mistake-proong.

Computer Science Department

University of Crete

3.2. Facets and Dynamic Taxonomies

19

Examples of applications of faceted metadata-search include: e-commerce (e.g. ebay), library and bibliographic portals (e.g. DBLP), museum portals ( e.g. [49] and Europeana14 ), mobile phone browsers (e.g. [35]), specialized search engines and portals (e.g. [50]), Semantic Web (e.g. [30, 49, 56]), general purpose web search engines (e.g. Google Base), and other frameworks (e.g. mSpace[67]).

3.2.2 Taxonomy Design


The most accurate way to create a taxonomy is to build categories by hand. Unfortunately, manual classication is expensive and infeasible for many practical document collections, and especially for a WSE document collection. Automatic clustering techniques generate clusters that are typically labeled using a set of keywords, which leads to unpredictive and not intuitive labels. An alternative approach to clustering is to generate hierarchies of terms for browsing the database. [66] introduced the subsumption hierarchies and [45] showed experimentally that subsumption hierarchies outperform lexical hierarchies [60]. Another approach is to use the hierarchical structure of WordNet15
16

to oer a hierarchy view over the topics [40].

WordNet together with a tree-minimization algorithm to create an appropriate concept hierarchy for a database is also used in [72]. All these techniques generate a single hierarchy for browsing the database. A supervised approach for extracting useful facets from a collection of text or text- annotated data is described in [14], which relies on WordNet hypernyms17 and on a Support Vector Machine (SVM) classier to assign new keywords to facets. More recent work [15, 13], provide an unsupervised technique to extract useful facet terms, by expanding a database using WordNet and Wikipedia to identify important terms.

3.2.3 Framework
Table 3.1 denes formally and introduces notations for terms, terminologies, taxonomies, faceted taxonomies, interpretations, descriptions and materialized faceted taxonomies as described in [76]. In brief, Obj is a set of objects (the set of all documents indexed by the WSE), T is a set of terms, and the elements of Obj can be described with respect to one or more aspects (facets), where each aspect is associated with a value domain, nite or innite, which may be ordered (in the general case we could have a partial order (T, )). The description of an object with respect to one facet consists of assigning to the object one or
14 http://www.europeana.eu 15 WordNet

is a lexical database, which groups English words into sets of synonyms called synsets, provides short, general

denitions, and records the various semantic relations between these synonym sets 16 http://wordnet.princeton.edu/ 17 Hypernym is a word whose meaning includes the meanings of other words, as the meaning of the term animal includes the meaning of cat, dog, parrot

General Graduate Exams

Panagiotis Papadakos

20

Chapter 3. Interaction Paradigms and Related Techniques

more terms from the taxonomy that corresponds to that facet. Table 3.2 denes the required notions and notations regarding user interaction. The user explores or navigates the information space by setting and changing his focus. The notion of focus can be intensional or extensional. Specically, any set of terms, i.e. any conjunction of terms (or any boolean expression of terms) is a possible focus. For example, the initial focus can be the empty compound term, or the top term of a facet. However, the user can also start from an arbitrary set of objects, and this is the common case in the context of a WSE. In that case the focus is dened extensionally. Specically, if A is the result of a free text query q , then the interaction is based on the restriction of the materialized faceted taxonomy on A (as dened at the bottom part of Table 3.2). At any point during the interaction, the immediate zoom-in/out/side points along with count information are computed and provided to the user. When the user selects one of these points then the selected term is added to the focus, and so on. An example of a materialized faceted taxonomy, is shown in Figure 3.6.

Figure 3.6: Example of a Materialized Faceted Taxonomy Foci are considered to be redundancy free. A focus ctx (i.e. ctx T ) is redundancy free if ctx =

Computer Science Department

University of Crete

3.2. Facets and Dynamic Taxonomies

21

min (ctx). For example, ctx = {Greece, Crete} is not redundancy free because min (ctx) = {Crete}. (ctx). This notion can be rened, in order The contents (or extension) of a focus ctx, is the set of objects I (ctx). to distinguish the shallow contents I (ctx), from the deep contents I

3.2.4 User Interface Design


System implementations for dynamic taxonomies and faceted search allow a wide range of query possibilities on the data. Only when these are made accessible by appropriate UIs, the resulting applications can support a variety of search, browsing and analysis tasks. Such systems should provide support at least for the three basic characteristics of faceted and dynamic taxonomies. They should display non-empty results, in multiple categorization schemes (facets), along with the count of the indexed objects of each category. Additional UI functionality, is usually accompanied by additional complexity and visual clutter. Selection and de-selection of zoom-points is of central importance in faceted search. If only one concept should be selectable at a time within a facet, traditional single-select controls such as radio buttons, dropdown list controls or simple links can be used. On the other hand, the standard multi-select elements, are check boxes. For instance, the yelp18 web application provides check buttons for multi-select facets and simple links for facets with exclusive selection. Alternatives for allowing both modes in a facet would be dedicated controls, or modier keys (such as pressing shift while clicking). For range selection navigation mode, slider controls can allow the specication of upper and lower bounds on the result set. De-selection should be as easy as concept selection. Additionally, if breadcrumbs or a similar lter summary, indicating summaries of single or all facets are present, these should include the option to clear individual lters as well. Also, buttons for reseting single facets or all lter options can help users to zoom-out quickly.

Figure 3.7: ContentLandscape Applies Collapsible Panel Pattern for Zooming For at facets, i.e. not featuring a hierarchical relation between the concepts, simple list widgets are
18 http://www.yelp.com

General Graduate Exams

Panagiotis Papadakos

22

Chapter 3. Interaction Paradigms and Related Techniques

usually used. List sorting can either be alphabetical, or dynamically updated by the number of assigned items in the current result set. For navigating hierarchies, a number of dierent presentation and navigation options exist, which include: Explorer Tree (not very space ecient), Zoom and Replace which replaces the facet widget content with the level below (used in Flamenco19 [85]), Collapsible panels, hierarchical widgets based on the accordion pattern20 (used in the ContentLandscape application [70], Figure 3.7), and Continuous Zooming, where hierarchical facets are displayed as space-lling widgets, which allow a fast traversal across all levels, while simultaneously maintaining context (used in the FacetZoom prototype [12], Figure 3.8). The number of indexed items for each facet and zoom-points, can be shown by numbers (after the labels), bar charts, height of facets, colour, etc. Visgets [18], extends this principle by featuring a whole number of visualizations. FaThumb [35], enables faceted search on mobile devices (Figure 3.9). The lter area is grouped in nine zones, corresponding to the nine digit keys on mobile phones. The middle zone serves as a spatial overview during navigation. The surrounding eight zones allow the user to select hierarchy branches and repeatedly zoom in on subtrees. The left short shortcut key adds the currently selected concept to the query, the right one allows to quickly jump back to the top.

Figure 3.8: FacetZoom Combines Ideas from Zoomable UIs With Faceted Search Query searching can be done either over all results or within the current focus, as shown in Figure 3.10. Moreover, in order to quickly locate zoom-points in a facet, and avoid having to navigate large hierarchies, even though the target concept may already known by name, direct access to facet items can be achieved with a keyword search over the concept labels (/facet [30]). Since the number of available facets can be very big, ways to reduce their usage space are discussed in [24], and include collapsible facet widgets (such as used by Getty images faceted navigation interface21 ) and expandable lter areas (i.e. More button). Furthermore, systems should be able to determine which facet-value pairs the interface should provide
19 Online

demos available at http:// amenco.berkeley.edu/

20 http://www.welie.com/patterns/showPattern.php?patternID=accordion 21 http://gettyimages.com

Computer Science Department

University of Crete

3.2. Facets and Dynamic Taxonomies

23

Figure 3.9: Faceted Search for Small Screens in the FaThumb Prototype

Figure 3.10: Flamenco Allows Choosing Between a Search Over All Results or Within Current Focus

to a user. Personalization allows the system to present the facet-value pairs that can help the user quickly nd the documents that he is most interested. Existing approaches include, content based personalization, where a recommendation system monitors users actions and pushes documents that match his user prole, collaborative based faceted search personalization, where the system recommends items to a user by leveraging information from other users with similar tastes and preferences, and nally an ontological

General Graduate Exams

Panagiotis Papadakos

24

Chapter 3. Interaction Paradigms and Related Techniques

approach, which uses the distance between values of an ontology, to measure the relevance to users [75].

Computer Science Department

University of Crete

3.2. Facets and Dynamic Taxonomies


Name terminology Notation T Denition a set of names, called terms (they may capture both categorical and numeric values) subsumption a partial order (reexive, transitive and antisymmetric) taxonomy broaders of t narrowers of t direct broaders of t direct narrowers of t faceted taxonomy compound term over T compound ordering broaders of s narrowers of s direct broaders of s direct narrowers of s object domain interpretation of T materialized faceted taxonomy Top element ordering of interpretations model of (T , ) induced by I extension of s in I and in I Description of o wrt I Description of o wrt I (s) I (s), I DI (o) I (o) DI (o) D (s) = { I (t) | t s} I (s) = { I (t) | t s} and I DI (o) = { t T | o I (t)} (t)} = DI (o) = { t T | o I
+ DI (o) = tDI (o) ({t} B (t))

25

(T, ) B (t) N + (t) B (t) N (t) F = {F1 , ..., Fk } s s s


+

T is a terminology, a subsumption relation over T { t | t < t } { t | t < t } minimal< (B + (t)) maximal< (N + (t)) Fi = (T i , i ), for i = 1, ..., k and all T i are disjoint any subset of T (i.e., any element of P (T )) s s i t s t s s.t. t t

B + (s) N + (s) B (s) N (s) Obj I (F , I )

{s P (T ) | s s } {s P (T ) | s s} minimal (B + (s)) maximal (N + (s)) any denumerable set of objects any function I : T 2Obj F is a faceted taxonomy {F1 , ..., Fk }, I is an interpretation of T =
i=1,k

Ti

i I I

i = maximal (T ) I (t) I (t) for each t T

it is the minimal model that is greater than I I (t) = {I (t ) | t t} I

Description of a set of objects A wrt I

DI (A)

DI (A) = oA DI (o)

Table 3.1: Basic Notions and Notations

General Graduate Exams

Panagiotis Papadakos

26

Chapter 3. Interaction Paradigms and Related Techniques

Name focus

Notation ctx

Denition any subset of T such that ctx =

minimal(ctx) focus projection on a facet i Kinds of zoom points w.r.t. a facet i while being at ctx zoom points zoom-in points immediate zoom-in points AZi (ctx)
+ (ctx) Zi

ctxi Notation

ctxi = ctx Ti Denition(s)

(ctx) I (t) = } = { t Ti | I = AZi (ctx) N + (ctxi )


+ (ctx)) = maximal(Zi

Zi (ctx)

= AZi (ctx) N (ctxi ) zoom-side points immediate zoom-side points Restriction over an object set restricted object set reduced interpretation reduced terminology
+ ZRi (ctx)

= AZi (ctx) \ {ctxi N + (ctxi ) B + (ctxi )} = maximal(ZR+ (ctx)) Denition(s) any subset of Obj I (t) = I (t) A (t) = } ={tT |I (t) A = } ={tT |I = oA B + (DI (o))

ZRi (ctx) Notation A I

Table 3.2: Interaction Notions and Notations

Computer Science Department

University of Crete

Chapter 4

Visualization Models and Metaphors

In this chapter we will discuss ve dierent visualization models. Initially, we will discuss MRPBM, which is based on RPs, and then we will analyze ESCBM, which is based on the VSM ranking model and its spatial characteristics. The next one is PFNET, which uses associative networks, and the fourth one is MDS, a group of methods used to discover empirical relationships among investigated objects. Finally, we will discuss SOM, which is a nonlinear topology-preserving projection method, to convert a high-dimensional space into a low dimensional grid and dierent visualization metaphors.

4.1 Multiple Reference Points Based Models (MRPBM)


MRPBM models are visualization algorithms to display the results of a search not in the classical linear order, but by projecting them on a low dimensional visual space. They can eectively handle complex information needs by using multiple RPs. RP or Point of Interest (POI), is a search criterion against which documents or surrogates are matched and search results are generated and presented to the users. In a broad sense, a RP represents users information needs and any information related to users needs, from user preferences and search history, to query terms or browsed documents. Multiple RPs can form a low dimensional visual space and documents can be mapped onto the space, based upon their attraction to the RPs. Visualization models based on multiple RPs can be classied into three categories: Fixed Multiple RPs Models

27

28

Chapter 4. Visualization Models and Metaphors These models use multiple RPs, with a xed position, and can be used for both vector-based and Boolean based IR systems. The representative model is InfoCrystal [69]. In the boolean context, each RP is equivalent to a term or a sub-Boolean logic expression from a Boolean query. The visual space is a polygon, where RPs constitute vertices of the polygon and visual results are displayed. The side lengths of the polygon are equal so that the RPs are evenly congured in the visual space. The retrieved results are displayed inside the polygon. The polygon is partitioned by N exclusive tiers, represented as concentric rings, where N is the number of RPs. The rst tier, displays results related to only one RP, the second results related to two RPs, etc. Figure 4.1 shows a xed multiple RPs model.

Figure 4.1: Display of 4 Reference Points in a Fixed Reference Point Environment

Movable Multiple RPs Models These models use multiple RPs, which can be manipulated by the user, while semantic connections of displayed objects are still maintained in the visual space. VIBE [55] and its variations, VRVIBE [2] and LyberWorld [28] are such models. The primary benet of this approach is that the user may arbitrary place a RP to any interesting area, such as another RP, document or cluster of documents, and observe the impact of the RP to that area. According to the algorithm, the position of a document is strongly related to the similarities between the document and a group of predened RPs. The positions of all related RPs in the visual space, play a very important role in positioning a projected document. In addition, taking into consideration the relevance between a document and

Computer Science Department

University of Crete

4.1. Multiple Reference Points Based Models (MRPBM)

29

related RPs, the ultimate position of a document is calculated. Initially the rst two related RPs are selected in order to calculate the position of the document. The new position of the document serves as an intermediate RP for further consideration, and the process continues until all related RPs are considered. If the user add, remove, or change the position of any RP, the whole algorithm must be executed again. Figure 4.2 shows a snapshot of VIBE. In this example 5 RPs (circles) are used and documents are represented as rectangles. Those documents that contain at least one of the descriptors indicated by the user when initiating the search are considered relevant. The documents with greater coincidence in their descriptors with those of the RP are placed closer to that RP. The user can also expand the icons of the documents or documents that are useful by simply drawing a box around a document or documents that are of interest and a list is shown of the chosen selection. Clicking with the mouse on any of the documents on the list will open another window with the complete document. One characteristic that makes the system interactive is that the user may add, change or remove the RPs from the screen. On carrying out any of these changes the system automatically launches the search query and re-orders the found documents to present the relationships between documents and those between POIs. Automatic RPs Rotation Models

This model is a similarity ratio based model and was introduced with WebStar [93] to visualize link structures. The uniqueness of this model is that it adds a new feature, automatic rotation of RP to the 2D visual space. The visual space is build on a polar coordinate system, where the origin of the visual space is a central document (focus point), specied or selected by users, and RPs are evenly distributed on a sphere with the focus point as center. All of the relevant documents are scattered within the visual space based on their projection angle (which is similarity based) and distance (which is not). By selecting a RP, it automatically rotates around the sphere. As a consequence, related documents are attracted and also rotated. Figure 4.3, shows the WebStar system. The central document (focus point), is denoted with a blue square at the center of the circle, while the four RPs (sport, research, international, library), are represented with the yellow squares, evenly distributed outside the circle. Documents are the pink squares scattered inside the circle. In this example the user has selected the international RP, coloured in red, which is rotated around the circle. Notice how documents change position as the RP rotates. Both the models for xed and movable multiple RPs require at least three RPs to project documents in

General Graduate Exams

Panagiotis Papadakos

30

Chapter 4. Visualization Models and Metaphors

Figure 4.2: VIBE Using 5 Reference Points

their visual spaces, while the model for automatic RPs rotation requires at least one RPs in conjunction with the focus point. Furthermore, visualization models for multiple RPs can be 2D or 3D and can be applied to either Boolean or vector based information systems. The position of any RPs can be controlled and manipulated by users at will. It is the exibility of manipulation that enables users to compare and analyze the impact of two reference points on documents, and identify good/poor discriminative terms. Such models can be used to visualize Internet hyperlinks, search results from an information retrieval system, a full-text, and term discriminative analysis.

4.2 Euclidian Spatial Characteristic Based Model (ESCBM)


These visualization models are based on the VSM model and its spatial characteristics. The basic Euclidean spatial elements such as point, distance, and angle may have a special connection to information retrieval in the contexts of the vector-based space. For instance, a document or RP in a vector based space corresponds to a spatial point in the Euclidean space. Euclidean distance between documents and RPs can be used as an indicator of their similarity. Their visual spaces are 2D and in order to construct them, they use

Computer Science Department

University of Crete

4.2. Euclidian Spatial Characteristic Based Model (ESCBM)

31

Figure 4.3: WebStar Using 4 RPs. Snapshots During a Full Rotation of international Reference Point

two RPs, which serve as view points, one major (KV P ), and one minor (AV P ). These RPs, the reference axis that they form and the distance between them, are all selected by the user and aect the relevant documents placement. The projection conversion equation for an IR evaluation model is crucial for visually displaying it in the visual space. The complexity of a conversion equation depends upon multiple factors such as the denition of the visual space and nature of the retrieval evaluation model. Some equations are simple and straightforward while others may be complicated. The signicance of visualizing an IR evaluation model is not only to make the invisible internal retrieval process transparent to users but also to allow them to manipulate the model in the visual space at will. In this context, three visualization models have been proposed.

General Graduate Exams

Panagiotis Papadakos

32 Distance-angle Based Model

Chapter 4. Visualization Models and Metaphors

In this model the visual projection distance and angle are dened for any document Di . The projection distance is the distance from the document Di to the KV P and the distance angle is the angle formed by the lines KV P Di and KV P AV P , in the vector space. The valid display area of this model is a half-innite plank, where the X-axis and Y-axis are dened as the visual projection angle and distance respectively. The width of X-axis is always equal to and the width of Y-axis is innite. KV P is always mapped onto the origin visual space, because its visual projection distance is 0 and the angle is dened as 0. The position of AV P is mapped onto the Y-axis. because its visual projection distance is the length between the two reference points, in the visual space and the visual projection angle is dened as 0. The distance between the two reference points does not aect this model. DARE [90] is such a model and Figure 4.4 shows the display of the projected cosine model using DARE. The angle a is the retrieval threshold, while R2 is AV P . D1 is a document situated within the retrieval area dened by the angle , and D2 is any document located on one boundary of the angle . Users may drag the vertical retrieval line to any place within the valid display area, to increase or decrease the retrieval area.

Figure 4.4: Display of the Projected Cosine Model, in Distance-Angle DARE Model Angle-angle Based Model In this model two visual projection angles are dened for any document Di . The rst angle () is the angle formed by the lines KV P Di and KV P AV P , and the second one ( ) is the angle formed by the lines AV P Di and KV P AV P , both of them in the vector space. The two angles and

Computer Science Department

University of Crete

4.2. Euclidian Spatial Characteristic Based Model (ESCBM)

33

are assigned to the X-axis and Y-axis. The minimum value and maximum value for the two angles and , are 0 and respectively. The valid display area is a triangle and the two reference points are projected at ( /2, 0) and (0, /2) respectively. This model again is not aected by the distance between the two RPs. TOFIR [91] is an example of such a model, shown in Figure 4.5. The angle is the retrieval threshold, the origin of the vector space is KV P , while R2 is AV P . In the gure O is the projected origin of the vector space. The horizontal line denes the retrieval area, and can be manipulated by the users.

Figure 4.5: Display of the Projected Cosine Model, in the Angle-Angle TOFIR Model Distance-distance Based Model In this model two visual projection distances are dened for any document Di . The rst distance, is the distance from the document Di to the KV P , and the second is the distance from the document Di to the AV P , both of them in the vector space. The two projection distances are assigned to the X-axis and Y-axis. The valid display area is a half-innite plank, where both the X-axis and Y-axis are assigned as the visual projection distances. It forms a /4 angle against the X-axis or the Y-axis, its two corners are connected to the X-axis and Y-axis respectively, and its width is dynamic and determined by the distance between the two RP. GUIDO [54] is such a model. Figure 4.6 shows the distance model in GUIDO. One of the distinguishing characteristics of these visualization models is their capacities to visualize traditional IR evaluation models in addition to visualizing relationships among documents. Document distributions in these visual spaces change accordingly when the RPs change. This implies that the displayed

General Graduate Exams

Panagiotis Papadakos

34

Chapter 4. Visualization Models and Metaphors

Figure 4.6: Display of the Projected Distance Model, in the Distance-Distance GUIDO Model

document congurations in the visual spaces can be customized based upon users dynamic information needs.

4.3 Pathnder Associative Newtork (PFNET)


The Pathnder associative network PFNET is a structural and procedural modeling technique that extracts underlying connection patterns in proximity data and represents them spatially in a class of networks [8]. The power of the Pathnder associative network is its ability to discard insignicant links in the original network while it reserves the salient semantic structure of the network. The simplied network still maintains the proximity connections and fundamental characteristics of the original network. The main idea of the Pathnder associative network is to discard the redundant paths and keep the signicant ones in a network. PFNET uses the triangle inequality, to identify paths with the lowest weights in the network, eliminate redundant ones, and make the network more economical. Figure 4.7 displays the original network and the nal PFNET network. Moreover, the principle of the triangle inequality can be extended to an abstract space. In that case, connection proximity between two points may be measured in other forms such as invisible semantic similarity between two objects rather than distance. Application of a PFNET to a domain problem requires identifying two basic elements: the rst is the objects which are used as nodes in the network, and the second is the proximity relationship between the two objects, which is used to form a link between the two objects. Proximity can be procured by either a

Computer Science Department

University of Crete

4.3. Pathnder Associative Newtork (PFNET)

35

Figure 4.7: Display of Original Network (left) and Final PFNET Network (right) human-interference method or an automatic computation method. Dierent objects and proximity methods can lead to dierent Pathnder associative networks. The Pathnder network technique is very eective and ecient for display of complex relationships among objects such as sophisticated semantic networks. As an IV means, it can be applied to a wide spectrum of IR environments, ranging from information searches [7, 20], author co-citation analysis1 [80], term co-occurrence analysis2 [16], to the Internet information representation [6]. Specically for query searching, after a query is submitted to the network, the relevance between the query and a document is calculated using the Pearson correlation coecient, and the relevance is indicated by the height of a raising spike from the document [7]. In another case [20, 16], both the query and a document are converted into two Pathnder associative networks, and the similarity between a query and a document is the similarity between the two Pathnder networks. The proximity algorithm consists of two parts. The rst part is dened as the ratio of common terms in both a query and a document to the number of all terms in the query. The second part measures the network structure similarity between the query network and a document network. The value of this part increases when nodes (terms) connected in the query network also appear closely connected in the document network. Finally, the two parts are weighted and integrated into a nal similarity value. The weaknesses of the Pathnder associative network include its computational complexity, which may prevent PFNET from visualizing a large dataset, and dynamically modifying a PFNET caused by interactions between users and the network. Another disadvantage of PFNETs in the present state of development
1 Phenomena 2 Keywords

occuring when the authors of two dierent papers, both co-cite the same paper(s) in their work appearing together in a predened length of text in the same document

General Graduate Exams

Panagiotis Papadakos

36

Chapter 4. Visualization Models and Metaphors

is that people have no way of knowing the features upon which similarity judgments are made, which results in that the semantic content of links is not easily discernible. PFNET cannot generate a local visual conguration based on users individual information needs, but it only produces a global overview for a data collection.

4.4 Multidimensional Scaling Models (MDS)


The MDS technique consists of a group of methods used to discover empirical relationships among investigated objects, by visualizing them and presenting their geographic representation in a low dimensional display space. It can be used to reveal and illustrate hidden patterns for a set of proximity measures among objects for multivariate, exploratory, and visual data analysis. An MDS algorithm starts with a matrix of itemitem similarities, and then assigns a location to each item in N-dimensional space (N is specied a priori), where users may perceive and analyze the relationships among the displayed objects. For suciently small N , the resulting locations may be displayed in a graph or 3D visualisation. The more similar two objects, the closer to each other they are, and vice versa. One of MDS techniques advantages is the diversity of its algorithms, where each one of them handles dierent situations. They can be classied into metric and non-metric MDS algorithms, based upon the types of input proximity data. The non-metric MDS algorithm is applied to qualitative3 proximity data, while metric MDS is applied to quantitative4 proximity data. Another category of MDS technique is classical MDS algorithm. which is used with quantitative proximity data. Applications of MDS in IR can be roughly categorized into two groups, based on the proximity denition: one is to use a co-citation method to dene the proximity metric, and the other is to use a non-cocitation method such as traditional distance-based or angle-based similarity measures. However, applying traditional MDS to a very large data set may be prohibitively slow, since it uses a linear algebra solution for the problem, which is computationally costly and makes heavy demands on storage. On the other hand, the non-metric (metric) MDS method looks for the best match between the original proximity of two objects and their Euclidean distance in a low dimensional, using an iterative process, starting with a random initial conguration. The Kruskal algorithm, which is used for the minimization, is iterative, simple and its computational complexity is in practice almost O(N ). Furhermore, the huge number of displayed objects in a low dimensional space raises concerns in terms of ecient system implementation and information representation in the MDS display space, for interactive systems. To solve the problem, people use the supernode method [68] that visualizes object clusters and
3 Qualitative 4 Quantitative

proximity data refers to ordinal data proximity data refers to ratio-scaled data

Computer Science Department

University of Crete

4.5. Self-organizing Map Model (SOM)

37

objects at dierent levels respectively. In the MDS visual display space, documents are clustered rst so that highly related documents in terms of the co-citation are formed as new supernodes. So instead of individual documents, the system displays these supernodes. Documents within a supernode can be visualized, at a lower level, if users zoom on a selected cluster.

Figure 4.8: Display of ThemeScape and Galaxy Visualizations of IN-SPIRE Visualization Program Another potential problem is the intuitive representation of projected objects in a low dimensional MDS space. It is extremely important for users to easily understand and meaningfully interpret the graphic presentation. Towards that aim, the MDS approach was combined with the so called ecological approach, in order to take advantage of natural display formats that humans are used to [83]. The ecological landscape is a MDS display space, which consists of a group of ecologically connected local landscapes. Each landscape represented an object cluster. The size of each local landscape is related to the number of documents containing a thematic term which dened the local landscape. A document is positioned based upon its indexing terms, the thematic term, and the category assigned to the document. Figure 4.8 shows the ThemeScape and Galaxy visualizations using the IN-SPIRE software.

4.5 Self-organizing Map Model (SOM)


The SOM (neural network), is a nonlinear topology-preserving projection method to convert a high dimensional space into a low (1D, 2D, or 3D) dimensional grid (feature map), as shown in Figure 4.9. There are three spaces which are involved in SOM: the high dimensional document vector space (associated with objects), the high dimensional weight vector space (associated with the nodes of the display grid), and the low dimensional visual space (the display grid). During the learning process, each input vector is randomly picked up and is assigned to the closest neuron, whose weight vector is the most relevant one. After the

General Graduate Exams

Panagiotis Papadakos

38

Chapter 4. Visualization Models and Metaphors

training process, the documents are projected onto the feature map and labels are assigned to the feature map areas (which most of the times is weight-based).

Figure 4.9: A SOM Feature Map Each partitioned area in the map clearly represents a concept(s) and documents associated with the concepts. The size of each area in the map indicates term occurrence frequencies or the possible size of the projected documents. After term labeling processing, semantically related areas are also connected. The neighboring relations of areas show intrinsic semantic associations among the neighboring areas, because according to the algorithm, only relevant concepts are adjacent in the feature map. The degree of the relevance between two neighboring areas can be judged by the shape and length of the border separating the two areas. The longer the sharing border, the more relevant the two neighboring areas. During feature map navigation, users can select any interesting concept term labeled on an area by clicking it in the map. This activates the system to list all document titles, even full texts, which are associated with the selected area. Moreover, after nding a document of interest, the system can show users all semantically relevant documents by pulling out all documents associated with the area this document belongs to. This technique was rst introduced to visualize document relations from document titles, then full texts or documents, which were categorized into a 2D grid based upon their contents [47, 46]. In addition, it has been applied to visualize more dynamic and diverse Internet information such as WEBSOM [39,

Computer Science Department

University of Crete

4.6. Metaphors

39

44]. To tackle the massive information, a multi-layered graphic SOM approach to Internet information categorization was presented. In that approach a recursive process of analyzing Web pages and creating submaps was executed [5]. In a retrieval algorithm based on SOM [43], after the map was created, each node was assigned a centroid vector. The centroid vector was generated based on average weights of all associated document vectors. After a query was submitted to the system, the query was compared with all centroid vectors. The best matching centroid vectors were selected and corresponding associated documents were pulled out as search results. Users can also submit a query to the feature map. The query terms are compared with the weight vectors directly. The nodes with the best matched weight vectors are highlighted in the contexts of the feature map. In such a way, users can identify retrieved nodes, the associated documents, and their distributions as well. Using SOM, people can explore and discover a complex hidden term semantic network. Despite its appeal the SOM techniques have some restrictions and weaknesses. Computational complexity is one of the disadvantages, especially for a large data set. The training process requires iterations of input signals, to reach the convergence. Moreover, SOM cannot eectively visualize traditional IR models, like other IR visualization models, such as ESCBM models, since meaningful geometric characteristics are lost. Finally, after training and learning processing, the SOM structures stay stable. So it can not be customized based upon each individual users needs, which would provide more exibility to the dynamic and diverse users information needs.

4.6 Metaphors
Metaphors are widely applied to visualization for IR. A metaphor can be thought as understanding and experiencing one thing in terms of another experience. They can be categorized into two categories: a) metaphors for the semantic framework presentation and b) metaphors for IR interactions.

4.6.1 Metaphors for the Semantic Framework Presentation


One of the primary characteristics of an IR visualization environment is the demonstration of object semantic relationships. Objects must be positioned and projected onto a meaningful framework to form a visual conguration, where internal structure and semantic connections of objects are shown. Such metaphors include: Map Map is a familiar concept. Important properties such as location, area, neighborhood, distance, height and scale can be used to express semantic relationships about a dataset. Examples are WebMap and

General Graduate Exams

Panagiotis Papadakos

40 Visual Net. Landscapes

Chapter 4. Visualization Models and Metaphors

Landscape brings in a variety of physical geographic features like elds, valleys, mountains, paths, rivers, to express data relationships. Examples are SPIRE and VxInsight. Solar System Celestial objects such as planets and asteroids, have their own orbits around the sun, and pull each other due to gravity. A dened central (focus) point is regarded as the sun, while scattered objects are planets. The gravity in the visual space is dened as the semantic strength between the central point and an object. This metaphor is used in WebStar. Galaxy In this metaphor, stars represent web pages, as in ALIVE. Browsed pages jump to the outermost rim of the galaxy, while unbrowsed pages, are gradually drawn towards the center of the galaxy, and eventually disappear. Another metaphor is to have documents as stars, clustered in galaxies as in SPIRE. Topic Islands In this metaphor, islands represent topics. The location and size of an island depend upon the relationship among involved topics and the number of documents associated to the island respectively. File Pile This metaphor is designed to support the casual organization of documents. All items/documents can be automatically classied into several meaningful piles on a table. Each pile indicates a certain subject/category where users can put in or pull items out from it. Library A library can be utilized as a metaphor to organize documents. The mapping of the virtual entities of directories on the structure of a library is very natural and straightforward. Bookshelf In the same manner, a bookshelf provides a natural framework to organize data. Book icons in the bookshelf can present categories and classications, or dierent book types. The size and thickness of books, can be associated with the number of books within a category. Examples include Visual Net, Forager and LibViewer. Hierarchical structures These metaphors are widely used to organize information, where the parent, children and sibling

Computer Science Department

University of Crete

4.6. Metaphors

41

relationships of a hierarchical structure need to be metaphorically presented. The Disk Tree and WEBKVDS visualization methods, selects a disc layout to diplay complicated tree structures, where dierent layers represent dierent levels. Hyperbolic Trees (Figure 4.10), make the entire tree visible at once. The unit disk gives a sh-eye lens view, giving more emphasis to nodes which are in focus and displaying nodes further out of focus closer to the boundary of the disk. Other methods, include simple Trees, TreeMaps and ConeTrees (Figure 4.10).

Figure 4.10: A 3D Cone Tree (left) and a Basic Hyperbolic Tree (right) Time related Time sometimes is a crucial factor for certain data. It is used as a browsing thread to organize the data to guide users through a series of events. In Perspective Wall (Figure 4.11), a 3D environment, one dimension is reserved for time (publishing time of data), and by moving the wall, the user browses events. ThemeRiver (Figure 4.11), applies the river metaphor to visual demonstrate topic changes. The horizontal ow of the river represents the ow of time. Presence Era, uses geological layes in sedimentary rocks to present the time factor in its interface. Users can look into the history by examining various layer patterns.

4.6.2 Metaphors for Information Retrieval Interaction


The interactions between users and an IR visualization environment are vital, since searching, browsing, judging relevance and other activities are done by interactions.

General Graduate Exams

Panagiotis Papadakos

42

Chapter 4. Visualization Models and Metaphors

Figure 4.11: Perspective Wall (left) and ThemeRiver (right) Browsing in an information visualization environment is necessary and crucial mean to nd information. A lens is a special reading tool that allows readers to focus on a special interest area in a visual space and exlude irrelevant areas. Lenses are used in many visualization applications like VIBE and Perspective Wall. Such lenses can magnify the specic area, lter out information, or provide the sh-eye eect, which magnies the focused area, while the rest objects are minimized but not eliminated, providing a technique to smoothly and gradually transfer from one area to another. DataLens (Figure 4.12), provide a 3D pyramid view of data. The users are able to control the focus area and the degree of details.

Figure 4.12: DataLens, a 3D Pyramid Lens The walking metaphor simulates a browsing operation by using the way humans move in their everyday life. Moving forward, backward, left and right, or turning the head left, right, up or down are provided to make a exible exploration in the visual space. Turning a page, skipping pages, and ruing pages are common reading behaviors of a reader. WebBook provides these behaviors, by clicking on the right page or left page of a metaphorical book. As the page turns, content appears gradually. By clicking to the right or left edge of the book, the user can skip pages, based on the relative distance from the current page position. Furthermore, the correct understanding and proper use of Boolean search, implemented in most WSE

Computer Science Department

University of Crete

4.6. Metaphors

43

is not a simple task. Filter/Flow attempts to simplify the complex Boolean query formulation process by using pipelines and water control. Documents are depicted as water which ows through the pipelines, and is controlled by a series of valves which form the basic AND or OR operators. The ow nally reaches a result pool, after a series of ltering processes. Semantic Filter simulates a traditional punch card retrieval system. A card represents a topic or subject and it has a grid system. Each grid cell corresponds to a document, which is the same for all cards. If a document is related to the topic or subject, then the corresponding grid cell of the document is punched. The retrieval process is simple. The user selects a group of relevant cards to his information need, and checks the grid status. If he can see through a grid cell, then the corresponding document is retrieved, since this document is relevant to all cards. Other approaches have focused on representing results in meaningful and congurable grids. Each axis is a specic metric, such as time, location, theme, size or format. The GRiDL prototype (Figure 4.13), diplays search result overviews within the ACM Digital Library in a matrix using two hierarchical categories. The users can easily identify interesting results by cross-referencing the two dimensions [74].

Figure 4.13: Gridl Prototype Displays Search Results Along Two Axes

General Graduate Exams

Panagiotis Papadakos

44

Chapter 4. Visualization Models and Metaphors HotMaps5 [31] (Figure 4.14, is a meta-search system that provides an abstract representation of the

entire set of search results retrieved, extending the mosaic metaphor. It supports interactive exploration via nested sorting of Web search results based on query term frequencies. By clicking on any of the 30 most frequent terms, the search results can be re-sorted so that the documents that make frequent use of these terms are moved to the top of the list. The colour codes represent the frequencies of each of the terms in your query. Dark red is used to represent the query terms that appear frequently; light yellow is used to represent the query terms that are infrequently used. User studies show an increase in speed and eectiveness and a reduction in missed documents when comparing HotMap to the list-based representation used by Google.

Figure 4.14: HotMaps, a 2D Visualization of How Query Terms Relate to Search Results

5 www.thehotmap.com/

Computer Science Department

University of Crete

Chapter 5

Vision and Research Methodology

In this chapter we analyze challenges in IR visualization, the vision of this dissertation, the suggested methology and the work that has already be done.

5.1 Challenges
IR visualization is an emerging eld, and there are still many issues and challenges. So Many Users, So Many Data Nowadays, users are not only computer experts, but ordinary people of any age and with little or no computer and IR expertise. Moreover, many of them do not speak English, they make spelling errors, and they have diverse and complicated needs. On the other hand, online information has grown and continues to grow rapidly, so it is pretty dicult to maintain a complete and current index of all of the information on the web. The validity of information may vary signicantly and spam pages are intentionally created for prot or deception. In addition, new information is available to a bigger variety of formats and media, like blogs, instant messaging, email, speech, images, music and video. To cope with this mixture of data, IR systems must seamlessly integrate across a variety of media, sources, and formats, providing a common UI for all of them in order to improve information access. Integration of Existing Visualization Techniques The diversity of IR visualization models, poses the question if they can be synthesized into a visualization environment, in order to take advantage of their strengths and overcome their weaknesses.

45

46

Chapter 5. Vision and Research Methodology Two basic strategies in this direction are identied. The rst is to display multiple visual congurations simultaneously in a larger visualization environment. The second is more complex, since it synthesizes various visualization approaches into one new visualization approach. In this case, their data structures should be compatible and their displayed attributes should be complementary. Recently, there is an eort to investigate the motivation and the feasibility for designing a declarative language for the specication of visualization and interaction methods which will allow the formal expression of structure, appearance, behavior and communication between the various structures of information visualization [9, 22]. Full-text Visualization Visualization for a full-text diers from visualization for an entire text repository, since the number of displayed objects is relatively smaller. Displayed objects for a full-text in a visual space can be dened as chapters, paragraphs, sentences or keywords. Issues related to full-text visualization, include ways to integrate both visualization for a full-text and collection into one visualization environment, development of visualization models for full-text, denition of objects and calculation of their similarity, and nally how to construct meaningful semantic frameworks. Screen real estate The more data presented and viewed in a visual space, the more screen real estate is needed. On the other hand, computers are in our everyday life, even in the form of small pocket devices and mobile phones, with very small displays. Therefore, looking for a balance between the amount of displayed data and the readability of the visual space is a fundamental issue. This can be done by narrowing down the displayed area to a local area (zooming in or out). In addition, decreasing the number of displayed objects in the space, based on specic object attributes or user preferences, can also increase the eectiveness and eciency of IR systems. 2D vs 3D A fundamental issue is whether 3D representations are suitable for IR visualization, since this kind of visualization is based on an abstract data space, with no physical structure. It has been argued, that a 3D display is clearly more eective for physical data that includes 3D spatial variables, while 2D has a long and eective history for abstract data [3]. In addition, 3D displays have a high demand for interaction and navigation (i.e. require special control devices, have increased system and user overload, and technical complexity), especially if they are displayed onto a 2D display. Metaphors The search for new metaphors (e.g. object icon representations), is an open research topic in the

Computer Science Department

University of Crete

5.1. Challenges

47

eld. Metaphors are stemmed from a culture, and the diversity of users, leads to the integration of dierent metaphors. Furthermore evaluation of a metaphor, i.e. when a metaphor is applied, to what extent the metaphor matches the target appropriately, to what extent it preserves the salient and meaningful properties of the target, if it reduces the users cognitive workload, if it improves eectiveness and eciency, etc. is a crucial issue. Evaluation Evaluation for IR visualization, refers to measuring the extent to which people use it to achieve goals in terms, of eectiveness, eciency and satisfaction. The evaluation of such systems is dicult, because of the complexity of data relationships, diversity of displayed data, interactive nature of exploratory search, along with the perceptual and cognitive abilities oered. Important parts of retrieval results, trends, patterns, clusters, and other aggregate information, are dicult to be measured and no specic metrics are available. Finally, it is dicult to come up with an universal evaluation system. New Paradigms Information at the macro-level is not searchable in any of the existing BQ, BO, and QB interaction paradigms. Finding a new information retrieval visualization paradigm is a huge challenge, because it is dicult to identify and dene meaningful and searchable objects from the aggregate information. Eciency (Data Structures and Algorithms) IR visualization, is an interactive and on-line process, and as such it should be ecient for users to deploy. Data structures and algorithms which will improve performance and response times will always be in demand, since the available volume of information increases at a very high rate. Furthermore, current indices used in WSEs, were designed for eciency in the classical IR tasks. With the current exploration needs and the diversity of data formats, new indices facilitating IR visualization systems and supporting, unstructured, semi-structured and structured information are needed. Adaptation/Personalization IR visualization systems should provide ecient and eective access to exploratorary information needs. Ways that can extend the user actions in order to further ease the interaction, and restrict the infobase, to those parts of the information that the user is interested in, should be investigated. Currently, preference management and adaptation, requires from the user to formulate complex expressions or interact with complex UIs. The above observations justify the need for exible and universal access methods that oer on-line preference elicitation and support. Requirements of such explorative environments include : a) simplicity (the users should be able to use and understand the interaction immediately), b) expressiveness (it should be possible for the user to interactively

General Graduate Exams

Panagiotis Papadakos

48

Chapter 5. Vision and Research Methodology specify complex preference structures), and c) acceptance by users (the resulting interaction should be eective and desired by the users).

5.2 Vision
The general objective of this dissertation is to elaborate on methods and techniques for exploring and progressively rening large volumes of heterogeneous information, through the provision of appropriate models of interaction (interaction paradigms) and visualization techniques. The goal is to support advanced in accuracy and completeness answers (or information spaces), appropriate for supporting decision making, while the user is not obliged to formulate complex queries or to use specic user proles (predened personalized services). Instead, it will be attempted to provide user with all benecial options to adjust or restrict the information received, in a summarized, concise and intuitive manner. The navigation should provide appropriate data visualization techniques, in order to exploit the ability of the human brain for rapid understanding and perception of visual information, and the ability to discover standards and relationships through it [37, 92, 76, 53, 41]. Apart from the methodological issues for achieving the above, this dissertation will focus on (a) performance, so that the resulting methods and techniques are applicable on large volumes of information, (b) exibility of applicability, so that they are also applicable to dierent types of information (from simple text and unstructured data, to semi-structured and structured data), and (c) adaptability (or on-site conguration) of these services. With regard to adaptability, the operation should be simple and systematic so that it can support the adaptation of services based on the environment (context) of the user. Our testbeds will include search engines of the Web, as well as repositories of metadata (social or not). Specically for the case of semantic metadata, we will test our methods on repositories of metadata expressed with respect to ontologies (e.g. with respect to CIDOC CRM ontology [17] or other for scientic data).

5.3 Proposed Methodology


5.3.1 Information Visualization Framework
Initialy we will select and formally dene the framework for information visualization. This framework should include multi-dimensional data. Dimensions can either be pre-computed, (based on specic attributes of objects), or can be computed in real-time (e.x. using on-line results clustering). The dimension values can be hierarcically orgazined or not, and the classication of objects can be precise or weighted (fuzzy). Ways to adapt metadata of the Semantic Web to this framework will be investigated.

Computer Science Department

University of Crete

5.3. Proposed Methodology

49

5.3.2 Metrics for Exploratory Search


The next step is the selection and denition of appropriate metrics for the evaluation of exploratorary search and IR visualization systems. These metrics will be used for the design and evaluation of exploratorary services and appropriate automatic adaptation functionality, including the requirements of mobile devices and phones with small displays.

5.3.3 Interaction Models for Exploratory Search


In this phase we will select and dene appropriate interaction paradigms for exploratory search. In more detail, and based on the metrics dened above, we will: Base our Approach on the Dynamic and Faceted Taxonomies Interaction Model The user explores or navigates the information space by setting and changing his focus. At any point during the interaction, we compute and provide to the user the immediate zoom-in/out/side points along with count information. When the user selects one of these points then the selected term is added to the focus, and so on. Our goal is to describe every aspect of a WSE as a facet (i.e. even dierent retrieval models could be described by one). Merge Clustering Techniques with Dynamic Faceted Taxonomies We will try to combine automatic results clustering and dynamic faceted taxonomies. For this purpose, we can create a new facet, which will hold the terms and the hierarchy returned by the clustering algorithm. The goals of this work will concentrate on: (a) proposing the need for exploiting both explicit and mined metadata, (b) showing how automatic results clustering can be combined with the interaction paradigm of automatic taxonomies, and (c) proving that this approach is feasible as an on-line task. Dene Strength Based Exploratory Search The purpose of this interaction paradigm is that the user will be able to dene a specic threshold (i.e. by using a bar), to restrict the number of results. This threshold can be dened either for all the returned objects, or to objects of specic facets. Moreover, this threshold can be dened for specic attributes of a facet, like number of instant children, count number, etc. Using this interaction model the user could possibly get a better understanding of the returned objects and their relationships. Rank and Reduce Facets and Zoom Points Since our intension is to provide a rich variety of facets to the user, the ranking of the available facets and zoom points will play an important role for a positive user experience. Only the top K facets

General Graduate Exams

Panagiotis Papadakos

50

Chapter 5. Vision and Research Methodology and zoom points will be presented to the user, while options to visit any facet and zoom points will be provided. As a result, we have to provide appropriate facet and zoom points ranking methods. These methods can be based on the count information of indexed objects, sibling and children information of zoom points, lexicographic ranking of terms, or using TF*IDF weighting. Facets and zoom points reduction will be especially important for mobile devices, with limited displays. Dene Preferences Using Dynamic Taxonomies We will investigate how we can extend the user actions in order to further ease the interaction and to speed up the restriction of the focus to those parts of the information space that the user is interested in. These actions will aect the appearance and presentation order of facets, terms (zoom-in/side points) and objects of the focus. The semantics will be described using the framework described in [38]. The requirements of such a functionality are: (a) Simplicity, so that users can understand and use it, (b) Expressiveness, so that even complex preference structures can be specied, and (c) Acceptance by the users. Dene Composition and Graph Functions Furthermore we will examine how the user could possibly compose a new facet, by merging two or more dierent facets, through an appropriate interaction model. Set operations like union, intersection, dierence, applied on the terms of the facets or their indexed objects could possibly facilitate the user to fulll his information need.

5.3.4 Exploratory Search and Visualization


In this phase we will determine the dimensionality of the visual space (i.e. 1D, 2D or 3D) and its coordination system (i.e. orthogonal, polar, or parallel). In addition, based on the data attributes (i.e. strings, numbers) we want to visualize, we will have to dene the type of the axes, which can be nominal, ordinal or quantitative. Then, we will have to dene appropriate mappings of objects onto the previously dened visualization space. This will determine the nal position of each individual object. Criteria like user dened reference systems, relevance to other related objects, or denition of visibility thresholds, can inuence their nal position. Furthermore, we will try to provide graph functionality through an intuitive UI, for the visual representation of 2 or more facets. The user will be able to create dierent types of 2D or 3D graphs, by dragging facets or zoom points to a well dened area. Moreover, he will be able to dene each aspect of the graph, like type of graph, axis type, values of axis, order, etc. Such graphs can help users get an instant overview of the infobase. For example using the zoom point of an authors name in the Author facet and the Year

Computer Science Department

University of Crete

5.4. Work Done

51

facet, a user could see the number of papers for a specic author, during a dened time period. For all the above, we will take into consideration psychology theories, like pre-attentive processing and Gestalt theory, for more intuitive user interaction.

5.3.5 Implementation
During this phase methodological and implementation specic issues will have to be addressed. Resulting methods and techniques should be applicable on large volumes of information. Furthermore, they should also be applicable to dierent types of information (from simple text and unstructured data, to semistructured and structured data). Our testbeds will include search engines of the Web, as well as repositories of metadata (social or not). Specically, for the case of semantic metadata, we will test our methods on repositories of metadata expressed with respect to ontologies (e.g. CIDOC CRM ontology or other for scientic data). As a running example for the implementation of the proposed interaction paradigms, we will use Mitos [59, 58]1 which is a prototype WSE2 , the FleXplorer API [76], that supports the interaction paradigm of faceted taxonomies, and AJAX.

5.3.6 Evaluation and Improvements


In this phase we will conduct a large scale user based evaluation of the provided visualization and exploratorary search functionality. Also an automatic evaluation, based on the dened metrics will also be conducted. After analyzing the results of the evaluations, possible aws and drawbacks of the developed functionality will be identied, leading to further improvements to the designed system. A second user based evaluation will be conducted, in order to justify them.

5.4 Work Done


5.4.1 ODBMS Index Representations
Mitos index is based on a ODBMS3 , instead of an inverted le. Such a design can support indexing of structured and unstructured documents, but on the other hand it is not as ecient as inverted les, regarding space and speed performance. In [58] we proposed and evaluated three dierent representations. The rst representation named PR, is a naive relational representation, simulating an inverted le. On the
1 http://groogle.csd.uoc.gr:8080/mitos/ 2 Under

development by the Department of Computer Science of the University of Crete and FORTH-ICS

3 PostgreSQL

General Graduate Exams

Panagiotis Papadakos

52

Chapter 5. Vision and Research Methodology

other hand, OR and COR representations exploit set-valued attributes oered by the ODBMS, in order to reduce the space needed to store the occurences of words in the documents. Experiments with the above representations were conducted, where COR was found to be the most ecient, being one order of magnitude less space costly and two orders of magnitude faster in query evaluation compared to PR.

5.4.2 FleXplorer, A Framework for Providing Faceted and Dynamic Taxonomy-based Information Exploration
FleXplorer is a main memory API4 that allows managing (creating, deleting, modifying) terms, taxonomies, facets and object descriptions [76]. It supports both nite and innite terminologies (e.g. numerically valued attributes) as well as explicitly and intensionally dened taxonomies. The former can be classication schemes and thesauri, the latter can be hierarchically organized intervals (based on the cover relation), etc. Regarding user interaction, the framework provides methods for setting the focus and getting the applicable zoom points. Regarding FleXplorers performance, the computation of zoom-in points with count information is more expensive than without. Using the main memory format, which is the fastest of the available formats, in 1 sec we can compute the zoom-in points of 240.000 results with count information, while without count information we can compute the zoom-in points of 540.000 results. This timings are ecient to support on-line tasks.

5.4.3 Exploratory Web Searching with Dynamic Taxonomies and Results Clustering
In [57] an interaction method for exploiting both explicit and mined metadata for enriching Web searching with exploration services, is proposed. On-line results clustering is useful for providing users with overviews of the results and thus allowing them to restrict their focus to the desired parts. On the other hand, the various metadata that are available to a WSE, e.g. domain/language/date/document information, are commonly exploited only through the advanced (form-based) search facilities that some WSE oer (and users rarely use). We propose an approach that combines both kinds of metadata by adopting the interaction paradigm of dynamic taxonomies and faceted exploration. This combination results to an eective, exible and ecient exploration experience.

4 Application

Programmatic Interface

Computer Science Department

University of Crete

Chapter 6

Conclusion

In this dissertation, we plan to elaborate on exploration services for WSE, based on the interaction paradigm of the Dynamic and Faceted Taxonomies in order to support advanced in accuracy and completeness answers, appropriate for exploratory information needs. The user will not be obliged to formulate complex queries or use specic user proles. Instead, it will be attempted to provide them all benecial options to adjust or restrict the information received, in a summarized, concise and intuitive manner. The navigation should provide appropriate data visualization techniques, in order to exploit the ability of the human brain for rapid understanding and perception of visual information, and the ability to discover standards and relationships through it. Regarding the interaction paradigms and visualization models described in Chapters 3 and 4, each one of them has its advantages, which dierentiate it from the other methods. The uniqueness of ESCBM is that an internal retrieval process of an IR model such as the cosine model, can be visualized in the visual space, while the salient characteristic of MRPBM is the exibility of the reference points. The advantage of PFNET relies on the produced optimal structure, while SOM oers the ability to explore and discover a complex hidden term semantic network (as in clustering). The power of MDS is that the data used in multidimensional scaling analysis is relatively free of any distributional assumption and it can handle various types of data ranging from ordinal data, to interval data, to ratio data. Dynamic and Faceted Taxonomies on the other hand, provide an intuitive UI users are already familiar with, which helps them to reach their information goal in an eective and ecient way. Furthermore, the ability to change the reference system for MRPBM and ESCBM visualizations, oer a

53

54

Chapter 6. Conclusion

dynamic visualization environment, contrary to the PFNET, SOM and MDS which provide a stable visual conguration, based on the similarities of the documents. In a broad sense, a reference system can be regarded as a special form of a user query. In this way, both ESCBM and MRPBM are classied into the QB IR interaction paradigm, because a reference system can be seen as a query, that narrows down a search to a subspace and then visualizes the results for browsing. On the other hand PFNET, SOM, and MDS can be classied into BQ paradigm or the browsing only BO paradigm, since they dont have a clearly dened reference system as a query to narrow down their results in the visual spaces. Faceted and Dynamic taxonomies are the only interaction paradigm, that oers all available interaction paradigms. The user can either start by browsing the taxonomies until he nds the desired information (BO), or he can start browsing the taxonomies and then formulate a query to restrict the infobase (BQ), or start by executing a query and then browsing the taxonomy (QB). Clustering on the other hand, is a QB interaction paradigm. Furthermore, this dissertation will also focus in investigating ways to integrate the dierent visualization models described in chapter 4 into the interaction paradigm of Dynamic and Faceted Taxonomies. In such a way, we will provide a synthesized visualization environment, which will take advantage of each visualization models strengths, and at the same time overcome its weaknesses. Apart from the methodological issues for achieving all the above, we will also focus on (a) performance, so that the resulting methods and techniques are applicable on large volumes of information, (b) exibility of applicability, so that they are also applicable to dierent types of information (from simple text and unstructured data, to semi-structured and structured data), and (c) adaptability (or on-site conguration) of these services. With regard to adaptability, the operation should be simple and systematic so that it can support the adaptation of services based on the environment (context) of the user.

Computer Science Department

University of Crete

Appendix A

Acronyms

BO Browsing Only BQ Browsing and Querying ESCBM Euclidian Spatial Characteristic Based Model HCI Human Computer Interaction IR Information Retrieval IV Information Visualization MDS Multidimensional Scaling Models MRPBM Multiple Reference Points Based Models MWSE Meta Web Search Engine RP Reference Point PFNET Pathnder Associative Newtork POI Point of Interest QB Querying and Browsing SOM Self-organizing Map Model

55

56 UI User Interface VSM Vector Space Model WSE Web Search Engine WWW World Wide Web

Appendix A. Acronyms

Computer Science Department

University of Crete

References
[1] F. Beil, M. Ester, and X. Xu. Frequent term-based text clustering. In KDD 02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 436442, New York, NY, USA, 2002. ACM. [2] S. Benford, D. Snowdon, C. Greenhalgh, R. Ingram, I. Knox, and C. Brown. Vr-vibe: A virtual environment for co-operative information retrieval. Computer Graphics Forum, 14(3):349360, 1995. [3] S. K. Card, J. Mackinlay, and B. Shneiderman. Readings in Information Visualization: Using Vision to Think. Series in Interactive Technologies. The Morgan Kaufmann, 1999. [4] K. Chakrabarti, S. Chaudhuri, and S. Hwang. Automatic categorization of query results. In SIGMOD 04: Proceedings of the 2004 ACM SIGMOD international conference on Management of data, pages 755766, New York, NY, USA, 2004. ACM. [5] C. Chen. Behavioural patterns of collaborative writing with hypertext - a state transition approach. In HCI 96: Proceedings of HCI on People and Computers XI, pages 265279, London, UK, 1996. Springer-Verlag. [6] C. Chen. Structuring and Visualizing the WWW by Generalized Similarity Analysis. In Proceedings of Hypertext 97, 1997. [7] C. Chen. Visualising semantic spaces and author co-citation networks in digital libraries. In Information Processing and Management, pages 401420, 1999. [8] N. J. Cooke, K. J. Neville, and A. L. Rowe. Procedural network representations of sequential data. Hum.-Comput. Interact., 11(1):2968, 1996. [9] Joseph A. Cottam and Andrew Lumsdaine. Thisstar: Declarative visualization prototype. In IEEE Symposium on Information Visualization, 2007.

57

58

References

[10] D. Crabtree, X. Gao, and P. Andreae. Improving Web Clustering by Cluster Selection. In Procs of the IEEE/WIC/ACM International Conference on Web Intelligence (WI05), pages 172178, 2005. [11] D.R. Cutting, D. Karger, J.O. Pedersen, and J.W. Tukey. Scatter/Gather: a cluster-based approach to browsing large document collections. In Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 318329, Copenhagen, Denmark, June 1992. ACM Press New York, NY, USA. [12] R. Dachselt, M. Frisch, and M. Weiland. Facetzoom: a continuous multi-scale widget for navigating hierarchical metadata. In CHI 08: Proceeding of the twenty-sixth annual SIGCHI conference on Human factors in computing systems, pages 13531356, New York, NY, USA, 2008. ACM. [13] W. Dakka, R. Dayal, and P.G. Ipeirotis. Automatic discovery of useful facet terms. SIGIR Faceted Search Workshop, 2006. [14] W. Dakka, P. G. Ipeirotis, and K. R. Wood. Automatic construction of multifaceted browsing interfaces. In CIKM 05: Proceedings of the 14th ACM international conference on Information and knowledge management, pages 768775, New York, NY, USA, 2005. ACM. [15] W. Dakka and P.G. Ipeirotis. Automatic Extraction of Useful Facet Hierarchies from Text Databases. In Proceedings of the 24th International Conference on Data Engineering, ICDE 2008, pages 466475, April 2008. [16] D. W. Dearholt and R. W. Schvaneveldt. Properties of pathnder networks. pages 130, 1990. [17] M. Doerr and N Crofts. Electronic Communication on Diverse Data - The Role of an Object-Oriented CIDOC Reference Model. In Proceedings CIDOC98, Melbrourne, October 1998. [18] M. D ork, S. Carpendale, C. Collins, and C. Williamson. Visgets: Coordinated visualizations for web-based information exploration and discovery. IEEE Transactions on Visualization and Computer Graphics, 14(6), December 2008. [19] P. Ferragina and A. Gulli. A personalized search engine based on web-snippet hierarchical clustering. In Proceedings of the 14th international conference on World Wide Web, WWW 2005 - Special interest tracks and posters, volume 5, pages 801810, May 2005. [20] R. H. Fowler, W. A. L. Fowler, and B. A. Wilson. Integrating query thesaurus, and documents through a common visual representation. In SIGIR 91: Proceedings of the 14th annual international ACM SIGIR conference on Research and development in information retrieval, pages 142151, New York, NY, USA, 1991. ACM.

Computer Science Department

University of Crete

References

59

[21] B.C.M. Fung, K. Wang, and M. Ester. Hierarchical Document Clustering Using Frequent Itemsets. In Proceedings of the SIAM International Conference on Data Mining, volume 30, San Francisco, CA, USA, May 2003. [22] M. Hemmje G. Jaeschke, M. Leissler. Modeling interactive, 3-dimensional information visualizations supporting information seeking behaviors. pages 109125, 2005. [23] F. Gelgi, H. Davulcu, and S. Vadrevu. Term ranking for clustering web search results. In 10th International Workshop on the Web and Databases, WebDB 2007, Beijing, China, June 2007. [24] M. Hearst. Design recommendations for hierarchical faceted search interfaces. ACM SIGIR Workshop on Faceted Search, 2006. [25] M. A. Hearst. Clustering versus faceted categories for information exploration. Commun. ACM, 49(4):5961, April 2006. [26] M. A. Hearst, A. Elliott, J. English, R. Sinha, K. Swearingen, and K. Yee. Finding the ow in web site search. Commun. ACM, 45(9):4249, 2002. [27] M.A. Hearst and J.O. Pedersen. Reexamining the cluster hypothesis: scatter/gather on retrieval results. In Proceedings of the 19th annual international ACM SIGIR conference on Research and Development in Information Retrieval, pages 7684, Zurich, Switzerland (Special Issue of the SIGIR Forum), August 1996. ACM Press New York, NY, USA. [28] M. Hemmje. Lyberworld: a 3d graphical user interface for fulltext retrieval. In Proc. ACM SIGCHI 95, pages 417418. ACM, 1995. [29] L. J. Heyer, S. Kruglyak, and S. Yooseph. Exploring expression data: Identication and analysis of coexpressed genes. Genome Res., 9(11):11061115, November 1999. [30] M. Hildebrand, J. Ossenbruggen, and L. Hardman. /facet: A browser for heterogeneous semantic web repositories. In International Semantic Web Conference, pages 272285, 2006. [31] O. Hoeber and X. D. Yang. Hotmap: Supporting visual exploration of web search results. Journal of the American Society for Information Science and Technology, 9999(9999):NA+, 2008. [32] F. H oppner, F. Klawonn, R. Kruse, and T. Runkler. Fuzzy Cluster Analysis. John Wiley & Sons, Inc., 1999. [33] J. Janruang and W. Kreesuradej. A New Web Search Result Clustering based on True Common Phrase Label Discovery. In Proceedings of the International Conference on Computational Inteligence

General Graduate Exams

Panagiotis Papadakos

60

References for Modelling Control and Automation and International Conference on Intelligent Agents Web Technologies and International Commerce. IEEE Computer Society Washington, DC, USA, 2006.

[34] M. K aki. Findex: properties of two web search result categorizing algorithms. In Proc. IADIS Intl. Conference on World Wide Web/Internet, Lisbon, Portugal, October 2005. [35] A. K. Karlson, G. G. Robertson, D. C. Robbins, M. P. Czerwinski, and G. R. Smith. Fathumb: a facet-based interface for mobile search. In CHI 06: Proceedings of the SIGCHI conference on Human Factors in computing systems, pages 711720, New York, NY, USA, 2006. ACM. [36] L. Kaufman and P. J. Rousseeuw. Finding groups in data. an introduction to cluster analysis. Wiley Series in Probability and Mathematical Statistics.Applied Probability and Statistics, New York: Wiley, 1990, 1990. [37] D. Keim, G. Andrienko, J. Fekete, C. G org, J. Kohlhammer, and G. Melan con. Visual analytics: Denition, process, and challenges. pages 154175. 2008. [38] W. Kieling. Foundations of preferences in database systems. In VLDB 02: Proceedings of the 28th international conference on Very Large Data Bases, pages 311322. VLDB Endowment, 2002. [39] T. Kohonen, S. Kaski, K. Lagus, J. Salojarvi, V. Paatero, and A. Saarela. Self organization of a massive document collection. IEEE Transactions on Neural Networks, 11:574585, 2000. [40] J. Kominek and R. Kazman. Accessing multimedia through concept clustering. In CHI 97: Proceedings of the SIGCHI conference on Human factors in computing systems, pages 1926, New York, NY, USA, 1997. ACM. [41] S. Koshman. Visualization-based information retrieval on the web. Library & Information Science Research, 28(2):192207, 2006. [42] W. M. Kules, III. Supporting exploratory web search with meaningful and stable categorized overviews. PhD thesis, College Park, MD, USA, 2006. Adviser-Shneiderman,, Ben. [43] K. Lagus. Text retrieval using self-organized document maps. Neural Process. Lett., 15(1):2129, 2002. [44] K. Lagus, T. Honkela, S. Kaski, and T. Kohonen. Websom for textual data mining. Artif. Intell. Rev., 13(5-6):345364, 1999. [45] D. Lawrie and B. W. Croft. Discovering and comparing topic hierarchies. In In Proceedings of RIAO 2000 Conference, pages 314330, 2000.

Computer Science Department

University of Crete

References [46] X. Lin. Map displays for information retrieval. J. Am. Soc. Inf. Sci., 48(1):4054, 1997.

61

[47] X. Lin, D. Soergel, and G. Marchionini. A self-organizing semantic map for information retrieval. In SIGIR 91: Proceedings of the 14th annual international ACM SIGIR conference on Research and development in information retrieval, pages 262269, New York, NY, USA, 1991. ACM. [48] J. Macqueen. Some methods for classication and analysis of multivariate observations. In Proc. Fifth Berkeley Symp. on Math. Statist. and Prob., volume 1, pages 281297, 1967. [49] E. M akel a, E. Hyv onen, and S. Saarela. Ontogator - a semantic view-based search engine service for web applications. In International Semantic Web Conference, pages 847860, Athens, GA, USA, Nov. 2006. [50] E. M akel a, K. Viljanen, P. Lindgren, M. Laukkanen, and E. Hyv onen. Semantic yellow page service discovery: The veturi portal. In poster paper at ISWC 05, Nov. 2005. [51] B. H. McCormick, T. A. DeFanti, and M. D. (ed) Brown. Visualization in Scientic Computing. ACM SIGGRAPH, New York, 1987. [52] T. Munzner. Guest editors introduction: Information visualization. IEEE Computer Graphics and Applications, 22(1):2021, 2002. [53] A. Noack. Energy models for graph clustering. J. Graph Algorithms Appl., 11(2):453480, 2007. [54] A. Nuchprayoon and R. R. Korfhage. Guido: Visualizing document retrieval. Visual Languages, IEEE Symposium on, 0:184, 1997. [55] K. A. Olsen, R. R. Korfhage, K. M. Sochats, M. B. Spring, and J. G. Williams. Visualization of a document collection: the vibe system. Inf. Process. Manage., 29(1):6981, 1993. [56] E. Oren, R. Delbru, and S. Decker. Extending faceted navigation for rdf data. In ISWC, 2006. [57] P. Papadakos, S. Kopidaki, N. Armenatzoglou, and Y. Tzitzikas. Exploratory web searching with dynamic taxonomies and results clustering. In ECDL 09: Proceedings of the 13th European Conference on Digital Libraries, September 2009 (to appear). [58] P. Papadakos, Y. Theoharis, Y. Marketakis, N. Armenatzoglou, and Y. Tzitzikas. Mitos: Design and Evaluation of a DBMS-based Web Search Engine. In Proc. of 12th Pan-Hellenic Conference of Informatics (PCI2008), Greece, August 2008.

General Graduate Exams

Panagiotis Papadakos

62

References

[59] P. Papadakos, G. Vasiliadis, Y. Theoharis, N. Armenatzoglou, S. Kopidaki, Y. Marketakis, M. Daskalakis, K. Karamaroudis, G. Linardakis, G. Makrydakis, V. Papathanasiou, L. Sardis, P. Tsialiamanis, G. Troullinou, K. Vandikas, D. Velegrakis, and Y. Tzitzikas. The Anatomy

of Mitos Web Search Engine. CoRR, Information Retrieval, abs/0803.2220, 2008. Available at http://arxiv.org/abs/0803.2220. [60] G. W. Paynter and I. H. Witten. A combined phrase and thesaurus browser for large document collections. In In ECDL, pages 2536, 2001. [61] E. Rasmussen. Clustering algorithms. pages 419442, 1992. [62] D. E. Rose and D. Levinson. Understanding user goals in web search. In WWW 04: Proceedings of the 13th international conference on World Wide Web, pages 1319. ACM Press, 2004. [63] G. M. Sacco. Navigating the cd-rom. In Proc. Int. Conf. Business of CD-ROM, 1987. [64] G. M. Sacco. Dynamic Taxonomies: A Model for Large Information Bases. IEEE Transactions on Knowledge and Data Engineering, 12(3):468479, May 2000. [65] G. M. Sacco. Analysis and validation of information access through mono, multidimensional and dynamic taxonomies. In Flexible Query Answering Systems, 7th International Conference,FQAS 2006, Milan, Italy, June 7-10, 2006, Proceedings, volume 4027 of Lecture Notes in Computer Science, pages 659670. Springer, 2006. [66] M. Sanderson and B. Croft. Deriving concept hierarchies from text. In SIGIR 99: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 206213, New York, NY, USA, 1999. ACM Press. [67] M.C. Schraefel, Maria Karam, and Shengdong Zhao. mSpace: Interaction Design for User-

Determined, Adaptable Domain Exploration in Hypermedia. In Procs of Workshop on Adaptive Hypermedia and Adaptive Web Based Systems, pages 217235, Nottingham, UK, Aug. 2003. [68] H Small and E Gareld. The geography of science: disciplinary and national mappings. J. Inf. Sci., 11(4):147159, 1986. [69] A. Spoerri. Infocrystal: a visual tool for information retrieval. In VIS 93: Proceedings of the 4th conference on Visualization 93, pages 150157, 1993. [70] M. Stefaner and B. Muller. Elastic lists for facet browsers. In Dynamic Taxonomies and Faceted Search (FIND), DEXA Workshops, pages 217221. IEEE Computer Society, 2007.

Computer Science Department

University of Crete

References

63

[71] J. Stefanowski and D. Weiss. Carrot2 and language properties in web search results clustering. In Proceedings of the International Atlantic Web Intelligence Conference, Madrid, Spain, May 2003. [72] E. Stoica and M. A. Hearst. Nearly-automated metadata hierarchy creation. In HLT-NAACL 2004: Short Papers, pages 117120, 2004. [73] I. Taksa, A. Spink, and R. Goldberg. A task-oriented approach to search engine usability studies. JSW, 3(1):6373, 2008. [74] L. Terveen, W. Hill, and B. Amento. Constructing, organizing, and visualizing collections of topically related web resources. ACM Transactions on Computer-Human Interaction, 6:6794, 1999. [75] M. Tvaro zek and M. Bielikov a. Personalized faceted navigation in the semantic web. pages 511515. 2007. [76] Y. Tzitzikas, N. Armenatzoglou, and P. Papadakos. FleXplorer: A Framework for Providing Faceted and Dynamic Taxonomy-based Information Exloration. Torino, Italy, Sep. 1-5 2008. Procs. of the 2nd Intern. Workshop on Dynamic Taxonomies and Faceted Search, FIND2008 (in conjunction with DEXA 2008). [77] J. Wang, Y. Mo, B. Huang, J. Wen, and L. He. Web Search Results Clustering Based on a Novel Sux Tree Structure. In Autonomic and Trusted Computing, 5th International Conference, ATC 2008, volume 5060, pages 540554, Oslo, Norway, June 2008. [78] Y. Wang and M. Kitsuregawa. Use link-based clustering to improve Web search results. In Proceedings of the Second International Conference on Web Information System Engineering(WISE2001), 2001. [79] D. Weiss and J. Stefanowski. Web search results clustering in Polish: Experimental evaluation of Carrot. In Intelligent Information Processing and Web Mining: Proceedings of the International IIS: IIPWM03, Zakopane, Poland, June 2003. [80] H. D. White. Author cocitation analysis and pearsons r. J. Am. Soc. Inf. Sci. Technol., 54(13):1250 1259, 2003. [81] J. J. Wijk. The value of visualization. Visualization Conference, IEEE, 0:11, 2005. [82] J.G. Williams, K.M. Sochats, and E. Morse. Visualization. Annual Review of Information Science and Technology (ARIST), 30(1):161207, 1995. [83] J. A. Wise. The ecological approach to text visualization. Journal of the American Society for Information Science, 50:12241233, 1999.

General Graduate Exams

Panagiotis Papadakos

64

References

[84] D. Xing, G.R. Xue, Q. Yang, and Y. Yu. Deep classier: automatically categorizing search results into large-scale hierarchies. In Proceedings of the international conference on Web search and web data mining, pages 139148. ACM New York, NY, USA, 2008. [85] K. Yee, K. Swearingen, K. Li, and M. Hearst. Faceted metadata for image search and browsing. In CHI 03: Proceedings of the SIGCHI conference on Human factors in computing systems, pages 401408, New York, NY, USA, 2003. ACM. [86] O. Zamir and O. Etzioni. Web document clustering: a feasibility demonstration. In Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, pages 4654, Melbourne, Australia, August 1998. ACM Press New York, NY, USA. [87] O. Zamir and O. Etzioni. Grouper: A dynamic clustering interface to web search results. Computer Networks, 31(11-16):13611374, 1999. [88] H. J. Zeng, Q. C. He, Z. Chen, W. Y. Ma, and J. Ma. Learning to cluster web search results. In Proceedings of the 27th annual international conference on Research and development in information retrieval, pages 210217, Sheeld, UK, July 2004. ACM Press New York, NY, USA. [89] D. Zhang and Y. Dong. Semantic, Hierarchical, Online Clustering of Web Search Results. In Advanced Web Technologies and Applications, 6th Asia-Pacic Web Conference, APWeb 2004, pages 6978, Hangzhou, China, April 2004. [90] J. Zhang. The characteristic analysis of the dare visual space. Inf. Retr., 4(1):6178, 2001. [91] J. Zhang. Tor: a tool of facilitating information retrieval introduce a visual retrieval model. Inf. Process. Manage., 37(4):639657, 2001. [92] J. Zhang. Visualization for Information Retrieval. Springer-Verlag, 2008. [93] J. Zhang and T. N. Nguyen. Webstar: a visualization model for hyperlink structures. Inf. Process. Manage., 41(4):10031018, 2005. [94] Y. Zhao and G. Karypis. Evaluation of hierarchical clustering algorithms for document datasets. In Data Mining and Knowledge Discovery, pages 515524. ACM Press, 2002.

Computer Science Department

University of Crete

Potrebbero piacerti anche