Sei sulla pagina 1di 20

Journal of Computing and Information Technology - CIT 12, 2004, 3, 175–194 175

A Taxonomy of Information
Retrieval Models and Tools

Gerardo Canfora and Luigi Cerulo


RCOST – Research Centre on Software Technology, University of Sannio, Benevento, Italy

Information retrieval is attracting significant attention address the representation, organization of, and
due to the exponential growth of the amount of infor- access to large amounts of heterogeneous infor-
mation available in digital format. The proliferation
of information retrieval objects, including algorithms, mation encoded in digital format  58 ].
methods, technologies, and tools, makes it difficult to In this paper we focus on text document re-
assess their capabilities and features and to understand
the relationships that exist among them. In addition, trieval, in which the information is represented
the terminology is often confusing and misleading, as by text documents. Therefore, for the purposes
different terms are used to denote the same, or similar, of this paper, the terms information and docu-
tasks. ments are used interchangeably. Text document
This paper proposes a taxonomy of information retrieval retrieval is the most traditional subfield of IR;
models and tools and provides precise definitions for however, IR comprises other subfields, such as
the key terms. The taxonomy consists of superimposing
two views: vertical taxonomy, that classifies IR models image retrieval, speech retrieval, information
with respect to a set of basic features, and horizontal generation, query answering, and text summa-
taxonomy, which classifies IR systems and services with rization, that we do not cover in this paper.
respect to the tasks they support.
A key feature of a text IR systems is retrieving
The aim is to provide a framework for classifying existing
information retrieval models and tools and a solid point the documents that can satisfy the information
to assess future developments in the field. needs of a user from a large collection of docu-
ments. Such systems, especially in the context
Keywords: information retrieval, taxonomy, tools, mo- of the web, are usually known as search en-
dels. gines, so that in the rest of the paper we will
consider search engine as a synonym of infor-
mation retrieval system. IR systems prepare the
collection of documents for retrieval through an
1. Introduction
indexing step. User information needs are usu-
ally represented by keywords or phrases, which
In recent years information retrieval has become are themselves indexed, although more complex
an important subject of much research, because representation languages are available. This
the amount of information available in digital representation, which causes inevitably a loss
formats has grown exponentially and the need of information, is usually known as query. In-
for retrieving relevant information has assumed dexing can assume different forms according to
a crucial importance. The World Wide Web and the model adopted to represent both the docu-
the Digital Libraries have shown to a large au- ments in the collection and the user information
dience the importance of effective mechanisms needs. Many current IR systems exploit ranked
and tools to retrieve documents from a very large IR methods, i.e. they rank the documents in the
document collection based on user information collection based on a measure of their relevance
needs. with respect to the user information needs as
Information Retrieval (IR) is the scientific dis- represented by a query.
cipline that deals with the analysis, design and The proliferation of information retrieval al-
implementation of computerized systems that gorithms, methods, technologies, and tools, is
176 A Taxonomy of Information Retrieval Models and Tools

making it more difficult to assess the features variety of more specific models. Paijmans iden-
and the characteristics of each IR aspect and to tified the vector document model as the basis for
understand the relationships that exist among building the classification and showed how the
them. The terminology is often confusing; for vector model can subsume other popular mo-
example, terms such as crawling, indexing, spi- dels. Whilst this constitutes a concise style of
dering, are often used to denote similar tasks, classification, it is unable to classify IR tech-
with no clear distinction of the differences. niques that are not derived from the vector based
model, such as the logic-based techniques.
In this paper we propose a classification of IR
models and tools and provide definitions for the Our approach is different, as we start from a
key terms. The classification consists of super- classification of the basic features of IR mod-
imposing two views: one for the IR models and els and proceed with a classification of the ob-
one for the IR objects, either tools or services. jects produced in the various fields of infor-
A vertical taxonomy classifies IR models with mation retrieval in terms of tools and services.
respect to a set of basic features, and a horizon- The flexibility of this faceted view is evident
tal taxonomy classifies IR objects with respect when we consider that different information re-
to their tasks, form, and context. The vertical trieval objects can be based on the same in-
taxonomy is built by exploding two basic fea- formation retrieval model, and the same infor-
tures of any IR model: the representation, that mation retrieval model can be exploited to im-
is the model adopted to represent both the docu- plement different information retrieval objects.
ments and the user queries; and the reasoning, For example, the classic vector model, generally
which refers to the framework adopted to re- presented as a retrieval technique, can be used
solve a representation similarity problem. The for building information filtering and document
horizontal taxonomy is derived from an analysis clustering tools, too. The latter are different in-
of the application areas of IR. formation retrieval objects that exploit the same
information retrieval model.

1.1. Related Works


1.2. Content and Structure of the Paper
In the literature, several studies have been pro-
posed that outline classifications of IR models There are two main viewpoints that characterize
and tools. However, most of these studies do information retrieval: we call these two view-
not cover the entire spectrum of IR objects; the points information retrieval objects and infor-
reasons can be found either in the age of the pa- mation retrieval models. The former is gener-
pers or in the specific objectives of the studies. ally an artifact that exists in the form of a tool or
For example, in 1984 Smith and Warner  69 ] a service and responds to the “what” question;
published a document representation taxonomy the latter is a set of theories on which the in-
with the aim of relating new research works formation retrieval object is based and respond
to previous works and to suggest new areas of to the “how” question. The two aspects are re-
research. Nowadays, this taxonomy is largely lated, as one object can be based on more than
incomplete, because it does not consider, for one model and one model can be the basis for
example, the representation of structured docu- more than one object. On this framework we
ments. In 1987 Belkin and Croft  20 ] published have built a horizontal taxonomy and a vertical
a classification of the most important retrieval taxonomy. The horizontal taxonomy refers to
techniques in which no reference is made to the IR objects, while the vertical one considers IR
relevance feedback model, because, as the au- models.
thors explicitly state, relevance feedback is not
The remainder of the paper is organized as fol-
considered a retrieval technique, rather a help
lows. Sections 2 and 3 introduce the vertical
to refine the retrieval model.
and the horizontal taxonomies, together with
In a more recent work, Paijmans  54 ] made an examples of their application. Section 4 super-
interesting analysis of the most important re- imposes the vertical and horizontal taxonomies
trieval models. The approach adopted to con- and shows how this can be used to obtain a map-
struct a taxonomy of IR models consists of iden- ping of the object’s features on the underlying
tifying a generic model that forms a basis for a models.
A Taxonomy of Information Retrieval Models and Tools 177

2. Vertical Taxonomy gies under the reasoning component.


Representation and Reasoning can be used to
Modeling the process of information retrieval characterize an information retrieval model. For
is complex, because many parts are, by their example, in  52 ] an information retrieval model
nature, vague and difficult to formalize. The is characterized as a quadruple fD, Q, F, R(q,d)g
human component assumes an important role where:
and many concepts, such as relevance and in-
formation needs, are subjective. Therefore, in- D is a set of logical views for the docu-
formation retrieval models can be very com- ments in the collection (Representation com-
plex and, consequently, their classification can ponent);
be hard. However, in the definition of any IR
model we can identify some common aspects. Q is a set of logical views for the user infor-
Generally, the first step is the representation of mation needs (Representation component);
documents and information needs. From these F is a framework for modeling document
representations a reasoning strategy is defined representation, queries and their relation-
that solves a representation similarity problem ships (Reasoning component);
to compute the relevance of documents with re-
spect to queries. Various strategies have been R(q,d) is a ranking function which associates
introduced with the aim of improving the re- a real number with a query q 2 Q and a doc-
trieval process: we classify these methodolo- ument d 2 D (Reasoning component).

Fig. 1. Vertical taxonomy.


178 A Taxonomy of Information Retrieval Models and Tools

An information retrieval model can be modeled Whilst documents are characterized by syntax,
as a couple <Rp, Rs> where Rp is the repre- structure, semantics and style, the structure and
sentation model of documents and queries, and semantics of text are generally sufficient to char-
Rs is a framework for modeling the relationship acterize queries.
between document and query representations,
which is the reasoning strategy. Every compo-
nent can be divided into subcomponents and for Query Representation
every subcomponent we can build a tree of pos-
sible approaches and solutions presented in the A query is the representation of a user infor-
literature, as shown in Fig. 1. mation needs. The user information needs is
Defining the approaches used for each compo- originated by a problem that the user should re-
nent identifies an IR model. For example, the solve; it is implicit in the user mind and its pur-
couple <Rp, Rs>: pose is the necessity to bridge a knowledge gap.
An information need can be of three types  50 ]:
Rp(query) = f keyword-based g known item information need, conscious infor-
Rp(document) = f weighted vector g mation need, and confused information need.
The first is when users search or verify the exis-
Rs(with logic) = f vector algebra g tence of documents they know. The second is
identifies the well-known vector model, as we when users search for documents they do not
will discuss later. We will now go into each of know, but regard a subject they know. The third
these components. is when users know neither the documents nor
the subject. The following classes of query rep-
resentations can be identified:
2.1. Representation Keyword-based. This is the simplest form
for a query. It is composed by keywords and
A fundamental component of an IR system is the documents containing such keywords are
the representation of the information itself: in- searched for. Keyword-based queries are
formation can be processed if it is represented popular, because they are intuitive and easy
in some way. to express. Usually, a keyword query is a
In text information retrieval, representation single word, but, in general, it can be a more
means representing documents and queries. A complex combination of (Boolean) opera-
document is the representation of the informa- tions applied to several words.
tion the author wished to encode; it is the unity — Single word. It is the most elementary
of information that can be retrieved by an IR query that can be formulated in a text
system. Queries are the representation of infor- retrieval system. Depending on the rea-
mation needs of a user. soning component, the result of a single
Any text can be characterized by using four at- word query is generally the set of docu-
tributes: syntax, structure, semantics, and style. ments containing at least one occurrence
A text has a given syntax and a structure, which of the searched word.
are usually dictated by the application or by the — Boolean. It is the oldest and still widely
person who created it. Text also has a seman- used form of combining the keywords in a
tics, specified by the author of the document. query. A Boolean query is an expression
Additionally, a document may have a presen- whose elements are keywords, Boolean
tation style associated with it, which specifies operators and a precedence notation. In
how it should be displayed or printed. In many addition to classical Boolean operators,
approaches to text representation the style is several new operators have been pro-
coupled with the document syntax and structure posed, such as: the NEAR operator, which
(see for example the LaTeX document prepara-
allows context search capabilities and the
tion system  40 ]). Modern representations, such
fuzzy Boolean operator, which relaxes
as XML  80 ], separate the representation of syn-
the meaning of canonical AND and OR.
tax and structures, which are defined either by
a DTD or an XSD, and style, which is captured Pattern-based. It is a more specific query
by XSL. formulation, which allows the specification
A Taxonomy of Information Retrieval Models and Tools 179

of text having some properties. A pattern is Vector space. The basic principle of this text
a set of syntactic features that must occur in representation model is to consider that each
a text segment. The segments satisfying the document is described by a vector of compo-
pattern specification are said to match the nents that are representative of the semantic
pattern. content of the document. Traditional vec-
Structural. Structural queries are a mecha- tor space approaches use a set of keywords,
nism to improve the retrieval quality of struc- called index terms, but other types of repre-
tured information. This mechanism is gener- sentative components, such as n-grams, are
ally built on top of the basic queries with the used. An index term is a word whose se-
addition of structural constrains expressed mantics helps in identifying the documents
using containment, proximity, or other re- main themes. Of course, not all terms of a
strictions on the structural elements in the document are useful for describing the doc-
documents. Structural queries can be cate- ument content. In fact, there are index terms
gorized into three main categories: fixed which are vaguer than others. Deciding the
structure, hypertext, and hierarchical struc- importance of terms is not a trivial task. In a
ture. The first is the simplest form and, for large collection of documents a word which
this reason, it is more restrictive. The docu- appears in each document is useless as an
ments are divided into a set of fields each of index term, because it does not discriminate
which contains some text. A fixed structural between documents. On the other hand, a
query restricts the search to text contained term that appears in one document will likely
in certain document fields. The hypertext is describe the content of this document ( 45 ],
probably the most flexible form of structur-  83 ]). Vector representations can be further
ing. It is a directed graph where the nodes categorized a s follows.
hold some text and the links represent con-
nections between the nodes. However, it — Binary. The text document is represented
is not possible to query the hypertext struc- as a binary vector of terms. Each ele-
tural connectivity, but only the text content ment of the vector represents a term and
of the nodes. This transforms the retrieval its value is ‘1’ if the term appears in the
activity into a navigational activity (brows- document, ‘0’ otherwise.
ing task). The hierarchical structure is an in-
termediate structuring model and represents — Weighted. In this case element values
a natural decomposition for many text col- are real numbers between 0 and 1, called
lections (books, articles, structural programs term weights, and represent the affinity
etc.). For example, XML is the most promi- of the term with respect to the document.
nent structural representation model and the A widespread method to compute the
XPath  81 ] is a query language for addressing term weights exploits two factors  58 ]:
pieces of content in the hierarchical struc- Term Frequency (TF) and Inverse Doc-
ture. ument Frequency (IDF). The first pro-
vides a measure of how well the term
describes the document contents (intra-
Document Representation cluster similarity); the second measures
how well the term can discriminate docu-
A document is a retrievable element of the doc- ments among the collections cluster dis-
ument space of an information retrieval system. similarity). A well-known term weight-
It can be considered as the minimal resource ing scheme, valid for generic collections,
that an information retrieval system can retrieve.
is the product between the TF and IDF
Historically, documents have been represented
factors. Several variations are described
by a set of terms called keywords, which are
usually extracted from the text or inserted by by Salton and Buckley  66 ].
the author. The following are the most signifi-  Latent semantic. In the traditional
cant types of document representation: vector space approach each document
Stream of characters. Text is represented as is represented by a vector of n compo-
a stream of characters and no interpretation nents, where n is the number of terms
is made on its structure or semantic content. occurring in the collection (dimension
180 A Taxonomy of Information Retrieval Models and Tools

of the document space). Latent Se- and topic. He uses a sliding window ap-
mantic Indexing (LSI)  27 ] reduces proach in which n-grams are obtained by
the dimension of the document space moving a window of n characters through
by capturing term-to-term statistical a document or a query, one character at
relationships. The document space is a time. Some authors  82 ] also use n-
then represented by a new coordinate grams that cross word boundaries, i.e.,
system of dimension k < n, called k- that start within one word, end in another
space (or LSI space), in which each of word, and include the space characters
the k dimension is a derived concept that separate consecutive words.
often called LSI factor or LSI feature.
Structural. Structural documents, similarly
LSI features are identified by using
to structural queries, are a mechanism to im-
a method for matrix decomposition
prove the retrieval quality. The main idea is
called Singular Value Decomposition
to enrich documents with additional infor-
(SVD). The derived concepts may be
mation that allow a computer to make part
thought of as artificial concepts; they of the semantic content explicit. XML is the
represent extracted common meaning most prominent standard for modeling these
components of many different words aspects of information.
and documents.
 Fuzzy subset. Fuzzy set theories deal
with the representation of classes 2.2. Reasoning
whose boundaries are non-well de-
fined. Each element of the class is as- With the term reasoning we refer to the set
sociated with a membership function of methods, models, and technologies used to
that defines the membership degree of match document and query representations in
the element in the class. In many a retrieval task. Strictly related with the rea-
fuzzy representation approaches the soning component is the concept of relevance.
TF-IDF function of the weighted vec- The primary goal of an information retrieval
tor model is used as the fuzzy mem- system is to retrieve the documents relevant to
bership function ( 35 ],  37 ]). a query. The reasoning component defines the
framework to measure the relevance between
— N-Gram. The n-gram approach is in
documents and queries using their representa-
some respects an evolution of vector space
tions.
approaches. In the traditional vector
space approaches the dimensions of the A key question to address in order to understand
document space for a given collection of the reasoning component of an IR system is to
documents are the words (or sometimes find a precise definition for relevance. This is
phrases) that occur in the collection. By still an open problem within the IR community;
contrast, in the n-gram approach, the di- the literature reports different definitions, but a
mensions of the document space are n- widespread definition is  67 ]:
grams: strings of n consecutive charac-
ters extracted from the text without con- Relevance is the (A) of a (B) existing
sidering word lengths, and even word between a (C) and a (D) as determined
boundaries. Hence, the n-gram is a re- by an (E).
markably pure statistical approach, one
Where:
that measures the statistical properties of
strings of text in the given collection and (A). measure, estimate, judgment: : :
(B). utility, matching, satisfaction: : :
does not consider the vocabulary, lexi-
cal, or semantic properties of the natu-
ral language in which the documents are (C). document, document representation,
written. The n-gram length (n) and the information provided: : :
method for extracting n-grams from doc-
(D). question, question representation,
information need: : :
uments vary from one author to another.
In  22 ] Damashek uses n-grams of length
5 and 6 for clustering text by language (E). request, intermediary, export: : :
A Taxonomy of Information Retrieval Models and Tools 181

An attempt to clarify this definition has been Reasoning with Logic


proposed by Mizzaro  51 ]. Starting from an ac-
curate analysis of the interactions between the Logic. The logical approach to information
users and the system, the paper identifies vari- retrieval can be formulated in terms of the
ous types of relevance on which it is possible to logical formula P(d ! n), where the arrow
define an order relation. is the conditional connective formalized by
An information retrieval reasoning strategy can a logic to be chosen and P is the predicate:
“the representation of document d is relevant
be one (or any combination) of: reasoning with
to the representation of information need n”.
logic, reasoning with uncertainty, and reason-
The central problem is selecting the right im-
ing with learning. A reasoning with logic ap-
plication connective, i.e. selecting the logic
proach deals especially with models developed
whose implication connective best mirrors
as logical-mathematical theories. A reasoning
relevance. An overview of the role of logic
with uncertainty approach comes useful when- information retrieval is reported in  68 ].
ever the system is unable to assess the truth of
all the aspects of the environment in which it Algebra. Algebra calculus is the most com-
operates. In these cases its behavior is affected mon approach. Under this item we include
by uncertainty. This is due to many reasons: the reasoning strategies which are based on
it does not understand the environment prop- a set of operations defined in an algebraic
erties; there are many variables to process and field.
not enough time available, etc. Reasoning with — Boolean algebra. In the conventional
learning approaches apply with inductive ma- Boolean algebra reasoning strategy the
chine learning techniques. Machine learning is query Boolean expression is computed
concerned with systems that learn from expe- to verify whether a document either sat-
rience. In a classical system, the system de- isfies a query (is relevant) or does not
signer inserts all the knowledge. Whenever the satisfy it (is non-relevant). No ranking
designer does not possess complete knowledge is possible, and this is a significant lim-
of the system’s application domain, a learning itation. A number of extended Boolean
mechanism is the only way to acquiring new models have been developed to provide
knowledge. Learning mechanisms are used ranked output. These extended Boolean
both for fulfilling an objective or to improve models employ extended Boolean opera-
it. In IR the primary goal is to improve retrieval tors (also called soft Boolean operators)
effectiveness, for example, in terms of precision  42 ].
and recall.
— Vector algebra. Using a weighting sche-
Most of the classical information retrieval mod- me for document and query representa-
els deal with the reasoning with logic and rea- tions the vector algebra approach com-
soning with uncertainty strategies. In the first, putes a numeric similarity between the
for example, fall methods based on first or- query and each document. The doc-
der logic ( 47 ],  8 ],  6 ]), and methods based uments can then be ranked according
on Boolean and vector algebra ( 74 ],  64 ],  25 ], to how similar they are to the query.
 78 ],  77 ]). In the second fall methods in which The usual similarity measure exploited in
the vagueness and uncertainty aspects of IR are document vector space is the inner prod-
treated in terms of probabilistic and fuzzy set ap- uct between the query vector and a given
proaches. Since many information retrieval as- document vector  65 ]. If both vectors
pects are affected by vagueness and uncertainty, have been cosine normalized, then the
many reasoning processes based on uncertainty inner product represents the cosine of the
have been proposed ( 59 ],  13 ],  14 ],  76 ],  10 ], angle between the two vectors; hence this
 53 ],  49 ],  48 ],  63 ],  70 ]). Machine learning similarity measure is often called cosine
techniques gained a growing popularity in the similarity. Other well-known variants of
past ten years ( 23 ],  16 ],  43 ]). similarity functions are: Dice’s coeffi-
cient and Jaccard’s coefficient  58 ].
Recently, several novel approaches have been
proposed, based on either graph theory ( 12 ], Graph theories. Graph theories deal with
 24 ],  33 ],  55 ]) or formal ontology  31 ]. structures formed by vertices and edges. The
182 A Taxonomy of Information Retrieval Models and Tools

application of graphs algorithms to informa- activity to rank the retrieved items in de-
tion retrieval becomes more interesting with creasing order of relevance to a user query
the advent of the web. Web resources can can greatly improve the effectiveness of such
be well modelled with a graph structure in systems. This objective can be reached by
which documents represent vertices and hy- extending the Boolean mode in several ways
perlinks represent edges. In  24 ] a Maxi-  35 ]. In the fuzzy extensions of document
mum Flow method is introduced to identify representations the aim is to provide more
web communities. Previous graph-based ap- specific and exhaustive representations of
proaches were applied to bibliographic doc- the documents information content, in or-
uments and were principally based on bib- der to reduce the imprecision and incom-
liometric methods such as co citation and pleteness of the Boolean indexing. For ex-
bibliographic coupling. Some of these are ample, a document can be represented as
used in the web context, too. Such algo- a fuzzy set of terms. In the fuzzy gener-
rithm includes: PageRank algorithm  12 ] on alization of the Boolean query language the
which the Google  104 ] web search engine objective must have a more expressive query
is based, HITS algorithm  33 ], and SAE al- language, in order to capture the vagueness
gorithm  55 ]. of the user needs as well as to simplify the
user system interaction. Various approaches
have been proposed. One of these intro-
Reasoning with Uncertainty duces soft connectives of selection criteria
 11 ], characterized by a parametric behavior
Probability theories. Probabilistic theories which can be set between the two extremes
were introduced by Robertson and Sparck “AND” and “OR”. In other approaches, the
Jones  59 ]. The fundamental reasoning ap- Boolean query language has been genera-
proach is based on the following assumption: lized by defining aggregation operators as
given a user query and a document in the col- linguistic quantifiers, such as “at least k” or
lection, the probabilistic reasoning process “about k”.
tries to estimate the probability that the user
will find the document interesting. There
exist some alternative approaches based on Reasoning with Learning
Bayesian networks. In particular, the infer-
ence network  71 ] model has been used in
the INQUERY system  13 ], while reference Several authors have proposed the use of ma-
 57 ] introduces a generalization called belief chine learning approach in IR. The most fre-
network. quently used techniques include  16 ]: multiple
layered and feed-forward neural networks such
Fuzzy set theories. Fuzzy IR models have as back propagation networks  62 ], symbolic
been defined to overcome the limitations of and inductive learning algorithms such as ID3
the crisp Boolean IR models, in particular  56 ] and ID5R  72 ], and evolution-based algo-
to manage the vagueness and incomplete- rithms such as genetic algorithms  34 ].
ness of users in query formulation. Fuzzy
extended Boolean models are a superstruc- Neural networks. Neural network comput-
ture of the Boolean model by means of which ing seems to fit well with conventional re-
existing Boolean IR systems can be extended trieval models such as the vector space model
without redesigning them completely. The and the probabilistic model. One of the first
standard Boolean models apply an exact applications in IR comes from Belew  7 ]. He
match between the query and the document developed a three-layer neural network of
representations, and then partition the docu- authors, index terms, and documents. The
ment base into two sets: the retrieved doc- system used relevance feedback from its user
uments and the rejected ones. As a con- to change its representation of authors, index
sequence of this crisp behavior, they are terms, and documents over time. An evolu-
liable to reject useful items as a result of tion of this application has been introduced
too restrictive queries, and to retrieve use- by Kwok  39 ], who uses a modified Hebbian
less material in reply to excessively gen- learning rule to reformulate probabilistic in-
eral queries. Thus, softening the retrieval formation retrieval. In other applications the
A Taxonomy of Information Retrieval Models and Tools 183

Neural Network approach has been used for Genetic algorithms. Several genetic algo-
more specific tasks. For example, in  44 ], a rithms implementations have been devel-
Kohonen’s self-organizing feature map was oped in the context of IR.  29 ] presents a ge-
applied to construct a self organizing repre- netic algorithm-based approach to document
sentation of the semantic relationships be- indexing, in which competing document de-
tween documents. A Neural Network doc- scriptions (binary vector of term) are associ-
ument clustering algorithms was developed ated with a document and altered over time
in  46 ]. The Hopfield neural network’s par- by using genetic mutation and crossover ope-
allel relaxation method was used in  17 ] for rators. In this design, a keyword represents
concept-based document retrieval and explo- a gene (bit pattern), a document which is
ration. a vector of keywords (bit string) represents
individuals, and a collection of documents,
Symbolic learning. In IR the use of symbolic initially judged relevant by a user, repre-
learning is more limited with respect to other sents the initial population. Based on a Jac-
learning techniques. In  9 ] a symbolic learn- card’s matching function, the initial popula-
ing technique is used for automatic text clas- tion evolves through generations and eventu-
sification. The symbolic learning process ally converges to an optimal, improved pop-
represents the numeric classification results ulation. In  30 ] a similar approach is adopted
in terms of IF-THEN rules. In  26 ] a regres- for document clustering.
sion method and ID3 were used to imple-
ment a feature-based indexing technique. In
 18 ] ID3 and the incremental ID5R algorithm
were adopted for information retrieval. Both 2.3. An Example
algorithms were able to use user-supplied
samples of desired documents to construct As an example of application of the vertical
decision trees of important keywords which taxonomy, we have taken some relevant works
could represent the user’s query. from the IR models field and tried to classify

Table 1. Vertical taxonomy of a set of Information Retrieval Models.


184 A Taxonomy of Information Retrieval Models and Tools

them using the vertical taxonomy. We iden- identified by three components, as illustrated in
tify each information retrieval model in relation Fig. 2: Tasks, Form, and Context.
to the representation and reasoning components
described above. This is shown in Tab. 1. A
notable aspect is that many models contain the 3.1. Tasks
weighted vector as a representation component;
this is why Paijmans  54 ] introduced the vector Information retrieval tasks are concerned with
document model. a particular aspect of information retrieval de-
rived from a user point of view and should not
be confused with the tasks in an information
retrieval process, such as query formulation,
3. Horizontal Taxonomy query expansion, comparison, ranking, docu-
ment presentation. An information retrieval ob-
The vertical taxonomy alone is not sufficient to ject can support one or more tasks and a task
take into account all the objects that have been can be stand-alone or it can be integrated in
produced under the IR umbrella. Users do not a process to perform a larger task. We have
interact with a model, but generally they use a identified the following tasks: ad hoc retrieval,
software tool that is able to solve an information known item search, interactive retrieval, filter-
retrieval problem. This calls for the introduc- ing, browsing, clustering, mining, gathering and
tion of a further dimension, a new viewpoint that crawling. Sometime they are known by differ-
we call horizontal taxonomy. Through the hor- ent names because they are inherited from var-
izontal taxonomy we classify information re- ious research areas.
trieval objects. An information retrieval object
is an artifact that solves a more or less general Ad Hoc Retrieval
IR problem. An information retrieval object is
An ad hoc retrieval task is characterized by an
arbitrary subject of the search and a short du-
ration  73 ]. It is typically performed by a re-
searcher doing a literature search in a library.
In this environment the retrieval system knows
the set of documents to be searched, but cannot
anticipate the particular topic that will be inves-
tigated  73 ]. A retrieval system’s response to an
ad hoc search is generally a list of documents
ranked by decreasing similarity to the query.
The internet search engines are examples of in-
formation retrieval objects from which one can
perform ad hoc search.

Known Item Search

A known item search is similar to an ad hoc


search, but the target of the search is a partic-
ular document (or a small set of documents)
that the searcher knows to exist in the collec-
tion and wants to find it  73 ]. An information
retrieval object that performs this task usually
implements a precise query language (for ex-
ample, structural query language) with which
a searcher can reach parts of a document with
known structure and semantics. For example,
in the library environment, a researcher that will
Fig. 2. Horizontal taxonomy. retrieve all articles by an author.
A Taxonomy of Information Retrieval Models and Tools 185

Interactive Retrieval in which documents are organized in categories


and subcategories. The hypertext model intro-
A user’s judgment of the usefulness of a doc- duces a navigational structure which allows a
ument may vary during an information seek- user to browse text in a non sequential man-
ing activity  38 ]; this can be captured by the ner. The web is the most well know example of
system through an interactive information re- hypertext structure.
trieval task. During the interactive task the sys-
tem attempts to perceive how the user interacts
with it and, as a consequence, it can modify
the current search strategy  60 ]. Classical rel-
evance feedback approaches  61 ] can be seen Clustering
as early techniques for interactive retrieval; the
user interaction is captured as yes/no judgment
of documents relevance. The system uses these The term emerges from the statistics commu-
judgments to expand and/or reweigh the query nity, where it is well known as classification
 32 ]. analysis and discriminant analysis  3 ]. In the
artificial intelligence community, the task is of-
ten called concept learning. Clustering is the
Filtering automatic recognition and the generation of cat-
egories of entities that can be text documents.
Also known as selective dissemination of in- It is usually based on some similarity measure
formation, or text routing, filtering combines between documents, as well as an explicit or
aspects of text retrieval and text categorization. implicit definition of what distinguishing char-
Like text categorization, a text filtering system acteristic should the groups of documents have.
processes documents in real time and assigns It is generally used to improve the retrieval pro-
them to zero or more classes. However, like text cess, because the search can be restricted on a
retrieval, each class is typically associated with set of interested category. In conjunction with
the information needs of one or a small group clustering is categorizing, which is the recog-
of users. Each user, or user group, can typically
nition and assignment of the document to one
add, remove, or modify the queries, or profiles,
according to their needs. Examples include: or more pre-existing categories. An example of
NewsSieve  100 ] a client/server USENET news categorization tools is CORA (Computer Sci-
filtering system that can be used in a desktop en- ence Research Paper Search Engine)  84 ], an au-
vironment, NewsWeeder  87 ] an experimental tomatic categorizing tool for scientific papers.
USENET news filtering service, and SIFT the An example of categorizing service is the Yahoo
Stanford Information Filtering Tool  86 ], which Directory  99 ]; in this case the categorization is
includes two selective dissemination services, performed manually, by human experts.
one for computer science technical reports and
one for USENET news articles.

Browsing Mining

When users are not interested in posing a spe-


cific query to the system, but they invest some Mining is the process of automatically extract-
time in exploring the document space, looking ing key information from text documents. Such
for interesting references, then they are brows- information can be: language identification,
ing the space, instead of searching. There are feature extraction, terminology extraction, pre-
three types of browsing, namely, flat, structure- dominant themes extraction, abbreviation ex-
guided and hypertext. In flat browsing the idea traction and relation extraction. LEXA  89 ] is
is that the user explores a document space which an example of a corpus processing software,
has a flat organization; for example, files in a while the IBM text miner  91 ] is a mining tool
directory. In structure-guided browsing the user integrated with the homonymous text search en-
is generally guided by a hierarchical structure gine.
186 A Taxonomy of Information Retrieval Models and Tools

Gathering where the high heterogeneity of the informa-


tion calls for a very general purpose approach.
This is an activity involving pro-active acqui- Google  104 ], Altavista  93 ], and Infoseek  111 ],
sition of information from possibly heteroge- are some general purpose engines that currently
neous sources. The metasearch engines exem- operate on the web. A specialized retrieval sys-
plify a particular type of gathering task. Meta- tem is one that is developed with a particular
crawler  92 ], InFind  116 ] are some examples. application domain in mind. For instance, the
They combine outputs of several search engines LEXIS-NEXIS  119 ] retrieval system is a spe-
and present the results as if produced by a single cialized retrieval system that provides access to
search engine. a very large collection of legal and business doc-
uments. Similarly, the ResearchIndex service
 105 ] provides free access to a large collection
Crawling of scientific paper.

Crawling is concerned with the activity of se- 3.4. An Example


lecting new, or updating the existing, sources
of information that will be processed by suc- As we did with the vertical taxonomy, here we
cessive activities, for example mining and/or apply the horizontal taxonomy to a set of in-
gathering. It is also known as indexing process formation retrieval objects. We have chosen
and, especially in the Web context, as spidering. 31 objects from various sources: research labs,
Well known examples are: Scooter  94 ], Archi- companies, and institutions.
textSpider  110 ], Sidewinder  112 ], Slurp  102 ]
and Guliver  114 ]; the spiders of Altavista  93 ], The main classification scheme consists of iden-
Excite  109 ], Infoseek  111 ], Inktomi  101 ] and tifying, for each object, its horizontal compo-
Northernlight  113 ]. nents included in Fig. 2.
This is done by analyzing the object as a black
box and trying to fetch information about what
3.2. Form it does. The result is viewed in the Appendix in
which information retrieval objects are listed
The form refers to the way in which the object is with some information notes and references.
supplied to the final user. It can be supplied in The presence of a cross establishes that the cor-
the form of tool or service. When the object is responding horizontal component is supported
implemented as a software product, then it is a by the information retrieval object.
tool. It exists because, for example, a company
has produced it to make business. It can be dis-
tributed, installed, sold, etc. When the object 4. Concluding Remarks
exists only in one, or a few instances used to de-
liver some information retrieval services, then For the purpose of simplicity, we have con-
it is a service. Examples are search engines on ducted the classification on two separate paths:
the web. a horizontal taxonomy and a vertical taxonomy.
In reality, these taxonomies are not disjoint and
in this concluding section we show how these
3.3. Context two important aspects of information retrieval
can be combined. We have already remarked
The context of an information retrieval object that an information retrieval object can be based
regards its domain of application. It can be on more than one model and an information re-
general or specific. A general purpose infor- trieval model can be the basis for more than one
mation retrieval object operates on heteroge- object.
neous domains and contents, unlike a context The vertical dimension classifies information
specific system that operates on document col- retrieval models based on a two components
lections belonging to a specific domain, such as view, namely representation and reasoning. The
legal and business documents, technical papers horizontal dimension classifies information re-
etc. Notable examples are web search engines, trieval objects with respect to the application
A Taxonomy of Information Retrieval Models and Tools 187

Table 2. Vertical projections.

areas. Indeed, objects can themselves be clas- the information needed to produce the vertical
sified with respect to the vertical components, projections of the related objects.
namely representation and reasoning. We call
this further classification of an IR object the ver- In recent years, information retrieval has as-
tical projection of the object; Tab. 2 shows the sumed an increasing importance because of the
vertical projection for the IR objects referred to dramatic growth of the amount of information
in the Appendix. Note that a few rows in the ta- available in digital formats. The proliferation
ble are left blank, as we were not able to access of information retrieval algorithms, methods,
188 A Taxonomy of Information Retrieval Models and Tools

technologies, and tools calls for the definition 4 ] BAEZA-YATES, R., GONNET, G., Efficient text
of basic concepts and terminology; this is use- searching of regular expressions, Proceedings of
ful to assess the features and the characteristics the 16th International Colloquium on Automata,
Languages and Programming, LNCS 372, (1989),
of each IR object and to understand the rela- pp. 46–62, Berlin (Germany).
tionships that exist between the objects. In this
paper we have proposed a taxonomy of IR ob- 5 ] BAEZA-YATES, R., NAVARRO, G., Fast approximate
jects, accompanied with definitions for the key string matching, Algorithmica, 23(2), (1999), pp.
127–158.
terms. This taxonomy is a tentative first step in
classifying IR models and tools, since it does not 6 ] BEERI, C., KORNATZKY, Y., A logical query lan-
cover all aspects of IR. The market and the de- guage for hypertext systems, Proceedings of the
velopment of IR technologies are still evolving European Conference on Hypertext, (1990), pp.
and this evolution will make some observations 67–80, Versailles, (France).
contained in this paper obsolete. As a result, 7 ] BELEW, R.K., Adaptative information retrieval,
this work will need to be updated incrementally Proceedings of the 12th Annual International
as the technology develops. However, we think ACM/SIGIR Conference on Research and De-
that the taxonomy presented in this paper pro- velopment in information Retrieval, (1989), pp.
vides a good starting point for such a continuous 11–20, Cambridge (MA).
updating. 8 ] BERND T., Logic Programs for Intelligent Web
One of the main limitations of the taxonomy Search, Proceedings of the 11th International Sym-
posium on Methodologies for Intelligent Systems,
presented in this paper is the fact that it covers (1999), LNAI 1609, Warsaw, (Poland).
only text information retrieval. Indeed, cur-
rent information needs require more and more 9 ] BLOSSEVILLE, M.J., HEBRAIL, G., MONTEIL, M.G.,
integrated retrieval models and tools that com- PENOT, N., Automatic document classification:
bine the traditional retrieval of text documents natural language processing, statistical analy-
sis, and expert system techniques used together,
with the retrieval of multimedia content, such Proceedings of the 15th Annual International
as images and speech, and even structured data ACM/SIGIR Conference on Research and De-
from databases. Therefore, there is room for velopment in information Retrieval, (1992), pp.
improvement of the proposed taxonomy and we 51–57, Copenhagen (Denmark).
are currently working on extending it in order to 10 ] BOOKSTEIN A., Fuzzy request: an approach to
include other important aspects of IR not cove- weighted Boolean searches, Journal of the Amer-
red here, primarily the retrieval of multimedia ican Society for Information Science, 31, (1980),
content. pp. 240–247.

11 ] BORDOGNA, G., PASI, G., A Fuzzy Linguistic


Approach Generalizing Boolean Information Re-
5. Acknowledgment trieval; a Model and Its Evaluation, Journal of
the American Society for Information Science, 44,
(1993), pp. 70–82.
The work described in this paper has been sup-
ported by the EUREKA Project E!2235, IKF – 12 ] BRIN, S., PAGE, L., MOTWANI, R., WINOGRAD, T.,
Information and Knowledge Fusion. The PageRank Citation Ranking: Bringing Order
to the Web, Technical report, Stanford University,
1998.
References 13 ] BROGLIO, J., CALLAN, J.P., CROFT, W.B., NACH-
BAR, D.W., Document retrieval and routing using
INQUERY system, Proceedings of the 3rd Re-
1 ] AGOSTI, M., CRESTATI, F., TACHIR: a Tool for the trieval Conference TREC, (1995), pp. 29–38,
Automated Construction of Hypertexts in Infor- Gaithersburg (Maryland).
mation Retrieval, Proceedings of RIAO, Rockfeller
University, (1994), NewYork (USA). 14 ] CALLAN, J., Document filtering with inference
network. Proceedings of the 19th Annual Int. ACM
2 ] ANANDEEP S., SYCARA, P.K., A Learning Per- SIGIR Conference on Research and Development
sonal Agent for Text Filtering and Notification,
Proceedings of the International Conference of in Information Retrieval, (1996), pp. 262–269,
Knowledge Based Systems, (1996), (http:// Zurich (Switzerland).
www.ri.cmu.edu/pubs/pub 2174.html).
15 ] CHANG, S.J., RICE, R.E., Browsing: a multidimen-
3 ] ANDERBERG, M.R., Cluster analysis for applica- sional framework, Annual Review of Information
tions, Academic Press, NewYork, 1973. Science and Technology, 28, (1993), pp. 231–276.
A Taxonomy of Information Retrieval Models and Tools 189

16 ] CHEN, H., Machine learning for information re- Development in Information Retrieval, (1998), pp.
trieval: neural networks, Symbolic learning, and 257–265, Grenoble (France).
genetic algorithms, Journal of the American So-
ciety for Information Science, 46(3), (1995), pp. 28 ] GARFIELD, E., Citation Indexing: Its Theory
194–216. and Application in Science, John Wiley & Sons,
NewYork, 1979.
17 ] CHEN, H. LYNCH, K.J., BASU, K., NG.,D.T., Gen-
erating, integrating, and activating thesauri for 29 ] GORDON, M., Probabilistic and genetic algorithms
concept-based document retrieval, IEEE EXPERT, for document retrieval, Comunication of the ACM,
Special Series on Artificial Intelligence in Text- 31(10), (1988), pp. 1208–1218.
based Information Systems, 8(2), (1993), pp.
25–34. 30 ] GORDON, M.D., User-based document clustering
by redescribing subject descriptions with a genetic
18 ] CHEN, H., SHE, L., Inductive query by examples algorithm, Journal of the American Society for
(IQBE): A machine learning approach, Proceed- Information Science, 42(5), (1991), pp. 311–322.
ings of the 27th Annual International Confer-
ence on System Sciences, Information Sharing 31 ] GUARINO, N., MASOLO, C., VETERE, G., Ontoseek:
and Knowledge Discovery Track, (1994), Maui Content-Based access to the web, IEEE Intelligent
(Hawaii). Systems, 14(3), (1999), pp. 70–80.

19 ] COOPER, W.S., GEY, F.C., DABNEY, D.P., Proba- 32 ] HAINES, D., CROFT, W.B., Relevance feedback and
bilistic retrieval based on staged logistic regres- inference networks, Proceedings of the 16th An-
sion, Proceedings of the 15th Annual Int. ACM nual Int. ACM SIGIR Conference on Research and
SIGIR Conference on Research and Development Development in Information Retrieval, (1993), pp.
in Information Retrieval, (1992), pp. 198–210, 2–11, Pittsburgh (USA).
Copenhagen (Denmark).
33 ] KLEINBERG, J.M., Authoritative Sources in a Hy-
20 ] CROFT, W.B., Approaches to intelligent informa- perlinked Environment, Proceedings of the 9th
tion retrieval, Information Processing and Man- Annual Int. ACM SIAM Symposium on Discrete Al-
agement, 23(4), (1987), pp. 249–254. gorithms, (1998), pp. 668–677, New York (USA).
21 ] CUTTING, D.R., PEDERSEN, J.O., KARGER, D., 34 ] KOZA, J.R., Genetic Programming: On the Pro-
TUKEY, J.W., Scatter/gather: a cluster-based ap- gramming of Computers by Means of Natural
proach to browsing large document collections, Selection, The MIT Press, Cambridge, MA, 1992.
Proceedings of the 15th Annual Int. ACM SI-
GIR Conference on Research and Development 35 ] KRAFT, D., BUEL, D.A., Fuzzy sets and generalized
in Information Retrieval, (1992), pp. 318–329, Boolean retrieval systems, International Journal
Copenhagen (Denmark). of Man-machine Studies, 19, (1983), pp. 45–56.

22 ] DAMASHEK, M., Gauging similarity with n-grams: 36 ] KRAFT, D., PETRY, F.E., BUCKLES, B.P., SADASI-
Language-independent categorization of text, Sci- VAN, T., The use of genetic programming to build
ence, 267, (1995), pp. 843–848. queries for information retrieval, IEEE Sympo-
sium on Evolutionary Computation, (1994), pp.
23 ] DOSZKOCS, T.E., REGGIA, J., LIN, X., Connec- 468–473, Orlando (USA).
tionist models and information retrieval, Annual
Review of Information Science and Technology, 37 ] KRAFT, D.H., BORDOGNA, G., PASI, G., Fuzzy set
25, (1990), pp. 209–260. techniques in information retrieval, in J. Bezdek,
D. Dubois and H. Prade (eds), Fuzzy Sets in
24 ] FLAKE, G.W., LAWRENCE, S., GILES, C.L., COET- Approximate Reasoning and Information Systems,
ZEE, F.M., Self Organization and Identification of 3(8), (1999), pp. 469–510, Kluwer Academic
Web Communities, Journal of the IEEE Computer Publishers.
Society, 35(3), (2002), pp. 66–71.
38 ] KUHLTHAY, C. C., Inside the search process: In-
25 ] FOX, E. A., Extending the Boolean and vector formation seeking from the user’s perspective,
space models of information retrieval with P-norm Journal of the American Society for Information
queries and multiple concept types, PhD thesis, Science, 42(5), (1991), pp. 361–371.
Cornell University, 1983.
39 ] KWOK, K.L., A neural network for probabilistic
26 ] FUHR, N., HARTMANN, S. KNORZ, G., LUSTIG, G., information retrieval, Proceedings of the 12th An-
SCHWANTNER, M., TZERAS, K., AIR/X – a rule- nual Int. ACM SIGIR Conference on Research and
based multistage indexing system for large subject Development in Information Retrieval, (1989), pp.
fields, Proceedings of the 8th National Conference 202–210, Cambridge (USA).
on Artificial Intelligence, (1990), pp. 789–895,
Boston (MA). 40 ] LAMPORT, L., LaTeX: A document Preparation
System, User’s guide and Reference manual; 2nd
27 ] FURNAS, G. W., DEERWESTER, S., DUMAIS, S. T., edition, Prentice Hall, 1994.
LANDAUER, T.K., HARSHMAN, R.A., STREETER,
L.A., LOCHBAUM, K.E., Information retrieval us- 41 ] LAYAIDA, R., BOUGHANEM, M. CARON, A., Con-
ing a singular value decomposition model of latent structing an information retrieval system with neu-
semantic structure, Proceedings of the 11th An- ral networks, Lecture Notes in Computer Science,
nual Int. ACM SIGIR Conference on Research and 856, (1994), pp. 561–570.
190 A Taxonomy of Information Retrieval Models and Tools

42 ] LEE, J.H., Properties of extended boolean mod- 56 ] QUINLAN, J.R., Learning efficient classification
els in information retrieval, Proceedings of the procedures and their application to chess and
17th Annual International ACM SIGIR Confer- games, Machine Learning, an Artificial Intel-
ence on Research and Development in Information ligence Approach, (1983), pp. 463–482, Tioga
Retrieval, (1994), pp. 182–190. Publishing company, Palo Alto, CA.
43 ] LEWIS, D.D., Learning in intelligent information 57 ] RIBEIRO-NETO, B.A., MUNTZ, R., A Belief net-
retrieval, Proceedings of the 8th International work model for IR, Proceedings of the 19th An-
Workshop on Machine Learning, (1991), pp. 235– nual Int. ACM SIGIR Conference on Research and
239, Morgan Kaufmann. Development in Information Retrieval, (1996), pp.
44 ] LIN, X., SOERGEL, D., MARCHIONINI, G., A self- 253–260, Zurich (Switzerland).
organizing semantic map for information retrieval,
Proceedings of the 14th Annual Int. ACM SI- 58 ] RIJSBERGEN, C.J., Information Retrieval, Butter-
GIR Conference on Research and Development worths, London, 1979.
in Information Retrieval, (1991), pp. 262–269,
Chicago (IL). 59 ] ROBERTSON, S.E., SPARCK JONES, K., Relevance
weighting of search terms, Journal of the American
45 ] LUHN, H.P., A statistical approach to mechanized Society for Information Sciences, 27(3), (1976),
encoding and searching of library information, pp. 129–146.
IBM Journal of Research and Development, 1,
(1957), pp. 309–317. 60 ] ROBINS, D., Interactive Information Retrieval:
Context and Basic Notions, Information Science,
46 ] MACLEOD, K.J., ROBERTSON, W., A neural algo- 3(2), (2000), pp. 57–61.
rithm for document clustering, Information Pro-
cessing & Management, 27(4), (1991), pp. 337– 61 ] ROCCHIO, J.J., Relevance Feedback in Information
346. Retrieval, Prentice Hall, 1971.
47 ] MCCUNE, B., TONG, R., DEAN, J.S., SHAPIRO, D.,
Rubric: a system for rule-based information re- 62 ] RUMELHART, D.E., HINTON, G.E., WILLIAMS, R.J.,
trieval, IEEE Transaction on Software Engineer- Learning Internal Representations by Error Prop-
ing, 1985, 11(9). agation, Parallel Distributed Processing, (1986),
pp. 318–362, The MIT Press, Cambridge, MA.
48 ] MIYAMOTO, S., NAKAYAMA, K., Fuzzy information
retrieval based on a fuzzy pseudo thesaurus, IEEE 63 ] SACHS W.M., An approach to associative retrieval
Transactions on Systems and Man Cybernetics, through the theory of fuzzy sets, Journal of
1986, 16(2), pp. 278–282. the American Society for Information Sciences,
(1976), pp. 85–87.
49 ] MIYAMOTO, S., TERUHISA, M., KAZUHIKO, N.,
Generation of a Pseudothesaurus for Informa- 64 ] SALTON, G., The SMART Retrieval System – Exper-
tion Retrieval base co-occurrences and fuzzy set iments in Automatic Document Processing, Pren-
operations, IEEE Transaction Systems, Man and tice Hall, New York, 1971.
Cybernetics, 13(1), (1983), pp. 62–69.
65 ] SALTON, G., Automatic text processing: The trans-
50 ] MIZZARO, S., A cognitive analysis of informa- formation, analysis, and retrieval of information
tion retrieval, Proceedings of CoLIS2, (1996), pp. by computer, Addison-Wesley, 1989.
233–250, Copenhagen (Denmark).
51 ] MIZZARO, S., How many relevancies in informa- 66 ] SALTON, G., BUCKLEY C., Term weighting ap-
tion retrieval?, Interacting with Computers, 10(3), proaches in automatic retrieval, Information Pro-
(1998), pp. 305–322. cessing and Management, 24(5), (1988), pp. 513–
523.
52 ] NAEZA-YATES, R., RIEBEIRO-NETO, B., Modern
Information Retrieval, Addison Wesley, New York, 67 ] SARACEVIC, T., RELEVANCE: A Review of and
1999. a Framework for the thinking of the notion in
information science, Journal of the American So-
53 ] OGAWA, Y., MORITA, T., KOBAYASHI, K., A fuzzy ciety for Information Science, 26(6), (1975), pp.
document retrieval system using the keyword con- 321–343.
nection matrix and a learning method, Fuzzy Sets
and Systems, 39, (1991), pp. 163–179. 68 ] SEBASTIANI, F., On the Role of Logic in In-
54 ] PAIJMANS, H., Explorations in the document formation Retrieval, Information Processing &
vector model of information retrieval, Dis- Management, 34(1), (1998), pp. 1–18.
sertation, Tilburg University, 1999. http://
pi0959.kub.nl:2080/Paai/Bibliogr/ 69 ] SMITH, L.C., WARNER, A.J., A taxonomy of repre-
sentation in information retrieval design, Journal
55 ] PIROLLI, P., PITKOW, J., RAO, R., Silk from Sow’s of Information Science, 8, (1984), pp. 113–121.
Ear: Extracting Usable Structures from the web,
Proceedings of the ACM Conference on Human 70 ] TAHANI, V.A., A fuzzy model of document re-
Factors in Computing Systems, (1996), pp. 118– trieval systems, Information Processing and Man-
125, New York (USA). agement, 12, (1976), pp. 177–187.
A Taxonomy of Information Retrieval Models and Tools 191

71 ] TURTLE, H., CROFT. W.B., Inference networks for 87 ] NewsWeeder. http://anther.learning.


document retrieval, Proceedings of the 13th An- cs.cmu.edu/ifhome.html
nual Int. ACM SIGIR Conference on Research and
Development in Information Retrieval, (1990), pp. 88 ] Grep. http://www.gnu.org
1–24, Brussels (Belgium). 89 ] LEXA.
http://nora.hd.uib.no/lexainf.html
72 ] UTGOFF, P.E., Incremental induction of decision
trees, Machine Learning, 4, (1989), pp. 161–186. 90 ] OCP. http://info.ox.ac.uk/ctitext/
resguide/resources/o125.html
73 ] VOORHESS, E.M., HARMAN, D., Overview of TREC
2001, National Institute of Standards and Technol- 91 ] IBM text miner. http://www.ibm.com
ogy, 2001.
92 ] Metacrawler. http://www.metacrawler.com/
74 ] WALLER, W.G., KRAFT, D.H., A Mathematical
Model of a Weighted Boolean Retrieval System, 93 ] Altavista. http://www.altavista.com
Information Processing & Management, 15(5), 94 ] Scooter. http://www.altavista.com
(1979), pp. 235–245.
95 ] INQUERY. http://www-ciir.cs.umass.edu
75 ] WILKINSON, R., HINGSTON, P., Using the cosine
measure in neural network for document retrieval, 96 ] SMART.
Proceedings of the 14th Annual Int. ACM SI- ftp://ftp.cs.cornell.edu/pub/smart/
GIR Conference on Research and Development 97 ] ILA. Internet Learning Agent. http://www.cs.
in Information Retrieval, (1991), pp. 202–210, washington.edu/homes/map/ila.html
Chicago (USA).
98 ] WebLearner.
76 ] WONG, S.K.M., YAO, Y.Y., On modeling in- http://www.ics.uci.edu/ pazzani
formation retrieval with probabilistic inference, /Coldlist.html
ACM Transactions on Information Systems, 13(1),
(1995), pp. 39–68. 99 ] Yahoo Directory. http://www.yahoo.com
77 ] WONG, S.K.M., ZIARKO, W., RAGHAVAN, V.V., 100 ] NewsSieve. http://www.newssieve.com/
WONG, P.C.N., On Extending the Vector Space 101 ] Inktomi. http://www.inktomi.com
Model for Boolean Query Processing, Proceed-
ings of the 9th Annual Int. ACM SIGIR Conference 102 ] Slurp. http://www.inktomi.com
on Research and Development in Information Re-
trieval, (1986), pp. 175–185, Pisa (Italy). 103 ] Isearch.
http://www.cnidr.org/isearch.html
78 ] WONG, S.K.M., ZIARKO, W., WONG, P.C.N., Gen-
eralized vector space model in information re- 104 ] Google. http://www.google.com
trieval, Proceedings of the 8th Annual Int. ACM 105 ] ResearchIndex.
SIGIR Conference on Research and Development http://www.researchindex.com
in Information Retrieval, (1985), pp. 18–25, New
York (USA). 106 ] Agrep, Glimpse.
http://glimpse.cs.arizona.edu/
79 ] WU, S., MANBER, U., Agrep: a fast approximate
pattern matching tool, Proceedings of USENIX 107 ] Scatter/Gather: http://www.sims.
Technical Conference, (1992), pp. 153–162, San berkeley.edu/ hearst/sg-overview.html
Francisco (USA). 108 ] Amalthaea. http://lcs.www.media.mit.edu/
80 ] XML eXtensible Markup Language 1.0 (Second moux/papers/PAAM96/PAAM96.html
Edition) W3C Recommendation 6 October 2000. 109 ] Excite. http://www.excite.com
http://www.w3.org/XML/
110 ] ArchitextSpider. http://www.excite.com
81 ] XPath XML Path Language 1.0 W3C Recommen-
dation 16 November 1999. 111 ] Infoseek. http://www.infoseek.com
http://www.w3.org/TR/xpath
112 ] Sidewinder. http://www.infoseek.com
82 ] YANNAKOUDAKIS, E.J., GOYAL, P., HUGGIL, J.A., 113 ] Northern Light.
The generation and use of text fragments for data http://www.northernlight.com
compression, Information Processing and Man-
agement, 18, (1982), pp. 15–21. 114 ] Guliver. http://www.northernlight.com
83 ] ZIPF, H.P., Human Behaviour and the Principle of 115 ] WEBSOM. http://websom.hut.fi/websom
Least Effort, Addison-Wesley, Cambridge, 1949.
116 ] Infind. http://www.infind.com
84 ] CORA. http://cora.whizbang.com 117 ] Lycos. http://www.lycos.com
85 ] TACHIR. http://www.dei.unipd.it/ ims/ 118 ] GeoSearch. http://www.northernlight.com
tachir.html
119 ] LEXIS-NEXIS.
86 ] SIFT. ftp://db.stanford.edu/pub/sift/ http://www.lexis-nexis.com
sift-1.1-netnews.tar.Z
192 A Taxonomy of Information Retrieval Models and Tools

Appendix: Horizontal Taxonomy of a Set of Information Retrieval Objects


A Taxonomy of Information Retrieval Models and Tools 193
194 A Taxonomy of Information Retrieval Models and Tools

Received: September, 2002 GERARDO CANFORA received the Laurea degree in electronic engineer-
Revised: January, 2004 ing from the University of Naples, Federico II, Italy, in 1989. He is
Accepted: May, 2004 currently a full professor of computer science at the Faculty of Engineer-
ing and the Director of the Research Centre on Software Technology
(RCOST) of the University of Sannio in Benevento, Italy. From 1990
to 1991, he was with the Italian National Research Council (CNR).
During 1992, he was at the Department of Informatica e Sistemistica
Contact address: of the University of Naples, Federico II, Italy. From 1992 to 1993, he
was a visiting researcher at the Centre for Software Maintenance of the
Gerardo Canfora University of Durham, UK. In 1993, he joined the Faculty of Engineer-
Research Centre on Software Technology ing of the University of Sannio in Benevento, Italy. He has served on
Department of Engineering the program committees of a number of international conferences. He
University of Sannio was a program co-chair of the 1997 International Workshop on Pro-
Palazzo ex Poste – Via Traiano gram Comprehension and of the 2001 International Conference and the
General Chair of the 2003 European Conference on Software Main-
82100 Benevento tenance and Reengineering. His research interests include software
ITALY maintenance, program comprehension, reverse engineering, workflow
e-mail: gerardo.canfora@unisannio.it management, document and knowledge management, and information
retrieval. He serves on the Editorial Board of the IEEE Transactions
on Software Engineering. He is a member of the IEEE and the IEEE
Computer Society.

LUIGI CERULO received the Laurea degree in computer engineering from


the University of Sannio, Italy, in 2001. He is currently an assistant
researcher at the Research Centre on Software Technology (RCOST)
of the University of Sannio in Benevento, Italy. His research interests
include information retrieval, fuzzy logic, and visual languages.

Potrebbero piacerti anche