Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
A Taxonomy of Information
Retrieval Models and Tools
Information retrieval is attracting significant attention address the representation, organization of, and
due to the exponential growth of the amount of infor- access to large amounts of heterogeneous infor-
mation available in digital format. The proliferation
of information retrieval objects, including algorithms, mation encoded in digital format 58 ].
methods, technologies, and tools, makes it difficult to In this paper we focus on text document re-
assess their capabilities and features and to understand
the relationships that exist among them. In addition, trieval, in which the information is represented
the terminology is often confusing and misleading, as by text documents. Therefore, for the purposes
different terms are used to denote the same, or similar, of this paper, the terms information and docu-
tasks. ments are used interchangeably. Text document
This paper proposes a taxonomy of information retrieval retrieval is the most traditional subfield of IR;
models and tools and provides precise definitions for however, IR comprises other subfields, such as
the key terms. The taxonomy consists of superimposing
two views: vertical taxonomy, that classifies IR models image retrieval, speech retrieval, information
with respect to a set of basic features, and horizontal generation, query answering, and text summa-
taxonomy, which classifies IR systems and services with rization, that we do not cover in this paper.
respect to the tasks they support.
A key feature of a text IR systems is retrieving
The aim is to provide a framework for classifying existing
information retrieval models and tools and a solid point the documents that can satisfy the information
to assess future developments in the field. needs of a user from a large collection of docu-
ments. Such systems, especially in the context
Keywords: information retrieval, taxonomy, tools, mo- of the web, are usually known as search en-
dels. gines, so that in the rest of the paper we will
consider search engine as a synonym of infor-
mation retrieval system. IR systems prepare the
collection of documents for retrieval through an
1. Introduction
indexing step. User information needs are usu-
ally represented by keywords or phrases, which
In recent years information retrieval has become are themselves indexed, although more complex
an important subject of much research, because representation languages are available. This
the amount of information available in digital representation, which causes inevitably a loss
formats has grown exponentially and the need of information, is usually known as query. In-
for retrieving relevant information has assumed dexing can assume different forms according to
a crucial importance. The World Wide Web and the model adopted to represent both the docu-
the Digital Libraries have shown to a large au- ments in the collection and the user information
dience the importance of effective mechanisms needs. Many current IR systems exploit ranked
and tools to retrieve documents from a very large IR methods, i.e. they rank the documents in the
document collection based on user information collection based on a measure of their relevance
needs. with respect to the user information needs as
Information Retrieval (IR) is the scientific dis- represented by a query.
cipline that deals with the analysis, design and The proliferation of information retrieval al-
implementation of computerized systems that gorithms, methods, technologies, and tools, is
176 A Taxonomy of Information Retrieval Models and Tools
making it more difficult to assess the features variety of more specific models. Paijmans iden-
and the characteristics of each IR aspect and to tified the vector document model as the basis for
understand the relationships that exist among building the classification and showed how the
them. The terminology is often confusing; for vector model can subsume other popular mo-
example, terms such as crawling, indexing, spi- dels. Whilst this constitutes a concise style of
dering, are often used to denote similar tasks, classification, it is unable to classify IR tech-
with no clear distinction of the differences. niques that are not derived from the vector based
model, such as the logic-based techniques.
In this paper we propose a classification of IR
models and tools and provide definitions for the Our approach is different, as we start from a
key terms. The classification consists of super- classification of the basic features of IR mod-
imposing two views: one for the IR models and els and proceed with a classification of the ob-
one for the IR objects, either tools or services. jects produced in the various fields of infor-
A vertical taxonomy classifies IR models with mation retrieval in terms of tools and services.
respect to a set of basic features, and a horizon- The flexibility of this faceted view is evident
tal taxonomy classifies IR objects with respect when we consider that different information re-
to their tasks, form, and context. The vertical trieval objects can be based on the same in-
taxonomy is built by exploding two basic fea- formation retrieval model, and the same infor-
tures of any IR model: the representation, that mation retrieval model can be exploited to im-
is the model adopted to represent both the docu- plement different information retrieval objects.
ments and the user queries; and the reasoning, For example, the classic vector model, generally
which refers to the framework adopted to re- presented as a retrieval technique, can be used
solve a representation similarity problem. The for building information filtering and document
horizontal taxonomy is derived from an analysis clustering tools, too. The latter are different in-
of the application areas of IR. formation retrieval objects that exploit the same
information retrieval model.
An information retrieval model can be modeled Whilst documents are characterized by syntax,
as a couple <Rp, Rs> where Rp is the repre- structure, semantics and style, the structure and
sentation model of documents and queries, and semantics of text are generally sufficient to char-
Rs is a framework for modeling the relationship acterize queries.
between document and query representations,
which is the reasoning strategy. Every compo-
nent can be divided into subcomponents and for Query Representation
every subcomponent we can build a tree of pos-
sible approaches and solutions presented in the A query is the representation of a user infor-
literature, as shown in Fig. 1. mation needs. The user information needs is
Defining the approaches used for each compo- originated by a problem that the user should re-
nent identifies an IR model. For example, the solve; it is implicit in the user mind and its pur-
couple <Rp, Rs>: pose is the necessity to bridge a knowledge gap.
An information need can be of three types 50 ]:
Rp(query) = f keyword-based g known item information need, conscious infor-
Rp(document) = f weighted vector g mation need, and confused information need.
The first is when users search or verify the exis-
Rs(with logic) = f vector algebra g tence of documents they know. The second is
identifies the well-known vector model, as we when users search for documents they do not
will discuss later. We will now go into each of know, but regard a subject they know. The third
these components. is when users know neither the documents nor
the subject. The following classes of query rep-
resentations can be identified:
2.1. Representation Keyword-based. This is the simplest form
for a query. It is composed by keywords and
A fundamental component of an IR system is the documents containing such keywords are
the representation of the information itself: in- searched for. Keyword-based queries are
formation can be processed if it is represented popular, because they are intuitive and easy
in some way. to express. Usually, a keyword query is a
In text information retrieval, representation single word, but, in general, it can be a more
means representing documents and queries. A complex combination of (Boolean) opera-
document is the representation of the informa- tions applied to several words.
tion the author wished to encode; it is the unity — Single word. It is the most elementary
of information that can be retrieved by an IR query that can be formulated in a text
system. Queries are the representation of infor- retrieval system. Depending on the rea-
mation needs of a user. soning component, the result of a single
Any text can be characterized by using four at- word query is generally the set of docu-
tributes: syntax, structure, semantics, and style. ments containing at least one occurrence
A text has a given syntax and a structure, which of the searched word.
are usually dictated by the application or by the — Boolean. It is the oldest and still widely
person who created it. Text also has a seman- used form of combining the keywords in a
tics, specified by the author of the document. query. A Boolean query is an expression
Additionally, a document may have a presen- whose elements are keywords, Boolean
tation style associated with it, which specifies operators and a precedence notation. In
how it should be displayed or printed. In many addition to classical Boolean operators,
approaches to text representation the style is several new operators have been pro-
coupled with the document syntax and structure posed, such as: the NEAR operator, which
(see for example the LaTeX document prepara-
allows context search capabilities and the
tion system 40 ]). Modern representations, such
fuzzy Boolean operator, which relaxes
as XML 80 ], separate the representation of syn-
the meaning of canonical AND and OR.
tax and structures, which are defined either by
a DTD or an XSD, and style, which is captured Pattern-based. It is a more specific query
by XSL. formulation, which allows the specification
A Taxonomy of Information Retrieval Models and Tools 179
of text having some properties. A pattern is Vector space. The basic principle of this text
a set of syntactic features that must occur in representation model is to consider that each
a text segment. The segments satisfying the document is described by a vector of compo-
pattern specification are said to match the nents that are representative of the semantic
pattern. content of the document. Traditional vec-
Structural. Structural queries are a mecha- tor space approaches use a set of keywords,
nism to improve the retrieval quality of struc- called index terms, but other types of repre-
tured information. This mechanism is gener- sentative components, such as n-grams, are
ally built on top of the basic queries with the used. An index term is a word whose se-
addition of structural constrains expressed mantics helps in identifying the documents
using containment, proximity, or other re- main themes. Of course, not all terms of a
strictions on the structural elements in the document are useful for describing the doc-
documents. Structural queries can be cate- ument content. In fact, there are index terms
gorized into three main categories: fixed which are vaguer than others. Deciding the
structure, hypertext, and hierarchical struc- importance of terms is not a trivial task. In a
ture. The first is the simplest form and, for large collection of documents a word which
this reason, it is more restrictive. The docu- appears in each document is useless as an
ments are divided into a set of fields each of index term, because it does not discriminate
which contains some text. A fixed structural between documents. On the other hand, a
query restricts the search to text contained term that appears in one document will likely
in certain document fields. The hypertext is describe the content of this document ( 45 ],
probably the most flexible form of structur- 83 ]). Vector representations can be further
ing. It is a directed graph where the nodes categorized a s follows.
hold some text and the links represent con-
nections between the nodes. However, it — Binary. The text document is represented
is not possible to query the hypertext struc- as a binary vector of terms. Each ele-
tural connectivity, but only the text content ment of the vector represents a term and
of the nodes. This transforms the retrieval its value is ‘1’ if the term appears in the
activity into a navigational activity (brows- document, ‘0’ otherwise.
ing task). The hierarchical structure is an in-
termediate structuring model and represents — Weighted. In this case element values
a natural decomposition for many text col- are real numbers between 0 and 1, called
lections (books, articles, structural programs term weights, and represent the affinity
etc.). For example, XML is the most promi- of the term with respect to the document.
nent structural representation model and the A widespread method to compute the
XPath 81 ] is a query language for addressing term weights exploits two factors 58 ]:
pieces of content in the hierarchical struc- Term Frequency (TF) and Inverse Doc-
ture. ument Frequency (IDF). The first pro-
vides a measure of how well the term
describes the document contents (intra-
Document Representation cluster similarity); the second measures
how well the term can discriminate docu-
A document is a retrievable element of the doc- ments among the collections cluster dis-
ument space of an information retrieval system. similarity). A well-known term weight-
It can be considered as the minimal resource ing scheme, valid for generic collections,
that an information retrieval system can retrieve.
is the product between the TF and IDF
Historically, documents have been represented
factors. Several variations are described
by a set of terms called keywords, which are
usually extracted from the text or inserted by by Salton and Buckley 66 ].
the author. The following are the most signifi- Latent semantic. In the traditional
cant types of document representation: vector space approach each document
Stream of characters. Text is represented as is represented by a vector of n compo-
a stream of characters and no interpretation nents, where n is the number of terms
is made on its structure or semantic content. occurring in the collection (dimension
180 A Taxonomy of Information Retrieval Models and Tools
of the document space). Latent Se- and topic. He uses a sliding window ap-
mantic Indexing (LSI) 27 ] reduces proach in which n-grams are obtained by
the dimension of the document space moving a window of n characters through
by capturing term-to-term statistical a document or a query, one character at
relationships. The document space is a time. Some authors 82 ] also use n-
then represented by a new coordinate grams that cross word boundaries, i.e.,
system of dimension k < n, called k- that start within one word, end in another
space (or LSI space), in which each of word, and include the space characters
the k dimension is a derived concept that separate consecutive words.
often called LSI factor or LSI feature.
Structural. Structural documents, similarly
LSI features are identified by using
to structural queries, are a mechanism to im-
a method for matrix decomposition
prove the retrieval quality. The main idea is
called Singular Value Decomposition
to enrich documents with additional infor-
(SVD). The derived concepts may be
mation that allow a computer to make part
thought of as artificial concepts; they of the semantic content explicit. XML is the
represent extracted common meaning most prominent standard for modeling these
components of many different words aspects of information.
and documents.
Fuzzy subset. Fuzzy set theories deal
with the representation of classes 2.2. Reasoning
whose boundaries are non-well de-
fined. Each element of the class is as- With the term reasoning we refer to the set
sociated with a membership function of methods, models, and technologies used to
that defines the membership degree of match document and query representations in
the element in the class. In many a retrieval task. Strictly related with the rea-
fuzzy representation approaches the soning component is the concept of relevance.
TF-IDF function of the weighted vec- The primary goal of an information retrieval
tor model is used as the fuzzy mem- system is to retrieve the documents relevant to
bership function ( 35 ], 37 ]). a query. The reasoning component defines the
framework to measure the relevance between
— N-Gram. The n-gram approach is in
documents and queries using their representa-
some respects an evolution of vector space
tions.
approaches. In the traditional vector
space approaches the dimensions of the A key question to address in order to understand
document space for a given collection of the reasoning component of an IR system is to
documents are the words (or sometimes find a precise definition for relevance. This is
phrases) that occur in the collection. By still an open problem within the IR community;
contrast, in the n-gram approach, the di- the literature reports different definitions, but a
mensions of the document space are n- widespread definition is 67 ]:
grams: strings of n consecutive charac-
ters extracted from the text without con- Relevance is the (A) of a (B) existing
sidering word lengths, and even word between a (C) and a (D) as determined
boundaries. Hence, the n-gram is a re- by an (E).
markably pure statistical approach, one
Where:
that measures the statistical properties of
strings of text in the given collection and (A). measure, estimate, judgment: : :
(B). utility, matching, satisfaction: : :
does not consider the vocabulary, lexi-
cal, or semantic properties of the natu-
ral language in which the documents are (C). document, document representation,
written. The n-gram length (n) and the information provided: : :
method for extracting n-grams from doc-
(D). question, question representation,
information need: : :
uments vary from one author to another.
In 22 ] Damashek uses n-grams of length
5 and 6 for clustering text by language (E). request, intermediary, export: : :
A Taxonomy of Information Retrieval Models and Tools 181
application of graphs algorithms to informa- activity to rank the retrieved items in de-
tion retrieval becomes more interesting with creasing order of relevance to a user query
the advent of the web. Web resources can can greatly improve the effectiveness of such
be well modelled with a graph structure in systems. This objective can be reached by
which documents represent vertices and hy- extending the Boolean mode in several ways
perlinks represent edges. In 24 ] a Maxi- 35 ]. In the fuzzy extensions of document
mum Flow method is introduced to identify representations the aim is to provide more
web communities. Previous graph-based ap- specific and exhaustive representations of
proaches were applied to bibliographic doc- the documents information content, in or-
uments and were principally based on bib- der to reduce the imprecision and incom-
liometric methods such as co citation and pleteness of the Boolean indexing. For ex-
bibliographic coupling. Some of these are ample, a document can be represented as
used in the web context, too. Such algo- a fuzzy set of terms. In the fuzzy gener-
rithm includes: PageRank algorithm 12 ] on alization of the Boolean query language the
which the Google 104 ] web search engine objective must have a more expressive query
is based, HITS algorithm 33 ], and SAE al- language, in order to capture the vagueness
gorithm 55 ]. of the user needs as well as to simplify the
user system interaction. Various approaches
have been proposed. One of these intro-
Reasoning with Uncertainty duces soft connectives of selection criteria
11 ], characterized by a parametric behavior
Probability theories. Probabilistic theories which can be set between the two extremes
were introduced by Robertson and Sparck “AND” and “OR”. In other approaches, the
Jones 59 ]. The fundamental reasoning ap- Boolean query language has been genera-
proach is based on the following assumption: lized by defining aggregation operators as
given a user query and a document in the col- linguistic quantifiers, such as “at least k” or
lection, the probabilistic reasoning process “about k”.
tries to estimate the probability that the user
will find the document interesting. There
exist some alternative approaches based on Reasoning with Learning
Bayesian networks. In particular, the infer-
ence network 71 ] model has been used in
the INQUERY system 13 ], while reference Several authors have proposed the use of ma-
57 ] introduces a generalization called belief chine learning approach in IR. The most fre-
network. quently used techniques include 16 ]: multiple
layered and feed-forward neural networks such
Fuzzy set theories. Fuzzy IR models have as back propagation networks 62 ], symbolic
been defined to overcome the limitations of and inductive learning algorithms such as ID3
the crisp Boolean IR models, in particular 56 ] and ID5R 72 ], and evolution-based algo-
to manage the vagueness and incomplete- rithms such as genetic algorithms 34 ].
ness of users in query formulation. Fuzzy
extended Boolean models are a superstruc- Neural networks. Neural network comput-
ture of the Boolean model by means of which ing seems to fit well with conventional re-
existing Boolean IR systems can be extended trieval models such as the vector space model
without redesigning them completely. The and the probabilistic model. One of the first
standard Boolean models apply an exact applications in IR comes from Belew 7 ]. He
match between the query and the document developed a three-layer neural network of
representations, and then partition the docu- authors, index terms, and documents. The
ment base into two sets: the retrieved doc- system used relevance feedback from its user
uments and the rejected ones. As a con- to change its representation of authors, index
sequence of this crisp behavior, they are terms, and documents over time. An evolu-
liable to reject useful items as a result of tion of this application has been introduced
too restrictive queries, and to retrieve use- by Kwok 39 ], who uses a modified Hebbian
less material in reply to excessively gen- learning rule to reformulate probabilistic in-
eral queries. Thus, softening the retrieval formation retrieval. In other applications the
A Taxonomy of Information Retrieval Models and Tools 183
Neural Network approach has been used for Genetic algorithms. Several genetic algo-
more specific tasks. For example, in 44 ], a rithms implementations have been devel-
Kohonen’s self-organizing feature map was oped in the context of IR. 29 ] presents a ge-
applied to construct a self organizing repre- netic algorithm-based approach to document
sentation of the semantic relationships be- indexing, in which competing document de-
tween documents. A Neural Network doc- scriptions (binary vector of term) are associ-
ument clustering algorithms was developed ated with a document and altered over time
in 46 ]. The Hopfield neural network’s par- by using genetic mutation and crossover ope-
allel relaxation method was used in 17 ] for rators. In this design, a keyword represents
concept-based document retrieval and explo- a gene (bit pattern), a document which is
ration. a vector of keywords (bit string) represents
individuals, and a collection of documents,
Symbolic learning. In IR the use of symbolic initially judged relevant by a user, repre-
learning is more limited with respect to other sents the initial population. Based on a Jac-
learning techniques. In 9 ] a symbolic learn- card’s matching function, the initial popula-
ing technique is used for automatic text clas- tion evolves through generations and eventu-
sification. The symbolic learning process ally converges to an optimal, improved pop-
represents the numeric classification results ulation. In 30 ] a similar approach is adopted
in terms of IF-THEN rules. In 26 ] a regres- for document clustering.
sion method and ID3 were used to imple-
ment a feature-based indexing technique. In
18 ] ID3 and the incremental ID5R algorithm
were adopted for information retrieval. Both 2.3. An Example
algorithms were able to use user-supplied
samples of desired documents to construct As an example of application of the vertical
decision trees of important keywords which taxonomy, we have taken some relevant works
could represent the user’s query. from the IR models field and tried to classify
them using the vertical taxonomy. We iden- identified by three components, as illustrated in
tify each information retrieval model in relation Fig. 2: Tasks, Form, and Context.
to the representation and reasoning components
described above. This is shown in Tab. 1. A
notable aspect is that many models contain the 3.1. Tasks
weighted vector as a representation component;
this is why Paijmans 54 ] introduced the vector Information retrieval tasks are concerned with
document model. a particular aspect of information retrieval de-
rived from a user point of view and should not
be confused with the tasks in an information
retrieval process, such as query formulation,
3. Horizontal Taxonomy query expansion, comparison, ranking, docu-
ment presentation. An information retrieval ob-
The vertical taxonomy alone is not sufficient to ject can support one or more tasks and a task
take into account all the objects that have been can be stand-alone or it can be integrated in
produced under the IR umbrella. Users do not a process to perform a larger task. We have
interact with a model, but generally they use a identified the following tasks: ad hoc retrieval,
software tool that is able to solve an information known item search, interactive retrieval, filter-
retrieval problem. This calls for the introduc- ing, browsing, clustering, mining, gathering and
tion of a further dimension, a new viewpoint that crawling. Sometime they are known by differ-
we call horizontal taxonomy. Through the hor- ent names because they are inherited from var-
izontal taxonomy we classify information re- ious research areas.
trieval objects. An information retrieval object
is an artifact that solves a more or less general Ad Hoc Retrieval
IR problem. An information retrieval object is
An ad hoc retrieval task is characterized by an
arbitrary subject of the search and a short du-
ration 73 ]. It is typically performed by a re-
searcher doing a literature search in a library.
In this environment the retrieval system knows
the set of documents to be searched, but cannot
anticipate the particular topic that will be inves-
tigated 73 ]. A retrieval system’s response to an
ad hoc search is generally a list of documents
ranked by decreasing similarity to the query.
The internet search engines are examples of in-
formation retrieval objects from which one can
perform ad hoc search.
Browsing Mining
areas. Indeed, objects can themselves be clas- the information needed to produce the vertical
sified with respect to the vertical components, projections of the related objects.
namely representation and reasoning. We call
this further classification of an IR object the ver- In recent years, information retrieval has as-
tical projection of the object; Tab. 2 shows the sumed an increasing importance because of the
vertical projection for the IR objects referred to dramatic growth of the amount of information
in the Appendix. Note that a few rows in the ta- available in digital formats. The proliferation
ble are left blank, as we were not able to access of information retrieval algorithms, methods,
188 A Taxonomy of Information Retrieval Models and Tools
technologies, and tools calls for the definition 4 ] BAEZA-YATES, R., GONNET, G., Efficient text
of basic concepts and terminology; this is use- searching of regular expressions, Proceedings of
ful to assess the features and the characteristics the 16th International Colloquium on Automata,
Languages and Programming, LNCS 372, (1989),
of each IR object and to understand the rela- pp. 46–62, Berlin (Germany).
tionships that exist between the objects. In this
paper we have proposed a taxonomy of IR ob- 5 ] BAEZA-YATES, R., NAVARRO, G., Fast approximate
jects, accompanied with definitions for the key string matching, Algorithmica, 23(2), (1999), pp.
127–158.
terms. This taxonomy is a tentative first step in
classifying IR models and tools, since it does not 6 ] BEERI, C., KORNATZKY, Y., A logical query lan-
cover all aspects of IR. The market and the de- guage for hypertext systems, Proceedings of the
velopment of IR technologies are still evolving European Conference on Hypertext, (1990), pp.
and this evolution will make some observations 67–80, Versailles, (France).
contained in this paper obsolete. As a result, 7 ] BELEW, R.K., Adaptative information retrieval,
this work will need to be updated incrementally Proceedings of the 12th Annual International
as the technology develops. However, we think ACM/SIGIR Conference on Research and De-
that the taxonomy presented in this paper pro- velopment in information Retrieval, (1989), pp.
vides a good starting point for such a continuous 11–20, Cambridge (MA).
updating. 8 ] BERND T., Logic Programs for Intelligent Web
One of the main limitations of the taxonomy Search, Proceedings of the 11th International Sym-
posium on Methodologies for Intelligent Systems,
presented in this paper is the fact that it covers (1999), LNAI 1609, Warsaw, (Poland).
only text information retrieval. Indeed, cur-
rent information needs require more and more 9 ] BLOSSEVILLE, M.J., HEBRAIL, G., MONTEIL, M.G.,
integrated retrieval models and tools that com- PENOT, N., Automatic document classification:
bine the traditional retrieval of text documents natural language processing, statistical analy-
sis, and expert system techniques used together,
with the retrieval of multimedia content, such Proceedings of the 15th Annual International
as images and speech, and even structured data ACM/SIGIR Conference on Research and De-
from databases. Therefore, there is room for velopment in information Retrieval, (1992), pp.
improvement of the proposed taxonomy and we 51–57, Copenhagen (Denmark).
are currently working on extending it in order to 10 ] BOOKSTEIN A., Fuzzy request: an approach to
include other important aspects of IR not cove- weighted Boolean searches, Journal of the Amer-
red here, primarily the retrieval of multimedia ican Society for Information Science, 31, (1980),
content. pp. 240–247.
16 ] CHEN, H., Machine learning for information re- Development in Information Retrieval, (1998), pp.
trieval: neural networks, Symbolic learning, and 257–265, Grenoble (France).
genetic algorithms, Journal of the American So-
ciety for Information Science, 46(3), (1995), pp. 28 ] GARFIELD, E., Citation Indexing: Its Theory
194–216. and Application in Science, John Wiley & Sons,
NewYork, 1979.
17 ] CHEN, H. LYNCH, K.J., BASU, K., NG.,D.T., Gen-
erating, integrating, and activating thesauri for 29 ] GORDON, M., Probabilistic and genetic algorithms
concept-based document retrieval, IEEE EXPERT, for document retrieval, Comunication of the ACM,
Special Series on Artificial Intelligence in Text- 31(10), (1988), pp. 1208–1218.
based Information Systems, 8(2), (1993), pp.
25–34. 30 ] GORDON, M.D., User-based document clustering
by redescribing subject descriptions with a genetic
18 ] CHEN, H., SHE, L., Inductive query by examples algorithm, Journal of the American Society for
(IQBE): A machine learning approach, Proceed- Information Science, 42(5), (1991), pp. 311–322.
ings of the 27th Annual International Confer-
ence on System Sciences, Information Sharing 31 ] GUARINO, N., MASOLO, C., VETERE, G., Ontoseek:
and Knowledge Discovery Track, (1994), Maui Content-Based access to the web, IEEE Intelligent
(Hawaii). Systems, 14(3), (1999), pp. 70–80.
19 ] COOPER, W.S., GEY, F.C., DABNEY, D.P., Proba- 32 ] HAINES, D., CROFT, W.B., Relevance feedback and
bilistic retrieval based on staged logistic regres- inference networks, Proceedings of the 16th An-
sion, Proceedings of the 15th Annual Int. ACM nual Int. ACM SIGIR Conference on Research and
SIGIR Conference on Research and Development Development in Information Retrieval, (1993), pp.
in Information Retrieval, (1992), pp. 198–210, 2–11, Pittsburgh (USA).
Copenhagen (Denmark).
33 ] KLEINBERG, J.M., Authoritative Sources in a Hy-
20 ] CROFT, W.B., Approaches to intelligent informa- perlinked Environment, Proceedings of the 9th
tion retrieval, Information Processing and Man- Annual Int. ACM SIAM Symposium on Discrete Al-
agement, 23(4), (1987), pp. 249–254. gorithms, (1998), pp. 668–677, New York (USA).
21 ] CUTTING, D.R., PEDERSEN, J.O., KARGER, D., 34 ] KOZA, J.R., Genetic Programming: On the Pro-
TUKEY, J.W., Scatter/gather: a cluster-based ap- gramming of Computers by Means of Natural
proach to browsing large document collections, Selection, The MIT Press, Cambridge, MA, 1992.
Proceedings of the 15th Annual Int. ACM SI-
GIR Conference on Research and Development 35 ] KRAFT, D., BUEL, D.A., Fuzzy sets and generalized
in Information Retrieval, (1992), pp. 318–329, Boolean retrieval systems, International Journal
Copenhagen (Denmark). of Man-machine Studies, 19, (1983), pp. 45–56.
22 ] DAMASHEK, M., Gauging similarity with n-grams: 36 ] KRAFT, D., PETRY, F.E., BUCKLES, B.P., SADASI-
Language-independent categorization of text, Sci- VAN, T., The use of genetic programming to build
ence, 267, (1995), pp. 843–848. queries for information retrieval, IEEE Sympo-
sium on Evolutionary Computation, (1994), pp.
23 ] DOSZKOCS, T.E., REGGIA, J., LIN, X., Connec- 468–473, Orlando (USA).
tionist models and information retrieval, Annual
Review of Information Science and Technology, 37 ] KRAFT, D.H., BORDOGNA, G., PASI, G., Fuzzy set
25, (1990), pp. 209–260. techniques in information retrieval, in J. Bezdek,
D. Dubois and H. Prade (eds), Fuzzy Sets in
24 ] FLAKE, G.W., LAWRENCE, S., GILES, C.L., COET- Approximate Reasoning and Information Systems,
ZEE, F.M., Self Organization and Identification of 3(8), (1999), pp. 469–510, Kluwer Academic
Web Communities, Journal of the IEEE Computer Publishers.
Society, 35(3), (2002), pp. 66–71.
38 ] KUHLTHAY, C. C., Inside the search process: In-
25 ] FOX, E. A., Extending the Boolean and vector formation seeking from the user’s perspective,
space models of information retrieval with P-norm Journal of the American Society for Information
queries and multiple concept types, PhD thesis, Science, 42(5), (1991), pp. 361–371.
Cornell University, 1983.
39 ] KWOK, K.L., A neural network for probabilistic
26 ] FUHR, N., HARTMANN, S. KNORZ, G., LUSTIG, G., information retrieval, Proceedings of the 12th An-
SCHWANTNER, M., TZERAS, K., AIR/X – a rule- nual Int. ACM SIGIR Conference on Research and
based multistage indexing system for large subject Development in Information Retrieval, (1989), pp.
fields, Proceedings of the 8th National Conference 202–210, Cambridge (USA).
on Artificial Intelligence, (1990), pp. 789–895,
Boston (MA). 40 ] LAMPORT, L., LaTeX: A document Preparation
System, User’s guide and Reference manual; 2nd
27 ] FURNAS, G. W., DEERWESTER, S., DUMAIS, S. T., edition, Prentice Hall, 1994.
LANDAUER, T.K., HARSHMAN, R.A., STREETER,
L.A., LOCHBAUM, K.E., Information retrieval us- 41 ] LAYAIDA, R., BOUGHANEM, M. CARON, A., Con-
ing a singular value decomposition model of latent structing an information retrieval system with neu-
semantic structure, Proceedings of the 11th An- ral networks, Lecture Notes in Computer Science,
nual Int. ACM SIGIR Conference on Research and 856, (1994), pp. 561–570.
190 A Taxonomy of Information Retrieval Models and Tools
42 ] LEE, J.H., Properties of extended boolean mod- 56 ] QUINLAN, J.R., Learning efficient classification
els in information retrieval, Proceedings of the procedures and their application to chess and
17th Annual International ACM SIGIR Confer- games, Machine Learning, an Artificial Intel-
ence on Research and Development in Information ligence Approach, (1983), pp. 463–482, Tioga
Retrieval, (1994), pp. 182–190. Publishing company, Palo Alto, CA.
43 ] LEWIS, D.D., Learning in intelligent information 57 ] RIBEIRO-NETO, B.A., MUNTZ, R., A Belief net-
retrieval, Proceedings of the 8th International work model for IR, Proceedings of the 19th An-
Workshop on Machine Learning, (1991), pp. 235– nual Int. ACM SIGIR Conference on Research and
239, Morgan Kaufmann. Development in Information Retrieval, (1996), pp.
44 ] LIN, X., SOERGEL, D., MARCHIONINI, G., A self- 253–260, Zurich (Switzerland).
organizing semantic map for information retrieval,
Proceedings of the 14th Annual Int. ACM SI- 58 ] RIJSBERGEN, C.J., Information Retrieval, Butter-
GIR Conference on Research and Development worths, London, 1979.
in Information Retrieval, (1991), pp. 262–269,
Chicago (IL). 59 ] ROBERTSON, S.E., SPARCK JONES, K., Relevance
weighting of search terms, Journal of the American
45 ] LUHN, H.P., A statistical approach to mechanized Society for Information Sciences, 27(3), (1976),
encoding and searching of library information, pp. 129–146.
IBM Journal of Research and Development, 1,
(1957), pp. 309–317. 60 ] ROBINS, D., Interactive Information Retrieval:
Context and Basic Notions, Information Science,
46 ] MACLEOD, K.J., ROBERTSON, W., A neural algo- 3(2), (2000), pp. 57–61.
rithm for document clustering, Information Pro-
cessing & Management, 27(4), (1991), pp. 337– 61 ] ROCCHIO, J.J., Relevance Feedback in Information
346. Retrieval, Prentice Hall, 1971.
47 ] MCCUNE, B., TONG, R., DEAN, J.S., SHAPIRO, D.,
Rubric: a system for rule-based information re- 62 ] RUMELHART, D.E., HINTON, G.E., WILLIAMS, R.J.,
trieval, IEEE Transaction on Software Engineer- Learning Internal Representations by Error Prop-
ing, 1985, 11(9). agation, Parallel Distributed Processing, (1986),
pp. 318–362, The MIT Press, Cambridge, MA.
48 ] MIYAMOTO, S., NAKAYAMA, K., Fuzzy information
retrieval based on a fuzzy pseudo thesaurus, IEEE 63 ] SACHS W.M., An approach to associative retrieval
Transactions on Systems and Man Cybernetics, through the theory of fuzzy sets, Journal of
1986, 16(2), pp. 278–282. the American Society for Information Sciences,
(1976), pp. 85–87.
49 ] MIYAMOTO, S., TERUHISA, M., KAZUHIKO, N.,
Generation of a Pseudothesaurus for Informa- 64 ] SALTON, G., The SMART Retrieval System – Exper-
tion Retrieval base co-occurrences and fuzzy set iments in Automatic Document Processing, Pren-
operations, IEEE Transaction Systems, Man and tice Hall, New York, 1971.
Cybernetics, 13(1), (1983), pp. 62–69.
65 ] SALTON, G., Automatic text processing: The trans-
50 ] MIZZARO, S., A cognitive analysis of informa- formation, analysis, and retrieval of information
tion retrieval, Proceedings of CoLIS2, (1996), pp. by computer, Addison-Wesley, 1989.
233–250, Copenhagen (Denmark).
51 ] MIZZARO, S., How many relevancies in informa- 66 ] SALTON, G., BUCKLEY C., Term weighting ap-
tion retrieval?, Interacting with Computers, 10(3), proaches in automatic retrieval, Information Pro-
(1998), pp. 305–322. cessing and Management, 24(5), (1988), pp. 513–
523.
52 ] NAEZA-YATES, R., RIEBEIRO-NETO, B., Modern
Information Retrieval, Addison Wesley, New York, 67 ] SARACEVIC, T., RELEVANCE: A Review of and
1999. a Framework for the thinking of the notion in
information science, Journal of the American So-
53 ] OGAWA, Y., MORITA, T., KOBAYASHI, K., A fuzzy ciety for Information Science, 26(6), (1975), pp.
document retrieval system using the keyword con- 321–343.
nection matrix and a learning method, Fuzzy Sets
and Systems, 39, (1991), pp. 163–179. 68 ] SEBASTIANI, F., On the Role of Logic in In-
54 ] PAIJMANS, H., Explorations in the document formation Retrieval, Information Processing &
vector model of information retrieval, Dis- Management, 34(1), (1998), pp. 1–18.
sertation, Tilburg University, 1999. http://
pi0959.kub.nl:2080/Paai/Bibliogr/ 69 ] SMITH, L.C., WARNER, A.J., A taxonomy of repre-
sentation in information retrieval design, Journal
55 ] PIROLLI, P., PITKOW, J., RAO, R., Silk from Sow’s of Information Science, 8, (1984), pp. 113–121.
Ear: Extracting Usable Structures from the web,
Proceedings of the ACM Conference on Human 70 ] TAHANI, V.A., A fuzzy model of document re-
Factors in Computing Systems, (1996), pp. 118– trieval systems, Information Processing and Man-
125, New York (USA). agement, 12, (1976), pp. 177–187.
A Taxonomy of Information Retrieval Models and Tools 191
Received: September, 2002 GERARDO CANFORA received the Laurea degree in electronic engineer-
Revised: January, 2004 ing from the University of Naples, Federico II, Italy, in 1989. He is
Accepted: May, 2004 currently a full professor of computer science at the Faculty of Engineer-
ing and the Director of the Research Centre on Software Technology
(RCOST) of the University of Sannio in Benevento, Italy. From 1990
to 1991, he was with the Italian National Research Council (CNR).
During 1992, he was at the Department of Informatica e Sistemistica
Contact address: of the University of Naples, Federico II, Italy. From 1992 to 1993, he
was a visiting researcher at the Centre for Software Maintenance of the
Gerardo Canfora University of Durham, UK. In 1993, he joined the Faculty of Engineer-
Research Centre on Software Technology ing of the University of Sannio in Benevento, Italy. He has served on
Department of Engineering the program committees of a number of international conferences. He
University of Sannio was a program co-chair of the 1997 International Workshop on Pro-
Palazzo ex Poste – Via Traiano gram Comprehension and of the 2001 International Conference and the
General Chair of the 2003 European Conference on Software Main-
82100 Benevento tenance and Reengineering. His research interests include software
ITALY maintenance, program comprehension, reverse engineering, workflow
e-mail: gerardo.canfora@unisannio.it management, document and knowledge management, and information
retrieval. He serves on the Editorial Board of the IEEE Transactions
on Software Engineering. He is a member of the IEEE and the IEEE
Computer Society.