Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Abstract [16], where Belief Revision (BR) techniques were used for
quantifying relevance between documents and queries. Un-
This work deals with the implementation of a logical fortunately, a direct implementation of the model proposed
model of Information Retrieval. Specifically, we present al- in [16] would require exponential time to decide relevance.
gorithms for document ranking within the Belief Revision Then, in this work we present an implementation of doc-
framework. Therefore, the logical model that stands on the ument ranking within the BR model whose computational
basis of our proposal can be efficiently implemented within complexity is reduced with respect to the direct application
realistic systems. Besides the inherent advantages intro- of [16]. This way, the actual applicability of the theoretical
duced by logic, the expressiveness is extended with respect framework described in [16] is ensured.
to classical systems because documents are represented as Besides classical representation of documents, i.e. sets
unrestricted propositional formulas. As well as represent- of terms, more expressive representations are introduced.
ing classical vectors, the model can deal with partial de- Consequently, one can build more expressive IR systems
scriptions of documents. Scenarios that can benefit from having efficient procedures for computing similarity. Those
these more expressive representations are discussed. partial representations are very helpful to include retrieval
situations in the model [15]. Furthermore, representing doc-
uments with general propositional formulas can improve
1 Introduction precision, specially in specific domains. From the algo-
rithms presented in this paper, one can also extract an analy-
sis of the tradeoff between the expressiveness of documents
The actual applicability of logical models of Informa-
and queries and the efficiency of the system.
tion Retrieval (IR) to the construction of practical systems is
The rest of the paper is organized as follows. Section
controversial. In this work we implement document ranking
2 presents how the BR framework was used in [16] to ob-
within a logical framework, contributing this way to show
tain a similarity measure. Section 3 shows the algorithm
that IR systems can be based on logical models. More-
developed for the case when documents and queries rep-
over, such systems can benefit from the management of in-
resent sets of terms and the algorithm that can deal with
complete descriptions, which is an inherent characteristic
more expressive documents and queries. Section 4 depicts
of logic. Generality and formalization of well-known IR
the performance results of some experiments we have done
notions are additional advantages of the use of logic for IR.
with the algorithms. Some points of discussion and future
Documents rarely fulfil queries in a complete way.
lines of work are presented in section 5. The paper ends
Therefore, a realistic logical model for IR should not de-
cide relevance using the classical entailment d j= q . Basi-
with some conclusions.
cally, classical entailment is too strong and cannot represent
partial relevance [12]. Classical entailment represents the 2 A similarity measure using Belief Revision
notion of logical consequence, i.e j= holds iff is sat-
isfied in all the interpretations satisfying . Van Rijsbergen Belief Revision addresses the problem of accommodat-
[19] showed that any logical approach should quantify rele- ing a new piece of information into a knowledge base. A
vance taking into account the minimal change that must be key issue within the BR theory is to keep consistency in
done in order to establish the truth of the entailment. Sev- the knowledge base when contradicting new information ar-
eral approaches have followed this line of research and re- rives. A BR operator has to establish a method for selecting
cent compendia can be found in [3, 13]. Among several the information that must remain after the arrival of con-
alternatives, we focus here on the framework proposed in tradicting information. The Principle of Minimal Change
states that as much old information as possible should be With partial document representations, documents have
preserved. This principle stands on the basis of any rea- several models. The measure of distance from the document
sonable revision operator. Model-based approaches to BR to the query is the average of the distances from each model
establish an order among logical interpretations. Next we P
of the document to the set of models of the query.
briefly sketch the similarity measure based on BR described
distan
e(d; q) = m2Modj(Mod d) dist(Mod(q);m)
in [16]. (d)j
First, we present some preliminaries. In this paper we Note that the situation described before, when a docu-
focus on propositional languages. The propositional alpha- ment is identified by its only model is a particular case of
bet is denoted by P . Interpretations are functions from the previous formula. A similarity measure, BRsim, can
the propositional alphabet, P , to the set ftrue; falseg. A be directly defined from distan
e by normalization. Some
model of a formula is an interpretation that makes the for- examples of the computation of this measure can be found
mula true and Mod( ) denotes the set containing all the in section 3.
models of the formula . The symmetric difference be- In [16] a restricted document representation was pro-
tween two sets A and B , A 4 B , is defined as A 4 B = posed in order to get some equivalences with classical mea-
(A [ B ) n (A \ B ), where n is the regular set difference. sures. Specifically, a classical vector with binary weights
Dalal’s revision operator [4] defines the difference be- was modeled as a conjunction of propositional letters where
tween two interpretations as the set of propositional letters each letter of the alphabet appears, either positive or nega-
on which they differ. tive. In this case document representations have only one
Diff (I; J ) = fp 2 PjI j= p iff J j= :pg model and BRsim is equivalent to the inner product query-
document similarity measure. Partial representations of
Then, a measure of distance between interpretations, can
documents have more than one model. It is in this general
be obtained from the number of differing propositional let-
case, when the logical model implies a significant improve-
ters.
Dist(I; J ) = jDiff (I; J )j. ment respect to classical models because a greater expres-
siveness of the representations is achieved.
In the following we represent an interpretation by the set
of propositional letters that it maps into true. Therefore, the
difference between two interpretations can be computed us- 3 Implementing Document Ranking
ing the symmetric difference between their respective sets.
The distance between the set of models of a formula The translation of model-based BR approaches into ef-
and a given interpretation I is defined as the distance from ficient algorithms has been a problem of great concern in
I to its closest interpretation in Mod( ): the BR community. In general, it has become hard to de-
Dist(Mod( ); I ) = minM 2Mod( ) Dist(M; I ) velop efficient algorithms for solving large problems. In [8]
Given a formula , an order between interpretations can a lower bound for knowledge base revision was identified:
be derived from the closeness of each interpretation to the base revision is at least as hard as deciding propositional sat-
set of models of the formula . That is, for any formula , isfiability, which is a well known NP-Complete problem. In
Dalal’s total pre-order is defined as: fact, Dalal showed [4] that his revision is an NP-Complete
I J iff Dist(Mod( ); I ) Dist(Mod( ); J ) problem and provided a method for computing it. How-
Given a theory to be revised with a new information ever, some studies have demonstrated that restricted prob-
, ÆD denotes the theory revised by Dalal’s operator. lems can be solved within limited bounds [2]. Liberatore
The models of the revised theory are the models of the new and Schaerf [14] identified reductions from circumscription
information that are the closest to the theory: into BR and vice versa. However, to rank documents we
Mod( ÆD ) = Min(Mod(); ) need the distances used within the BR process and the re-
This framework was applied to IR as follows. Each duction to a circumscription problem produces directly the
propositional letter of the alphabet P represents one index final formula of the revised theory.
term. Queries are modeled as propositional formulas that A direct translation of Dalal’s revision into an algorithm
in the BR framework play the role of theories. Documents requires a table of symmetric differences between all the
are propositional formulas and play the role of new infor- models of the theory and all the models of the new infor-
mations. Thus, following Dalal’s development, a notion of mation. This computation takes exponential time. Then,
closeness to the query can be obtained within the revision even when the document has only one model, all the mod-
q ÆD d. If a document is represented by a formula that has els of the query have to be computed. On the contrary, the
only one model, each document can be identified by its only algorithms presented in this work do not compute all the
model, and the measure of distance from the model of the models of the theory and the new information. The basic
document to the set of models of a query can be regarded as assumption is that both query and document have to be ex-
a measure of distance from the document to the query itself. pressed in disjunctive normal form (DNF). A DNF formula
has the form
1 _
2 _ : : : where each
j is a conjunction not appear in the representation of , half of the models of
l1 ^ l2 ^ : : :, where each lj is a literal, i.e. a propositional will map the letter into true and the other half will map
letter or its negation. A DNF formula can be represented it into false. On the other hand, and as a consequence of
as a set of clauses = f 1 ; 2 ; : : :g. Each clause is a set the presence of the literal in 1 , all the models of have
of literals representing their conjunction, i.e. a clause rep- to map that letters into the same truth value. Therefore,
resents a
j . The whole set represents the disjunction of all whatever this fixed truth value is, half of the models of
the clauses. The important point is that a conjunction of lit- will have the opposite one. This produces an increment of
erals can be thought as a partial model, representing the set j 1 n 1 \1 j CDist( 1 ;1 ) in the distance. Finally, the dis-
2
of models resulting from fixing the truth value of the atoms tance is transformed into a similarity value in the interval
appearing in the conjunction and combining the truth value [0,1]. This normalization uses the fact that the greatest
of the atoms non appearing in the conjunction. value of distan
e is j 1 j.
Instead of a measure of distance between interpretations,
The computation of CDist( 1 ; 1 ) in step 1 can be done
a measure of distance between clauses, CDist, is defined.
traversing the literals in 1 and checking whether the oppo-
The difference between two clauses, i and j , is the set of
site literal belongs to 1 . It can also be done with the re-
literals in i whose negation is in j :
ciprocal process, that is, traversing 1 and checking in 1 .
CDiff ( i ; j ) = fl 2 i j:l 2 j g
Each check can be done in unit time because an array can be
The distance between two clauses is given by the cardi-
used to store what literals belong to a clause. Then, step 1
nality of their difference:
can be done in linear time w.r.t the size of 1 or 1 . Due to
CDist( i ; j ) = jCDiff ( i ; j )j similar reasons, the computation of j 1 n 1 \ 1 j in step 2
can also be accomplished in linear time respect to the size of
3.1 Algorithm for the simple case any clause. Consequently, this algorithm can be run in lin-
ear time w.r.t the size of either 1 or 1 . As 1 represents
The simple case arises when both document and query the query and 1 the document, 1 is expected to have less
represent sets of terms. In classical IR this case corresponds literals than 1 and the most efficient implementation of the
to a representation as a vector for both elements. On the algorithm can be done with complexity O(j 1 j).
logical side, this corresponds to the fact that representations
are conjunctions of propositional letters. In this case, query Let us analyze the use of the previous algorithm for IR.
and document are directly in DNF form and can be both Classical systems consider representations of documents
= f 1 g and
represented as a set with one clause, i.e. that have information about the presence or absence for
= f1 g. all the index terms. On the logical side, this case corre-
sponds with the fact that documents are total theories, i.e.
1 contains all the index terms either positive or negative.
Algorithm 1: As a consequence, all the query terms appear in 1 and
Procedure Similarity( ,) distan
e = CDist( 1 ; 1 ). As CDist( 1 ; 1 ) counts the
Input:query = f 1g
number of differing terms between the query and the doc-
document = f1 g ument, the result of the algorithm (after the normalization)
Output:BRsim( ,) is equivalent to the inner product query-document match-
ing function. On the other hand, when a document is a
1.Compute CDist( 1 ; 1 ) partial theory its representation does not store information
2.distan
e = CDist( 1 ; 1 ) + j 1 n 1 \1 j CDist( 1 ;1 ) about all the index terms. This is not a regular assump-
2
3.Return (1 distan
e
j 1j ) tion in classical systems. Let us think about a query that
mentions one of those index terms that do not appear in the
document representation. In this case, the model does not
The value of CDist( 1 ; 1 ) represents the number of
assume that the document is (or is not) really about that in-
literals in 1 -query terms- that appear in 1 -document
dex term. On the contrary, it considers a value of distance
terms- with opposite value. This means that any pair of
of 0:5 for the query terms not present in the document rep-
models of and will differ on the interpretation for
resentation. This behavior is captured in the formula by
these terms. Then, all the models of have to fare at j 1 n 1 \1 j CDist( 1 ;1 ) .
least CDist( 1 ; 1 ) to any model of . That is the rea- 2
son why CDist( 1 ; 1 ) is directly added to distan
e. The It is important to note that algorithm 1 ensures an effi-
set 1 n 1 \ 1 contains the literals in 1 that do not belong cient implementation of the model proposed in [16]. Algo-
to 1 . Therefore, the value j 1 n 1 \ 1 j CDist( 1 ; 1 ) rithm 1 deals with the constrained representations proposed
is the number of literals in 1 whose letter does not ap- in that work and it computes similarity in linear time w.r.t
pear in 1 , either positive or negative. As these letters do the size of the query.
Example 1: In this example we show the computation of with any combination of AND’s and OR’s. The fact that
BRsim using tables of symmetric differences between inter- query representations are DNF formulas does not imply that
pretations and the computation of BRsim using Algorithm users have to articulate their information needs in this form,
1. Let the propositional alphabet P , the documents d1 and but the system translates user information needs into DNF
d2 and the query q be defined as: form. Any propositional formula can be translated into its
P = fa; b;
; d; eg DNF equivalent.
q =a^
In this section we develop an algorithm which computes
d = :a ^ b
the similarity measure between a document and a query,
1
d = a ^ :b ^
both in DNF. It is important to note that now the measure
2 is not equivalent to a classical one because document repre-
The computation of similarity from symmetric differ- sentations are different from those used by classical models.
ences between interpretations is shown in fig. 1. The fol- The algorithm traverses the set of models of the docu-
lowing lines depict the computation of the similarity using ment , and for each model computes its distance to the
Algorithm 1. query . These distances are accumulated and, finally, the
total number of models of is used to get the average of the
Document d1 distances to the query. The advantage stands on the fact that
Input: no models of the query are needed. The distance from each
Query (DNF): = f 1 g, 1 fa;
g
= model of the document to the query is computed transform-
Document (DNF): = f1 g, 1 = f:a; bg ing the model to a set of literals and comparing with each
1. CDist( 1 ; 1 ) = jfl 2 1 j:l 2 1 gj = jfagj = 1 conjunction of the query. Reflecting Dalal’s semantics, the
2. distan
e = 1 + jfa;
gn;j
2
1
= 1:5
least distance is selected to be the distance from the model
distan
e 1:5 to the query.
3. 1 j 1 j = 1 2 = 0:25. Return(0.25). Before developing the algorithm some preliminaries are
Document d2 shown. The size of the propositional alphabet will be de-
Input: noted by S , S = jPj. As it has been said before, an inter-
Query (DNF): = f 1 g, 1 fa;
g
= pretation is denoted by the set of letters mapped into true.
Document (DNF): = f1 g, 1 = fa; :b;
g Then, given an interpretation m, LIT (m) represents the
transformation of m into a set of literals, i.e. LIT (m) =
1. CDist( 1 ; 1 ) = jfl 2 1 j:l 2 1 gj = j;j = 0
m [ f:ljl 2 P n mg. The symbols min and max are the
2. distan
e = 0 + jfa;
gnfa;
gj 0 = 0
2 size of the smallest and the biggest clause in , respectively.
distan
e 0
3. 1 j 1j =1 2
= 1. Return(1). The symbol max represents the maximum size of a clause
Note that using tables of symmetric differences we need in .
to compute a lot of models and distances between them. On
the other hand, algorithm 1 only takes a few steps and it Algorithm 2:
does not need any model. Procedure Similarity( ,)
Input:query = f 1 ; 2 ; : : :g
Document models ! d1
Query models # fbg fb;
g fb; dg fb; eg fb;
; dg fb;
; eg fb; d; eg fb;
; d; eg
fa;
g fa; b;
g fa; bg fa; b;
; dg fa; b;
; eg fa; b; dg fa; b; eg fa; b;
; d; eg fa; b; d; eg
fa; b;
g fa;
g fag fa;
; dg fa;
; eg fa; dg fa; eg fa;
; d; eg fa; d; eg
fa;
; dg fa; b;
; dg fa; b; dg fa; b;
g fa; b;
; d; eg fa; bg fa; b; d; eg fa; b;
; eg fa; b; eg
fa;
; eg fa; b;
; eg fa; b; eg fa; b;
; d; eg fa; b;
g fa; b; d; eg fa; bg fa; b;
; dg fa; b; dg
fa; b;
; dg fa;
; dg fa; dg fa;
g fa;
; d; eg fag fa; d; eg fa;
; eg fa; eg
fa; b;
; eg fa;
; eg fa; eg fa;
; d; eg fa;
g fa; d; eg fag fa;
; dg fa; dg
fa;
; d; eg fa; b;
; d; eg fa; b; d; eg fa; b;
; eg fa; b;
; dg fa; b; eg fa; b; dg fa; b;
g fa; bg
fa; b;
; d; eg fa;
; d; eg fa; d; eg fa;
; eg fa;
; dg fa; eg fa; dg fa;
g fag
Cardinalities and computation of the distance:
Document models ! d1
Query models # fbg fb;
g fb; dg fb; eg fb;
; dg fb;
; eg fb; d; eg fb;
; d; eg
fa;
g 3 2 4 4 3 3 5 4
fa; b;
g 2 1 3 3 2 2 4 3
fa;
; dg 4 3 3 5 2 4 4 3
fa;
; eg 4 3 5 3 4 2 4 3
fa; b;
; dg 3 2 2 4 1 3 3 2
fa; b;
; eg 3 2 4 2 3 1 3 2
fa;
; d; eg 5 4 4 4 3 3 3 2
fa; b;
; d; eg
P
4 3 3 3 2 2 2 1
dist(Mod(q); mi ) = minm2Mod(q) dist(m; mi ) 2 1 2 2 1 1 2 1
dist(Mod(q);m)
distan
e(d; q) = m2ModjMod
(d)
(d)j
12 = 1 5
8 :
Finally, BRsim(d; q ) is computed from distan
e(d; q ) using k, the number of literals appearing in the query:
Document models ! d2
Query models # fa;
g fa;
; dg fa;
; eg fa;
; d; eg
fa;
g ; fdg feg fd; eg
fa; b;
g fbg fb; dg fb; eg fb; d; eg
fa;
; dg fdg ; fd; eg feg
fa;
; eg feg fd; eg ; fdg
fa; b;
; dg fb; dg fbg fb; d; eg fb; eg
fa; b;
; eg fb; eg fb; d; eg fbg fb; dg
fa;
; d; eg fd; eg feg fdg ;
fa; b;
; d; eg fb; d; eg fb; eg fb; dg fbg
Cardinalities and computation of the distance:
Document models ! d2
Query models # fa;
g fa;
; dg fa;
; eg fa;
; d; eg
fa;
g 0 1 1 2
fa; b;
g 1 2 2 3
fa;
; dg 1 0 2 1
fa;
; eg 1 2 0 1
fa; b;
; dg 2 1 3 2
fa; b;
; eg 2 3 1 2
fa;
; d; eg 2 1 1 0
fa; b;
; d; eg
P
3 2 2 1
dist(Mod(q); mi ) = minm2Mod(q) dist(m; mi ) 0 0 0 0
dist(Mod(q);m)
distan
e(d; q) = m2ModjMod
(d)
(d)j
0
Finally, BRsim(d; q ) is computed from distan e(d; q ) using k, the number of literals appearing in the query:
Document models ! d1
Query models # fa; b;
g fa; b;
; dg fa; b; dg
fa;
g fbg fb; dg fb;
; dg
fa; b;
g ; fdg f
; dg
fa;
; dg fb; dg fbg fb;
g
fa; b;
; dg fdg ; f
g
fa; dg fb;
; dg fb;
g fbg
fa; b; dg f
; dg f
g ;
Cardinalities and computation of the distance:
Document models ! d1
Query models # fa; b;
g fa; b;
; dg fa; b; dg
fa;
g 1 2 3
fa; b;
g 0 1 2
fa;
; dg 2 1 2
fa; b;
; dg 1 0 1
fa; dg 3 2 1
fa; b; dg
P
2 1 0
dist(Mod(q); mi ) = minm2Mod(q) dist(m; mi ) 0 0 0
dist(Mod(q);m)
distan
e(d; q) = m2ModjMod
(d)
(d)j 0
Finally, BRsim(d1 ; q ) is computed from distan e(d1 ; q ) using min the size of the smallest conjunction in :
Document d2
Symmetric differences between query and document models:
Document models ! d2
Query models # fb;
g fa; b;
g fb;
; dg fa; b;
; dg fa; bg fa; b; dg
fa;
g fa; bg fbg fa; b; dg fb; dg fb;
g fb;
; dg
fa; b;
g fag ; fa; dg fdg f
g f
; dg
fa;
; dg fa; b; dg fb; dg fa; bg fbg fb;
; dg fb;
g
fa; b;
; dg fa; dg fdg fag ; f
; dg f
g
fa; dg fa; b;
; dg fb;
; dg fa; b;
g fb;
g fb; dg fbg
fa; b; dg fa;
; dg f
; dg fa;
g f
g fdg ;
Cardinalities and computation of the distance:
Document models ! d2
Query models # fb;
g fa; b;
g fb;
; dg fa; b;
; dg fa; bg fa; b; dg
fa;
g 2 1 3 2 2 3
fa; b;
g 1 0 2 1 1 2
fa;
; dg 3 2 2 1 3 2
fa; b;
; dg 2 1 1 0 2 1
fa; dg 4 3 3 2 2 1
fa; b; dg
P
3 2 2 1 1 0
dist(Mod(q); mi ) = minm2Mod(q) dist(m; mi ) 1 0 1 0 1 0
dist(Mod(q);m)
distan
e(d; q) = m2ModjMod (d)
(d)j 0.5
Finally, BRsim(d2 ; q ) = 1
distan
e(d2 ;q) = 1 0:5 = 0:75
min 2
1. Document d1
Input:
Query in DNF form: = f 1 ; 2 g, 1 = fa;
g; 2 = fa; dg
2. Document d2
Input:
Query in DNF form: = f 1 ; 2 g, 1 = fa;
g; 2 = fa; dg