Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Index Terms Quantum Logic; Textual wrapper; Local Appropriator; Natural Language Interface; Information Retrieval
1 INTRODUCTION
he information age has brought in complex challenges which are related to the management of infor-
some of which could be the irrelevant, unreliable, inaccurate, outdated, misleading, etc.? The "IR problem" is especially problematic as knowledge increases, as the number of media and platforms increases, as the integration of media grows, as the interoperability of platforms increases, and as we face information overload. Representation of text documents have earlier been based on the occurrence of terms in documents. This means that almost all the well established models have had a benchmark on the process of the occurrence of term. Some of these models include: the Boolean Model (BM), Binary Independence Model (BIM), Vector Space Model (VSM), Language Model (LM), and lately Fuzzy Retrieval Model (FRM). These models are exclusively discussed in [1][2][3]. Statistical analysis has been shown to be the most recently successful method of text indexing which is based on the occurrence of terms in documents [4]. The approach in this work is a sharp diversion from the approaches of earlier work. Here, we see documents as been considered useful or relevant based on the text it contains. These texts are human usable texts which are the basis of language presentation and thus make it comfortable for us to consider such, as a natural language text. Objects, maybe textual or multimedia, will be considered as states of
mation and its sources. Information is useless unless it can be found and used. Searching for information is an event that takes place almost every other minute by every other person, this act, which can both be formally and informally defined, has made searching a human act. This process is characterized by the fact that the user of the information request for documents via query to solve his information need (IN). The problem therefore can be broadly classified into two: (1) the appropriate presentation of the user information need and (2) the relevance of the retrieved document. The first problem identified here is largely due to the users knowledge of the domain of discussion. This context forms the basis of the Information Retrieval (IR) Problem, coined the IR problem. It is the problem of How do we distinguish what we want from the sea of what we do not want, especially the bad,
A.O Enikuomehin is with the Computer Dept. of the Lagos State University.Nigeria. M.O. Rahman is with the Department of Computer Science, Lagos State University, Nigeria. J.S Sadiku, is with the department of computer science, university of Ilorin, Nigeria.
24
physical systems and their features (such as term) can be viewed as physical observables to be measured in such systems. Quantum Mechanics relation to Information Retrieval in this direction has earlier been proposed in Van Rijsbergen [5] since a holistic implementation was not achieved, such proposal did not go further into implementation in Natural Language systems. We can employ the techniques of Quantum Mechanics as prescribed [6] to represent information and documents as measure of observables in physical system. Quantum theorys foundations currently rest on abstract mathematical formulations known as Hilbert spaces and C* algebras. These abstractions work well for calculating the probability of a particular outcome in an experiment. But they lack the intuitive physical meaning that physicists crave the elegance of Einsteins theory of special relativity, for instance, which says that the speed of light is constant and that laws of physics dont change from one reference frame to the next [7]. Purification, a system with uncertain properties (a mixed state) is always part of a larger pure state that can, in principle, be completely known was developed. Consider the pion, this particle, which has a spin of zero, can decay into two spinning photons. Each single photon is in a mixed state it has an equal chance of spinning up or down. The pair of photons together, though, comprises a pure state in which they must always spin in opposite directions. [8] Conclude that there is only one way in which a theory can satisfy the purification postulate: it must contain entangled states. The other option, that the theory must not contain mixed states, that is, that the probabilities of outcomes in any measurement are either 0 or 1 like in classical deterministic theory, cannot hold, as one can always prepare mixed states by mixing deterministic ones. The purification postulate alone allows some of the key features of quantum information processing to be derived, such as the no-cloning theorem or teleportation [9].These presentations were collectively analyzed and scientist presented some additional axiom to support its application in information retrieval. This is not complex, since in documents retrieval, the essence of reference is determined by terms, this is closely synonymous to the process of extracting features of a physical system. The extraction is centrally controlled by a form of object relation. This
relation will be referred to as textual relation to signify that the relation is performed on text terms of retrieved documents. The textual relation is related to the term frequency of documents, again this also makes it comfortable for us to represent this as textual relation. The main idea in this work is to present a formulation for representing documents such that a generated retrieval process can be implemented on a natural language interface. Such development will assist in the presentation of an interface that can perform information retrieval on Natural Language text. Quantum mechanics is an important part of physics and an underlying logic has been designed in theory. This refers to the widely used Quantum theory [10]. Quantum theory offers impressive advantages over classical theory in the estimation of physical parameters and this has been shown in several works supported by core physics principle [11],[12],[13],[14],[15]. The prototypical example is the estimation of an unknown phase shift where the variance vanishes as N2 with the number N of accesses to the phase-shifting process, whereas a classical statistics over independent copies would give the scaling N1 The user of information need have been shown to be in anomalous state of knowledge [16]. In this, the state of such user changes as more information is obtained. This is closely related to physical system such as wave collapse. A phenomenon in which a wave function initially in a superposition of several different possible eigenstates appears to reduce to a single one of those status after interaction with an observer, that is, the reduction of the physical possibilities into a single possibly as seen by an observer. The state is described as an element of a projective Hilbert space expressed in Dirac (bra-ket) notation as a vector:
| = ci | i
i
(1)
The kets
25
of processes in IR is the degree of relevance of results of each model. In this work, a matching algorithm is also developed as a methodology for establishing the relevance level of components obtained from documents as regards the content of the query
26
can be incorporated into the system. The quantum consideration begins by a review of the work of [18 ]where he established that for any quantum measurement to be useable, it must obey the following three properties; (i) idempotency (ii) ordered structure (iii) a possibility for non-communicative. In the LA, each of the properties is well formulated under the categories and can be further discussed as: (i) Idempotency: multiple application yields same result as a single application. That is, applying our text wrapping operation for a given query returns the same result with that of the first operation. That is, applying them in a number of times give the same result as applying them once. This can also be explained with this example: Let D be a document D will either exist as
of repository, we thus extend to any type of database where N represents the total number of documents and ni is the total number of documents containing the index term ti
[20]
we
have
propriately. Then, let D be a natural language text given D= for us or not for us, very simple. The TW (is, 2) to D, will be appropriated to D given as D1= us, very simple. Then if TW (is, 2) is performed again no further reduction will take place because all the term are already contained with the defined space. (ii) Ordered structure is formulated from the type of relation they possess. If given a natural language text as above, and on application the same result is achieved. In the following algorithm, the probability of the token is not considered, as the first consideration is the determination of relevance by appropriately expressing the terms that were not reduced in the document. A model for this formulation is presented. Meanwhile, it must be noted that the process of generating relevance includes a methodology for relaxing the users query. An appropriator index does the relaxations. For a universe of document, this can be applied as a transformation of the proposal of Roelleke [19];
) as it will
also leave the terms wrapped by LAi(i). This can mathematically be expressed as
(7)
The results of applying LAs at different t (central term) are not the same and thus they do not commute. This formation compares excellently with the principles of Quantum theory where the same baseline properties such as measurement are incompatible with other properties such as wavelength. This is the situation in wave collapse. States in quantum theory are called quantum states and they refer to the complete and a maximum summary of the characteristics of the quantum system at a moment in time. [21] Maintained that the
for predicting probability results of measurements of physical systems. The states are represented by a unit vector in Hilbert space, that is, the states are such that
idf(ti) (4)
|| || 2 =< | >= 1.
(8)
27
place just after the token are parsed. So in our word, the index term is key for generating expressiveness and relation for the terms. The appropriator fully uses some proposition like the textual relation between terms which helps in the generation of probability for terms in the given document. If the proportion is Dr(Q) then, the terms are appropriated leaving those terms that
Will is the central term ti, with w, then Smith have associated width w-1 because the two terms appear together. This can be represented as
several related terms that are commonly used together such as in Robin Van Parsee.
D rQ . fulfils
Thus, the appropriation can be thought of as propositions themselves that allows for some logical operations to be defined on them. Similar proposition has been used initially in the logics for implementation of retrieval processes but can also be extended and used in the discus-
sion of our appropriator. They are the complement , the and the intersection. The complement , exunion
cludes all term that are not in the set Dr(Q) and thus they
4 TEXTUAL MEASURE
From the above, it can be seen that any term of a document have a textual relationship with other. The rigorous mathematical foundation of quantum mechanics is generally agreed to be based on von Neumanns formulation, which uses the notion of a state of quantum logic, a probability measure on the collection of quantum events, i.e., on the closed subspaces of the Hilbert space. An observable, discrete or continuous, is defined as a mapping from the subsets of the real line to the collection of states of the quantum logic. These two notions are then used to derive the expected value of an observable, which provides a consistent quantum theory treating discrete and continuous observables in a uniform manner. Gleasons fundamental theorem shows that, for Hilbert spaces of dimension greater than two, the states of logic are in one to one correspondence with density operators, i.e. trace class operators on the Hilbert space with trace one, and thus allows us to work with the more convenient density operators instead of the probability measures on the collection of closed subspaces of the Hilbert space. Gleason's theorem states that any totally additive measure on the closed subspaces, or projections, of a Hilbert space of dimension greater than two is given by a positive operator of trace class. All standard programming languages allow for non-trivial recursion, a fundamental feature of computation which can potentially result in the non- termination of programs [23]. The search space is reduced following the application of
do not form the next search space. If given two D (appropriator index) on a set of documents, the union returns all terms that are satisfying (DrQ) without the ones where
D rQ > 1. The intersection defines a similar case but considers the meet. The following follows mathematically;
. (9)
The simplest form of the appropriators is one that ignores all terms except the one, maintaining that every other term satisfies
( Dt rQ) .
fined as: LA (t,0). Such will be easy to represent in the search space as a one dimensional projector. The application of the mind to each natural language text natural makes the projector to be orthogonal to one another since if one apply one to the document, the result of applying another will make other term to be ignored such that is
The LA
further explains the reality of the proportion of [22] on the co-occurrence of terms. In this, if given term ti and tj in w with n(Ei) and n(tj) are determined by width w, all in a document D, the general space is reduced by applying LA(ti, w) to D until a term is achieved where LA(ti,0) can be applied such that the resulting words are enumerated. Consider a document that contains Will Smith together and the LA centers on Will (see section 2) with respect to some relation w called the width which is a measure of the position of the central term in the document. Thus, if
28
the textual measure as non-relevant terms are ignored. Thus, a truth value of every token can be achieved using probability of the occurrence of such term. Gleason theory explains the natural way of presenting probabilities. Gleasons theorem can be generalized as follows: Let
If given three documents; d1; social sciences student with ICT knowledge d2; philosophy lecturers do not use ICT d3; ICT staff and query q: student usage of ICT. The set K of index terms is therefore K=(social, science, student, ICT, with, knowledge, philosophy, lecturer, do, use, staff) with t, = social, t2=science tu=staff which can be transformed as follows: We assume the above documents are presented as a large corpus, we can therefore present its relative frequencies as:: A(3), B(2), C(1) So if we assume that the collection contains 10,000 documents and document frequencies of these terms are: A(50), B(1300), C(250) Then: LA(D1): tf = 3/3; idf = log(10000/50) = 5.3; tf-idf = 5.3 LA(D2): tf = 2/3; idf = log(10000/1300) = 2.0; tf-idf = 1.3 LA(D3): tf = 1/3; idf = log(10000/250) = 3.7; tf-idf = 1.2 This generates the same result when implemented with
rable (real or complex) Hilbert space H of dimension at least 3. There exist a positive self adjoin operator T of the trace, such that for closed subspaces of H, we have
tion onto L, and an operator T, of trace class provided T is positive and its trace is finite. The interest is the measure in which (11) trace
The above defines the Gleason way of generating associated probability. The trace is referred to in this work as pre-probability. In linear algebra, the trace of an n-by-n square matrix A is defined to be the sum of the elements on the main diagonal (the diagonal from the upper left to the lower right) of A, i.e.,
(11) where aii represents the entry on the ith row and ith column of A. The trace of a matrix is the sum of the (complex) eigenvalues, and it is invariant with respect to a change of basis. Its properties were also carved in [22]. Literally, given M as set of objects about Male and F as set of objects about Female. If
our transformation:
LA(di )(t1 ) =
(14)
Thus, the document d1 is the most relevant. After the application of the Tw, the unwrapped text occupies uncertain position since information about them is unknown and they are no longer considered. The non zero other terms that is, not the central term has a probability of almost zero but net zero. However, the probability distribution of the central term is zero, others except the central term will have a probability give us
M F
can think of tr(.) as an indexing operation which will generate a set of objects for a given set of attributes. If tr(m) is generates for male, tr(f) generates for female. Then
tr ( M F ) tr (M L) = {E V a t aM F ) .(13) tr ( M F ) is not an
1 N k 1
The concept of textual relation is now inherent as the probability distribution is based on length. Since the central term determine the probability distribution
artificial class which is only achievable by defining an appropriate probability. With probability, a non-Boolean logic has been initialized thus we can insist that the probability obeys some given conditions. As example,
29
FRAMEWORK
determined. It should be noted that when the Tw acts on such documents, some of them will be unwrapped and thus their term cannot be determined. The unwrapped term can be any term except the central term. Thus, if the information
p
5.1 Intersection
The intersection of two Tw, is the transformation that preservers the information preserved by both Tws;
is
given
in
terms
of
position
as:
= I positions (18)
Nv
Therefore, the information after applying the text wrapper cannot be represented like the previous one above, The complete information after Tw can be given as (19) Where LD is the length of the document, Nv the size of the Natural Language Text and Nu the unwrapped tokens;
I (w(t , p) D) = N u + ( LD N u )(1 log( N v1 ) 1 log N v1 log( N v 1) = N u log( N ) + ( LD NU ) log( N ) log ( N v ) v v
log( N v1 ) 1. log( N v )
6
(20)
CONCLUSION
In this paper, we have provided formalism for retrieving natural language text based on the wrapping techniques of text positioning. This process is well justified as it supports some benchmark mathematical processes. This justification is similar to the behavior of physical systems when subjected to representation by measurement. This has not only led to futher understanding of the IR process but also provide a framework for the inprovemnet of (16) such models. a conclusion may review the main points of the paper, do not replicate the abstract as the conclusion. A conclusion might elaborate on the importance of the work or suggest applications and extensions. We present a Local Approximator where the implementation framework is completed. We introduced an improvement on search result based on reduction of search space by using appropriate measures to determine relevance. The reduced search space can then be separated with each document as vectors. Since the interest is in the implementation in a Natural language interface, we extend the work into understanding the relationship between terms. This paper has
(17) It should be noted that same set of text in same positions are wrapped. Such that it can be said to have been preserved. The position where term can be determined will be assigned a probability and others with unidentified position with another probability thus, if all the positions of a document is independent, the total information of the document would be a precise length when all terms are
Tw[D,(A)D(B)]
30
only extensively discussed the retrieval process however a major task is the appropriate formulation of information need thus a further direction from here is to apply the same techniques to the query formulation process.
[18] M.A., Nelsen and I.L., Chuang (2000) Quantum computation and Quantum information, Cambridge, UK: Cambridge University Press. [19] T. ,Roelleke, (2003), A frequency-based and a poisson-based denition of the probability of being informative, in J. Callan, G. Cormack, C. Clarke, D. Hawking and A. Smeaton Eds), SIGIR 2003: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM Press, New York, pp. 227234. [20] J. ukasiewicz, (1920) O logice trjwartociowej (in Polish). Ruch filozoficzny 5:170171. English translation: On threevalued logic, in L. Borkowski (ed.), Selected works by Jan ukasiewicz, NorthHolland, Amsterdam, 1970, pp. 87 88. ISBN 0-7204-2252-3 [21] A. M. Gleason. Measures on the closed subspaces of a Hilbert space. Journal of Mathematics and Mechanics, 6:885894, 1957 [22] P. Bruza, D. Song, (2003), Towards content sensitive information inference. Journal of American society for information Science and Technology, 54(3):pp321-334,2003 [23] A. Edalat A, (2004), An extension of Gleason's theorem for quantum computation, Imperial College London SW7 2BZ. International Journal of Theoretical Physics (impact factor: 0.85).; 43(7):1827-1840
REFERENCES
[1] I.Van Rijsbergen, C.J., Information retrieval. Buttermorths, Chapter 2. Automatic text analysis, 1979 [2] J. Clerk Maxwell, A Treatise on Electricity and Magnetism, 3rd ed., vol. 2. Oxford: Clarendon, 1892, pp.6879 [3] R. Nicole, Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA (1999). [4] [5] W.B. Croft, A language modeling approach to information retrieval. In proc of SIG, 298, 1998, pp.275-281. C.J. Van Rijsbergen, The geometry of information retrieval, Cambridge universal press), 2004 [6] A. Huertas-Rosero, L. Azzopardi., C. Van Rijsbergen : characterizing through erasing: A theoretical framework for representing documents inspired by quantum theory in P.D. Bruza, W. Lawless C.J. V.R., ed: Proc. [7] [8] [9]
2nd
AAAI Quantum interaction symA. O Enikuomehin is the head of training, ICT centre of the lagos State University. He is a programmer with expertise in interface generation for natural Language processes. He has published in well reputable journals. M.O. Rahman is the head of Dept, computer science at the Lagos State University. He has published widely in area of computer science. J.S. Sadiku, is the Head, Department of computer science, university of Ilorin, Nigeria
posium, Oxford, U.K. College publication pp160-163, 2008 A. Einstein, B. Podolsky, N. Rosen, Can quantum-mechanical description of physicalreality be considered complete? Physical review 47(10) (1935) 777780 G.Chiribella, G.M.D'Ariano, and P. Perinotti, "Probabilistic theories with purification", Phys. Rev. A 81, 062348, 2010 G. Chiribella, G. M. D'Ariano and M. F. Sacchi, Optimal estimation of group parameters using entanglement, Phys. Rev. A 72 042338 (2005 [10] R.P. , Feynman, Lectures on Physics: Quantum Mechanics. Volume 3. Addison-Wesley (1963) [11] G. M. D'Ariano, R. Demkowicz-Dobrzanski, P. Perinotti, and M. F. Sacchi, Quantum-state decorrelation, Phys. Rev. A 77,
in Advances in Quantum Theory, AIP Conf. Proc. 1327 7 (2011); Also arXiv 1012.0535 [13] A. Khrennikov, (2003) Quantum-like formalism for cognitive measurements. Biosystems 70(3) 211233 [14] I. Schmitt (2006)Quantum query processing: unifying database querying and information retrieval. Otto von Guericke Universitt Magdeburg, Institut fr Technische Informationssysteme [15] L. Hardy, (2001) Quantum theory from five reasonable axioms [16] N.J. Belkin, (2005), Anomalous State of Knowledge. IN: K.E. Fisher, S. Erdelez and E.F. McKechine(Eds.), theories of information behavior: A researchers guide. Medford, NJ: Information Today (PP. 44-48) [17] O. Enikuomehin and J.S.Sadiku, LANLI:A natural language interfacing tool for relational database query, International Journal of Advance research in Computer Science and Engineering., 2012