Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
presentation
BITS Pilani
Pilani Campus
N.MEHALA
FACULTY,CS/IS GROUP
INFORMATION RETRIEVAL
CS F469
Second Semester 2014-15
3/5/2015
CS F469
2
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
IR system
Results
(L1)
Documents
(L1 )
Cross-language IR:
Documents and user requests are in different languages (bilingual IR)
Request (L1)
Source language
Cross-language IR
(CLIR) system
Documents
(L2 )
Results(L2)
Target language
Paul Clough, Bridging the language gap: making digital collections available to a multilingual society,
presentation, 2005
BITS Pilani, Pilani Campus
Multilingual IR
(MLIR) system
Request (L?)
Documents
(L2 )
Documents
(L3)
Documents
(L4 )
e.g. the
Web
Why CLIR?
Top Ten Languages Used in the Web
( Number of Internet Users by Language )
% of all
Internet Users
Internet Users
by Language
Internet
Penetration
by Language
Internet
Growth
for Language
( 2000 - 2007 )
2007 Estimate
World Population
for the Language
English
29.5 %
328,666,386
28.7 %
139.6 %
1,143,218,916
Chinese
14.3 %
159,001,513
11.8 %
392.2 %
1,351,737,925
Spanish
8.0 %
88,920,232
20.2 %
260.3 %
439,284,783
Japanese
7.7 %
86,300,000
67.1 %
83.3 %
128,646,345
German
5.3 %
58,711,687
61.1 %
113.2 %
96,025,053
French
5.0 %
55,521,294
14.3 %
355.2 %
387,820,873
Portuguese
3.6 %
40,216,760
17.2 %
430.8 %
234,099,347
Korean
3.1 %
34,120,000
45.6 %
79.2 %
74,811,368
Italian
2.8 %
30,763,940
51.7 %
133.1 %
59,546,696
Arabic
2.6 %
28,540,700
8.4 %
931.8 %
340,548,157
81.7 %
910,762,512
21.4 %
181.4 %
4,255,739,462
18.3 %
203,511,914
8.8 %
444.5 %
2,318,926,955
100.0 %
1,114,274,426
16.9 %
208.7 %
6,574,666,417
WORLD TOTAL
Resource
Query
Search
Selection
System discovery
Vocabulary discovery
Concept discovery
Document discovery
source reselection
Documents
Examination
Documents
Delivery
Paul Clough, Bridging the language gap: making digital collections available to a multilingual society,
presentation, 2005
D.W. Oard, A Survey of Multilingual Text Retrieval. Computer Science Technical Report Series; Vol. CS-TR-3615.
1996
BITS Pilani, Pilani Campus
CLIR problems
Handling non-ASCII character sets
Untranslatable search keys: e.g. compound words, proper
names, special terms
Multi-word concepts, e.g. phrases and idioms
Ambiguity, e.g. Homonymy and polysemy
Word Inflections, e.g. plurals and gender
Paul Clough, Bridging the language gap: making digital collections available to a multilingual society,
presentation, 2005
Ari Pirkola, et al. Dictionary-Based Cross-Language Information Retrieval_ Problems, Methods, and Research
Findings. Information Retrieval, Vol. 4. 2001
BITS Pilani, Pilani Campus
Ontology
Representation of concepts and relationships
Thesaurus
more commonly used means: listing of words with similar, related, or
opposite meanings
It does not include the definition of words
Bilingual dictionary
a list of words together with additional word-specific information.
Bilingual controlled vocabulary
carefully selected list of words and phrases, which are used to tag
units of information (document or work) so that they may be more
easily retrieved by a search
Corpora
The document collection itself
D.W. Oard, A Survey of Multilingual Text Retrieval. Computer Science Technical Report Series; Vol. CS-TR-3615.
1996
Jimmy Lin, Cross-Language and Multimedia Information Retrieval. Slides for LBSC 796/INFM 718R. 2006
Wikipedia. Related pages.
Metamodel.com. What are the differences between a vocabulary, a taxonomy, a thesaurus, an ontology, and a
meta-model? http://www.metamodel.com/article.php?story=20030115211223271. 2004
Approaches to CLIR
14
BITS Pilani, Pilani Campus
An Architecture of Multilingual
Information Access
M ultiple Langauges
M ultilingual Resources
Language
Identification
(LI)
Inform ation
Extraction
Inform ation
F iltering
Inform ation
Retrieval
Q uery
Translation
Text
Classification
D ocum ent
Translation
Text
S um m arization
Text P rocessing
Language
Translation
U ser Interface
(U I)
N ative Langauge(s)
10-15
BITS Pilani, Pilani Campus
Information
Retrieval
Information
Science
Artificial
Intelligence
Speech
Recognition
Computational
Linguistics
10-16
BITS Pilani, Pilani Campus
Information Science
User interface
Interactive search technique
Thesaurus construction
Evaluation
10-17
BITS Pilani, Pilani Campus
Computational Linguistics
Language identification
Morphological analysis
Stylistic analysis
Part-of-speech tagging
Identifying occurrences of phrases
Using parallel corpora
Using comparable corpora
10-18
BITS Pilani, Pilani Campus
10-19
BITS Pilani, Pilani Campus
Filtering
Relevance Feedback
Document representation
Latent semantic indexing
Generalization vector space model
Collection fusion
Passage retrieval
10-20
BITS Pilani, Pilani Campus
Similarity thesaurus
Local context analysis
Automatic query expansion
Fuzzy term matching
Adapting retrieval methods to collection
Building cheap test collection
Evaluation
10-21
BITS Pilani, Pilani Campus
Artificial Intelligence
Machine translation
Machine learning
Template extraction and matching
Building large knowledge bases
Semantic network
10-22
BITS Pilani, Pilani Campus
Speech Recognition
Signal processing
Pattern matching
Phone lattice
Background noise elimination
Speech segmentation
Modeling speech prosody
Building test databases
Evaluation
10-23
BITS Pilani, Pilani Campus
Design Decisions
What to index?
Free text or controlled vocabulary
What to translate?
Queries or documents
Where to get translation knowledge?
Dictionary, ontology, training corpus
24
BITS Pilani, Pilani Campus
Query Translation
Chinese Document
Collection
Chinese
documents
Retrieval
Engine
Chinese
queries
Translation
System
Results
select
examine
English
queries
BITS Pilani, Pilani Campus
Document Translation
Chinese Document
Collection
Translation
System
Results
select
Retrieval
Engine
examine
English
queries
English Document
Collection
BITS Pilani, Pilani Campus
Tradeoffs
Query Translation
Often easier
Disambiguation of query terms may be difficult with short queries
Document Translation
Documents can be translated and stored offline
Automatic translation can be slow
Which is better?
Often depends on the availability of language-specific resources
(e.g., morphological analyzers)
Both approaches present challenges for interaction
28
BITS Pilani, Pilani Campus
Early Development
29
BITS Pilani, Pilani Campus
30
BITS Pilani, Pilani Campus
31
BITS Pilani, Pilani Campus
Controlled Vocabulary
A controlled vocabulary information retrieval system
can be very useful in the hands of a skilled searcher,
but end users often find free text searching to be more
helpful.
Experience has shown that although the domain
knowledge that can be encoded in a thesaurus permits
experienced users to form more precise queries
casual and intermittent users have difficulty in
exploiting the expressive power of a traditional query
interface in exact match retrieval systems
Controlled vocabulary text retrieval systems are widely
used in libraries
32
BITS Pilani, Pilani Campus
Knowledge-based Techniques
for Free Text Searching
33
BITS Pilani, Pilani Campus
34
BITS Pilani, Pilani Campus
phrase identification
words to be transliterated
Hindi-English
dictionaries
Collection
Ireland
peace
talks