Sei sulla pagina 1di 40

BITS Pilani

presentation
BITS Pilani
Pilani Campus

N.MEHALA
FACULTY,CS/IS GROUP

INFORMATION RETRIEVAL
CS F469
Second Semester 2014-15

3/5/2015

CS F469

2
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Cross Language Information


Retrieval (CLIR)

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Cross Language Information


Retrieval (CLIR)
 A subfield of information retrieval dealing with
retrieving information written in a language
different from the language of the user's query.
 E.g., Using Hindi queries to retrieve English
documents
 Example CLIR applications
Cross-Language retrieval from texts
Cross-Language retrieval from audio and images

BITS Pilani, Pilani Campus

Monolingual vs. Bilingual


vs. Multilingual
Monolingual IR:
Documents and user requests in the same language
Request
(L1)

IR system

Results
(L1)

Documents
(L1 )

Cross-language IR:
Documents and user requests are in different languages (bilingual IR)
Request (L1)

Source language

Cross-language IR
(CLIR) system

Documents
(L2 )

Results(L2)

Target language

Paul Clough, Bridging the language gap: making digital collections available to a multilingual society,
presentation, 2005
BITS Pilani, Pilani Campus

Monolingual vs. Bilingual


vs. Multilingual (con.)
Multilingual IR:
Documents in collection in different languages, search requests in any language

Multilingual IR
(MLIR) system

Request (L?)

Documents
(L2 )

Documents
(L3)

Results (L2, L3 or L4)

Documents
(L4 )

e.g. the
Web

BITS Pilani, Pilani Campus

Why CLIR?
Top Ten Languages Used in the Web
( Number of Internet Users by Language )

Mar. 10, 2007


TOP TEN LANGUAGES
IN THE INTERNET

% of all
Internet Users

Internet Users
by Language

Internet
Penetration
by Language

Internet
Growth
for Language
( 2000 - 2007 )

2007 Estimate
World Population
for the Language

English

29.5 %

328,666,386

28.7 %

139.6 %

1,143,218,916

Chinese

14.3 %

159,001,513

11.8 %

392.2 %

1,351,737,925

Spanish

8.0 %

88,920,232

20.2 %

260.3 %

439,284,783

Japanese

7.7 %

86,300,000

67.1 %

83.3 %

128,646,345

German

5.3 %

58,711,687

61.1 %

113.2 %

96,025,053

French

5.0 %

55,521,294

14.3 %

355.2 %

387,820,873

Portuguese

3.6 %

40,216,760

17.2 %

430.8 %

234,099,347

Korean

3.1 %

34,120,000

45.6 %

79.2 %

74,811,368

Italian

2.8 %

30,763,940

51.7 %

133.1 %

59,546,696

Arabic

2.6 %

28,540,700

8.4 %

931.8 %

340,548,157

TOP TEN LANGUAGES

81.7 %

910,762,512

21.4 %

181.4 %

4,255,739,462

Rest of World Languages

18.3 %

203,511,914

8.8 %

444.5 %

2,318,926,955

100.0 %

1,114,274,426

16.9 %

208.7 %

6,574,666,417

WORLD TOTAL

Internet World Stats, http://www.internetworldstats.com/stats7.htm


BITS Pilani, Pilani Campus

The Information Retrieval Cycle


If you cant understand the documents
Source
Selection

How do you formulate a query?

Resource

How do you know something is worth


looking at?
Query
Formulation

Query

Search

How can you understand the retrieved


documents?
Ranked List

Selection
System discovery
Vocabulary discovery
Concept discovery
Document discovery

source reselection

Documents

Examination

Documents

Delivery

BITS Pilani, Pilani Campus

Why CLIR? (con.)




A collection may contains documents in many different languages, e.g.


the Web. It would be impractical to form a query in each language.
The documents may be expressed in more than one languages. For
example,


Technical documents in which English jargon appears intermixed with


narrative text in another language.
Academic works which cite the titles of documents in different languages.

The user is not sufficiently fluent to express a query in a language, but


is able to make use of the documents that are identified.
The user is monolingual and wants to query in their native language.
Because he/she



can judge relevance even if results not translated


have access to document translation

Paul Clough, Bridging the language gap: making digital collections available to a multilingual society,
presentation, 2005
D.W. Oard, A Survey of Multilingual Text Retrieval. Computer Science Technical Report Series; Vol. CS-TR-3615.
1996
BITS Pilani, Pilani Campus

CLIR problems
Handling non-ASCII character sets
Untranslatable search keys: e.g. compound words, proper
names, special terms
Multi-word concepts, e.g. phrases and idioms
Ambiguity, e.g. Homonymy and polysemy
Word Inflections, e.g. plurals and gender


Paul Clough, Bridging the language gap: making digital collections available to a multilingual society,
presentation, 2005
Ari Pirkola, et al. Dictionary-Based Cross-Language Information Retrieval_ Problems, Methods, and Research
Findings. Information Retrieval, Vol. 4. 2001
BITS Pilani, Pilani Campus

The General Problem (cont)


Traditional IR identifies relevant documents in
the same language as the query (monolingual
IR)
Cross-language information retrieval (CLIR) tries
to identify relevant documents in a language
different from that of the query
This problem is more and more acute for IR on
the Web due to the fact that the Web is a truly
multilingual environment
11
BITS Pilani, Pilani Campus

Resources for Translation




Ontology
 Representation of concepts and relationships
Thesaurus
 more commonly used means: listing of words with similar, related, or
opposite meanings
 It does not include the definition of words
Bilingual dictionary
 a list of words together with additional word-specific information.
Bilingual controlled vocabulary
 carefully selected list of words and phrases, which are used to tag
units of information (document or work) so that they may be more
easily retrieved by a search
Corpora
 The document collection itself

D.W. Oard, A Survey of Multilingual Text Retrieval. Computer Science Technical Report Series; Vol. CS-TR-3615.
1996
Jimmy Lin, Cross-Language and Multimedia Information Retrieval. Slides for LBSC 796/INFM 718R. 2006
Wikipedia. Related pages.
Metamodel.com. What are the differences between a vocabulary, a taxonomy, a thesaurus, an ontology, and a
meta-model? http://www.metamodel.com/article.php?story=20030115211223271. 2004

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

Approaches to CLIR

14
BITS Pilani, Pilani Campus

An Architecture of Multilingual
Information Access
M ultiple Langauges

M ultilingual Resources

Language
Identification
(LI)

Inform ation
Extraction

Inform ation
F iltering

Inform ation
Retrieval

Q uery
Translation

Text
Classification

D ocum ent
Translation

Text
S um m arization

Text P rocessing

Language
Translation

U ser Interface
(U I)

N ative Langauge(s)

10-15
BITS Pilani, Pilani Campus

Building Blocks for CLIR

Information
Retrieval

Information
Science

Artificial
Intelligence

Speech
Recognition

Computational
Linguistics

10-16
BITS Pilani, Pilani Campus

Information Science

User interface
Interactive search technique
Thesaurus construction
Evaluation

10-17
BITS Pilani, Pilani Campus

Computational Linguistics

Language identification
Morphological analysis
Stylistic analysis
Part-of-speech tagging
Identifying occurrences of phrases
Using parallel corpora
Using comparable corpora

10-18
BITS Pilani, Pilani Campus

Computational Linguistics (Continued)


Aligning documents
Identifying occurrences of geographic and
temporal concepts
Stochastic language models
Word disambiguation
Lexicons (morphology, part-of-speech)
Bilingual dictionaries (terms and possible
translation)

10-19
BITS Pilani, Pilani Campus

Information Retrieval (w/o CL)

Filtering
Relevance Feedback
Document representation
Latent semantic indexing
Generalization vector space model
Collection fusion
Passage retrieval

10-20
BITS Pilani, Pilani Campus

Information Retrieval (Continued)

Similarity thesaurus
Local context analysis
Automatic query expansion
Fuzzy term matching
Adapting retrieval methods to collection
Building cheap test collection
Evaluation

10-21
BITS Pilani, Pilani Campus

Artificial Intelligence

Machine translation
Machine learning
Template extraction and matching
Building large knowledge bases
Semantic network

10-22
BITS Pilani, Pilani Campus

Speech Recognition

Signal processing
Pattern matching
Phone lattice
Background noise elimination
Speech segmentation
Modeling speech prosody
Building test databases
Evaluation
10-23
BITS Pilani, Pilani Campus

Design Decisions
What to index?
Free text or controlled vocabulary
What to translate?
Queries or documents
Where to get translation knowledge?
Dictionary, ontology, training corpus

24
BITS Pilani, Pilani Campus

Query Translation
Chinese Document
Collection
Chinese
documents
Retrieval
Engine
Chinese
queries

Translation
System

Results

select

examine

English
queries
BITS Pilani, Pilani Campus

Document Translation
Chinese Document
Collection

Translation
System

Results

select
Retrieval
Engine

examine

English
queries

English Document
Collection
BITS Pilani, Pilani Campus

Tradeoffs

Query Translation
Often easier
Disambiguation of query terms may be difficult with short queries
Document Translation
Documents can be translated and stored offline
Automatic translation can be slow
Which is better?
Often depends on the availability of language-specific resources
(e.g., morphological analyzers)
Both approaches present challenges for interaction

BITS Pilani, Pilani Campus

28
BITS Pilani, Pilani Campus

Early Development

1964 International Road Research Documentation


English, French and German thesaurus
1969 Pevzner
Exact match with a large Russian/English thesaurus
1970 Salton
Ranked retrieval with small English/German dictionary
1971 UNESCO
Proposed standard for multilingual thesauri

29
BITS Pilani, Pilani Campus

Controlled Vocabulary Matures


1977 IBM STAIRS-TLS
Large-scale commercial cross-language IR
1978 ISO Standard 5964
Guidelines for developing multilingual thesauri
1984 EUROVOC thesaurus
Now includes all 9 EC languages
1985 ISO Standard 5964 revised

30
BITS Pilani, Pilani Campus

Free Text Developments


1970, 1973 Salton
Hand coded bilingual term lists
1990 Latent Semantic Indexing
1994 European multilingual IR project
First precision/recall evaluation
1996 SIGIR Cross-lingual IR workshop
1998 EU/NSF digital library working group

31
BITS Pilani, Pilani Campus

Controlled Vocabulary
A controlled vocabulary information retrieval system
can be very useful in the hands of a skilled searcher,
but end users often find free text searching to be more
helpful.
Experience has shown that although the domain
knowledge that can be encoded in a thesaurus permits
experienced users to form more precise queries
casual and intermittent users have difficulty in
exploiting the expressive power of a traditional query
interface in exact match retrieval systems
Controlled vocabulary text retrieval systems are widely
used in libraries

32
BITS Pilani, Pilani Campus

Knowledge-based Techniques
for Free Text Searching

33
BITS Pilani, Pilani Campus

Knowledge Structures for IR


Ontology
Representation of concepts and relationships
Thesaurus
Ontology specialized for retrieval
Bilingual lexicon
Ontology specialized for machine translation
Bilingual dictionary
Ontology specialized for human translation

34
BITS Pilani, Pilani Campus

Dictionary-based Query Translation




phrase identification
words to be transliterated

Hindi-English
dictionaries

Collection

Ireland
peace
talks

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

Potrebbero piacerti anche