Myanmar Search Engine

Myanmar Search
Engine
Nyi Lynn Seck

EC (MCPA)
Search Engine Evolution
● 1st generation (use only “on page” data)

– text data, Word frequency, language
● 2nd generation (use off-page, web-specific data)

– Link (or connectivity) analysis
– Click-through data (What people click)
– Anchor-text (How people refer to this page)
● 3rd generation (answer “the need behind the

query”)
– Semantic analysis - what is this about?
– Focus on user need, rather than on query
– Context determination
Text Mining Research Area
● Information Retrieval ( IR )
– Search Engines
– Classification
– Recommendation
● Information Extraction ( IE )
– Screen scraping
– Product Information (e.g. price) scraping
● Information Understanding
– Natural Language Processing (NLP)
– Question Answering
– Concept Extraction from Newsgroup
– Visualization
– Summarization
● Cross - Lingual Text Mining
● Trend Detection
– Outlier Detection
Classical Indexing
Indexing
– Keyword Indexing
– Subject Indexing (Classification)
– Collocate subjects
– Define & Assign code (Call Number) to document
Tokenization
Tokenization is the process of replacing

sensitive data with unique identification symbols
that retain all the essential information without
compromising its security
Assign unique ID to each word &

keep in a lexicon
Remove Stop/Noise words before/after

tokenization
Stemming , Lemmatization
Stemming is the process for reducing inflected (or sometimes

derived) words to their stem, base or root form – generally a
written word form.
Lemmatization is the process of reducing an inflected

spelling to its lexical root or lemma form. The lemma form is
the base form or head word form you would find in a
dictionary. The combination of the lemma form with its word
class (noun, verb. etc.) is called the lexeme.
ကကကက
ကကကကကကကကက ကကကကကကကကကက ကကကကကကကကကကကက
ကကကက
ကကကကကကကကကကကကကကကက ကကကကကကက ကကကကကကက

Inverted Index
Inverted Index
Formula & Algorithm?
Stop Words
a What stop words will be use in Myanmar Search Engine
able
about
above
abroad
according
accordingly
across
actually
adj
English
after
afterwards
again
against
ago
ahead
ain't
all
allow
allows
almost
alone
NGram
ကကကကကကကကကကကကကကကက ကကကကကက ကကကကကကကကကကကကက ကကကကကကကကက
|ကကကက||ကကကက||ကကကက||ကကကက||က||ကက||ကကက||ကကကကကက
||ကကကက||ကကက||ကကက||ကကက||ကကက|
ကကကကကကကက ကကကကကကကကကကကက ကကကကကကကကကကကကကကကက ကကကကကကကကကကကကကကကကက

ကကကကကကကက ကကကကကကကကကကကက ကကကကကကကကကကကကက ကကကကကကကကကကကကကကက
ကကကကကကကက ကကကကကကကကက ကကကကကကကကကကက ကကကကကကကကကကကကကက
ကကကကက ကကကကကကက ကကကကကကကကကက ကကကကကကကကကကကကကက
ကကက ကကကကကက ကကကကကကကကကကကက ကက
ကကကကက ကကကကကကကကကကက ကကကကကကကကကကကကကကကကကကကကကကကကကကကကက
ကကကကကကကကက ကကကကကကကကကကကကကကကကကကကကကကကကကကကကက ကက
ကကကကကကကကကက ကကကကကကကကကကကကကကကကကကကကကကကကကကကကက ကကကကကကကကကကကကကက
ကကကကကကက ကကကကကကကကကက ကကကကကကကကကကကကက ကကကက
ကကကကကက ကကကကကကကကက ကကကကကကကကကကကက ကကကကကကကကကကကကကက
ကကကကကက ကကကကကကကကက ကကကကက
2 Gram |ကကကကကကကက
ကကကကကက ||ကကကကကကကက ||ကကက||ကကကကကကကကကက||ကကကကကက
||
ကကကကကကကကကကကကကက
ကကကကကက ||ကကကကကက
|
3 Gram |ကကကကကကကကကကကက||ကကကကကက||ကကကကကကကကကကကကက||ကကကကကကကကက
ကကကကက |
4 Gram |ကကကကကကကကကကကကကကကက| ကကကကကကကကကကကကကကကက
MyanmarWord Segmentation using Syllable level Longest Matching : Hla Hla Htay
Simple Myanmar Syllable Structure
C
C+M
C+M+V
C+M+V+K
C+M+ V+ K+ D
C+M+V+D
C+M+K
C+M+K+D
C+M+D
C+V
C+V+K
C+V+K+D
C+V+D
C+K
C+K+D
Language Specific Search Engine
Basic Architecture
Language specific crawler

Corpus/ Crawler
Lexicon
Language
Page Identification
repository
Parser Indexer Ranking Query

engine engine
results query
Pann Yu Mon, Management and Information System Engineering Department, Nagaoka University of Technology, Japan
Crawling Coverage
Domains The Number of

Crawling Pages
Parameters Collected
.mm 3,555 [ 1.1%]
qSeed URLs 35
qLevel of depth 6 .com 276,554 [ 83.2%]
qCrawling time 2 weeks
qCPU 2.40 GHz Other 52,245 [ 15.7%]
qMemory 1 GB gTLDs
qConnection: 100 Mbit Total 332 , 354 [100.0%]
per second
10 th July 2008

Myanmar Search Engine

Caricato da

Informazioni sul documento

Descrizione originale:

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Myanmar Search Engine

Caricato da

Copyright:

Formati disponibili

Myanmar Search

Nyi Lynn Seck

● 1st generation (use only “on page” data)

● 2nd generation (use off-page, web-specific data)

● 3rd generation (answer “the need behind the

– Subject Indexing (Classification)

Tokenization is the process of replacing

Assign unique ID to each word &

Remove Stop/Noise words before/after

Stemming is the process for reducing inflected (or sometimes

Lemmatization is the process of reducing an inflected

ကကကကကကကကက ကကကကကကကကကက ကကကကကကကကကကကက

ကကကကကကကကကကကကကကကက ကကကကကကက ကကကကကကက

ကကကကကကကက ကကကကကကကကကကကက ကကကကကကကကကကကကကကကက ကကကကကကကကကကကကကကကကက

Language specific crawler

Parser Indexer Ranking Query

Domains The Number of

Potrebbero piacerti anche