Sei sulla pagina 1di 16

Myanmar Search

Engine

Nyi Lynn Seck


EC (MCPA)
Search Engine Evolution

● 1st generation (use only “on page” data)


– text data, Word frequency, language

● 2nd generation (use off-page, web-specific data)


– Link (or connectivity) analysis
– Click-through data (What people click)
– Anchor-text (How people refer to this page)

● 3rd generation (answer “the need behind the


query”)
– Semantic analysis - what is this about?
– Focus on user need, rather than on query
– Context determination
Text Mining Research Area
● Information Retrieval ( IR )
– Search Engines
– Classification
– Recommendation
● Information Extraction ( IE )
– Screen scraping
– Product Information (e.g. price) scraping
● Information Understanding
– Natural Language Processing (NLP)
– Question Answering
– Concept Extraction from Newsgroup
– Visualization
– Summarization
● Cross - Lingual Text Mining
● Trend Detection
– Outlier Detection
Classical Indexing
Indexing

– Keyword Indexing

– Subject Indexing (Classification)

– Collocate subjects
– Define & Assign code (Call Number) to document
Tokenization

Tokenization  is the process of replacing


sensitive data with unique identification symbols
that retain all the essential information without
compromising its security

Assign unique ID to each word &


keep in a lexicon

Remove Stop/Noise words before/after


tokenization
Stemming , Lemmatization

Stemming is the process for reducing inflected (or sometimes


derived) words to their stem, base or root form – generally a
written word form.

Lemmatization is the process of reducing an inflected


spelling to its lexical root or lemma form. The lemma form is
the base form or head word form you would find in a
dictionary. The combination of the lemma form with its word
class (noun, verb. etc.) is called the lexeme.
ကကကက

ကကကကကကကကက ကကကကကကကကကက ကကကကကကကကကကကက

ကကကက

ကကကကကကကကကကကကကကကက ကကကကကကက ကကကကကကက


Inverted Index
Inverted Index
Formula & Algorithm?
Stop Words
a What stop words will be use in Myanmar Search Engine
able
about
above
abroad
according
accordingly
across
actually
adj
English
after
afterwards
again
against
ago
ahead
ain't
all
allow
allows
almost
alone
NGram
ကကကကကကကကကကကကကကကက ကကကကကက ကကကကကကကကကကကကက ကကကကကကကကက
|ကကကက||ကကကက||ကကကက||ကကကက||က||ကက||ကကက||ကကကကကက
||ကကကက||ကကက||ကကက||ကကက||ကကက|

ကကကကကကကက ကကကကကကကကကကကက ကကကကကကကကကကကကကကကက ကကကကကကကကကကကကကကကကက


ကကကကကကကက ကကကကကကကကကကကက ကကကကကကကကကကကကက ကကကကကကကကကကကကကကက
ကကကကကကကက ကကကကကကကကက ကကကကကကကကကကက ကကကကကကကကကကကကကက
ကကကကက ကကကကကကက ကကကကကကကကကက ကကကကကကကကကကကကကက
ကကက ကကကကကက ကကကကကကကကကကကက ကက
ကကကကက ကကကကကကကကကကက ကကကကကကကကကကကကကကကကကကကကကကကကကကကကက
ကကကကကကကကက ကကကကကကကကကကကကကကကကကကကကကကကကကကကကက ကက
ကကကကကကကကကက ကကကကကကကကကကကကကကကကကကကကကကကကကကကကက ကကကကကကကကကကကကကက
ကကကကကကက ကကကကကကကကကက ကကကကကကကကကကကကက ကကကက
ကကကကကက ကကကကကကကကက ကကကကကကကကကကကက ကကကကကကကကကကကကကက
ကကကကကက ကကကကကကကကက ကကကကက
2 Gram |ကကကကကကကက
ကကကကကက ||ကကကကကကကက ||ကကက||ကကကကကကကကကက||ကကကကကက
||
ကကကကကကကကကကကကကက
ကကကကကက ||ကကကကကက
|
3 Gram |ကကကကကကကကကကကက||ကကကကကက||ကကကကကကကကကကကကက||ကကကကကကကကက
ကကကကက |
4 Gram |ကကကကကကကကကကကကကကကက| ကကကကကကကကကကကကကကကက
MyanmarWord Segmentation using Syllable level Longest Matching : Hla Hla Htay
Simple Myanmar Syllable Structure

C
C+M
C+M+V
C+M+V+K
C+M+ V+ K+ D
C+M+V+D
C+M+K
C+M+K+D
C+M+D
C+V
C+V+K
C+V+K+D
C+V+D
C+K
C+K+D
Language Specific Search Engine
Basic Architecture

Language specific crawler


Corpus/ Crawler
Lexicon
Language
Page Identification
repository

Parser Indexer Ranking Query


engine engine

results query

Pann Yu Mon, Management and Information System Engineering Department, Nagaoka University of Technology, Japan
Crawling Coverage

Domains The Number of


Crawling Pages
Parameters Collected
.mm 3,555 [ 1.1%]
qSeed URLs 35
qLevel of depth 6 .com 276,554 [ 83.2%]
qCrawling time 2 weeks
qCPU 2.40 GHz Other 52,245 [ 15.7%]
qMemory 1 GB gTLDs
qConnection: 100 Mbit Total 332 , 354 [100.0%]
per second

10 th July 2008

Potrebbero piacerti anche