Slide Tambahan

Multimedia Database System
Arry Akhmad Arman

Laboratory for Signal and System
Electrical Engineering Department of ITB
Email: aa@lss.ee.itb.ac.id, aa_arman@rocketmail.com
Silabus
z Conventional Database (RDBMS, SQL)
z DBMS and MMDBMS, RDBMS vs OODBMS
z Media Representation
z Spatial Data Structures
z Image Database
z Document Database
z Audio Database
z Video Database
z Multimedia Database
z Storage System
z Media Presentation & Servers
Arry Akhmad Arman, aa@lss.ee.itb.ac.id
Satuan Acara Perkuliahan
z Conventional Database (RDBMS, SQL), 4 pertemuan @2jam, Aciek
z DBMS and MMDBMS, RDBMS vs OODBMS
z Media Representation
z Spatial Data Structures
z Image Database
z Document Database
z Audio Database
z Video Database
z Multimedia Database
z Storage System
z Media Presentation & Servers

Referensi
z Referensi Umum tentang
DBMS/RDBMS dan SQL
z Principles of Multimedia
Database Systems, V. S.
Subrahmanian,

Quick Review to
Database Systems
Arry Akhmad Arman

Laboratory for Signal and System
Electrical Engineering Department of ITB
Email: aa@lss.ee.itb.ac.id, aa_arman@rocketmail.com
Architecture
z Centralized (Mainframe to mini computer system)
z File Sharing in LAN
z Client Server
Arry Akhmad Arman, aa@lss.ee.itb.ac.id 2

© 1999-2001, ARRY AKHMAD ARMAN - Electrical Engineering Dept. of ITB
Client Server Architecture
Client Side Request Server side

(SQL) Server
Client Database
(DBMS)
Result
Result (data only)
Request (data & format
(SQL) Display)
user

Client Server Architecture
Client Side request Server side
Client network Server Data

base
result
result
request
user Client dan Server bisa

terpisah jauh melalui network

Structured Query Language
z SQL adalah suatu standar bahasa untuk memberikan

Query (request) dalam system database.
z SQL digunakan dalam berbagai platform yang berbeda
z Adanya SQL membuat para pemakai mudah berhadapan
dengan berbagai platform yang berlainan.
Query (SQL)
Database
hasil
System

Open DataBase Connector
Query (SQL)
Data Database
hasil base System
Syst #1
em
#1 ODBC
SQL
Query (SQL)
hasil Database
hasil System
Data #2
base
Syst
em
#2
Multimedia Database
Client Side request Server side
Client Server Database
result
result
request
user
Data Multimedia

Data Multimedia
z Image Data
z Video Data
z Audio Data
z Text/Document Data
z Handwritten data

Sample Multimedia Scenario
Surveillance Audio/Phone
Surveillance Audio/Phone
Data Data
Data Data
Geographical Police Still Image

Geographical
Information
Police
Application
Still Image
Data
Information Applicat Data

ion
Relational Document
Relational Document
Data Data
Data Data
Contoh Query : Image Query
z Tersedia foto dari orang yang dicurigai.
z Query
“Retrieve all images from the image library in
which the person appearing in the currently
displayed photograph appears”
z Kemungkinan
} User sudah tahu identitas orang dalam foto, query dapat
dilakukan melalui tekstual
} User tidak mengetahui identitasnya, query dilakukan
dengan bantuan image processing (recognition)
Contoh Query : Audio Query
z Polisi memiliki rekaman audio percakapan telpon antara orang
yang dicurigai dengan seseorang yang lain
z Query
} “Cari semua bagian percakapan yang mengandung kata ‘uang’
“ } Tentukan identitas orang yang berbicara dengan si-A
z Proses pertama memerlukan dukungan teknologi “speech

recognition”
z Proses kedua memerlukan dukungan teknologi “speaker
recognition”

Contoh Query : Video Query
z Polisi memiliki rekaman video yang dipasang di

sejumlah tempat pengintaian
z Query
} “Cari semua segmen video yang memperlihatkan
penyerahan barang dari seseorang kepada orang
lain”

Contoh Query : Text/Doc Query
z Polisi mengakses news library
z Query
} “Cari semua artikel yang berhubungan dengan si-
A”

Contoh Query : Relational Query
z Polisi mengakses database Bank
z Query
} “Cari semua transaksi transfer uang dari si-A
kepada orang lain yang nilainya lebih dari
100.000.000”

Aplikasi Multimedia Database
z Edukasi/Library
z Industri Perfilman
z Television/News Broadcaster
z Travel Industry
z Documentation Center

Multidimensional
Data Structure
Arry Akhmad Arman

Electrical Engineering Department
Institut Teknologi Bandung
Last update: September 2005

RDBMS
A relational database management

system (RDBMS) is a system that
manipulates tables.
Informally speaking, table consist of
rows and columns.
Each row is called a tuple.
Relationship can be created between tables.
Disadvantages of RDBMS
Data is organized in the form of relatively

“flat” tuples
The schemes of relations are relatively static
Relationship that might exist between the
content of (part of) one table and (part of)
another relational table must be explicitly
encoded
Quick Introduction to
Object Oriented Database
© 2002, Arry Akhmad Arman

Laboratory for Signal & Systems
Bandung Institute of Technology
email : aa_arman@BitSmart.com
¿
Relational Database vs Object Oriented
RDBMS Object Oriented
z Data is organized in the form z Object, which are manipulated

of relatively “flat” tuples by application
z The scheme of relations are z Classes, which are collection
relatively static of objects. Class also contain
z Relationship that might exist associated methods (algorithm
between the content of (part specifying how the objects in
of) one table and (part of) that class are to be
another relational table must manipulated.
be expliciitly encoded z A hierarchy that imposes an
through the use of constructs cyclic graph structure on the
such as integrity constraint. set of classes
© 1999-2001, ARRY AKHMAD ARMAN - Electrical Engineering Dept. of ITB 2

Multidimensional
Data Structures
© 2002, Arry Akhmad Arman

¿
N-dimensional Data
z Most media data requires the ability to reason about both

time and space.
z Such data is typically referred to as n-dimensional data,
reflection the fact that data has associated attributes
drawn from n-dimensional space. Example :
o Typical two dimensional coordinates (x,
y) o Three dimensional coordinates (x, y, z)
o Space time (x, y, z, t)
z Most techniques to store n-dimensional data do by using

“hierarchical” decomposition of space that are typically
represented by various kind of trees : k-d trees, point
quadtrees, MX quadtrees, and R-trees.

K-d Trees : Introduction
z The k-d trees is used to

store k-dimensional
point data such as that
shown in figure below.
z For k=2, k-d tree is 2-d
tree
z For k-3, k-d tree is 3-d
tree

K-d Trees : Node Structure
nodetype = record
INFO : infotype;
XVAL : real;
YVAL : real;
LLINK : înfotype;
RLINK : înfotype;
end;
INFO XVAL YVAL

LLINK RLINK

K-d Trees : Rules (1)
z Suppose T is a pointer to the root of a 2-d tree. If N is a

node in this tree, then the level of node N is define as
Level(N) = 0 if N is the root of the tree

Level(N) = level(P)+1 if N’s parent is P

K-d Trees : Rules (2)
z If N is a node in the tree such that level(N) is even, then

o every node M in th subtree rooted at N.LLINK has
the propeeerty that M.XVAL < M.XVAL,
o every node P in th subtree rooted at N.RLINK has
the propeeerty that P.XVAL >= P.XVAL,
z If N is a node in the tree such that level(N) is odd, then

o every node M in th subtree rooted at N.LLINK has
the propeeerty that M.YVAL < M.YVAL,
o every node P in th subtree rooted at N.RLINK has
the propeeerty that P.YVAL < P.YVAL,

K-d Tree : Insertion
City (XVAL, YVAL)

Banja Luka (19, 45)
Derventa (40, 50)
Teslic (38, 38)
Tuzla (54, 40)
Sinj (4,4)

K-d Tree : Insertion (2)

K-d Tree : Characteristics
z Tree structure dipengaruhi oleh urutan pemasukan node

z Titik yang sesungguhnya direpresentasikan dengan setiap
node pada tree
z Setiap node selalu berisi informasi (tidak ada node
perantara yang kosong)
z Setiap node informasi mungkin terletak diujung tree, diawal
(root) atau ditengah. Hal ini akan menyulitkan proses
deletion jika node yang akan dihapus tidak terletak di ujung.

Point Quad Tree : Node Structure
nodetype = record
INFO : infotype;
XVAL : real;
YVAL : real;
NW : înfotype;
SW : înfotype;
NE : înfotype;
SE : înfotype;
end;
INFO XVAL YVAL

NW SW NE SE

Point Quad Tree : Rules (1)

Point Quad Tree : Rules (2)

Point Quad Tree : Insertion
City (XVAL, YVAL)

Banja Luka (19, 45)
Derventa (40, 50)
Teslic (38, 38)
Tuzla (54, 40)
Sinj (4,4)

Point Quad Tree : Insertion (2)

Point Quad Tree : Characteristics
z Tree structure dipengaruhi oleh urutan pemasukan node

z Titik yang sesungguhnya direpresentasikan dengan setiap
node pada tree
z Setiap node selalu berisi informasi (tidak ada node
perantara yang kosong)
z Setiap node informasi mungkin terletak diujung tree, diawal
(root) atau ditengah. Hal ini akan menyulitkan proses
deletion jika node yang akan dihapus tidak terletak di ujung.
z Kedalaman tree lebih pendek dibandingkan k-d tree

MX Quad Tree : Node Structure
nodetype = record
INFO : infotype;
XVAL : real;
YVAL : real;
NW : înfotype;
SW : înfotype;
NE : înfotype;
SE : înfotype;
end;
INFO XVAL YVAL

NW SW NE SE

MX Quad Tree : Rules (1)

Multimedia Database:
Image Databases
Arry Akhmad Arman
School of Electrical Engineering and Informatics
Institut Teknologi Bandung
Last update: October 17, 2006
Arry Akhmad Arman, School of Electrical Engineering and Informatics, Information Technology, ITB
Objectives
• What new concept will you learn in this

chapter ?
– What an image databases is ?
– Query in image databases (different from
textual databases)
– Image analysis techniques (needed by query
process)
• New data abstraction for implementation
• New techniques needed to
implement these abstraction
• What technological features support
these implementation methods ?
Arry Akhmad Arman, School of Electrical Engineering and Informatics, Information Technology, ITB 2
Application of Images Databases
• NASA’s EOSDID project has collected a

tremendous amount of data pertaining to the
earth.
• Image data from Hubble telescope
• Citizen database
• Image documentation
• X-ray picture in hospital
Simple Image Database
Name Image File
John1.jpg
John John1.jpg
Budi Budi1.jpg
Budi1.jpg
What is the difference ?
• In textual databases, people give a query

by “clear” textual query.
– Find all people with NAME=‘BUDI’
– Find all people with ADDRESS=‘Asia Afrika’
• In “ideal” image databases, query may ask in

more “natural” or “complicated” way.
– Here’s a picture of person. Find all persons that
similar to the picture and show the complete indetity.
How image store in
computer system?
• Pixel
• Resolution
• Image Type : B/W,
greyscale, color
• Color depth
• Memory capacity
• File standar
Image Compression
Image
Compression
Compressed
Representation
Image of Image
DeCompression
storage size = y
y<x
Original image (storage size = x)
Compression type : lossy, lossless
Raw Images
object
• Content of image consist of objects.
• Such objects in an image could have
a variety of associated properties,
such as :
– Shape descriptor, describe the
shape/location of the region within
the object is located object
– Property descriptor, describe the
property of individual pixel or
group of pixel. In general, it will be
infeasible to associate properties
with individual pixels, and hence object
cells will be used most of the time.
Example
• Shape descriptor
– rectangle : XLB=10, XUB=60, YLB=5, YUB=50
• Property descriptor
– pixel at location (14,17) have the following value properties
Red=5, Green=1, Blue=3.
• Instead of specifying properties for each pixel as in the

above example, we may split region of (a x b) pixel into
(m x n) cells, where
– a mod m=0 and b mod n =0
– m<a and n<b
Definitions
• Formaly speaking, we have the following

definitions that give precise, mathematical form
to the previous intuitions.
• Definitions
– Definition 1 : Grid resolution of the image
– Definition 2 : Cell property
– Definition 3 : Object shape
– Definition 4 : Rectangle
– Definition 5 : Image Database (IDB)
Definition 1 : Grid Resolution
• Every image I, has an
associated pair of
positive integers (m,
n), called grid
resolution of the
image.
• This divides image
into (m x n) cells of
equal size, called
the image grid.
Definition 2 : Cell Property
• Cell Property is a triple (Name, Values, Method), where

– Name is a string denoting the property’s name
– Values is a set of values that the property may assume,
– Method is an algorithm that tells us how to compute the property
involved
• Examples
– (bwcolor, {b,w}, bwalgo)
– (graylevel, [0,1], grayalgo)
∑ ∑findgray(i, j)
grayalgo(cell) = XLB≤ i ≤ XUB YLB≤ j≤YUB
( XUB − XLB)(YUB −YLB)
Definition 3 : Object Shape
• An Object Shape is any set P of points such that

if p, q ∈ P, then there is exists a sequence of
points p1, … pn all in P such that :
– p = p1 and q =pn and
– for all 1 ≤ i ≤ n, ……… (see page 104)
current pixel All yellow
pixels are valid next
pixel position
Definition 4: Rectangle
• Rectangle is a simple object shape descriptor

(compared to definition 3)
• A rectangle is an object shape, P, such that there
is exist integers XLB, XUB, YLB, YUB such that
P = {(x, y) XLB ≤ x p XUB & YLB ≤ y p YUB)
Definition 5 : Image Database (IDB)
• Image Database (IDB) consists of a

triple (GI, Prop, Rec) where
– GI is a set of gridded images of the form (Image, m, n)
– Prop is a set of cell properties, and
– Rec is a mapping that associates, with each image, a
set of rectangles denoting objects.
Two Important Things
• When representing image data of the form described

above, two major factors must be taken into account :
– First, image are often very large object, (p1 x p2) pixel array.
Storing properties on pixel basis is usually infeasible. We
need image compression algorithms.
– Second, given an image I (compressed or raw), there is critical

need to determine what “features” appear in the image. This
is typically done by breaking up the image into a set of
homogeneous rectangular regions, each of which is
called segment.
The process of finding these segment is called segmentation.

Compressed Image Representation
• Consider a two-dimensional image I consisting of (p1 x

p2) pixels.
• I(x, y) is a number denoting one or more attributes of the
pixel.
• Reasoning about an image by considering all the pixels is
not feasible because each of p1, p2 may be 1024 or more,
leading to over million entries in the image matrix I.
• A common approach is to transform this matrix I into a
compressed representation of the matrix.
Transformation of an image
Compressed representation
Original image
Creation of compressed representation
• Creation of compressed representation consist of two
parts :
– Size selection.
The size h of the compressed representation is selected by the
image database designer. The larger the size, the greater is the
fidelity. However, as the size increases, so does the complexity
of creating an index for manipulating such representations, and
searching this index.
– Transform selection.
DFT (Discrete Fourier Transform)
2Πxa 2Πyb
1 p1 −1 p2 −1 −j +
DFT(x, y) = ∑ ∑ I (a,b) × e p
1
p
2
pp = =
1 2 a 0b 0
• There is an Inverse-DFT Transformation

• 100% invertible (not all transformation) property, but in
practical, it is often to sacrifice this property by applying
this property with certain other non-invertible
properties.
• DFT preserve Euclidean distance
DCT (Discrete Cosine Transform)
2 p −1 p −1 (2r +1)Πi (2s +1)Πj
1 2
DCT(i, j) = α(i)α( j) ∑∑
cos × cos
pp r =0 s=0 2r 2s
1 2
• Basis for JPEG compression

• Like DFT, DCT is convertible
• DCT can be quickly computed (also DFT)
• Both DFT and DCT have been modified in numerous
ways to support different desried performance (speed,
quality)
Wavelet
• A new class of techniques for image compression

is the class of wavelet transformation.
Transformed Image Index
• Indexing of large volume of data is not easy.
• We can index image database at compressed
representation level (what is the problem??)
Index on
?
Raw images (original images)
compressed representations
user
Segmentation : Overview
• So far, we have proceeded under the optimistic

assumption that the regions in an image, where
feature of interest lie, can be somehow identified, and
then the content of these “interesting regions” is
somehow determined.
• We will quickly overview how, given any image, we can

separate the image into homohegeous regions called
segments.
• The process called segmentation.
Segmentation : Connected Regions
• Suppose I is an image
containing (m x n) cells.
• A connected regions R in R3
image I is a set of cells such
that if cells (x1,y1), (x2, y2) ∈ R1
R, there is a sequence of cells R2
C1, … Cn in R such that :
– C1 = (x1, y1) and
– Cn = (x2, y2) and
– The Euclidean distance between
cells Ci and Ci+1 for all 1 ≤ i ≤ n is
1.
Segmentation :
Homogeneity Predicate
Connected Homogeneity True or

Predicate False
Region
• Homogeneity Predicate associated with an image I is a

function H that takes any connected region R in image I as
input and returns either “true” or “false”. True if region is
homogeny, else false.
• See examples in page 109-111
Segmentation : Definition
• Given an image I represented as a set of (m x n) pixels,

we define a segmentation of image I with respect to a
homogeneity predicate P to be a set R1, … Rk of
regions such that :
1. R1 ∩ R2 = φ for all 1 ≤ i ≠ j ≤ k
2. I = R1 ∪ ... ∪ Rk
3. H (Ri ) = true for all 1 ≤ i ≤ k, and
4. for all distinct i, j, 1 ≤ i, j ≤ n, such that R1 ∪ Rj is connected regions
it is the case that H (Ri ∪ Rj ) = false
See example in page 112
Strategy : Split & Merge
• Split
We start with the whole image. If it is homogeneous,
then we are done, and the image is a valid segmentation
of itself. Otherwise, we split the image into two parts
and recursively repeat this process till we find a set ……
• Merge
We now check of the Ri that can be merged together.
Similarity-Based Retrieval
Approach : (1) Metric Approach and (2) Transformation Approach
Similarity : Metric Approach
• Based on distance between images
• Problem : Very large computation process.
1024 x 1024 Distance (d) 1024 x 1024

Color image Color image
1024 1024
d (o1, o2 ) = ∑ ∑(diffr [i, j] + diff g [i, j] + diffb [i, j])

i =1j=1
How many operations ?

Reducing Computation Load
Original
images Feature of
Original
images
Map
Object
fe Indexing index Repository
algorithm
(n+2) dimensional S dimensional

space
space
This condition must be satisfied :

If o1, o2, o3 are objects such that d(o1,o2) ≤ d(o1,o3) then
d’(fe(o1),fe(o2)) ≤ d’(fe(o1),fe(o3))
Transformation Approach
• There is a set of transformation

operator (translation, rotation, scaling, etc)
• The transformation from o (original shape) into o’
(destination shape) is a sequence of
transformation operations.
• There is a cost of each transformation
• More similar object need least cost
Original T1 T2 Tn Destination
Shape Shape
Example 1
Example 2
Alternative for Implementation
• Representating IDB as a relations
• Representating IDB with spatial data structures

Object usually repesent by rectangle in image.
We can use R-Tree to implement each object.
• Representating IDB with image transformation

Compressing image into a single read value
vector that still representing the feature of an
image. Vector size no more than 100 or 200.
Text/Document Databases
© 1999-2002, Arry Akhmad Arman

¿
Document Database Sample
D1 Jose Orojuelo’s Operations in Bosnia
D2 The Medellin Cartel’s Financial Organization
D3 The Cali Cartel’s Distribution Network
D4 Banking Operation and Money Laundering
D5 Profile of Hector Gomez
D6 Connection Between Terrorism and Asian Dope
Operations
D7 Hector Gomez : How He Gave Agents the Slip in Cali
D8 Sex, Drugs, and Videotape
D9 The Iranian Connection
D10 Boating and Drugs : Slip Owned by the Cali Cartel

Document Database Sample
D1 Internet untuk Belajar Jarak Jauh
D2 Guru-Guru SD Mendapat Pelatihan Komputer
D3 Simulasi Menggunakan Komputer
D4 Komputer Sebagai Salah Satu Teknologi Kunci
Dalam Teknologi Informasi
D5 Kesenjangan Digital Dapat Diperkecil
Dengan Program Pendidikan yang Tepat
D6 Distance Learning Melalui Jaringan Global
D7 Belajar Tanpa Tatap Muka Melalui Jaringan
Internet
D8 KPU telah menetapkan Presiden RI yang Baru
D9 Presiden RI Hasil Pemilu akan
menetapkan Kabinet Baru
D10 Perhitungan Suara Pada Pemilu Presiden RI
Dilakukan Dengan Bantuan TI

Synonymy and Polysemy
z Problem in document retrieval : synonymy and polysemy
z Synonymy
Given a topic T, the word T does not occur anywhere in a
document D, even though the document D is in fact
closely related to the topic T in question.
z Polysemy
The same word may mean many different things in
different context.

Document Retrieval
All Documents
Relevant Documents
Documents Returned
by document retrieval algorithm

Sample of Doc Retrieval
All Documents
Relevant Documents
50 20 150
Documents Returned

Metrics for Document Retrieval
z There are two important metric can be used to measure

the quality of Document Retrieval System
Precision.
The precision of an algorithm A for information retrieval, with respect
to suitable test sets and relevance definitions, is measured by
determining how many of the answers returned by the algorithm are
in fact correct.
Recall
The recall of an algorithm A is a measure of how many of the right
documents are in fact retrieved by the query.

Precision and Recall
Pt = 100× 1+ card ({d ∈ Dtest d ∈ A(t) ∧ relevant(t, d ) is true})

1+ card ({d ∈ Dtest d ∈ A(t)})
Rt = 100× 1+ card ({d ∈ Dtest d ∈ A(t) ∧ relevant(t, d ) is true})

1+ card ({d ∈ Dtest relevant(t, d ) is true})
z D is a finite set of document.

z A is an retrieval information algorithm that takes a topic string
t as input and return a set A(t) of documents as output, that is
A(t) ⊆ D.

Precision & Recall
Relevant Documents
All Documents
P= 20 +1 = 21
20+150 +1 171
150 20 50
R = 20 +1 =21
20 + 50 +1 71
Documents Returned

Stop Lists, Word Stems
and Frequency Tables
z Stop List asociated with a document set D is a set StopL of

words that are deemed “irrelevant”, even though they may
appear frequently. For example : the, and, for, etc
z Word Stems is a table that store a list of records that contain a

list of different words that have a same meaning.
z Frequency Table, FreqT, associated with D and T is an (MxN)

matrix such that FreqT(I,j) equals the number of occurences
of the word ti in document dj. Frequency Table can be used to
find similar document.

Frequency Tables
Term/Doc d1 d2 d3 d4 d5 d6
t1 615 390 10 10 18 65
t2 15 4 76 217 91 816
t3 2 8 815 142 765 1
t4 312 511 677 11 711 2
t5 45 33 516 64 491 59
z Which documents are similar ???

z Find 25 documents that are maximally
relevant with respect to the banking
operations and drugs”

Closeness
z Term distance
Suppose vec(I) denotes the number of occurences of term ti
in Q. Then the term distance between Q and document dr is
given by
∑ (vecQ ( j) − FreqT ( j, r))2
M
j =1
Cosine distance
This metric is used extensively in the document database world :
∑ (vec Q ( j) − FreqT ( j, r))M
j =1
M M
∑(vecQ ( j))2 × ∑(FreqT ( j, r))2

j=1 j=1

Latent Semantic Indexing
z Similar document have a similar word frequency

z Problem : Matrix of frequency tables can be very large !!!
z LSI try to use a technique called SVD (Singular Value

Decomposition), well known in matrix theory, to reduce the
size of frequency matrix.

Size Reduction through SVD
T
FreqT T S D
T dan D
adalah
matriks
= X X ortogonal.
S adalah
MxN MxM MxR RxN matriks
singular.
= X X
MxN Mxk kxk kxN
Sample
a1 a1 a1 a1 a1 20 0 0 0 0 b1 b1 ... b1 a1
1 2 3 4 5
0 16 0 0 0 1 2 N 5
a2
a2
a2
a2
a2
b 2 ...
1 2 3 4 5
0 0 12 0 0 1
b2
2
b2
N
a2
5
... ... ... ... ... ... ... ... ... ...
0
aM aM aM aM aM 0 0 0.08 0 b5 b5 b5 ... b5
N
1 2 3 4 5 0 0 0 0 0.004 1 2 3
a1 a1 a1 20 0 0 b1 b1 ... b1
1 2 3
0 16 0 1 2
a 2
a 2
a 2
b 2 b 2 ... b2N
1 2 3 1 2 N
... ... ... 0 0 12 b 3
b 3
... b 3
1 1 1
aM aM aM
1 2 3

Example
Without SVD
z M=number of terms=1.000.000
z N=number of documents=10.000
z MxN matrix need 10.000.000.000 entries
Using SVD (R=200)

z MxR = 1.000.000 x 200 = 200.000.000 entries
z RxR = 200x200=40.000 entries
z RxN = 200 x 10.000 = 2.000.000 entries
z Total entries approx 202.000.000 entries
z This entries can be reduced by special storage
technique (because most of entries are zero)

Document Retrieval Using SVD
w
x o y = ∑ xi × yi = dot product
i =1
Suppose di and dj are two documents.

The similarity of these two documents with respect to SVD
representation
TS * × D*T
T
is dot product computation of two column in the matrix D*
∑ D*T [i, z]× D*T [ j, z]R
z =1

II-1 II-2
synthesizer ini tidak dapat menghasilkan ucapan dengan tingkat kealamian yang
II.3 Konversi dari Teks ke Ucapan tinggi.
Suatu sistem Speech Synthesizer atau Text to Speech pada prinsipnya terdiri dari
Synthesizer yang menggunakan teknik diphone concatenation bekerja dengan
dua sub sistem, yaitu : cara menggabung-gabungkan segmen-segmen bunyi yang telah direkam
(1) bagian Konverter Teks ke Fonem (Text to Phoneme), serta sebelumnya. Setiap segmen berupa diphone (gabungan dua buah fonem).
Synthesizer jenis ini dapat menghasilkan bunyi ucapan dengan tingkat kealamian
(2) bagian Konverter Fonem to Ucapan (Phoneme to Speech).
(naturalness) yang tinggi.
Bagian Konverter Teks ke Fonem berfungsi mengolah kalimat masukan dalam
Struktur sistem seperti di atas pada prinsipnya merupakan konfigurasi tipikal yang
suatu bahasa tertentu yang berbentuk teks menjadi rangkaian kode-kode bunyi
digunakan pada berbagai sistem Text to Speech berbagai bahasa. Namun
yang biasanya direpresentasikan dengan kode fonem, durasi serta pitch-nya.
demikian, pada setiap sub-sistem terdapat sifat-sifat serta proses-proses yang
Bagian ini bersifat sangat language dependant. Untuk suatu bahasa baru, bagian
sangat spesifik terhadap bahasa yang digunakan.
ini harus dikembangkan secara lengkap khusus untuk bahasa tersebut.
Konversi dari teks ke fonem sangat dipengaruhi oleh aturan-aturan yang berlaku
dalam suatu bahasa. Pada prinsipnya proses ini melakukan konversi dari simbol-
Text KONVERTER KONVERTER Ucapan simbol tekstual menjadi simbol-simbol fonetik yang merepresentasikan unit bunyi
TEXT KE FONEM FONEM KE UCAPAN terkecil dalam suatu bahasa. Setiap bahasa memiliki aturan cara pembacaan dan
cara pengucapan teks yang sangat spesifik. Hal ini menyebabkan implementasi
Kode-kode fonem, unit konverter teks ke fonem menjadi sangat spesifik terhadap suatu bahasa.
pitch dan durasi
Untuk mendapatkan ucapan yang lebih alami, ucapan yang dihasilkan harus
Gambar II.7 Sistem Text to Speech
memiliki intonasi (prosody). Secara kuantisasi, intonasi adalah perubahan nilai
Bagian Konverter Fonem ke Ucapan akan menerima masukan berupa kode-kode pitch (frekuensi dasar) selama pengucapan kalimat dilakukan atau pitch sebagai
fonem serta pitch dan durasi yang dihasilkan oleh bagian sebelumnya. fungsi waktu. Pada prakteknya, informasi pembentuk intonasi berupa data-data
Berdasarkan kode-kode tersebut, bagian Konverter Fonem ke Ucapan akan pitch serta durasi pengucapannya untuk setiap fonem yang dibangkitkan. Nilai-
menghasilkan bunyi atau sinyal ucapan yang sesuai dengan kalimat yang ingin nilai yang dihasilkan diperoleh dari suatu model intonasi. Intonasi bersifat sangat
diucapkan. Ada beberapa alternatif teknik yang dapat digunakan untuk spesifik untuk setiap bahasa, sehingga model yang diperlukan untuk
implementasi bagian ini. Dua teknik yang banyak digunakan adalah formant membangkitkan data-data intonasi menjadi sangat spesifik juga untuk suatu
synthesizer, serta diphone concatenation. bahasa. Beberapa model umum, pernah dikembangkan untuk intonasi, tetapi
Formant synthesizer bekerja berdasarkan suatu model matematis yang akan untuk digunakan pada suatu bahasa masih perlu banyak penyesuaian yang harus
dilakukan.
melakukan komputasi untuk menghasilkan sinyal ucapan yang diinginkan.
Synthesizer jenis ini telah lama digunakan pada berbagai aplikasi. Walaupun dapat Konverter fonem ke ucapan berfungsi untuk membangkitkan sinyal ucapan
menghasilkan ucapan dengan tingkat kemudahan interpretasi yang baik, berdasarkan kode-kode fonem yang dihasilkan dari proses sebelumnya. Sub
sistem ini harus memiliki pustaka setiap unit ucapan dari suatu bahasa. Pada
Document2 Document2
II-3 II-4
sistem concatenation, sistem harus didukung oleh suatu diphone database yang Tahap berikutnya adalah melakukan konversi dari teks yang sudah secara lengkap
berisi rekaman segmen-segmen ucapan yang berupa diphone. Ucapan dalam suatu merepresentasikan kalimat yang ingin diucapkan menjadi kode-kode fonem.
bahasa dibentuk dari satu set bunyi yang mungkin berbeda untuk setiap bahasa, Konversi teks menjadi fonem biasanya dilakukan dengan dua cara. Sebagian
oleh karena itu setiap bahasa harus dilengkapi dengan diphone database yang proses konversi dapat dilakukan dengan aturan konversi yang sederhana dan
berbeda. berlaku umum untuk berbagai kondisi. Sebagian proses lainnya bersifat
Tahapan-tahapan utama konversi dari teks menjadi ucapan dapat dinyatakan kondisional, tergantung dari huruf-huruf atau fonem-fonem tetangganya, bahkan
terdapat bentuk-bentuk translasi yang tidak dapat ditemukan keteraturannya.
dengan diagram seperti terlihat pada Gambar II.8.
Konversi yang teratur dapat diimplementasikan dengan tabel konversi yang berisi
Text
pasangan antara urutan huruf dan urutan fonem, bahkan mungkin hanya berisi
satu huruf dan satu fonem. Aturan yang lebih sulit biasanya diimplementasikan
Text
Normalization dengan tabel konversi yang akan diterapkan jika kondisi rangkaian huruf tetangga
kiri dan kanannya terpenuhi. Contoh bentuk aturan konversi huruf ke fonem yang
memenuhi teknik tersebut adalah sebagai berikut.
Conversion
eTexttoPhonem
Exception Dictionary Letter-to-Phoneme
Lookup Conversion Left-context [letter-set] right-context = phoneme string
Huruf tertentu yang ditunjuk dalam posisi [letter-set] akan dikonversikan menjadi
suatu fonem dalam “phoneme string” jika left-context dan right context terpenuhi.
Prosody
Generation Bahasa Inggris termasuk bahasa yang tidak mempunyai keteraturan konversi teks
ke fonem. Suatu TTS bahasa Inggris biasanya dilengkapi dengan suatu basis data
Phonetic
yang berisi ribuan kata serta konversi padanan urutan fonemnya. Bahasa
Analysis
Indonesis termasuk bahasa yang jelas aturan konversinya. Sebagian besar kata
PhonemetoSpeech
Speech Parameters Indonesia dapat dikonversikan menjadi fonem dengan aturan yang jelas, walaupun
Generation
Conversion
Speech tetap ada kondisi-kondisi yang tidak dapat ditemukan keteraturannya. Sebagai
Production
contoh, simbol huruf e dapat diucapkan sebagai e pepet atau e taling, artinya
Speech Waveform
harus dikonversikan menjadi fonem yang berbeda untuk kondisi yang berbeda.
Dalam blok diagram di atas, kondisi yang masih dapat ditangani oleh aturan
Gambar II.8. Konversi dari Teks ke Ucapan (dimodifikasi dari Pelton, 1992) diimplementasikan dengan blok Letter to Phoneme Conversion. Konversi yang
tidak teratur ditangani oleh bagian Exception Dictionary Lookup.
Hasil dari tahap tersebut adalah rangkaian fonem yang merepresentasikan bunyi
Tahap normalisasi teks berfungsi untuk mengubah semua teks kalimat yang ingin
kalimat yang ingin diucapkan. Bagian prosody generator akan melengkapi setiap
diucapkan menjadi teks yang secara lengkap memperlihatkan cara
unit fonem yang dihasilkan dengan data durasi pengucapannya serta pitchnya.
pengucapannya. Lihat contoh kalimat dan hasil normalisasinya pada Gambar II.9.
Data durasi serta pitch diperoleh berdasarkan kombinasi antara tabel atau database
Document2 Document2
II-5 II-6
serta model prosody. Secara simbolik, hasil dari bagian ini sudah menghasilkan
informasi yang cukup untuk menghasilkan ucapan yang diinginkan.
II.4 Sintesa Menggunakan Metoda Diphone Concatenation
Satu tahap berikutnya yang masih sering dilakukan adalah Phonetic Analysis.
Pembentukan ucapan pada pensintesa ucapan menggunakan metoda diphone
Tahap ini dapat dikatakan sebagai tahap penyempurnaan, yaitu melakukan
concatenation pada prinsipnya dilakukan dengan cara menyusun sejumlah
perbaikan di tingkat bunyi. Sebagai contoh, dalam bahasa Indonesia, fonem /k/
diphone yang bersesuaian sehingga diperoleh ucapan yang diinginkan. Sebagai
dalam kata bapak tidak pernah diucapkan secara tegas, atau adanya sisipan
contoh, pada gambar II.10 diperlihatkan pembentukan kata atau ucapan
fonem /y/ dalam pengucapan kata rupiah.
“komputer” yang disusun dari diphone-diphone _k, ko, om dan seterusnya.
Text bapak membeli 5 kerang
seharga Rp 500,-
Text bapak membeli lima _k ko om mp pu ut te er r_
Normalization
kerang seharga lima
ratus rupiah
Exception Dictionary Letter-to-Phoneme

Lookup Conversion /b//a//p//a//k/ .../k//e//r//
a//N/ .../r//u//p//i//a//h/
Diphone
/b/, 40 ms, 90 hz
Prosody /a/, 56 ms, 95 hz
Generation
Gambar II.10. Pembentukan Ucapan “komputer” dari Diphone-Diphone nya
/p/, 35 ms, 96 hz
/a/, 75 ms, 105 hz
/b/, 40 ms, 90 hz /k/, 40 ms, 104 hz Supaya pensitesa ucapan dapat mengucapkan semua kemungkinan kata atau
/a/, 56 ms, 95 hz Phonetic ...
/p/, 35 ms, 96 hz Analysis /a/, 60 ms, 102 hz kalimat yang ada dalam suatu bahasa, sistem tersebut harus didukung oleh
/a/, 75 ms, 105 hz /h/, 45 ms, 100 hz
/k/, 40 ms, 104 hz
diphone database yang terdiri dari semua kombinasi diphone yang mungkin
... Speech Parameters terjadi. Diphone concatenation engine atau unit pemroses diphone akan menerima
/a/, 60 ms, 102 hz Generation
/h/, 45 ms, 100 hz Speech masukan berupa daftar fonem yang ingin diucapkan, masing-masing disertai oleh
durasi pengucapannya, serta pitch atau frekuensinya. Pengaturan durasi serta pitch
Speech Waveform
Production (Formant Synth.) akan membentuk intonasi pengucapan yang diinginkan. Berdasarkan daftar fonem
yang diterima, unit ini akan menentukan susunan diphone yang sesuai.
Gambar II.9. Besaran-besaran Dalam Setiap Tahap Proses Konversi Selanjutnya, unit ini akan melakukan smoothing sambungan antar diphone,
dari Teks ke Ucapan (dimodifikasi dari Pelton, 1992) manipulasi durasi pengucapan serta manipulasi pitch.
Sejumlah teknik untuk pemrosesan diphone telah dikembangkan oleh berbagai

pihak, diantaranya adalah autoregressive (AR), Glottal AR, hybrid
harmonic/stocastic, time domain PSOLA (TD-PSOLA), multiband resynthesis-
Document2 Document2
II-7 II-8
PSOLA (MBR-PSOLA), serta Linear Prediction-PSOLA (LP-PSOLA) [Dut97]. diberikan kepada alat-alat ucap manusia, sehingga akhirnya dihasilkan ucapan
Tabel II.1 memperlihatkan perbandingan karakteristik teknik-teknik tersebut. yang sesuai dengan pesan yang ingin diucapkan.
k, dur, pitch
o, dur, pitch
m, dur, pitch
p, dur, pitch
u, dur, pitch
t, dur, pitch
e, dur, pitch
ucapan "komputer"
r, dur, pitch
Diphone
Concatenation
Engine
Diphone
Database
Gambar II.12. Foto Sinar X Penampang Alat-Alat Ucap Manusia [Rab93]

Gambar II.11. Pembentukan Ucapan “komputer” dari Diphone-Diphone nya
Gambar II.12 memperlihatkan foto sinar X penampang alat-alat ucap manusia.

Vocal tract pada gambar tersebut ditandai oleh garis putus-putus, dimulai dari
Tabel II.1. Perbandingan Karakteristik Beberapa Teknik Pengolahan Diphone
vocal cords atau glottis, dan berakhir pada mulut. Vocal tract terdiri dari pharynx
(koneksi antara esophagus dengan mulut) dan mulut. Panjang vocal tract pria pada
umumnya sekitar 17 cm. Daerah pertemuan vocal tract ditentukan oleh lidah,
2
bibir, rahang, dan bagian belakang langit-langit; luasnya berkisar antara 20 cm
sampai dengan mendekati nol. Nasal tract mulai dari bagian belakang langit-
langit dan berakhir pada nostrils. Pada keadaan tertentu, suara nasal akan
dikeluarkan melalui rongga ini.
II.5 Pembentukan dan Karakteristik Sinyal Ucapan Gambar II.13 memperlihatkan model sistem produksi ucapan manusia yang
disederhanakan. Pembentukan ucapan dimulai dengan adanya hembusan udara
II.5.1 Sistem Pembentukan Ucapan
yang dihasilkan oleh paru-paru. Cara kerjanya mirip seperti piston atau pompa
Ucapan manusia dihasilkan oleh suatu sistem produksi ucapan yang dibentuk oleh yang ditekan untuk menghasilkan tekanan udara. Pada saat vocal cord berada
alat-alat ucap manusia. Proses tersebut dimulai dengan formulasi pesan dalam dalam keadaan tegang, aliran udara akan menyebabkan terjadinya vibrasi pada
otak pembicara. Pesan tersebut akan diubah menjadi perintah-perintah yang vocal cord dan menghasilkan bunyi ucapan yang disebut voiced speech sound.
Document2 Document2
II-9 II-10
Pada saat vocal cord berada dalam keadaan lemas, aliran udara akan melalui memperlihatkan potongan sinyal selama 100 mili detik, sehingga seluruh gambar
daerah yang sempit pada vocal tract dan menyebabkan terjadinya turbulensi, tersebut memperlihatkan sinyal ucapan sepanjang 500 mili detik.
sehingga menghasilkan suara yang dikenal sebagai unvoiced sound.
Gambar II.14. Contoh Sinyal Ucapan “It’s time” [Rab93]
Ada berbagai cara untuk mengklasifikasikan bagian-bagian atau komponen sinyal

Gambar II.13. Model Sistem Produksi Ucapan Manusia [Rab93]
ucapan. Salah satu cara yang sederhana adalah dengan cara
Ucapan dihasilkan sebagai rangkaian atau urutan komponen-komponen bunyi- mengklasifikasikannya menjadi tiga keadaan yang berbeda, yaitu (1) silence (S),
bunyi pembentuknya. Setiap komponen bunyi yang berbeda dibentuk oleh keadaan pada saat tidak ada ucapan yang diucapkan; (2) unvoiced (U), keadaan
perbedaan posisi, bentuk, serta ukuran dari alat-alat ucap manusia yang berubah- pada saat vocal cord tidak melakukan vibrasi, sehingga suara yang dihasilkan
ubah selamat terjadinya proses produksi ucapan. bersifat tidak periodic atau bersifat random; (3) voiced (V), keadaan pada saat
terjadinya vibrasi pada vocal cord, sehingga menghasilkan suara yang bersifat
II.5.2 Representasi Sinyal Ucapan kuasi periodik.
Sinyal ucapan merupakan sinyal yang berubah terhadap waktu dengan kecepatan Pada gambar II.14 di atas sudah tercantum label-label S, U dan V yang dapat
perubahan yang relatif lambat. Jika diamati pada selang waktu yang pendek mempermudah untuk mengamati perbedaan keadaan-keadaan tersebut. Baris
(antara 5 sampai dengan 100 mili detik), karakteristiknya praktis bersifat tetap; pertama serta awal baris kedua ditandai dengan S, artinya bagian tersebut
tetapi jika diamati pada selang waktu yang lebih panjang karakteristiknya terlihat merepresentasikan keadaan diam dimana pembicara belum mengucapkan apapun.
berubah-ubah sesuai dengan kalimat yang sedang diucapkan. Gambar II.14 Amplituda kecil yang tampak pada perioda tersebut merupakan noise latar
memperlihatkan contoh sinyal ucapan dari suatu kalimat bahasa Inggris “It’s belakang yang ikut terekam.
time” yang diucapkan oleh seorang pria. Setiap baris pada gambar tersebut
Document2 Document2
II-11 II-12
Suatu perioda singkat unvoiced (U) tampak mendahului vocal pertama dalam kata
“It”. Selanjutnya diikuti oleh daerah voiced (V) yang cukup panjang,
merepresentasikan vokal “i”. Berikutnya diikuti oleh daerah unvoiced (U)
berikutnya yang merepresentasikan daerah pelemahan pengucapan “i”. Setelah itu
diikuti oleh silence (S) yang merupakan bagian dari fonem “t”, dan seterusnya.
Dari contoh tersebut jelas bahwa segmentasi ucapan menjadi S, U dan V tidak
bersifat eksak, artinya ada daerah-daerah yang tidak dapat dikategorikan dengan
tegas ke dalam salah satu dari tiga kategori tersebut. Salah satu penyebabnya
adalah perubahan dari keadaan-keadaan alat ucap manusia yang tidak bersifat
diskrit dari satu keadaan ke keadaan lainnya, sehingga bunyi transisi dari satu
segmen ke segmen lainnya menghasilkan bentuk yang tidak mudah ditentukan.
Selain itu, ada segmen-segmen ucapan yang mirip atau bahkan mengandung
silence didalamnya.
Representasi sinyal dalam diagram waktu terhadap amplituda seperti gambar Gambar II.15. Spektogram Pita Lebar, Spektogram Pita Sempit dan Amplituda
sebelumnya seringkali tidak cukup untuk mendapatkan besaran-besaran kuantitatif Ucapan dari kalimat “Every Salt Breeze Comes From Sea” [Rab93]
yang efektif untuk melakukan analisis dari suatu ucapan. Untuk melakukan analisis
Spektogram dibedakan menjadi spektogram pita lebar (wideband spectogram) dan
sinyal ucapan, lebih sering digunakan representasi spektral menggunakan spektogram
spektogram pita sempit (narrowband spectogram). Spektogram pita lebar adalah
seperti terlihat pada Gambar II.15. Dengan menggunakan spektogram, dapat
analisis spectral pada suatu interval sepanjang 15 mili detik menggunakan filter
diidentifikasikan komponen-komponen frekuensi dari suatu segmen ucapan. Segmen
dengan lebar pita 125 Hz serta analisis detail yang dilakukan setiap 1 mili detik.
ucapan yang bentuknya mirip pada domain waktu lebih mudah dibedakan pada
Spektogram pita sempit adalah analisis spectral pada suatu interval sepanjang 50
spektogram dengan cara melihat perbedaan komponen frekuensinya.
mili detik menggunakan filter dengan lebar pita 40 Hz serta analisis detail yang
dilakukan setiap 1 mili detik. Spektogram pita lebar dapat digunakan untuk
melihat komponen-komponen frekuensi utama dari suatu ucapan dengan jelas,
seperti terlihat pada gambar paling atas dari gambar II.15 tersebut. Sebagian
komponen frekuensi yang tidak dominan menjadi tidak terlihat pada spektogram
pita lebar. Untuk melihat komponen-komponen frekuensi yang lebih rinci
dilakukan menggunakan spektogram pita sempit, seperti yang terlihat pada
gambar kedua dari atas pada Gambar II.15.
Dalam kegiatan penelitian dan pengembangan sistem TTS, analisis spektral

diantaranya digunakan untuk melakukan segmentasi komponen-komponen sinyal
Document2 Document2
II-13 II-14
ucapan, indetifikasi komponen frekuensi segmen ucapan, serta analisis frekuensi

dasar yang diperlukan untuk analisis intonasi ucapan.
II.5.3 Karakteristik Sinyal Ucapan
Unit bunyi terkecil yang dapat dibedakan oleh manusia disebut fonem. Suatu
ucapan kata atau kalimat pada prinsipnya dapat dilihat sebagai urutan fonem.
Himpunan fonem yang ada dalam suatu bahasa berbeda-beda. Setiap fonem
disimbolkan dengan suatu simbol yang unit. Saat ini ada beberapa standar cara
penamaan fonem yang berlaku (Rabiner 1993, _____), diantaranya adalah standar
(1) IPA (International Phonetic Alphabet) 1, (2) ARPABET, serta (3) SAMPA.
Tabel II.1 memperlihatkan daftar fonem bahasa Inggris-Amerika serta
representasinya dalam simbol-simbol IPA serta ARPABET.
Tabel II.1
Fonem-fonem Bahasa Inggris-Amerika dalam standar IPA dan ARPABET
[Rab93]
Setiap fonem memiliki ciri-ciri yang berbeda. Gambar II.16 memperlihatkan
daftar fonem serta pengkalisifikasiannya untuk bahasa Inggris-Amerika.
1 Sistem abjad yang disusun oleh l’Association Phonetique Internationale pada 1897 atas
prakarsa Otto Jespersen, dengan tujuan supaya orang dapat belajar dan merekam lafal perbagai Gambar II.16. Daftar dan Klasifikasi Fonem Bahasa Inggris-Amerika [Rab93]
bahasa secara cermat dan menghindari ketikakonsistenan; didasarkan pada huruf Latin dengan
berbagai tambahan [Yus98]
Document2 Document2
II-15 II-16
II.5.3.1 Vokal Gambar II.18. Spektogram Sinyal Ucapan Vokal Bahasa Inggris [Rab93]
Sinyal ucapan vokal memiliki bentuk kuasi periodik seperti terlihat pada Gambar
II.17. Setiap vokal mempunyai komponen frekuensi tertentu yang membedakan II.5.3.2 Diftong
karakter satu fonem vokal dengan fonem vokal lainnya, seperti terlihat pada Diftong pada prinsipnya adalah dua fonem vokal yang berurutan dan diucapkan
spektogram Gambar II.18. Fonem vokal Bahasa Inggris mencakup fonem- tanpa jeda. Fonem diftong Bahasa Inggris mencakup /AY/, /OY/, /AW/, dan /EY/.
fonem /IY/, /IH/, /EH/, /AE/, /AA/, /ER/, /AH/, /AX/, /AO/, /UW/, /UH/, dan Karakteristik diftong mirip dengan karakteristik fonem-fonem vokal
/OW/. Penelitian untuk mengidentifikasikan karakteristik fonem-fonem vokal pembentuknya disertasi bentuk transisinya.
Bahasa Indonesia pernah dilakukan dan dipublikasikan oleh Arry Akhmad Arman
(Arman, 1999).
Gambar II.19. Spektogram Sinyal Ucapan Diftong Bahasa Inggris [Rab93]
Gambar II.17. Bentuk Sinyal Ucapan Vokal Bahasa Inggris [Rab93]
II.5.3.3 Konsonan Nasal
Konsonan nasal dibangkitkan dengan eksitasi glotal dan vocal tract mengerut total
pada beberapa titik tertentu sepanjang lintasan pengucapan. Bagian belakang
langit-langit merendah, sehingga udara mengalir melalui nasal tract dengan suara
yang dipancarkan melalui lubang hidung. Konsonan nasal Bahasa Inggris
adalah /M/, /N/, dan /NX/. Contoh bentuk sinyal ucapan serta spektogramnya
dapat dilihat pada Gambar II.20.
Document2 Document2
II-17 II-18
Gambar II.20. Contoh Sinyal dan Spektogram Konsonan Nasal Bahasa Inggris
Gambar II.21. Contoh Sinyal dan Spektogram Konsonan Frikatif Bahasa Inggris
[Rab93]
[Rab93]
II.5.3.3 Konsonan Frikatif

II.5.3.3 Konsonan Stop
Konsonen frikatif pada prinsipnya dapat dibedakan menjadi frikatif unvoiced serta
Seperti konsonan frikatif, konsonen stop dapat dibedakan menjadi konsonan stop
voiced. Fonem Bahasa Inggris yang termasuk frikatif unvoiced adalah /F/, /TH/,
unvoiced serta voiced. Konsonan stop memiliki bentuk yang berbeda dengan
/S/, dan /SH/, sedangkan yang termasuk frikatif voiced adalah /V/, /Z/, dan /ZH/.
konsonan-konsonan lainnya. Konsonan ini memperlihatkan pola transient dan
Frikatif unvoiced dibentuk dengan suatu eksitasi terhadap vocal tract dengan
tidak kontinyu. Konsonan ini dibentuk dengan cara memberikan tekanan pada
suatu aliran udara yang tetap, sehingga menyebabkan turbulensi di daerah yang
kondisi pengerutan total di bagian rongga mulut tertentu, dan segera diikuti
mengkerut dalam vocal tract. Frikatif voiced agak berbeda dengan frikatif
dengan pelemasan. Untuk fonem /B/ pengerutan terjadi di bibir, untuk fonem /D/
unvoiced. Pada frikatif voiced, suara dihasilkan oleh dua sumber eksitasi. Sumber
pengerutan terjadi di belakang gigi depan, sedangkan untuk fonem /G/ pengerutan
eksitasi lainnya adalah glotis.
terjadi di sekitar bagian belakang langit-langit. Selama perioda total pengerutan
terjadi, tidak ada suara yang dikeluarkan dari mulut, sehingga fonem ini selalu
mengandung bagian yang menyerupai silence. Fonem Bahasa Inggris yang
termasuk konsonan stop unvoiced adalah /P/, /T/, dan /K/, sedangkan yang
termasuk konsonan stop voiced adalah /B/, /D/, dan /G/.
Document2 Document2
II-19
Gambar II.22. Contoh Sinyal dan Spektogram Konsonan Stop Bahasa Inggris
[Rab93]
Document2
Video Databases
© 1999-2002, Arry Akhmad Arman

¿
Type of Video Request
z Retrieving a specified video
z Identifying and retrieving video segments

Which Aspect of Video to Store?
z Video Sample #1 : Educational Video

Consider an 8 hours, one day lecture of short course given
by Prof. X on topic “Video Database”.
z Items of interest could include the following :

People : Prof. X, students who might ask a question
Activities : Lecturing(quadtree, Prof. X), Questioning(Erica, Prof.
X), etc

z Video Sample #2 : Video Recording in Airport

People : Denis Dopeman, pilot, member of ground staff
Inanimate objects : airplanes, cars, trucks, warehouse
Activities : plane_landing(Airbus320,GIA239,Line-A, 150KHz), ...

z Video Sample #3 : “Sound of Music” Movie

People : Maria, Count Von Trapp, ...
Inanimate objects : piano in Count Von Trapp house
Animate objects : ducks and birds in the pond
Activities : singing, dancing, …….

Common Characteristics
z From 3 examples above we notice that they all share

certain common characteristics : objects and actitivities.
z We are now ready to start formally defining a video database

through a sequence of definitions (def 7.1 - 7.6)

Definitions
z Property
z Object Scheme
z Object Instance
z Activity Scheme
z Activity

Definition : Property (Def 7.1)
z A property is a pair (pname, Values), where

pname is the name of the property and
Values is a set.
z Examples :
(height, R+) : “height” property with positif real values.
(primarycolors, {red, green, blue})
(lisence_plate, X)
(shirtcolor, Colors)

Definition : Object Sheme (Def 7.2)
z An object scheme is a pair (fd, fi) where :

fd is a set of frame-dependant properties,
fi is a aset of frame-independant properties, and
fi and fd are disjoint sets.
z fd and fi are properties that have already defined in Def 7.1

Definition : Object Instance (Def 7.3)
z An object instance is a triple (oid, os, ip), where :

oid is a string called the object-id,
os = (fd, fi) is an object scheme (see def 7.2)
ip is a set of statements such that
for each property in fi, ip contains at most one property instance , and
for each property in fd and each frame f of the video, ip contains at most
one property instance. This property instance is denoted by the expression
pname = v IN f.

Sample : Frame-dependent property
Frame Objects Frame-dependent properties
1 Jane Shady Has(briefcase), at(path_front)

Dopeman_house Door(closed)
briefcase
2 Jane Shady Has(briefcase), at(path_middle)
Denis Dopeman At(door)
Dopeman_house Door(open)
briefcase
3 Jane Shady Has(briefcase), at(door)
Denis Dopeman At(door)
briefcase
4 Jane Shady At(door)
Denis Dopeman Has(briefcase), At(door)
briefcase
5 Jane Shady at(middle_path)
Dopeman_house Door(closed)
briefcase

Sample : Frame-independent property
Object Frame-independent Value

properties
Jane Shandy Age 35
height 170 cm
Dopeman_house Address 6717 Pimmit Drive
Falls Church, VA22047
Type Brick
color brown
Denis Dopeman Age 56
height 186 cm
briefcase Color Black

Length 40 cm
width 31 cm

Definition : Activity Scheme (Def 7.4)
z An activity scheme, ACT_SCH, is a finite set of properties

such that if (pname, Values1) and (pname, Values2) are both
in ACT_SCH, then Values1=Values2.
z Example : Activity ExchangeObject has three pair scheme

(Giver, Person)
(Receiver, Person)
(Item, Thing)
Giver = Jane Shady, Receiver = Denis Dopeman, Item = briefcase

Definition : Activity (Def 7.5)
z An Activity is a pair consisting of

AcID, the name of the acitivity of scheme ACT_SCH (Def 7.4)
for each pair(pname, Values) ∈ ACT_SCH, an equation of the
form pname=v, where v ∈ Values
z Example :
Activity Lecturing may have the
scheme {(Lecturer, Person), (Topic,
String)} and equations :
Lecturer = Prof. Felix
Topic = Video Databases

Definition : Video Content (Def 7.6)
z Let us suppose that framenum(v) specifies the total number

of frames of video v. The content of v consist of a
triple (OBJ, AC, λ) where :
OBJ = {oid1, …, oidn} is a finite set of object instance,
AC = {AcID1, …, AcIDk} is a finite set of activities/event, and
OBJ∪AC
λ is a map from {1, …., framenum(v) to 2

Problem : large number of frames!
z If we specifies λ for each frame, this would be inefficient,

because (for examples) a typical 90 minutes movie, running
30 fps would consist of approximately 90 x 60 x 30 =
162.000 frame !
z Asociated every object and activities for each
individual frame may be computationally infeasible.
z We need to find more compact representation.

z More feasible representation is a “segment based” rather
than “frame based”.

Definition : Video Library
z Video Library, VidLib, consist of 5 tuples

(VidContent, Vid_Id, frame_num, R, plm)
VidContent is the content of the video

Vid_Id is the name of the video
frame_num is the number of frames in the video
R is a set of relations about videos as a whole, and
plm is a placement mapping that specifies the address of different
parts of the video

Organization of Simple Video Library
VideoContent Vid_Id framenum R plm
Video Vid1.mpg 9999 Date, place Placement

Content mapping
Vid2.mpg 4000 Date, place representation
Structure
Vid3.mpg 16000 Date, place
... … ...
... … ...
... … ...
... … ...
... … ...
... … ...
... … ...
Vid20.mpg 5000 Date, place

Frame Sequence (Def 7.8)
z Frame sequence is a pair [i,j), where 1≤i ≤j ≤n.

represent the set of all frames between i (inclusive) and j
(non-incluseive).
z Example :
frame sequence [6,12) denotes the set of
frames {6, 7, 8, 9, 10, 11}

Partial Ordering (Def 7.9)
z Partial ordering symbol is ⊆ (box)

z [i1, j1) ⊆ [i2, j2) iff i1 < j1 ≤ i2 < j2.
Intuitively, partial ordering means that the sequence of frame
[i1, j1) precedes the sequence of frames denoted by [i2, j2).
z [i1, j1) ⊂ [i2, j2) is partial ordering with j1 ≠ i2

Well ordered (Def 7.10)
z A set X of frame sequence is said to be well-ordered iff

X is finite (i.e., {[i1, j1), …[ir, jr)} for some integer r, and
[i1, j1) ⊆ [i2, j2) ⊆ … ⊆ [ir, jr)

Frame Sequence (Def 7.11)
z A set X of frame sequence is said to be solid iff

X is well ordered, and
there is no pair of frame sequence in X of the from [i1, i2) and [i2, i3)
practical view : if there is 2 or more sequence that directly
connected but it’s separated in different sequence, it’s not solid.
z Example :
X = {[1,5),[5,7),[9,11)}
X is well ordered, but X is not solid --> [1,5) and [5,7)
X = {[1,7),[9,11)}
X is well ordered, and solid

Segment Association Map (Def 7.12)
z Suppose (OBJ, AC, λ) represent the content of video v.

A segment association map σV associated with video v is
the map defined as follows :
σV’s domain is (OBJ ∪ AC)
σV return, for each x ∈ (OBJ ∪ AC), a solid set of frane
sequences, denoted σV(x) such that :
if [s,e) ∈ σV(x), then for all s ≤ f < , it is the case that x ∈ λ(f), and
for all frames f and all x ∈ OBJ ∪ A, if x ∈ λ(f), then there exists a
frame sequence [s,e) ∈ σV(x) such that f ∈ [s,e)
z In short, if (OBJ, AC, λ) is the content of video v, then
replacing the map λ by the map σV may often represent
a substantial saving.

Sample
Segment oriented
z Frame oriented Object1 : fr 250-750
Object1 = 1250 frames Object1 : fr 1750-2500
Object2 : fr 250-1000
Object2 = 1500 frames
Object2 : fr 2250-2500
Object3 : fr 1000-1750
Object3 : fr 3250-5000
…...


Slide Tambahan

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Slide Tambahan

Caricato da

Copyright:

Formati disponibili

Multimedia Database System

Arry Akhmad Arman

Arry Akhmad Arman, aa@lss.ee.itb.ac.id

Arry Akhmad Arman, aa@lss.ee.itb.ac.id

Arry Akhmad Arman

Arry Akhmad Arman, aa@lss.ee.itb.ac.id 2

Client Side Request Server side

Arry Akhmad Arman, aa@lss.ee.itb.ac.id 3

Client Side request Server side

Client network Server Data

user Client dan Server bisa

Arry Akhmad Arman, aa@lss.ee.itb.ac.id 4

z SQL adalah suatu standar bahasa untuk memberikan

Arry Akhmad Arman, aa@lss.ee.itb.ac.id 5

Client Side request Server side

Client Server Database

Arry Akhmad Arman, aa@lss.ee.itb.ac.id 7

Arry Akhmad Arman, aa@lss.ee.itb.ac.id 8

Geographical Police Still Image

Information Applicat Data

z Proses pertama memerlukan dukungan teknologi “speech

Arry Akhmad Arman, aa@lss.ee.itb.ac.id 11

z Polisi memiliki rekaman video yang dipasang di

Arry Akhmad Arman, aa@lss.ee.itb.ac.id 12

Arry Akhmad Arman, aa@lss.ee.itb.ac.id 13

z Polisi mengakses database Bank

Arry Akhmad Arman, aa@lss.ee.itb.ac.id 14

Arry Akhmad Arman, aa@lss.ee.itb.ac.id 15

Arry Akhmad Arman

Last update: September 2005

 A relational database management

 Data is organized in the form of relatively

© 2002, Arry Akhmad Arman

RDBMS Object Oriented

z Data is organized in the form z Object, which are manipulated

© 1999-2001, ARRY AKHMAD ARMAN - Electrical Engineering Dept. of ITB 2

© 2002, Arry Akhmad Arman

z Most media data requires the ability to reason about both

z Most techniques to store n-dimensional data do by using

© 1999-2001, ARRY AKHMAD ARMAN - Electrical Engineering Dept. of ITB 2

z The k-d trees is used to

© 1999-2001, ARRY AKHMAD ARMAN - Electrical Engineering Dept. of ITB 3

INFO XVAL YVAL

© 1999-2001, ARRY AKHMAD ARMAN - Electrical Engineering Dept. of ITB 4

z Suppose T is a pointer to the root of a 2-d tree. If N is a

Level(N) = 0 if N is the root of the tree

© 1999-2001, ARRY AKHMAD ARMAN - Electrical Engineering Dept. of ITB 5

z If N is a node in the tree such that level(N) is even, then

z If N is a node in the tree such that level(N) is odd, then

© 1999-2001, ARRY AKHMAD ARMAN - Electrical Engineering Dept. of ITB 6

City (XVAL, YVAL)

© 1999-2001, ARRY AKHMAD ARMAN - Electrical Engineering Dept. of ITB 7

© 1999-2001, ARRY AKHMAD ARMAN - Electrical Engineering Dept. of ITB 8

z Tree structure dipengaruhi oleh urutan pemasukan node

© 1999-2001, ARRY AKHMAD ARMAN - Electrical Engineering Dept. of ITB 9

INFO XVAL YVAL

© 1999-2001, ARRY AKHMAD ARMAN - Electrical Engineering Dept. of ITB 10

© 1999-2001, ARRY AKHMAD ARMAN - Electrical Engineering Dept. of ITB 11

© 1999-2001, ARRY AKHMAD ARMAN - Electrical Engineering Dept. of ITB 12

City (XVAL, YVAL)

© 1999-2001, ARRY AKHMAD ARMAN - Electrical Engineering Dept. of ITB 13

© 1999-2001, ARRY AKHMAD ARMAN - Electrical Engineering Dept. of ITB 14

z Tree structure dipengaruhi oleh urutan pemasukan node

© 1999-2001, ARRY AKHMAD ARMAN - Electrical Engineering Dept. of ITB 15

A relational database management

Data is organized in the form of relatively