Sei sulla pagina 1di 43

1.

1 Statement of the Problem

Given an unordered tree T with a random set of nodes N = { N1,

N2 , N3 , . . . , Nn-1 , Nn , null } where n is total nodes of tree T. t =

{left,right} where left is a left child and right is a right child of t, t N,

and left t , right t, and left right. Here, the left and right child

has no relevant connection to the parent t node because N set has no

order. So, in searching a certain data d in a Tree T f : S(T, d) N where

N is the node containing the searched data, during this process if t d

where t T , the probability that the data can be found on both

subtrees of t are 50%. As a result traversing the entire tree and

evaluating node-by-node of the tree T is the only solution which will

results of increasing search space, with this observation the researcher

found two major problems:

(1) there is no certain data or information in every node of an

unordered tree that will be use to served as a basis for evaluating the

probability to guide on which subtree of the current node has the

higher chances of the presence of an arbitrary data, (2) there is no

existing new search algorithm for unordered tree that will reduces the

search space in searching arbitrary data in overall.

1.2 Objective of the Study

Given that in unordered tree traversing the entire tree is the only
solution due to the fact that the arrangement of data with respect to its

value of parent and child are none relevance. To help reduce the

involve nodes during searching process of unordered tree the objective

are as follows:

(1) to add two data on every nodes of a tree that hold a certain
information such as a frequency,etc., one represent the left subtree
(left bit frequency) and the other one represent the right subtree (right
bit frequency). This data will contains binary digit frequency count of
subtrees which will be use during searching process. Frequency is the
count of 1's of all binary equivalent of data in the tree (2) to create a
search algorithm that will reduce the search space by evaluating the
added data of a node as a basis on which of the sub-tree will be
evaluated first, that if in case, if the data is not in the current node (3)
to test and verify if the added data and the designed algorithm
increases the accuracy and reduces the search space in overall
compare to the traversal one.

1.3 Significance of the Study

With this new approach of the study, it might open a new idea to

the field of computer science that for every random data in a given set

there exists a unique property of a data that can be used and grouped

to at least to create and increase the probability or frequency on which

portion of sets it located. With this algorithm a reduced search space is

expected in general compared to the traditional approach. This

algorithm can share a great help in the field of data-mining, biology,

chemistry in terms of searching process in unordered tree structures.


1. 4 Scope and Limitations

This study focuses only on binary rooted trees, specifically the

searching process. It considered that the sets of data of a tree are all

whole number and unique such that there is no exist of duplicate data

and if character is used its equivalent decimal value is considered. The

length of binary frequency data is limited only up to 8 positions only in

our sample and with that the data are limited only from 1 up to 255.

Deleting node in a tree is not included, it primarily focuses only on

searching.
CHAPTER II

REVIEW OF RELATED LITERITURE

Mining frequent trees is very useful in domains like

bioinformatics, web mining, mining semi-structured data, and so on.

Mohammed J. Zaki[1] conducted a study on the year 2005 in titled

Efficiently Mining Frequent Embeeded Unordered tree. It introduces the

SLEUTH algorithm for mining frequent, unordered, embedded subtrees

in a database of labeled trees. But before the SLEUTH algorithm can be

used it form first three phases: (1) It uses another algorithm that

enumerates all embedded, unordered trees[1,p3], (2) Prefix Extension

is used to correctly enumerate all ordered embedded or induced

trees[1,p8] and for unordered trees it has to do a further check to see

if the new extension is the canonical form for its automorphism group,

and if so, it is a valid extension[1,p8]. Once all the candidate are

enumerated it extend the notion of scope-list joins to compute


frequency of unordered trees. (3) SLEUTH uses scope-list joins for fast

frequency computation for a new extension. (4) SLEUTH is more

efficient than SLEUTH-F K F2, and is generally comparable to

TreeMiner, which mines only ordered subtrees, even though SLEUTH

has to check if a subtree is in canonical form[1,p18]. In this study the

phase (3) performs a frequency computation for embedded unordered

subtree where in our study we perform frequency counting for 1-bit on

the same position.

In the same year study in titled Mining Databases of Labeled

Trees using Canonical Forms[Yun Chi, 2005] where it address the one

important issue in mining databases of labeled rooted treesfinding

frequently occurring subtrees. In he's study the number of transactions

in the database that support an itemset S is called the frequency of the

itemset. The frequency of a candidate itemset needs to be counted in

order to determine if the itemset is frequent. The first frequency

counting method is based on direct checking: for each transaction, the

frequency of all candidate itemsets supported by the transaction is

increased by one[2,p26]. This study and our study use frequency

counting but differ in terms of what we're counting because our study

is counting the 1-bit of binary data.

In terms of mining frequent itemset, [Shariq and Abdul, 2006] In

this paper they present a novel bit-vector projection technique, they

call it Projected-Bit-R egions (PBR) bit-vector projection technique and


its implementation Ramp (itemset mining algorithm). For efficient

projection of bit-vectors, the goal of projection should be such as, to

bitwise- only those regions of head bit-vector bitmap(head) with

tail item X bit- vector bitmap(X) which contains a value greater

than zero and skip all others. With projection using PBR, each node Y of

search space contains an array of valid region indexes PBRY which

guide the frequency counting procedure to traverse only those regions

which contain an index in array and skip all other.[3,p2]. This paper

show relatedness to our concept by checking region of the head or

parent node and skip or prune the sub-area if it show zero value but as

to the process of obtaining the region shows the difference. The same

approach of bit-vector used by [Zahoor, Shariq, Rauf 2007] but there

study is mining the N-most interesting frequent itemsets[4].

At the same year, [Jose L. Balcazar, Albert Bifet, and Antonio

Lozano]] proposed a representation of ordered trees, describe a

combinatorial characterization and some properties, and use them to

propose an efficient algorithm for mining frequent closed subtrees from

a set of input trees. Then they focus on unordered trees, and show that

intrinsic characterizations of their representation provide for a way of

avoiding the repeated exploration of unordered trees, and then they

give an efficient algorithm for mining frequent closed unordered

trees[5]. This study involved of comparing the ordered and unordered

tree to avoid repeated exploration of unordered tree and our study can
be use at this phase since our study is a searching algorithm for

unordered tree, and our obtained frequencies can be used farther for

their study.

Another study has been proposed to exploit lattices of item sets,

from which they can extract optimal decision trees in linear time. They

give several strategies to efficiently build these lattices. Their

experiments show that under the same constraints, DL8 has better test

results than C4.5 which confirm that

exhaustive search does not always imply over fitting. The results also

show that DL8 is a useful and interesting tool to learn decision trees

under constraints. This study can even more improve if our of

searching will be used because of the unordered tree structure of a

decision tree[6].

There was a study that compare the memory requirements and

support counting performance of FP Tree, and Compressed Patricia Trie

against several novel variants of vertical bit vectors. First, borrowing

ideas from the VLDB domain, we compress vertical bit vectors using

WAH encoding. Second, we evaluate the Gray code rank- based

transaction reordering scheme, and show that in practice, simple

lexicographic ordering, obtained by applying LSB Radix sort,

outperforms this scheme. Led by these results, we propose HDO, a

novel Hamming-distance-based greedy transaction reordering scheme,

and aHDO, a linear-time approximation to HDO. We present results of


experiments performed on 15 common datasets with varying degrees

of sparseness, and show that HDO- reordered, WAH encoded bit

vectors can take as little as 5% of the uncompressed space, while

aHDO achieves similar compression on sparse datasets. Finally, with

results from over a billion database and data mining style frequency

query executions, we show that bitmap-based approaches result in up

to hundreds of times faster support counting, and HDO-WAH encoded

bitmaps offer the best space-time tradeoff. Unlike to our study, we

used bit of every data to obtain frequency, there in this study item set

transaction is represented a bit pattern[7].

CHAPTER III

RESEARCH METHODOLOGY

The concept of the study is to used the binary number of every


data of tree particularly the 1-bit of binary number and count its

frequency and store it at the parent node. All the binary frequencies of

left subtree will be store at lbf of parent node and on rbf for the right

subtree.

3.1 Trees

A rooted, labeled, tree, T = (V,E) is a directed, acyclic, connected

graph, with V = {0, 1, . . . , n} as the set of vertices (or nodes), E =

{(x,y) | x, y E V } as the set of edges. One distinguished vertex r E V is

designated the root, and for all x E V, there is a unique path from r to

x. Further, l: V L is a labeling function mapping vertices to a set of

labels L = {l1, l2 , . . . } . In an ordered tree the children of each

vertex are ordered (i.e., if a vertex has k children, then we can

designate them as the first child, second child, and so on up to the kth

child), p otherwise, the tree is unordered. If x, y E V


bi2 i

i=0
and there is a path from x to y , then x is called an

ancestor of y (and y a descendant of x ), denoted as x p y , where p is

the length of the path from x to y . If x 1 y (i.e., x is an immediate


p
ancestor), then x is called the parent of y , and y the child
bi
i=0
of x . If x and y have the same parent, x and y are called

siblings, and if they have a common ancestor, they are called cousins

[1].
Illustration 3.1 Rooted ordered and unordered trees

3.1 Binary equivalence and 1-bit position

For every real numbers there is an equivalent value in different

numbering system. Here, we will use binary numbering system since it

is a set of {1,0} distinct digit and with this it can provide us a

sequence 1's and 0's for a certain data. Using this idea, we can

examine if the binary equivalence can lead us to a simple pattern.

Figure 3.1.1: Decimal number to binary number conversion using


short-cut method
example 1:
Decimal base 128 64 32 16 8 4 2 1
1010 = 0 0 0 0 1 0 Binary
1 number
0 (bit)

position 0
position 1
position 2
position 3
position 4
position 5
position 6
position 7

example 2:
Decimal base 128 64 32 16 8 4 2 1
710 = 0 0 0 0 0 1 1 number
Binary 1 (bit)

position 0
position 1
position 2
position 3
position 4
position 5
position 6
position 7

example 3:
Decimal base 128 64 32 16 8 4 2 1
12810 = 1 0 0 0 0 0 0 0
13710 = 1 0 0 0 1 0 0 1
9810 = 0 1 1 0 0 0 1 0
7010 = 0 1 0 0 0 1 1 0

We can get the decimal equivalent of each binary numbers by using


the equation:
p

b i 2i Equation bi fi 3.1.2: binary number to decimal


i=0 equation

This equation will be used to convert binary number to decimal and

Equation 3.1.2: Sum of 1-bit in binary number


p

is an
f i 2i =N equation used to sum the 1-bit in binary
i=0

number. Where p, is the total position of binary number and b is the bit

value at specific position of p. The position of every bit is very

important through the process of this study.

Converting binary number at example 1 to decimal number:

decimal Number binary number


1010 = 0 0 0 0 1 0 1 0
7 6 5 4 3 2 1 0
position , and p = 7,

p
(i) b i 2i=0(20 )+1(21)+0 (22)+1(23 )+0(2 4 )+0(25 )+0(26 )+0(27 )=10
i=0

p
(ii)
b i 2i=0+2+0+8+0+0+0+ 0=10
i=0

3.2 1-Bit Frequency

The result of counting all 1-bit position on every equivalent

binary numbers of node data with respect to p there


bi bi
position is called ' 1-bit frequency ' i=0 as show
fi
at the figure 3.2 below:

Figure 3.2.1 Decimal number to binary number conversion using short-


cut method

p
Deci f i= x mal base 128 64 32 16 8 4 2
i=0
1
1010 = 0 0 0 0 1 0 1 0
710 = 0 0 0 0 0 1 1 1
12810 = 1 0 0 0 0 0 0 0
13710 = 1 0 0 0 1 0 0 1
9810 = 0 1 1 0 0 0 1 0
7010 = 0 1 0 0 0 1 1 0
------------- ------------------------------------------------------------------
Total: 45010 2 2 1 0 2 2 1-bit
4 frequency
2

7 6 5 4 3 2 1 positions
0

Theorem 3.2.1: If x is number of frequency with respect to its position

then it is also the total decimal numbers associated with that specific

position.

Proof 3.2.1: 110 = 12 1-bit frequency at position 0. The frequency

number is 1 at position 0, therefore there is only one number

associated with position 0.

Proof 3.2.2: Given the binary numbers,

110 = 0 0 1
210 = 0 1 0
410 = 1 0 0
310 = 0 1 1
1 2 2 1-bit frequencies
2 1 0 positions,

therefore we can say that there are two numbers associated at position

zero, which are the number 1, & 3 and there are also two numbers

associated at position 1, which are the number 2, & 3 and there is only

one number associated at position 2, the number 4.

Base from the figure 3.2, from all binary numbers we obtain one set of
1-bit frequency. Every frequency numbers has an associations with

some of the decimal numbers base from theorem 3.2.

3.3 Sum and decimal equivalent of frequency

We can get the equivalent decimal of


1
frequency ( p ) 100 by using the equation 3.1,

and
fi replacing with . Using
2 i=0
1
figure 3.2, f = { 2 , 2 , 1 , 0 , 2 , 2 , 4 , 2} and p be the total positions

which is p p = 7 and let N be the decimal number which

is N = 450 fi we can get its decimal equivalent by:


2 1i=0

fi p
i=0
Equation 3.3.1: frequency to decimal fi number
2 i=0
1

p
(i)
f i (2i )=2(20 )+4 (21)+2(22)+2(23 )+0(2 4)+1(25 )+2(26 )+2(27 )=450
i=0

p
(ii)
f i (2i )=2(1)+ 4 (2)+2(4)+2(8)+0 (16)+1(32)+2 (64)+2(128)=450
i=0

p
(iii)
f i (2i )=2+8+8+ 16+0+32+128+256=450
i=0

with this, we can say that 450 decimal number can be obtain by the
frequency of { 2 , 2 , 1 , 0 , 2 , 2 , 4 , 2} with decimal equivalent of

every position of {2,8,8,16,0,32,128,256}. This is very important

aspects of our study since it reveals how much value of every positions

contributed to the whole number 450. Another aspect we consider is

the sum of frequency, the sum of frequency is the total number of 1-bit

associated with all those numbers by the theorem 3.2. We can get the

sum of frequency using the equation 3.1. 2 replacing

with . Using figure 3.2, f = { 2 , 2 , 1 , 0 , 2 , 2 , 4 , 2} and let

p be the total positions which is p = 7, let x be the sum of 1-bit which

is 15, then we can get sum of frequency by:

Equation 3.3: Sum of frequency

f
p
positions
(I)
f i=2+ 4+2+2+0+1+2+2=15
i=0

The equation shows, that there are 15 bits combination of 1-bit

involved with the decimal number of 450. With 15 bits combination, we

can ask this question, can we compute the probability of a data if it is

exist using the bits combination? The answer is definitely yes! All we

need is to get the position of 1-bit of its binary number equivalent.

3.4 Probability and combination

With all those idea and knowledge we discuss above, we now get
most-significant-position
(msp)
to one important point of our study which is the probability. To compute
les-significant-position (lsp)
probability is very important part of our searching study since our
algorithm will defend on it to decide on which subtree to evaluate first
given the probability result. We take it little by little, let us first
consider this:

Table 3.4: Frequency and its possible decimal number

Total
Frequenc Total Decimal number combinatio
Possible binary numbers
y Positions involved n of decimal
number
12
{1 } 0 1 1
1
1 12 , 0 1
1
{1,1} 11 1 02 1, 2, 3 3
1 1
1 1 12 , 001 010 0
1 1
1 11 1 1 02 1 0 12 1
{1,1,1} 2 1, 2 , 3 , 4, 5, 6, 7 7
0 02
111 111
111
11112 , 0001 0010
0100
1111 111 02 11012
10112
1111 1111
1111 1, 2 , 3 , 4 , 5 , 6 ,7 ,
{1,1,1,1} 3 8 ,9 ,10 ,11 ,12 ,13 , 15
1000 1010 0110 14 ,15
most-significant-position
0011 (msp)
01112 01012 10012
11002
1111 1111 1111
1111

on table 3.4, we will focus only the position since it is being obtain from
the position of 1-bit. Position p must be the number of 1-bit of a certain
number x and with that we can compute the probability of x by:

Equation 3.4: Probability of a number exist with a given of frequency


Equation 3.5: Total of possible combination of binary number

where is the sum of frequency base from equation 3.3, and


total

possible number exist within the total position p is


. With this

probability equation, we can now identify the percent of data exist

given the total position p, i.e., what percent that the decimal number 7

exist given that the total position of 1-bit binary number 7 is 2, using

the equation 3.3, we can compute the probability by:

1 1 1 1
( )100=( )100=( )100=( )100=(0.33333)100=33.33333
p
2 1
2
2 1 41 3

so, the probability is 33.33 this is due to the fact that the frequency

{1,1,1} of 7 can be obtain also from 6 combination binary numbers.

Combination is the total number that a certain frequency can be

obtain, let us consider table 3.4, for the frequency of { 1 } of course

there is only one binary number could represent that frequency, the

binary number 1. For the frequency {1 ,1 } it can be obtain from 3


possible binary numbers the, 11, 01, & 10. For the frequency {1 , 1 , 1 ,

1} it could be obtain from 15 possible binary numbers.

3.5 Most-significant-position (msp) and Less-significant-position (lsp)

We already know that the


x y x+ y
frequency is the sum
f msp + f lsp = f i
i i
of all 1-bit of

every binary number i=0 i=0 i=0 with respect to

there position, I.e, example:

x x Figure 3.5.1: Binary numbers,


f lsp i
f 1lsp i
frequency, and positions
i=0 i=0

x x 1010 = 0 0 0
0
f msp i
1 0 f 1 mspi 1
7 =
0
0 0 0 0
i=0 i=0 10
0 1 1 1
12810 = 1 0 0 0 0 0 0 0
--------- ---------------------------------------------------------------
Total: 14510 1 0 0 0 1 1 2 1
7 6 5 4 3 2 1 0

Given that the frequencies f = { 1 , 0 , 0 , 0 , 1 , 1 , 2 , 1 } under

figure 3.5, and let x = 7 with equivalent binary number of {1 , 1 , 1}

and position of { 2 , 1 , 0} of each 1-bit, the most-significant-

f 1lsp +1 i
a 1+1
i=0
f =( x )(100)=( c
)(100)
x x
f msp a+a 1+2 1
f lsp + f 1lsp +2
i

i i
i=0
1
i=0 i=0
position(msp) is the 1-bit position of an arbitrary binary number when

mapping to a position of a certain frequency. i,e. for example:

Figure 3.5.2: Mapping msp to a certain frequency


x = 710 = 1 1 12
2 1 0 positions

f lsp +1 i
a+1 f = {
i=0
f 1=( x )(100)=( c
)(100) 1 ,
x x
f 1msp a+a 1+2 1 0 ,
f lsp + f 1lsp +2 0 ,
i
i=0
1
i=0
i
i=0
i 0 ,
1 ,
1, 2,1}
7 6 5 4 3 2 1 0 positions

msp = { 2 , 1 , 0 } , lsp = { 7 , 6 , 5 , 4 , 3 }

Figure 3.5 shows the mapping of msp to the position of

frequency, there we can identify also the lsp. Every value of msp and

lsp has also a position, as shown below:

Figure 3.5.3: Position of every element of msp and lsp

msp = { 2 , 1 , 0 } lsp = { 7 , 6 , 5 , 4 , 3 }
2 , 1 , 0 positions 4 3 2 1 0 positions

The less-significant-position is the position which are/is not included in

the position of 1-bit binary number of an arbitrary number, here in our

example its decimal number 7. Another example:

Figure 3.5.3: Mapping msp to a certain frequency

x = 13610 = 1 0 0 0 1 0 0 02
a1+1 6+1 7
( )(100)=( )(100)= (100)=0.4375 (100)=43.75 7 6 5
c
a+a 1+2 1 3
3+6+2 1 16
4 3 2
1 0 positions
a1+1 3+1 4
( )(100)=( )(100)= (100)=0.25 (100)=25
c
a+a 1+2 1 3
3+6+ 2 1 16
f = { 1 ,
0,0,0,1,1, 2,1}
7 6 5 4 3 2 1 0 positions

msp = { 7 , 3 } lsp = { 6 , 5 , 4 , 2 , 1 , 0 }
1 , 0 positions 5 , 4 , 3 , 2 , 1 , 0 positions

The lbf information will be used to represent the left sub-tree of a node

and rbf will be use to represent the right sub-tree of the node as shown

below:

Figure 3.5.4: lbf

represent the left

subtree and rbf

represent the right

subtree
3.6 Probability theorem

Now, the most important part in our study is to create a theorem

of probability, after having all those knowledge of frequency, positions,

combination, most-significant-position (msp) , and less-significant-

position (lsp). This theorem will be bases of our algorithm as it try to

search an arbitrary data in an unordered tree. Our theorem will

considered two sets of frequencies since every node of an unordered

binary tree hast at most two child, the left-child and the right-child.

Theorem 3.6.1

Let f and f1 be a sets of 1-bit frequency and b is the binary

number of number N . Let msp subset of b's 1-bit positions and lsp a

subset of b's 0-bit positions such that lsp msp = b accordingly. Let x

and y be the length of msp and lsp respectively such that:


Let a = , a1 = , and

c= and c1 =

then the probability is given by,

for the percentage of f with respect to b's msp and lsp is :

Equation 3.6.1: Probability equation for frequency f

percent

for the percentage of f1 with respect to b's msp and lsp is :

Equation 3.6.2: Probability equation for frequency f

percent

If f > f1 with respect to b's msp and lsp then there is a high
percentage that b is within f compared to f1 .

Proof:
1010 = 0 0 0 0 1 0 1 0
710 = 0 0 0 0 0 1 1 1
12810 = 1 0 0 0 0 0 0 0
--------- ---------------------------------------------------------------
Total: 14510 1 0 0 0 1 1 f
2 1

13710 = 1 0 0 0 1 0 0 1
9810 = 0 1 1 0 0 0 1 0
7010 = 0 1 0 0 0 1 1 0
------------- ----------------------------------------------------------------
Total: 30510 1 2 1 0 1 1 f1
2 1
7 6 5 4 3 1-bit
2 frequency
1 0 position

Using the above frequency, f = {1,0,0,0,1,1,2,1} and

f1={1,2,1,0,1,1,2,1} ,

let N = 10 and N = b2 and let p the set of positions of bits such that:

b = { 0 0 0 0 1 0 1 0 } = 1010
p={7 6 5 4 3 2 1 0 }

msp = { 3, 1 } lsp = {7 , 6 , 5 ,
4,2,0}
c = f msp = { 1, 2 } = 3 a = flsp = {1 , 0 ,
0,0,1,1}=3
c1= f1 msp = { 1, 2 } = 3 a1 = f1lsp = {1 , 2 ,
1,0,2,1}=6

using our equation of probability we can get the percentage on both f

and f1 with respect to b by:

f = percent

f1 = percent
f = 43.75% and

f1 = 25%

since f > f1, therefore f has a high probability of number 10.

3.7 Data structure

In-order for every node to hold bit-frequency, two data is added

on nodes basic structure, the lbf and the rbf data, such that every time

a node is created there are data in a node that ready to hold a bit-

frequency.

Figure 3.7.1 Basic structure model

Below is sample tree to have an overview on what a binary unordered


tree looks like if this basic data structure is being applied. Every node
that has no left-child or right child on a tree are assumed that the child
is NULL.

Figure 3.7.2 Unordered tree with lbf and rbf data added
Every node will hold bit-frequency, one will represent the left sub-tree

the left-bit-frequency (lbf) and one for the right sub-tree the right-bit-

frequency(rbf). Every bit-frequency represent an entire sub-tree, bit-

frequency will be updated every time a node is added but only those

nodes which part of backing process and only one of the two bit-

frequency of a node is updated. When backing came from the left child

sub-tree then only the lbf of parents node is updated, the same happen

to the right child of a parent node if backtrack come from right sub-

tree.

3.8 Obtaining left-bit-frequency and right-bit-frequency

The lbf and rbf are one of the most important part in our study,

since it is our first objective of the study. There are two phases to

obtain and update the lbf and rbf: (1) converting the data N into binary

form and get the msp: (2) update all msp of lbf or rbf all ancestors
node base from the msp of data N by increasing each msp value by 1.

The direction of backtrack will decide on which frequency of the current

node to be updated. If the backtrack comes from the left sub-tree of

the parent node then the msp of lbf of the parent node will be update

else the msp if rbf will be updated.

Figure 3.8.1: updating msp of lbf or rbf of ancestors node


3.4 Search

For searching an arbitrary data, its binary equivalent must be

evaluated to obtain a set of 1-bit positions, this set is called the most-

significant-positions (msp) and those positions not part of the set is

called the less-significant-positions (lsp). Say, let x be the data that we

are going to search on tree T, and let b the binary equivalent of x, we

got msp and lsp of x.

So, if the current data of node is not equal to x, there will be two

phases of evaluations to determine on which sub-tree of the parent

node to be evaluated first. Phase 1, using the msp of x , if there exist

zero value on any position lbf msp and rbfmsp then it sure that the data

does not exist on both left sub-tree and right subtree and stop the

search. If only the lbf shows zero value on any msp then prune the left
sub-tree and proceed the search to right sub-tree, the same happen if

only the rbf show zero value on any position of msp. Phase 2, using the

lsp of x as sum of both lbf lsp and rbflsp will compared, if one of them

shows lowest value then the search will proceed to that subtree else by

default the left sub-tree will be searched. The derive theorem will be

used at this phase.

It does not mean to say that if both lbf msp and rbfmsp show non-

zero value on any msp the data exist, say for example a frequency of

111, this frequency can be obtain from 001, 010, 100 binary number a

1, 2, & 4 numbers in decimal. So, searching number 7 cannot be

determined automatically that it does exist. Frequency 111 can be

obtain from 111 binary number alone, so if we search 1, 2, & 4 we

cannot determine automatically that all those number does exist.

Figure 3.4: Arbitrary data that does not exist on tree


Figure 3.5 Left sub-tree pruned

3.6 Algorithm
BFS(R, D)
1. t NULL
2. if R equal to NULL then goto step 20
3. else t R
4. if t data equal to D then goto step 20
5. else if for every n mspx
6. if LBFn equal to zero then
7. prune left subtree
8. if RBFn equal to zero then
9. prune right subtree
10. end if
11. for every n lspx
12. if sum of lbfn <= sum of rbfn
13. BSF(t left sub-tree, D)
14. if t equal to NULL then
15. BSF(t righ sub-tree,D)
16. end if
17. else BSF(t right sub-tree,D)
16. if t equal to NULL then
17. BSF(t left subtree,D)
18. end if
19. end if
20. return t
21. end

3.7 Time complexity: Though the algorithm can reduce the involved
nodes overall but it take extra time for filtering and evaluation process.

Table 3.1: Proposed algorithm time complexity

Time Complexity Search


space
Binary Search Add Total Summary complexit
Frequency y
Search
Algorithm O(p log n O(2p* k* 2(log O ( p k log
O( p k log n ) O(p+n)
) n)) n)

Base from the table 3.1, the p represent the number of frequency
positions, the k represent the internal nodes candidates for
backtracking if in case the search is missed.

Table 3.2: Time complexity between traversal and proposed algorithm

Search Time Complexity


method Worst Case Best Case Space Complexity
Traversal O(n) O(logn) O(1+n)
bfs O(pklogn) O(plogn) O(p+n)

CHAPTER IV

RESULTS AND DISCUSSIONS

In order to observe and examine the result of the algorithm, two


sample of unordered tree are being created and one-by-one of the data
is being searched. Every nodes which part of the path during searching
process are being recorded. Traditional searching method is being used
first followed by the design algorithm (BFSA).
Illustration 4.1 unordered tree with 20 nodes

Table 4.1 Result


Total
Searc Path
Node
h
Traditional (A) BFSA (B) A B
5 2,5 2,7,15,1,25,23,5 1 6
10 2,5,10 2,5,10 2 2
6 2,5,10,6 2,5,10,6 3 3
3 2,5,10,6,3 2,7,15,1,25,11,23,5,10,3 4 9
30 2,5,10,6,3,30 2,5,10,3,30 5 4
41 2,5,10,6,3,30,41 2,5,10,3,30,41 6 5
18 2,5,10,6,3,30,41,18 2,5,10,3,30,18 7 5
48 2,5,10,6,3,30,41,18,48 2,5,10,3,30,18,48 8 6
4 2,5,10,6,3,30,41,18,48,4 2,5,10,6,3,30,18,48,4 9 8
2,7,15,1,25,5,10,3,30,18,48,
13 2,5,10,6,3,30,41,18,48,4,13 10 11
13
7 2,5,10,6,3,30,41,18,48,4,13,7 2,7 11 1
20 2,5,10,6,3,30,41,18,48,4,13,7,20 2,5,10,3,18,48,30,7,20 12 8
15 2,5,10,6,3,30,41,18,48,4,13,7,20,15 2,7,15 13 2
1 2,5,10,6,3,30,41,18,48,4,13,7,20,15,1 2,7,15,1 14 3
9 2,5,10,6,3,30,41,18,48,4,13,7,20,15,1,9 2,7,15,1,9 15 4
2,5,10,6,3,30,41,18,48,4,13,7,20,15,1,9,
25 2,7,15,1,25 16 4
25
2,5,10,6,3,30,41,18,48,4,13,7,20,15,1,9,
11 2,7,15,1,25,11 17 5
25,11
2,5,10,6,3,30,41,18,48,4,13,7,20,15,1,9,
23 2,7,15,1,25,11 18 5
25,11,23
2,5,10,6,3,30,41,18,48,4,13,7,20,15,1,9,
8 2,7,15,8 19 3
25,11,23,8
2,5,10,6,3,30,41,18,48,4,13,7,20,15,1,9, 2,5,10,3,30,18,48,13,7,15,1,
12 20 12
25,11,23,8, ne 25,ne
10
TOTAL NODES 210
6
Note: ne not exist

Base on the result at table 4.1 and the illustration at 4.1 it shows
that for 20 nodes including the root the traditional search takes a
grand total of 210 nodes for searching one-by-one of the data in a tree
this include searching a data which is not found in the tree while Binary
Frequency Search Algorithm takes only 106 node when searching one-
by-one of the data. With this result, BFSA shows that it reduced the
nodes by > 49%.

illustration 4.2 unordered tree with 11 nodes

Table 4.2 Result


Path Total Node
Search
Traditional (A) BFSA (B) A B
1 4,2,6,9,8,10,5,7,1 4,5,7,1 8 3
2 4,2 4,2 1 1
3 4,2,6,9,8,10,5,7,1,3 4,5,7,3 9 3
5 4,2,6,9,8,10,5 4,5 6 1
6 4,2,6 4,2,6 2 2
7 4,2,6,9,8,10,5,7 4,5,7 7 2
8 4,2,6,9,8 4,2,8 4 2
9 4,2,6,9 4,2,6,9 3 3
10 4,2,6,9,8,10 4,2,8,10 5 3
11 4,2,6,9,8,10,5,7,1,3,11 4,2,6,5,3,11 10 5
TOTAL NODES 55 25

Base on the result at table 4.2 and the illustration at 4.2 it shows
that for 10 nodes including the root the traditional search takes a
grand total of 56 nodes for searching one-by-one of the data in a tree
this include searching a data which is not found in the tree while Binary
Frequency Search Algorithm takes only 25 nodes. With this result, BFSA
shows that it reduced the nodes by > 54%.

SIMULATIONS RESULTS 4
In order to check the reliability of the algorithm with respect to
how much it reduces the involved nodes and the accuracy during the
choices between the two subtree, the left and the right subtree. A
series of simulation has been made with a a random data to create a
sample tree, random data is used to make sure that every tree that
has been constructed during the simulation are unique. Below the the
graph for the result of the simulations:

Illustration 4.1: Average result of reduced involve nodes

Unordered Tree simulation and searching


60.5
60
59.5
59
Percentage

58.5
58
57.5
57
56.5
56
0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200
Iteration
Base from the results shown at illustration 4.1, from 10 simulated
trees to 2000 the algorithm shows efficiency in term of reducing the
nodes in overall greater than 59% overall.
Illustration 4.2: Average result of accuracy

Average accuracy during simulation


68.5

68
Accuracy Percentage

67.5

67

66.5

66

65.5
0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200
Sample Unordered Trees

For the accuracy of the algorithm, based from the illustration 4.2,
for the samples of 2000 unordered simulated trees the algorithm show
an accuracy of greater than 66%.

CHAPTER V
CONCLUSIONS AND RECOMMENDATIONS

Base from the results, it shows that the algorithm clearly shows
that it reduces the involve nodes overall at search process and it can
easily determine if a sub-tree must be prune or not. And because of
extra filtering and evaluation processes the algorithm will take extra
time with the time complexity of O(p k log n). The algorithm can be
best used in the field of data mining in searching process when
avoiding to used traversal the entire unordered tree and where the
sets data are unique.
For further studies, the researcher recommend to enhance the
algorithm in such it will eliminate the k in the O(p k log n) time
complexity, for this case the algorithm become faster and precise to its
searching than traversal. For mining frequent subtree or embedded
subtree, our algorithm can be used.

APPENDICES
A. Glossary of Terms

frequency The total count of 1-bit in a column.


less-significant-bit (lsb) A set of 0-bit position(s) with respect to
binary equivalent of the arbitrary data to be
searched.
most-significant-bit (msb) A set of 1-bit position(s) with respect to
binary equivalent of the arbitrary data to be searched.
left-bit-frequency(lbf) A data contain frequencies that represent the
left sub-tree of a current node.
righ-bit-frequency (rbf) A data contain frequencies that represent the
right sub-tree of a current node.
BIBLIOGRAPHY

[1] Mohammed J. Zaki*: Efficiently Mining Frequent Embedded


Unordered Trees
p5 2005
[2] Yun Chi: Mining Databases of Labeled Trees using Canonical Forms,
p5-6 2005
[3] Shariq Bashir and Abdul Rauf Baig: Ramp: High Performance
Frequent Itemset
Mining with Efficient Bit-Vector Projection Technique, p5 2007
[4] Zahoor Jan, Shariq Bashir, A. Rauf Baig: Applying Bit-Vector
Projection Mining of N-Most Interesting Frequent Itemset, p6 2007
[5] Jose L. Balcazar, Albert Bifet and Antoni Lozano: Mining Frequent
Closed
Unordered Trees Through Natural Representations, p6 2007
[6] Siegfried Nijssen, Elisa Fromont: Mining Optimal Decision Trees from
Itemset Lattices, p7 2007
[7] Hassan H. Malik and John R. Kender: Optimizing Frequency Queries
for Data Mining Applications , p7 2007

Potrebbero piacerti anche