Binary Frequency Search Algorithm

1.
1 Statement of the Problem
Given an unordered tree T with a random set of nodes N = { N1,
N2 , N3 , . . . , Nn-1 , Nn , null } where n is total nodes of tree T. t =
{left,right} where left is a left child and right is a right child of t, t N,
and left t , right t, and left right. Here, the left and right child
has no relevant connection to the parent t node because N set has no
order. So, in searching a certain data d in a Tree T f : S(T, d) N where
N is the node containing the searched data, during this process if t d
where t T , the probability that the data can be found on both
subtrees of t are 50%. As a result traversing the entire tree and
evaluating node-by-node of the tree T is the only solution which will
results of increasing search space, with this observation the researcher
found two major problems:
(1) there is no certain data or information in every node of an
unordered tree that will be use to served as a basis for evaluating the
probability to guide on which subtree of the current node has the
higher chances of the presence of an arbitrary data, (2) there is no
existing new search algorithm for unordered tree that will reduces the
search space in searching arbitrary data in overall.
1.2 Objective of the Study
Given that in unordered tree traversing the entire tree is the only
solution due to the fact that the arrangement of data with respect to its
value of parent and child are none relevance. To help reduce the
involve nodes during searching process of unordered tree the objective
are as follows:
(1) to add two data on every nodes of a tree that hold a certain
information such as a frequency,etc., one represent the left subtree
(left bit frequency) and the other one represent the right subtree (right
bit frequency). This data will contains binary digit frequency count of
subtrees which will be use during searching process. Frequency is the
count of 1's of all binary equivalent of data in the tree (2) to create a
search algorithm that will reduce the search space by evaluating the
added data of a node as a basis on which of the sub-tree will be
evaluated first, that if in case, if the data is not in the current node (3)
to test and verify if the added data and the designed algorithm
increases the accuracy and reduces the search space in overall
compare to the traversal one.
1.3 Significance of the Study
With this new approach of the study, it might open a new idea to
the field of computer science that for every random data in a given set
there exists a unique property of a data that can be used and grouped
to at least to create and increase the probability or frequency on which
portion of sets it located. With this algorithm a reduced search space is
expected in general compared to the traditional approach. This
algorithm can share a great help in the field of data-mining, biology,
chemistry in terms of searching process in unordered tree structures.

1. 4 Scope and Limitations
This study focuses only on binary rooted trees, specifically the
searching process. It considered that the sets of data of a tree are all
whole number and unique such that there is no exist of duplicate data
and if character is used its equivalent decimal value is considered. The
length of binary frequency data is limited only up to 8 positions only in
our sample and with that the data are limited only from 1 up to 255.
Deleting node in a tree is not included, it primarily focuses only on
searching.
CHAPTER II
REVIEW OF RELATED LITERITURE
Mining frequent trees is very useful in domains like
bioinformatics, web mining, mining semi-structured data, and so on.
Mohammed J. Zaki[1] conducted a study on the year 2005 in titled
Efficiently Mining Frequent Embeeded Unordered tree. It introduces the
SLEUTH algorithm for mining frequent, unordered, embedded subtrees
in a database of labeled trees. But before the SLEUTH algorithm can be
used it form first three phases: (1) It uses another algorithm that
enumerates all embedded, unordered trees[1,p3], (2) Prefix Extension
is used to correctly enumerate all ordered embedded or induced
trees[1,p8] and for unordered trees it has to do a further check to see
if the new extension is the canonical form for its automorphism group,
and if so, it is a valid extension[1,p8]. Once all the candidate are
enumerated it extend the notion of scope-list joins to compute

frequency of unordered trees. (3) SLEUTH uses scope-list joins for fast
frequency computation for a new extension. (4) SLEUTH is more
efficient than SLEUTH-F K F2, and is generally comparable to
TreeMiner, which mines only ordered subtrees, even though SLEUTH
has to check if a subtree is in canonical form[1,p18]. In this study the
phase (3) performs a frequency computation for embedded unordered
subtree where in our study we perform frequency counting for 1-bit on
the same position.
In the same year study in titled Mining Databases of Labeled
Trees using Canonical Forms[Yun Chi, 2005] where it address the one
important issue in mining databases of labeled rooted treesfinding
frequently occurring subtrees. In he's study the number of transactions
in the database that support an itemset S is called the frequency of the
itemset. The frequency of a candidate itemset needs to be counted in
order to determine if the itemset is frequent. The first frequency
counting method is based on direct checking: for each transaction, the
frequency of all candidate itemsets supported by the transaction is
increased by one[2,p26]. This study and our study use frequency
counting but differ in terms of what we're counting because our study
is counting the 1-bit of binary data.
In terms of mining frequent itemset, [Shariq and Abdul, 2006] In
this paper they present a novel bit-vector projection technique, they
call it Projected-Bit-R egions (PBR) bit-vector projection technique and

its implementation Ramp (itemset mining algorithm). For efficient
projection of bit-vectors, the goal of projection should be such as, to
bitwise- only those regions of head bit-vector bitmap(head) with
tail item X bit- vector bitmap(X) which contains a value greater
than zero and skip all others. With projection using PBR, each node Y of
search space contains an array of valid region indexes PBRY which
guide the frequency counting procedure to traverse only those regions
which contain an index in array and skip all other.[3,p2]. This paper
show relatedness to our concept by checking region of the head or
parent node and skip or prune the sub-area if it show zero value but as
to the process of obtaining the region shows the difference. The same
approach of bit-vector used by [Zahoor, Shariq, Rauf 2007] but there
study is mining the N-most interesting frequent itemsets[4].
At the same year, [Jose L. Balcazar, Albert Bifet, and Antonio
Lozano]] proposed a representation of ordered trees, describe a
combinatorial characterization and some properties, and use them to
propose an efficient algorithm for mining frequent closed subtrees from
a set of input trees. Then they focus on unordered trees, and show that
intrinsic characterizations of their representation provide for a way of
avoiding the repeated exploration of unordered trees, and then they
give an efficient algorithm for mining frequent closed unordered
trees[5]. This study involved of comparing the ordered and unordered
tree to avoid repeated exploration of unordered tree and our study can
be use at this phase since our study is a searching algorithm for
unordered tree, and our obtained frequencies can be used farther for
their study.
Another study has been proposed to exploit lattices of item sets,
from which they can extract optimal decision trees in linear time. They
give several strategies to efficiently build these lattices. Their
experiments show that under the same constraints, DL8 has better test
results than C4.5 which confirm that
exhaustive search does not always imply over fitting. The results also
show that DL8 is a useful and interesting tool to learn decision trees
under constraints. This study can even more improve if our of
searching will be used because of the unordered tree structure of a
decision tree[6].
There was a study that compare the memory requirements and
support counting performance of FP Tree, and Compressed Patricia Trie
against several novel variants of vertical bit vectors. First, borrowing
ideas from the VLDB domain, we compress vertical bit vectors using
WAH encoding. Second, we evaluate the Gray code rank- based
transaction reordering scheme, and show that in practice, simple
lexicographic ordering, obtained by applying LSB Radix sort,
outperforms this scheme. Led by these results, we propose HDO, a
novel Hamming-distance-based greedy transaction reordering scheme,
and aHDO, a linear-time approximation to HDO. We present results of

experiments performed on 15 common datasets with varying degrees
of sparseness, and show that HDO- reordered, WAH encoded bit
vectors can take as little as 5% of the uncompressed space, while
aHDO achieves similar compression on sparse datasets. Finally, with
results from over a billion database and data mining style frequency
query executions, we show that bitmap-based approaches result in up
to hundreds of times faster support counting, and HDO-WAH encoded
bitmaps offer the best space-time tradeoff. Unlike to our study, we
used bit of every data to obtain frequency, there in this study item set
transaction is represented a bit pattern[7].
CHAPTER III
RESEARCH METHODOLOGY
The concept of the study is to used the binary number of every

data of tree particularly the 1-bit of binary number and count its
frequency and store it at the parent node. All the binary frequencies of
left subtree will be store at lbf of parent node and on rbf for the right
subtree.
3.1 Trees
A rooted, labeled, tree, T = (V,E) is a directed, acyclic, connected
graph, with V = {0, 1, . . . , n} as the set of vertices (or nodes), E =
{(x,y) | x, y E V } as the set of edges. One distinguished vertex r E V is
designated the root, and for all x E V, there is a unique path from r to
x. Further, l: V L is a labeling function mapping vertices to a set of
labels L = {l1, l2 , . . . } . In an ordered tree the children of each
vertex are ordered (i.e., if a vertex has k children, then we can
designate them as the first child, second child, and so on up to the kth
child), p otherwise, the tree is unordered. If x, y E V

bi2 i
i=0
and there is a path from x to y , then x is called an
ancestor of y (and y a descendant of x ), denoted as x p y , where p is
the length of the path from x to y . If x 1 y (i.e., x is an immediate

p
ancestor), then x is called the parent of y , and y the child
bi
i=0
of x . If x and y have the same parent, x and y are called
siblings, and if they have a common ancestor, they are called cousins
[1].
Illustration 3.1 Rooted ordered and unordered trees
3.1 Binary equivalence and 1-bit position
For every real numbers there is an equivalent value in different
numbering system. Here, we will use binary numbering system since it
is a set of {1,0} distinct digit and with this it can provide us a
sequence 1's and 0's for a certain data. Using this idea, we can
examine if the binary equivalence can lead us to a simple pattern.
Figure 3.1.1: Decimal number to binary number conversion using

short-cut method
example 1:
Decimal base 128 64 32 16 8 4 2 1
1010 = 0 0 0 0 1 0 Binary
1 number
0 (bit)
position 0
position 1
position 2
position 3
position 4
position 5
position 6
position 7
example 2:
Decimal base 128 64 32 16 8 4 2 1
710 = 0 0 0 0 0 1 1 number
Binary 1 (bit)
position 0
position 1
position 2
position 3
position 4
position 5
position 6
position 7
example 3:
Decimal base 128 64 32 16 8 4 2 1
12810 = 1 0 0 0 0 0 0 0
13710 = 1 0 0 0 1 0 0 1
9810 = 0 1 1 0 0 0 1 0
7010 = 0 1 0 0 0 1 1 0
We can get the decimal equivalent of each binary numbers by using

the equation:
p
b i 2i Equation bi fi 3.1.2: binary number to decimal

i=0 equation
This equation will be used to convert binary number to decimal and
Equation 3.1.2: Sum of 1-bit in binary number

p
is an
f i 2i =N equation used to sum the 1-bit in binary
i=0
number. Where p, is the total position of binary number and b is the bit
value at specific position of p. The position of every bit is very
important through the process of this study.
Converting binary number at example 1 to decimal number:
decimal Number binary number

1010 = 0 0 0 0 1 0 1 0
7 6 5 4 3 2 1 0
position , and p = 7,
p
(i) b i 2i=0(20 )+1(21)+0 (22)+1(23 )+0(2 4 )+0(25 )+0(26 )+0(27 )=10
i=0
p
(ii)
b i 2i=0+2+0+8+0+0+0+ 0=10
i=0
3.2 1-Bit Frequency
The result of counting all 1-bit position on every equivalent
binary numbers of node data with respect to p there

bi bi
position is called ' 1-bit frequency ' i=0 as show
fi
at the figure 3.2 below:
Figure 3.2.1 Decimal number to binary number conversion using short-

cut method
p
Deci f i= x mal base 128 64 32 16 8 4 2
i=0
1
1010 = 0 0 0 0 1 0 1 0
710 = 0 0 0 0 0 1 1 1
12810 = 1 0 0 0 0 0 0 0
13710 = 1 0 0 0 1 0 0 1
9810 = 0 1 1 0 0 0 1 0
7010 = 0 1 0 0 0 1 1 0
------------- ------------------------------------------------------------------
Total: 45010 2 2 1 0 2 2 1-bit
4 frequency
2
7 6 5 4 3 2 1 positions
0
Theorem 3.2.1: If x is number of frequency with respect to its position
then it is also the total decimal numbers associated with that specific
position.
Proof 3.2.1: 110 = 12 1-bit frequency at position 0. The frequency
number is 1 at position 0, therefore there is only one number
associated with position 0.
Proof 3.2.2: Given the binary numbers,
110 = 0 0 1
210 = 0 1 0
410 = 1 0 0
310 = 0 1 1
1 2 2 1-bit frequencies
2 1 0 positions,
therefore we can say that there are two numbers associated at position
zero, which are the number 1, & 3 and there are also two numbers
associated at position 1, which are the number 2, & 3 and there is only
one number associated at position 2, the number 4.
Base from the figure 3.2, from all binary numbers we obtain one set of
1-bit frequency. Every frequency numbers has an associations with
some of the decimal numbers base from theorem 3.2.
3.3 Sum and decimal equivalent of frequency
We can get the equivalent decimal of

1
frequency ( p ) 100 by using the equation 3.1,
and
fi replacing with . Using
2 i=0
1
figure 3.2, f = { 2 , 2 , 1 , 0 , 2 , 2 , 4 , 2} and p be the total positions
which is p p = 7 and let N be the decimal number which
is N = 450 fi we can get its decimal equivalent by:

2 1i=0
fi p
i=0
Equation 3.3.1: frequency to decimal fi number
2 i=0
1
p
(i)
f i (2i )=2(20 )+4 (21)+2(22)+2(23 )+0(2 4)+1(25 )+2(26 )+2(27 )=450
i=0
p
(ii)
f i (2i )=2(1)+ 4 (2)+2(4)+2(8)+0 (16)+1(32)+2 (64)+2(128)=450
i=0
p
(iii)
f i (2i )=2+8+8+ 16+0+32+128+256=450
i=0
with this, we can say that 450 decimal number can be obtain by the
frequency of { 2 , 2 , 1 , 0 , 2 , 2 , 4 , 2} with decimal equivalent of
every position of {2,8,8,16,0,32,128,256}. This is very important
aspects of our study since it reveals how much value of every positions
contributed to the whole number 450. Another aspect we consider is
the sum of frequency, the sum of frequency is the total number of 1-bit
associated with all those numbers by the theorem 3.2. We can get the
sum of frequency using the equation 3.1. 2 replacing
with . Using figure 3.2, f = { 2 , 2 , 1 , 0 , 2 , 2 , 4 , 2} and let
p be the total positions which is p = 7, let x be the sum of 1-bit which
is 15, then we can get sum of frequency by:
Equation 3.3: Sum of frequency
f
p
positions
(I)
f i=2+ 4+2+2+0+1+2+2=15
i=0
The equation shows, that there are 15 bits combination of 1-bit
involved with the decimal number of 450. With 15 bits combination, we
can ask this question, can we compute the probability of a data if it is
exist using the bits combination? The answer is definitely yes! All we
need is to get the position of 1-bit of its binary number equivalent.
3.4 Probability and combination
With all those idea and knowledge we discuss above, we now get
most-significant-position
(msp)
to one important point of our study which is the probability. To compute
les-significant-position (lsp)
probability is very important part of our searching study since our
algorithm will defend on it to decide on which subtree to evaluate first
given the probability result. We take it little by little, let us first
consider this:
Table 3.4: Frequency and its possible decimal number
Total
Frequenc Total Decimal number combinatio
Possible binary numbers
y Positions involved n of decimal
number
12
{1 } 0 1 1
1
1 12 , 0 1
1
{1,1} 11 1 02 1, 2, 3 3
1 1
1 1 12 , 001 010 0
1 1
1 11 1 1 02 1 0 12 1
{1,1,1} 2 1, 2 , 3 , 4, 5, 6, 7 7
0 02
111 111
111
11112 , 0001 0010
0100
1111 111 02 11012
10112
1111 1111
1111 1, 2 , 3 , 4 , 5 , 6 ,7 ,
{1,1,1,1} 3 8 ,9 ,10 ,11 ,12 ,13 , 15
1000 1010 0110 14 ,15
most-significant-position
0011 (msp)
01112 01012 10012
11002
1111 1111 1111
1111
on table 3.4, we will focus only the position since it is being obtain from
the position of 1-bit. Position p must be the number of 1-bit of a certain
number x and with that we can compute the probability of x by:
Equation 3.4: Probability of a number exist with a given of frequency

Equation 3.5: Total of possible combination of binary number
where is the sum of frequency base from equation 3.3, and

total
possible number exist within the total position p is

. With this
probability equation, we can now identify the percent of data exist
given the total position p, i.e., what percent that the decimal number 7
exist given that the total position of 1-bit binary number 7 is 2, using
the equation 3.3, we can compute the probability by:
1 1 1 1
( )100=( )100=( )100=( )100=(0.33333)100=33.33333
p
2 1
2
2 1 41 3
so, the probability is 33.33 this is due to the fact that the frequency
{1,1,1} of 7 can be obtain also from 6 combination binary numbers.
Combination is the total number that a certain frequency can be
obtain, let us consider table 3.4, for the frequency of { 1 } of course
there is only one binary number could represent that frequency, the
binary number 1. For the frequency {1 ,1 } it can be obtain from 3

possible binary numbers the, 11, 01, & 10. For the frequency {1 , 1 , 1 ,
1} it could be obtain from 15 possible binary numbers.
3.5 Most-significant-position (msp) and Less-significant-position (lsp)
We already know that the

x y x+ y
frequency is the sum
f msp + f lsp = f i
i i
of all 1-bit of
every binary number i=0 i=0 i=0 with respect to
there position, I.e, example:
x x Figure 3.5.1: Binary numbers,

f lsp i
f 1lsp i
frequency, and positions
i=0 i=0
x x 1010 = 0 0 0
0
f msp i
1 0 f 1 mspi 1
7 =
0
0 0 0 0
i=0 i=0 10
0 1 1 1
12810 = 1 0 0 0 0 0 0 0
--------- ---------------------------------------------------------------
Total: 14510 1 0 0 0 1 1 2 1
7 6 5 4 3 2 1 0
Given that the frequencies f = { 1 , 0 , 0 , 0 , 1 , 1 , 2 , 1 } under
figure 3.5, and let x = 7 with equivalent binary number of {1 , 1 , 1}
and position of { 2 , 1 , 0} of each 1-bit, the most-significant-
f 1lsp +1 i
a 1+1
i=0
f =( x )(100)=( c
)(100)
x x
f msp a+a 1+2 1
f lsp + f 1lsp +2
i
i i
i=0
1
i=0 i=0
position(msp) is the 1-bit position of an arbitrary binary number when
mapping to a position of a certain frequency. i,e. for example:
Figure 3.5.2: Mapping msp to a certain frequency

x = 710 = 1 1 12
2 1 0 positions
f lsp +1 i
a+1 f = {
i=0
f 1=( x )(100)=( c
)(100) 1 ,
x x
f 1msp a+a 1+2 1 0 ,
f lsp + f 1lsp +2 0 ,
i
i=0
1
i=0
i
i=0
i 0 ,
1 ,
1, 2,1}
7 6 5 4 3 2 1 0 positions
msp = { 2 , 1 , 0 } , lsp = { 7 , 6 , 5 , 4 , 3 }
Figure 3.5 shows the mapping of msp to the position of
frequency, there we can identify also the lsp. Every value of msp and
lsp has also a position, as shown below:
Figure 3.5.3: Position of every element of msp and lsp
msp = { 2 , 1 , 0 } lsp = { 7 , 6 , 5 , 4 , 3 }
2 , 1 , 0 positions 4 3 2 1 0 positions
The less-significant-position is the position which are/is not included in
the position of 1-bit binary number of an arbitrary number, here in our
example its decimal number 7. Another example:
Figure 3.5.3: Mapping msp to a certain frequency
x = 13610 = 1 0 0 0 1 0 0 02
a1+1 6+1 7
( )(100)=( )(100)= (100)=0.4375 (100)=43.75 7 6 5
c
a+a 1+2 1 3
3+6+2 1 16
4 3 2
1 0 positions
a1+1 3+1 4
( )(100)=( )(100)= (100)=0.25 (100)=25
c
a+a 1+2 1 3
3+6+ 2 1 16
f = { 1 ,
0,0,0,1,1, 2,1}
7 6 5 4 3 2 1 0 positions
msp = { 7 , 3 } lsp = { 6 , 5 , 4 , 2 , 1 , 0 }
1 , 0 positions 5 , 4 , 3 , 2 , 1 , 0 positions
The lbf information will be used to represent the left sub-tree of a node
and rbf will be use to represent the right sub-tree of the node as shown
below:
Figure 3.5.4: lbf
represent the left
subtree and rbf
represent the right
subtree
3.6 Probability theorem
Now, the most important part in our study is to create a theorem
of probability, after having all those knowledge of frequency, positions,
combination, most-significant-position (msp) , and less-significant-
position (lsp). This theorem will be bases of our algorithm as it try to
search an arbitrary data in an unordered tree. Our theorem will
considered two sets of frequencies since every node of an unordered
binary tree hast at most two child, the left-child and the right-child.
Theorem 3.6.1
Let f and f1 be a sets of 1-bit frequency and b is the binary
number of number N . Let msp subset of b's 1-bit positions and lsp a
subset of b's 0-bit positions such that lsp msp = b accordingly. Let x
and y be the length of msp and lsp respectively such that:

Let a = , a1 = , and
c= and c1 =
then the probability is given by,
for the percentage of f with respect to b's msp and lsp is :
Equation 3.6.1: Probability equation for frequency f
percent
for the percentage of f1 with respect to b's msp and lsp is :
Equation 3.6.2: Probability equation for frequency f
percent
If f > f1 with respect to b's msp and lsp then there is a high
percentage that b is within f compared to f1 .
Proof:
1010 = 0 0 0 0 1 0 1 0
710 = 0 0 0 0 0 1 1 1
12810 = 1 0 0 0 0 0 0 0
--------- ---------------------------------------------------------------
Total: 14510 1 0 0 0 1 1 f
2 1
13710 = 1 0 0 0 1 0 0 1
9810 = 0 1 1 0 0 0 1 0
7010 = 0 1 0 0 0 1 1 0
------------- ----------------------------------------------------------------
Total: 30510 1 2 1 0 1 1 f1
2 1
7 6 5 4 3 1-bit
2 frequency
1 0 position
Using the above frequency, f = {1,0,0,0,1,1,2,1} and
f1={1,2,1,0,1,1,2,1} ,
let N = 10 and N = b2 and let p the set of positions of bits such that:
b = { 0 0 0 0 1 0 1 0 } = 1010
p={7 6 5 4 3 2 1 0 }
msp = { 3, 1 } lsp = {7 , 6 , 5 ,
4,2,0}
c = f msp = { 1, 2 } = 3 a = flsp = {1 , 0 ,
0,0,1,1}=3
c1= f1 msp = { 1, 2 } = 3 a1 = f1lsp = {1 , 2 ,
1,0,2,1}=6
using our equation of probability we can get the percentage on both f
and f1 with respect to b by:
f = percent
f1 = percent
f = 43.75% and
f1 = 25%
since f > f1, therefore f has a high probability of number 10.
3.7 Data structure
In-order for every node to hold bit-frequency, two data is added
on nodes basic structure, the lbf and the rbf data, such that every time
a node is created there are data in a node that ready to hold a bit-
frequency.
Figure 3.7.1 Basic structure model
Below is sample tree to have an overview on what a binary unordered

tree looks like if this basic data structure is being applied. Every node
that has no left-child or right child on a tree are assumed that the child
is NULL.
Figure 3.7.2 Unordered tree with lbf and rbf data added
Every node will hold bit-frequency, one will represent the left sub-tree
the left-bit-frequency (lbf) and one for the right sub-tree the right-bit-
frequency(rbf). Every bit-frequency represent an entire sub-tree, bit-
frequency will be updated every time a node is added but only those
nodes which part of backing process and only one of the two bit-
frequency of a node is updated. When backing came from the left child
sub-tree then only the lbf of parents node is updated, the same happen
to the right child of a parent node if backtrack come from right sub-
tree.
3.8 Obtaining left-bit-frequency and right-bit-frequency
The lbf and rbf are one of the most important part in our study,
since it is our first objective of the study. There are two phases to
obtain and update the lbf and rbf: (1) converting the data N into binary
form and get the msp: (2) update all msp of lbf or rbf all ancestors
node base from the msp of data N by increasing each msp value by 1.
The direction of backtrack will decide on which frequency of the current
node to be updated. If the backtrack comes from the left sub-tree of
the parent node then the msp of lbf of the parent node will be update
else the msp if rbf will be updated.
Figure 3.8.1: updating msp of lbf or rbf of ancestors node

3.4 Search
For searching an arbitrary data, its binary equivalent must be
evaluated to obtain a set of 1-bit positions, this set is called the most-
significant-positions (msp) and those positions not part of the set is
called the less-significant-positions (lsp). Say, let x be the data that we
are going to search on tree T, and let b the binary equivalent of x, we
got msp and lsp of x.
So, if the current data of node is not equal to x, there will be two
phases of evaluations to determine on which sub-tree of the parent
node to be evaluated first. Phase 1, using the msp of x , if there exist
zero value on any position lbf msp and rbfmsp then it sure that the data
does not exist on both left sub-tree and right subtree and stop the
search. If only the lbf shows zero value on any msp then prune the left
sub-tree and proceed the search to right sub-tree, the same happen if
only the rbf show zero value on any position of msp. Phase 2, using the
lsp of x as sum of both lbf lsp and rbflsp will compared, if one of them
shows lowest value then the search will proceed to that subtree else by
default the left sub-tree will be searched. The derive theorem will be
used at this phase.
It does not mean to say that if both lbf msp and rbfmsp show non-
zero value on any msp the data exist, say for example a frequency of
111, this frequency can be obtain from 001, 010, 100 binary number a
1, 2, & 4 numbers in decimal. So, searching number 7 cannot be
determined automatically that it does exist. Frequency 111 can be
obtain from 111 binary number alone, so if we search 1, 2, & 4 we
cannot determine automatically that all those number does exist.
Figure 3.4: Arbitrary data that does not exist on tree

Figure 3.5 Left sub-tree pruned
3.6 Algorithm
BFS(R, D)
1. t NULL
2. if R equal to NULL then goto step 20
3. else t R
4. if t data equal to D then goto step 20
5. else if for every n mspx
6. if LBFn equal to zero then
7. prune left subtree
8. if RBFn equal to zero then
9. prune right subtree
10. end if
11. for every n lspx
12. if sum of lbfn <= sum of rbfn
13. BSF(t left sub-tree, D)
14. if t equal to NULL then
15. BSF(t righ sub-tree,D)
16. end if
17. else BSF(t right sub-tree,D)
16. if t equal to NULL then
17. BSF(t left subtree,D)
18. end if
19. end if
20. return t
21. end
3.7 Time complexity: Though the algorithm can reduce the involved
nodes overall but it take extra time for filtering and evaluation process.
Table 3.1: Proposed algorithm time complexity
Time Complexity Search

space
Binary Search Add Total Summary complexit
Frequency y
Search
Algorithm O(p log n O(2p* k* 2(log O ( p k log
O( p k log n ) O(p+n)
) n)) n)
Base from the table 3.1, the p represent the number of frequency
positions, the k represent the internal nodes candidates for
backtracking if in case the search is missed.
Table 3.2: Time complexity between traversal and proposed algorithm
Search Time Complexity

method Worst Case Best Case Space Complexity
Traversal O(n) O(logn) O(1+n)
bfs O(pklogn) O(plogn) O(p+n)
CHAPTER IV
RESULTS AND DISCUSSIONS
In order to observe and examine the result of the algorithm, two

sample of unordered tree are being created and one-by-one of the data
is being searched. Every nodes which part of the path during searching
process are being recorded. Traditional searching method is being used
first followed by the design algorithm (BFSA).
Illustration 4.1 unordered tree with 20 nodes
Table 4.1 Result

Total
Searc Path
Node
h
Traditional (A) BFSA (B) A B
5 2,5 2,7,15,1,25,23,5 1 6
10 2,5,10 2,5,10 2 2
6 2,5,10,6 2,5,10,6 3 3
3 2,5,10,6,3 2,7,15,1,25,11,23,5,10,3 4 9
30 2,5,10,6,3,30 2,5,10,3,30 5 4
41 2,5,10,6,3,30,41 2,5,10,3,30,41 6 5
18 2,5,10,6,3,30,41,18 2,5,10,3,30,18 7 5
48 2,5,10,6,3,30,41,18,48 2,5,10,3,30,18,48 8 6
4 2,5,10,6,3,30,41,18,48,4 2,5,10,6,3,30,18,48,4 9 8
2,7,15,1,25,5,10,3,30,18,48,
13 2,5,10,6,3,30,41,18,48,4,13 10 11
13
7 2,5,10,6,3,30,41,18,48,4,13,7 2,7 11 1
20 2,5,10,6,3,30,41,18,48,4,13,7,20 2,5,10,3,18,48,30,7,20 12 8
15 2,5,10,6,3,30,41,18,48,4,13,7,20,15 2,7,15 13 2
1 2,5,10,6,3,30,41,18,48,4,13,7,20,15,1 2,7,15,1 14 3
9 2,5,10,6,3,30,41,18,48,4,13,7,20,15,1,9 2,7,15,1,9 15 4
2,5,10,6,3,30,41,18,48,4,13,7,20,15,1,9,
25 2,7,15,1,25 16 4
25
2,5,10,6,3,30,41,18,48,4,13,7,20,15,1,9,
11 2,7,15,1,25,11 17 5
25,11
2,5,10,6,3,30,41,18,48,4,13,7,20,15,1,9,
23 2,7,15,1,25,11 18 5
25,11,23
2,5,10,6,3,30,41,18,48,4,13,7,20,15,1,9,
8 2,7,15,8 19 3
25,11,23,8
2,5,10,6,3,30,41,18,48,4,13,7,20,15,1,9, 2,5,10,3,30,18,48,13,7,15,1,
12 20 12
25,11,23,8, ne 25,ne
10
TOTAL NODES 210
6
Note: ne not exist
Base on the result at table 4.1 and the illustration at 4.1 it shows
that for 20 nodes including the root the traditional search takes a
grand total of 210 nodes for searching one-by-one of the data in a tree
this include searching a data which is not found in the tree while Binary
Frequency Search Algorithm takes only 106 node when searching one-
by-one of the data. With this result, BFSA shows that it reduced the
nodes by > 49%.
illustration 4.2 unordered tree with 11 nodes
Table 4.2 Result

Path Total Node
Search
Traditional (A) BFSA (B) A B
1 4,2,6,9,8,10,5,7,1 4,5,7,1 8 3
2 4,2 4,2 1 1
3 4,2,6,9,8,10,5,7,1,3 4,5,7,3 9 3
5 4,2,6,9,8,10,5 4,5 6 1
6 4,2,6 4,2,6 2 2
7 4,2,6,9,8,10,5,7 4,5,7 7 2
8 4,2,6,9,8 4,2,8 4 2
9 4,2,6,9 4,2,6,9 3 3
10 4,2,6,9,8,10 4,2,8,10 5 3
11 4,2,6,9,8,10,5,7,1,3,11 4,2,6,5,3,11 10 5
TOTAL NODES 55 25
Base on the result at table 4.2 and the illustration at 4.2 it shows
that for 10 nodes including the root the traditional search takes a
grand total of 56 nodes for searching one-by-one of the data in a tree
this include searching a data which is not found in the tree while Binary
Frequency Search Algorithm takes only 25 nodes. With this result, BFSA
shows that it reduced the nodes by > 54%.
SIMULATIONS RESULTS 4
In order to check the reliability of the algorithm with respect to
how much it reduces the involved nodes and the accuracy during the
choices between the two subtree, the left and the right subtree. A
series of simulation has been made with a a random data to create a
sample tree, random data is used to make sure that every tree that
has been constructed during the simulation are unique. Below the the
graph for the result of the simulations:
Illustration 4.1: Average result of reduced involve nodes
Unordered Tree simulation and searching

60.5
60
59.5
59
Percentage
58.5
58
57.5
57
56.5
56
0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200
Iteration
Base from the results shown at illustration 4.1, from 10 simulated
trees to 2000 the algorithm shows efficiency in term of reducing the
nodes in overall greater than 59% overall.
Illustration 4.2: Average result of accuracy
Average accuracy during simulation

68.5
68
Accuracy Percentage
67.5
67
66.5
66
65.5
0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200
Sample Unordered Trees
For the accuracy of the algorithm, based from the illustration 4.2,
for the samples of 2000 unordered simulated trees the algorithm show
an accuracy of greater than 66%.
CHAPTER V
CONCLUSIONS AND RECOMMENDATIONS
Base from the results, it shows that the algorithm clearly shows
that it reduces the involve nodes overall at search process and it can
easily determine if a sub-tree must be prune or not. And because of
extra filtering and evaluation processes the algorithm will take extra
time with the time complexity of O(p k log n). The algorithm can be
best used in the field of data mining in searching process when
avoiding to used traversal the entire unordered tree and where the
sets data are unique.
For further studies, the researcher recommend to enhance the
algorithm in such it will eliminate the k in the O(p k log n) time
complexity, for this case the algorithm become faster and precise to its
searching than traversal. For mining frequent subtree or embedded
subtree, our algorithm can be used.
APPENDICES
A. Glossary of Terms
frequency The total count of 1-bit in a column.

less-significant-bit (lsb) A set of 0-bit position(s) with respect to
binary equivalent of the arbitrary data to be
searched.
most-significant-bit (msb) A set of 1-bit position(s) with respect to
binary equivalent of the arbitrary data to be searched.
left-bit-frequency(lbf) A data contain frequencies that represent the
left sub-tree of a current node.
righ-bit-frequency (rbf) A data contain frequencies that represent the
right sub-tree of a current node.
BIBLIOGRAPHY
[1] Mohammed J. Zaki*: Efficiently Mining Frequent Embedded

Unordered Trees
p5 2005
[2] Yun Chi: Mining Databases of Labeled Trees using Canonical Forms,
p5-6 2005
[3] Shariq Bashir and Abdul Rauf Baig: Ramp: High Performance
Frequent Itemset
Mining with Efficient Bit-Vector Projection Technique, p5 2007
[4] Zahoor Jan, Shariq Bashir, A. Rauf Baig: Applying Bit-Vector
Projection Mining of N-Most Interesting Frequent Itemset, p6 2007
[5] Jose L. Balcazar, Albert Bifet and Antoni Lozano: Mining Frequent
Closed
Unordered Trees Through Natural Representations, p6 2007
[6] Siegfried Nijssen, Elisa Fromont: Mining Optimal Decision Trees from
Itemset Lattices, p7 2007
[7] Hassan H. Malik and John R. Kender: Optimizing Frequency Queries
for Data Mining Applications , p7 2007

Binary Frequency Search Algorithm

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Binary Frequency Search Algorithm

Caricato da

Copyright:

Formati disponibili

1.

1 Statement of the Problem

Given an unordered tree T with a random set of nodes N = { N1,

N2 , N3 , . . . , Nn-1 , Nn , null } where n is total nodes of tree T. t =

{left,right} where left is a left child and right is a right child of t, t N,

has no relevant connection to the parent t node because N set has no

order. So, in searching a certain data d in a Tree T f : S(T, d) N where

N is the node containing the searched data, during this process if t d

where t T , the probability that the data can be found on both

subtrees of t are 50%. As a result traversing the entire tree and

evaluating node-by-node of the tree T is the only solution which will

results of increasing search space, with this observation the researcher

found two major problems:

(1) there is no certain data or information in every node of an

probability to guide on which subtree of the current node has the

higher chances of the presence of an arbitrary data, (2) there is no

search space in searching arbitrary data in overall.

1.2 Objective of the Study

involve nodes during searching process of unordered tree the objective

1.3 Significance of the Study

to at least to create and increase the probability or frequency on which

portion of sets it located. With this algorithm a reduced search space is

expected in general compared to the traditional approach. This

algorithm can share a great help in the field of data-mining, biology,

chemistry in terms of searching process in unordered tree structures.

This study focuses only on binary rooted trees, specifically the

and if character is used its equivalent decimal value is considered. The

length of binary frequency data is limited only up to 8 positions only in

Deleting node in a tree is not included, it primarily focuses only on

REVIEW OF RELATED LITERITURE

Mining frequent trees is very useful in domains like

bioinformatics, web mining, mining semi-structured data, and so on.

Mohammed J. Zaki[1] conducted a study on the year 2005 in titled

Efficiently Mining Frequent Embeeded Unordered tree. It introduces the

SLEUTH algorithm for mining frequent, unordered, embedded subtrees

in a database of labeled trees. But before the SLEUTH algorithm can be

enumerates all embedded, unordered trees[1,p3], (2) Prefix Extension

is used to correctly enumerate all ordered embedded or induced

trees[1,p8] and for unordered trees it has to do a further check to see

and if so, it is a valid extension[1,p8]. Once all the candidate are

enumerated it extend the notion of scope-list joins to compute

frequency computation for a new extension. (4) SLEUTH is more

efficient than SLEUTH-F K F2, and is generally comparable to

TreeMiner, which mines only ordered subtrees, even though SLEUTH

has to check if a subtree is in canonical form[1,p18]. In this study the

phase (3) performs a frequency computation for embedded unordered

subtree where in our study we perform frequency counting for 1-bit on

the same position.

In the same year study in titled Mining Databases of Labeled

important issue in mining databases of labeled rooted treesfinding

frequently occurring subtrees. In he's study the number of transactions

in the database that support an itemset S is called the frequency of the

itemset. The frequency of a candidate itemset needs to be counted in

order to determine if the itemset is frequent. The first frequency

counting method is based on direct checking: for each transaction, the

frequency of all candidate itemsets supported by the transaction is

increased by one[2,p26]. This study and our study use frequency

is counting the 1-bit of binary data.

In terms of mining frequent itemset, [Shariq and Abdul, 2006] In

this paper they present a novel bit-vector projection technique, they

call it Projected-Bit-R egions (PBR) bit-vector projection technique and

projection of bit-vectors, the goal of projection should be such as, to

bitwise- only those regions of head bit-vector bitmap(head) with

tail item X bit- vector bitmap(X) which contains a value greater