The Basic Concepts of Algorithms: 2.1 The Minimal Spanning Tree Problem

Chapter 2
The Basic Concepts of Algorithms

The synthesis and analysis of algorithms is a rather complicated field in computer science. Since we
shall model every problem in molecular biology as an algorithm problem in this book, we shall give a
brief introduction of some basic concepts of algorithms in this chapter in the hope that the reader can
have a basic understanding of issues related to algorithms. It is quite difficult to present all of the
terms in the field of algorithms formally. We will therefore take an informal approach. That is, all of
the basic concepts will be presented in an informal way.
2.1 The Minimal Spanning Tree Problem
Every computer program is based upon some kind of algorithms. Efficient algorithms will produce
efficient programs. Let us consider the minimal spanning tree problem. We believe that the reader
will quickly understand why it is so important to study algorithm design.
There are two versions of spanning tree problem. We will introduce one version of them first.
Consider Figure 2.1.
Figure 2.1: A Set of Planar Points
A tree is a graph without cycles and a spanning tree of a set of points is a tree consisting all of the
points. In Figure 2.2, we show three spanning trees of the set of points in Figure 2.1.
2-1
Figure 2.2: Three Spanning Trees of the Set of Points in Figure 2.1
Among the three spanning trees, the tree in Figure 2.2(a) is the shortestand this is what we are
interested. Thus the minimal spanning tree problem is defined as follows: We are given a set of
points and we are asked to find a spanning tree with the shortest total length.
How can we find a minimal spanning tree? A very straightforward algorithm is to enumerate all
possible spanning trees and one of them must be what we are looking for. Figure 2.3 shows all of the
possible spanning trees for three points. As can be seen, there are only three of them.
Figure 2.3: All Possible Spanning Trees for Three Points
For four points, as shown in Figure 2.4, there are sixteen 16 possible spanning trees. In general, it can
be shown that given n points, there are
2 n
n
possible spanning trees for them. Thus if we have 5
points, there are already
3
5
= 125 possible spanning trees. If n = 100, there will be
98
100
possible
spanning trees. Even if we have an algorithm to generate all spanning trees, time will not allow us to
do so. No computer can finish this enumeration within any reasonable time.
2-2
Figure 2.4: All Possible Spanning Tree for Four Points
Yet, there is an efficient algorithm to solve this minimal spanning tree problem. Let us first introduce
the Prim's Algorithm.
2.1.1 Prim's Algorithm
Consider the points in Figure 2.1 again. Suppose we start with any point, say point b. he nearest
neighbor of point b is point a. We now connect point a with point b , as shown in Figure 2.5(a). Let us
denote the set { } b a, as X and the set of the rest of points as Y. We now find the shortest distance
between points in X and Y which is that between between b and e, We add e to the minimal spanning
tree by connecting b with e, as shown in Figure 2.5(b). Now X = { } e b a , , and Y = { } f d c , , . During
the whole process, we continuously create two sets, namely X and Y. X consists of all of the points in
the partially created minimal spanning tree and Y consists of all of the remaining points. In each step
of Prim's algorithm, we find a shortest distance between X and Y and add a new point to the tree until
Y is empty. For the points in Figure 2.1, the process of constructing a minimal spanning tree through
this method is shown in Figure 2.5.
Figure 2.5: The Process of Constructing a Minimal Spanning Thee Based upon Prims Algorithm
In the above, we assumed that the input is a set of planar points. We can generalize it so that the input
is a connected graph where each edge is associated with a positive weight. It can be easily seen that a
set of planar points corresponds to a graph where there is an edge between every two vertices and the
2-3
weight associated with each edge is simply the Euclidean distance between the two pints. If there is
an edge between every pair of vertices, we shall call this kind of graph a complete graph. Thus a set
of planar points correspond to a complete graph. Note that in a general graph, it is possible that there
is no edge between two vertices. A typical graph is now shown in Figure 2.6.
Throughout the entire book, we shall always denote a graph by G = (V,E) where V is the set of
vertices in G and E is the set of edges in G.
Figure 2.6: A General Graph
We now present Prim's algorithm as follows:
Algorithm 2.1 Prim's Algorithm to Construct a Minimal Spanning Tree
Input: A weighted, connected and undirected graph G = (V,E).
Output: A minimal spanning tree of G.
Step 1: Let x be any vertex in V. Let X = { } x and Y = V \ { } x
Step 2: Select an edge (u,v) from E such that X u , Y v and (u,v)has the smallest weight among
edges between X and Y.
Step 3: Connect u to v. Let { } v X X and { } v Y Y \ .
Step 4: If Y is empty, terminate and the resulting tree is a minimal spanning tree. Otherwise, go to
Step 2.
Let us consider the graph in Figure 2.7. he process of applying Prim's algorithm to this graph is now
illustrated in Figure 2.8.
2-4
Figure 2.7: A General Graph
Figure 2.8: The Process of Applying Prims Algorithm to the Graph in Figure 2.7
We will not formally prove the correctness of Prim's algorithm. The reader can find the proof in
almost every textbook on algorithms. Yet, now we can easily see the importance of algorithms. If one
does not know the existence of such an efficient algorithm to construct minimal spanning trees, one
can never construct minimal spanning trees if the input size is large. It would be disastrous for any
one to use an exhaustive search method in this case.
In the following, we will introduce another algorithm to construct a minimal spanning tree. This
algorithm is called Kruskal's algorithm.
2-5
2.1.2 Kruskal's Algorithm
Kruskal's algorithm to construct a minimal spanning tree is quite similar to Prim's algorithm. It would
first sort all of the edges in the graph into an ascending sequence. Then edges are added into a
partially constructed minimal spanning tree one by one. Each time an edge is added, we check
whether a cycle is formed. If a cycle is formed, we discard this edge. The algorithm is terminated if
the tree contains n-1 edges.
Algorithm 2.2 Kruskal's Algorithm to Construct a Minimal Spanning Tree
Input: A weighted, connected and undirected graph G = (V,E).
Output: A minimal spanning tree of G.
Step 1: T .
Step 2: while T contains less than n-1edges do
Choose an edge (v,w) from E of the smallest weight.
Delete (v,w) from E.
If the adding of (v,w)does not create cycle in T then
Add (v,w) to T.
Else Discard (v,w).
end while
Let us consider the graph in Figure 2.7. The process of applying Kruskal's algorithm to this graph is
illustrated in Figure 2.9.
Both algorithms presented above are efficient.
2-6
Figure 2.9: The Process of Applying Kruskals Algorithm to the Graph in Figure 2.7
2.2 The Longest Common Subsequence Problem
In this section, we will introduce another seemingly difficult problem and again show an efficient
algorithm to solve this problem.
Consider the following sequence:
S: ABDDRGTY.
The following sequences are all subsequences of S
ABDDRGT, DGTY, DDRGT, BDR, ABDTY, DDRT, GY, ADRG
ABTY, TY, ABG, DDY, AR, BTY, TY, Y
Suppose that we are given two sequences as follows:
S
1
: ABDDRGTY
S
2
: CDEDGRRT
Then we can see that the following sequences are subsequences of both S
1
and

S
2
DRT
DRGT
DDRGT
2-7
All of the above sequences are called common subsequences of S
1
and

S
2.
The longest common
subsequence problem is defined as follows: Given two sequences, find a longest common
subsequence of them. If one is not familiar with algorithm design, one may be totally lost with this
problem. Algorithm 2.3 is a naive algorithm to solve this problem.
Algorithm 2.3 A Naive Algorithm for the Longest Common Subsequence Problem
Step 1: Generate all of the subsequences of, say S
1
Step 2: Starting with the longest one, check whether it is a subsequence of S
2
Step 3: If it is not, delete this subsequence and, return to Step 2.
Step 4: Otherwise, return this subsequence as a longest common subsequence of S
1
and

S
2
The trouble is that it is exceedingly time-consuming to generate all of the subsequences of a sequence.
A much better and much more efficient algorithm employs a strategy called the dynamic
programming strategy. Since this strategy is used very often to solve problems in molecular biology,
we shall describe it in the next section.
2.3 The Dynamic Programming Strateg
The dynamic programming strategy can be explained by considering the graph in Figure 2.10. Our
problem is to find a shortest route from vertex S to vertex T. As can be seen, there are three branches
from S. Thus we have at least three routes, namely going through A, going through B and going
through C. We have no idea which one is the shortest. But we have the following principle: If we go
through a vertex X, we should find a shortest route from X to T
Figure 2.10: A Graph
Let d(x,y) denote the shortest distance between vertices x and y. We have the following equation:
2-8
'
+
+
+
) , ( ) , (
) , ( ) , (
) , ( ) , (
min ) , (
T C d C S d
T B d B S d
T A d A S d
T S d
The question is: How do we find the shortest route from, say vertex A, to vertex T? Note that we can
use the same principle to find a shortest route from A to T. That is, the problem to find a shortest
route from A to T is the same as the problem finding a shortest route from S to T except the size of the
problem is now smaller.
The shortest route finding problem can now be solved systematically as follows:
'
+
+
+
) , ( ) , (
) , ( ) , (
) , ( ) , (
min ) , (
T C d C S d
T B d B S d
T A d A S d
T S d

'
+
+
+
) , ( 3
) , ( 18
) , ( 15
min
T C d
T B d
T A d
'
+
+
) , ( ) , (
) , ( ) , (
min ) , (
T E d E A d
T D d D A d
T A d

'
+
+
) , ( 10
) , ( 11
min
T E d
T D d
'
+
+
21 10
41 11
min
= 31
'
+
+
+
) , ( ) , (
) , ( ) , (
) , ( ) , (
min ) , (
T G d G B d
T F d F B d
T E d E B d
T S d
2-9
(2.2)
(2.1)
'
+
+
+
) , ( 2
) , ( 1
) , ( 9
min
T G d
T F d
T E d
'
+
+
+
21 2
3 1
21 9
min
= 4
'
+
+
) , ( ) , (
) , ( ) , (
min ) , (
T H d H C d
T G d G C d
T S d
'
+
+
27 16
21 14
min
= 35
Substituting 2.2, 2.3 and 2.4 into 2.1, we obtain that
) , ( T S d
= min{15+31,18+4,3
+35} = 22, which implies that the shortest route from S to T is T F B S . As shown above,
the basic idea of dynamic programming strategy is to decompose a large problem into several sub-
problems. Each sub-problem is identical to the original problem except the size is smaller. Thus the
dynamic programming strategy always solves a problem recursively.
In the next section, we go back to the longest common subsequence problem and show how the
dynamic programming strategy can be applied to solve the problem.
2.4 Application of the Dynamic Programming Strategy to Solve the
Longest Common Subsequence Problem
The longest common subsequence problem was presented in Section 2.2. It was also pointed out that
we can not solve the problem in any nave and unsophisticated way. In this section, we shall show
that this problem can be solved elegantly by using the dynamic programming strategy.
We are given two sequences:
m
a a a S ...
2 1 1
and
n
b b b S ...
2 1 2
. Consider
m
a and
n
b . There are
two cases:
Case 1:
n m
b a . In this case,
m
a , which is equivalent to
n
b , must be included in the longest
2-10
(2.4)
(2.3)
common subsequence. The longest common subsequence of
1
S and
2
S is the longest common
subsequence of
1 2 1
...
m
a a a and
1 2 1
...
n
b b b plus
m
a .
Case 2:
n m
b a . Then we find two longest common subsequences, that of
m
a a a ...
2 1
and
1 2 1
...
n
b b b and that of
1 2 1
...
m
a a a and
n
b b b ...
2 1
. Among these two, we choose the longer one
and the longest common subsequence of
1
S and
2
S must be this longer one.
To summarize, the dynamic programming strategy decomposes the longest common subsequence
problem into three identical sub-problems and each of them is of smaller size. Each sub-problem can
now be solved recursively.
In the following, to simplify the discussion, let us concentrate our mind to finding the length of a
longest common subsequence. It will be obvious that our algorithm can be easily extended to find a
longest common subsequence.
Let
) , ( j i LCS
denote the length of a longest common subsequence of
i
a a a ...
2 1
and
j
b b b ...
2 1
.
) , ( j i LCS
can be found by the following formula:
{ }
'
j i
j i
b a if j i LCS j i LCS
b a if j i LCS
j i LCS
) 1 , ( ), , 1 ( max
1 ) 1 , 1 (
) , (
0 ) 1 , 0 ( ) 0 , 1 ( ) 0 , 0 ( LCS LCS LCS
The following is an algorithm to find the length of a longest common subsequence based upon the
dynamic programming strategy:
Algorithm 2.4 An Algorithm to Find the Length of a Longest Common Subsequence Based upon the
Dynamic Programming Strategy
Input:
m
a a a ...
2 1
and
n
b b b ...
2 1
Output: The length of a longest common subsequence of A and B, denoted as
) , ( n m LCS
Step 1:
0 ) 1 , 0 ( ) 0 , 1 ( ) 0 , 0 ( L L L
Step 2: for i = 1 to m do
for j = 1 to n do
if
j i
b a
then { 1 ) 1 , 1 ( ) , ( + j i LCS j i LCS
else { } { ) 1 , ( ), , 1 ( max ) , ( j i LCS j i LCS j i LCS
end for
end for
2-11
Let us consider an example.
A = AGCT
and B = CGT
The entire process of finding the length of a longest common subsequence of A and B is now
illustrated in the following table, Table 2.1. By tracing back, we can find two longest subsequences
CT and GT.
Let us consider another example: A = aabcdec and B = badea. Table 2.2 illustrates the process.
Again, it can be seen that we have two longest common subsequences, namely bde and ade.
Table 2.1: The Process of Finding the Length of Longest Common Subsequence of AGCT and CGT
Table 2.2: The Process of Finding the Length of a Longest Common Subsequence of A = aabcdec and
B = badea
2.5 The Time-Complexity of Algorithms
In the above sections, we showed that it is important to be able to design efficient algorithms. Or, to
put it in another way, we may say that many problems can hardly be solved if we can not design
efficient algorithms. Therefore, we now come to a critical question: How do we measure the
efficiency of algorithms?
2-12
We usually say that an algorithm is efficient if the program based upon this algorithm runs very fast.
But whether a program runs fast or not sometimes depends on the hardware and also the skill of the
programmers which are irrelevant to the algorithm.
In algorithm analysis, we always choose a particular step of the algorithm. Then we try to see how
many such steps are needed to complete the program. For instance, in all sorting algorithms, the
comparison of data can not be avoided. Therefore, we often use the number of comparisons of data as
the time-complexity of a sorting algorithm.
Let us consider the straight insertion sort algorithm. We are given a sequence of numbers,
n
x x x ...
2 1
.
The straight insertion sort algorithm scans this sequence. If
i
x is found to be smaller than
1 i
x , we
put
1
x to the left of
1 i
x . This process is continued until the left of
i
x is not smaller than it.
Algorithm 2.4 The Straight Insertion Sort Algorithm
Input: A sequence of numbers
n
x x x ...
2 1
Output: The sorted sequence of
n
x x x ...
2 1
for j = 2 to n do
; 1 j i
;
j
x x
while 0 > < i and x x
i
do
; 1
;
1

+
i i
x x
i i
end while
;
1
x x
i

+
end for
Suppose that the input sequence is 9, 17, 1, 5, 10. The straight insertion sort sorts this sequence into a
sorted sequence as follows:
9
9,17
1,9,17
1,5,9,17
1,5,9,10,17
In this sorting algorithm, the dominating steps are data movements. There are three data movements,
2-13
namely
i i j
x x x x
+1
, and x x
i

+1
. We can use the number of data movements to measure the
time-complexity of the algorithm. In the algorithm, there are one outer loop and one inner loop. For
the outer loop, the data movement operations
j
x x
and x x
i

+1
are always executed no matter
what the input is. For the inner loop, there is only one data movement, namely
i i
x x
+1
. This
operation is executed only if this inner loop is executed. In other words, whether this operation is
executed depends on the input data. Let us denote the number of data movements executed for the
inner loop by
i
d . The total number of data movements for the straight insertion sort is
+
+
n
i
i
n
i
i
d n
d X
2
2
) 1 ( 2
) 2 (
Best Case:
The worst case occurs when the input sequence is exactly reversibly sorted. In such a case,
1
2
d
2
3
d
1 n d
n
Thus,
( )
2
2
2
) 4 )( 1 (
2
) 1 (
) 1 ( 2
2
) 1 (
n
n n n n
n X
n n
d
n
i
i

+
Average Case:
To conduct an analysis of the average case, note that when
i
x is being considered,
) 1 ( i
data
movements have already been sorted. Imagine that
i
x is the largest among the i numbers. Then the
inner loop will not be executed. If
i
x is the jth largest number among the i numbers, there will be j-1
data movements executed in the inner loop. The probability that
i
x is the jth largest among i numbers
is
i
1
for
i j 1
. Therefore the average number of data movements is
2-14
.
2
3
1
1
....
3 2
1
+
+
+ +
i
i
j
i
i
i i
i
j
The average case time-complexity for straight insertion sort is
( ) ( ) ( )
2
2 2
2
8 1
4
1
3
2
1
2
3
n n n
i
i
n
i
n
i
n
i
+
,
_
+
+

In summary, the time-complexities of the straight insertion sort are as follows:

Straight Insertion Sort Algorithm
Best Case: 2(n-1)
Average Case: ( ) ( ) 1 8
4
1
+ n n
Worst Case: ( ) ( ) 4 1
2
1
+ n n
In the above, we showed how an algorithm can be analyzed. For each algorithm, there are three time-
complexities: the worst case time-complexity, the average case time-complexity and the best case
time-complexity. For practical purposes, the average case time-complexity is the most important one.
Unfortunately, it is usually very difficult to obtain an average case time-complexity. In this book,
unless we specify clearly, whenever we mention the time-complexity, we mean the worst case time-
complexity.
Now, suppose we have a time-complexity equal to
n n +
2
. It can be easily seen that as n becomes
very very large,
2
n
dominates. That is, we may ignore the term n when n is large enough. We now
present a formal definition for this idea.
Definition 2.1
)) ( ( ) ( n g n f
if and only if there exist two positive constants c and
0
n such that
0
) ( ) ( n n all for n g c n f .
This function is called big function. If
) (n f
is
)) ( ( n g
,
) (n f
is bounded by
) (n g
, is
2-15
certain sense, as n is large enough. If the time-complexity of an algorithm is
)) ( ( n g
, it will take
less than ) (n g c steps to run this algorithm as n is large enough for some c. Note that n is the size of
the input data.
Assume that the time-complexity of an algorithm is
) (
2
n n +
. Then
1 2
)
1
1 (
) (
2
2
2
2

+
+
n for n
n
n
n n n f
Thus, we say that the time-complexity is
) (
2
n
because we can fine c and
0
n to be 2 and 1
respectively.
It is customary to use
) 1 (
to represent a constant. Let us go back to the time-complexities of the
straight insertion sort algorithm. Using the big function, we now have:
Straight Insertion Sort Algorithm
Best Case:
) (n
Average Case:
) (
2
n
Worst Case:
) (
2
n
Many other algorithms have been analyzed. Their time-complexities are as follows:
The Binary Search Algorithm
Best Case:
) 1 (
Average Case:
) (log n
Worst Case:
) (log n
The Straight Selection Sort
Best Case:
) 1 (
Average Case:
) log ( n n
Worst Case:
) (
2
n
2-16
Quicksort
Best Case:
) log ( n n
Average Case:
) log ( n n
Worst Case:
) (
2
n
Heapsort Sort
Best Case:
) (n
Average Case:
) log ( n n
Worst Case:
) log ( n n
The Dynamic Programming Approach to Find a Longest Common Subsequence
Best Case:
) (
2
n
Average Case:
) (
2
n
Worst Case:
) (
2
n
In Table 2.3, we list different time-complexity functions in terms of the input sizes.
Table 2.3: Time-Complexity Functions
As can be seen in this table, it is quite important to be able to design algorithms with low time-
complexities. For instance, suppose that one algorithm has
) (n
time-complexity and another one
has
) (log n
time-complexity. When n = 10000, the algorithm with
) (log n
takes much less
number of steps than the algorithm with
) (n
time-complexity.
2-17
2.6 The 2-Dimensional Maxima Finding Problem and the Divide and
Conquer Strateg
In the above sections, we showed the importance of designing efficient algorithms. In this section, we
shall introduce another problem, called the 2-dimensional maxima finding problem. A
straightforward algorithm for this problem would have
) (
2
n
time-complexity. Yet we can have an
algorithm to solve the same problem with
) log ( n n
time-complexity. This algorithm is based upon
the divide and conquer strategy.
In the 2-dimensional space, a point ) , (
1 1
y x is said to dominate another point ) , (
2 2
y x if
2 1
x x >
and
2 1
y y > . If a point is not dominated by any point, it is called a maxima. For example, in Figure
2.11, all of the circled points are maxima's. The maxima finding problem is defined as follows:
Given a set of n points, find all of the maxima's of these points.
Figure 2.11: A Set of 2-dimensional Points
A straightforward algorithm to find all of these maxima's is to conduct an exhaustive search. That is,
for each point, we compare it with all of the other points to see if it is dominated by any of them. The
total number of comparisons needed is therefore
2 / ) 1 ( n n
. This means that the time-complexity
of the straightforward algorithm is
) (
2
n
Let us consider Figure 2.12 in which the set S of input points is divided into two sets, those to the left
of L and those to the right of L. This line L, which is perpendicular to the X-axis, is a median line.
That is, L is determined by the median of the X-values of all of the points. Denote the set of points left
to L by
L
S and those to the right of L by
R
S . The divide and conquer approach would find the
maxima's of
L
S and
R
S separately. In Figure 2.12,
1
P ,
2
P and
3
P are maximas for
L
S and
6
P ,
8
P
2-18
and
10
P are maximas for
R
S .
Figure 2.12: The Sets of Points in Figure 2.11 Divided by a Median Line
We observe that the maxima's of
R
S must be maxima's of the whole set. But some of the maxima's
of
L
S may not be the maxima's of S. For instance,
3
P is not. To determine whether a maxima of
L
S is a maxima of S, let us observe Figure 2.13. In Figure 2.13, the maximas of
L
S and
R
S are all
projected onto the Y-axis. A maxima u of
L
S is a maxima of S if and only if there is no maxima v of
R
S whose y-value is higher than that of u. In Figure 2.13, it can be seen that
3
P is not a maxima of S
because the y-value of
6
P and
8
P are both higher than the y-value of
3
P . We may now conclude that
the set of maximas of S is {
1
P ,
2
P ,
6
P ,
8
P ,
10
P }.
Figure 2.13: The Projection of Maximas onto the Y-axis
We can see that the divide and conquer approach consists of two stages. First, it divides the set of
input of data into two subsets
1
S and
2
S . We solve the two problems with
1
S and
2
S separately.
2-19
Let us denote the two sub-solutions by
1
A and
2
A . The second stage is the merging scheme which
merges
1
A and
2
A into the final solution A.
The reader may be puzzled about one problem: How are we going to find the maxima's of, say
L
S ?
The answer is: We find the maxima's of
L
S and
R
S by recursively using this divide and conquer
algorithm. That is,
L
S , for example, is to be divided into two subsets again and maxima's are found
for both of them. Thus, the following Algorithm to find maximas based upon the divide and conquer
strategy.
Algorithm 2.6 An Algorithm to Find Maxima's Based upon the Divide and Conquer Strategy
Input: A set S of 2-dimensional points.
Output: The maximas of S.
Step 1: If S contains only one point, return it as a maxima. Otherwise, find a line perpendicular to the
X-axis which separates S into
L
S and
R
S , each of which consisting of n/2 points.
Step 2: Recursively find the maximas of
L
S and
R
S .
Step 3: Project the maximas of
L
S and
R
S onto L and sort these points according to their y-values.
Conduct a linear scan on the projections and discard each of maxima of
L
S if its y-value is
less than the y-value of some maximas of
R
S .
Although in our algorithm, we divide the set of points into subsets so small that it contains only one
point, it is not necessary to do so. In practice, we can divide the set into subsets containing any small
constant number of points and solve the problem with this small subset of points by a naive and
straightforward method. Imagine that the final subset contains only two points, the maxima of these
points can be determined by just one comparison.
To give the reader some feeling of the "recursiveness" of the divide and conquer strategy, let us
consider the set of 8 points in Figure 2.14. To simplify our discussion, let us divide our set of 8 points
into 4 subsets, each of them contains 2 points as shown in Figure 2.14.
2-20
Figure 2.14: The Dividing of Points into 4 Subsets
Our algorithm will proceed as follows:
(1) The sets of maximas for
11
S ,
12
S ,
21
S and
22
S are separately found to be
} , { }, , { }, {
6 5 21 4 3 12 1 11
P P A P P A P A and } , {
8 7 22
P P A respectively.
(2) Merging
11
S and
12
S will eliminate
1
P . Thus } , {
4 3 1
P P A .
(3) Merging
21
S and
22
S will eliminate
5
P and
6
P . Thus } , {
8 7 2
P P A .
(4) Merging
1
A and
2
A will eliminate
4
P . Thus the set of maximas is
} , , {
8 7 3
P P P A
We will conduct an analysis of Algorithm 2.6. In the algorithm, there is a step which finds the median
of a set of numbers. For example, consider the set {17, 5, 7, 26, 10, 21, 13, 31, 4}. The median is 13.
Note that there are four numbers, namely 5, 7, 10 and 4, which are smaller than 13 and four numbers,
namely 17, 26, 21 and 31, which are greater than 13. To find such a median, it appears that we have
to sort the numbers. For this case, the sorted sequence is 4, 5, 7, 10, 13, 17, 21, 26, 31 and the median
13 can now be easily determined. The trouble is that the time-complexity of sorting is at least
) log ( n n
. If we use a median finding algorithm which employs sorting, its time-complexity is at
least
) log ( n n
. In the following section, we shall show that there is a median finding algorithm
whose time-complexity is
) (n
. This algorithm is based upon the prune and search strategy.
Meanwhile, in the analysis of Algorithm 2.6, we shall use this fact.
The time-complexity of Algorithm 2.6 depends on the time-complexities of the following steps:
(1) In Step 1, the splitting step, there is a median finding operation. As we pointed out in the above
paragraph, this can be accomplished in
) (n
steps. This means that the time-complexity of the
2-21
splitting step is
) (n
.
(2) In Step 2, there are two sub-problems. Each of them is of size n/2 and will be recursively solved
by using the same algorithm.
(3) In Step 3, the merging step, there is a sorting step and a linear scan step. The sorting takes
) log ( n n
steps and the linear scan takes
) (n
steps. Thus it takes
) log ( n n
teps to
complete Step 3. Therefore the time-complexity of the merging step is
) log ( n n
.
Let T(n) denote the time-complexities of the algorithm. Let S(n) and M(n) denote the time-
complexities of the merging and merging step respectively. Then
1 ) (
1 ) ( ) ( ) 2 / ( 2 ) (

> + +
n for b n T and
n for n M n S n T n T
Using the definition of the big function, we have
n cn n T
n n c n c n T n T
log ) 2 / ( 2
log ) 2 / ( 2 ) (
2 1
+
+ +
where c is a constant.
Let us assume that
k
n 2
for some k. We now have
) log ) 2 / log( ( ) 4 / ( 4
log )) 2 / log( ) 2 / ( ) 4 / ( 2 ( 2
log ) 2 / ( 2 ) (
n n n n c n T
n cn n n c n T
n cn n T n T
+ +
+ +
+

n cn n cn bn
n n cn nT
n n n n n n n c nT
log
2
1
log
2
1
)) log 2 / ) log 1 ((( ) 1 (
) 2 log ... ) 4 / log( ) 2 / log( log ( ) 1 (
2
+ +
+ +
+ + + + +
Therefore,
) log ( ) (
2
n n n T
.
Although
) log (
2
n n
is much better than
) (
2
n
, we can still improve it. In the following, we
shall show that we can improve our algorithm such that its time-complexity becomes
) log ( n n
. In
Algorithm 2.6, we have to perform sorting in every merging step. This is the main reason why its
time-complexity is
) log (
2
n n
. Suppose we perform a preprocessing by sorting the points
according to their y-values before we start the divide and conquer maxima finding algorithm. This
preprocessing takes
) log ( n n
time. The total time-complexity is now
) log ( n n
+
) (n T
. But
) ( ) ( ) 2 / ( 2 ) ( n n n T n T + +
where n > 1
and
b n T ) (
when n = 1
2-22
.
.
.
.
.
.
.
.
It can be easily shown that
) log ( ) ( n n n T
and the total time-complexity, including the
preprocessing step, is
) log ( n n
.
2.7 The Selection Problem and the Prune and Search Strategy
In the maxima finding algorithm based upon the divide and conquer strategy, we need to find the
median of a set of numbers. This median finding problem can be generalized to the selection problem
which is defined as follows. We are given a set S of n numbers and a number k. We are asked to find
the kth smallest, or the kth largest, number among the numbers in S. It is obvious that the median
finding problem is a special case of the more general selection problem.
To solve the selection problem, an easy method is to sort the numbers. After sorting, the kth smallest,
or the kth largest, number can be found immediately. In this section, we show that we can avoid
sorting by using the prune and search strategy. The prune and search strategy is again a strategy which
is recursive, in some sense. That is, an algorithm based upon this strategy is always recursive.
We are given n data items. Suppose that we have a mechanism in which after each iteration, a
constant percentage, say f, of the given input data are eliminated. The problem is finally solved when
the problem size is reduced to a reasonably small number. Let T(n) be the time needed to solve a
problem using the prune and search strategy. Assume that the time needed to eliminate f data items
from n data items is
) (
k
n
.
Then
'
> +
5
5 ) ( ) ) 1 ((
) (
n for c
n for n n f T
n T
k
For sufficiently large n, we have
k k k
k
n f c cn n f T
cn n f T n T
) 1 ( ) ) 1 ((
) ) 1 (( ) (
2
+ +
+
pk k k k
k pk k k k
f f f cn c
n f c n f c cn c
) 1 ( ... ) 1 ( ) ) 1 ( 1 ( '
) 1 ( ... ) 1 ( '
2
+ + + + +
+ + + +
Since
1 ) 1 ( < f
,as
n
,
) ( ) (
k
n n T
2-23
.
.
.
.
.
.
.
.
We now explain why the prune and search strategy can be applied to solve the selection problem.
Given a set S of n numbers, suppose that there is a number p which divides S into three subsets
1
S ,
2
S and
3
S ,
1
S containing all numbers smaller than p,
2
S containing all numbers equal to p and
3
S
containing all numbers greater than p. Then we have the following cases:
Case 1: The size of
1
S is greater than k. In this case, the kth smallest of S must be located in
1
S and
we can prune away
2
S and
3
S .
Case 2: The condition of Case 1 is not valid. But the size of
1
S and
2
S is greater than k. The kth
smallest number of S must be equal to p.
Case 3: None of the conditions of Case 1 and Case 2 is valid. In this case, the kth smallest number of
S must be located in
3
S and we can prune away
1
S and
2
S .
The problem is to determine an appropriate p. This number p must guarantee that a constant fraction
of numbers can be eliminated. Algorithm 2.7 can be used to find that p.
Algorithm 2.7 A Subroutine to Find p from n Numbers for the Selection Problem
Input: A set S of n numbers..
Output: The number p which is to be used in the algorithm to find the kth smallest number based
upon the prune and search strategy.
Step 1: Divide S into
1
1
1
5
n
subsets of 5 numbers, adding if possible into the last subset.
Step 2: Sort each of the 5-number subsets.
Step 3: Find the median
i
m of the ith subset. Recursively find the median of
'
1
1
1
5
,..., 2 , 1
n
m m by
using the selection algorithm. Let p be this median.
2-24
Figure 2.15: The Execution of Algorithm 2.7
That p selected can guarantee
4
1
of the input data can be eliminated and is illustrated in Figure 2.15.
Some points about Algorithm 2.7 are in order. First, it is not absolute that the input set must be
divided into subsets containing 5 numbers. We may divide it into subsets, each containing, say 7,
numbers. Our algorithm would work as long as it each subset contains a constant number of numbers.
Note that as long as the input size is a constant, it takes
) 1 (
, meaning a constant, steps to complete
the algorithm. Thus, each sorting performed in Step 2 takes constant number of steps. For Step 3, p
will be found by using the selection algorithm itself recursively.
The following is the algorithm based upon the prune and search strategy to find the kth smallest
number.
Algorithm 2.8 A Prune and Search Algorithm to Find the Smallest kth Number
Input: A set S of n numbers..
Output:.The kth smallest number of S.
Step 1: Divide S into
1
1
1
5
n
subsets of 5 numbers, if n is not a net multiple of S ,add some dummy
elements to the last subset such that it contains five elements.
Step 2: Sort each subset of elements.
2-25
Step 3: Use Algorithm 2.7 to determine p.
Step 4: Partition S into three subsets
1
S ,
2
S and
3
S , containing numbers less than p, equal to p and
larger than p.
Step 5: If k S
1
, discard
2
S and
3
S and selects the kth smallest number of
1
S in the next
iteration; else if k S S +
2 1
, p is the kth smallest element of p; otherwise, let
2 1
' S S k k . Solve the problem that selects the ' k th smallest number from
3
S
during the next iteration.
Let
) (n T
denote the time-complexity of the algorithm. Then
) ( ) 5 / ( ) ) 4 / 3 (( ) ( n n T n T n T + +
The first term
) ) 4 / 3 (( n T
is due to the fact that 1/4 of the input data will be eliminated after each
iteration. The second term
) 5 / (n T
is due to the fact that during the execution of Algorithm 2.7, we
have to execute the selection problem involving n/5 numbers. The third term
) (n
is due to the fact
that dividing n numbers into n/5 subsets takes
) (n
steps.
It can be proved that
) ( ) ( n n T
. The proof is rather complicated and is omitted here.
Although we may dislike time-complexities such as
3
n
,
5
n
and etc, they are not so bad as compared
with the time-complexities as
n
2
or ! n . When
n
= 10000,
n
2
is an exceedingly large number. If an
algorithm has such a time-complexity, then the problem can never be solved by any computer when n
is large. An algorithm is a polynomial algorithm if its time-complexity is
)) ( ( n p
where
) (n p
is a
polynomial function, such as
2
n
,
4
n
and etc. An algorithm is an exponential algorithm if its time-
complexity cannot be bounded by a polynomial function. There are many problems which have
polynomial algorithms. The sorting problem, the minimal spanning tree problem and the longest
common subsequence problem all have polynomial algorithms. A polynomial problem is called a
polynomial problem if there exist polynomial algorithms to solve them. Unfortunately, there are many
problems which, up to now, have no polynomial algorithms to solve them. We are interested in one
question: Is it possible that in the future, some polynomial algorithms will be found for them? This
question will be answered in the next section.
2.8 The NP-Complete Problems
The concept of NP-completeness is perhaps the most difficult one in the field of design and analysis
of algorithms. It is impossible to present this idea formally. We shall instead present an informal
2-26
discussion of these concepts.
Let us first define some problems.
The partition problem: We are given a set S of numbers and we are asked to determine whether S
can be partitioned into two subsets
1
S and
2
S such that that the sum of elements in
1
S is equal to
the sum of elements of
2
S .
For example, let S = {13, 2, 17, 20, 8}. The answer to this problem instance is "yes" because we can
partition S into
1
S = {13, 17} and
2
S = {2, 20, 8}.
The Sum of Subset Problem: We are given a set S of numbers and a constant c and we are asked to
determine whether there exists a subset S of S such that the sum of S is equal to c.
For example, let S = {12, 9, 33, 42, 7, 10, 5} and c = 24. The answer of this problem instance is "yes"
as there exists S = {9, 10, 5} and the sum of the elements in S is equal to 24. If c is 6, the answer
will be "no".
The Satisfiability Problem: We are given a Boolean formula X and we are asked whether there exists
an assignment of true or false to the variables in X which makes X true.
For example, let X be ) ( ) ( ) (
3 2 2 1 3 2 1
x x x x x x x . Then the following
assignment will make X true and the answer will be yes.
T x F x F x
3 2 1
, ,
If X is
1 1
x x , there will be no assignment which can make X true and the answer will be no.
The Minimal Spanning Tree Problem: Given a graph G, find a spanning tree T of G with the
minimum length.
The Traveling Salesperson Problem: Given a graph G = (V,E), find a cycle of edges of this graph
such that all of the vertices in the graph is visited exactly once with the minimum total length.
For example, consider Figure 2.16. There are two cycles satisfying our condition. They are
a f c d e b a C
1
and
a f d e b c a C 2
.
1
C is
shorter and is the solution of this problem instance.
Figure 2.16 A Graph
2-27
For the partition problem, the sum of subset problem and the satisfiability problem, their solutions are
either "yes" or "no". They are called decision problems. The minimal spanning tree problem and the
traveling salesperson problem are called optimization problems.
For an optimization problem, there is always a decision problem corresponding to it. For instance,
consider the minimal spanning tree problem, we can define a decision version of it as follows: Given
a graph G, determine whether there exists a spanning tree of G whose total length is less than a given
constant c. This decision version of the minimal spanning tree can be solved after the minimal
spanning tree problem, which is an optimization problem, is solved. Suppose the total length of the
minimal spanning tree is a. If a < c, the answer is "yes"; otherwise, its answer is "no". The decision
version of this minimal spanning tree problem is called the minimal spanning tree decision problem.
Similarly, we can define the longest common subsequence decision problem as follows: Given two
sequences, determine whether there exists a common subsequence of them whose length is greater
than a given constant c. We again call this decision problem the longest common subsequence
subsequence decision problem. The decision version problem will be solved as soon as the
optimization problem is solved.
In general, optimization problems are more difficult than decision problems. To investigate whether
an optimization problem is difficult to solve, we merely have to see whether its decision version is
difficult or not. If the decision version is difficult already, the optimization version must be difficult.
Before discussing NP-complete problems, note that there is a term called NP problem. We cannot
formally define NP problems here as it is too complicated to do so. The reader may just remember
the following: (1) NP problems are all decision problems. (2) Nearly all of the decision problems are
NP problems. Among the NP problems, there are many problems which have polynomial algorithms.
They are called P problems. For instance, the minimal spanning tree decision problem and the longest
common subsequence decision problem are all P problems. There are also a large set of problems
which, up to now, have no polynomial algorithms.
Figure 2.17: NP Problems
2-28
NP-complete problems constitute a subset of NP problems, as shown in Figure 2.17 . Precise and
formal definition of NP-complete problems cannot be given in this book. But some important
properties of NP-complete problems can be stated as follows:
(1) Up to now, no NP-complete problem has any worst case polynomial algorithm.
(2) If any NP-complete problem can be solved in polynomial time in worst case, all NP problems,
including all NP-complete problems, can be solved in polynomial time in worst case.
(3) Whether a problem is NP-complete or not has to be formally proved and there are thousands of
problems proved to be NP-complete problems.
(4) If the decision version of an optimization problem is NP-complete, this optimization problem is
called NP-hard.
Base upon the above facts, we can conclude that all NP-complete and NP-hard problems must be
difficult problems. Not only they do not have polynomial algorithms at present, it is quite unlikely
that they can have polynomial algorithms in the future because of the second property stated above.
The satisfiability problem is a famous NP-complete problem. The traveling salesperson problem is an
NP-hard problem. Many other problems, such as the chromatic number problem, vertex covering
problem, bin packing problem, 0/1 knapsack problem and the art museum problem are all NP-hard. In
the future, we will often claim that a certain problem is NP-complete without giving a formal proof.
Once a problem is said to be NP-complete, it means it is quite unlikely that a polynomial algorithm
can be designed for it. In fact, the reader should never even try to find a polynomial algorithm for it.
But the reader must understand that we can not say that there exists no polynomial algorithms for NP-
complete problems. We are merely saying that the chance of having such algorithms is very very
small.
It should be noted here that NP-completeness refers to worst cases. Thus, it is still possible to find an
algorithm for an NP-complete problem which has polynomial time-complexity in average cases. It is
our experience that this is also quite difficult as the analysis of average cases is usually quite difficult
to begin with. Is is also possible to design some algorithms which perform rather well although we
can not have an average case analysis of them.
Should we give up hope when we have proved that a problem is NP-hard? No, we should not. In the
next section, we shall introduce the concept of approximation algorithms. Whenever we have proved
a problem to be NP-complete, we should try to design an approximation algorithm for it. This will be
discussed in the following section.
2.9 Approximation Algorithms
As indicated in the previous section, many optimization problems are NP-hard problems. This means
2-29
that it is quite unlikely that polynomial algorithms can be designed for these problems. Thus it is
desirable to have approximation algorithms which will produce approximate solutions with
polynomial time-complexities.
Let us consider the vertex covering problem. Given a graph G = (V,E), the vertex covering problem
requires us to find a minimum number of vertices from V which covers all edges in E. For instance,
for the graph in Figure 2.18, vertex a covers all edges. The solution is {a}. For the graph in Figure
2.19, the solution is {b,d}.
It has been proved that the vertex covering problem is NP-complete. Algorithm 2.9 is an
approximation algorithm for this problem.
Let us apply this approximation algorithm to the graph in Figure 2.18. Suppose we pick up edge (a,d).
We can see that all other edges are incident to {a,d}. Thus {a,d} is the approximate solution. Note
that the optimum solution is {a}. Thus the size of the approximate solution is twice as large as that of
the optimum solution.
2-30
Now we apply the algorithm to the graph in Figure 2.19. Suppose we pick up (c,d). S will be {c,d}
and edges (b,c) and (d,e) will be eliminated. Edge (a,b) still remains. We pick up (a,b). The final S
will be {c,d,a,b}. It was pointed out that the optimum solution is {b,d}. Thus the approximation
algorithm has again produced an approximate solution with size as large as twice of the optimum
solution.
Algorithm 2.9 An Approximation Algorithm to Solve the Vertex Covering Problem
Input: A graph G = (V,E).
Output: An approximate solution for the vertex covering problem with performance ratio 2 .
Step 1: Pick up any edge e. Put the two end vertices u and v of e into S.
Step 2: Eliminate all edges which are incident to u and v.
Step 3: If there is no edge left, output S as the approximate solution. Otherwise, go to Step 1.
It is by no means accidental that in each case, the solution of the approximate solution is twice as
large as the optimum solution. We shall prove later that Algorithm 2.9 will always perform in such a
way.
Let App be the solution of an approximation algorithm. Let Opt be an optimal solution. The
performance ratio of the approximation algorithm, denoted as
, is defined as
Opt
App
. For some
approximation algorithms, the performance ratio
may be a function of the input size n. For

instance, it may be
) (log n
or
) (n
. For some approximation algorithms, the performance ratios
are constants. In general, we like an approximation algorithm to have a performance ratio as a
constant and it should be as small as possible. For Algorithm2.9, we shall prove that the performance
ratio is less than or equal to 2.
Let k edges be chosen by our approximation algorithm. Then k App 2 . Since the optimal solution
Opt must be a vertex cover, every edge must be covered by at least one vertex in Opt. Due to the
special property of our approximation algorithm, no two edges selected share the same covering
vertex. Thus, we have
Opt k
This means that
Opt k App 2 2
2-31
We conclude that 2 for our approximation algorithm.
2-32

The Basic Concepts of Algorithms: 2.1 The Minimal Spanning Tree Problem

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

The Basic Concepts of Algorithms: 2.1 The Minimal Spanning Tree Problem

Caricato da

Copyright:

Formati disponibili

Chapter 2

The Basic Concepts of Algorithms

In summary, the time-complexities of the straight insertion sort are as follows:

may be a function of the input size n. For

Potrebbero piacerti anche