Human Intelligence

Unit-1
Overview and Search Technique

Introduction to Artificial Intelligence
Definition of AI
What is AI?
Artificial Intelligence is concerned with the design of
intelligence in an artificial device.
The term was coined by McCarthy in 1956.
There are two ideas in the definition.
1. Intelligence
2. Artificial device
What is intelligence?
Is it that which characterize humans? Or is there an absolute
standard of judgment?
Accordingly there are two possibilities:
A system with intelligence is expected to behave as
intelligently as a human
A system with intelligence is expected to behave in the best
possible manner
Secondly what type of behavior are we talking about?
Are we looking at the thought process or reasoning ability of
the system?
Or are we only interested in the final manifestations of the
system in terms of its actions?
Given this scenario different interpretations have been used
by different researchers as defining the scope and view of
Artificial Intelligence.
Page 1
1. One view is that artificial intelligence is about designing

systems that are as intelligent as humans.
This view involves trying to understand human thought and an
effort to build machines that emulate the human thought process.
This view is the cognitive science approach to AI.
2. The second approach is best embodied by the concept of the
Turing Test. Turing held that in future computers can be
programmed to acquire abilities rivaling human intelligence. As
part of his argument Turing put forward the idea of an 'imitation
game', in which a human being and a computer would be
interrogated under conditions where the interrogator would not
know which was which, the communication being entirely by
textual messages. Turing argued that if the interrogator could not
distinguish them by questioning, then it would be unreasonable
not to call the computer intelligent. Turing's 'imitation game' is
now usually called 'the Turing test' for intelligence.
3. Logic and laws of thought deals with studies of ideal or
rational thought process and inference. The emphasis in this case
is on the inference mechanism, and its properties. That is how
the system arrives at a conclusion, or the reasoning behind its
selection of actions is very important in this point of view. The
soundness and completeness of the inference mechanisms are
important here.
4. The fourth view of AI is that it is the study of rational agents.
This view deals with building machines that act rationally. The
focus is on how the system acts and performs, and not so much
on the reasoning process. A rational agent is one that acts
rationally, that is, is in the best possible manner.
Page 2
Problem Solving:
Strong AI aims to build machines that can truly reason and
solve problems. These machines should be self aware and their
overall intellectual ability needs to be indistinguishable from
that of a human being. Excessive optimism in the 1950s and
1960s concerning strong AI has given way to an appreciation of
the extreme difficulty of the problem. Strong AI maintains that
suitably programmed machines are capable of cognitive mental
states.
Weak AI: deals with the creation of some form of computerbased artificial intelligence that cannot truly reason and solve
problems, but can act as if it were intelligent. Weak AI holds that
suitably programmed machines can simulate human cognition.
Applied AI: Aims to produce commercially viable "smart"
systems such as, for example, a security system that is able to
recognize the faces of people who are permitted to enter a
particular building. Applied AI has already enjoyed considerable
success.
Cognitive AI: computers are used to test theories about how the
human mind works--for example, theories about how we
recognize faces and other objects, or about how we solve
abstract problems.
Best First Search: Best First search, which is a way of
combining the advantages of both depth-first and breadth-firstsearch into a single method.
One way of combining the two is to follow a single path
at a time, but switch paths whenever some competing path looks
more promising than the current one does.
Page 3
At each step of the best-first-search process, we select

the most promising of the nodes we have generated so far. This
is done by applying an appropriate heuristic to each of them. We
then expanded the chosen node by using the rules to generate its
successors. If one of them is a solution, we can quit. If not, all
those new nodes are added to the set of nodes generated so far.
Again the most promising node is selected and the process
continues.
Fig shows the beginning of a best-first search procedure.
Initially, there is only one node, so it will be expanded. Doing so
generates three new nodes. The heuristic function, which, in this
example, is an estimate of the cost of getting to a solution from a
given node, is applied to each of these new nodes. Since node D
is the most promising, if it is expanded next, producing two
successor nodes, E and F. But then the heuristic function is
applied to them. Now another path, that going through node B,
looks more promising, so it is pursued, generating nodes G and
H. But again when these new nodes are evaluated they look less
promising than another path, so attention is returned to the pass
through D to E. E is then expanded, yielding nodes I and J. At
the next step, j will be expanded, since it is the most promising.
This process can continue until a solution is found.
Step 1
Step2
Step3
A
Page 4
Step4
Step5
A
D
B
Fig: A Best-first Search

The actual operation of the algorithm is very simple.
It proceeds in steps, expanding one node at each step, until it
Page 5
generates a node that corresponds to a goal state. At each step, it

picks the most promising of the nodes that have so far been
generated but not expanded. It generates the successors of the
chosen node, applies the heuristic function to them, and adds
them to the list of open nodes, after checking to see if any of
them have been generated before. By doing this check, we can
guarantee that each node only appears once in this graph,
although many nodes may point to it as a successor. Then the
next step begins.
The process can be summarized as follows.
Algorithm: Best-First Search
1. Start with OPEN containing just the initial state.
2. Until a goal is found or there are no nodes left on OPEN
do:
(a) Pick the best node on OPEN.
(b) Generate its successors.
(c) For each successor do:
i. If it has not been generated before, evaluate it, add it to
OPEN, and record its parent.
ii. If it has been generated before, change the parent if this
new path is better than the previous one. In that case,
update the cost of getting this node and to any successors
that this node may already here.
A* Search: The best-first search algorithm that was just

presented is a simplification of an algorithm called A*, which
Page 6
was first presented by Hart et al. [1968; 1972]. This algorithm

uses the same f, g, and h functions, as well as the lists OPEN
and CLOSED
The form of heuristic estimation function for A* is
f*(n) =g*(n)+h*(n)
Where the two components g*(n) and h*(n) are estimates of
the cost (or distance) from the start node to node n and the
cost form node n to a goal node, respectively.
Nodes on the open list are nodes that have been
generated but not yet expanded while nodes on the closed list
are nodes have been expanded and whose children are,
therefore, available to the search program. The A* algorithm
proceeds as follows
Algorithm: A*
1. Place the starting node s on open.
2. If open is empty, stop and return failure.
3. Remove from open the node n that has the smallest value
of f*(n). If the node is a goal node, return success and
stop. Otherwise,
4. Expand n, generating all of its successors n and place n
on closed. For every successor n, if n is not already on
open or closed attach a back-pointer to n, compute f*(n)
and place it an open.
Page 7
5. Each n that is already on open or closed should be

attached to back-pointers which reflect the lowest g*(n)
path. If n was on closed and its pointer was changed,
remove it and place it on open.
6. Return to step2.
It has been shown that the A* algorithm is both complete and
admissible. Thus, A* will always find an optimal path if one
exists. The efficiency of an A* algorithm depends on how
closely h* approximates h and the cost of the computing f*.
AO* algorithm:
1. Place the starting node s on open
2. Using the search tree constructed thus far, compute the
most promising solution tree To.
3. Select a node n that is both on open and a part of To.
Remove n from open and place it on closed.
4. If n is a terminal goal node, label n as solved. If the
solution of n results in any of ns ancestors being solved,
label all the ancestors as solved. If the start node s is
solved, exit with success where To is the solution tree.
Remove from open all nodes with a solved ancestor.
5. If n is not a solvable node (operators cannot be applied),
label n as unsolvable. If the start node is labeled as
unsolvable, exist with failure. If any of ns ancestors
become unsolvable because n is, label them unsolvable as
Page 8
well. Remove from open all nodes with unsolvable

ancestors.
6. Otherwise, expand node n generating all of its successors.
For each such successor node that contains more than one
sub problem, generate their successors to give individual
sub problems. Attach to each newly generated node a back
pointer to its predecessor. Compute the cost estimate h* for
each newly generated node and place all such nodes that do
not yet have descendents on open. Next, recomputed the
values of h* at n and each ancestor of n.
7. Return to step2.
Hill Climbing Search: Hill climbing gets their names

from the way the nodes are selected for expansion. At each
point in the search path, a successor node that appears to lead
most quickly to the top of the hill (the goal) is selected for
exploration. This method requires that some information be
available with which to evaluate and order the most promising
choices.
Hill climbing is like depth-first searching where the
most promising child is selected for expansion. When the
children have been generated, alternative choices are
evaluated using some type of heuristic function. The path that
appears most promising is then chosen and no further
reference to the parent or other children is retained. This
Page 9
process continues from node-to-node with previously

expanded nodes being discarded.
Hill climbing can produce substantial savings over blind
searches when an informative, reliable function is available to
guide the search to a global goal. It suffers from some serious
drawbacks when this is not the case. Potential problem types
named after certain terrestrial anomalies are the foothill,
ridge, and plateau traps.
The foothill trap results when local maxima or peaks
are found. In this case the children all have less promising
goal distances than the parent node. The search is essentially
trapped at the local node with no indication of goal direction.
The only way to remedy this problem is to try moving in
some arbitrary direction a few generations in the hope that the
real goal direction will become evident, backtracking to an
ancestor node and trying a secondary path choice.
A second potential problem occurs when several
adjoining nodes have higher values than surrounding nodes.
This is the equivalent of a ridge. The search may encounter a
plateau type of structure, that is, an area in which all
neighboring nodes have the same values. Once again, one of
the methods noted above must be tried to escape the trap.
Breadth-First Search: Breadth-first searches are

performed by exploring all nodes at a given depth before
Page 10
proceeding to the next level. This mans that all immediate

children of nodes are explored before any of the childrens
children are considered. Breadth first tree search is illustrated
in fig. It has the obvious advantage of always finding a
minimal path length solution where one exists. A great many
nodes may need to be explored before a solution is found,
especially if the tree is very full. It uses a queue structure to
hold all generated but still unexplored nodes. The breadthfirst algorithm proceeds as follows.
BREADTH-FIRST SEARCH
1. Place the starting node s on the queue.
2. If the queue is empty, return failure and stop.
3. If the first element on the queue is a goal node g,
return success and stop. Otherwise,
4. Remove and expand the first element from the queue
and place all the children at the end of the queue in
any order.
5. Return to step2.
Start
Page 11
Mini-max search: The mini-max search procedure is a depthlimited search procedure. The idea is to start at the current
position and use the plausible-move generator to generate the set
of possible successor positions. Now we can apply the static
evaluation function to those positions and simply choose the
best one.
The starting position is exactly as good for us as the position
generated by the best move we can make next. Here we assume
that the static evaluation function returns large values to indicate
good situations for us, so our goal is to maximize the value of
the static evaluation function of the next board position.
An example of this operation is shown in fig1. It assumes a
static evaluation function that returns values ranging from -10 to
10, with 10 indicating a win for us, -10 a win for the opponent,
and 0 an even match. Since our goal is to maximize the value of
the heuristic function, we choose to move to B. backing Bs
value up to A, we can conclude that As value is 8, since we
know we can move to a position with a value of 8.
A
B(8)
C
Page 12
(8)
(3)
(-2)
E
(9)
(-6) (0)
(0)
I
JJJJ F
(-2) (-4) (3)
Fig1. One ply search and two ply search

But since we know that the static evaluation function is not
completely accurate, we would like to carry the search farther
ahead than one ply. This could be very important for example, in
a chess game in which we are in the middle of a piece exchange.
After our move, the situation would appear to be very good, but
if we look one move ahead, we will see that one of our pieces
also gets captured and so the situation is not as seemed.
Once the values from the second ply are backed up, it becomes
clear that the correct move for us to make at the first level, given
the information we have available, is C, since there is nothing
the opponent can do from there to produce a value worse than
-2. This process can be repeated for as many ply as time allows,
and the more accurate evaluations that are produced can be used
to choose the correct move at the top level. The alteration of
maximizing and minimizing at alternate ply when evaluations
are being pushed back up corresponds to the opposing strategies
of the two players and gives this method the name minimax.
Having described informally the operation of the minimax
procedure, we now describe it precisely. It is a straight forward
recursive procedure that relies on two auxiliary procedures that
are specific to the game being played:
Page 13
1. MOVGEN (position, player) - The plausible-move generator,

which returns a list of nodes representing the moves that can be
made by player in position. We call the two players PLAYERONE and PLAYER-TWO; in a chess program, we might use the
names BLACK and WHITE instead.
2.STATIC(position, player) the static evaluation function,
which returns a number representing the goodness of position
from the standpoint of player.2
Heuristic function
A heuristic function, or simply a heuristic, is a function that
ranks alternatives in various search algorithms at each branching
step based on the available information (heuristically) in order to
make a decision about which branch to follow during a search.
Shortest paths
For example, for shortest path problems, a heuristic is a
function, h(n) defined on the nodes of a search tree, which
serves as an estimate of the cost of the cheapest path from that
node to the goal node. Heuristics are used by informed search
algorithms such as Greedy best-first search and A* to choose the
best node to explore. Greedy best-first search will choose the
node that has the lowest value for the heuristic function. A*
search will expand nodes that have the lowest value for g(n) +
h(n), where g(n) is the (exact) cost of the path from the initial
state to the current node. If h(n) is admissiblethat is, if h(n)
Page 14
never overestimates the costs of reaching the goal, then A*

will always find an optimal solution.
The classical problem involving heuristics is the n-puzzle.
Commonly used heuristics for this problem include counting the
number of misplaced tiles and finding the sum of the Manhattan
distances between each block and its position in the goal
configuration. Note that both are admissible.
Effect of heuristics on computational
performance
In any searching problem where there are b choices at each node
and a depth of d at the goal node, a naive searching algorithm
would have to potentially search around bd nodes before finding
a solution. Heuristics improve the efficiency of search
algorithms by reducing the branching factor from b to a lower
constant b', using a cutoff mechanism. The branching factor can
be used for defining a partial order on the heuristics, such that
h1(n) < h2(n) if h1(n) has a lower branch factor than h2(n) for a
given node n of the search tree. Heuristics giving lower
branching factors at every node in the search tree are preferred
for the resolution of a particular problem, as they are more
computationally efficient.
Finding heuristics
The problem of finding an admissible heuristic with a low
branching factor for common search tasks has been extensively
Page 15
researched in the artificial intelligence community. Several

common techniques are used:
Solution costs of sub-problems often serve as useful

estimates of the overall solution cost. These are always
admissible. For example, a heuristic for a 10-puzzle might
be the cost of moving tiles 1-5 into their correct places. A
common idea is to use a pattern database that stores the
exact solution cost of every sub problem instance.
The solution of a relaxed problem often serves as a useful
admissible estimate of the original. For example,
Manhattan distance is a relaxed version of the n-puzzle
problem, because we assume we can move each tile to its
position independently of moving the other tiles.
Given a set of admissible heuristic functions
h1(n),h2(n),...,hi(n), the function h(n) =
max{h1(n),h2(n),...,hi(n)} is an admissible heuristic that
dominates all of them.
Using these techniques a program called ABSOLVER was

written (1993) by A.E. Prieditis for automatically generating
heuristics for a given problem. ABSOLVER generated a new
heuristic for the 8-puzzle better than any pre-existing heuristic
and found the first useful heuristic for solving the Rubik's Cube.
Consistency and Admissibility
If a Heuristic function never over-estimates the cost reaching to
goal, then it is called an Admissible heuristic function.
Page 16
If H(n) is consistent then the value of H(n) for each node along a
path to goal node are non decreasing.
Alpha-beta pruning: Alpha-beta pruning is a search

algorithm which seeks to reduce the number of nodes that
are evaluated by the min max algorithm in its search tree. It
is a search with adversary algorithm used commonly for
machine playing of two-player games (Tic-tac-toe, Chess,
Go, etc.). It stops completely evaluating a move when at least
one possibility has been found that proves the move to be
worse than a previously examined move. Such moves need
not be evaluated further. Alpha-beta pruning is a sound
optimization in that it does not change the score of the result
of the algorithm it optimize
History
Allen Newell and Herbert Simon who used what John
McCarthy calls an "approximation"[1] in 1958 wrote that
alpha-beta "appears to have been reinvented a number of
times".[2] Arthur Samuel had an early version and Richards,
Hart, Levine and/or Edwards found alpha-beta independently
in the United States.[3] McCarthy proposed similar ideas
during the Dartmouth Conference in 1956 and suggested it
to a group of his students including Alan Kotok at MIT in
1961.[4] Alexander Brudno independently discovered the
alpha-beta algorithm, publishing his results in 1963.[5]
Donald Knuth and Ronald W. Moore refined the algorithm
in 1975[6][7] and it continued to be advanced. Improvements
over naive mini max
Page 17
An illustration of alpha-beta pruning. The grayed-out sub trees

need not be explored (when moves are evaluated from left to
right), since we know the group of sub trees as a whole yields
the value of an equivalent sub tree or worse, and as such cannot
influence the final result. The max and min levels represent the
turn of the player and the adversary, respectively.
The benefit of alpha-beta pruning lies in the fact that branches of

the search tree can be eliminated. The search time can in this
way be limited to the 'more promising' sub tree, and a deeper
search can be performed in the same time. Like its predecessor,
it belongs to the branch and bound class of algorithms. The
optimization reduces the effective depth to slightly more than
half that of simple mini max if the nodes are evaluated in an
Page 18
optimal or near optimal order (best choice for side on move

ordered first at each node).
With an (average or constant) branching factor of b, and a search
depth of d plies, the maximum number of leaf node positions
evaluated (when the move ordering is pessimal) is O(b*b*...*b)
= O(bd) the same as a simple mini max search. If the move
ordering for the search is optimal (meaning the best moves are
always searched first), the number of leaf node positions
evaluated is about O(b*1*b*1*...*b) for odd depth and
O(b*1*b*1*...*1) for even depth, or
. In the
latter case, where the ply of a search is even, the effective
branching factor is reduced to its square root, or, equivalently,
the search can go twice as deep with the same amount of
computation.The explanation of b*1*b*1*... is that all the first
player's moves must be studied to find the best one, but for each,
only the best second player's move is needed to refute all but the
first (and best) first player move alpha-beta ensures no other
second player moves need be considered. If b=40 (as in chess),
and the search depth is 12 plies, the ratio between optimal and
pessimal sorting is a factor of nearly 406 or about 4 billion times.
An animated pedagogical example that attempts to be
human-friendly by substituting initial infinite (or arbitrarily
large) values for emptiness and by avoiding using the nega max
coding simplifications.
Normally during alpha-beta, the sub trees are temporarily
dominated by either a first player advantage (when many first
player moves are good, and at each search depth the first move
Page 19
checked by the first player is adequate, but all second player

responses are required to try and find a refutation), or vice versa.
This advantage can switch sides many times during the search if
the move ordering is incorrect, each time leading to inefficiency.
As the number of positions searched decreases exponentially
each move nearer the current position, it is worth spending
considerable effort on sorting early moves. An improved sort at
any depth will exponentially reduce the total number of
positions searched, but sorting all positions at depths near the
root node is relatively cheap as there are so few of them. In
practice, the move ordering is often determined by the results of
earlier, smaller searches, such as through iterative deepening.
The algorithm maintains two values, alpha and beta, which
represents the minimum score that the maximizing player is
assured of and the maximum score that the minimizing player, is
assured of respectively. Initially alpha is negative infinity and
beta is positive infinity. As the recursion progresses the
"window" becomes smaller. When beta becomes less than alpha,
it means that the current position cannot be the result of best
play by both players and hence need not be explored further.
Additionally, this algorithm can be trivially modified to return
an entire principal variation in addition to the score. Some more
aggressive algorithms such as MTD(f) do not easily permit such
a modification.
Pseudo code
Function alpha beta (node, depth, , , Player)
if depth = 0 or node is a terminal node
Page 20
return the heuristic value of node

If Player = Max Player
for each child of node
:= max(, alpha beta(child, depth-1, , , not(Player) ))
if
Break
(* Beta cut-off *)
return
else
for each child of node
:= min(, alpha beta(child, depth-1, , , not(Player) ))
if
break
(* Alpha cut-off *)
return
(* Initial call *)
Alpha beta (origin, depth, -infinity, +infinity, Max Player)
Heuristic improvements
Alpha-beta search can be made even faster by considering only a
narrow search window (generally determined by guesswork
based on experience). This is known as aspiration search. In the
extreme case, the search is performed with alpha and beta equal;
a technique known as zero-window search, null-window search,
or scout search. This is particularly useful for win/loss searches
near the end of a game where the extra depth gained from the
narrow window and a simple win/loss evaluation function may
lead to a conclusive result. If an aspiration search fails, it is
straightforward to detect whether it failed high (high edge of
window was too low) or low (lower edge of window was too
Page 21
high). This gives information about what window values might

be useful in a re-search of the position.
Constraint Satisfaction, Constraint satisfaction is the

process of finding a solution to a set of constraints that impose
conditions that the variables must satisfy. A solution is therefore
a vector of variables that satisfies all constraints.
The techniques used in constraint satisfaction depend on the
kind of constraints being considered. Often used are constraints
on a finite domain, to the point that constraint satisfaction
problems are typically identified with problems based on
constraints on a finite domain. Such problems are usually solved
via search, in particular a form of backtracking or local search.
Constraint propagation are other methods used on such
problems; most of them are incomplete in general, that is, they
may solve the problem or prove it un satisfiable, but not always.
Constraint propagation methods are also used in conjunction
with search to make a given problem simpler to solve. Other
considered kinds of constraints are on real or rational numbers;
solving problems on these constraints is done via variable
elimination or the simplex algorithm.
Constraint satisfaction originated in the field of artificial
intelligence in the 1970s (see for example (Laurire 1978)).
During the 1980s and 1990s, embedding of constraints into a
programming language were developed. Languages often used
for constraint programming are Prolog and C++.
Page 22
Constraint satisfaction problem

Constraints enumerate the possible values a set of variables
may take. Informally, a finite domain is a finite set of
arbitrary elements. A constraint satisfaction problem on such
domain contains a set of variables whose values can only be
taken from the domain, and a set of constraints, each
constraint specifying the allowed values for a group of
variables. A solution to this problem is an evaluation of the
variables that satisfies all constraints. In other words, a
solution is a way for assigning a value to each variable in
such a way that all constraints are satisfied by these values.
In practice, constraints are often expressed in compact form,
rather than enumerating all values of the variables that would
satisfy the constraint. One of the most used constraints is the one
establishing that the values of the affected variables must be all
different.
Problems that can be expressed as constraint satisfaction
problems are the Eight queens puzzle, the Sudoku solving
problem, the Boolean satisfiability problem, scheduling
problems and various problems on graphs such as the graph
coloring problem.
While usually not included in the above definition of a
constraint satisfaction problem, arithmetic equations and
inequalities bound the values of the variables they contain and
can therefore be considered a form of constraints. Their domain
Page 23
is the set of numbers (either integer, rational, or real), which is

infinite: therefore, the relations of these constraints may be
infinite as well; for example, X = Y + 1 has an infinite number of
pairs of satisfying values. Arithmetic equations and inequalities
are often not considered within the definition of a "constraint
satisfaction problem", which is limited to finite domains. They
are however used often in constraint programming.
Solving
Constraint satisfaction problems on finite domains are typically
solved using a form of search. The most used techniques are
variants of backtracking, constraint propagation, and local
search. These techniques are used on problems with nonlinear
constraints.
Variable elimination and the simplex algorithm are used for
solving linear and polynomial equations and inequalities, and
problems containing variables with infinite domain. These are
typically solved as optimization problems in which the
optimized function is the number of violated constraints.
Complexity
Solving a constraint satisfaction problem on a finite domain is
an NP complete problem. Research has shown a number of
tractable sub cases, some limiting the allowed constraint
relations, some requiring the scopes of constraints to form a tree,
possibly in a reformulated version of the problem. Research has
also established relationship of the constraint satisfaction
Page 24
problem with problems in other areas such as finite model

theory.
Constraint programming
Constraint programming is the use of constraints as a
programming language to encode and solve problems. This is
often done by embedding constraints into a programming
language, which is called the host language. Constraint
programming originated from a formalization of equalities of
terms in Prolog II, leading to a general framework for
embedding constraints into a logic programming language. The
most common host languages are Prolog, C++, and Java, but
other languages have been used as well.
Constraint logic programming
A constraint logic program is a logic program that contains
constraints in the bodies of clauses. As an example, the clause
A(X):-X>0,B(X) is a clause containing the constraint X>0 in the
body. Constraints can also be present in the goal. The constraints
in the goal and in the clauses used to prove the goal are
accumulated into a set called constraint store. This set contains
the constraints the interpreter has assumed satisfiable in order to
proceed in the evaluation. As a result, if this set is detected un
satisfiable, the interpreter backtracks. Equations of terms, as
used in logic programming, are considered a particular form of
Page 25
constraints which can be simplified using unification. As a

result, the constraint store can be considered an extension of the
concept of substitution that is used in regular logic
programming. The most common kinds of constraints used in
constraint logic programming are constraints over
integers/rational/real numbers and constraints over finite
domains.
Concurrent constraint logic programming languages have also
been developed. They significantly differ from non-concurrent
constraint logic programming in that they are aimed at
programming concurrent processes that may not terminate.
Constraint handling rules can be seen as a form of concurrent
constraint logic programming, but are also sometimes used
within a non-concurrent constraint logic programming language.
They allow for rewriting constraints or to infer new ones based
on the truth of conditions.
Constraint satisfaction toolkits
Constraint satisfaction toolkits are software libraries for
imperative programming languages that are used to encode and
solve a constraint satisfaction problem.
Cassowary constraint solver is an open source project for

constraint satisfaction (accessible from C, Java, Python and
other languages).
Comet, a commercial programming language and toolkit
Page 26
Gecode, an open source portable toolkit written in C++

developed as a production-quality and highly efficient
implementation of a complete theoretical background.
JaCoP (solver) an open source Java constraint solver
Koalog a commercial Java-based constraint solver.
logilab-constraint an open source constraint solver written

in pure Python with constraint propagation algorithms.
MINION an open-source constraint solver written in C++,
with a small language for the purpose of specifying
models/problems.
ZDC is an open source program developed in the
Computer-Aided Constraint Satisfaction Project for
modeling and solving constraint satisfaction problems.
Other constraint programming languages

Constraint toolkits are a way for embedding constraints into an
imperative programming language. However, they are only used
as external libraries for encoding and solving problems. An
approach in which constraints are integrated into an imperative
programming language is taken in the Kaleidoscope
programming language.
Constraints have also been embedded into functional
programming languages.
Evaluation function: An evaluation function, also known

as a heuristic evaluation function or static evaluation
Page 27
function, is a function used by game-playing programs to

estimate the value or goodness of a position in the mini max and
related algorithms. The evaluation function is typically designed
to be prioritize speed over accuracy; the function looks only at
the current position and does not explore possible move
In chess
One popular strategy for constructing evaluation functions is as
a weighted sum of various factors that are thought to influence
the value of a position. For instance, an evaluation function for
chess might take the form
c1 * material + c2 * mobility + c3 * king safety + c4 * center
control +...
Chess beginners, as well as the simplest of chess programs,
evaluate the position taking only "material" into account, i.e.
they assign a numerical score for each piece (with pieces of
opposite color having scores of opposite sign) and sum up the
score over all the pieces on the board. On the whole, computer
evaluation functions of even advanced programs tend to be more
materialistic than human evaluations. This is compensated for
by the increased speed of evaluation, which allows more plies to
be examined. As a result, some chess programs may rely too
much on tactics at the expense of strategy.
Game tree: a game tree is a directed graph whose nodes are

positions in a game and whose edges are moves. The complete
game tree for a game is the game tree starting at the initial
position and containing all possible moves from each position.
Page 28
The first two ply of the game tree for tic-tac-toe.

The diagram shows the first two levels, or ply, in the game tree
for tic-tac-toe. We consider all the rotations and reflections of
positions as being equivalent, so the first player has three
choices of move: in the center, at the edge, or in the corner. The
second player has two choices for the reply if the first player
played in the center, otherwise five choices. And so on.
The number of leaf nodes in the complete game tree is the
number of possible different ways the game can be played. For
example, the game tree for tic-tac-toe has 26,830 leaf nodes.
Game trees are important in artificial intelligence because one
way to pick the best move in a game is to search the game tree
using the mini max algorithm or its variants. The game tree for
tic-tac-toe is easily searchable, but the complete game trees for
larger games like chess are much too large to search. Instead, a
chess-playing program searches a partial game tree: typically
Page 29
as many ply from the current position as it can search in the time
available. Except for the case of "pathological" game trees [1]
(which seem to be quite rare in practice), increasing the search
depth (i.e., the number of ply searched) generally improves the
chance of picking the best move.
Two-person games can also be represented as and-or trees. For
the first player to win a game there must exist a winning move
for all moves of the second player. This is represented in the
and-or tree by using disjunction to represent the first player's
alternative moves and using conjunction to represent all of the
second player's moves.
Solving Game Trees
An arbitrary game tree that has been fully colored

With a complete game tree, it is possible to "solve" the game
that is to say, find a sequence of moves that either the first or
Page 30
second player can follow that will guarantee either a win or tie.
The algorithm can be described recursively as follows.
1. Color the final ply of the game tree so that all wins for
player 1 are colored one way, all wins for player 2 are
colored another way, and all ties are colored a third
way.
2. Look at the next ply up. If there exists a node colored
opposite as the current player, color this node for that
player as well. If all immediately lower nodes are
colored for the same player, color this node for the
same player as well. Otherwise, color this node a tie.
3. Repeat for each ply, moving upwards, until all nodes
are colored. The color of the root node will determine
the nature of the game.
The diagram shows a game tree for an arbitrary game, colored
using the above algorithm.
It is usually possible to solve a game (in this technical sense of
"solve") using only a subset of the game tree, since in many
games a move need not be analyzed if there is another move that
is better for the same player (for example alpha-beta pruning can
be used in many deterministic games).
Any sub tree that can be used to solve the game is known as a
decision tree, and the sizes of decision trees of various shapes
are used as measures of game complexity.
Page 31
Game of chance: A game of chance is a game whose

outcome is strongly influenced by some randomizing device,
and upon which contestants may or may not wager money or
anything of monetary value. Common devices used include dice,
spinning tops, playing cards, roulette wheels or numbered balls
drawn from a container.
Any game of chance that involves anything of monetary value is
gambling.
Gambling is known in nearly all human societies, even though
many have passed laws restricting it. Early people used the
knucklebones of sheep as dice. Some people develop a
psychological addiction to gambling, and will risk even food and
shelter to continue.
Some games of chance may also involve a certain degree of
skill. This is especially true where the player or players have
decisions to make based upon previous or incomplete
knowledge, such as poker and blackjack. In other games like
roulette and baccarat the player may only choose the amount of
bet and the thing he/she wants to bet on, the rest is up to chance,
therefore these games are still considered games of chance with
small amount of skills required [1]. The distinction between
'chance' and 'skill' is relevant as in some countries chance games
are illegal or at least regulated, where skill games are not
Page 32
Unit-2
Knowledge Representation
Introduction to Knowledge Representation (KR)
We argue that the notion can best be understood in terms of
five distinct roles it plays, each crucial to the task at hand:
A knowledge representation (KR) is most fundamentally a
surrogate, a substitute for the thing itself, used to enable
an entity to determine consequences by thinking rather
than acting, i.e., by reasoning about the world rather than
taking action in it
It is a set of ontological commitments, i.e., an answer to
the question: In what terms should I think about the
world?
It is a fragmentary theory of intelligent reasoning,
expressed in terms of three components: (i) the
representations fundamental conception of intelligent
Page 33
reasoning; (ii) the set of inferences the representation

sanctions; and (iii) the set of inferences it recommends.
It is a medium for pragmatically efficient computation,
i.e., the computational environment in which thinking is
accomplished. One contribution to this pragmatic
efficiency is supplied by the guidance a representation
provides for organizing information so as to facilitate
making the recommended inferences.
It is a medium of human expression, i.e., a language in
which we say things about the words
Knowledge representation is needed for library classification
and for processing concepts in an information system. In the
field of artificial intelligence, problem solving can be simplified
by an appropriate choice of knowledge representation.
Representing the knowledge in one way may make the solution
simple, while an unfortunate choice of representation may make
the solution difficult or obscure; the analogy is to make
computations in Hindu-Arabic numerals or in Roman numerals;
long division is simpler in one and harder in the other. Likewise,
there is no representation that can serve all purposes or make
every problem equally approachable.
Properties for Knowledge Representation Systems
The following properties should be possessed by a knowledge
representation system.
Representational Adequacy
the ability to represent the required knowledge;
Inferential Adequacy
Page 34
- the ability to manipulate the knowledge represented to produce

new knowledge corresponding to that inferred from the original;
Inferential Efficiency
- the ability to direct the inferential mechanisms into the most
productive directions by storing appropriate guides;
Acquisition Efficiency
- the ability to acquire new knowledge using automatic methods
wherever possible rather than reliance on human intervention.
Predicate Logic: Propositional logic combines atoms An

atom contains no propositional connectives
Have no structure (today_is_wet, john_likes_apples)
Predicates allow us to talk
about objects
Properties: is wet (today)

Relations:
likes (john, apples)
True or false
In predicate logic each atom
is a predicate
e.g. first order logic, higher-order logic

First Order Logic
More expressive logic than
propositional
Page 35
Used in this course (Lecture 6 on representation in

FOL)
Constants are objects: john,
apples
Predicates are properties
and relations:
likes(john, apples)
Functions transform
objects:
likes(john, fruit of(apple tree))

Variables represent any
object: likes(X, apples)
Quantifiers qualify values
of variables
True for all objects (Universal):

apples)
X. likes(X,
Exists at least one object (Existential): X. likes(X, apples

Example: FOL Sentence
Every rose has a thorn
Page 36
For all X
if (X is a rose)
then there exists Y
(X has Y) and (Y is a thorn)
Higher Order Logic

More expressive than first
order
Functions and predicates are
also objects
Described by predicates: binary(addition)

Transformed by functions: differentiate(square)
Can quantify over both
E.g. define red functions as
having zero at 17
Much harder to reason with
Forward Chaining: In forward chaining the rules are

examined one after the other in a certain order. The order might
be the sequence in which the rules were entered into the rule set
Page 37
or some other sequence as specified by the user. As each rule is

examined, the expert system attempts to evaluate whether the
condition is true or false.
Rule evaluation: When the condition is true, the rule is fired
and the next rule is examined. When the condition is false, the
rule is not fired and the next rule is examined.
It is possible that a rule cannot be evaluated as true or
false. Perhaps the condition includes one or more variables with
unknown values. In that case the rule condition is unknown.
When a rule condition is unknown, the rule is not fired and the
next rule is examined.
The iterative reasoning process: The process of examining
one rule after the other continues until a complete pass has been
made through the entire rule set. More than one pass usually is
necessary in order to assign a value to the goal variable. Perhaps
the information needed to evaluate one rule is produced by
another rule that is examined subsequently. After the second rule
is fired, the first rule can be evaluated o the next pass.
The passes continue as long as it is possible to fire rules.
When no more rules can be fired, the reasoning process ceases.
Example of forward reasoning: Letters are used for the
conditions and actions to keep the illustration simple. In rule1,
for example, if condition A exists then action B is taken.
Page 38
Condition A might be
THIS.YEAR.SALES>LAST.YEAR.SALES
Backward chaining: Backward chaining is an inference

method used in automated theorem proves, proof assistants and
other artificial intelligence applications. It is one of the two most
commonly used methods of reasoning with inference rules and
logical implications the other is forward chaining. Backward
chaining is implemented in logic programming by SLD
resolution. Both rules are based on the modus ponens inference
rule.
Backward chaining starts with a list of goals (or a hypothesis)
and works backwards from the consequent to the antecedent to
see if there is data available that will support any of these
consequents. An inference engine using backward chaining
would search the inference rules until it finds one which has a
consequent (Then clause) that matches a desired goal. If the
antecedent (If clause) of that rule is not known to be true, then it
is added to the list of goals (in order for one's goal to be
confirmed one must also provide data that confirms this new
rule).
For example, suppose that the goal is to conclude the color of
my pet Fritz, given that he croaks and eats flies, and that the rule
base contains the following four rules:
Page 39
1. If X croaks and eats flies Then X is a frog

2. If X chirps and sings Then X is a canary
3. If X is a frog Then X is green
4. If X is a canary Then X is yellow
This rule base would be searched and the third and fourth rules
would be selected, because their consequents (Then Fritz is
green, Then Fritz is yellow) match the goal (to determine Fritz's
color). It is not yet known that Fritz is a frog, so both the
antecedents (If Fritz is a frog, If Fritz is a canary) are added to
the goal list. The rule base is again searched and this time the
first two rules are selected, because their consequents (Then X
is a frog, Then X is a canary) match the new goals that were just
added to the list. The antecedent (If Fritz croaks and eats flies) is
known to be true and therefore it can be concluded that Fritz is a
frog, and not a canary. The goal of determining Fritz's color is
now achieved (Fritz is green if he is a frog, and yellow if he is a
canary, but he is a frog since he croaks and eats flies; therefore,
Fritz is green).
Note that the goals always match the affirmed versions of the
consequents of implications (and not the negated versions as in
modus tollens) and even then, their antecedents are then
considered as the new goals (and not the conclusions as in
affirming the consequent) which ultimately must match known
facts (usually defined as consequents whose antecedents are
always true); thus, the inference rule which is used is modus
ponens.
Page 40
Because the list of goals determines which rules are selected and
used, this method is called goal-driven, in contrast to data-driven
forward-chaining inference. The backward chaining approach is
often employed by expert systems.
Conceptual Dependency formalism: Conceptual

dependency (CD) is a theory of natural language processing
which mainly deals with representation of semantics of a
language. The main motivation for the development of CD as a
knowledge representation techniques are given below:
To construct computer programs that can understand
natural language.
To make inferences from the statements and also to
identify conditions in which two sentences can have
similar meaning.
To provide facilities for the system to take part in
dialogues and answer questions.
To provide a means of representation which are language
independent.
Knowledge is represented in CD by elements what are
called as conceptual structures. What forms the basis of CD
representation is that for two sentences which have identical
meaning there must be only one representation and implicitly
packed information must be explicitly stated.
Page 41
In order that knowledge is represented in CD form, certain

primitive actions have been developed. Table provides the
primitives CD actions.
Apart from the primitives CD actions one has to make use of
the six following categories of types of objects.
1. PPs: (picture producers)
Only physical objects are physical producers.
2. ACTs: Actions are done by an actor to an object.
Table gives the major ACTs.
Table Primitive CD forms
CD primitive action
1. ATRANS
2. PTRANS
(e.g. go)
3. PROPEL
(e.g. throw)
4. MOVE
the animal
5. GRASP
hold)
Page 42
Explanation
transfer of abstract relationship(e,g, give)
transfer of physical location of an object
application of physical force of an object
movement of a body part of an animal by
grasping of an object by an actor (e.g.
6. INGEST
Taking of an object by an animal to the
inside of that a
animal (e.g. Drink.eat)
7. EXPEL
Expulsion of an object from inside the body
by an animal to the world (e.g. spit)
8. MTRANS
Transfer of mental information between
animals or within an animal (e.g. tell)
9. MBUILD
Construction of a new information from an
old information (e.g. decide).
10. SPEAK
Action of producing sound (e.g. say).
3. LOCs: Locations
Every action takes place at some locations and serves as
source and destination.
4. Ts: Times
An action can take place at a particular location at a given
specified time. The time can be represented on an absolute scale
or relative scale.
5. AAs: Action aiders
These serve as modifiers of actions, the actor PROPEL has a
speed factor associated with it which is an action aider.
6. PAs: Picture Aides
Page 43
Serve as aides of picture producers. Every object that serve as

a PP, needs certain characteristics by which they are defined.
PAs practically serve PPs by defining the characteristics.
There are certain rules by which the conceptual categories of
types of objects discussed can be combined.
CD models provide the following advantages for representing
knowledge.
The ACT primitives help in
representing wide knowledge in a succinct way. To
illustrate this, consider the following verbs. These are verbs
that correspond to transfer of mental information.
-see
-learn
-hear
-inform
-remember
In CD representation all these are represented using a single

ACT primitives MTRANS. They are not represented
individually as given. Similarly, different verbs that indicate
various activities are clubbed under unique ACT primitives,
thereby reducing the number of inference rules.
The main goal of CD
representation is to make explicit of what is implicit. That
is why every statement that is made has not only the actors
Page 44
and objects but also time and location, source and

destination.
The following set conceptual tenses still make usage of CD
more precise.
O-Object case relationship
R-recipient case relationship
P-past
F-future
T-transition
Ts-start transition
Tf-finished transition
K-continuing
?-interrogative
/-negative
Nil-present
Delta-timeless
C-conditional
Page 45
CD brought forward the

notion of language independence because all ACTs are
language-independent primitive.
Semantic Nets: The main idea behind semantic nets is that the
meaning of a concept comes from the ways in which it is
connected to other concepts. In a semantic net, information is
represented as a set of nodes connected to each other by a set of
labeled arcs, which represent relationship among the nodes. A
fragment of a typical semantic net is shown in fig.
Mammal
Isa
Person
Has-port
Nose
Instance
UniformColor
Blue
team
Pee-wee-Reese
Brooklyn- Dodgers
Fig: A semantic network

This network contains an example of both the Isa and instance
relations, as well as some other, more domain-specific relations
like team and uniform-color. In this network, we could use
inheritance to derive the additional relation
Page 46
Has-part (pee-wee-Reese, Nose)

1. Insertion Search. One of the early ways that semantic nets
were used was to find relationships among objects by spreading
activation out from each of two nodes and seeing where the
activation met. This process is called insertion search. Using this
process, it is possible to use the network of fig to answer
questions such as what is the connection between the Brooklyn
Dodgers and Blue?
2. Representing Non Binary predicates. Semantic nets are a
natural way to represent relationships that would appear as
ground instances of binary predicates in predicate logic. For
example, some of the arcs from fig could be represented in logic
as
Isa (person, mammal)
Instance (pee-Wee-Reese, Person)
Team (pee-wee-Reese, Brooklyn-Dodgers)
Uniform color (pee-wee- Reese, blue)
But the knowledge expressed by predicates of other arties can
also be expressed in semantic nets. We have already seen that
many unary predicates. Such as Isa and instance. So for
example,
Man (Marcus)
Page 47
Could be rewritten as
Instance (Marcus, Man)
Thereby making it easy to represent in a semantic net.
3. Partitioned semantic Nets. Suppose we want to represent
simple quantified expression in semantic nets. One way to do
this is to partition the semantic net into a hierarchical set of
spaces, each of which corresponds to the scope of one or more
variables. To see how this works, consider first the simple net
shown in fig. this net corresponds to the statement.
The dog bit the mail carrier.
The nodes Dogs, Bite, and Mail-Carrier represents the classes of
dogs, biting, and mail carriers, respectively, while the nodes d,b,
and m represent a particular dog, a particular biting, and a
particular mail carrier. This fact can be easily be represented by
a single net with no portioning.
But now suppose that we want to represent the fact
Every dog has bitten a mail carrier.
Or, in logic:
X: dog(x)
y: Mail-carrier(y) bite(x, y)
To represent this fact, it is necessary to encode the scope of the

universally quantified variable x.
Page 48
Dogs
Mailcarrier
Bite
Isa
Isa
isa
Assailant
victim
Fig: using partitioned semantic nets

Frame: A frame is a collection of attributes (usually called slots)
and associated values (and possibly constraints on values) that
describes some entity in the world. Sometimes a frame
describes an entity in some absolute sense; sometimes it
represents the entity from a particular point of view. A single
frame taken alone is rarely useful. Instead, we build frame
systems out of collection of frames that are connected to each
other.
Set theory provides a good basis for understanding frame
systems. Although not all frame systems are defined this way,
we do so here. In this view, each frame represents either a class
(a set) or an instance (an element of a class). To see how this
works, consider a frame system shown in fig. In this example,
the frames person, adult-male, ML-baseball-player, pitcher, and
ML-baseball team are all classes. The frames pee-wee-Reese
and Brooklyn-Dodgers are instances.
Page 49
Person
Isa:
Mammal
Cardinality:
6,000,000,000
*handed:
Right
Adult-Male
Isa:
person
Cardinality
2,000,000,000
*height:
5-10
ML-Baseball-Player
Isa:
adult-male
Cardinality:
624
*height:
6-1
*bats:
equal to handed
*batting-average: .252
*team:
*uniform-color:
Fielder
Isa:
ML-baseball-player
Cardinality:
376
*batting-average: .262
Page 50
Pee-Wee-Reese
Instance:
fielder
Height:
5-10
Bats:
right
Batting-average: .309
Team:
Brooklyn-Dodgers
Uniform-color:
Blue
ML-Baseball-Team
Isa:
Team
Cardinality:
26
*team-size:
24
*manager:
Brooklyn-dodgers
Instance:
ML-Baseball-Team
Team-size:
24
Manager:
Leo-Durocher
Players:
(Pee-Wee-Reese)
fig. A simplified frame system
Wff. Not all strings can represent propositions of the predicate

logic. Those which produce a proposition when their symbols
Page 51
are interpreted must follow the rules given below, and they are
called wffs (well-formed formulas) of the first order predicate
logic.
Rules for constructing Wffs
A predicate name followed by a list of variables such as P(x, y),
where P is a predicate name, and x and y are variables, is called
an atomic formula.
Wffs are constructed using the following rules:
1. True and False are wffs.
2. Each propositional constant (i.e. specific proposition), and
each propositional variable (i.e. a variable representing
propositions) are wffs.
3. Each atomic formula (i.e. a specific predicate with
variables) is a wff.
4. If A, B, and C are wffs, then so are A, (A B), (A B), (A
B), and (A B).
5. If x is a variable (representing objects of the universe of
discourse), and A is a wff, then so are x A and x A .
6. For example, "The capital of Virginia is Richmond." is a
specific proposition. Hence it is a wff by Rule 2.
Let B be a predicate name representing "being blue" and let
x be a variable. Then B(x) is an atomic formula meaning "x
is blue". Thus it is a wff by Rule 3. above. By applying
Page 52
Rule 5. to B(x), xB(x) is a wff and so is xB(x). Then by

applying Rule 4. to them x B(x) x B(x) is seen to be a
wff. Similarly if R is a predicate name representing "being
round". Then R(x) is an atomic formula. Hence it is a wff.
By applying Rule 4 to B(x) and R(x), a wff B(x) R(x) is
obtained.
In this manner, larger and more complex wffs can be
constructed following the rules given above.
Note, however, that strings that can not be constructed by
using those rules are not wffs. For example, xB(x)R(x),
and B( x ) are NOT wffs, NOR are B( R(x) ), and B( x
R(x) ) .
One way to check whether or not an expression is a wff
is to try to state it in English. If you can translate it into a
correct English sentence, then it is a wff.
More examples: To express the fact that Tom is taller than
John, we can use the atomic formula taller(Tom, John),
which is a wff. This wff can also be part of some compound
statement such as taller(Tom, John) taller(John, Tom),
which is also a wff.
If x is a variable representing people in the world, then
taller(x,Tom), x taller(x,Tom), x taller(x,Tom), x y
taller(x,y) are all wffs among others.
7. However, taller( x,John) and taller(Tom Mary, Jim), for
example, are NOT wffs.
Page 53
Unit-3
Handling Uncertainty and learning
Fuzzy Logic: In the techniques we have not modified the
mathematical underpinnings provided by set theory and logic.
We have instead augmented those ideas with additional
constructs provided by probability theory. We take a different
approach and briefly consider what happens if we make
fundamental changes to our idea of set membership and
corresponding changes to our definitions of logical operations.
The motivation for fuzzy sets is provided by the need
to represent such propositions as:
John is very tall.
Mary is slightly ill.
Sue and Linda are close friends.
Exceptions to the rule are nearly impossible.
Most Frenchmen are not very tall.
While traditional set theory defines set membership as a
Boolean predicate, fuzzy set theory allows us to represent set
membership as a possibility distribution.
Page 54
Once set membership has been redefined in this way, it is

possible to define a reasoning system based on techniques for
combining distributions. Such responders have been applied
control systems for devices as diverse trains and washing
machines.
Dempster Shafer Theory: This theory was developed by

Dempster 1968; Shafer, 1976. This approach considers sets of
propositions and assigns to each of them an interval
[Belief, Plausibility]
In which the degree of belief must lie. Belief (usually
denoted Bel) measures the strength of the evidence in favor of a
set of propositions. It ranges from 0 (indicating no evidence) to
1(denoting certainty).
A belief function, Bel, corresponding to a specific m for
the set A, is defined as the sum of beliefs committed to every
subset of A by m. That is, Bel (A) is a measure of the total
support or belief committed to the set A and sets a minimum
value for its likelihood. It is defined in terms of all belief
assigned to A as well as to all proper subsets of A. Thus,
Bel (A) =m (B)
For example, if U contains the mutually exclusive subsets A, B,
C, and D then
Page 55
Bel({A,C,D})= m({A,C,D}) +m({A,C})+m({A,D})+m({C,D})

+m({A})+m({c})+m({D}.
In Dempster-Shafer theory, a belief interval can also be defined
for a subset A. It is represented as the subinterval [Bel (A), P1
(A)] of [0, 1]. Bel (A) is also called the support of A, and P1 (A)
=1-Bel (A) the plausibility of A.
We define Bel (o) =0 to signify that no belief should be
assigned to the empty set and Bel (U) = 1 to show that the truth
is contained within U. The subsets A of U are called the focal
elements of the support function Bel when m (A)>0.
Since Bel (A) only partially describes the beliefs about
proposition A, it is useful to also have a measure of the extent
one believes in A that is, the doubts regarding A. For this, we
define the doubt of A as D (A) = Bel (A). From this definition it
will be seen that the upper bound of the belief interval noted
above, P1 (A), can be expressed as P1 (A) =1-D (A) = 1- Bel
(À). P1 (A) represents an upper belief limit on the proposition
A. The belief interval, [Bel (A), P (A)], is also sometimes
referred to as the confidence in A, while the quantity P1 (A)-Bel
(A) is referred to as the uncertainty in A. It can be shown that
P1 (0) = 0, P1 (U) =1
For all A,
P1 (A) Bel (A)
Page 56
Bel (A) +Bel (À) 1,

P1 (A) + P1 (À) 1, and
For A _B,
Bel (A) Bel (B), P1 (A) P1 (B)
As an example of the above concepts, recall once again the
problem of identifying the terrorist organizations A, B, C and D
could have been responsible for the attack. The possible subsets
of U in this case form a lattice of sixteen sub sets (fig).
{A, B, C, D}
{A, B, C,}
{A, B, D}
{A, C, D}
{B, C, D}
{A, B} {A, C,} {B, C,} {B, D} {A, C,} {C, D} {B, D} {C, D}
{A}
{B}
{C}
{D}
{O}
Fig Lattice of subsets of the universe U.
Page 57
Assume one piece of evidence supports the belief

that groups A and C were responsible to a degree of m1 ({A, C})
= 0.6 and another source of evidence disproves the belief that C
was involved (and therefore supports the belief that the three
organizations, A, B, and D were responsible: that is m2 ({A, B,
D}) =0.7. To obtain the pooled evidence, we compute the
following quantities
M1 0m2({A}) = (0.6)*(0.7) =0.42
M1 0 m2 ({A, C}) = (0.6) *(0.3) = 0.18
M1 0 m2 ({A, B, D}) = (0.4)*(0.7) = 0.28
M1 0 m2 ({U}) = (0.4)*(0.3) =0.12
M1 0m2=0 for all other subsets of U
Bel1 ({A, C}) = m ({A, C}) +m ({A}) +m ({C})
Bayes Theorem: An important goal for many problemsolving systems is to collect evidence as the system goes along
and to modify its behavior on the basis of evidence. To model
this behavior, we need a statistical theory of evidence. Bayesian
statistics is such a theory. The fundamental notion of. Bayesian
statistics is that of conditional probability;
P (H/ E)
231
Page 58
Read this expression as the probability of hypothesis H given

that we have observed evidence E. To compute this, we need to
take into account the prior probability of H and the extent to
which E provides evidence of H. To do this, we need to define a
universe that contains an exhaustive, mutually exclusive set of
His among which we are trying to discriminate then, let
P (Hi/E) = the probability that hypothesis Hi is true given
evidence E
P (E/Hi) = the probability that we will observe evidence E given
that hypothesis i is true.
P (Hi) = the priori probability that hypothesis i is true in the
absence of any specific evidence. These probabilities are called
prior probabilities of priors.
K=the number of positive hypotheses
Bayes theorem then states that
P (Hi/E) = P (E/Hi). P (Hi)
P (E/Hn).P (Hn)
Specifically, when we say P (A/B), we are describing
the conditional probability of A given that the only evidence we
have is B. If there is also other relevant evidence, then it too
must be considered. Suppose, for example, that we are solving a
medical diagnosis problem. Consider the following assertions:
Page 59
S: patient has spots

M: patient has measles
F: patient has high fever
Without any additional evidence, the presence of spots serves
as evidence in favor of measles. It also serves as evidence of
fever since measles would cause fever. But, since spots and
fever are not independent events, we cannot just sum their
effects; instead, we need to represent explicitly the conditional
probability that arises from their conjunction. In general, given a
prior body of evidence e and some new observation E, we need
to compute.
P (H/E, e) = P (H/E).P (e/E, H)
P (e/E)
Unfortunately, in an arbitrarily complex world, the sizes of
the set of join probabilities that we are require in order to
compute this function grows as 2n if there are n different
propositions being considered. This makes using Bayes theorem
intractable for several reasons:
The knowledge acquisition problem is insurmountable; too
many probabilities have to be provided.
The space that would be required to store all the
probabilities is too large.
The time required to compute the probabilities is too large.
Page 60
Despite these problems, through Bayesian statistics provide

an attractive basis for an uncertain reasoning system. As a
result, several mechanisms for exploiting its power while at
the same time making it tractable have been developed.
Learning: One of the most often heard criticisms of AI is that

machines cannot be called intelligent until they are able to learn
to do new things and to adapt to new situations, rather than
simply doing as they are told to do. There can be little question
that the ability to adapt to new surroundings and to solve new
problems is an important characteristics of intelligent entities.
How to interpret its inputs in such a way that its performance
gradually improves.
Learning denotes changes in the systems that are
adaptive in the sense that they enable the system to do the same
task or tasks drawn from the same population more efficiently
and more effectively the next time.
Learning covers a wide range of phenomena.
1. At one end of the spectrum is skill refinement. People get
better at many tasks simply by practicing. The more you
ride a bicycle or play tennis, the better you get.
2. At the other end of this spectrum lies knowledge
acquisition. Knowledge is generally acquired through
experience
Page 61
3. Many AI programs are able to improve their performance

substantially through rote- learning techniques.
4. Another way we learn is through taking advice from others.
Advice taking is similar to rote learning, but high-level may
not be in a form simple enough for a program to use
directly in problem solving.
5. People also learn through their own problem solving
experience. After solving a complex problem, we
remember the structure of the problem and the methods we
used to solve it. The next time we see the problem, we can
solve it more efficiently. Moreover, we can generalize from
our experience to solve related problems more easily. the
program remembers its experiences and generalizes from
them. In large problem spaces, however, efficiency gains
are critical. Learning can mean the difference between
solving a problem rapidly and not solving it at all. In
addition, programs that learn through problem-solving
experience may be able to come up with qualitatively better
solutions in the future.
6. Another form of learning that does involve stimuli from the
outside is learning from examples. Learning from examples
usually involves a teacher who helps us classify things by
correcting us when we are wrong. Sometimes, however, a
program can discover things without the aid of a teacher.
Learning is itself a problem-solving process.
Page 62
Learning Model: Learning can be accomplished using a

number of different methods. For example, we can learn by
memorizing facts, by being told, or by studying examples like
problem solutions. Learning requires that new knowledge
structures be created from some form of input stimulus. This
new knowledge must then be assimilated into a knowledge
base and be tested in some way for its utility. Testing means
that the knowledge should be used in the performance of
some task from which meaningful feedback can be obtained,
where the feedback provides some measure of the accuracy
and usefulness of the newly acquired knowledge.
Learning model is depicted in fig where the
environment has been included as part of the overall learner
system. The environment may be regarded as either a form of
nature which produces random stimuli or as a more organized
training source such as a teacher which provides carefully
selected training examples for the learner component.
Stimuli
Environment
Examples
Or Teacher
Feedback u
Learner
Component
Page 63
Knowledge
Base
Critic
performance
Evaluator
Response
Performance
Component
Tasks
Fig. Learning Model
The actual form of environment used will depend on the

particular learning paradigm. In any case, some representation
language must be assumed for communication between the
environment and the learner. The language may be the same
representation scheme as that used in the knowledge base (such
as form of predicate calculus). When they are chosen to be the
same we say the single representation trick is being used. This
usually results in a simpler implementation since it is not
necessary to transform between two or more different
representations.
Inputs to the learner component may be physical stimuli
of some type or descriptive, symbolic training examples. The
information conveyed to the learner component is used to create
and modify knowledge structures in the knowledge base.
Page 64
When given a task, the performance component produces a

response describing actions in performing the task. The critic
module then evaluates this response relative to an optimal
response.
The cycle described above may be repeated a number of
times until the performance of the system have reached some
acceptable level, until a known learning goal has been reached,
or until changes cease to occur in the knowledge base after some
chosen numbers of training examples have been observed.
There are several important factors which influence a
systems ability to learn in addition to the form of representation
used. They include the types of training provided, the form and
extent of any initial background knowledge, the type of
feedback provided, and the learning algorithms used (fig).
Background
knowledge
Feedback
Learning
Algorithms
Training
Scenario
Page 65
resultant
Representation
scheme
fig Factors affecting learning performance

Finally the learning algorithms themselves determine to a large
extent how successful a learning system will be. Te algorithms
control the search to find and build the knowledge structures.
We then expect that the algorithms that extract much of the
useful information from training examples and take advantage of
any background knowledge out perform those that do not.
Supervised learning: Supervised learning is the machine

learning task of inferring a function from supervised training
data. The training data consist of a set of training examples. In
supervised learning, each example is a pair consisting of an
input object (typically a vector) and a desired output value (also
called the supervisory signal). A supervised learning algorithm
analyzes the training data and produces an inferred function,
which is called a classifier (if the output is discrete, see
classification) or a regression function (if the output is
continuous, see regression). The inferred function should predict
the correct output value for any valid input object. This requires
the learning algorithm to generalize from the training data to
unseen situations in a "reasonable" way (see inductive bias).
(Compare with unsupervised learning.) The parallel task in
human and animal psychology is often referred to as concept
learning.
Page 66
Overview
In order to solve a given problem of supervised learning, one has
to perform the following steps:
1. Determine the type of training examples. Before doing
anything else, the engineer should decide what kind of data
is to be used as an example. For instance, this might be a
single handwritten character, an entire handwritten word, or
an entire line of handwriting.
2. Gather a training set. The training set needs to be
representative of the real-world use of the function. Thus, a
set of input objects is gathered and corresponding outputs
are also gathered, either from human experts or from
measurements.
3. Determine the input feature representation of the learned
function. The accuracy of the learned function depends
strongly on how the input object is represented. Typically,
the input object is transformed into a feature vector, which
contains a number of features that are descriptive of the
object. The number of features should not be too large,
because of the curse of dimensionality; but should contain
enough information to accurately predict the output.
4. Determine the structure of the learned function and
corresponding learning algorithm. For example, the
engineer may choose to use support vector machines or
decision trees.
Page 67
5. Complete the design. Run the learning algorithm on the

gathered training set. Some supervised learning algorithms
require the user to determine certain control parameters.
These parameters may be adjusted by optimizing
performance on a subset (called a validation set) of the
training set, or via cross-validation.
6. Evaluate the accuracy of the learned function. After
parameter adjustment and learning, the performance of the
resulting function should be measured on a test set that is
separate from the training set.
Factors to consider
Factors to consider when choosing and applying a learning
algorithm include the following:
1. Heterogeneity of the data. If the feature vectors include
features of many different kinds (discrete, discrete ordered,
counts, continuous values), some algorithms are easier to
apply than others. Many algorithms, including Support
Vector Machines, linear regression, logistic regression,
neural networks, and nearest neighbor methods, require that
the input features be numerical and scaled to similar ranges
(e.g., to the [-1,1] interval). Methods that employ a distance
function, such as nearest neighbor methods and support
vector machines with Gaussian kernels, are particularly
sensitive to this. An advantage of decision trees is that they
easily handle heterogeneous data.
2. Redundancy in the data. If the input features contain
redundant information (e.g., highly correlated features),
Page 68
some learning algorithms (e.g., linear regression, logistic

regression, and distance based methods) will perform
poorly because of numerical instabilities. These
problems can often by solved by imposing some form of
regularization.
3. Presence of interactions and non-linearitys. If each of the
features makes an independent contribution to the output,
then algorithms based on linear functions (e.g., linear
regression, logistic regression, Support Vector Machines,
naive Bayes) and distance functions (e.g., nearest neighbor
methods, support vector machines with Gaussian kernels)
generally perform well. However, if there are complex
interactions among features, then algorithms such as
decision trees and neural networks work better, because
they are specifically designed to discover these interactions.
Linear methods can also be applied, but the engineer must
manually specify the interactions when using them.
How supervised learning algorithms work
Given a set of training examples of the form
, a learning algorithm a function
,
where X is the input space and Y is the output space. The
function g is an element of some space of possible functions G,
usually called the hypothesis space. It is sometimes convenient
to represent g using a scoring function
such that g
is defined as returning the y value that gives the highest score:
. Let F denote the space of scoring
functions.
Page 69
Although G and F can be any space of functions, many learning

algorithms are probabilistic models where g takes the form of a
conditional probability model g(x) = P(y | x), or f takes the form
of a joint probability model f(x,y) = P(x,y). For example, naive
Bayes and linear discriminant analysis are joint probability
models, whereas logistic regression is a conditional probability
model.
There are two basic approaches to choosing f or g: empirical risk
minimization and structural risk minimization[3]. Empirical risk
minimization seeks the function that best fits the training data.
Structural risk minimize includes a penalty function that controls
the bias/variance tradeoff.
In both cases, it is assumed that the training set consists of a
sample of independent and identically distributed pairs,
.
In order to measure how well a function fits the training data, a
loss function
is defined. For training example
, the loss of predicting the value is
.
The risk R(g) of function g is defined as the expected loss of g.
This can be estimated from the training data as
.
Generalizations of supervised learning
There are several ways in which the standard supervised
learning problem can be generalized:
Page 70
Semi-supervised learning: In this setting, the desired output

values are provided only for a subset of the training data. The
remaining data is unlabeled.
Active learning: Instead of assuming that all of the training
examples are given at the start, active learning algorithms
interactively collect new examples, typically by making queries
to a human user. Often, the queries are based on unlabeled data,
which is a scenario that combines semi-supervised learning with
active learning.
Structured prediction: When the desired output value is a
complex object, such as a parse tree or a labeled graph, then
standard methods must be extended.
.Learning to rank: When the input is a set of objects and the
desired output is a ranking of those objects, then again the
standard methods must be extended.
Unsupervised Learning: What if a neural network is given

no feedback for its outputs, not even a real-valued
reinforcement? Can the network learn anything useful? The
unintuitive answer is yes.
Has- hair?
Has-scales?
has-feathers? flies?
lives in water?
lays eggs?
Dog
Cat
Bat
Page 71
Whale
Canary
Robin
Ostrich
Snake
Lizard
Alligator 0
Fig. Data for unsupervised learning

This form of learning is called unsupervised learning
because no teacher is required. Given a set of input data, the
network is allowed to play with it to try to discover regularities
and relationships between the different parts of the input.
Learning is often made possible through some notion of which
features in the input sets are important. But often we do not
know in advance which features are important, and asking a
learning system to deal with raw input data can be
computationally expensive. Unsupervised learning can be used
as a feature discovery module that precedes supervised
learning.
Consider the data in fig. the group of ten animals, each is
described by its own set of features, breaks down naturally into
three groups: mammals, reptiles and birds. We would like to
build a network that can learn which group a particular animal
Page 72
belongs to, and to generalize so that it can identify animals it has

not yet seen. We can easily accomplish this with a six-input,
three-output back propagation network. We simply present the
network with an input, observe its output, and update its weights
based on the errors it makes. Without a teacher, however, the
error cannot be computed, so we must seek other methods.
Our first problem is to ensure that only one of the three output
units become active for any given input. One solution to this
problem is to let the network settle, find the output unit with the
highest level of activation, and set that unit to 1 and all other
output units to 0. In other words, the output unit with the highest
activation is the only one we consider to be active. A more
neural-like solution is to have the output units fight among
themselves for control of an input vector.
Learning by Induction : What is "Learning by Induction"?

Simply put, it is learning by watching. You watch what others
do, then you do that. Below is a more formal explanation of
inductive vs. deductive logic:
In logic, we often refer to the two broad methods of reasoning as
the deductive and inductive approaches.
Deductive reasoning works from the more general to the more
specific. Sometimes this is informally called a "top-down"
approach. We might begin with thinking up a theory about our
topic of interest. We then narrow that down into more specific
hypotheses that we can test. We narrow down even further when
we collect observations to address the hypotheses. This
Page 73
ultimately leads us to be able to test the hypotheses with specific

data -- a confirmation (or not) of our original theories.
Inductive reasoning works the other way, moving from specific
observations to broader generalizations and theories. Informally,
we sometimes call this a "bottom up" approach. In inductive
reasoning, we begin with specific observations and measures,
begin to detect patterns and regularities, formulate some
tentative hypotheses that we can explore, and finally end up
developing some general conclusions or theories. (Thanks to
William M.K. Trochim for these definitions).
To translate this into an approach to learning a skill, deductive
learning is someone TELLING you what to do, while inductive
learning is someone SHOWING you what to do. Remember the
saying "a picture is worth a thousand words"? That means, in a
given amount of time, a person can be SHOWN a thousand
times more information than they could be TOLD in the same
amount of time. I can access a picture or pattern much more
quickly than the equivalent description of that picture or pattern
in words.
Athletes often practice "visualization" before they undertake an
action. But in order to visualize something, you need to have a
picture in your head to visualize. How do you get those pictures
in your head, by WATCHING. Who do you watch?
Professionals. This is the key. Pay attention here. When you
want to learn a skill:
Page 74
WATCH PROFESSIONALS DO IT BEFORE YOU DO IT.

DO NOT DO IT YOURSELF FIRST.
Going out and doing a sport without having seen AND
STUDIED professionals doing that sport is THE NUMBER
ONE MISTAKE people make. They force themselves to play,
their brain says "what do we do now?", another part of the brain
looks for examples (pictures) of what to do, and, finding none,
says "just do anything". So they try to generate behavior to
accomplish something within the rules of the sport. If they
"keep score" and try to "win" and avoid "losing", the negative
impact is multiplied tenfold.
Yet this is EXACTLY what most people do and what most ARE
TOLD to do! "Interested in tennis? Grab a racquet, join a
league, get out there and have fun!" Then what happens? They
have no training, they try do what it takes to "win", and to do so,
they manufacture awful strokes just TO BE ABLE to play
(remember, they joined a league, so they have to keep score and
win!), these awful strokes get ingrained by repetition, they
produce terrible results, and they are very difficult to unlearn, so
progress, despite lessons (mostly in the useless form of words),
is slow or non existent. Then they quit.
When you finally pick up a racquet and go out to play, and your
brain says "what do we do now?", your head will be filled with
pictures of professionals perfectly doing what you are trying to
do. You will not know how to do it incorrectly, because you
have never seen it done incorrectly. You will try to do what they
do, and you will almost immediately proceed to an advanced
Page 75
intermediate level. You will be a beginner for a short period of

time, if at all, and improvement will be a matter of adding to and
refining what you are doing, not stripping down and unlearning
bad patterns. And since you are not keeping score, you focus
purely on technique. If you hit one into the net, just pull another
ball out of your pocket and do it again. No big deal, no drama,
no guilt. Just hit another. When you feel you can hit all of your
shots somewhat professionally, maybe you can actually play
someone and keep score. You will love the positive feedback of
beating players who have been playing much longer than you
have. You will wonder how they could have played for so long
and still "play like that". Don't they know it's done "this way?"
What professional does it "that way?" Don't they watch tennis
on TV? Who does that? I just started and I know that's wrong.
All these thoughts will make you feel like a genius.
So how does all of this relate to chess? Simply put, play
over the games of professional players and see how they play
before you play anybody. Try to imitate them instead of trying
to reinvent the wheel. Play over the games of lots of different
players and then decide which one or two you like. The ones
you like are the ones where you say after playing over one of
their games, "I would love to play a game like that!" Then just
concentrate on those one or two players. Study and play the
openings they play. Get books where they comment on their
own games. Maybe they will say what they were thinking
during the game. Try to play like them. During your games,
think "What would he do in this position?" Personally, I like
Murphy for his rapid development and attacks, Lachine for his
Page 76
creativeness in all positions, and Spas sky for his ability to play
all types of positions and create attacks in calm positions .
Learning Decision tree

Learning , Decision tree used in data mining and machine
learning, uses a decision tree as a predictive model which maps
observations about an item to conclusions about the item's target
value. More descriptive names for such tree models are
classification trees or regression trees. In these tree structures,
leaves represent classifications and branches represent
conjunctions of features that lead to those classifications.
In decision analysis, a decision tree can be used to visually and
explicitly represent decisions and decision making. In data
mining, a decision tree describes data but not decisions; rather
the resulting classification tree can be an input for decision
making..
General
Learning decision tree is a common method used in data mining.
The goal is to create a model that predicts the value of a target
variable based on several input variables. Each interior node
corresponds to one of the input variables; there are edges to
children for each of the possible values of that input variable.
Each leaf represents a value of the target variable given the
values of the input variables represented by the path from the
root to the leaf.
Page 77
A tree can be "learned" by splitting the source set into subsets

based on an attribute value test. This process is repeated on each
derived subset in a recursive manner called recursive
partitioning. The recursion is completed when the subset at a
node all has the same value of the target variable, or when
splitting no longer adds value to the predictions.
Data comes in records of the form:
The dependent variable, Y, is the target variable that we are

trying to understand, classify or generalize. The vector x is
composed of the input variables, x1, x2, x3 etc., that are used for
that task.
Types:
Classification tree analysis is when the predicted outcome is
the class to which the data belongs.
Regression tree analysis is when the predicted outcome can
be considered a real number (e.g. the price of a house, or a
patients length of stay in a hospital).
Classification And Regression Tree (CART) analysis is
used to refer to both of the above procedures, first introduced by
Breiman et al.
Page 78
Chi-squared Automatic Interaction Detector (CHAID).

Performs multi-level splits when computing classification trees.
[2]
A Random Forest classifier uses a number of decision trees,

in order to improve the classification rate.
Boosted Trees can be used for regression-type and
classification-type problems
Decision tree advantages
Decision trees have various advantages:
Simple to understand and interpret. People are able to
understand decision tree models after a brief explanation.
Requires little data preparation. Other techniques often

require data normalization, dummy variables need to be
created and blank values to be removed.
Able to handle both numerical and categorical data.
Other techniques are usually specialized in analyzing
datasets that have only one type of variable. Ex: relation
rules can be used only with nominal variables while neural
networks can be used only with numerical variables.
Uses a white box model. If a given situation is observable
in a model the explanation for the condition is easily
explained by Boolean logic. An example of a black box
model is an artificial neural network since the explanation
for the results is difficult to understand.
Page 79
Possible to validate a model using statistical tests. That

makes it possible to account for the reliability of the model.
Robust. Performs well even if its assumptions are
somewhat violated by the true model from which the data
were generated.
Perform well with large data in a short time. Large
amounts of data can be analyzed using personal computers
in a time short enough to enable stakeholders to take
decisions based on its analysis.
Truth maintenance system : Truth maintenance system,

or TMS, is a knowledge representation method for representing
both beliefs and their dependencies. The name truth
maintenance is due to the ability of these systems to restore
consistency.
It is also termed as a belief revision system; a truth maintenance
system maintains consistency between old believed knowledge
and current believed knowledge in the knowledge base (KB)
through revision. If the current believed statements contradict
the knowledge in KB, then the KB is updated with the new
knowledge. It may happen that the same data will again come
into existence; the previous knowledge will be required in KB.
If the previous data is not present, it is required for new
inference. But if the previous knowledge was with KB, then no
retracing of the same knowledge was needed. Hence the use of
TMS to avoid such retracing; it keeps track of the contradictory
data with the help of a dependency record. This record reflects
Page 80
the retractions and additions which makes the inference engine

(IE) aware of its current belief set.
Each statements having at least one valid justification is made a
part of the current belief set. When a contradiction is found, the
statement(s) responsible for the contradiction are identified and
an appropriate is retraced. This results the addition of new
statements to the KB. This process is called dependencydirected backtracking.
The TMS maintain the records in the form of a dependency
network. The nodes in the network are one of the entries in the
KB (may be a premise, antecedent, inference rule etc.) Each arc
of the network represents the inference steps from which the
node was derived.
Premise: A premise is a fundamental belief which is assumed to
be always true. They do not need justifications. Considering
premises are base from which justifications for all other nodes
will be stated.
There are two types of justification for each node. They are:
1. Support List [SL]
2. Conceptual Dependencies(CP)
Many kinds of truth maintenance systems exist. Two major
types are single-context and multi-context truth maintenance. In
single context systems, consistency is maintained among all
facts in memory (database). Multi-context systems allow
consistency to be relevant to a subset of facts in memory (a
Page 81
context) according to the history of logical inference. This is

achieved by tagging each fact or deduction with its logical
history. Multi-agent truth maintenance systems perform truth
maintenance across multiple memories, often located on
different machines. De Klees ATMS (1986) was utilized in
systems based upon KEE on the Lisp Machine. The first multiagent TMS was created by Mason and Johnson. It was a multicontext system. Bridge land and Huns created the first singlecontext multiagent system
Nodded Dependency-directed backtracking is a problem-solving (

Dependency directed Backtracking: Dependency directed
N
backtracking is a problem solving (qv) technique for efficiently

evading contradictions. It is invoked when the problem solvers
discovers that its current state is inconsistent. The goal is, in a
single operation, to change the problem solvers current state to
neither one that contains neither the contradiction just uncovered
nor any contradiction encountered previously. This is achieved
by consulting records of the inferences the problem solver has
performed and records of previous contradiction, which
dependencydirected backtracking has constructed in response
to previous contradictions.
Contrast to backtracking: Dependency directed backtracking
was developed to avoid the deficiencies of chronological
backtracking. Consider the application of chronological
backtracking to the following task (see fig): First do one of A or
B, then one of C or D, and then one of E or F. Assume that each
step requires significant problem solving effort and that A and C
together or B and E together produce a contradiction that is only
uncovered after significant effort. Fig illustrates the sequence of
Page 82
problem solving states that chronological backtracking goes

through to find all solutions (6, 7, 11 and 14)
Backtracking to an Appropriate Choice: The first deficiency
of chronological backtracking is illustrated by the unnecessary
state 4. The contradiction discovered in state 3 depends on
choices A and C and not E. Therefore, replacing the choices E
with F and working on state 4 is futile, as this change does not
remove the contradiction. Unlike chronological backtracking,
which replaces the most recent choice, dependency directed
backtracking replaces a choice that caused the contradiction. The
discovery that state 3 is inconsistent causes immediate
backtracking to state 5. To be able to determine which choices
underlie the contradiction requires that the problem solver store
dependency records with every datum that it infers.
Avoiding Rediscovering contradiction. The second deficiency
of chronological backtracking is illustrated by unnecessary
state 13. The contradiction discovered in state 10 depends on B
and E. As E is the most recent choice, chronological and
dependency directed backtracking are indistinguishable, both
backtracking to state 11. How ever, as B and E are known to be
inconsistent with each other, there is no point in rediscovering
this contradiction by working in state 13.
Page 83
Disadvantages:
1. Dependency-directed backtracking incurs a significant time
and space overhead as it requires the maintenance of
dependency records and an additional no-good database.
Thus the effort required to maintain the dependencies may
be more than the problem-solving effort solved.
Page 84
2. If the problem solver is logically complete and finishes all

work on a state before considering the next, the problem of
backtracking to an inappropriate choice cannot occur.
3. In such cases much of the advantage of Dependencydirected backtracking is irrelevant. However, most practical
problem solvers are neither logically complete nor finish all
possible work on a state before considering one other.
Fuzzy function:
Membership function is the one of the fuzzy function which is
used to develop the fuzzy set value. The fuzzy logic is depends
upon membership function
Unit-4
Natural Language processing and planning
Backward chaining: Backward chaining (or backward
reasoning) is an inference method used in automated theorem
Page 85
provers, proof assistants and other artificial intelligence

applications. It is one of the two most commonly used methods
of reasoning with inference rules and logical implications the
other is forward chaining. Backward chaining is implemented in
logic programming by SLD resolution. Both rules are based on
the modus ponens inference rule.
Backward chaining starts with a list of goals (or a hypothesis)
and works backwards from the consequent to the antecedent to
see if there is data available that will support any of these
consequents. An inference engine using backward chaining
would search the inference rules until it finds one which has a
consequent (Then clause) that matches a desired goal. If the
antecedent (If clause) of that rule is not known to be true, then it
is added to the list of goals (in order for one's goal to be
confirmed one must also provide data that confirms this new
rule).
For example, suppose that the goal is to conclude the color of
my pet Fritz, given that he croaks and eats flies, and that the rule
base contains the following four rules:
1. If X croaks and eats flies Then X is a frog
2. If X chirps and sings Then X is a canary
3. If X is a frog Then X is green
4. If X is a canary Then X is yellow
This rule base would be searched and the third and fourth rules
would be selected, because their consequents (Then Fritz is
Page 86
green, Then Fritz is yellow) match the goal (to determine Fritz's
color). It is not yet known that Fritz is a frog, so both the
antecedents (If Fritz is a frog, If Fritz is a canary) are added to
the goal list. The rule base is again searched and this time the
first two rules are selected, because their consequents (Then X
is a frog, Then X is a canary) match the new goals that were just
added to the list. The antecedent (If Fritz croaks and eats flies) is
known to be true and therefore it can be concluded that Fritz is a
frog, and not a canary. The goal of determining Fritz's color is
now achieved (Fritz is green if he is a frog, and yellow if he is a
canary, but he is a frog since he croaks and eats flies; therefore,
Fritz is green).
Note that the goals always match the affirmed versions of the
consequents of implications (and not the negated versions as in
modus tollens) and even then, their antecedents are then
considered as the new goals (and not the conclusions as in
affirming the consequent) which ultimately must match known
facts (usually defined as consequents whose antecedents are
always true); thus, the inference rule which is used is modus
ponens.
Because the list of goals determines which rules are selected and
used, this method is called goal-driven, in contrast to data-driven
forward-chaining inference. The backward chaining approach is
often employed by systems. Programming languages such as
Prolog, Knowledge Machine and Eclipse support backward
chaining within their inference engines.
Parsing: Parser is an algorithm for inferring the structure

Page 87
Of its input, guided by a grammar that dictates what Structures

are possible or probable. In an ordinary Parser, the input is a
string, and the grammar ranges over strings. This explores
generalizations of Ordinary parsing algorithms that allow the
input to Consist of string tuples and/or the grammar to range
Over string tuples. Such inference algorithms can perform
various kinds of analysis on parallel texts, Also known as multi
texts.
Figure 1 show some of the ways in which ordinary parsing
can be generalized. A synchronous parser is an algorithm that
can infer the syntactic structure of each component text in a
multi text and simultaneously infer the correspondence relation
Between this structures.1 when a parsers input can have fewer
dimensions than the parsers grammar, we call it a translator.
When a parsers grammar can have fewer dimensions than the
parsers input, we call it a synchronizer. The corresponding
processes are called translation and synchronization.
To our knowledge, synchronization has never been explored
as a class of algorithms. Neither has the relationship between
parsing and word alignment. The relationship between
translation and ordinary parsing was noted a long time. But here
we articulate it in more detail: ordinary parsing is a special
Case of synchronous parsing, which is a special case of
translation. This paper offers an informal guided tour of the
generalized parsing algorithms in Figure 1. It culminates with a
recipe for using these algorithms to train and apply a syntaxaware statistical machine translation (SMT) system.
Page 88
Machine translation: Machine translation is architecture

for SMT that revolves around multi trees. Figure 2 shows how to
build and use a rudimentary Machine Translation system,
starting from some multi text and one or more monolingual tree
banks.
The recipe follows:
T1. Induce a word-to-word translation model.
T2. Induce PCFGs from the relative frequencies of productions
in the monolingual tree banks
T3. Synchronize some multi text,
Page 89
T4. Induce an initial PMTG from the relative frequencies of

productions in the multi tree bank.
T5. Re-estimate the PMTG parameters, using a
Synchronous parser with the expectation smearing.
A1. Use the PMTG to infer the most probable multi tree
Covering new input text.
A2. Linearize the output dimensions of the multi tree.
Steps T2, T4 and A2 are trivial. Steps T1, T3, T5, and A1 are
instances of the generalized parsers
Figure 2 is only architecture. Computational
Complexity and generalization error stand in the
Way of its practical implementation. Nevertheless,
it is satisfying to note that all the non-trivial algorithms
In Figure 2 are special cases of Translator CT.
It is therefore possible to implement an MTSMT
System using just one inference algorithm, parameterized
By a grammar, a smearing, and a search
Strategy. An advantage of building an MT system in
This manner is that improvements invented for ordinary
Parsing algorithms can often be applied to all
The main components of the system. For example,
Me lamed (2003) showed how to reduce the computational
complexity of a synchronous parser by _ _3_, just by changing
the logic.
The same optimization can be applied to the inference
algorithms... With proper software design, such optimizations
Need never be implemented more than once. For simplicity, the
algorithms in this are based on CKY logic. However, the
architecture in Figure 2 can also be implemented using
Page 90
generalizations of more sophisticated parsing logics, such as

those inherent in Early or Head-Driven parsers
Benefits of Machine Translation: There are three research

benefits of using generalized Parsers to build MT systems.
Page 91
1. We can take advantage of past and future research on

making parsers more accurate and more efficient.
Therefore,
2. We can concentrate our efforts on better models, without
worrying about MT-specific search algorithms.
3. More generally and most importantly, this approach
encourages MT research to be less specialized and more
transparently related to the rest of computational
linguistics.
Block world: The technique we are about to discuss can be

applied in a wide variety of task domains, and they have been.
But to make it easy to compare the variety of methods we
consider, we should find it useful to look at all of them in a
single domain that is complex enough that we need for each of
the mechanisms is apparent yet simple enough that easy-tofollow examples can be found. The blocks world is such a
domain. There is a flat surface on which blocks can be placed.
There are a number of square blocks, all the same size. They can
be stacked one upon another. There is a robot arm that can
manipulate the blocks. The actions it can perform include:
UNSTACK (A, B) - pick up block A from its current
position on block B. the arm must be empty and blocks A
must have no blocks on top of it.
STACK (A, B) - place block A on block B. the arm must
already be holding and the surface of B must be clear.
PICKUP (A) - picks up block A from the table and holds it.
The arm must be nothing on top of block A.
Page 92
PUTDOWN (A)- put block A down on the table. The arm

must have been holding block A.
Notice that in the world we have described, the robot arm can
hold only one block at a time. Also, since all blocks are the same
size, each block can have at most one other block directly on top
of it.
In order to specify both the conditions under which an operation
may be performed and the results of performing it, we need to
use the following predicates:
ON (A, B) - block A is on block B.
ONTABLE (A) - block A is on the table.
CLEAR (A) - there is nothing on top of block A.
HOLDING (A) - The arm is holding block A.
ARMEMPTY- the arm is holding nothing.
Various logical statements are true in this blocks world. For
example,
[
x: HOLDING (x)]
ARMEMPTY
x: ONTABLE(x)
y: ON(x,y)
The first of these statements says simply that if the arm is

holding anything, then it is not empty. The second says that if a
block is on the table, then it is not also on another block. The
third says that any block with no blocks on it is clear.
Page 93
Components of planning system: The components of planning

systems are as follows:
Choose the best rule to apply next based on the best
available information
Apply the chosen rule to compute the new problem state
that arises from the application
Detect when a solution has been found
Detect dead ends so that they can be abandoned and the
systems effort directed in more fruitful directions.
Detect when an almost correct solution has been found and
employ special techniques to make it totally correct.
Choosing rules to apply. The most widely used technique for
selecting appropriate rules to apply is first to isolate a set of
differences between the desired goal state and the current state
and then to identify those rules that are relevant to reducing
those differences. If several rules are found, a variety of other
heuristic information can be exploited to choose among them.
Applying Rules. Each rule simply specified the problem state
that would result from its application. Now, however, we must
be able to deal with rules that specify only a small part of the
complete problem state.
One way is to describe, for each action, each of the changes it
makes to the state description. In addition, some statement that
everything else remains unchanged is also necessary. Fig shows
Page 94
A
how a state, called S0, of a simple blocks world problem could

be represented.
ON (A, B, S0) ^
ONTABLE (B, S0) ^
CLEAR (A, S0)
Fig1: a simple blocks world description
If we start with the situation shown in fig, we would describe it

as
ON (A, B) ÔNTABLE (B) ^CLEAR (A)
After applying the operator UNSTACK (A, B), our description

of the world would be
ONTABLE (B) ^CLEAR (A) ^CLEAR (B) ^HOLDING (A)
STRIPS-style operators that correspond to the blocks world

operations we have discussed are shown in fig2. Notice that for
simple rules such as these the PRECONDITION lists is often
identical to the DELETE list. In order to pick up a block, the
robot arm must be empty; as soon as it picks up a block, it is no
longer empty. But preconditions are not always detected. For
example, in order for the arm to pick up a block, the block must
have no other blocks on top of it: after it is picked up, it still has
no blocks on top of it. This is the reason that the
Page 95
PRECONDITION and DELETE lists must be specified

separately.
STACK(X, Y)
P: CLEAR(Y) ^HOLDING(X)
D: CLEAR(Y) ^HOLDING(X)
A: ARMEMPTYÔN(X, Y)
UNSTACK(X, Y)
P: ON(X, Y) ^CLEAR(X) ÂRMEMPTY
D: ON(X, Y) ÂRMEMPTY
A: HOLDING(X) ÔN(X, Y)
PICKUP(X)
P: CLEAR(X) ÔNTABLE(X) ÂRMEMPTY
D: ONTABLE(X) ÂRMEMPTY
A: HOLDING(X)
PUTDOWN(X)
P: HOLDING(X)
D: HOLDING(X)
A: ONTABLE(X) ÂRMEMPTY
Fig: STRIPS- style operators for the blocks world

Page 96
Detecting a solution. A planning system has succeeded in

finding a solution to a problem when it has found a sequence of
operators that transforms the initial problem state into the goal
state.
Detecting Dead ends. As a planning system is searching for a
sequence of operators to solve a particular problem, it must be
able to detect when it is exploring a path that can never lead to a
solution. The same reasoning mechanisms that can be used to
detect a solution can often be used for detecting a dead end.
Goal stack planning: The technique to be developed for
solving compound goals that many interact was the use of a goal
stack. This was the approach used by STRIPS. In this method,
the problem solver makes use of a single stack that contains both
goals and operators that have been proposed to satisfy those
goals. The problem solver also relies on a database that
describes the current situation and a set of operators described in
PRECONDITION, ADD, and DELETE lists. To see how this
method works, let us carry it through for the simple example
shown in fig.
B
Start: ON (B, A) ^
goal: ON (C, A) ^
ON TABLE (A) ^
ON (B, D) ^
ONTABLE(C) ^
ONTABLE (A) ^
Page 97
ONTABLE (D) ^
ONTABLE (D)
ARMEMPTY
Fig: a very simple Blocks world problem
When we begin solving this problem, the goal stack is simply

ON (C, A) ^ ON (B, D) ^ ONTABLE (A) ^ ONTABLE (D)
But we want to separate this problem into four sub problems,

one for each component of the original goal. Two of the sub
problems, ONTABLE (A) and ONTABLE (D), are already true in the
initial state. Depending on the order in which we want to tackle
the sub problems, there are two goals stacks that could be
created as our first step, where each line represents one goal on
the stack and OTAD is an abbreviation for ONTABLE (A) ^
ONTABLE (D):
ON (C, A)
ON (B, D)
ON (B, D)
ON (C, A)
ON (C, A) ^ ON (B, D) ^ ONTAD
ON (C, A) ^ ON (B, D) ^ ONTAD
[1]
[2]
To continue with the example we started above, let us assume

that we choose first to explore alternative1. Alternative2 will
also lead to a solution. In fact, it finds one so trivially that it is
not very interesting. Exploring alternative 1, we first check to
see whether ON (C, A) is true in the current state.
Page 98
Partial-Order Planner. Any planner that maintains a partial

solution as a totally ordered list of steps found so far is called a
total-order planner, or a linear planner. Alternatively, if we
only represent partial-order constraints on steps, then we have a
partial-order planner, which is also called a non-linear
planner. In this case, we specify a set of temporal constraints
between pairs of steps of the form S1 < S2 meaning that step S1
comes before, but not necessarily immediately before, step S2.
We also show this temporal constraint in graph form as
S1 +++++++++> S2
STRIPS is a total-order planner, as are situation-space
progression and regression planners
Principle of Least Commitment
The principle of least commitment is the idea of never making a
choice unless required to do so. In other words, only do
something if it's necessary. The advantage of using this principle
is that we try to avoid doing work that might have to be undone
later, hence avoiding wasted work. In planning, one application
of this principle is to never order plan steps unless it's necessary
for some reason. So, partial-order planners exhibit this property
because constraints ordering steps will only be inserted when
necessary. On the other hand, situation-space progression
planners make commitments about the order of steps as they try
to find a solution and therefore may make mistakes from poor
guesses about the right order of steps.
Page 99
Representing a Partial-Order Plan

A partial-order plan will be represented as a graph that describes
the temporal constraints between plan steps selected so far. That
is, each node will represent a single step in the plan (i.e., an
instance of one of the operators), and an arc will designate a
temporal constraint between the two steps connected by the arc.
For example,
S1 ++++++++> S2 ++++++++++> S5
|\
^
| \++++++++++++++++|
|
|
v
|
++++++> S3 ++++++> S4 ++++++
graphically represents the temporal constraints S1 < S2, S1 <
S3, S1 < S4, S2 < S5, S3 < S4, and S4 < S5. This partial-order
plan implicitly represents the following three total-order plans,
each of which is consistent with all of the given constraints:
[S1,S2,S3,S4,S5], [S1,S3,S2,S4,S5], and [S1,S3,S4,S2,S5].
Partial-Order Planner (POP) Algorithm
function pop(initial-state, conjunctive-goal, operators)
// non-deterministic algorithm
plan = make-initial-plan(initial-state, conjunctive-goal);
loop:
begin
if solution?(plan) then return plan;
(S-need, c) = select-subgoal(plan) ; // choose an unsolved goal
choose-operator(plan, operators, S-need, c);
// select an operator to solve that goal and revise plan
resolve-threats(plan); // fix any threats created
end
Page 100
end
function solution?(plan)
if causal-links-establishing-all-preconditions-of-all-steps(plan)
and all-threats-resolved(plan)
and all-temporal-ordering-constraints-consistent(plan)
and all-variable-bindings-consistent(plan)
then return true;
else return false;
end
function select-subgoal(plan)
pick a plan step S-need from steps(plan) with a precondition c
that has not been achieved;
return (S-need, c);
end
procedure choose-operator(plan, operators, S-need, c)
// solve "open precondition" of some step
choose a step S-add by either
Step Addition: adding a new step from operators that
has c in its Add-list
or Simple Establishment: picking an existing step in Steps(plan)
that has c in its Add-list;
if no such step then return fail;
add causal link "S-add --->c S-need" to Links(plan);
add temporal ordering constraint "S-add < S-need" to Orderings(plan);
if S-add is a newly added step then
begin
add S-add to Steps(plan);
add "Start < S-add" and "S-add < Finish" to Orderings(plan);
end
end
Page 101
procedure resolve-threats(plan)
foreach S-threat that threatens link "Si --->c Sj" in Links(plan)
begin // "declobber" threat
choose either
Demotion: add "S-threat < Si" to Orderings(plan)
or Promotion: add "Sj < S-threat" to Orderings(plan);
if not(consistent(plan)) then return fail;
end
end
Recursive transition network recursive transition network

("RTN") is a graph theoretical schematic used to represent the
rules of a context free grammar. RTNs have application to
programming languages, natural language and lexical analysis.
Any sentence that is constructed according to the rules of an
RTN [1] is said to be "well-formed." The structural elements of a
well-formed sentence may also be well-formed sentences by
themselves, or they may be simpler structures. This is why
RTNs are described as recursive.
A sentence is generated by a RTN by applying the generative
rules specified in the RTN itself. These represent any set of
rules or a function consisting of a finite number of steps.
Unit-5
Expert System and AI languages
Page 102
Introduction: An expert system is a set of programs that

manipulate encoded knowledge to solve problems in a
specialized domain that normally requires human expertise. An
expert systems knowledge is obtained form expert sources and
coded in a form suitable for the system to use in its interference
or reasoning processes. The expert knowledge must be obtained
from specialists or other sources of expertise, such as texts,
journal articles, and data bases. This type of knowledge usually
requires much training and experience in some specialized field
such as medicine, geology, system configuration, or engineering
design.
Characteristic Features of Expert systems: Expert

systems differ from conventional computer systems in several
important ways.
1. Expert systems use knowledge rather than data to control
the solution process. Much of the knowledge used is
heuristic in nature rather than algorithmic.
2. The knowledge is encoded and maintained as an entity
separate from the control program. As such, it is not
compiled together with the control program itself. This
permits the incremental addition and modification
(refinement) of the knowledge base without recompilation
of the control programs. Furthermore, it is possible in
some cases to use different knowledge bases with the same
Page 103
control programs to produce different types of expert

systems.
3. Expert systems are capable of explaining how a particular
conclusion was reached, and why requested information is
needed during a consultation. This is important as it gives
the user a chance to assess and understand the systems
reasoning ability, thereby improving the users confidence
in the system.
4. Expert systems use symbolic representations for
knowledge (rules, networks, or frames) and perform their
interference through symbolic computations that closely
resemble manipulations of natural language.
Expert systems often reason with Meta knowledge; that is, they
reason with knowledge about themselves, and their own
knowledge limits and capabilities.
Applications:
Different types of medical diagnoses(internal medicine,
pulmonary diseases, infectious blood disease, and so on)
Diagnosis of complex electronic and electrochemical
systems.
Diagnosis of diesel electric locomotion systems.
Diagnosis of software development projects.
Forecasting crop damage.
Location of faults in computer and communication systems.
Page 104
Expert and Systems analyst
Development engine
Knowledge
Base
Inference engine
User interface
User
Fig. An expert system model.
Rule-Based System Architectures: The most common

form of architecture used in expert and other types of
knowledge based systems is the production system, also
called the rule-based system. This type of system uses
knowledge encoded in the form of production rules, that is, if
then rules.
IF: Condition-1 and Condition-2 and Condition-3
THEN Take Action-4
Page 105
Pro
ble
m
Do
mai
n
IF: The temperature is greater than 200 degrees, and

the water level is low
THEN: Open the safety valve.
A&B & C&D
E&F
Each rule represents a small chunk of knowledge

relating to the given domain of expertise which leads from
some initially known facts to some useful conclusions or
action part of the rule is then accepted as known(or at least
known with some degree of certainty).
Inference in production systems is
accomplished by a process of chaining through the rules
recursively, either in a forward or backward direction, until a
conclusion is reached or until failure occurs. The selection of
rules used in the chaining process is determined by matching
current facts against the domain knowledge or variables in rules
and choosing among a candidate set of rules the ones that meet
some given criteria, such as specificity. The inference process is
typically carried out in an interactive mode with the user
providing input parameters needed to complete the chaining
process.
Page 106
EXPERT SYSTEM
USER
Explanation
Module
Inference engine
Input
I/O interface
Case history
file
Output
Editor
Knowledge
base
Working
memory
Learning
Module
Fig. Components of a typical expert system

The Knowledge Base: The Knowledge base contains facts and
rules about some specialized knowledge domain.
Page 107
The Inference Process: The inference engine accepts user input

queries and responses to questions through the I/O interface and
uses this dynamic information together with the static
knowledge (the rules and facts) stored in the knowledge base.
The knowledge in the knowledge base is used to derive
conclusions about the current case or situation as presented by
the users input.
During the match stage, the contents of working
memory are compared to facts and rules contained in the
knowledge base. When consistent matches are found, the
corresponding rules are placed in a conflict set. To find an
appropriate and consistent match, substitutions (instantiations)
may be required. Once all the matched rules have been added to
the conflict set during a given cycle, one of the rules is selected
for execution.
When the left side of a sequence of rules is
instantiated first and the rules are executed from left to right, the
process is called forward chaining. This is also known as datadriven inference since input data are used to guide the direction
of the inference process. For example, we can chain forward to
show that when a student is encouraged, is healthy, and has
goals, the student will succeed.
ENCOURAGED (student)
Page 108
MOTIVATED (student)
MOTIVATED (student) &HEALTHY (student)
WORKHARD (student)
WORKHARD (student) &HASGOALS (student)

EXCELL (student)
EXCELL (student)
SUCCED (student)
On the other hand, when the right side of the rules is

instantiated first, the left-hand conditions become sub goals.
These sub goals may in turn cause sub- sub goals to be
established, and so on until facts are found to match the lowest
sub goals conditions. When this form of inference takes place,
we say that backward chaining is performed. This form of
inference is also known as goal-driven inference since an initial
goal establishes the backward direction of the inferring.
Explanation Module: The Explanation module provides the
user with an explanation of the reasoning process when
requested. This is done in response to a how query or a why
query.
To respond to a how query, the explanation module traces
the chain of rules fired during a consultation with the user. The
sequence of rules that led to the conclusion is then printed for
the user in an easy to understand human-language style. This
permits the user to actually see the reasoning process followed
by the system in arriving at the conclusion. If the user does not
agree with the reasoning steps presented they may be changed
using the editor.
Page 109
To respond to a why query, the explanation module

must be able to explain why certain information is needed by the
inference engine to complete a step in the reasoning process
before it can proceed. For example, in diagnosing a car that will
not start, a system might be asked why it needs to know the
status of the distributor spark.
Building a knowledge Base: The editor is used by developers
to create new rules for addition to the knowledge base, to delete
outmoded rules, or to modify existing rules in some way.
Consistency tests for newly created rule. Such systems also
prompt the user for missing information.
The I/O Interface: The input-output interface permits the user
to communicate with the system in a more natural way by
permitting the use of simple selection menus or the use of a
restricted language which is close to a natural language. This
means that the system must have special prompts or a
specialized vocabulary which encompasses the terminology of
the given domain of expertise.
Nonproduction System Architectures: Instead of rules,

these systems employ more structured representation schemes
like associative or semantic networks, frame and rule structures,
and decision trees, or even specialized networks like neural
networks.
Page 110
Associative or Semantic Network Architectures: We know

that an associative network is a network made up of nodes
connected by directed arcs. The nodes represent objects,
attributes, concepts, or other basic entities, and the arcs, which
are labeled, describe the relationship between the two nodes they
connect. Special network links include the ISA and HASPART
links which designate an object as being a certain type of object
(belonging to a class of objects) and as being a subpart of
another object, respectively.
Associative network representations are especially useful
in depicting hierarchical knowledge structures, where property
inheritance is common. More often, these network
representations are used in natural language or computer vision
systems or in conjunction with some other form of
representation.
Frame Architectures: Frames are structured sets of closely
related knowledge, such as an object or concept name, the
objects main attributes and their corresponding values, and
possibility some attached procedures (if-needed, if-added, ifremoved procedures). The attributes, values, and procedures are
stored in specified slots facets of the frame. Individual frames
are usually linked together as a much like the nodes in an
associative network.
Page 111
Decision Tree Architectures: Knowledge for expert systems

may be stored in the form of a decision tree when the knowledge
can be structured in a top-to-bottom manner. For example, the
identification of objects (equipment, faults, physical objects,
diseases) can be made through a decision tree structure. Initial
and intermediate nodes in the tree correspond to object
attributes, and terminal nodes correspond to the identities of
objects. Attribute values for an object determine a path to a leaf
node in the tree which contains object identification. Each object
attribute corresponds to a non terminal node in the tree and each
branch of the decision tree corresponds to an attribute value or
set of values.
Blackboard system Architectures: Blackboard architectures
refer to a special type of knowledge-based system which uses a
form of opportunistic reasoning. This differs from pure forward
or pure backward chaining in production systems in that either
direction may be chosen dynamically at each stage in the
problem solution process.
Blackboard systems are composed of three functional
components as depicted in fig
1. There are a number of knowledge sources which are
separate and independent sets of coded knowledge. Each
knowledge source may be thought of as a specialist in
some limited area needed to solve a given subset of
Page 112
problems; the sources may contain knowledge in the form

of procedures, rules, or other schemes.
2. A globally accessible data base structure, called a
blackboard, contains the current problem state and
information needed by the knowledge sources (input data,
partial solutions, control data, alternatives, and final
solutions). The knowledge sources make changes to the
blackboard data that incrementally lead to a solution.
Communication and interaction between the knowledge
sources takes place solely through the blackboard.
3. Control information may be contained within the sources,
on the blackboard, or possibly in a separate module. The
control knowledge monitors the changes to the blackboard
and determines what the immediate focus of attention
should be in solving the problem.
Analogical Reasoning Architectures: Expert systems based
on analogical architectures solve new problems like humans, by
finding a similar problem solution that is known and applying
the known solution to the new problem, possibly with some
modifications, for example, if we know a method of proving that
the product of two even integer is even, we can successfully
prove that the product of two odd integers is odd through much
the same proof steps. Expert systems using analogical
architectures will require a large knowledge base having
Page 113
numerous problem solutions and other previously encountered

situations or episodes
Neural Network Architectures: Neural networks are large
networks of simple processing elements or nodes which process
information dynamically in response to external inputs. The
nodes are simplified models of neurons. The knowledge in a
neural network is distributed throughout the network in the form
of internodes connections and weighted links which form the
inputs to the nodes. The link weights serve to enhance or inhibit
the input stimuli values which are then added together at the
nodes. If the sum of all the inputs to a node exceeds some
threshold value T, the node executes and produces an output
Blackboard
Page 114
Knowledge sources
Control information
Fig. Components of blackboard systems.
which is passed on to other nodes or is used to produce some
output response.
Neural networks were originally inspired as being
models of the human nervous system. They are generally
simplified models to be sure.
Knowledge acquisition: Knowledge for expert systems must

be derived from expert sources like experts in the given field,
journal articles, texts, reports, data bases, and so on. Elicitation
of the right Knowledge can take several man years and cost
hundreds of thousands of dollars. This process is now
recognized as one of the main bottlenecks in building expert and
other Knowledge-based systems. Consequently, much effort has
been developed to more effective methods of acquisition and
coding.
Pulling together and correctly interpreting the right
knowledge to solve a set of complex tasks is an onerous job.
Typically, experts do not know what specific Knowledge is
being neither applied nor just how it is applied in the solution of
a given problem. Even if they do know, it is likely they are
unable to articulate the problem solving process well enough to
Page 115
capture the low-level Knowledge used and the inferring

processes applied. This difficulty has lead to the use of AI
experts (called Knowledge engineers) who serve as
intermediaries between the domain expert and the system. The
knowledge engineer elicits information from the experts and
codes this Knowledge into a form suitable for use in the expert
system.
The Knowledge elicitation process is depicted in fig. To
elicit the requisite Knowledge, a Knowledge engineer conducts
extensive interviews with domain experts. During the
interviews, the expert is asked to solve typical problems in the
domain of interest and to explain his or her solutions.
Domain
D
Expert
Knowledge
engineer
System
Editor
Knowledge
Base
Fig the Knowledge acquisition process.

Using the Knowledge gained from the experts and other
sources, the knowledge engineer codes the knowledge in the
form of rules or some other representation scheme. This
Page 116
knowledge is then used to solve sample problems for review.

Errors and omissions are uncovered and corrected, and
additional knowledge is added as needed. The process is
repeated until a sufficient body of knowledge has been collected
to solve a large class of problems in the chosen domain. The
whole process may take as many as tens of person years.
Lisp: Lisp is one of the oldest computer programming

languages. It was invented by John Mc carthy during the late
1950s, shortly after the development of FORTRAN LISP is
particularly suited for AI programs because of its ability to
process symbolic information effectively. It is a language with a
simple syntax, with little or no data typing and dynamic memory
management.
Lisp has become the language of choice for most
AI practioners. It was practically unheard of outside the research
community until AI began to gain some popularity ten to fifteen
years ago. Since then, special LISP processing machines have
been built and its popularity has spread to many new sectors of
business and government.
The basic building blocks of LISP are
the atom, list, and the string. An atom is a number or string of
contiguous characters, including numbers and special characters.
A list is a sequence of atoms and/or other lists enclosed within
Page 117
parentheses. A string is a group of characters enclosed in double

quotation marks.
Lisp programs run either on an interpreter or as
compiled code. The interpreter examines source programs in a
repeated loop, called the read-evaluate-print loop. This loop
reads the program code, evaluates it, and prints the values
returned by the program. The interpreter signals its read lines to
accept code for execution by printing a prompt such as the ->
symbol.
Examples
Here are examples of Common Lisp code.
The basic "Hello world" program:
(Print "Hello world")
As the reader may have noticed from the above discussion, Lisp
syntax lends itself naturally to recursion. Mathematical problems
such as the enumeration of recursively defined sets are simple to
express in this notation.
Evaluate a number's factorial:
(Defun factorial (n)
(if (<= n 1)
1
(* n (factorial (- n 1)))))
Page 118
An alternative implementation, often faster than the previous

version if the Lisp system has tail recursion optimization:
(defun factorial (n &optional (acc 1))
(if (<= n 1)
acc
(factorial (- n 1) (* acc n))))
Contrast with an iterative version which uses Common Lisp's
loop macro:
(defun factorial (n)
(loop for i from 1 to n
for fac = 1 then (* fac i)
finally (return fac)))
The following function reverses a list. (Lisp's built-in reverse
function does the same thing.)
(defun -reverse (list)
(let ((return-value '()))
(dolist (e list) (push e return-value))
return-value))
Prolog: Prolog is a general purpose logic programming

language associated with artificial intelligence and
computational linguistics.
Prolog has its roots in formal logic, and unlike many other
programming languages, Prolog is declarative: The program
Page 119
logic is expressed in terms of relations, represented as facts and

rules. A computation is initiated by running a query over these
relations.[4]
The language was first conceived by a group around Alain
Colmerauer in Marseille, France, in the early 1970s and the first
Prolog system was developed in 1972 by Colmerauer with
Philippe Roussel.[5][6]
Prolog was one of the first logic programming languages,[7] and
remains among the most popular such languages today, with
many free and commercial implementations available. While
initially aimed at natural language processing, the language has
since then stretched far into other areas like theorem proving,[8]
expert systems,[9] games, automated answering systems,
ontologies and sophisticated control systems. Modern Prolog
environments support creating graphical user interfaces, as well
as administrative and networked applications.
Syntax and semantics. In Prolog, program logic is expressed in
terms of relations, and a computation is initiated by running a
query over these relations. Relations and queries are constructed
using Prolog's single data type, the term. Relations are defined
by clauses. Given a query, the Prolog engine attempts to find a
resolution refutation of the negated query. If the negated query
can be refuted, i.e., an instantiation for all free variables is found
that makes the union of clauses and the singleton set consisting
of the negated query false, it follows that the original query, with
the found instantiation applied, is a logical consequence of the
program. This makes Prolog (and other logic programming
Page 120
languages) particularly useful for database, symbolic

mathematics, and language parsing applications. Because Prolog
allows impure predicates, checking the truth value of certain
special predicates may have some deliberate side effect, such as
printing a value to the screen. Because of this, the programmer
is permitted to use some amount of conventional imperative
programming when the logical paradigm is inconvenient. It has
a purely logical subset, called "pure Prolog", as well as a
number of extra logical features.
Data types
Prolog's single data type is the term. Terms are either atoms,
numbers, variables or compound terms.
An atom is a general-purpose name with no inherent

meaning. Examples of atoms include x, blue, 'Taco', and
'some atom'.
Numbers can be floats or integers.
Variables are denoted by a string consisting of letters,
numbers and underscore characters, and beginning with an
upper-case letter or underscore. Variables closely resemble
variables in logic in that they are placeholders for arbitrary
terms.
A compound term is composed of an atom called a
"functor" and a number of "arguments", which are again
terms. Compound terms are ordinarily written as a functor
followed by a comma-separated list of argument terms,
which is contained in parentheses. The number of
Page 121
arguments is called the term's arity. An atom can be

regarded as a compound term with arity zero. Examples of
compound terms are truck_year('Mazda', 1986) and
'Person_Friends'(zelda,[tom,jim]).
Special cases of compound terms:
A List is an ordered collection of terms. It is denoted by

square brackets with the terms separated by commas or in
the case of the empty list, []. For example [1,2,3] or
[red,green,blue].
Strings: A sequence of characters surrounded by quotes is
equivalent to a list of (numeric) character codes, generally
in the local character encoding, or Unicode if the system
supports Unicode. For example, "to be, or not to be".
Examples
Here follow some example programs written in Prolog.
Hello world
An example of a query:
?- write('Hello world!'), nl.
Hello world!
true.
?Compiler optimization
Page 122
Any computation can be expressed declaratively as a sequence

of state transitions. As an example, an optimizing compiler with
three optimization passes could be implemented as a relation
between an initial program and its optimized form:
program optimized(Prog0, Prog) :optimization_pass_1(Prog0, Prog1),
optimization_pass_2(Prog1, Prog2),
optimization_pass_3(Prog2, Prog).
or equivalently using DCG notation:
program_optimized --> optimization_pass_1,
optimization_pass_2, optimization_pass_3.
Quicksort
The Quicksort sorting algorithm, relating a list to its sorted
version:
partition([], _, [], []).
partition([X|Xs], Pivot, Smalls, Bigs) :( X @< Pivot ->
Smalls = [X|Rest],
partition(Xs, Pivot, Rest, Bigs)
; Bigs = [X|Rest],
partition(Xs, Pivot, Smalls, Rest)
).
quicksort([]) --> [].
quicksort([X|Xs]) -->
Page 123
{ partition(Xs, X, Smaller, Bigger) },

quicksort(Smaller), [X], quicksort(Bigger).
Dynamic programming
The following Prolog program uses dynamic programming to
find the longest common subsequence of two lists in polynomial
time. The clause database is used for memorization:
:- dynamic(stored/1).
memo (Goal) :- ( stored (Goal) -> true ; Goal,
asserts(stored(Goal)) ).
lcs([], _, []) :- !.
lcs(_, [], []) :- !.
lcs([X|Xs], [X|Ys], [X|Ls]) :- !, memo(lcs(Xs, Ys, Ls)).
lcs([X|Xs], [Y|Ys], Ls) :memo(lcs([X|Xs], Ys, Ls1)), memo(lcs(Xs, [Y|Ys], Ls2)),
length(Ls1, L1), length(Ls2, L2),
( L1 >= L2 -> Ls = Ls1 ; Ls = Ls2 ).
Example query:
?- lcs([x,m,j,y,a,u,z], [m,z,j,a,w,x,u], Ls).
Ls = [m, j, a, u]
Design patterns
A design pattern is a general reusable solution to a commonly
occurring problem in software design. In Prolog, design patterns
go under various names: skeletons and techniques, clichs,
Page 124
program schemata, and logic description schemata. An

alternative to design patterns is higher order programming.
Higher-order programming
Main articles: Higher-order logic and Higher-order
programming
By definition, first-order logic does not allow quantification
over predicates. A higher-order predicate is a predicate that takes
one or more other predicates as arguments. Prolog already has
some built-in higher-order predicates such as call/1, find all/3,
setoff/3, and bag of/3.[16] Furthermore, since arbitrary Prolog
goals can be constructed and evaluated at run-time, it is easy to
write higher-order predicates like map list/2, which applies an
arbitrary predicate to each member of a given list, and sub list/3,
which filters elements that satisfy a given predicate, also
allowing for currying.[15]
To convert solutions from temporal representation (answer
substitutions on backtracking) to spatial representation (terms),
Prolog has various all-solutions predicates that collect all answer
substitutions of a given query in a list. This can be used for list
comprehension. For example, perfect numbers equal the sum of
their proper divisors:
Perfect (N):between (1, inf, N), U is N // 2,
find all(D, (between(1,U,D), N mod D =:= 0), Ds),
sum list(Ds, N).
Page 125
This can be used to enumerate perfect numbers, and also to

check whether a number is perfect
Mycin. Mycin was an early expert system developed over five

or six years in the early 1970s at Stanford University. It was
written in Lisp as the doctoral dissertation of Edward Shortliffe
under the direction of Bruce Buchanan, Stanley N. Cohen and
others. It arose in the laboratory that had created the earlier
Dendral expert system, but emphasized the use of judgmental
rules that had elements of uncertainty (known as certainty
factors) associated with them. This expert system was designed
to identify bacteria causing severe infections, such as bacteremia
and meningitis, and to recommend antibiotics, with the dosage
adjusted for patient's body weight the name derived from the
antibiotics themselves, as many antibiotics have the suffix "mycin". The Mycin system was also used for the diagnosis of
blood clotting diseases.
Method
MYCIN operated using a fairly simple inference engine, and a
knowledge base of ~600 rules. It would query the physician
running the program via a long series of simple yes/no or textual
questions. At the end, it provided a list of possible culprit
bacteria ranked from high to low based on the probability of
each diagnosis, its confidence in each diagnosis' probability, the
reasoning behind each diagnosis (that is, MYCIN would also list
the questions and rules which led it to rank a diagnosis a
particular way), and its recommended course of drug treatment.
Page 126
Despite MYCIN's success, it sparked debate about the use of its

ad hoc, but principled, uncertainty framework known as
"certainty factors". The developers performed studies showing
that MYCIN's performance was minimally affected by
perturbations in the uncertainty metrics associated with
individual rules, suggesting that the power in the system was
related more to its knowledge representation and reasoning
scheme than to the details of its numerical uncertainty model.
Some observers felt that it should have been possible to use
classical Bayesian statistics. MYCIN's developers argued that
this would require either unrealistic assumptions of probabilistic
independence, or require the experts to provide estimates for an
unfeasibly large number of conditional probabilities.[1][2]
Subsequent studies later showed that the certainty factor model
could indeed be interpreted in a probabilistic sense, and
highlighted problems with the implied assumptions of such a
model. However the modular structure of the system would
prove very successful, leading to the development of graphical
models such as Bayesian networks
Results
Research conducted at the Stanford Medical School found
MYCIN to propose an acceptable therapy in about 69% of cases,
which was better than the performance of infectious disease
experts who were judged using the same criteria. This study is
often cited as showing the potential for disagreement about
therapeutic decisions, even among experts, when there is no
"gold standard" for correct treatment.
Page 127
Practical use
MYCIN was never actually used in practice. This wasn't because
of any weakness in its performance. As mentioned, in tests it
outperformed members of the Stanford medical school faculty.
Some observers raised ethical and legal issues related to the use
of computers in medicine if a program gives the wrong
diagnosis or recommends the wrong therapy, who should be held
responsible? However, the greatest problem, and the reason that
MYCIN was not used in routine practice, was the state of
technologies for system integration, especially at the time it was
developed. MYCIN was a stand-alone system that required a
user to enter all relevant information about a patient by typing in
response to questions that MYCIN would pose. The program ran
on a large time-shared system, available over the early Internet
(Arpanet), before personal computers were developed. In the
modern era, such a system would be integrated with medical
record systems, would extract answers to questions from patient
databases, and would be much less dependent on physician entry
of information. In the 1970s, a session with MYCIN could easily
consume 30 minutes or morean unrealistic time commitment
for a busy clinician.
A difficulty that rose to prominence during the development of
MYCIN and subsequent complex expert systems has been the
extraction of the necessary knowledge for the inference engine
Page 128
to use from the human expert in the relevant fields into the rule
base (the so-called knowledge engineering).
Page 129

Human Intelligence

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Human Intelligence

Caricato da

Copyright:

Formati disponibili

Unit-1

Overview and Search Technique

1. One view is that artificial intelligence is about designing

At each step of the best-first-search process, we select

Fig: A Best-first Search

generates a node that corresponds to a goal state. At each step, it

A* Search: The best-first search algorithm that was just

was first presented by Hart et al. [1968; 1972]. This algorithm

5. Each n that is already on open or closed should be

well. Remove from open all nodes with unsolvable

Hill Climbing Search: Hill climbing gets their names

process continues from node-to-node with previously

Breadth-First Search: Breadth-first searches are

proceeding to the next level. This mans that all immediate

(-2) (-4) (3)

Fig1. One ply search and two ply search

1. MOVGEN (position, player) - The plausible-move generator,

never overestimates the costs of reaching the goal, then A*

researched in the artificial intelligence community. Several

Solution costs of sub-problems often serve as useful

Using these techniques a program called ABSOLVER was

Alpha-beta pruning: Alpha-beta pruning is a search

An illustration of alpha-beta pruning. The grayed-out sub trees

The benefit of alpha-beta pruning lies in the fact that branches of

optimal or near optimal order (best choice for side on move

checked by the first player is adequate, but all second player

return the heuristic value of node

high). This gives information about what window values might

Constraint Satisfaction, Constraint satisfaction is the

Constraint satisfaction problem

is the set of numbers (either integer, rational, or real), which is

problem with problems in other areas such as finite model

constraints which can be simplified using unification. As a

Cassowary constraint solver is an open source project for

Gecode, an open source portable toolkit written in C++

JaCoP (solver) an open source Java constraint solver

Koalog a commercial Java-based constraint solver.

logilab-constraint an open source constraint solver written

Other constraint programming languages

Evaluation function: An evaluation function, also known

function, is a function used by game-playing programs to

Game tree: a game tree is a directed graph whose nodes are

The first two ply of the game tree for tic-tac-toe.

An arbitrary game tree that has been fully colored

Game of chance: A game of chance is a game whose

reasoning; (ii) the set of inferences the representation

- the ability to manipulate the knowledge represented to produce

Predicate Logic: Propositional logic combines atoms An

Properties: is wet (today)

likes (john, apples)

e.g. first order logic, higher-order logic

Used in this course (Lecture 6 on representation in

Predicates are properties

likes(john, fruit of(apple tree))

object: likes(X, apples)

Quantifiers qualify values

True for all objects (Universal):

Exists at least one object (Existential): X. likes(X, apples

Every rose has a thorn

(X has Y) and (Y is a thorn)

Higher Order Logic

Functions and predicates are

Described by predicates: binary(addition)

Much harder to reason with