Sei sulla pagina 1di 9

Handout No.

7
Phil.015
April 12, 2016
ELEMENTS OF DECISION THEORY
The elements of decision theory are similar to those of the theory of games in
that decision theory may be considered as the theory of a two-person game, in which
nature takes the role of one of the players, and the other player if of course the
statistician. Because the problems are simpler, it is useful to start with decisions in
the absence of data. Often certain decision problems involving data can cleverly be
converted into no-data problems.
I. NO-DATA DECISION PROBLEMS
The basic conceptual ingredients of a no-data decision model are the same is
those of a zero-sum two-person game-theoretic model, namely:
(i) A non-empty set = {1 , 2 , } called the state space (sometimes
referred to as the parameter space), of possible states of nature.
(ii) A non-empty set A = {a1 , a2 , } called the action space, of actions
available to the decision-maker. And
(iii) a loss function : A R that assigns a real number (, a)
to each pair (, a), specified by the state of nature and action a. The
value (, a) represents the loss incurred when the decision-maker takes
action a and the nature is in state . Although (, a) can even be a
negative number, representing gain or payoff, statisticians usually think of
(, a) conservatively as a loss and take 0 to be the smallest loss (i.e., no
loss at all). That is to say, (, a) is always a non-negative real number.
Simply, nature chooses a point in , and the statistician (without being
informed of the choice nature has made) chooses an action a in A. As a
consequence of these two choices, the statistician (decision-maker) loses an
amount (, a).
In a mathematical sense,1 the triple h, A, i with the specifications itemized
above is called a no-data decision model (or a two-person game).
Here are some examples:
1

An essential part of understanding how a mathematical method works in decision theory is being
able to view it somewhat abstractly as a mathematical phenomenon. This approach helps to see
how an abstract idea might work in areas other than the one we may be interested in at the moment.

1. Consider the decision problem faced by Alma: whether or not to hold a


garage sale the following Saturday. To announce the sale, she would have
to pay $30 to place a notice in the local newspaper. If the day is sunny,
she expects to receive about $200 from her customers; the net return after
deducing the cost of the ad is $170. If it rains, however, no sale is likely to
occur, and Alma suffers a loss due to the newspaper ad. Her default action
is to hold no sale, for which she neither gains nor loses anything, come
rain or shine. The obvious no-data decision model h, A, i is specified by
setting:
(a) = {1 , 2 }, where 1 = sunny, 2 = rainy (This set belongs to
nature, viewed as player I.)
(b) A = {a1 , a2 }, where a1 = no sale, a2 = sale. (These are the
possible choices of player II, the decision-maker.)
(c) The loss function (in dollars) : A R is given by the loss
table:
a1
a2

1
170
0

2
170
200

Thus (1 , a1 ) = 170, (2 , a2 ) = 200, etc. Note that the real-valued


payoff or utility function (gain in dollars) U : A R is given
by the table
a1
a2

1
0
170

2
0
-30

The loss function is obtained from the payoff (utility) function U by


simply adding $170 to it. We do this in order to get non-negative real
numbers for all losses. So the worst loss is 0. This example is discussed later with additional information and conceptual enrichment.
Formally,
we set (i , a) =df Umax (a)

U (i , a), where Umax (a) =
max U (1 , a), U (2 , a), , U (n , a) .

2. Homemaker Alma Jones can cook spaghetti, hamburger, or steak for dinner. She has learned from her past experience that if her husband is in
good mood, she can serve him spaghetti and save some money, but if he
is in a bad mood, only a juicy steak will calm him down and make him
bearable. Clearly, there are three states of nature (Jones possible modes
of being) and there are three actions available to Alma, leading to another
finitary decision model h, A, i, where
(a) = {1 , 2 , 3 }, where 1 = Mr. Jones is in a good mood, 2 =
2

Mr. Jones is in a normal mood, and 3 = Mr. Jones is in a bad mood;


(b) A = {a1 , a2 , a3 }, where a1 = prepare spaghetti, a2 = prepare hamburger,
and a3 = prepare steak.
(c) The loss function (for food, gasoline, etc.) : A R is defined
by the loss table
a1
a2
a3

1
0
5
10

2
2
3
9

3
4
5
6

3. In estimation theory it is standard to specify the pertinent decision model


h, A, i by simply setting = A = R (real line) and (, a) = ( a)2
(quadratic loss). In particular, suppose a coin has an unknown probability
(state or propensity) in = [0, 1] (the closed unit interval) of coming
up heads. We wish to estimate this probability on the basis of one toss
of the coin. Clearly, each toss-based estimate corresponds to an element
a in A = [0, 1]. Here the loss function is given by the quadratic loss
(, a) = ( a)2 .
Given the decision models above, the fundamental problem is what course of
action should Alma take? It is natural to search for the best course of action,
i.e., an action that brings the smallest loss no matter what the true state of
nature. For this we need a principle (scheme, procedure) that algorithmically
specifies the action that is best according to the principle used. Generally,
a principle leads to a partial ordering 4 on the set A of available courses of
action, and any such ordering may be considered a principle. Here the relation
a1 4 a2 means that action a2 is at least as good as action a1 . In brief, a decision
principle is modeled by the pair hA, 4i. As we shall see, we may (for example)
define the ordering a1 4 a2 by the formula [(, a2 ) (, a1 ).
Two important elementary principles are basic to the study of decision models:
1. The Minimax Principle:
A distinct type of partial ordering of actions in a decision model is obtained
by ordering the actions according to the worst that could happen to the
decision maker. In other words an action a is preferred to an action a if
and only if
max{(1 , a), (2 , a), , (n , a)} < max{(1 , a ), (2 , a ), , (n , a )}.
Here we are considering the losses over the various possible states of nature and choose the action for which the loss is maximal. This provides
3

an ordering among the possible actions. Now, according to the minimax


principle, the decision maker takes the action a for which the maximum
loss max{(1 , a), (2 , a), , (n , a)} is the smallest.
It is easy to see that in Example 1 (garage sale) we have
M(a1 ) = max{(1 , a1 ), (2 , a1 )} = 170
M(a2 ) = max{(1 , a2 ), (2 , a2 )} = 200
so that Alma should choose a1 , because it has a smaller max.
In the same way as above, we find that in Example 2 (what food to prepare), according to the minimax principle the best choice is action a3
(prepare steak). We see that the minimax principle reflects a pessimistic
attitude.
2. The Bayes Minimum Expected Loss Principle:
Bayesians take the point of view that the possible states of nature (parameters) can be seen as values of a single random variable , so that we have
outcomes (events) = 1 , = 2 , with known prior probabilities
p0 (1 ) = P( = 1 ), p0 (2 ) = P( = 2 ), . For example, Alma may believe, based on her past experience, that sunny days are far more frequent
than rainy days, and she may even be able to put a reasonably accurate
percentage value on the occurrence of sunny vs. rainy.
Given a decision model h, A, i together with a prior probability distribution p0 (1 ), p0 (2 ), p0 (3 ), on possible states in (serving as an
epistemic enrichment of in the extant decision model, so that now we
use h, p0 i), we define the Bayes loss of action ai by the expected value
(average loss)

B(ai ) =df Bp0 (ai ) = (1 , ai )p0 (1 ) + (2 , ai )p0 (2 ) + + (n , ai )p0 (n ).


Thus, given a prior distribution p0 (i ) on the states of nature, the Bayes
loss incurred for a given action a is now a random variable with expected
value B(a). The Bayes action is then defined to be the action a that
minimizes the Bayes loss B(a). Thus, the computation of expected losses
B(a1 ), B(a2 ), B(a3 ), according to a given prior distribution provides a
means of partially ordering the available actions in A, say in the form
B(a1 ) < B(a2 ) < B(a3 ) < . The action that is farthest to the left on
this scale is the most desirable from the Bayesian perspective.
Suppose the prior probability distribution p0 () = P( = ) on states of
nature in Example 1 (garage sale) is given by the table
4


p0 ()

1
0.7

2
0.3

Find the Bayes action that minimizes the average losses. Do the same in
Example 2 (prepare dinner for Mr. Jones), given that the prior probability
distribution on states of nature is given by the table

p0 ()

1
0.5

2
0.3

3
0.2

Considerably better decisions become available, if in addition to a prior knowledge there is access to various observation results of a designated parent random
variable X, whose values are assumed to depend on the states of nature in the
form of (conditional) probability distributions p(x|) = P(X = x | = ).
Simply, in an observation setting we think of the states of nature as causes
and the values x of parent variable X as effects, but of course a precisely
formulated cause-effect relationship is only statistical. Unfortunately, the welcome feature of having additional information from observation data somewhat
complicates the no-data decision models, treated above.
II. DATA-BASED DECISION PROBLEMS
To give a correct mathematical structure to the process of information gathering, we assume that there is a parent random variable X taking its values in
an observation space (sample space) X , whose known probability distribution
p(x|) depends on the true state of nature . Thus, what we have so far
is a decision model h, A, i coupled with a random variable X with range X
and known probability distribution
pX (x|). For each there is a probability
Rx
measure P(X x | ) = pX (x | ) dx , specified by pX (x | ).

For example, Let = { 34 , 21 } be the set of possible (unknown) probabilities of


getting a head in one toss of a (fair or biased) coin. We think of 1 = 34 as one
of the states a coin would be in if it were considerably biased in favor of heads,
and of course 2 = 12 encodes the state of a fair (balanced) coin. The problem
is that the decision maker does not know the true state of the coin. One
major way to find out is to observe the coins behavior in a longer sequence of
tosses. Let X be the success random variable, counting the number of heads
that come up in each given sequence of trials, whose distribution is given by the
binomial
 
n x
i (1 i nx ),
P(X = x | = i ) = pX (x|i ) =
x

where x = 0, 1, 2, , n are the possible values of the success random variable


X, i = 1, 2 for the possible states, and natural number n is the sample size.
The pertinent decision space A = {a1 , a2 } consists of two decisions: a1 =
the die is fair and a2 = the die is biased. The fundamental question here is
what decision rule or strategy should the decision maker use in light of
information obtained from observing the values of X.
Given a decision rule d, on the basis of recording the value x of random variable
X the decision maker chooses action d(x) = a in A. What is new here is the
decision function (also known as a decision rule or a strategy) d : X A that
assigns to each outcome x of X in its observation (sample) space X a unique
action d(x) in A. Of course, the decision maker may apply several different
decision functions some good and some not so good, or even foolish. So,
now the primary problem is no longer the choice of the right action in A
but the choice of the right decision function d in the decision function space
AX , comprised of all possible mappings from X to A. Because X is a random
variable, now the loss (, d(X)) becomes a random quantity that has to be
averaged over all possible values of X. Note that in the coin example above,
the decision space AX consists of 2n decision functions. Fortunately, since most
of them are uninteresting from the standpoint of decision theory, we shall focus
only on a much smaller subset of so-called admissible decision functions.
The first step in setting up a data-based decison model is the introduction of the
risk function : AX R that assigns to each state of nature in
and to each decision d : X A in AX the numerical value (, d), interpreted
as the risk incurred by using decision rule d, when the true state of nature
is . Remember, the loss (, d(X)) is now a random quantity, because it is a
function of X. The classical decision-theoretic definition of the risk function
: AX R (expected value of the random loss) is as follows:

(, d) = (, d(x1 ))p(x1 |) + (, d(x2 ))p(x2 |) + + (, d(xn ))p(xn |),


where x1 , x2 , , xn are the possible values of parent random variable X, p(x1 |) =
P(X = x1 | = ) denotes the probability value that parent random variable
X takes value x1 , given that the state of nature is . Likewise for p(x2 |) =
P(X = x2 | = ), etc. Finally, d(x1 ) is the action assigned by decision rule d
to outcome x1 . We interpret d(x2 ), , d(xn ) similarly.
Notice that the no-data decision model h, A, i, treated earlier, has been replaced by a brand-new, data-based decision model h, AX , i, in which the
decision function space AX has an underlying structure, including the probability distributions p(x|) (one for each ) for X, whose exploitation is the main
6

objective of data-based decision theory.


We do not yet know which is the best decision function in a data-based decision
model. As in the no-data case, we can use:
1. The minimax principle by calculating the maximum risk
M(d) =df max{(1 , d), (2 , d), , (m , d)}
for all admissbile d over the various possible states of nature 1 , 2 , , m .
Then choose the decision function d that minimizes the maximums
M(d), M(d ), M(d ), in the model. In this way, a partial ordering 4
is defined over the decision function space AX , raking all decision rules
from a (pessimistic) minimax perspective.
2. The Bayes minimum expected value principle assumes that the states of
nature come with prior probability weights p0 (), so that one can determine the Bayes risk (average risk over the possible states of nature)

B(d) =df = Bp0 (d) = (1 , d)p0 (1 ) + (2 , d)p0 (2 ) + + (m , d)p0 (m )


for using decision function d. The prereferred decision function d is the
one that minimizes the Bayes risk B(d ). Thus, there is a family of decision
functions d1 , d2 , d3 , with respective Bayes risks B(d1 ), B(d2 ), B(d3 ),
that have to be calculated. A decision rule d is called a Bayes decision function just in case its Bayes risk B(d ) is the smallest in the list
B(d1 ), B(d2 ), B(d3 ), .
Let us consider further the examples introduced earlier.
1. Suppose that in addition to givens in Example 1 (Almas garage sale),
Alma has two possible forecasts (based on how her neighbors have profited
from their recent garage sales), captured by the values of the parent random variable X: Outcome x1 forecasts great sale and x2 forecats weak
sale. The pertinent frequencies in the form of probability distributions
p(x | 1 ) and p(x | 2 ) are given by the table
1
2

x1
0.7
0.4

x2
0.3
0.6

How many distinct decision functions are there? We have 4 = 22 possible


decision rules, including some patently wrong ones that totally ignore the
data:
7

x1
x2

d1
a1
a1

d2
a1
a2

d3
a2
a1

d4
a2
a2

Note that decision rule d1 picks action a1 = d1 (x1 ) = d1 (x2 ) independently


of what the observation result tells Alma. The only decision rule that
makes good empirical sense is d3 . It assigns action a2 (sale) to the forecast x1 of great sale and it assigns action a1 (no sale) to the forecast x1
of weak sale. In applications, the bad decision rules are automatically
eliminated by the dominance relation.
Recall that a decision function d dominates another decision function d
just in case [(, d) (, d )]. We say that d dominates d strictly
or alternatively that decision function d is better than d just in case
[(, d) (, d )] and [(, d) < (, d )]. A decision function d
is said to be admissible if and only if d is not dominated strictly by any
other decision function. From now on, we confine our attention to the
subset AX AX of optimal decision functions. In view of dominance,
the other decision functions will never be picked. But of course, in general there are many admissible decision functions in AX , forming a kind of
boundary.
Getting back to Almas problem, first we calculate the risks
(1 , d2 ), (1 , d3 ), (2 , d2 ) and (2 , d3 )
Then, depending on the avaliability of the prior p0 (1 ), p0 (2 ), we calculate
the Bayes risks B(d2 ) and B(d3 ). Finally, we choose the decision function
d with the smallest Bayes risk.
2. Returning to the Alma Jones example (Example 2, choices in preparing
dinner), suppose Alma lost the afternoon paper that Mr. Jones loves to
read after work. When Mr. Jones returns home, she will have to tell him
that she lost the damn paper. That could make Mr. Jones terribly mad or
maybe not too mad. This has happened many times before. Alma foresees
4 possible responses from Mr. Jones that will help her decide which food to
prepare for dinner. Here the parent random variable X takes four values,
x1 , x2 , x3 , x4 , where Mr. Jones responses are encoded precisely by these
values:
x1
x2
x3
x4

= Newspapers will get lost.


= I keep telling you a place for everything and everything in its place.
= Why did I ever get married?
= an absent-minded, far-away look.
8

Based on past experience, the frequencies p(x|1 ), p(x|2 ), p(x|3 ) of Mr.


Jones possible responses are given by the table
1
2
3

x1
0.5
0.2
0

x2
0.4
0.5
0.2

x3
0.1
0.2
0.5

x4
0
0.1
0.3

Here the first row specifies the probability distribution p(x|1 ), the second
row gives p(x|1 ), and the third row defines p(x|3 ) for X.
Evidently, here the risk function for 1 and decision rule d is given by
(1 , d) = (, d(x1 ))p(x1 |1 )+(1 , d(x2 ))p(x2 |1 )+ +(1 , d(x4 ))p(x4 |1 ).
Now, suppose Alma knows that Mr. Jones is in a good mood 30% of the
time, in a normal mood 50% of the time, and is in a bad mood 20% of the
time. Now the resulting enriched data-based decision model h, AX , i
allows to calculate the Bayes risks
B(d) == (1 , d)p0 (1 ) + (2 , d)p0 (2 ) + (3 , d)p0 (3 )
for all admissible d by substituting the given percentages for p0 (). Alma
can now choose the best strategy, based on her data about Mr. Jones
behavior, reflecting whether Mr. Jones is in a good, normal or bad mood.
Suppose we observe a parent random variable X with possible values x1 , x2 , x3
and note that the observation outcome was X = x1 . Because we can calculate
the marginal probability P(X = x1 ) from the weighted average
P(X = x1 ) = p(x1 |1 )p0 (1 ) + p(x1 |2 )p0 (2 ) + + p(x1 |m )p0 (m )
of conditional probabilities and prior, the posterior probability
p1 (i |x1 ) = P( = i | X = x1 ) = p0 (i )

p(x1 |i )
P(X = x1 )

(known from the Bayes theorem) can be used in determining the conditional
Bayes risk
B(d|x1 ) == (1 , d)p1 (1 |x1 ) + (2 , d)p1 (2 |x1 ) + (3 , d)p1 (3 |x1 )
for all d. As before, the decision maker chooses the decision function d provided
that its Bayes risk B(d |x1 ) is the smallest among all Bayes risks B(dj |x1 ) of
decision functions.
9