0 valutazioniIl 0% ha trovato utile questo documento (0 voti)

32 visualizzazioni9 pagineELEMENTS OF DECISION THEORY

May 04, 2016

© © All Rights Reserved

PDF, TXT o leggi online da Scribd

ELEMENTS OF DECISION THEORY

© All Rights Reserved

0 valutazioniIl 0% ha trovato utile questo documento (0 voti)

32 visualizzazioni9 pagineELEMENTS OF DECISION THEORY

© All Rights Reserved

Sei sulla pagina 1di 9

7

Phil.015

April 12, 2016

ELEMENTS OF DECISION THEORY

The elements of decision theory are similar to those of the theory of games in

that decision theory may be considered as the theory of a two-person game, in which

nature takes the role of one of the players, and the other player if of course the

statistician. Because the problems are simpler, it is useful to start with decisions in

the absence of data. Often certain decision problems involving data can cleverly be

converted into no-data problems.

I. NO-DATA DECISION PROBLEMS

The basic conceptual ingredients of a no-data decision model are the same is

those of a zero-sum two-person game-theoretic model, namely:

(i) A non-empty set = {1 , 2 , } called the state space (sometimes

referred to as the parameter space), of possible states of nature.

(ii) A non-empty set A = {a1 , a2 , } called the action space, of actions

available to the decision-maker. And

(iii) a loss function : A R that assigns a real number (, a)

to each pair (, a), specified by the state of nature and action a. The

value (, a) represents the loss incurred when the decision-maker takes

action a and the nature is in state . Although (, a) can even be a

negative number, representing gain or payoff, statisticians usually think of

(, a) conservatively as a loss and take 0 to be the smallest loss (i.e., no

loss at all). That is to say, (, a) is always a non-negative real number.

Simply, nature chooses a point in , and the statistician (without being

informed of the choice nature has made) chooses an action a in A. As a

consequence of these two choices, the statistician (decision-maker) loses an

amount (, a).

In a mathematical sense,1 the triple h, A, i with the specifications itemized

above is called a no-data decision model (or a two-person game).

Here are some examples:

1

An essential part of understanding how a mathematical method works in decision theory is being

able to view it somewhat abstractly as a mathematical phenomenon. This approach helps to see

how an abstract idea might work in areas other than the one we may be interested in at the moment.

garage sale the following Saturday. To announce the sale, she would have

to pay $30 to place a notice in the local newspaper. If the day is sunny,

she expects to receive about $200 from her customers; the net return after

deducing the cost of the ad is $170. If it rains, however, no sale is likely to

occur, and Alma suffers a loss due to the newspaper ad. Her default action

is to hold no sale, for which she neither gains nor loses anything, come

rain or shine. The obvious no-data decision model h, A, i is specified by

setting:

(a) = {1 , 2 }, where 1 = sunny, 2 = rainy (This set belongs to

nature, viewed as player I.)

(b) A = {a1 , a2 }, where a1 = no sale, a2 = sale. (These are the

possible choices of player II, the decision-maker.)

(c) The loss function (in dollars) : A R is given by the loss

table:

a1

a2

1

170

0

2

170

200

payoff or utility function (gain in dollars) U : A R is given

by the table

a1

a2

1

0

170

2

0

-30

simply adding $170 to it. We do this in order to get non-negative real

numbers for all losses. So the worst loss is 0. This example is discussed later with additional information and conceptual enrichment.

Formally,

we set (i , a) =df Umax (a)

U (i , a), where Umax (a) =

max U (1 , a), U (2 , a), , U (n , a) .

2. Homemaker Alma Jones can cook spaghetti, hamburger, or steak for dinner. She has learned from her past experience that if her husband is in

good mood, she can serve him spaghetti and save some money, but if he

is in a bad mood, only a juicy steak will calm him down and make him

bearable. Clearly, there are three states of nature (Jones possible modes

of being) and there are three actions available to Alma, leading to another

finitary decision model h, A, i, where

(a) = {1 , 2 , 3 }, where 1 = Mr. Jones is in a good mood, 2 =

2

(b) A = {a1 , a2 , a3 }, where a1 = prepare spaghetti, a2 = prepare hamburger,

and a3 = prepare steak.

(c) The loss function (for food, gasoline, etc.) : A R is defined

by the loss table

a1

a2

a3

1

0

5

10

2

2

3

9

3

4

5

6

h, A, i by simply setting = A = R (real line) and (, a) = ( a)2

(quadratic loss). In particular, suppose a coin has an unknown probability

(state or propensity) in = [0, 1] (the closed unit interval) of coming

up heads. We wish to estimate this probability on the basis of one toss

of the coin. Clearly, each toss-based estimate corresponds to an element

a in A = [0, 1]. Here the loss function is given by the quadratic loss

(, a) = ( a)2 .

Given the decision models above, the fundamental problem is what course of

action should Alma take? It is natural to search for the best course of action,

i.e., an action that brings the smallest loss no matter what the true state of

nature. For this we need a principle (scheme, procedure) that algorithmically

specifies the action that is best according to the principle used. Generally,

a principle leads to a partial ordering 4 on the set A of available courses of

action, and any such ordering may be considered a principle. Here the relation

a1 4 a2 means that action a2 is at least as good as action a1 . In brief, a decision

principle is modeled by the pair hA, 4i. As we shall see, we may (for example)

define the ordering a1 4 a2 by the formula [(, a2 ) (, a1 ).

Two important elementary principles are basic to the study of decision models:

1. The Minimax Principle:

A distinct type of partial ordering of actions in a decision model is obtained

by ordering the actions according to the worst that could happen to the

decision maker. In other words an action a is preferred to an action a if

and only if

max{(1 , a), (2 , a), , (n , a)} < max{(1 , a ), (2 , a ), , (n , a )}.

Here we are considering the losses over the various possible states of nature and choose the action for which the loss is maximal. This provides

3

principle, the decision maker takes the action a for which the maximum

loss max{(1 , a), (2 , a), , (n , a)} is the smallest.

It is easy to see that in Example 1 (garage sale) we have

M(a1 ) = max{(1 , a1 ), (2 , a1 )} = 170

M(a2 ) = max{(1 , a2 ), (2 , a2 )} = 200

so that Alma should choose a1 , because it has a smaller max.

In the same way as above, we find that in Example 2 (what food to prepare), according to the minimax principle the best choice is action a3

(prepare steak). We see that the minimax principle reflects a pessimistic

attitude.

2. The Bayes Minimum Expected Loss Principle:

Bayesians take the point of view that the possible states of nature (parameters) can be seen as values of a single random variable , so that we have

outcomes (events) = 1 , = 2 , with known prior probabilities

p0 (1 ) = P( = 1 ), p0 (2 ) = P( = 2 ), . For example, Alma may believe, based on her past experience, that sunny days are far more frequent

than rainy days, and she may even be able to put a reasonably accurate

percentage value on the occurrence of sunny vs. rainy.

Given a decision model h, A, i together with a prior probability distribution p0 (1 ), p0 (2 ), p0 (3 ), on possible states in (serving as an

epistemic enrichment of in the extant decision model, so that now we

use h, p0 i), we define the Bayes loss of action ai by the expected value

(average loss)

Thus, given a prior distribution p0 (i ) on the states of nature, the Bayes

loss incurred for a given action a is now a random variable with expected

value B(a). The Bayes action is then defined to be the action a that

minimizes the Bayes loss B(a). Thus, the computation of expected losses

B(a1 ), B(a2 ), B(a3 ), according to a given prior distribution provides a

means of partially ordering the available actions in A, say in the form

B(a1 ) < B(a2 ) < B(a3 ) < . The action that is farthest to the left on

this scale is the most desirable from the Bayesian perspective.

Suppose the prior probability distribution p0 () = P( = ) on states of

nature in Example 1 (garage sale) is given by the table

4

p0 ()

1

0.7

2

0.3

Find the Bayes action that minimizes the average losses. Do the same in

Example 2 (prepare dinner for Mr. Jones), given that the prior probability

distribution on states of nature is given by the table

p0 ()

1

0.5

2

0.3

3

0.2

Considerably better decisions become available, if in addition to a prior knowledge there is access to various observation results of a designated parent random

variable X, whose values are assumed to depend on the states of nature in the

form of (conditional) probability distributions p(x|) = P(X = x | = ).

Simply, in an observation setting we think of the states of nature as causes

and the values x of parent variable X as effects, but of course a precisely

formulated cause-effect relationship is only statistical. Unfortunately, the welcome feature of having additional information from observation data somewhat

complicates the no-data decision models, treated above.

II. DATA-BASED DECISION PROBLEMS

To give a correct mathematical structure to the process of information gathering, we assume that there is a parent random variable X taking its values in

an observation space (sample space) X , whose known probability distribution

p(x|) depends on the true state of nature . Thus, what we have so far

is a decision model h, A, i coupled with a random variable X with range X

and known probability distribution

pX (x|). For each there is a probability

Rx

measure P(X x | ) = pX (x | ) dx , specified by pX (x | ).

getting a head in one toss of a (fair or biased) coin. We think of 1 = 34 as one

of the states a coin would be in if it were considerably biased in favor of heads,

and of course 2 = 12 encodes the state of a fair (balanced) coin. The problem

is that the decision maker does not know the true state of the coin. One

major way to find out is to observe the coins behavior in a longer sequence of

tosses. Let X be the success random variable, counting the number of heads

that come up in each given sequence of trials, whose distribution is given by the

binomial

n x

i (1 i nx ),

P(X = x | = i ) = pX (x|i ) =

x

X, i = 1, 2 for the possible states, and natural number n is the sample size.

The pertinent decision space A = {a1 , a2 } consists of two decisions: a1 =

the die is fair and a2 = the die is biased. The fundamental question here is

what decision rule or strategy should the decision maker use in light of

information obtained from observing the values of X.

Given a decision rule d, on the basis of recording the value x of random variable

X the decision maker chooses action d(x) = a in A. What is new here is the

decision function (also known as a decision rule or a strategy) d : X A that

assigns to each outcome x of X in its observation (sample) space X a unique

action d(x) in A. Of course, the decision maker may apply several different

decision functions some good and some not so good, or even foolish. So,

now the primary problem is no longer the choice of the right action in A

but the choice of the right decision function d in the decision function space

AX , comprised of all possible mappings from X to A. Because X is a random

variable, now the loss (, d(X)) becomes a random quantity that has to be

averaged over all possible values of X. Note that in the coin example above,

the decision space AX consists of 2n decision functions. Fortunately, since most

of them are uninteresting from the standpoint of decision theory, we shall focus

only on a much smaller subset of so-called admissible decision functions.

The first step in setting up a data-based decison model is the introduction of the

risk function : AX R that assigns to each state of nature in

and to each decision d : X A in AX the numerical value (, d), interpreted

as the risk incurred by using decision rule d, when the true state of nature

is . Remember, the loss (, d(X)) is now a random quantity, because it is a

function of X. The classical decision-theoretic definition of the risk function

: AX R (expected value of the random loss) is as follows:

where x1 , x2 , , xn are the possible values of parent random variable X, p(x1 |) =

P(X = x1 | = ) denotes the probability value that parent random variable

X takes value x1 , given that the state of nature is . Likewise for p(x2 |) =

P(X = x2 | = ), etc. Finally, d(x1 ) is the action assigned by decision rule d

to outcome x1 . We interpret d(x2 ), , d(xn ) similarly.

Notice that the no-data decision model h, A, i, treated earlier, has been replaced by a brand-new, data-based decision model h, AX , i, in which the

decision function space AX has an underlying structure, including the probability distributions p(x|) (one for each ) for X, whose exploitation is the main

6

We do not yet know which is the best decision function in a data-based decision

model. As in the no-data case, we can use:

1. The minimax principle by calculating the maximum risk

M(d) =df max{(1 , d), (2 , d), , (m , d)}

for all admissbile d over the various possible states of nature 1 , 2 , , m .

Then choose the decision function d that minimizes the maximums

M(d), M(d ), M(d ), in the model. In this way, a partial ordering 4

is defined over the decision function space AX , raking all decision rules

from a (pessimistic) minimax perspective.

2. The Bayes minimum expected value principle assumes that the states of

nature come with prior probability weights p0 (), so that one can determine the Bayes risk (average risk over the possible states of nature)

for using decision function d. The prereferred decision function d is the

one that minimizes the Bayes risk B(d ). Thus, there is a family of decision

functions d1 , d2 , d3 , with respective Bayes risks B(d1 ), B(d2 ), B(d3 ),

that have to be calculated. A decision rule d is called a Bayes decision function just in case its Bayes risk B(d ) is the smallest in the list

B(d1 ), B(d2 ), B(d3 ), .

Let us consider further the examples introduced earlier.

1. Suppose that in addition to givens in Example 1 (Almas garage sale),

Alma has two possible forecasts (based on how her neighbors have profited

from their recent garage sales), captured by the values of the parent random variable X: Outcome x1 forecasts great sale and x2 forecats weak

sale. The pertinent frequencies in the form of probability distributions

p(x | 1 ) and p(x | 2 ) are given by the table

1

2

x1

0.7

0.4

x2

0.3

0.6

decision rules, including some patently wrong ones that totally ignore the

data:

7

x1

x2

d1

a1

a1

d2

a1

a2

d3

a2

a1

d4

a2

a2

of what the observation result tells Alma. The only decision rule that

makes good empirical sense is d3 . It assigns action a2 (sale) to the forecast x1 of great sale and it assigns action a1 (no sale) to the forecast x1

of weak sale. In applications, the bad decision rules are automatically

eliminated by the dominance relation.

Recall that a decision function d dominates another decision function d

just in case [(, d) (, d )]. We say that d dominates d strictly

or alternatively that decision function d is better than d just in case

[(, d) (, d )] and [(, d) < (, d )]. A decision function d

is said to be admissible if and only if d is not dominated strictly by any

other decision function. From now on, we confine our attention to the

subset AX AX of optimal decision functions. In view of dominance,

the other decision functions will never be picked. But of course, in general there are many admissible decision functions in AX , forming a kind of

boundary.

Getting back to Almas problem, first we calculate the risks

(1 , d2 ), (1 , d3 ), (2 , d2 ) and (2 , d3 )

Then, depending on the avaliability of the prior p0 (1 ), p0 (2 ), we calculate

the Bayes risks B(d2 ) and B(d3 ). Finally, we choose the decision function

d with the smallest Bayes risk.

2. Returning to the Alma Jones example (Example 2, choices in preparing

dinner), suppose Alma lost the afternoon paper that Mr. Jones loves to

read after work. When Mr. Jones returns home, she will have to tell him

that she lost the damn paper. That could make Mr. Jones terribly mad or

maybe not too mad. This has happened many times before. Alma foresees

4 possible responses from Mr. Jones that will help her decide which food to

prepare for dinner. Here the parent random variable X takes four values,

x1 , x2 , x3 , x4 , where Mr. Jones responses are encoded precisely by these

values:

x1

x2

x3

x4

= I keep telling you a place for everything and everything in its place.

= Why did I ever get married?

= an absent-minded, far-away look.

8

Jones possible responses are given by the table

1

2

3

x1

0.5

0.2

0

x2

0.4

0.5

0.2

x3

0.1

0.2

0.5

x4

0

0.1

0.3

Here the first row specifies the probability distribution p(x|1 ), the second

row gives p(x|1 ), and the third row defines p(x|3 ) for X.

Evidently, here the risk function for 1 and decision rule d is given by

(1 , d) = (, d(x1 ))p(x1 |1 )+(1 , d(x2 ))p(x2 |1 )+ +(1 , d(x4 ))p(x4 |1 ).

Now, suppose Alma knows that Mr. Jones is in a good mood 30% of the

time, in a normal mood 50% of the time, and is in a bad mood 20% of the

time. Now the resulting enriched data-based decision model h, AX , i

allows to calculate the Bayes risks

B(d) == (1 , d)p0 (1 ) + (2 , d)p0 (2 ) + (3 , d)p0 (3 )

for all admissible d by substituting the given percentages for p0 (). Alma

can now choose the best strategy, based on her data about Mr. Jones

behavior, reflecting whether Mr. Jones is in a good, normal or bad mood.

Suppose we observe a parent random variable X with possible values x1 , x2 , x3

and note that the observation outcome was X = x1 . Because we can calculate

the marginal probability P(X = x1 ) from the weighted average

P(X = x1 ) = p(x1 |1 )p0 (1 ) + p(x1 |2 )p0 (2 ) + + p(x1 |m )p0 (m )

of conditional probabilities and prior, the posterior probability

p1 (i |x1 ) = P( = i | X = x1 ) = p0 (i )

p(x1 |i )

P(X = x1 )

(known from the Bayes theorem) can be used in determining the conditional

Bayes risk

B(d|x1 ) == (1 , d)p1 (1 |x1 ) + (2 , d)p1 (2 |x1 ) + (3 , d)p1 (3 |x1 )

for all d. As before, the decision maker chooses the decision function d provided

that its Bayes risk B(d |x1 ) is the smallest among all Bayes risks B(dj |x1 ) of

decision functions.

9

## Molto più che documenti.

Scopri tutto ciò che Scribd ha da offrire, inclusi libri e audiolibri dei maggiori editori.

Annulla in qualsiasi momento.