0 Voti positivi0 Voti negativi

6 visualizzazioni32 pagineFeb 25, 2017

© © All Rights Reserved

PDF, TXT o leggi online da Scribd

© All Rights Reserved

6 visualizzazioni

© All Rights Reserved

Sei sulla pagina 1di 32

Chapter 5

Statistical Decision Problems

Decision problems called statistical are those in which there

are data on the state of nature, hopefully containing

information that can be used to make a better decision. The

availability of data will generally provide some illumination,

so that in the selection of an action one is not completely in

the dark concerning the state of nature. However, in

practice, one still faces the problem of taking bad action if

information contained in the data is not intelligently utilized.

We assume that data contain information about the true state

of nature. For the general discussion, the notation X , either a

single random variable or random vector, will be employed

to refer to data, will have a (joint) density function given by

f ( x| ) , x S X ; .

Leong YK & Wong WY Introduction to Statistical Decisions 2

If is treated as random variable then the above density

function will be regarded as the conditional density function

of X given that the state of nature is = .

Let (, A, L) be a decision problem and X , an observable

random variable. A procedure for using data as an aid to

action choosing will involve a rule. That is, assign an action

to each of the observed data X.

Decision rule

A function

d : SX A

is called a nonrandomized decision rule.

Example 5.2.1

Suppose (, A, L) is a decision problem with A = { a1 , a2 },

and suppose X is an observable random variable such that

S X = {x1 , x2 }. Then for are four distinct nonrandomized

decision rules as in the following table:

d1 d2 d3 d4

x = x1 a2 a2 a1 a1

x2 a2 a1 a2 a1

Leong YK & Wong WY Introduction to Statistical Decisions 3

Example 5.2.2

Suppose (, A, L) is a decision problem with A = { a1 , a2 },

and suppose X is an observable random variable such that

S X = {x1 , x2 , x3}. Then for are eight distinct

nonrandomized decision rules as in the following table:

d1 d2 d3 d 4 d5 d 6 d 7 d8

x = x1 a2 a2 a2 a1 a2 a1 a1 a1

x2 a2 a2 a1 a2 a1 a2 a1 a1

x3 a2 a1 a2 a2 a1 a1 a2 a1

When a decision function d ( x) is used, the loss incurred will

depend not only on the true state of nature , but also on the

value of the observable random variable X . Since X is

random, so the loss incurred

L( , d ( X ))

is a random variable. The expected value of the above

random loss is called the risk function of decision d . That is,

Risk Function

The risk function of the decision rule d , when

the state of nature is , is defined to be

R( , d ) = E[ L( , d ( X ))]

Leong YK & Wong WY Introduction to Statistical Decisions 4

making decisions, the action space is expanded to the set of

(nonrandomized) decision rules, denoted by D and the no-

data decision problem (, A, L) is extended to the statistical

decision problem denoted by (, D, R) .

Example 5.3.1

Consider the decision problem with loss table given by

a1 a2

1 0 4

2 4 0

function given by

f ( x;1) f ( x; 2 )

x=0 0.6 0.2

1 0.4 0.8

The set of nonrandomized decision rules are

d1 d2 d3 d4

x=0 a2 a2 a1 a1

1 a2 a1 a2 a1

Leong YK & Wong WY Introduction to Statistical Decisions 5

d1 d2 d3 d4

R(1, d ) 4 2.4 1.6 0

R( 2 , d ) 0 3.2 0.8 4

The decision rules are also be represented graphically by

means of their risk points.

Remark

Decisions d1 and d 4 , which ignore the data, give risk

points which are exactly the same as the corresponding

loss points for no-data problem.

The straight line joining the risk points of d1 and d 4

would consists the loss points ( as regards risks ) to the

randomized actions of a2 and a1.

An intelligent use of data, such as decision rule d 3 can

improve the expected losses.

Leong YK & Wong WY Introduction to Statistical Decisions 6

Using data foolishly, such as decision d 2 can deteriorate

the loss function.

How many distinct nonrandomized decision rules are in a

given statistical decision problem in which there n available

Actions and the observed random variable has m possible

values?

Answer : n m

Example 5.3.2

Consider a decision problem in which the loss matrix is

given by

a1 a2

1 0 1

2 6 5

Suppose that the observed random variable X takes three

possible values with density function given in the following

table

x1 x2 x3

f ( x;1) 0.6 0.3 0.1

f ( x; 2 ) 0.1 0.4 0.5

The list of nonrandomized decision rules are tabulated as

follows:

Leong YK & Wong WY Introduction to Statistical Decisions 7

d1 d2 d3 d4 d5 d6 d7 d8

x1 a2 a2 a2 a1 a2 a1 a1 a1

x2 a2 a2 a1 a2 a1 a2 a1 a1

x3 a2 a1 a2 a1 a1 a1 a2 a1

table

d1 d 2 d 3 d 4 d 5 d 6 d 7 d8

R(1, d ) 1.0 0.9 0.7 0.4 0.6 0.3 0.1 0.0

R( 2 , d ) 5.0 5.5 5.4 5.1 5.9 5.6 5.5 6.0

These risk points are presented graphically as follows:

typical:

Leong YK & Wong WY Introduction to Statistical Decisions 8

losses if the data are wisely used.

If the set of risk points ( R(1, a ), R( 2 , a )) represents

various decisions is pulled in toward the origin,

where decisions are as perfect as they could be.

the data must contain some information of the state of

nature. For, if the distribution of X is independent of ,

say

f ( x; ) = g ( x)

then the risks of a given decision rule d are

R(i , d ) = E[ L(i , d ( X ))] = L(i , d ( x)) g ( x) .

x

If the state space contains m elements, then

R(1, d ) L(1, d ( x))

R( 2 , d ) L( 2 , d ( x))

= g ( x)

M x M

R( m , d ) L( m , d ( x)

which is a convex combination of the loss point.

Leong YK & Wong WY Introduction to Statistical Decisions 9

Example 5.3.3

Consider the decision problem stated in Example 5.3.2 with

loss table given by

a1 a2

1 0 1

2 6 5

variable X is given by

x1 x2 x3

f ( x;1) 0.6 0.3 0.1

f ( x; 2 ) 0.6 0.3 0.1

x1, x2 , x3

The risk of a decision rule, say d 2 =

(Note)

a2, a2 , a1

Would be

R(1, d 2 ) = 1 0.6 + 1 0.3 + 0 0.1

R( 2 , d 2 ) = 5 0.6 + 5 0.3 + 6 0.1,

or written in vector form as

R(1, d 2 ) 1 0

= 0.9 + 0.1

R( 2 , d 2 ) 5 6

This is a convex combination of the pure actions a1 and a2 .

Leong YK & Wong WY Introduction to Statistical Decisions 10

The way to select a decision rule, with the knowledge of the

risk function, is exactly the same problem as the selection of

an action. The risk function plays the role as that of the loss

function in no-data decision problem.

Admissible Decision Rules

The concepts of dominance and admissibility can be

extended in obvious way to decision rules.

We say that decision rule d is dominated by

decision rule d 0 if

R( , d 0 ) R ( , d ) .

If the above inequality is strict for some state ,

then decision rule is said to be inadmissible.

Decision rule which is not inadmissible is called

an admissible decision rule.

Example 5.4.1

Consider the decision problem stated in Example 5.3.1. The

risk set and the risk points of the nonrandomized decision

rules are reproduced here for easy reference.

Leong YK & Wong WY Introduction to Statistical Decisions 11

d1 d2 d3 d4

R(1, d ) 4 2.4 1.6 0

R( 2 , d ) 0 3.2 0.8 4

Other three decision rules are admissible.

Minimax Principle

As in the no-data case, it is necessary to devise a scheme of

preferences so that in this ordering one can select the most

desirable decision rule.

Decision rule d 0 is said to be a minimax decision

rule if

max R ( , d 0 ) max R ( , d ) for all d D .

Leong YK & Wong WY Introduction to Statistical Decisions 12

Example 5.4.2

The risk set of the risk points of the decision rules

considered in Example 5.4.1 is reproduced as follows:

d1 d2 d3 d4

R(1, d ) 4 2.4 1.6 0

R( 2 , d ) 0 3.2 0.8 4

max R ( , d ) 3 3.2 1.6 4

So d3 is the minimax decision rule.

The minimax decision rule is obtained by moving a wedge

whose vertex is on 450 line and whose sides are parallel to

the coordinate axes up to the set of risk sets

As in the no-data decision problem, it can make a difference

if one is going to consider a minimax approach whether

one uses regrets or losses.

Leong YK & Wong WY Introduction to Statistical Decisions 13

Note that there are two ways to introduce the idea of regret.

It is either

(a) applying it to the initial loss function

Data

Loss Regret Regret

Data

Loss Risk Regret

Lr ( , ai ) = L( , ai ) min L( , a ) .

aA

Note that the min L( , a ) depends solely on the state of

aA

nature. So for each observed value X = x ,

Lr ( , d ( x )) = L( , d ( x )) min L( , a )

aA

Thus the expected regret is then

Lr ( , d ) = E[ L( , d ( X ))] = R( , d ) min L( , a ) .

aA

Now, for any decision rule d,

min L( , a ) L( , d ( x )) for all x .

aA

Therefore,

Leong YK & Wong WY Introduction to Statistical Decisions 14

aA

and hence

min L( , a ) min R( , d ) .

aA d D

On the other hand, since A D, and R( , d *) = L( , a ) if

d * ( x ) a , so

min R( , d ) L( , a ) for all a A

d D

and therefore,

min R( , d ) = min L( , a ) ,

d D aA

Therefore

Rr ( , d ) = E[ Lr ( , d ( X )] = R( , d ) min R( , d ' )

d 'D

This implies that the expected regret is the same as the

regretized risk.

Example 5.4.3

Reconsider the statistical decision problem in Example 5.3.2.

The risk functions of the nonrandomized decision rules were

tabulated as follows:

Leong YK & Wong WY Introduction to Statistical Decisions 15

d1 d 2 d 3 d 4 d 5 d 6 d 7 d8

R(1, d ) 1.0 0.9 0.7 0.4 0.6 0.3 0.1 0.0

R( 2 , d ) 5.0 5.5 5.4 5.1 5.9 5.6 5.5 6.0

as follows

d1 d 2 d3 d 4 d5 d 6 d 7 d8

E[ Lr (1, d ( X ))] 1.0 0.9 0.7 0.4 0.6 0.3 0.1 0.0

E[ Lr ( 2 , d ( X ))] 0.0 0.5 0.4 0.1 9.9 0.6 0.5 1.0

decision rule. The minimax randomized regret decision rule

can be obtained graphically by moving the wedge with

vertex on the 450 line up to the set of expected regret points.

It is clear that the minimax randomized regret rule is a

mixture of d 4 and d 7 , namely

~ d1 d 2 d 3 d 4 d5 d 6 d 7 d8

p=

0 0 0 p 0 0 1 p 0

or simply denoted as

~

p = ( 0 , 0 , 0 , p , 0 , 0 ,1 p , 0 )

Leong YK & Wong WY Introduction to Statistical Decisions 16

E[ Lr (1, ~

p )] = E[ Lr ( 2 , ~

p )]

3p +1 = 5 4 p

p = 4/7

Bayes Principle

Another scheme for ordering decision rules is that of

assigning prior probabilities ( ) to the various states of

nature, and determining the average risk over these states.

Leong YK & Wong WY Introduction to Statistical Decisions 17

Suppose that ( ) = P( = ) is the prior

probability of the state of nature. Then the Bayes

risk of the decision rule d in a statistical decision

problem (, D, R) is defined to be

R( , d ) = R( , d ) ( ) .

Decision rule d 0 is called a Bayes decision rule

against if

R( , d 0 ) R( , d ) for all d D.

Posterior Distribution

The Bayes approach to select an optimal decision rule

involves the assumption that the state of nature is random

with probability function ( ) . The probability function of

the observed random variable X will be regarded as the

conditional distribution given that = and is written as

P( X = x | = ) or simply as f ( x | ) . We shall denote the

conditional distribution of given that X = x as ( | x ) .

Note that

( | x ) P ( X = x ) = P( X = x | = ) ( ) (*)

Leong YK & Wong WY Introduction to Statistical Decisions 18

In fact, both sides of the above equation represent the joint

probability of X and . We call conditional distribution

( | x ) the posterior distribution of given that X = x .

As ( | x ) is considered as a function of while x is fixed,

( | x ) is proportional to the product

P( X = x | = ) ( )

and we write

( | x ) P ( X = x | = ) ( ) (**)

Example 5.4.4

Suppose that the conditional probabilities of X given =

are

P ( X = x1 | = 1 ) = 1 / 4 = 1 P ( X = x 2 | = 1 )

P( X = x1 | = 2 ) = 2 / 3 = 1 P ( X = x2 | = 2 ) ,

and suppose the prior probability function of is given by

P( = 1 ) = (1 ) = w

P( = 2 ) = ( 2 ) = 1 w , 0 w 1

For X = x1,

3w 8 8w

(1 | x1) = , ( 2 | x1 ) = 1 (1 | x1 ) =

8 5w 8 5w

Leong YK & Wong WY Introduction to Statistical Decisions 19

For X = x2 ,

3w 4 4w

(1 | x2 ) = , ( 2 | x2 ) =

4w 4w

Successive Observations

If an observation can alter prior odds to posterior odds, it

would seem that further observation, applied to the first

posterior distribution as though it were a prior, should result

in yet a new posterior distribution.

as a prior distribution with new data, the resulting posterior

distribution is the same as if one had waited until all the data

were at hand to use with the original prior distribution for a

final posterior distribution.

Leong YK & Wong WY Introduction to Statistical Decisions 20

that X 1 and X 2 independent observations with density

functions f1( x1 | ) and f 2 ( x2 | ) , respectively.

( | x1) f1( x1 | ) ( ) .

Thus the posterior density function of given that X 1 = x1

is

( | x1) = c1( x1) f1( X 1 | ) ( )

By regarding 1( ) = ( | x1 ) as the density function of

,the posterior density function of when X 2 = x2 is

available is given by

2 ( | x2 ) f 2 ( x2 | )1( )

f 2 ( x2 | ) f1( x1 | ) ( )

Since the f1( x1 | ) f 2 ( x2 | ) is the joint density function of

X 1 and X 2 , this shows that 2 ( | x2 ) is the posterior

density of given the observed vector ( x1, x2 ) .

Recall that the Bayes risk of a decision rule d against the

prior probability function is

R( , d ) = E [ R( , d )] = R( , d ) ( ) .

Leong YK & Wong WY Introduction to Statistical Decisions 21

Now

R( , d ) = L( , d ( x )) f ( x | ) ( )

x

= L( , d ( x ) ( | x ) f ( x )

(*)

x

where f (x ) represents the marginal density function of X .

It follows from (*) that for each observed value x , the Bayes

decision rule d 0 such that

L( , d 0 ( x )) ( | x ) L( , d ( x )) ( | x ) d D

Example 5.4.5

Consider a decision problem with loss matrix

a1 a2

1 0 8

2 4 0

Suppose that the statistician can observe a random variable

X with the following conditional distributions:

P( X = 0 | = 1 ) = 3 / 4 P( X = 0 | = 2 ) = 1 / 3

P( X = 1 | = 1 ) = 1 / 4 P( X = 1 | = 2 ) = 2 / 3 .

It is requires to construct a Bayes decision rule against the

following prior distribution of

Leong YK & Wong WY Introduction to Statistical Decisions 22

: P( = 1) = w , P( = 2 ) = 1 w , 0 w 1

For x = 0 ,

(1 | 0) P ( X = 0 | = 1) P ( = 1)

= 3w / 4

( 2 | 0) P ( X = 0 | = 2 ) P ( = 2 )

= (1 w) / 3 .

This implies that

3w / 4 9w

(1 | 0) = =

3w / 4 + (1 w) / 3 9w + 4(1 w)

4(1 w)

( 2 | 0) = .

9 w + 4(1 w)

L( , a1 ) ( | 0) = 0 (1 | 0) + 4 ( 2 | 0)

16(1 w )

=

9 w + 4(1 w)

L( , a2 ) ( | 0) = 8 (1 | 0) + 0 ( 2 | 0)

72w

= .

9 w + 4(1 w)

Leong YK & Wong WY Introduction to Statistical Decisions 23

iff w 2 / 11.

Similarly, for x = 1,

(1 | 1) P( X = 1 | = 1) P( = 1)

= w/4

( 2 | 1) P ( X = 1 | = 2 ) P( = 2 )

= 2(1 w ) / 3

w/4 3w

(1 | 1) = =

w / 4 + 2(1 w) / 3 3w + 8(1 w)

8(1 w)

( 2 | 1) = .

3w + 8(1 w)

Therefore, d (1) = a1 32(1 w ) 24 w w 4 / 7

Conclusion : The Bayes decision rule is

d1 , 0 w 2 / 11

d0 ( x ) = d3 , 2 / 11 w 4 / 7

d , 4/7 w 1

4

Leong YK & Wong WY Introduction to Statistical Decisions 24

d1 d2 d3 d3

0 a2 a2 a1 a1

1 a2 a1 a2 a1

d1 d2 d3 d4

R(1, d ) 8 6 2 0

R( 2 , d ) 0 8/3 4/3 4

minimax

Leong YK & Wong WY Introduction to Statistical Decisions 25

Example 5.4.6

In the above example (Example 4.5.4), the (randomized)

decision rule with constant risk is a mixture of d3 and d3 .

The slope of the line joining risk points of d3 and d 4 is

4 4/3 4

m= = .

r0 2 3

Hence the prior vector =< w ,1 w > which is perpendicular

to vector joining risk points of d3 and d 4 is given by

1 w 3 4

= or w = .

w 4 7

This implies that the randomized decision rule d * with

constant risk is Bayes against the prior distribution

: P( = 1) = 4 / 7 and P( = 2 ) = 3 / 7 .

d * = (0 , 0 , 6 / 7 , 1 / 7 )

Notice that d * is the only admissible decision rule with

constant risk ( see the following figure ).

Leong YK & Wong WY Introduction to Statistical Decisions 26

unique Bayes decision rule against , then

0 is admissible.

Bayes against a prior distribution

P( = i ) = pi > 0 , for all i = 1, L, n ,

Then 0 is admissible.

5.5 Sufficiency

It is common practice for statistician, when confronted with

a mass of data, to compute some simple measure from the

data and then base statistical procedures on the simpler

quantity . Computing such simpler measure is called

reducing the data; the measures themselves are called

statistics.

A question arises naturally, How much reduction of the data

is possible without losing information regarding the state of

nature?

Leong YK & Wong WY Introduction to Statistical Decisions 27

Sufficient Statistic

statistic T is said to be sufficient for a family of density

functions { f ( | ) ; } if ( | ~

x1 ) = ( | ~

x2 ) for

any prior distribution of , and any data ~ x1 and ~ x2 of

the same size from the family { f ( | ) ; }.

Factorization Theorem

Suppose that f ( ~ x | ) represents the joint density

~

function of the observed random vector X .

~

Statistic T = t ( X ) is sufficient for if and only if

f (~x | ) = g (t ( ~

x ); )h( ~

x)

where g depends on ~

x only through t and h does

not depend on .

Fisher in early of 1920.

~

Statistic T = t ( X ) is said to be sufficient for the

family of density functions { f (; ) : } if the

~

conditional distribution of X given T = t does not

depend on .

Leong YK & Wong WY Introduction to Statistical Decisions 28

The reason for performing an experiment whose distribution

depends on is to learn about , and if the distribution does

not depend on there is no point to performing the

experiment no information about is about by so doing.

~

This shows, intuitively, why nothing is lost if the data X is

reduced to a sufficient statistic T .

~

More precisely, given any decision rule d ( X ) , there is a decision

rule (T ) for which R ( , ) = R ( , d ) .The concept of

sufficiency is useful for it allows us to focus on decision rules that

are functions of sufficient statistics. We shall illustrate this fact by

example (see Example 5.6.1) and state without proof the

theoretical result as given below.

~

depending only on sufficient statistic T = t ( X ) so that

R( , d ) = R( , ).

follows:

~

Suppose T = t is observed. Then in the distribution of X

~

given T = t , if data X = ~

x * is observed, then taken action

d (x~*) .

Leong YK & Wong WY Introduction to Statistical Decisions 29

(t ) is generally a randomized decision rule. Since the

action taken depends on ~ x *, and ~ x * came about by

~

performing the experiment X | T = t . Thus for given and

T = t , the loss of is random and what is relevant is the

expected loss:

~

L( , (t )) = E[ L( , d ( X )) | T = t )

~ ~ ~

= L ( , d ( x ) P ( X = x | T = t).

~

x :t ( ~

x )= t

R( , ) = E[ L( , (T ))]

= L( , (t )) P(T = t )

t

~

= L( , d ( x )) P( X = x | T = t ) P (T = t )

~ ~

x :t ( ~x )= t

t ~

~

= L( , d ( ~ x )) P( X = ~

x)

~

x

= R( , d ) .

Example 5.5.1

Consider a decision problem (, A, L ) in which

A = { a1 , a2 }. Let X 1 and X 2 be a random sample of size 2

from a Bernoulli distribution with parameter . Then

Leong YK & Wong WY Introduction to Statistical Decisions 30

T = X 1 + X 2 is sufficient for . Moreover, T is a binomial

random variable with parameter ( 2 , ).

d(x ) =

a2 , ( x1, x2 ) = (1, 0) or (1,1)

a1 , t =0

(t ) =

a2 , t=2

a1 , if Head occurs

(1) = .

a2 , if Tail occurs

R( , ) = E[ L( , (T ))]

= L( , a1 ) P (T = 0) + L( , (1)) P (T = 1)

+ L( , a2 ) P (T = 2)

2 1

= L( , a1 )(1 ) + [ L( , a1 ) + L( , a2 )]2 (1 )

2

+ L( , a2 ) 2

Leong YK & Wong WY Introduction to Statistical Decisions 31

= L( , a1 ){(1 ) 2 + (1 )} + L( , a2 ){ (1 ) + 2 }

~ ~

= L( , a1 ){P ( X = (0,0)) + P( X = (0,1))}

~ ~

+ L( , a2 ){P ( X = (1,1)) + P ( X = (1,0))}

Example 5.6.1

Consider a decision problem (, A, L ) in which A = { a1 , a2 }.

Let X1 and X 2 be a random sample of size 2 from a Bernoulli

distribution with parameter . Then T = X 1 + X 2 is sufficient for

. Moreover, T is a binomial random variable with parameter

( 2 , ).

d (x ) =

a 2 , ( x1, x2 ) = (1, 0) or (1,1)

a1 , t =0

(t ) =

a 2 , t=2

a1 , if Head occurs

(1) = .

a 2 , if Tail occurs

Leong YK & Wong WY Introduction to Statistical Decisions 32

R ( , ) = E[ L ( , (T ))]

= L( , a1 ) P (T = 0) + L( , (1)) P (T = 1) + L( , a2 ) P (T = 2)

1

= L( , a1 )(1 ) 2 + [ L( , a1 ) + L( , a2 )]2 (1 ) + L( , a2 ) 2

2

= L( , a1 ){(1 ) 2 + (1 )} + L( , a2 ){ (1 ) + 2 }

~ ~

= L( , a1){P ( X = (0,0)) + P ( X = (0,1))}

~ ~

+ L( , a2 ){P ( X = (1,1)) + P ( X = (1,0))}

= R ( , d )

## Molto più che documenti.

Scopri tutto ciò che Scribd ha da offrire, inclusi libri e audiolibri dei maggiori editori.

Annulla in qualsiasi momento.