Sei sulla pagina 1di 51

Stat13FinalReview

A.Probabilitytablestouse.
B.variancealgebra,correlation,covariance,
regression
C.ProbabilityandConditionalprobability


Stat13Finalreview
A.ProbabilityTablestouse.

Before(midterm)After

Normaldistribution Chisquaredistribution
tdistribution
Howto
standardize? Degreesoffreedom(d.f)
1.Foronesample,d.f.=n1(lecture12,
Mean slide9,lecture13,14)

Variance= 2.Forfrequency/count,d.f.=number
ofcells1numberofparameters
(Standard estimated(lecture15,16,20/21)
deviation)2
Reviewlecture3, 3.Forlinearregression,d.f=sample

size2(lecture25)
especiallyslide4
Pearsonschisquare
Sumof
(Observedexpected)2/expected

Fortestofindependence,degreeoffreedom
equals(#Columns1)(#rows1)


Stat13FinalReview
PartB
Before(Midterm) After
Variancealgebra,
confidenceinterval Regressionline:
Independent: Slopeequals
Var(XY)=var(X)
r[SD(Y)/SD(X)]
+var(Y)
Dependent: Whereristhe
Var(XY)= correlationcoefficient
Var(X)+Var(Y)
2Cov(X,Y) Lecture23,24,25

Standarderrorofthemean
Lecture6,7 Correlation=cov(X,Y)/SD(X)SD(Y)

Consistency:ifusen1indoingSD,thenusen1foraveraging
product
Practice:StepbystepforCovariance,variance,
andcorrelationcoefficients.
(XEX)2 (YEY)2
x y XEX YEY product
2 4 5 1.5 7.5 25 2.25
4 3 3 2.5 7.5 9 6.25
6 6 1 0.5 0.5 1 0.25
8 5 1 0.5 0.5 1 0.25
10 8 3 2.5 7.5 9 6.25
12 7 5 1.5 7.5 25 2.25

EX=7 EY=5.5 SD(X)=3.4 SD(Y)=1.7 Cov=29/6 Corr=0.828



sqrt(35/3)=3.4 =cov/sd(x)sd(y)
Usepopulationversion,sodividedbyn
AlgebraforVariance,covariance

Var(X+Y)=VarX+VarY+2cov(X,Y)
Var(X)=Cov(X,X)
Var(X+a)=Var(X)
Cov(X+a,Y+b)=Cov(X,Y)
Cov(aX,bY)=abCov(X,Y)
Var(aX)=a2Var(X)
Cov(X+Y,Z)=cov(X,Z)+cov(Y,Z)
Cov(X+Y,V+W)=cov(X,V)+cov(X,W)+cov(Y,W)
+cov(Y,W)

TRICK:pretendallmeansarezero;(X+Y)

(V+W)=XV+XW+YW+YW
Lecture7Accuracyofsample
meanX
Var(X)=Var(X)dividedbysamplesizen
WhatisXbar?Calledsamplemean.

Standarderrorofthemean=SD(X)
=SD(X)dividedbysquaredrootofn
Assamplesizeincreases,thesamplemeanbecomemore
andmoreaccurateinestimatingthepopulationmean
Samplesizeneededtomeetaccuracyrequirement

Stat13FinalReview
PartC
Probabilityfunction:meanandstandard
deviation;lecture19,20,21
Conditionalprobability:tree,table,should
knowhowtoupdateprobability(Bayes
theorem);lecture17,18


BinomialandPoisson
YouShouldrememberbinomial
n
P(X=x)=(x)px(1p)(nx)

IwillprovidePoissonintheexam;you
shouldknowhowtouseit
P(X=x)=ex/x!,wheree=2.71828


Officehoursnextweek
Monday,Wednesday34pm
Myoffice:

Geology4608


Lecture3Normaldistribution,
stemleaf,histogram
IdealizedPopulation,Boxofinfinitelymanytickets,each
tickethasavalue.
RandomvariableandprobabilitystatementP(X<85)
Notations,Greekletters:Mean(expectedvalue)andstandard
deviation,E(X)= , SD(X)= Var(X)=
Examples
Empiricaldistribution:Stemleaf,histogram
Threevariantsofhistogram:frequency,relativefrequency,
density(calledstandardizedinbook)
Sameshapewithdifferentverticalscale
Density=relativefrequency/lengthofinterval


Givenaboxofticketswithvaluesthatcomefrom
anormaldistributionwithmean75andstandard
deviation15,whatistheprobabilitythata
randomlyselectedticketwillhaveavalueless
than85?

LetXbethenumberelected(arandomvariable).
Pr(X<85).


Howdoesthenormaltablework?
StartfromZ=0.0,thenZ=0.1
Increasingpatternobserved
OnthenegativesideofZ
Usesymmetry


Howtostandardize?
Findthemean
Findthestandarddeviation
Z=(Xmean)/SD

Reversequestions:
HowtorecoverXfromZ?
HowtorecoverXfrompercentile?

Supposethereare20percentstudents
failingtheexam
Whatisthepassinggrade?
GofrompercentagetoZ,usingnormal
table
ConvertZintoX,usingX=mean+Ztimes
SD


Probabilityforaninterval
P(60<X<85)

Drawthecurve(locatemean,andendpoints
ofinterval)

=P(X<85)P(X<60)where
P(X<60)=P(Z<(6075)/15)=P(Z<1)=1
P(Z<1)=1.841=about.16

Lecture12Brownianmotion,
chisquaredistribution,d.f.
Adjustedscheduleahead
Chisquaredistribution(lotofsupplementarymaterial,
cometoclass!!!)1lecture
Hypothesistesting(abouttheSDofmeasurement
error)andPvalue(whyn1?supplement)1lecture
ChisquaretestforModelvalidation(chapter11)
Probabilitycalculation(chapter4)
BinomialdistributionandPoisson(chapter5,supplement,
horsekickdeathcavalierdata,hittinglottery,SARS
infection)
Correlation,prediction,regression(supplement)
tdistribution,Fdistribution


Slide9of
R2=(X1A)2+(X2A)2++(XnA)2;A=(X1+..+Xn)/n=average
Lecture12

Followsachisquaredistributionwithn1degreesoffreedom

IfvarianceofnormaleachXis2
ThenD2/2followsachisquare
distributionwithndegreesoffreedom
R2/2followsachisquaredistributionwith
n1degreesoffreedom;thisisalsotrue
evenifthemeanofthenormaldistribution
(foreachX)isnotzero(why?)

Lecture13Chisquareand
samplevariance
Finishthediscussionofchisquaredistributionfromlecture12
Expectedvalueofsumofsquaresequalsn1.
Whydividingbyn1incomputingsamplevariance?
Itgivesanunbiasedestimateoftruevarianceofmeasurementerro
TestinghypothesisabouttrueSDofmeasurementerror
ConfidenceintervalaboutthetrueSDofmeasurementerror.


Slide4.Lecture
13 Measurementerror=
readingfromaninstrumenttruevalue
Onebiotechcompanyspecializingmicroarraygeneexpression
profilingclaimstheycanmeasuretheexpressionlevelofagene
withanerrorofsize.1(thatis,aftertestingtheirmethodnumerous
times,theyfoundthestandarddeviationoftheirmeasurement
errorsis0.1)Thedistributionoferrorsfollownormaldistribution
withmean0(unbiased).
Cellsfromatumortissueofapatientaresenttothiscompanyfor
Microarrayassay.Toassureconsistency,thecompanyrepeattheassay
4times.Theresultofonegene,P53(themostwellstudiedtumor
suppressorgene),is1.1,1.4,1.5,1.2.

Isthereenoughevidencetorejectthecompanysclaimabout
theaccuracyofmeasurement?NotethatsampleSDissqrt(0.1/3),
Biggerthan0.1.
Thisproblemcanbesolvedbyusingchisquareddistribution.Weask
HowlikelyitistoobserveasampleSDthisbigandiftheprobabilityis
Small,thenwehavegoodevidencethattheclaimmaybefalse.(nextlecture)



Lecture14chisquaretest,Pvalue
Measurementerror(reviewfromlecture13)
Nullhypothesis;alternativehypothesis
Evidenceagainstnullhypothesis
MeasuringtheStrengthofevidencebyPvalue
Presettingsignificancelevel
Conclusion
Confidenceinterval


Testingstatisticsisobtainedbyexperienceorstatistical
training;itdependsontheformulationoftheproblemandhow
thedataarerelatedtothehypothesis.
FindthestrengthofevidencebyPvalue:
fromafuturesetofdata,computetheprobabilitythatthe
summarytestingstatisticswillbeaslargeasorevengreater
thantheoneobtainedfromthecurrentdata.IfPvalueisvery
small,theneitherthenullhypothesisisfalseoryouare
extremelyunlucky.Sostatisticianwillarguethatthisisa
strongevidenceagainstnullhypothesis.
IfPvalueissmallerthanaprespecifiedlevel(calledsignificance
level,5%forexample),thennullhypothesisisrejected.


Backtothemicroarrayexample
Ho:trueSDdenote0.1by0)
H1:trueSD>0.1(becausethisisthemainconcern;youdontcareif
SDissmall)
Summary:
SampleSD(s)=squarerootof(sumofsquares/(n1))=0.18
Wheresumofsquares=(1.11.3)2+(1.21.3)2+(1.41.3)2+(1.51.3)2=
0.1,n=4
Theratios/isittoobig?
ThePvalueconsideration:
Supposeafuturedataset(n=4)willbecollected.
LetsbethesampleSDfromthisfuturedataset;itisrandom;sowhatisthe
probabilitythats/willbe
Asbigasorbiggerthan1.8?P(s/0>1.8)


P(s/0>1.8)
Buttofindtheprobabilityweneedtousechisquare
distribution:
Recallthatsumofsquares/truevariancefollowachi
squaredistribution;
Therefore,equivalently,wecompute
P(futuresumofsquares/02>sumofsquaresfromthe
currentlyavailabledata/02),(recall0is
Thevalueclaimedunderthenullhypothesis);


Onceagain,ifdataweregeneratedagain,thenSumofsquares/true
varianceisrandomandfollowsachisquareddistribution
withn1degreesoffreedom;wheresumofsquares=sumofsquared
distancebetweeneachdatapointandthesamplemean
Note:Sumofsquares=(n1)samplevariance=(n1)(sampleSD)2
Pvalue=P(chisquarerandomvariable>computedvaluefrom
data)=P(chisquarerandomvariable>10.0)
Forourcase,n=4;solookatthechisquaredistribution
withdf=3;fromtablewesee:
Pvalueisbetween.025and.
01,rejectnullhypothesisat
5%significancelevel

9.348 11.34
Thevaluecomputedfromavailabledata=.10/.01=10

(notesumofsquares=.1,truevariance=.1 2
Confidenceinterval
A95%confidenceintervalfortruevariance2is
(Sumofsquares/C2,sumofsquares/C1)
WhereC1andC2arethecuttingpointsfromchi
squaretablewithd.f=n1sothat
P(chisquarerandomvariable>C1)=.975
P(chisquarerandomvariable>C2)=.025
Thisintervalisderivedfrom
P(C1<sumofsquares/2<C2)=.95
Forourdata,sumofsquares=.1;fromd.f=3oftable,
C1=.216,C2=9.348;sotheconfidenceintervalof2is0.1017
to.4629;howaboutconfidenceintervalof

Lecture15Categoricaldataand
chisquaretests
Continuousvariable:height,weight,geneexpression
level,lethaldosageofanticancercompound,etc
ordinal
Categoricalvariable:sex,profession,politicalparty,blood
type,eyecolor,phenotype,genotype
Questions:dosmokecauselungcancer?Dosmokers
haveahighlungcancerrate?
Dothe4nucleotides,A,T,G,C,occurequallylikely?


Lecture16chisquaretest
(continued)
Suppose160pairsofconsecutive
nucleotidesareselectedatrandom.
Aredatacompatiblewiththeindependent
occurrenceassumption?


A T G C

A 15 10 13 7

T 10 13 7 10

G 10 10 10 10

C 5 12 10 8


Independenceimpliesjoint
probabilityequalsproductof
marginalprobabilities
LetP(firstnucleotide=A)=PA1
P(firstnucleotide=T)=PT1andsoon
LetP(secondnucleotide=A)=P A2
P(secondnucleotide=T)=P T2andsoon
P(AA)=PA1PA2
P(AT)=PA1PT2
WedonotassumePA1=PA2andsoon

Expectedvaluein();df=(#of
rows1)(#ofcolumns1)
A T G C

A 15 10 13 7
(11.25) (12.66) (11.25) (9.84)
T 10(10) 13 7(10) 10
(11.25) (8.75)
G 10(10) 10 10(10) 10
(11.25) (8.75)
C 5(8.75) 12(9.84) 10 8(7.66)
(8.75)

Pearsonschisquarestatistic=166.8>27.88.Pvalue<.001
Simpleorcompositehypothesis
Simple:parametersarecompletelyspecified
Composite:parametersarenotspecifiedandhaveto
beestimatedfromthedata

Lossof1degreeoffreedomperparameterestimated
Numberofparametersestimated=(#ofrows1)+
(#ofcolumns1)
Sothedfforchisquaretestis#ofcells1(#ofrows
1)(#ofcolumns1)=(#ofrow1)(#ofcol1)

Testofindependenceinacontingencytable
AreSARSdeathratesindependentofcountries?DatafromLA
times,asofMonday5.pm.(Wednesday,fromApril30,2003)

China Hong Singapo Canada others


Kong re

cases 3303 1557 199 344 243

death 148 138 23 21 11

Df=1times4=4;butwait,

converttodeathalivetablefirst
d.f.=4

China
total

death 148 138 23 21 11


341
(199.5 (94) (12) (20.8) (14.7)
)
alive 3155 1419 176 323 232
5305
(3103. (1463) (187) (323.2 (228.3
5) ) )
total 330315571993442435646
PearsonsChisquarestatistic=47.67>18.47;Pvalue<.001,reject

nullhypothesis,dataincompatiblewithindependenceassumption
Lectures20/21Poisson
distribution
Asalimittobinomialwhennislargeandpissmall.
AtheorembySimeonDenisPoisson(17811840).
Parameter=np=expectedvalue
Asnislargeandpissmall,thebinomialprobabilitycan
beapproximatedbythePoissonprobabilityfunction
P(X=x)=ex/x!,wheree=2.71828
Ionchannelmodeling:n=numberofchannelsincellsand
pisprobabilityofopeningforeachchannel;


BinomialandPoisson
approximation
x n=100,p=.01 Poisson
0 .366032 .367879
1 .36973 .367879
2 .184865 .183940
3 .06099 .061313
4 .014942 .015328
5 .002898 .003066
6 .0000463 .000511
7


Advantage:Noneedtoknownandp;
estimatetheparameterfromdata

X=Numberofdeaths frequencies
0 109
1 65
2 22
3 3
4 1
total 200
200yearlyreportsofdeathbyhorsekickfrom10cavalrycorps

overaperiodof20yearsin19thcenturybyPrussianofficials.
x Data Poisson Expected
frequencies probability frequencies
0 109 .5435 108.7
1 65 .3315 66.3
2 22 .101 20.2
3 3 .0205 4.1
4 1 .003 0.6
200
Poolthelasttwocellsandconductachisquaretesttoseeif
Poissonmodeliscompatiblewithdataornot.Degreeof
freedomis411=2.Pearsonsstatistic=.304;Pvalueis.859
(youcanonlytellitisbetween.95and.2fromtableinthe

book);acceptnullhypothesis,datacompatiblewithmodel
RutherfoldandGeiger(1910)
Poloniumsourceplacedashortdistance
fromasmallscreen.Foreachof2608
eighthminuteintervals,theyrecordedthe
numberofalphaparticlesimpingingonthe
screen

Otherrelatedapplicationin
MedicalImaging:Xray,PETscan(positronemission
tomography),MRI


#ofparticles Observedfrequency Expectedfreq.
0 57 54
1 203 211
2 383 407
3 525 526
4 532 508
5 408 394
6 273 254
7 139 140
8 45 68
9 27 29
10 10 11
11+ 6 6
Pearsonschisquaredstatistics=12.955;d.f.=1211=10
Poissonparameter=3.87,Pvaluebetween.95and.975.Accept
nullhypothesis:dataarecompatiblewithPoissonmodel


Poissonprocessformodelingnumberof
eventoccurrencesinaspatialortemporal
domain
Homogeneity:rateofoccurrenceis
uniform
Independentoccurrenceinnon
overlappingareas
Nonclumping



Stat13lecture25
regression(continued,SE,tand
chisquare)
Simplelinearregressionmodel:
Y=0+1X+
Assumption:isnormalwithmean0variance2
Thefittedlineisobtainedbyminimizingthesumof
squaredresiduals;thatisfinding0and1sothat

(Y101X1)2+.(Yn01Xn)2isassmallaspossible
Thismethodiscalledleastsquaresmethod


Leastsquarelineisthesameasthe
regressionlinediscussedbefore
Itfollowsthatestimatedslope1canbecomputedby
r[SD(Y)/SD(X)]=[cov(X,Y)/SD(X)SD(Y)]
[SD(Y)/SD(X)]
=cov(X,Y)/VAR(X)(thisisthesameasequationforhat1
onpage518)
Theinterceptisestimatedbyputtingx=0inthe
regressionline;yieldingequationonpage518
Therefore,thereisnoneedtomemorizetheequationfor
leastsquareline;computationallyitisadvantageoustouse
cov(X,Y)/var(X)insteadofr[SD(Y)/SD(X)]


Findingresidualsandestimating
thevarianceof
Residuals=differencesbetweenYandthe
regressionline(thefittedline)
Anunbiasedestimateof2is
[sumofsquaredresiduals]/(n2)
Whichdividedby(n2)?
Degreeoffreedomisn2becausetwoparameters
wereestimated
[sumofsquaredresiduals]/2followsachisquare.


Hypothesistestingforslope
Slopeestimate1israndom
Itfollowsanormaldistributionwithmean
equaltothetrue1andthevariance
equalto2/[nvar(X)]
Because2isunknown,wehavetoestimate
fromthedata;theSE(standarderror)of
theslopeestimateisequaltothesquared
rootoftheabove

tdistribution
Supposeanestimatehatisnormalwith
variancec2.
Suppose2isestimatedbys2whichis
relatedtoachisquareddistribution
Then()/(cs2)followsa
tdistributionwiththedegreesoffreedom
equaltothechisquaredegreefreedom


Anexample
Determiningsmallquantitiesofcalciuminpresenceof
magnesiumisadifficultproblemofanalyticalchemists.
Onemethodinvolvesuseofalcoholasasolvent.
Thedatabelowshowtheresultswhenapplyingto10
mixtureswithknownquantitiesofCaO.Thesecond
columngives
AmountCaOrecovered.
Questionofinterest:testtoseeifinterceptis0;testtosee
ifslopeis1.


X:CaO Y:CaO Fitted residual
present recovered value
4.0 3.7 3.751 .051
8.0 7.8 7.73 .070
12.5 12.1 12.206 .106
16.0 15.6 15.688 .088
20.0 19.8 19.667 .133
25.0 24.5 24.641 .141
31.0 31.1 30.609 .491
36.0 35.5 35.583 .083
40.0 39.4 39.562 .161
40.0 39.5
39.562 .062

Standard
Estimate error
LeastSquaresEstimates:

Constant0.228090(0.137840)
Predictor0.994757(5.219485E3)

RSquared:0.999780
Squaredcorrelation
Sigmahat:0.206722
Numberofcases:10 Estimateof
Degreesoffreedom:8 SD()

(10.994757)/5.219485E3=
.22809/.1378=1.6547 1.0045052337539044

Potrebbero piacerti anche