Sei sulla pagina 1di 26

An Introduction to Statistics in Matlab

This guide is derived from the help les within Matlab. It identies important concepts when working with Matlab as well as statistical tools that one might commonly use. It is not intended to be all-inclusive or explain all functions in great detail, but to serve as a quick reference when working with statistics in Matlab. For more information regarding the concepts addressed in this guide or for Matlab operations that this guide omits, please refer to the Matlab help les, accessed through the menus within Matlab.

Table of Contents
1. An Introduction to Matrices and Arithmetic Operators a. Matrices b. Operators Basic Statistical Commands Hypothesis Testing a. Z-test b. T-test c. Two-sample t-test d. ANOVA Bootstrap Plotting with Matlab Regression and Curve-fitting a. Command-based Regression i. Setting up the Design Matrix by Hand ii. Using polyfit b. Graphical-based Regression c. Curve fitting tool Working with Probability Distributions a. CDF b. Parameter Estimation c. Quantiles d. Loglikelihood e. PDF f. Random number generation pg. 2 pg. 2 pg. 4 pg. 7 pg. pg. pg. pg. pg. 9 9 10 10 11

2. 3.

4. 5. 6.

pg. 13 pg. 14 pg. pg. pg. pg. pg. pg. pg. pg. pg. pg. pg. pg. pg. 16 16 16 17 18 19 22 22 23 23 24 25 25


An Introduction to Matrices and Arithmetic Operators

In Matlab, matrix operations are the basis for all calculations. A matrix is a two-dimensional array of real or complex numbers. A column vector is an m-by-1 matrix, created by the statement: c = [3;1;4];

A row vector is a 1-by-n matrix, created by the statement: r = [3 1 4];

A scalar is a 1-by-1 matrix, created by the statement: s = 7;

To create a m-by-n matrix, you may string together row vectors, separated by a semicolon: m = [1 4 7 10; 2 5 8 11; 3 6 9 12];

Note that the use of the semicolon at the end of each line prevents the answer from being immediately returned by Matlab. To immediately view the output, simply leave off the semicolon.

The dimensions of a matrix are displayed for each matrix (excluding vectors) in the workspace window. Alternatively, one can use the following command: a = size(A);

To obtain the transpose of a matrix, use "". g = [4 3 1]; h = g

For example,

(which returns

h = 4 3 1)

The identity matrix may be obtained using the command "eye". i = eye(3,2) (which returns i = 1 0 0 i = 1 0 0 1 0) 0 1)

i = eye(2)

(which returns

To get the inverse of a matrix, use "inv". A = [1 1 1; 1 2 3; 1 3 6]; X = inv(A)

(which returns

X =

3 -3 1

-3 5 -2

1 -2 1)

To get the determinant of a matrix, use "det". d = det(A) (which returns d = 1)

This may be necessary when performing certain calculations, as the command may simply return NaN when you wished to disregard these values.

To get a matrix of ones [or zeros], use the following commands: a = ones(2,3); b = zeros(2,3); which will return 2-by-3 matrices of ones [or zeros respectively].

For simplicity (and the fact that one will need matrices of ones or zeros that are the same size as other matrices) use the command: c = zeros(size(C)); to get a matrix of zeros that has the same dimensions as C.

Missing values may be indicated with the value NaN. To remove these values from your named variable, use the following command: x = x(isnan(x));

To add and subtract matrices, they must be of the same size (same number of rows and columns). Then, simply use "+" or "-". For example, A B r s = = = = [2 3; 4 5]; [2 1; 5 8]; [3 1 4]; 7;

C = A + B

(which returns C = [4 9

4 13])

To do matrix multiplication, simply use "*". r as defined above, w = r * c w = c * r (which returns w = 26) (which returns w =[9 3 12 3 1 4 12 4 16])

For example, using c and

Similarly, with scalars, w = r * s (which returns w = [21 7 28]) 4

To raise a matrix to a power (using matrix multiplication), simply use "". C = A2 (from above) (which returns C = [16 28 21 37])

To do element-by-element multiplication, use the command immultiply. x = [3 1 4]; y = [5 6 4]; g = immultiply(x,y)

(which returns g = [15 6 16])

One can also use the command .* g = x.*y (which returns g = [15 6 16])

Similarly, for element by element division, use the command ./ g = x./y (which returns g = [0.6000 0.1667 1.0000])

Similarly, for raising each element in a matrix to a power, use the command .

A = [1 2; 3 4]; B = A.2

(which returns B = [1 9

4 16])

To find the minimum [or maximum] of a vector, use the command min(a) [or max(a) respectively]. If a is a matrix, the command will return a vector of the minimum [or maximum] of each column of a. To return the minimum [or maximum] of each row, use min(a,[],2) [or max(a,[],2) respectively.

To obtain the difference between successive elements of a vector (for example, to find the interspike time from a series of spike times), 5

use the command diff. d = diff(a);

To find e raised to a power, use the command: h = exp(x);

To find the logarithm base e of a variable, use the command: h = log(x);

To find the logarithm base 10 of a variable, use the command: h = log10(x);

Basic statistical commands

Several of the most common statistical operators are briefly described here.

mean (average of vectors and matrices) m = mean(X) calculates the sample average of a vector, or the mean of each column of a matrix

median (median of vectors and matrices) m = median(X) calculates the median value of a vector, of the median of each column of a matrix

std (standard deviation of a sample) y = std(X) computes the sample standard deviation of the data in X

var (variance of a sample) y = var(X) computes the variance of the data in X

iqr (interquartile range of a sample) y = iqr(X)

computes the difference between the 75th and 25th percentiles of the sample X

cov (covariance matrix) C = cov(X) C = cov(x,y) computes the covariance matrix

corr (linear or rank correlation) RHO = corr(X) RHO = corr(X,Y,...) [RHO, PVAL] = corr(X,Y,...) returns a matrix containing pairwise correlation between each pair of columns in the matrix X (or the pairs of columns in the matrices X and Y). If PVAL is saved (using the last command), it also returns the matrix of p-values for testing the null hypothesis of zero correlation.

corrcoef (correlation coefficients) R = corrcoef(X) R = corrcoef(x,y) [R,P] = corrcoef(x,y) returns a matrix of correlation coefficients (and p-values testing for presence of non-zero correlation if desired)

Hypothesis Testing
The following is an example of simulated data used to illustrate several common statistical hypothesis tests. Macaque monkeys were studied in an experiment in which a stimulus was presented to them, and the time between the stimulus presentation and the hand movement (hand) was recorded. In addition, the time between the presentation of the stimulus and an increase in neural activity (neural) was also recorded. neural = [17.1335 12.1186 14.4286 19.8125 15.8087 20.2471 15.1181 11.3353 12.6164 20.1423 14.5888 10.7160 23.2002] 17.3276 20.5080

hand = [28.3377

32.5013 32.8495 40.6467

37.1507 38.3092 46.7327

28.4141 34.6986 38.4544

38.7637 28.2532 44.4557]

44.7567 40.9032

This is used to test the null hypothesis that the mean of a batch of data is equal to a particular value mu, assuming the variance is known. Suppose that in our experiment, we know that the standard devation of the time to move the hand is 6. We wish to know whether or not the mean time to move the hand is different from 34. We also decide that we would claim that the mean is different from 34 if, if we sampled over and over and the mean was truly 34, we would expect a value for our sample mean as extreme or more extreme than what we obtained no more than five percent of the time (our alpha value or level of significance). To perform the test, use the command: [h,sig,ci,zval] = ztest(x,m,sigma,alpha) Applied to our data and test: [h, sig, ci, zval] = ztest(hand, 34, 6, 0.05)

The values returned are h (0 if we should not reject the idea the mean 9

is different from 34, 1 if we should), sig (the probability that the observed value of Z could be as large or larger by chance under the null hypothesis, i.e. the p-value), ci (a (1-alpha)*100% confidence interval for the mean), and zval (the value of the z-statistic).

Here, we would not reject the null hypothesis that the mean is different from 34.

Often the true standard deviation is not known and we must work with the sample standard deviation. In this case, we use a t-test. The command is similar to that of the z-test. [h,p,ci] = ttest(x,m,alpha) Again, testing the mean to be 34 or not, we run the command: [h, p, sig] = ttest(hand,34,0.05)

Once again, we would fail to reject our null hypothesis that the mean is actually 34.

Both Z and t-tests may be performed using one-sided hypotheses as well, as opposed to the two-sided tests that were performed above. This is done by using the command [h,sig,ci,zval] = ztest(x,m,sigma,alpha,tail) [h,p,ci] = ttest(x,m,alpha,tail) and using right for the alternative mu>m and left for the alternative mu<m in place of tail.

Two-sample t-test
To test the null hypothesis that two samples have the same mean, one may use a two-sampled t-test. In Matlab, the assumption is made that 10

the standard deviations for the two samples are equal. test, use the command: [h,p,ci] = ttest2(x,y,alpha,tail)

To perform the

Suppose, for the sake of an example, we wished to verify that the time to hand movement and the time to neural response were in fact different. In our case, the command would be: [h,p,ci] = ttest2(neural,hand,0.05) (with the default tail setting being both for testing two-sided hypotheses) The result clearly states that we should reject the hypothesis that the two means are the same, with a p-value of 7.83 x 10-12. The confidence interval given is a 95% confidence interval for the difference of the means.

One use of ANOVA (analysis of variance) is to test the null hypothesis that the means of several groups are all the same. The alternative hypothesis is that at least one group has a different mean from the others. The command to perform this test is: [p,table,stats] = anova1(X) where X is matrix where the different columns represent different groups. For example, suppose we have 3 monkeys doing the aforementioned task. The time for each monkey to move its hand is recorded as follows: A = [24 26 35 42 15 26 24 35 25 21]; B = [19 25 34 16 24 34 26 25 14 18]; C = [32 31 42 25 28 27 34 32 29 28];

To perform the test to see if all of the means are equal, use the following series of commands: X = [A;B;C]; [p,table,stats] = anova1(X)


This will return an ANOVA table, showing the value of the F-statistic and p-value (which is greater than 0.05, so we fail to reject the hypothesis that the means are different) as well as a boxplot of the three different groups.


The bootstrap involves choosing random samples with replacement from a data set and applying a statistical function to each sample the same way. Sampling with replacement means that every sample is returned to the data set after sampling, so a particular data vector from the original data set could appear multiple times in a given bootstrap sample. The number of elements in each bootstrap sample equals the number of elements in the original data set. To perform this test, we can use the bootstrap as follows: [bootstat,bootsam] = bootstrp(nboot,bootfun,d1,d2,...) Here, nboot is the number of times we are going to sample, bootfun is the function we are interested in, and d1, etc. are the input data. Each row of the output, bootstat, contains the results of applying bootfun to one bootstrap sample. If bootfun returns multiple output arguments, only the first is stored in bootstat. If the first output from bootfun is a matrix, the matrix is reshaped to a row vector for storage in bootstat. Each column in bootsam contains indices of the values that were drawn from the original data sets to constitute the corresponding bootstrap sample.


Plotting with Matlab

Some of the most commonly used methods to create plots are listed below: boxplot (boxplots of a data sample) boxplot(X) returns a boxplot of each column of X

hist (plot histograms) hist(y,nb) [n,x] = hist(y,nb) draws a histogram with nb bins (default is 10). You may have the frequency counts (n) and bin locations (x) returned as well.

hist3 (3-dimensional histogram of bivariate data) hist3(X,nbins) hist3(X,Edges,edges) draws a 3-D histogram with nbins(1) x nbins(2) bins (default is 10x10). The edges of the dimensions may be specified.


(linear 2-D plot) plot(X,Y)

creates a linear plot, connecting the points defined by the vectors X and Y (if X and Y are matrices, columns or rows of the two matrices are matched up, depending upon the dimensions of the matrices). Plot may be modified to create scatter plots and other plots by adjusting the options available (see help menu for more information).


(3-D line plot)


plot3(X,Y,Z) creates a line plot, connecting the points defined by the vectors X, Y, and Z (which may be matrices, as discussed above)

scatter (2-D scatter/bubble graph) scatter(X,Y) creates a 2-D scatter plot, with bubbles at the locations specified by the vectors X and Y (which must be of the same size)

scatter3 (3-D scatter plot) scatter(X,Y,Z) creates a 3-D scatter plot, with bubbles at the locations specified by the vectors X, Y, and Z (which must be of the same size)

normplot (normal probability plot) normplot(X) displays a normal probability plot of the data in X. normplot displys a line for each column of X. For matrices,

probplot (probability plot) probplot(distname,Y) creates a probability plot for the specified distribution

qqplot (quantile-quantile plot of two samples) qqplot(X,Y) displays a quantile-quantile plot of two samples. If the two samples do come from the same distribution, the plot will be linear. 15

Regression and curve-tting

Matlab has both a command based system to perform regression and curve-fitting as well as a graphical interface, allowing the user to point-and-click to perform the analysis.

Command based Regression

Setting up the Design Matrix by Hand Suppose one wishes to model the data y using the polynomial function: y = a + b*t + c*t2

To perform this regression, one first needs to create the design matrix X, which consists of a column of ones and two columns corresponding to t and t2. X = [ones(size(t)) t t.2];

To perform the regression, use the backslash operator (do not confuse with the division operator): r = X\y;

r will now contain three values, corresponding to a, b, and c in the form above.

This can also be done for linear-in-the-parameters regression. Suppose one wishes to model the data y using the function: y = a + b*exp(-t) + c*t*exp(-t)

Again, one creates the design matrix and then proceeds with the regression: X = [ones(size(t)) r = X\y; exp(-t) t.*exp(-t)];


To perform multiple regression, a similar procedure is followed. Suppose one wishes to model the data y using the function: y = a + b*x + c*z

Again, one creates the design matrix and then proceeds with the regression: X = [ones(size(x)) r = X\y; x z];

Note that x, y, and z must be of the same size for the regression to work.

Using polyt Alternatively, one can also perform polynomial regression without first setting up the design matrix.

Again, suppose one wishes to model the data y using the polynomial function: y = a + b*t + c*t2 The command polyfit will set up the design matrix and perform the regression in one step. p = polyfit(t,y,2); fitted) (where 2 is the degree of the polynomial being

Suppose one now wishes to evaluate the fit at a series of points s. The command polyval allows this evaluation. fit = polyval(p,s); (where p is the fit from polyfit)

To use the same commands to make exponential fits, merely transform the variables. Suppose one wishes to model the data y using the function:


y = a + b*exp(t) + c*exp(t.2) Then, one would use the command: p = polyfit(t,log(y),2);

To get confidence intervals using polyfit, use the following set of commands: Returning to our example where we are using the model: y = a + b*t + c*t2 First, one performs the regression, but also saves the standard errors (as d): [p,d] = polyfit(t,y,2); One then evalutates along the fitted curve as before (using the points s): [fit,del] = polyval(p,s,d); A 95% point-wise confidence interval at each point s can be obtained from (fit-1.96*del, fit+1.96*del).

Graphical based regression

A quick way to perform regressions is through the basic fitting tool that Matlab provides. Its options are limited, but it provides a method to perform regressions and immediately view the results. To begin, suppose we have a set of data y which we would like to regress on the variable x. First, plot the data using the command: plot(x,y) Now, on the resulting figure, click on Tools in the menu bar and 18

select Basic Fitting. This will open another window. On this window, check the fit you wish to use on the data (choosing from interpolating splines, linear regression, quadratic regression, and multiple polynomial regression). You can choose multiple types at the same time. The resulting fit will be plotted along with your data. To obtain residual plots, simply check off "Plot residuals" with the options of choosing a bar or scatter plot. You can also include the regression equation in your plot by checking the "Show equations" box. Pressing panel in Clicking variable the arrow on the bottom of the menu will open up another the window, showing the coefficients used to fit the model. "Save to workspace" will allow you to save the fit as a name.

Pressing the arrow pointing right on the bottom of the menu again will open up another panel in the window. This panel allows you to input values of x at which to evaluate the fit. These fitted values may also be saved to the workspace, similar to the fit itself, by pressing "Save to workspace."

Curve tting tool

Another graphical based tool to perform regression is the Curve fitting tool. This tool has more options than the Basic fitting tool, including more possible fits. These fits include polynomial fits, Gaussian fits, interpolating fits (including cubic splines), powers, rational expressions of polynomials, smoothing splines, trigonometric fits, and Weibull. Also included in the curve fitting tool are the ability to exclude outliers and other points while performing the regression and the inclusion of point-wise confidence bands.

The following is step-by-step directions when working with the curve fitting tool:

Store data to be analyzed in variables x,y,.... Next, type in the command: cftool This command will open the curve fitting tool. 19

Click on Data... Button.

Data screen will open.

Import workspace vectors using the dropdown menu for X Data and Y Data. Note that variables must be of the same class. For example, for neuronal spike data, the y variable is often an integer. If the x variable is of the class double, change the y variable to the class double in order to proceed. When X and Y Data are selected (and weights if desired), click on Create data set. Note that smoothed data sets may also be created by clicking on the Smooth tab on the top of the Data screen. Only a small variety of smoothing options are available. Choosing a method and clicking on Create smoothed data will create a data set of data smoothed using the appropriate method.

On the Curve Fitting Tool window, the button Plotting... will allow you to choose which data sets you wish to be displayed on a graph at the same time. Click the checkmark next to the dataset to show the plot; toggle the check off to remove that dataset from the plot.

On the Curve Fitting Tool window, the button Fitting... will allow you to create a fit for the data. First, click on the New fit button. Choose the appropriate dataset and the type of fit from the drop down menu boxes. The different types of fits will also have a set of choices from which to choose a fit. For example, polynomial allows one to choose linear, quadratic, cubic,...9th degree polynomial.

After choosing the appropriate type of fit, hit apply to calculate the fit. The fit will be plotted in the Curve Fitting Tool window and will be saved as the chosen Fit Name, the default being "fit 1". The Table of Fits in the Fitting window will give appropriate information about the fit including SSE and R-squared. Additional information may be added to the table by pressing the Table options... button and 20

checking the desired quantities.

For additional analysis of the fit, select the Analysis... button on the Curve Fitting Tool. This will open the Analysis window. The first option gives the fit to analyze and the values of X at which to analyze the fit. (Notation is starting X:distance between X values: ending X). Checking "Evalute fit at Xi" will allow you to calculate prediction or confidence bands at each value of X indicated above. The other options allow calculations of the first and second derivatives of the fit and an integral of the fit. By choosing "Plot results," the confidence intervals and/or derivatives will be plotted in a separate window, "Curve Fitting Analysis." Note that some fits, namely smoothing splines, will not allow calculations of prediction or confidence bands.

The Exclude... button on the Curve Fitting Tool allows for removal of particular points or regions when calculating a fit. Define a new Exclusion rule name and select a data set. To exclude particular points, check next to the index number of the corresponding X/Y pair. To exclude sections of data, enter the value(s) that you do not wish to consider in the Exclude Sections area. When finished, press Create exclusion rule. This exclusion rule may be accessed when using the Fitting... option by choosing the appropriate Exclusion rule in the dropdown menu in the Fitting window.


Working with Probability Distributions

There are several distinct functions dealing with different distribution functions. They are virtually identical except in distribution and necessary parameters.

The cumulative density function (or cdf) returns the integrated (or summed) density from negative infinity to the specified value X. The basic form for the command is ****cdf(X,param1,param2,...), where **** is substituted with the appropriate abbreviation for the density being used (listed below) and the necessary parameters are given.

p = ****cdf(X,param1,param2,...) computes cdf at each value in X using parameters specified cdf(name,X,A1,A2,A3) betacdf(X,A,B) - beta binocdf(X,N,P) - binomial chi2cdf(X,V) - chi-squared ecdf(y) - empirical cdf expcdf(X,MU) - exponential fcdf(X,V1,V2) - F gamcdf(X,A,B) - gamma geocdf(X,P) - geometric hygecdf(X,M,K,N) - hypergeometric logncdf(X,MU,SIGMA) - lognormal nbincdf(X,R,P) - negative binomial ncfcdf(X,NU1,NU2,DELTA) - noncentral F nctcdf(X,NU,DELTA) - noncentral T ncx2cdf(X,V,DELTA) - noncentral chi-squared normcdf(X,MU,SIGMA) - normal poiscdf(X,LAMBDA) - poisson raylcdf(X,B) - Rayleigh tcdf(X,V) - Students t unidcdf(X,N) - discrete uniform unifcdf(X,A,B) - continuous uniform wblcdf(X,A,B) - Weibull


Parameter Estimation
Supposing one knows the appropriate distribution from which data was obtained, one can estimate the appropriate parameters for that distribution. The command is ****fit(data,alpha), again substituting the appropriate abbreviation for ****, and alpha is provided so as to obtain a confidence interval for the estimates. Note that the abbreviations for the densities are the same as in CDF, although not all densities are available.

[phat,pci] = ****fit(data,alpha) estimates parameters for a distribution for the data; pci contains the confidence intervals for the estimates, using the appropriate value of alpha (default = 0.05) mle(data,distribution,dist) betafit(data,alpha) binofit(data,alpha) expfit(data,alpha) gamfit(data,alpha) lognfit(data,alpha) nbinfit(data,alpha) normfit(data,alpha) poisfit(data,alpha) raylfit(data,alpha) unifit(data,alpha) - for continuous uniform wblfit(data,alpha)

Quantiles can be obtained from a particular density, given the appropriate parameters. Given a value (or vector) P, the quantiles are calculated for each number in P using the command ****inv(P,param1,param2,...) where **** and the parameters are as before.

q = ****inv(P,param1,param2,...) computes the inverse of the cdf for the probabilities in P using parameters specified icdf(name,P,A1,A2,A3) betainv(P,A,B) 23

chi2inv(P,V) expinv(P,MU) finv(P,V1,V2) gaminv(P,A,B) geoinv(Y,P) hygeinv(P,M,K,N) logninv(P,MU,SIGMA) nbininv(Y,R,P) ncfinv(P,NU1,NU2,DELTA) nctinv(P,NU,DELTA) ncx2inv(P,V,DELTA) norminv(P,MU,SIGMA) poisinv(P,LAMBDA) raylinv(P,B) tinv(P,V) unidinv(P,N) unifinv(P,A,B) wblinv(P,A,B)

The value of the negative loglikelihood, evaluated at given parameters and data, can be obtained using the command ****like(params,data). Note that this option is available for only a certain number of select distriibutions.

nlogL = ****like(params,data) returns the negative of the log-likelihood function for the parameters specified in the vector params for the observations in the vector data betalike(params,data) explike(param,data) gamlike(params,data) lognlike(params,data) normlike(params,data) wbllike(parmas,data)


Like the CDF, the probability density function (PDF) returns the value of the density function at a particular value of X, given the necessary parameters. The command is ****pdf(X,param1,param2,...) where the substitutions are made as before.

Y = ****pdf(X,param1,param2,...) computes pdf at each value in X using parameters specified pdf(name,X,A1,A2,A3) betapdf(X,A,B) chi2pdf(X,V) exppdf(X,MU) fpdf(X,V1,V2) gampdf(X,A,B) geopdf(X,P) hygepdf(X,M,K,N) lognpdf(X,MU,SIGMA) mvnpdf(X,mu,SIGMA) - multivariate normal nbinpdf(X,R,P) ncfpdf(X,NU1,NU2,DELTA) nctpdf(X,V,DELTA) ncx2pdf(X,V,DELTA) normpdf(X,MU,SIGMA) poispdf(X,LAMBDA) raylpdf(X,B) tpdf(X,V) unidpdf(X,N) unifpdf(X,A,B) wblpdf(X,A,B)

Random number generation

Probably one of the most useful commands when performing simulations, the random number generator allows one to generate data from a distribution, given the necessary parameters. The commands for this function are of the form ****rnd(param1,param2,...,m,n) where m and n indicate the dimensions of the matrix that is being generated. ****, param1, param2,... are as before.

R = ****rnd(param1,param2,m,n) 25

returns an m x n matrix of random numbers from a distribution with parameters specified random(name,A1,A2,A3,m,n) betarnd(A,B,m,n) chi2rnd(V,m,n) exprnd(MU,m,n) frnd(V1,V2,m,n) gamrnd(A,B,m,n) geornd(P,m,n) hygernd(M,K,N,m,n) iwishrnd(SIGMA,df) - inverse Wishart lognrnd(MU,SIGMA,m,n) mvnrnd(mu,SIGMA) mvtrnd(C,df,cases) - multivariate t (C = correlation matrix, output has cases rows and p (col of C) columns) nbinrnd(R,P,m,n) ncfrnd(NU1,NU2,DELTA,m,n) nctrnd(V,DELTA,m,n) ncx2rnd(V,DELTA,m,n) normrnd(MU,SIGMA,m,n) poisrnd(LAMBDA,m,n) raylrnd(B,m,n) trnd(V,m,n) unidrnd(N,m,n) unifrnd(A,B,m,n) wblrnd(A,B,m,n) wishrnd(SIGMA,df) - Wishart