Sei sulla pagina 1di 19

Introduction to Excel

for

First year Statistics


in School of Mathematical Sciences

CONTENTS
THE VERY BASIC BASICS.................................................................................................2 1. ENTERING AND MODIFYING INFORMATION ..........................................................2 a) Formulae in Excel .......................................................................................................3 b) Writing Formulae .......................................................................................................4 2. USING DATA ANALYSIS TOOLPAK............................................................................6 2.1 Loading Data Analysis in MS Excel 2007............................................................6 2.2 Features in Data Analysis .........................................................................................7 2.3 To Obtain Descriptive Statistics ...............................................................................7 2.4 To Obtain a Histogram .............................................................................................9 2.5 Generating Random Numbers ................................................................................11 2.6 Scatterplot and line of best fit (Regression)............................................................12 3. EXCEL FUNCTIONS FOR DISTRIBUTIONS .............................................................15 4. NAMING CELLS and RANGES ....................................................................................17 5. FURTHER RESOURCES .............................................................................................18

Prepared by Dr Dianne Atkinson 2010. School of Mathematical Sciences, Monash University, Clayton, Vic. 3800

Introduction to Excel
The goal of this brief manual is to help you get started in Microsoft Excel 2007 or above for basic computations, data summary and data analysis. This will be a powerful tool for you in the preparation of reports in all disciplines in science. Note this manual is not a comprehensive guide to Excel. It is primarily to give you background in the features used in statistics in which you probably have had limited to no experience of before. You will need to refer to it throughout this course.

THE VERY BASIC BASICS


When you first open MS Excel, a new filecalled a workbookis displayed on your screen. A workbook consists of various sheets. When you open a new workbook, it contains three worksheets. Near the bottom of the screen you should see tabs labelled Sheet1, Sheet2, and Sheet3. You can add or delete or copy worksheets as needed by clicking on tab and right-mouse click for a menu. Using separate sheets in a workbook can be helpful to organise your work when you have many sections & related information, all within the one file. They can be renamed also by double-clicking on the tabs and typing. A spreadsheet is an array of cells organised in rows and columns with an alphanumeric system: Rows are numbered from 1,2,3,.. to 65,536 and Columns are labelled alphabetically: A, B, C, , X, Y, Z, AA, AB, AC, to IV (256 columns in total). Each cell is identified by the column and row that intersect at its location. Thus D3 is the cell reference for the cell in the fourth column and the third row. There is an alternative cell reference style called (R1C1) in which the columns are also designated by numbers eg R3C4 for cell at row 3 and column 4 is D4. If your computer defaults to this setting it can be changed: In Office 2007 and above: go to Office symbol in top left hand corner>Excel Options>Formulas and un-clicking the box R1C1 reference style. In Office 2003: go to Options>General and un-clicking the box R1C1 reference style.

1. ENTERING AND MODIFYING INFORMATION


Three types of information can be entered into a cell: labels, values and formulas. Press the <Enter> key when you have typed the information into each cell, or use the arrows to move to another cell. Cells can be formatted to represent different types of data via Format>Cell on the top toolbar; eg a choice of currency in number style will put in the $ sign etc. More on formatting later. Labels are character strings that are typically used for headings or comments. Example: Growth of investment Values are numbers such as 1.21, $6.75, 5% entered as 1.21, 6.75, or 0.05 into specially formatted cells for number, currency or percentage, respectively.

Formulae are mathematical expressions that use the values or formulae in other cells to create new values or formulas. Examples: = B3 = A7+1 = B7*(1+B4) = SUM(A7:A12)

a) Formulae in Excel
All formulas must begin with an equals sign = and are entered directly into the cell. This can be either by hand using certain operators or codes as described below or by using inbuilt functions. A formula may contain constants or use cell references containing values (see above examples) to calculate the final result. Operators
Arithmetic operator + (plus sign) (minus sign) * (asterisk) / (forward slash) ^ (caret) Description Addition Subtraction Multiplication Division Exponentiation =3+3 =31 =3*3 =3/3 =2^3 Example 6 2 9 1 8 Result

The Paste Function Wizard on the toolbar is a library of in-built functions so that you do not have to know all of Excels shorthand. Investigate menus, particularly the Statistical one. Some Common Functions with their formulae built-in to Excel are: Built-in Function
SQRT LOG LN EXP SUM AVERAGE STDEV QUARTILE SUMSQ

Description
Square Root Log in base 10 Natural Log Exponential Sum Average Standard Deviation Eg First Quartile Q1 Sum of Squares

Example
=SQRT(A6) =LOG(B1) =LN(A1) =EXP(A1) =SUM(D1:D23) =AVERAGE(D1:D23) =STDEV(D1:D23) =QUARTILE(D1:D23,1) =SUMSQ(D1:D23)

also to learn about Use specific functions. Each function is described briefly here and more detail is available via Help ?

b) Writing Formulae
Exercise 1: Value of an Investment
Open a new spreadsheet and enter the information in the diagram for an annually compounded interest:
(Dont worry that the columns do not appear enough. You can change the column width by putting the cursor on the line between column labels A and B at the top of the sheet dragging it to the desired width). wide easily and

The formula =B3 will show the value 1000 when you press Enter .

To change appearances to reflect currency and percentages, Formatting of the number can be done from the Number section on the toolbar. Select the cell of interest and press the $ or the % button as required:

The 0 and 1 in cells A7 and A8 of the Year column show the pattern of wanting to count in 1s. Highlight these two cells (A7 and A8) and use the Fill Handle of NOTE 1 below to copy this pattern down the column to A57 (50 years).

NOTE 1: Copying a cell


Use the mouse to position the cursor on the bottom right corner of the highlighted cell. The cursor changes from a block cross to a thin cross, , which is the Fill Handle. Depress the left mouse button and drag down. Now select cell B8 and enter the formula =B7*(1+B4) Cell B8 should contain the number 1050, with the above formula displayed in the text formula entry area when B8 is the active cell. We now want to copy the formula in B8 into money cells against each year: B9 to B57. Both references in cell B8 are relative, so if we were to copy this formula to B9, the formula in B9 would be = B8*(1 + B5).

5 The problem here is that the cell B5 no longer contains the interest rate. We need to make part of this reference absolute (by placing a dollar sign, $, in front of the part of the cell designation that would changehere copying down a column this is the row number that would change otherwise). Thus you can now copy the formula effectively without changing the reference to the constant value in the formula. Select B8, click into the formula bar at top and edit the formula to: = B7*(1+B$4). The value 1050 will remain in cell B8. ($B$4 would also be OK) Select cell B8 and copy down to cell B57 using the fill handle (+) as before.

Your spreadsheet should now look like this:

NOTE 2: Relative versus Absolute Referencing


Relative references When you create a formula, references to cells or ranges are usually used to find values already in the spreadsheet. If you are in C4 and wish to use a value in cell B3 in the formula, Excel interprets this as find the value one cell to the left (CB) and one row up (43) and use it here. This is a relative reference. When you copy a formula that uses relative references, Excel automatically adjusts the references in the pasted formula. So copying down column by 1 cell would change this to =B4. Copying across row would change formula to =C3 etc. Absolute references If you don't want Excel to adjust references when you copy a formula to a different cell, use an absolute reference or a name. You can create an absolute reference by placing dollar signs ($): in front of row number for copying down the column, and in front of column letter for copying across row, OR in front of both. To see the power of a spreadsheet, we need only change the Principal or Interest Rate, and the spreadsheet automatically calculates the new values. Change the interest rate to 10% and observe the changes Change the principal to $5000 and observe the changes

2. USING DATA ANALYSIS TOOLPAK 2.1 Loading Data Analysis in MS Excel 2007
Click on the Office symbol the drop-down page. On the new screen: and select Excel Options at the bottom

AND

Select Add-ins on the left-hand side.

At the bottom of this screen you see Manage Excel Add-ins next to Go. Click Go.

An add-ins pop-up screen appears. Select the box next to Analysis Toolpak and click OK. A pop-up screen appears with Feature not currently installed. Would you like to install now? Click YES. This results in Wait for Configuration Process. Be patient as this may take a few minutes. The Data Analysis option should then appear on the Quick Access Toolbar at the right-hand side in a new DATA tool section called Analysis. This procedure only needs to be done once as Data Analysis will remain on logging out. Loading Data Analysis in Excel 2003: In Excel 2003 the basic statistical operations are found under Tools as Data Analysis. You should firstly ensure the Analysis ToolPak is enabled so that Data Analysis shows under Tools. Select Tools > Add-Ins Check boxes for Analysis ToolPak Click OK Data Analysis should now show as an option under the Tools menu.

7 Data Analysis and Mac: Excel 2003 for Mac does have the Data Analysis tool BUT Excel 2007 and above DO NOT include this feature because Microsoft decided not to support VBA (visual basic applications) in which the macros in Data Analysis are written.

2.2 Features in Data Analysis


Within Data Analysis there are built-in multi-stage procedures or tests. The menu looks like Note in particular the following features: Descriptive Statistics describe the centre and spread of a data set for a single variable. Statistics such as mean, median and standard deviation, maximum and minimum are displayed in one table. The various functions can be found individually using the Paste Function Wizard on the toolbar BUT many descriptors together can be obtained in one table via Data Analysis. A Histogram is the bar chart showing the frequency distribution of values in a data set for a single variable. It gives a visual picture of the spread of the data. Random Number Generation is useful to allocate a random number to an ordered list and after sorting according to the random number the list can be randomised. This is important when choosing a random sample from a list. Regression will be used to investigate the appropriateness and strength of a straight (linear) line placed on two variable (x,y) data. How does the variable y change with a change in variable x? t-Test: .. of different types in which the hypothesis test for comparing two the means of two groups performed directly from the data from two samples. Anova: Single factor is a hypothesis test involving the question Is there a difference between the means of more than 2 treatment groups?

For STA1010 only see also:

2.3 To Obtain Descriptive Statistics


Exercise 2: Uni-variable sample data
In 1798 the English scientist Henry Cavendish measured the density of the earth by careful work with a torsion balance. The variable recorded was the density of the earth as a multiple of the density of water. Here are his measurements: 5.50, 5.61, 4.88, 5.07, 5.26, 5.55, 5.36, 5.29, 5.58, 5.65, 5.57, 5.53, 5.62, 5.29, 5.44, 5.34, 5.79, 5.10, 5.27, 5.39, 5.42, 5.47, 5.63, 5.34, 5.46, 5.30, 5.75, 5.68, 5.85

Enter these values in an Excel column, eg cells A2 to A30 with a heading in A1.

Select Data Analysis > Descriptive Statistics Input Range: A1:A30 by highlighting the data (I have included the heading row!). Click radio button for Labels in first row IF you did include the heading. Click radio button for Grouped by Columns as your data is down a column. Click radio button for Output range and enter C1 in the space. (Cell C1 will be the upper
left hand cell of the output generated.)

Check Summary Statistics box OK

You will notice a large amount of information is calculated; much of which ( eg kurtosis, skewness) will not concern this unit. Note the measures of the centre of the data: mean (5.45) and median (5.46). Note the measures of spread of the data: Standard deviation (0.22) and range (= max-min = 5.85-4.88 = 0.97. The other important measure of spread is the Interquartile Range (=Q3Q1) which cannot be found here. You can find the quartiles using the built-in function ). =QUARTILE(cell range, x) (see

2.4 To Obtain a Histogram


A frequency distribution is the count of the occurrences of data values that fall within set intervals of values over the range of the values in the data set for the sample. For example, how many data values in Cavendish experiment are between 5.2 and 5.4? A histogram is the bar chart representation of the distribution of values. A good detail is to have about 8-12 bars and very few intervals with very small (0 or 1 or 2) counts. The first thing to do is to identify the range of values and the intervals in which you wish the count the data 9 called the BIN range). These intervals should be uniform and make sense in practical terms: To choose our own bin intervals, we need to look at the highest and lowest values and establish the appropriate interval size to fit about 8-10 intervals (bars) in this range? For Cavendish experiment: Take a step below the min and a step above the max the data lies between about 4.8 and 6.0. This is a span of 1.2 and needing about 8-10 bars means intervals of 0.1 (12 bars) or 0.2 (6 bars) opt for the latter (intervals as 0.2) If 0.1was used there would be to many zero and very small counts. The upper limit for each interval is then 4.8, 5.0, 5.2, 5.4, 5.6, 5.8, and 6.0. In the Excel spreadsheet: Set up a column (column F) called BIN and enter all these values in the column (one number per cell, eg in F1 to F8). Select Data Analysis > Histogram Input range: A1:A30 Bin range: F1:F8 Tick Labels Output options: Click radio button for Output range, put cursor in blank space and enter H1 Check Chart output box OK

NOTE that the output is a table of frequencies and a chart. The chart from Excel does need editing to have an acceptable appearance:

edit

10

This edit was achieved by: Select the Bin label and you can edit the wording by typing and Enter Select the heading Histogram and Delete. A caption below a figure is a better explanation of the figure in a report. Remove the legend Frequency by highlighting it and pressing Delete. Click on one of the blue bars, right mouse click for menu: Select Format Data Series in Series Options slide the Gap Width to No Gap in Border Color select Solid Line and make Color Black to outline bars.

11

2.5 Generating Random Numbers


There are several ways of generating random numbers; including drawing out of a hat, drawing out of a hat with replacement and Sampling using Excel Data Analysis. But we will use a very efficient way to obtain a random sequence without repetition: Random Number Generation using Excel Data Analysis

Exercise 3: Randomising an Ordered List


Set up an ordered list in a column in Excel: eg the alphabet in order Say we wish to choose 5 letters at random from the alphabet with all letters having equal chance of being selected. In the column next to the ordered list, place a random number (any number) next to every letter in the list by: Open Data Analysis> Random Number Generation and enter the following values: Number of variables = 1 (for 1 column of random numbers) Number of random numbers = 26 (appropriate to question) Distribution = Uniform Leave as Between 0 and 1. Enter the last 4 digits of your ID number as a random seed Output range B2 as start of column Leave everything else blank OK.

This is now an ORDERED LIST with a corresponding RANDOM NUMBER. If the two columns are now linked and the RANDOM column is sorted into order the LIST column will correspondingly be RANDOMISED: Highlight the entire 26x2 array. You can include the headings to make it easier. Open Data> Sort Tick My data has headers if column headings were included. Sort by ... in the drop down menu select column Random Number

The LIST will now be randomised, and the first 5 letters will be my randomly chosen letters: D, R, P, N, and E in this example.

12

2.6 Scatterplot and line of best fit (Regression)


A scatterplot is the graphical representation of x-y data. The line of best fit is the straight line placed on the data such that all data points are as close as possible to the line. In-built into Excel, the Method of Least Squares is used to position the line of best fit. The line is described by its equation of the line and the closeness of fit is described by the correlation coefficient, both can be obtained in the scatterplot. A full regression analysis including the residual plot and inference is obtained in Data Analysis.

Exercise 4: Relationship between Weight and Blood Pressure


The table lists the systolic blood pressure and weight (kg) of a group of males, aged 50-55, who have been diagnosed with high BP Enter this information into Excel. (i) Produce scatterplot: Highlight the data of the x and y columns. Excel always takes the column furthest to the left as x and any columns to the right as y. Go to Insert tab > Charts and choose Scatter>Scatter with markers only. This produces the basic plot as shown on the following page.

Like all Excel charts this plot requires editing (axes labels, line of best fit and its equation, correlation) to be presentable:

These edits were achieved by: With the plot selected by clicking the left mouse into area, go to Chart Tools tab at top in centre. Select Chart Layouts. From the drop-down menu choose the style that gives you the axes labels and, if wanted, the added trendline and it equation.

13 Select the heading and Delete. A caption below a figure is a better explanation of the figure in a report. Select the legend and Delete. Only one (x,y) pair ( called a series) is plotted so no need for a legend. Select the minor horizontal gridlines and Delete. Change x-axis and y-axis labels by selecting each and typing an appropriate label with units. This first appears only in the bar at the top after typing, press Enter to place the words at the selected axis. Change each axis scale to start the plot so that the data fills the whole plot area: Place the cursor near the x-axis or one of its values, click left mouse to select axis Click right mouse for menu and select Format Axis Change Minimum to Fixed and type 75 Repeat the 3 arrowed steps above for the y-axis and enter100 as the minimum. Move the equation box to a clearer position.

(ii) Regression Analysis including Residual plot Regression analysis involves placing this line of best fit AND assessing the goodness of the fit of the added trendline through a residual plot. The full inferential statistical analysis of the slope is also given (STA1010 only). The residual is the difference between data value and the lines value at each point. The residuals will be randomly distributed along the line IF the line is an appropriate representation of the data. If the linear relationship of the line of best fit is NOT appropriate a pattern (like a curve or U) would be seen in the residuals as you progress along the line. To determine regression line, residuals and residual plot: Open Data Analysis > Regression: Note that it asks for Y-range first! Input Y Range: B3 : B13 Input X Range: A3 : A13 Check labels box if headings included Output range: A15 Tick Residuals and Residual Plots (Do not ask for the Line Plot here as it is not a good plot.) OK

The SUMMARY OUTPUT is a large table that contains full regression information.

14 In SCI1020 the important information here is: Correlation (Multiple R) and correlation squared (R Square), The line of best fit equation given by y = mx +c where c = Coefficients for Intercept m = Coefficients for slope: weight in this example (sometimes title is X-Variable) The equation in the example is BP= 0.7506 Weight + 72.489 (as seen before on the scatterplot). Residual plot In STA1010 the information needed is as for SCI1020 plus the inference on the slope which involves the row containing P-value and Upper 95% and Lower 95%for the X-variable.

The residual plot is randomly scattered so the linear relationship (described by this line of best fit) is appropriate to the data in this range.

15

3. EXCEL FUNCTIONS FOR DISTRIBUTIONS


The following functions are used in Inferential Statistics, as will be explained in the units lectures and support classes. NORMSDIST Returns the standard normal cumulative distribution function. The distribution has a mean of 0 (zero) and a standard deviation of one. NORMSDIST gives the probability, from the samples zvalue, of that sample or LESS. Syntax in Excel is: =NORMSDIST(z) Z is the value for which you want the distribution. NB: The Standard Normal distribution in Excel and in tables is a LEFT-SIDED interval area. If you want the RIGHT-SIDED area ( the greater than probability) then the formulas is =1-NORMSDIST(z). The total area under any distribution is 1 (100% of possibilities). NORMSINV Returns the inverse of the standard normal cumulative distribution. The distribution has a mean of zero and a standard deviation of one. This is NORMSDIST in reverse: given the probability (that z-value or less) what is the standardised score, the z-value? Syntax in Excel is: =NORMSINV(probability) Probability is a probability corresponding to the normal distribution (as in the diagram: the left hand interval area). NB: The Standard Normal distribution in Excel and Tables is LEFT-SIDED interval area. TDIST Returns the Percentage Points (probability) for the Students t-distribution where a numeric value (x) is a calculated value of t for which the Percentage Points are to be computed. The Students t-distribution is used to find the p-value in the hypothesis testing of means when the population standard deviation, , is not known (as is the usual case). TDIST gives the probability, from the samples t-value, of that sample or more extreme. Syntax in Excel is: =TDIST(x,degrees_freedom,tails) X is the numeric value of the t-value. degrees_freedom = sample size 1 = (n-1) Tails specifies the number of distribution tails to return. If tails = 1, TDIST returns the one-tailed distribution. If tails = 2, TDIST returns the two-tailed distribution. NB: The t-distribution in Excel and tables gives a TAIL area

16

TINV Returns the t-value of the Student's t-distribution as a function of the probability and the degrees of freedom. This is TDIST in reverse: given the probability what is the t-value? Syntax in Excel is =TINV(probability,degrees_freedom) Probability is the probability associated with the two-tailed Student's t-distribution. degrees_freedom is the number of degrees of freedom with which to characterize the distribution, = sample size 1 = (n-1) Remarks

A one-tailed t-value can be returned by replacing probability with 2*probability. For a probability of 0.05 and degrees of freedom of 10, the two-tailed value is calculated with TINV(0.05,10), which returns 2.28. The one-tailed value for the same probability and degrees of freedom can be calculated with TINV(2*0.05,10), which returns 1.812.

CHIDIST Returns the one-tailed probability of the chi-squared distribution. The 2 distribution is associated with a 2 test. Use the 2 test to compare observed and expected values. By comparing the observed results with the expected ones, you can decide whether your original hypothesis is valid. CHIDIST gives the probability, from the samples chi-squared value, of that sample or MORE EXTREME (the tail area). Syntax in Excel is: =CHIDIST(x,degrees_freedom) X is the value at which you want to evaluate the distribution. degrees_freedom is the number of degrees of freedom.= (#rows-1)x(#colums-1)

17

4. NAMING CELLS and RANGES


Instead of cell references, it is much clearer to use the actual terms for the variables in the formula. This can be achieved by naming the cells. This is an advanced feature that is very useful to know as it makes the equations in your spreadsheet directly readable instead of cell designations. In business this is essential for transparency of the procedures.

Exercise 5: Celsius Fahrenheit Conversion


Use the Fill in a Series technique to enter in column A, a heading of Celsius and values ranging from -20 to 50, in steps of 1. In cell B1, enter the heading Fahrenheit. To designate column A cells with the name Celsius: Select the range A2:A72. Select Formulas tab and Select Name Manager Select New

Excel anticipates that the column heading will be the name and has it there already. If you want different, just type it in. Select OK.

Enter in cell B2 the formula for the Fahrenheit temperature conversion as =1.8*Celsius+32. This is instead of =1.8*A2+32 which is not as informative. Note spelling and capitals must be the same as the defined name. Copy this formula down to cell B72 to show the conversion for all Celsius values. Use the Microsoft Excel Help facility to find out more on how to name cells and ranges for yourself.

18

5. FURTHER RESOURCES
Some useful resources for extra help may be Microsoft Excel Help, , found on the top toolbar of the program and Microsofts Support page for Excel: http://office.microsoft.com/en-us/excel-help/ , choose your version and investigate the Help and How to lists. Excel 2007 - Training - Microsoft Office Online Audio course with many and various tutorials in basics: Get to know Excel: Microsoft Corporation. All rights reserved. ... (http://office.microsoft.com/en-us/excel-help/CH010224830.aspx) For Excel 2003 but very good for basics in any version: Clemson U. Physics Excel http://phoenix.phys.clemson.edu/tutorials/excel/ Copyright 2006, Clemson University. All Rights Reserved http://www.youtube.com , of mixed quality but one suggestion is an Uploader called ExcelIsFun who has a series on Excel Statistics, e.g.Excel Statistics 31 Histogram using Data Analysis Add In

Potrebbero piacerti anche