Sei sulla pagina 1di 16

PPT1 - DATA MANAGEMENT

IMPORTANCE OF STATISTICS TO MANAGERS


 Properly present and describe data or information; EXAMPLES OF DESCRIPTIVE STATISTICS:
 Come up with conclusions about large populations based only on 1. The table further shows that 14 out of 20 students are using
information obtained from samples; Globe network.
 Add essence to decisions made; 2. According to the Court Administration of the Philippines, 14% of
 Improve a process of decision making to reduce guesswork; trial-ready civil actions and equity cases in Metro Manila during
 Gather reliable data and/or forecast. 1993 were decided in less than six months (May 14, 1995).
3. Cigarettes were associated with 29% of the 4,470 civilian fire
STATISTICS deaths in 1989 (The book of Odds, Plume, 1991)
Term Meaning
Latin Status State EXAMPLES OF INFERENTIAL STATISTICS:
Datum Fact of Information • The National Eyes Institute has halted a clinical trial on type of
Modern Latin Statisticum Collegium Council of State eye surgery calling it ineffective and possibly harmful to a
Italian Statista Statesman or person’s vision.
Politician • “Allergy therapy may make bees go away” (April, 1995)
German Statistik Science of State • The Gallup Poll says 1 out of 10 Filipinos is a member of a health
- Introduced by Gottfried Achenwall (1749) club or fitness center”.
- Orginally designated the analysis of data about state, • Drinking Decaffeinated coffee can raise cholesterol levels by 7%.
signifying the “Science of State” (Philippine Heart Association)

 Branch of mathematics that transforms data into useful Exercises: Determine whether the following statistical claims is
information for decision makers descriptive or inferential.
 Science of conducting studies to collect, organize, summarize, 1. One out of every five people is an especially appealing target for
present, analyze, and draw conclusions from a set of quantitative hungry mosquitoes.
data 2. Almost 85% of lung cancers in men and 45% in women are
 It is also concerned with the use of probability theory to estimate tobacco-related.
population parameters 3. The risk of heart attack is attributed to obesity.
4. Native Americans are significantly more likely to be hit crossing
the streets than are people of other ethnicities.
5. There is an 80% chance that in a room full of 30 people that at
least two people will share the same birthday.

BASIC TERMS IN STATISTICS


Data – are the different values (measurements or observations) that
the variables can assume.

IMPORTANCE OF DATA:
 Data are needed to provide the necessary input to a survey.
 Data are needed to provide the necessary input to a study.
 Data are needed to measure performance of an ongoing service
or production process.
SOME REASONS TO STUDY STATISTICS:  Data are needed to evaluate conformance to standards.
Health Computer Skills Information Management  Data are needed to assist in formulating alternative courses of
Care action in a decision-making process.
Auditing Process Improvement Technical Literacy  Data are needed to satisfy our curiosity.
Marketing Health Care Quality Improvement
Purchasing Product Warranty Operations Management Observation – a single member of a collection of items that we want
Medicine to study, such as a person, firm, or region.
Variable – a characteristic of the subject or individual, such as an
STATISTICAL CHALLENGES employee’s income or an invoice amount.
The ideal data analyst (business professionals using statistics) should Data set – consists of all the values of all of the variables for all of the
possess these characteristics: observations we have chosen to observe.
 Is technically current (e.g., software-wise).
 Communicates well. - A data set may consist of many variables.
 Is proactive. - T h e q u e s t i
 Has a broad outlook.
 Is flexible
 Focuses on the main problem.
 Meets deadlines.
 Knows his/her limitations and is willing to ask for help.
 Can deal with imperfect information.
 Has professional integrity. can be used will depend upon the data type and the number of
variables.
TWO BRANCHES OF STATISTICS:
1. Descriptive Statistics POPULATION AND SAMPLE
- Concerned with collecting, organizing, summarizing, Population
presenting, and analyzing numerical data. - refers to the groups or aggregates of people, objects, materials,
- Is that branch of statistics that presents techniques for events, or things of any form.
describing set of measurements. - consists of all subjects (human or otherwise) that are being
2. Inferential Statistics studied.
- Drawing conclusions and/or making decisions concerning a - Measured by Parameters
population based only on sample data. Sample
- Also called Statistical inference or Inductive statistics - consists of few or more members of the population.
- Its main concern is to analyze the organized data leading to - Is a subgroup of the population selected for analysis
prediction or inferences. - a portion of population, or a subset from a set of units.
- Implies that before carrying out an inference, appropriate - Measured by Statistics
and correct descriptive measures or methods are employed
to bring out good results PARAMETER AND STATISTIC
Parameter
Descriptive Inferential Statistics - are the measures of the population.
Statistics - any numerical summary measure based on data from a
Collect Data Estimation population. (Bluman, 2010)
Ex. Survey Ex. Estimate the population mean weight Statistic (estimate)
using the sample mean weight - is the measure of sample.
Present Data Hypothesis Testing
Ex. Tables and Ex. Test the claim that the population SAMPLE AND CENSUS
Graphs mean weight is 120 pounds Sample
Characterize Data Drawing conclusions and/or making - involves looking only at some items selected from the population.
Ex. Sample Mean decisions concerning a population based Census
on sample results - is an examination of all items in a defined population.

\
• You might ask patients to express the amount of pain they
are feeling on a scale of 1 to 10. A score of 7 means more
pain that a score of 5, and that is more than a score of 3. But
the difference between a pain of 7 and 5 may not be the
same as that between 5 and 3. The values simply express
an order.
• Movie ratings, from * to *****.
• Rating a product purchased through online by giving marks
of 5 stars
- Classifies data into distinct categories in which ranking is implied

Numerical Variables
3. Interval Scale – it has all the properties of the ordinal scale.
- Quantitative differences can be determined.
- Indicates an actual amount and there is equal unit of
measurement separating each score, specifically equal intervals.
- The zero point of the interval scale is arbitrary and does not
reflect an absence of the attribute.
- It does not have a true value of zero

VARIABLE AND CONSTANT


Variable – refers to a characteristic or property whereby the members 4. Ratio scale – it has all the properties of the interval scale.
of the group or set vary or differ from one another. - It is an interval scale with the additional property that its zero
Constant – refers to a property whereby the members of the group do position indicates the absence of the quantity being measured.
not differ from one another. - Similar to interval data, but has an absolute zero and multiples
are meaningful.
TYPES OF VARIABLES - A ratio variable has a clear definition of 0.0. When the variable
Categorical (Qualitative) Variable
equals 0.0, there is none of that variable.
- Represent differences in quality, character, or kind but not in
amount. NOTES:
- Categorical variables have values that can only be placed into  Variables like height, weight, enzyme activity are ratio variables.
categories, such as “yes” and “no.”  Whereas, temperature, expressed in F or C, is not a ratio
Numerical (Quantitative) Variable variable.
- Are numerical in nature and can be ordered or ranked.  A temperature of 0.0 on either of those scales does not mean ‘no
- Numerical variables have values that represent quantities. heat’.
 However, temperature in Kelvin is a ratio variable, as 0.0 Kelvin
TYPES OF NUMERICAL VARIABLES really does mean ‘no heat’.
Discrete variables  Another counter example is Ph. It is not a ratio variable, as Ph=0
- generate numerical answers that arise from counting. just means 1 molar of H+, and the definition of molar is fairly
- is a variable whose values can be counted using integral values arbitrary.
only.  A Ph of 0.0 does not mean ‘no acidity’ (quite the opposite!).
Continuous variables  When working with ratio variables, but not interval variables, you
- generate numerical answers that arise from measuring. can look at the ratio of two measurements.
- a variable that can assume any numerical value over an interval  A weight of 4 grams is twice a weight of 2 grams, because weight
of intervals is a ratio variable.
- a variable that can assume fractions or decimals.  A temperature of 100 degrees C is not twice as hot as 50 degrees
C, because temperature C is not a ratio variable.
DEPENDENT AND INDEPENDENT VARIABLE  A Ph of 3 is not twice as acidic as a Ph of 6, because Ph is not a
Independent Variable – the variable that predicts the value of the ratio variable.
other variable.
Dependent Variable – variable whose value is being predicted. EXAMPLES OF LEVEL OF MEASUREMENT
Nomina Gender, ethnic, origin, personality, vices, nationality,
l Account number, tax identification number, student
number, telephone number
Ordinal  Passed or failed
 The first prize winner in quiz bee is better than the
2nd prize winner
 Excellence is higher than satisfactory
 Very good is higher than good
 Rating an instructor/student in a class (excellent,
very good, good, poor)
 More difficult Happier than, etc.

Other examples of ordinal data are:


 Social class (mass, elite)
LEVELS OF MEASUREMENT  Income (high, average, low)
Categorical Variables  Ordering of meals by preference
1. Nominal Scale – it is the lowest level of data management.  Responses to items on an instrument (always,
- The numerical result in measuring variables is used for sometimes, never)
identification purposes only.
- It does not signify any quantitative value.  Grades (A,B,C,D,E)
- Use numbers for the purpose of identifying name or membership  Rating scales (based in scores & percentage)
in a group or category.  Built of people or amount of product (small,
- For mutual exclusive, not ordered categories medium, large)
- For example, your study might compare five different genotypes Interval  IQ test scores of employees in a company (a score
(types of species). You can code the five genotypes with numbers of zero doesn’t mean the person has no
if you want, but the order is arbitrary and any calculations (for intelligence)
example, computing an average) would be meaningless.
- Classifies data into distinct categories in which no ranking is  Temperature in Fahrenheit or Celsius (doesn’t
implied mean absence of heat)
Ratio  The amount of money in your pocket
2. Ordinal Scale – it has all the properties of the nominal scale.  Time, Age, Height, Weight
- The numbers obtained does not only identify variables but also
gives order/rank or inequalities.
 Number of students inside the classroom
- It has no equivalence of intervals.  Shares of stock
- One where the order matters but not the difference between
values. SOURCES OF DATA
Example: 1. Primary Sources – The data collector is the one using the data
for analysis. The data are collected by the researcher himself.

\
 By direct observation or measurement 1. Random Sampling
 By interview (questionnaires or rating scales) - Is a sampling technique where we select a group of subjects (a
 By mail of recording or of reporting forms sample) for study from a larger group (a population).
- Ordinary or special mails - Each individual is chosen entirely by chance and each member of
- Courier services the population has a known, but possibly non-equal, chance of
- E-mail and fax being included in the sample.
 Data from a political survey
 Data collected from an experiment 2. Simple Random Sampling
- Each member of the population has an equal chance to be
2. Secondary Sources – The data compliers are secondary sources of included in the sample gathered.
data. The information taken from published or unpublished materials  Lottery or Fishbowl technique
previously gathered by other researchers or agencies.  Table of random Numbers – random numbers can be
 Analyzing census data generated by a random number table, software
 Examining data from print journals or data published on the program or a calculator.
internet (Book, newspapers, magazines, journals, published and
unpublished theses and dissertations) Example: There are 800 students currently enrolled in your school.
 Registration (registry of birth/deaths, marriages) You wish to form a sample of ten students to answer some survey
questions.
RANDOM VARIABLES • Assign numbers 001 to 800 to each student.
- are variables whose values are determined by chance. • On the table of random numbers, choose a starting place at random
- are needed since one cannot do arithmetic operations on words (anywhere, say, the 5th column, 2nd row.)
- enable us to compute statistics, such as average or variance.
- Example:
• Selecting a product from a manufacturing process
1 – defective 2 – non-defective
• Selecting a student from each class
1 – male 2 – female
• Inspection of the availability of resources
1 – more than enough 2 – enough 3 – scarce resource • Read numbers in grouping of three digits. Get the first 10 groupings. •
261, 046, 731, 800, 701, 349, 866, 675, 199, 723, 596
TWO TYPES OF RANDOM VARIABLES
1.Categorical Random Variables SAMPLING WITH OR WITHOUT REPLACEMENT
2. Numerical Random Variables  Without Replacement
- Not allowing duplicates when sampling
TWO TYPES OF NUMERICAL RANDOM VARIABLES  With Replacement
1. Discrete Random Variables - Allowing duplicates when sampling
2. Continuous Random Variables
 Instinctively most people believe that sampling without
RANGE OF APPLICATIONS:
replacement is preferred over sampling with replacement
A few examples of business applications of statistics are given below:
because allowing duplicates in our sample seems odd.
1. An auditor can use random sampling techniques to audit the
 In reality, sampling without replacement can be a problem when
accounts receivables transactions to make inferences and
our sample size n is close to our population size N.
decisions about the validity of the total accounts receivable
 At some point in the sampling process, the remaining items in the
number reported on the company’s balance sheet.
population will no longer have the same probability of being
2. A financial analyst may use regression and correlation to
selected as the items we chose at the beginning of the sampling
understand the relationship of a financial ratio to a set of other
process. This could lead to a bias.
variables in business.
 When should we worry about sampling without
3. A sales manager may use statistical techniques to forecast sales
replacement?
as well as production cost for the coming year.
- Only when the population is finite and the sample size is close to
4. A manager may use probability and statistics in the evaluation of
the population size.
alternative projects or investments.
5. A market researcher may use test of significance about a group of
Note:
buyers to which the firm wishes to sell a particular product.
6. A product developer may determine what potential customers
 A common criterion is that a finite population is effectively infinite
if the sample is less than 5 percent of the population (i.e., if n/N
want in a new product being developed using statistics.
≤ .05).
7. A production manager tests manufacturing processes to ensure
that products are produced with desired customer and regulatory
 An equivalent statement is that a population is effectively infinite
when it is at least 20 times as large as the sample (i.e., when N/n
specifications.
≥ 20).
8. Restaurant managers use statistics to examine variability of
important performance measures, such as customer orders and
food cost, to plan work schedules and estimate material 3. Systematic Random Sampling
purchases. - Obtained when we choose every “nth” individual in a population.
- The items or individuals are arranged in some way perhaps
FOUR PROCESSES OF STATISTICS: alphabetically or other sort.
1. Collection
2. Description 4. Stratified Random Sampling
3. Analysis - A population is first divided into subsets based on homogeneity
4. Interpretation called strata.
- Stratified sampling involves selecting independent samples from a
METHODS OF DATA COLLECTION number of subpopulations, groups or strata within the population
The following are some methods of collecting data. - Example: Suppose a farmer wishes to work out the average milk
1. The Direct or Interview Method yield of each cow type in his herd which consists of Ayrshire,
2. The Indirect or Questionnaire Method Friesian, Galloway and Jersey cows. He could divide up his herd
3. The Registration Method into the four sub-groups and take samples from these.
4. The Experimental or observation Method
5. Cluster Sampling
COLLECTION OF DATA - Can be done by subdividing the population into smaller units or
1. Data are needed to provide the necessary input to a survey. clusters, usually along geographic boundaries, then selecting only
2. Data are needed to provide the necessary input to a study. at random some primary units where the study would then be
3. Data are needed to measure performance of an ongoing service concentrated.
or production process. - Strata consist of geographical regions.
4. Data are needed to evaluate conformance to standards.  One-stage cluster sampling – sample consists of all
5. Data are needed to assist in formulating alternative courses of elements in each of k randomly chosen subregions
action in a decision-making process. (clusters).
6. Data are needed to satisfy our curiosity.  Two-stage cluster sampling – first choose k subregions
(clusters), then choose a random sample of elements within
PPT2: SAMPLING TECHNIQUES each cluster
Sampling is that part of statistical practice concerned with the
selection of individual observations intended to yield some knowledge 6. Multi-Stage Sampling
about a population of concern, especially for the purposes of statistical - Combination of several sampling techniques.
inference.

SAMPLING TECHNIQUES

\
- Usually used by researchers who are interested in studying a very  Nonresponse bias occurs when those who respond have
large population, say the whole provinces included in the characteristics different from those who don’t respond.
CALABARZON.  For example, people with caller ID, answering machines,
- This is done by starting the selection of the members of the blocked or unlisted numbers, or cell phones are likely to be
sample using cluster sampling and then dividing each cluster into missed in telephone surveys. Because these are generally
strata. Then from each stratum individuals are drawn randomly more affluent individuals, their socioeconomic class may be
using simple random sampling. underrepresented in the poll.
 A special case is selection bias, a self-selected sample.
NON-PROBABILITY SAMPLING  For example, a talk show host who invites viewers to take a
- is a sampling technique wherein members of the sample are web survey about their sex lives will attract plenty of
drawn from the population based on the judgment of the respondents. But those who are willing to reveal details of
researchers. their personal lives (and who have time to complete the
- The result of the study using this technique is relatively biased. survey) are likely to differ substantially from those who
- This technique lacks objectivity of selection; hence, it is dislike such surveys or are too busy (and probably weren’t
sometimes called subjective sampling. watching the show anyway).
- Non-probability sampling techniques are sometimes used  Response error occurs when respondents deliberately give false
information to mimic socially acceptable answers, to avoid
because they are convenient and economical.
embarrassment, or to protect personal information.
- Researchers use this method because they are inexpensive and
 Coverage error occurs when some important segment of the
easy to conduct.
target population is systematically missed.
 Measurement error results when the survey questions do not
1. Convenience Sampling – Take advantage of whatever sample is
accurately reveal the construct being assessed.
available at that moment. A quick way to sample
 Interviewer error occurs when the interviewer’s facial
2. Purposive Sampling
expressions, tone of voice, or appearance influences the
3. Judgment Sampling
responses.
- A non-probability sampling method that relies on the
 Sampling error is uncontrollable random error that is inherent in
expertise of the sampler to choose items that are any random sample.
representative of the population.  Even when using a random sampling method, it is possible that
- Can be affected by subconscious bias (i.e., non-randomness the sample will contain unusual responses. This cannot be
in the choice). prevented and is generally undetectable. It is not an error on your
- Quota sampling is a special kind of judgment sampling, in part.
which the interviewer chooses a certain number of people in
each category. PPT3: SURVEY RESEARCH
4. Focus Groups – A panel of individuals chosen to be BASIC STEPS IN SURVEY RESEARCH
representative of a wider population, formed for open-ended Step 1. State the goals of the research.
discussion and idea gathering Step 2. Develop the budget (time, money, staff)
Step 3. Create a research design (target population, frame, sample
OTHER DATA COLLECTION METHODS size)
 Point-of-sale (POS) systems can collect real-time data on Step 4. Choose a survey type and method of administration.
purchases at retail or convenience stores, restaurants, and gas Step 5. Design a data collection instrument (questionnaire)
stations. Step 6. Pretest the survey instrument and revise as needed.
 Many companies use loyalty cards that are swiped during the Step 7. Administer the survey (follow up if needed)
purchase. These loyalty cards have the customer’s information, Step 8. Code the data and analyze it.
which can be matched to the purchase just made.
 Businesses also send out e-mail surveys to loyal customers on a SURVEY TYPES
regular basis to get feedback on their products and services. Type of Characteristics
 Facebook can track your Internet searches using its software Survey
algorithms. Mail • You need a well-targeted and current mailing
 Google also tracks Internet searches and provides these data list (people move a lot).
through its Google Analytics services. • Low response rates are typical and
nonresponse bias is expected (non-
SAMPLE SIZE
respondents differ from those who respond).
 The necessary sample size depends on the inherent variability of
• Zip code lists (often costly) are an attractive
the quantity being measured and the desired precision of the
option to define strata of similar income,
estimate.
education, and attitudes.
 For example, the caffeine content of Mountain Dew is fairly
• To encourage participation, a cover letter
consistent because each can or bottle is filled at the factory, so a
small sample size would suffice to estimate the mean. should clearly explain the uses to which the
 In contrast, the amount of caffeine in an individually brewed cup data will be put.
of Bigelow Raspberry Royale tea varies widely because people let • Plan for follow-up mailings.
it steep for varying lengths of time, so a larger sample would be Telephone • Random dialing yields very low response and
needed to estimate the mean. is poorly targeted
 The purposes of the investigation, the costs of sampling, the • Purchased phone lists help reach the target
budget, and time constraints also are taken into account in population, though a low response rate still is
deciding on sample size. typical (disconnected phones, caller
screening, answering machines, work hours,
SOURCES OF ERROR OR BIAS no-call lists).
 In sampling, the word bias refers to a systematic tendency to • Other sources of nonresponse bias include
over- or underestimate a population parameter of interest. the growing number of non-English speakers
 However, the words bias and error are often used and distrust caused by scams and spams.
interchangeably. Interviews • Interviewing is expensive and time-
 The word error generally refers to problems in sample consuming, yet a trade-off between sample
methodology that lead to inaccurate estimates of a population size for high-quality results may still be worth
parameter. it.
 No matter how careful you are when conducting a survey, you will • Interviews must be carefully handled so
encounter potential sources of error. interviewers must be well-trained – an added
cost.
• But you can obtain information on complex or
sensitive topics (e.g., gender discrimination in
companies, birth control practices, diet and
exercise habits).
Web • Web surveys are growing in popularity, but are
subject to nonresponse bias because those
who participate may differ from those who feel
too busy, don’t own computers or distrust your
motives (scams and spam are again to
blame).
• This type of survey works best when targeted
to a well-defined interest group on a question
of self-interest (e.g., views of CPAs on new
proposed accounting rules, frequent flyer
views on airline security).
Direct • This can be done in a controlled setting (e.g.,

\
Observatio psychology lab) but requires informed • Responses are usually coded numerically (e.g., 1 = male 2 =
n consent, which can change behavior. female).
• Unobtrusive observation is possible in some • Missing values are typically denoted by special characters (e.g.,
non-lab settings (e.g., what percentage of blank, “.” or “*”).
airline passengers carry on more than two • Discard questionnaires that are flawed or missing many
bags, what percentage of SUVs carry no responses.
passengers, what percentage of drivers wear • Watch for multiple responses, outrageous or inconsistent replies
seat belts). or range answers.
• Follow-up if necessary and always document your data-coding
SURVEY GUIDELINES decisions.
• Planning - What is the purpose of the survey? Consider staff
expertise, needed skills, degree of precision, budget. DATA FILE FORMAT
• Design - Invest time and money in designing the survey. Use
books and references to avoid unnecessary errors.
• Quality - Take care in preparing a quality survey so that people
will take you seriously.
• Pilot Test - Pretest on friends or co-workers to make sure the
survey is clear.
• Buy-In - Improve response rates by stating the purpose of the
survey, offering a token of appreciation or paving the way with
endorsements.
• Expertise – Work with a consultant early on
Enter data into a spreadsheet or database as a “flat file” (n subjects x
QUESTIONNAIRE DESIGN m variables matrix).
• Use a lot of white space in layout.
• Begin with short, clear instructions. ADVICE ON COPYING DATA
• State the survey purpose. • Using commas (,), dollar signs ($), or percents (%) as part of the
• Assure anonymity. values may result in your data being treated as text values.
• Instruct on how to submit the completed survey. • A numerical variable may only contain the digits 0-9, a decimal
• Break survey into naturally occurring sections. point, and a minus sign.
• Let respondents bypass sections that are not applicable (e.g., “if • To avoid round-off errors, format the data column as plain
you answered no to question 7, skip directly to Question 15”). numbers with the desired number of decimal places before you
• Pretest and revise as needed. copy the data to a statistical package
• Keep as short as possible
PPT4: THE MEASURES OF VARIATION
SIGNIFICANCE OF THE MEASURE:
1. It categorically supports the descriptional value of the measures
of central tendency.
2. It functions as a measure of risk or uncertainty in the field of
finance.
3. It provides measures of volatility in considering alternatives for
pricing commodities.
4. It may be commonly used commonly as a measure of error in the
field of forecasting.

COMPUTATIONAL PROCEDURES:
1. Range – is the scale distance between the highest and the lowest
value in a given set of data
2. Mean Absolute Deviation – the average distance of the values
from the arithmetic mean

Formula for Ungrouped Data Formula for Grouped Data

MAD=
∑|x −x́| MAD=
∑ f |x−x́|
n n

Illustration:

QUESTION WORDING
- The way a question is asked has a profound influence on the
response.
- Example:
1. Shall state taxes be cut?
2. Shall state taxes be cut, if it means reducing highway
maintenance?
3. Shall state taxes be cut, it means firing teachers and police?
- Make sure you have covered all the possibilities.
- Example:
Are you married? Yes or No?
- Overlapping classes or unclear categories are a problem.
- Example: 3. Quartile Deviation – one half of the distance between quartiles.
How old is your father?
 35 – 45 Formula for Quartile Deviation
 45 – 55
 55 – 65
Q3−Q1

QD=
65 or older 2
CODING AND DATA SCREENING 4. Variance and Standard Deviation

\
• Variance ( s2) – the average squared difference or deviation from Tables and Graphs for Univariate Numerical Data
• Univariate Data – those data having one scalar components.
the mean They involve only one variable.
• Standard Deviation (s) – the average square root of the variance
Ungrouped Data and Grouped Data
PROPERTIES OF STANDARD DEVIATION: 1. Ordered Array (Data Array)
1. Standard Deviation is only used to measure the spread or 2. Stem and Leaf Display
dispersion around the mean of a data set
2. Standard Deviation is never negative Illustration
3. For data with approximately the same mean, the greater the
Construct a stem and Leaf display and frequency distribution table with
spread, the greater the Standard Deviation
five classes for the following College Algebra grades of 30 students.
4. If all the values of a data set are the same, the Standard
Deviation is zero (because each value is equal to the mean)
70 83 87 76 80 87 75 84 85 76 81 82 89 77 84 86 71 80 80 79 84 86
93 83 85 88 72 84 84 92
Population Variance Sample Variance
Answer:

σ 2=
∑ ( x− x́)2 s2=
∑ (x− x́)2 7
8
01256679
00012334444455667789
8
20

N n−1 9 23 2

Population Standard Deviation Sample Standard Deviation


2
σ =√ σ s= √ s2 Grouped Data
1. Frequency Distribution Table - is a summary table in which data
are arranged into conveniently established, numerically ordered
class groupings or categories.

FDT using Microsoft Excel


1. Highlight the data Column.
Illustration: 2. Choose “Insert” from the menu bar.
3. Select “Pivot Table”.
4. Select “Existing Worksheet”, then click OK.
5. Click the column name (i.e., “Scores”).
6. Select one cell on the output column, then right click.
7. Select “Group”, then enter the desired interval size.
8. Right click any interval then drag the interval column title to
“Values”

Illustration:
RELATIVE MEASURES OF VARIABILITY Frequency Distribution of the Philippines’ Richest as of 082509
% of Persons
Coefficient of Variation Net Worth Number
Proportions % of "Less than"
- Allows the variability of scores in two sets of data that do not in Million of of X Cf< Cf>
of Persons Persons Lower Boundary
Dollars Persons
necessarily measure the same thing. of Class Interval
- Denoted by ‘v’ 38-370 26 204
- It is useful to use this coefficient when the means of the 371 - 703 8 537
distribution being compared are far apart, or data are in different 704 - 1036 3 870
units. 1037 - 1369 1 1203
1370 - 1702 1 1536
Formula of Coefficient of Variation
Illustration 2:
standard deviation Where:
CV = x 100 s = standard deviation % of Persons
mean x́ = mean
Net Worth Number
in Million of of X Cf< Cf>
Proportions % of "Less than"
of Persons Persons Lower Boundary
Dollars Persons
of Class Interval
7–9 2
Illustration: 10 – 12 8
13 – 15 14
16 – 18 17
19 - 21 20 20

PPT6: TRUNCATED MEAN, OUTLIERS, AND GEOMETRIC MEAN


OUTLIERS
Outliers - are observations that lies an abnormal distance from other
Set A Set B values in a random sample from a population.
Mean 7.4 7
Sample SD and 1.3 and 1.14 2.5 and 1.58 • In cost accounting, an outlier could be a cost or its related level of
Variance activity that is out of line with other observations.
CV 0.15 and 15% 0.23 and 23% • An outlier can be detected by plotting each observation's cost and
IQR 2 3 related level of activity onto a graph or scatter diagram. If one of
QD 1 1.5 those points deviates from the pattern of the other points, it is said
Median 7 7 to be an outlier. The outlier could be the result of an accounting
error, an unusual charge, or a unique change in volume.
PPT5: DESCRIPTIVE STATISTICS • To avoid developing an incorrect formula for estimating future
Descriptive Statistics costs, the outlier should be investigated and perhaps excluded.
- a method of collection, organization, summarization, analysis and • Removal of Outliers is to “objectively” detect and reject outliers
presentation of quantitative data. that are due to systematic errors by the experimentalist.
- It is a tool/measurement that aims to describe a particular sample
data. Procedure for determining the Outliers using the IQR (Tukey’s
Method)
FUNCTIONS OF DESCRIPTIVE STATISTICS 1. Arrange the data in order and find Q1 and Q3.
1. It provides quantitative and qualitative description about the 2. Find the interquartile range IQR = Q3 - Q1.
sample data. 3. Multiply the IQR by 1.5.
2. It summarizes in a single quantitative value a particular 4. Subtract the value obtained in step 3 from Q1 and add the value
characteristic of sample data. to Q3.
3. It conveys significant idea about the sample data relative to the 5. Check the data set for any value that is smaller than
study. Q1 – 1.5(IQR) or larger than Q3 + 1.5(IQR).
4. It presents the collected data for easy and further analysis using [Q1 – 1.5(IQR), Q3 + 1.5(IQR)]
table and graphs, percentage, measures of central tendency or
variability, skewness and kurtosis among other measures. Illustration:
Check the following data set for outliers: 5, 6, 12, 13, 15, 18, 22, 50
DATA PRESENTATION USING TABLES AND GRAPHS

\
TRUNCATED MEAN in the shaping of the attitudes of those farmers who feel negatively
- Obtained by computing the arithmetic mean but removing first the towards the design:
extreme value or outliers (Q) on both ends of the distribution • Does not ridge
(lowest and/or highest). • Does not work for inter-cropping
- Removed for the ff. reasons: • Far too expensive
a. to eliminate the directional effect of these values. • New technology too risky
b. to give a better descriptional average of data. • Too difficult to carry.

Illustration: Non-comparative Scales


Given the interest rates: 2.12, 2.16, 2.14, 3.12, 2.13. Continuous Rating Scales: The respondents are asked to give a
Determine if 3.12 is an outlier. rating by placing a mark at the appropriate position on a continuous
line. The scale can be written on card and shown to the respondent
GEOMETRIC MEAN RATE OF RETURN during the interview. Two versions of a continuous rating scale are
This done by computing the nth root of the product of n number of depicted in figure 3.7.
numerical values minus 1.

Formula:
ŔG =√n x1 x2 x 3 x 4 … x n−1
Where:
x=1+r
r = growth rate on the given period

Uses of the Geometric Mean Rate of Return:


1. An alternative measure of average if the arithmetic mean
overestimates or underestimates the average of the data.
2. Commonly used for data such as periodic percentage
increase/decrease.
3. Widely used to measure correct average in the periodic growth
rate
GEOMETRIC MEAN
- Calculated by raising the product of series of numbers to the
inverse of the total length of the series.
- Used in finance to calculate average growth rates and is referred
to as the compounded annual growth rate.
- Most useful when numbers in the series are not independent of
each other or if numbers tend to make large fluctuations.
- Applications of the geometric mean are most common in business
and finance, where it is commonly used when dealing with
percentages to calculate growth rates and returns on portfolio of
securities.
- Geometric mean is also used in certain financial and stock market
indexes, such as Financial Times' Value Line Geometric index TWO COMMONLY USED SCALES IN SURVEYS AND RESEARCH:
1. Semantic Scale
Illustration: - Makes use of words rather than numbers. Respondents describe
Consider a stock that grows by 10% in year one, declines by 20% in their feelings about the situation, brand or product on scales with
year two, grows 9% in year three, and grows by 30% in year four. The semantic labels.
geometric mean of the growth rate is calculated as: - The Semantic Differential Scale measures connotative meaning.
- This scale allows a researcher to measure a subject’s attitude
[(1+0.1)*(1-0.2)*(1+0.09)*(1+0.3)]1/4 - 1 = 0.0567 or 5.67% annually. toward a particular concept.
Illustration 2:
Philippine stock Exchange’s Index Monthly Growth Rates over 3
months: +5%, -3%, +10%.

Answer: 0.03861 or 3.86%

PPT7: DATA MANAGEMENT NON- COMPARATIVE


MEASUREMENT SCALES
Measurement Scales
The various types of scales used in business research fall into two
broad categories:
1. Comparative Scale
- In comparative scaling, the respondent is asked to compare one Personality or Character Inventories
brand or product against another. These are designed to measure certain traits of individuals or assess
their feelings about themselves.
2. Non-Comparative Scale
- With non comparative scaling respondents need only evaluate a
single product or brand.
- Their evaluation is independent of the other product and/or
brands which the researcher is studying.
- Non comparative scaling is frequently referred to as monadic
scaling
- This is the more widely used type of scale in commercial
marketing research studies.

COMPARATIVE AND NON-COMPARATIVE MEASUREMENT


SCALES
2. Likert Scale
Comparative Scales - A commonly used attitude scale where subjects are asked to
Paired comparison: Researchers wish to find out which are the most circle the word or number that best represents how they feel
important factors in determining the demand for a product. Conversely, about the topics included in the questions or statements in the
they may wish to know which of the most important factors are acting scale.
to prevent the widespread adoption of a product. - Named after Rensis Likert
- Uni-dimensional scaling method
Example: - It is a psychometric response scale often used in questionnaires,
A very poor farmer response to the first design of an animal-drawn and is the most widely used scale in survey research.
mould board plough. A combination of exploratory research and - A Likert scale is what is termed a summated instrument scale.
shrewd observation suggested that the following factors played a role - This means that the items making up a Likert scale are summed
to produce a total score.

\
- IA Likert scale is a composite of itemized scales.
Illustration:
The annual revenue growth for the apartment sector from 4th quarter
2004 to 3rd quarter 2014 has been studied and used the coefficient of
variation ratio to compare the historical volatility and average annual
revenue change at both the national and metro level.
So, according to the coefficient of variation, on a standalone basis,
Phoenix ranks as one of the riskiest metros for investors. Then, when
you look at the less risky metros – specifically Washington DC, SF,
Miami and San Diego – these cities have some common traits, such as
high barriers to entry, strong growth in the 20 to 34-year-old age cohort
(group) and expensive single-family housing.

Steps in Constructing a Likert Scale


1. Define the concept or attitude to be measured.
2. Create the items.
3. Rate the items.
4. Select the items.
5. Administer the scale.

How to score Likert Scales


1. Reduce to nominal scale, group into accept or reject categories.
2. Treat as ordinal data, use Mann-Whitney test, Wilcoxon signed
rank test or Kruskal Wallis test.
3. If summed up to produce a total score, use parametric test like
the t-test and analysis of variance.
PPT8: RISK AND RETURN OF INVESTMENT
TERMS: Note:
1. Expected Value – is the weighted mean rate of return of the
• The smaller the coefficient of variation, the less risk per unit of
financial asset or investment commonly the average expected
return an investor takes on. (low risk of not attaining a high return)
returns relative to its total value.
• The higher the coefficient of variation, the more risk per unit of
2. Weight – is the measure of possibility (probability) of the portfolio
return an investor takes on. (high risk of not attaining a high
return, it is a proportion of each asset/investment relative to the
return)
total value of the portfolio.
3. Portfolio – is a financial concern or interest of an individual or • The coefficient of variation represents the ratio of the standard
company. deviation to the mean, and it is a useful statistic for comparing the
4. Portfolio Management – deals with the process of analyzing degree of variation from one data series to another, even if the
financial investments leading to a decision that benefits the means are drastically different from one another.
individual or company (investors). • Most investors are risk-averse, they want to minimize their risk
5. Portfolio Return – is the expected value of certain financial asset per unit of return.
or investment. • Coefficient of variation provides a standardized measure of
6. Risk – is the quantification of uncertainty of future values. It is a comparing risk and return of different investments. A rational
variation around the mean of the data. It is the difference of each investor would select an investment with lowest coefficient of
outcome and the mean (Variance). variation.
• The coefficient of variation measures the amount of risk per unit
To Illustrate the EV (Expected Return) of return. The ratio is useful when comparing investments with
XYZ Corporation is planning to invest in Stock A. The firm expects that varying degrees of standard deviation and returns. The ratio is
the possible returns are dependent on the state of the economy. calculated as follows:
Determine the expected return on their planned investment. Standard Deviation
C oefficient of Variation=
State of Possible Returns Probability Average Gain
Economy (Ri) Distributions (Pi)
Recession 10% 0.20 • To compare how much risky one investment over another we use
Normal 15% 0.60 the ratio of the coefficient of variation of the two investments, say
Prosperity 20% 0.20 investment A over investment B.

Expected return (ř ) = (10% x 0.20) + (15% x 0.60) + (20% x 0.20) Investment A
Ratio=
= 0.15 or 15% Investment B
PORTFOLIO RETURN Where A = Larger CV and B = Lower CV
A portfolio is a collection of investments all owned by an individual or a
firm. EXPECTED RETURN
EV = Probability * Rate of Return
Illustration:
ABC Corporation is considering investing in three stocks. Listed below MEASURING RISK – STANDARD DEVIATION
are the expected returns for each investment.
N
Stocks

FLI
Individual Expected
Returns (%)
10
Amount Invested

20,000
σ= √∑i=1
(r −^r )2 Pi

PA 15 50,000 Illustration:
WEB 18 30,000 Consider the following annual dividends of 3 stocks (in
thousands) for 5 years:
What is the expected portfolio return?
Expected return (rˇp ) = (10% x 20%) +(15% x 50%) + (18% x 30%)
= 14.90%

MEASURING RISK
Risk is the exposure to uncertainty or danger resulting to changes in
the expected return in a given investment.
a. Which of these stocks has a greatest average annual dividend in
• Standard deviation (σi) measures total, or stand-alone, risk. 5 years?
• The larger σi is, the lower the probability that actual returns will be b. Which is the best stock in terms of its annual dividend?
closer to expected returns. c. Which among the 3 stocks is least risky?
• Larger σi is associated with a wider probability distribution of d. Which among the 3 stocks is the riskiest?
returns. e. Determine how much riskier is Stock C compared to Stock A.

\
f. How does measure of variability helps in the decision? Rule of Thumb in Interpreting the Skewness Number
If skewness is less than –1 or Highly Skewed (Positively or
Answer: greater than +1 negatively)
If skewness is between -1 and Moderately Skewed (Positively
– ½ or between + ½ and +1 or negatively)
If skewness is between – ½ Approximately Symmetric
and + ½

TYPES OF SKEWNESS
1. Positive skewness
PPT9: MEASURES OF SHAPE - The mode is located I the highest point of the curve.
- The median is the middle value of the distribution.
MEASURES OF SHAPE (SKEWNESS AND KURTOSIS) - The mean is greater than the median and the mode and
found towards the lower tail end of the curve.
- Mean is pulled by the frequency but very high numerical
observations.

2. Negative skewness
- The mode is the highest point
- The median is the middle value
- The mean is less than the median and the mode and found
towards the lower tail end of the curve (left side)

3. Symmetrical (No skew)


- The mean, the median and the mode are found at the
distribution (normal distribution)
 SKEWNESS COEFFICIENT OF SKEWNESS (SK)
- Tells you the amount and direction of skew (departure from Formula:
horizontal symmetry)
- Is the degree of symmetry, or departures from symmetry of a 3( x́−~x)
set of data
sk=
- It also indicates whether the curve has a longer tail on the
s
right (positively skewed) or a longer tail on the left
(negatively skewed)
PROPERTIES OF SKEWNESS
 Similar in shape to a normal distribution except that it is not THREE EXAMPLES OF DISTRIBUTIONS, EACH WITH DIFFERENT
symmetrical: the half left of the polygon is not a mirror image of SKEWNESS:
the right half.
 The mean, median, and the mode are not equal
 If the mean is larger than the median and the mode, the
distribution is said to have a positive skewness or skewed to the
right
 If the mean is smallest of the three averages, the distribution is
negatively skewed or skewed to the left

THE EFFECT OF SKEW ON THE MEAN AND MEDIAN

 A distribution is skewed if one of its tails is longer than the other


 Positive Skew – means that it has a long tail in the positive
direction
 Negative Skew – It has a long tail in the negative direction
 Symmetric Distribution – Has no skew. Meaning, the
mean, median, and the mode all coincide.

Illustration:
Below are the reading temperature in 12 cities in the country during
INTERPRETATION OF SKEWNESS: summer:
72, 74, 75, 77, 78, 79, 82, 85, 86, 90, 93, 94
 Positive Skewness – the data are positively skewed or skewed
right, meaning that the right tail of the distribution is longer than
Solve for the skewness then interpret the result.
the left
 Negative Skewness – the data are negatively skewed or skewed
Illustration 2:
left, meaning that the left tail is longer.
The age distribution of 60 vacationers in Boracay Beach was
 Skewness = 0 – the data are perfectly symmetrical. observed. It was found out that the mean age is 41.83, median is 43.23
 A skewness of exactly zero is unlikely for real-world data and s = 14.32. Find the value of skewness.

\
- The truncated mean is obtained by computing the arithmetic
Answer:
3( x́−~x) 3(41.83−43.23)
sk= =sk = =−0.29
s 14.32
The value -0.29 is between -0.5 and +0.5. This indicates that curve is
approximately symmetric, so therefore, the distribution is said to be
normally distributed.

 KURTOSIS
- Tells you how tall and sharp the central park is, relative to a
standard bell curve.
- From the Greek word κυρτός, kyrtos or kurtos, meaning
bulging.
mean but removing first the extreme value or outliers (Q) on both
- Any measure of the "peakedness" of a frequency distribution
ends of the distribution (lowest and/or highest).
- Refers to the peakedness or flatness of a frequency distribution - These are removed for the following reasons:
- Is a measure of flatness of the distribution. a) To eliminate the directional effect of these values.
 Heavier tailed distributions have larger kurtosis measures b) To give a better descriptional average of data.
 The normal distribution has a kurtosis of 3.
- Kurtosis characterizes the relative peakedness or flatness of a Example: Given the interest rates: 2.12, 2.16, 2.14, 3.12, 2.13.
distribution compared with the normal distribution. Determine if 3.12 is an outlier.
 Positive Kurtosis – indicates a relatively peaked distribution
 Negative Kurtosis – indicates a relatively flat distribution
GEOMETRIC MEAN OF RATE OF RETURN
CLASSIFICATIONS OF KURTOSIS: This is done by computing the nth root of the product of n number of
1. Mesokurtic – Normal distribution numerical values minus 1.
- The distribution of data wherein more items are clustered
around a central value and there will be a few extremes.
- The mean = median = mode Where:
2. Leptokurtic – A distribution that is more peaked than the normal
3. Platykurtic – The one that is flatter than the normal ŔG =√ x1 x2 x 3 x 4 … x n−1 x 1=1+r
n
r = growth rate on the given
period
Kurtosis for Ungrouped Data Kurtosis for Grouped Data
4 USES OF THE GEOMETRIC MEAN RATE OF RETURN:
∑ f (x− x́) ∑ f ( x− x́)4 1. An alternative measure of average if the arithmetic mean
Ku= Ku= overestimates or underestimates the average of the data.
n s4 n s4 2. Commonly used for data such as periodic percentage
increase/decrease.
Ku = 0  Mesokurtic 3. Widely used to measure correct average in the periodic growth
Ku > 0  Leptokurtic rate
Ku < 0  Platykurtic
s GEOMETRIC MEAN
Illustration: - Calculated by raising the product of series of numbers to the
Below are the reading temperature in 12 cities in the country during inverse of the total length of the series.
summer: - Used in finance to calculate average growth rates and is referred
72, 74, 75, 77, 78, 79, 82, 85, 86, 90, 93, 94 to as the compounded annual growth rate.
- Most useful when numbers in the series are not independent of
Solve for the kurtosis then interpret the result. each other or if numbers tend to make large fluctuations.
- Applications of the geometric mean are most common in business
PPT10: THE MEASURES OF CENTRAL TENDENCY and finance, where it is commonly used when dealing with
percentages to calculate growth rates and returns on portfolio of
MEAN OF UNGROUPED DATA securities.
 The sum of the values divided by the number of values--often - Geometric mean is also used in certain financial and stock market
called the "average." indexes, such as Financial Times' Value Line Geometric index.
 Add all of the values together.
 Divide by the number of values to obtain the mean. Growth Rates Examples:
1. Consider a stock that grows by 10% in year one, declines by 20%
Illustration: in year two, grows 9% in year three, and grows by 30% in year
The mean of 7, 12, 24, 20, 19 is 16.4 four. The geometric mean of the growth rate is calculated as:

[(1+0.1)*(1-0.2)*(1+0.09)*(1+0.3)]1/4 - 1 = 0.0567 or 5.67% annually.


Arithmetic Mean  x́=
∑ xi
2. Philippine stock Exchange’s Index Monthly Growth Rates over 3
n months: +5%, -3%, +10%.
Answer: 3.86% per month
WEIGHTED MEAN
MEDIAN FOR UNGROUPED DATA
COMBINED MEAN Median
- The value which divides the values into two equal halves, with
Answer the following problems: half of the values being lower than the median and half higher
1. Three sections of a statistics class containing 28, 32, and 35 than the median.
students averaged 83, 80, and 76, respectively, on the same final - The middlemost value in the list of items arranged in increasing or
examination. What is the combined mean for all three sections?
decreasing order.
2. On a vacation trip a family bought 21.3 liters of gasoline at 39.9
per liter, 18.7 liters at 42.9 cents per liter, and 23.5 liters at 40.9
Properties of Median:
cents per liter. Find the mean price paid per liter.
1. It is the value midway between the highest and the lowest value in
3. A survey of a random sample of people leaving an amusement
a rank order distribution.
park showed an average expenditure of $10.30 for the evening.
2. It is the point that divides the distribution into two equal parts.
The average expenditure for the 20 girls in the sample was $9.70
and for the boys it was $11.10. How many boys are there in the
To find Median:
random sample?
1. Sort the values into ascending order.
4. A savings and loan association make one car loan of P5000 at
2. If you have an odd number of values, the median is the middle
10.5% interest, a second car loan of P6300 at 10.8% interest, and
value.
a third car loan of P4500 at 11% interest. What is the average
3. If you have an even number of values, the median is the
percentage return to the savings and loan association for these
arithmetic mean of the two middle values.
three loans?
MODE
TRUNCATED MEAN
- The most frequently-occurring value (or values).
- The value that appears most often.

\
To find the Mode: Below are the reading temperatures in 12 cities in the country during
1. Calculate the frequencies for all of the values in the data the summer:
2. The mode is the value (or values) with the highest frequency 72, 74, 75, 77, 78, 79, 82, 85, 86, 90, 93, 94

Example: Solve the following:


For individuals having the following ages: 18, 18, 19, 20, 20, 20, 21, 23 Decile (3) Decile (5) Decile (8)
Percentile (25) Percentile (75) Percentile (85.5)
Three Types of Mode:
 Unimodal - a distribution of only one mode
Example: 7, 3, 9, 7, 2, 1, 4, 7 PPT11: INFERENTIAL STATISTICS (TEST OF NORMALITY)
 Bimodal - a distribution of two modes THREE PRINCIPAL USES OF INFERENTIAL METHODS
Example: 1, 2, 8, 5, 1, 2, 4, 9, 6 1) Estimation
 Trimodal - a distribution of two modes 2) Hypothesis Testing
Example: 9, 4, 6, 4, 1, 5, 1, 7, 9, 2, 3, 3) Regression Analysis and Correlation Analysis

APPLICATION
1. The mean grade of the first year, second year, and third year
students in Math in a particular high school is 80. If there are 32
first year, 29 second year, and 38 third year in this group whose
average are 84 and 80 for the first year and second year,
respectively, find the average of the third-year students of this
high school.
2. A statistics instructor computes final grades based on quizzes,
long tests, and final exam giving them weights of 2, 3, and 5,
respectively. If a student had grades of 87, 91, and 88, for
quizzes, long tests, and final exam, respectively, find the student’s
final grade in Statistics.

MEASURES OF LOCATION
1. Median ESTIMATION
2. Quantiles Estimation is the process of deriving the value of the parameter from
the information obtained from a sample. The population is usually large
QUANTILES and the parameter is always unknown, thus only estimates of the true
- are also average of position or location of the desired item. parameter are derived or obtained from the sample.
- (n/2, n/4, 3n/4, n/10, n/100…)
CORRELATION ANALYSIS AND REGRESSION ANALYSIS
1. Quartiles - divide the distribution into four equal parts.  Correlation Analysis is a statistical technique used in
 Quartile One (Q1) – measures the first one-fourth of the determining whether a relationship exists between variable of
given distribution (n/4) interest.
 Quartile Three (Q3) – measures the point separating the
third from the fourth or last quartile. (3n/4)  Regression Analysis is a statistical method that identifies the
 Quartile Two (Q2) – divides the distribution into half and relationship between quantitative variables.
therefore will be the same as the median. (n/2) - Once the relationship is established, regression moves on to
2. Deciles – divide the distribution into ten equal parts the prediction capability of inferential statistics.
3. Percentiles – divide the distribution into one hundred equal parts
HYPOTHESIS TESTING
QUANTILES FOR UNGROUPED DATA Hypothesis Testing is a decision-making process for evaluating
Quantiles are also measures of location the measures: claims or beliefs about a population.
The researcher:
1) States the particular hypothesis that will be evaluated;
2) Gives a significance level;
3) Selects a sample from the population;
4) Collects the data;
5) Performs the calculations required for the test; and
6) Makes probabilistic decision.

TEST OF NORMALITY
Two descriptive techniques to determine if the sample data fits the
normal distribution:
1) Compare the defining summary statistics of the sampled data with
the properties of the normal distribution.
2) Construct the normal probability plot.

There are several ways to check for normality.


• The easiest way is to draw a histogram and check its shape. If the
histogram is not approximately bell shaped, then the data are not
normally distributed.
• Skewness can be checked using the Pearson Coefficient (PC) of
skewness also known as Pearson’s Index of Skewness.
• In addition, the data should be checked for outliers. Even one or
two outliers can have a big effect on normality.

QUARTILES Constructing the Histogram


The procedure for finding the quartiles, whether the number of items is 1. First enter the data on the first column.
the data set is even or odd, are as follows: 2. Enter the bin numbers (upper levels/upper boundaries) in the
range C4:C10.
To determine location/position: 3. On the Data tab, in the Analysis group, click Data Analysis.
1 1 3 4. Select Histogram and click OK.
Q 1= (n+1) Q 2= (n+1) Q 2= (n+1) 5. Select the given data (range A2:A19).
4 2 4 6. Click in the Bin Range box and select upper levels/upper
boundaries (range C4:C8).
To find the value: 7. Click the Output Range option button, click in the Output Range
box and select any cell to display your output, (say cell F3).
Q 1=LV + Dec .(HV −LV ) 8. Check Chart Output.
9. Click OK.
DECILES 10. Properly label your chart.
Decile can be calculated by finding the proportion of each decile in 11. To remove the space between the bars, right click a bar, click
relation to the whole distribution. Format Data Series and change the Gap Width to 0%.
12. To add borders, right click a bar, click Format Data Series, click
PERCENTILES the Fill & Line icon, click Border and select a color.
(Same procedure as quantiles and deciles)
DETERMINING NORMALITY USING THE PEARSON COEFFICIENT
Illustration:

\
3( X−median) whisker plot for a moderately sized data and a frequency
PC = histogram or polygon for a large data.
s 3. Determine how the data are distributed in its range – whether
approximately, 2/3 of the data lie within 1 standard deviation
If the index is greater than or equal to +1 or less than or equal to -1, it about the mean, 4/5 of the data are within 1.28 standard deviation
can be concluded that data are significantly skewed. about the mean, and if approximately, 19/20 of the data are within
2 standard deviations about the mean.
PC ≤−1 or PC≥+1 WHAT IS A BOX AND WHISKER PLOT?
Box and whisker plot
Example 1: - Also called a box plot—displays the five-number summary of a set
A survey of high-tech firms showed a number of days’ inventory they of data. The five-number summary is the minimum, first quartile,
had on hand. Determine if the data are approximately normally median, third quartile, and maximum.
distributed. - In a box plot, we draw a box from the first quartile to the third
quartile. A vertical line goes through the box at the median. The
5, 29, 34, 44, 45, 63, 68, 74, 74, 81, 88, 91, 97, 98, 113, 118, 151, 158 whiskers go from each quartile to the minimum or maximum.

- A box and whisker plot are a way of summarizing a set of data


measured on an interval scale. It is often used in explanatory data
3 ( X −median ) 3 (79.5−77.5 ) analysis.
PC = = ≈ 0.148
s 40.5 - This type of graph is used to show the shape of the distribution,
its central value, and its variability.
Data is approximately Normally Distributed. - In a box and whisker plot:
• the ends of the box are the upper and lower quartiles, so the
box spans the interquartile range
• the median is marked by a vertical line inside the box
Example 2: • the whiskers are the two lines outside the box that extend to
The data shown consist of games played each year in the career of the highest and lowest observations. 
Baseball Hall of Famer Bill Mazeroski. Determine if the data are
approximately normal using the Pearson’s Index of skewness.

81, 148, 152, 135, 151, 152, 159, 142, 34, 162, 130, 162, 163, 143, 67,
112, 70

Please note that box and whisker plots can be drawn either vertically
or horizontally.

DETERMINING NORMALITY USING THE BOXPLOT


- You can tell the shape of the histogram (distribution) - in many
cases at least - by just looking the box plot, and you can also
estimate whether the mean is less than or greater than the
3( X−median) 3(127.24−143) median.
PC = = ≈−1.19 - Recall that the mean is impacted by especially large or small
s 39.87 values, even if there are just a few of them, while the median
is more stable with respect to exceptional values.
Distribution is significantly skewed to the left. - Therefore:
 If the distribution is normal, there are few exceptionally
large or small values. The mean will be about the same as
the median, and the box plot will look symmetric.
 If the distribution is skewed to the right most values are
'small', but there are a few exceptionally large ones. Those
exceptional values will impact the mean and pull it to the
right, so that the mean will be greater than the median.
The box plot will look as if the box was shifted to the left so
that the right tail will be longer, and the median will be closer
to the left line of the box in the box plot.
 If the distribution is skewed to the left, most values are
'large', but there are a few exceptionally small ones. Those
exceptional values will impact the mean and pull it to the left,
so that the mean will be less than the median. The box
plot will look as if the box was shifted to the right so that
the left tail will be longer, and the median will be closer to the
EVALUATION OF DESCRIPTIVE MEASURES TO CHECK FOR right line of the box in the box plot.
NORMALITY
Four ways to check for normality: As a quick way to remember skewedness:
a. The shape is symmetrical and bell-shaped.  longer tail on the left means skewed to the left means mean on
b. The mean, median, mode coincide. the left of median (smaller)
c. The interquartile range (IQR = Q3 – Q1) equals 1.33 standard  longer tail on the right means skewed to the right means mean on
deviations. the right of median (larger)
d. The distribution is infinitely continuous.  tails equally long means normal means mean about equal to
median
PROCEDURE FOR CHECKING FOR NORMALITY OF SAMPLE
DATA
1. Compare if the summary statistics follow the model – mean,
median, and mode coincides, the interquartile range is 1.33 times
the standard deviation (1.33 σ), and the range is approximately 6
standard deviations (6 σ).
2. Observe if the appearance of the graphs is symmetric bell-
shaped. Construct a stem-and –leaf display and a box-and

\
 Because of symmetry of the standard normal curve, O1 And On
will have the same numerical value but oppositely signed, O1 will
be negative and On will be positive.

Illustration:
Shop Sales
699 996
836 1037
915 1085
921 1119
978 1208
983 1223

Compute for O1, O2, O3, …, On: (proportion or area under the
curve)
O1st = 1/13 O2nd = 2/13 O3rd = 3/13
= 0.0769 = 0.1538 = 0.2308
O11th = 11/13 O12th = 12/13 O1 = -1.43
= 0.8462 = 0.9231

O2 = -1.02 O3 = -0.73 O11 = 1.02


O12 = 1.43 Solve for: Q4 to Q10

PPT12: HYPOTHESIS TESTING

HYPOTHESIS TESTING (Confirmatory Data Analysis)


The Null Hypothesis (H0)
- it is a statement of no difference, no relationship or no significant
effect.
THE NORMAL PROBABILITY PLOT - is generally tested through a non-directional test since it states no
difference, no effect or no relationship.

The Normal Probability Plot The Alternative Hypothesis (H1)


- Consists of points lying on or close to an imaginary straight line - it is a statement of difference, with relationship or with significant
that rises from the lower left corner of the graph to the upper right effect.
corner if the data is normally or approximately distributed.
- Conversely, data that do not follow the normal model consist of Illustration:
points deviating from this line in some patterned way. Suppose a Statistics professor read an article which states that the
overall mean grade in Statistics of the students in the University Belt is
81. Furthermore, suppose that, for a sample of students, the average
grade in Statistics in their university is 84. Can the professor conclude
that the students in their university is better than average?

Hypothesis testing refers to the formal procedures used by


statisticians to accept or reject statistical hypotheses.
• A scientist might want to know whether the earth is warming up.
• A physician would like to know whether a new medication will
lower a person’s blood pressure.

• An educator wish to see whether a new teaching technique is


better than a traditional one. (Flipped Classroom vs. Traditional
approach, or Virtual Classroom vs. Classroom Setting)
• A retail merchant might want to know whether the public refers a
certain color in a new line of fashion.
• Automobile manufacturers are interested in determining whether
car balloons will reduce the severity of injuries caused by
accidents.
• A computer expert would like to know the user friendliness of a
certain software

HYPOTHESIS TESTING
- Is a decision-making process for evaluating claims about a
population.
- In hypothesis testing, the researcher must define:
• the population under study;
• state the particular hypotheses that will be investigated;
• give the significance level;
• select a sample from the population;
• collect the data;
• perform the calculations required for the statistical test, and
finally reach a conclusion.

The three methods used in test hypotheses:


In Panel A, the points lie close to an imaginary straight line rising to 1. The traditional method (classical method)
the right. Such observation leads to conclude that the data in Panel A 2. The P-value method
is approximately normal. 3. The confidence interval method

In Panel B, the points are nonlinear where the points seem to rise STATISTICAL HYPOTHESIS
more steeply at first and then less steeply in the end. This pattern is • A Statistical Hypothesis is a statement, assertion, conjecture, or
indicative of the elongated tail of a left skewed data in Table 3.2.2. claim about the nature of a population. It is the basic object of any
experimental inquiry.
On the other hand, Panel C, the non-linear pattern of the points seems • A Statistical Hypothesis is an assumption or statement, which
to rise less steeply at first and then more steeply. The steepness at the may or may not be true, concerning one or more populations.
right is indicative of the elongated right tail of skewed data in Table
3.2.2 Consider the following:
1. At one time it was thought that, the middle children are said to be
TO CONSTRUCT THE NORMAL PROBABILITY PLOT insecure compared to the first or last child.
Perform the Inverse Normal Scores Transformation 2. Female had inferior intelligence compared to male.
 By converting the ordered data set x1, x2, x3, … xn to the 3. The mean life span of man is 50 years.
corresponding standardized normal quantile values O1, O2, O3, 4. The average daily income of a family is P345.
…, On, where Oi is the value below which is the proportion i/(n + 5. Researchers at George Washington University and the National
1) of the area under the standard normal curve. Institutes of Health claim that approximately 75% of the people

\
believe tranquilizers work very well to make a person calmer and
more relaxed.
6. Records of a certain hospital showed that the distribution of length
of stay of its patients is normal with a mean of 11.5 days and a
standard deviation of 2 days.

TEST OF STATISTICAL HYPOTHESIS


Statistics offers varied tools and techniques that help the researcher
draw valid and reliable inferences or generalizations about the
population on the basis of the sample, known as INFERENTIAL
STATISTICS.

Goal of Hypotheses Testing


The goal of Hypotheses Testing is not to question the computed value
of the sample statistic (i.e.,) but to make a judgment about the HYPOTHESIS TESTING-TRADITIONAL METHOD
difference between the sample statistics and the hypothesized
population parameter. Situation A
A medical researcher is interested in finding out whether a new
Use of Hypotheses Testing medication will have any undesirable side effects. The researcher is
Hypotheses testing enables a researcher to generalize population from particularly concerned with the pulse rate of the patients who take the
relatively small samples. In many instances, a researcher can only rely medication. Will the pulse rate increase, decrease, or remain
on the information provided by a part of the population. unchanged after a patient takes the medication? If the population
mean pulse rate is 82 beats per minute, write the Ho and H1.
BASIC DEFINITIONS IN HYPOTHESES TESTING
Statistic – a function of the random sample, that is based on the Situation B
observations and is used to make the decision in favor of the null A chemist invents an additive to increase the life of an automobile
hypothesis or alternative hypothesis, i.e., sample mean, standard battery. If the mean lifetime of the automobile battery without the
deviation, variance, proportion, t-score, z-score, etc. additive is 36 months, formulate the Ho and Ha.
Parameter – is a numerical characteristic of the population mean,
population standard deviation, population variance, proportion, t-score, Situation C
z-score, etc. It is usually unknown and estimated only by a A contractor wishes to lower aircon bills by using a special type of
corresponding statistic computed from the sample data insulation in houses. If the average of the monthly aircon bill is P3,900,
Population or universe – is a complete set of all possible formulate the Ho and Ha about air-conditioning costs with the use of
observations, values, elements, or objects under consideration. insulation.
Sample – is a representative part of the population
Null hypothesis (Ho) – is known as a statement of no difference, no
relationship, or no significant effect hypothesis. It implies neutrality and
objectivity.
Alternative hypothesis (Ha) – is the opposite of null hypothesis. It
specifies an existence of a difference, there is a relationship or there is
a significant effect, and is therefore non-directional. This hypothesis is
also called a predictive hypothesis.
Predictive hypothesis – specifies that one group is better than the
other, and is therefore sometimes directional.

• The null hypothesis, denoted by H0 is usually the hypothesis


that sample observations result purely from chance.
• The alternative hypothesis, denoted by H1 or Ha, is the
hypothesis that sample observations are influenced by some
nonrandom cause.

• In writing the H0 in equation or symbol form, the symbol for


parameters (μ) is usually used. H0: μ1– μ2= 0. It should be noted
that X1 and X2 are only estimates of the population means, and
since we would like to generalize findings later to a larger
population, we use the parameter symbols.

Consider the following statement:


1. The mean life span of man is 50 years.
2. The average daily income of a family is P345.
3. Researchers at George Washington University and the National
Institutes of Health claim that approximately 75% of the people
believe tranquilizers work very well to make a person more calm
and relaxed.
4. Records of a certain hospital showed that the distribution of length
of stay of its patients is normal with a mean of 11.5 days and a
standard deviation of 2 days.
Exercise: State the null and alternative hypotheses for each
DIRECTIONAL AND NON-DIRECTIONAL TEST OF HYPOTHESIS conjecture
• One-Tailed Test – is used when the hypotheses are directional or
if the direction or the nature of difference is stated. This type of a) A researcher thinks that if expectant mothers use vitamin pills, the
test of hypotheses is also called a directional test. This means birth weight of the babies will increase. The average birth weight
that, inequalities like less than (<) or greater than (>) are used for of the population is 8.6 pounds.
the alternative hypothesis. b) An engineer hypothesized that the mean number of defects can
• Two-Tailed Test – is used when the hypotheses are non- be decreased in a manufacturing process of compact disks by
directional or if the direction or difference is not stated. This type using robots instead of humans for certain tasks. The mean
of test of hypothesis is also called a non-directional test. Equal number of defective disks per 1000 is 18.
sign is used for the null and not equal sign for alternative c) A psychologist feels that playing soft music during a test will
hypothesis. change the results of the tests. The psychologist is not sure
whether the grades will be higher or lower. In the past, the mean
• A one-tailed test is used when the critical region (rejection of the scores was 73.
region) is located at only one extreme of distribution or range of
values for the test statistic. STATISTICAL TEST
• A two-tailed test is used when the critical region (rejection
region) is located at both sides of the distribution or range of
values for the test statistic.

\
6. μ 1 – μ2≠ 0
7. μ 1 > μ2
8. μ1<μ
9. The new educational program has a poor effect in the
achievement of the elementary pupils.
10. The use of multimedia as an instructional material in classroom
teaching is highly significant towards student behavior inside the
classroom.

SOME FACTS ABOUT REJECTING OR NOT REJECTING THE


NULL HYPOTHESIS
• Some researchers say that a hypothesis test can have one of two
outcomes: you accept the null hypothesis or you reject the null
hypothesis.
• Many statisticians, however, take issue with the notion of
"accepting the null hypothesis." Instead, they say: you reject the
null hypothesis or you fail to reject the null hypothesis.
Statistical Test uses the data obtained from a sample to make a
decision about whether the null hypothesis should be rejected. The Distinction between “Acceptance” and “Failure to Reject”
numerical value obtained from a statistical test is called the test value. • Acceptance – implies that the null hypothesis is true
• Failure to Reject – implies that the data are not sufficiently
Illustrations: persuasive for us to prefer the alternative hypothesis over the null
1. A medical researcher is interested in finding out whether a new hypothesis
medication will have any undesirable side effects. The researcher
is particularly concerned with the pulse rate of the patients who - The given hypothesis is tested with the help of the sample data.
take the medication. Will the pulse rate increase, decrease, or - A simple random sample has the full freedom of giving any value
remain unchanged after a patient takes the medication? Suppose to its statistic.
that the mean rate for the population under study is 82 beats per - The sample is not aware of our plans.
minute, what would be the null and alternative hypotheses? - We decide about our hypothesis on the basis of the sample
statistic.
H0: The mean pulse rate of the patients who took the medication - If the sample does not support the null hypothesis, we reject it on
is 82 beats per minute. probability basis and accept the alternative hypothesis.
H1: The mean pulse rate of the patients who took the medication - If the sample does not oppose the hypothesis, the hypothesis is
is not 82 beats per minute.
accepted. But here ‘accept’ does not mean the acceptance of null
hypothesis but only means that the sample has not strongly
OR
opposed it.
H0: There is no significant difference between the mean pulse
- Meaning, “Not opposed” does not mean that the sample has
rates of the population under study and the patients who take the
strongly supported the hypothesis.
medication.
- The support of the sample in favor of the hypothesis cannot be
H1: There is significant difference between the mean pulse rates
of the population under study and the patients who take the established.
medication. - When the hypothesis is rejected, it is rejected with a high
probability.
2. Consider an experiment involving two groups: An experimental - The acceptance of a hypothesis (Ho) merely implies that there
group and a controlled group. The researcher would like to test is no sufficient statistical evidence to believe otherwise.
whether the treatment (virtual learning) will improve the student’s - A critical region is a set of values of the test statistic that is
achievement in mathematics under the experimental group. The chosen before the experiment to define the conditions under
same treatment is not given to the controlled group. The which the null hypothesis will be rejected.
hypothesis can be stated in various ways: - The significance level of a test is the maximum value of the
probability of rejecting the null hypothesis when in fact it is true.
a) H0: There will be no significant difference in the achievement
between the group that will be exposed to virtual learning and the THE LEVEL OF SIGNIFICANCE AND DECISION RULE
group that will not be exposed to the same. Decision Rule – specifies the critical value for the sample findings.
H1: The achievement of the group that will be exposed to virtual Each hypothesis is tested at a chosen level of significance.
learning will differ from that of the other.
Common levels of significance called α - values (alpha values) are:
1%, 5%, and 10%
b) H0: There will be no significant effect of the virtual learning on the
achievement of the students.
H1: Virtual learning will have a positive effect on the achievement
of the students.
NOTE:
c) H0: The achievement of the students will not relate to the virtual • The alternative hypothesis is the hypothesis that makes it
learning conducted on them. Directional (one-tailed test) and Non-Directional (two-tailed
H1: There will be a positive relationship between the achievement test).
of the students and the virtual learning they will be exposed to. • Common levels of significance called α - values (alpha values)
are: 1%, 5%, and 10%.

Exercise: State the null and alternative hypotheses for each


conjecture
a) A researcher thinks that if expectant mothers use vitamin pills, the
birth weight of the babies will increase. The average birth weight
of the population is 8.6 pounds.
b) An engineer hypothesized that the mean number of defects can
be decreased in a manufacturing process of compact disks by
using robots instead of humans for certain tasks. The mean TYPE I AND TYPE II ERROR
number of defective disks per 1000 is 18. • Type I error – is when we reject the null hypothesis when it is
c) A psychologist feels that playing soft music during a test will true
change the results of the tests. The psychologist is not sure • Type II error – is when we accept or fail to reject the null
whether the grades will be higher or lower. In the past, the mean hypothesis when the alternative hypothesis is false.
of the scores was 73.  The probability of committing a Type II error is called Beta,
and is often denoted by β.
Exercises: Tell whether the following is null or alternative, Directional  The probability of not committing a Type II error is called the
or Non-Directional, One-tailed or two-tailed test: Power of the test.
1. There is a significant difference between the use of Sunsilk or
Pantene shampoo.
2. There is a no significant difference between the use of shampoo
alone and the use of shampoo + conditioner.
3. There is a no significant effect in the achievement of the students
in Mathematics between a terror professor and a cool professor.
4. The use of Filipino language in teaching mathematics subject has
a better effect than the use of pure English.
5. μ 1 – μ2= 0

\
Step 2. Specify the level of significance to be used.
Step 3. Determine the critical region.
Step 4. Select an appropriate test statistic and determine the critical
value of the test statistic.

INFERENCES FROM TWO SAMPLES


(1) Inferences About Two Proportions
(2) Inferences About Two Means: Independent Samples
(3) Inferences from Matched Pairs
(4) Comparing Variation in Two Samples

SUMMARY OF HYPOTHESIS TESTS

Type I and Type II Error Two Possibilities of Committing an Error:


• If H0 is true & the falls on the rejection region, Type I error is
committed.
• If H0 is false & the falls on the acceptance region, Type II error is
committed.
• To represent the probability of committing the two types of error:
α = probability of making a type I error
β = probability of making a type II error
• An α level of significance is the more commonly used symbol than
β.

CRITICAL VALUES OF Z

STATISTICAL TEST FORMULAS:


Critical Values – divide the possible values of the test statistic into two A. Test for a single mean when the population variance (σ2) is
regions called the rejection region and the non-rejection region known and the sample is more than 30 (n ≥ 30).
(acceptance region).

B. Test for a single mean when the population variance (σ2) is


unknown and the sample is more than 30 (n ≥ 30).

C. Test for a single mean when the population variance is


unknown and the sample is not more than 30 (n < 30).

GUIDELINES IN HYPOTHESIS TESTING: D. Hypothesis Tests About the difference Between Two
1. State the null and alternative hypothesis (H0 andH1). population Means for Large (n1, n2 > 30) and Independent
2. Choose the level of significance. Samples.
3. Determine the critical region.
4. Choose the statistical test appropriate to test the hypothesis.
5. Compute the value of the statistical test.
6. Make a decision.
Reject H0 if the test statistic has a value in the critical region;
otherwise, do not reject H0.
7. Interpret and discuss the result.

STEPS IN HYPOTHESIS TESTING


Step 1. Formulate the null hypothesis H0.
H0: μ = μ0 Choose an appropriate alternative hypothesis.
H1: μ < μ0 (one-tailed test)
μ > μ0 (one-tailed test)
μ ≠ μ0 (two-tailed test)

The critical/rejection region for the alternative hypothesis;


• lies entirely in the left tail of the distribution
• lies entirely in the right tail of the distribution
• lies in both tails of the distribution

Potrebbero piacerti anche