Sei sulla pagina 1di 11

UNIT-1

Introduction to Statistics. Statistics is a branch of mathematics that deals with the collection, analysis
and interpretation of data. ... In layman's terms, data in statistics can be any set of information that
describes a given entity. An example of data can be the ages of the students in a given class.

Definition of Statistics:- the practice or science of collecting and analyzing numerical data in large
quantities, especially for the purpose of inferring proportions in a whole from those in a representative
sample.

Definition of statistics. 1: a branch of mathematics dealing with the collection, analysis, interpretation,
and presentation of masses of numerical data. 2 : a collection of quantitative data

Limitations of Statistics

The important limitations of statistics are:

(1) Statistics laws are true on average. Statistics are aggregates of facts, so a single observation is not a
statistic. Statistics deal with groups and aggregates only.

(2) Statistical methods are best applicable to quantitative data.

(3) Statistics cannot be applied to heterogeneous data.

(4) If sufficient care is not exercised in collecting, analyzing and interpreting the data, statistical results
might be misleading.

(5) Only a person who has an expert knowledge of statistics can handle statistical data efficiently.

(6) Some errors are possible in statistical decisions. In particular, inferential statistics involves certain
errors. We do not know whether an error has been committed or not.

Scope and importance of Statistics:

1. Statistics and planning: Statistics in indispensable into planning in the modern age which is termed as
“the age of planning”. Almost all over the world the govt. are re-storing to planning for economic
development.

2. Statistics and economics: Statistical data and techniques of statistical analysis have to immensely
useful involving economical problem. Such as wages, price, time series analysis, demand analysis.

3. Statistics and business: Statistics is an irresponsible tool of production control. Business executive are
relying more and more on statistical techniques for studying the much and desire of the valued
customers.
UNIT-1
4. Statistics and industry: In industry statistics is widely used inequality control. In production
engineering to find out whether the product is confirming to the specifications or not. Statistical tools,
such as inspection plan, control chart etc.

5. Statistics and mathematics: Statistics are intimately related recent advancements in statistical


technique are the outcome of wide applications of mathematics.

6. Statistics and modern science: In medical science the statistical tools for collection, presentation and
analysis of observed facts relating to causes and incidence of dieses and the result of application various
drugs and medicine are of great importance.

7. Statistics, psychology and education: In education and physiology statistics has found wide application
such as, determining or to determine the reliability and validity to a test, factor analysis etc.

8. Statistics and war: In war the theory of decision function can be a great assistance to the military and
personal to plan “maximum destruction with minimum effort.”

Difference between Classification and Tabulation

The primary difference between classification and tabulation is that the process of classifying data int
groups is known as classification of data, whereas tabulation is the act of presenting data in tabular
form, for better interpretation

After the collection of data is completed, it is prepared for analysis. As the data is raw, it needs to be
transformed in such a way, that it is appropriate for analysis. The form of data, highly influences the
result of analysis and so, to get positive results, the data preparation should be proper. There are
various steps of data preparation, which include editing, coding, classification, tabulation, graphical
representation and so on.

For a layperson, classification and tabulation are same, but the fact is they are different, as the former is
a means to sort data, for further analysis while the latter is used to present data.

Content: Classification Vs Tabulation

Comparison Chart

Definition

Key Differences

Conclusion
UNIT-1
Comparison Chart

BASIS FOR COMPARISON CLASSIFICATION TABULATION

1 Meaning Classification is the Tabulation is a process of


process of grouping summarizing data and presenting it in
data into different a compact form, by putting data into
categories, on the statistical table.
basis of nature,
behavior, or
common
characteristics.

2. Order After classification


After data collection

3. Arrangement Columns and rows


Attributes and
variables

4. Purpose To analyse data To present data

5. Bifurcates data into Categories and sub- Headings and sub-headings


categories

Definition of Classification

Classification refers to a process, wherein data is arranged based on the characteristic under
consideration, into classes, or groups, as per resemblance of observations. Classification puts the data in
a condensed form, as it removes unnecessary details that helps to easily comprehend data.
UNIT-1
The data collected for the first time is raw data and so it is arranged in haphazard manner, which does
not provide a clear picture. The classification of data reduces the large volume of raw data into
homogeneous groups, i.e. data having common characteristics or nature are placed in one group and
thus, the whole data is bifurcated into a number of groups. there are four types of classification:

Qualitative Classification or Ordinal Classification

Quantitative Classification

Chronological or Temporal Classification

Geographical or Spatial Classification

Definition of Tabulation

Tabulation refers to a logical data presentation, wherein raw data is summarized and displayed in a
compact form, i.e. in statistical tables. In other words, it is a systematic arrangement of data in columns
and rows, that represents data in concise and attractive way. One should follow the given guidelines for
tabulation.

A serial number should be allotted to the table, in addition to the self explanatory title.

The statistical table is required to be divided into four parts, i.e. Box head, Stub, Caption and Body. The
complete upper part of the table that contains columns and sub-columns, along with caption, is the Box
Head. The left part of the table, giving description of rows is called stub. The part of table that contains
numerical figures and other content is its body.

Length and Width of the table should be perfectly balanced.

Presentation of data should be such that it takes less time and labor to make comparison between
various figures.

Footnotes, explaining the source of data or any other thing, are to be presented at the bottom of the
table.
UNIT-1
Key Differences Between Classification and Tabulation

The paramount differences between classification and tabulation are discussed in the points given
below:

The process of arranging data into different categories, on the basis of nature, behavior, or common
characteristics is called classification. A process of condensing data and presenting it in a compact form,
by putting data into statistical table, is called tabulation.

Classification of data is done after data collection process is completed. On the other hand, tabulation
follows classification.

Data classification is based on similar attributes and variables of the observations. Conversely, in
tabulation the data is arranged in rows and columns, in a systematic way.

Classification of data is performed with the objective of analysing data in order to draw inferences.
Unlike tabulation, which aims at presenting data, to ensure easy comparison of various figures.

In classification, data is bifurcated into categories and sub-categories while in tabulation data is divided
into headings and sub-headings.

Conclusion

When the collection and verification of data is completed on the basis of homogeneity and consistency,
it needs to be summarized and presented in a clear and compact manner which highlights the relevant
features of data. Both classification and tabulation, enhances the readability and attractiveness of data,
by presenting it in a manner that looks more appealing to the eyes.

A Basic Review of Statistics Definitions and Concepts

Population: The universe of event numbers under study. 

Sample and sampling: A portion of the population used for statistical analysis. Sampling is the process
by which numerical values will be selected from the population. Sample statistics, if they are unbiased,
are economical ways to draw inferences about the larger population. The requirement here is that the
sample drawn should be large enough to give an unbiased picture of the population.

Probability sampling: A selection process that considers the likelihood of selecting a particular number
or attribute from the population.

Random sample: A sampling process whereby every item in the population has an equal chance of being
chosen and placed in the sample.  Lack of a random sample may result in the estimated statistic(s) being
biased. 
UNIT-1
Variable: A numerical attribute that can take on different values. Variables constitute the characteristic
of a sample set to which statistical analysis will be analyzed. There can be categorical or numerical
variables. For example, a person's religion would be a categorical variable whereas that person's
disposable income would be a numerical variable.

Parameter: A numerical measure describing a particular characteristic of a population. Since we typically


do not study populations, parameters are often unobservable and must be estimated. An unbiased
parameter estimate is one that is statistically equal to the true population parameter.

Statistic: A numerical measure that describes some property of the population. A statistic is obtained
from a sample. We hope the statistic estimated from the sample is statistically equal to the same
statistic if we could collect it from the population.

Published sources of data: Published sources of data are collected from either a primary or secondary
source. Primary data is data collected by the primary analyst. Secondary data is data that has been
collected from primary sources.

Experimental data: Data about a variable that has been collected by allowing only one (or a selected)
group of variables to change. All other variables are held constant. Experimental data is typically seen in
the hard sciences. Non-experimental data is typically seen in the social sciences where it is impossible to
"hold everything else constant." 

Survey data: Data collected from the responses of a group of participants.

Frame data: Data collected using a pre-specified list establishing the guidelines that will be used in
assembling the sample from the population.  Frames should be selected so that the resulting sample will
represent the population.

Bar chart: A chart made from categorical data in which the heights of bars represent the frequency (or
relative frequency aka percent) of membership in each value of the variable. Unlike a histogram, the
width of the bars carries no meaning.

Histogram: A graph made from quantitative data in which the range of the data is divided into intervals
called bins, and then bars are constructed above each bin such that the heights of the bars represent the
frequency or relative frequency of data in the particular bin. Unlike a bar chart, the width of the bars is
an important characteristic of the graph

Box and whisker plot: A plot that incorporates the median and upper and lower quartiles to graphically
display the data range. Also particularly useful for displaying outliers when they are present in the data.

Time series plot: The plot of a specified variable over time.


UNIT-1
Cross-sectional data: Data compared at one point in time. Comparisons can be intra-data or with a
benchmark data point.

Probability: The mathematical likelihood a particular outcome will occur.

Probability distribution: A scaling of possible event outcomes based upon their likelihood (probability)
of occurring and described by a probability function.

Discrete probability distribution: A probability distribution where each class contains only certain values
of thei variable in any particular interval (such as only whole number values, for example).

Continuous probability distribution: A probability distribution described by any possible value of the
variable within the range of possible values.

Symmetrical probability distribution: A probability function that has a vertical line of symmetry creating
left/right mirror images. The most well-known example is the bell-shaped Normal distribution that is
fully described by it's mean and standard deviation.

Left-skewed probability distribution: A set of data values in which the mean is generally less than the
median. The left tail of the distribution is longer than the right tail of the distribution.

Right-skewed probability distribution: A set of data values in which the mean is generally greater than
the median. The right tail of the distribution is longer than the left tail of the distribution.

Central Limit Theorem: The statistical law that states that regardless of the shape of the distribution of
the individual values in the population, as the sample size gets larger, the sampling distribution of the
mean can be approximated by a normal distribution.
UNIT-1
Degree of freedom: The number of independent data values available to estimate the population's
standard deviation. The degrees of freedom equal the number of observations in the sample (N) minus
the number of parameters to be estimated (K).

Student's t-distribution: A family of curves each of which is a symmetrical bell-shaped distribution that
has greater area in the tails than the normal probability distribution. Each distribution will be defined by
its degrees of freedom. As the degrees of freedom increase the t-distribution approaches that of the
normal distribution. 

Z-value: A statistic generated from a normal probability distribution. It is a standardized value in that it
divides the difference between an observation and the mean value by the standard deviation of the
observations.

A pie chart: A circular graph where wedge-shaped slices comprise proportions of the total circular graph.

Pareto chart: A bar chart that displays the count of each item as a number or percentage in ascending
order from left to right. The Pareto function represents a cumulative percentage summing to 100%. 

Frequency: The number or percent occurrence of a particular outcome out of N trials.

Frequency table: A grouping of data into mutually exclusive classes showing the number of observations
in each class. Relative frequency classes are derived from a frequency table by computing the
percentage of the total observations made up by each class.

Joint frequency distribution: A table consisting of paired responses for two variables. 

Scatter diagram: A graph that plots the coordinates from two series of data points.  In a typical a scatter
diagram the X axis (the horizontal axis) represents the units of one variable while the Y axis (the vertical
axis) represents the units of the second variable.  Scatter diagrams can reveal patterns among data.
UNIT-1
Mean: A measure of central tendency. It is computed by summing all data values and dividing by the
number of data values summed. In this context the mean (average) is an ex post number. It is computed
after-the-fact. If the observations include all the values in a population the average is referred to as a
population mean. If the values used in the computation only include those from a sample, the result is
referred to as a sample mean.

Expected mean: A measure of central tendency. All data values are weighted by their probability of
occurring and then summed. The expected mean is an ex ante calculation (sometimes referred to as a
weighted mean where the probabilities are the weights). The expected mean can be from a population
or from a sample. Typically it is computed from a sample. The expected mean is also referred to as an
expected value.

Median: A center value that divides the data array into two halves. The median is not affected by
extreme observation values in the data set.

Mode: The value in the data set that occurs most frequently. Some data sets may have more than one
mode if two different values tie for the most frequently occurring value. For example, a distribution of
values may be bi-modal in nature.

Population variance: The population variance is the average of the squared differences of the data
values from the mean value of observations divided by N observations.

Sample variance: The sum of the squared differences of the data values from the mean value of
observations where this sum is divided by the number of observations (N) minus 1. We divide by N – 1
to correct for a bias produced in the sample variance when the number of observations is small.

Sample standard deviation: The square root of the sample variance. The sample standard deviation
represents the typical distance from the mean to an observation in the data. The sample standard
deviation is a measure of risk.

Sample coefficient of variation: The ratio obtained by dividing the sample standard deviation by the
sample mean. This calculation is useful when two different data sets have different means and standard
UNIT-1
deviations.  For two independent data sets we typically choose the data set with the lower coefficient of
variation—less variation per unit of expected value.

Point estimate: A single statistic (number) that is determined from a sample. It is used to estimate the
corresponding population parameter.

Sampling error: Differences from the mean that occur due to random chance.

Confidence interval:  An interval computed from a sample that is expected to contain the poplulation
parameter with a given level of confidence.

Null hypothesis: The belief that a population parameter is equal to a specific value. The null can be
rejected via statistical inference.

Alternative hypothesis: The subsequent test result that leads the researcher to reject the null hypothesis
in favor of the alternative hypothesis with a pre-specified level of confidence. The null and alternative
hypotheses are mutual exclusive states.

Correlation: The strength of linear association between two variables. Correlation is not causality. A
causal relationship exists when the independent variable is the underlying contributing determinant of
the dependent variable. A causal relationship may be suggested by correlation; it is not proof a causal
relationship exists however.

Correlation coefficient: A numerical measure of the sign and strength of the linear association between
two variables. The correlation coefficient will range between -1.00 (negative correlation) and +1.00
(positive correlation).

Linear regression: A statistical method in which a straight line is "fit" to a scatter of point coordinates so
as to determine an estimated intercept and slope (the regression coefficients). Once estimated the
intercept and slope allow the value of the dependent variable to be obtained from the value of an
UNIT-1
independent variable. Multiple linear regression uses two or more independent variables to explain a
dependent variable.

Regression residual: The difference between an observed value of the dependent variable and the
corresponding estimated value from the regression model. Small residuals mean the model leads to
more accurate predictions than large residuals. The least squares regression model is one in which the
sum of squared residuals is minimized.

Rate of return:The difference between the dollar amount invested at the beginning of the period and
the amount received at the end of the period divided by the amount invested. Stocks can have return
both from capital appreciation and for dividends. Bonds can have return both from capital appreciation
and from interest payments. Rates of return are typically computed on a yearly basis. Annual rates of
return can compound over time. An annualized rate of return can be positive or negative. Rates of
return are subject to variation (risk). This risk can be measured by the security's standard deviation in an
isolated setting. If we replace an actual ending period dollar amount with an expected ending period
dollar amount we have what is termed an expected rate of return.

The "market" and return on the market: When we say the "market" in finance we are referring to the
overall market for stocks. Since this market contains many stocks it is completely diversified. The only
risk associated with the market is non-diversifiable or "market risk." The return on the market (say for
one year) is measured using one of the many stock indexes. Perhaps the most popular is the S&P 500
stock index. This index contains 500 stocks from a broad cross-section of U. S. companies. The return on
the market is computed as [Beginning index value – Ending index value]/Ending index value. The market
return is important because it is the benchmark against which other individual stock returns are judged.

Diversification: The effect that reduces portfolio risk if the securities making up the portfolio are not
perfectly positively correlated. Cross security returns tend to moderate each other over time thereby
reducing the volatility of any one security held in isolation. A broad market index will be completely
diversified and will demonstrate only non-diversifiable or market risk.

Potrebbero piacerti anche