Sei sulla pagina 1di 36

1.

Define and characterize the main concepts used in data analysis (population, sample,
observation, variables, etc.)
The activity of data analysis can be defined as representing a succession of processing and interpretation
operations, operations performed on some primary information regarding phenomena and processes from
the economic-social reality and based on a great variety of specific methods and techniques, for the
purpose of deepening. knowledge of behavior
The quantitative and qualitative information regarding the phenomena and processes studied express a
lot of concrete states and evolutions from the investigated reality and are the result of a laborious process
of observation, measurement and evaluation, process in which a series of norms, principles,
methodologies and tools specific to the process intervene. measurement. The information obtained from
the investigated reality, following observation and measurement processes, is known as data. Data
represents the raw, empirical material, which is the basis of all decisions in any field of activity, and the
quality of these decisions depends directly on the quality of those decisions.
The stochastic phenomenon is that observable phenomenon, whose particular manifestations are
uncertain, but which show a certain regularity of the forms of manifestation, a certain connection between
these forms of manifestation. The information regarding a certain phenomenon under study, information
necessary to analyze the behavior of that phenomenon, is the result of a measurement process. This
process represents, in fact, an action of assigning numerical values for the characteristics of that
phenomenon.
By measurement process is understood all the activities of assigning numerical values for the
characteristics of the analyzed phenomenon. The selection of the units that will be effectively subjected to
the measurement and registration process is made according to very precise criteria and rules, rigorously
substantiated from a statistical-mathematical point of view. The values taken by the characteristics of the
units studied through this process make up the so-called sample of observations.
Data represent quantitative and qualitative expressions of certain phenomena and processes from the
surrounding reality. One of the fundamental concepts of data analysis, which is related to the definition of
many of the usual concepts of this discipline is that of statistical population.
The general population or community is represented by the multitude of all effective or conceptual
measurements that are of interest to the researcher or experimenter.
The sample represents a subset of measurements selected from a population, a subset of the statistical
population subjected to scientific investigation.
The variable represents an abstraction of the set of possible values that a characteristic of a certain
phenomenon can record

2. What are the variables and how they are classified

The variable represents an abstraction of the set of possible values that a characteristic of a certain
phenomenon can record. The variety of economic and social phenomena and the different ways of
expressing their characteristics make the variables through which these characteristics are described have
a different nature. Like the characteristics of the populations, according to their nature, the variables can
be of two types: qualitative variables and quantitative variables.
In the data analysis, there is a need for differentiated treatment of qualitative and quantitative data
because there are substantial differences between these types of data, both in terms of approach and
interpretation, and in terms of the methods and techniques used in the analysis. For these reasons, a clear
distinction is made between qualitative variables and quantitative variables.
Qualitative variables are variables that differ in type, refer to non-numerical properties of elementary
units belonging to a population and cannot be expressed numerically.
Quantitative variables are variables that differ in size, refer to numerical properties of elementary units
in a population and are expressed in numerical units: length, weight, value, etc. Depending on the nature
of the values they take, the variables are divided into two categories: discrete variables and continuous
variables.

1
Discrete variables are variables that can take a limited, finite set of values and are also called categorical
variables. The values taken by the discrete variables are called alternatives, categories, variants or
modalities. Typically, qualitative variables are discrete variables. However, discrete variables can be
some quantitative variables.
Continuous type variables are variables that can take values from a continuous range. Basically, the set
of possible values of the continuous type variables is an infinite set. As a rule, qualitative variables are not
continuous type variables.

3.What is the measurement scale and what are the main types of measurement scales used in data
analysis

Measurement is a process by which numbers or symbols are associated with characteristics or


properties of objects or subjects, which are the object of the study.
The assignment of numbers or symbols for the characteristics or properties of certain objects is made
based on the observance of predetermined rules and by using specific procedures. For example, if the
object of the study is represented by individuals who are potential buyers of a particular product, then the
characteristics for which it is necessary to assign numbers or symbols may be: age, income, sex,
profession, etc.
The measurement of the characteristics or properties of objects or subjects is always characterized
by a certain specificity, determined by the nature of the measured characteristic, and implies, necessarily,
the existence of benchmarks, reference systems, known as a scale. As a fundamental element of the
process of measuring the characteristics of economic phenomena and processes, the scale can be defined
as follows.
A scale represents an appropriate standard, which determines how values are assigned to
variables; to define a measurement scale is equivalent to:
• to establish a set of possible values of the variable, also called a selection space;
• specify the rules according to which symbols are assigned for the elements of a given reality, that is,
define a structure of the selection space.
Depending on the nature of the variables expressed with their help, there are four types of scales, which
are defined in the following. Like the measurement process as such, the scale or reference system is also
specific to the nature that has the characteristic of the measurement process. From this point of view,
there are several types of measurement scales: nominal scale, ordinal scale, interval scale and ratio scale.
The first two types of scales are non-metric scales, and the last two are metric scales.

4. Define and characterize the nominal scale and the ordinary scale. Highlight the possible
operations on these types of scales

              The nominal scale is a non-metric scale, based on which the values of the variables are defined
by means of non-numeric symbols. Measuring variables on a nominal scale is equivalent to the process of
coding variables. Even if numbers are used for coding, these numbers are, however, purely conventional.
The nominal scale is a non-metric scale, whereby the possible values of the measured characteristics are
assigned symbols without numerical relevance, depending on the nature of these values.
The nominal scale is used to measure characteristics whose values are qualitative, non-
quantifiable. The values that such characteristics can take are known as categories or alternatives. The
variables measured on the nominal scale are called nominal variables and are variables whose form of
expression is attributive type and which can be used only to establish the belonging to a certain class of
the entity described by means of the variable. A special class of nominal type variables is represented by
binary variables, which are variables that can take only two non-numeric type values.
Nominal type variables are discrete variables and can only be used for qualitative classification purposes,
the non-numerical nature of these variables making their use impossible for comparisons, hierarchies or
ordering. In the case of measuring on a nominal scale, the values that the characteristics subject to the

2
measurement can take, respectively to the categories or alternatives, are assigned symbols, which are of a
non-numerical nature.
On the nominal scale, two different values of the measured characteristic are highlighted by two different
symbols. The elements of the nominal scale, its "divisions", are represented by the symbols assigned to
the values of the studied characteristic, or, more precisely, by the categories of that characteristic. The
nominal scale is represented by the multitude of these symbols. For example, the crowds: {"male",
"female"}, {"industry", "agriculture", "construction", ...}, {"worker", "peasant", "intellectual"},
represent scales of nominal type used to measure characteristics such as sex, field of activity, social
category, profession.
What is characteristic of the nominal scale is the fact that the studied subjects cannot be compared in
terms of the value that they register with the characteristic measured on this scale. Based on the values
recorded on the nominal scale it is not possible to say which subject is "better positioned" from the point
of view of the studied characteristic or, even less, "to what extent" one subject is better positioned than
another.
Also on this scale, the characteristics can also be assigned numbers, except that these numbers do not
have the proper meaning of number, having practically the same meaning as the symbols. Both the
symbols themselves, as well as the numbers with symbol role, attributed to the characteristics on this
measurement scale, only have the role of classifying into certain groups of subjects or counting the
number of subjects in each category, and cannot be used in any type. numerical calculation. By means of
the values measured on the nominal scale, the subjects are differentiated only from the point of view of
belonging to a certain class or belonging to a certain category. This means that using the nominal scale to
measure the measurable characteristics on this scale generates classes or categories of subjects.
For the characteristics measured on the nominal scale, a limited number of statistical indicators can be
calculated, which represent, in fact, countings of the symbols appeared on the nominal scale. These
indicators are module and frequency. In the case of the characteristics measured on the nominal scale, the
frequency distribution can also be highlighted.
In a data analysis, the nominal variables can be represented by a series of variables such as: sex, social
category, family type, profession, brand of a product, etc. The only invariant transformation of the
nominal scale is represented by the recoding operation, this operation not affecting the belonging to a
certain class of the values measured on this scale.

5. Define and characterize the ordinary scale and the ratio scale. Highlight the possible operations
on these types of scales

The ordinal scale is a non-metric scale, similar to the nominal scale, ie a coding scale with the
distinction that on this scale it is possible to order the values of the variables. This scale is mainly used to
measure consumer preferences. The ordinal scale allows the classification of the values of a variable
according to their rank, but the differences between the ranks are not relevant and have no meaning. This
type of scale does not allow the establishment of the degree to which the characteristics of two entities
different ones differ (more, less).
The ordinal scale is a non-metric scale, whereby the possible values of the characteristics are
assigned order numbers or ranks, depending on the position of these values in a hierarchy.
The variables measured on this scale are called ordinary variables, they are qualitative variables of
discrete type and they cannot be expressed in real numerical form. As examples of ordinary variables we
can mention: the category of income (small, medium, high), the level of studies (elementary, middle,
upper), consumers' preference for a particular product (very big, big, small, very small, not at all), the
level quality of a product or service (inferior, medium, superior), economic status (recession, stagnation,
expansion) etc.
The ordinal scale is used if the characteristic of the subjects under analysis determines a
differentiation of the subjects from the point of view of the position that each of them occupies in a
hierarchy, in an ordering, that is, if the characteristic takes type values. ordinal. The values that the

3
characteristics measured on the ordinal scale can take are ordinal values or notes, also known as ranks.
These values are assigned either order numbers or symbols that highlight a certain order of the
characteristic values.
On the ordinal scale, two different values of a characteristic are highlighted through two different ranks,
that is, through two different positions within the hierarchy. The elements of the ordinal scale, its
"divisions", are represented by the numbers or symbols used to represent the ranks, respectively the
possible positions in the respective ordering. The nominal scale is represented by the multitude of these
numbers or symbols.
Although the values of ordinal characteristics are not actual numbers, they do, however, differentiate the
position of one subject from another subject, "say something" about this position. The values of a
characteristic measured on the ordinal scale allow only the ordering of the subjects from the point of view
of this characteristic, determining a hierarchy of the subjects or objects.
By means of the values that the characteristics measured on the ordinal scale can take, the individuals
differ from each other only in terms of rank, of the place they occupy in the hierarchy generated by the
ordinal scale. This means that the use of the ordinal scale to measure the measurable characteristics on
this scale generates hierarchies, orderings of the subjects.
Measurement on the ordinal scale allows comparisons between subjects from the point of view of the
measured characteristic, but these comparisons refer only to how one subject "is located" relative to
another, without being able to say and "to what extent" subjects differ between them according to the
respective characteristic. The differences between two successive values on the ordinal scale cannot be
considered equal, they do not determine an equal spacing between individuals, so that it can be stated, for
example, that the subject in the first place is "three times better" than the subject located in the third place.
For the characteristics measured on the ordinal scale, a series of statistical indicators such as:
module, median, correlation coefficient of ranks, frequency can be calculated. Also, for the ordinal type
characteristics it is possible to highlight and frequency distribution. In this context, it is important to
specify that the mean and the differences in the values of the ordinary variables are irrelevant, have no
informational or logical meaning.
The only invariant transformation of the ordinal scale is translation, that is, the transformation that
maintains the order of the values of a variable. Analytically, this type of invariant transformation of the
ordinal scale can be defined as: y = a + x where a is a constant, positive or negative, which gives the
meaning and magnitude of the translation of the ordinal scale values, values represented by x.
The ratio type scale is the scale that has all the properties of the interval type scale, but, in addition to this,
it has a natural, unconventional origin, which cannot be changed. It is a metric scale, on which the values
are expressed in numerical form, but, unlike the interval type variables, these values are defined in
relation to a certain origin. The origin of the scale indicates the absence of property, characteristic. In
addition to the previous scales, the ratio of values is defined on this scale, meaning that one can compare
how many times one value is greater than another. The ratio scale is a metric scale, through which the
possible values that the measured characteristics can take are assigned defined numbers in relation to a
predetermined origin. The ratio scale is invariant up to a positive proportional transformation, that is, up
to the transformation: y = ax. The variables measured on the report scale are called report type variables
and they are quantitative variables. With these variables, all operations defined for numeric variables are
allowed. As examples of report type variables we can mention: price, income, age, salary, profit, sales
volume, number of buyers, etc.

6. What are the main ways of representing (matrix) information in data analysis. Define and
exemplify each of these modes

In order to ensure a more convenient and efficient manipulation, the data used in data analysis are
represented in a specific form, called the matrix form. This form of data representation offers both the
advantage of a simple and clear structuring of the data, as well as the advantage of offering the possibility
of generalizing the concept of data set.

4
In most cases of data analysis, the matrix is the entity that defines and, at the same time, contains all the
information, all the data, subject to the analysis process. In principle, the primary data are represented in
the data analysis in three main matrix forms: observation matrices, contingency matrices or tables and
proximity matrices or tables.
2.3.1 Observation matrices
An observation matrix is a rectangular array in which the lines represent the objects subject to
measurements, and the columns represent the characteristics of the objects. The elements of the painting
represent values recorded in the measurement process for the characteristics of the objects subject to the
measurements. These values also bear the generic name of scores. The observation matrices are also
called "object * characteristic" type matrices.
For a data analysis in which the number of objects under analysis is T, and the number of characteristics
of the objects is n, the observation matrix has the following form:
X X
ll 12 - Xln
where an element xij represents the value recorded for the second characteristic of the object i. A line i of
the observation matrix X defines an object Oi and represents the values recorded by this object at those n
characteristics it possesses. A column j of the observation matrix X represents the values recorded by
characteristic j on the set of all T objects under analysis. Usually, in data analysis, each line of the
observation matrix X is called an observation and each column of this matrix is called a variable.
 
In many situations, information about all the characteristics of all the objects under analysis cannot be
obtained. If the data defining the objects are not complete, the observation matrix defined above bears the
name of the observation matrix with omitted values.
2.3.2 Contingency matrices
There are rectangular tables of size mxn, used to represent the data regarding the relative or
absolute frequencies recorded on a set of objects of the values of two variables of discrete type, the first
variable, denoted by u, having m possible values, and the second. variable, denoted by v, having n
possible values. The lines of a contingency matrix represent the possible values of the first discrete
variable, and the columns of this matrix represent the possible values of the second discrete variable. In
the data analysis, contingency matrices are also called "modalities" modalities.
An element represents the frequency, absolute or relative, of the objects for which the first variable takes
the value U; and the second variable takes the value Vj. This element shows how many objects the two
variables analyzed have the values Uj and Vj simultaneously.
2.3.3 Proximity matrices
There are square matrices of size nxn, used to represent data on the similarity or non-similarity of some
objects. The order of proximity matrices is determined by the number of objects under study. The
elements of a proximity matrix represent similarity coefficients, non-similarity coefficients or distances.
An X- element in this matrix
measures the degree of closeness between object i and object j.
Proximity matrices are also called "object * object" arrays and are used in classification problems using
cluster techniques and in multidimensional scaling problems.

7. Define the main indicators (one-dimensional) with which the central tendency or location or
position (including calculation relationships and properties) is synthesized. Show that the media is
an optimal synthesis for a lot of observations

One of the most important and relevant measures for describing the values of a characteristic is that
represented by the central tendency. The main aim of the measurement of the central tendency is to
determine a magnitude that synthesizes, to summarize, the multitude of values represented by the
observations made on some variables, in terms of their magnitude. It is obvious that, in order to be
relevant, the size used to measure the central tendency must be a kind of "center of gravity" of the
available observations, the values of the observations being distributed around this size.

5
From a geometrical point of view, determining a measure for expressing the central tendency is
equivalent to finding a vector that has the same meaning and the same direction with the vector whose
components are equal to the unit and which is as close to the vector of observations. In this sense, it can
be said that, in the case of the Euclidean metric, the magnitude that optimally expresses the central
tendency is the arithmetic mean.
The central tendency can be highlighted by means of statistical indicators, among which the most
important are: the average, the median and the module. Each of these indicators expresses, in one way or
another, more or less suggestive, the level of the characteristic analyzed along the objects.
Average - is obtained by dividing the sum of the individual values by the population or sample size
Median - is the value that, within the statistical series, separates the population in two equal parts. It does
not have a formula as simple as the average one; moreover, a median value itself exists only if the number
n is without a spouse, when there is, in fact, an average individual ([n + 1] / 2 th century) whose median
value is. If n is even, we take individuals of rank n / 2 and n / 2 + 1, with values, say, Xi and xi + 1, and
the median can be any value in the range (Xi, Xi + 1); usually the arithmetic mean of the two values is
taken.
The module - is used only when working with frequencies, being the value taken with the highest
frequency. One can also speak of relative modal values when the frequencies of several classes, not
neighboring, exceed those of their immediate vicinity; we have to deal with bimodal (two-way) or
plurimodal series.

8. Define the main indicators (one-dimensional) with which the variability is synthesized (including
calculation relationships and properties)

Another important measure for synthesizing the values of a characteristic is that of the variability that
characterizes the observations of the variable, of the scattering, of the dispersion of these values. A
synthetic indicator, used to measure and express the variability of the values of a characteristic, is the
variant.
The variability that characterizes the set of observations made on a certain characteristic is highlighted by
the differences that exist between the values that the characteristic registers on the set of subjects, by the
magnitude of the variations of the characteristic values from one subject to another. Variability is
important both informationally and in size in the context in which the relevance of the media can be
judged. The lower the variability of a set of observations, the more the average is a synthesis, a summary
more appropriate and more relevant to the set of observations. On the other hand, the greater the
variability, the less the average can be considered a relevant synthetic expression of the observed values.
Therefore, it can be said that the greater or lesser confidence we can give to the average as a magnitude
that summarizes the observed values depends on the magnitude of the variability of these values. This
means that in order to have a measure of the relevance of the mean, it is necessary to establish a measure
of the variability.
The variant is directly proportional to the magnitude of the variation of the measured characteristic values
or to the information size that is contained by the observations available for data analysis. In the
conditions of the previous notations, the variant of the variable Xi, denoted by Si2, is determined by the
following formula:

Specifically, the variant represents the sum of the squares of the deviations of the individual values
relative to the average, which returns, on average, on each individual value, that is, on each observation
made on the variable. As a result of the fact that variability may or may not exist, the variant, as a
measure of this variability, is always a non-negative measure.

6
Starting from the way the variant measures the variability and from the importance that this variability
has in the data analysis, one can make the assertion that, in a certain sense, the variant represents a
measure of the information contained in the analyzed data.
A major deficiency of the variant, as an indicator for measuring the variability, of the amount of
information contained in the primary data, is related to the fact that the variants of two characteristics or
two variables expressed in different units of measure cannot be compared. The comparison of variants is,
however, only possible if the measurements of the characteristics are expressed in the same units of
measurement. Also in this regard, there is another important shortcoming of the variant: that it is an un
scaled size. Although the size of the variant is limited lower, it has a lower edge represented by the zero
value and emphasizing the lack of variability or constant, it is not limited superior, it does not have an
upper edge
Another difficult problem, which arises in connection with the variant, is that the units of measure in
which it is expressed are different from the units of measure of the characteristic whose variability it
measures. In fact, the variant is measured in units of measurement that represent squares of the units of
measurement of the observations made on the considered characteristic. This feature of the variant creates
a series of difficulties related to the concrete interpretation of the size of this indicator of the variation.
Due to the lack of significance of the units of measurement of the variant, another indicator, derived from
the variance and represented by the square root of the variant, is used to measure the variation. This
indicator is known as the standard deviation

9. Define the simple variant, the total variant and the generalized variant. Deduce and interpret the
generalized version. Show that the generalized variant is equal to the determinant of the covariance
matrix

 The simple variant is a measure for the deviation from the mean, standard deviation from the mean. Vs =
Σi = 1n (xi- xmediu) 2. The total variation measures the variability characterizes the observations of a set
of variables and is defined as the sum of the individual variants of the variables: Vt = ΣSi2
The total variation offers a comprehensive picture on the global variability that characterizes the analyzed
lime observations, as it measures this variability only in the individual sense, not taking into account the
common, simultaneous variability of the observations, that is the generalized variability.
The generalized variant measures the variability that characterizes the observations of the set of variables,
both from an individual point of view, as well as from the point of view of simultaneity, of the
informational inter-interactivity of the variables. The generalized variant corresponding to the space of
observations of 2 considered variables is given by rel: Vg = (1 / (T-1) | x1 | * | x2 | sin Φ)
It can be shown that the generalized variation is representative of the determinant of the covariant matrix
as corresponding to the variables under study, respectively: Vg = | S |.
The generalized variant is an extremely important measure of the total variability, formed both as a result
of the individual variability that characterizes the variables, as well as as a result of the common
variability that crackles the interaction of the variables.

10. Define the main indicators (one-dimensional) with which the links are synthesized (including
calculation relationships and properties)

The intensity and meaning of the linear connection or association between two characteristics of objects
or individuals represents another important measure that can be used in the numerical synthesis of data.

7
The measure of the linear type association can be expressed by correlating the simultaneous variations or
the covariates of two characteristics on a set of objects or individuals. The basic size used to express
simultaneous variations is represented by the covariance:

Covariance is a measure of the simultaneous variation of two variables, being, in absolute value, the
higher the absolute values of the variations of the two variables around the mean are closer in magnitude,
showing a certain proportionality on the set of studied subjects. Covariance is considered to be a
numerical expression of the degree of association of two characteristics due to the fact that, in all cases
where two variables are significantly related, a variation in one sense of one of them will cause a
proportional variation of the same meaning (in the case of the direct link) or the opposite direction (in the
case of the reverse link) of the other variable.
A scaled measure of the degree of linear association between two variables, which eliminates some
deficiencies of the covariance as an indicator of measuring the linear type, is represented by the Pearson
correlation coefficient:

The Pearson correlation coefficient is a scaled magnitude in the range -1 ≤r_ij ≤1


A null value of the correlation coefficient shows the absence of the linear type link between the two
variables, as an absolute value equal to the unit shows a perfect linear connection, which is direct if the
value is equal to 1 and vice versa if the value is equal to - 1.

11. Define and interpret the correlation and the correlation coefficient

Correlation is a statistical method used to determine the relationships between two or more variables.
There are several types of correlations, both parametric and nonparametric.
The correlation coefficient is a quantitative value that describes the relationship between two or more
variables. It varies between [-1 and +1], where the extreme values assume a perfect relation between the
variables while 0 means a total lack of linear relation. A more adequate interpretation of the obtained
values is made by comparing the obtained result with certain preset values in the correlation tables
depending on the number of subjects, the type of connection and the desired significance threshold.

12. Define profile, chronological and panel data. Exemplify each of the three types

Profile data represent information obtained through static measurements, carried out on the characteristics
of some units of a population, at the same type moment. An observation in the context of profile data is
represented by the value or values of a single entity, of a single unit in the population. No observations
coincide in this case with the number of units observed and recorded. Data of this type do not incorporate
in the meaning they carry, the influence of time on the formation of the characteristics at pop level and
the meaning of the passage of time, neither explicitly or implicitly. The profile type data refers to the state
of the waters that have it at a certain moment individuals of a company household etc. Ex data on the
individual salary of a month of workers of a company, data on the average population of the states of the
world in a year, etc.

8
The time series data represent the information obtained by measurements of a dynamic nature performed
on each unit of a population at a later time or time interval. The time intervals for which measurements
are made can be from: hours, months years, etc. This type of data refers to the evolution in time of the
status of an individual, households, etc. Data of this type can be interval or moment data. Interval data are
the data referring to the snail which are stock type sizes and those of the moment type are the data
referring to the snail which are flow type quantities.
The panel type data is obtained through mixed measurements, static and dynamic, performed on the same
units of a pop at its moment in successive time interval.
These data can be imagined as representing "mixed information cuts", transversal and longitudinal, in
relation to the time axis. The observation of these data is made in a note of simultaneity.
Ex family budgets: registration and several families from a sample

13. Define observational and experimental data. Exemplify each category

The experimental data refers to the data obtained through org of controlled experiments, in which the
influences of fact on the effect are controlled directly, by fixing precise combinations of influences. They
are characterized only by some research fields, those fields in which specific experiments can be
organized, necessary to obtain these data. These areas are: physics, chemistry, biology, etc.
These data are laboratory data, laboratory meaning a series of special cond, restrictions and specific
measuring instruments. In the economic-social field, experimentation is either completely impossible or
possible but only very rarely and in a very restrictive and costly way.
The observable / non-experimental data type is obtained through the free observation of the movement of
the studied phenomena and processes, without the direct intervention of the investigator on the condition
in which this movement takes place. Obtaining these data is the result of passive observations, finding.
The intervention of the person making the measurements is of an ex post type, takes place after the real
phenomena and processes have taken place. The observable data are specific to the economic-social
domain.

14. What are the main types of preliminary data transformations? Interpret the quantities resulting
from these transformations and mention their properties

Prior to their use in data analysis, the original data is subjected to the preliminary processing process,
which can be performed by the centering operation or by the data standardization operation.
The data centering operation consists in replacing the value of each observation with a new value,
represented by the deviation of the original value from the average of the initial data. Due to the fact that
the sum of the deviations from the original values is always null, the data centering operation will cause
the centered variables to have the null average. In this case, the variation of a variable is proportional to
the square of the length of the vector composed of the observations of that variable, and the standard
deviation is proportional to the length of the same vector.
For centered variables, the covariance is proportional to the scalar product of the vectors representing the
observations of the two variables, and the correlation coefficient between two such variables is the scalar
ratio between the scalar product of the vectors representing the observations on the variables and the
product of the lengths of these vectors.
The operation of standardizing the values of a variable consists in replacing the values of each
observation with a new value representing the ratio between the centered value of the respective operation
and the standard deviation of the respective variable. Also, the standardized variables have the null
arithmetic mean, in addition, their variant is equal to the unit, and the covariance is scaled in the interval
[-1,1] (for cov = 1 perfect direct linear association between the two variables, cov = 0 there is no
association between the two variables, cov = -1 (perfect indirect linear association). One consequence of
this property is that for these variables, the covariates are even Pearson correlation coefficients.

9
As with the centered variables, the covariance is proportional to the scalar product of the vectors
representing the observations of the two variables, and the correlation coefficient between two such
variables is identical to the covariance and is proportional to the scalar product representing the
observations on the variables.

15. Define the main types of matrices used in data analysis (cross-product, covariance, correlation).
Highlight the connection relationships between these types of matrices

The most important ones used in data analysis are:


The matrix of the centered observations - can be obtained as the difference between the metric is obs and
the Xmediu matrix whose columns are the averages of those in the variables Xc = X - Xmediu
The standardized observations matrix - can be obtained as a product between the centered lime matrix and
the inverse diagonal matrix V, of which the standard deviations of the n variables are Z = Xc * V-1 =
((X11-X1 medium) / S1 …… ..)
Cross product matrix - can be determined for both original lime, as well as for centered and standardized
lime. For the original variables, the cross product matrix is obtained as a product between the transposed
mat X and the X matrix. C = Xt * X
If the variables are centered, the cross product can be given as follows: Cc = ∑_ (t = 1) ^ T▒ 〖(xt1-x1
medium)〗 * (xt1-x1 medium). ∑_ (t = 1) ^ T▒ 〖(xt1-x1 medium)〗 * (xt2-x2 medium)…. Σ_ (t = 1) ^ T▒
〖(x2mediu-XT2)〗 * (XT1-x1mediu)
The covariance matrix - is one of the most commonly used matrices in data analysis, most AD techniques
pp calculating this matrix. For the situation in which the number of lime analyzed is equal to n, the cov of
any 2 lime can be arranged in the form of a square and symmetrical matrix, of dimension n * n, called the
cov matrix:
S = (S12 S12 .. S1n), where Si = 1 / (T-1) * | xi - xmediui | 2
     (S21 S22… S2n) if Sij = 1 / (T-1) * (xi - xmediui) t * (xj - xmediun)
As noted above, the covariance matrix for the original lime can be written with the help of the cross
product mat for the centered lime case S = 1 / (T-1) Cc
Correlation matrix - this is an imp mat in AD because it writes a series of AD methods and techniques
and bases its procedures on the spectral analysis of this matrix. This mat has the following form:
R = (1 r12 .. r1n), where rij = (xi-xmediui) t * (xj-xmediun) / | xi-xmediui | * | xj-xmediuj |
      (r21 1 .. r2n)
The original lime correlation matrix can be written with the help of matt prod crosswise for the
standardized wave case R = 1 / (T-1) Zt * Z

16. What is the analysis of the principal components. Highlight five categories of problems that can
be solved using the principal component analysis techniques

 PCA is a multidimensional analysis method that aims at detecting new lime, called principal components
and expressed in the form of linear combinations of original lime, of these new lime being of maximum
variability. The technique on which PCA is based consists in calculating the projections of each point in
the initial space, as the original lime subjected to the analysis, on the axes of a new space, whose
dimension is significantly smaller.
From the point of view of the principles on which the act of AD is based, it is of interest only those linear
combinations significantly informational. As the number of linear combinations that can be formed with
originating lime requires a sorting of them, this sorting is done by def a criterion that must be at the base
of the decision to retain or eliminate a certain linear combination. In the PCA framework, this criterion is
based on the magnitude of the var of each linear combination and can be formulated as follows: the linear
combinations with small lime are eliminated, and the linear combinations with a maximum lime are
retained for the study.

10
The main problems that can be solved with the help of PCA techniques are divided into two main
categories: simplifying the structure of the causal dependence and reducing the dimensionality of the
causal space.
The first category of probe is understood to simplify the structure of the causal dependence, you will
obtain a causal space of smaller size and which will allow a simpler and more suggestive representation
of the objects. Following this simplification, it is much easier to highlight the causal morphology and to
express an adequate structure of objects. A special case of problem solved by PCA is the elimination of
redundancies inf. If a scientific investigation has as its initial direct spatial object, it is very difficult to
deduce and express a structural dependent that clearly shows the net contradictions of the varanalized
when forming the variability of the whole causal space. The causal lime correlation has a complicated
structure of dependence, a redundant structure that includes certain overlaps inf of the causal lime
influences. Because the initial causal structure is complicated and includes many overlaps, it generates
some difficulties regarding the clear understanding of the causal relationships and the formulation of
some pertinent data. In this case it is possible to renounce the inclusion in the analysis of the
corresponding information, which is not significant in all directions.
The second category of problem solved by PCA is the dimensionality reduction. At the base of the PCA is
the idea that the units in the initial coordinate system are not always the most suitable, considering that
there may be another way of representing them more relevant. This new mode of repre can be obtained by
considering a new repre space, which defines new features of objects. The new caracs are called main
comp and the regular wave are called scores. The problem of repre in a small space is known as the prob
of the dimensionality. For this reason, PCA is also known as the dimensionality network technique. For
example, red dimensionalities can be explained as follows: it is considered a two-dimensional graph in
which some values are illustrated, the red dimensionality in this case means the transition from a two-
dimensional system to a one-dimensional system, that is, the values must be represented by a line. As a
result of this operation, a new entity is obtained, which can be interpreted as representing a new data
feature. The content contained in this new entity is more relevant than the one contained in the previous
observations.
Concrete problems solved by reducing the dimension are:
1) selection of influence lime - because not all influence lime have the same importance in the formation
of characteristics, it is necessary that lime be subjected to a filtering process, a process by which some
lime are eliminated and others kept in fct by their significance. In addition to the process of filtering lime
indepts from their impts in AD, there is a frequent need to group lime indep into fct by their influence.
2) simplification of mathematical models - there are many reasons that make it difficult and
uncomfortable to retain within a mathematical model of analysis / predictive of a large number of lime
and which lead to the need for a certain simplification of the model from this point of view. First of all,
the significance of each lime in a model that includes a too large number of lime is greatly diminished. In
the second row I get some information needed to estimate a model that contains a large number of lime
would amplify an effort and a cost of prohibitive. some lime strongly intercorelated. In the 4th row, a
large number of lime retained in a model would raise serious probe of the complexity of the calculation
3) eliminate information redundancies
4) visualization of complex causal relationships - through PCA the necessary conditions can be created so
that even objects that are characterized by a large number of lime can be created
5) compression and restoration of data in computing - after eliminating redundancies inf, the new repre-
sentation of objects ensured by applying PCA techniques is accompanied by a negligible inf loss, so in
the storage and transmission of inf the new features of objects can be forgotten. Because their number is
much smaller, the handling of information can be done with a lower consumption of resources.

17. Interpret the logic of principal component analysis (including geometrically)

The PCA logic is based on the fundamental idea that certain transformations can be made on the initial
obs, which sad et maximizing individual lime for certain individual lime for certain lime and minimizing

11
lime for other variables. Thus the logico-informational significance of some lime is accentuated and the
one of other lime is diminished. Maximizing the variance of some variables and emphasizing the
significance of these lime in relation to the others is all the more relevant as there is a stronger law
between the original lime.
In order to highlight the way in which the principal components can be deduced, that is, the new limes
that have proposed to conserve the variability that initially causal spatial character and which are
uncorrelated, we will proceed to the successive rotations of the 2 initial axes, measuring the variant which
characterizes the 2 var for each position modified by rotation of the axis system. Due to the fact that the
axis of the axis is rotated with a certain number of degrees, the coordinates of the 2 vars are altered
correctly and thus in the new coordinates it is different.
Obs: - the rotation of the initial axes with a certain angle in view of maximizing the variation along an
axis does not change the position or the configuration of the points relative to the original observations,
which only changes their coordinates with respect to the new axes;
        - the new axes resulting from the rotation that maximize variant after the first axis def 2 new
variables, called principal components and which have the null mean
        - the two variables are called the main comp of the original linear lime comb and are uncorrelated
between them
        - the coordinates of the new variables are the projections for the initial observations and are called
scores of the main comp
        - the 2 main comp fully conserve the total variant corresponding to the original variables, ie the sum
of the variants of the 2 main comp is equal to the sum of the variants of the 2 original var
         - the first main component has a maximum variant, taking the maximum possible out of the total
variant that has original carac.

18. Define the principal components and mention their properties

The principal components are abstract vector variables, defined in the form of original lime linear
combines and which follow 2 fundamental prop:
             - 2 and 2 are uncorrelated and the sum of the square coefficients that defines the linear
combination corresponding to a main comp is equal to the unit
- the first main comp is a normalized linear combinator whose variant is maximum, the second main
comp is a linear combinator not correlated with the first main comp and which has a maximum possible
lime, but smaller than that of the first comp, etc.
Geometric Dpdv, lime called main comp def a new space of the objects, in the context of which they are
verifiable following relevant for the PCA def:
    - the axes of the new space are orthogonal 2 and 2 and the new lime called the main comp
    - the coordinates of the objects in the new space, that is, the projections of the objects on its axes, are
evaluations of the objects in relation to the new variables and are called scores of the main comp or main
scores
   - theoretically, the main comp number is equal to the original lime number; not all the main comp have
a significant inf significance, so the least significant inf infp are eliminated
  - main comp are linear combinations of maximum variant of the original variables
   - main comp are scaled in fct by the magnitude of their variant, the first being the main component with
maximum variant, and last being the main comp with min variant
   - main comp are uncorrelated 2 by 2
   - the sum of the variants of the main comp coincides with the sum of the variants of the original
variables, of the main comp take in full the variability contained in the original lime.
With the help of the main comp, you can define a structure of the dependency between the original
variables simpler and clearer, so easier to interpret.

12
19. Formulate the mathematical model of the analysis of the principal components, define and
interpret its defining sizes

In order to formulate the mathematical model underlying the PCA, we will consider that the initial causal
space under investigation is due to a number of n explanatory variables noted x1… xn. . These variables
symbolize the character of the objects under analysis, which means that each object is pp to be
characteristic of n variables.
The detecting activity of the principal components can be described by means of a transformation of the
following type: Ψ: Rn → Rk, where Rn and Rk, are two real vector spaces, and the dim of the 2nd is
much smaller than the dim of the first. The problem of PCA from a mathematical point of view is
equivalent to solving the following extreme problem:
       OptA € Ma * k Φ (x, w)
       SR: W = AT * X, where eight can be min or max.
A specific situation for solving the PCA sample is the maximization of the main comp variant, as a
measure of the quantity of inf expressed by each of them.
In order to define the mathematical model of the PCA, we will consider that the vectors α (i) represent the
columns of a matrix A of dimension n * n of the form:

Also, we will assume that x is the vector whose coordinates are original lime x1, x2…, and that w is the
vector whose coordinates are principal comp w1, w2…. Under these conditions, the linear combinations
that define the main comp can be written as:

Or in matrix form:

Based on these notations, the mathematical model of the PCA can be defined as follows:

Those in the columns of the matrix A represent in fact the normalized eigenvectors of the covariance
matrix Σ, and the variant of each principal component wi, which is a maximal variant with respect to the
principal comp variants, is representative of the wave and of the same covariance matrix. This way of
determining the elements of matrix A is equivalent to calculating the projections of objects of type x € Rn
on the linear subspace generated by the vectors of the columns of matrix A. We have seen previously that
those in main component of the causal space as defined by the original variables x1, x2 ... are defined of
the linear combinations:
Wi = α1 (i) x1 +… + αn (i) xn, i = 1..n, whose weights αj (i) are determined so as to maximize the
principal comp lime Wi

20. Illustrate how to deduce the principal components

13
The deduction of the principal components consists in practically choosing between the eigenvalues of
the covariance matrix Σ on the largest and determining the components of the weight vector α which
defines the respective main component. Thus, for each eigenvalue λ_ i of the n eigenvalues of the
covariance matrix Σ, we will have one solution, that is, a vector α ^ ((i)) and thus a principal component
w_i.

21. Define and justify 3 of the properties of the principal components.

The principal components have a number of extremely interesting properties, which derive from their
very definition and which are important for understanding the nature and content of these abstract
constructions. The knowledge of the properties that the principal components have is particularly
important in the process of data analysis, allowing to determine the changes induced on the principal
components and on the sizes associated with them by the transformations applied on the observations of
the original variables. Three of these are:
a. Distribution according to the normal law
If the original variables are normally distributed, the vector of the principal components W is distributed
normally with the average Atµ and the covariance matrix Λ, that is:
w ~ N (Atµ, Λ), where Λ is the diagonal matrix whose elements are the eigenvalues λ1, λ2, ...., λn of the
covariance matrix Σ.
The normality of the n variables representing the principal components results from the fact that these are
linear combinations of the original n variables, which, by hypothesis, are normal variables.
b. Conservation of the generalized variant
The principal components w1, w2, ......, wn ensure the complete preservation of the generalized variant of
the original variables x1, x2, ...., xn. This means that: VG (x) = VG (w). This property highlights the
informational quality that the principal components of a re-expression of the original variables have.
c. Dependence on the units of measurement
The principal components w1, w2, ...., wn and their variants depend on the units of measurement in which
the original variables x1, x2, ...., xn are measured. This means that, with the change of units of measure of
the original variables, both the principal components and their variants change.

22. Interpret the vectors and eigenvalues of the covariance matrix

We will assume that x is the vector whose coordinates are the original variables x1, x2, ...., xn and that w
is the vector whose coordinates are the principal components w1, w2, ...., wn. Under these conditions, the
linear combinations that define the principal components can be written in matrix form as:

Those in those columns of the matrix A represent in fact the normalized eigenvectors of the covariance
matrix Σ, and the variant of each principal component wi, which is a maximum variant with respect to the
variants of the previous principal components, is represented by the eigenvalue and of the same
covariance matrix .
The problem of determining the components of the vector α defining the linear combination representing
the main component w is reduced to solving the following extreme problems with links:

14
The problem can be solved using the Lagrange multiplier method:

It follows that the solution α of the extreme problem is even one of the eigenvectors of the covariance
matrix Σ, namely the one associated with the eigenvalue λ of the same matrix. Furthermore, it is observed
that the maximum value of the quadratic form αt * Σ * α is, at the extreme point α, equal to λ,
respectively:

This last relationship highlights the fact that the variant of a principal component is equal to an eigenvalue
of the covariance matrix.

23. What are the main scores and how to determine them. Why it is necessary to determine the
principal scores

In the analysis of the principal components the coordinates of the objects in the small space are also
called main scores of the objects. If we assume that p components have been retained and if we note with
Ū the matrix of dimension n x p, whose columns are the eigenvectors that define the p principal
components, then the matrix of scores can be determined as follows:

The lines of the matrix W represent the scores corresponding to the new variables or the observations of
the principal components. Once determined, the main scores can be used in the analysis as a substitute for
the original observations, thus simplifying the initial informational basis. In relation to this problem, the
main scores are more suitable for use in analyzes because they are less affected by errors, compared to the
original measurements. The fact that the main scores are more robust in relation to the disturbances
introduced by the errors, that they have a certain invariance in relation to the errors, causes them to
become more informationally important than the original observations.

24. What is the factor matrix (the correlation matrix between the original variables and the
principal components). How to calculate and interpret its elements

An important matrix used in the context of the analysis of the principal components, whose elements offer
premises for interesting interpretations is the factor matrix. We will assume that those in principal
components are represented by the vector w, and the covariance matrix of the principal components is the
diagonal matrix Λ. We will also consider the connection between the vector of the original variables and
the vector of the principal components as given by the relation: x = A * w, where A is the matrix of the
eigenvectors of the covariance matrix Σ.
The Ω matrix is a very important matrix for principal component analysis and is known as the factor
matrix. This matrix has the form:

15
a generic element ωij from the factor matrix Ω being determined by the relation:

The elements of the factor Ω matrix are called the intensities of the factors and have a particularly
interesting interpretation from the point of view of the connection between the original variables x1,
x2, ...., xn and the principal components w1, w2, ...., wn. Thus, the element at the intersection of line i
with column j in the matrix Ω, that is, the element ωij = √ (λ_j) / σ_i * α_i ^ ((j)), represents the
correlation coefficient between the standardized variable xi. and the second major component wj. The
intensities of the factors are indicators of the extent to which the original variables participate in the
formation of the principal components or, more correctly, of the extent to which the principal components
synthesize the information contained in the original variables. The higher the value of the correlation
coefficient between an original variable and a main component, the more appropriate and complete the
informational expression of the original variable through the respective main component.
The factor matrix is very important because, based on the analysis of the values of its elements, a series of
partitions or clusters can be identified on the set of variables, clusters which, associated with certain
principal components, can lead to the establishment of intuitive meanings for those components. This
means that the analysis of the elements of the matrix factor Ω can allow the identification of those
original variables that are represented through a certain main component and, on this basis, creating the
possibility of assigning a concrete meaning for each main component.

25. Describe how the principal components can be interpreted in terms of concrete significance.
Exemplifying

The interpretation of the principal components is facilitated by the graph "circle of correlations" which is
the projection of the unit sphere F, on a plane delimited by c1 and c2, two principal components in space
F, whose coordinates represent coefficients of correlation of the initial variables with the components
considered two by two. .
Criteria for choosing the number of principal components
The efficiency of expressing the original variables through the principal components is closely related to
the degree of correlation of the original variables and, in particular, to the way in which these variables
are structured in terms of correlation. In relation to the degree of correlation of the original variables, one
can make an extremely interesting observation from the theoretical point of view and very useful from the
practical point of view. This observation refers to the fact that there is a strong link between the degree of
correlation of the original variables and the number of principal components by which the original
variables can be effectively re-expressed.
If on the set of the original variables the existence of some sub-sets consisting of variables that have the
property that are very strongly correlated with each other, on the one hand, and very weakly correlated
with the variables belonging to other sub-sets, on the other, is clearly evident, then it can be done the
assertion that the original variables can be re-expressed sufficiently well by a number of principal
components equal to the number of such sub-sets.

16
27. What is factorial analysis and what types of problems can be solved with its help

Factor analysis is a multivalent analysis that aims to explain the correlations manifested between a series
of variables, called indicators or tests, through a smaller number of ordered and uncorrelated factors,
called common factors. This is used to solve problems related to:
• studying the different levels of manifestation of the interdependencies between the explanatory
variables;
• detecting a simplified and clear structure of the interdependence relations existing between the
explanatory variables;
• obtaining a "cluster-izari", a classification of the explanatory variables through entities called factors;
• obtaining specific information, in the form of so-called factors, on the basis of which a synthetic
interpretation of causal relationships can be made;
• verification of hypotheses regarding the existence of a particular factorial structure or of the existence of
a certain number of common factors;
• synthesizing the common causal potential of several explanatory variables in the form of a small number
of factors.

28. General structure of the factorial analysis model

The process of factorial analysis includes the following essential steps:


• determining the minimum number of common factors by which the correlations between the indicator
variables can be explained optimally;
• performing rotations of the factors, in order to determine the factor solution in the simplest and clearest
form;
• estimating the intensities of the factors, the structure of the links, the commonalities and the variants of
the unique factors;
• deduction of appropriate interpretations for common factors;
• estimating factor scores.
Of these, the problem that raises the most difficulties in performing this analysis is that of estimating the
intensity of the common factors.

29. Define and interpret the decomposition of variability in the context of factor analysis

The factorial analysis aims to re-express the variability contained in the initial space, in a differentiated
way, depending on the role that they play in its formation, the common factors, on the one hand, and the
unique factors, on the other.
By using multidimensional analysis techniques that aim to reduce dimensionality, the variability of the
non-dimensional causal space is conserved through the reduced variability, induced by common factors.
These, together with the unique factor, determine a space called test space or factor space whose axes are
orthogonal two by two.
The variability that characterizes the two spaces is measured through variation or dispersion. In the data
analysis it is considered that a variable is the more significant the greater its variability.
In the factorial analysis, the variability of the initial causal space is considered to be a composition of
variability, formed under the influence of the considered factors. From this point of view, the variant can
be divided into three important components: the community component, the uniqueness component and
the residuality or error component. Thus the variant of the indicator variable x_j is written: Lime (x_j) =
Community + Unicity + Residuality. This relation defines the decomposition of the variant of an indicator
variable according to the variants of the three categories of factors that influence the respective variable.
Except for the third component of the decomposition, which is the variant of the residual factor, the first
two components cannot be assigned as the variants of the factors. They are determined by the coefficients

17
that weight the variants of the factors, which means that they represent contributions of the variants of the
factors to the formation of the variant of the indicator variable.

30. What are factor scores, how they are calculated and how they are interpreted

A factor score is a form of determining an observation corresponding to a given factor, which is formed
based on the contribution of the original variables.
The scores for a given factor are expressed as follows: f = F ^ tx. In practice, the expression in the form of
factor scores of the T observations made on the original variable is based on the following relation:
n
z kj=∑ bki x ij k=1...p, j=1...T
i=1
Where: : z kjrepresents the factor scores, b_ki is the element in the line k and the column i of the factor
matrix transpose, and x_ij is the j-observation made on the original variable.
Considering the observation matrix X and the factor F matrix, the definite Z score Z = F ^ t X, is called
the factor scores matrix, which can be used in subsequent analyzes instead of the original variables.

31. Methods of estimating the factorial model

The use of factor analysis to solve specific problems also involves determining the number of common
factors that will be retained in the model. Although the decision to retain the number of factors is
subjective, there are a number of criteria:
o Percentage of coverage criterion
 The choice of the number of factors to be included in the factorial model depends on the proportion of
the common variability contained in the initial causal space that the user wishes to express through a
succession of common factors
 An approximate estimate of this proportion, in case the number of retained factors is equal to k, can be
obtained with the formula

k - number of factors retained by the model


         n - no of original variables
     λi - the eigenvalue with respect to which the common factor i is defined
 The disadvantage is due to the fact that the pk size shows the weight of the variation of the first k
principal component in the total variant and not the weight of the variant explained by the first k factors in
the test space variant, and represents a disadvantage because there is a difference between the principal
components and the common factors. essence
o Kaizer's criterion
 Can be used when factorial analysis is performed on a correlation matrix, ie when the original
variables are assumed to be standardized. According to this, the number of factors required to be included
in a factorial analysis model is equal to the number of own values greater than or equal to 1.
 Its justification would be that for the analysis, it is important only those common factors whose variant
is at least equal to the variant of the original variables, variables that, being normalized, have the unitary
variant
 It can be used only when working with normalized variables and the disadvantage would be that its
application leads to the retention in the model of a too large number of factors.
o The criterion of "granulosity"
 According to this criterion, the number of factor ice will be retained in the factor analysis model is
established based on a graphical analysis of the own values. The graph on which the analysis is made is
constructed taking in the abscissa no d of the order of the own values and in the ordinate the values of

18
these own values. Thus, the graph will have the shape of a curve of the negative exponential type, because
the eigenvalues are ordered by their decreasing magnitude.
 The number of factors that will be retained in the model is determined by the point on the graph to the
right of which the slope of the curve becomes negligible, the order number of its own value
corresponding to this point determining the number of factors that will be retained.
 The disadvantage is that the application leads to the retention of a too small number of common
factors but in practice, the construction of a model with one or two common factors, has the advantage
that it facilitates the graphical representation of the size of the factor analysis, useful in the interpretation
phase.

32. Define the recognition of forms and exemplify some of its applications in the economic-financial
field.

Form recognition theory can be defined as representing all the norms, principles, methods and tools of
analysis and decision used in order to identify the belonging of some forms or objects (units, phenomena,
events, actions, processes, etc.) to certain classes with individuality. determined.
It can be said that the recognition of forms sums up all the attempts to construct those models that
simulate the way in which man quantifies, analyzes, interprets and anticipates the evolutionary behavior
of phenomena and processes. From the point of view of systems theory, the recognition of the forms can
be considered as a general system in which the inputs represent the set of the characteristics of the objects
to be classified, the outputs represent the set of possible classes of which the analyzed objects can belong,
and the transfer function expresses the decision mechanism. whereby a particular object is identified as
belonging to a particular class.
In the economic-social field, the theory of the recognition of the forms finds a wide use especially in the
process of data analysis and in the activity of prediction. The problem of classifying a set of objects is a
standard problem, commonly encountered in socio-economic investigation, and its approach involves the
use of methods and techniques specific to the theory of shape recognition.
Numerous problems in the field of data analysis, starting with those related to identifying the defining
characteristics for the most diverse categories of phenomena and ending with those related to the
functional delimitation, the structural hierarchy or the informational synthesis of some sets of phenomena
and economic-social processes. The methods and techniques belonging to the theory of shape recognition
are irreplaceable in analyzes that operate with large amounts of information, where the need to
essentialize and synthesize interdependencies involves a continuous process of classification and
structuring of information.
An even wider use of pattern recognition theory is found in the field of predictions. The activity of
making predictions can be regarded as a process whose characteristics are very close, even going so far as
to identify, the specific characteristics of a process of pattern recognition. The evaluation of the states that
a phenomenon belonging to a given reality may have in the future represents, in fact, a process of
recognizing those forms of evolution of the phenomenon that are most likely to occur. Moreover, both in
the activity of prediction and in the process of classification or recognition of forms, the approaches to
approach have a predominantly probabilistic nature. On the other hand, the problem of the recognition of
forms is itself a problem of prediction in which, starting from certain characteristics of the analyzed
objects, called objects and forms, predictions are made regarding the belonging of these objects to certain
classes. Moreover, establishing the membership of the forms in certain classes is the main purpose of
using the techniques for the recognition of forms.
Currently, the most modern methods and techniques in the field of prediction are those based on a new
class of models, specific to the outline of a new approach in the field of shape recognition theory, called
neural networks. The methods of scientific approach based on neural networks are more consistent with
the pronounced complexity and unpredictability that characterize the behavior of economic-social
phenomena and processes and offer a number of important advantages, compared to other methods and
techniques used for the same purpose.

19
Form recognition techniques can be used in the socio-economic field to solve problems such as: analysis
of data with high degree of heterogeneity, substantiation of the criteria for choosing development
projects, classification of decisions according to their impact on different compartments of life. economic-
social, detection of specific periods from the evolution of some economic systems, establishing lending
policies in the financial-banking field, evaluating the efficiency of the promotion activities of some
products, determining the most appropriate periods for selling certain types of goods, identifying the most
profitable business areas, classification and hierarchy of economic-social entities etc.

33. Define the main concepts of shape recognition

Of the many concepts used in the theory of form recognition, three can be considered as fundamental and
defining for the essence and purposes of the theory of form recognition: form, class and classifier. The
form represents the numerical expression of the object studied in order to classify it in a certain class and
is the result of quantifying the main characteristics possessed by that object. The form or object is an
individual information entity, characterized by a n-dimensional vector, whose components define the
values of its characteristics, and which is the object of the classification or prediction process.
One of the fundamental assumptions on which the theory of shape recognition is based is that the
analyzed objects are characterized by a certain degree of heterogeneity. This means that, by default, the
existence of the possibility of defining distinct classes on the set of objects is assumed. On the other hand,
it is assumed that certain objects belonging to the analyzed set have something in common, are
characterized by a certain degree of homogeneity. The two requirements imposed on the set of analyzed
objects are known as similarity and dissimilarity.
The class, group or cluster represents a distinct subset of objects that verify the following two properties:
the objects that make up a class are homogeneous in terms of their defining characteristics; two objects
between which there are significant differences in terms of defining characteristics belong to different
classes. The class, group or cluster represents a distinct information entity with concrete meaning,
consisting of all the objects whose characteristics are identical or differ very little and which are
significantly different from the characteristics of the objects of other classes or groups.
The number of classes that make up the output set of a system for pattern recognition varies depending on
the specificity of the domain for which this system is used and the purposes pursued.
The classifier is a statistical-mathematical model that, based on the information about the characteristics
of a particular object, determines the decision to classify the object in a certain class. The classifier can be
regarded as the set of principles, rules or criteria, depending on which the analyzed objects are assigned to
one class or another. The classifier or classification criterion represents the rule or set of rules on the basis
of which the objects belonging to the analyzed set are affected or assigned to well defined classes or
groups. Depending on the nature of the rules used in the classification process, there are several
categories of classifiers: hierarchical classifiers, minimum cost classifiers, minimal distance classifiers,
Bayes-ian classifiers, heuristic classifiers, etc.

34. Formulate the general problem of classification

In its most general form, the classification problem can be formulated in terms of decision theory, and the
classification methods can be defined in the form of specific decision tools. We will describe below how
the classification problem can be defined as a decision problem. For this purpose, we will assume the
existence of a population of shapes or objects, denoted by Ω and defined as:

M represents the number of units of the analyzed population


Each object that makes up the population Ω is defined by a number of features, called explanatory
variables. In this way the population can be represented as a N-dimensional vector:

20
The explanatory variables, which define the characteristics of the analyzed objects, are the sizes
according to which the belonging of an object from the population to one of its classes is established.
Explanatory variables can be qualitative or quantitative. They can be measured on the four known scales,
namely nominal, ordinal, interval or ratio.
Of the elements that represent the explanatory variables, some may have lower discrimination power, and
others may have higher discrimination power. From this point of view, in the construction of the
classification algorithms, those variables that have the highest discrimination power must be selected. The
variables with the highest discrimination power, define those characteristics of the objects that allow a
stronger differentiation of the classes in which the respective objects can be grouped and are called
descriptor variables. For a given object, the vector of values of the descriptor variables is even the form
associated with that object.
In relation to a manifestation or a future action, the elements of the population can be found in one of
several potential states, called states of nature. The states of nature represent physical, economic or social
conjunctions, in relation to which the set of analyzed objects is structured in the form of well-
individualized categories. The states of nature are characterized by completeness and mutual exclusivity.
This means that apart from these states of nature, there can be no other possible state of nature,
respectively, that two different states of nature can never manifest simultaneously.
The main feature of a classification problem is that although the possible states of nature are known a
priori, as number, nature and plausibility of the manifestation, and each element of the population is
certainly found in one, and only in one, of these states are usually not known, precisely and a priori, in
which of the states of nature each of the population units is found.
The general problem of classification: Given a lot of objects, it is required to determine the criterion or
rule that describes the belonging of the objects to the classes in the form in which the respective set of
objects is structured.
Depending on the prior knowledge or ignorance of belonging to the classes of objects belonging to the
sample extracted from the population, the classification methods are divided into two broad categories:
controlled classification and uncontrolled classification. Once the classification criterion has been
established, it can still be used to make predictions regarding the belonging to a certain class of new
objects, outside the existing sample, objects whose membership is not known a priori. After the
classification criterion has been identified, and provided that the belonging of the objects belonging to the
available sample is known, it can also be used to verify the correctness with which it can make the
classification, ie to test the quality of the classifier. The way in which a classifier ensures the
classification of objects with known membership can be described by means of a matrix, called the
classification correctness matrix or, more simply, the classification matrix, which contains the
information needed to assess the correctness of the classification of objects.

35. Define controlled and uncontrolled recognition systems

All the activities involved in a process of pattern recognition, all the information manipulated in this
context and the multitude of procedures, algorithms, methods and techniques used for this purpose, are
regarded as representing a system, called a system of pattern recognition.
There are two fundamental types of shape recognition systems: uncontrolled recognition systems and
controlled recognition systems. These two types of forms recognition systems are determined by the goals
pursued, the nature of the information they process, the specificity of the methods and instruments used,
as well as the nature of the results obtained with their help.
• The system of uncontrolled recognition
The systems of uncontrolled recognition of the forms are the systems in which there is no initial
information regarding the number of classes and the belonging of the forms to certain classes, the

21
construction of the classes being made progressively, as the number of analyzed forms increases, and the
number of possible classes being established only in the final phase of the recognition process.
The main characteristic of the systems of uncontrolled recognition of the forms consists in the fact that
the belonging of the analyzed objects to one class or another is not known. This means that, by default,
the number of classes is not known precisely.
The principles, procedures, methods and techniques belonging to the systems of uncontrolled recognition
of forms are known under the general name of techniques of classification, unsupervised classification or
cluster analysis.
Cluster analysis is a classification technique characterized by the fact that the affectation of the forms or
objects in clusters or groups is done progressively and without knowing a priori the number of classes,
according to the verification of two fundamental criteria:
a. the objects or forms classified in each class should be as similar as possible in terms of certain
characteristics;
b. the objects classified in one class differ as much as possible from the objects classified in any of the
other classes.
The first criterion for affecting the shapes on the classes requires that each class is as homogeneous as
possible in relation to the characteristics considered for the classification of objects. The second criterion
requires that each class differ as much as possible in terms of classification characteristics. Depending on
the characteristics of the procedures they use, the initial assumptions they are based on and the nature of
the results obtained with their help, the cluster analysis methods fall into two broad categories:
hierarchical clustering methods and classification methods by partitioning or iterative methods . The first
category includes clustering methods by aggregation and clustering methods by division. For each of the
two types of clustering there are several specific procedures, among which we mention: simple
aggregation method, complete aggregation method, average aggregation method, Ward's method, etc. The
second category includes a series of algorithms, among which we mention: the K-means algorithm, the
Kmedoid algorithm, the CLARA algorithm, the fuzzy algorithm, etc.
Regarding the results provided by the systems of uncontrolled recognition of the forms, we specify that
the exits of these systems are not reduced, as a rule, to a single and simple configuration of the objects
analyzed on classes, but include several variants of configuration of the objects on classes, variants
contained in an informational entity called cluster structure or cluster hierarchy. Cluster hierarchy allows
the researcher to choose a certain configuration of objects on classes, which means, implicitly, the choice
of a certain number of classes.
The systems of uncontrolled recognition are used more for the purposes of systematization, grouping and
informational synthesis, in the situations in which very large amounts of data are analyzed and these data
are characterized by a high degree of heterogeneity. In this sense, the techniques of uncontrolled
recognition of the forms are very useful and efficient in the activities of preliminary data analysis. The
use of cluster analysis in this phase of data analysis is important because it allows
more efficient organization of heterogeneous data. The retrieval of information within the structured data
massif using the cluster analysis techniques becomes much easier, and the data can be interpreted more
consistently.
• Controlled recognition system
The systems of controlled recognition of the forms are those systems in which the a priori existence of a
given number of classes and a set of forms, called prototypes or references, whose belonging to these
classes is known is assumed. This set of shapes is represented by the sample of objects extracted from the
population under study, also known as a training set or a learning set.
The training set or the learning set is a sample of forms extracted from the studied population, forms
whose membership in the population classes is known and based on which the formal classification
criteria are deduced. Within the systems of controlled recognition of the forms, the data represented by
the training set include both information regarding the essential properties of the objects under analysis,
as well as information regarding the belonging of these objects to the existing classes. Based on this
initial information, the rules and the decision criteria for partitioning in the form of regions or classes of

22
the set of objects subject to the study or the space in which the characteristics of the objects take values
are deduced. In fact, in the case of such techniques, the information contained in the training set is used to
make inferences about dividing the total population into classes. Moreover, the application of the
controlled classification techniques results in a set of formal classification rules and criteria, ie a
classifier. These rules and criteria are then used to classify new forms that are not yet classified, whose
membership is unknown, that is, to make predictions about the membership of the new forms.
Usually, the initial set of forms is divided into two subsets used for different purposes: the first subset is
called the training set and contains those forms used to deduce the classification rules and criteria, ie to
construct the actual classifier; the second subset is called the prediction set and contains those forms used
to test the classifier built on the basis of the training set.
The system of controlled recognition of the forms represents the totality of the activities and the
procedures that have as purpose the deduction of criteria of sharing of a population of informational
entities (objects or variables), in the form of a known number of classes, based on the knowledge of the
characteristics and of the belonging of the elements of a sample originated from the respective population.
Unlike uncontrolled classification techniques, which are mainly based on the use of distance concept, the
fundamental element of controlled classification techniques is a formal model, called a classifier. In the
case of discriminant analysis, the classifier is represented by the discriminated functions or the
classification functions.

36. What is cluster analysis, what are its fundamental concepts and what are its areas of use?

Cluster analysis aims to search and identify classes, groups or clusters within sets of objects or shapes, so
that the elements belonging to the same class are as similar as possible, and the elements belonging to
different classes are as different as possible. . In other words, cluster analysis is a way of examining the
similarities and dissimilarities between objects belonging to a certain set, in order to group these objects
in the form of distinct and homogeneous classes within.
General classification criterion: Classification of objects in classes is done in such a way as to ensure a
minimum variability within the classes and a maximum variability between classes. Through cluster
analysis each object in the analyzed set is assigned to a single class, and the set of classes is a discrete and
unorderable set.
The classes or groups in the form of which the sets of objects are structured are also called clusters. A
cluster is a subset consisting of similar objects, that is, of objects that are sufficiently similar to each other
in terms of their defining characteristics.
From a geometric point of view, like lots of points in a given space, clusters can have very different
shapes, more or less regular. Thus, the shape of the clusters can be convex or concave, compact or
elongated. As a rule, cluster analyzes represent uncontrolled classification procedures, in which neither
the belonging of certain objects to certain classes nor the number of possible classes is known a priori.
The number of classes or clusters is variable and is established concurrently with the actual classification
activity.
Cluster analysis can be defined as representing a multitude of principles, methods and classification
algorithms, aiming to organize the data in the form of significant, relevant information structures. Cluster
analysis is an exploratory analysis, of a multidimensional type, which aims to group informational
entities, with physical or abstract nature, into classes or clusters made up of informational entities with a
high degree of similarity. From a concrete point of view, performing a classification using the methods
and techniques of cluster analysis consists in obtaining cluster solutions or partitions, represented by a
multitude of classes or clusters denoted by ώ1, ώ2, ..., ώk.IN for certain classification methods, the
classification results are represented by unique cluster solutions, while in the case of other classification
methods, such as agglomerative hierarchical classification methods, they are represented by lots of cluster
solutions, called cluster solutions hierarchies or hierarchies. of partitions.

23
In cluster analysis, the cluster hierarchies are made up of a number of T cluster solutions, each solution
containing increasing clusters, respectively clusters with increasing aggregation levels. A cluster
hierarchy has a structure of the following form:

where T is the number of objects, and Ki is the number of clusters in the cluster solution at level i.
In the case of agglomerative hierarchical methods, the number of clusters in the first partition is equal to
the number of objects, ie K0 = T. Also, the number of clusters in a partition at a certain level is less than 1
than the number of clusters in the partition at the lower level and greater than 1 than the number of
clusters in the partition at the higher level, respectively:

Although the use of cluster analysis techniques is not specific only for certain fields of activity, however,
their most frequent use is found in the field of marketing, in psychosocial investigations or in the socio-
micosocial assessments at territorial level. In the field of marketing, the applications of cluster analysis
techniques in studying consumer behavior are detached. These applications are aimed at evaluating the
chances of launching a new product, identifying new markets, ways of segmenting the market or
identifying the positioning of the products of different producers on the market. In determining the
positioning of different brands of a product on the market, the cluster analysis is used to classify the
brands of manufacture, according to the similarity or dissimilarity of the perceptions that consumers have
towards these brands. Based on the way the brands are classified and the characteristics of the consumers
who express their preferences, a manufacturer can identify the competing brands and specific features of
the categories of consumers who prefer the product of this manufacturer.

37. Define the purposes of cluster analysis and describe the type of information used in cluster
analysis

Cluster analysis is fundamentally different from statistical procedures, such as those aimed at verifying
significance, in that it is not based and does not imply a priori fulfillment of any specific hypothesis.
Therefore, by its very essence, cluster analysis is an important and efficient tool for exploratory analysis.
It can be said that the general purpose of cluster analysis is to create so-called taxonomies or typologies.
The construction of the typologies is based on the analysis of the similarities and differences existing
between the objects of a large data:
• choosing an optimal number of clusters, depending on the nature of the classification problem and the
goals pursued;
• interpretation of the significance of clusters;
The results of a cluster analysis are represented either by a single cluster solution or by the cluster
hierarchies, which contain different ways of configuring objects on classes, that is, several cluster
solutions.
Cluster analysis can generally be regarded as a tool aimed at reducing sets of objects, or even variables, to
a smaller number of information entities, which are classes or clusters. In its usual sense, as a set of
methods and techniques of object classification, cluster analysis is an analysis performed in the space of
variables. Indeed, most uses of cluster analysis techniques are those that aim to classify objects, and not to
classify variables. Cluster analysis can be used both for classifying objects and for classifying variables
that define objects. Unlike the use of cluster analysis to classify objects, where the specificity is

24
represented by the fact that distances are evaluated for pairs of objects, in the case of using cluster
analysis for the classification of variables, the evaluation of distances is done for pairs of variables.

38. Define cluster analysis and show how to classify cluster analysis methods

Cluster analysis can be defined as representing a multitude of principles, methods and classification
algorithms, aiming to organize the data in the form of significant, relevant information structures.
The most important problem of any type of cluster analysis is that of how proximity, degree of closeness
or degree of distance, between objects and clusters can be measured. Any process of object classification
is defined in relation to a certain measure of the degree of approximation or separation between the
analyzed objects, regardless of the method or algorithm on which this process is based. This measure can
be represented by either an indicator of similarity or an indicator of dissimilarity. Each of the two
categories of indicators will be defined and analyzed further.
In general, the measurement of the degree of proximity between objects is done using two groups of
indicators, known as similarity indicators and dissimilarity indicators. The similarity and dissimilarity
indicators can be used as the informational basis in any classification process
the fact that they can induce an orderly relation on the set of pairs of objects or variables and,
consequently, they can contribute to the classification of objects or variables. The higher the value of a
similarity indicator, the more objects or variables for which this indicator is evaluated can be considered
to be more similar or closer. Also, a very small value of
the similarity indicator points out that the two objects or the two variables are further apart.
Dissimilarity indicators are numeric quantities that express how different or how far apart two objects or
two variables are. The dissimilarity indicators are also called indicators or coefficients for distinguishing
or distancing objects or variables. The higher the value of a dissimilarity indicator, the more the two
objects or the two variables for which they are calculated are different, that is, the more distanced
between them. The most important and most widely used category of dissimilarity indicators is the
distance type indicators. Unlike similarity indicators, which can be best used to express the degree of
closeness between objects with qualitative features, dissimilarity indicators are more appropriate for
measuring proximity for objects with quantitative characteristics.
The information used, in the last instance, in the cluster analysis are represented in the form of symmetric
arrays of object type × objects, called, as the case may be, proximity matrices, similarity matrices,
association matrices, incidence matrices, dissimilarity matrices or dissimilarity matrices. distances. Both
rows and columns of matrices of this type refer to the analyzed objects, so that their number is equal to
the number of objects under analysis. The elements of these matrices are numeric sizes that express the
proximity between the pairs of objects that label the rows and columns of the matrices.

39. Define the concept of distance and describe some ways of evaluating distances between forms

Dissimilarity indicators are numeric quantities that express how different or how far apart two objects or
two variables are. The dissimilarity indicators are also called indicators or coefficients for distinguishing
or distancing objects or variables. The higher the value of a dissimilarity indicator, the more the two
objects or the two variables for which they are calculated are different, that is, the more distanced
between them. The most important and most widely used category of dissimilarity indicators is the
distance type indicators.
By their numerical nature, the variables of quantitative type, that is to say the variables measured on the
scales of type ratio, interval and, possibly, ordinal, allow a more natural definition of the concept of
distance. For nominal type variables, including binary variables, distances are calculated in a specific
way, compatible with the nature of these variables. For the evaluation of dissimilarities between objects
whose characteristics are of quantitative type or of variables of quantitative type, several types of
distances can be used, such as: Euclidean distance (simple, weighted or square), Manhattan distance,

25
Cebissev distance, Minkovski distance , distance Camberra, distance Mahalanobis, distance Pearson,
distance Jambu etc.
• Euclidean Distance
The Euclidean distance, which is also known as the type norm, is the most commonly used distance in
cluster analysis problems. It is calculated as the square root of the sum of the squares of the differences of
the coordinates of the two objects or variables for which the distance is evaluated.
Euclidean distance expresses the proximity between objects as the distance between two points in the
Euclidean space, respectively as the distance measured in a straight line.
• Manhattan Distance
The Manhattan distance, also called the rectangular distance, the "City-Block" distance or the type norm,
is calculated as the sum of the absolute values of the differences of the coordinates of the two objects or
of the two variables analyzed. a rise to power, the distance from Manhattan is more robust compared to
the presence of aberrant values in the data. The Manhattan distance can also be calculated in the weighted
version, the calculation being done in a similar way to the weighted Euclidean distance. Also, the
Manhattan distance can be used if the objects have characteristics that are measured on the interval scale
and the ratio scale.
• Cebîsev Distance
Cebîsev distance, also known as "maximum dimensions" or type norm, is an absolute value type distance
and is determined to be the maximum value of the absolute values of the differences between the
coordinates of the objects or variables. The distance of Cebîsev can be used when one wants two objects
or variables to appear as different, if they differ even in terms of a characteristic or an object, respectively.
• Mahalanobis Distance
Distance Mahalanobis is one of the most known, most important and most frequently used distances. It is
a generalized form of the concept of distance. Mahalanobis distance is the only type of distance that takes
into account, in a complete way, the degree of dispersion of the set of objects or of the set of variables
analyzed, as well as the degree of correlation of the respective information entities. The use of
Mahalanobis distance is recommended, especially in situations where the variables that describe the
objects are correlated with each other. Mahalanobis distance is also used in the case of controlled
classification techniques, based on this distance being developed even an operational criterion of
discrimination.

40 & 41. Formulate the general classification criterion and show how the inter- and intra-cluster
variability is evaluated (the uni / multi-dimensional case)

General classification criterion: Classification of objects in classes is done in such a way as to ensure a
minimum variability within the classes (intra) and a maximum variability between classes (inter)
The distance between two clusters is, in fact, a distance between two sets of points, which is a more
difficult distance to evaluate. As the distance between two sets of points, the distance between two
clusters can be measured using one of several possible methods. Among the proposed methods for
evaluating the distances between clusters we mention: the nearest neighbors method, the nearest
neighbors method, the average distance between pairs, the centroid method and Ward's method.
The nearest neighbors method evaluates the distance between two clusters as the distance between two
objects, one from the first cluster, and the other from the second cluster, which are closest to each other in
terms of the distance used.
The nearest neighbors method evaluates the distance between two clusters as the distance between two
objects, one from the first cluster, and the other from the second cluster, which are the most distant from
each other in the sense of the distance used.
The mean distance between pairs method evaluates the distance between two clusters as the average of
the distances between any two objects belonging to the two clusters, one to the first cluster and the other
to the second cluster.

26
The centroid method evaluates the distance between two clusters as the distance between the centroid of
the two clusters
Ward's method is a method of evaluating the distance between two clusters, which is based on
maximizing the degree of homogeneity of the clusters or, what is the same thing, on minimizing the
intracluster variability. As a rule, the degree of homogeneity of a cluster is considered to be the higher,
the smaller the total sum of squares of intracluster deviations. In this sense, it can be said that the Ward
distance between two clusters measures the cumulative intracluster variability, which is induced by the
combination of the two clusters at the resulting cluster configuration.

42. Evaluation of distances between clusters

A difficult problem that appears in cluster analysis is related to the need to evaluate distances between
classes or clusters. The difficulty of this problem is due to the fact that the distances between classes or
clusters are, in fact, distances between sets of objects or distances between sets of variables.
The problem of evaluating distances between clusters arises especially in the case of hierarchical cluster
analysis, in which the construction of the cluster tree can be done on the basis of successive clustering or
successive clustering. The clustering of their clusters is called amalgamation or aggregation, and the
division of clusters is called disaggregation.
Theoretically, the process of successive aggregation or disaggregation of clusters is based on the
definition of a boundary distance between clusters, called distance and aggregation threshold, respectively
disaggregation threshold. In principle, the decision to merge two clusters or to divide a cluster is only
made if the distance between these clusters is smaller, respectively greater than the fixed boundary
distance.
If in the case of evaluating the degree of closeness or distance between two objects things are relatively
simple, it is sufficient to calculate one of the aforementioned distances, if it is necessary to evaluate the
degree of closeness or distance between two clusters, things become a little more and involve the
existence of a specific method of evaluation. The distance between two clusters is, in fact, a distance
between two sets of points, ie a distance that is more difficult to evaluate.
As the distance between two sets of points, the distance between two clusters can be measured by one of
several possible methods.
Among the proposed methods for evaluating the distances between clusters we mention: the nearest
neighbors method, the nearest neighbors method, the average distance between pairs, the centroid method
and Ward's method.

43. Analyze hierarchical type cluster

Hierarchical or arboreseent cluster analysis is a method of classification based on grouping objects based
on successive aggregation into increasingly larger classes of objects or successive disaggregation into
smaller and smaller classes.
Cluster analysis implies a tree organization in groups or classes of latent structures. This organization is
done with heuristic algorithms or formal model algorithms that generate cluster structures based on
maximizing likelihood.
The result of using hierarchical cluster analysis is represented by a lot of particular cluster structures,
called classification tree or hierarchical tree.
Hierarchical cluster structures are characterized by different levels of aggregation, ranging from a
minimum to a maximum level. As the hierarchical level increases the number of clusters decreases
The structure of the cluster with the highest level of aggregation consists of a single cluster, which
includes all objects subject to classification. The cluster structure with the lowest level of aggregation
consists of a number of clusters equal to the number of objects to be analyzed, each cluster including a
single object.

27
The higher the aggregation level of the cluster structures, the more the similarities between the objects of
a cluster decrease a cluster from a higher level contains more objects than a cluster from a lower level.
Hierarchical classification algorithms can be divided into two broad categories:
• classification algorithms by aggregation, amalgamation or combination;
• classification algorithms by disaggregation or division.
The disaggregation algorithms had constructed the clusters in a descending manner, starting with all the
objects in a single cluster and continuing, through its successive division, until obtaining clusters
containing a single object.
Aggregation or amalgamation algorithms had built clusters in an ascending manner, starting from clusters
containing a single object and continuing, through successive clustering, to a cluster that includes all
objects.
In the case of aggregation classification procedures, in each step, two objects are combined in either one
object, or one object and one cluster, or two different clusters. At each stage of the divisional procedures,
a cluster is divided into either two or sub-clusters
the shape of a cluster and an object, or in the form of two objects.
The number of steps required to obtain a hierarchical cluster solution depends on the number of objects
subject to classification and is different for the two categories of hierarchical classification methods.
The processes of aggregation and disaggregation of clusters, specific to the two categories of hierarchical
classification procedures, require the use of specific methods for evaluating the distances between
clusters.

44. Simple aggregation method

In cluster analysis based on simple aggregation, the affect of an object to a cluster is only done if that
object has a certain degree of dissimilarity with one of the objects that already belong to the cluster.
Clustering of this type is also called the minimum distance cluster analysis or the MIN cluster analysis.
The simple aggregation method is based on expressing the proximity between two clusters through the
distance between the closest objects in the two clusters. The evaluation of this distance is done using the
method of the nearest neighbors.
The simple aggregation method is a method of hierarchical classification of ascendant type, which
combines in each stage of classification those two clusters for which the distance between the closest
neighbors is the smallest, compared to other pairs of clusters.

The smallest distance between the nearest neighbors of the three possible pairs of clusters is the distance
corresponding to the pairs of clusters (clusterl, cluster2). As a result, cluster 1 will be merged with
cluster2, resulting in a new cluster, which will contain the objects in the two clusters.

45. The method of complete aggregation

28
This clustering method is similar to the simple aggregation method, except that the aggregation of two
clusters is done based on an aggregation distance that is the distance between the farthest objects in those
clusters. Clusterizarega of this type is also called maximum distance cluster analysis or MAX cluster
analysis.
In the case of the complete aggregation method, the evaluation of the distances between clusters is done
using the method of the most distant neighbors. This means that the distance between two clusters is
considered to be in this case the largest distance of any two points belonging to the two clusters.
The method of complete aggregation is a method of ascending hierarchical classification, which combines
in each stage of classification those two clusters for which the distance between the nearest neighbors is
the smallest, in comparison with other pairs of clusters.

Figure 8.10: Cluster clustering by complete aggregation method

46. The method of average aggregation

The average aggregation method is a clustering method similar to the two methods mentioned above, with
the difference that the evaluation of the distance between two clusters is considered to be the average of
the distances that separate the objects belonging to the two clusters.
The aggregation of clusters using the average aggregation method is made based on the determination of
an average degree of connectivity between clusters, a degree evaluated as the average distance
corresponding to a pair of objects, the first object belonging to one cluster and the second object
belonging to the other cluster.
 The method of average aggregation is a method of ascending hierarchical classification, which combines
in each stage of the classification those two clusters for which the average distance between all pairs
formed with objects in the two clusters is the smallest, in comparison with other pairs of clusters.

47. Centroid method

The centroid method is a method of ascending hierarchical classification, in which the distances from
clusters are evaluated using the centroid method. The basic idea of the centroid method is to obtain a new
cluster by combining two existing clusters, depending on the smallest distance of the cluster centers that
are checked for clustering purposes.
Definition: The centroid method is a method of hierarchical classification of ascending type, which
combines in each stage of classification those two clustewre for which the distance between the centroid
of the two clusters is the smallest, compared to other pairs of clusters.
Two clusters are grouped into a new cluster if and only if the distance between their centrioles is the
smallest of all distances between the centrioles of any two clusters that belong to the available cluster
configuration. in the following figure is

29
Figure 8.11: Illustration of the centroid method

48. Ward's method

Ward's method, also known as the minimal intraclass variance method, is one of the best known and most
efficient methods of hierarchical classification by aggregation.
An object can be assigned to a cluster only if the minimization of the sum of the elements on the diagonal
of the common matrix of the covariance of the clusters is realized.
Ward's method is a method of evaluating the distance between two clusters that is based on maximizing
the degree of homogeneity of the clusters.
Definition: Ward's method is a method of ascending hierarchical classification, which combines in each
stage of classification those two clusters for which the sum of squares of deviations in the cluster resulting
from clustering is the smallest, compared to other pairs of clusters.
Ward's method is not a proper method of calculating distances between clusters, but a method of cluster
formation based on maximizing the degree of cluster homogeneity.
As a measure of the degree of homogeneity of the clusters, the sum of the squares of deviations is used,
called the sum of squares of intraclass deviations. The degree of homogeneity of a cluster is considered to
be the higher the smaller the sum of intraclass deviations.
Ward distance is evaluated for all possible combinations of clustering in one cluster of any two clusters in
the initial configuration

49. Algorithm k-means

Partial clustering is totally different from hierarchical. The main difference is that we know in advance
the number of clusters and we try to create an algorithm from which the number of clusters established
previously and they will be as different from each other.
The k means algorithm assumes the arbitrary choice of an integer k and then the organization of objects in
k clusters considering the maximization of the intercluster variant and the minimization of the intra
cluster variant. Each cluster has a center (centroid) that represents the average of all data points in the
cluster.
The steps of the k means algorithm
1. Randomly select k points as cluster centers.
2. The distance between each object and the centroid is determined.
3. Assign the instances / objects of those clusters that have the closest center to
these, in relation to a certain measure of similarity.
4. After all objects have been assigned, recalculate the position of each of the k - centroid (the average of
all instances in each cluster).

30
5. Repeat steps 2 and 3, until a certain stopping criterion is met.
The stop (convergence) criterion can be one of the following:
- no re-allocation (minimum re-allocation) of the data points for different clusters
- no change (or minimum change) of centroid
Note 1: Although, in principle, the k-means clustering method produces exactly k clusters that divide the
initial set of objects as distinctly as possible, the problem of estimating the optimal number of clusters
leading to the best separation of objects remains open.
Note 2: A major problem with using the k-Means method is that due to the fact that we have to calculate
the average we cannot apply it only on numerical data (e.g. categorical).
Advantages and disadvantages of k-means
      • (+) Less sensitive to extreme values (outliers) and irrelevant attributes;
      • (+) Applicable to large and very large data sets;
      • (-) It requires the prior choice of the number of clusters, which proves a
sometimes difficult to establish.

50. Dendrogram (hierarchical classification tree)

The Dendogram is a roundabout that is used to illustrate hierarchical cluster organizations. The
Dendogram can be realized by applying the algorithms that organize the clusters into groups, between a
minimum level and a maximum level. The number of steps required for such an organization depends on
the number of clusters, as we progress in the hierarchy, the similarity between clusters decreases and the
number of objects contained by them increases.
By dividing the dendrogram horizontally, a partition of the set of the classified elements is obtained. The
components of the partition are the searched classes.
A dendogram is shown in the figure below. On the horizontal axis are the initial elements (the order is the
one that allows drawing the tree). On the vertical axis are the distances between objects, for example,
between objects 4 and 6 is a distance equal to 4.

Since nothing is known in a clustering problem (the number of classes in particular), evaluating the
quality of the obtained partition is a very important step. The evaluation should take into account both the
fact that, perhaps, the initial set does not have a well-defined class structure and that different methods
lead to different classes.
The usual evaluation procedures:
• View the partition (dendrograms, profiles, projections).
• Quality indicators
Divisional coefficients (DC) and agglomeration
(agglomerative coefficient - AC) that provides global (average) indicators.

31
o Silhouette indices that can be defined both globally and locally for each cluster.

51. How to choose the number of clusters in the case of hierarchical type classifications

Hierarchical Cluster Analysis (HCA) is a "hierarchical" grouping method in which each class is wholly
contained in another class. No prior information is required about the number of classes, and once an
individual has been assigned to a class, he or she will remain there. It is not recommended for use with
large databases with many individuals.
Hierarchical methods of class formation are characterized by the fact that the number of classes is not
known in advance, but is determined during the course, by the classification algorithm. There are two
categories of hierarchical classification algorithms, namely ascending (or aggregation) and descending
algorithms.
In the following we will present the main steps for an aggregation algorithm. Suppose we have
individuals that we want to classify.
Step 1. Consider n0 = n, that is the finer partition, initially formed from classes with a single individual
each. In this set of individuals / classes two are selected, the closest according to the proximity index
used. They will form the first group.
Step 2. Calculate a new proximity matrix containing n0 - 1 lines, corresponding to the n0 - 2 objects /
classes still ungrouped and the first group created. Based on this new matrix, two other objects are
identified, the ones closest to each other, and with them a new group will be formed. Iteratively, these
objects can be either two individuals, either an individual and a group already formed, or two groups
already constituted.
We decrease n0 (= n0 - 1) and repeat step 2 until all individuals have been grouped.

52. Formulate the general problem of supervised recognition of forms and mention some areas of
use

Supervised learning techniques aim to build a model of initial data in which some of the variables are
explanatory (predictor variables) and one or more variables are response variables. Among the supervised
techniques are: canonical analysis, discrimination analysis, multiple linear regression, logistic regression.
 To evaluate the (linear) connection between two quantitative variables, we can choose to calculate the
Pearson correlation coefficient and interpret the value obtained. But if we want to evaluate the linear link
between two sets of (quantitative) variables, one possibility is to evaluate the correlation between two
linear combinations, which optimally represent the two sets of variables.
 Usually, the canonical analysis is used in the following context: on some individuals of the population
were made both objective measurements and subjective assessments (expressed quantitatively, in the
form of notes). For example, individuals could be a group of companies, the objective variables could be
the financial-accounting indicators, and the subjective variables could be the note given (by a panel of
specialists) of the product promotion policy, the preference of the shareholders for assets, etc.
Discrimination analysis methods apply to a population of individuals characterized by continuous or
categorical variables that are a priori divided into groups. In the analysis of discrimination, the population
of individuals that have been investigated is divided into groups and we have the observed data for these
individuals. (In some situations the groups appear naturally, in others they are the result of a previous
analysis.)

53. Define the purposes of supervised recognition of forms and describe the type of information
used in supervised recognition

Discriminant analysis has two well-defined purposes, namely:


- A decision-making purpose, quite often, which aims to build a rule for affecting individuals in a group, a
rule that can be applied in the future. This rule is constructed according to the set of predictor variables

32
observed on individuals. A good rule of affectation is one that will lead in the future to errors of
classification of future observations as little as possible.
- An explanatory purpose, which aims to discover the most relevant variables in describing the
differences between the groups formed a priori.
The types of information used: OBJECTIVES and SUBJECTS.
 Over some individuals of the population, both objective measurements and subjective assessments
(expressed quantitatively, in the form of notes) were made. Therefore, the first set of variables consists of
the "objectives", either. The data obtained from those in individuals will form the X matrix (of size). The
second set of variables consists of the "subjective" ones, either, and the data obtained from the n
individuals will form the matrix Y (of dimensions pxxx, ... ,, 21pn × qyyy, ... ,, 21qn ×) .

54. What are linear type classifiers. Describe the logic of linear discrimination and the
discriminated space

The classifier or classification criterion represents the rule or set of rules on the basis of which the objects
that appear in the analyzed set are affected or assigned to well defined classes or groups.
The first approach to classification problems using discriminant analysis techniques dates from 1933 and
was proposed by Fisher. Subsequently, approaches of this type have been constantly developing, and
applications based on discriminant analysis have expanded to more and more areas of activity and have
increasingly diversified.
The most and most useful applications of the discriminant analysis based on Fisher's criterion are found in
the financial-banking field, field in which the type techniques are called credit-scoring techniques; they
are the most important tools for substantiating the decisions regarding the granting of loans.
The discriminant analysis method proposed by Fisher is a parametric method, characterized by simplicity
and robustness, and which offers very useful interpretation possibilities for the analysis. The simplicity of
this method derives from the fact that its use requires only the evaluation of some estimates for the
parameters of the population and its classes, parameters represented by means, variants or covariates. this
represents a very important advantage of the discriminant analysis of fisher type, in comparison, for
example, with the techniques of discriminant analysis based on the Bayesian criterion, techniques whose
use implies the knowledge of the a priori probabilities.

55. Define linear discriminant functions, discriminant variables and discriminant scores

The theoretical basis of Fisher's discriminant analysis is represented by the variant analysis. Fisher's
criterion defines a way of deducing discriminant functions based on the comparative analysis between
intra-group variability and inter-group variability, at the level of the classes or groups of the analyzed
population. The discriminant functions deduced based on Fisher's criterion are also called score functions
and are linear functions.
The fundamental criterion underlying the division of the set of objects into subsets is a mixed criterion,
which aims at minimizing intra-group variability and maximizing intergroup variability. The use of this
combined criterion ensures the best differentiation of the classes or groups of the population.
A discriminated function of the Fisher type is determined as a linear combination of the discriminant
variables, a combination of which coefficients are components of an eigenvector of the matrix. From this
way of defining it results, by default, that several discriminant functions can be identified.
The maximum possible number of discriminant functions that can be identified based on Fisher's criterion
is equal to the number of distinct and strictly positive eigenvalues of the matrix.
Determining discriminant functions is equivalent to finding directions, or vectors, with respect to which
the intra-group variability should be minimal and the inter-group variability to be maximum. these
directions will define the axes of the discriminated space and can be identified as linear combinations by
the descriptor variables selected in the analysis.

33
The idea behind Fisher's criterion is to determine directions or axes, so that, along them, the classes of the
crowd differ as much as possible between them and, at the same time, each class has a degree of
homogeneity. as large as possible. In other words, Fisher's criterion aims to determine directions along
which the intergroup variability is as high as possible and the intra-group variability to be as small as
possible.
The projections of the objects on the axes defined by these directions represent new coordinates of the
objects and are called discriminant scores.
In the discriminant analysis the new directions to be identified do not have to be necessarily orthogonal,
as opposed to the analysis of the principal components in which the directions of maximum variability
must verify the orthogonality property.
The discriminant variables are linear combinations of the descriptor variables, of the form: d = β0 + βt * x

56 & 57 Describe the Bayesian classifier and show how it can be used to predict the membership of
the forms. Describe the form of the Bayesian classifier in the case of normality and
homoscedasticity of classes

34
58. Describe the Fisher linear classifier and show how it can be used to predict the membership of
the shapes.

35
The classifiers are predictive in nature, with the purpose of establishing the membership of forms with
unknown membership in a particular class.
The informational basis needed to construct the classifiers is represented by an extended observation
matrix of size T + 1 lines and n + 1 columns, where the first n columns represent the actual characteristics
of the shapes, and the last column is associated with an additional variable, which is explained by
function systems fi (x1,… .xi), called classifiers.
The idea behind the Fisher classifier is to determine an axis / direction along which the classes of the Ω
set differ as much as possible from each other and, at the same time, each class has a higher degree of
homogeneity within. They. In other words, the purpose of the Fisher classifier is to find a direction, along
which the intraclass variability is greatest, and the interclass variability as small.
In the end, the discriminant function will be a product of the initial variables and eigenvectors of the
product between the inverse of the intraclass covariance matrix and the interclass covariance matrix.

59. Describe the Mahalanobis classifier and show how it can be used in predicting the membership
of the forms

The classifiers are predictive in nature, with the purpose of establishing the membership of forms with
unknown membership in a particular class.
The informational basis needed to construct the classifiers is represented by an extended observation
matrix of size T + 1 lines and n + 1 columns, where the first n columns represent the actual characteristics
of the shapes, and the last column is associated with an additional variable, which is explained by
function systems fi (x1,… .xi), called classifiers.
The first two steps in the Mahalanobis classifier are the following:
1. The vectors of the averages of classes xmed1,…., Xmedk are estimated
2. The covariance matrix Σ1,…, Σk is estimated.
Then Mahalanobis distance from a shape to the centroid of each class is evaluated. D (x, xmedk) = (x-
xmedk) T Σk-1 (x-xmedk)
The form is considered to belong to the class where the smallest distance is obtained Mahalanobis.

60. Describe how to determine the predictive ability of a classifier and the matrix of the correctness
of the classification

It is considered that a given classifier ensures a high classification capacity if there is a trade-off between
the speed of recognition and its predictive ability.
By the speed of recognition is understood that fraction of the forms in the training set correctly classified
by the determined classifier.
Predictive ability represents the number of objects in the prediction set, whose class membership is
supposedly unknown, are correctly classified by the classifier.
Let the error function c: XxXxY -> [0; ∞] be such that c (x, y, y) = 0, {x, y, f (x)} belong to XxXxY, and
x is a form of X, y is the real class to which it belongs and f (x) the class in which it was classified by the
classifier.
In general, the error function has the form D (x, y, f (x)) = 0.5 | f (x) -y |
If the form x is correctly classified, then D (x, y, f (x)) = 0, otherwise it is equal to 1. Thus we obtain the
matrix of the correctness of the classification. If we form a matrix with elements x on the line and
elements f (x) on columns, then the number of elements of the matrix that are equal to 0, gives the
predictive ability of a classifier.

36

Potrebbero piacerti anche