Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
113
data cleaning and was addressed in Section 3.2.2. Section 3.2.3 on the data cleaning
process also discussed ETL tools, where users specify transformations to correct data
inconsistencies. Attribute construction and aggregation were discussed in Section 3.4
on data reduction. In this section, we therefore concentrate on the latter three strategies.
Discretization techniques can be categorized based on how the discretization is performed, such as whether it uses class information or which direction it proceeds (i.e.,
top-down vs. bottom-up). If the discretization process uses class information, then we
say it is supervised discretization. Otherwise, it is unsupervised. If the process starts by first
finding one or a few points (called split points or cut points) to split the entire attribute
range, and then repeats this recursively on the resulting intervals, it is called top-down
discretization or splitting. This contrasts with bottom-up discretization or merging, which
starts by considering all of the continuous values as potential split-points, removes some
by merging neighborhood values to form intervals, and then recursively applies this
process to the resulting intervals.
Data discretization and concept hierarchy generation are also forms of data reduction. The raw data are replaced by a smaller number of interval or concept labels. This
simplifies the original data and makes the mining more efficient. The resulting patterns
mined are typically easier to understand. Concept hierarchies are also useful for mining
at multiple abstraction levels.
The rest of this section is organized as follows. First, normalization techniques are
presented in Section 3.5.2. We then describe several techniques for data discretization,
each of which can be used to generate concept hierarchies for numeric attributes. The
techniques include binning (Section 3.5.3) and histogram analysis (Section 3.5.4), as
well as cluster analysis, decision tree analysis, and correlation analysis (Section 3.5.5).
Finally, Section 3.5.6 describes the automatic generation of concept hierarchies for
nominal data.
3.5.2
114
attributes with initially large ranges (e.g., income) from outweighing attributes with
initially smaller ranges (e.g., binary attributes). It is also useful when given no prior
knowledge of the data.
There are many methods for data normalization. We study min-max normalization,
z-score normalization, and normalization by decimal scaling. For our discussion, let A be
a numeric attribute with n observed values, v1 , v2 , . . . , vn .
Min-max normalization performs a linear transformation on the original data. Suppose that minA and maxA are the minimum and maximum values of an attribute, A.
Min-max normalization maps a value, vi , of A to vi0 in the range [new minA , new maxA ]
by computing
vi0 =
vi minA
(new maxA new minA ) + new minA .
maxA minA
(3.8)
Min-max normalization preserves the relationships among the original data values. It
will encounter an out-of-bounds error if a future input case for normalization falls
outside of the original data range for A.
Example 3.4 Min-max normalization. Suppose that the minimum and maximum values for the
attribute income are $12,000 and $98,000, respectively. We would like to map income
to the range [0.0, 1.0]. By min-max normalization, a value of $73,600 for income is
73,600 12,000
transformed to 98,000
12,000 (1.0 0) + 0 = 0.716.
In z-score normalization (or zero-mean normalization), the values for an attribute,
A, are normalized based on the mean (i.e., average) and standard deviation of A. A value,
vi , of A is normalized to vi0 by computing
vi0 =
vi A
,
A
(3.9)
where A and A are the mean and standard deviation, respectively, of attribute A. The
mean and standard deviation were discussed in Section 2.2, where A = n1 (v1 + v2 + +
vn ) and A is computed as the square root of the variance of A (see Eq. (2.6)). This
method of normalization is useful when the actual minimum and maximum of attribute
A are unknown, or when there are outliers that dominate the min-max normalization.
Example 3.5 z-score normalization. Suppose that the mean and standard deviation of the values for
the attribute income are $54,000 and $16,000, respectively. With z-score normalization,
54,000
a value of $73,600 for income is transformed to 73,600
= 1.225.
16,000
A variation of this z-score normalization replaces the standard deviation of Eq. (3.9)
by the mean absolute deviation of A. The mean absolute deviation of A, denoted sA , is
1
+ |v2 A|
+ + |vn A|).
sA = (|v1 A|
n
(3.10)
115
vi A
.
sA
(3.11)
The mean absolute deviation, sA , is more robust to outliers than the standard deviation,
A . When computing the mean absolute deviation, the deviations from the mean (i.e.,
|xi x |) are not squared; hence, the effect of outliers is somewhat reduced.
Normalization by decimal scaling normalizes by moving the decimal point of values
of attribute A. The number of decimal points moved depends on the maximum absolute
value of A. A value, vi , of A is normalized to vi0 by computing
vi
(3.12)
vi0 = j ,
10
where j is the smallest integer such that max(|vi0 |) < 1.
Example 3.6 Decimal scaling. Suppose that the recorded values of A range from 986 to 917. The
maximum absolute value of A is 986. To normalize by decimal scaling, we therefore
divide each value by 1000 (i.e., j = 3) so that 986 normalizes to 0.986 and 917
normalizes to 0.917.
Note that normalization can change the original data quite a bit, especially when
using z-score normalization or decimal scaling. It is also necessary to save the normalization parameters (e.g., the mean and standard deviation if using z-score normalization)
so that future data can be normalized in a uniform manner.
3.5.3
Discretization by Binning
Binning is a top-down splitting technique based on a specified number of bins.
Section 3.2.2 discussed binning methods for data smoothing. These methods are also
used as discretization methods for data reduction and concept hierarchy generation. For
example, attribute values can be discretized by applying equal-width or equal-frequency
binning, and then replacing each bin value by the bin mean or median, as in smoothing
by bin means or smoothing by bin medians, respectively. These techniques can be applied
recursively to the resulting partitions to generate concept hierarchies.
Binning does not use class information and is therefore an unsupervised discretization technique. It is sensitive to the user-specified number of bins, as well as the presence
of outliers.
3.5.4
116
3.5.5
117
3.5.6
118
form a hierarchy at the schema level, a user could define some intermediate levels
manually, such as {Alberta, Saskatchewan, Manitoba} prairies Canada and
{British Columbia, prairies Canada} Western Canada.
3. Specification of a set of attributes, but not of their partial ordering: A user may
specify a set of attributes forming a concept hierarchy, but omit to explicitly state
their partial ordering. The system can then try to automatically generate the attribute
ordering so as to construct a meaningful concept hierarchy.
Without knowledge of data semantics, how can a hierarchical ordering for an
arbitrary set of nominal attributes be found? Consider the observation that since
higher-level concepts generally cover several subordinate lower-level concepts, an
attribute defining a high concept level (e.g., country) will usually contain a smaller
number of distinct values than an attribute defining a lower concept level (e.g.,
street). Based on this observation, a concept hierarchy can be automatically generated based on the number of distinct values per attribute in the given attribute set.
The attribute with the most distinct values is placed at the lowest hierarchy level. The
lower the number of distinct values an attribute has, the higher it is in the generated concept hierarchy. This heuristic rule works well in many cases. Some local-level
swapping or adjustments may be applied by users or experts, when necessary, after
examination of the generated hierarchy.
Lets examine an example of this third method.
Example 3.7 Concept hierarchy generation based on the number of distinct values per attribute.
Suppose a user selects a set of location-oriented attributesstreet, country, province
or state, and cityfrom the AllElectronics database, but does not specify the hierarchical
ordering among the attributes.
A concept hierarchy for location can be generated automatically, as illustrated in
Figure 3.13. First, sort the attributes in ascending order based on the number of distinct values in each attribute. This results in the following (where the number of distinct
values per attribute is shown in parentheses): country (15), province or state (365), city
(3567), and street (674,339). Second, generate the hierarchy from the top down according to the sorted order, with the first attribute at the top level and the last attribute at the
bottom level. Finally, the user can examine the generated hierarchy, and when necessary,
modify it to reflect desired semantic relationships among the attributes. In this example,
it is obvious that there is no need to modify the generated hierarchy.
Note that this heuristic rule is not foolproof. For example, a time dimension in a
database may contain 20 distinct years, 12 distinct months, and 7 distinct days of the
week. However, this does not suggest that the time hierarchy should be year < month <
days of the week, with days of the week at the top of the hierarchy.
4. Specification of only a partial set of attributes: Sometimes a user can be careless
when defining a hierarchy, or have only a vague idea about what should be included
in a hierarchy. Consequently, the user may have included only a small subset of the
country
15 distinct values
province_or_state
city
street
119
Figure 3.13 Automatic generation of a schema concept hierarchy based on the number of distinct
attribute values.
112
3.5.1
($0...$1000]
($0...$200]
($0...
$100]
($100...
$200]
($200...$400]
($400...$600]
($600...$800]
($800...$1000]
($200... ($300...
$300]
$400]
($400... ($500...
$500]
$600]
($600... ($700...
$700]
$800]
($800... ($900...
$900] $1000]
Figure 3.12 A concept hierarchy for the attribute price, where an interval ($X . . . $Y ] denotes the range
from $X (exclusive) to $Y (inclusive).