Sei sulla pagina 1di 7

CS412 Assignment 2 Ref Answer

Question 1:
Assume a base cuboid of 10 dimensions contains only three base cells: (1) (a1, b2, c3, d4; ..., d9,
d10), (2) (a1, c2, b3, d4, ..., d9, d10), and (3) (b1, c2, b3, d4, ..., d9, d10), where a_i != b_i, b_i !=
c_i, etc. The measure of the cube is count.
1, How many nonempty cuboids will a full data cube contain?
Answer: 210 = 1024

2, How many nonempty aggregate (i.e., non-base) cells will a full cube contain?
Answer: There will be 3 ∗ 210 − 6 ∗ 27 − 3 = 2301 nonempty aggregate cells in the full cube.
The number of cells overlapping twice is 27 while the number of cells overlapping once is
4 ∗ 27 . So the final calculation is 3 ∗ 210 − 2 ∗ 27 − 1 ∗ 4 ∗ 27 − 3, which yields the result.

3, How many nonempty aggregate cells will an iceberg cube contain if the condition of the 4,
iceberg cube is "count >= 2"?
Answer: There are in total 5 ∗ 27 = 640 nonempty aggregate cells in the iceberg cube. To
calculate the result: fix the first three dimensions as (***), (a1**), (*c1*), (**b3) or (*c1b3), and
vary the rest seven ones.

4, How many closed cells are in the full cube?


Answer: There’re 6 closed cells in the full cube: 3 base cells; (a1, *, *, d4, …, d10); (*, c2, b3, d4,
…, d10) : count 2; (*, *, *, d4, .., d10): count 3.

Question 2: (Half open questions, make sure your algorithm and assumptions are correct, no need
to be very specific)
Suppose a base cuboid has the following tuples:
A B C D Count Sales

a1 b1 c1 d1 1 6

a1 b2 c2 d1 1 4

a1 b3 c1 d2 1 2

a2 b4 c1 d2 1 10

a2 b3 c2 d3 1 12
1, Show the representative steps to demonstrate how a complete data cube (with Count and
SUM(Sales) as measures) is computed by the multiway array aggregation algorithm;
Answer (from fang2): Suppose dimensions A, B, C, D are organized into 2, 4, 2, 3 partitions
respectively. So in total there are 2*4*2*3 = 48 chunks. The cardinality of dimensions A, B, C, D is
2, 4, 2, 3 respectively, i.e. A and C have the smallest size, followed by D, and lastly B has the
largest sieze). From the base cuboid given, we can compute 3D-, 2D-, 1D- and apex cuboids as in
the diagram.
The chunk scan order is always first along the smallest dimension, then along 2nd smallest
dimension, then along 3rd smallest dimension, and so on. For example, when computing
3D-cuboids from the base cuboid, we first scan chunks along the A dimension, then C, D and B, in
this ascending order of the size of the dimension. In other words, we aggregates first towards
CDB, so only 1 chunk of CDB needs to be held in memory at any one time; then aggregates
towards ADB, so only 1 row of ADB needs to be held in memory at any one time, so on and so
forth. For computation of 2D-, 1D- and apex cuboids, a similar approach is adopted, where the
chunks are scanned first along the smallest dimensions.
During computation of a cuboid, both measures (count and sales) are aggregated.

2, Do the same using the BUC algorithm; and


Answer (from duan9):
First we order the dimensions in descending order by cardinality: BDAC. Then we have the
aggregation order in the tree form:

At the beginning of the recursion, we aggregate all the dimensions to get the apex cuboid using
the two measures: count and sum of sales. Then we start partitioning the table according to the
sequence BDAC as follows:
Through this recursive aggregation and partition process we get the following cuboids: apex, B,
BD, BDA, BDAC. Then we traverse back (as part of the recursion) and get BDC, and traverse back
further we get BA, BAC and so on.

3, Do the same using the Star-Cubing algorithm.


Answer (from duan9):
First we order the dimensions as we did in BUC: BDAC. Based on the order, we have the following
computation ordering:

Then we construct a Star-Tree for the base table. Since we are actually computing the full cube,
there is no star on the star tree. Similarly, there will be no compressed table (or you can do your
own assumption and build your own compressed table).
Then we start the aggregation process by looking at a branch of the Star-Tree each time and
aggregate simultaneously to four descendant cuboids (with shared dimensions). After the first
branch is processed, we get the following four trees:

Then the second branch is processed (traverse back after reach the leaf node of the tree). Since
it’s a completely different branch from the first, all the four trees formed in the first step will be
output and destroyed. Four new trees will be formed in a similar way. This is traverse will keep
going until all the tree nodes are traversed.
Then we use the same approach to build BD/BD from BDC/BD, BA/BA from BAC/B and so on, so
forth. Finally we get all the cuboids for the full cube.

Question 3: (Open questions, following are some possible answers)


Consider the three data cube computation algorithms exercised in Question 2, discussed the
following:
1, For different skewness of data, discuss the relative computation efficiency of the above
algorithms in very large datasets;
Answer (from luu1):
Multi-way array aggregation: Computation efficiency will be higher with less skewed data. If the
underlying data is extremely skewed, some chunks may be too big to fit into the memory (i.e. the
dense data). Also, the shared aggregate computation will be done over empty cells in the
non-dense part of the data, which is inefficient.
BUC: Similarly, computation efficiency will be higher with less skewed data, as evenly distributed
data provides greater opportunity for pruning.
Star –cube: Star-cubing is robust against skewed data because star-tree is generated only based
on the existing tuples and the tree generation or the aggregation process does not depend on the
distribution of the data.

2, If the cube is sparse, one may want to compute iceberg cube only (e.g., compute only those
cells whose support is more than one), for different skewness of data, discuss the relative
computation efficiency of the above algorithms in very large datasets.
Answer (from luu1):
Multiway array aggregation: Although the chunking technique may be able to compress sparse
data array by utilizing direct index accessing mechanism, if the data are very sparse, a large
number of chunks will need to be generated for relatively small amount of meaningful data. Also,
sparse data means that the shared aggregate computation will be done over empty cells in many
cases, which is inefficient
BUC: Sparse data will generate huge number of partitions where many of them will be pruned.
However, large number of partition adds computational overhead during recursive computation.
Star-cubing: Star-cubing compresses data into star-tree structure, removing the redundant empty
cells (or those cells below the water) from aggregation process. Therefore it’s robust against
sparse data.

3, The base cuboid in Question 2 has 4 dimensions. Suppose a base cuboid has 100 dimensions
(D1, D2, ..., D100), discuss how the high dimensional OLAP can be computed. Let the following
dimensions be grouped together in shell fragment cube computation: (Dm, D(m+20), D(m+40),
D(m+60), D(m+80)) for m = 1, ..., 20. Discuss how the query (a1, ?, ?, * , c5, * , ..., * ) can be
computed efficiently, where ? means the inquired variable and "*" means "do not care" variable.
Answer (from fang2):
As given, the dimensions are grouped as follows in shell fragment computation:
(D1, D21, D41, D61, D81),
(D2, D22, D42, D62, D82),

(D20, D40, D60, D80, D100).

For the sub-cube query (a1, ?, ?, * , c5, *, ..., * ), D1 and D5 are instantiated dimensions, and D2
and D3 are inquired dimensions. All the other dimensions are irrelevant (“don’t care”). We first
identify each relevant dimension in their shell fragment, and use the computed cuboids in the
shell fragment to obtain TIDlists. This is illustrated in the table below. The TIDlists are constructed
randomly for illustration purpose.

Next, we need to intersect the TIDlists to answer the subcube query. We should first intersect the
TIDlists of the instantiated dimensions, as this greatly reduces the size of the intersection. So for
D1 and D5, we get a reduced TIDlist (a1,c5): {2,3,5}. Then we further intersect this with D2 and
D3 using a top-down depth-first strategy, discarding any empty intersections during the process.
After which, we obtain the following TIDlists: {(b1,d1):{2}, (b2,d2):{5}, (b2,d3):{3}}1. These TIDlists
can be used to construct a 2D cuboid on dimensions D2 and D3. Such a constructed “base-cuboid”
can then be used to compute the 2D cube for the two inquired dimensions trivially, which
answers the sub-cube query.
Question 4.
A database has 5 transactions. Let min sup = 60% and min conf = 80%.
TID items bought

T100 M, O, N, K, E, Y

T200 D, O, N, K, E, Y

T300 M, A, K, E

T400 M, U, C, K, Y

T500 C, O, O, K, I, E

1, Find all frequent itemsets using Apriori and FP-growth, respectively. Compare the efficiency of
the two mining processes.
Answer (from luu1):
Apriori:

FP-Growth:
- About data scan: Apriori needs to scan the database repeatedly to accumulate a k-item support
and check the frequency. On the other hand, FP-growth algorithm needs 2 scan, once to identify
frequent1-item set and second to build FP-tree.
- About candidate generation: Apriori algorithm generates exponential number of candidate set
and the self-join process of candidate generation itself is expensive. FP-growth algorithm does
not generate any candidate set.

2, List all the association rules (with support s and confidence c) matching the following metarule,
where X is a variable representing customers, and itemi denotes variables representing items (e.g.,
"A", "B", etc.):
for all x in transaction; buys(X, item1) and buys(X, item2) -> buys(X, item3) [ s, c ]
Answer (from luu1):
‐ Buys(X, E) and buys (X, O) -> buys(X,K) [60%, 100%]
‐ Buys(X, O) and buys(X, K) -> buys(X, E) [60%, 100%]

Potrebbero piacerti anche