Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
(1) What
are
the
advantages
of
the
cube
operator
compared
to
the
operators
rollup,
drilldown,
and
pivot?
Cube
generalizes
rollup,
drilldown,
and
pivot
and
thus
subsumes
them.
The
benefit
of
this
is
that
if
we
precompute
and
materialize
the
cube,
that
will
facilitate
a
whole
class
of
data
explorations
(and
queries).
Exploration
and
visualization
of
the
cube
or
subsets
of
group-bys
from
the
cube
can
in
turn
enable
the
detection
of
patterns
that
are
otherwise
hard
to
find.
(2) Consider
a
data
warehouse
with
the
dimensions
D1,
D2,
D3,
D4,
D5
which
have
999,
99,
24,
4,
and
19
values
respectively.
Suppose
the
sparsity
factor
of
the
cube
is
10%.
What
is
the
estimated
number
of
tuples
in
the
sparse
cube?
The
size
of
the
full
cube
is
(999+1)(99+1)(24+1)(4+1)(19+1)
=
250,000,000
tuples.
With
a
sparsity
factor
of
10%,
the
size
of
the
sparse
cube
is
about
250,000,000
x
0.1
=
25,000,000
tuples.
(3) Consider
computing
the
full
cube
over
the
dimensions
{Product,
Time,
Geography},
using
sorting
to
speed
up
the
computation
of
the
group-bys
that
make
up
the
cube.
Explain
with
a
diagram
of
the
cube
lattice
how
youd
order
the
dimensions
of
each
group-by
in
order
to
minimize
the
number
of
sort
operations
required.
P=product;
T=time;
G=geography.
PGT
PG
GT
TP
P
G
T
{}
Each
red
path
is
a
sorted
pass.
By
ordering
the
sort
attributes
as
shown
in
the
cube
lattice
above,
we
can
compute
the
entire
cube
in
3
sort
passes
of
pipesort.
(4) Consider
the
following
cube
lattice,
with
numbers
showing
the
estimated
sizes
of
the
group-bys,
where
M
indicates
a
million
tuples.
If
you
are
allowed
to
materialize
3
group-bys,
which
are
the
best
three
youd
materialize
in
order
to
optimize
the
evaluation
of
queries
corresponding
to
the
various
group-bys
in
the
cube
lattice?
(a,b,c)
10
M
(b,c)
6M
(a,b)
8M
(a,c)
4M
(a)
2M
(b)
4M
(c)
5M
3.5
M
{}
1
First,
notice
that
there
is
an
inconsistency
in
the
size
estimates
provided.
(c)
has
a
size
of
5M
,
whereas
(a,c)
has
a
size
of
4M.
This
is
impossible
since
the
size
of
group-
bys
cannot
grow
as
you
go
down
the
lattice.
Lets
work
with
a
revised
estimate
for
(c)
of
3.5
M.
This
is
the
size
we
will
use
below.
Of
the
three
group-bys
we
are
allowed
to
materialize,
one
is
taken:
we
always
must
materialize
the
top
element
of
the
cube
lattice,
for
it
cannot
be
derived
from
any
other
group-by.
So,
that
leaves
two
more
group-bys
to
choose.
The
following
table
tracks
the
marginal
gain
of
each
group-by,
given
what
has
been
materialized
in
previous
rounds.
Initially,
Only
abc
is
materialized.
The
group-by
that
is
the
winner
of
the
greedy
choice
in
each
round
is
highlighted
in
red.
Remember,
marginal
saving
for
a
single
group-by
=
sum
of
marginal
savings
for
all
group-bys
that
can
be
derived
from
it,
compared
with
the
current
cheapest
way
of
computing
each
of
those
group-bys.
E.g.,
the
group-bys
that
can
be
derived
from
ab
are
ab
(yes,
you
include
itself),
a,
b,
and
{}.
Initially,
the
cheapest
way
to
compute
each
of
them
is
to
use
the
top
group-by
abc,
which
has
a
cost
of
10
M.
On
materializing
ab,
these
four
group-bys
can
be
computed
at
a
cheaper
cost
of
8
M.
So,
the
marginal
gain
of
ab
=
4
group-bys
x
savings
on
each
=
4
x
2
M.
Keep
in
mind,
sometimes
the
savings
on
different
group-bys
derivable
from
the
same
group-by
can
be
different.
Note:
For
simplicity,
we
ignore
subtraction
of
small
numbers
from
millions:
e.g.,
we
write
10
M
1
as
just
10
M.
Group-by
ab
bc
ac
a
b
c
{}
Round
1
4
x
2
M
4
x
4
M
4
x
6
M
2
x
8
M
2
x
6
M
2
x
6.5
M
10
M
Round
2
2
x
2
M
2
x
4
M
2
x
2
M
1
x
6
M
2
x
0.5
M
4
M
Thus,
the
two
additional
group-bys
we
should
materialize
according
to
the
greedy
algorithm
are
ac
and
bc.
(5) Consider
the
following
transaction
database.
Transaction_id
Basket
of
items
t1
{a,c,d,f}
t2
{a,b,d,e,g}
t3
{b,c,d,e}
t4
{a,b,c,d}
Suppose
minSup
=
3
and
minConf
=
2/3.
(a) Using
the
Apriori
algorithm,
find
all
itemsets
that
are
frequent,
i.e.,
have
a
support
minSup.
Round
1:
sup(a)
=
3;
sup(b)
=
3;
sup(c)
=
3;
sup(d)
=
4;
sup(e)
=
2;
sup(f)
=
1;
sup(g)
=
1.
Discard
e,f,g
and
their
supersets
as
their
support
is
<
minSup
=
3.
Candidates
for
round
2
=
ab,
ac,
ad,
bc,
bd,
cd.
Round
2:
sup(ab)
=
2;
sup(ac)
=
2;
sup(ad)
=
3;
sup(bc)
=
2;
sup(bd)
=
3;
sup(cd)
=
3.
Discard
ab,
ac,
bc.
Candidates
for
round
3
=
abd,
acd,
bcd.
Round
3:
sup(abd)
=
2;
sup(acd)
=
2;
sup(bcd)
=
2.
All
discarded.
No
candidates
for
round
4
stop!
The
frequent
itemsets
are
a,
b,
c,
d,
ad,
bd,
cd.
(b) Based
on
(a),
find
all
strong
association
rules,
i.e.,
association
rules
whose
confidence
minConf.
We
use
all
frequent
itemsets
found
in
(a)
above
to
form
ARs.
Singleton
itemsets
never
contribute
to
non-trivial
ARs.
That
only
leaves
ad,
bd,
cd.
conf(ad)
=
sup(ad)/sup(a)
=
3/3
=
1.