Sei sulla pagina 1di 3

Understanding the Difference Between Column-Stores and OLAP Data Cubes

by smadden on July 7th, 2008

in big data

Previous Post
Next Post
Both column-stores and data cubes are designed to provide high performance on an
alytical database workloads (often referred to as Online Analytical Processing,
or OLAP.) These workloads are characterized by queries that select a subset of
tuples, and then aggregate and group along one or more dimensions. For example,
in a sales database, one might wish to find the sales of technology products by
month and store the SQL query to do this would look like:
SELECT month, store, COUNT(*)
FROM sales, products
WHERE productType = technology
AND products.id = sales.productID
GROUP BY month, store
In this post, we study how column-stores and data cubes would evaluate this quer
y on a sample database:
Column Store Analysis
In column-stores, this query would be answered by scanning the productType colum
n of the products table to find the ids that have type technology. These ids wo
uld then be used to filter the productID column of the sales table to find posit
ions of records with the appropriate product type. Finally, these positions wou
ld be used to select data from themonths and stores columns for input into the G
ROUP BY operator. Unlike in a row-store, the column-store only has to read a fe
w columns of the sales table (which, in most data warehouses, would contain tens
of columns), making it significantly faster than most commercial relational dat
abases that use row-based technology.
Also, if the table is sorted on some combination of the attributes used in the q
uery (or if a materialized view or projection of the table sorted on these attri
butes is available), then substantial performance gains can be obtained both fro
m compression and the ability to directly offset to ranges of satisfying tuples.
For example, notice that the sales table is sorted on productID, then month, t
hen storeID. Here, all of the records for a givenproductID are co-located, so
the extraction of matching productIDs can be done very quickly using binary sear
ch or a sparse index that gives the first record of each distinctproductID. Fur
thermore, the productID column can be effectively run-length encoded to avoid st
oring repeated values, which will use much less storage space. Run-length encod
ing will also be effective on the month and storeID columns, since for a group o
f records representing a specific productID, month is sorted, and for a group of
records representing a given (productID,month) pair, storeID is sorted. For ex
ample, if there are 1,000,000 sales records of about 1,000 products sold by 10 s
tores, with sales uniformly distributed across products, months and stores, then
the productID column can be stored in 1,000 records (one entry per product), th
e month column can be stored in 1,000 x 12 = 12,000 records, and the storeID col
umn can be stored in and 1,000 x 12 x 10 = 120,000 records. This compression me
ans that less the amount of data read from disk is less than 5% of its uncompres
sed size.
Data Cube Analysis
Data cube-based solutions (sometimes referred to as MOLAP systems for multidimens
ional online analytical processing ), are represented by commercial products such
as EssBase. They store data in array-like structures, where the dimensions of
the array represent columns of the underlying tables, and the values of the cell
s represent pre-computed aggregates over the data. A data cube on the product,

store, and month attributes of the sales table, for example, would be stored in
an array format as shown in the figure above. Here, the cube includes roll-up cel
ls that summarize the values of the cells in the same row, column, or stack (x,y p
osition.) If we want to use a cube to compute the values of the COUNT aggregate,
as in the query above, the cells of this cube would look like:

Here, each cell contains the count of the number of records with a given (produc
tID,month,storeID) value. For example, there is one record with storeID=1, prod
uctID=2, and month=April. The sum fields indicate the values of the COUNT rolled u
p on specific dimensions; for example, looking at the lower left hand corner of t
he cube for Store 1, we can see that in storeID 1, productID 1 was sold twice ac
ross all months. Thus, to answer the above query using a data cube, we first id
entify the subset of the cube that satisfies the WHERE clause (here, products 3,
4, and 5 are technology products this is indicated by their dark shading in the a
bove figure.) Then, the system reads the pre-aggregated values from sum fields
for the unrestricted attributes (store and month), which gives the result that s
tore 2 had 1 technology sale in Feburary and 1 in June, and that store 3 had 1 t
echnology sale in February and 1 in October.
The advantages of a data cube should be clear it contains pre-computed aggregate v
alues that make it a very compact and efficient way to retrieve answers for spec
ific aggregate queries. It can be used to efficiently compute a hierarchy of ag
gregates for example, the sum columns in the above cube make it is very fast to co
mpute the number of sales in a given month across all stores, or the number of s
ales or a particular product across the entire year in a given store. Because t
he data is stored in an array-structure, and each element is the same size, dire
ct offsetting to particular values may be possible. However, data cubes have sev
eral limitations:
Sparsity: Looking at the above cube, most of the cells are empty. This is not s
imply an artifact our sample data set being small the number of cells in a cube is
the product of the cardinalities of the dimensions in the cube. Our 3D cube wi
th 10 stores and 1,000 products would have 120,000 cells, and adding a fourth di
mension, such as customerID (with, say, 10,000 values), would cause the number o
f cells to balloon to 1.2 billion! Such high dimensionality cubes cannot be sto
red without compression. Unfortunately, compression can limit performance somew
hat, as direct offsetting is no longer possible. For example, a common technique
is to store them as a table with the values and positions of the non-empty cell
s, resulting in an implementation much like a row-oriented relational database!
Inflexible, Limited ad-hoc query support: Data cubes work great when a cube aggr
egated on the dimensions of interest and using the desired aggregation functions
is available. Consider, however, what happens in the above example if the user
wants to compute the average sale price rather than the count of sales, or if t
he user wants to include aggregates on customerID in addition to the other attri
butes. If no cube is available, the user has no choice but to fall back to quer
ies on an underlying relational system. Furthermore, if the
user wants to drill down into the underlying data asking, for example who was the c
ustomer who bought a technology product at store 2 in February? the cube cannot be
used (one could imagine storing entire tuples, or pointers to tuples, in the cel
ls of a cube, but like sparse representations, this significantly complicates th
e representation of a cube and can lead to storage space explosions.) To deal w
ith these limitations, some cube systems support what is called HOLAP or hybrid onl
ine analytical processing , where they will automatically redirect queries that ca
nnot be answered with cubes to a relational system, but such queries run as fast
as whatever relational system executes them.
Long load times: Computing a cube requires a complex aggregate query over all of
the data in a warehouse (essentially, every record has to be read from the data
base.) Though it is possible to incrementally update cubes as new data arrives,

it is impractical to dynamically create new cubes to answer ad-hoc queries.


Summary and Discussion
Data cubes work well in environments where the query workload is predictable, so
that cubes needed to answer specific queries can be pre-computed. They are ina
ppropriate for ad-hoc queries or in situations where complex relational expressi
ons are needed.
In contrast, column-stores provide very good performance across a much wider ran
ge of queries (all of SQL!) However, for low-dimensionality pre-computed aggrega
tes, it is likely that a data-cube solution will outperform a column store. For
many-dimensional aggregates, the tradeoff is less clear, as sparse cube represen
tations are unlikely to perform any better than a column store.
Finally, it is worth noting that there is no reason that cubes cannot be combine
d with column-stores, especially in a HOLAP-style configuration where queries no
t directly answerable from a cube are redirected to an underlying column-store s
ystem. That said, given that column-stores will typically get very good perform
ance on simple aggregate queries (even if cubes are slightly faster), it is not
clear if the incremental cost of maintaining and loading an additional cube syst
em to compute aggregates is ever worthwhile in a column-store world. Furthermor
e, existing HOLAP products, which are based on row-stores, are likely to be an o
rder of magnitude or more slower than column-stores on ad-hoc queries that canno
t be answered by the MOLAP system, for the same reasons discussed elsewhere in t
his blog.