Sei sulla pagina 1di 3

Differences between CUBES and Star Schema Most of us should by now be familiar with the differences between a pure

OLTP relational design and a Star Schema design. If not, a brief discussion here on Star Schemas. We're not going to address MDM (Master Data Management) issues in this thread. One typical MDM issue is to ensure that Profit and Customer and Sale (all business terms) mean the same thing in all environments and all star schemas. As stated, we're not discussing that here. Each Star schema usually has 1 central fact table, like Sales or CampaignResponses. Linked to this fact are the dimensions (attributes about the fact). for instance, when we made a sale, it is not sufficient to know how much, but we also need to know:

To whom For what (product/s) When By Whom When was it shipped / received anything else the business might find important.

When a database is properly configured as a data warehouse, it expects wide fact tables, with many foreign keys to the primary keys of the dimensions. In this way, if you are looking for a particular product or customer's details in the fact table, the join is super-fact and the rows are retrieved in an efficient manner. Even better still is when you are running the database as a column-oriented database. Star Schema's require that the dimensions be loaded prior to the facts, and if a dimension is not present, there should always be a catch-all for undefined values. In this way, there will ALWAYS be a valid foreign key to a dimension for EACH row of the fact table. This is typical of Kymball design methodology and it works. Star Schemas are typically loaded by some form of ETL (Extract, Transform, Load) process, though these days, with the speed and affordability of modern disk, the approach really should be renamed ELT, since most of the transformation can be much more efficiently performed in staging tables in the data warehouse rather than externally, the way it used to be back in the 90's. A Data mart is often a collection of one or more star schemas and in the Kymball definition, the collection of data marts form the basis of the data warehouse. The Inmon methodology is more of a purist approach, will take much longer to implement and pseudo-states that the Data Warehouse is the source of all datamarts, is the source of all information, and from the DW, the datamarts are then extracted and distributed. Fair enough and fantastic in principle, however the practical reality is that most organizations are reluctant to spend the money required to enjoy the benefits of this purist approach. Let me say that where I have seen this work in large organizations, it can be a beautiful thing to study. But back to the world of mass majority, we're probably predestined (in 2010) to go down the Kymball design route and create ourselves a datamart that will serve as a datasource for reports or dashboards. Cubes on the other hand have their advantage by being a multi-dimensional data object, on top of which you can place various tools that will interrogate the cube. Cubes are built using various approaches, the most well-known of these is probably Microsoft Analysis Services. A cube typically uses a Star Schema (or relational too if required) as its source and builds a structure with precomputed aggregates that can be sliced and diced any which way, by any

dimension / attribute to give us near-instant results. The question pops up time and time again: "So why not just build a cube and all our problems will be solved?" A very good question indeed! This is where a good understanding of exactly what is a cube and star schema, how do they relate and when is it appropriate to use which? To create the perfect cube, you would need to understand what the end users will use it for. A large manufacturing company I did some work at had me analyze their cubes and I saw 1 particular cube repeated almost identically 125 times, each time using up about 5 GB of space. the problem here was that each set of users wanted just a tiny little bit of difference in the information. As can be guessed, of these 125 cubes that were being refreshed and reloaded daily, 3 were actually being used. Sounds like a lot of redundant unnecessary work being done. when I investigated why only 3 cubes were being used, I ran into the same answer almost every time: Every time we want a new attribute or to add a new measure, we need to redesign the cube, test it, migrate it and then on the next change request, get severely scolded for not anticipating this change in advance. Once in some presentation, I read / heard / bemused, that in order to build the perfect cube, you would need to anticipate every question the business end users of the cube might have, for at least the next 5 years. Well, good luck with that, I would LOVE to know if this is achievable. In fact, if you HAVE done such a thing, I would like to consider hiring you, you water-walking, dragon-slaying, saviour of the information world! Advantages of a cube: 1. Very fast response to give you the information you have previously designed the cube to hold. 2. Drilldown path is totally dynamic, you get clean and structured aggregations of the measures in the cube 3. A single structure to query: no joins, no underlying detailed knowledge of the data entities are required, since it is built on a fact.

Disadvantages of a cube: 1. Not suitable for Ad-hoc queries. 2. They are rigid, like ice-cubes. The computed aggregations are it, no calculated expressions on the fly. They have to be designed into the cube. 3. No real-time ability. If you have a real-time or NRT (near-real-time) DW, the cube unfortunately is not and has to be rebuilt in its entirety before you can analyze the latest data. 4. Drildown is only to the lowest level afforded by the cube. There will be a level at which in order to see the information that comprises the summary in the cube, you will need to leave the cube and drillthrough to the underlying data. 5. Size: Every attribute you add to a cube increases its size and not just a little 6. Size of source data. If your organization creates millions of rows per week / day, to build and maintain a cube on this information can be a nightware. Think of a cube where the underlying data comprises several billion rows. you would probably create a cube for the current and last month and maybe a cube

for the current and previous quarter, but beyond that, it can be a pretty ghoulish task to maintain this. 7. Maintenance: If you need several hours to load the daily data, you will need considerably more as your implementation matures, to keep refreshing the cubes with this growing data monster. 8. Complexity. You cannot use SQL to query cubes, you will need to use MDX (Multidimensional Expressions) or XMLA or rely on a tool that hides these things from you. so often there is a steeper learning curve to get started with cube technology.

By now you have probably picked up that I am a greater fan of a very fast access Star Schema that any cube. Let's put this into a practical reality: I designed a star schema for a customer on a MS Sql Server database, having created some 20 million rows in the fact table (current stock on hand). I could have created a cube and dealt with the issues above. However, using some smart SQL and some decent indexing, I can run a report r dashboard that can interrogate any level of the product hierarchy and also any level of the store hierarchy (several hundred stores) and return summary and detailed information in no more than 3 seconds. Having that kind of response, I have to say that the effort of building and maintaining a cube on this data seems pointless.

Potrebbero piacerti anche