Sei sulla pagina 1di 42

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

CHAPTER

Introduction to Data
Warehousing

nformationassetsareimmenselyvaluabletoanyenterprise,andbecauseofthis,
theseassetsmustbeproperlystoredandreadilyaccessiblewhentheyareneeded.
However,theavailabilityoftoomuchdatamakestheextractionofthemost
importantinformationdifficult,ifnotimpossible.ViewresultsfromanyGooglesearch,
andyoullseethatthedata=informationequationisnotalwayscorrectthatis,too
muchdataissimplytoomuch.
Datawarehousingisaphenomenonthatgrewfromthehugeamountofelectronicdata
storedinrecentyearsandfromtheurgentneedtousethatdatatoaccomplishgoalsthatgo
beyondtheroutinetaskslinkedtodailyprocessing.Inatypicalscenario,alargecorporation
hasmanybranches,andseniormanagersneedtoquantifyandevaluatehoweachbranch
contributestotheglobalbusinessperformance.Thecorporatedatabasestoresdetaileddata
onthetasksperformedbybranches.Tomeetthemanagersneeds,tailor-madequeriescan
beissuedtoretrievetherequireddata.Inorderforthisprocesstowork,database
administratorsmustfirstformulatethedesiredquery(typicallyanaggregateSQLquery)
aftercloselystudyingdatabasecatalogs.Thenthequeryisprocessed.Thiscantakeafew
hoursbecauseofthehugeamountofdata,thequerycomplexity,andtheconcurrenteffects
ofotherregularworkloadqueriesondata.Finally,areportisgeneratedandpassedto
seniormanagersintheformofaspreadsheet.
Manyyearsago,databasedesignersrealizedthatsuchanapproachishardlyfeasible,
becauseitisverydemandingintermsoftimeandresources,anditdoesnotalwaysachieve
thedesiredresults.Moreover,amixofanalyticalquerieswithtransactionalroutinequeries
inevitablyslowsdownthesystem,andthisdoesnotmeettheneedsofusersofeithertype
ofquery.Todaysadvanceddatawarehousingprocessesseparateonlineanalyticalprocessing
(OLAP)fromonlinetransactionalprocessing(OLTP)bycreatinganewinformationrepository
thatintegratesbasicdatafromvarioussources,properlyarrangesdataformats,andthen
makesdataavailableforanalysisandevaluationaimedatplanninganddecision-making
processes(Lechtenbrger,2001).

ch01.indd1

4/21/093:23:27PM

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

Data Warehouse Design: Modern Principles and Methodologies

Letsreviewsomefieldsofapplicationforwhichdatawarehousetechnologiesare
successfullyused:
Trade Salesandclaimsanalyses,shipmentandinventorycontrol,customercare
andpublicrelations
Craftsmanship Productioncostcontrol,supplierandordersupport
Financialservices Riskanalysisandcreditcards,frauddetection
Transportindustry Vehiclemanagement
Telecommunicationservices Callflowanalysisandcustomerprofileanalysis
Healthcareservice Patientadmissionanddischargeanalysisandbookkeepingin
accountsdepartments
Thefieldofapplicationofdatawarehousesystemsisnotonlyrestrictedtoenterprises,
butitalsorangesfromepidemiologytodemography,fromnaturalsciencetoeducation.
Apropertythatiscommontoallfieldsistheneedforstorageandquerytoolstoretrieve
informationsummarieseasilyandquicklyfromthehugeamountofdatastoredin
databasesormadeavailablebytheInternet.Thiskindofinformationallowsustostudy
businessphenomena,learnaboutmeaningfulcorrelations,andgainusefulknowledgeto
supportdecision-makingprocesses.

1.1 Decision Support Systems


Untilthemid-1980s,enterprisedatabasesstoredonlyoperationaldatadatacreatedby
businessoperationsinvolvedindailymanagementprocesses,suchaspurchasemanagement,
salesmanagement,andinvoicing.However,everyenterprisemusthavequick,comprehensive
accesstotheinformationrequiredbydecision-makingprocesses.Thisstrategicinformationis
extractedmainlyfromthehugeamountofoperationaldatastoredinenterprisedatabasesby
meansofaprogressiveselectionandaggregationprocessshowninFigure1-1.

FIGURE 1-1
Information value
as a function of
quantity

ch01.indd2

4/21/093:23:28PM

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

Chapter 1:

Introduction to Data Warehousing

Anexponentialincreaseinoperationaldatahasmadecomputerstheonlytoolssuitable
forprovidingdatafordecision-makingperformedbybusinessmanagers.Thisfacthas
dramaticallyaffectedtheroleofenterprisedatabasesandfosteredtheintroductionof
decisionsupportsystems.Theconceptofdecisionsupportsystemsmainlyevolvedfromtwo
researchfields:theoreticalstudiesondecision-makingprocessesfororganizationsand
technicalresearchoninteractiveITsystems.However,thedecisionsupportsystemconcept
isbasedonseveraldisciplines,suchasdatabases,artificialintelligence,man-machine
interaction,andsimulation.Decisionsupportsystemsbecamearesearchfieldinthemid70sandbecamemorepopularinthe80s.

Decision Support System

Adecisionsupportsystem(DSS)isasetofexpandable,interactiveITtechniquesand
toolsdesignedforprocessingandanalyzingdataandforsupportingmanagers
indecisionmaking.Todothis,thesystemmatchesindividualresourcesofmanagers
withcomputerresourcestoimprovethequalityofthedecisionsmade.
Inpractice,aDSSisanITsystemthathelpsmanagersmakedecisionsorchooseamong
differentalternatives.Thesystemprovidesvalueestimatesforeachalternative,allowingthe
managertocriticallyreviewtheresults.Table1-1showsapossibleclassificationofDSSson
thebasisoftheirfunctions(Power,2002).
Fromthearchitecturalviewpoint,aDSStypicallyincludesamodel-basedmanagement
systemconnectedtoaknowledgeengineand,ofcourse,aninteractivegraphicaluser
interface(SpragueandCarlson,1982).Datawarehousesystemshavebeenmanagingthe
databack-endsofDSSssincethe1990s.Theymustretrieveusefulinformationfromahuge
amountofdatastoredonheterogeneousplatforms.Inthisway,decision-makerscan
formulatetheirqueriesandconductcomplexanalysesonrelevantinformationwithout
slowingdownoperationalsystems.
System

Description

Passive DSS

Supports decision-making processes, but it does not offer


explicit suggestions on decisions or solutions.

Active DSS

Offers suggestions and solutions.

Collaborative DSS

Operates interactively and allows decision-makers to


modify, integrate, or refine suggestions given by the system.
Suggestions are sent back to the system for validation.

Model-driven DSS

Enhances management of statistical, financial, optimization,


and simulation models.

Communication-driven DSS

Supports a group of people working on a common task.

Data-driven DSS

Enhances the access and management of time series of


corporate and external data.

Document-driven DSS

Manages and processes nonstructured data in many formats.

Knowledge-driven DSS

Provides problem-solving features in the form of facts, rules,


and procedures.

TABLE 1-1

ch01.indd3

Classication of Decision Support Systems

4/21/093:23:28PM

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

Data Warehouse Design: Modern Principles and Methodologies

1.2 Data Warehousing


Datawarehousesystemsareprobablythesystemstowhichacademiccommunitiesand
industrialbodieshavebeenpayingthegreatestattentionamongalltheDSSs.Data
warehousingcanbeinformallydefinedasfollows:

Data Warehousing

Datawarehousingisacollectionofmethods,techniques,andtoolsusedtosupport
knowledgeworkersseniormanagers,directors,managers,andanalyststoconduct
dataanalysesthathelpwithperformingdecision-makingprocessesandimproving
informationresources.
Thedefinitionofdatawarehousingpresentedhereisintentionallygeneric;itgivesyou
anideaoftheprocessbutdoesnotincludespecificfeaturesoftheprocess.Tounderstandthe
roleandtheusefulpropertiesofdatawarehousingcompletely,youmustfirstunderstandthe
needsthatbroughtitintobeing.In1996,R.Kimballefficientlysummedupafewclaims
frequentlysubmittedbyendusersofclassicinformationsystems:
Wehaveheapsofdata,butwecannotaccessit!Thisshowsthefrustrationofthose
whoareresponsibleforthefutureoftheirenterprisesbuthavenotechnicaltoolsto
helpthemextracttherequiredinformationinaproperformat.
Howcanpeopleplayingthesameroleachievesubstantiallydifferentresults?Inmidsize
tolargeenterprises,manydatabasesareusuallyavailable,eachdevotedtoaspecific
businessarea.Theyareoftenstoredondifferentlogicalandphysicalmediathatare
notconceptuallyintegrated.Forthisreason,theresultsachievedineverybusiness
areaarelikelytobeinconsistent.
Wewanttoselect,group,andmanipulatedataineverypossibleway!Decision-making
processescannotalwaysbeplannedbeforethedecisionsaremade.Endusersneed
atoolthatisuser-friendlyandflexibleenoughtoconductadhocanalyses.They
wanttochoosewhichnewcorrelationstheyneedtosearchforinrealtimeasthey
analyzetheinformationretrieved.
Showmejustwhatmatters!Examiningdataatthemaximumlevelofdetailisnot
onlyuselessfordecision-makingprocesses,butisalsoself-defeating,becauseit
doesnotallowuserstofocustheirattentiononmeaningfulinformation.
Everyoneknowsthatsomedataiswrong!Thisisanothersorepoint.Anappreciable
percentageoftransactionaldataisnotcorrectoritisunavailable.Itisclearthatyou
cannotachievegoodresultsifyoubaseyouranalysesonincorrectorincompletedata.
Wecanusethepreviouslistofproblemsanddifficultiestoextractalistofkeywords
thatbecomedistinguishingmarksandessentialrequirementsforadatawarehouseprocess,a
setoftasksthatallowustoturnoperationaldataintodecision-makingsupportinformation:
accessibilitytousersnotveryfamiliarwithITanddatastructures;
integrationofdataonthebasisofastandardenterprisemodel;
queryflexibilitytomaximizetheadvantagesobtainedfromtheexistinginformation;

ch01.indd4

4/21/093:23:28PM

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

Chapter 1:

Introduction to Data Warehousing

informationconcisenessallowingfortarget-orientedandeffectiveanalyses;
multidimensionalrepresentationgivingusersanintuitiveandmanageableview
ofinformation;
correctnessandcompletenessofintegrateddata.
Datawarehousesareplacedrightinthemiddleofthisprocessandactasrepositories
fordata.Theymakesurethattherequirementssetcanbefulfilled.

Data Warehouse

Adatawarehouseisacollectionofdatathatsupportsdecision-makingprocesses.
Itprovidesthefollowingfeatures(Inmon,2005):
Itissubject-oriented.
Itisintegratedandconsistent.
Itshowsitsevolutionovertimeanditisnotvolatile.
Datawarehousesaresubject-orientedbecausetheyhingeonenterprise-specific
concepts,suchascustomers,products,sales,andorders.Onthecontrary,operational
databaseshingeonmanydifferententerprise-specificapplications.
Weputemphasisonintegrationandconsistencybecausedatawarehousestakeadvantage
ofmultipledatasources,suchasdataextractedfromproductionandthenstoredtoenterprise
databases,orevendatafromathirdpartysinformationsystems.Adatawarehouseshould
provideaunifiedviewofallthedata.Generallyspeaking,wecanstatethatcreatingadata
warehousesystemdoesnotrequirethatnewinformationbeadded;rather,existing
informationneedsrearranging.Thisimplicitlymeansthataninformationsystemshouldbe
previouslyavailable.
Operationaldatausuallycoversashortperiodoftime,becausemosttransactions
involvethelatestdata.Adatawarehouseshouldenableanalysesthatinsteadcoverafew
years.Forthisreason,datawarehousesareregularlyupdatedfromoperationaldataand
keepongrowing.Ifdatawerevisuallyrepresented,itmightprogresslikeso:Aphotograph
ofoperationaldatawouldbemadeatregularintervals.Thesequenceofphotographs
wouldbestoredtoadatawarehouse,andresultswouldbeshowninamoviethatreveals
thestatusofanenterprisefromitsfoundationuntilpresent.
Fundamentally,dataisneverdeletedfromdatawarehousesandupdatesarenormally
carriedoutwhendatawarehousesareoffline.Thismeansthatdatawarehousescanbe
essentiallyviewedasread-onlydatabases.Thissatisfiestheusersneedforashortanalysis
queryresponsetimeandhasotherimportanteffects.First,itaffectsdatawarehousespecific
databasemanagementsystem(DBMS)technologies,becausethereisnoneedforadvanced
transactionmanagementtechniquesrequiredbyoperationalapplications.Second,data
warehousesoperateinread-onlymode,sodatawarehousespecificlogicaldesignsolutions
arecompletelydifferentfromthoseusedforoperationaldatabases.Forinstance,themost
obviousfeatureofdatawarehouserelationalimplementationsisthattablenormalization
canbegivenuptopartiallydenormalizetablesandimproveperformance.
Otherdifferencesbetweenoperationaldatabasesanddatawarehousesareconnected
withquerytypes.Operationalqueriesexecutetransactionsthatgenerallyread/writea

ch01.indd5

4/21/093:23:29PM

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

Data Warehouse Design: Modern Principles and Methodologies

smallnumberoftuplesfrom/tomanytablesconnectedbysimplerelations.Forexample,
thisappliesifyousearchforthedataofacustomerinordertoinsertanewcustomerorder.
ThiskindofqueryisanOLTPquery.Onthecontrary,thetypeofqueryrequiredindata
warehousesisOLAP.Itfeaturesdynamic,multidimensionalanalysesthatneedtoscana
hugeamountofrecordstoprocessasetofnumericdatasumminguptheperformanceofan
enterprise.ItisimportanttonotethatOLTPsystemshaveanessentialworkloadcore
frozeninapplicationprograms,andadhocdataqueriesareoccasionallyrunfordata
maintenance.Conversely,datawarehouseinteractivityisanessentialpropertyforanalysis
sessions,sotheactualworkloadconstantlychangesastimegoesby.
ThedistinctivefeaturesofOLAPqueriessuggestadoptionofamultidimensional
representationfordatawarehousedata.Basically,dataisviewedaspointsinspace,whose
dimensionscorrespondtomanypossibleanalysisdimensions.Eachpointrepresentsan
eventthatoccursinanenterpriseandisdescribedbyasetofmeasuresrelevanttodecisionmakingprocesses.Section1.5givesadetaileddescriptionofthemultidimensionalmodel
youabsolutelyneedtobefamiliarwithtounderstandhowtomodelconceptualandlogical
levelsofadatawarehouseandhowtoquerydatawarehouses.
Table1-2summarizesthemaindifferencesbetweenoperationaldatabasesanddata
warehouses.

NOTE Forfurtherdetailsonthedifferentissuesrelatedtothedatawarehouseprocess,referto
ChaudhuriandDayal,1997;Inmon,2005;Jarkeetal.,2000;Kelly,1997;Kimball,1996;
Mattison,2006;andWrembelandKoncilia,2007.

Feature

Data Warehouses

Users

Thousands

Hundreds

Workload

Preset transactions

Specific analysis queries

Access

To hundreds of records, write


and read mode

To millions of records, mainly readonly mode

Goal

Depends on applications

Decision-making support

Data

Detailed, both numeric and


alphanumeric

Summed up, mainly numeric

Data integration

Application-based

Subject-based

Quality

In terms of integrity

In terms of consistency

Time coverage

Current data only

Current and historical data

Updates

Continuous

Periodical

Model

Normalized

Denormalized, multidimensional

Optimization

For OLTP access to a database


part

For OLAP access to most of the


database

TABLE 1-2

ch01.indd6

Operational Databases

Differences Between Operational Databases and Data Warehouses (Kelly, 1997)

4/21/093:23:29PM

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

Chapter 1:

Introduction to Data Warehousing

1.3 Data Warehouse Architectures


Thefollowingarchitecturepropertiesareessentialforadatawarehousesystem(Kelly,1997):
Separation Analyticalandtransactionalprocessingshouldbekeptapartasmuch
aspossible.
Scalability Hardwareandsoftwarearchitecturesshouldbeeasytoupgradeasthe
datavolume,whichhastobemanagedandprocessed,andthenumberofusers
requirements,whichhavetobemet,progressivelyincrease.
Extensibility Thearchitectureshouldbeabletohostnewapplicationsand
technologieswithoutredesigningthewholesystem.
Security Monitoringaccessesisessentialbecauseofthestrategicdatastoredin
datawarehouses.
Administerability Datawarehousemanagementshouldnotbeoverlydifficult.
Twodifferentclassificationsarecommonlyadoptedfordatawarehousearchitectures.
Thefirstclassification,describedinsections1.3.1,1.3.2,and1.3.3,isastructure-orientedone
thatdependsonthenumberoflayersusedbythearchitecture.Thesecondclassification,
describedinsection1.3.4,dependsonhowthedifferentlayersareemployedtocreate
enterprise-orientedordepartment-orientedviewsofdatawarehouses.

1.3.1

Single-Layer Architecture

Asingle-layerarchitectureisnotfrequentlyusedinpractice.Itsgoalistominimizethe
amountofdatastored;toreachthisgoal,itremovesdataredundancies.Figure1-2shows
theonlylayerphysicallyavailable:thesourcelayer.Inthiscase,datawarehousesarevirtual.
FIGURE 1-2
Single-layer
architecture for
a data warehouse
system

ch01.indd7

4/21/093:23:29PM

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

Data Warehouse Design: Modern Principles and Methodologies

Thismeansthatadatawarehouseisimplementedasamultidimensionalviewofoperational
datacreatedbyspecificmiddleware,oranintermediateprocessinglayer(Devlin,1997).
Theweaknessofthisarchitectureliesinitsfailuretomeettherequirementfor
separationbetweenanalyticalandtransactionalprocessing.Analysisqueriesaresubmitted
tooperationaldataafterthemiddlewareinterpretsthem.Itthisway,thequeriesaffect
regulartransactionalworkloads.Inaddition,althoughthisarchitecturecanmeetthe
requirementforintegrationandcorrectnessofdata,itcannotlogmoredatathansourcesdo.
Forthesereasons,avirtualapproachtodatawarehousescanbesuccessfulonlyifanalysis
needsareparticularlyrestrictedandthedatavolumetoanalyzeishuge.

1.3.2 Two-Layer Architecture


Therequirementforseparationplaysafundamentalroleindefiningthetypicalarchitecture
foradatawarehousesystem,asshowninFigure1-3.Althoughitistypicallycalledatwolayerarchitecturetohighlightaseparationbetweenphysicallyavailablesourcesanddata
warehouses,itactuallyconsistsoffoursubsequentdataflowstages(Lechtenbrger,2001):

1. Sourcelayer Adatawarehousesystemusesheterogeneoussourcesofdata.That
dataisoriginallystoredtocorporaterelationaldatabasesorlegacy1databases,orit
maycomefrominformationsystemsoutsidethecorporatewalls.

2. Datastaging Thedatastoredtosourcesshouldbeextracted,cleansedtoremove
inconsistenciesandfillgaps,andintegratedtomergeheterogeneoussourcesintoone
commonschema.Theso-calledExtraction,Transformation,andLoadingtools(ETL)can
mergeheterogeneousschemata,extract,transform,cleanse,validate,filter,andload
sourcedataintoadatawarehouse(Jarkeetal.,2000).Technologicallyspeaking,this
stagedealswithproblemsthataretypicalfordistributedinformationsystems,such
asinconsistentdatamanagementandincompatibledatastructures(Zhugeetal.,
1996).Section1.4dealswithafewpointsthatarerelevanttodatastaging.

3. Datawarehouselayer Informationisstoredtoonelogicallycentralizedsingle
repository:adatawarehouse.Thedatawarehousecanbedirectlyaccessed,butit
canalsobeusedasasourceforcreatingdatamarts,whichpartiallyreplicatedata
warehousecontentsandaredesignedforspecificenterprisedepartments.Meta-data
repositories(section1.6)storeinformationonsources,accessprocedures,data
staging,users,datamartschemata,andsoon.

4. Analysis Inthislayer,integrateddataisefficientlyandflexiblyaccessedtoissue
reports,dynamicallyanalyzeinformation,andsimulatehypotheticalbusiness
scenarios.Technologicallyspeaking,itshouldfeatureaggregatedatanavigators,
complexqueryoptimizers,anduser-friendlyGUIs.Section1.7dealswithdifferent
typesofdecision-makingsupportanalyses.

Thearchitecturaldifferencebetweendatawarehousesanddatamartsneedstobestudied
closer.ThecomponentmarkedasadatawarehouseinFigure1-3isalsooftencalledthe
primarydatawarehouseorcorporatedatawarehouse.Itactsasacentralizedstoragesystemfor
1
Thetermlegacysystemdenotescorporateapplications,typicallyrunningonmainframesorminicomputers,
that are currently used for operational tasks but do not meet modern architectural principles and current
standards.Forthisreason,accessinglegacysystemsandintegratingthemwithmorerecentapplicationsisa
complextask.Allapplicationsthatuseanonrelationaldatabaseareexamplesoflegacysystems.

ch01.indd8

4/21/093:23:30PM

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

Chapter 1:

Introduction to Data Warehousing

FIGURE 1-3
Two-layer
architecture for
a data warehouse
system

allthedatabeingsummedup.Datamartscanbeviewedassmall,localdatawarehouses
replicating(andsummingupasmuchaspossible)thepartofaprimarydatawarehouse
requiredforaspecificapplicationdomain.

Data Marts

Adatamartisasubsetoranaggregationofthedatastoredtoaprimarydata
warehouse.Itincludesasetofinformationpiecesrelevanttoaspecificbusinessarea,
corporatedepartment,orcategoryofusers.
Thedatamartspopulatedfromaprimarydatawarehouseareoftencalleddependent.
Althoughdatamartsarenotstrictlynecessary,theyareveryusefulfordatawarehouse
systemsinmidsizetolargeenterprisesbecause
theyareusedasbuildingblockswhileincrementallydevelopingdatawarehouses;
theymarkouttheinformationrequiredbyaspecificgroupofuserstosolvequeries;
theycandeliverbetterperformancebecausetheyaresmallerthanprimarydata
warehouses.

ch01.indd9

4/21/093:23:30PM

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

10

Data Warehouse Design: Modern Principles and Methodologies

Sometimes,mainlyfororganizationandpolicypurposes,youshoulduseadifferent
architectureinwhichsourcesareusedtodirectlypopulatedatamarts.Thesedatamarts
arecalledindependent(seesection1.3.4).Ifthereisnoprimarydatawarehouse,this
streamlinesthedesignprocess,butitleadstotheriskofinconsistenciesbetweendata
marts.Toavoidtheseproblems,youcancreateaprimarydatawarehouseandstillhave
independentdatamarts.Incomparisonwiththestandardtwo-layerarchitectureof
Figure1-3,therolesofdatamartsanddatawarehousesareactuallyinverted.Inthiscase,
thedatawarehouseispopulatedfromitsdatamarts,anditcanbedirectlyqueriedto
makeaccesspatternsaseasyaspossible.
Thefollowinglistsumsupallthebenefitsofatwo-layerarchitecture,inwhichadata
warehouseseparatessourcesfromanalysisapplications(Jarkeetal.,2000;Lechtenbrger,2001):
Indatawarehousesystems,goodqualityinformationisalwaysavailable,even
whenaccesstosourcesisdeniedtemporarilyfortechnicalororganizationalreasons.
Datawarehouseanalysisqueriesdonotaffectthemanagementoftransactions,the
reliabilityofwhichisvitalforenterprisestoworkproperlyatanoperationallevel.
Datawarehousesarelogicallystructuredaccordingtothemultidimensionalmodel,
whileoperationalsourcesaregenerallybasedonrelationalorsemi-structuredmodels.
AmismatchintermsoftimeandgranularityoccursbetweenOLTPsystems,which
managecurrentdataatamaximumlevelofdetail,andOLAPsystems,which
managehistoricalandsummarizeddata.
Datawarehousescanusespecificdesignsolutionsaimedatperformance
optimizationofanalysisandreportapplications.

NOTE Afewauthorsusethesameterminologytodefinedifferentconcepts.Inparticular,those

authorsconsideradatawarehouseasarepositoryofintegratedandconsistent,yetoperational,
data,whiletheyuseamultidimensionalrepresentationofdataonlyindatamarts.Accordingto
ourterminology,thisoperationalviewofdatawarehousesessentiallycorrespondstothe
reconcileddatalayerinthree-layerarchitectures.

1.3.3 Three-Layer Architecture


Inthisarchitecture,thethirdlayeristhereconcileddatalayeroroperationaldatastore.Thislayer
materializesoperationaldataobtainedafterintegratingandcleansingsourcedata.Asaresult,
thosedataareintegrated,consistent,correct,current,anddetailed.Figure1-4showsadata
warehousethatisnotpopulatedfromitssourcesdirectly,butfromreconcileddata.
Themainadvantageofthereconcileddatalayeristhatitcreatesacommonreferencedata
modelforawholeenterprise.Atthesametime,itsharplyseparatestheproblemsofsource
dataextractionandintegrationfromthoseofdatawarehousepopulation.Remarkably,in
somecases,thereconciledlayerisalsodirectlyusedtobetteraccomplishsomeoperational
tasks,suchasproducingdailyreportsthatcannotbesatisfactorilypreparedusingthe
corporateapplications,orgeneratingdataflowstofeedexternalprocessesperiodicallysoas
tobenefitfromcleaningandintegration.However,reconcileddataleadstomoreredundancy
ofoperationalsourcedata.Notethatwemayassumethateventwo-layerarchitecturescan
haveareconciledlayerthatisnotspecificallymaterialized,butonlyvirtual,becauseitis
definedasaconsistentintegratedviewofoperationalsourcedata.

ch01.indd10

4/21/093:23:31PM

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

Chapter 1:

Introduction to Data Warehousing

11

FIGURE 1-4
Three-layer
architecture for
a data warehouse
system

Finally,letsconsiderasupplementaryarchitecturalapproach,whichprovidesa
comprehensivepicture.Thisapproachcanbedescribedasahybridsolutionbetweenthe
single-layerarchitectureandthetwo/three-layerarchitecture.Thisapproachassumesthat
althoughadatawarehouseisavailable,itisunabletosolveallthequeriesformulated.This
meansthatusersmaybeinterestedindirectlyaccessingsourcedatafromaggregatedata
(drill-through).Toreachthisgoal,somequerieshavetoberewrittenonthebasisofsource
data(orreconcileddataifitisavailable).Thistypeofarchitectureisimplementedina
prototypebyCuiandWidom,2000,anditneedstobeabletogodynamicallybacktothe
sourcedatarequiredforqueriestobesolved(lineage).

ch01.indd11

4/21/093:23:31PM

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

12

Data Warehouse Design: Modern Principles and Methodologies

NOTE Gupta,1997a;HullandZhou,1996;andYangetal.,1997discusstheimplicationsofthis

approachfromtheviewpointofperformanceoptimization,andinparticularviewmaterialization.

1.3.4 An Additional Architecture Classification


Thescientificliteratureoftendistinguishesfivetypesofarchitecturefordatawarehouse
systems,inwhichthesamebasiclayersmentionedintheprecedingparagraphsare
combinedindifferentways(Rizzi,2008).
Inindependentdatamartsarchitecture,differentdatamartsareseparatelydesignedand
builtinanonintegratedfashion(Figure1-5).Thisarchitecturecanbeinitiallyadoptedinthe
absenceofastrongsponsorshiptowardanenterprise-widewarehousingproject,orwhen
theorganizationaldivisionsthatmakeupthecompanyarelooselycoupled.However,it
tendstobesoonreplacedbyotherarchitecturesthatbetterachievedataintegrationand
cross-reporting.
Thebusarchitecture,recommendedbyRalphKimball,isapparentlysimilartothe
precedingarchitecture,withoneimportantdifference.Abasicsetofconformeddimensions
(thatis,analysisdimensionsthatpreservethesamemeaningthroughoutallthefactsthey
belongto),derivedbyacarefulanalysisofthemainenterpriseprocesses,isadoptedand
sharedasacommondesignguideline.Thisensureslogicalintegrationofdatamartsandan
enterprise-wideviewofinformation.
Inthehub-and-spokearchitecture,oneofthemostusedinmediumtolargecontexts,there
ismuchattentiontoscalabilityandextensibility,andtoachievinganenterprise-wideview

FIGURE 1-5
Independent data
marts architecture

ch01.indd12

4/21/093:23:32PM

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

Chapter 1:

Introduction to Data Warehousing

13

FIGURE 1-6
Hub-and-spoke
architecture

ofinformation.Atomic,normalizeddataisstoredinareconciledlayerthatfeedsasetof
datamartscontainingsummarizeddatainmultidimensionalform(Figure1-6).Users
mainlyaccessthedatamarts,buttheymayoccasionallyquerythereconciledlayer.
Thecentralizedarchitecture,recommendedbyBillInmon,canbeseenasaparticular
implementationofthehub-and-spokearchitecture,wherethereconciledlayerandthedata
martsarecollapsedintoasinglephysicalrepository.
Thefederatedarchitectureissometimesadoptedindynamiccontextswherepreexisting
datawarehouses/datamartsaretobenoninvasivelyintegratedtoprovideasingle,crossorganizationdecisionsupportenvironment(forinstance,inthecaseofmergersand
acquisitions).Eachdatawarehouse/datamartiseithervirtuallyorphysicallyintegrated
withtheothers,leaningonavarietyofadvancedtechniquessuchasdistributedquerying,
ontologies,andmeta-datainteroperability(Figure1-7).

ch01.indd13

4/21/093:23:32PM

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

14

Data Warehouse Design: Modern Principles and Methodologies

FIGURE 1-7 Federated architecture

Thefollowinglistincludesthefactorsthatareparticularlyinfluentialwhenitcomesto
choosingoneofthesearchitectures:
Theamountofinterdependentinformationexchangedbetweenorganizationalunits
inanenterpriseandtheorganizationalroleplayedbythedatawarehouseproject
sponsormayleadtotheimplementationofenterprise-widearchitectures,suchasbus
architectures,ordepartment-specificarchitectures,suchasindependentdatamarts.
Anurgentneedforadatawarehouseproject,restrictionsoneconomicandhuman
resources,aswellaspoorITstaffskillsmaysuggestthatatypeofquickarchitecture,
suchasindependentdatamarts,shouldbeimplemented.
Theminorroleplayedbyadatawarehouseprojectinenterprisestrategiescanmake
youpreferanarchitecturetypebasedonindependentdatamartsoverahub-andspokearchitecturetype.
Thefrequentneedforintegratingpreexistingdatawarehouses,possiblydeployed
onheterogeneousplatforms,andthepressingdemandforuniformlyaccessingtheir
datacanrequireafederatedarchitecturetype.

ch01.indd14

4/21/093:23:32PM

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

Chapter 1:

Introduction to Data Warehousing

15

1.4 Data Staging and ETL


Nowletscloselystudysomebasicfeaturesofthedifferentarchitecturelayers.Wewillstart
withthedatastaginglayer.
ThedatastaginglayerhoststheETLprocessesthatextract,integrate,andcleandata
fromoperationalsourcestofeedthedatawarehouselayer.Inathree-layerarchitecture,
ETLprocessesactuallyfeedthereconcileddatalayerasingle,detailed,comprehensive,
top-qualitydatasourcethatinitsturnfeedsthedatawarehouse.Forthisreason,the
ETLprocessoperationsasawholeareoftendefinedasreconciliation.Thesearealsothe
mostcomplexandtechnicallychallengingamongallthedatawarehouseprocessphases.
ETLtakesplaceoncewhenadatawarehouseispopulatedforthefirsttime,thenitoccurs
everytimethedatawarehouseisregularlyupdated.Figure1-8showsthatETLconsistsof
fourseparatephases:extraction(orcapture),cleansing(orcleaningorscrubbing),transformation,
andloading.Inthefollowingsections,weofferbriefdescriptionsofthesephases.

NOTE RefertoJarkeetal.,2000;Hofferetal.,2005;KimballandCaserta,2004;andEnglish,1999
formoredetailsonETL.

Thescientificliteratureshowsthattheboundariesbetweencleansingandtransforming
areoftenblurredfromtheterminologicalviewpoint.Forthisreason,aspecificoperationis
notalwaysclearlyassignedtooneofthesephases.Thisisobviouslyaformalproblem,but
notasubstantialone.WewilladopttheapproachusedbyHofferandothers(2005)to
makeourexplanationsasclearaspossible.Theirapproachstatesthatcleansingis
essentiallyaimedatrectifyingdatavalues,andtransformationmorespecificallymanages
dataformats.
Chapter10discussesallthedetailsofthedata-stagingdesignphase.Chapter3deals
withanearlydatawarehousedesignphase:integration.Thisphaseisnecessaryifthereare
heterogeneoussourcestodefineaschemaforthereconcileddatalayer,andtospecifically
transformoperationaldatainthedata-stagingphase.

1.4.1

Extraction

Relevantdataisobtainedfromsourcesintheextractionphase.Youcanusestaticextraction
whenadatawarehouseneedspopulatingforthefirsttime.Conceptuallyspeaking,this
lookslikeasnapshotofoperationaldata.Incrementalextraction,usedtoupdatedata
warehousesregularly,seizesthechangesappliedtosourcedatasincethelatestextraction.
IncrementalextractionisoftenbasedonthelogmaintainedbytheoperationalDBMS.Ifa
timestampisassociatedwithoperationaldatatorecordexactlywhenthedataischangedor
added,itcanbeusedtostreamlinetheextractionprocess.Extractioncanalsobesourcedrivenifyoucanrewriteoperationalapplicationstoasynchronouslynotifyofthechanges
beingapplied,orifyouroperationaldatabasecanimplementtriggersassociatedwith
changetransactionsforrelevantdata.
Thedatatobeextractedismainlyselectedonthebasisofitsquality(English,1999).In
particular,thisdependsonhowcomprehensiveandaccuratetheconstraintsimplemented
insourcesare,howsuitablethedataformatsare,andhowcleartheschemataare.

ch01.indd15

4/21/093:23:33PM

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

16

Data Warehouse Design: Modern Principles and Methodologies

FIGURE 1-8
Extraction,
transformation,
and loading

1.4.2

Cleansing

Thecleansingphaseiscrucialinadatawarehousesystembecauseitissupposedtoimprove
dataqualitynormallyquitepoorinsources(Galhardasetal.,2001).Thefollowinglist
includesthemostfrequentmistakesandinconsistenciesthatmakedatadirty:
Duplicatedata Forexample,apatientisrecordedmanytimesinahospitalpatient
managementsystem

ch01.indd16

4/21/093:23:33PM

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

Chapter 1:

Introduction to Data Warehousing

17

Inconsistentvaluesthatarelogicallyassociated SuchasaddressesandZIPcodes
Missingdata Suchasacustomersjob
Unexpecteduseoffields Forexample,asocialSecurityNumberfieldcouldbe
usedimproperlytostoreofficephonenumbers
Impossibleorwrongvalues Suchas2/30/2009
Inconsistentvaluesforasingleentitybecausedifferentpracticeswereused For
example,tospecifyacountry,youcanuseaninternationalcountryabbreviation(I)
orafullcountryname(Italy);similarproblemsarisewithaddresses(HamletRd.
andHamletRoad)
Inconsistentvaluesforoneindividualentitybecauseoftypingmistakes Suchas
HametRoadinsteadofHamletRoad
Inparticular,notethatthelasttwotypesofmistakesareveryfrequentwhenyouare
managingmultiplesourcesandareenteringdatamanually.
ThemaindatacleansingfeaturesfoundinETLtoolsarerectificationandhomogenization.
Theyusespecificdictionariestorectifytypingmistakesandtorecognizesynonyms,aswell
asrule-basedcleansingtoenforcedomain-specificrulesanddefineappropriateassociations
betweenvalues.Seesection10.2formoredetailsonthesepoints.

1.4.3 Transformation
Transformationisthecoreofthereconciliationphase.Itconvertsdatafromitsoperational
sourceformatintoaspecificdatawarehouseformat.Ifyouimplementathree-layer
architecture,thisphaseoutputsyourreconcileddatalayer.Independentlyofthepresenceof
areconcileddatalayer,establishingamappingbetweenthesourcedatalayerandthedata
warehouselayerisgenerallymadedifficultbythepresenceofmanydifferent,heterogeneous
sources.Ifthisisthecase,acomplexintegrationphaseisrequiredwhendesigningyourdata
warehouse.SeeChapter3formoredetails.
Thefollowingpointsmustberectifiedinthisphase:
Loosetextsmayhidevaluableinformation.Forexample,BigDealLtDdoesnotexplicitly
showthatthisisaLimitedPartnershipcompany.
Differentformatscanbeusedforindividualdata.Forexample,adatecanbesavedas
astringorasthreeintegers.
Followingarethemaintransformationprocessesaimedatpopulatingthereconciled
datalayer:
Conversionandnormalizationthatoperateonbothstorageformatsandunitsof
measuretomakedatauniform
Matchingthatassociatesequivalentfieldsindifferentsources
Selectionthatreducesthenumberofsourcefieldsandrecords
Whenpopulatingadatawarehouse,normalizationisreplacedbydenormalization
becausedatawarehousedataaretypicallydenormalized,andyouneedaggregationtosum
updataproperly.

ch01.indd17

4/21/093:23:33PM

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

18

Data Warehouse Design: Modern Principles and Methodologies

FIGURE 1-9
Example of
cleansing and
transforming
customer data

CleansingandtransformationprocessesareoftencloselyconnectedinETLtools.
Figure1-9showsanexampleofcleansingandtransformationofcustomerdata:afieldbasedstructureisextractedfromaloosetext,thenafewvaluesarestandardizedsoas
toremoveabbreviations,andeventuallythosevaluesthatarelogicallyassociatedcan
berectified.

1.4.4

Loading

Loadingintoadatawarehouseisthelaststeptotake.Loadingcanbecarriedoutintwoways:
Refresh Datawarehousedataiscompletelyrewritten.Thismeansthatolderdata
isreplaced.Refreshisnormallyusedincombinationwithstaticextractionto
initiallypopulateadatawarehouse.
Update Onlythosechangesappliedtosourcedataareaddedtothedata
warehouse.Updateistypicallycarriedoutwithoutdeletingormodifying
preexistingdata.Thistechniqueisusedincombinationwithincrementalextraction
toupdatedatawarehousesregularly.

1.5 Multidimensional Model


Thedatawarehouselayerisavitallyimportantpartofthisbook.Here,weintroduceadata
warehousekeyword:multidimensional.Youneedtobecomefamiliarwiththeconceptsand
terminologyusedheretounderstandtheinformationpresentedthroughoutthisbook,
particularlyinformationregardingconceptualandlogicalmodelinganddesigning.

ch01.indd18

4/21/093:23:34PM

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

Chapter 1:

Introduction to Data Warehousing

19

Overthelastfewyears,multidimensionaldatabaseshavegeneratedmuchresearchand
marketinterestbecausetheyarefundamentalformanydecision-makingsupport
applications,suchasdatawarehousesystems.Thereasonwhythemultidimensionalmodel
isusedasaparadigmofdatawarehousedatarepresentationisfundamentallyconnectedto
itseaseofuseandintuitivenessevenforITnewbies.Themultidimensionalmodelssuccess
isalsolinkedtothewidespreaduseofproductivitytools,suchasspreadsheets,thatadopt
themultidimensionalmodelasavisualizationparadigm.
Perhapsthebeststartingpointtoapproachthemultidimensionalmodeleffectivelyisa
definitionofthetypesofqueriesforwhichthismodelisbestsuited.Section1.7offersmore
detailsontypicaldecision-makingqueriessuchasthoselistedhere(Jarkeetal.,2000):
Whatisthetotalamountofreceiptsrecordedlastyearperstateandperproductcategory?
WhatistherelationshipbetweenthetrendofPCmanufacturerssharesandquartergains
overthelastfiveyears?
Whichordersmaximizereceipts?
Whichoneoftwonewtreatmentswillresultinadecreaseintheaverageperiodofadmission?
Whatistherelationshipbetweenprofitgainedbytheshipmentsconsistingoflessthan
10itemsandtheprofitgainedbytheshipmentsofmorethan10items?
Itisclearthatusingtraditionallanguages,suchasSQL,toexpressthesetypesofqueries
canbeaverydifficulttaskforinexperiencedusers.Itisalsoclearthatrunningthesetypesof
queriesagainstoperationaldatabaseswouldresultinanunacceptablylongresponsetime.
Themultidimensionalmodelbeginswiththeobservationthatthefactorsaffecting
decision-makingprocessesareenterprise-specificfacts,suchassales,shipments,hospital
admissions,surgeries,andsoon.Instancesofafactcorrespondtoeventsthatoccurred.
Forexample,everysinglesaleorshipmentcarriedoutisanevent.Eachfactisdescribed
bythevaluesofasetofrelevantmeasuresthatprovideaquantitativedescriptionof
events.Forexample,salesreceipts,amountsshipped,hospitaladmissioncosts,and
surgerytimearemeasures.
Obviously,ahugenumberofeventsoccurintypicalenterprisestoomanytoanalyze
onebyone.Imagineplacingthemallintoann-dimensionalspacetohelpusquicklyselect
andsortthemout.Then-dimensionalspaceaxesarecalledanalysisdimensions,andthey
definedifferentperspectivestosingleoutevents.Forexample,thesalesinastorechaincan
berepresentedinathree-dimensionalspacewhosedimensionsareproducts,stores,and
dates.Asfarasshipmentsareconcerned,products,shipmentdates,orders,destinations,
andterms&conditionscanbeusedasdimensions.Hospitaladmissionscanbedefinedby
thedepartment-date-patientcombination,andyouwouldneedtoaddthetypeofoperation
toclassifysurgeryoperations.
Theconceptofdimensiongavelifetothebroadlyusedmetaphorofcubestorepresent
multidimensionaldata.Accordingtothismetaphor,eventsareassociatedwithcubecells
andcubeedgesstandforanalysisdimensions.Ifmorethanthreedimensionsexist,thecube
iscalledahypercube.Eachcubecellisgivenavalueforeachmeasure.Figure1-10showsan
intuitiverepresentationofacubeinwhichthefactisasaleinastorechain.Itsanalysis
dimensionsarestore,productanddate.Aneventstandsforaspecificitemsoldinaspecific
storeonaspecificdate,anditisdescribedbytwomeasures:thequantitysoldandthe
receipts.Thisfigurehighlightsthatthecubeissparsethismeansthatmanyeventsdidnot
actuallytakeplace.Ofcourse,youcannotselleveryitemeverydayineverystore.

ch01.indd19

4/21/093:23:34PM

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

20

Data Warehouse Design: Modern Principles and Methodologies

FIGURE 1-10
The threedimensional cube
modeling sales in
a store chain:
10 packs of Shiny
were sold on
4/5/2008 in the
EverMore store,
totaling $25.

Ifyouwanttousetherelationalmodeltorepresentthiscube,youcouldusethe
followingrelationalschema:
SALES(store,product,date,quantity,receipts)

Here,theunderlinedattributesmakeuptheprimarykeyandeventsareassociatedwith
tuples,suchas<'EverMore','Shiny','04/05/08',10,25>.Theconstraintexpressedbythis
primarykeyspecifiesthattwoeventscannotbeassociatedwithanindividualstore,
product,anddatevaluecombination,andthateveryvaluecombinationfunctionally
determinesauniquevalueforquantityandauniquevalueforreceipts.Thismeans
thatthefollowingfunctionaldependency2holds:
store,product,datequantity,receipts

Toavoidanymisunderstandingofthetermevent,youshouldrealizethatthegroup
ofdimensionsselectedforafactrepresentationsinglesoutauniqueeventinthe
multidimensionalmodel,butthegroupdoesnotnecessarilysingleoutauniqueeventin
theapplicationdomain.Tomakethisstatementclearer,consideronceagainthesales
example.Intheapplicationdomain,onesinglesaleseventissupposedtobeacustomers
purchaseofasetofproductsfromastoreonaspecificdate.Inpractice,thiscorrespondsto
asalesreceipt.Fromtheviewpointofthemultidimensionalmodel,ifthesalesfacthasthe
product,store,anddatedimensions,aneventwillbethedailytotalamountofanitem
soldinastore.Itisclearthatthedifferencebetweenbothinterpretationsdependsonsales

2
The definition of functional dependency belongs to relational theory. Given relation schema R and two
attributesetsX= {a1...,an}andY= {b1...,bm},XissaidtofunctionallydetermineY(XY)ifandonlyif,forevery
legalinstancerofRandforeachpairoftuplest1,t2inr,t1[X] =t2[X]impliest1[Y] = t2[Y].Heret[X/Y]denotesthe
valuestakenintfromtheattributesinX/Y.Byextension,wesaythatafunctionaldependencyholdsbetween
twoattributesetsXandYwheneachvaluesetofXalwayscorrespondstoasinglevaluesetofY.Tosimplify
thenotation,whenwedenotetheattributesineachset,wedropthebraces.

ch01.indd20

4/21/093:23:35PM

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

Chapter 1:

Introduction to Data Warehousing

21

receiptsthatgenerallyincludevariousitems,andonindividualitemsthataregenerallysold
manytimeseverydayinastore.Inthefollowingsections,weusethetermseventandfactto
makereferencetothegranularitytakenbyeventsandfactsinthemultidimensionalmodel.
Normally,eachdimensionisassociatedwithahierarchyofaggregationlevels,often
calledroll-uphierarchy.Roll-uphierarchiesgroupaggregationlevelvaluesindifferentways.
Hierarchiesconsistoflevelscalleddimensionalattributes.Figure1-11showsasimpleexample
ofhierarchiesbuiltontheproductandstoredimensions:productsareclassifiedinto
types,andarethenfurtherclassifiedintocategories.Storesarelocatedincitiesbelongingto
states.Ontopofeachhierarchyisafakelevelthatincludesallthedimension-relatedvalues.
Fromtheviewpointofrelationaltheory,youcanuseasetoffunctionaldependencies
betweendimensionalattributestoexpressahierarchy:
producttypecategory
storecitystate

Insummary,amultidimensionalcubehingesonafactrelevanttodecision-making.It
showsasetofeventsforwhichnumericmeasuresprovideaquantitativedescription.Each
cubeaxisshowsapossibleanalysisdimension.Eachdimensioncanbeanalyzedatdifferent
detaillevelsspecifiedbyhierarchicallystructuredattributes.
Thescientificliteratureshowsmanyformalexpressionsofthemultidimensionalmodel,
whichcanbemoreorlesscomplexandcomprehensive.Wellbrieflymentionalternative
termsusedforthemultidimensionalmodelinthescientificliteratureandincommercialtools.
FIGURE 1-11
Aggregation
hierarchies built on
the product and
store dimensions

ch01.indd21

4/21/093:23:35PM

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

22

Data Warehouse Design: Modern Principles and Methodologies

Thefactandcubetermsareofteninterchangeablyused.Essentially,everyoneagreesontheuse
ofthetermdimensionstospecifythecoordinatesthatclassifyandidentifyfactoccurrences.
However,entirehierarchiesaresometimescalleddimensions.Forexample,thetermtime
dimensioncanbeusedfortheentirehierarchybuiltonthedateattribute.Measuresare
sometimescalledvariables,metrics,properties,attributes,orindicators.Insomemodels,
dimensionalattributesofhierarchiesarecalledlevelsorparameters.

NOTE Themainformalexpressionsofthemultidimensionalmodelintheliteraturewereproposed
byAgrawaletal.,1995;GyssensandLakshmanan,1997;DattaandThomas,1997;
Vassiliadis,1998;andCabibboandTorlone,1998.

Theinformationinamultidimensionalcubeisverydifficultforuserstomanagebecause
ofitsquantity,evenifitisaconciseversionoftheinformationstoredtooperational
databases.If,forexample,astorechainincludes50storesselling1000items,andaspecific
datawarehousecoversthree-year-longtransactions(approximately1000days),thenumberof
potentialeventstotals5010001000=5107.Assumingthateachstorecansellonly10
percentofalltheavailableitemsperday,thenumberofeventstotals5106.Thisisstilltoo
muchdatatobeanalyzedbyuserswithoutrelyingonautomatictools.
Youhaveessentiallytwowaystoreducethequantityofdataandobtainuseful
information:restrictionandaggregation.Thecubemetaphoroffersaneasy-to-useandintuitive
waytounderstandbothofthesemethods,aswewilldiscussinthefollowingparagraphs.

1.5.1

Restriction

Restrictingdatameansseparatingpartofthedatafromacubetomarkoutananalysisfield.
Inrelationalalgebraterminology,thisiscalledmakingselectionsand/orprojections.
Thesimplesttypeofselectionisdataslicing,showninFigure1-12.Whenyouslicedata,
youdecreasecubedimensionalitybysettingoneormoredimensionstoaspecificvalue.For
example,ifyousetoneofthesalescubedimensionstoavalue,suchasstore='EverMore',
thisresultsinthesetofeventsassociatedwiththeitemssoldintheEverMorestore.According
tothecubemetaphor,thisissimplyaplaneofcellsthatis,adataslicethatcanbeeasily
displayedinspreadsheets.Inthestorechainexamplegivenearlier,approximately105events
stillappearinyourresult.Ifyousettwodimensionstoavalue,suchasstore='EverMore'
anddate='4/5/2008',thiswillresultinallthedifferentitemssoldintheEverMorestoreon
April5(approximately100events).Graphicallyspeaking,thisinformationisstoredatthe
intersectionoftwoperpendicularplanesresultinginaline.Ifyousetallthedimensionstoa
particularvalue,youwilldefinejustoneeventthatcorrespondstoapointinthethreedimensionalspaceofsales.
Dicingisageneralizationofslicing.Itposessomeconstraintsondimensionalattributes
toscaledownthesizeofacube.Forexample,youcanselectonlythedailysalesofthefood
itemsinApril2008inFlorida(Figure1-12).Inthisway,iffivestoresarelocatedinFlorida
and50foodproductsaresold,thenumberofeventstoexaminechangesto55030=7500.
Finally,aprojectioncanbereferredtoasachoicetokeepjustonesubgroupofmeasures
foreveryeventandrejectothermeasures.

ch01.indd22

4/21/093:23:35PM

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

Chapter 1:

Introduction to Data Warehousing

23

FIGURE 1-12 Slicing and dicing a three-dimensional cube

1.5.2 Aggregation
Aggregationplaysafundamentalroleinmultidimensionaldatabases.Assume,for
example,thatyouwanttoanalyzetheitemssoldmonthlyforathreeyearperiod.
Accordingtothecubemetaphor,thismeansthatyouneedtosortallthecellsrelatedtothe
daysofeachmonthbyproductandstore,andthenmergethemintoonesinglemacrocell.
Intheaggregatecubeobtainedinthisway,thetotalnumberofevents(thatis,thenumber
ofmacrocells)is50100036.Thisisbecausethegranularityofthetimedimensionsdoes
notdependondaysanylonger,butnowdependsonmonths,and36isthenumberof
monthsinthreeyears.Everyaggregateeventwillthensumupthedataavailableinthe
eventsitaggregates.Inthisexample,thetotalamountofitemssoldpermonthandthe
totalreceiptsarecalculatedbysummingeverysinglevalueoftheirmeasures(Figure1-13).
Ifyoufurtheraggregatealongtime,youcanachievejustthreeeventsforeverystoreproductcombination:oneforeveryyear.Whenyoucompletelyaggregatealongthetime
dimension,eachstore-productcombinationcorrespondstoonesingleevent,whichshows
thetotalamountofitemssoldinastoreoverthreeyearsandthetotalamountofreceipts.

ch01.indd23

4/21/093:23:36PM

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

24

Data Warehouse Design: Modern Principles and Methodologies

FIGURE 1-13
Time hierarchy
aggregation of the
quantity of items
sold per product
in three stores.
A dash shows that
an event did not
occur because no
item was sold.

EverMore

EvenMore

SmartMart

1/1/2007

1/2/2007

10

15

1/3/2007

20

..........

..........

..........

..........
1/1/2008

1/2/2008

15

10

20

1/3/2008

20

20

25

..........

..........

..........

1/1/2009

1/2/2009

20

25

1/3/2009

20

12

20

..........

..........

..........

EverMore

EvenMore

SmartMart

200

180

150

..........

..........

January 2007
February 2007

180

150

120

March 2007

220

180

160

..........

..........

..........

..........
January 2008

350

220

200

February 2008

300

200

250

March 2008

310

180

300

..........

..........

..........

380

200

220

..........
January 2009
February 2009

310

200

250

March 2009

300

160

280

..........

..........

..........

..........

EverMore

EvenMore

SmartMart

2007

2,400

2,000

1,600

2008

3,200

2,300

3,000

2009

3,400

2,200

3,200

EverMore

EvenMore

SmartMart

9,000

6,500

7,800

Total

ch01.indd24

4/21/093:23:37PM

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

Chapter 1:

Introduction to Data Warehousing

25

FIGURE 1-14 Two cube aggregation levels. Every macro-event measure value is a sum of its component
event values.

Youcanaggregatealongvariousdimensionsatthesametime.Forexample,Figure1-14
showsthatyoucangroupsalesbymonth,producttype,andstorecity,andbymonthand
producttype.Moreover,selectionsandaggregationscanbecombinedtocarryoutan
analysisprocesstargetedexactlytousersneeds.

1.6 Meta-data
Thetermmeta-datacanbeappliedtothedatausedtodefineotherdata.Inthescopeofdata
warehousing,meta-dataplaysanessentialrolebecauseitspecifiessource,values,usage,
andfeaturesofdatawarehousedataanddefineshowdatacanbechangedandprocessedat
everyarchitecturelayer.Figures1-3and1-4showthatthemeta-datarepositoryisclosely
connectedtothedatawarehouse.Applicationsuseitintensivelytocarryoutdata-staging
andanalysistasks.
AccordingtoKellysapproach,youcanclassifymeta-dataintotwopartiallyoverlapping
categories.Thisclassificationisbasedonthewayssystemadministratorsandendusers
exploitmeta-data.Systemadministratorsareinterestedininternalmeta-databecauseit
definesdatasources,transformationprocesses,populationpolicies,logicalandphysical
schemata,constraints,anduserprofiles.Externalmeta-dataisrelevanttoendusers.
Forexample,itisaboutdefinitions,qualitystandards,unitsofmeasure,relevant
aggregations.

ch01.indd25

4/21/093:23:37PM

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

26

Data Warehouse Design: Modern Principles and Methodologies

Meta-dataisstoredinameta-datarepositorywhichalltheotherarchitecturecomponents
canaccess.AccordingtoKelly,atoolformeta-datamanagementshould
allowadministratorstoperformsystemadministrationoperations,andinparticular
managesecurity;
allowenduserstonavigateandquerymeta-data;
useaGUI;
allowenduserstoextendmeta-data;
allowmeta-datatobeimported/exportedinto/fromotherstandardtoolsandformats.
Asfarasrepresentationformatsareconcerned,ObjectManagementGroup(OMG,2000)
releasedastandardcalledCommonWarehouseMetamodel(CWM)thatreliesonthreefamous
standards:UnifiedModelingLanguage(UML),eXtensibleMarkupLanguage(XML),andXML
MetadataInterchange(XMI).Partners,suchasIBM,Unisys,NCR,andOracle,inacommon
effort,createdthenewstandardformatthatspecifieshowmeta-datacanbeexchanged
amongthetechnologiesrelatedtodatawarehouses,businessintelligence,knowledge
management,andwebportals.
Figure1-15showsanexampleofadialogboxdisplayingexternalmeta-datarelated
tohierarchiesinMicroStrategyDesktopoftheMicroStrategy8toolsuite.Inparticular,

FIGURE 1-15
Accessing
hierarchy
meta-data in
MicroStrategy

ch01.indd26

4/21/093:23:38PM

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

Chapter 1:

Introduction to Data Warehousing

27

thisdialogboxdisplaystheCallingCenterattributeparentattributes.Specifically,itstates
thatacallingcenterreferstoadistributioncenter,belongstoaregion,andismanagedby
amanager.

NOTE SeeBarquinandEdelstein,1996;Jarkeetal.,2000;Jennings,2004;andTozer,1999,for
acomprehensivediscussiononmeta-datarepresentationandmanagement.

1.7 Accessing Data Warehouses


Analysisisthelastlevelcommontoalldatawarehousearchitecturetypes.Aftercleansing,
integrating,andtransformingdata,youshoulddeterminehowtogetthebestoutofitin
termsofinformation.Thefollowingsectionsshowthebestapproachesforendusersto
querydatawarehouses:reports,OLAP,anddashboards.Endusersoftenusetheinformation
storedtoadatawarehouseasastartingpointforadditionalbusinessintelligence
applications,suchaswhat-ifanalysesanddatamining.SeeChapter15formoredetailson
theseadvancedapplications.

1.7.1

Reports

Thisapproachisorientedtothoseuserswhoneedtohaveregularaccesstotheinformation
inanalmoststaticway.Forexample,supposealocalhealthauthoritymustsendtoitsstate
officesmonthlyreportssummingupinformationonpatientadmissioncosts.Thelayoutof
thosereportshasbeenpredeterminedandmayvaryonlyifchangesareappliedtocurrent
lawsandregulations.Designersissuethequeriestocreatereportswiththedesiredlayout
andfreezeallthoseinanapplication.Inthisway,enduserscanquerycurrentdata
whenevertheyneedto.
Areportisdefinedbyaqueryandalayout.Aquerygenerallyimpliesarestrictionandan
aggregationofmultidimensionaldata.Forexample,youcanlookforthemonthlyreceipts
duringthelastquarterforeveryproductcategory.Alayoutcanlooklikeatableorachart
(diagrams,histograms,pies,andsoon).Figure1-16showsafewexamplesoflayoutsforthe
receiptsquery.
Areportingtoolshouldbeevaluatednotonlyonthebasisofcomprehensivereport
layouts,butalsoonthebasisofflexiblereportdeliverysystems.Areportcanbeexplicitly
runbyusersorautomaticallyandregularlysenttoregisteredendusers.Forexample,itcan
besentviae-mail.
Keepinmindthatreportsexistedlongbeforedatawarehousesystemscametobe.
Reportshavealwaysbeenthemaintoolusedbymanagersforevaluatingandplanning
taskssincetheinventionofdatabases.However,addingdatawarehousestothemixis
beneficialtoreportsfortwomainreasons:First,theytakeadvantageofreliableandcorrect

ch01.indd27

4/21/093:23:38PM

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

28

Data Warehouse Design: Modern Principles and Methodologies

FIGURE 1-16
Report layouts:
table (top),
line graph (middle),
3-D pie graphs
(bottom)

ch01.indd28

4/21/093:23:38PM

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

Chapter 1:

Introduction to Data Warehousing

29

resultsbecausethedatasummedupinreportsisconsistentandintegrated.Inaddition,
datawarehousesexpeditethereportingprocessbecausethearchitecturalseparation
betweentransactionprocessingandanalysessignificantlyimprovesperformance.

1.7.2

OLAP

OLAPmightbethemainwaytoexploitinformationinadatawarehouse.Surelyitisthe
mostpopularone,anditgivesendusers,whoseanalysisneedsarenoteasytodefine
beforehand,theopportunitytoanalyzeandexploredatainteractivelyonthebasisofthe
multidimensionalmodel.Whileusersofreportingtoolsessentiallyplayapassiverole,
OLAPusersareabletostartacomplexanalysissessionactively,whereeachstepisthe
resultoftheoutcomeofprecedingsteps.Real-timepropertiesofOLAPsessions,required
in-depthknowledgeofdata,complexqueriesthatcanbeissued,anddesignforusersnot
familiarwithITmakethetoolsinuseplayacrucialrole.TheGUIofthesetoolsmustbe
flexible,easy-to-use,andeffective.
AnOLAPsessionconsistsofanavigationpaththatcorrespondstoananalysisprocessfor
factsaccordingtodifferentviewpointsandatdifferentdetaillevels.Thispathisturnedinto
asequenceofqueries,whichareoftennotissueddirectly,butdifferentiallyexpressedwith
referencetothepreviousquery.Theresultsofqueriesaremultidimensional.Becausewe
humanshaveadifficulttimedecipheringdiagramsofmorethanthreedimensions,OLAP
toolstypicallyusetablestodisplaydata,withmultipleheaders,colors,andotherfeaturesto
highlightdatadimensions.
EverystepofananalysissessionischaracterizedbyanOLAPoperatorthatturnsthelatest
queryintoanewone.Themostcommonoperatorsareroll-up,drill-down,slice-and-dice,
pivot,drill-across,anddrill-through.Thefiguresincludedhereshowdifferentoperators,and
weregeneratedusingtheMicroStrategyDesktopfront-endapplicationintheMicroStrategy
8toolsuite.TheyarebasedontheV-Mallexample,inwhichalargevirtualmallsellsitems
fromitscatalogviaphoneandtheInternet.Figure1-17showstheattributehierarchies
relevanttothesalesfactinV-Mall.
Theroll-upoperatorcausesanincreaseindataaggregationandremovesadetaillevel
fromahierarchy.Forexample,Figure1-18showsaqueryposedbyauserthatdisplays

FIGURE 1-17
Attribute
hierarchies in
V-Mall; arrows
show functional
dependencies

ch01.indd29

4/21/093:23:38PM

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

30

Data Warehouse Design: Modern Principles and Methodologies

FIGURE 1-18 Time hierarchy roll-up

monthlyrevenuesin2005and2006foreverycustomerregion.Ifyourollitup,you
removethemonthdetailtodisplayquarterlytotalrevenuesperregion.Rolling-upcanalso
reducethenumberofdimensionsinyourresultsifyouremoveallthehierarchydetails.If
youapplythisprincipletoFigure1-19,youcanremoveinformationoncustomersand
displayyearlytotalrevenuesperproductcategoryasyouturnthethree-dimensionaltable

ch01.indd30

4/21/093:23:39PM

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

Chapter 1:

Introduction to Data Warehousing

31

FIGURE 1-19 Roll-up removing customer hierarchy

intoatwo-dimensionalone.Figure1-20usesthecubemetaphortosketcharoll-upoperation
withandwithoutadecreaseindimensions.
Thedrill-downoperatoristhecomplementtotheroll-upoperator.Figure1-20showsthat
itreducesdataaggregationandaddsanewdetailleveltoahierarchy.Figure1-21showsan
examplebasedonabidimensionaltable.Thistableshowsthattheaggregationbased
oncustomerregionsshiftstoanewfine-grainedaggregationbasedoncustomercities.

FIGURE 1-20

ch01.indd31

Rolling-up (left) and drilling-down (right) a cube

4/21/093:23:39PM

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

32

Data Warehouse Design: Modern Principles and Methodologies

FIGURE 1-21

Drilling-down customer hierarchy

InFigure1-22,thedrill-downoperatorcausesanincreaseinthenumberoftabledimensions
afteraddingcustomerregiondetails.
Slice-and-diceisoneofthemostabusedtermsindatawarehouseliteraturebecauseitcan
havemanydifferentmeanings.AfewauthorsuseitgenerallytodefinethewholeOLAP
navigationprocess.Otherauthorsuseittodefineselectionandprojectionoperationsbased
ondata.Incompliancewithsection1.5.1,wedefineslicingasanoperationthatreducesthe

FIGURE 1-22

ch01.indd32

Drilling-down and adding a dimension

4/21/093:23:40PM

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

Chapter 1:

Introduction to Data Warehousing

33

FIGURE 1-23
Slicing (above)
and dicing (below)
a cube

numberofcubedimensionsaftersettingoneofthedimensionstoaspecificvalue.Dicingis
anoperationthatreducesthesetofdatabeinganalyzedbyaselectioncriterion(Figure1-23).
Figures1-24and1-25showafewexamplesofslicinganddicing.
Thepivotoperatorimpliesachangeinlayouts.Itaimsatanalyzinganindividual
groupofinformationfromadifferentviewpoint.Accordingtothemultidimensional
metaphor,ifyoupivotdata,yourotateyourcubesothatyoucanrearrangecellsonthe

FIGURE 1-24

ch01.indd33

Slicing based on the Year='2006' predicate

4/21/093:23:40PM

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

34

Data Warehouse Design: Modern Principles and Methodologies

FIGURE 1-25

Selection based on a complex predicate

basisofanewperspective.Inpractice,youcanhighlightadifferentcombinationof
dimensions(Figure1-26).Figures1-27and1-28showafewexamplesofpivotedtwodimensionalandthree-dimensionaltables.
Thetermdrill-acrossstandsfortheopportunitytocreatealinkbetweentwoormore
interrelatedcubesinordertocomparetheirdata.Forexample,thisappliesifyoucalculate
FIGURE 1-26
Pivoting a cube

ch01.indd34

4/21/093:23:41PM

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

Chapter 1:

Introduction to Data Warehousing

35

FIGURE 1-27 Pivoting a two-dimensional table

anexpressioninvolvingmeasuresfromtwocubes(Figure1-29).Figure1-30showsan
exampleinwhichasalescubeisdrilled-acrossapromotionscubeinordertocompare
revenuesanddiscountsperquarterandproductcategory.
MostOLAPtoolscanperformdrill-throughoperations,thoughwithvaryingeffectiveness.
Thisoperationswitchesfrommultidimensionalaggregatedataindatamartstooperational
datainsourcesorinthereconciledlayer.
Inmanyapplications,anintermediateapproachbetweenstaticreportingand
OLAPisbroadlyused.Thisintermediateapproachiscalledsemi-staticreporting.Evenif
asemi-staticreportfocusesonagroupofinformationpreviouslyset,itgivesuserssome
marginoffreedom.Thankstothismargin,userscanfollowalimitedsetofnavigationpaths.
Forexample,thisapplieswhenyoucanrollupjusttoafewhierarchyattributes.Thissolution
iscommon,becauseitprovidessomeunquestionableadvantages.First,usersneedlessskillto
usedatamodelsandanalysistoolsthantheyneedforOLAP.Second,thisavoidstheriskthat
occursinOLAPofachievinginconsistentanalysisresultsorincorrectonesbecauseofany
misuseofaggregationoperators.Third,ifyouposeconstraintsontheanalysesallowed,you
willpreventusersfromunwillinglyslowingdownyoursystemwhenevertheyformulate
demandingqueries.

FIGURE 1-28 Pivoting a three-dimensional table

ch01.indd35

4/21/093:23:41PM

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

36

Data Warehouse Design: Modern Principles and Methodologies

FIGURE 1-29
Drilling across
two cubes

1.7.3

Dashboards

Dashboardsareanothermethodusedfordisplayinginformationstoredtoadatawarehouse.
ThetermdashboardreferstoaGUIthatdisplaysalimitedamountofrelevantdatainabrief
andeasy-to-readformat.Dashboardscanprovideareal-timeoverviewofthetrendsfora
specificphenomenonorformanyphenomenathatarestrictlyconnectedwitheachother.
Thetermisavisualmetaphor:thegroupofindicatorsintheGUIaredisplayedlikeacar
dashboard.Dashboardsareoftenusedbyseniormanagerswhoneedaquickwaytoview
information.However,toconductanddisplayverycomplexanalysesofphenomena,
dashboardsmustbematchedwithanalysistools.
Today,mostsoftwarevendorsofferdashboardsforreportcreationanddisplay.Figure1-31
showsadashboardcreatedwithMicroStrategyDynamicEnterprise.Theliteraturerelated
todashboardgraphicdesignhasalsoproventobeveryrich,inparticularinthescopeof
enterprises(Few,2006).

FIGURE 1-30 Drilling across the sales cube (Revenue measure) and the promotions cube
(Discount measure)

ch01.indd36

4/21/093:23:42PM

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

Chapter 1:

FIGURE 1-31

Introduction to Data Warehousing

37

An example of dashboards

Keepinmind,however,thatdashboardsarenothingbutperformanceindicatorsbehind
GUIs.Theireffectivenessisduetoacarefulselectionoftherelevantmeasures,whileusing
datawarehouseinformationqualitystandards.Forthisreason,dashboardsshouldbe
viewedasasophisticatedeffectiveadd-ontodatawarehousesystems,butnotasthe
primarygoalofdatawarehousesystems.Infact,theprimarygoalofdatawarehouse
systemsshouldalwaysbetoproperlydefineaprocesstotransformdataintoinformation.

1.8 ROLAP, MOLAP, and HOLAP


Thesethreeacronymsconcealthreemajorapproachestoimplementingdatawarehouses,
andtheyarerelatedtothelogicalmodelusedtorepresentdata:
ROLAPstandsforRelationalOLAP,animplementationbasedonrelationalDBMSs.
MOLAPstandsforMultidimensionalOLAP,animplementationbasedon
multidimensionalDBMSs.
HOLAPstandsforHybridOLAP,animplementationusingbothrelationaland
multidimensionaltechniques.

ch01.indd37

4/21/093:23:42PM

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

38

Data Warehouse Design: Modern Principles and Methodologies

Theideaofadoptingtherelationaltechnologytostoredatatoadatawarehouse
hasasolidfoundationifyouconsiderthehugeamountofliteraturewrittenaboutthe
relationalmodel,thebroadlyavailablecorporateexperiencewithrelationaldatabase
usageandmanagement,andthetopperformanceandflexibilitystandardsofrelational
DBMSs(RDBMSs).Theexpressivepoweroftherelationalmodel,however,doesnot
includetheconceptsofdimension,measure,andhierarchy,soyoumustcreate
specifictypesofschematasothatyoucanrepresentthemultidimensionalmodelin
termsofbasicrelationalelementssuchasattributes,relations,andintegrityconstraints.
Thistaskismainlyperformedbythewell-knownstarschema.SeeChapter8formore
detailsonstarschemataandstarschemavariants.
ThemainproblemwithROLAPimplementationsresultsfromtheperformancehit
causedbycostlyjoinoperationsbetweenlargetables.Toreducethenumberofjoins,oneof
thekeyconceptsofROLAPisdenormalizationaconsciousbreachinthethirdnormalform
orientedtoperformancemaximization.Tominimizeexecutioncosts,theotherkeywordis
redundancy,whichistheresultofthematerializationofsomederivedtables(views)that
storeaggregatedatausedfortypicalOLAPqueries.
Fromanarchitecturalviewpoint,adoptingROLAPrequiresspecializedmiddleware,also
calledamultidimensionalengine,betweenrelationalback-endserversandfront-end
components,asshowninFigure1-32.ThemiddlewarereceivesOLAPqueriesformulated
byusersinafront-endtoolandturnsthemintoSQLinstructionsforarelationalback-end
applicationwiththesupportofmeta-data.Theso-calledaggregatenavigatorisaparticularly
importantcomponentinthisphase.Incaseofaggregateviews,thiscomponentselectsa
viewfromamongallthealternativestosolveaspecificqueryattheminimumaccesscost.
Incommercialproducts,differentfront-endmodules,suchasOLAP,reports,and
dashboards,aregenerallystrictlyconnectedtoamultidimensionalengine.Multidimensional
enginesarethemaincomponentsandcanbeconnectedtoanyrelationalserver.Opensource
solutionshavebeenrecentlyreleased.Theirmultidimensionalengines(Mondrian,2009)are
disconnectedfromfront-endmodules(JPivot,2009).Forthisreason,theycanbemoreflexible

FIGURE 1-32

ch01.indd38

ROLAP architecture

4/21/093:23:43PM

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

Chapter 1:

Introduction to Data Warehousing

39

thancommercialsolutionswhenyouhavetocreatethearchitecture(ThomsenandPedersen,
2005).AfewcommercialRDBMSsnativelysupportfeaturestypicalformultidimensional
enginestomaximizequeryoptimizationandincreasemeta-datareusability.Forexample,
sinceits8iversionwasmadeavailable,OraclesRDBMSgivesuserstheopportunitytodefine
hierarchiesandmaterializedviews.Moreover,itoffersanavigatorthatcanusemeta-dataand
rewritequerieswithoutanyneedforamultidimensionalenginetobeinvolved.
DifferentfromaROLAPsystem,aMOLAPsystemisbasedonanadhoclogicalmodelthat
canbeusedtorepresentmultidimensionaldataandoperationsdirectly.Theunderlying
multidimensionaldatabasephysicallystoresdataasarraysandtheaccesstoitispositional
(GaedeandGnther,1998).Grid-files(Nievergeltetal.,1984;WhangandKrishnamurthy,1991),
R*-trees(Beckmannetal.,1990)andUB-trees(Markletal.,2001)areamongthetechniquesused
forthispurpose.
ThegreatestadvantageofMOLAPsystemsincomparisonwithROLAPisthat
multidimensionaloperationscanbeperformedinaneasy,naturalwaywithMOLAP
withoutanyneedforcomplexjoinoperations.Forthisreason,MOLAPsystemperformance
isexcellent.However,MOLAPsystemimplementationshaveverylittleincommon,because
nomultidimensionallogicalmodelstandardhasyetbeenset.Generally,theysimplyshare
theusageofoptimizationtechniquesspecificallydesignedforsparsitymanagement.The
lackofacommonstandardisaproblembeingprogressivelysolved.Thismeansthat
MOLAPtoolsarebecomingmoreandmoresuccessfulaftertheirlimitedimplementationfor
manyyears.Thissuccessisalsoprovenbytheinvestmentsinthistechnologybymajor
vendors,suchasMicrosoft(AnalysisServices)andOracle(Hyperion).
Theintermediatearchitecturetype,HOLAP,aimsatmixingtheadvantagesofbothbasic
solutions.Ittakesadvantageofthestandardizationlevelandtheabilitytomanagelarge
amountsofdatafromROLAPimplementations,andthequeryspeedtypicalofMOLAP
systems.HOLAPimpliesthatthelargestamountofdatashouldbestoredinanRDBMSto
avoidtheproblemscausedbysparsity,andthatamultidimensionalsystemstoresonlythe
informationusersmostfrequentlyneedtoaccess.Ifthatinformationisnotenoughtosolve
queries,thesystemwilltransparentlyaccessthepartofthedatamanagedbytherelational
system.Overthelastfewyears,importantmarketactorssuchasMicroStrategyhave
adoptedHOLAPsolutionstoimprovetheirplatformperformance,joiningothervendors
alreadyusingthissolution,suchasBusinessObjects.

1.9 Additional Issues


Theissuesthatfollowcanplayafundamentalroleintuningupadatawarehousesystem.
Thesepointsinvolveverywide-rangingproblemsandarementionedheretogiveyouthe
mostcomprehensivepicturepossible.

1.9.1

Quality

Ingeneral,wecansaythatthequalityofaprocessstandsforthewayaprocessmeetsusers
goals.Indatawarehousesystems,qualityisnotonlyusefulforthelevelofdata,butabove
allforthewholeintegratedsystem,becauseofthegoalsandusageofdatawarehouses.
Astrictqualitystandardmustbeensuredfromthefirstphasesofthedatawarehouseproject.

ch01.indd39

4/21/093:23:43PM

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

40

Data Warehouse Design: Modern Principles and Methodologies

Defining,measuring,andmaximizingthequalityofadatawarehousesystemcanbe
verycomplexproblems.Forthisreason,wementiononlyafewpropertiescharacterizing
dataqualityhere:
Accuracy Storedvaluesshouldbecompliantwithreal-worldones.
Freshness Datashouldnotbeold.
Completeness Thereshouldbenolackofinformation.
Consistency Datarepresentationshouldbeuniform.
Availability Usersshouldhaveeasyaccesstodata.
Traceability Datacaneasilybetraceddatabacktoitssources.
Clearness Datacanbeeasilyunderstood.
Technically,checkingfordataqualityrequiresappropriatesetsofmetrics(Abelletal.,
2006).Inthefollowingsections,weprovideanexampleofthemetricsforafewofthequality
propertiesmentioned:
Accuracyandcompleteness Referstothepercentageoftuplesnotloadedbyan
ETLprocessandcategorizedonthebasisofthetypesofproblemarising.This
propertyshowsthepercentageofmissing,invalid,andnonstandardvaluesof
everyattribute.
Freshness Definesthetimeelapsedbetweenthedatewhenaneventtakesplace
andthedatewhenuserscanaccessit.
Consistency Definesthepercentageoftuplesthatmeetbusinessrulesthatcanbe
setformeasuresofanindividualcubeormanycubesandthepercentageoftuples
meetingstructuralconstraintsimposedbythedatamodel(forexample,uniqueness
ofprimarykeys,referentialintegrity,andcardinalityconstraintcompliance).
Notethatcorporateorganizationplaysafundamentalroleinreachingdataquality
goals.Thisrolecanbeeffectivelyplayedonlybycreatinganappropriateandaccurate
certificationsystemthatdefinesalimitedgroupofusersinchargeofdata.Forthisreason,
designersmustraiseseniormanagersawarenessofthistopic.Designersmustalsomotivate
managementtocreateanaccuratecertificationprocedurespecificallydifferentiatedfor
everyenterprisearea.Aboardofcorporatemanagerspromotingdataqualitymaytrigger
avirtuouscyclethatismorepowerfulandlesscostlythananydatacleansingsolution.
Forexample,youcanachieveawesomeresultsifyouconnectacorporatedepartment
budgettoaspecificdataqualitythresholdtobereached.
Anadditionaltopicconnectedtothequalityofadatawarehouseprojectisrelatedto
documentation.Todaymostdocumentationisstillnonstandardized.Itisoftenissuedat
theendoftheentiredatawarehouseproject.Designersandimplementersconsider
documentationawasteoftime,anddatawarehouseprojectcustomersconsideritanextra
costitem.Softwareengineeringteachesthatastandardsystemfordocumentsshouldbe
issued,managed,andvalidatedincompliancewithprojectdeadlines.Thissystemcan
ensurethatdifferentdatawarehouseprojectphasesarecorrectlycarriedoutandthatall
analysisandimplementationpointsareproperlyexaminedandunderstood.Inthe
mediumandlongterm,correctdocumentsincreasethechancesofreusingdatawarehouse
projectsandensureprojectknow-howmaintenance.

ch01.indd40

4/21/093:23:43PM

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

Chapter 1:

Introduction to Data Warehousing

41

NOTE Jarkeetal.,2000havecloselystudieddataquality.Theirstudiesprovideusefuldiscussionson
theimpactofdataqualityproblemsfromthemethodologicalpointofview.Kelly,1997describes
qualitygoalsstrictlyconnectedtotheviewpointofbusinessorganizations.Serranoetal.,2004,
2007;Lechtenbrger,2001;andBouzeghoubandKedad,2000focusonqualitystandards
respectivelyforconceptual,logical,andphysicaldatawarehouseschemata.

1.9.2

Security

Informationsecurityisgenerallyafundamentalrequirementforasystem,anditshouldbe
carefullyconsideredinsoftwareengineeringateveryprojectdevelopmentstagefrom
requirementanalysisthroughimplementationtomaintenance.Securityisparticularly
relevanttodatawarehouseprojects,becausedatawarehousesareusedtomanage
informationcrucialforstrategicdecision-makingprocesses.Furthermore,multidimensional
propertiesandaggregationcauseadditionalsecurityproblemssimilartothosethatgenerally
ariseinstatisticdatabases,becausetheyimplicitlyoffertheopportunitytoinferinformation
fromdata.Finally,thehugeamountofinformationexchangethattakesplaceindata
warehousesinthedata-stagingphasecausesspecificproblemsrelatedtonetworksecurity.
Appropriatemanagementandauditingcontrolsystemsareimportantfordata
warehouses.Managementcontrolsystemscanbeimplementedinfront-endtoolsorcan
exploitoperatingsystemservices.Asfarasauditingisconcerned,thetechniquesprovided
byDBMSserversarenotgenerallyappropriateforthisscope.Forthisreason,youmusttake
advantageofthesystemsimplementedbyOLAPengines.Fromtheviewpointofusers
profilebaseddataaccess,basicrequirementsarerelatedtohidingwholecubes,specific
cubeslices,andspecificcubemeasures.Sometimesyoualsohavetohidecubedatabeyond
agivendetaillevel.

NOTE Inthescientificliteraturethereareafewworksspecificallydealingwithsecurityindata

warehousesystems(Kirkgzeetal.,1997;PriebeandPernul,2000;RosenthalandSciore,2000;
Katicetal.,1998).Inparticular,PriebeandPernulproposeacomparativestudyonsecurity
propertiesofafewcommercialplatforms.Ferrandez-Medinaetal.,2004andSoleretal.,2008
discussanapproachthatcouldbemoreinterestingfordesigners.TheyuseaUMLextensionto
modelspecificsecurityrequirementsfordatawarehousesintheconceptualdesignand
requirementanalysisphases,respectively.

1.9.3

Evolution

Manymaturedatawarehouseimplementationsarecurrentlyrunninginmidsizeandlarge
companies.Theunstoppableevolutionofapplicationdomainshighlightsdynamicfeatures
ofdatawarehousesconnectedtothewayinformationchangesattwodifferentlevelsas
timegoesby:
Datalevel Evenifmeasureddataisnaturallyloggedindatawarehousesthanksto
temporaldimensionsmarkingevents,themultidimensionalmodelimplicitly
assumesthathierarchiesarecompletelystatic.Itisclearthatthisassumptionisnot
veryrealistic.Forexample,acompanycanaddnewproductcategoriestoitscatalog
andremoveothers,oritcanchangethecategorytowhichanexistingproduct
belongsinordertomeetnewmarketingstrategies.

ch01.indd41

4/21/093:23:44PM

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

42

Data Warehouse Design: Modern Principles and Methodologies

Schemalevel Adatawarehouseschemacanvarytomeetnewbusinessdomain
standards,newusersrequirements,orchangesindatasources.Newattributesand
measurescanbecomenecessary.Forexample,youcanaddasubcategorytoa
producthierarchytomakeanalysesricherindetail.Youshouldalsoconsiderthat
thesetoffactdimensionscanvaryastimegoesby.
Temporalproblemsareevenmorechallengingindatawarehousesthaninoperational
databases,becausequeriesoftencoverlongerperiodsoftime.Forthisreason,data
warehousequeriesfrequentlydealwithdifferentdataand/orschemaversions.Moreover,
thispointisparticularlycriticalfordatawarehousesthatrunforalongtime,becauseevery
evolutionnotcompletelycontrolledcausesagrowinggapbetweentherealworldandits
databaserepresentation,eventuallymakingthedatawarehousesobsoleteanduseless.
Asfaraschangesindatavaluesareconcerned,differentapproacheshavebeen
documentedinscientificliterature.Somecommercialsystemsalsomakeitpossibletotrack
changesandquerycubesonthebasisofdifferenttemporalscenarios.Seesection8.4formore
detailsondynamichierarchies.Ontheotherhand,managingchangesindataschematahas
beenexploredonlypartiallytodate.Nocommercialtooliscurrentlyavailableonthemarket
tosupportapproachestodataschemachangemanagement.
Theapproachestodatawarehouseschemachangemanagementcanbeclassifiedin
twocategories:evolution(Quix,1999;Vaismanetal.,2002;Blaschka,2000)andversioning
(Ederetal.,2002;Golfarellietal.,2006a).Bothcategoriesmakeitpossibletoalterdata
schemata,butonlyversioningcantrackpreviousschemareleases.Afewapproachesto
versioningcancreatenotonlytrueversionsgeneratedbychangesinapplication
domains,butalsoalternativeversionstouseforwhat-ifanalyses(Bebeletal.,2004).
Themainproblemthathasnotbeensolvedinthisfieldisthecreationoftechniquesfor
versioninganddatamigrationbetweenversionsthatcanflexiblysupportqueriesrelatedto
moreschemaversions.Furthermore,weneedsystemsthatcansemiautomaticallyadjust
ETLprocedurestochangesinsourceschemata.Inthisdirection,someOLAPtoolsalready
usetheirmeta-datatosupportanimpactanalysisaimedatidentifyingthefullconsequences
ofanychangesinsourceschemata.

ch01.indd42

4/21/093:23:44PM

Potrebbero piacerti anche