Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
CHAPTER
Introduction to Data
Warehousing
nformationassetsareimmenselyvaluabletoanyenterprise,andbecauseofthis,
theseassetsmustbeproperlystoredandreadilyaccessiblewhentheyareneeded.
However,theavailabilityoftoomuchdatamakestheextractionofthemost
importantinformationdifficult,ifnotimpossible.ViewresultsfromanyGooglesearch,
andyoullseethatthedata=informationequationisnotalwayscorrectthatis,too
muchdataissimplytoomuch.
Datawarehousingisaphenomenonthatgrewfromthehugeamountofelectronicdata
storedinrecentyearsandfromtheurgentneedtousethatdatatoaccomplishgoalsthatgo
beyondtheroutinetaskslinkedtodailyprocessing.Inatypicalscenario,alargecorporation
hasmanybranches,andseniormanagersneedtoquantifyandevaluatehoweachbranch
contributestotheglobalbusinessperformance.Thecorporatedatabasestoresdetaileddata
onthetasksperformedbybranches.Tomeetthemanagersneeds,tailor-madequeriescan
beissuedtoretrievetherequireddata.Inorderforthisprocesstowork,database
administratorsmustfirstformulatethedesiredquery(typicallyanaggregateSQLquery)
aftercloselystudyingdatabasecatalogs.Thenthequeryisprocessed.Thiscantakeafew
hoursbecauseofthehugeamountofdata,thequerycomplexity,andtheconcurrenteffects
ofotherregularworkloadqueriesondata.Finally,areportisgeneratedandpassedto
seniormanagersintheformofaspreadsheet.
Manyyearsago,databasedesignersrealizedthatsuchanapproachishardlyfeasible,
becauseitisverydemandingintermsoftimeandresources,anditdoesnotalwaysachieve
thedesiredresults.Moreover,amixofanalyticalquerieswithtransactionalroutinequeries
inevitablyslowsdownthesystem,andthisdoesnotmeettheneedsofusersofeithertype
ofquery.Todaysadvanceddatawarehousingprocessesseparateonlineanalyticalprocessing
(OLAP)fromonlinetransactionalprocessing(OLTP)bycreatinganewinformationrepository
thatintegratesbasicdatafromvarioussources,properlyarrangesdataformats,andthen
makesdataavailableforanalysisandevaluationaimedatplanninganddecision-making
processes(Lechtenbrger,2001).
ch01.indd1
4/21/093:23:27PM
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
Letsreviewsomefieldsofapplicationforwhichdatawarehousetechnologiesare
successfullyused:
Trade Salesandclaimsanalyses,shipmentandinventorycontrol,customercare
andpublicrelations
Craftsmanship Productioncostcontrol,supplierandordersupport
Financialservices Riskanalysisandcreditcards,frauddetection
Transportindustry Vehiclemanagement
Telecommunicationservices Callflowanalysisandcustomerprofileanalysis
Healthcareservice Patientadmissionanddischargeanalysisandbookkeepingin
accountsdepartments
Thefieldofapplicationofdatawarehousesystemsisnotonlyrestrictedtoenterprises,
butitalsorangesfromepidemiologytodemography,fromnaturalsciencetoeducation.
Apropertythatiscommontoallfieldsistheneedforstorageandquerytoolstoretrieve
informationsummarieseasilyandquicklyfromthehugeamountofdatastoredin
databasesormadeavailablebytheInternet.Thiskindofinformationallowsustostudy
businessphenomena,learnaboutmeaningfulcorrelations,andgainusefulknowledgeto
supportdecision-makingprocesses.
FIGURE 1-1
Information value
as a function of
quantity
ch01.indd2
4/21/093:23:28PM
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
Chapter 1:
Anexponentialincreaseinoperationaldatahasmadecomputerstheonlytoolssuitable
forprovidingdatafordecision-makingperformedbybusinessmanagers.Thisfacthas
dramaticallyaffectedtheroleofenterprisedatabasesandfosteredtheintroductionof
decisionsupportsystems.Theconceptofdecisionsupportsystemsmainlyevolvedfromtwo
researchfields:theoreticalstudiesondecision-makingprocessesfororganizationsand
technicalresearchoninteractiveITsystems.However,thedecisionsupportsystemconcept
isbasedonseveraldisciplines,suchasdatabases,artificialintelligence,man-machine
interaction,andsimulation.Decisionsupportsystemsbecamearesearchfieldinthemid70sandbecamemorepopularinthe80s.
Adecisionsupportsystem(DSS)isasetofexpandable,interactiveITtechniquesand
toolsdesignedforprocessingandanalyzingdataandforsupportingmanagers
indecisionmaking.Todothis,thesystemmatchesindividualresourcesofmanagers
withcomputerresourcestoimprovethequalityofthedecisionsmade.
Inpractice,aDSSisanITsystemthathelpsmanagersmakedecisionsorchooseamong
differentalternatives.Thesystemprovidesvalueestimatesforeachalternative,allowingthe
managertocriticallyreviewtheresults.Table1-1showsapossibleclassificationofDSSson
thebasisoftheirfunctions(Power,2002).
Fromthearchitecturalviewpoint,aDSStypicallyincludesamodel-basedmanagement
systemconnectedtoaknowledgeengineand,ofcourse,aninteractivegraphicaluser
interface(SpragueandCarlson,1982).Datawarehousesystemshavebeenmanagingthe
databack-endsofDSSssincethe1990s.Theymustretrieveusefulinformationfromahuge
amountofdatastoredonheterogeneousplatforms.Inthisway,decision-makerscan
formulatetheirqueriesandconductcomplexanalysesonrelevantinformationwithout
slowingdownoperationalsystems.
System
Description
Passive DSS
Active DSS
Collaborative DSS
Model-driven DSS
Communication-driven DSS
Data-driven DSS
Document-driven DSS
Knowledge-driven DSS
TABLE 1-1
ch01.indd3
4/21/093:23:28PM
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
Data Warehousing
Datawarehousingisacollectionofmethods,techniques,andtoolsusedtosupport
knowledgeworkersseniormanagers,directors,managers,andanalyststoconduct
dataanalysesthathelpwithperformingdecision-makingprocessesandimproving
informationresources.
Thedefinitionofdatawarehousingpresentedhereisintentionallygeneric;itgivesyou
anideaoftheprocessbutdoesnotincludespecificfeaturesoftheprocess.Tounderstandthe
roleandtheusefulpropertiesofdatawarehousingcompletely,youmustfirstunderstandthe
needsthatbroughtitintobeing.In1996,R.Kimballefficientlysummedupafewclaims
frequentlysubmittedbyendusersofclassicinformationsystems:
Wehaveheapsofdata,butwecannotaccessit!Thisshowsthefrustrationofthose
whoareresponsibleforthefutureoftheirenterprisesbuthavenotechnicaltoolsto
helpthemextracttherequiredinformationinaproperformat.
Howcanpeopleplayingthesameroleachievesubstantiallydifferentresults?Inmidsize
tolargeenterprises,manydatabasesareusuallyavailable,eachdevotedtoaspecific
businessarea.Theyareoftenstoredondifferentlogicalandphysicalmediathatare
notconceptuallyintegrated.Forthisreason,theresultsachievedineverybusiness
areaarelikelytobeinconsistent.
Wewanttoselect,group,andmanipulatedataineverypossibleway!Decision-making
processescannotalwaysbeplannedbeforethedecisionsaremade.Endusersneed
atoolthatisuser-friendlyandflexibleenoughtoconductadhocanalyses.They
wanttochoosewhichnewcorrelationstheyneedtosearchforinrealtimeasthey
analyzetheinformationretrieved.
Showmejustwhatmatters!Examiningdataatthemaximumlevelofdetailisnot
onlyuselessfordecision-makingprocesses,butisalsoself-defeating,becauseit
doesnotallowuserstofocustheirattentiononmeaningfulinformation.
Everyoneknowsthatsomedataiswrong!Thisisanothersorepoint.Anappreciable
percentageoftransactionaldataisnotcorrectoritisunavailable.Itisclearthatyou
cannotachievegoodresultsifyoubaseyouranalysesonincorrectorincompletedata.
Wecanusethepreviouslistofproblemsanddifficultiestoextractalistofkeywords
thatbecomedistinguishingmarksandessentialrequirementsforadatawarehouseprocess,a
setoftasksthatallowustoturnoperationaldataintodecision-makingsupportinformation:
accessibilitytousersnotveryfamiliarwithITanddatastructures;
integrationofdataonthebasisofastandardenterprisemodel;
queryflexibilitytomaximizetheadvantagesobtainedfromtheexistinginformation;
ch01.indd4
4/21/093:23:28PM
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
Chapter 1:
informationconcisenessallowingfortarget-orientedandeffectiveanalyses;
multidimensionalrepresentationgivingusersanintuitiveandmanageableview
ofinformation;
correctnessandcompletenessofintegrateddata.
Datawarehousesareplacedrightinthemiddleofthisprocessandactasrepositories
fordata.Theymakesurethattherequirementssetcanbefulfilled.
Data Warehouse
Adatawarehouseisacollectionofdatathatsupportsdecision-makingprocesses.
Itprovidesthefollowingfeatures(Inmon,2005):
Itissubject-oriented.
Itisintegratedandconsistent.
Itshowsitsevolutionovertimeanditisnotvolatile.
Datawarehousesaresubject-orientedbecausetheyhingeonenterprise-specific
concepts,suchascustomers,products,sales,andorders.Onthecontrary,operational
databaseshingeonmanydifferententerprise-specificapplications.
Weputemphasisonintegrationandconsistencybecausedatawarehousestakeadvantage
ofmultipledatasources,suchasdataextractedfromproductionandthenstoredtoenterprise
databases,orevendatafromathirdpartysinformationsystems.Adatawarehouseshould
provideaunifiedviewofallthedata.Generallyspeaking,wecanstatethatcreatingadata
warehousesystemdoesnotrequirethatnewinformationbeadded;rather,existing
informationneedsrearranging.Thisimplicitlymeansthataninformationsystemshouldbe
previouslyavailable.
Operationaldatausuallycoversashortperiodoftime,becausemosttransactions
involvethelatestdata.Adatawarehouseshouldenableanalysesthatinsteadcoverafew
years.Forthisreason,datawarehousesareregularlyupdatedfromoperationaldataand
keepongrowing.Ifdatawerevisuallyrepresented,itmightprogresslikeso:Aphotograph
ofoperationaldatawouldbemadeatregularintervals.Thesequenceofphotographs
wouldbestoredtoadatawarehouse,andresultswouldbeshowninamoviethatreveals
thestatusofanenterprisefromitsfoundationuntilpresent.
Fundamentally,dataisneverdeletedfromdatawarehousesandupdatesarenormally
carriedoutwhendatawarehousesareoffline.Thismeansthatdatawarehousescanbe
essentiallyviewedasread-onlydatabases.Thissatisfiestheusersneedforashortanalysis
queryresponsetimeandhasotherimportanteffects.First,itaffectsdatawarehousespecific
databasemanagementsystem(DBMS)technologies,becausethereisnoneedforadvanced
transactionmanagementtechniquesrequiredbyoperationalapplications.Second,data
warehousesoperateinread-onlymode,sodatawarehousespecificlogicaldesignsolutions
arecompletelydifferentfromthoseusedforoperationaldatabases.Forinstance,themost
obviousfeatureofdatawarehouserelationalimplementationsisthattablenormalization
canbegivenuptopartiallydenormalizetablesandimproveperformance.
Otherdifferencesbetweenoperationaldatabasesanddatawarehousesareconnected
withquerytypes.Operationalqueriesexecutetransactionsthatgenerallyread/writea
ch01.indd5
4/21/093:23:29PM
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
smallnumberoftuplesfrom/tomanytablesconnectedbysimplerelations.Forexample,
thisappliesifyousearchforthedataofacustomerinordertoinsertanewcustomerorder.
ThiskindofqueryisanOLTPquery.Onthecontrary,thetypeofqueryrequiredindata
warehousesisOLAP.Itfeaturesdynamic,multidimensionalanalysesthatneedtoscana
hugeamountofrecordstoprocessasetofnumericdatasumminguptheperformanceofan
enterprise.ItisimportanttonotethatOLTPsystemshaveanessentialworkloadcore
frozeninapplicationprograms,andadhocdataqueriesareoccasionallyrunfordata
maintenance.Conversely,datawarehouseinteractivityisanessentialpropertyforanalysis
sessions,sotheactualworkloadconstantlychangesastimegoesby.
ThedistinctivefeaturesofOLAPqueriessuggestadoptionofamultidimensional
representationfordatawarehousedata.Basically,dataisviewedaspointsinspace,whose
dimensionscorrespondtomanypossibleanalysisdimensions.Eachpointrepresentsan
eventthatoccursinanenterpriseandisdescribedbyasetofmeasuresrelevanttodecisionmakingprocesses.Section1.5givesadetaileddescriptionofthemultidimensionalmodel
youabsolutelyneedtobefamiliarwithtounderstandhowtomodelconceptualandlogical
levelsofadatawarehouseandhowtoquerydatawarehouses.
Table1-2summarizesthemaindifferencesbetweenoperationaldatabasesanddata
warehouses.
NOTE Forfurtherdetailsonthedifferentissuesrelatedtothedatawarehouseprocess,referto
ChaudhuriandDayal,1997;Inmon,2005;Jarkeetal.,2000;Kelly,1997;Kimball,1996;
Mattison,2006;andWrembelandKoncilia,2007.
Feature
Data Warehouses
Users
Thousands
Hundreds
Workload
Preset transactions
Access
Goal
Depends on applications
Decision-making support
Data
Data integration
Application-based
Subject-based
Quality
In terms of integrity
In terms of consistency
Time coverage
Updates
Continuous
Periodical
Model
Normalized
Denormalized, multidimensional
Optimization
TABLE 1-2
ch01.indd6
Operational Databases
4/21/093:23:29PM
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
Chapter 1:
1.3.1
Single-Layer Architecture
Asingle-layerarchitectureisnotfrequentlyusedinpractice.Itsgoalistominimizethe
amountofdatastored;toreachthisgoal,itremovesdataredundancies.Figure1-2shows
theonlylayerphysicallyavailable:thesourcelayer.Inthiscase,datawarehousesarevirtual.
FIGURE 1-2
Single-layer
architecture for
a data warehouse
system
ch01.indd7
4/21/093:23:29PM
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
Thismeansthatadatawarehouseisimplementedasamultidimensionalviewofoperational
datacreatedbyspecificmiddleware,oranintermediateprocessinglayer(Devlin,1997).
Theweaknessofthisarchitectureliesinitsfailuretomeettherequirementfor
separationbetweenanalyticalandtransactionalprocessing.Analysisqueriesaresubmitted
tooperationaldataafterthemiddlewareinterpretsthem.Itthisway,thequeriesaffect
regulartransactionalworkloads.Inaddition,althoughthisarchitecturecanmeetthe
requirementforintegrationandcorrectnessofdata,itcannotlogmoredatathansourcesdo.
Forthesereasons,avirtualapproachtodatawarehousescanbesuccessfulonlyifanalysis
needsareparticularlyrestrictedandthedatavolumetoanalyzeishuge.
1. Sourcelayer Adatawarehousesystemusesheterogeneoussourcesofdata.That
dataisoriginallystoredtocorporaterelationaldatabasesorlegacy1databases,orit
maycomefrominformationsystemsoutsidethecorporatewalls.
2. Datastaging Thedatastoredtosourcesshouldbeextracted,cleansedtoremove
inconsistenciesandfillgaps,andintegratedtomergeheterogeneoussourcesintoone
commonschema.Theso-calledExtraction,Transformation,andLoadingtools(ETL)can
mergeheterogeneousschemata,extract,transform,cleanse,validate,filter,andload
sourcedataintoadatawarehouse(Jarkeetal.,2000).Technologicallyspeaking,this
stagedealswithproblemsthataretypicalfordistributedinformationsystems,such
asinconsistentdatamanagementandincompatibledatastructures(Zhugeetal.,
1996).Section1.4dealswithafewpointsthatarerelevanttodatastaging.
3. Datawarehouselayer Informationisstoredtoonelogicallycentralizedsingle
repository:adatawarehouse.Thedatawarehousecanbedirectlyaccessed,butit
canalsobeusedasasourceforcreatingdatamarts,whichpartiallyreplicatedata
warehousecontentsandaredesignedforspecificenterprisedepartments.Meta-data
repositories(section1.6)storeinformationonsources,accessprocedures,data
staging,users,datamartschemata,andsoon.
4. Analysis Inthislayer,integrateddataisefficientlyandflexiblyaccessedtoissue
reports,dynamicallyanalyzeinformation,andsimulatehypotheticalbusiness
scenarios.Technologicallyspeaking,itshouldfeatureaggregatedatanavigators,
complexqueryoptimizers,anduser-friendlyGUIs.Section1.7dealswithdifferent
typesofdecision-makingsupportanalyses.
Thearchitecturaldifferencebetweendatawarehousesanddatamartsneedstobestudied
closer.ThecomponentmarkedasadatawarehouseinFigure1-3isalsooftencalledthe
primarydatawarehouseorcorporatedatawarehouse.Itactsasacentralizedstoragesystemfor
1
Thetermlegacysystemdenotescorporateapplications,typicallyrunningonmainframesorminicomputers,
that are currently used for operational tasks but do not meet modern architectural principles and current
standards.Forthisreason,accessinglegacysystemsandintegratingthemwithmorerecentapplicationsisa
complextask.Allapplicationsthatuseanonrelationaldatabaseareexamplesoflegacysystems.
ch01.indd8
4/21/093:23:30PM
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
Chapter 1:
FIGURE 1-3
Two-layer
architecture for
a data warehouse
system
allthedatabeingsummedup.Datamartscanbeviewedassmall,localdatawarehouses
replicating(andsummingupasmuchaspossible)thepartofaprimarydatawarehouse
requiredforaspecificapplicationdomain.
Data Marts
Adatamartisasubsetoranaggregationofthedatastoredtoaprimarydata
warehouse.Itincludesasetofinformationpiecesrelevanttoaspecificbusinessarea,
corporatedepartment,orcategoryofusers.
Thedatamartspopulatedfromaprimarydatawarehouseareoftencalleddependent.
Althoughdatamartsarenotstrictlynecessary,theyareveryusefulfordatawarehouse
systemsinmidsizetolargeenterprisesbecause
theyareusedasbuildingblockswhileincrementallydevelopingdatawarehouses;
theymarkouttheinformationrequiredbyaspecificgroupofuserstosolvequeries;
theycandeliverbetterperformancebecausetheyaresmallerthanprimarydata
warehouses.
ch01.indd9
4/21/093:23:30PM
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
10
Sometimes,mainlyfororganizationandpolicypurposes,youshoulduseadifferent
architectureinwhichsourcesareusedtodirectlypopulatedatamarts.Thesedatamarts
arecalledindependent(seesection1.3.4).Ifthereisnoprimarydatawarehouse,this
streamlinesthedesignprocess,butitleadstotheriskofinconsistenciesbetweendata
marts.Toavoidtheseproblems,youcancreateaprimarydatawarehouseandstillhave
independentdatamarts.Incomparisonwiththestandardtwo-layerarchitectureof
Figure1-3,therolesofdatamartsanddatawarehousesareactuallyinverted.Inthiscase,
thedatawarehouseispopulatedfromitsdatamarts,anditcanbedirectlyqueriedto
makeaccesspatternsaseasyaspossible.
Thefollowinglistsumsupallthebenefitsofatwo-layerarchitecture,inwhichadata
warehouseseparatessourcesfromanalysisapplications(Jarkeetal.,2000;Lechtenbrger,2001):
Indatawarehousesystems,goodqualityinformationisalwaysavailable,even
whenaccesstosourcesisdeniedtemporarilyfortechnicalororganizationalreasons.
Datawarehouseanalysisqueriesdonotaffectthemanagementoftransactions,the
reliabilityofwhichisvitalforenterprisestoworkproperlyatanoperationallevel.
Datawarehousesarelogicallystructuredaccordingtothemultidimensionalmodel,
whileoperationalsourcesaregenerallybasedonrelationalorsemi-structuredmodels.
AmismatchintermsoftimeandgranularityoccursbetweenOLTPsystems,which
managecurrentdataatamaximumlevelofdetail,andOLAPsystems,which
managehistoricalandsummarizeddata.
Datawarehousescanusespecificdesignsolutionsaimedatperformance
optimizationofanalysisandreportapplications.
NOTE Afewauthorsusethesameterminologytodefinedifferentconcepts.Inparticular,those
authorsconsideradatawarehouseasarepositoryofintegratedandconsistent,yetoperational,
data,whiletheyuseamultidimensionalrepresentationofdataonlyindatamarts.Accordingto
ourterminology,thisoperationalviewofdatawarehousesessentiallycorrespondstothe
reconcileddatalayerinthree-layerarchitectures.
ch01.indd10
4/21/093:23:31PM
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
Chapter 1:
11
FIGURE 1-4
Three-layer
architecture for
a data warehouse
system
Finally,letsconsiderasupplementaryarchitecturalapproach,whichprovidesa
comprehensivepicture.Thisapproachcanbedescribedasahybridsolutionbetweenthe
single-layerarchitectureandthetwo/three-layerarchitecture.Thisapproachassumesthat
althoughadatawarehouseisavailable,itisunabletosolveallthequeriesformulated.This
meansthatusersmaybeinterestedindirectlyaccessingsourcedatafromaggregatedata
(drill-through).Toreachthisgoal,somequerieshavetoberewrittenonthebasisofsource
data(orreconcileddataifitisavailable).Thistypeofarchitectureisimplementedina
prototypebyCuiandWidom,2000,anditneedstobeabletogodynamicallybacktothe
sourcedatarequiredforqueriestobesolved(lineage).
ch01.indd11
4/21/093:23:31PM
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
12
NOTE Gupta,1997a;HullandZhou,1996;andYangetal.,1997discusstheimplicationsofthis
approachfromtheviewpointofperformanceoptimization,andinparticularviewmaterialization.
FIGURE 1-5
Independent data
marts architecture
ch01.indd12
4/21/093:23:32PM
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
Chapter 1:
13
FIGURE 1-6
Hub-and-spoke
architecture
ofinformation.Atomic,normalizeddataisstoredinareconciledlayerthatfeedsasetof
datamartscontainingsummarizeddatainmultidimensionalform(Figure1-6).Users
mainlyaccessthedatamarts,buttheymayoccasionallyquerythereconciledlayer.
Thecentralizedarchitecture,recommendedbyBillInmon,canbeseenasaparticular
implementationofthehub-and-spokearchitecture,wherethereconciledlayerandthedata
martsarecollapsedintoasinglephysicalrepository.
Thefederatedarchitectureissometimesadoptedindynamiccontextswherepreexisting
datawarehouses/datamartsaretobenoninvasivelyintegratedtoprovideasingle,crossorganizationdecisionsupportenvironment(forinstance,inthecaseofmergersand
acquisitions).Eachdatawarehouse/datamartiseithervirtuallyorphysicallyintegrated
withtheothers,leaningonavarietyofadvancedtechniquessuchasdistributedquerying,
ontologies,andmeta-datainteroperability(Figure1-7).
ch01.indd13
4/21/093:23:32PM
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
14
Thefollowinglistincludesthefactorsthatareparticularlyinfluentialwhenitcomesto
choosingoneofthesearchitectures:
Theamountofinterdependentinformationexchangedbetweenorganizationalunits
inanenterpriseandtheorganizationalroleplayedbythedatawarehouseproject
sponsormayleadtotheimplementationofenterprise-widearchitectures,suchasbus
architectures,ordepartment-specificarchitectures,suchasindependentdatamarts.
Anurgentneedforadatawarehouseproject,restrictionsoneconomicandhuman
resources,aswellaspoorITstaffskillsmaysuggestthatatypeofquickarchitecture,
suchasindependentdatamarts,shouldbeimplemented.
Theminorroleplayedbyadatawarehouseprojectinenterprisestrategiescanmake
youpreferanarchitecturetypebasedonindependentdatamartsoverahub-andspokearchitecturetype.
Thefrequentneedforintegratingpreexistingdatawarehouses,possiblydeployed
onheterogeneousplatforms,andthepressingdemandforuniformlyaccessingtheir
datacanrequireafederatedarchitecturetype.
ch01.indd14
4/21/093:23:32PM
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
Chapter 1:
15
NOTE RefertoJarkeetal.,2000;Hofferetal.,2005;KimballandCaserta,2004;andEnglish,1999
formoredetailsonETL.
Thescientificliteratureshowsthattheboundariesbetweencleansingandtransforming
areoftenblurredfromtheterminologicalviewpoint.Forthisreason,aspecificoperationis
notalwaysclearlyassignedtooneofthesephases.Thisisobviouslyaformalproblem,but
notasubstantialone.WewilladopttheapproachusedbyHofferandothers(2005)to
makeourexplanationsasclearaspossible.Theirapproachstatesthatcleansingis
essentiallyaimedatrectifyingdatavalues,andtransformationmorespecificallymanages
dataformats.
Chapter10discussesallthedetailsofthedata-stagingdesignphase.Chapter3deals
withanearlydatawarehousedesignphase:integration.Thisphaseisnecessaryifthereare
heterogeneoussourcestodefineaschemaforthereconcileddatalayer,andtospecifically
transformoperationaldatainthedata-stagingphase.
1.4.1
Extraction
Relevantdataisobtainedfromsourcesintheextractionphase.Youcanusestaticextraction
whenadatawarehouseneedspopulatingforthefirsttime.Conceptuallyspeaking,this
lookslikeasnapshotofoperationaldata.Incrementalextraction,usedtoupdatedata
warehousesregularly,seizesthechangesappliedtosourcedatasincethelatestextraction.
IncrementalextractionisoftenbasedonthelogmaintainedbytheoperationalDBMS.Ifa
timestampisassociatedwithoperationaldatatorecordexactlywhenthedataischangedor
added,itcanbeusedtostreamlinetheextractionprocess.Extractioncanalsobesourcedrivenifyoucanrewriteoperationalapplicationstoasynchronouslynotifyofthechanges
beingapplied,orifyouroperationaldatabasecanimplementtriggersassociatedwith
changetransactionsforrelevantdata.
Thedatatobeextractedismainlyselectedonthebasisofitsquality(English,1999).In
particular,thisdependsonhowcomprehensiveandaccuratetheconstraintsimplemented
insourcesare,howsuitablethedataformatsare,andhowcleartheschemataare.
ch01.indd15
4/21/093:23:33PM
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
16
FIGURE 1-8
Extraction,
transformation,
and loading
1.4.2
Cleansing
Thecleansingphaseiscrucialinadatawarehousesystembecauseitissupposedtoimprove
dataqualitynormallyquitepoorinsources(Galhardasetal.,2001).Thefollowinglist
includesthemostfrequentmistakesandinconsistenciesthatmakedatadirty:
Duplicatedata Forexample,apatientisrecordedmanytimesinahospitalpatient
managementsystem
ch01.indd16
4/21/093:23:33PM
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
Chapter 1:
17
Inconsistentvaluesthatarelogicallyassociated SuchasaddressesandZIPcodes
Missingdata Suchasacustomersjob
Unexpecteduseoffields Forexample,asocialSecurityNumberfieldcouldbe
usedimproperlytostoreofficephonenumbers
Impossibleorwrongvalues Suchas2/30/2009
Inconsistentvaluesforasingleentitybecausedifferentpracticeswereused For
example,tospecifyacountry,youcanuseaninternationalcountryabbreviation(I)
orafullcountryname(Italy);similarproblemsarisewithaddresses(HamletRd.
andHamletRoad)
Inconsistentvaluesforoneindividualentitybecauseoftypingmistakes Suchas
HametRoadinsteadofHamletRoad
Inparticular,notethatthelasttwotypesofmistakesareveryfrequentwhenyouare
managingmultiplesourcesandareenteringdatamanually.
ThemaindatacleansingfeaturesfoundinETLtoolsarerectificationandhomogenization.
Theyusespecificdictionariestorectifytypingmistakesandtorecognizesynonyms,aswell
asrule-basedcleansingtoenforcedomain-specificrulesanddefineappropriateassociations
betweenvalues.Seesection10.2formoredetailsonthesepoints.
1.4.3 Transformation
Transformationisthecoreofthereconciliationphase.Itconvertsdatafromitsoperational
sourceformatintoaspecificdatawarehouseformat.Ifyouimplementathree-layer
architecture,thisphaseoutputsyourreconcileddatalayer.Independentlyofthepresenceof
areconcileddatalayer,establishingamappingbetweenthesourcedatalayerandthedata
warehouselayerisgenerallymadedifficultbythepresenceofmanydifferent,heterogeneous
sources.Ifthisisthecase,acomplexintegrationphaseisrequiredwhendesigningyourdata
warehouse.SeeChapter3formoredetails.
Thefollowingpointsmustberectifiedinthisphase:
Loosetextsmayhidevaluableinformation.Forexample,BigDealLtDdoesnotexplicitly
showthatthisisaLimitedPartnershipcompany.
Differentformatscanbeusedforindividualdata.Forexample,adatecanbesavedas
astringorasthreeintegers.
Followingarethemaintransformationprocessesaimedatpopulatingthereconciled
datalayer:
Conversionandnormalizationthatoperateonbothstorageformatsandunitsof
measuretomakedatauniform
Matchingthatassociatesequivalentfieldsindifferentsources
Selectionthatreducesthenumberofsourcefieldsandrecords
Whenpopulatingadatawarehouse,normalizationisreplacedbydenormalization
becausedatawarehousedataaretypicallydenormalized,andyouneedaggregationtosum
updataproperly.
ch01.indd17
4/21/093:23:33PM
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
18
FIGURE 1-9
Example of
cleansing and
transforming
customer data
CleansingandtransformationprocessesareoftencloselyconnectedinETLtools.
Figure1-9showsanexampleofcleansingandtransformationofcustomerdata:afieldbasedstructureisextractedfromaloosetext,thenafewvaluesarestandardizedsoas
toremoveabbreviations,andeventuallythosevaluesthatarelogicallyassociatedcan
berectified.
1.4.4
Loading
Loadingintoadatawarehouseisthelaststeptotake.Loadingcanbecarriedoutintwoways:
Refresh Datawarehousedataiscompletelyrewritten.Thismeansthatolderdata
isreplaced.Refreshisnormallyusedincombinationwithstaticextractionto
initiallypopulateadatawarehouse.
Update Onlythosechangesappliedtosourcedataareaddedtothedata
warehouse.Updateistypicallycarriedoutwithoutdeletingormodifying
preexistingdata.Thistechniqueisusedincombinationwithincrementalextraction
toupdatedatawarehousesregularly.
ch01.indd18
4/21/093:23:34PM
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
Chapter 1:
19
Overthelastfewyears,multidimensionaldatabaseshavegeneratedmuchresearchand
marketinterestbecausetheyarefundamentalformanydecision-makingsupport
applications,suchasdatawarehousesystems.Thereasonwhythemultidimensionalmodel
isusedasaparadigmofdatawarehousedatarepresentationisfundamentallyconnectedto
itseaseofuseandintuitivenessevenforITnewbies.Themultidimensionalmodelssuccess
isalsolinkedtothewidespreaduseofproductivitytools,suchasspreadsheets,thatadopt
themultidimensionalmodelasavisualizationparadigm.
Perhapsthebeststartingpointtoapproachthemultidimensionalmodeleffectivelyisa
definitionofthetypesofqueriesforwhichthismodelisbestsuited.Section1.7offersmore
detailsontypicaldecision-makingqueriessuchasthoselistedhere(Jarkeetal.,2000):
Whatisthetotalamountofreceiptsrecordedlastyearperstateandperproductcategory?
WhatistherelationshipbetweenthetrendofPCmanufacturerssharesandquartergains
overthelastfiveyears?
Whichordersmaximizereceipts?
Whichoneoftwonewtreatmentswillresultinadecreaseintheaverageperiodofadmission?
Whatistherelationshipbetweenprofitgainedbytheshipmentsconsistingoflessthan
10itemsandtheprofitgainedbytheshipmentsofmorethan10items?
Itisclearthatusingtraditionallanguages,suchasSQL,toexpressthesetypesofqueries
canbeaverydifficulttaskforinexperiencedusers.Itisalsoclearthatrunningthesetypesof
queriesagainstoperationaldatabaseswouldresultinanunacceptablylongresponsetime.
Themultidimensionalmodelbeginswiththeobservationthatthefactorsaffecting
decision-makingprocessesareenterprise-specificfacts,suchassales,shipments,hospital
admissions,surgeries,andsoon.Instancesofafactcorrespondtoeventsthatoccurred.
Forexample,everysinglesaleorshipmentcarriedoutisanevent.Eachfactisdescribed
bythevaluesofasetofrelevantmeasuresthatprovideaquantitativedescriptionof
events.Forexample,salesreceipts,amountsshipped,hospitaladmissioncosts,and
surgerytimearemeasures.
Obviously,ahugenumberofeventsoccurintypicalenterprisestoomanytoanalyze
onebyone.Imagineplacingthemallintoann-dimensionalspacetohelpusquicklyselect
andsortthemout.Then-dimensionalspaceaxesarecalledanalysisdimensions,andthey
definedifferentperspectivestosingleoutevents.Forexample,thesalesinastorechaincan
berepresentedinathree-dimensionalspacewhosedimensionsareproducts,stores,and
dates.Asfarasshipmentsareconcerned,products,shipmentdates,orders,destinations,
andterms&conditionscanbeusedasdimensions.Hospitaladmissionscanbedefinedby
thedepartment-date-patientcombination,andyouwouldneedtoaddthetypeofoperation
toclassifysurgeryoperations.
Theconceptofdimensiongavelifetothebroadlyusedmetaphorofcubestorepresent
multidimensionaldata.Accordingtothismetaphor,eventsareassociatedwithcubecells
andcubeedgesstandforanalysisdimensions.Ifmorethanthreedimensionsexist,thecube
iscalledahypercube.Eachcubecellisgivenavalueforeachmeasure.Figure1-10showsan
intuitiverepresentationofacubeinwhichthefactisasaleinastorechain.Itsanalysis
dimensionsarestore,productanddate.Aneventstandsforaspecificitemsoldinaspecific
storeonaspecificdate,anditisdescribedbytwomeasures:thequantitysoldandthe
receipts.Thisfigurehighlightsthatthecubeissparsethismeansthatmanyeventsdidnot
actuallytakeplace.Ofcourse,youcannotselleveryitemeverydayineverystore.
ch01.indd19
4/21/093:23:34PM
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
20
FIGURE 1-10
The threedimensional cube
modeling sales in
a store chain:
10 packs of Shiny
were sold on
4/5/2008 in the
EverMore store,
totaling $25.
Ifyouwanttousetherelationalmodeltorepresentthiscube,youcouldusethe
followingrelationalschema:
SALES(store,product,date,quantity,receipts)
Here,theunderlinedattributesmakeuptheprimarykeyandeventsareassociatedwith
tuples,suchas<'EverMore','Shiny','04/05/08',10,25>.Theconstraintexpressedbythis
primarykeyspecifiesthattwoeventscannotbeassociatedwithanindividualstore,
product,anddatevaluecombination,andthateveryvaluecombinationfunctionally
determinesauniquevalueforquantityandauniquevalueforreceipts.Thismeans
thatthefollowingfunctionaldependency2holds:
store,product,datequantity,receipts
Toavoidanymisunderstandingofthetermevent,youshouldrealizethatthegroup
ofdimensionsselectedforafactrepresentationsinglesoutauniqueeventinthe
multidimensionalmodel,butthegroupdoesnotnecessarilysingleoutauniqueeventin
theapplicationdomain.Tomakethisstatementclearer,consideronceagainthesales
example.Intheapplicationdomain,onesinglesaleseventissupposedtobeacustomers
purchaseofasetofproductsfromastoreonaspecificdate.Inpractice,thiscorrespondsto
asalesreceipt.Fromtheviewpointofthemultidimensionalmodel,ifthesalesfacthasthe
product,store,anddatedimensions,aneventwillbethedailytotalamountofanitem
soldinastore.Itisclearthatthedifferencebetweenbothinterpretationsdependsonsales
2
The definition of functional dependency belongs to relational theory. Given relation schema R and two
attributesetsX= {a1...,an}andY= {b1...,bm},XissaidtofunctionallydetermineY(XY)ifandonlyif,forevery
legalinstancerofRandforeachpairoftuplest1,t2inr,t1[X] =t2[X]impliest1[Y] = t2[Y].Heret[X/Y]denotesthe
valuestakenintfromtheattributesinX/Y.Byextension,wesaythatafunctionaldependencyholdsbetween
twoattributesetsXandYwheneachvaluesetofXalwayscorrespondstoasinglevaluesetofY.Tosimplify
thenotation,whenwedenotetheattributesineachset,wedropthebraces.
ch01.indd20
4/21/093:23:35PM
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
Chapter 1:
21
receiptsthatgenerallyincludevariousitems,andonindividualitemsthataregenerallysold
manytimeseverydayinastore.Inthefollowingsections,weusethetermseventandfactto
makereferencetothegranularitytakenbyeventsandfactsinthemultidimensionalmodel.
Normally,eachdimensionisassociatedwithahierarchyofaggregationlevels,often
calledroll-uphierarchy.Roll-uphierarchiesgroupaggregationlevelvaluesindifferentways.
Hierarchiesconsistoflevelscalleddimensionalattributes.Figure1-11showsasimpleexample
ofhierarchiesbuiltontheproductandstoredimensions:productsareclassifiedinto
types,andarethenfurtherclassifiedintocategories.Storesarelocatedincitiesbelongingto
states.Ontopofeachhierarchyisafakelevelthatincludesallthedimension-relatedvalues.
Fromtheviewpointofrelationaltheory,youcanuseasetoffunctionaldependencies
betweendimensionalattributestoexpressahierarchy:
producttypecategory
storecitystate
Insummary,amultidimensionalcubehingesonafactrelevanttodecision-making.It
showsasetofeventsforwhichnumericmeasuresprovideaquantitativedescription.Each
cubeaxisshowsapossibleanalysisdimension.Eachdimensioncanbeanalyzedatdifferent
detaillevelsspecifiedbyhierarchicallystructuredattributes.
Thescientificliteratureshowsmanyformalexpressionsofthemultidimensionalmodel,
whichcanbemoreorlesscomplexandcomprehensive.Wellbrieflymentionalternative
termsusedforthemultidimensionalmodelinthescientificliteratureandincommercialtools.
FIGURE 1-11
Aggregation
hierarchies built on
the product and
store dimensions
ch01.indd21
4/21/093:23:35PM
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
22
Thefactandcubetermsareofteninterchangeablyused.Essentially,everyoneagreesontheuse
ofthetermdimensionstospecifythecoordinatesthatclassifyandidentifyfactoccurrences.
However,entirehierarchiesaresometimescalleddimensions.Forexample,thetermtime
dimensioncanbeusedfortheentirehierarchybuiltonthedateattribute.Measuresare
sometimescalledvariables,metrics,properties,attributes,orindicators.Insomemodels,
dimensionalattributesofhierarchiesarecalledlevelsorparameters.
NOTE Themainformalexpressionsofthemultidimensionalmodelintheliteraturewereproposed
byAgrawaletal.,1995;GyssensandLakshmanan,1997;DattaandThomas,1997;
Vassiliadis,1998;andCabibboandTorlone,1998.
Theinformationinamultidimensionalcubeisverydifficultforuserstomanagebecause
ofitsquantity,evenifitisaconciseversionoftheinformationstoredtooperational
databases.If,forexample,astorechainincludes50storesselling1000items,andaspecific
datawarehousecoversthree-year-longtransactions(approximately1000days),thenumberof
potentialeventstotals5010001000=5107.Assumingthateachstorecansellonly10
percentofalltheavailableitemsperday,thenumberofeventstotals5106.Thisisstilltoo
muchdatatobeanalyzedbyuserswithoutrelyingonautomatictools.
Youhaveessentiallytwowaystoreducethequantityofdataandobtainuseful
information:restrictionandaggregation.Thecubemetaphoroffersaneasy-to-useandintuitive
waytounderstandbothofthesemethods,aswewilldiscussinthefollowingparagraphs.
1.5.1
Restriction
Restrictingdatameansseparatingpartofthedatafromacubetomarkoutananalysisfield.
Inrelationalalgebraterminology,thisiscalledmakingselectionsand/orprojections.
Thesimplesttypeofselectionisdataslicing,showninFigure1-12.Whenyouslicedata,
youdecreasecubedimensionalitybysettingoneormoredimensionstoaspecificvalue.For
example,ifyousetoneofthesalescubedimensionstoavalue,suchasstore='EverMore',
thisresultsinthesetofeventsassociatedwiththeitemssoldintheEverMorestore.According
tothecubemetaphor,thisissimplyaplaneofcellsthatis,adataslicethatcanbeeasily
displayedinspreadsheets.Inthestorechainexamplegivenearlier,approximately105events
stillappearinyourresult.Ifyousettwodimensionstoavalue,suchasstore='EverMore'
anddate='4/5/2008',thiswillresultinallthedifferentitemssoldintheEverMorestoreon
April5(approximately100events).Graphicallyspeaking,thisinformationisstoredatthe
intersectionoftwoperpendicularplanesresultinginaline.Ifyousetallthedimensionstoa
particularvalue,youwilldefinejustoneeventthatcorrespondstoapointinthethreedimensionalspaceofsales.
Dicingisageneralizationofslicing.Itposessomeconstraintsondimensionalattributes
toscaledownthesizeofacube.Forexample,youcanselectonlythedailysalesofthefood
itemsinApril2008inFlorida(Figure1-12).Inthisway,iffivestoresarelocatedinFlorida
and50foodproductsaresold,thenumberofeventstoexaminechangesto55030=7500.
Finally,aprojectioncanbereferredtoasachoicetokeepjustonesubgroupofmeasures
foreveryeventandrejectothermeasures.
ch01.indd22
4/21/093:23:35PM
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
Chapter 1:
23
1.5.2 Aggregation
Aggregationplaysafundamentalroleinmultidimensionaldatabases.Assume,for
example,thatyouwanttoanalyzetheitemssoldmonthlyforathreeyearperiod.
Accordingtothecubemetaphor,thismeansthatyouneedtosortallthecellsrelatedtothe
daysofeachmonthbyproductandstore,andthenmergethemintoonesinglemacrocell.
Intheaggregatecubeobtainedinthisway,thetotalnumberofevents(thatis,thenumber
ofmacrocells)is50100036.Thisisbecausethegranularityofthetimedimensionsdoes
notdependondaysanylonger,butnowdependsonmonths,and36isthenumberof
monthsinthreeyears.Everyaggregateeventwillthensumupthedataavailableinthe
eventsitaggregates.Inthisexample,thetotalamountofitemssoldpermonthandthe
totalreceiptsarecalculatedbysummingeverysinglevalueoftheirmeasures(Figure1-13).
Ifyoufurtheraggregatealongtime,youcanachievejustthreeeventsforeverystoreproductcombination:oneforeveryyear.Whenyoucompletelyaggregatealongthetime
dimension,eachstore-productcombinationcorrespondstoonesingleevent,whichshows
thetotalamountofitemssoldinastoreoverthreeyearsandthetotalamountofreceipts.
ch01.indd23
4/21/093:23:36PM
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
24
FIGURE 1-13
Time hierarchy
aggregation of the
quantity of items
sold per product
in three stores.
A dash shows that
an event did not
occur because no
item was sold.
EverMore
EvenMore
SmartMart
1/1/2007
1/2/2007
10
15
1/3/2007
20
..........
..........
..........
..........
1/1/2008
1/2/2008
15
10
20
1/3/2008
20
20
25
..........
..........
..........
1/1/2009
1/2/2009
20
25
1/3/2009
20
12
20
..........
..........
..........
EverMore
EvenMore
SmartMart
200
180
150
..........
..........
January 2007
February 2007
180
150
120
March 2007
220
180
160
..........
..........
..........
..........
January 2008
350
220
200
February 2008
300
200
250
March 2008
310
180
300
..........
..........
..........
380
200
220
..........
January 2009
February 2009
310
200
250
March 2009
300
160
280
..........
..........
..........
..........
EverMore
EvenMore
SmartMart
2007
2,400
2,000
1,600
2008
3,200
2,300
3,000
2009
3,400
2,200
3,200
EverMore
EvenMore
SmartMart
9,000
6,500
7,800
Total
ch01.indd24
4/21/093:23:37PM
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
Chapter 1:
25
FIGURE 1-14 Two cube aggregation levels. Every macro-event measure value is a sum of its component
event values.
Youcanaggregatealongvariousdimensionsatthesametime.Forexample,Figure1-14
showsthatyoucangroupsalesbymonth,producttype,andstorecity,andbymonthand
producttype.Moreover,selectionsandaggregationscanbecombinedtocarryoutan
analysisprocesstargetedexactlytousersneeds.
1.6 Meta-data
Thetermmeta-datacanbeappliedtothedatausedtodefineotherdata.Inthescopeofdata
warehousing,meta-dataplaysanessentialrolebecauseitspecifiessource,values,usage,
andfeaturesofdatawarehousedataanddefineshowdatacanbechangedandprocessedat
everyarchitecturelayer.Figures1-3and1-4showthatthemeta-datarepositoryisclosely
connectedtothedatawarehouse.Applicationsuseitintensivelytocarryoutdata-staging
andanalysistasks.
AccordingtoKellysapproach,youcanclassifymeta-dataintotwopartiallyoverlapping
categories.Thisclassificationisbasedonthewayssystemadministratorsandendusers
exploitmeta-data.Systemadministratorsareinterestedininternalmeta-databecauseit
definesdatasources,transformationprocesses,populationpolicies,logicalandphysical
schemata,constraints,anduserprofiles.Externalmeta-dataisrelevanttoendusers.
Forexample,itisaboutdefinitions,qualitystandards,unitsofmeasure,relevant
aggregations.
ch01.indd25
4/21/093:23:37PM
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
26
Meta-dataisstoredinameta-datarepositorywhichalltheotherarchitecturecomponents
canaccess.AccordingtoKelly,atoolformeta-datamanagementshould
allowadministratorstoperformsystemadministrationoperations,andinparticular
managesecurity;
allowenduserstonavigateandquerymeta-data;
useaGUI;
allowenduserstoextendmeta-data;
allowmeta-datatobeimported/exportedinto/fromotherstandardtoolsandformats.
Asfarasrepresentationformatsareconcerned,ObjectManagementGroup(OMG,2000)
releasedastandardcalledCommonWarehouseMetamodel(CWM)thatreliesonthreefamous
standards:UnifiedModelingLanguage(UML),eXtensibleMarkupLanguage(XML),andXML
MetadataInterchange(XMI).Partners,suchasIBM,Unisys,NCR,andOracle,inacommon
effort,createdthenewstandardformatthatspecifieshowmeta-datacanbeexchanged
amongthetechnologiesrelatedtodatawarehouses,businessintelligence,knowledge
management,andwebportals.
Figure1-15showsanexampleofadialogboxdisplayingexternalmeta-datarelated
tohierarchiesinMicroStrategyDesktopoftheMicroStrategy8toolsuite.Inparticular,
FIGURE 1-15
Accessing
hierarchy
meta-data in
MicroStrategy
ch01.indd26
4/21/093:23:38PM
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
Chapter 1:
27
thisdialogboxdisplaystheCallingCenterattributeparentattributes.Specifically,itstates
thatacallingcenterreferstoadistributioncenter,belongstoaregion,andismanagedby
amanager.
NOTE SeeBarquinandEdelstein,1996;Jarkeetal.,2000;Jennings,2004;andTozer,1999,for
acomprehensivediscussiononmeta-datarepresentationandmanagement.
1.7.1
Reports
Thisapproachisorientedtothoseuserswhoneedtohaveregularaccesstotheinformation
inanalmoststaticway.Forexample,supposealocalhealthauthoritymustsendtoitsstate
officesmonthlyreportssummingupinformationonpatientadmissioncosts.Thelayoutof
thosereportshasbeenpredeterminedandmayvaryonlyifchangesareappliedtocurrent
lawsandregulations.Designersissuethequeriestocreatereportswiththedesiredlayout
andfreezeallthoseinanapplication.Inthisway,enduserscanquerycurrentdata
whenevertheyneedto.
Areportisdefinedbyaqueryandalayout.Aquerygenerallyimpliesarestrictionandan
aggregationofmultidimensionaldata.Forexample,youcanlookforthemonthlyreceipts
duringthelastquarterforeveryproductcategory.Alayoutcanlooklikeatableorachart
(diagrams,histograms,pies,andsoon).Figure1-16showsafewexamplesoflayoutsforthe
receiptsquery.
Areportingtoolshouldbeevaluatednotonlyonthebasisofcomprehensivereport
layouts,butalsoonthebasisofflexiblereportdeliverysystems.Areportcanbeexplicitly
runbyusersorautomaticallyandregularlysenttoregisteredendusers.Forexample,itcan
besentviae-mail.
Keepinmindthatreportsexistedlongbeforedatawarehousesystemscametobe.
Reportshavealwaysbeenthemaintoolusedbymanagersforevaluatingandplanning
taskssincetheinventionofdatabases.However,addingdatawarehousestothemixis
beneficialtoreportsfortwomainreasons:First,theytakeadvantageofreliableandcorrect
ch01.indd27
4/21/093:23:38PM
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
28
FIGURE 1-16
Report layouts:
table (top),
line graph (middle),
3-D pie graphs
(bottom)
ch01.indd28
4/21/093:23:38PM
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
Chapter 1:
29
resultsbecausethedatasummedupinreportsisconsistentandintegrated.Inaddition,
datawarehousesexpeditethereportingprocessbecausethearchitecturalseparation
betweentransactionprocessingandanalysessignificantlyimprovesperformance.
1.7.2
OLAP
OLAPmightbethemainwaytoexploitinformationinadatawarehouse.Surelyitisthe
mostpopularone,anditgivesendusers,whoseanalysisneedsarenoteasytodefine
beforehand,theopportunitytoanalyzeandexploredatainteractivelyonthebasisofthe
multidimensionalmodel.Whileusersofreportingtoolsessentiallyplayapassiverole,
OLAPusersareabletostartacomplexanalysissessionactively,whereeachstepisthe
resultoftheoutcomeofprecedingsteps.Real-timepropertiesofOLAPsessions,required
in-depthknowledgeofdata,complexqueriesthatcanbeissued,anddesignforusersnot
familiarwithITmakethetoolsinuseplayacrucialrole.TheGUIofthesetoolsmustbe
flexible,easy-to-use,andeffective.
AnOLAPsessionconsistsofanavigationpaththatcorrespondstoananalysisprocessfor
factsaccordingtodifferentviewpointsandatdifferentdetaillevels.Thispathisturnedinto
asequenceofqueries,whichareoftennotissueddirectly,butdifferentiallyexpressedwith
referencetothepreviousquery.Theresultsofqueriesaremultidimensional.Becausewe
humanshaveadifficulttimedecipheringdiagramsofmorethanthreedimensions,OLAP
toolstypicallyusetablestodisplaydata,withmultipleheaders,colors,andotherfeaturesto
highlightdatadimensions.
EverystepofananalysissessionischaracterizedbyanOLAPoperatorthatturnsthelatest
queryintoanewone.Themostcommonoperatorsareroll-up,drill-down,slice-and-dice,
pivot,drill-across,anddrill-through.Thefiguresincludedhereshowdifferentoperators,and
weregeneratedusingtheMicroStrategyDesktopfront-endapplicationintheMicroStrategy
8toolsuite.TheyarebasedontheV-Mallexample,inwhichalargevirtualmallsellsitems
fromitscatalogviaphoneandtheInternet.Figure1-17showstheattributehierarchies
relevanttothesalesfactinV-Mall.
Theroll-upoperatorcausesanincreaseindataaggregationandremovesadetaillevel
fromahierarchy.Forexample,Figure1-18showsaqueryposedbyauserthatdisplays
FIGURE 1-17
Attribute
hierarchies in
V-Mall; arrows
show functional
dependencies
ch01.indd29
4/21/093:23:38PM
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
30
monthlyrevenuesin2005and2006foreverycustomerregion.Ifyourollitup,you
removethemonthdetailtodisplayquarterlytotalrevenuesperregion.Rolling-upcanalso
reducethenumberofdimensionsinyourresultsifyouremoveallthehierarchydetails.If
youapplythisprincipletoFigure1-19,youcanremoveinformationoncustomersand
displayyearlytotalrevenuesperproductcategoryasyouturnthethree-dimensionaltable
ch01.indd30
4/21/093:23:39PM
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
Chapter 1:
31
intoatwo-dimensionalone.Figure1-20usesthecubemetaphortosketcharoll-upoperation
withandwithoutadecreaseindimensions.
Thedrill-downoperatoristhecomplementtotheroll-upoperator.Figure1-20showsthat
itreducesdataaggregationandaddsanewdetailleveltoahierarchy.Figure1-21showsan
examplebasedonabidimensionaltable.Thistableshowsthattheaggregationbased
oncustomerregionsshiftstoanewfine-grainedaggregationbasedoncustomercities.
FIGURE 1-20
ch01.indd31
4/21/093:23:39PM
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
32
FIGURE 1-21
InFigure1-22,thedrill-downoperatorcausesanincreaseinthenumberoftabledimensions
afteraddingcustomerregiondetails.
Slice-and-diceisoneofthemostabusedtermsindatawarehouseliteraturebecauseitcan
havemanydifferentmeanings.AfewauthorsuseitgenerallytodefinethewholeOLAP
navigationprocess.Otherauthorsuseittodefineselectionandprojectionoperationsbased
ondata.Incompliancewithsection1.5.1,wedefineslicingasanoperationthatreducesthe
FIGURE 1-22
ch01.indd32
4/21/093:23:40PM
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
Chapter 1:
33
FIGURE 1-23
Slicing (above)
and dicing (below)
a cube
numberofcubedimensionsaftersettingoneofthedimensionstoaspecificvalue.Dicingis
anoperationthatreducesthesetofdatabeinganalyzedbyaselectioncriterion(Figure1-23).
Figures1-24and1-25showafewexamplesofslicinganddicing.
Thepivotoperatorimpliesachangeinlayouts.Itaimsatanalyzinganindividual
groupofinformationfromadifferentviewpoint.Accordingtothemultidimensional
metaphor,ifyoupivotdata,yourotateyourcubesothatyoucanrearrangecellsonthe
FIGURE 1-24
ch01.indd33
4/21/093:23:40PM
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
34
FIGURE 1-25
basisofanewperspective.Inpractice,youcanhighlightadifferentcombinationof
dimensions(Figure1-26).Figures1-27and1-28showafewexamplesofpivotedtwodimensionalandthree-dimensionaltables.
Thetermdrill-acrossstandsfortheopportunitytocreatealinkbetweentwoormore
interrelatedcubesinordertocomparetheirdata.Forexample,thisappliesifyoucalculate
FIGURE 1-26
Pivoting a cube
ch01.indd34
4/21/093:23:41PM
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
Chapter 1:
35
anexpressioninvolvingmeasuresfromtwocubes(Figure1-29).Figure1-30showsan
exampleinwhichasalescubeisdrilled-acrossapromotionscubeinordertocompare
revenuesanddiscountsperquarterandproductcategory.
MostOLAPtoolscanperformdrill-throughoperations,thoughwithvaryingeffectiveness.
Thisoperationswitchesfrommultidimensionalaggregatedataindatamartstooperational
datainsourcesorinthereconciledlayer.
Inmanyapplications,anintermediateapproachbetweenstaticreportingand
OLAPisbroadlyused.Thisintermediateapproachiscalledsemi-staticreporting.Evenif
asemi-staticreportfocusesonagroupofinformationpreviouslyset,itgivesuserssome
marginoffreedom.Thankstothismargin,userscanfollowalimitedsetofnavigationpaths.
Forexample,thisapplieswhenyoucanrollupjusttoafewhierarchyattributes.Thissolution
iscommon,becauseitprovidessomeunquestionableadvantages.First,usersneedlessskillto
usedatamodelsandanalysistoolsthantheyneedforOLAP.Second,thisavoidstheriskthat
occursinOLAPofachievinginconsistentanalysisresultsorincorrectonesbecauseofany
misuseofaggregationoperators.Third,ifyouposeconstraintsontheanalysesallowed,you
willpreventusersfromunwillinglyslowingdownyoursystemwhenevertheyformulate
demandingqueries.
ch01.indd35
4/21/093:23:41PM
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
36
FIGURE 1-29
Drilling across
two cubes
1.7.3
Dashboards
Dashboardsareanothermethodusedfordisplayinginformationstoredtoadatawarehouse.
ThetermdashboardreferstoaGUIthatdisplaysalimitedamountofrelevantdatainabrief
andeasy-to-readformat.Dashboardscanprovideareal-timeoverviewofthetrendsfora
specificphenomenonorformanyphenomenathatarestrictlyconnectedwitheachother.
Thetermisavisualmetaphor:thegroupofindicatorsintheGUIaredisplayedlikeacar
dashboard.Dashboardsareoftenusedbyseniormanagerswhoneedaquickwaytoview
information.However,toconductanddisplayverycomplexanalysesofphenomena,
dashboardsmustbematchedwithanalysistools.
Today,mostsoftwarevendorsofferdashboardsforreportcreationanddisplay.Figure1-31
showsadashboardcreatedwithMicroStrategyDynamicEnterprise.Theliteraturerelated
todashboardgraphicdesignhasalsoproventobeveryrich,inparticularinthescopeof
enterprises(Few,2006).
FIGURE 1-30 Drilling across the sales cube (Revenue measure) and the promotions cube
(Discount measure)
ch01.indd36
4/21/093:23:42PM
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
Chapter 1:
FIGURE 1-31
37
An example of dashboards
Keepinmind,however,thatdashboardsarenothingbutperformanceindicatorsbehind
GUIs.Theireffectivenessisduetoacarefulselectionoftherelevantmeasures,whileusing
datawarehouseinformationqualitystandards.Forthisreason,dashboardsshouldbe
viewedasasophisticatedeffectiveadd-ontodatawarehousesystems,butnotasthe
primarygoalofdatawarehousesystems.Infact,theprimarygoalofdatawarehouse
systemsshouldalwaysbetoproperlydefineaprocesstotransformdataintoinformation.
ch01.indd37
4/21/093:23:42PM
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
38
Theideaofadoptingtherelationaltechnologytostoredatatoadatawarehouse
hasasolidfoundationifyouconsiderthehugeamountofliteraturewrittenaboutthe
relationalmodel,thebroadlyavailablecorporateexperiencewithrelationaldatabase
usageandmanagement,andthetopperformanceandflexibilitystandardsofrelational
DBMSs(RDBMSs).Theexpressivepoweroftherelationalmodel,however,doesnot
includetheconceptsofdimension,measure,andhierarchy,soyoumustcreate
specifictypesofschematasothatyoucanrepresentthemultidimensionalmodelin
termsofbasicrelationalelementssuchasattributes,relations,andintegrityconstraints.
Thistaskismainlyperformedbythewell-knownstarschema.SeeChapter8formore
detailsonstarschemataandstarschemavariants.
ThemainproblemwithROLAPimplementationsresultsfromtheperformancehit
causedbycostlyjoinoperationsbetweenlargetables.Toreducethenumberofjoins,oneof
thekeyconceptsofROLAPisdenormalizationaconsciousbreachinthethirdnormalform
orientedtoperformancemaximization.Tominimizeexecutioncosts,theotherkeywordis
redundancy,whichistheresultofthematerializationofsomederivedtables(views)that
storeaggregatedatausedfortypicalOLAPqueries.
Fromanarchitecturalviewpoint,adoptingROLAPrequiresspecializedmiddleware,also
calledamultidimensionalengine,betweenrelationalback-endserversandfront-end
components,asshowninFigure1-32.ThemiddlewarereceivesOLAPqueriesformulated
byusersinafront-endtoolandturnsthemintoSQLinstructionsforarelationalback-end
applicationwiththesupportofmeta-data.Theso-calledaggregatenavigatorisaparticularly
importantcomponentinthisphase.Incaseofaggregateviews,thiscomponentselectsa
viewfromamongallthealternativestosolveaspecificqueryattheminimumaccesscost.
Incommercialproducts,differentfront-endmodules,suchasOLAP,reports,and
dashboards,aregenerallystrictlyconnectedtoamultidimensionalengine.Multidimensional
enginesarethemaincomponentsandcanbeconnectedtoanyrelationalserver.Opensource
solutionshavebeenrecentlyreleased.Theirmultidimensionalengines(Mondrian,2009)are
disconnectedfromfront-endmodules(JPivot,2009).Forthisreason,theycanbemoreflexible
FIGURE 1-32
ch01.indd38
ROLAP architecture
4/21/093:23:43PM
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
Chapter 1:
39
thancommercialsolutionswhenyouhavetocreatethearchitecture(ThomsenandPedersen,
2005).AfewcommercialRDBMSsnativelysupportfeaturestypicalformultidimensional
enginestomaximizequeryoptimizationandincreasemeta-datareusability.Forexample,
sinceits8iversionwasmadeavailable,OraclesRDBMSgivesuserstheopportunitytodefine
hierarchiesandmaterializedviews.Moreover,itoffersanavigatorthatcanusemeta-dataand
rewritequerieswithoutanyneedforamultidimensionalenginetobeinvolved.
DifferentfromaROLAPsystem,aMOLAPsystemisbasedonanadhoclogicalmodelthat
canbeusedtorepresentmultidimensionaldataandoperationsdirectly.Theunderlying
multidimensionaldatabasephysicallystoresdataasarraysandtheaccesstoitispositional
(GaedeandGnther,1998).Grid-files(Nievergeltetal.,1984;WhangandKrishnamurthy,1991),
R*-trees(Beckmannetal.,1990)andUB-trees(Markletal.,2001)areamongthetechniquesused
forthispurpose.
ThegreatestadvantageofMOLAPsystemsincomparisonwithROLAPisthat
multidimensionaloperationscanbeperformedinaneasy,naturalwaywithMOLAP
withoutanyneedforcomplexjoinoperations.Forthisreason,MOLAPsystemperformance
isexcellent.However,MOLAPsystemimplementationshaveverylittleincommon,because
nomultidimensionallogicalmodelstandardhasyetbeenset.Generally,theysimplyshare
theusageofoptimizationtechniquesspecificallydesignedforsparsitymanagement.The
lackofacommonstandardisaproblembeingprogressivelysolved.Thismeansthat
MOLAPtoolsarebecomingmoreandmoresuccessfulaftertheirlimitedimplementationfor
manyyears.Thissuccessisalsoprovenbytheinvestmentsinthistechnologybymajor
vendors,suchasMicrosoft(AnalysisServices)andOracle(Hyperion).
Theintermediatearchitecturetype,HOLAP,aimsatmixingtheadvantagesofbothbasic
solutions.Ittakesadvantageofthestandardizationlevelandtheabilitytomanagelarge
amountsofdatafromROLAPimplementations,andthequeryspeedtypicalofMOLAP
systems.HOLAPimpliesthatthelargestamountofdatashouldbestoredinanRDBMSto
avoidtheproblemscausedbysparsity,andthatamultidimensionalsystemstoresonlythe
informationusersmostfrequentlyneedtoaccess.Ifthatinformationisnotenoughtosolve
queries,thesystemwilltransparentlyaccessthepartofthedatamanagedbytherelational
system.Overthelastfewyears,importantmarketactorssuchasMicroStrategyhave
adoptedHOLAPsolutionstoimprovetheirplatformperformance,joiningothervendors
alreadyusingthissolution,suchasBusinessObjects.
1.9.1
Quality
Ingeneral,wecansaythatthequalityofaprocessstandsforthewayaprocessmeetsusers
goals.Indatawarehousesystems,qualityisnotonlyusefulforthelevelofdata,butabove
allforthewholeintegratedsystem,becauseofthegoalsandusageofdatawarehouses.
Astrictqualitystandardmustbeensuredfromthefirstphasesofthedatawarehouseproject.
ch01.indd39
4/21/093:23:43PM
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
40
Defining,measuring,andmaximizingthequalityofadatawarehousesystemcanbe
verycomplexproblems.Forthisreason,wementiononlyafewpropertiescharacterizing
dataqualityhere:
Accuracy Storedvaluesshouldbecompliantwithreal-worldones.
Freshness Datashouldnotbeold.
Completeness Thereshouldbenolackofinformation.
Consistency Datarepresentationshouldbeuniform.
Availability Usersshouldhaveeasyaccesstodata.
Traceability Datacaneasilybetraceddatabacktoitssources.
Clearness Datacanbeeasilyunderstood.
Technically,checkingfordataqualityrequiresappropriatesetsofmetrics(Abelletal.,
2006).Inthefollowingsections,weprovideanexampleofthemetricsforafewofthequality
propertiesmentioned:
Accuracyandcompleteness Referstothepercentageoftuplesnotloadedbyan
ETLprocessandcategorizedonthebasisofthetypesofproblemarising.This
propertyshowsthepercentageofmissing,invalid,andnonstandardvaluesof
everyattribute.
Freshness Definesthetimeelapsedbetweenthedatewhenaneventtakesplace
andthedatewhenuserscanaccessit.
Consistency Definesthepercentageoftuplesthatmeetbusinessrulesthatcanbe
setformeasuresofanindividualcubeormanycubesandthepercentageoftuples
meetingstructuralconstraintsimposedbythedatamodel(forexample,uniqueness
ofprimarykeys,referentialintegrity,andcardinalityconstraintcompliance).
Notethatcorporateorganizationplaysafundamentalroleinreachingdataquality
goals.Thisrolecanbeeffectivelyplayedonlybycreatinganappropriateandaccurate
certificationsystemthatdefinesalimitedgroupofusersinchargeofdata.Forthisreason,
designersmustraiseseniormanagersawarenessofthistopic.Designersmustalsomotivate
managementtocreateanaccuratecertificationprocedurespecificallydifferentiatedfor
everyenterprisearea.Aboardofcorporatemanagerspromotingdataqualitymaytrigger
avirtuouscyclethatismorepowerfulandlesscostlythananydatacleansingsolution.
Forexample,youcanachieveawesomeresultsifyouconnectacorporatedepartment
budgettoaspecificdataqualitythresholdtobereached.
Anadditionaltopicconnectedtothequalityofadatawarehouseprojectisrelatedto
documentation.Todaymostdocumentationisstillnonstandardized.Itisoftenissuedat
theendoftheentiredatawarehouseproject.Designersandimplementersconsider
documentationawasteoftime,anddatawarehouseprojectcustomersconsideritanextra
costitem.Softwareengineeringteachesthatastandardsystemfordocumentsshouldbe
issued,managed,andvalidatedincompliancewithprojectdeadlines.Thissystemcan
ensurethatdifferentdatawarehouseprojectphasesarecorrectlycarriedoutandthatall
analysisandimplementationpointsareproperlyexaminedandunderstood.Inthe
mediumandlongterm,correctdocumentsincreasethechancesofreusingdatawarehouse
projectsandensureprojectknow-howmaintenance.
ch01.indd40
4/21/093:23:43PM
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
Chapter 1:
41
NOTE Jarkeetal.,2000havecloselystudieddataquality.Theirstudiesprovideusefuldiscussionson
theimpactofdataqualityproblemsfromthemethodologicalpointofview.Kelly,1997describes
qualitygoalsstrictlyconnectedtotheviewpointofbusinessorganizations.Serranoetal.,2004,
2007;Lechtenbrger,2001;andBouzeghoubandKedad,2000focusonqualitystandards
respectivelyforconceptual,logical,andphysicaldatawarehouseschemata.
1.9.2
Security
Informationsecurityisgenerallyafundamentalrequirementforasystem,anditshouldbe
carefullyconsideredinsoftwareengineeringateveryprojectdevelopmentstagefrom
requirementanalysisthroughimplementationtomaintenance.Securityisparticularly
relevanttodatawarehouseprojects,becausedatawarehousesareusedtomanage
informationcrucialforstrategicdecision-makingprocesses.Furthermore,multidimensional
propertiesandaggregationcauseadditionalsecurityproblemssimilartothosethatgenerally
ariseinstatisticdatabases,becausetheyimplicitlyoffertheopportunitytoinferinformation
fromdata.Finally,thehugeamountofinformationexchangethattakesplaceindata
warehousesinthedata-stagingphasecausesspecificproblemsrelatedtonetworksecurity.
Appropriatemanagementandauditingcontrolsystemsareimportantfordata
warehouses.Managementcontrolsystemscanbeimplementedinfront-endtoolsorcan
exploitoperatingsystemservices.Asfarasauditingisconcerned,thetechniquesprovided
byDBMSserversarenotgenerallyappropriateforthisscope.Forthisreason,youmusttake
advantageofthesystemsimplementedbyOLAPengines.Fromtheviewpointofusers
profilebaseddataaccess,basicrequirementsarerelatedtohidingwholecubes,specific
cubeslices,andspecificcubemeasures.Sometimesyoualsohavetohidecubedatabeyond
agivendetaillevel.
NOTE Inthescientificliteraturethereareafewworksspecificallydealingwithsecurityindata
warehousesystems(Kirkgzeetal.,1997;PriebeandPernul,2000;RosenthalandSciore,2000;
Katicetal.,1998).Inparticular,PriebeandPernulproposeacomparativestudyonsecurity
propertiesofafewcommercialplatforms.Ferrandez-Medinaetal.,2004andSoleretal.,2008
discussanapproachthatcouldbemoreinterestingfordesigners.TheyuseaUMLextensionto
modelspecificsecurityrequirementsfordatawarehousesintheconceptualdesignand
requirementanalysisphases,respectively.
1.9.3
Evolution
Manymaturedatawarehouseimplementationsarecurrentlyrunninginmidsizeandlarge
companies.Theunstoppableevolutionofapplicationdomainshighlightsdynamicfeatures
ofdatawarehousesconnectedtothewayinformationchangesattwodifferentlevelsas
timegoesby:
Datalevel Evenifmeasureddataisnaturallyloggedindatawarehousesthanksto
temporaldimensionsmarkingevents,themultidimensionalmodelimplicitly
assumesthathierarchiesarecompletelystatic.Itisclearthatthisassumptionisnot
veryrealistic.Forexample,acompanycanaddnewproductcategoriestoitscatalog
andremoveothers,oritcanchangethecategorytowhichanexistingproduct
belongsinordertomeetnewmarketingstrategies.
ch01.indd41
4/21/093:23:44PM
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1
42
Schemalevel Adatawarehouseschemacanvarytomeetnewbusinessdomain
standards,newusersrequirements,orchangesindatasources.Newattributesand
measurescanbecomenecessary.Forexample,youcanaddasubcategorytoa
producthierarchytomakeanalysesricherindetail.Youshouldalsoconsiderthat
thesetoffactdimensionscanvaryastimegoesby.
Temporalproblemsareevenmorechallengingindatawarehousesthaninoperational
databases,becausequeriesoftencoverlongerperiodsoftime.Forthisreason,data
warehousequeriesfrequentlydealwithdifferentdataand/orschemaversions.Moreover,
thispointisparticularlycriticalfordatawarehousesthatrunforalongtime,becauseevery
evolutionnotcompletelycontrolledcausesagrowinggapbetweentherealworldandits
databaserepresentation,eventuallymakingthedatawarehousesobsoleteanduseless.
Asfaraschangesindatavaluesareconcerned,differentapproacheshavebeen
documentedinscientificliterature.Somecommercialsystemsalsomakeitpossibletotrack
changesandquerycubesonthebasisofdifferenttemporalscenarios.Seesection8.4formore
detailsondynamichierarchies.Ontheotherhand,managingchangesindataschematahas
beenexploredonlypartiallytodate.Nocommercialtooliscurrentlyavailableonthemarket
tosupportapproachestodataschemachangemanagement.
Theapproachestodatawarehouseschemachangemanagementcanbeclassifiedin
twocategories:evolution(Quix,1999;Vaismanetal.,2002;Blaschka,2000)andversioning
(Ederetal.,2002;Golfarellietal.,2006a).Bothcategoriesmakeitpossibletoalterdata
schemata,butonlyversioningcantrackpreviousschemareleases.Afewapproachesto
versioningcancreatenotonlytrueversionsgeneratedbychangesinapplication
domains,butalsoalternativeversionstouseforwhat-ifanalyses(Bebeletal.,2004).
Themainproblemthathasnotbeensolvedinthisfieldisthecreationoftechniquesfor
versioninganddatamigrationbetweenversionsthatcanflexiblysupportqueriesrelatedto
moreschemaversions.Furthermore,weneedsystemsthatcansemiautomaticallyadjust
ETLprocedurestochangesinsourceschemata.Inthisdirection,someOLAPtoolsalready
usetheirmeta-datatosupportanimpactanalysisaimedatidentifyingthefullconsequences
ofanychangesinsourceschemata.
ch01.indd42
4/21/093:23:44PM