Sei sulla pagina 1di 10

Sunday,

14Feb2016.

OracleArticles

DesigningEfficientETLProcessesinOracle:

Introduction:

Dataconversiontheprocessofmigratingapplicationdataisamajorcomponentofmanydatabase
relatedprojects.Thisprocessessentiallyentailsthreesubprocesses:1)Extractingdatafromthe
sourcesystem,2)Transformingdatatotheformatrequiredbythetargetsystemand3)Loadingdata
intothetargetsystemhencetheabbreviationETL.Oracleoffersavarietyofdatabasefeaturesfor
implementingETLprocesses.Additionally,thereistherelativelynewOracleWarehouseBuilderafull
featuredETLtool.TherearealsoothertoolsbythirdpartyvendorssuchasInformatica.However,
despitethisrecentproliferationofETLtools,manyOraclebaseddatamigrationprojectscontinueto
usehomegrownprograms(mainlyPL/SQL)forETL.Thisismainlybecausesmallerprojectscannot
justifythecostandlearningtimerequiredforeffectiveuseofcommercialETLtools.Besides,doingit
oneselfoffersmuchmorecontrolovertheprocess.

Aswithanyprogrammingprojectthereareefficientand(infinitelymore)inefficientwaystodesign
andimplementETLprocesses.ThisarticleofferssometipsforthedesignofefficientETLprogramsin
anOracleenvironment.Thetipsarelistedinnoparticularorder.Ihopeyoufindthemhelpfulinyour
projects

Tip1:Useexternaltablestoloaddatafromexternalflatfiles

Thetraditionalworkhorseforloadingexternal(flatfile)dataintoOracleistheSQL*Loaderutility.
Sinceversion9i,externaltablesofferaneasier,andalmostasperformant,routetoloaddata.
Essentially,externaltablesenableustoaccessdatainanexternalfileviaanOracletable.Here'sa
simpleexample:
Thefollowingdataisavailableinatextfilecalledtest.txt:

scott,cio
kailash,dba
jim,developer

WewanttopullthisdataintoOracleusinganexternaltable.Thefirststepistocreateadirectory
objectinOracle,whichpointstotheoperatingsystemdirectoryinwhichtheabovefileresides.
Assumethisisanoperatingsystemdirectorycalledc:\test.TheSQLtomakethisdirectoryaccessible
fromOracleis:

createorreplacedirectorytest_diras'c:\test'

TheusermusthavethecreateanydirectoryprivilegeinordertoexecutethisSQL.Wethendefinean
externaltablebasedonourtextfile:

createtableext_test(
namevarchar2(10),
jobvarchar2(10))
organizationexternal
(typeoracle_loader
defaultdirectorytest_dir
accessparameters(fieldsterminatedby',')
location(test_dir:'test.txt'))
rejectlimit0

Itwouldtakeustoofarafieldtogointoeachoftheclausesintheabovestatement.ChecktheOracle
docsforfurtherdetails.TherelevantbookistheOracleUtilitiesGuide.Therearealsoseveralthird
partyarticlesavailableonlinedoagooglesearchon"oracleexternaltable"tolocatesomeofthese.

ToaccesstheexternaldatawesimplyissuethefollowingSQLstatementfromwithinOracle:
select*fromext_test

Thereareseveralaccessparameteroptionsavailabletohandlemorecomplicatedrecordformats
checktheOracledocumentationfordetails.

Tip2:Usesetbasedoperationstoprocessdata

MostETLprocessesdealwithlargenumbersofrecords.Eachrecordmustbecleanedatthefieldlevel
andthenloadedintothetargetsystem.ProgrammersunfamiliarwithSQLtendtouseprocedural
methods("for"loops,forexample)todealwithsuchsituations.Withsomethought,however,itis
oftenpossibletoconstructanSQLstatementthatisequivalenttotheproceduraltechnique.The
methodusingSQLshouldbepreferred,asSQLprocessestheentirebatchofrecordsasasingleset.
Providedthestatementisn'ttoocomplex,theSQLstatementwillgenerallyrunfasterthanthe
equivalentproceduralcode.Here'san(admittedlycontrived)exampletoillustratethepoint:

executefromSQLPlus

settimingon

createtabletest(
idnumber,
namevarchar2(128))

altertabletestaddconstraintpk_testprimarykey(id)

declare
cursorc_testis
select
object_id,
object_name
from
all_objects
begin
forr_testinc_testloop
insertintotest
(id,name)
values
(r_test.object_id,lower(r_Test.object_name))
endloop
commit
end
.
/

truncatetabletest

declare
begin
insertintotest(id,name)
select
object_id,
lower(object_name)
from
all_objects
commit
end
.
/

Youmayneedalargernumberofrowstoseeanappreciabledifferencebetweenthetwomethods.The
advantageoftheSQLbasedtechniquebecomesapparentwithgreaterdatavolumes.

HerearesometipsthatmayhelpinconvertingproceduralcodetoSQL:

1.ReplaceproceduralcontrolstatementswithSQLdecodeorcase:Oftenaproceduralstatement
such as if....then....else can be replaced by a single SQL statement that uses the decode
function or case statement. For simple conditionals the SQL option will almost always
outperform procedural code. The decode function is useful when your branching criterion is
based on discrete values for example "Male" or "Female". The case statement would be
preferred if the criterion is based on a continuous range eg: any criterion that uses > or <.
Here'sanexampleofthetwoinaction:

DECODE:Classifybygender

select
ename,decode(gender,'M','Male','F','Female','Unknown')
from
employees

CASE:isthecurrentdayaweekendorweekday?

select
casewhento_char(sysdate,'D')>'1'andto_char(sysdate,'D')<'6'then
'Weekday'
else
'Weekend'
endasday_type
from
dual

2.Pivoting:AcommonrequirementofETLprocessesistopivotatableaboutaparticularcolumn,
i.e. to convert rows to columns based on values in a specific column. This is easily achieved
using the decode function. Consider the following example: we want to flatten out data in a
table called SALES based on the MONTH column (which contains numbers ranging from 1 to
12).TheSALEStablehasthefollowingcolumns:REGION,YEAR,MONTHandSALES_DOLLARS
(primary key REGION, YEAR and MONTH). The sales data is to be flattened out with each
month's dollars appearing in a separate column JAN_SALES, FEB_SALES .... DEC_SALES
appearingasaseparatecolumn.Here'showthiscanbedone:

select
region
year,
min(decode(month,1,amount))jan_sales,
min(decode(month,2,amount))feb_sales,
min(decode(month,3,amount))mar_sales,
min(decode(month,4,amount))apr_sales,
min(decode(month,5,amount))may_sales,
min(decode(month,6,amount))jun_sales,
min(decode(month,7,amount))jul_sales,
min(decode(month,8,amount))aug_sales,
min(decode(month,9,amount))sep_sales,
min(decode(month,10,amount))oct_sales,
min(decode(month,11,amount))nov_sales,
min(decode(month,12,amount))dec_sales
from
sales
groupby
region,
year

The grouping is required to collapse twelve records for the region / year into a single record.
Notethatyoucanuseanygroupingfunction(AVG,MAX,SUM,forexample)ifallkeycolumns
other than MONTH are to be preserved in the flattened table. If this isn't so, you will have to
useSUMasyouwillbesummingoverrecordsforafixedmonth.Ineithercaseyoushouldbe
sureyouunderstandwhatyou'regetting!

3."Unpivoting": This is the converse of the above. Here we want to move data from several
columnsintoasinglecolumn,basedonsomecriterion.SaywehaveatablecalledFLAT_SALES
with columns REGION, YEAR, JAN_SALES,....DEC_SALES (produced by the above query). An
ETL process requires us to get the data in normalised format, i.e. columns REGION, YEAR,
MONTH,SALES_DOLLARS,asintheoriginalSALEStable.Here'saquerythatwillachievethis:
select
t1.regionregion,
t1.yearyear,
t2.rmonth,
decode(t2.r,
1,jan_sales,
2,feb_sales,
3,mar_sales,
4,apr_sales,
5,may_sales,
6,jun_sales,
7,jul_sales,
8,aug_sales,
9,sep_sales,
10,oct_sales,
11,nov_sales,
12,dec_sales)sales_dollars
from
flat_salest1,
(select
rownumr
from
all_objects
where
rownum<=12)t2

Thetablet2servestoproduce12rowsforeachoriginalrowviaacartesianproduct(nojoins).
The query uses a decode on rownum to pick out the sales value for each month from the
flattened table. Note that we could have used any table or view with 12 or more rows in the
definition of t2. I chose ALL_OBJECTS as this view is generally available to all schemas, and
containsalargenumberofrows.

4.Using builtin functions: One can achieve fairly complex data transformations using Oracle's
characterSQLfunctions.Thenicethingaboutthesefunctionsisthattheycanbeembeddedin
SQL, so you get the full advantage of setbased operations. The usual suspects include:
LENGTH, LOWER, LPAD, REPLACE, RPAD, SUBSTR, UPPER, (L)(R)TRIM. Some "lesser known"
functionsIhavefoundusefulinmyETLeffortsare:
CHRReturnsthecharacterassociatedwiththespecifiedASCIIcode.Thisisusefulwhen
you want to remove nonprintable characters such as carriage return (CR) and linefeed
(LF)frominputstrings.YouwoulddothisusingCHRintandemwithREPLACElikeso:
REPLACE(input_string,CHR(13)||CHR(10)), 13 and 10 being the ASCII codes forCRand
LFrespectively.
INITCAPThiscapitalizesthefirstletterineachwordoftheinputstring.Everyotherletter
islowercased.
INSTRReturnsthelocationofaspecifiedstringwithintheinputstring.
Allthesefunctionshavearangeofinvocationoptions.ChecktheOracleSQLdocumentationfor
fulldetails.

5.Analytics: Since version 8i, Oracle has introduced analytical extensions to SQL. These enable
one to do a range of procedural operations using SQL (yes, you read that right procedural
operations using SQL). Some operations possible using analytical SQL include: subgrouping
(different levels of grouping within a statement), ranking and other statistics over subsets of
data,comparing values in different rows, and presenting summary and detailed data using a
singleSQLstatement.CheckoutmyarticleonanalyticSQLtofindoutmore.

TomKyte'ssiteisagreatresourcefortipsonconvertingproceduralcodetoSQLtrysearchingonthe
keywords"ETL"and"analyticfunctions"forastart.

OK,afterextollingthevirtuesofusingSQLwehavetoadmitthattherearesituationsinwhich
proceduralcodebecomesunavoidable.Insuchcasesyoucanstilltakeadvantageofsetbased
processingbyusingbulkbindingwithinyourproceduralcode.SectionsoftheOracledocumentationon
bulkbindingareavailablehereandhere(freeregistrationrequired).Anintroductoryarticleonbulk
bindingisavailablehere(dbasupport.com).

Tip3:TRUNCATEtableswhendeletingalldata
Deletingalldatafromatableisbestdoneusingthetruncatestatement.Atruncaterunsfasterthana
deletebecauseitsimplyresetsthetable'shighwatermarktozero.Truncateisaveryquickoperation.
Indexesarealsotruncatedalongwiththetable.AtruncateisDDLandthereforecannotberolledback
besureyouconsiderthisbeforeusingtruncate.Notethattablesreferencedbyenabledforeignkey
constraintscannotbetruncatedunlessthekeysaredisabledfirst.Here'sablockofcodethatuses
truncatetocleanouttheSALEStable(whichisreferencedbyenabledforeignkeyconstraints):

declare

cursorc_referenced_byis
select
t1.constraint_nameconstraint_name,
t1.table_nametable_name
from
user_constraintst1,
user_constraintst2
where
t1.constraint_type='R'
and
t1.r_constraint_name=t2.constraint_name
and
t2.table_name='SALES'

begin

forr_referenced_byinc_referenced_byloop

executeimmediate
'altertable'||r_referenced_by.table_name||'disableconstraint'
||r_referenced_by.constraint_name

endloop

executeimmediate
'truncatetablesales'

end

Tip4:Usedirectpathinsertsforloadingdata

Adirectpathinsertoffersaquickwaytoloadlargevolumesofdataintoatable.Itdoesthisby
bypassingthebuffercacheandwritingdirectlytodatafiles.Adirectpathinsertisdoneusingtheinsert
into...select..idiom,togetherwithanappendhint.Theappendhintcausestherowstoinsertedabove
thetable'shighwatermarkanyfreespacebelowthehighwatermarkisnotused.Hereisthesyntax
foradirectpathinsertintoSALESfromSALES_STAGE:

insert/*+append*/into
sales
(region,
year,
month,
sales_dollars)
select
region,
year,
month,
sales_dollars
from
sales_stage

Adirectpathinsertbyitselfisfasterthanaconventionalinsert.Justhowmuchmileageonegets
dependsonthevolumeofdataloaded.Theperformanceofdirectpathinsertscanbefurtherenhanced
bydoingthefollowingbeforetheload:

1.Bypassingconstraintchecksbydisablingallconstraints.Remembertoreenabletheseafterthe
loadisdone.
2.Suppressingredogenerationbyputtingthetableinnologgingmode.Ifyour table has a large
number of nonunique indexes, you might also consider setting these to unusable state. This
will suppress redo generation associated with index maintenance. The indexes can then be
rebuiltaftertheload.Notethatuniqueindexesshouldnotbesetunusablebecausetheloadwill
fail with an ORA26026: unique index initially in unusable state error. Unique indexes can,
however,bedroppedbeforeandrecreatedaftertheload.Thisis,ineffect,whathappenswhen
the primary key constraint is disabled (as discussed in the previous point) the unique index
associatedwiththeprimarykeyisdropped.
Animportantconsequenceofsuppressingredoisthattheoperationisunrecoverable.Besureto
coordinatesuchoperationswithyourDBAsothatshecanscheduleabackupofthedatabaseor
relevanttablespacesaftertheload.

Here'sablockofcodethatbypassesconstraintschecksandsuppressesredogenerationfortheabove
insert:

declare

cursorc_constraintsis
select
table_name,
constraint_name
from
user_constraints
where
table_name='SALES'

cursorc_indexesis
select
index_name
from
user_indexes
where
table_name='SALES'
and
uniqueness<>'UNIQUE'

begin

forr_constraintsinc_constraintsloop

executeimmediate
'altertable'||r_constraints.table_name||'disableconstraint'
||r_constraints.constraint_name

endloop

optionaltruncatetocleanoutthetable
disablereferencingconstraints,ifneeded(seetip3)

executeimmediate
'truncatetablesales'

forr_indexesinc_indexesloop

executeimmediate
'alterindex'||r_indexes.index_name||'unusable'

endloop

executeimmediate
'altersessionsetskip_unusable_indexes=true'

executeimmediate
'altertablesalesnologging'

insert/*+append*/into
sales
(region,
year,
month,
sales_dollars)
select
region,
year,
month,
sales_dollars
from
sales_stage

executeimmediate
'altersessionsetskip_unusable_indexes=false'

forr_indexesinc_indexesloop

executeimmediate
'alterindex'||r_indexes.index_name||'rebuildnologging'

endloop

executeimmediate'altertablesaleslogging'

forr_constraintsinc_constraintsloop

executeimmediate
'altertable'||r_constraints.table_name||'enableconstraint'

||r_constraints.constraint_name

endloop

commit

end

Forsmalldatavolumes,theoverheadofdisablingconstraintsandindexes,andmakingthetable
nologgingwillswampthebenefitsgained.Ingeneral,thelargertheloadthegreaterthebenefitofthe
foregoingactions.Asalways,benchmarkbeforeimplementationinyourloads.
TheabovecodeusesalotofdynamicSQLsoitcanbeheavyondatabaseresources.Theuseof
dynamicSQLisunavoidablebecauseweneedtoperformtheoperationswithinamodule(PL/SQL
block,procedureorpackage).However,thisshouldnotcauseperformanceproblemsforbusiness
usersbecauseETLbatchprocessesnormallyrunduringoffpeakhours.

Tip5:UsetheMERGEcommandforupserts

Acommonrequirementistheneedtoperforman"upsert"i.e.:updatearowifitexists,insertitifit
doesnt.InOracle(versions9iandbetter)thiscanbedoneinonestepusingtheMERGEstatement.
Here'sanexamplethatusestheEMPtableofSCOTTschema.TheexampleusesatableEMP_STAGE
thatholdsupdatedandnewrecordsthataretobeupsertedintoEMP.Youneedtobeloggedintothe
SCOTTschema(oranyotherschemawithacopyofSCOTT.EMP)inordertoruntheexample:

createEMP_STAGE

createtableemp_stageasselect*fromempwhere1=2

insertupdaterecordsinEMP_STAGE

insertintoemp_stagevalues
(7369,'SMITH','CLERK',7902,to_date('17121980','ddmmyyyy'),1800,NULL,20)

insertintoemp_stagevalues
(7499,'ALLEN','SALESMAN',7698,to_date('2021981','ddmmyyyy'),2200,300,30)

insertintoemp_stagevalues
(7521,'WARD','SALESMAN',7698,to_date('2221981','ddmmyyyy'),1250,500,30)
insertintoemp_stagevalues
(7839,'KING','PRESIDENT',NULL,to_date('17111981','ddmmyyyy'),9500,NULL,10)

insertintoemp_stagevalues
(7782,'CLARK','MANAGER',7839,to_date('961981','ddmmyyyy'),8500,NULL,10)

insertnewrecordsinEMP_STAGE

insertintoemp_stagevalues
(7940,'WEBSTER','DBA',7782,to_date('2311985','ddmmyyyy'),7000,NULL,10)

insertintoemp_stagevalues
(7945,'HAMILL','DEVELOPER',7782,to_date('2151985','ddmmyyyy'),6000,NULL,10)

insertintoemp_stagevalues
(7950,'PINCHON','ANALYST',7782,to_date('20101985','ddmmyyyy'),6000,NULL,10)

commit

MERGErecordsintoEMP

mergeinto
empe
using
emp_stagees
on
(e.empno=es.empno)
whenmatchedthen
update
set
e.sal=es.sal
whennotmatchedthen
insert
(e.empno,
e.ename,
e.job,
e.mgr,
e.hiredate,
e.sal,
e.comm,
e.deptno)
values
(es.empno,
es.ename,
es.job,
es.mgr,
es.hiredate,
es.sal,
es.comm,
es.deptno)

commit

Threerowsareinsertedandfiveexistingrowsareupdatedbytheabove.
InOracle10gonecanalsoaddaconditionalclausestotheinsertandupdateportionsofthemerge
statement.Forexample:

MERGErecordsintoEMP,exceptforDEVELOPERS

mergeinto
empe
using
emp_stagees
on
(e.empno=es.empno)
whenmatchedthen
update
set
e.sal=es.sal
whennotmatchedthen
insert
(e.empno,
e.ename,
e.job,
e.mgr,
e.hiredate,
e.sal,
e.comm,
e.deptno)
values
(es.empno,
es.ename,
es.job,
es.mgr,
es.hiredate,
es.sal,
es.comm,
es.deptno)
where
es.job<>'DEVELOPER'

Inthiscasetherecordforemployee7945isnotinserted.

Theconditionalclausecanbeintheupdateportionaswell.In10gitisalsopossibletodeleterows
fromthedestinationtablebasedonconditionalcriteria.Checkthedocumentationfordetails.

Tip6:UseheterogeneousservicestoaccessdatainnonOraclerelationaldatabases

DatabaselinksareoftenusedtotransfersmallvolumesofdatabetweenOracledatabases.Itisless
wellknownthatdatabaselinkscanalsobesetupbetweenOracleandnonOracledatabases.Thisisa
usefulfeature,asETLprocessesoftenneedtoaccessandtransferdatatoOraclefromthirdparty
databasessuchasMSSQLServer.Thestandardwaytodothisisbyexportingdatafromthenon
Oracledatabasetoaflatfile,andthenimportingthedataintoOracleviaSQLLoaderorexternal
tables.OracleHeterogeneousServicesprovidesasinglestepoptiontoachievethetransfer.
HeterogeneousServicescomeintwoflavours:

1.Generic Heterogeneous Services: This option, which is bundled with the Oracle Server uses
ODBCtoconnecttothenonOracledatabase.AtutorialonaccessingSQLServerusingGeneric
HeterogeneousServicesisavailablehere.
2.TransparentGateways:Theseareextracostoptionsthatareoptimisedforspecific databases.
They offer better performance compared to the Generic option because they are designed to
exploitoptimisationsandfeaturesspecifictoparticulardatabases.Themechanicsofsettingup
Transparent Gateways is quite similar to the Generic option. Check the documentation that
comes with the specific Transparent Gateway for further details. The documentation is hidden
awayintherelevantgatewayinstallationdirectory.Forexample,thedocumentationfortheMS
SQL Server gateway sits in ORACLE_HOME/tg4msql.A tutorial on accessing SQL Server using
TransparentGatewaysisavailablehere.

Awarningondatatransfertechniquesthatusesdatabaselinks:theefficiencyofthetransferdepends
onyournetworkbandwidthandthequantityofdatatobetransferred.Ihaveusedit(withgreat
success)totransfersmalltomoderatequantitiesofdata(~100000rows,rowsize~1000bytes)over
colocatedmachineswithincorporatenetworks.

ClosingRemarks

ETLprocessespresentatechnicalchallengeastheyentailcomplextransformationsandloadsoflarge
quantitiesofdatawithinevershrinkingtimewindows.Asdevelopersweneedtouseallthetricksin
thebooktospeedupourETLprocedures.InthisarticleI'veoutlinedsomeofthetechniquesthatI
haveused,withsuccess,inseveralprojects.Ihopeyoufindthemusefulinyourwork.

Backtothetop
This page last modified on: 02/14/2016 10:15:27
Send us your comments.
Copyright: Kailash Awati & Arati Apte, 2000-2015.

Potrebbero piacerti anche