Sei sulla pagina 1di 5

Datastage Best Practices This section provides an overview of recommendations for standard practices.

The recommendations are categorized as follows: _ Standards _ Development guidelines _ Component usage _ Datastage Data Types _ Partitioning data _ Collecting data _ Sorting _ Stage specific guidelines Standards It is important to esta lish and follow consistent standards in: Directory structures for installation and application support directories. !aming conventions" especially for Datastage Pro#ect categories" stage names" and lin$s. %ll Datastage #o s should e documented with the Short Description field" as well as %nnotation fields. It is the Datastage developer&s responsi ility to ma$e personal ac$ups of their wor$ on their local wor$station" using Datastage's DS( e)port capa ility. This can also e used for integration with source code control systems. !ote: % detailed discussion of these practices is eyond the scope of this *ed oo$s pu lication" and you should spea$ to your %ccount +)ecutive to engage I,- IPS Services. Development guidelines -odular development techni.ues should e used to ma)imize re/use of Datastage #o s and components: 0o parameterization allows a single #o design to process similar logic instead of creating multiple copies of the same #o . The -ultiple/Instance #o property allows multiple invocations of the same #o to run simultaneously. % set of standard #o parameters should e used in Datastage #o s for source and target data ase parameters 1DS!" user" password" etc.2 and directories where files are stored. To ease re/use" these standard parameters and settings should e made part of a Designer 0o Parameter Sets. Create a standard directory structure outside of the Datastage pro#ect directory for source and target files" intermediate wor$ files" and so forth. 3here possi le" create re/usa le components such as parallel shared containers to encapsulate fre.uently/used logic. Datastage Template #o s should e created with: o Standard parameters such as source and target file paths" and data ase login properties o +nvironment varia les and their default settings o %nnotation loc$s

0o Parameters should always e used for file paths" file names" data ase login settings. Standardized +rror 4andling routines should e followed to capture errors and re#ects.

Component usage The following guidelines should e followed when constructing parallel #o s in I,- InfoSphere Datastage +nterprise +dition: !ever use Server +dition components 1,%SIC Transformer" Server Shared Containers2 within a parallel #o . ,%SIC *outines are appropriate only for #o control se.uences. %lways use parallel Data Sets for intermediate storage etween #o s unless that specific data also needs to e shared with other applications. 5se the Copy stage as a placeholder for iterative design" and to facilitate default type conversions. 5se the parallel Transformer stage 1not the ,%SIC Transformer2 instead of the 6ilter or Switch stages. 5se ,uild7p stages only when logic cannot e implemented in the parallel Transformer. Datastage data types The following guidelines should e followed with Datastage data types: ,e aware of the mapping etween Datastage 1S892 data types and the internal DS:++ data types. If possi le" import ta le definitions for source data ases using the 7rchestrate Schema Importer 1orchd util2 utility. 9everage default type conversions using the Copy stage or across the 7utput mapping ta of other stages. Partitioning data In most cases" the default partitioning method 1%uto2 is appropriate. 3ith %uto partitioning" the Information Server +ngine will choose the type of partitioning at runtime ased on stage re.uirements" degree of parallelism" and source and target systems. 3hile %uto partitioning will generally give correct results" it might not give optimized performance. %s the #o developer" you have visi ility into re.uirements" and can optimize within a #o and across #o flows. ;iven the numerous options for $eyless and $eyed partitioning" the following o #ectives form a methodology for assigning partitioning: Objective 1 Choose a partitioning method that gives close to an e.ual num er of rows in each partition" while minimizing overhead. This ensures that the processing wor$load is evenly alanced" minimizing overall run time. Objective 2 The partition method must match the usiness re.uirements and stage functional re.uirements" assigning related records to the same partition if re.uired. %ny stage that processes groups of related records 1generally using one or more $ey columns2 must e partitioned using a $eyed partition method.

This includes" ut is not limited to: %ggregator" Change Capture" Change %pply" 0oin" -erge" *emove Duplicates" and Sort stages. It might also e necessary for Transformers and ,uild7ps that process groups of related records. Objective 3 5nless partition distri ution is highly s$ewed" minimize re/partitioning" especially in cluster or ;rid configurations. *e/partitioning data in a cluster or ;rid configuration incurs the overhead of networ$ transport. Objective 4 Partition method should not e overly comple). The simplest method that meets the a ove o #ectives will generally e the most efficient and yield the est performance. 5sing the a ove o #ectives as a guide" the following methodology can e applied: a2 Start with %uto partitioning 1the default2. 2 Specify 4ash partitioning for stages that re.uire groups of related records as follows: o Specify only the $ey column1s2 that are necessary for correct grouping as long as the num er of uni.ue values is sufficient o 5se -odulus partitioning if the grouping is on a single integer $ey column o 5se *ange partitioning if the data is highly s$ewed and the $ey column values and distri ution do not change significantly over time 1*ange -ap can e reused2 c2 If grouping is not re.uired" use *ound *o in partitioning to redistri ute data e.ually across all partitions: o +specially useful if the input Data Set is highly s$ewed or se.uential d2 d. 5se same partitioning to optimize end/to/end partitioning and to minimize re/ partitioning o ,e mindful that Same partitioning retains the degree of parallelism of the upstream stage o 3ithin a flow" e)amine up/stream partitioning and sort order and attempt to preserve for down/stream processing. This may re.uire re/e)amining $ey column usage within stages and re/ordering stages within a flow 1if usiness re.uirements permit2. Note: In satisfying the re.uirements of this second o #ective" it might not e possi le to choose a partitioning method that gives an almost e.ual num er of rows in each partition. %cross #o s" persistent Data Sets can e used to retain the partitioning and sort order. This is particularly useful if downstream #o s are run with the same degree of parallelism 1configuration file2 and re.uire the same partition and sort order. Collecting Data ;iven the options for collecting data into a se.uential stream" the following guidelines form a methodology for choosing the appropriate collector type: a2 3hen output order does not matter" use %uto partitioning 1the default2.

2 Consider how the input Data Set has een sorted: o 3hen the input Data Set has een sorted in parallel" use Sort -erge collector to produce a single" glo ally sorted stream of rows. o 3hen the input Data Set has een sorted in parallel and *ange partitioned" the ordered collector might e more efficient. c2 5se a *ound *o in collector to reconstruct rows in input order for round/ro in partitioned input Data Sets" as long as the Data Set has not een re/partitioned or reduced. Sorting %pply the following methodology when sorting in an I,- Infosphere Datastage +nterprise +dition data flow: a2 Start with a lin$ sort. 2 Specify only necessary $ey column1s2. c2 Do not use Sta le Sort unless needed. d2 5se a stand/alone Sort stage instead of a 9in$ sort for options that are not availa le on a 9in$ sort: o The <*estrict -emory 5sage= option should e included here. If you want more memory availa le for the sort" you can only set that via the Sort Stage > not on a sort lin$. The environment varia le o ?%PT_TS7*T_ST*+SS_,97C@SIA+ can also e used to set sort memory usage 1in -,2 per partition. o Sort @ey -ode" Create Cluster @ey Change Column" Create @ey Change Column" 7utput Statistics. o %lways specify <Datastage= Sort 5tility for standalone Sort stages. o 5se the <Sort @ey -odeBDon&t Sort 1Previously Sorted2= to resort a su /grouping of a previously/sorted input Data Set. e2 ,e aware of automatically/inserted sorts: o Set ?%PT_S7*T_I!S+*TI7!_C4+C@_7!9C to verify ut not esta lish re.uired sort order. f2 -inimize the use of sorts within a #o flow. g2 To generate a single" se.uential ordered result set" use a parallel Sort and a Sort -erge collector. Stage speci ic guidelines The guidelines y stage are as follows: !rans ormer Ta$e precautions when using e)pressions or derivations on nulla le columns within the parallel Transformer: %lways convert nulla le columns to in/ and values efore using them in an e)pression or derivation. %lways place a re#ect lin$ on a parallel Transformer to capture : audit possi le re#ects. "oo#up

It is most appropriate when reference data is small enough to fit into availa le shared memory. If the Data Sets are larger than availa le memory resources" use the 0oin or -erge stage. 9imit the use of data ase Sparse 9oo$ups to scenarios where the num er of input rows is significantly smaller 1for e)ample D:DEE or more2 than the num er of reference rows" or when e)ception processing. $oin ,e particularly careful to o serve the nulla ility properties for input lin$s to any form of 7uter 0oin. +ven if the source data is not nulla le" the non/$ey columns must e defined as nulla le in the 0oin stage input in order to identify unmatched records. %ggregators 5se 4ash method %ggregators only when the num er of distinct $ey column values is small. % Sort method %ggregator should e used when the num er of distinct $ey values is large or un$nown. Database Stages The following guidelines apply to data ase stages: 3here possi le" use the Connector stages or native parallel data ase stages for ma)imum performance and scala ility. The 7D,C Connector and 7D,C +nterprise stages should only e used when a native parallel stage is not availa le for the given source or target data ase. 3hen using 7racle" D,F" or Informi) data ases" use 7rchestrate Schema Importer 1orchd util2 to properly import design metadata. Ta$e care to o serve the data type mappings. If possi le" use an S89 where clause to limit the num er of rows sent to a Datastage #o . %void the use of data ase stored procedures on a per/row asis within a high/ volume data flow. 6or ma)imum scala ility and parallel performance" it is est to implement usiness rules natively using Datastage parallel components.

Potrebbero piacerti anche