Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
In this yet another "DWBI Concepts' Original article", we test the performance of Informatica PowerCentre 8.5 Joiner transformation versus Oracle 10g database join. This article gives a crucial insight to application developer in order to take informed decision regarding performance tuning.
In our previous article, we tested the performance of ORDER BY operation in Informatica and Oracle and found that, in our test condition, Oracle performs sorting 14% speedier than Informatica. This time we will look into the JOIN operation, not only because JOIN is the single most important data set operation but also because performance of JOIN can give crucial data to a developer in order to develop proper push down optimization manually. Informatica is one of the leading data integration tools in todays world. More than 4,000 enterprises worldwide rely on Informatica to access, integrate and trust their information assets with it. On the other hand, Oracle database is arguably the most successful and powerful RDBMS system that is trusted from 1980s in all sorts of business domain and across all major platforms. Both of these systems are bests in the technologies that they support. But when it comes to the application development, developers often face challenge to strike the right balance of operational load sharing between these systems. This article will help them to take the informed decision.
Which JOINs data faster? Oracle or Informatica?
As an application developer, you have the choice of either using joining syntaxes in database level to join your data or using JOINER TRANSFORMATION in Informatica to achieve the same outcome. The question is which system performs this faster?
Test Preparation
We will perform the same test with 4 different data points (data volumes) and log the results. We will start with 1 million data in detail table and 0.1 million in master table. Subsequently we will test with 2 million, 4 million and 6 million detail table data volumes and 0.2 million, 0.4 million and 0.6 million master table data volumes. Here are the details of the setup we will use,
1. Oracle 10g database as relational source and target 2. Informatica PowerCentre 8.5 as ETL tool
3. Database and Informatica setup on different physical servers using HP UNIX 4. Source database table has no constraint, no index, no database statistics and no partition 5. Source database table is not available in Oracle shared pool before the same is read 6. There is no session level partition in Informatica PowerCentre 7. There is no parallel hint provided in extraction SQL query 8. Informatica JOINER has enough cache size
We have used two sets of Informatica PowerCentre mappings created in Informatica PowerCentre designer. The first mapping m_db_side_join will use an INNER JOIN clause in the source qualifier to sort data in database level. Second mapping m_Infa_side_join will use an Informatica JOINER to JOIN data in informatica level. We have executed these mappings with different data points and logged the result. Further to the above test we will execute m_db_side_join mapping once again, this time with proper database side indexes and statistics and log the results.
Result
The following graph shows the performance of Informatica and Database in terms of time taken by each system to sort data. The average time is plotted along vertical axis and data points are plotted along horizontal axis.
Data Points 1 2 3 4 Master Table Record Count 0.1 M 0.2 M 0.4 M 0.6 M Detail Table Record Count 1M 2M 4M 6M
Verdict In our test environment, Oracle 10g performs JOIN operation 24% faster than Informatica Joiner Transformation while without Index and 42% faster with Database Index Assumption
1. Average server load remains same during all the experiments 2. Average network speed remains same during all the experiments
Note
1. This data can only be used for performance comparison but cannot be used for performance benchmarking. 2. This data is only indicative and may vary in different testing conditions.
Informatica is one of the leading data integration tools in todays world. More than 4,000 enterprises worldwide rely on Informatica to access, integrate and trust their information assets with it. On the other hand, Oracle database is arguably the most successful and powerful RDBMS system that is trusted from 1980s in all sorts of business domain and across all major platforms. Both of these systems are bests in the technologies that they support. But when it comes to the application development, developers often face challenge to strike the right balance of operational load sharing between these systems. Think about a typical ETL operation often used in enterprise level data integration. A lot of data processing can be either redirected to the database or to the ETL tool. In general, both the database and the ETL tool are reasonably capable of doing such operations with almost same efficiency and capability. But in order to achieve the optimized performance, a developer must carefully consider and decide which system s/he should be trusting with for each individual processing task. In this article, we will take a basic database operation Sorting, and we will put these two systems to test in order to determine which does it faster than the other, if at all.
Which sorts data faster? Oracle or Informatica?
As an application developer, you have the choice of either using ORDER BY in database level to sort your data or using SORTER TRANSFORMATION in Informatica to achieve the same outcome. The question is which system performs this faster?
Test Preparation
We will perform the same test with different data points (data volumes) and log the results. We will start with 1 million records and we will be doubling
the volume for each next data points. Here are the details of the setup we will use, 1. Oracle 10g database as relational source and target 2. Informatica PowerCentre 8.5 as ETL tool 3. Database and Informatica setup on different physical servers using HP UNIX 4. Source database table has no constraint, no index, no database statistics and no partition 5. Source database table is not available in Oracle shared pool before the same is read 6. There is no session level partition in Informatica PowerCentre 7. There is no parallel hint provided in extraction SQL query 8. The source table has 10 columns and first 8 columns will be used for sorting 9. Informatica sorter has enough cache size We have used two sets of Informatica PowerCentre mappings created in Informatica PowerCentre designer. The first mapping m_db_side_sort will use an ORDER BY clause in the source qualifier to sort data in database level. Second mapping m_Infa_side_sort will use an Informatica sorter to sort data in informatica level. We have executed these mappings with different data points and logged the result.
Result
The following graph shows the performance of Informatica and Database in terms of time taken by each system to sort data. The time is plotted along vertical axis and data volume is plotted along horizontal axis.
Verdict The above experiment demonstrates that Oracle database is faster in SORT operation than Informatica by an average factor of 14%.
Assumption
1. Average server load remains same during all the experiments 2. Average network speed remains same during all the experiments
Note
This data can only be used for performance comparison but cannot be used for performance benchmarking. To know the Informatica and Oracle performance comparison for JOIN operation
A Normalizer is an Active transformation that returns multiple rows from a source row, it returns duplicate data for single-occurring source columns. The
Normalizer transformation parses multiple-occurring columns from COBOL sources, relational tables, or other sources. Normalizer can be used to transpose the data in columns to rows. Normalizer effectively does the opposite of Aggregator!
Example of Data Transpose using Normalizer
Think of a relational table that stores four quarters of sales by store and we need to create a row for each sales occurrence. We can configure a Normalizer transformation to return a separate row for each quarter like below..
The following source rows contain four quarters of sales by store: Source Table Store Quarter1 Quarter2 Quarter3 Quarter4
Store1
100
300
500
700
Store2
250
450
650
850
The Normalizer returns a row for each store and sales combination. It also returns an index(GCID) that identifies the quarter number:
Target Table Store Sales Quarter
Store 1
100
Store 1
300
Store 1
500
Store 1
700
Store 2
250
Store 2
450
Store 2
650
Store 2
850
How Informatica Normalizer Works Suppose we have the following data in source: Name Month Transportation House Rent Food
Sam
Jan
200
1500
500
John
Jan
300
1200
300
Tom
Jan
300
1350
350
Sam
Feb
300
1550
450
John
Feb
350
1200
290
Tom
Feb
350
1400
350
and we need to transform the source data and populate this as below in the target table:
Name Month Expense Type Expense
Sam
Jan
Transport
200
Sam
Jan
House rent
1500
Sam
Jan
Food
500
John
Jan
Transport
300
John
Jan
House rent
1200
John
Jan
Food
300
Tom
Jan
Transport
300
Tom
Jan
House rent
1350
Jan
Food
350
Now below is the screen-shot of a complete mapping which shows how to achieve this result using Informatica PowerCenter Designer. Image: Normalization Mapping Example 1
In the Ports tab of the Normalizer the ports will be created automatically as configured in the Normalizer tab. Interestingly we will observe two new columns namely GK_EXPENSEHEAD and GCID_EXPENSEHEAD. GK field generates sequence number starting from the value as defined in Sequence field while GCID holds the value of the occurence field i.e. the column no of the input Expense head. Here 1 is for FOOD, 2 is for HOUSERENT and 3 is for TRANSPORTATION.
Now the GCID will give which expense corresponds to which field while converting columns to rows. Below is the screen-shot of the expression to handle this GCID efficiently:
5)performance improves.
Different Types of Informatica Partitions We can define the following partition types: Database partitioning, Hash auto-keys, Hash user keys, Key range, Pass-through, Round-robin.
partition points. When we add a partition point, we increase the number of pipeline stages by one. Increasing the number of partitions or partition points increases the number of threads. We cannot create partition points at Source instances or at Sequence Generator transformations. Number of partitions: A partition is a pipeline stage that executes in a single thread. If we purchase the Partitioning option, we can set the number of partitions at any partition point. When we add partitions, we increase the number of processing threads, which can improve session performance. We can define up to 64 partitions at any partition point in a pipeline. When we increase or decrease the number of partitions at any partition point, the Workflow Manager increases or decreases the number of partitions at all partition points in the pipeline. The number of partitions remains consistent throughout the pipeline. The Integration Service runs the partition threads concurrently. Partition types: The Integration Service creates a default partition type at each partition point. If we have the Partitioning option, we can change the partition type. The partition type controls how the Integration Service distributes data among partitions at partition points. We can define the following partition types: Database partitioning, Hash auto-keys, Hash user keys, Key range, Pass-through, Round-robin. Database partitioning: The Integration Service queries the database system for table partition information. It reads partitioned data from the corresponding nodes in the database. Pass-through: The Integration Service processes data without redistributing rows among partitions. All rows in a single partition stay in the partition after crossing a pass-through partition point. Choose pass-through partitioning when we want to create an additional pipeline stage to improve performance, but do not want to change the distribution of data across partitions. Round-robin: The Integration Service distributes data evenly among all partitions. Use round-robin partitioning where we want each partition to process approximately the same numbers of rows i.e. load balancing. Hash auto-keys: The Integration Service uses a hash function to group rows of data among partitions. The Integration Service groups the data based on a partition key. The Integration Service uses all grouped or sorted ports as a compound partition key. We may need to use hash auto-keys partitioning at Rank, Sorter, and unsorted Aggregator transformations. Hash user keys: The Integration Service uses a hash function to group rows of data among partitions. We define the number of ports to generate the partition key.
Key range: The Integration Service distributes rows of data based on a port or set of ports that we define as the partition key. For each port, we define a range of values. The Integration Service uses the key and ranges to send rows to the appropriate partition. Use key range partitioning when the sources or targets in the pipeline are partitioned by key range. We cannot create a partition key for hash auto-keys, round-robin, or pass-through partitioning. Add, delete, or edit partition points on the Partitions view on the Mapping tab of session properties of a session in Workflow Manager. The PowerCenter Partitioning Option increases the performance of PowerCenter through parallel data processing. This option provides a thread-based architecture and automatic data partitioning that optimizes parallel processing on multiprocessor and gridbased hardware environments.
Lookups are cached by default in Informatica. This means that Informatica by default brings in the entire data of the lookup table from database server to Informatica Server as a part of lookup cache building activity during session run. If the lookup table is too huge, this ought to take quite some time. Now consider this scenario - what if you are looking up to the same table different times using different lookups in different mappings? Do you want to spend the time of building the lookup cache again and again for each lookup? Off course not! Just use persistent cache option! Yes, Lookup cache can be either non-persistent or persistent. The Integration Service saves or deletes lookup cache files after a successful session run based on whether the Lookup cache is checked as persistent or not.
Where and when we shall use persistent cache:
Suppose we have a lookup table with same lookup condition and return/output ports and the lookup table is used many times in multiple mappings. Let us say a Customer Dimension table is used in many mappings to populate the surrogate key in the fact tables based on their source system
keys. Now if we cache the same Customer Dimension table multiple times in multiple mappings that would definitely affect the SLA loading timeline.
So the solution is to use Named Persistent Cache.
In the first mapping we will create the Named Persistent Cache file by setting three properties in the Properties tab of Lookup transformation.
Lookup cache persistent: To be checked i.e. a Named Persistent Cache will be used. Cache File Name Prefix: user_defined_cache_file_name i.e. the Named Persistent cache file name that will be used in all the other mappings using the same lookup table. Enter the prefix name only. Do not enter .idx or .dat
Re-cache from lookup source: To be checked i.e. the Named Persistent Cache file will be rebuilt or refreshed with the current data of the lookup table. Next in all the mappings where we want to use the same already built Named Persistent Cache we need to set two properties in the Properties tab of Lookup transformation.
Lookup cache persistent: To be checked i.e. the lookup will be using a Named Persistent Cache that is already saved in Cache Directory and if the cache file is not there the session will not fail it will just create the cache file instead. Cache File Name Prefix: user_defined_cache_file_name i.e. the Named
Persistent cache file name that was defined in the mapping where the persistent cache file was created.
Note:
If there is any Lookup SQL Override then the SQL statement in all the lookups should match exactly even also an extra blank space will fail the session that is using the already built persistent cache file. So if the incoming source data volume is high, the lookup tables data volume that need to be cached is also high, and the same lookup table is used in many mappings then the best way to handle the situation is to use one-time build, already created persistent named cache.
Yes, we are. But as it appears, in many cases, it might have an performance benefit (especially if the input is already sorted or when you know input data will not violate the order, like you are loading daily data and want to sort it by day). Remember Informatica holds all the rows in Aggregator cache for
aggregation operation. This needs time and cache space and this also voids the normal row by row processing in Informatica. By removing the Aggregator with an Expression, we reduce cache space requirement and ease out row by row processing. The mapping below will show how to do this
Image: Aggregation with Expression and Sorter 1 Sorter (SRT_SAL) Ports Tab
Now I am showing a sorter here just illustrate the concept. If you already have sorted data from the source, you need not use this thereby increasing the performance benefit. Expression (EXP_SAL) Ports Tab Image: Expression Ports Tab Properties Sorter (SRT_SAL1) Ports Tab
This is how we can implement aggregation without using Informatica aggregator transformation. Hope you liked it!