Sei sulla pagina 1di 9

Informatica V9 Sizing Guide

Overview of Document

This document shows average sizing for V9 Installs at 3 different levels. The first is the size of installed elements on the file system. The second is the runtime footprint of general V9 services for all users. The last is the additional overhead in memory and disk of an individual user’s running mappings.

The mappings contribution to disk/memory usage is usually the most critical and the most difficult to average without particular details. The details below can be used as a basis of scaling calculations based on number of concurrent mappings submitted to the server, transform usage in the mapping and the data file input size in number of rows and columns.

Base on Disk Install Size

A typical Server Platform Install will take about 3.2 GB of Disk. This does not include disk usage for reference data items listed in appendix 1 and 2. In addition it does not take account of the database usage of a typical set of reference data items. The numbers below are for a full set of all these content elements on the file system. Depending on individual customers usage appendix 1 and 2 may be used to estimate more exact disk sizing.

Server Platform Install Size

3.2 GB

Identity Reference Data

600 MB

Address Reference Data

4 GB

Reference Table Data

3 GB

This gives a rounded base figure of about 12 GB. This does not include additional customer reference data or increases in Address Reference data which is amended as country postal authorities add additional data.

General Runtime Memory Sizing

The below sizing is for an average running server with no disk/memory intensive mappings and no loaded content

No

Service Name

Virtual Set

Working Set

1.

Admin Console

773K

133K

2.

MRS

1288K

407K

3.

Mapping Service

978K

254K

4. Analyst Tool

702K

79K

This table shows the average sizes of the 4 V9 services of a typical configuration. The Virtual Set is the total memory in virtual memory and the Working Set is the physically resident memory usage.

Address Validation Reference data This data is loaded globally for all users. The customers configuration dictates which AV reference files are loaded. The Address Validation file size guide may be used here to estimate memory usage. The average size in memory of each loaded element is approximately the same as the disk footprint.

For example if a user runs a mapping that uses the following reference file

United States

Batch/Interactive

533 MB

It would be expected that the process memory size will grow by 533MB approximately. It should be noted that this memory cost is for the life time of the server and is a once off cost for the server and all mappings run in the server’s lifetime. The loaded Address Validation data is not unloaded even when there are no current users for performance reasons.

User Created Mapping Memory and Disk footprint

This section is split into 3 types of Data Quality component. The Standard elements don’t incur any additional costs in memory or disk usage beyond its standard running size. The Dynamic components are of 2 different types. Reference data based transforms which hold in-memory, the same reference table lookup structures and Dynamic transforms that can include items like third party engines, sort space or b-tree storage. The Dynamic transforms use both memory and disk that can considerably depending on the data being processed.

Standard DQ Transformations

Comparison Transformation

Decision Transformation

Merge Transformation

None of the transforms have dynamic memory or disk usage that varies with the size of the data being processed. All these components are referred to as passive since they process data rows in small batches and send to the next component in the mapping immediately.

Reference Data Based Transformations

Case Convertor Transformation

Labeller Transformation

Parser Transformation

Standardiser Transformation

These transforms are all based around usage of reference data. While they are all passive in that they process data immediately they have initialisation costs that increase memory based on configuration. This memory usage makes them dynamic based on the transforms configuration but not dynamic based on the number of rows presented for processing

While the reference data is managed in a database for editing, at runtime it’s held in memory for performance. To optimise the throughput this in-memory storage is designed for speed rather than space efficiency. The current list of reference tables available is around 3.5K so a list of tables and in-memory sizes is not included. Each transform will have its own copy of the in-memory reference data. To enable sizing the customer should take the number of bytes in each column of the reference table and multiply it by the number of lines. This final calculation multiplied by 1.3 will give an approximate guide to the in memory footprint.

For example a reference table with 10K rows and 6 columns with an average byte count per column

of 25 will give

memory cost is for the lifetime of the mapping. All in memory reference tables are freed when the

mapping is finished.

10000 * 6 * 25 * 1.3 approximately 2M runtime memory usage. This runtime

Dynamic DQ Transformations

All the following components have dynamic memory and disk usage. These components are referred to as active and in general store large numbers of rows internally for block processing and have memory/disk requirements that increase in-line with the volume of input rows and number of corresponding columns per row

Address Validator Transformation This component is treated in the General Runtime memory sizing section as it affects all users as soon as the first mapping is run.

Association Transformation This component makes extensive use of B-tree file based storage. Each column used in the association will have its own b-tree and a general b-tree is used to store all the input data rows. The Informatica b-tree is space efficient but not compressed. So the general sizing guideline here is as follows,

Each association column is the total volume of data for each column * 20 bytes per input row

The general storage cache is the size of the input data set * 10 bytes per row will be the on disk runtime cost.

An internal memory map of association id’s and rows will be no larger than 20 bytes * the number of rows

Sorter Based Transforms Consolidation Transformation

Key Generator Transformation

These transforms all contain standard Informatica sort transforms. Currently they are all set to auto. This is an internal configuration which attempts to give the transform as much memory as possible without affecting system performance. When user wants more explicit control the sort transform can be set with a memory limit on the maximum amount of main memory it can use to sort data. The on disk temp size will grow as all data rows must be stored by the sort transform

Match Transformation The match transform makes use of 2 different types of B-tree depending on its configuration. When a user has configured a set of pass through ports and Identity matching both types will be used. In general it can be assumed that the B-tree storage will not exceed in a significant way the total size of disk the data would occupy if sitting outside the B-tree on the file system.

Worked Sizing for US based Customer

Because any individual customer will have problem specific requirements the following example shows how the data in this document may be applied to create more accurate sizing estimates. The example shows the sizing for both disk and memory for a 4 user DQ installation using US Address Validation, US Identity Matching and US Reference Dictionaries. While this number may be small the variable elements of disk/memory usage only magnify when you have multiple users concurrently using disk and memory intensive transforms. The transforms that have individual requirements per mapping run are indicated in the document.

Base Server Disk Requirements

Base Memory Requirements

12 GB (Calculation shown above)

2 GB (Calculation shown above)

Assumption here is that a mapping without disk/memory sensitive components will add little beyond the standard footprint. This will not be true with very complex mappings.

User 1 Running a matching mapping Dual Source Identity with Source1 containing 1M rows and source2 containing 100K rows, 6 columns with 25 bytes per column, 20 columns of pass-through data with 25 bytes per column

This mapping will have 2 sorters from the key generation phase, 1 B-tree from matching, 1 B-tree from Identity and internal memory usage for Identity and clustering

Disk Usage

B-tree 1 Identity = 1100000 * 6 * 25 = 165MB

B-tree 2 Pass-through = 1100000 * 20 * 25 = 550MB

Memory Usage = Internal storage for large number of transforms used for matching 10MB

User 2 Running an AV mapping Single Source with Source1 containing 1M rows

This mapping will have minimal transforms but will load the all US AV validation reference data

United States

Batch/Interactive

533

MB

United States

GeoCoding

422

MB

United States

FastCompletion

380

MB

Total Disk added = 0

Memory Usage = 533 + 422 + 380 = 1.3 GB

User 3 Running Standardisation Single Source with Source1 containing 10M rows

This mapping will have minimal transforms but will load 10 dictionaries to standardise

Assume each dictionary has 10K rows with 5 columns and 25 bytes average per column

Total Disk added = 0

Memory Usage

10000 * 5 * 25 * 1.3 = 1.6 MB per dictionary

Total Memory = 16MB

User 4 Running Association Single Source with Source1 containing 10M rows and association running across 8 groups

This mapping will not have other matching transforms and will source data directly from a single table. Each association key column will have a 10 byte key and there will be 10 additional columns of row data each 50 bytes wide

Each Key column Btree will take 10M * ( 10 + 20) 300MB

General Storage will take 10M * ((8 * 10) + (10 * 50)) 5.8GB

Total Disk 300MB * 8 columns + 5.8GB = 8.2GB

Total Memory = 10M * 20 = 200MB

Total Additional Memory/Disk used by the 4 concurrently running mappings

Disk = 165MB + 550MB + 8200MB

= 8915MB

Memory = 10 MB + 1300MB + 16MB + 200MB

= 1526MB

Summary

The data in this document estimates the standard disk and memory footprint of the V9 server. In addition the 2 tables shown at the end of the document will allow a user to minimise the on disk footprint of the install if this is required. The Example sizing at the bottom of the document shows how to estimate a mappings contribution to disk/memory by analysing the composition of the mapping and each transforms contribution to disk/memory usage. The example also shows the importance of factoring in the number of concurrent users and likely usage in defined the total peak requirements of an individual installation.

Appendix 1

Address Validation Reference Data with On Disk size

Largest 50 files

United States

Batch/Interactive

533

MB

United Kingdom

FastCompletion

501

MB

United States

GeoCoding

422

MB

United States

FastCompletion

380

MB

United Kingdom

Batch/Interactive

306

MB

France

FastCompletion

210

MB

France

Batch/Interactive

153

MB

Argentina

FastCompletion

120

MB

Brazil

FastCompletion

104

MB

Germany

FastCompletion

102

MB

Germany

Batch/Interactive

99

MB

United Kingdom

Supplementary

94.5

MB

Italy

FastCompletion

92.9

MB

Argentina

Batch/Interactive

90

MB

Canada

FastCompletion

83.1

MB

India

FastCompletion

83.1

MB

India

Batch/Interactive

80

MB

Germany

GeoCoding

73.5

MB

Brazil

Batch/Interactive

73.3

MB

Italy

Batch/Interactive

66

MB

Canada

Batch/Interactive

61.8

MB

United Kingdom

GeoCoding

51.8

MB

Sweden

FastCompletion

49 MB

Mexico

FastCompletion

48.5

MB

Australia

FastCompletion

44.6

MB

Russian Federation

FastCompletion

44.3

MB

Mexico

Batch/Interactive

42.8

MB

Australia

Batch/Interactive

40.9

MB

Russian Federation

Batch/Interactive

40.5

MB

France

GeoCoding

39.7

MB

Portugal

FastCompletion

38.8

MB

Italy

GeoCoding

36.6

MB

Netherlands

FastCompletion

35.5

MB

Canada

GeoCoding

32.7

MB

China

FastCompletion

28.4

MB

Netherlands

Batch/Interactive

27.8

MB

Sweden

Batch/Interactive

27.4

MB

Spain

GeoCoding

25.6

MB

Australia

GeoCoding

25.4

MB

Spain

FastCompletion

23.7

MB

Chile

FastCompletion

23.4

MB

Netherlands

GeoCoding

22.7

MB

Portugal

Batch/Interactive

22.5

MB

China

Batch/Interactive

21.4

MB

Finland

GeoCoding

18.8

MB

Switzerland

FastCompletion

18.2

MB

Sweden

GeoCoding

17.8

MB

Chile

Batch/Interactive

16.8

MB

Belgium

FastCompletion

16.1

MB

Spain

Batch/Interactive

15.4

MB

The full list can be found at: http://www.addressdoctor.com/en/support/countrydownloadv5.asp

Appendix 2

Identity Based Matching Reference Data with On Disk Size

IM_japan_i.zip

86,222,167

IM_japan.zip

86,222,153

IM_japan_r.zip

15,754,935

IM_gaelic.zip

9,237,372

IM_canada.zip

8,933,319

IM_international.zip

5,303,974

IM_chinese_s.zip

4,955,588

IM_south_africa.zip

4,260,152

IM_uk.zip

4,241,637

IM_ireland.zip

4,241,357

IM_new_zealand.zip

4,200,805

IM_australia.zip

4,153,252

IM_usa.zip

4,134,750

IM_arabic_m.zip

3,893,388

IM_indonesia.zip

3,494,046

IM_cyrillic.zip

3,022,104

IM_arabic_r.zip

2,980,176

IM_singapore.zip

2,505,578

IM_india.zip

2,321,418

IM_chinese_t.zip

2,189,993

IM_aml.zip

2,083,153

IM_greek_l.zip

2,057,442

IM_switzerland.zip

2,028,497

IM_france.zip

1,950,898

IM_philippines.zip

1,896,332

IM_luxembourg.zip

1,812,614

IM_belgium.zip

1,696,864

IM_germany.zip

1,604,137

IM_brasil.zip

1,596,925

IM_portugal.zip

1,596,786

IM_korean_r.zip

1,588,819

IM_italy.zip

1,554,842

IM_turkey.zip

1,552,887

IM_hk_r.zip

1,542,915

IM_sweden.zip

1,528,272

IM_netherlands.zip

1,476,954

IM_taiwan_r.zip

1,473,532

IM_denmark.zip

1,473,231

IM_slovakia.zip

1,458,393

IM_malaysia.zip

1,447,577

IM_thai_r.zip

1,443,929

IM_spain.zip

1,438,526

IM_chinese_r.zip

1,431,129

IM_colombia.zip

1,414,047

IM_argentina.zip

1,413,962

IM_indo_chin_r.zip

1,410,620

IM_chile.zip

1,400,965

IM_peru.zip

1,389,800

IM_vietnam_r.zip

1,379,744

IM_puerto_rico.zip

1,372,143

IM_mexico.zip

1,344,656

IM_thai.zip

1,279,607

IM_finland.zip

1,273,884

IM_norway.zip

1,273,795

IM_poland.zip

1,261,906

IM_greek.zip

1,247,548

IM_hungary.zip

1,205,908

IM_estonia.zip

1,092,791

IM_korean.zip

821,290

IM_ofac.zip

759,006

IM_hebrew.zip

754,978

IM_chinese_i.zip

544,844

IM_arabic.zip

297,401