Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Informatica PowerCenter
(Version 8.1.1)
Informatica PowerCenter Unstructured Data Guide Version 8.1.1 September 2006 Copyright (c) 19982006 Informatica Corporation. All rights reserved. Printed in the USA. This software and documentation contain proprietary information of Informatica Corporation and are provided under a license agreement containing restrictions on use and disclosure and are also protected by copyright law. Reverse engineering of the software is prohibited. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise) without prior consent of Informatica Corporation. Use, duplication, or disclosure of the Software by the U.S. Government is subject to the restrictions set forth in the applicable software license agreement and as provided in DFARS 227.7202-1(a) and 227.7702-3(a) (1995), DFARS 252.227-7013(c)(1)(ii) (OCT 1988), FAR 12.212(a) (1995), FAR 52.227-19, or FAR 52.227-14 (ALT III), as applicable. The information in this document is subject to change without notice. If you find any problems in the documentation, please report them to us in writing. Informatica Corporation does not warrant that this documentation is error free. Informatica, PowerCenter, PowerCenterRT, PowerCenter Connect, PowerCenter Data Analyzer, PowerMart, SuperGlue, Metadata Manager, Informatica Data Quality and Informatica Data Explorer are trademarks or registered trademarks of Informatica Corporation in the United States and in jurisdictions throughout the world. All other company and product names may be trade names or trademarks of their respective owners. Portions of this software and/or documentation are subject to copyright held by third parties, including without limitation: Copyright DataDirect Technologies, 1999-2002. All rights reserved. Copyright Sun Microsystems. All Rights Reserved. Copyright RSA Security Inc. All Rights Reserved. Copyright Ordinal Technology Corp. All Rights Reserved. Informatica PowerCenter products contain ACE (TM) software copyrighted by Douglas C. Schmidt and his research group at Washington University and University of California, Irvine, Copyright (c) 1993-2002, all rights reserved. Portions of this software contain copyrighted material from The JBoss Group, LLC. Your right to use such materials is set forth in the GNU Lesser General Public License Agreement, which may be found at http://www.opensource.org/licenses/lgpl-license.php. The JBoss materials are provided free of charge by Informatica, as-is, without warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability and fitness for a particular purpose. Portions of this software contain copyrighted material from Meta Integration Technology, Inc. Meta Integration is a registered trademark of Meta Integration Technology, Inc. This product includes software developed by the Apache Software Foundation (http://www.apache.org/). The Apache Software is Copyright (c) 1999-2005 The Apache Software Foundation. All rights reserved. This product includes software developed by the OpenSSL Project for use in the OpenSSL Toolkit and redistribution of this software is subject to terms available at http://www.openssl.org. Copyright 1998-2003 The OpenSSL Project. All Rights Reserved. The zlib library included with this software is Copyright (c) 1995-2003 Jean-loup Gailly and Mark Adler. The Curl license provided with this Software is Copyright 1996-2004, Daniel Stenberg, <Daniel@haxx.se>. All Rights Reserved. The PCRE library included with this software is Copyright (c) 1997-2001 University of Cambridge Regular expression support is provided by the PCRE library package, which is open source software, written by Philip Hazel. The source for this library may be found at ftp://ftp.csx.cam.ac.uk/pub/software/programming/ pcre. InstallAnywhere is Copyright 2005 Zero G Software, Inc. All Rights Reserved. Portions of the Software are Copyright (c) 1998-2005 The OpenLDAP Foundation. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted only as authorized by the OpenLDAP Public License, available at http://www.openldap.org/software/release/license.html. This Software is protected by U.S. Patent Numbers 6,208,990; 6,044,374; 6,014,670; 6,032,158; 5,794,246; 6,339,775 and other U.S. Patents Pending. DISCLAIMER: Informatica Corporation provides this documentation as is without warranty of any kind, either express or implied, including, but not limited to, the implied warranties of non-infringement, merchantability, or use for a particular purpose. The information provided in this documentation may include technical inaccuracies or typographical errors. Informatica could make improvements and/or changes in the products described in this documentation at any time without notice.
Table of Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
New Features and Enhancements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi About This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Document Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Other Informatica Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii Visiting Informatica Customer Portal . . . . . . . . . . . . . . . . . . . . . . . . . viii Visiting the Informatica Web Site . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii Visiting the Informatica Developer Network . . . . . . . . . . . . . . . . . . . . viii Visiting the Informatica Knowledge Base . . . . . . . . . . . . . . . . . . . . . . . viii Obtaining Technical Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Table of Contents
iii
iv
Table of Contents
Preface
Welcome to PowerCenter, the Informatica software product that delivers an open, scalable data integration solution addressing the complete life cycle for all data integration projects including data warehouses, data migration, data synchronization, and information hubs. PowerCenter combines the latest technology enhancements for reliably managing data repositories and delivering information resources in a timely, usable, and efficient manner. The Unstructured Data Option extends the PowerCenters enterprise data integration capabilities to provide access to unstructured and semi-structured data formats. With this option, you can seamlessly access, integrate, and deliver enterprise data currently locked in documents and industry-specific data formats.
Launch ContentMaster Studio from the PowerCenter Designer. You can launch ContentMaster Studio from the Tools menu in the PowerCenter Designer. Select available ContentMaster parsers when creating the transformation. When you create an Unstructured Data transformation, you select a parser from a list of the available parsers in ContentMaster. Specify output type as file. You can specify file or buffer as the Unstructured Data transformation output type.
vi
Preface
Document Conventions
This guide uses the following formatting conventions:
If you see It means The word or set of words are especially emphasized. Emphasized subjects. This is the variable name for a value you enter as part of an operating system command. This is generic text that should be replaced with user-supplied values. The following paragraph provides additional facts. The following paragraph provides suggested uses. The following paragraph notes situations where you can overwrite or corrupt data, unless you follow the specified procedure. This is a code example. This is an operating system command you enter from a prompt to run a task.
Preface
vii
Informatica Customer Portal Informatica web site Informatica Developer Network Informatica Knowledge Base Informatica Technical Support
support@informatica.com for technical inquiries support_admin@informatica.com for general customer service requests
WebSupport requires a user name and password. You can request a user name and password at http://my.informatica.com.
North America / South America Informatica Corporation Headquarters 100 Cardinal Way Redwood City, California 94063 United States Europe / Middle East / Africa Informatica Software Ltd. 6 Waltham Park Waltham Road, White Waltham Maidenhead, Berkshire SL6 3TN United Kingdom Asia / Australia Informatica Business Solutions Pvt. Ltd. Diamond District Tower B, 3rd Floor 150 Airport Road Bangalore 560 008 India Toll Free Australia: 00 11 800 4632 4357 Singapore: 001 800 4632 4357 Standard Rate India: +91 80 4112 5738
Standard Rate Belgium: +32 15 281 702 France: +33 1 41 38 92 26 Germany: +49 1805 702 702 Netherlands: +31 306 022 797 United Kingdom: +44 1628 511 445
Preface
ix
Preface
Chapter 1
Overview, 2 Installation and Configuration, 3 Configuring Data Transformations in ContentMaster, 6 Configuring an Unstructured Data Transformation, 7 Error Messages, 11
Overview
The PowerCenter Unstructured Data Option adds powerful data transformation capabilities to PowerCenter workflows. It lets you read data in unstructured and semi-structured formats, transform the data, and write it to a target. For example, you can use the Unstructured Data Option to read the data in the following formats:
Microsoft Word, Excel, PowerPoint, PDF, WordPerfect, Star Office, ASCII reports, HTML, undocumented binaries, RPG, and ANSI Industry-specific formats, such as HL7, ACORD, FIXML, SWIFT, PL1, MVR, ASTM, EDI-X12, EDIFACT, and XML standards AFP, postscripts, and DJDE Allows greater visibility into all enterprise data. Eliminates the need for hand coding to access unstructured and semi-structured data. Enables you to conform to mandated industry data formats, which helps ensure regulatory compliance.
The Unstructured Data Option includes the Unstructured Data transformation, which integrates with the Itemfield ContentMaster data transformation system. It lets you run ContentMaster services in PowerCenter. Complete the following steps to use the Unstructured Data transformation: 1. 2. 3. In ContentMaster Studio, configure a data transformation and publish it as a ContentMaster service. In the PowerCenter Designer, configure a mapping that uses the Unstructured Data transformation to activate the ContentMaster service. In the Workflow Manager, configure and run a workflow that uses the mapping.
Install and configure ContentMaster Engine 4.0.6. For more information about installing and configuring ContentMaster Engine 4.0.6, see Getting Started with ContentMaster and ContentMaster Studio User's Guide.
Install the Integration Service and ContentMaster Engine on the same machine.
In CMConfig.xml, locate the JVMLocation attribute. Change the attribute value to the path of the JVM installed with PowerCenter. For example, on Solaris you can change the attribute to the following:
<JVMLocation>PowerCenter8.1.0/java/jre/lib/sparc/client</JVMLocation>
The following table shows the JVM path for each Integration Service platform:
Platform AIX HP Linux Solaris JVM Directory Path <PowerCenter8.1.0>/java/jre/bin/classic <PowerCenter8.1.0>/java/jre/lib/PA_RISC2.0/server <PowerCenter8.1.0>/java/jre/lib/i386/client <PowerCenter8.1.0>/java/jre/lib/sparc/client
For more information about configuring JVM in ContentMaster, see the ContentMaster Studio Workbench User Guide.
Go to the ContentMaster installation directory. In the setEnv.csh or setEnv.sh environment variable, add the following entry:
setenv CMJAVA_PATH <JVMLocation>:<PowerCenter8.1.1>/java/jre/lib/<OS>
Where <JVMLocation> is the directory of JVM location settings and <PowerCenter8.1.1> is the directory of Integration Service installation.
3.
4.
Verify that the JVM location setting are the same as the settings in the following tables based on your operating system. The following table shows the JVM location settings you for 32-bit operating systems:
Operating System AIX HP-UX Linux Solaris JVM Location Settings <JVMLocation><PowerCenter8.1.1>/java/jre/bin/classic</JVMLocation> <JVMLocation><PowerCenter8.1.1>/java/jre/lib/PA_RISC2.0/server </JVMLocation> <JVMLocation><PowerCenter8.1.1>/java/jre/lib/i386/client</JVMLocation> <JVMLocation><PowerCenter8.1.1>/java/jre/lib/sparc/client </JVMLocation>
The following table shows the JVM location settings you for 32-bit operating systems:
Operating System HP-PARISC HP-IPF JVM Location Settings <JVMLocation><PowerCenter8.1.1>/java/jre/lib/PA_RISC2.0W/server </JVMLocation> <JVMLocation><PowerCenter8.1.1>/java/jre/lib/IA64W/hotspot </JVMLocation>
Parser. Converts any data format to XML. Serializer. Converts XML to any format. Transformer. Modifies the data in any format.
During the PowerCenter session, the Unstructured Data transformation runs a ContentMaster service to parse the data. For example, an Unstructured Data transformation can activate a ContentMaster parser service, which transforms binary or text inputs to XML. Similarly, an Unstructured Data transformation can activate a ContentMaster serializer service, which transforms XML to other data formats. By chaining a parser and a serializer in a single PowerCenter mapping or in sequential mappings, you can use ContentMaster to transform any data format to any other data format through XML.
To configure and publish a data transformation in ContentMaster: 1. 2. 3.
In the PowerCenter Designer, click Tools > ContentMaster Studio to launch ContentMaster Studio. In ContentMaster Studio, configure a workspace containing the data transformation that you want to run. Use the Workspace > Publish as Service command to publish the transformation as a ContentMaster service. This lets ContentMaster Engine run the transformation outside ContentMaster Studio.
For more information about creating and publishing ContentMaster services, see Getting Started with ContentMaster and ContentMaster Studio User's Guide.
File. Select File when you want to pass the full path of the file that contains the source or target data. Buffer. Select Buffer when you want to read data from the source and pass it to the Unstructured Data transformation. Or, select Buffer when you want pass data from the Unstructured Data transformation to the target.
For example, you want to read customer name and address data from a Excel file and create an XML file that contains this data. You can create a Unstructured Data transformation with the input as buffer and the output as file. The Integration Service reads the data from the Excel file. The Unstructured Data transformation parses the data and sends the data to an XML file. The Integration Service then writes the XML file to the target. When you select File as the output type, the PowerCenter Designer creates the Unstructured Data transformation with two input ports. When you select Buffer as the output type, the PowerCenter Designer creates the Unstructured Data transformation with one input port. An Unstructured Data transformation always contains one output port. You cannot change the output type after you create the transformation. For more information about editing the Unstructured Data transformation, see Editing an Unstructured Data Transformation on page 9. Figure 1-1 shows an Unstructured Data transformation with File output type and another with Buffer output type:
Figure 1-1. Unstructured Data Transformation Examples
File Output Mode Buffer Output Mode
In the Mapping Designer or Transformation Developer, click Transformation > Create. Select Unstructured Data Transformation as the transformation type. Enter a name for the transformation. Click Create. The UDO Transformation dialog box appears.
5.
Input Type
Required
Output Type
Required
6.
Click OK.
Double-click the title bar of the transformation. On the Ports tab, optionally edit the name, datatype, and precision of the input and output ports. Datatype must be String or Binary.
3.
The following example shows the Metadata Extensions tab for the Unstructured Data transformation:
4.
Click OK.
10
Error Messages
In the event of an error, the Unstructured Data transformation may write the following error messages to the PowerCenter session log. UDT_50000 Cause: Action: UDT_50001 Cause: ContentMaster service <service name> not found. You have not published the data transformation as a ContentMaster service. In ContentMaster Studio, publish the service. Output filename is empty. The output type is designated as file, but the Integration Service did not create a filename. The Integration Service creates the output file name from the input port data. It cannot create a filename if the input data is not valid. Ensure that the input data is valid. Output precision <precision> is less than output length <length>. The output of the ContentMaster service is larger than the precision of the output port. Increase the precision of the output port. Error threshold reached. The number of row errors reached the allowable limit. Review the session and workflow logs and correct the errors. Content Master Engine initialization failed: <status.error code> <status.description>. There may be a ContentMaster installation problem. Review the installation and configuration or re-install. Failed to process data: <data>. An error occurred in ContentMaster while processing the data. Run the session again. Invalid input type: <input type>. The Unstructured Data transformation is not configured correctly. Enter file or buffer as the input type for the transformation.
Action: UDT_50002 Cause: Action: UDT_50003 Cause: Action: UDT_50004 Cause: Action: UDT_50005 Cause: Action: UDT_50006 Cause: Action:
Error Messages
11
Invalid output type: <output type>. The Unstructured Data transformation is not configured correctly. You must recreate the transformation using a valid output type. Use file or buffer as the output type. Failed to flush the output. An error occurred while flushing the output. Check system resources. Run the session again. Could not get metadata extensions for the Unstructured Data transformation. Metadata is either inconsistent or not present. Make sure the transformation name, input types, and output types are correct. or
Recreate the transformation. Unstructured Data transformation can only have one output port. There is an incorrect number of output ports for the Unstructured Data transformation. The Unstructured Data transformation can have only one output port. Define exactly one output port. Unstructured Data transformation can only have one input port. There is an incorrect number of buffer input ports for the Unstructured Data transformation. Define exactly one buffer input port and exactly one output port. Datatype of the input port is invalid. The datatype of the input port is not String or Binary. Assign a String or Binary datatype. Datatype of the output port is invalid. The datatype of the input port is not String or Binary. Assign a String or Binary datatype. There must be exactly two input ports in the widget. There is an incorrect number of file input ports for the Unstructured Data transformation. For file input, there must be two input ports. Edit the transformation to create a second input port.
Action: UDT_50011 Cause: Action: UDT_50012 Cause: Action: UDT_50013 Cause: Action: UDT_50014 Cause: Action:
12
ItemField engine initialization failed. There may be an Itemfield installation problem. Review the session and workflow logs for additional errors. Review the installation and configuration, or re-install. Could not get all three required metadata extensions. Metadata is either inconsistent or not present. Make sure the transformation name, input types, and output types are correct. or
Recreate the transformation. Message catalog cannot be created for Unstructured Data transformation. The Unstructured Data Option did not install properly, or message catalog files are missing. Reinstall the Unstructured Data Option. License for Unstructured Data Transformation has not been enabled. The Unstructured Data Option license is not valid for the Unstructured Data transformation. Contact Informatica Technical Support.
Error Messages
13
14
Index
B
buffer input type description 7 buffer output type description 7
J
JRE verifying the JRE path for Unstructured Data Option 4 JVM configuring in ContentMaster 3
D
data transformations configuring in ContentMaster Studio 6
M
mappings configuring with Unstructured Data transformation 10
F
file input type description 7 file output type description 7 writing filenames in an Unstructured Data Option session 10
O
Output Type property in Unstructured Data transformations 7
I
Input Type property in Unstructured Data transformations 7 installation minimum system requirements for Unstructured Data Option 3
S
services ContentMaster service types 6 system requirements Unstructured Data Option 3
15
U
Unstructured Data Option overview 2 PowerCenter 2 Unstructured Data transformation configuring in a mapping 10 creating 8 editing 9 upgrading 7 upgrading Unstructured Data transformation 7
W
workflows configuring for Unstructured Data Option 10 running for Unstructured Data Option 10
16
Index