Sei sulla pagina 1di 6

Error Tables, Work Tables and Log Tables Besides target table(s), MultiLoad requires the use of four

special tables in or der to function. They consist of two error tables (per target table), one worktable (per target table) , and one log table. In essence, the Error Tables will be used to store any conversion, constraint or un iqueness violations during a load. Work Tables are used to receive and sort data and SQL on each AMP prior to storing them permanently to disk. A Log Table (also called, "Logtable") is used to store successful checkpoints during load processing in case a RESTART is needed. HINT: Sometimes a company wants all of these load support tables to be housed in a particular database. When these tables are to be stored in any database other than the user 's own default database, then you must give them a qualified name (<databasename>.<tablename>) in the script or use the DATABASE command to change the current database. Where will you find these tables in the load script? The Logtable is generally i dentified immediately prior to the .LOGON command. Worktables and error tables can be named in the BEG IN MLOAD statement. Do not underestimate the value of these tables. They are vital to the operation of MultiLoad. Without them a MultiLoad job can not run. Now that you have had the " executive summary", let's look at each type of table individually. Two Error Tables: Here is another place where FastLoad and MultiLoad are similar . Both require the use of two error tables per target table. MultiLoad will automatically creat e these tables. Rows are inserted into these tables only when errors occur during the load process. T he first error table is the acquisition Error Table (ET). It contains all translation and constraint err ors that may occur while the data is being acquired from the source(s). The second is the Uniqueness Violation (UV) table that stores rows with duplicat e values for Unique Primary Indexes (UPI). Since a UPI must be unique, MultiLoad can only loa d one occurrence into a table. Any duplicate value will be stored in the UV error tabl e. For example, you might see a UPI error that shows a second employee number "99." In this case, if the name for employee "99" is Kara Morgan, you will be glad that the row did not load since K ara Morgan is already in the Employee table. However, if the name showed up as David Jackson, then you know that further investigation is needed, because employee numbers must be unique. Teradata Utilities: BTEQ, FastLoad, MultiLoad, TPump, and FastExport, Second Edi tion 3 Reprinted for ibmkvskumar@in.ibm.com, IBM Coffing Data Warehousing, Coffing Publ ishing (c) 2005, Copying Prohibited Each error table does the following: Identifies errors Provides some detail about the errors Stores the actual offending row for debugging You have the option to name these tables in the MultiLoad script (shown later). Alternatively, if you

do not name them, they default to ET_<target_table_name> and UV_<target_table_na me>. In either case, MultiLoad will not accept error table names that are the same as ta rget table names. It does not matter what you name them. It is recommended that you standardize on th e naming convention to make it easier for everyone on your team. For more details on how these error tables can help you, see the subsection in this chapter titled, "Troubleshooting MultiL oad Errors." Log Table: MultiLoad requires a LOGTABLE. This table keeps a record of the resul ts from each phase of the load so that MultiLoad knows the proper point from which to RESTART . There is one LOGTABLE for each run. Since MultiLoad will not resubmit a command that has been run previously, it will use the LOGTABLE to determine the last successfully complete d step. Work Table(s): MultiLoad will automatically create one worktable for each target table. This means that in IMPORT mode you could have one or more worktables. In the DELETE mode, y ou will only have one worktable since that mode only works on one target table. The purpose o f worktables is to hold two things: 1. The Data Manipulation Language (DML) tasks 2. The input data that is ready to APPLY to the AMPs The worktables are created in a database using PERM space. They can become very large. If the script uses multiple SQL statements for a single data record, the data is sent t o the AMP once for each SQL statement. This replication guarantees fast performance and that no SQL statement will ever be done more than once. So, this is very important. However, there is no su ch thing as a free lunch, the cost is space. Later, you will see that using a FILLER field can help reduce this disk space by not sending unneeded data to an AMP. In other words, the efficiency of the MultiLoad run is in your hands. Supported Input Formats Data input files come in a variety of formats but MultiLoad is flexible enough t o handle many of them. MultiLoad supports the following five format options: BINARY, FASTLOAD, TE XT, UNFORMAT and VARTEXT. Teradata Utilities: BTEQ, FastLoad, MultiLoad, TPump, and FastExport, Second Edi tion 4 Reprinted for ibmkvskumar@in.ibm.com, IBM Coffing Data Warehousing, Coffing Publ ishing (c) 2005, Copying Prohibited BINARY Each record is a 2-byte integer, n, that is followed by n bytes of data. A byte is the smallest means of storage of for Teradata. FASTLOAD This format is the same as Binary, plus a marker (X '0A' or X '0D') tha t specifies the end of the record. TEXT Each record has a random number of bytes and is followed by an end of the r ecord marker. UNFORMAT The format for these input records is defined in the LAYOUT statement o

f the MultiLoad script using the components FIELD, FILLER and TABLE. VARTEXT This is variable length text RECORD format separated by delimiters such as a comma. For this format you may only use VARCHAR, LONG VARCHAR (IBM) or VARBYTE data formats in your MultiLoad LAYOUT. Note that two delimiter characters in a row will result in a null value between them. Figure 5-1 MultiLoad Has Five IMPORT Phases MultiLoad IMPORT has five phases, but don't be fazed by this! Here is the short list: Phase 1: Preliminary Phase Phase 2: DML Transaction Phase Phase 3: Acquisition Phase Phase 4: Application Phase Phase 5: Cleanup Phase Let's take a look at each phase and see what it contributes to the overall load process of this magnificent utility. Should you memorize every detail about each phase? Probably not. But it is important to know the essence of each phase because sometimes a load fails. When it does, you need to know in which phase it broke down since the method for fixing the error to RESTART may vary depending on the phase. And if you can picture what MultiLoad actually does in each phase, you will likely write better scripts that run more efficiently. Phase 1: Preliminary Phase The ancient oriental proverb says, "Measure one thousand times; Cut once." Multi Load uses Phase 1 to conduct several preliminary set-up activities whose goal is to provide a sm ooth and successful climate for running your load. The first task is to be sure that the SQL syntax and MultiLoad commands are valid. After all, why try to run a script when the system will just find out during the load process that the statements are not useable? MultiLoad knows that it is muc h better to identify any syntax errors, right up front. All the preliminary steps are automated. No u ser intervention is required in this phase. Second, all MultiLoad sessions with Teradata need to be established. The default is the number of available AMPs. Teradata will quickly establish this number as a factor of 16 fo r the basis regarding the number of sessions to create. The general rule of thumb for the number of se ssions to use for smaller systems is the following: use the number of AMPs plus two more. For larg er systems with hundreds of AMP processors, the SESSIONS option is available to lower the defaul t. Remember, Teradata Utilities: BTEQ, FastLoad, MultiLoad, TPump, and FastExport, Second Edi tion 5 Reprinted for ibmkvskumar@in.ibm.com, IBM Coffing Data Warehousing, Coffing Publ ishing (c) 2005, Copying Prohibited these sessions are running on your poor little computer as well as on Teradata. Each session loads the data to Teradata across the network or channel. Every AMP plays an essential role in the MultiLoad process. They receive the data blocks, hash each row and send the

rows to the correct AMP. When the rows come to an AMP, it stores them in worktab le blocks on disk. But, lest we get ahead of ourselves, suffice it to say that there is ample reason for multiple sessions to be established. What about the extra two sessions? Well, the first one is a control session to h andle the SQL and logging. The second is a back up or alternate for logging. You may have to use s ome trial and error to find what works best on your system configuration. If you specify too few ses sions it may impair performance and increase the time it takes to complete load jobs. On the other h and, too many sessions will reduce the resources available for other important database activi ties. Third, the required support tables are created. They are the following: Type of Table Table Details ERRORTABLES MultiLoad requires two error tables per target table. The first erro r table contains constraint violations, while the second error table stores Unique Primary Index violations. WORKTABLES Work Tables hold two things: the DML tasks requested and the input data that is ready to APPLY to the AMPs. LOGTABLE The LOGTABLE keeps a record of the results from each phase of the load so that MultiLoad knows the proper point from which to RESTART. Figure 5-2 The final task of the Preliminary Phase is to apply utility locks to the target tables. Initially, access locks are placed on all target tables, allowing other users to read or write to the table for the time being. However, this lock does prevent the opportunity for a user to request an exclusive lock. Although, these locks will still allow the MultiLoad user to drop the table, no one else may DROP or ALTER a target table while it is locked for loading. This leads us to Phase 2. Phase 2: DML Transaction Phase In Phase 2, all of the SQL Data Manipulation Language (DML) statements are sent ahead to Teradata. MultiLoad allows the use of multiple DML functions. Teradata's Parsing Engine (PE) parses the DML and generates a step-by-step plan to execute the request. This ex ecution plan is then communicated to each AMP and stored in the appropriate worktable for each t arget table. In other words, each AMP is going to work off the same page. Later, during the Acquisition phase the actual input data will also be stored in the worktable so that it may be applied in Phase 4, the Application Phase. Next, a match tag is assigned to each DML request that will match it with the appropriate rows of input data. The match ta gs will not actually be used until the data has already been acquired and is about to be applied to the worktable. This is somewhat like a student who receives a letter from the university in the summer that lists his courses, professor's names, and classroom locations for the upcoming semester. T he letter is a "match tag" for the student to his school schedule, although it will not be used for several months.

This matching tag for SQL and data is the reason that the data is replicated for each SQL statement using the same data record. Phase 3: Acquisition Phase With the proper set-up complete and the PE's plan stored on each AMP, MultiLoad is now ready to receive the INPUT data. This is where it gets interesting! MultiLoad now acquire s the data in large, unsorted 64K blocks from the host and sends it to the AMPs. Teradata Utilities: BTEQ, FastLoad, MultiLoad, TPump, and FastExport, Second Edi tion 6 Reprinted for ibmkvskumar@in.ibm.com, IBM Coffing Data Warehousing, Coffing Publ ishing (c) 2005, Copying Prohibited At this point, Teradata does not care about which AMP receives the data block. T he blocks are simply sent, one after the other, to the next AMP in line. For their part, each AMP begins to deal with the blocks that they have been dealt. It is like a game of cards - you take the cards that you have received and then play the game. You want to keep some and give some away. Similarly, the AMPs will keep some data rows from the blocks and give some away. The AMP hashes each row on the primary index and sends it over the BYNET to the proper A MP where it will ultimately be used. But the row does not get inserted into its target table, jus t yet. The receiving AMP must first do some preparation before that happens. Don't you have to get re ady before company arrives at your house? The AMP puts all of the hashed rows it has receiv ed from other AMPs into the worktables where it assembles them into the SQL. Why? Because once the rows are reblocked, they can be sorted into the proper order for storage in the target ta ble. Now the utility places a load lock on each target table in preparation for the Application Phase . Of course, there is no Acquisition Phase when you perform a MultiLoad DELETE task, since no data is being acquired. Phase 4: Application Phase The purpose of this phase is to write, or APPLY, the specified changes to both t he target tables and NUSI subtables. Once the data is on the AMPs, it is married up to the SQL for ex ecution. To accomplish this substitution of data into SQL, when sending the data, the host h as already attached some sequence information and five (5) match tags to each data row. Those match tags are used to join the data with the proper SQL statement based on the SQL statement within a DMP label. In addition to associating each row with the correct DML statement, match tags also guarantee that no row will be updated more than once, even when a RESTART occurs. The following five columns are the matching tags: MATCHING TAGS ImportSeq Sequence number that identifies the IMPORT command where the error occurred DMLSeq Sequence number for the DML statement involved with the error SMTSeq Sequence number of the DML statement being carried out when the error was discovered ApplySeq Sequence number that tells which APPLY clause was running when the

error occurred SourceSeq The number of the data row in the client file that was being built whe n the error took place Figure 5-3 Remember, MultiLoad allows for the existence of NUSI processing during a load. E very hash-sequence sorted block from Phase 3 and each block of the base table is read only once to reduce I/O operations to gain speed. Then, all matching rows in the base block a re inserted, updated or deleted before the entire block is written back to disk, one time. Th is is why the match tags are so important. Changes are made based upon corresponding data and DML (S QL) based on the match tags. They guarantee that the correct operation is performed for th e rows and blocks with no duplicate operations, a block at a time. And each time a table block is written to disk successfully, a record is inserted into the LOGTABLE. This permits MultiLoad to avoid starting again from the very beginning if a RESTART is needed. What happens when several tables are being updated simultaneously? In this case, all of the updates are scripted as a multi-statement request. That means that Teradata view s them as a single transaction. If there is a failure at any point of the load process, Mult iLoad will merely need to be RESTARTed from the point where it failed. No rollback is required. Any errors will be written to the proper error table. Phase 5: Clean Up Phase Those of you reading these paragraphs that have young children or teenagers will certainly appreciate this final phase! MultiLoad actually cleans up after itself. The util ity looks at the final Error Teradata Utilities: BTEQ, FastLoad, MultiLoad, TPump, and FastExport, Second Edi tion 7 Reprinted for ibmkvskumar@in.ibm.com, IBM Coffing Data Warehousing, Coffing Publ ishing (c) 2005, Copying Prohibited Code (&SYSRC). MultiLoad believes the adage, "All is well that ends well." If th e last error code is zero (0), all of the job steps have ended successfully (i.e., all has certainly ended well). This being the case, all empty error tables, worktables and the log table are dropped. All locks, both Teradata and MultiLoad, are released. The statistics for the job are generated for output (SYSPRINT) and the system count variables are set. After this, each MultiLoad session is logged off . So what happens if the final error code is not zero? Stay tuned. Restarting MultiLoad is a topic th at will be covered later in this chapter.

Potrebbero piacerti anche