Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Goals
Partition by
Invalid Record Indicator
Hash of batch_id
Example Table
CREATE TABLE dev_wsl_lzn.ECOM_ORD_LINE_CHRG_E(
orderno string
ld_seq_i int
ld_btch_i int ,
..
last_mod_ctime_i bigint )
PARTITIONED BY (
nvld_rec_c string,
ld_btch_hash_i int)
Inserts
set hive.exec.dynamic.partition.mode=nonstrict;
insert into table DEV_WSL_LZN.ECOM_ORD_LINE_CHRG_E
partition(nvld_rec_c,ld_btch_hash_i)
select
DT_1.orderno,
.
, UNIX_TIMESTAMP()
, nvld_rec_c
, cast(ld_btch_i/100 as bigint) as ld_btch_hash_i
From source_table/flume/storm
Effects of Partitioning
ECOM_ORD_LINE_CHRG_E
Metrics
Before/After
Before
MapReduce Jobs Launched:
Job 0: Map: 12 Reduce: 4 Cumulative CPU: 94.33 sec HDFS Read: 404072929 HDFS Write: 464
SUCCESS
Job 1: Map: 3 Reduce: 1 Cumulative CPU: 7.0 sec HDFS Read: 1831 HDFS Write: 6 SUCCESS
Total MapReduce CPU Time Spent: 1 minutes 41 seconds 330 msec
30%
OK
Reduction in
71511
Time taken: 47.807 seconds, Fetched: 1 row(s)
HDFS Reads
After:
MapReduce Jobs Launched:
Job 0: Map: 10 Reduce: 3 Cumulative CPU: 74.11 sec HDFS Read: 284270313 HDFS Write: 348
SUCCESS
Job 1: Map: 2 Reduce: 1 Cumulative CPU: 5.37 sec HDFS Read: 1337 HDFS Write: 6 SUCCESS
Total MapReduce CPU Time Spent: 1 minutes 19 seconds 480 msec
OK
71511
Time taken: 47.451 seconds, Fetched: 1 row(s)
Updates
HDFS doesnt support atomic
updates/deletes
Accommodate by appending row
Time stamping to denote modification time
ROW_NUMBER() by PK to get latest row for PK
Design considerations
You probably have the row laying around
use it!
If not, join back on PK/last mod time to get row
and append
Foundation Updates
Type 1 Update
Single Pass insert to foundation
Assumes all rows in staging work table
PRCS_C used to determine I/U/D logic (
PRCS_C appending to all tables to determine
if row is deleted
Last_mod_ctime_i appended to all tables to
determine sequencing
Domain/Dimensional values
If small enough overwrite leveraging set
theory
Type 2 Upserts
Single Pass insert to foundation
Assumes
All rows in staging work table
Heavy lifting has been accomplished to
determine effective start
Foundation Retrieval
Need to determine latest row when
retrieving
Accomplished via ROW_NUMBER()
partitioned by PK, ordered by
LAST_MOD_CTIME_I desc
Compaction Example
use dev_sls_fnd;
set hive.exec.dynamic.partition.mode=nonstrict; --needed for
dynamic partitions
insert into table rgtn_new PARTITION(sls_d)
select
sls_d from
(select
rgtn_i ,
..
sls_d,
ROW_NUMBER() over (partition by rgtn_i order by
last_mod_ctime_i desc ) as pk_seq_i
from dev_sls_fnd.rgtn
)DT_1 where rgtn_seq = 1;
Alter table rgtn rename to rgtn_old;