Sei sulla pagina 1di 15

Hive ETL Patterns

How to handle landing


zone/foundation updates

Goals

Achieve performance goals


Restartability
Persist Batching metaphor
Minimize Data Hops
Eliminate small HDFS files
Bigger is better.

Landing Zone Pattern


Overview
Table Structure
Add column to hash batch_id
I.E.: cast(batch_id/100 as bigint)

Add column to provide sequence via


UNIX time
To accommodate latest instance of PK

Partition by
Invalid Record Indicator
Hash of batch_id

Example Table
CREATE TABLE dev_wsl_lzn.ECOM_ORD_LINE_CHRG_E(
orderno string
ld_seq_i int
ld_btch_i int ,
..
last_mod_ctime_i bigint )
PARTITIONED BY (
nvld_rec_c string,
ld_btch_hash_i int)

Inserts
set hive.exec.dynamic.partition.mode=nonstrict;
insert into table DEV_WSL_LZN.ECOM_ORD_LINE_CHRG_E
partition(nvld_rec_c,ld_btch_hash_i)
select
DT_1.orderno,
.
, UNIX_TIMESTAMP()
, nvld_rec_c
, cast(ld_btch_i/100 as bigint) as ld_btch_hash_i
From source_table/flume/storm

Landing Zone Retrieval


Select DT_1.* from
(Select
, row_number() over (partitioned by pk_cols ordered by last_mod
desc) as pk_seq_i
From landing_zone_table
Where
(
(nvld_rec_c = Y and batch >= min_recy_batch)
OR
(nvld_rec_c = N and batch >= max_valid_batch)
)
and batch_hash btwn cast(min_recy_batch/100 as bigint) and
cast(max_valid_batch/100 as bigint)
) DT_1 where pk_seq_i = 1;

Effects of Partitioning
ECOM_ORD_LINE_CHRG_E

Metrics
Before/After
Before
MapReduce Jobs Launched:
Job 0: Map: 12 Reduce: 4 Cumulative CPU: 94.33 sec HDFS Read: 404072929 HDFS Write: 464
SUCCESS
Job 1: Map: 3 Reduce: 1 Cumulative CPU: 7.0 sec HDFS Read: 1831 HDFS Write: 6 SUCCESS
Total MapReduce CPU Time Spent: 1 minutes 41 seconds 330 msec
30%
OK
Reduction in
71511
Time taken: 47.807 seconds, Fetched: 1 row(s)
HDFS Reads
After:
MapReduce Jobs Launched:
Job 0: Map: 10 Reduce: 3 Cumulative CPU: 74.11 sec HDFS Read: 284270313 HDFS Write: 348
SUCCESS
Job 1: Map: 2 Reduce: 1 Cumulative CPU: 5.37 sec HDFS Read: 1337 HDFS Write: 6 SUCCESS
Total MapReduce CPU Time Spent: 1 minutes 19 seconds 480 msec
OK
71511
Time taken: 47.451 seconds, Fetched: 1 row(s)

Updates
HDFS doesnt support atomic
updates/deletes
Accommodate by appending row
Time stamping to denote modification time
ROW_NUMBER() by PK to get latest row for PK

Design considerations
You probably have the row laying around
use it!
If not, join back on PK/last mod time to get row
and append

Landing Zone Updates


set hive.exec.dynamic.partition.mode=nonstrict; -- required for dynamic partition inserts
insert into table DEV_WSL_LZN.ECOM_ORD_LINE_CHRG_E partition(nvld_rec_c,ld_btch_hash_i)
select
DT_1.orderno,
DT_1.ld_seq_i,
....
UNIX_TIMESTAMP(),
DT_2.nvld_rec_c, -- updated column(s)
cast(DT_1.ld_btch_i/100 as bigint) as ld_btch_hash_i
from dev_wsl_lzn.ECOM_ORD_LINE_CHRG_E DT_1
INNER JOIN dev_wsl_etl.duplicate_fees DT_2 -- temp table to ID rows to be changed..
ON DT_1.ld_seq_i = dt_2.ld_seq_i and
DT_1.ld_btch_i = dt_2.ld_btch_i
and dt_1.last_mod_ctime_i = dt_2.last_mod_ctime_i
and dt_1.ld_btch_hash_i between cast(min_btch/100 as bigint) and cast(max_btch/100
as bigint);
-- use btch_hash to prune partitions

Foundation Updates
Type 1 Update
Single Pass insert to foundation
Assumes all rows in staging work table
PRCS_C used to determine I/U/D logic (
PRCS_C appending to all tables to determine
if row is deleted
Last_mod_ctime_i appended to all tables to
determine sequencing

Domain/Dimensional values
If small enough overwrite leveraging set
theory

Domain Table Update


Example

insert overwrite table dev_wsl_fnd.ECOM_ORD_SUB_CLAS


select DT_1.* from
(select -- get updates!
t1.ORD_STYP_C,
t2.ORD_TYPE_C,
t1.ORD_CATG_C,
t2.ORD_STYP_DESC_T,
T2.LD_BTCH_I as CRTE_BTCH_I,
t2.DEST_BTCH_I as UPDT_BTCH_I
from dev_wsl_fnd.ECOM_ORD_SUB_CLAS T1
inner join dev_wsl_stg.ECOM_ORD_SUB_CLAS_W T2 on T1.ORD_STYP_C = T2.ORD_STYP_C AND T1.ORD_CATG_C = T2.ORD_CATG_C
UNION ALL
select -- NEW RECORDS
t2.ORD_STYP_C,
t2.ORD_TYPE_C,
t2.ORD_CATG_C,
t2.ORD_STYP_DESC_T,
T2.LD_BTCH_I as CRTE_BTCH_I,
t2.DEST_BTCH_I as UPDT_BTCH_I
from dev_wsl_stg.ECOM_ORD_SUB_CLAS_W T2
LEFT OUTER JOIN dev_wsl_fnd.ECOM_ORD_SUB_CLAS T1 on T2.ORD_STYP_C = T1.ORD_STYP_C AND T2.ORD_CATG_C = T1.ORD_CATG_C AND
T2.ORD_CATG_C IS NULL
UNION ALL
select -- rows that didn't change
t1.ORD_STYP_C,
t1.ORD_TYPE_C,
t1.ORD_CATG_C,
t1.ORD_STYP_DESC_T,
T1.CRTE_BTCH_I ,
t1.UPDT_BTCH_I
from dev_wsl_fnd.ECOM_ORD_SUB_CLAS T1
LEFT OUTER JOIN dev_wsl_stg.ECOM_ORD_SUB_CLAS_W T2 on T1.ORD_STYP_C = T2.ORD_STYP_C AND T1.ORD_CATG_C =
T2.ORD_CATG_C AND T2.ORD_CATG_C IS NULL
) DT_1;

Type 2 Upserts
Single Pass insert to foundation
Assumes
All rows in staging work table
Heavy lifting has been accomplished to
determine effective start

Logical primary key extended to include


effective start
Last_mod_ctime_i (Unix Timestamp)
appended to all tables to determine
sequencing of primary key, including eff_start

Foundation Retrieval
Need to determine latest row when
retrieving
Accomplished via ROW_NUMBER()
partitioned by PK, ordered by
LAST_MOD_CTIME_I desc

Compact non-current rows


Insert into shadow table where pk_seq_i
=1
Rename

Compaction Example
use dev_sls_fnd;
set hive.exec.dynamic.partition.mode=nonstrict; --needed for
dynamic partitions
insert into table rgtn_new PARTITION(sls_d)
select

sls_d from
(select
rgtn_i ,
..
sls_d,
ROW_NUMBER() over (partition by rgtn_i order by
last_mod_ctime_i desc ) as pk_seq_i
from dev_sls_fnd.rgtn
)DT_1 where rgtn_seq = 1;
Alter table rgtn rename to rgtn_old;

Potrebbero piacerti anche