Hive ETL Patterns

Hive ETL Patterns
How to handle landing

zone/foundation updates
Goals
Achieve performance goals

Restartability
Persist Batching metaphor
Minimize Data Hops
Eliminate small HDFS files
Bigger is better.
Landing Zone Pattern

Overview
Table Structure
Add column to hash batch_id
I.E.: cast(batch_id/100 as bigint)
Add column to provide sequence via

UNIX time
To accommodate latest instance of PK
Partition by
Invalid Record Indicator
Hash of batch_id
Example Table
CREATE TABLE dev_wsl_lzn.ECOM_ORD_LINE_CHRG_E(
orderno string
ld_seq_i int
ld_btch_i int ,
..
last_mod_ctime_i bigint )
PARTITIONED BY (
nvld_rec_c string,
ld_btch_hash_i int)
Inserts
set hive.exec.dynamic.partition.mode=nonstrict;
insert into table DEV_WSL_LZN.ECOM_ORD_LINE_CHRG_E
partition(nvld_rec_c,ld_btch_hash_i)
select
DT_1.orderno,
.
, UNIX_TIMESTAMP()
, nvld_rec_c
, cast(ld_btch_i/100 as bigint) as ld_btch_hash_i
From source_table/flume/storm
Landing Zone Retrieval

Select DT_1.* from
(Select
, row_number() over (partitioned by pk_cols ordered by last_mod
desc) as pk_seq_i
From landing_zone_table
Where
(
(nvld_rec_c = Y and batch >= min_recy_batch)
OR
(nvld_rec_c = N and batch >= max_valid_batch)
)
and batch_hash btwn cast(min_recy_batch/100 as bigint) and
cast(max_valid_batch/100 as bigint)
) DT_1 where pk_seq_i = 1;
Effects of Partitioning
ECOM_ORD_LINE_CHRG_E
Metrics
Before/After
Before
MapReduce Jobs Launched:
Job 0: Map: 12 Reduce: 4 Cumulative CPU: 94.33 sec HDFS Read: 404072929 HDFS Write: 464
SUCCESS
Job 1: Map: 3 Reduce: 1 Cumulative CPU: 7.0 sec HDFS Read: 1831 HDFS Write: 6 SUCCESS
Total MapReduce CPU Time Spent: 1 minutes 41 seconds 330 msec
30%
OK
Reduction in
71511
Time taken: 47.807 seconds, Fetched: 1 row(s)
HDFS Reads
After:
MapReduce Jobs Launched:
Job 0: Map: 10 Reduce: 3 Cumulative CPU: 74.11 sec HDFS Read: 284270313 HDFS Write: 348
SUCCESS
Job 1: Map: 2 Reduce: 1 Cumulative CPU: 5.37 sec HDFS Read: 1337 HDFS Write: 6 SUCCESS
Total MapReduce CPU Time Spent: 1 minutes 19 seconds 480 msec
OK
71511
Time taken: 47.451 seconds, Fetched: 1 row(s)
Updates
HDFS doesnt support atomic
updates/deletes
Accommodate by appending row
Time stamping to denote modification time
ROW_NUMBER() by PK to get latest row for PK
Design considerations
You probably have the row laying around
use it!
If not, join back on PK/last mod time to get row
and append
Landing Zone Updates

set hive.exec.dynamic.partition.mode=nonstrict; -- required for dynamic partition inserts
insert into table DEV_WSL_LZN.ECOM_ORD_LINE_CHRG_E partition(nvld_rec_c,ld_btch_hash_i)
select
DT_1.orderno,
DT_1.ld_seq_i,
....
UNIX_TIMESTAMP(),
DT_2.nvld_rec_c, -- updated column(s)
cast(DT_1.ld_btch_i/100 as bigint) as ld_btch_hash_i
from dev_wsl_lzn.ECOM_ORD_LINE_CHRG_E DT_1
INNER JOIN dev_wsl_etl.duplicate_fees DT_2 -- temp table to ID rows to be changed..
ON DT_1.ld_seq_i = dt_2.ld_seq_i and
DT_1.ld_btch_i = dt_2.ld_btch_i
and dt_1.last_mod_ctime_i = dt_2.last_mod_ctime_i
and dt_1.ld_btch_hash_i between cast(min_btch/100 as bigint) and cast(max_btch/100
as bigint);
-- use btch_hash to prune partitions
Foundation Updates
Type 1 Update
Single Pass insert to foundation
Assumes all rows in staging work table
PRCS_C used to determine I/U/D logic (
PRCS_C appending to all tables to determine
if row is deleted
Last_mod_ctime_i appended to all tables to
determine sequencing
Domain/Dimensional values
If small enough overwrite leveraging set
theory
Domain Table Update

Example
insert overwrite table dev_wsl_fnd.ECOM_ORD_SUB_CLAS

select DT_1.* from
(select -- get updates!
t1.ORD_STYP_C,
t2.ORD_TYPE_C,
t1.ORD_CATG_C,
t2.ORD_STYP_DESC_T,
T2.LD_BTCH_I as CRTE_BTCH_I,
t2.DEST_BTCH_I as UPDT_BTCH_I
from dev_wsl_fnd.ECOM_ORD_SUB_CLAS T1
inner join dev_wsl_stg.ECOM_ORD_SUB_CLAS_W T2 on T1.ORD_STYP_C = T2.ORD_STYP_C AND T1.ORD_CATG_C = T2.ORD_CATG_C
UNION ALL
select -- NEW RECORDS
t2.ORD_STYP_C,
t2.ORD_TYPE_C,
t2.ORD_CATG_C,
t2.ORD_STYP_DESC_T,
T2.LD_BTCH_I as CRTE_BTCH_I,
t2.DEST_BTCH_I as UPDT_BTCH_I
from dev_wsl_stg.ECOM_ORD_SUB_CLAS_W T2
LEFT OUTER JOIN dev_wsl_fnd.ECOM_ORD_SUB_CLAS T1 on T2.ORD_STYP_C = T1.ORD_STYP_C AND T2.ORD_CATG_C = T1.ORD_CATG_C AND
T2.ORD_CATG_C IS NULL
UNION ALL
select -- rows that didn't change
t1.ORD_STYP_C,
t1.ORD_TYPE_C,
t1.ORD_CATG_C,
t1.ORD_STYP_DESC_T,
T1.CRTE_BTCH_I ,
t1.UPDT_BTCH_I
from dev_wsl_fnd.ECOM_ORD_SUB_CLAS T1
LEFT OUTER JOIN dev_wsl_stg.ECOM_ORD_SUB_CLAS_W T2 on T1.ORD_STYP_C = T2.ORD_STYP_C AND T1.ORD_CATG_C =
T2.ORD_CATG_C AND T2.ORD_CATG_C IS NULL
) DT_1;
Type 2 Upserts
Single Pass insert to foundation
Assumes
All rows in staging work table
Heavy lifting has been accomplished to
determine effective start
Logical primary key extended to include

effective start
Last_mod_ctime_i (Unix Timestamp)
appended to all tables to determine
sequencing of primary key, including eff_start
Foundation Retrieval
Need to determine latest row when
retrieving
Accomplished via ROW_NUMBER()
partitioned by PK, ordered by
LAST_MOD_CTIME_I desc
Compact non-current rows

Insert into shadow table where pk_seq_i
=1
Rename
Compaction Example
use dev_sls_fnd;
set hive.exec.dynamic.partition.mode=nonstrict; --needed for
dynamic partitions
insert into table rgtn_new PARTITION(sls_d)
select
sls_d from
(select
rgtn_i ,
..
sls_d,
ROW_NUMBER() over (partition by rgtn_i order by
last_mod_ctime_i desc ) as pk_seq_i
from dev_sls_fnd.rgtn
)DT_1 where rgtn_seq = 1;
Alter table rgtn rename to rgtn_old;

Hive ETL Patterns

Caricato da

Informazioni sul documento

Descrizione originale:

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Hive ETL Patterns

Caricato da

Copyright:

Formati disponibili

Hive ETL Patterns

How to handle landing

Achieve performance goals

Landing Zone Pattern

Add column to provide sequence via

Landing Zone Retrieval

Landing Zone Updates

Domain Table Update

insert overwrite table dev_wsl_fnd.ECOM_ORD_SUB_CLAS

Logical primary key extended to include

Compact non-current rows

Potrebbero piacerti anche