Real Time DW With SQL - Server PDF

5/30/2015
Architecting Real-Time Data

Warehouses with SQL Server
Mark Murphy
President,
Infinity Analytics Inc.
Presenter: Mark Murphy

• NYC-based Independent Consultant
• https://www.linkedin.com/in/markmurphynyc
• http://www.infinityanalytics.com/
1
5/30/2015
Traditional DW: Nightly / Weekly Data Load

Oracle
CRM
SQL 2008 Reload Changes

Inventory
AdventureWorks
DW 2014
AdvWorks
2014 Nightly ETL
SSIS/Stored Procs
Supplier Shipping
Schedules
(CSV/XML)
Real-Time DW: Continuous Data Load

Oracle
CRM
SQL 2008
Inventory
Merge Changes AdventureWorks
DW 2014
AdvWorks
2014 Constant ETL
Stored Procs/CDC
Supplier Shipping
Schedules
(CSV/XML)
2
5/30/2015
Why?
• Zero data latency
• Top customers *today*
• RT Analytics
• Predictive analytics
• Recommender systems
• RT promotions
• Cool factor to hit “refresh”
New Customer Signups - *Today*
3
5/30/2015
4 Architectural Components
1. XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
3. XXXXXXXXXXXXXXXXXXXXXX (CDC) XXXXXXXXXXXXXXXXXXXXXXX
4
5/30/2015
Real-Time DW: Continuous Data Load

Oracle
CRM
SQL 2008
Inventory
Merge Changes AdventureWorks
DW 2014
AdvWorks
2014
ETL
Stored Procs
Supplier Shipping
Schedules
(CSV/XML)
End Goal
• Dimensional Model
• RT-ETL to transform 3NF / flat

data structures in source systems
to a dimensional model in the
data warehouse.
5
5/30/2015
Caveats to RTDW
• If you don’t need it – don’t do it
• Higher cost in ETL development and testing
• More moving parts – more to go wrong.
• Easier to TRUNCATE and INSERT a full table than to implement real-

time update logic.
Let’s Go!
6
5/30/2015
ODS – Operational Data Store

ODS Layer
Oracle
CRM
CRM_ODS
SQL 2008 AdventureWorks

Inventory DW 2014
INVENTORY_ODS
AdvWorks
AdvWorks2014
ADV_WORKS_ODS
2014
SHIP_SCHED_ODS
ODS Layer Why ODS?

CRM_ODS
• Doesn’t touch OLTP production source
systems
INVENTORY_ODS
• Overcomes lack of CDC support of source

databases
ADV_WORKS_ODS
• Divides & Conquers work & complexity

SHIP_SCHED_ODS
• Improves performance, adds flexibility
7
5/30/2015
ODS Layer Other Considerations

• One SQL Database per source database /
CRM_ODS subject area
INVENTORY_ODS
• Mirror the source systems exactly (except
possibly for indexes)
ADV_WORKS_ODS • Do not build reports off the ODS or give direct

access to end-users.
SHIP_SCHED_ODS • Even in a traditional, non-RT DW, this is a best

practice
1. Operational Data Store (ODS) databases: create 1 per source
database or subject area. PUSH data into the ODS’s as often
as possible.
8
5/30/2015
Now What?
ODS Layer
Oracle
CRM
CRM_ODS
SQL 2008
Inventory INVENTORY_ODS ? AdventureWorks
DW 2014
AdvWorks
ADV_WORKS_ODS
2014
SHIP_SCHED_ODS
Source to Target Mapping
AdventureWorks 2014 AdventureWorksDW 2014

Translation View
Source to Target
OLTP – 3rd Normal Form DW - Dimensional
9
5/30/2015
Source View - dimGeography
rtdemo_src.
dimGeography
Source View - factInternetSales
rtdemo_src.
factInternetSales
10
5/30/2015
Re-Init Procedures
MERGE INTO <DESTINATION> TGT
USING <SOURCE VIEW> AS SRC
ON SRC.Business Key = TGT.Business Key
WHEN NOT EXISTS THEN INSERT()
WHEN EXISTS AND (

<some difference>
)
THEN UPDATE();
Re-Inits
• Are needed when:
• System is initialized
• Source system changes, need to reprocess
• System troubleshooting (failsafe)
• Theoretically, with just re-inits, you could load your

data warehouse in a traditional, non-RT manner.
11
5/30/2015
Problem
• 1:01 run dimGeography reinit
• 1:02 run dimCustomer reinit
• 1:05 run factInternetSales reinit
1:00 1:01 1:02 1:03 1:04 1:05
• What if a customer was added at 1:03, and placed an order at 1:04?

• Missing Key!
Database Snapshots
• Database snapshots are created instantly, as a shadow copy.
ADV_WORKS_ODS
ADV_WORKS_ODS
_SNAP
• They do not store data at initial creation. Instead, they store the
“before” image as changes are made.
• Can query either the snapshot or the original.
12
5/30/2015
Re-inits in Practice
• So source views/re-inits should be pointed to the ODS Snapshots.
• Re-inits procedures will re-synch the DW data based on the frozen

version of the source.
• Good code/validation exercise as well.

• So, a correct reinit procedure will insert/update ZERO rows on the second run
off the same snapshot
13
5/30/2015
ODS
ODSSnapshots
Layer
Oracle
CRM ETL
CRM_ODS_SNAP
CRM_ODS Re-init SP’s
SQL 2008 INVENTORY_ODS_

Inventory SNAP
INVENTORY_ODS
ADV_WORKS_ODS
AdvWorks _SNAP
ADV_WORKS_ODS Adv Works
2014
DW 2014
SHIP_SCHED_ODS_
SNAP
SHIP_SCHED_ODS
LSNs
• Binary way of representing the exact transaction order of the
database.
• Example: 0X0000002D000000480001
14
5/30/2015
Reading & Writing LSNs

• LSN of a snapshot can be
queried:
sys.sp_cdc_dbsnapshotLSN
• Our REINIT_ALL procedure

will store this LSN
1. Operational Data Store (ODS) databases: create 1 per source database or
subject area. PUSH data into the ODS’s as often as possible.
2. Re-init Processes: build a re-init stored proc for each dim

and fact, sourced from ODS snapshots. PULL from the
source views into the DW dims/facts. Store the snapshot
LSNs as the starting point.
15
5/30/2015
Now What (part 2)?

ODS Layer
Oracle
CRM
CRM_ODS
SQL 2008
Inventory INVENTORY_ODS ? AdventureWorks
DW 2014
AdvWorks
ADV_WORKS_ODS
2014
SHIP_SCHED_ODS
Incremental Algorithm
Lookup the last LSN from the HWM table (old)
Get the new latest LSN from the ODS (new)
Begin Transaction
Process all dimensions incrementally (old,new)
Process all facts incrementally (old,new)
Update the HWM
Commit Transaction
16
5/30/2015
CDC
• Using Change Data Capture (CDC) to pull all changes to a table for a
given LSN range.
• Requires SQL Server Enterprise Edition
• CDC Tutorial: (Pinal Dave)

• https://www.simple-talk.com/sql/learn-sql-server/introduction-to-change-data-capture-%28cdc%29-in-sql-server-2008/
CDC Primer
Sales.SalesOrderHeader
cdc.Sales_SalesOrderHeader_CT
..and functions…
17
5/30/2015
Reading from CDC

• cdc.fn_get_ALL_changes_Sales_SalesOrderHeader(@startlsn, @endlsn)
• cdc.fn_get_NET_changes_Sales_SalesOrderHeader(@startlsn, @endlsn)
• SELECT FROM cdc.Sales_SalesOrderHeader_CT directly
• Add an OPTION(OPTIMIZE FOR UNKNOWN) to CDC function queries if the

performance is poor.
Incremental Procs
• For each source table, read from the CDC functions to see what’s changed
in the requested LSN range.
• Store the results in temp tables.
• Join the tables together, mimicking the structure of the source views.
• Merge into the fact/dim, just like the re-inits.
• Appendix A: joining two tables when they’re updated in different LSN

ranges.
18
5/30/2015
ASIDE: Catch-up algorithm

• If the DW is behind by one hour, should it catch up all in one transaction,
or break it up into smaller pieces?
• Former is much easier.
• Latter is more difficult, but provides more accurate timestamps, such as

for Type-II dimensions.
• Need to loop through the ODS’s cdc.LSN_TIME_MAPPING table, processing one
time slice at a time (e.g. 5 minutes)
1. Operational Data Store (ODS) databases: create 1 per source database or subject
area. PUSH data into the ODS’s as often as possible.
2. Re-init Processes: build a re-init stored proc for each dim and fact, sourced from
ODS snapshots. PULL from the source views into the DW dims/facts. Store the
snapshot LSNs as the starting point.
3. Incremental Processes: build an incremental stored proc for

each dim and fact. Use CDC functions to populate temp tables
that mimic source views. PULL data incrementally on demand.
Transactionally store new HWM.
19
5/30/2015
Agent Job
Why not SSIS?

• You could, but if the source and
target are both SQL Server,
easier and faster to work directly
with T-SQL.
• If you do, use transactions!
• Push process into ODS’s might be http://msbitips.blogspot.com/

perfect for SSIS.
20
5/30/2015
Why not Change Tracking?

• Only stores key values, not data values
• No way to recreate history, which might be needed.
• Mechanics of re-inits and incrementals to precise LSN values wouldn’t

be possible.
Real-Time Aggregates
• Create Indexed views for RT aggregates

on facts, with/without joins to
dimensions
• Are kept up to date automatically
• 2014 allows for updateable

columnstore indexes as well
21
5/30/2015
Indexed Views
• Will degrade performance of INSERTs/UPDATEs to the fact table, so
make sure they’re worthwhile to add.
• Be careful of updating referenced dimensions, index view foreign key

references can deadlock.
• Use the GetAppLock() function to single-thread write access to the fact table
and referenced dimensions.
OLAP
• SSAS Cubes may also be able to be updated frequently.
• Multiple partitions, sliced by time:

• ROLAP Current Day
• MOLAP
22
5/30/2015
Monitoring/Alerting
• All ETL operations should be logged
and timed.
• Logger should commit even if overall
transaction is rolled back.
• If incremental job fails 10 times, slow it

down/turn it off.
Statistics
• Won’t ever be up to date for the latest data. From fact table, in SQL
2008/2012, will give cardinality estimate of 1 if the date range is past
the HWM.
• Trace flags 2389, 2390, 4139 in SQL 2012 to deal with this “Ascending
Key” problem
• SQL 2014 is supposed to be better, but it has an issue where it may

guess that there are 9% of overall rows since the statistics HWM.
• =>milossql.wordpress.com (“beyond histogram” articles)
23
5/30/2015
Caching
• Turn off caching on the reporting server to always have live data.
Performance Considerations
• Need to tune RT ETL so that it doesn’t
have any inefficiencies. Measure in
milliseconds, not seconds.
• Use WhoIsActive to see what’s running
• MUST have Read Committed Snapshot

Isolation (RCSI) enabled on the DW
database.
24
5/30/2015
Process AW_ODS
DEV
• Have RT replication flowing into AW AW_ODS

QA
DEV/QA/Prod Prod
• Keep the incremental process working in AW_ODS
Prod
all 3!
DW Prod DW Prod
• For a new RTDW, build in parallel to an (Legacy) (new RT)
existing DW, so you can reconcile the
two.
1. Operational Data Store (ODS) databases: create 1 per source database or subject area.
PUSH data into the ODS’s as often as possible.
2. Re-init Processes: build a re-init stored proc for each dim and fact, sourced from ODS
snapshots. PULL from the source views into the DW dims/facts. Store the snapshot LSNs as
the starting point.
3. Incremental Processes: build an incremental stored proc for each dim and fact. Use CDC
functions to populate temp tables that mimic source views. PULL data incrementally on
demand. Transactionally store new HWM.
4. Test early & test often. Make sure RT data is flowing into DEV
and QA. Tune ETL, statistics, aggregates and user queries
against a live system with RCSI enabled.
25
5/30/2015
4 Architectural Components - Review

1. Operational Data Store (ODS) databases: create 1 per source database or
subject area. PUSH data into the ODS’s as often as possible.
2. Re-init Processes: build a re-init stored proc for each dim and fact, sourced
from ODS snapshots. PULL from the source views into the DW dims/facts.
Store the snapshot LSNs as the starting point.
3. Incremental Processes: build an incremental stored proc for each dim and fact.
Use CDC functions to populate temp tables that mimic source views. PULL data
incrementally on demand. Transactionally store new HWM.
4. Test early & test often. Make sure RT data is flowing into DEV and QA. Tune
ETL, statistics, aggregates and user queries against a live system with RCSI
enabled.
More Information
• Code/slides at: http://www.infinityanalytics.com/
• https://www.linkedin.com/in/markmurphynyc
• mark@infinityanalytics.com
26
5/30/2015
Appendix A: Joining two tables with CDC

1. Get net changes from A into #TEMP_A
2. Get net changes from B into #TEMP_B
3. Examine the join key between A and B. Search for any records
missing in B that are in A. If there are missing records, then…
1. Look to the future change CDC _CT table for the *before* images of future
modifications, -BUT- only where the record isn’t inserted in the future, after
this LSN range.
2. Finally, look to the base table for records that are still missing, -BUT- only
where the record isn’t inserted in the future, after this LSN range.
4. Repeat step 3, but this time looking for A missing from B.
5. Now continue processing, using #TEMP_A and #TEMP_B.
27

Real Time DW With SQL - Server PDF

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Real Time DW With SQL - Server PDF

Caricato da

Copyright:

Formati disponibili

5/30/2015

Architecting Real-Time Data

Presenter: Mark Murphy

Traditional DW: Nightly / Weekly Data Load

SQL 2008 Reload Changes

Real-Time DW: Continuous Data Load

• Cool factor to hit “refresh”

New Customer Signups - *Today*

Real-Time DW: Continuous Data Load

• RT-ETL to transform 3NF / flat

• Easier to TRUNCATE and INSERT a full table than to implement real-

ODS – Operational Data Store

SQL 2008 AdventureWorks

ODS Layer Why ODS?

• Overcomes lack of CDC support of source

• Divides & Conquers work & complexity

• Improves performance, adds flexibility

ODS Layer Other Considerations

ADV_WORKS_ODS • Do not build reports off the ODS or give direct

SHIP_SCHED_ODS • Even in a traditional, non-RT DW, this is a best

Source to Target Mapping

AdventureWorks 2014 AdventureWorksDW 2014

OLTP – 3rd Normal Form DW - Dimensional

Source View - dimGeography

Source View - factInternetSales

WHEN NOT EXISTS THEN INSERT()

WHEN EXISTS AND (

• Theoretically, with just re-inits, you could load your

1:00 1:01 1:02 1:03 1:04 1:05

• What if a customer was added at 1:03, and placed an order at 1:04?

• Re-inits procedures will re-synch the DW data based on the frozen

• Good code/validation exercise as well.

SQL 2008 INVENTORY_ODS_

Reading & Writing LSNs

• Our REINIT_ALL procedure

2. Re-init Processes: build a re-init stored proc for each dim

Now What (part 2)?

Get the new latest LSN from the ODS (new)

• Requires SQL Server Enterprise Edition

• CDC Tutorial: (Pinal Dave)

Reading from CDC

• SELECT FROM cdc.Sales_SalesOrderHeader_CT directly

• Add an OPTION(OPTIMIZE FOR UNKNOWN) to CDC function queries if the

• Store the results in temp tables.

• Merge into the fact/dim, just like the re-inits.

• Appendix A: joining two tables when they’re updated in different LSN

ASIDE: Catch-up algorithm

• Former is much easier.

• Latter is more difficult, but provides more accurate timestamps, such as

3. Incremental Processes: build an incremental stored proc for

Why not SSIS?

• Push process into ODS’s might be http://msbitips.blogspot.com/

Why not Change Tracking?

• No way to recreate history, which might be needed.

• Mechanics of re-inits and incrementals to precise LSN values wouldn’t

• Create Indexed views for RT aggregates

• 2014 allows for updateable

• Be careful of updating referenced dimensions, index view foreign key

• Multiple partitions, sliced by time:

• If incremental job fails 10 times, slow it

• SQL 2014 is supposed to be better, but it has an issue where it may

• MUST have Read Committed Snapshot

• Have RT replication flowing into AW AW_ODS

New Customer Signups - Today