Sei sulla pagina 1di 27

5/30/2015

Architecting Real-Time Data


Warehouses with SQL Server
Mark Murphy
President,
Infinity Analytics Inc.

Presenter: Mark Murphy


• NYC-based Independent Consultant

• https://www.linkedin.com/in/markmurphynyc

• http://www.infinityanalytics.com/

1
5/30/2015

Traditional DW: Nightly / Weekly Data Load


Oracle
CRM

SQL 2008 Reload Changes


Inventory
AdventureWorks
DW 2014

AdvWorks
2014 Nightly ETL
SSIS/Stored Procs

Supplier Shipping
Schedules
(CSV/XML)

Real-Time DW: Continuous Data Load


Oracle
CRM

SQL 2008
Inventory
Merge Changes AdventureWorks
DW 2014

AdvWorks
2014 Constant ETL
Stored Procs/CDC

Supplier Shipping
Schedules
(CSV/XML)

2
5/30/2015

Why?
• Zero data latency
• Top customers *today*

• RT Analytics
• Predictive analytics
• Recommender systems
• RT promotions

• Cool factor to hit “refresh”

New Customer Signups - *Today*

3
5/30/2015

4 Architectural Components
1. XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
2. XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
3. XXXXXXXXXXXXXXXXXXXXXX (CDC) XXXXXXXXXXXXXXXXXXXXXXX
4. XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

4
5/30/2015

Real-Time DW: Continuous Data Load


Oracle
CRM

SQL 2008
Inventory
Merge Changes AdventureWorks
DW 2014

AdvWorks
2014
ETL
Stored Procs

Supplier Shipping
Schedules
(CSV/XML)

End Goal
• Dimensional Model

• RT-ETL to transform 3NF / flat


data structures in source systems
to a dimensional model in the
data warehouse.

5
5/30/2015

Caveats to RTDW
• If you don’t need it – don’t do it
• Higher cost in ETL development and testing
• More moving parts – more to go wrong.

• Easier to TRUNCATE and INSERT a full table than to implement real-


time update logic.

Let’s Go!

6
5/30/2015

ODS – Operational Data Store


ODS Layer
Oracle
CRM
CRM_ODS

SQL 2008 AdventureWorks


Inventory DW 2014
INVENTORY_ODS

AdvWorks
AdvWorks2014
ADV_WORKS_ODS
2014

SHIP_SCHED_ODS

ODS Layer Why ODS?


CRM_ODS
• Doesn’t touch OLTP production source
systems
INVENTORY_ODS

• Overcomes lack of CDC support of source


databases
ADV_WORKS_ODS

• Divides & Conquers work & complexity


SHIP_SCHED_ODS

• Improves performance, adds flexibility

7
5/30/2015

ODS Layer Other Considerations


• One SQL Database per source database /
CRM_ODS subject area

INVENTORY_ODS
• Mirror the source systems exactly (except
possibly for indexes)

ADV_WORKS_ODS • Do not build reports off the ODS or give direct


access to end-users.

SHIP_SCHED_ODS • Even in a traditional, non-RT DW, this is a best


practice

4 Architectural Components
1. Operational Data Store (ODS) databases: create 1 per source
database or subject area. PUSH data into the ODS’s as often
as possible.
2. XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

3. XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

4. XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

8
5/30/2015

Now What?
ODS Layer
Oracle
CRM
CRM_ODS

SQL 2008
Inventory INVENTORY_ODS ? AdventureWorks
DW 2014
AdvWorks
ADV_WORKS_ODS
2014

SHIP_SCHED_ODS

Source to Target Mapping

AdventureWorks 2014 AdventureWorksDW 2014


Translation View

Source to Target

OLTP – 3rd Normal Form DW - Dimensional

9
5/30/2015

Source View - dimGeography

rtdemo_src.
dimGeography

Source View - factInternetSales

rtdemo_src.
factInternetSales

10
5/30/2015

Re-Init Procedures
MERGE INTO <DESTINATION> TGT
USING <SOURCE VIEW> AS SRC
ON SRC.Business Key = TGT.Business Key

WHEN NOT EXISTS THEN INSERT()

WHEN EXISTS AND (


<some difference>
)
THEN UPDATE();

Re-Inits
• Are needed when:
• System is initialized
• Source system changes, need to reprocess
• System troubleshooting (failsafe)

• Theoretically, with just re-inits, you could load your


data warehouse in a traditional, non-RT manner.

11
5/30/2015

Problem
• 1:01 run dimGeography reinit
• 1:02 run dimCustomer reinit
• 1:05 run factInternetSales reinit

1:00 1:01 1:02 1:03 1:04 1:05

• What if a customer was added at 1:03, and placed an order at 1:04?


• Missing Key!

Database Snapshots
• Database snapshots are created instantly, as a shadow copy.

ADV_WORKS_ODS
ADV_WORKS_ODS
_SNAP

• They do not store data at initial creation. Instead, they store the
“before” image as changes are made.
• Can query either the snapshot or the original.

12
5/30/2015

Re-inits in Practice
• So source views/re-inits should be pointed to the ODS Snapshots.

• Re-inits procedures will re-synch the DW data based on the frozen


version of the source.

• Good code/validation exercise as well.


• So, a correct reinit procedure will insert/update ZERO rows on the second run
off the same snapshot

13
5/30/2015

ODS
ODSSnapshots
Layer
Oracle
CRM ETL
CRM_ODS_SNAP
CRM_ODS Re-init SP’s

SQL 2008 INVENTORY_ODS_


Inventory SNAP
INVENTORY_ODS

ADV_WORKS_ODS
AdvWorks _SNAP
ADV_WORKS_ODS Adv Works
2014
DW 2014
SHIP_SCHED_ODS_
SNAP
SHIP_SCHED_ODS

LSNs
• Binary way of representing the exact transaction order of the
database.

• Example: 0X0000002D000000480001

14
5/30/2015

Reading & Writing LSNs


• LSN of a snapshot can be
queried:
sys.sp_cdc_dbsnapshotLSN

• Our REINIT_ALL procedure


will store this LSN

4 Architectural Components
1. Operational Data Store (ODS) databases: create 1 per source database or
subject area. PUSH data into the ODS’s as often as possible.

2. Re-init Processes: build a re-init stored proc for each dim


and fact, sourced from ODS snapshots. PULL from the
source views into the DW dims/facts. Store the snapshot
LSNs as the starting point.
3. XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

4. XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

15
5/30/2015

Now What (part 2)?


ODS Layer
Oracle
CRM
CRM_ODS

SQL 2008
Inventory INVENTORY_ODS ? AdventureWorks
DW 2014
AdvWorks
ADV_WORKS_ODS
2014

SHIP_SCHED_ODS

Incremental Algorithm
Lookup the last LSN from the HWM table (old)

Get the new latest LSN from the ODS (new)

Begin Transaction
Process all dimensions incrementally (old,new)
Process all facts incrementally (old,new)
Update the HWM
Commit Transaction

16
5/30/2015

CDC
• Using Change Data Capture (CDC) to pull all changes to a table for a
given LSN range.

• Requires SQL Server Enterprise Edition

• CDC Tutorial: (Pinal Dave)


• https://www.simple-talk.com/sql/learn-sql-server/introduction-to-change-data-capture-%28cdc%29-in-sql-server-2008/

CDC Primer
Sales.SalesOrderHeader

cdc.Sales_SalesOrderHeader_CT

..and functions…

17
5/30/2015

Reading from CDC


• cdc.fn_get_ALL_changes_Sales_SalesOrderHeader(@startlsn, @endlsn)

• cdc.fn_get_NET_changes_Sales_SalesOrderHeader(@startlsn, @endlsn)

• SELECT FROM cdc.Sales_SalesOrderHeader_CT directly

• Add an OPTION(OPTIMIZE FOR UNKNOWN) to CDC function queries if the


performance is poor.

Incremental Procs
• For each source table, read from the CDC functions to see what’s changed
in the requested LSN range.

• Store the results in temp tables.

• Join the tables together, mimicking the structure of the source views.

• Merge into the fact/dim, just like the re-inits.

• Appendix A: joining two tables when they’re updated in different LSN


ranges.

18
5/30/2015

ASIDE: Catch-up algorithm


• If the DW is behind by one hour, should it catch up all in one transaction,
or break it up into smaller pieces?

• Former is much easier.

• Latter is more difficult, but provides more accurate timestamps, such as


for Type-II dimensions.
• Need to loop through the ODS’s cdc.LSN_TIME_MAPPING table, processing one
time slice at a time (e.g. 5 minutes)

4 Architectural Components
1. Operational Data Store (ODS) databases: create 1 per source database or subject
area. PUSH data into the ODS’s as often as possible.

2. Re-init Processes: build a re-init stored proc for each dim and fact, sourced from
ODS snapshots. PULL from the source views into the DW dims/facts. Store the
snapshot LSNs as the starting point.

3. Incremental Processes: build an incremental stored proc for


each dim and fact. Use CDC functions to populate temp tables
that mimic source views. PULL data incrementally on demand.
Transactionally store new HWM.
4. XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

19
5/30/2015

Agent Job

Why not SSIS?


• You could, but if the source and
target are both SQL Server,
easier and faster to work directly
with T-SQL.
• If you do, use transactions!

• Push process into ODS’s might be http://msbitips.blogspot.com/


perfect for SSIS.

20
5/30/2015

Why not Change Tracking?


• Only stores key values, not data values

• No way to recreate history, which might be needed.

• Mechanics of re-inits and incrementals to precise LSN values wouldn’t


be possible.

Real-Time Aggregates

• Create Indexed views for RT aggregates


on facts, with/without joins to
dimensions
• Are kept up to date automatically

• 2014 allows for updateable


columnstore indexes as well

21
5/30/2015

Indexed Views
• Will degrade performance of INSERTs/UPDATEs to the fact table, so
make sure they’re worthwhile to add.

• Be careful of updating referenced dimensions, index view foreign key


references can deadlock.
• Use the GetAppLock() function to single-thread write access to the fact table
and referenced dimensions.

OLAP
• SSAS Cubes may also be able to be updated frequently.

• Multiple partitions, sliced by time:


• ROLAP Current Day

• MOLAP

22
5/30/2015

Monitoring/Alerting
• All ETL operations should be logged
and timed.
• Logger should commit even if overall
transaction is rolled back.

• If incremental job fails 10 times, slow it


down/turn it off.

Statistics
• Won’t ever be up to date for the latest data. From fact table, in SQL
2008/2012, will give cardinality estimate of 1 if the date range is past
the HWM.

• Trace flags 2389, 2390, 4139 in SQL 2012 to deal with this “Ascending
Key” problem

• SQL 2014 is supposed to be better, but it has an issue where it may


guess that there are 9% of overall rows since the statistics HWM.
• =>milossql.wordpress.com (“beyond histogram” articles)

23
5/30/2015

Caching
• Turn off caching on the reporting server to always have live data.

Performance Considerations
• Need to tune RT ETL so that it doesn’t
have any inefficiencies. Measure in
milliseconds, not seconds.
• Use WhoIsActive to see what’s running

• MUST have Read Committed Snapshot


Isolation (RCSI) enabled on the DW
database.

24
5/30/2015

Process AW_ODS
DEV

• Have RT replication flowing into AW AW_ODS


QA
DEV/QA/Prod Prod
• Keep the incremental process working in AW_ODS
Prod
all 3!

DW Prod DW Prod
• For a new RTDW, build in parallel to an (Legacy) (new RT)
existing DW, so you can reconcile the
two.

4 Architectural Components
1. Operational Data Store (ODS) databases: create 1 per source database or subject area.
PUSH data into the ODS’s as often as possible.

2. Re-init Processes: build a re-init stored proc for each dim and fact, sourced from ODS
snapshots. PULL from the source views into the DW dims/facts. Store the snapshot LSNs as
the starting point.

3. Incremental Processes: build an incremental stored proc for each dim and fact. Use CDC
functions to populate temp tables that mimic source views. PULL data incrementally on
demand. Transactionally store new HWM.

4. Test early & test often. Make sure RT data is flowing into DEV
and QA. Tune ETL, statistics, aggregates and user queries
against a live system with RCSI enabled.

25
5/30/2015

4 Architectural Components - Review


1. Operational Data Store (ODS) databases: create 1 per source database or
subject area. PUSH data into the ODS’s as often as possible.
2. Re-init Processes: build a re-init stored proc for each dim and fact, sourced
from ODS snapshots. PULL from the source views into the DW dims/facts.
Store the snapshot LSNs as the starting point.
3. Incremental Processes: build an incremental stored proc for each dim and fact.
Use CDC functions to populate temp tables that mimic source views. PULL data
incrementally on demand. Transactionally store new HWM.
4. Test early & test often. Make sure RT data is flowing into DEV and QA. Tune
ETL, statistics, aggregates and user queries against a live system with RCSI
enabled.

More Information
• Code/slides at: http://www.infinityanalytics.com/

• https://www.linkedin.com/in/markmurphynyc

• mark@infinityanalytics.com

26
5/30/2015

Appendix A: Joining two tables with CDC


1. Get net changes from A into #TEMP_A
2. Get net changes from B into #TEMP_B
3. Examine the join key between A and B. Search for any records
missing in B that are in A. If there are missing records, then…
1. Look to the future change CDC _CT table for the *before* images of future
modifications, -BUT- only where the record isn’t inserted in the future, after
this LSN range.
2. Finally, look to the base table for records that are still missing, -BUT- only
where the record isn’t inserted in the future, after this LSN range.
4. Repeat step 3, but this time looking for A missing from B.
5. Now continue processing, using #TEMP_A and #TEMP_B.

27

Potrebbero piacerti anche