Sei sulla pagina 1di 11

(http://kejser.

org/modeling/why-date-between-fromdate-and-todate-is-a-dangerous-join-criteria/)
2012-06-29 Data Modeling (http://kejser.org/category/modeling/)
No Comments (http://kejser.org/modeling/why-date-between-fromdate-and-todate-is-a-dangerous-
join-criteria/#comments)
Data Warehouse (http://kejser.org/tag/data-warehouse/), Dimensional Models
(http://kejser.org/tag/dimensional-models/), Histogram (http://kejser.org/tag/histogram/), Historizing
(http://kejser.org/tag/historizing/), OLTP (http://kejser.org/tag/oltp/), Query Plans
(http://kejser.org/tag/query-plans/), Statistics (http://kejser.org/tag/statistics/)
Why Date BETWEEN FromDate AND
ToDate is a dangerous join criteria
(http://kejser.org/modeling/why-date-
between-fromdate-and-todate-is-a-
dangerous-join-criteria/)
(http://kejser.org/)
I have been meaning to write this blog post for some time and the discussion about
Data Vault finally prompted me to do it.
Sometimes, you find yourself in situations where you have to join a table that has a
structure like this:
The join criteria is expressed:
Or more commonly, this variant with a semi open interval:
Data models that promote these types of joins are very dangerous to relational
optimizers and you have to step carefully when executing queries with many of these
joins. Let us have a look at why this is so.
Temporal Join
To illustrate the issue with this query pattern, let me create a very simple test setup
that you can experiment with. Use this script to generate the two tables:
1
2
3
4
5
6
7
8

CREATE TABLE TemporalTracking (
SomeKey INT
, FromDate DATETIME
, ToDate DATEIME
, <more columns>
)

1
2
3
4
5
6

FROM <OtherTable> OT
INNER JOIN TemporalTracking T
ON OT.SomeTimeColumn BETWEEN T.FromDate AND T.ToDate
AND OT.SomeKey = T.SomeKey

1
2
3
4
5
6
7

FROM <OtherTable> OT
INNER JOIN TemporalTracking T
ON OT.SomeTimeColumn >= T.FromDate
AND OT.SomeTimeColumn < T.ToDate
AND OT.SomeKey = T.SomeKey

1
2
3
4
5
6
7
8
9
10
11
12

CREATE TABLE SmallTable (
SK INT NOT NULL
, BusinessKey INT
, FromDate DATETIME
, ToDate DATETIME
, SomeColumn INT)

/* Create 5M row join table */
INSERT SmallTable WITH (TABLOCK)
SELECT reps.n * 100000 + k.n - 1
, k.n - 1

(see my utility functions (http://blog.kejser.org/2011/04/26/utility-functions-
fn_convert_to_base-and-fn_nums/) for fn_nums)
Smal l Tabl e above is a temporal tracker table with 10 temporal records per
Busi nessKey. Bi gTabl e is about 400MB large and does not use temporal tracking
but has a TranDate column that determines which temporal records in Smal l Tabl e
match it.
Let us now try to execute a reporting query where we ask for an aggregate over
Bi gTabl e joining it up to its matching temporal keys in Smal l Tabl e.
Some quick statistics about this query execution on my laptop:
CPU time: 25547 ms
Logical I/O operations: 50762 (no physical)
Memory Grant: 370 MB
Rows Returned: 2600
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
, DATEADD(month, reps.n-1, '2000-01-01')
, DATEADD(month, reps.n, '2000-01-01')
, reps.n * k.n%15
FROM fn_nums(100000) k
CROSS JOIN fn_nums(10) reps

CREATE UNIQUE INDEX CIX ON SmallTable (BusinessKey, FromDate)

CREATE UNIQUE INDEX IX_SK ON SmallTable (PK)

CREATE STATISTICS Stat_FromTo ON SmallTable (BusinessKey, FromDate, ToDate)

UPDATE STATISTICS SmallTable WITH FULLSCAN

SELECT n AS RowID
, n%100000 AS BusinessKey
, n%1000 AS OtherKey
, DATEADD(month, n%5, '2000-01-01') AS TranDate
INTO BigTable
FROM fn_nums(10000000)

1
2
3
4
5
6
7
8
9

SELECT S.SomeColumn, B.OtherKey, COUNT(RowID)
FROM BigTable B
INNER JOIN SmallTable S
ON B.BusinessKey = S.BusinessKey
AND B.TranDate >= S.FromDate
AND B.TranDate < S.ToDate
GROUP BY S.SomeColumn, B.OtherKey

Nothing overly suspicious yet (though the memory grant seems rather right). Lets
just have a quick look at the query plan:
(http://kejserbi.files.wordpress.com/2012/06/image19.png)
That is a pretty big misestimate isnt it? And that is the crux of the issues, it is
immensely hard for a query optimizer to accurately predict that the join on the
temporal table will lead to one and only one row (unless you have a temporally aware
database of course).
But misestimates are not the full story, there is also a higher CPU cost involved in
doing this join. At the CPU instruction level, more work needs to be done to find
record matches an interval than doing a straight compare of two values.
Now, you can imagine what happens if you have a data model that has very long
chains of these joins. As you probably know, query misestimates (and the risk of bad
query plans) typically grows exponentially with the number of tables being joined.
Having a data model that almost guarantees poor estimate even in a two join setup
can quickly lead to interesting tuning challenges.
Going Kimball Again
There is a very good reason Kimball recommends integer keys for type2 dimensions
instead of the temporal join you just saw.
Let us change Bi gTabl e into a Kimball representation instead by pre-joining like
this:
1
2
3
4
5
6
7
8

SELECT RowID
, S.SK
, B.OtherKey
, B.TranDate
INTO BigTableKimball
FROM BigTable B
INNER JOIN SmallTable S
We can now rewrite the previous aggregate query to this:
Let us first have a look at the query plan for the Kimball style join:
(http://kejserbi.files.wordpress.com/2012/06/image20.png)
Same query plan shape but look at the difference in estimates vs. actuals! We are
spot on here.
Comparing the Kimball style Type2 join with the temporal join we get:
Measurement Temporal J oi n Ki mbal l Type 2
CPU Time 25547 ms 10844 ms
Logical I/O 50762 50762
Memory Grant 370 MB 315 MB
Rows Returned 2600 2600
Misestimate 3x None
Summary
In this blog entry, I have shown you why temporal style joins can be dangerous to
query optimizers. While it is not always possible to avoid them extreme care should
be taken if you chose to include them as the only way to access your data model.

9
10
11
12
ON B.BusinessKey = S.BusinessKey
AND B.TranDate >= S.FromDate
AND B.TranDate < S.ToDate

1
2
3
4
5
6
7

SELECT S.SomeColumn, B.OtherKey, COUNT(RowID)
FROM BigTableKimball B
INNER JOIN SmallTable S
ON B.SK = S.SK /* Kimball style Type2 join */
GROUP BY S.SomeColumn, B.OtherKey

No Comments
Pingback: How Vertical Partitioning and Deep Joins Kill Parallelism Thomas Kejser's Database Blog
(http://blog.kejser.org/2012/07/16/how-vertical-partitioning-and-deep-joins-kill-parallelism/)
Pingback: Modeling Dimensions with History Tracked, Generic Attributes Thomas Kejser's Database Blog
(http://blog.kejser.org/2012/07/06/modeling-dimensions-with-history-tracked-generic-attributes/)
Ol der
(http://kejser.org/modeling/the-data-vault-vs-
kimball-round-2/)
Newer
(http://kejser.org/modeling/the-information-
staircase/)
Thanks again Thomas for these additional tips I appreciate you taking the time to share your
experiences. We will have additional projects this year attempting to query this database so these
tips will help. I checked out the Thomas Christensens forum and there is a lot to digest :-). Still
working on it and have learned a good deal already!
Reply (/modeling/why-date-between-fromdate-and-todate-is-a-dangerous-join-criteria/?
replytocom=493#respond)
cteveret ( http://gravatar. com/cteveret) 2 years ago
Great post Thomas thanks for taking the time. I work frequently with a database designed using a
vault model concept. This data warehouse is built and extended using an architectural strategy
based upon the idea that a small set of business data in scope of a given operational system
functional enhancement project must be warehoused as it is already being worked upon and
might be useful someday to those who might want to report on it. The architecture also requires
that a full history is retained for every data element in every entity. The vault model has been very
helpful to us in meeting these requirements given its exclusive use of M to M relationships and in
retaining a history of every column change with its satellite tables. The Link tables and the Satellite
tables both rely on bi-temporal relationships to relate entities to each other and retain a
relationship history over time, and also to maintain a row history for each entity over time.
While it works really well for folding in new entities and data elements without impacting existing
process (you just create a new hub or link, or add a new satellite), we often struggle to get good
plans for range scan queries. The plans are hard to read as there are so many tables involved (every
entity has at least 1 hub/link, one satellite (often 3 or 4), and one point in time satellite to make it
easy to join the current row without a subselect). Often the optimizer times out generating the plan
if we are joining in 10 or more composite entities. I can usually re-write queries to get only the data
Todd Everett ( http://gravatar. com/cteveret) 2 years ago
required and often use cross applies to join the hub/link tables to the satellite (which only helps for
current state queries where the matching bi-temporal child is always the most recent one
thankfully that is almost always what is needed) but it is a lot of work. But a 20 entity monster
query written to retrieve data as of some time in the past (like in your example) never finishes. I was
at a loss to understand why this was the case and your post has really helped me grasp the trade-
offs in query optimization we have in using this approach..
Ultimately our architecture calls for the warehouse to be a like a distribution center it feeds
data marts and isnt intended to be queried but in reality our business partners dont want to
fund additional development for a data mart and our operations partners dont want to manage the
extra databases as each means more backups, more maintenance, etc. You have given me some
good tools and understanding to help explain to our partners why, if they want easy and fast
reporting under our current architecture, we need the extra funding to build a data mart. Usually
what I encounter is the belief that indexing and the nolock hint solves all. The test script you
provided is a great example to show how an inability for the optimizer to develop accurate
estimates given all these bi-temporal relationships is a root problem that wont always be easily
addressed by indexes and query hints.
I dont want to get into a Kimball vs. Inmon / Linstedt battle as every architecture optimizes certain
objectives at the expense of others and none is good or bad absent of those objectives. I have been
trying to learn all I can about all the different approaches, what objectives they maximize and what
trade-offs they have. Your blog has been a really good source of learning material and I look forward
to your continued posts.
Reply (/modeling/why-date-between-fromdate-and-todate-is-a-dangerous-join-criteria/?
replytocom=489#respond)
Hi Tod
Thanks for the reply. Your experience with Vault reflects mine.
If your data model consistently create these types of issues for you isnt there a point where
you have to ask the obvious question?
It is indeed true that Vault allows the addition of sources fast but at what cost further down
the delivery chain? You might find the discussion in Thomas Christensens forum informative
(see my links of previous post). Here, I describe how the history tracking you are looking for can
be done without requiring the Vault model.
The forum also contains a fascinating discussion about the tradeoffs you do with Vault and
exactly which benefits are claimed. After reading it I hope you might revise your stance that
every model has BOTH good and bad sides
Reply (/modeling/why-date-between-fromdate-and-todate-is-a-dangerous-join-criteria/?
replytocom=490#respond)
Thomas Kej ser 2 years ago
cteveret ( http://thesql da. wordpress. com) 2 years ago
Thanks Thomas I will check these discussions out!
Reply (/modeling/why-date-between-fromdate-and-todate-is-a-dangerous-join-criteria/?
replytocom=491#respond)
I forgot to share some advise Todd (and sorry for misspelling your name)
If you do find yourself in Vault land with no way back, there are a few tricks you can
play to narrow the search space. Disclaimer: this will NOT generate the best queries
though in my experience it makes may the optimiser create a good enough plan
in more reasonable time.
First or all: add OPTION (LOOP JOIN) to the queries. While it IS possible to perform
BETWEEN queries with hash joins the cost (and risk) of getting the join order
wrong is too high.
Second: use FORCE_SEEK hints on all tables. This again narrows the search space
and will avoid expensive spools. If you are standard Vault indexed (especially if
clustered on all join keys and from-dates you should have a fully indexed path
through the join tree.
There are some additional tricks you can play to unfold history in a structure
manner, but they are the subject for a full blog entry.
May the force be with you you will likely need it
Reply (/modeling/why-date-between-fromdate-and-todate-is-a-dangerous-join-
criteria/?replytocom=492#respond)
Thomas Kej ser 2 years ago
You could have achieved even better results more easily by properly indexing BigTable
((BusinessKey, TranDate) INCLUDE (OtherKey, RowId)) and using a FORCESEEK hint on BigTable in
the original query. (At least on my end, this made that query 25% faster than your revised version.
Its too bad that the QO requires a hint for that.)
Im not sure how realistic this example is. Im used to seeing temporal questions like give me all of
the rows that were active between date X and date Y, whereas yours is give me everything in the
database. If youre only looking for active rows based on a narrow date range with respect to whats
in the database (which I think is much more common), youre not going to have this issue. (Again,
assuming appropriate indexes and somewhat thoughtful query construction.)
Reply (/modeling/why-date-between-fromdate-and-todate-is-a-dangerous-join-criteria/?
replytocom=487#respond)
Adam Machani c ( http://sql bl og. com) 2 years ago
Thomas Kej ser 2 years ago
Name * Email *
Leave a Reply
Website
Comment
Your first comment is exactly to my point Adam: optimizes dont deal well with this construct
and you often have to hint them if you go down this path. You could index it for that one join
but that wont do you much good if you have to bi-temporal join to another table too.
I agree that these queries often have the form of slide me into this date range which help
immensely (but still give you estimates that are way off when you join multiple tables
together). Take this as an example:
CREATE TABLE Product (BK, From, To, GroupBK)
CREATE TABLE Group( BK, From, To, CatgoryBK)
CREATE TABLE Category(BK, From, To)
Asking the question: what did this product look like at a certain date)? is easy here (but
estimates are way off already) But asking: show me history of all products by Group and
Category is quite tough and the chance of getting the wrong join strategy and confusing the
optimizer becomes significant even on a small table.
The query I used in this blog entry is to illustrate the tradeoff between storing dimensions as
bi-temporal relations to facts (like for example Data Vault recommends) and as storing them
as materialized surrogate keys. You really dont want to serve up fact data to ETL tools or
end users in this format if you can avoid it.
Reply (/modeling/why-date-between-fromdate-and-todate-is-a-dangerous-join-criteria/?
replytocom=488#respond)
In the original example, also consider the impact/cost of maintaining statistics on the changing
dates for todate as your data evolves
Reply (/modeling/why-date-between-fromdate-and-todate-is-a-dangerous-join-criteria/?
replytocom=486#respond)
peteads ( http://peteadshead. wordpress. com) 2 years ago
Post Comment
Activity
Thomas Kejser
Fighting Bad Data Modeling
Posts Comments Tags
OT: Thi ngs Fashi on St ores and Desi gners j ust dont Get
(ht t p:/ / kej ser.org/ musi ng/ ot - t hi ngs- fashi on- st ores- and- desi gners- j ust - dont - get / )
(ht t p:/ / kej ser.org/ musi ng/ ot - t hi ngs- fashi on- st ores- and- desi gners- j ust - dont - get / )
Curi ous Part i t i on Funct i on Behavi our (ht t p:/ / kej ser.org/ dat abases/ curi ous-
part i t i on- funct i on- behavi our/ )
(ht t p:/ / kej ser.org/ dat abases/ curi ous- part i t i on- funct i on- behavi our/ )
Defaul t Confi gurat i on of SQL Server (and query hi nt s)
(ht t p:/ / kej ser.org/ dat abases/ defaul t - confi gurat i on- of- sql - server- and- query- hi nt s/ )
(ht t p:/ / kej ser.org/ dat abases/ defaul t - confi gurat i on- of- sql - server- and- query- hi nt s/ )
Cl ust ered I ndexes vs. Heaps (ht t p:/ / kej ser.org/ dat abases/ cl ust ered- i ndexes- vs-
heaps/ )
(ht t p:/ / kej ser.org/ dat abases/ cl ust ered- i ndexes- vs- heaps/ )
Synchroni sat i on i n .NET Part 4:
Part i t i oned Dat a St ruct ures (ht t p:/ / kej ser.org/ programmi ng/ synchroni sat i on- i n-
net - part - 4- part i t i oned- dat a- st ruct ures/ )
(ht t p:/ / kej ser.org/ programmi ng/ synchroni sat i on- i n- net - part - 4- part i t i oned- dat a- st ruct ures/ )
Copyright 2014 - Thomas Kejser

Potrebbero piacerti anche