Sei sulla pagina 1di 7

Relational Algebra

• Operates on relations, i.e. sets


– Later: we discuss how to extend this to bags
• Five operators:
CSE 544: –

Union: ∪
Difference: -
Relational Operators, Sorting –

Selection: σ
Projection: Π
– Cartesian Product: ×
Wednesday, 5/12/2004 • Derived or auxiliary operators:
– Intersection, complement
– Joins (natural,equi-join, theta join, semi-join)
– Renaming: ρ

1. Union and 2. Difference What about Intersection ?


• It is a derived operator
• R1 ∪ R2 • R1 ∩ R2 = R1 – (R1 – R2)
• Example: • Also expressed as a join (will see later)
ActiveEmployees ∪ RetiredEmployees
• Example
UnionizedEmployees ∩ RetiredEmployees
• R1 – R2
• Example:
AllEmployees – RetiredEmployees

3. Selection 4. Projection
• Eliminates columns, then removes duplicates
• Returns all tuples which satisfy a condition • Notation: Π A1,…,An (R)
• Notation: σc(R) • Example: project social-security number and
• Examples names:
σSalary > 40000 (Employee) Π SSN, Name (Employee)
Output schema: Answer(SSN, Name)
σname = “Smithh” (Employee)
• The condition c can be =, <, ≤, >, ≥, <>

1
Cartesian Product Example

Employee
5. Cartesian Product Name SSN
John 999999999
Tony 777777777
• Each tuple in R1 with each tuple in R2
• Notation: R1 × R2 Dependents
EmployeeSSN Dname
• Example: 999999999 Emily
777777777 Joe
Employee × Dependents
• Very rare in practice; mainly used to Employee x Dependents
Name SSN EmployeeSSN Dname
express joins John 999999999 999999999 Emily
John 999999999 777777777 Joe
Tony 777777777 999999999 Emily
Tony 777777777 777777777 Joe

Renaming Renaming Example

Employee
• Changes the schema, not the instance Name SSN
• Notation: ρ B1,…,Bn (R) John 999999999
Tony 777777777
• Example:
ρLastName, SocSocNo (Employee)
Output schema: Answer(LastName, SocSocNo) ρLastName, SocSocNo (Employee)
LastName SocSocNo
John 999999999
Tony 777777777

Natural Join Example


Natural Join
Employee
• Notation: R1 |×| R2 Name SSN
John 999999999
Tony 777777777
• Meaning: R1 |×| R2 = ΠA(σC(R1 × R2))
Dependents
SSN Dname
• Where: 999999999 Emily
777777777 Joe
– The selection σC checks equality of all common
attributes Employee Dependents =
ΠName, SSN, Dname(σ
σ SSN=SSN2(Employee x ρSSN2, Dname(Dependents))
– The projection eliminates the duplicate common
attributes Name SSN Dname
John 999999999 Emily
Tony 777777777 Joe

2
Natural Join Natural Join
• R= A B S= B C • Given the schemas R(A, B, C, D), S(A, C, E),
X Y Z U
what is the schema of R |×| S ?
X Z V W
Y Z Z V
Z V
• Given R(A, B, C), S(D, E), what is R |×| S ?
A B C
• R |×| S= X Z U
X Z V • Given R(A, B), S(A, B), what is R |×| S ?
Y Z U
Y Z V
Z V W

Theta Join Eq-join


• A join that involves a predicate • A theta join where θ is an equality
• R1 |×| θ R2 = σ θ (R1 × R2) • R1 |×| A=B R2 = σ A=B (R1 × R2)
• Here θ can be any condition: =, <, ≠, ≤, > ≥ • Example:
Employee |×| SSN=SSN Dependents

• Most useful join in practice

Semijoins in Distributed
Semijoin
Databases
• R |× S = Π A1,…,An (R |×| S) • Semijoins are used in distributed databases
• Where A1, …, An are the attributes in R Dependents
Employee
SSN Dname Age
• Example:
SSN Name ... ...
Employee |× Dependents ... ... network

Employee
Employee |×|
|×| ssn=ssn (σ age>71(Dependents))
ssn=ssn (σ age>71 (Dependents))
T = Π SSN σ age>71 (Dependents)
R = Employee |× T
Answer = R |×| Dependents

3
Complex RA Expressions
Π name Operations on Bags
buyer-ssn=ssn
A bag = a set with repeated elements
Relational Engines work on bags, not sets !
pid=pid All operations need to be defined carefully on bags
• {a,b,b,c}∪{a,b,b,b,e,f,f}={a,a,b,b,b,b,b,c,e,f,f}
seller-ssn=ssn
• {a,b,b,b,c,c} – {b,c,c,c,d} = {a,b,b,d}
• σC(R): preserve the number of occurrences
Π ssn Π pid
• ΠA(R): no duplicate elimination
σname=fred σname=gizmo • Cartesian product, join: no duplicate elimination

Person Purchase Person Product

Logical Operators in the Bag


Example
Algebra
• Union, intersection, difference SELECT
SELECTcity,
city,count(*)
count(*) Π city, c
• Selection σ Relational FROM
FROMsales
sales
Algebra GROUP
GROUPBYBYcity
• Projection Π (on bags) HAVING
city
HAVINGsum(price)
sum(price)>>100
100 σ p > 100
• Join
T(city,p,c)
• Duplicate elimination δ γ city, sum(price) p, count(*) c

• Grouping γ
• Sorting τ sales

Physical Operators
Architecture of a Database Engine
SELECT S.buyer
SELECT S.buyer buyer SQL query
FROM
FROMPurchase
PurchaseP, P,Person
PersonQQ
WHERE
WHERE P.buyer=Q.nameAND
P.buyer=Q.name AND
Q.city=‘seattle’
Q.city=‘seattle’AND
AND
σ Parse Query
City=‘seattle’ phone>’5430000’
Q.phone > ‘5430000’
Q.phone > ‘5430000’
Logical
Select Logical Plan
Query Plan: Query plan
Buyer=name (Simple Nested Loops) optimization
• logical tree Select Physical Plan
• implementation Purchase Person
Physical
choice at every plan
(Table scan) (Index scan) Query Execution
node
Some operators are from relational
• scheduling of algebra, and others (e.g., scan)
operations. are not.

4
Cost Parameters Cost Parameters
In database systems the data is on disks, not in main memory • Clustered table R:
– Blocks consists only of records from this table
The cost of an operation = total number of I/Os – B(R) ≈ T(R) / blockSize
Cost parameters: • Unclustered table R:
– Its records are placed on blocks with other tables
• B(R) = number of blocks for relation R – When R is unclustered: B(R) ≈ T(R)
• T(R) = number of tuples in relation R
• V(R, a) = number of distinct values of attribute a
• When a is a key, V(R,a) = T(R)
• When a is not a key, V(R,a)

Cost Scanning Tables


Cost of an operation = • The table is clustered:
number of disk I/Os needed to:
– read the operands – Table-scan: if we know where the blocks are
– compute the result
– Index scan: if we have a sparse index to find the
blocks
Cost of writing the result to disk is not included on the following
slides • The table is unclustered
– May need one read for each record
Question: the cost of sorting a table with B blocks ?
Answer:

Sorting While Scanning Cost of the Scan Operator


• Sometimes it is useful to have the output • Clustered relation:
– Table scan:
sorted • Unsorted: B(R)
• Three ways to scan it sorted: • Sorted: 3B(R)
– Index scan
– If there is a primary or secondary index on it, • Unsorted: B(R)
use it during scan • Sorted: B(R) or 3B(R)
– If it fits in memory, sort there • Unclustered relation
– If not, use multi-way merge sort – Unsorted: T(R)
– Sorted: T(R) + 2B(R)

5
Sorting 2-Way Merge-sort:
Requires 3 Buffers in RAM
• Problem: sort 1 GB of data with 1MB of RAM. • Pass 1: Read 1MB, sort it, write it.
• Where we need this: • Pass 2, 3, …, etc.: merge two runs, write them
– Data requested in sorted order (ORDER BY) Runs of length 2L
Runs of length L
– Needed for grouping operations
– First step in sort-merge join algorithm INPUT 1
– Duplicate removal OUTPUT

– Bulk loading of B+-tree indexes. INPUT 2

Main memory Disk


Disk
buffers

Two-Way External Merge Sort Can We Do Better ?


• Assume block size is B = 4Kb

• Step 1 Ł runs of length L = 1MB • Hint:

• Step 2 Ł runs of length L = 2MB We have 1MB of main memory, but only used
• Step 3 Ł runs of length L = 4MB 12KB
• ......
• Step 10 Ł runs of length L = 1GB (why ?)
Need 10 iterations over the disk data to sort 1GB

Cost Model for Our Analysis External Merge-Sort


• B: Block size ( = 4KB)
• Phase one: load M bytes in memory, sort
• M: Size of main memory ( = 1MB)
– Result: runs of length M bytes ( 1MB )

For later use (won’t need now): M/R records


... ...
• N: Number of records in the file
Disk Disk
• R: Size of one record M bytes of main memory

6
Phase Two Phase Three
• Merge M/B – 1 runs into a new run (250 runs ) • Merge M/B – 1 runs into a new run
• Result: runs of length M (M/B – 1) bytes (250MB) • Result: runs of length M (M/B – 1)2 records
(250*250MB = 62.5GB – larger than the file)
Input 1 Input 1

... Input 2
....
Output ... ... Input 2
....
Output ...
Input M/B Input M/B
Disk Disk Disk Disk
M bytes of main memory M bytes of main memory

Need 3 iterations over the disk data to sort 1GB

Cost of External Merge Sort


External Merge Sort
• Number of passes:
1 + log M/B−1 Size/M • The xsort tool in the XML toolkit sorts
using this algorithm
• Can sort 1GB of XML data in about 8
• How much data can we sort with 10MB RAM? minutes
– 1 pass Ł 10MB data
– 2 passes Ł 25GB data (M/B = 2500)

• Can sort everything in 2 or 3 passes !

Potrebbero piacerti anche