Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
3. Selection 4. Projection
• Eliminates columns, then removes duplicates
• Returns all tuples which satisfy a condition • Notation: Π A1,…,An (R)
• Notation: σc(R) • Example: project social-security number and
• Examples names:
σSalary > 40000 (Employee) Π SSN, Name (Employee)
Output schema: Answer(SSN, Name)
σname = “Smithh” (Employee)
• The condition c can be =, <, ≤, >, ≥, <>
1
Cartesian Product Example
Employee
5. Cartesian Product Name SSN
John 999999999
Tony 777777777
• Each tuple in R1 with each tuple in R2
• Notation: R1 × R2 Dependents
EmployeeSSN Dname
• Example: 999999999 Emily
777777777 Joe
Employee × Dependents
• Very rare in practice; mainly used to Employee x Dependents
Name SSN EmployeeSSN Dname
express joins John 999999999 999999999 Emily
John 999999999 777777777 Joe
Tony 777777777 999999999 Emily
Tony 777777777 777777777 Joe
Employee
• Changes the schema, not the instance Name SSN
• Notation: ρ B1,…,Bn (R) John 999999999
Tony 777777777
• Example:
ρLastName, SocSocNo (Employee)
Output schema: Answer(LastName, SocSocNo) ρLastName, SocSocNo (Employee)
LastName SocSocNo
John 999999999
Tony 777777777
2
Natural Join Natural Join
• R= A B S= B C • Given the schemas R(A, B, C, D), S(A, C, E),
X Y Z U
what is the schema of R |×| S ?
X Z V W
Y Z Z V
Z V
• Given R(A, B, C), S(D, E), what is R |×| S ?
A B C
• R |×| S= X Z U
X Z V • Given R(A, B), S(A, B), what is R |×| S ?
Y Z U
Y Z V
Z V W
Semijoins in Distributed
Semijoin
Databases
• R |× S = Π A1,…,An (R |×| S) • Semijoins are used in distributed databases
• Where A1, …, An are the attributes in R Dependents
Employee
SSN Dname Age
• Example:
SSN Name ... ...
Employee |× Dependents ... ... network
Employee
Employee |×|
|×| ssn=ssn (σ age>71(Dependents))
ssn=ssn (σ age>71 (Dependents))
T = Π SSN σ age>71 (Dependents)
R = Employee |× T
Answer = R |×| Dependents
3
Complex RA Expressions
Π name Operations on Bags
buyer-ssn=ssn
A bag = a set with repeated elements
Relational Engines work on bags, not sets !
pid=pid All operations need to be defined carefully on bags
• {a,b,b,c}∪{a,b,b,b,e,f,f}={a,a,b,b,b,b,b,c,e,f,f}
seller-ssn=ssn
• {a,b,b,b,c,c} – {b,c,c,c,d} = {a,b,b,d}
• σC(R): preserve the number of occurrences
Π ssn Π pid
• ΠA(R): no duplicate elimination
σname=fred σname=gizmo • Cartesian product, join: no duplicate elimination
• Grouping γ
• Sorting τ sales
Physical Operators
Architecture of a Database Engine
SELECT S.buyer
SELECT S.buyer buyer SQL query
FROM
FROMPurchase
PurchaseP, P,Person
PersonQQ
WHERE
WHERE P.buyer=Q.nameAND
P.buyer=Q.name AND
Q.city=‘seattle’
Q.city=‘seattle’AND
AND
σ Parse Query
City=‘seattle’ phone>’5430000’
Q.phone > ‘5430000’
Q.phone > ‘5430000’
Logical
Select Logical Plan
Query Plan: Query plan
Buyer=name (Simple Nested Loops) optimization
• logical tree Select Physical Plan
• implementation Purchase Person
Physical
choice at every plan
(Table scan) (Index scan) Query Execution
node
Some operators are from relational
• scheduling of algebra, and others (e.g., scan)
operations. are not.
4
Cost Parameters Cost Parameters
In database systems the data is on disks, not in main memory • Clustered table R:
– Blocks consists only of records from this table
The cost of an operation = total number of I/Os – B(R) ≈ T(R) / blockSize
Cost parameters: • Unclustered table R:
– Its records are placed on blocks with other tables
• B(R) = number of blocks for relation R – When R is unclustered: B(R) ≈ T(R)
• T(R) = number of tuples in relation R
• V(R, a) = number of distinct values of attribute a
• When a is a key, V(R,a) = T(R)
• When a is not a key, V(R,a)
5
Sorting 2-Way Merge-sort:
Requires 3 Buffers in RAM
• Problem: sort 1 GB of data with 1MB of RAM. • Pass 1: Read 1MB, sort it, write it.
• Where we need this: • Pass 2, 3, …, etc.: merge two runs, write them
– Data requested in sorted order (ORDER BY) Runs of length 2L
Runs of length L
– Needed for grouping operations
– First step in sort-merge join algorithm INPUT 1
– Duplicate removal OUTPUT
• Step 2 Ł runs of length L = 2MB We have 1MB of main memory, but only used
• Step 3 Ł runs of length L = 4MB 12KB
• ......
• Step 10 Ł runs of length L = 1GB (why ?)
Need 10 iterations over the disk data to sort 1GB
6
Phase Two Phase Three
• Merge M/B – 1 runs into a new run (250 runs ) • Merge M/B – 1 runs into a new run
• Result: runs of length M (M/B – 1) bytes (250MB) • Result: runs of length M (M/B – 1)2 records
(250*250MB = 62.5GB – larger than the file)
Input 1 Input 1
... Input 2
....
Output ... ... Input 2
....
Output ...
Input M/B Input M/B
Disk Disk Disk Disk
M bytes of main memory M bytes of main memory