Sei sulla pagina 1di 119

Table

of Contents
1. Introduction
2. Using PostgreSQL
3. Data Organization
4. Getting Data Out
5. Data Aggregation
6. Integrity Constraints
7. Joining Tables
8. Indexing & Performance
9. Postgres-Specific Types
10. Database Administration
11. Appendix

What this book is about


In this book you will learn about SQL (Structured Query Language), which is the
language your database speaks for reading, inserting and updating data.
The focus of this book is on the PostgreSQL database, because:
Its open-source and not controlled by any particular entity.
Its stable and it has many years of development behind it.
Its full of features that other open-source databases lack.
Its popular and big players in the Rails world use it (Heroku).
In addition to SQL, you will also learn some database administration basics to
keep everything running smoothly. The goal is not to make you an expert, but
proficient enough that you will be able to identify and fix the most common issues.

Using PostgreSQL
Interacting with PostgreSQL directly
There are many ways you can interact with your database, sometimes the best
way is to send commands directly. This is useful when troubleshooting or testing
out a new query.

Using psql
The psql utility lets you have access to your database via a command-line
interface similar to irb. To open a connection to the database using the default
user (postgres), use this command:
psql -U postgres

In Ubuntu that may not work out of the box so try this instead:
sudo su postgres
psql

Now you should see a prompt that looks something like this:
postgres=#

This means postgres is waiting for you to start typing commands. In this case a
postgres-specific command or a SQL query. Postgres commands are preceded
with a backslash character. For example, to get the help menu type \? and to exit
type \q .
Here is a table with the most useful commands:
Command

Description

\l

List databases

\dt

List tables

\dn

List schemas

\d table_name

Show table details

\q

Quit psql

\s

Show history

\du

List users (known as roles in postgres)

\x

Display results in vertical (similar to \G in mysql)

\password

Change password

Other useful commands include dumping the output from a query to a file. For
example:
SELECT * FROM users \g /tmp/users.txt

Note: You want to practice on your own local database to avoid causing any
problems. Refer to the Appendix to learn how to install and setup PostgreSQL.
Its good to know that there is a ~/.psqlrc file you can use to enable certain
options every time you start psql. For example, in mine I have \x auto to let
postgres choose the best output format depending on the situation. Check out the
help menu (with \? ) for other options that you may find useful.

Using pgAdmin
If you like graphical tools then pgAdmin might be for you. If you have a debianbased system you should be able to install pgAdmin like this:
apt-get install pgadmin3

To connect to a new server click on the plug icon or use the menu option File ->
Add server. At a minimum you need to provide:
A name to identify this server
Hostname or ip address
User and password
Now you can right-click on it and select Connect, this will show all the databases

in your server. You can navigate by expanding the trees (click on the + symbol).
You will find your tables under Schemas. You will also see entries for things you
might not be familiar with (like domains or collations), dont worry about those.

If you prefer a web-based tool then you could try pgweb.

Interacting with PostgreSQL from Ruby


The pg gem
Using the pg gem we can talk to postgres by issuing SQL queries. First we need
to install the gem:
gem install pg

Then we need to connect to the database, execute our query and output the
results.
Here is an example:
require 'pg'
OUTPUT_FORMAT = "%7d | %-16s | %-8s | %s"
OUTPUT_FIELDS = ['pid', 'application_name', 'state', 'query']
# Establish a database connection
conn = PG.connect(dbname: 'postgres', user: 'postgres')
# Our query, ready to go!
query = "SELECT * FROM pg_stat_activity"
# Execute the query
conn.exec(query) do |result|
result.each { |row| puts OUTPUT_FORMAT % row.values_at(*OUTPUT_FIELDS) }
end

This example will connect to our local database using the postgres user, then
execute a query and nicely output the results.
Note: if you get an authentication error when trying out this code you may
need to setup host-based authentication (instructions are on the database
administration chapter)
The connect method can take other options if you need to, here is a list: host,
port, user, password, dbname, connect_timeout.
The complete documentation for pg is available here:

http://deveiate.org/code/pg/index.html

ActiveRecord
ActiveRecord is what is known as an ORM (Object-Relational Mapping). An ORM
maps database tables to classes and objects in our application. It does this by
generating SQL queries for us and running them on the database on our behalf.
Once everything is setup correctly, getting all the users from our database is as
easy as Users.all .
ActiveRecord is very popular because it plays a central role in the Ruby on Rails
framework.
Alternative ORM frameworks for Ruby include: Sequel and ROM.

Using ActiveRecord without Rails


Sometimes you may be working in a Ruby app that requires access to the
database but it doesnt need to be a full web app. Since ActiveRecord is
contained inside a separate gem we can use it for our non-rails project.
Here is an example:
require 'active_record'
require 'pg'
# Connection setup
ActiveRecord::Base.establish_connection(
adapter: 'postgresql',
database: 'postgres',
user: 'postgres'
)
# Define a model
class User < ActiveRecord::Base
end
# Print user with id = 1
p User.find(1)

This example assumes that we have an users table and one user with an id of 1.

Using ORMs vs raw SQL


ORM frameworks are great, even thought they have some detractors. They save
a lot of work and allow you to express your queries in an object-oriented way. The
main disadvantages you can find is performance issues or limitations on the API
to get your perfect query. Sometimes you may have to drop in a bit of raw SQL to
get what you want with the best performance.

Data Organization
Before we can insert any data into our database we need to understand how data
is organized in PostgreSQL (and in relational databases in general).
Data is stored in rows, and each row is composed of columns. A group of rows
using the same columns is called a table. And finally, tables are grouped into
databases.

After learning about data types we will see how we can create our own tables.
Right now lets see how we can explore what tables are available on our system.
Open a database connection via psql and type \dt , this will list all the tables. If
you have loaded the example data (instructions in the Appendix) you should be
seeing something like this:
List of relations
Schema | Name | Type | Owner
--------+------------+-------+---------public | countries | table | postgres
public | users | table | postgres

In postgres there is also the concept of schemas, which are just namespaces for a
group of tables. By default all tables are created in the public schema. You can
get a list of schemas in the current database by using the \dn command in psql.
If you create tables in another schema you will have to access them using this
format: schema.table . You can avoid this if you add your schema to the schema

search path.
To view the current search path:
SHOW search_path;

To create a new schema:


CREATE SCHEMA my_new_schema;

To change the search path:


SET search_path TO my_new_schema, public;

When changing the search_path new tables created during the current psql
session will be created under the schema thats first on the list. In this case
my_new_schema .

Data Types
Every column in our database can hold one specific data type. There are many
data types available in PostgreSQL. Here is a table covering the most useful
ones:
Type

Valid values

integer

-2147483648 to +2147483647

bigint

-9223372036854775808 to +9223372036854775807

decimal

up to 131072 digits before the decimal point; up to 16383 digits


after the decimal point

serial

Integer type column with auto increment

text

a string of unlimited length

bool

t, true, f, false

timestamp

date and time

timestamptz

date and time with timezone

time

time of day (no date)

For the full list: http://www.postgresql.org/docs/9.4/static/datatype.html

Creating Tables
Now that we know how PostgreSQL is structured we are ready to create our own
tables. We can do this using CREATE TABLE.
This is the general form for CREATE TABLE :
CREATE TABLE <table_name> (<columns>);

Lets create a testing table you can play around with:


CREATE TABLE testing
(id serial, name text, age integer, score decimal, active bool);

Once we send this query to the database via psql we should be able to see our
new table listed when we do \dt .
List of relations
Schema | Name | Type | Owner
-------+---------+-------+---------public | testing | table | postgres

To get more details about our new table we can use \d <table_name> . This will
give us all the columns in this table and their data types. It will also list indexes
and constraints if we have any (covered later in the book).
Column | Type | Modifiers
-------+---------+--------------------------------------------------id | integer | not null default nextval('test_id_seq'::regclass)
name | text |
age | integer |
score | numeric |
active | boolean |

Altering Tables
After we have created some tables we may want to change them. For example,
we may want to add a new column or rename an existing table.
An ALTER TABLE query is what we need in those cases. Here are some
examples.
Adding a column:
ALTER TABLE <table_name> ADD COLUMN <column_name> <column_type>;

Renaming a column:
ALTER TABLE <table_name> RENAME COLUMN <column_name> TO <new_column_name>;

Renaming a table:
ALTER TABLE <table_name> RENAME TO <new_table_name>;

Changing a columns type:


ALTER TABLE <table_name> ALTER COLUMN <column_name> TYPE <new_type>;

Inserting Data
A database isnt too useful unless we put data in it, so lets add some data!
This is what an INSERT query looks like:
INSERT INTO <table> VALUES (<values>);

We need to provide a value for all the columns in the correct order. For an autoincrement field like id we can just say default and postgres will do the right
thing.
For example, we can add a new user to our database like this:
INSERT INTO users VALUES (default, 'Peter', 'peter@gmail.com', 30, 10);

If you just want to set certain fields you can use this syntax:
INSERT INTO <table> (<column list>) VALUES (<values>);

Getting Data Out


The Select Query
To be able to ask the database for data we need to know two things: the columns
we want and the table where we are pulling data from. In SQL we can express
those using a SELECT statement.
This is the general form for a SELECT statement:
SELECT <columns> FROM <table>

If we want all the columns we can use an asterisk * , but this is generally
discouraged. Often we dont need every column and we can just fetch the ones
we want.
For example, to get all the country names and their id we can run this query:
SELECT id, name FROM countries;

The output looks like this:


id | name
---+--------------- 1 | United Kingdom
2 | Egypt
3 | Spain
4 | France
5 | Italy
... more rows omitted for brevity

Filtering with WHERE


Most of the time we will want to filter the rows we get back, so that instead of
getting all the data in our table we get only what we need. For example, we may
want to build a query that returns all the clients with unpaid invoices.
The database can do this filtering of results for us when we use the WHERE
keyword in conjunction with a SELECT query.
This is the general form for a WHERE query:
SELECT <columns> FROM <table> WHERE <condition>

Conditions are very similar from those you would use in Ruby, the main difference
is that you use a single equals sign to test for equality.
Here are some examples:
SELECT <columns> FROM <table> WHERE <column> > <value>
SELECT <columns> FROM <table> WHERE <column> < <value>
SELECT <columns> FROM <table> WHERE <column> = <value>
SELECT <columns> FROM <table> WHERE <column> != <value>

You can also check for multiple conditions at the same time by using AND :
SELECT * FROM users
WHERE age > 30 AND name != 'Curtis';

Beyond WHERE
The WHERE clause is nice, but there are related keywords that can supplement it:
BETWEEN, IN and LIKE. Lets see an example of each.
Find rows in a certain range:
... WHERE age BETWEEN 0 AND 100

Looking for a list of values:


... WHERE id IN (3,4,5)

Find strings that start with a certain letter (case-sensitive):


... WHERE name LIKE 'C%'

Limiting Our Output


Even with a WHERE clause we can still get thousands of rows back, in some
cases we may just want a sample of those rows.
We can do this using the LIMIT keyword. We want to have this at the very end of
our query.
SELECT <columns> FROM <table> WHERE <condition> LIMIT <amount>;

Order!
Another common database operation is sorting. In SQL we use ORDER BY and a
column name to get sorted results.
This is the general form for ORDER BY:
SELECT <columns> FROM <table> ORDER BY <column> <ASC/DESC>

The last part, ASC (Ascending) / DESC (Descending) is optional. By default


results are sorted in ascending order.

Changing Data
Updates
This is what an update query looks like:
UPDATE <table> SET <column> = <new_value> WHERE <condition>

If you want to add to the current value (on integer columns) you can do this:
UPDATE <table> SET <column> = <column> + <new_value> WHERE <condition>

Deletes
DELETE FROM <table> WHERE <condition>

When you run a DELETE query from psql you will get a report of the amount of
rows that were deleted. For example if the query didnt delete any records you will
see:
DELETE 0

Note: Pay special attention when writing a DELETE query. Dont miss that
WHERE clause or you will end up deleting the whole table!

Data Aggregation
The database is not limited to just getting rows out, it can also aggregate data for
us. For example calculating the sum for all the values in one column.

Min, max, sum, avg


We have a number of functions available in postgres, to use a function we need to
tell which column it should be applied to.
Here are some examples.
Minimum value:
SELECT min(age) FROM users;

Maximum value:
SELECT max(age) FROM users;

Sum of all the values:


SELECT sum(paid_amount) FROM orders;

Calculate the average:


SELECT avg(paid_amount) FROM orders;

Counting Rows
Counting rows is another kind of aggregation we may want to do, in PostgreSQL
this operation is slow. In a Rails app we can mitigate this by using a counter cache
(which is an extra column on the table we want a count for).
This will give us the total amount of users in our database.
SELECT count(id) FROM users;

We can combine this with a WHERE clause to only count those rows that meet
certain criteria.

Group By
Grouping is a very interesting operation, with grouping we can (for example) get a
summary of the values for some column on our database.
This query will tell us how many countries we have from every continent:
SELECT
count(id) as count,
continent
FROM countries
GROUP BY continent
ORDER BY count DESC;

The output looks like this:


count | continent
------+-------------- 10 | Europe
2 | South America
2 | Asia
1 | Africa
(4 rows)

Isnt that cool? One more thing: to filter results by the grouped column you will
need to use the HAVING keyword.
For example, to only see continents which have more than 5 countries:
SELECT
count(id) as count,
continent
FROM countries
GROUP BY continent
HAVING count(id) > 5
ORDER BY count DESC;

This will only return Europe for our example database, since its the only one with
more than 5 countries listed.

Be aware of the error


One common error that you will see when trying to build a GROUP BY query is
the following:
ERROR: column "tags.created_at" must appear in the GROUP BY
clause or be used in an aggregate function

The problem here is that there are multiples values for the created_at column, the
database doesnt know which to pick, so instead of just picking one at random it
throws that error.
You need to tell the database how to handle this via an aggregation function. For
example we may want the maximum value for this column, which in the case of a
time column it means the newest one.
This query should fix the error:
SELECT content, max(created_at) FROM tags GROUP BY content;

Integrity Constraints
Over time our data will grow and unless we take good care of it there is a good
chance that it will become inconsistent. Using constraints we can define a few
rules that will help keep our data under control.
Note: constraints are different from application-level validations.

Not NULL, Default values and Unique


rows
Having NULL values all over your database is not a very pleasant experience. We
can avoid this by forcing our columns to never be NULL.
ALTER TABLE <table_name> ALTER COLUMN <column> SET NOT NULL;

If we try to insert a null value after enabling the constraint we will get this error:
ERROR: null value in column "password" violates not-null constraint

Another way to deal with this issue is to set default values. For example, setting
the default as 0 for an integer column may be a good idea.
ALTER TABLE <table_name> ALTER COLUMN <column> SET DEFAULT <default_value>

One more thing that constraints can help us avoid is having duplicate data. We
can accomplish this by declaring an UNIQUE constraint.
ALTER TABLE <table_name> ADD CONSTRAINT <contrain_name> UNIQUE (<columns>)

Check Constraints
Another type of constraint is the CHECK constrain. CHECK constraints allow us to
validate data before it makes its way into the database. For example, if we have
an age column we may want to make sure that it cant be negative.
ALTER TABLE users ADD CONSTRAINT postive_age CHECK (age > 0);

If we check the table details now using \d users we should see our new
constraint listed:
Check constraints:
"postive_age" CHECK (age > 0)

Joining Tables
Table relationships are the reason we call PostgreSQL a relational database.
Relations are not explicitly declared in our database (other than having foreign
keys, which we will talk about later).
What they allow us to do is to separate data into different tables (for ex. A user
has many orders and an order has many products).
A note on semantics: traditional SQL literature uses the term relation to refer
to what we call tables.

Joining Data
To put together data from two or more related tables we will have to use joins.
There are different types of joins we can use. For example, the most basic join is
performed via a WHERE clause.
Here is the syntax:
SELECT * FROM <table1, table2, tableN>
WHERE <table1.id = table2.table1_id>

Suppose we store our users country in a separate table. Using this query we can
retrieve the actual user country name, instead of the country id.
SELECT users.name, countries.name as country
FROM users, countries
WHERE users.country_id = countries.id;

This is what the output looks like:


name | country
----------+---------------Curtis | Hungary
Albertha | Italy
Brionna | Portugal
Camron | Egypt
Wanda | Japan
Palma | Sweden
Cortez | Sweden
Kennith | Italy
Ima | United Kingdom
Beth | Japan
Alanna | France
Kevon | Chile
Jordi | Spain
... more rows omitted for brevity

Another type of JOIN is the INNER JOIN. This join will return rows that match the
condition in both tables. Here is the same example but using the JOIN clause.
The output is the same because both types of joins are very similar.

SELECT users.name, countries.name as country


FROM users
JOIN countries ON users.country_id = countries.id;

Other types of joins include: LEFT OUTER JOIN , RIGHT OUTER JOIN , FULL OUTER JOIN
and the CROSS JOIN (join every row in one table with every row on the other, this
produces 600 rows on our example users table!).
SELECT * FROM users CROSS JOIN countries;

The main difference between the types of joins is on the rows that are returned.
Most of the time your standard INNER JOIN will be what you want.

Types of Associations
There are 3 different ways that a pair of tables can be related:
One-to-one
One-to-many
Many-to-many
The one-to-one relationship is probably the less common out of the three. Every
row has exactly one corresponding row on the other table. In ActiveRecord we
use belongs_to for the model with the foreign key and has_one for the
associated model.
The one-to-many association is more common and we have an example of it in
the schema included with this book. Users belong to one country, but countries
have many users.

In ActiveRecord terms:
Country -> has_many :users
User -> belongs_to :country

Finally, the many-to-many association. This type of association requires an


intermediate table which is composed of two foreign keys (the post_tags table in
the image below).
Example: A post has many tags, and a tag has many posts.

In ActiveRecord terms:

Post -> has_and_belongs_to_many :tags


Tag -> has_and_belongs_to_many :posts

The alternative to has_and_belongs_to_many (often abreviated as HABTM) is using


has_many :through . The difference is that the latter allows us to define a join

model with validations, callbacks and extra attributes.

Generating a Diagram of your Database


In a Rails projects you can end up with many tables and visualizing how they are
related is not easy. The rails-erd project has you covered.
To install rails-erd simply add it to your Gemfile:
gem 'rails-erd', group: :development

Then run bundle install and once its installed you can run it with bundle exec
erd to generate a visualization of your database schema.
Loading application in 'country-data'...
Generating entity-relationship diagram for 2 models
Diagram saved to 'erd.pdf'.

This is what the output looks like for a simple app with 2 models:

Foreign Key Constraints


The foreign key is the column that contains the id for the associated table. For
example, if we have a books table and an authors table the foreign keys will be
author_id for books and book_id for authors (assuming a one-to-one
association).
Foreign key constraints can help us enforce referential integrity. For example, if
we have a many-to-many table it wouldnt make much sense to allow it to have
records which have no associated rows (like an order table with products that
dont exist).
Here is how you can add a new foreign key constraint:
ALTER TABLE <table>
ADD CONSTRAINT <constraint_name>
FOREIGN KEY (<column>)
REFERENCES <other_table> (<other_column>);

Another thing we can do is to RESTRICT a row deletion (for example: prevent


deleting a product if an order for it still exists) and CASCADE (for example: delete all
the users comments when deleting the user).
Here is how you can add a RESTRICT constraint (CASCADE is very similar):
ALTER TABLE users
ADD CONSTRAINT valid_country
FOREIGN KEY(country_id) REFERENCES countries(id)
ON DELETE RESTRICT;

Indexing & Performance


Once your database starts growing everything will seem to be slowing down, but
fear not, there are a few things you can do to speed up your database, like adding
indexes (and sometimes removing them too).

Why do we need indexes?


When you ask your database to filter by a certain value (using WHERE) it has to
read every single row, this is called a table scan or sequential scan.
Imagine you have tens of thousands of rows, looking for the specific rows that
have 'David' in the name column is going to take a while.
Indexes are a database feature that exists to deal with this problem. An index is a
separate data structure (B-trees) that makes it a lot faster to find specific records.
A B-tree is similar to a binary tree but with multiple nodes per branch instead of
just two. This is what it looks like:

You dont need to worry about the details, but its still good to know how it works.
Other types of index available in postgres are: Hash, GiST and GIN.
Most of the time you just want B-tree indexes, which are the default. For
advanced data types (like Array, hstore and JSON) you will need a GIN index. A
GIN index supports fields with multiple values.

Adding and Removing Indexes


What indexes do we have?
There are two ways to look up the active indexes in your database. First by using
\di in psql, which will list all the indexes in the current schema. The second way

is by inspecting table details using \d <table_name> .


Here is an example:
Table "public.users"
Column | Type | Modifiers
-----------+-----------------------------+----------------------------id | integer | not null default nextval(...
name | character varying |
age | integer |
email | character varying |
country_id | integer |
created_at | timestamp without time zone | not null
updated_at | timestamp without time zone | not null
Indexes:
"users_pkey" PRIMARY KEY, btree (id)

Notice the Indexes section at the end. Every line tells us: the name of the index,
the type of index and the indexed column/s.

Adding new indexes


Now that we know what indexes we have we may want to add a new one. Of
course you may be wondering what indexes you need, pghero can help us with
that.
Pghero is a series of SQL functions and views that are very helpful for database
administration, you can find pghero here: https://github.com/ankane/pghero.sql
Note: in addition to the raw sql queries there is also also a web version of
pghero.
To install pghero:

CREATE SCHEMA admin_views;


SET search_path admin_views, public;
--- Copy the text from this url and paste into psql:
--- https://raw.githubusercontent.com/ankane/pghero.sql/master/install.sql
SET search_path public, admin_views;

Now you can use the pghero queries, for example:


SELECT * FROM pghero_missing_indexes;

Unfortunately you wont be able to see it in action with our little test database
since it requires at least 10.000 rows and some database activity. If you have this
problem in your own database you will have to investigate what are the most
popular queries to see if an index could speed them up (this is covered in the
database administration section).
Once you know what you need you can create the index like this:
CREATE INDEX <index_name> ON <table> (<columns>);

Please be aware that creating an index will block all write operations on the table.
Creating an index on a large table can take a long time so you may want to
schedule a maintenance period.
There is also an option to create an index without blocking writes, using the
CONCURRENTLY option.
CREATE INDEX CONCURRENTLY <index_name> ON <table> (<columns>);

If you choose that option, be aware that the operation can be interrupted and
generate an invalid index, just drop the index and start again if that happens. You
will also have increased server load during the operation.

Index selectivity
One concept we need to be familiar with when thinking about indexes is index
selectivity. The selectivity of an index is the number of different values the
candidate column can have divided by the total number of rows. The perfect

selectivity (1) is achieved with unique fields.


Selectivity helps us evaluate how efficient a B-Tree index will be. For example, if
we have a gender column where the only values are male and female that gives
us very low selectivity, therefore making an index for this column less efficient
because we would have to scan about half the table. We also need to be aware of
the overhead that a new index adds to our writes (insert/update/delete).
You can use the following query to find the selectivity of a column:
SELECT count(distinct <column>)::decimal / count(<column>) as selectivity
FROM <table>;

We are looking for a selectivity ratio of at least 0.80. If its lower than that we dont
want to create an index in most cases.

Removing indexes
Indexes are cool and all, but having too many of them can be disastrous for your
performance. Every time you insert or update a record the indexes have to be
updated too.
This is how you drop an index:
DROP INDEX <index_name>;

Good candidates for removal are: low usage indexes (you can find this with
pghero_unused_indexes or inspecting the pg_stat_user_indexes table), low

selectivity indexes.

Explain it to me!
Slow database queries can make your application perform very slowly. The good
news is that there are ways to analyze and find out which queries are being slow
and why.
If we want to analyze a query we have to append EXPLAIN in front of it. Postgres
will not run this query but it will give us the query plan. This plan is composed of
the operations postgres needs to do to execute the query and get our data ready.
Postgres is also able to calculate an estimated cost and estimated amount of rows
that need to be read.
Here is the output of explain for a simple select query (EXPLAIN SELECT *
FROM users):
QUERY PLAN
------------------------------------------------------Seq Scan on users (cost=0.00..1.40 rows=40 width=92)
(1 row)

The explain output is composed of what postgres calls nodes. Every node is an
operation on the database: scanning a table, scanning an index, sorting, filtering
output by a where clause, etc.
Every operation has some estimated cost, rows and width (average size in bytes
of every row).To optimize a query we want to look for high-cost operations.
Using EXPLAIN ANALYZE we can get more accurate results, since the query will
actually be run and real stats can be gathered.
This is the query plan for the same query, but this time using EXPLAIN ANALYZE :
QUERY PLAN
-------------------------------------------------------------Seq Scan on users (cost=0.00..1.40 rows=40 width=92)
(actual time=0.009..0.066 rows=40 loops=1)
Planning time: 0.277 ms
Execution time: 0.150 ms
(3 rows)

A great site that can help analyze your EXPLAIN queries is explain.depesz.com.
There is a self-hosted version of this site on github if you are worried about
sharing your query data with an external entity.

The site will parse your query and use color-coding to indicate the slowest parts.

Postgres-Specific Types
PostgreSQL is a very powerful database, to the point that it can substitute what
many people would use a NoSQL database for (schema-less data).
In this chapter we are going to explore the following postgres-specific data types:
ARRAYS, HSTORE and JSON.

The Array type


The array type in postgres is very interesting because it can substitute an entire
table in a one-to-many relationship. It can also be indexed using a GIN index,
which will give us fast queries.

Creating Array Columns


To add a column with array as type:
ALTER TABLE testing ADD COLUMN tags text[];

This will create a tags array which will contain values of type text. Arrays can be
composed of any postgres built-in type.

Inserting & Querying Array Values


We need some special syntax for inserting arrays, here is an example:
INSERT INTO testing VALUES (1, array['a', 'b']);

Query
When doing a query for an array field you have to use another array to compare
against. There are two main conditions we can check for: exact equality and
inclusion.
Example: find by equality
SELECT * FROM testing WHERE tags = array[1, 2, 3]

Example: contains operator


SELECT * FROM testing WHERE tags @> array[1]

Update

To update an array column we have a few options: replacing the entire array,
changing a single value or appending extra values.
Example: using the concatenation operator
UPDATE testing SET tags = tags || array['f', 'g'] WHERE age = 1;

Delete
The array_remove function is the easiest way to delete an element from an array
field. You can see it working using this query (which doesnt change any data).
SELECT array_remove(ARRAY[1,2,3,2], 2)

Then for the real delete:


UPDATE testing SET tags = array_remove(tags, 'b') where age = 1;

Indexing arrays
As I mentioned before its possible to index arrays, here is the syntax for that:
CREATE INDEX <index_name> ON <table> USING gin(<column>);

Array functions
There are some array-specific functions we can use to help us. For example,
using the unnest function we can treat an array field like a temporary table /
regular rows.
SELECT unnest(tags) FROM testing WHERE tags is NOT NULL;

Gives us:
unnest
-------a
c

b
(3 rows)

One practical use for unnest is to aggregate array values across the table. This
query gives us a count of every tag we have:
SELECT
unnest(tags),
count(*)
FROM testing
GROUP BY unnest(tags)
ORDER BY 2 DESC;

Hstore: Key-values pairs


The hstore type allows us to store key -> value pairs in a column. Both the key
and the value are expected to be strings.
Since hstore is implemented as an extension you have to enable it:
CREATE EXTENSION hstore;

CRUD operations
Creating a table with an hstore field:
CREATE TABLE products (id serial, attributes hstore);

Querying
The main operations we may want to do with hstore fields are:
Getting the value for a specific key
Find all rows that have a specific key defined
Find all rows which have a specific key/value pair defined
Operation

Description

hstore -> key

Get value

hstore ? key

Contains key

hstore @> hstore

Contains key/value pair

Examples:
SELECT attributes -> 'brand' FROM testing;

SELECT name, price FROM testing WHERE attributes ? 'brand';

SELECT * FROM testing WHERE attributes @> 'brand=>apple'::hstore;

Update & Delete


This query will add a new key or update it if it already exists:
UPDATE books SET hstore_column = hstore_column || 'c=>20'::hstore

You can delete an hstore key like this:


UPDATE books SET hstore_column = delete(hstore_column, key)

If you need more info about the hstore type you can find the official documentation
here: http://www.postgresql.org/docs/9.4/static/hstore.html

The JSON Type


In postgres we can store JSON and query it. Since version 9.4 two types of json
columns are available: JSON and JSONB. The main difference is that JSON will
just store json as a string (without parsing it) which makes insertion of new data
faster.
If you use the JSONB format postgres will pre-process your json input into a binary
format. Which will make it faster to query and work with, but a bit slower to add
new data. In addition, JSONB supports indexing (using a GIN index), while the
regular JSON type does not.
The main problem with the JSON/JSONB type is that its pretty hard to manipulate
individual values (there is no concatenation operator like in array or hstore).
You can find the official documentation here:
http://www.postgresql.org/docs/9.4/static/datatype-json.html

JSON operations
Here are a couple examples of using a JSONB column.
Adding a JSONB column:
ALTER TABLE testing ADD COLUMN json JSONB;

Setting a json column (replaces current values):


UPDATE testing
SET json = '{"test": true, "secret": 123}'::jsonb
WHERE age = 100;

Getting a value:
SELECT json -> 'test' FROM testing;

Finding rows that have a certain key/value pair set:


SELECT *

FROM testing
WHERE json @> '{"total": 500}'::jsonb

Note: if you want to update a single value you will need some custom
functions, you can find them here:
https://gist.github.com/matugm/f12c5f28d40d83d65a2f#file-json-update-sql
There is no built-in delete operation for json columns.

Using Postgres Arrays in Rails


In modern versions of rails we can use these advanced column types without
having to use raw SQL queries. We can create a migration with an array column
like this:
create_table :books do |t|
t.string 'title'
t.integer 'isbn'
t.string 'tags', array: true
end
add_index :books, :tags, using: 'gin'

Then we can simply query using where:


Book.where("tags @> array[?]", search_tags)

Database Administration
Once your application is up and running there are some tasks you need to do to
keep it in top condition.

User roles & access control


Having control over when and how our data is accessed is important if you care
about your data at all. Postgres uses the name of roles to refer to database
users. It is important to note that database users have no relation with operating
system users.

Listing users
In psql we can use \du to get a list of users.
List of roles
Role name | Attributes | Member of
-----------+------------------------------------------------+---------- postgres | Superuser, Create role, Create DB, Replication | {}

In this output we can see that we only have one user (postgres) and that it has
Superuser privileges. Ideally you will want to have a low-privileged user for your
application to use, which will give you better security in case your application
server is compromised or the secrets are leaked.

Creating & deleting users


This will create an user with the specified password.
CREATE USER rails WITH PASSWORD 'jo45alkf';

To delete an user we can use this query:


DROP ROLE david;

Here is the official documentation if you need more details:


http://www.postgresql.org/docs/9.4/static/sql-createrole.html

Host Based Authentication

The pg_hba.conf file in postgres allows you to configure who can connect to the
database and how. There are a number of authentication methods you can
configure. One that I find very useful in a development setup is the trust method.
The trust method will allow you to login without a password as long as the
connection is coming from the configured ip address. To enable this, add or
uncomment this line to your pg_hba file:
host all all 127.0.0.1/32 trust

Then reload your postgres service. Using sudo service reload postgres or sudo
systemctl reload postgres .

Here is a list of common authentication methods:


Name

Description

trust

Always allow access (no password required)

reject

Always deny access

md5

Require an md5-hashed password

password

Require a plaintext password

peer

Allow access if operating system user matches with the database


name

Check out the documentation for the full list:


http://www.postgresql.org/docs/9.4/static/auth-pg-hba-conf.html

Configuration
Config files & config info
To find the location of your configuration file you can use the following query:
SHOW config_file;

Another interesting query is this one, which will give you a count of current
configuration parameters by type.
SELECT count(*), source FROM pg_settings GROUP BY source ORDER BY count;

Output:
count | source
-------+--------------------- 1 | environment variable
2 | client
12 | override
19 | configuration file
205 | default
(5 rows)

In my case you can see that Im running mostly default values, with exactly 19
parameters coming from the configuration file.

Configuration Tuning
There are a few parameters in the postgres configuration file that we can tweak
for better performance. Note that the ideal numbers depend on your hardware and
use case.
Setting

Recommended value

work_mem

Used for in-memory sorting. Normal value: 4-16 MB

shared_buffers

25% of total memory

effective_cache_size

50% of total memory

checkpoint_segments

How many segments to wait before a checkpoint. A


good starting value is 8 (default is 3)

Logging
Reading the logs can be useful when we are trying to troubleshoot our database
or just doing a regular health check. In Ubuntu the logs can be found under
/var/log/postgresql by default. If you are on a different distribution then a pretty

good guess is that the logs are under /var/lib/pgsql/data/pg_log .


There are some configuration-related settings you can change. You can find them
using this query:
SELECT name, setting, short_desc FROM pg_settings WHERE name LIKE 'log_%';

Database statistics
If you are wondering how your database is doing, then there a some queries you
can run to gather information.

Active queries
Using this simple query you can get a list of the current activity on your database.
Watch out for long-running queries consuming too many resources!
SELECT application_name, query_start, state, query FROM pg_stat_activity;

Output:
-[ RECORD 1 ]----+----------------------------------------------------application_name | bin/rails
query_start | 2015-08-15 02:08:32.775124+02
state | idle
query | SELECT 1
-[ RECORD 2 ]----+----------------------------------------------------application_name | psql
query_start | 2015-08-15 19:50:53.504786+02
state | active
query | SELECT application_name,query_start,state,query FROM

In this case we can see an idle query from rails. This query is used to check that
the database connection is alive, so we dont need to worry about it.
Here is the relevant source code from ActiveRecord:
https://github.com/rails/rails/blob/v4.1.0/activerecord/lib/active_record/connection_
adapters/postgresql_adapter.rb#L587

Table stats
A good thing about postgres is that it keeps a lot of meta-data about its
operations. With this query we can get per-table stats (like the number of inserts
and deletes).
select * from pg_stat_user_tables;

That query will give you a lot of information, but it may be hard to visualize.
Especially if you have a lot of tables. Since this is just like a regular table we can
query and filter it. So I wrote this complicated-looking query which prints a
summary of all your tables.
SELECT *,
round(100 *
(
float4(total_reads) /
float4(total_writes + total_reads))::numeric, 2
) AS read_percent,
round(100 *
(
float4(total_writes) /
float4(total_writes + total_reads))::numeric, 2
) AS write_percent
FROM
(SELECT relname,
n_tup_ins AS inserts,
n_tup_upd AS updates,
n_tup_del AS deletes,
(n_tup_ins + n_tup_upd + n_tup_del) AS total_writes,
(seq_scan + idx_scan) AS total_reads
FROM pg_stat_user_tables) AS T;

Table size
If you want to check out the disk space your tables are using pghero has a
function to help you:
SELECT * FROM pghero_relation_sizes WHERE type = 'table';

Note: for this to work you must have installed pghero as instructed in the
Adding new indexes section.
Alternatively, you can use a bash script to get the size of all your database files.
For this script to work you must be inside the base directory, which is under the
data directory.

One easy one to find your data directory is by using this command: pg_lsclusters
(debian / ubuntu only) or ps -U postgres -f .
du -sh * | while read SIZE OID; do echo "$SIZE `oid2name -q |
tr -s ' ' | grep -E "^ $OID " | cut -f 3 -d ' '`"; done | sort -rn | column
-t

Note: copy & paste both lines, they are part of the same command!

Query stats
Before we can use query stats we have to do some setup first.
1.Add to postgresql.conf:
shared_preload_libraries = 'pg_stat_statements'

2.Reload service
Depending on your system you may need to run:
sudo service reload postgresql

or
sudo systemctl reload postgresql

3.Enable extension
CREATE EXTENSION pg_stat_statements;

4.Get stats
Before you can get some meaningful stats you have to leave your app running
(assuming its on production) for a few hours. Then run this query:
SELECT
total_time as total,
(total_time/calls) as avg, calls, query
FROM pg_stat_statements
ORDER BY total DESC

LIMIT 10;

This will give you a list of the slowest queries.


Documentation:
http://www.postgresql.org/docs/current/static/pgstatstatements.html

Backups
You put a lot of effort into your database, but you need to take another step to
make sure you can recover in case of a disaster. Having a backup plan is
important as you may imagine. In this section I will explain how you can create
backups from your database data and how to restore them.

Plain-text Backup
Using the pg_dump utility you can create a backup file of your database. The
format of this file is plain text. You can open it and see how its composed of SQL
commands.
This command will generate the backup as backup-<current_date>.sql :
pg_dump -U postgres > backup-$(date +%Y-%m-%d).sql

The utility has a few useful options. For example, -s lets you get an empty copy
of your tables, this can be useful when setting up the application in a testing
environment. Using -n and -t you can dump specific schemas and tables.
To restore a pg_dump backup:
psql -U postgres -1 database_name < backup.sql

You will need to create the database if it doesnt exist, using the createdb
command.
Tip: always test your recovery procedure in advance (in a dev server with the
same postgres version) to make sure it will work when its needed.
The only problem with pg_dump is that recovery is not very efficient for large
databases (> 1GB). In the next section we will explore another backup method.

Physical Backup
Another way to take a backup of your database is to have a copy of the data files.

You cant just copy the files directly since there are changes happening to them on
a live server.
If you use the pg_basebackup utility it will take care of everything for you, here is
the command:
pg_basebackup -x -P -D backup$(date +%Y-%m-%d) -h localhost -U postgres

Note: you need to have the replication permission set on pg_hba.conf for this
to work
This will produce a base.tar.gz file inside a folder named after the current date.
To restore a backup done with pg_basebackup follow these steps:
1. Stop the postgres service
2. Rename the old data folder to data.old
3. Create a new data folder, make sure it belongs to the postgres user and it
has 0700 as permissions
4. Change into the new data folder and copy over your base.tar.gz backup file
5. Uncompress with tar zxvf base.tar.gz
6. Start the postgres service
After recovery you will see this in the logs:
LOG: database system was interrupted; last known up at 2015-08-23
LOG: redo starts at 0/38000084
LOG: consistent recovery state reached at 0/380000A8
LOG: redo done at 0/380000A8

Note: even if you are using pg_basebackup as your main backup I would still
recommend having a plain-text backup just in case there are any issues with
the recovery process.

Appendix
Installing Postgres
In Ubuntu you can use this command:
sudo apt-get install postgresql postgresql-contrib

The installation process runs an script that creates a cluster. A cluster is


composed of the files required (like configuration and system tables) to start
postgres.
You should see a summary like this:
Creating new cluster 9.3/main
config /etc/postgresql/9.3/main
data /var/lib/postgresql/9.3/main
locale en_US.UTF-8
port 5432

An important line here is the locale, you want it to be UTF-8 to avoid encoding
issues. If you run into locale issues two commands to investigate are locale and
locale-gen .

Note: Ubuntu doesnt always ship the latest version of postgres. For example,
in 14.04 LTS the version of postgres is 9.3 which doesnt support the JSONB
data type. The postgres developers maintain an APT repository which
contains the latest version. You can find the instructions here:
https://wiki.postgresql.org/wiki/Apt

Loading the Example Data


After postgres is up and running you can load the example db provided with the
book using this command:
sudo su postgres
psql < example-data.sql

Properties of relational databases: ACID


Relational databases like postgres have a set of properties that make them
reliable. These properties are summarized by the ACID acronym. You dont need
to memorize these, just be aware that they exist and that not all databases offer
these properties.

A - Atomicity
Changes in the database happen in an atomic way, meaning that either the full
change is applied or none of it.

C - Consistency
Transactions will always leave the database in a valid state. This means (for
example) that you will never have duplicated rows if you have defined a UNIQUE
constraint.

I - Isolation
If multiple database operations are run at the same time this ensures that they will
be executed in the correct order. In postgres this is done via the MVCC
(Multiversion Concurrency Control) system.
You can learn more about MVCC here:
http://www.postgresql.org/docs/9.4/static/mvcc-intro.html

D - Durability
Once an operation is committed it stays like that, even in the event of a system
crash or software error. Postgres can do this thanks to the WAL (Write-Ahead
Log) system.
You can read more about the WAL here:
http://www.postgresql.org/docs/9.4/static/wal-intro.html

Readline Cheatsheet
Psql uses the readline library to read user input, this means that you can use a
few shortcuts to be faster.
Shortcut

Description

CTRL + a

Move the cursor to the start of the line

CTRL + e

Move the cursor to the end of the line

CTRL + w

Delete one word

CTRL + u

Delete everything on the left

CTRL + k

Delete everything on the right

CTRL + y

Restore the last thing deleted by another shortcut

CTRL + l

Clear the screen

CTRL + r

Search the command history

Tip: readline is very popular so you can use these shortcuts in other
command-line interfaces like Bash.

Active Record & SQL


This table shows common ActiveRecord methods and how they translate to SQL.
AR

SQL

Country.all

SELECT * FROM countries

Country.where(name:
Spain)

SELECT * FROM countries WHERE name =


Spain

User.pluck(:email)

SELECT email FROM users

User.max(:age)

SELECT max(age) FROM users

User.all.limit(5)

SELECT * FROM users LIMIT 5

User.create(name: Peter)

INSERT INTO users (name) VALUES (Peter)

User.first.destroy

DELETE FROM users WHERE id = 1

Tip: using the rails db command inside a rails project will open a psql
session.

Potrebbero piacerti anche