Database

1 Le basi di dati
Le basi di dati
Gruppo di lavoro Alberti, Boldi, Gaito, Grossi, Malchiodi, Mereghetti, Morpurgo, Rosti, Palano, Zanabon & Jeremy Sproston _Torino
2 Le basi di dati
Scopo
• Gestione dell’informazione
• Informazione: difficile da definire ma a tutti è
chiara l’importanza della sua gestione in diversi
tipi di attività
• Esempi: aziende, banche, anagrafi, università,
compagnie aeree, ...
Gruppo di lavoro Alberti, Boldi, Gaito, Grossi, Malchiodi, Mereghetti, Morpurgo, Rosti, Palano, Zanaboni
3 Le basi di dati
Sistema informativo
• Insieme delle risorse di un’organizzazione
dedicate alla gestione dell’informazione
• Gestione: acquisizione, elaborazione,
conservazione, produzione
• Il concetto di sistema informativo esiste da
secoli (es.: anagrafi)
4 Le basi di dati
Sistema informatico
• Parte del sistema informativo che gestisce
l’informazione automaticamente (calcolatori,
reti, software, ...)
tivo
a
f o rm
a in
t em Sistema
Sis informatico
5 Le basi di dati
Dati e informazione
• Dati: stringhe di caratteri, numeri, immagini,
suoni, ...
• Informazione: veicolata dai dati
opportunamente interpretati
• Esempio:
“Paolo”, “Rossi”, 1100 dati
nome, cognome del direttore, stipendio
interpretazione
Informazione e dati
Rappresentazione dell’informazione:
Basata su codifica (interpretata da
programma)
Dati = elementi di informazione, che di
per sé non hanno interpretazione
Mario Rossi Æ nome e cognome
2334455 Æ numero matricola
6
6 Le basi di dati
L’informazione nei
sistemi informatici
Sistema informatico
Informazione Informazione
DATI
Rappresentazione Interpretazione
• Basi di dati: collezione di dati che

rappresentano informazioni di interesse per un
certo sistema informativo
Dati e applicazioni
I dati possono variare nel tempo (per
esempio, importo conto corrente)
Le modalità con cui i dati sono rappresentati in
un sistema sono di solito stabili
Le operazioni sui dati variano spesso (per
esempio, ricerche)
separare i dati dalle applicazioni che

operano su essi
7
7 Le basi di dati
Problemi con la gestione dei dati

• I programmi tradizionali operano
indipendentemente su copie di dati
Prog1 Prog2 Prog3
copia1 copia2 copia3
• Problemi: inconsistenza
ridondanza
8 Le basi di dati
Soluzione ottimale
• Unica risorsa dati accessibile a più programmi
Prog1
Prog2
Prog3
9 Le basi di dati
DBMS
Data Base Management System
• Sistemi software per la gestione di collezioni di
dati che siano: grandi, condivise, persistenti
assicurando: affidabilità e privatezza
• I DBMS devono essere: efficenti e efficaci
• BASE DI DATI = collezione di dati gestita da

un DBMS
Caratteristiche dei DB e DBMS
Persistenza = dati sempre disponibili, non
“vivono” dentro una sola applicazione
Affidabilità = protezione dei dati, in caso di
guasto HW o SW capacità di ripristinare i dati
(almeno parzialmente)
Privatezza = abilitazioni diverse a seconda
dell’utente
Efficienza = tempi di risposta e occupazione
spazio accettabili (dipende molto dalla tecnica di
memorizzazione dei dati)
Efficacia = facilitare l’attività di organizzazione
12
10 Le basi di dati
Modello logico e fisico dei dati

BASE DI DATI
DATI =
DATI + DBMS
• Modello logico: regole per strutturare i dati

secondo certe proprietà + operazioni sui dati
• Modello fisico: rappresentazione dello schema
logico mediante strutture fisiche di
memorizzazione. Es.: file, liste, alberi, ...
11 Le basi di dati
DBMS: Architettura ANSI/SPARC

utente
utente utente
Schema esterno
Schema esterno Schema esterno
Schema logico
Schema fisico
DATI
12 Le basi di dati
I linguaggi nei DBMS

• DDL: data definition language
definisce gli schemi esterni, logici, fisici e le
autorizzazioni d’accesso
• DML: data manipulation language
consente di interrogare e di aggiornare la base
di dati
• Nelle basi di dati relazionali (che vedremo),
SQL contiene entrambe le funzionalità
DDL e DML relazionali
Due paradigmi:
Dichiarativo Æ SQL (Structured Query
Language)
Procedurale Æ algebra relazionale
Varie proposte commerciali (non esiste
un vero “standard SQL”, sintassi un po’
diverse)
30
Progetto di un DB: Schema di una relazione
Schema = definizione della struttura

della relazione
È l’intestazione della relazione:
NomeRelazione(Attr1,…,Attrn)
Non varia nel tempo (modulo
ristrutturazione del DB)
Per esempio: in Dbcorsi
Docenza(corso,docente)
23
Istanza di una relazione
Istanza = dati che descrivono gli
individui appartenenti alla relazione
(sono le righe della tabella)
Varia nel tempo (aggiunta, modifica,
eliminazione dei dati riguardanti gli
individui)
24
Schema e istanza di un DB
Schema = insieme degli schemi delle
relazioni (struttura)
Istanza (o stato) = valori dei dati
nelle tabelle (righe)
25
13 Le basi di dati
Progettazione di basi di dati

• Problema: quali informazioni inserire in una
base di dati e quali legami esistono tra di esse?
• Schema concettuale di una base di dati da cui
ricavare la struttura dei dati secondo il modello
logico del DBMS (schema logico)
• Uno strumento per esprimere schemi
concettuali: i diagrammi Entità-Relazione (E-R)
14 Le basi di dati
I diagrammi E-R
• Entità: classi di oggetti rilevanti ENTITA’
• Relazioni: legami tra entità

RELAZIONE
• Attributi: descrivono proprietà

rilevanti di entità e relazioni
attributo
15 Le basi di dati
Uno schema concettuale

• Diagramma elementare E-R per una base di dati
in cui archiviare studenti, materie e relativi esami
data voto
matricola nome
nome STUDENTE ESAME MATERIA
cognome titolare
commissione
(se è unica!!)
codice fiscale
16 Le basi di dati
Identificatori
• Gruppi di attributi che identificano
univocamente le occorrenze di un’entità
identificatore identificatore
singolo multiplo
data voto
matricola nome
nome STUDENTE ESAME MATERIA
cognome titolare
17 Le basi di dati
Cardinalità delle relazioni

• molti a molti (N-N):
N N
STUDENTE ESAME MATERIA
• Ogni studente può aver sostenuto più esami

• Ogni esame può essere stato dato da più studenti
18 Le basi di dati

• uno a molti (1-N):
1 N
STUDENTE ISCR CdL
• Ogni studente è iscritto ad un corso di laurea

• Ogni corso di laurea può avere più iscritti
19 Le basi di dati

• uno a uno (1-1):
1 1
DIRETTORE DIRIGE DIPART
• Ogni direttore dirige un dipartimento

• Ogni dipartimento ha un direttore
20 Le basi di dati
Il modello relazionale
• I dati vengono strutturati in tabelle
Tabella: STUDENTI(matr, nome, cognome)
STUDENTI
Campi: colonne della tabella
matr nome cognome
305011 Carlo Rossi
356433 Mario Verdi
345553 Franco Verdi Record: righe della tabella
345434 Daniele Rossi
21 Le basi di dati
Schema e istanze
STUDENTE(Matricola, Cognome, Nome, Data di Nascita),
CORSO(Codice, Titolo, Docente),
ESAME(Studente, Voto, Corso) Schema
STUDENTE ESAME
Matricola Cognome Nome Data di Nascita Studente Voto Corso
6554 Rossi Mario 5/12/1978 3456 30 04
8765 Neri Paolo 3/11/1976 3456 24 02
9283 Verdi Luisa 12/11/1979 9283 28 01
3456 Rossi Maria 1/2/1978 6554 26 01
MATERIA
Codice Titolo Docente
01 Analisi Neri
02 Chimica Bruni
04 Chimica Verdi
Istanza
22 Le basi di dati
I legami tra i dati

• Il modello relazionale è basato sui valori
STUDENTE
Matricola Cognome Nome Data di Nascita
6554 Rossi Mario 5/12/1978
8765 Neri Paolo 3/11/1976
9283 Verdi Luisa 12/11/1979
3456 Rossi Maria 1/2/1978 ESAME
Studente Voto Corso
MATERIA 3456 30 04
3456 24 02
Codice Titolo Docente 9283 28 01
01 Analisi Neri 6554 26 01
02 Chimica Bruni
04 Chimica Verdi
23 Le basi di dati
Vincoli di integrità
• Proprietà che le istanze devono soddisfare

• Tipi di vincoli
intrarelazionali: su campi, su record, su tabella
interrelazionali: su più tabelle
(integrità referenziale)
24 Le basi di dati
Esempi di vincoli
Studenti Matricola Cognome Nome Nascita
276545 Rossi Maria 23/04/1968 Vincolo su
276545 Neri Anna 23/04/1972 tabella
788854 Verdi Fabio 12/02/1972
Vincolo su
Esami Studente Voto Lode Corso record
Con tabella 276545 28 e lode 01
Studenti 276545 32 02
788854 23 03 Vincolo su
e lode campo
200768 30 03
Corsi Codice Titolo Docente

01 Analisi Giani
Vincoli Interrelazionali Vincoli Intrarelazionali
03 NULL NULL
02 Chimica Belli
25 Le basi di dati
Chiave
• Insieme minimale di attributi che identifica
univocamente i record di una tabella
Matr Nome Cognome STUDENTE
301 Carlo Rossi
302 Marco Neri
311 Guido Mauro
ESAME
Matricola Cognome Nome Esame Data Voto
26 Le basi di dati
Chiave esterna e
integrità referenziale
• Attributo/i che costituiscono la chiave di
un’altra tabella
Studenti Matricola Cognome Nome Nascita
276545 Rossi Maria 23/04/1968
276543 Neri Anna 23/04/1972
Vincolo di 788854 Verdi Fabio 12/02/1972
integrità
referenziale Esami Studente Voto Lode Corso
276545 28 Fisica
276545 18 Analisi
788854 23 Analisi
788854 30 e lode Algebra
27 Le basi di dati
Dai diagrammi E-R alle tabelle

• molti a molti (N-N):
matr voto id
N N
nome STUDENTE ESAME MATERIA nome
cognome
STUDENTE(matr, nome, cognome) MATERIA(id, nome)
ESAME(matr, id, voto)
28 Le basi di dati

• uno a molti (1-N):
matr anno id
1 N
nome STUDENTE ISCR CdL nome
cognome
STUDENTE(matr, nome, cognome, cdl, anno) CdL(id, nome)
• A volte è preferibile tradurre come nel caso N-N

29 Le basi di dati

• uno a uno (1-1):
matr anno id
1 1
nome DIRETTORE DIRIGE DIPART nome
cognome
DIRETTORE(matr, nome, cognome, id, nome-dip, anno)
• A volte è preferibile tradurre come nel caso 1-N
Pagina 1 di 1
mhtml:file://C:\Documents and Settings\Amministratore\My Documents\My Notes\St... 27/03/2008

Pagina 1 di 1
mhtml:file://C:\Documents and Settings\Amministratore\My Documents\My Notes\N... 27/03/2008

UNA POSSIBILE FORMA DI DIAGRAMMA ENTITA'-RELAZIONE
studente
PK matricola
U2 ID_studente
U1 codice fiscale
I2 nome
I1 cognome
genere
indirizzo
NOTA città residenza
CAP
u: R e d: R
data nascita
indicano che non c'è effetto sui dati della tabella
"figlio" quando quelli della tabella "padre" luogo nascita
vengono aggiornati (u: update) oppure cancellati facoltà
(d: delete). Tutore didattico
Questa scelta può ovviamente essere cambiata
imponendo la propagazione dell'aggiornamento
dei record.
u:R
d:R
contatto
PK ID_contatto
materia
PK,FK1,I1 matricola
PK codice_materia
tipo_contatto
I1 ID_materia dato_contatto
denominazione
N_crediti
u:R
d:R
esame
PK,FK1,I1,I2 codice_materia
u:R
d:R PK,FK2,I3 matricola
data
domande
voto
lode
ALTRA POSSIBILE FORMA DELLO STESSO DIAGRAMMA ENTITA'-RELAZIONE
Si può ora implementare il

studente database in access 2003.
Il database ha il nome
PK matricola "studente_corso.mdb" come è
stato dato a lezione.
U2 ID_studente
U1 codice fiscale
I2 nome
I1 cognome
genere
indirizzo
città residenza
CAP
data nascita
luogo nascita
facoltà
Tutore didattico
u:R
1..*
d:R
contatto
PK ID_contatto
materia
PK codice_materia tipo_contatto
dato_contatto
I1 ID_materia FK1,I1 matricola
denominazione
N_crediti
u:R
1..*
d:R
esame
PK,FK1,I1,I2 codice_materia
u:R
d:R PK,FK2,I3 matricola
1..*
data
domande
voto
lode
Ora, dopo aver controllato accuratamente il DER, si può passare alla realizzazione fisica, usando Access 2003
Le
Maschere
(in Access)
Sono utilizzate
• per presentare in modo
più attraente
• per visualizzare e
modificare i dati
• per aggiungere record
Possono essere
• a colonne
• tabellari
• foglio dati
Visualizzazione
Struttura
Le Maschere ….
La creazione delle maschere può essere
facilitata dalla Autocomposizione Maschera
scegliendo le opzioni che vengono via via
offerte:
• Tabelle o query
• Campi da introdurre
• Aspetto: colonne, tabellare, foglio dati
• Stile, con scelta sfondi e combinazioni
di colori
1
Il Report
Permette di avere un controllo molto più preciso per l’output
finale
• Permette di avere una

intestazione e un piè di
pagina
• Può calcolare totali e
sottototali, può contenere
grafici
• Può essere utilizzato per
presentazioni o etichette
di indirizzi
Tipologie di Report
•Report a colonne
I nomi dei campi sono listati sul lato
sinistro mentre i il valore del campo è
listato a destra
•Report tabellare
I nomi dei campi sono listati affiancati in
alto mentre i valori dei campi sono
collocati sotto di essi
2
Creare un Report
La creazione di un report può essere facilitata
dalla Autocomposizione Report scegliendo tra
le opzioni che vengono via via offerte:
• Selezionare la Tabella
o la Query
• Raggruppare i dati
• Ordinare i dati
• Selezionare l’aspetto
• Scegliere lo stile e la
veste grafica
3
30 Le basi di dati
Query (interrogazioni)
• Query: estrazione da un data base di
informazioni che soddisfa certe proprietà
• Come esprimere le query:
SQL: Structured Query Language
QBE: Query by Example di Access
Algebra relazionale: un linguaggio formale
...
• In comune hanno un insieme di operatori
31 Le basi di dati
Operatori su tabelle
• Operatori insiemistici:
unione
intersezione binari
differenza
• Operatori relazionali:
proiezione
unari
selezione
join (naturale) binario
32 Le basi di dati
Operatori insiemistici: unione

• L’unione di due tabelle con attributi compatibili
è una tabella con lo stesso schema contenente
l’unione dei record
CICLISMO CICLISMO ∪ CALCIO
CF Cognome Eta` CF Cognome Eta`
RSSX Rossi 20 RSSX Rossi 20
NRXY Neri 21 NRXY Neri 21
VRDX Verdi 20 VRDX Verdi 20
RSSY Rossi 20
CALCIO
CF Cognom e Eta`
RSSY Rossi 20
Appassionati di
NRXY Neri 21 ciclismo o di calcio
VRDX Verdi 20
33 Le basi di dati
Operatori insiemistici: intersezione

• L’intersezione di due tabelle con attributi
compatibili è una tabella con lo stesso schema
contenente l’intersezione dei record
CICLISMO
CF Cognome Eta` CICLISMO ∩ CALCIO
RSSX Rossi 20
CF Cognom e Eta`
NRXY Neri 21
NRXY Neri 21
VRDX Verdi 20 VRDX Verdi 20
CALCIO
CF Cognom e Eta` Appassionati di
RSSY Rossi 20 ciclismo e di calcio
NRXY Neri 21
VRDX Verdi 20
34 Le basi di dati
Operatori insiemistici: differenza

• La differenza di due tabelle con attributi compatibili
è una tabella con lo stesso schema contenente i
record presenti nella prima ma non nella seconda
CICLISMO CICLISMO - CALCIO
CF Cognome Eta` CF Cognome Eta`
RSSX Rossi 20 RSSX Rossi 20
NRXY Neri 21 Appassionati di
VRDX Verdi 20
ciclismo ma non di calcio
CALCIO CALCIO - CICLISMO
CF Cognom e Eta`
RSSY Rossi 20 CF Cognome Eta`
NRXY Neri 21 RSSY Rossi 20
VRDX Verdi 20 Appassionati di
calcio ma non di ciclismo
35 Le basi di dati
Operatori relazionali: proiezione

• La proiezione su una tabella produce una
tabella con gli attributi specificati e contenente
gli stessi record “ristretti”
Matricola Nome Cognome CdL STUDENTI
40445 Paolo Rossi Fisica
43555 Piero Bianchi Matematica
43566 Piero Verdi Informatica
55655 Marco Rossi Lettere
πCognome,Nome(STUDENTI)
Cognome Nome
Elencare cognomi e Rossi Paolo
nomi degli studenti Bianchi Piero
Verdi Piero
Rossi Marco
36 Le basi di dati
Operatori relazionali: selezione

• La selezione su una tabella produce una tabella
con gli stessi attributi contenente record che
soddisfano un certo predicato
Matricola Nome Cognome CdL STUDENTI
40445 Paolo Rossi Fisica
43566 Piero Verdi Informatica
55655 Marco Rossi Lettere
σ Nome = ‘Piero’ or CdL = ‘Fisica’ (STUDENTI)

Quali sono gli studenti Matricola Nome Cognome CdL
con nome ‘Piero’ o 40445 Paolo Rossi Fisica
iscritti a ‘Fisica’? 43566 Piero Verdi Informatica
37 Le basi di dati
Operatori relazionali:
prodotto cartesiano
• Il prodotto di due tabelle produce una tabella
con l’unione degli attributi contenente le
concatenazioni dei record
STUDENTI Cognome Corso Nome Materia CdL
Rossi Fisica Fisica Meccanica
Bianchi Fisica Matematica Algebra
STUDENTI × CdL
Tutte le possibili Cognome Corso Nome Materia
concatenazioni Rossi Fisica Fisica Meccanica
tra record Rossi Fisica Matematica Algebra
Bianchi Fisica Fisica Meccanica
Bianchi Fisica Matematica Algebra
38 Le basi di dati
Operatori relazionali: join

• Il join tra due tabella produce una tabella con
l’unione degli attributi contenente le concatenazioni
dei record concordanti sugli attributi comuni
STUDENTE Matricola Cognome Nome Matricola Voto Corso ESAME
6554 Rossi Mario 3456 30 Analisi
8765 Neri Paolo 3456 24 Fisica
9283 Verdi Luisa 9283 28 Fisica
3456 Rossi Maria
STUDENTI ESAME
Matricola Cognome Nome Voto Corso
Situazione esami 3456 Rossi Maria 30 Analisi
degli studenti 3456 Rossi Maria 24 Fisica
9283 Verdi Luisa 28 Fisica
39 Le basi di dati
Esercitazione
• Consideriamo la seguente basi di dati per
l’archiviazione di noleggi di CD
CLIENTE NOLEGGIO
Cognome Nome ID-Cliente ID-Cliente ID-CD data
Rossi Paolo 10 10 1 23/7/2002
Bianchi Maria 11 9 1 11/9/2002
Verdi Carlo 9 11 3 15/2/2003
10 2 30/3/2003
CD
ID -C D T itolo A rtista
1 Up REM
CLIENTE(Cognome, Nome, ID-Cliente)
2 O cto ber U2 CD(ID-CD, Titlo, Artista)
3 S yncronicity P olice NOLEGGIO(ID-Cliente, ID-CD, data)
40 Le basi di dati
Esempi di query
• Eseguire le seguenti query:
a) Artista e titolo dei cd noleggiati dal signor
Paolo Rossi;
b) Nome e cognome dei clienti che hanno
noleggiato cd dei REM;
c) Titolo dei cd che sono stati noleggiati dal
cliente avente codice 10 oppure dal cliente
avente codice 11.
Casi “semplici” di realizzazione di
DB biotecnologici
Studiamo un po’ in dettaglio la costruzione dei diagrammi

entità – relazione nei seguenti due esempi:
1. Modello di dati per il genoma di batteri associati alle

piante (PABdb)
2. Modello di dati per un DB di impronte di motivi di
proteine (PRINTS-S)
1 - A data model for
Comparative Genomics
Laboratory for Bioinformatics (LBI), Institute of

Computing (IC) - UNICAMP
PhD Student: Luciano Antonio Digiampietri

Advisor: João Carlos Setubal
Co-advisor: Cláudia Maria Bauzer Medeiros
Topics
Introduction
Motivation
The data model
The PABdb system
Conclusions
Future work
History
In 2002 the following genomes:
– Agrobacterium tumefaciens
– Mesorhizobium loti
– Ralstonia solanacearum Plant
– Sinorhizobium meliloti associated-
– Xanthomonas axonopodis pv. citri bacteria
– Xanthomonas campestris pv. campestris
– Xylella fastidiosa cvc PAB
– Xylella fastidiosa Temecula1
Were compared by the following people:
– M. A. Van Sluys, C. B. Monteiro-Vitorello, L. E. A. Camargo,
C. F. M. Menck, A. C. R. da Silva, J. A. Ferro, M. C. Oliveira,
J. C. Setubal, J. P. Kitajima, A.J. Simpson.
To help the comparison a database was created:
=> PAB database
Main author: J. P. Kitajima
Publication:
M. A. van Sluys, C. B. Monteiro-Vitorello, L. E. A. Camargo, C. F. M. Menck, A. C. R. da Silva, J.
A. Ferro,M. C. Oliveira, J. C. Setubal, J. P. Kitajima, and A. J. G. Simpson. Comparative genomic
analysis of plant-associated bacteria. Annual Review of Phytopathology, 40, 169-189, 2002.
This publication presents analysis results, not

database description
This work
– PAB database overhaul
• Redesign
• Repopulation (data reload)
• Incusion of new query and visualization tools
– PAB database description (there was none)
– Results
• It is now much more flexible
– can be used as building block of larger information
systems
• Scalable
– Much easier to include more genomes
Motivation for the work
Growing number of complete genomes of bacteria:
– Today there are about 130 complete genomes
– In few years there will be more than 1000
The genomes of several species of a genus or indeed the

genomes of of several strains of the same species have
been sequenced.
This data growth has made necessary the development of

new systems and tools for comparative genomics.
– The new systems must be:
• Flexible
• Scalable
Scope
Xylella fastidiosa citrus
strains grape
almond
oleander
Xanthomonas axonopodis pv. citri
species campestris pv. campestris
oryzae
vesicatoria
Plant associated bacteria:
small sets of Agrobacterium tumefaciens
Sinorhizobium meliloti
genomes Xanthomonas axonopodis pv. citri
Xylella fastidiosa cvc
large sets of All microbial

genomes
Basic concepts: Replicon
Any kind of cell unit that contains genetic information
(e.g. chromosomes, plasmids and mitochondria)
Synechocystis sp. PCC 6803

plasmid
pSYSM
plasmid
pSYSA
chromosome
plasmid
pSYSX
Basic concepts: Homology
Homology: two genes are homologous if they
share a common ancestor.
homologous homologous
genes genes
Basic concepts: Homology (II)
Paralogous genes are two (or more) genes homologous in the same
organisms.
Orthologous genes are homologous genes belong to different
organisms.
paralogous
genes
organism1
organism2
orthologous
genes
Basic concepts: gene family
gene_id genome_id gene_category gene_product
Atu0324 At III.A.1 chromosomal replication initiator

protein dnaA
SMc01167 Sm III.A.1 chromosomal replication initiator
protein
Mll5581 Ml III.A.1 chromosomal replication initiator
protein dnaA
XCC0001 Xcc III.A.1 chromosomal replication initiator
XAC0001 Xac III.A.1 chromosomal replication initiator
PD0001 Xfpd III.A.1 chromosomal replication initiator
XF0001 Xfcvc III.A.1 chromosomal replication initiator
RSc3442 Rs III.A.1 probable chromosomal replication

initiator protein dnaA
Basic concepts: functional
category
I - Intermediary metabolism
– Degradation
• Degradation of polysaccharides and oligosaccharides
• Degradation of small molecules
• Degradation of lipids
– Central intermediary metabolism
– Energy metabolism, carbon
– Regulatory functions
II - Biosynthesis of small molecules
III - Macromolecule metabolism
IV - Cell structure
V - Cellular processes
VI - Mobile genetic elements
VII - Pathogenicity, virulence, and adaptation
VIII - Hypothetical
Motivation queries
– Given two or more genomes, what are the genes
shared between them and to what families do they
belong?
– Given two or more genomes, what are the genes

specific to one in relation to the others, and to
what families do they belong?
– Given a gene x from an organism not in the

system, does it have homologous in the system? If
so, how many?
genomes G1 G2 Gk
replicons R1 R2 R3 R4 R5 Rp-1 Rp
genes gxgxgxgxgxgxgx gxgxgwxgxgxgxgw gxrgxgxgxgxgxgy goxgxgxgxgxgxgz

gxgxgxcgxgxgxgxgxgx
Family1 Family2
Category
Attributes
Attributes based in GenBank data

– Genome:
• id, strain, source, taxid, description
– Replicon:
• id, genome_id, description, sequence
– Genes:
• id, replicon_id, start_pos, end_pos, gene_synonym,
orientation, product, name, gi, category
Conceptual model
Category
2
:
N
Genome
1..N
Replicon
1..N
Gene
N..N Gene
Family
1
:
N
BLAST Hits
Tables and relationships
gene_family_tbl family_tbl
gene_blast_tbl family_id family_id

gene_id description
gene_id genome_id
blast_type
blast_db genome_tbl
blast_order genome_id
blast_gene_id gene_tbl
genome_strain
blast_tax_id gene_id genome_source
blast_qu_cover gene_start_pos genome_taxid
blast_sj_cover gene_end_pos genome_description
blast_idty replicon_id genome_pab
blast_description gene_synonym
gene_orientation
gene_product replicon_tbl
gene_name replicon_id
category_tbl gene_category genome_id
categ_id gene_category_sec replicon_description
categ_description gene_gi replicon_sequence
PABdb information system
PlantAssociated Bacteria Database

Main objectives
– management of genome data;
– comparison among genomes;
– clustering of genes in gene families and in
categories
– Allow easy inclusion of new comparison
tools
User tools System overview
In bioinformatics, Basic Local Alignment Search Tool, or BLAST, is an algorithm for comparing primary biological
sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences. A BLAST
search enables a researcher to compare a query sequence with a library or database of sequences, and identify library
sequences that resemble the query sequence above a certain threshold. For example, following the discovery of a previously
unknown gene in the mouse, a scientist will typically perform a BLAST search of the human genome to see if humans carry a
similar gene; BLAST will identify sequences in the human genome that resemble the mouse gene based on similarity of
sequence. The BLAST program was designed by Eugene Myers, Stephen Altschul, Warren Gish, David J. Lipman and Webb
Miller at the NIH and was published in J. Mol. Biol. in 1990 (WIKIPEDIA)
LOCAL BLAST, category and

DBMS family operations
converters of
data
Structured
DBMS DBMS DBMS
files FASTA
2 - Biological Databases
for Protein Sequence Analysis
Terri Attwood
School of Biological Sciences
University of Manchester, Oxford Road
Manchester M13 9PT, UK
http://www.bioinf.man.ac.uk/dbbrowser/
Tipi principali di errore nelle banche dati:
1) Relativamente alle annotazioni e all’invio di dati:

inaccuratezza,
omissioni,
errori madornali,
inconguenza tra diversi campi.
2) Nelle sequenze (nei dati in generale):

errori di sequenziamento,
cattiva lettura dei dati sperimentali (gel),
presenza di inserzioni di vettori di clonaggio.
>sp|P56478|IL7_RAT
• A collection of data, … MFHVSFRYIFGIPPLILVLLPVTSSD
• which are structured; CHIKDKDGKAFGSVLMISINQLDKMT
GTDSDCPNNEPNFFKKHLCDDTKEAA
• which are indexed; FLNRAARKLRQFLKMNISEEFNDHLL
RVSDGTQTLVNCTSKEEKTIKEQKKN
• which are periodically updated; DPCFLKRLLREIKTCWNKILKGSI
SEQUENCES
• which has references to other databases;
•…
• Biological databases are tightly associated to tools …

• to retrieve entry of the database; 3D
• to update the database;
•…
• The main six database categories : ONTOLOGIES

• sequences LITERATURE
• proteins (UniProtKB);
• nucleic acids (EMBL). FUNCTION
• mapping
• genes;
• chromosomes;
LS125-4
•… R14523
CYC223
• 3D structures (PDB)
• gene/protein expression EXPRESSION
• function (KEGG) MAPPING
• literature (PubMed), ontologies (GO), …
The stuff you have to know
• Single- & three-letter amino acid codes
– G Glycine Gly P Proline Pro
– A Alanine Ala V Valine Val
– L Leucine Leu I Isoleucine Ile
– M Methionine Met C Cysteine Cys
– F Phenylalanine Phe Y Tyrosine Tyr
– W Tryptophan Trp H Histidine His
– K Lysine Lys R Arginine Arg
– Q Glutamine Gln N Asparagine Asn
– E Glutamic Acid Glu D Aspartic Acid Asp
– S Serine Ser T Threonine Thr
• Additional codes
– B Asn/Asp Z Gln/Glu X Any amino acid
Ground rules for bioinformatics
• Don't always believe what programs tell you
– they're often misleading & sometimes wrong!
• Don't always believe what databases tell you
• Don't always believe what lecturers tell you
• In short, don't be a naive user

– when computers are applied to biology, it is vital to understand the
difference between mathematical & biological significance
– computers don’t do biology
– they only do “sums” quickly!
Fingerprint
Sequence
Motif
Motif: A sequence motif is a characteristic sequence pattern observed in different

proteins or nucleic acids and typically associated with a particular function such
as molecular binding. A structural motif is a recurring three dimensional
arrangement of structural elements observed in different proteins.
http://www.bioinf.manchester.ac.uk/dbbrowser/PRINTS/
The database is a compendium of
protein "fingerprints", which are
groups of conserved motifs used to
characterise a protein family. The
diagnostic power of a fingerprint is
refined by iterative scanning of a
SWISS-PROT/TrEMBL composite.
Usually the motifs do not overlap,
but are separated along a
sequence, though they may be
contiguous in 3D-space.
Fingerprints can encode protein
folds and functionalities more
flexibly and powerfully than can
single motifs, full diagnostic potency
deriving from the mutual context
provided by motif neighbours.
SPRINT (Search PRINTS-S) provides an

prePRINTS is an automatically generated
interface to the PRINTS-S database. PRINTS-S
supplement to the PRINTS database, aiming
is the relational cousin of the PRINTS data bank
to provide increased protein family coverage
of protein family fingerprints.
Single motif Fuzzy regex
methods (IDENTIFY)
Exact regex
(PROSITE) Full domain
alignment methods
Profiles
(PROFILE LIBRARY)
HMMs
(Pfam)
Identity matrices
(PRINTS)
Multiple motif Weight matrices

methods (BLOCKS)
Definizione di Motif repeat:
A repeat is a word long enough so that it’s unlikely to occur very often by
chance, given a random sequence.
Caratteristiche di un caso di motif repeat:
The tetratricopeptide motif repeat (TPR) was first identified in the cell cycle division
proteins cdc16, cdc23, and cdc27 as a degenerate tandem repeat of 34 amino acid
residues encoding an a-helix-turn-a-helix motif (14).
Mutations within the TPR motifs of these proteins cause mitotic arrest at the metaphase-
to-anaphase transition. TPRs have now been identified in a wide diversity of organisms,
ranging from bacteria to humans.
This motif, which is known to mediate protein-protein interactions, exists in proteins of

diverse biological functions such as cell cycle regulation, neurogenesis, transcription
control, mitochondrial and peroxisomal protein transport, Rac mediated activation of
NADPH oxidase, protein kinase inhibition, and protein folding (15).
In TPR-containing proteins, this repeat motif is often present in tandem arrays of 3–16
motifs, although individual TPR motifs or blocks of TPR motifs may be dispersed
throughout the protein sequence.
Da: Biophysical Journal, Volume 89 October 2005, 2640–2649

PRINTS
• Most protein families are characterised by >1 motif

– it is sensible to use them all to build a diagnostic signature
• This is the principle of fingerprints
– these offer improved diagnostic reliability by virtue of the
biological context provided by motif neighbours
• Motifs are excised from alignments by hand &
encoded as ungapped, unweighted local alignments
– residue information is augmented via iterative searches
– sequences matching all motifs that weren't in the original
alignment are added to the motifs, & the db searched again
• The process is repeated until convergence
– results are manually annotated prior to inclusion in the db
Fingerprints
• Fingerprints are groups of motifs excised from

alignments & used for iterative db searching
– no weighting scheme is used
– searches depend only on residue frequencies
– resulting scoring matrices are thus sparse
• Each motif trawls the database independently
– search results are correlated to determine which sequences
match all the motifs & which match only partially
– no information is thrown away
• Iteration refines the fingerprint & increases its potency
– fingerprints are diagnostically more powerful than regular
expressions
What's in a sequence?
Fingerprint visualisation
• Full potency of fingerprinting is gained from the mutual
context provided by motif neighbours
• Important, as it inherently implies a biological context to
motifs matched in the correct order, with appropriate
distances between them
– results are thus biologically more meaningful than those from
single motifs
• Allows sequence identification even when parts of the
fingerprint are absent
– such matches are best visualised graphically
A fingerprinting overview
Know your family
Blocks
• Blocks are groups of motifs derived automatically from families
identified in PRINTS & InterPro
– sequences are aligned automatically & motifs are automatically
identified by searching for spaced residue triplets (e.g., AxxxVxxC)
– a block score is calculated using the BLOSUM62 matrix
– validity of blocks is confirmed with a 2nd motif-finding algorithm
– blocks found by both methods are considered reliable
• Sequences within motifs are clustered to reduce contributions to
residue frequencies from sets of closely-related sequences
– each cluster is treated as a single sequence & given a score that gives
a measure of its relatedness
– the higher the weight, the more dissimilar the segment from others in
the block, the most distant being given a score of 100
– segments <80% similar are separated by blank lines
Profiles
• Profiles are scoring tables derived from full alignments
– these define which residues are allowed at given positions
– which positions are conserved & which degenerate
– which positions, or regions, can tolerate insertions
– the scoring system is intricate, & may include evolutionary weights,
results from structural studies, & data implicit in the alignment
– variable penalties are specified to weight against INDELs occurring
in core 2' structure elements
• Within a profile, the I & M fields contain position-specific scores
for insert & match positions
– in conserved regions, INDELs aren't totally forbidden, but are
strongly impeded by large penalties defined in the DEFAULT field
– these are superseded by more permissive values in gapped regions
– the inherent complexity of profiles renders them highly potent
discriminators, but they are time-consuming to derive
A practical approach
• A central goal is to predict protein function from sequence

• Given a newly-determined sequence, we want to know
– what is my protein?
– to what family does it belong?
– what is its function?
– & how can we explain its function in structural terms?
• By searching pattern dbs & fold libraries, we may
recognise patterns that allow us to infer relationships with
previously-characterised families & folds
• Given the variety of dbs to search, how do we use them to
build a sensible search protocol?
Protein sequence database identity search
e.g., for short fragments, pinpoints identical matches
to probe - may identify correct reading frame
Protein sequence database similarity search

e.g., nrdb, OWL, SP+SPTrEMBL - identifies
homologues to probe
Protein pattern database search

e.g., PROSITE, profiles, PRINTS, BLOCKS,
Pfam - identifies family relationships or pin-
points key structural or functional sites
Known structure No known structure

Structure classification database query Protein fold pattern library search
e.g., scop, CATH, FSSP - provides details e.g., threading - identifies compatible
of structural class, secondary structure folds for the probe sequence
information, ligand-binding, etc.
Why bother with pattern searches?
• Primary searches won't always allow outright diagnosis

– BLAST & FASTA are not infallible
– often can't assign mathematically significant scores
– results may be complicated by modules, domains or
compositionally-biased regions
– annotations of retrieved hits may be incorrect
• Pattern databases contain potent descriptors
– so, distant relationships missed by BLAST may be captured by
one or more of the family or functional site distillations
Structural & functional interpretation
• Running db searches often does little more than identify

a protein family
– this only scratches the surface - we still want to know what our
protein does & what it might look like
• The first step is to examine the detailed family
documentations in PROSITE, PRINTS & InterPro
– these should help to elucidate the function of the protein
• The next step is to examine the fold classification &
structure summary resources
– e.g., SCOP, CATH & PDBsum (assuming the structure is known)
Significance
• Appreciating that mathematical & biological significance are different
is crucial
• It is especially important in understanding the limitations of
– database search algorithms
– multiple sequence alignment algorithms
– pattern recognition techniques
– functional site & structure prediction tools
• Contrary to popular opinion, there is currently still
– no biologically-reliable automatic multiple alignment algorithm
– no infallible pattern-recognition technique
– no reliable gene, function or structure prediction algorithm
Conclusions
• Gene prediction, structure & function prediction are non-trivial
– structure & function prediction tools are, at best, 70% accurate
• What are the lessons for sequence analysis?
– when searching for distant homologues, several dbs should be searched
– different methods provide different perspectives
– dbs aren’t complete & their contents don’t fully overlap
• The more dbs searched, the more difficult it can be to interpret results
• The more computers are involved in automating genome annotation,
the greater the need for collaboration with biologists
• The more data we have to handle, the more rigorous we must be in
our thinking (& writing) if we are to make sense of the complexities
• We are still a long way from having reliable tools for deducing protein
function from sequence
– but with the right approach, there is hope
Interrogazione standard o avanzata per
PRINTS-S
Link:
http://www.bioinf.manchester.ac.uk/dbbrowser/PRINTS/
Da questo sito lo studente può anche ricavare dati per riempire il proprio
database e fare esercizio con proprie query e report personali
Diagramma ER per
il DB PRINTS-S
This diagram depicts the fingerprint data modelled using three entities: FINGERPRINT, MOTIF and SEQUENCE
(rectangular boxes). Each entity has a relationship with another, as represented by a solid connecting arrow. A single arrow-
head denotes a single relationship and a double arrow-head denotes a many relationship: e.g., one fingerprint has many
motifs, however, one motif is associated with one, and only one, fingerprint. The many-to-many relationship between entities
fingerprint and sequence is special, as highlighted by means of a diamond. The diamond represents a relationship with a
property and, in this case, it represents sequence assignment. Sequences are assigned an attribute dependent on their
relationship with the fingerprint: i.e., they may be true-positive (true positive), true-positive partial (true partial) or false-
positive partial (false partial). The dotted arrow represents a recursive property relation, since one fingerprint can be related
to another. Ellipses denote specific entity attributes.
Diagramma ER di “avvio” in Access 2003
Queste tabelle
realizzano una
possibile forma
del database
con le relative
cardinalità. Lo
studente è
invitato a
studiare anche
altre soluzioni.
Il file di
riferimento che
ho preparato
per il nostro
corso è:
PRINTS-S.mdb
Esempi di riempimento delle tabelle
Al link ftp://ftp.bioinf.man.ac.uk/pub/prints si trova il database completo da

scaricare; nelle pagine seguenti è riportato il DER completo per un confronto con
quello “semplificato” trattato nel nostro corso.
Representation of fingerprint data using a relational logical model
Dr Jane Mabey Gilsenan, Education & Research Centre, University Hospital of South Manchester, United Kingdom
In this model, a set of related data is represented as a table in which a column

represents an attribute and a row represents a set of values attributed to a relation
type. Within each table, significant attributes are highlighted: primary keys are
denoted by purple text; and foreign keys are denoted by a star.
1
PRINTS-S Schema
PRINTS-S Schema
The following tables provide a more detailed description of the attributes in each entity:
Table = fingerprint
+----------------------------------+----------------------------------+-------+
| Field | Type | Length|
+----------------------------------+----------------------------------+-------+
| fprint_accn | varchar() | 15 |
| identifier | varchar() | 15 |
| motifs | int2 | 2 |
| date | date | 4 |
| up_date | date | 4 |
| family_title | text | var |
| family_doc | varchar() | 18 |
| category | text | var |
| second_accn | varchar() | 15 |
http://www.bioinf.manchester.ac.uk/dbbrowser/sprint/printss_data_model.html (1 of 3)02/04/2008 11.22.42

PRINTS-S Schema
| pseudo | bool | 1 |
+----------------------------------+----------------------------------+-------+
Table = motif
+----------------------------------+----------------------------------+-------+
+----------------------------------+----------------------------------+-------+
| motif | int2 | 2 |
| repeat | varchar() | 4 |
| seqn_accn | varchar() | 15 |
| seqn_fragment | varchar() | 35 |
| start_position | int4 | 4 |
| inter_motif_dist | int4 | 4 |
| initial | bool | 1 |
| final | bool | 1 |
+----------------------------------+----------------------------------+-------+
Table = inter_motif
+----------------------------------+----------------------------------+-------+
+----------------------------------+----------------------------------+-------+
| motif_a | int2 | 2 |
| motif_b | int2 | 2 |
| minimum | int4 | 4 |
| maximum | int4 | 4 |
+----------------------------------+----------------------------------+-------+
Table = seqn
+----------------------------------+----------------------------------+-------+
+----------------------------------+----------------------------------+-------+
| seqn_id | varchar() | 15 |
| seqn_title | text | var |
+----------------------------------+----------------------------------+-------+
Table = seqn_detection
+----------------------------------+----------------------------------+-------+
+----------------------------------+----------------------------------+-------+
| repeats | int4 | 4 |
| detection | char() | 1 |
| automatic | bool | 1 |
+----------------------------------+----------------------------------+-------+
Table = history
+----------------------------------+----------------------------------+-------+
+----------------------------------+----------------------------------+-------+
| dbase_version | varchar() | 15 |
| convergence | int2 | 2 |
| hit_list_length | int2 | 2 |
| scan_method | varchar() | 12 |

PRINTS-S Schema
+----------------------------------+----------------------------------+-------+
Table = dbase_xrefn
+----------------------------------+----------------------------------+-------+
+----------------------------------+----------------------------------+-------+
| dbase_name | varchar() | 15 |
| dbase_accn | varchar() | 15 |
| dbase_id | varchar() | 30 |
+----------------------------------+----------------------------------+-------+
Table = true_partial
+----------------------------------+----------------------------------+-------+
+----------------------------------+----------------------------------+-------+
| matches | int2 | 2 |
| motif_diagram | varchar() | 15 |
| automatic | bool | 1 |
+----------------------------------+----------------------------------+-------+
Table = false_partial
+----------------------------------+----------------------------------+-------+
+----------------------------------+----------------------------------+-------+
+----------------------------------+----------------------------------+-------+
Table = relation
+----------------------------------+----------------------------------+-------+
+----------------------------------+----------------------------------+-------+
| fprint_one | varchar() | 15 |
| relationship | text | var |
| fprint_two | varchar() | 15 |
+----------------------------------+----------------------------------+-------+
Table = ancestry
+----------------------------------+----------------------------------+-------+
+----------------------------------+----------------------------------+-------+
| fprint_id | varchar() | 15 |
| ancestors | text | var |
+----------------------------------+----------------------------------+-------+

Database

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Database

Caricato da

Copyright:

Formati disponibili

1 Le basi di dati

• Basi di dati: collezione di dati che

separare i dati dalle applicazioni che

Problemi con la gestione dei dati

copia1 copia2 copia3

• BASE DI DATI = collezione di dati gestita da

Modello logico e fisico dei dati

• Modello logico: regole per strutturare i dati

DBMS: Architettura ANSI/SPARC

I linguaggi nei DBMS

Schema = definizione della struttura

Progettazione di basi di dati

• Relazioni: legami tra entità

• Attributi: descrivono proprietà

Uno schema concettuale

Cardinalità delle relazioni

• Ogni studente può aver sostenuto più esami

Cardinalità delle relazioni

• Ogni studente è iscritto ad un corso di laurea

Cardinalità delle relazioni

• Ogni direttore dirige un dipartimento

I legami tra i dati

• Proprietà che le istanze devono soddisfare

Corsi Codice Titolo Docente

Dai diagrammi E-R alle tabelle

STUDENTE(matr, nome, cognome) MATERIA(id, nome)

ESAME(matr, id, voto)

Dai diagrammi E-R alle tabelle

STUDENTE(matr, nome, cognome, cdl, anno) CdL(id, nome)

• A volte è preferibile tradurre come nel caso N-N

Dai diagrammi E-R alle tabelle

DIRETTORE(matr, nome, cognome, id, nome-dip, anno)

• A volte è preferibile tradurre come nel caso 1-N

mhtml:file://C:\Documents and Settings\Amministratore\My Documents\My Notes\St... 27/03/2008

mhtml:file://C:\Documents and Settings\Amministratore\My Documents\My Notes\N... 27/03/2008

Si può ora implementare il

• Permette di avere una

Operatori insiemistici: unione

Operatori insiemistici: intersezione

Operatori insiemistici: differenza

Operatori relazionali: proiezione

Operatori relazionali: selezione

σ Nome = ‘Piero’ or CdL = ‘Fisica’ (STUDENTI)

Operatori relazionali: join

Studiamo un po’ in dettaglio la costruzione dei diagrammi

1. Modello di dati per il genoma di batteri associati alle

Laboratory for Bioinformatics (LBI), Institute of

PhD Student: Luciano Antonio Digiampietri

 This publication presents analysis results, not

 The genomes of several species of a genus or indeed the

 This data growth has made necessary the development of

large sets of All microbial

Synechocystis sp. PCC 6803

Atu0324 At III.A.1 chromosomal replication initiator

XAC0001 Xac III.A.1 chromosomal replication initiator

PD0001 Xfpd III.A.1 chromosomal replication initiator

XF0001 Xfcvc III.A.1 chromosomal replication initiator

RSc3442 Rs III.A.1 probable chromosomal replication

– Given two or more genomes, what are the genes

– Given a gene x from an organism not in the

genes gxgxgxgxgxgxgx gxgxgwxgxgxgxgw gxrgxgxgxgxgxgy goxgxgxgxgxgxgz

 Attributes based in GenBank data

gene_blast_tbl family_id family_id

This publication presents analysis results, not

The genomes of several species of a genus or indeed the

This data growth has made necessary the development of

Attributes based in GenBank data

PlantAssociated Bacteria Database