Sei sulla pagina 1di 3

Databases

Database is a collection of related data and is a repository for a collection of computational data
files. Databases are electronic filing cabinets, a convenient and efficient method of storing vast
amount of information. They are different types of databases on the manner of storage (e.g. flat-
file database, relational database and object-oriented database etc.).
Database has several applications over traditional paper based methods of record keeping.
Compactness
Speed of operation (fast retrieval and changing data faster than human)
Up to date information is available on demand at any time
Accuracy and consistency of information

Terminology used in databases


Entity is a person, place, thing or condition about which data and information are
collected
Attribute is a category of data or information that describes an entity. Each attribute is
a fact about the entity.
Data item is a specific detail of an individual entity that is stored in a database
Record is a grouping of data items that consists of a set of data or information that
describes an entitys specific occurrence.
Relation file is the table in a database that describes entity

Database management system (DBMS)


DBMS is a complex set of software designed for the purpose of managing databases based on a
variety of data models. It also controls the organization, storage, management and retrieval of
data in the database. DBMS is also known as database manager.
Biological databases
Bioinformatics deals with the management and analysis of biological information stored in
databases. A database is repository of sequences of nucleotides in DNA or RNA and amino acids
in proteins that provide a centralized and homogenous view of its contents. The database
(repository) is created and modified through DBMS. Each data entry in database is made
according to a definite scheme which is a set of pre-specified rules through the data definition
language. Graphical User Interface (GUI) allows the user to access the contents of database and
also helps in browsing through the contents of the repository. As mentioned earlier, biological
databases also have a specialized query language that helps the user in querying the contents of
the specific database. The data-definition language and the query language together from the
data model.
A biological database is a collection of data that is organized in such a way its contents can
easily be accessed, managed and updated. The activity of preparing a database (figure 5.1) can
be divided into:
Collection of data in the form which can be easily accessed
Making it available to a multi-user system (always available for user)

Classification of biological databases


Biological databases can be broadly classified into sequence and structure databases.
Sequence databases are applicable to both nucleic acid and protein sequences whereas,
structure databases are applicable to only proteins. In general biological databases can be
classified into primary, secondary and composite databases.
A Primary database
It contains information of the sequence or structure alone. Examples of these includes
GenBank, DDBJ for genome or nucleotide sequences: Swiss-Prot, PIR for protein sequences.
In the early 1980s, sequence information started to become more abundant in the scientific
literature. Realizing this, several laboratories saw that there might be advantageous to
harvesting and storing these sequences in central repositories. Thus, several primary database
projects began to evolve in different parts of the world.
Nucleic acid sequence databases
The principal DNA sequence databases are GenBank (USA), EMBL (Europe), DDBJ (Japan), which
exchange data on daily basis to ensure comprehensive coverage at each of the sites.
EMBL
The European Molecular Biology Laboratory (EMBL) Nucleotide Sequence
Database (http://www.ebi.ac.uk/embl/index.html) is a comprehensive collection of
primary nucleotide sequences, maintained at the European Bioinformatics Institute
(EBI).
Data are received from both different author submissions and genome sequencing groups,
and from the scientific literature and patent applications.
The database is produced in collaboration with DDBJ and GenBank; the participating groups each
collect a portion of the total sequence data reported worldwide, and all new and updated entries
are then exchanged between the groups on a daily basis. The three databases adhere to a set of
documented guidelines (The DDBJ/EMBL/GenBank Feature Table Definition) which regulate the
content and syntax of the database entries. These guidelines ensure that the data continue to be
made available in a format that can be exchanged efficiently between the databases, is
compatible with current bioinformatics software and reflects developments in the fields of
molecular and general biology.
By 1980, EMBL contained more than a million entries, representing more than 15,500
species but with model systems predominating (Homo sapiens, Caenorhabditis elegans,
Saccharomyces cerevisiae, Mus musculus and Arabidopsis thaliana together constitute
more than 50% of the resource).
The EMBL and EMBLNEW databases are stored and maintained in an ORACLE data management
system and can be searched on the Internet with the Sequence Retrieval System (SRS)
Information can be retrieved from EMBL using the SRS (Sequence Retrieval
System); this links the principal protein sequence databases with motif, structure,
mapping and other special databases, and includes links to the MEDLINE facility.
EMBL can be searched with query sequences via the EBIs Web interface to the BLAST
and FASTA programs.

Secondary database contains derived information from the primary database. It contains
information like structure, conserved regions, signature sequence and active site
residues of the protein families arrived by multiple sequence alignment of a set
of related proteins. The secondary database contains entries of the PDB in an
organized way. These contain entries that are classified according to their structure like all
alpha proteins, all beta proteins etc. These also contain information on conserved
secondary-structure motifs of a particular proteins. Examples includes SCOP, CATH,
PROSITE etc.
Composite database amalgamates a variety of differnet primary database sources,
which obviates the need to search multiple resources. Different composite database uses
different primary database and differnet criteria in their search algorithm.

biological database

sequence database structure database`

nucleic acid protein protein

PRIMARY PIR
DATABASES
GenBank DDBJ
MIPS
SWISS-PROT PDB
EMBL
TrEMBL
NRL-3D

SECONDARY
DATABASES
Pfam BLOCKS PROSITE SCOP CATH

Potrebbero piacerti anche