Sei sulla pagina 1di 156

CIMAP Summer Training on Biotechnology & Bioinformatics

20th June 20th July, 2006

Bioinformatics : Techniques and usage

Dr. Ashok Sharma Head, Bioinformatics and Co-ordinator, Bioinformatics Centre Central Institute of Medicinal and Aromatic Plants PO. CIMAP, Lucknow-226015, India. Web site: www.cimap.res.in E-mail: ashoksharma@cimap.res.in

Sequences

Biological Knowledge

Bioinformatics

Databases

Greater Biological Knowledge

Bioinformatics:
Why
What

Computational Methods Resources and Tools

If you are one of many biologists for whom genome database are as comprehensible as a mass of supermarket barcodes It is a good time to team up with a friendly bioinformaticist and join the action, before, it is too late

If biologists do not adapt to the powerful computation tools needed to exploit huge data sheets, they could find themselves floundering in the wake of advances in genomics

It is predicted that the potential to integrate different levels of genomic data such as raw sequence from the human genome and those of model organisms, data on genetic variability between individuals and on gene expression in different tissues will radically change biological research. It is also agreed that small experiments driven by individual investigators will give way to a world in which multidisciplinary teams, sharing huge online data sets, emerge as key players.

Bioinformatics : a brave new world


Radical change in biological research from small experiments driven by individual investigators

Multidisciplinary teams sharing huge online data will be the key players

Era of systems biology ability to create mathematical models describing the function of networks of genes and proteins is just as important as traditional lab skills

Who will have competitive advantage?


Those who learn to conduct high throughput genomic analyses, and who can master the computational tools needed to exploit biological databases

Outcome of this natural selection will see many current top scientists, research groups and even whole institutes relegated to the second division

What is the solution

In the long run, the change will come through the emergence of a new breed of biologists who are steeped in computational biology as an integral parts of their education. This means that the subject must be included as a core module in all undergraduate biology courses, rather than a specialist option. Although, this is starting to happen, the availability of teachers with the appropriate expertise is still a limiting factor.

The emerging new breed So, if the majority of biologists are not to be disenfranchised, What is the solution? Emergence of a new breed of biologists who are steeped in computational biology as an integral parts of their education. Limiting factor: availability of teachers with the appropriate expertise.

One of the model solution has come out in U.S.A


Funding agencies are also trying to drive change by ploughing money into initiatives that require a multidisciplinary approach and a strong computational component. The US National Institute of Health, for instance, through its National Institute of General Medical Sciences; has created a programme of glue grants for integrative and collaborative approaches to research. Under this programme, the Alliance will draw a complete map of interactions between some 1000 proteins in two types of cells. The consortium unites computational biologists. traditional experimentalists with

Glue grants Integrative and collaborative approaches to research

US National Institutes of Health Alliance for Cellular Signaling

Complete map of interactions between some 1,000 proteins in two types of cells

Consortium unites traditional experimentalist with computational biologists.

Modern Biology and particularly Biotechnology are very much information-dependent fields. In fact, the symbiosis between information technology and biotechnology today is as intricately entwined as the two strands of the genetic material that make up the DNA helix.

y Human Genome Project and other genome projects such as sequencing of bacterial genomes and yeast genomes, etc. have produced enormous amounts of DNA sequence data.

y Large scale biological research involving micro sequencing of proteins, 2D gel patterns of proteins and polypeptides, metabolic pathways, physical and genetic maps of the organisms, cell line information, and microbial strain data etc. have been responsible for the unprecedented growth of biological data. y Projects such as Species-2000, global plant check list, information on release of organisms in environment, and Animal Virus Information, etc. are producing hard data at the species level in multimedia format.

The rate of growth of the biological data is estimated to be more than 200 million base pairs per year.

The database content itself is doubling in size approximately every year.

Nucleotide and protein sequences are not the only data that are accumulating rapidly. The number of characterized genes from a variety of organisms and the number of solved protein structures are also doubling every two years.

 The enormous growth of biological data and its availability in the major international databases is serving as a source of knowledge to the life scientists.

 The whole paradigm shift in molecular biology towards dataintensive research in search of useful genes is basically due to the fact that the genetic data is becoming the major driving force in drug discovery, protein engineering, design of new molecules, and other related areas.

 The large stores of biological data are holding the promise to serve as the Discovery Super Highway for innovations in biotechnology through a process of analysis and transformation of molecular and structural data into biological knowledge for prosperity.

 In the face of the challenges imposed by the growing size and complexity of the biological data, a new discipline of science, known as Bioinformatics, had emerged in the recent past.

 Bioinformatics deals with the various issues related to the biological data. It also covers the development of data analysis tools, modeling of biological macromolecules and their complexes, metabolic pathways, designing of new molecules such as drugs, peptide vaccines, proteins, etc.

Gradually, Bioinformatics has evolved to deal with four related but still distinct problem areas, viz.: a) Handling and management of biological data, including its organization, control, linkages, analysis, and so forth. b) Communication among people, projects, and institutions engaged in the biological research and applications. The communication may include e-mail, file transfer, remote login, computer conferencing, electronic bulletin boards, or establishment of web-based information resources. c) Organization, access, search and retrieval of biological information, documents, and literature. d) Analysis and interpretation of the biological data through the computational approaches including visualization, mathematical modeling, and development of algorithms for highly parallel processing of complex biological structure.

Bioinformatics may, be defined as a scientific discipline that encompasses all the aspects of biological information, viz., acquisition, processing, storage, distribution, analysis and interpretation, that combines the tools and techniques of mathematics, computer science, and biology with the aim of understanding the biological significance of a variety of data.

 Bioinformatics has acquired great importance due to its application in the Genome projects.  The target of decoding the three billion base pairs of the human DNA has become achievable only through the use of various innovative techniques and methods evolved by the Bioinformatics scientists.  Bioinformatics has become an essential component of biotechnology based product and process development.  The process of drug design and development is expensive and timeconsuming. The application of the tools and techniques of Bioinformatics has resulted in the reduction in cost and the development cycle of the drugs. This aspect has a tremendous impact on the society. If a newly discovered drug is a life-saving one, then the resulting gains are not only in terms of financial savings but also in saving the lives of several million people. Major pharmaceutical and Biotechnology companies have set up large R&D groups in Bioinformatics.

Bioinformatics is a multidisciplinary subject. Through only about a decade old, it has become very important for the growth of biosciences, biotechnology, and the economic prosperity of nations. Three well-identified divisions of Bioinformatics may be considered: a) Molecular Bioinformatics, b) Cellular and sub-cellular Bioinformatics, and c) Orgasmic and community Bioinformatics.

FUNCTIONS OF A BIOINFORMATICS CENTRE


i. The principal objective of a Bioinformatics Centre is to function as an information base in each specialty so that the scientists have ready access to the computer-based information on resources, databases in subject fields, and build up expertise in bioinformatics in keeping with the rapid development in this area.

ii. To provide a computer-based information storage and retrieval system of database that collects structured information generated by research and industrial institutions in the identified fields of biotechnology, continually update the databases and make the information available to the users. iii. An active network mode, in which the scientists get access to the biotechnology community in the identified areas, answer requests for information in an interactive and discussive mode and actively initiate dialogue among groups with common interest.

iv. To provide retrieval service either online or offline in their specialized areas and to give overall information support even in areas other than those assigned to them.

v. To provide communication link with international databases for selective bibliographic information for the user scientist. vi. To develop software packages and databases specific to user needs.

vii. To conduct training courses in the specialized areas periodically to meet the special requirements of manpower development in the area and to promote awareness about the computerized storage and retrieval facility among bio scientists and information scientists.

Bioinformatics What?
A mixture of Biochemistry, Molecular Biology, and Computer Science Obtaining, storing, organizing, and analyzing biological and genetic information for understanding its activity in living organisms Main goal is to convert multitude of complex data into useful information and knowledge Data includes gene and protein sequences, cDNA, nucleotide sequences Data from gene sequencing, combinatorial chemical synthesis, gene-expression investigations, pharmicogenomics, proteomic studies, and other methods of study. Information used to build synthetic and predictive models allowing scientists to better understand complex living systems Future applications in biology, chemistry, pharmaceuticals, medicine, and agriculture

What is the Role of Bioinformatics


The Role of the Bioinformatics group is to:
 Research and develop tools and systems that provide understanding and integration of genomic data across technologies  Work with other Research Information staff to make these tools available to research scientists

What kinds of data are we interested in?


Sequence data Profile data gene expression and proteins Mapping data Function and phenotype Pathways

TECHNOLOGIES IN BIOINFORMATICS
DataData-acquisition Systems These are requires mainly at research labs generating large amounts of data. These data. systems include inventory Control Software, tracking hundreds of thousands of reagents, gels and other materials, reagent manipulation software, robotic system to carry out high volume, high precision laboratory manipulation in genome research and sequence production software that will help improve sequencing. sequencing.

TECHNOLOGIES IN BIOINFORMATICS
Data Analysis Systems Studying sequences, predicting protein structure and comparing genomes on an extension such all requires Informatics tools such as Sequence Analysis Software that performs alignments, detects homologies, identifies coding regions and extracts features. Protein folding software is features. used to transform genetic information into function via proteins whose functional specific are determined by their 3-D shapes. Genetic mapping Software Systems play shapes. a key role in the analysis of genetic mapping data. data. Classification Software extracts features from DNA Sequences place proteins into gene families and track protein motifs. motifs.

TECHNOLOGIES IN BIOINFORMATICS
DataData- Management System
Various genome projects are generating information that can not be accommodated by traditional publishing. Electronic data management and publishing Systems are crucial components of genomic research.

CHALLENGES IN BIOINFORMATICS
Bioinformatics, which is the intersection of Information Technology and Mathematics with molecular biology / genetics, has created several challenges for the Computer Science Community. Community.


Information Storage
 

 

Storing huge amounts of genetic information, amenable to rapid access and manipulation, is a great challenge. challenge. One million bases (1Mb) N 1 Megabyte (1MB). Thus, one would require 3 MB). Gigabytes (3 GB) of computer data storage space to store entire Human Genome comprising three Gigabases (3 Gb). Gb). This includes nucleotide sequence data only and does not include data annotations and other information associated with the sequence data. data. With time, more annotations entered either
(a) by scientists as a result of laboratory findings, literature searches, data analysis, or
personal communications, and/or (b) entered as a result of automated data analysis programs or autoannotators, Will be

associated with the sequence data increasing the requirements of storage significantly beyond the 3 GB for the human genome. genome.

The Management and Integration of Biological Information




The development of database management systems is an active subspecialty of computer science. science. The need to organise terabytes of heterogeneous biological data in a form that is easily usable, and which employs sophisticated data visualisation capabilities is essential for progress in modern biology. biology.

Data Analysis Systems




Genetic data can not be analysed efficiently without computer systems. systems. Studying sequences, predicting protein structures, and comparing genomes on an extensive scale all require additional Informatics tools, such as: as:

Sequence Analysis Software




Sequence analysis is so far the best known, best established area of bioinformatics - Performing alignments, detecting homologies, identifying coding regions, extracting features, and other computerised analysis of the sequences - are now performed as routine. At the same time, sequence routine. analysis is a multi-faceted and biologically profound area of research, multidemanding much continued work. work.

Protein Folding Software




Genetic information is transformed into function via proteins, whose functional specificities are determined by their three dimensional shapes. Prediction of the shapes. protein structure from amino acid sequnces is an important and challenging problem. problem. Computation plays an increasing central role in the assembly and integration of large maps composed of different kinds and combinations of data. data. As the genome projects mature and large amounts of genomic information is available for a number of species, comparative genomics is emerging as an active area of study. study. Methods for mapping genes to their physical locations on the genome; searching genome; for related genes; analysing the database to find families of related genes and to genes; understand their coordinated expression; finding correlation between specific expression; diseases and expression of related genes. genes.

Map Assembly & Integration Software




Comparative Genomics Tools




Gene Mining


SERENDIPITY EFFECT


One of the most exciting aspects of the information revolution is that it allows us to combine many different items of information and many different kind of information on a scale never seen before. before. Large international databases for instance, include contributions from thousands of different sources. Also the sources. hypertext links (Information Super Highway) between sites makes it possible to draw together many different kinds of information that bear on a particular problems. problems. These activities not only promote collaboration on a truly vast scale, they also enrich research. One important effect is the research. Screndipity effect combining different datasets makes possible entirely new kinds of study-New Studies inevitable studylead to new and unexpected discoveries. discoveries.

Computational Methods
A core set of computational approaches has emerged for dealing with the types of data that are currently shared in public databases DNA, protein sequence, and protein structure.

Using public databases and data formats The first key skill for biologists is to learn to use online search tools to find information. Literature searching is no longer a matter of looking up references in a printed index. You can find links to most of the scientific publications you need online. There are central databases that collect reference information so you can search dozens of journals at once. You can even set up agents that notify you when new articles are published in an area of interest. Searching the public molecular-biology databases requires the same skills as searching for literature references: you need to know how to construct a query statement that will pluck the particular needle youre looking for out of the database haystack.

Sequence alignment and sequence searching Being able to compare pairs of DNA or protein sequences and extract partial matches has made it possible to use a biological sequence as a database query. Sequence-based searching is another key skill for biologists; a little exploration of the biological databases at the beginning of a project often saves a lot of valuable time in the lab. Identifying homologous sequences provides a basis for phylogenetic analysis and sequence-pattern recognition. Sequence-based searching can be done online through web forms, so it requires no special computing skills, but to judge the quality of your search results you need to understand how the underlying sequence-alignment method works and go beyond simple sequence alignment to other types of analysis.

Gene prediction
Gene prediction is only one of a cluster of methods for attempting to detect meaningful signals in uncharacterized DNA sequences. Until recently, most sequences deposited in GenBank were already characterized at the time of deposition. That is, someone had already gone in and, using molecular biology, genetic, or biochemical methods, figured out what the gene did. However, now that the genome projects are in full swing, theres a lot of DNA sequence out there that isnt characterized. Software for prediction of open reading frames, genes, exon splice sites, promoter binding sites, repeat sequences, and tRNA genes helps molecular biologists make sense out of this unmapped DNA.

Multiple sequence alignment


Multiple sequence-alignment methods assemble pairwise sequence alignments for many related sequences into a picture of sequence homology among all members of a gene family. Multiple sequence alignments aid in visual identification of sites in a DNA or protein sequence that may be functionally important. Such sites are usually conserved; that is, the same amino acid is present at that site in each one of a group of related sequences. Multiple sequence alignments can also be quantitatively analyzed to extract information about a gene family. Multiple sequence alignments are an integral step in phylogenetic analysis of a family of related sequences, and they also provide the basis for identifying sequence patterns that characterize particular protein families.

Phylogenetic analysis
Phylogenetic analysis attempts to describe the evolutionary relatedness of a group of sequences. A traditional phylogenetic tree or cladogram groups species into a diagram that represents their relative evolutionary divergence. Branchings of the tree that occur furthest from the root separate individual species; branchings that occur close to the root group species into kingdoms, phyla, classes, families, genera, and so on. The information in a molecular sequence alignment can be used to compute a phylogenetic tree for a particular family of gene sequences. The branchings in phylogenetic trees represent evolutionary distance based on sequence similarity scores or on information-theoretic modeling of the number of mutational steps required to change on sequence into the other. Phylogenetic analyses of protein sequence families talks not about the evolution of the entire organism but about evolutionary change in specific coding regions.

Extraction of patterns and profiles from sequence data


A motif is a sequence of amino acids that defines a substructure in a protein that can be connected to function or to structural stability. In a group of evolutionarily related gene sequences, motifs appear as conserved sites. Sites in a gene sequence tend to be conserved-to remain the same in all or most representatives of a sequence family when there is selection pressure against copies of the gene that have mutations at that site. Nonessential parts of the gene sequence will diverge from each other in the course of evolution, so the conserved motif regions who up as a signal in a sea of mutational noise. Sequence profiles are statistical descriptions of these motif signals; profiles can help identify distantly related proteins by picking out a motif signal even in a sequence that has diverged radically from other members of the same family.

Protein sequence analysis


The amino-acid content of a protein sequence can be used as the basis for many analyses, from computing the isoelectric point and molecular weight of the protein and the characteristic peptide mass fingerprints that will form when its digested with a particular protease, to predicting secondary structure features and posttransnational modification sites.

Protein structure prediction


It is a lot harder to determine the structure of a protein experimentally than it is to obtain DNA sequence data. One very active area of bioinformatics and computational biology research is the developemtn of methods for predicting protein structure from protein sequence. Methods such as secondary structure prediction and threading can help determine how a protein might fold, classifying it with other proteins that have similar topology, but they dont provide a detailed structure mode.

Protein structure property analysis Protein structures have many measurable properties that are of interest to crystallographers and structural biologists. Protein structure validation tools are used by crystallographers to measure how well a structure model conforms to structural rules extracted from existing structures or chemical model compounds. These tools may also analyze the fitness of every amino acid in a structure model for its environment, flagging such oddities as buried charges with no countercharge or large patches of hydrophobic amino acids found on a protein surface. These tools are useful for evaluating both experimental and theoretical structure models.

Protein structure alignment and comparison


Even when two gene sequences arent apparently homologous, the structures of the proteins they encode can be similar, New tools for computing structural similarity are making is possible to detect distant homologies by comparing structures, even on the absence of much sequence similarity.

Biochemical simulation
Biochemical simulation uses the tools of dynamical systems modeling to simulate the chemical reactions involved in metabolism. Simulations can extend from individual metabolic pathways to transmembrane transport processes and even properties of whole cells or tissues. Biochemical and cellular simulations traditionally have relied on the ability of the scientist to describe a system mathematically, developing a system of differential equations that represent the different reactions and fluxes occurring in the system. However new software tools can build the mathematical framework of a simulation automatically from a description provided interactively by the user, making mathematical modeling accessible to any biologist who knows enough about a system to describe it according to the conventions of dynamical systems modeling.

Whole genome analysis


As more and more genomes are sequenced completely, the analysis of raw genome data has become a more important task. There are a number of perspectives from which one can look at genome data: for example, it can be treated as a long linear sequence, but its often more useful to integrate DNA sequence information with existing genetic and physical map data. This allows you to navigate a very large genome and find what you want.

Primer design
Many molecular biology protocols require the design of oligonucleotide primers. Proper primer design is critical for the success of polymerase chain reaction (PCR), oligo hybridization, DNA sequencing, and microarray experiments. Primers must hybridize with the target DNA to provide a clear answer to the question being asked, but, they must also have appropriate physicochemical properties; they must not self-hybridize or dimerize; and they should not have multiple targets within the sequence under investigation. There are several web-based services that allow users to submit a DNA sequence and automatically detect appropriate primers, or to compute the properties of a desired primer DNA sequence.

DNA microarray analysis


DNA microarray analysis is a relatively new molecular biology method that expands on classic probe hybridization methods to provide access to thousands of genes at once. The main tasks in microarray analysis as its currently done are an image analysis step, in which individual spots on the array image are identified and signal intensities are identified.

Proteomics analysis
Before theyre ever crystallized and biochemically characterized, proteins are often studied using a combination of gel electrophoresis, partial sequencing, and mass spectroscopy. 2-D gel electrophoresis can separate a mixture of thousands of proteins into distinct components; the individual spots of material can be blotted or even cut from the gel and analyzed. Simple computational tools can provide some information to aid in the process of analyzing protein mixtures. Its trivial to compute molecular weight and pI from a protein sequence; by using these values in combination, sets of candidate identities can be found for each spot on a gel. Its also possible to compute, from a protein sequence, the peptide fingerprint that is created when that protein is broken down into fragments by enzymes with specific protein cleavage sites.

Databases
The internet is a powerful resource containing a large volume of data and tools to manipulate them unfortunately, connecting data between them can sometimes be tricky. What is a database ? An organized body of related information. A collection of information organized and presented to serve a specific purpose. A computerized database is an updated, organized file of machine readable information that is rapidly searched and retrieved by computer. computerized storehouse of data (records). allows user-defined queries. allows extraction of specified records. allows adding, changing, removing, and merging of records . uses standardized formats.

The ideal sequence database for computational analyses and datadatamining:


It must be complete with minimal redundancy It must contain as much up-to-date information (annotation) as up-topossible on each sequence All the information items must be retrievable by computer programs in a consistent manner It must be highly interoperable with other databases

Database Categories List

Database Categories List

Nucleotide Sequence Databases RNA sequence databases Protein sequence databases Structure Databases Genomics Databases (non-vertebrate) Metabolic and Signaling Pathways Human and other Vertebrate Genomes Human Genes and Diseases Microarray Data and other Gene Expression Databases Proteomics Resources Other Molecular Biology Databases Organelle databases Plant databases Immunological databases

Nucleotide Sequence Databases

The nucleotide sequence databases are data repositories, accepting nucleic acid sequence data from the scientific community and making it freely available. The databases strive for completeness, with the aim of recording every publicly known nucleic acid sequence. These data are heterogenous, they vary with respect to the source of the material (e.g. genomic versus cDNA), the intended quality (e.g. finished versus single pass sequences), the extent of sequence annotation and the intended completeness of the sequence relative to its biological target (e.g. complete versus partial coverage of a gene or a genome). The nucleotide databases are distributed free of charge over the internet.

DDBJ, GenBank and EMBL-Bank exchange new and updated data on a daily basis to achieve optimal synchronisation. The result is that they contain exactly the same information, except for sequences that have been added in the last 24 hours. Nucleotide Sequence Databases can be further subdivided into following : 1)International Nucleotide Sequence Database Collaboration 2)Coding and non-coding DNA 3)Gene structure, introns and exons, splice sites 4)Transcriptional regulator sites and transcription factors.

Nucleotide Sequence Databases

Database name

Full name and/or description

URL

1.1. International Nucleotide Sequence Database Collaboration

GenBank

An annotated collection of all publicly available nucleotide and protein sequences

http://www.ncbi.nlm.nih.gov/

EMBL Nucleotide Sequence Database

An annotated collection of all publicly available nucleotide and protein sequences

http://www.ebi.ac.uk/embl.html

DDBJDNA Data Bank of Japan

An annotated collection of all publicly available nucleotide and protein sequences

http://www.ddbj.nig.ac.jp

Online databases


primary repositories of sequence data: - European Bioinformatics Institute (EBI) - DNA data bank of Japan (DDBJ) - GenBank, National Center for Biotechnology Information (NCBI) each of these databases contain equivalent information (formats vary slightly)

1.2. DNA sequences: genes, motifs and regulatory sites 1.2.1. Coding and coding DNA
ACLAME CUTG A classification of genetic mobile elements Codon usage tabulated from GenBank Deviations from the standard genetic code in various organisms and organelles Human endogenous retrovirus database Immunoglobulin, T cell receptor and MHC nucleotide sequences from human and other vertebrates http://aclame.ulb.ac.be/ http://www.kazusa.or.jp/codon/ http://www.ncbi.nlm.nih.gov/Taxonomy/ Utils/wprintgc.cgi?mode=c http://herv.img.cas.cz

Genetic Codes HERVd IMGT/LIGMDB Imprinted Gene Catalogue

http://imgt.cines.fr/cgi-bin/IMGTlect.jv

Imprinted genes and parent-of-origin effects in animals

http://www.otago.ac.nz/IGC

Islander MICdb

Pathogenicity islands and prophages in bacterial genomes Prokaryotic microsatellites

http://www.indiana.edu/islander http://www.cdfd.org.in/micas

STRBase TIGR Gene Indices

Short tandem DNA repeats database Organism-specific databases of EST and gene sequences

http://www.cstl.nist.gov/div831/strbase/ http://www.tigr.org/tdb/tgi.shtml

Transterm

Codon usage, start and stop signals

http://uther.otago.ac.nz/Transterm.html

UniGene

Unified clusters of ESTs and full-length mRNA sequences

http://www.ncbi.nlm.nih.gov/UniGene/

UniVec

Vector sequences, adapters, linkers and primers used in DNA cloning, can be used to check for vector contamination

http://www.ncbi.nlm.nih.gov/VecScreen/U niVec.html http://genomewww2.stanford.edu/vectordb/

VectorDB

Characterization and classification of nucleic acid vectors Eukaryotic protein-encoding DNA sequences, both introncontaining and intron-less genes

Xpro

http://origin.bic.nus.edu.sg/xpro/

1.2.2. Gene structure, introns and exons, splice sites

ASAP

Alternative spliced isoforms

http://www.bioinformatics.ucla.edu/ASAP

ASD

EBIs alternative splicing database project includes three databases AltSplice, AltExtron and AEdb

http://www.ebi.ac.uk/asd

ASDB EASED EID ExInt HS3D

Alternative splicing database: protein products and expression patterns of alternatively-spliced genes Extended alternatively spliced EST database Exonintron database: introns in protein-coding genes Exonintron structure of eukaryotic genes Homo sapiens splice sites dataset

http://hazelton.lbl.gov/teplitski/alt http://eased.bioinf.mdc-berlin.de/ http://mcb.harvard.edu/gilbert/EID/ http://intron.bic.nus.edu.sg/exint/exint.html http://www.sci.unisannio.it/docenti/rampone/

IDB/IEDB

Intron sequence and evolution databases Introns and alternative splicing in C.elegans and C.briggsae

http://nutmeg.bio.indiana.edu/intron/index.html http://www.cse.ucsc.edu/kent/intronerator/

Intronerator

SpliceDB

Canonical and non-canonical mammalian splice sites

http://genomic.sanger.ac.uk/spldb/SpliceDB.htm l

SpliceNest

A tool for visualizing splicing of genes from EST data

http://splicenest.molgen.mpg.de/

YIDB

Yeast nuclear and mitochondrial intron sequences

http://www.emblheidelberg.DE/ExternalInfo/seraphin/yidb.html

1.2.3. Transcriptional regulator sites and transcription factors


ACTIVITY DBTBS DBTSS Functional DNA/RNA site activity Bacillus subtilis promoters and transcription factors A database of transcriptional start sites http://util.bionet.nsc.ru/databases/activity.htm l http://dbtbs.hgc.jp/ http://dbtss.hgc.jp/

DPInteract

Binding sites for E.coli DNA-binding proteins

http://arep.med.harvard.edu/dpinteract

EPD

Eukaryotic promoter database Hematopoietic promoter database: transcriptional regulation in hematopoiesis Primate mitochondrial DNA control region sequences PSSMs for transcription factor DNA-binding sites Plant cis-acting regulatory DNA elements

http://www.epd.isb-sib.ch http://bioinformatics.med.ohiostate.edu/HemoPDB http://www.hvrbase.org/ http://jaspar.cgb.ki.se http://www.dna.affrc.go.jp/htdocs/PLACE

HemoPDB HvrBase JASPAR PLACE

PlantCARE PlantProm

Plant promoters and cis-acting regulatory elements Plant promoter sequences for RNA polymerase II

http://intra.psb.ugent.be:8080/PlantCARE/ http://mendel.cs.rhul.ac.uk/

Database Categories List

Database Categories List

Nucleotide Sequence Databases RNA sequence databases Protein sequence databases Structure Databases Genomics Databases (non-vertebrate) Metabolic and Signaling Pathways Human and other Vertebrate Genomes Human Genes and Diseases Microarray Data and other Gene Expression Databases Proteomics Resources Other Molecular Biology Databases Organelle databases Plant databases Immunological databases

RNA sequence databases


The RNA sequence databases aims to contain all the databases have compiled all complete or nearly complete ribosomal RNA sequences from all or specific rna sequences. Some of them contains secondary structure information, additional information about the sequences, such as taxonomic classification of the organism from which they have been obtained, and literature references are also provided. There are databases containing information regarding 16S and 23S ribosomal RNA mutations, 5S rRNA sequences, Genomic tRNA, All complete or nearly complete rRNA sequences etc.

2. RNA sequence databases


16S and 23S rRNA Mutation Database 5S rRNA Database Aptamer database ARED

16S and 23S ribosomal RNA mutations 5S rRNA sequences Small RNA/DNA molecules binding nucleic acids, proteins AU-rich element-containing mRNA database A database of group II introns, self-splicing catalytic RNAs All complete or nearly complete rRNA sequences Genomic tRNA database

http://ribosome.fandm.edu/ http://biobases.ibch.poznan.pl/5SData/ http://aptamer.icmb.utexas.edu/ http://rc.kfshrc.edu.sa/ared

Mobile group II introns European rRNA database GtRDB

http://www.fp.ucalgary.ca/group2introns/ http://www.psb.ugent.be/rRNA/ http://rna.wustl.edu/GtRDB http://biosun.bio.tudarmstadt.de/goringer/gRNA/gRNA.html http://hiv-web.lanl.gov/ http://bibiserv.techfak.unibielefeld.de/HyPa/ http://ifr31w3.toulouse.inserm.fr/IRESda tabase/

Guide RNA Database HIV Sequence Database

RNA editing in various kinetoplastid species HIV RNA sequences Hybrid pattern library: structural elements in classes of RNA Internal ribosome entry site database

HyPaLib IRESdb

miRNA Registry NCIR

Database of microRNAs (small non-coding RNAs) Non-canonical interactions in RNA structures

http://www.sanger.ac.uk/Software/Rfam/mir na/ http://prion.bchs.uh.edu/bp_type/

ncRNAs Database

Non-coding RNAs with regulatory functions

http://biobases.ibch.poznan.pl/ncRNA/

PLANTncRNAs

Plant non-coding RNAs

http://www.prl.msu.edu/PLANTncRNAs

Plant snoRNA DB

snoRNA genes in plant species

http://www.scri.sari.ac.uk/plant_snoRNA/

PLMItRNA

Plant mitochondrial tRNA

http://bighost.area.ba.cnr.it/PLMItRNA/ http://wwwbio.leidenuniv.nl/ KB.html http://rdp.cme.msu.edu Batenburg/P

PseudoBase RDP

Database of RNA pseudoknots Ribosomal database project: rRNA sequence data

Rfam RISCC RNA Modification Database RRNDB

Non-coding RNA families Ribosomal internal spacer sequence collection

http://www.sanger.ac.uk/Software/Rfam/ http://ulises.umh.es/RISSC

Naturally modified nucleosides in RNA rRNA operon numbers in various prokaryotes

http://medlib.med.utah.edu/RNAmods/ http://rrndb.cme.msu.edu/

Small RNA Database

Small RNAs from prokaryotes and eukaryotes

http://mbcr.bcm.tmc.edu/smallRNA

SRPDB

Signal recognition particle database

http://psyche.uthct.edu/dbs/SRPDB/SRPD B.html

Subviral RNA Database

Viroids and viroid-like RNAs

http://subviral.med.uottawa.ca/cgibin/home.cgi

tmRNA Website

tmRNA sequences and alignments

http://www.indiana.edu/tmrna

tmRDB

tmRNA database

http://psyche.uthct.edu/dbs/tmRDB/tmRDB. html

tRNA database UTRdb/UTRsit e

tRNA viewer and sequence editor 5'- and 3'-UTRs of eukaryotic mRNAs

http://www.unibayreuth.de/departments/biochemie/trna/ http://bighost.area.ba.cnr.it/srs6/

Database Categories List

Database Categories List

Nucleotide Sequence Databases RNA sequence databases Protein sequence databases Structure Databases Genomics Databases (non-vertebrate) Metabolic and Signaling Pathways Human and other Vertebrate Genomes Human Genes and Diseases Microarray Data and other Gene Expression Databases Proteomics Resources Other Molecular Biology Databases Organelle databases Plant databases Immunological databases

Types of protein databases


1. Sequence sequence databases 2. Protein motif databases 3. Protein structure databases
SCIENCEISFN GLAWEWINQTR | ||||| GREWEWINES

Protein sequence databases


The protein databases are the most comprehensive source of information on proteins. It is necessary to distinguish between universal databases covering proteins from all species and specialised data collections storing information about specific families or groups of proteins, or about the proteins of a specific organism. Two categories of universal protein databases can be discerned: simple archives of sequence data; and annotated databases where additional information has been added to the sequence record. In the upcoming slides you will find a list of the databases like: Primary protein sequence databases such as UniProt/Swiss-Prot Specialised protein sequence databases such as GOA Specialised protein databases such as ENZYME Secondary protein databases such as InterPro Structure databases such as PDB

3. Protein sequence databases 3.1. General sequence databases


EXProt Sequences of proteins with experimentally verified function http://www.cmbi.kun.nl/EXProt/

NCBI Protein database

All protein sequences: translated from GenBank and imported from other protein databases

http://www.ncbi.nlm.nih.gov/entrez

PIR

Protein information resource: a collection of protein sequence databases, part of the UniProt project

http://pir.georgetown.edu/ http://pir.georgetown.edu/pirwww/pirnref .shtml

PIR-NREF

PIRs non-redundant reference protein database

PRF

Protein research foundation database of peptides: sequences, literature and unnatural amino acids

http://www.prf.or.jp/en

Swiss-Prot

Curated protein sequence database with a high level of annotation (protein function, domain structure, modifications) Translations of EMBL nucleotide sequence entries: computerannotated supplement to Swiss-Prot

http://www.expasy.org/sprot

TrEMBL

http://www.expasy.org/sprot

UniProt

Universal protein knowledgebase: a database of protein sequence from Swiss-Prot, TrEMBL and PIR

http://www.uniprot.org/

3.2. Protein properties

AAindex ProTherm

Physicochemical properties of amino acids Thermodynamic data for wild-type and mutant proteins

http://www.genome.ad.jp/aaindex/ http://gibk26.bse.kyutech.ac.jp/jouhou/Pr otherm/protherm.html

3.3. Protein localization and targeting


http://www.bioinfo.tsinghua.edu.cn/dbsublo c.html

DBSubLoc

Database of protein subcellular localization

MitoDrome

Nuclear-encoded mitochondrial proteins of Drosophila

http://bighost.area.ba.cnr.it/BIG/MitoDrome

NESbase NLSdb

Nuclear export signals database Nuclear localization signals

http://www.cbs.dtu.dk/databases/NESbase http://cubic.bioc.columbia.edu/db/NLSdb/

THGS

Transmembrane helices in genome sequences

http://pranag.physics.iisc.ernet.in/thgs/

TMPDB

Experimentally characterized transmembrane topologies

http://bioinfo.si.hirosaki-.ac.jp/TMPDB/

3.4. Protein sequence motifs and active sites


ASC Active sequence collection: biologically active peptides http://bioinformatica.isa.cnr.it/ASC/

Blocks

Alignments of conserved regions in protein families

http://blocks.fhcrc.org/

CSA

Catalytic site atlas: enzyme active sites and catalytic residues in enzymes of known 3D structure

http://www.ebi.ac.uk/thorntonsrv/databases/CSA/

COMe

Co-ordination of metals etc.: classification of bioinorganic proteins (metalloproteins and some other complex proteins)

http://www.ebi.ac.uk/come

eMOTIF

Protein sequence motif determination and searches

http://motif.stanford.edu/emotif

Metalloprotein Site Database

Metal-binding sites in metalloproteins

http://metallo.scripps.edu/

O-GlycBase

O- and C-linked glycosylation sites in proteins

http://www.cbs.dtu.dk/databases/OGLYCBA SE/

PhosphoBase

Protein phosphorylation sites

http://www.cbs.dtu.dk/databases/PhosphoBas e/

PROMISE

Prosthetic centers and metal ions in protein active sites

http://metallo.scripps.edu/PROMISE

PROSITE

Biologically significant protein patterns and profiles

http://www.expasy.org/prosite

3.5. Protein domain databases; protein classification

CDD CluSTr Hits

Conserved domain database: includes protein domains from Pfam, SMART and COG databases Clusters of Swiss-Prot+TrEMBL proteins A database of protein domains and motifs Integrated resource of protein families, domains and functional sites

http://www.ncbi.nlm.nih.gov/Structure/cdd/ cdd.shtml http://www.ebi.ac.uk/clustr http://hits.isb-sib.ch/

InterPro

http://www.ebi.ac.uk/interpro

iProClass MetaFam

Integrated protein classification database Database of protein family annotations Protein families: multiple sequence alignments and profile hidden Markov models of protein domains

http://pir.georgetown.edu/iproclass/ http://metafam.ahc.umn.edu/

Pfam

http://www.sanger.ac.uk/Software/Pfa m/

PIRSF

Family/superfamily classification of whole proteins

http://pir.georgetown.edu/pirsf/ http://www.bioinf.man.ac.uk/dbbrowser/PRIN TS/

PRINTS

Hierarchical gene family fingerprints

PIR-ALN

Curated database of protein sequence alignments Protein families defined by PIR superfamilies and PROSITE patterns

http://pir.georgetown.edu/pirwww/dbinfo/piraln .html http://pir.georgetown.edu/gfserver/proclass.htm l

ProClass

ProDom

Protein domain families

http://www.toulouse.inra.fr/prodom.html

ProtoMap ProtoNet SBASE

Hierarchical classification of Swiss-Prot proteins Hierarchical clustering of Swiss-Prot proteins Protein domain sequences and tools

http://protomap.cornell.edu/ http://www.protonet.cs.huji.ac.il/ http://www.icgeb.org/sbase

SMART

Simple modular architecture research tool: signalling, extracellular and chromatin-associated protein domains

http://smart.embl-heidelberg.de/

SUPFAM

Grouping of sequence families into superfamilies

http://pauling.mbu.iisc.ernet.in/supfam

SYSTERS

Systematic re-searching and clustering of proteins

http://systers.molgen.mpg.de/

TIGRFAMs

TIGR protein families adapted for functional annotation

http://www.tigr.org/TIGRFAMs

3.6. Databases of individual protein families


AARSDB Aminoacyl-tRNA synthetase database http://rose.man.poznan.pl/aars/index.html

ABCdb

ABC transporters database

http://ir2lcb.cnrs-mrs.fr/ABCdb/

ASPD

Artificial selected proteins/peptides database

http://wwwmgs.bionet.nsc.ru/mgs/gnw/aspd/

BacTregulators

Transcriptional regulators of AraC and TetR families

http://www.bactregulators.org/ http://www.chemie.unimarburg.de/ csdbase/

CSDBase DExH/D Family Database Endogenous GPCR List

Cold shock domain-containing proteins

DEAD-box, DEAH-box and DExH-box proteins

http://www.helicase.net/dexhd/dbhome.htm

G protein-coupled receptors; expression in cell lines

http://www.tumor-gene.org/GPCR/gpcr.html

ESTHER EyeSite GPCRDB

Esterases and other alpha/beta hydrolase enzymes Families of proteins functioning in the eye G protein-coupled receptors database

http://www.ensam.inra.fr/esther http://eyesite.cryst.bbk.ac.uk/ http://www.gpcr.org/7tm/

Histone Database HIV Molecular Immunology Database HIV Protease Database

Histone fold sequences and structures

http://research.nhgri.nih.gov/histones/

HIV epitopes

http://hiv-web.lanl.gov/immunology/

HIV reverse transcriptase and protease sequences

http://hivdb.stanford.edu/ http://www.biosci.ki.se/groups/tbu/homeo.ht ml

Homeobox Page

Homeobox proteins, classification and evolution

Homeodomain Resource HORDE

Homeodomain sequences, structures and related genetic and genomic information

http://research.nhgri.nih.gov/homeodomain

Human olfactory receptor data exploratorium Inteins (protein splicing elements) database: properties, sequences, bibliography Sequences of proteins of immunological interest

http://bioinfo.weizmann.ac.il/HORDE/

InBase Kabat Database

http://www.neb.com/neb/inteins.html http://immuno.bme.nwu.edu/

KinG Knottins

Ser/Thr/Tyr-specific protein kinases encoded in complete genomes Database of knottinssmall proteins with an unusual disulfide through disulfide knot

http://hodgkin.mbu.iisc.ernet.in/king http://knottin.cbs.cnrs.fr

LGICdb Lipase Engineering Database

Ligand-gated ion channel subunit sequences database

http://www.pasteur.fr/recherche/banques/ LGIC/LGIC.html

Sequence, structure and function of lipases and esterases

http://www.led.uni-stuttgart.de/

LOX-DB MEROPS MHCPEP MPIMP NPD NucleaRDB Nuclear Receptor Resource

Mammalian, invertebrate, plant and fungal lipoxygenases Database of proteolytic enzymes (peptidases) MHC-binding peptides Mitochondrial protein import machinery of plants Nuclear protein database Nuclear receptor superfamily

http://www.dkfz-heidelberg.de/spec/lox-db/ http://www.merops.ac.uk/ http://wehih.wehi.edu.au/mhcpep/ http://millar3.biochem.uwa.edu.au/ ndex.html http://npd.hgu.mrc.ac.uk/ http://www.receptors.org/NR/ lister/i

Nuclear receptor superfamily

http://nrr.georgetown.edu/nrr/nrr.html http://www.enslyon.fr/LBMC/laudet/nurebase/nurebase. html

NUREBASE

Nuclear hormone receptors database

Olfactory Receptor Database ooTFD

Sequences for olfactory receptor-like molecules

http://ycmi.med.yale.edu/senselab/ordb/

Object-oriented transcription factors database

http://www.ifti.org/ootfd

PKR

Protein kinase resource: sequences, enzymology, genetics and molecular and structural properties

http://pkr.sdsc.edu/

PLANT-PIs

Plant protease inhibitors

http://bighost.area.ba.cnr.it/PLANT-PIs http://plantsp.sdsc.edu/

PlantsP/PlantsT

Plant proteins involved in phosphorylation and membrane transport

Prolysis

Proteases and natural and synthetic protease inhibitors

http://delphi.phys.univ-tours.fr/Prolysis/

REBASE

Restriction enzymes and associated methylases

http://rebase.neb.com/rebase/rebase.html

Ribonuclease P Database RPG

RNase P sequences, alignments and structures Ribosomal protein gene database

http://www.mbio.ncsu.edu/RNaseP/home.html http://ribosome.miyazaki-med.ac.jp/

RTKdb

Receptor tyrosine kinase sequences

http://pbil.univ-lyon1.fr/RTKdb/

S/MARt dB

Nuclear scaffold/matrix attached regions

http://smartdb.bioinf.med.uni-goettingen.de/ http://fermi.utmb.edu/SDAP

SDAP

Structural database of allergenic proteins and food allergens http://wit.mcs.anl.gov/WIT2/Sentra/HTML/ sentra.html

SENTRA

Sensory signal transduction proteins

SEVENS

7-transmembrane helix receptors (G-protein-coupled)

http://sevens.cbrc.jp/ http://bio.lundberg.gu.se/dbs/SRPDB/SR PDB.html http://ibb.uab.es/trsdb

SRPDB TrSDB

Proteins of the signal recognition particles Transcription factor database

VIDA VKCDB

Homologous viral protein families database

http://www.biochem.ucl.ac.uk/bsm/virus_da tabase/VIDA.html

Voltage-gated potassium channel database

http://vkcdb.biology.ualberta.ca/ http://www.stanford.edu/rnusse/wntwindow. html

Wnt Database

Wnt proteins and phenotypes

Database Categories List

Database Categories List

Nucleotide Sequence Databases RNA sequence databases Protein sequence databases Structure Databases Genomics Databases (non-vertebrate) Metabolic and Signaling Pathways Human and other Vertebrate Genomes Human Genes and Diseases Microarray Data and other Gene Expression Databases Proteomics Resources Other Molecular Biology Databases Organelle databases Plant databases Immunological databases

Structure Databases

The number of known molecular structures is increasing very rapidly and these are available through the various databases comprising of structural information regarding the specific molecule. Various sub categories lying in this divison of molecular databases are: 1)Small molecules 2)Carbohydrates 3)Nucleic acid structure 4)Protein structure 5) Unicellular eukaryotes genome databases.

4. Structure Databases 4.1. Small molecules

CSD

Cambridge structural database: crystal structure information for organic and metal-organic compounds

http://www.ccdc.cam.ac.uk/prods/csd/csd. html

HIC-Up

Hetero-compound Information CentreUppsala

http://xray.bmc.uu.se/hicup

AANT

Amino acidnucleotide interaction database

http://aant.icmb.utexas.edu/

Klotho

Collection and categorization of biological compounds

http://www.biocheminfo.org/klotho

LIGAND

Chemical compounds and reactions in biological pathways

http://www.genome.ad.jp/ligand/

4.2. Carbohydrates
http://bssv01.lancs.ac.uk/gig/pages/gag/c arbbank.htm

CCSD

Complex carbohydrate structure database (CarbBank)

Glycan

Carbohydrate database, part of the KEGG system

http://glycan.genome.ad.jp/

GlycoSuiteDB

N- and O-linked glycan structures and biological sources

http://www.glycosuite.com/

Monosaccharide Browser

Space filling Fischer projections of monosaccharides

http://www.jonmaber.demon.co.uk/monosac charide

SWEET-DB

Annotated carbohydrate structure and substance information

http://www.dkfzheidelberg.de/spec2/sweetdb/

4.3. Nucleic acid structure


NDB Nucleic acid-containing structures http://ndbserver.rutgers.edu/

NTDB

Thermodynamic data for nucleic acids

http://ntdb.chem.cuhk.edu.hk/

RNABase

RNA-containing structures from PDB and NDB

http://www.rnabase.org/

SCOR

Structural classification of RNA: RNA motifs by structure, function and tertiary interactions

http://scor.lbl.gov/

PRODORIC NET

Prokaryotic database of gene regulation networks

http://prodoric.tu-bs.de/

PromEC

E.coli promoters with experimentally-identified transcriptional start sites

http://bioinfo.md.huji.ac.il/marg/promec

SELEX_DB

DNA and RNA binding sites for various proteins, found by systematic evolution of ligands by exponential enrichment

http://wwwmgs.bionet.nsc.ru/mgs/systems/s elex/

TESS

Transcription element search system

http://www.cbil.upenn.edu/tess

TRANSCompel

Composite regulatory elements affecting gene transcription in eukaryotes

http://www.generegulation.com/pub/databases.html#transco mpel

TRANSFAC

Transcription factors and binding sites

http://transfac.gbf.de/TRANSFAC/index. html

TRRD

Transcription regulatory regions of eukaryotic genes

http://www.bionet.nsc.ru/trrd/

4.4. Protein structure


ArchDB Automated classification of protein loop structures http://gurion.imim.es/archdb

ASTRAL

Sequences of domains of known structure, selected subsets and sequence-structure correspondences

http://astral.stanford.edu/

BAliBASE BioMagResBa nk

A database for comparison of multiple sequence alignments

http://www-igbmc.ustrasbg.fr/BioInfo/BAliBASE2/index.html

NMR spectroscopic data for proteins and nucleic acids

http://www.bmrb.wisc.edu/

CADB

Conformational angles in proteins database

http://cluster.physics.iisc.ernet.in/cadb/ http://www.biochem.ucl.ac.uk/bsm/cath_ new http://cl.sdsc.edu/ce.html http://ckaap.sdsc.edu/ http://www.bioinfo.biocenter.helsinki.fi:8 080/dali/

CATH CE CKAAPs DB Dali

Protein domain structures database 3D Protein structure alignments Structurally-similar proteins with dissimilar sequences Protein fold classification using the Dali search engine

Decoys R Us

Computer-generated protein conformations

http://dd.stanford.edu/

DisProt

Database of Protein Disorder: information about proteins that lack fixed 3D structure in their native states

http://divac.ist.temple.edu/disprot

DomIns

Domain insertions in known protein structures

http://stash.mrc-lmb.cam.ac.uk/DomIns http://www.ncbs.res.in/ se/dsdbase.html faculty/mini/dsdba

DSDBASE

Native and modeled disulfide bonds in proteins

DSMM

Database of simulated molecular motions

http://projects.villaosch.de/dbase/dsmm/

eF-site

Electrostatic surface of Functional site: electrostatic potentials and hydrophobic properties of the active sites

http://ef-site.protein.osaka-u.ac.jp/eF-site

FSSP

Fold classification based on structure-structure alignment of proteins, currently maintained as Dali database

http://www.ebi.ac.uk/dali/fssp http://www.biochem.ucl.ac.uk/bsm/cath_ne w/Gene3D/

Gene3D

Precalculated structural assignments for whole genomes Genomic threading database: structural annotations of complete genomes

GTD

http://bioinf.cs.ucl.ac.uk/GTD

GTOP Het-PDB Navi HOMSTRAD IMB Jena Image Library IMGT/3Dstruct ure-DB ISSD LPFC MMDB E-MSD ModBase

Protein fold predictions from genome sequences Hetero-atoms in protein structures Homologous structure alignment database: curated structurebased alignments for protein families

http://spock.genes.nig.ac.jp/

genome/

http://daisy.nagahama-ibio.ac.jp/golab/hetpdbnavi.html http://www-cryst.bioc.cam.ac.uk/homstrad

Visualization and analysis of 3D biopolymer structures Sequences and 3D structures of vertebrate immunoglobulins, T cell receptors and MHC proteins Integrated sequence-structure database Library of protein family core structures NCBIs database of 3D structures, part of NCBI Entrez EBIs macromolecular structure database Annotated comparative protein structure models Database of macromolecular movements: descriptions of protein and macromolecular motions, including movies Phylogeny and alignment of homologous protein structures Structural motifs of protein superfamilies

http://www.imb-jena.de/IMAGE.html http://imgt3d.igh.cnrs.fr http://www.protein.bio.msu.su/issd http://wwwsmi.stanford.edu/projects/helix/LPFC http://www.ncbi.nlm.nih.gov/Structure http://www.ebi.ac.uk/msd http://salilab.org/modbase

MolMovDB PALI PASS2

http://bioinfo.mbb.yale.edu/MolMovDB/ http://pauling.mbu.iisc.ernet.in/ http://ncbs.res.in/ s.html pali

faculty/mini/campass/pas

PepConfDB PDB PDB-REPRDB PDBsum SCOP Sloop StructureSuperposition Database SWISS-MODEL Repository SUPERFAMILY SURFACE TargetDB 3D-GENOMICS TOPS

A database of peptide conformations Protein structure databank: all publicly available 3D structures of proteins and nucleic acids Representative protein chains, based on PDB entries Summaries and analyses of PDB structures Structural classification of proteins Classification of protein loops

http://202.41.70.49:8080/pepconfdb/index.ht m http://www.rcsb.org/pdb http://www.cbrc.jp/pdbreprdb/ http://www.biochem.ucl.ac.uk/bsm/pdbsum http://scop.mrc-lmb.cam.ac.uk/scop http://www-cryst.bioc.cam.ac.uk/ sloop/

Pairwise superposition of TIM-barrel structures

http://ssd.rbvi.ucsf.edu/

Database of annotated 3D protein structure models Assignments of proteins to structural superfamilies Surface residues and functions annotated, compared and evaluated: a database of protein surface patches Target data from worldwide structural genomics projects Structural annotations for complete proteomes Topology of protein structures database

http://swissmodel.expasy.org/repository http://supfam.org/ http://cbm.bio.uniroma2.it/surface http://targetdb.pdb.org/ http://www.sbg.bio.ic.ac.uk/3dgenomics http://www.tops.leeds.ac.uk

Database Categories List

Database Categories List

Nucleotide Sequence Databases RNA sequence databases Protein sequence databases Structure Databases Genomics Databases (non-vertebrate) Metabolic and Signaling Pathways Human and other Vertebrate Genomes Human Genes and Diseases Microarray Data and other Gene Expression Databases Proteomics Resources Other Molecular Biology Databases Organelle databases Plant databases Immunological databases

Genomics Databases
For organisms of major interest to geneticists, there is a long history of conventionally published catalogues of genes or mutations. In the past few years, most of these have been made available in an electronic form and a variety of new databases have been developed. These databases vary greatly in the classes of data captured and how these data are stored.This category of databases comprising of the information regarding various genomes like of Humans ,Plants, Viral, Invertebrate, Microbes etc. 1)Genome annotation terms, ontologies and nomenclature 2)Taxonomy and identification 3)General genomics databases 4)Viral genome databases 5)Prokaryotic genome databases 6)Unicellular eukaryotes genome databases 7)Fungal genome databases 8)Invertebrate genome databases 9)Human genome databases, maps and viewers.

5. Genomics Databases (non-human) (non5.1. Genome annotation terms, onthologies and nomenclature
Human gene nomenclature: approved gene symbols Gene onthology consortium database Gene onthology annotation project Nomenclature of enzymes, membrane transporters, electron transport proteins and other proteins Nomenclature of biochemical and organic compounds approved by the IUBMB-IUPAC Joint Commission The International Union of Pharmacology recommendations on receptor nomenclature and drug classification Gene products organized by biological function http://www.gene.ucl.ac.uk/nomenclat ure http://www.geneontology.org/ http://www.ebi.ac.uk/GOA

Genew GO GOA IUBMB Nomenclature database IUPAC Nomenclature database

http://www.chem.qmul.ac.uk/iubmb

http://www.chem.qmul.ac.uk/iupac

IUPHAR-RD PANTHER

http://www.iuphar-db.org/iuphar-rd/ http://panther.celera.com/

SOURCE UMLS

Functional genomic resource for annotations ontologies and expression data Unified medical language system

http://source.stanford.edu/ http://umlsks.nlm.nih.gov/

5.1.1. Taxonomy and Identification

ICB

gyrB database for identification and classification of bacteria

http://www.mbio.co.jp/icb

NCBI Taxonomy

Names and taxonomic lineages of all organisms in GenBank

http://www.ncbi.nlm.nih.gov/Taxonomy/

RIDOM RDP

rRNA-based differentiation of medical microorganisms Ribosomal database project

http://www.ridom-rdna.de/ http://rdp.cme.msu.edu

Tree of Life

Information on phylogeny and biodiversity

http://phylogeny.arizona.edu/tree/phylogeny .html

5.2. General genomics databases


COG Clusters of orthologous groups of proteins from unicellular microorganisms Comparative regulatory genomics: conserved non-coding sequence blocks http://www.ncbi.nlm.nih.gov/COG

CORG

http://corg.molgen.mpg.de/

DEG

Database of essential genes from bacteria and yeast

http://tubic.tju.edu.cn/deg

EBI Genomes

EBIs collection of databases for the analysis of complete and unfinished viral, pro- and eukaryotic genomes Eukaryotic gene orthologs: orthologous DNA sequences in the TIGR gene indices Enhanced microbial genomes library: completely sequenced genomes of unicellular organisms

http://www.ebi.ac.uk/genomes

EGO

http://www.tigr.org/tdb/tgi/ego/

EMGlib

http://pbil.univ-lyon1.fr/emglib/emglib.html

Entrez Genomes

NCBIs collection of databases for the analysis of complete and unfinished viral, pro- and eukaryotic genomes Integrated biochemical data on seven bacterial genomes: publicly available portion of the ERGO database Database of bacterial and archaeal gene fusion events

http://www.ncbi.nlm.nih.gov/entrez/query. fcgi?db=Genome

ERGOLight FusionDB

http://www.ergo-light.com/ERGO http://igs-server.cnrs-mrs.fr/FusionDB

Genome information broker

DDBJs collection of databases for the analysis of complete and unfinished viral, pro- and eukaryotic genomes Genomes online database: a listing of completed and ongoing genome projects

http://gib.genes.nig.ac.jp

GOLD TIGR Microbial Database

http://www.genomesonline.org/

Lists of completed and ongoing genome projects with links to complete genome sequences Putative horizontally transferred genes in prokaryotic genomes

http://www.tigr.org/tdb/mdb/mdbcomplet e.html

HGT-DB

http://www.fut.es/

debb/HGT/

KEGG MBGD

Kyoto encyclopedia of genes and genomes: integrated suite of databases on genes, proteins, and metabolic pathways Microbial genome database for comparative analysis Database of orphan ORFs (ORFs with no homologs) in complete microbial genomes

http://www.genome.ad.jp/kegg http://mbgd.genome.ad.jp/

ORFanage

http://www.cs.bgu.ac.il/

nomsiew/ORFans

PACRAT

Archaeal and bacterial intergenic sequence features

http://www.biosci.ohio-tate.edu/

pacrat

PEDANT

Results of an automated analysis of genomic sequences

http://pedant.gsf.de

TIGR Comprehensiv e Microbial Resource

Various data on complete microbial genomes: uniform annotation, properties of DNA and predicted proteins

http://www.tigr.org/CMR

TransportDB

Predicted membrane transporters in complete genomes, classified according to the TC classification system

http://www.membranetransport.org

WIT

What is there? Metabolic reconstruction for completely sequenced microbial genomes

http://wit.mcs.anl.gov/WIT2/

5.3. Organism-specific genomic databases Organism5.3.1. Viruses


HCVDB HIV Drug Resistance Database The hepatitis C virus database http://hepatitis.ibcp.fr/

Mutations in HIV genes that confer resistance to anti-HIV drugs Annotated and curated database for complete viral genome sequences

http://resdb.lanl.gov/Resist_DB/default.htm

VirGen

http://bioinfo.ernet.in/virgen/virgen.html

5.3.2. Prokaryotes 5.3.2.1. Escherichia coli


ASAP A systematic annotation package for community analysis of E.coli and related genomes https://asap.ahabs.wisc.edu/annotation/php/A SAP1.htm

CCDB coliBase Colibri Essential genes in E.coli GenoBase GenProtEC PEC

CyberCell database: E.coli database at U. Alberta A database for E.coli, Salmonella and Shigella E.coli genome database at Institut Pasteur

http://redpoll.pharmacy.ualberta.ca/CCDB http://colibase.bham.ac.uk/ http://genolist.pasteur.fr/Colibri/ http://magpie.genome.wisc.edu/ ntial.html http://ecoli.aist-nara.ac.jp/ http://genprotec.mbl.edu http://shigen.lab.nig.ac.jp/ecoli/pec http://ecocyc.org/ http://bmb.med.miami.edu/EcoGene/EcoWe b/ chris/esse

First results of an E.coli gene deletion project E.coli genome database at Nara Institute E.coli K-12 genome and proteome database Profiling of E.coli chromosome E.coli K-12 genes, metabolic pathways, transporters, and gene regulation Sequence and literature data on E.coli genes and proteins

EcoCyc EcoGene

RegulonDB

Transcriptional regulation and operon organization in E.coli

http://www.cifn.unam.mx/Computational_G enomics/regulondb/

5.3.2.2. Bacillus subtilis


BSORF NRSub SubtiList Bacillus subtilis genome database at Kyoto U. Non-redundant Bacillus subtilis database at U. Lyon Bacillus subtilis genome database at Institut Pasteur http://bacillus.genome.ad.jp/ http://pbil.univ-lyon1.fr/nrsub/nrsub.html http://genolist.pasteur.fr/SubtiList/

5.3.2.3. Other bacteria

BioCyc

Pathway/genome databases for many bacteria

http://biocyc.org/

CampyDB

Database for Campylobacter genome analysis

http://campy.bham.ac.uk/

ClostriDB

Finished and unfinished genomes of Clostridium spp.

http://clostri.bham.ac.uk/

CyanoBase

Cyanobacterial genomes

http://www.kazusa.or.jp/cyano

LeptoList

Leptospira interrogans genome

http://bioinfo.hku.hk/LeptoList

MolliGen

Genomic data on mollicutes

http://cbi.labri.fr/outils/molligen/

RsGDB

Rhodobacter sphaeroides genome

http://wwwmmg.med.uth.tmc.edu/sphaeroides

5.3.3. Unicellular eukaryotes 5.3.3.1. Yeast


SGD Saccharomyces genome database http://www.yeastgenome.org/

CYGD

MIPS Comprehensive yeast genome database

http://mips.gsf.de/proj/yeast

Gnolevures

A comparison of S.cerevisiae and 14 other yeast species

http://cbi.labri.fr/Genolevures

MitoPD

Yeast mitochondrial protein database

http://bmerc-www.bu.edu/mito

SCMD SCPD

Saccharomyces cerevisiae morphological database: micrographs of budding yeast mutants Saccharomyces cerevisiae promoter database

http://yeast.gi.k.u-tokyo.ac.jp/ http://cgsigma.cshl.org/jian

TRIPLES

Transposon-insertion phenotypes, localization, and expression in Saccharomyces

http://ygac.med.yale.edu/triples/

YDPM

Yeast deletion project and mitochondria database

http://wwwdeletion.stanford.edu/YDPM/YDPM_index. html

Yeast Intron Database

Ares laboratory database of splicesomal introns in S.cerevisiae

http://www.cse.ucsc.edu/research/compbio/ yeast_introns.html

Yeast snoRNA Database yMGV

Yeast small nucleolar RNAs Yeast microarray global viewer

http://www.bio.umass.edu/biochem/rnasequence/Yeast_snoRNA_Database/snoRN A_DataBase.html http://www.transcriptome.ens.fr/ymgv/

5.3.3.2. Other unicellular eukaryotes


ApiEST-DB EST sequences from various Apicomplexan parasites http://www.cbil.upenn.edu/paradbs-servlet

CryptoDB

Cryptosporidium parvum genome database

http://cryptodb.org/

DictyBase

Genome information, literature and experimental resources for Dictyostelium discoideum

http://dictybase.org/

Full-Malaria

Full-length cDNA library from erythrocytic-stage Plasmodium falciparum

http://fullmal.ims.u-tokyo.ac.jp/

GeneDB PlasmoDB TcruziDB ToxoDB

Curated database for Trypanosoma brucei, Leishmania major, S.pombe and other Sanger-sequenced genomes Plasmodium genome database Trypanosoma cruzi genome database Toxoplasma gondii genome database

http://www.genedb.org/ http://plasmodb.org/ http://tcruzidb.org/ http://toxodb.org/

5.3.4. Plants 5.3.4.1. General plant databases


CropNet Genome mapping in crop plants http://ukcrop.net/ http://genoplanteinfo.infobiogen.fr/FLAGdb/

FLAGdb++

Integrative database about plant genomes

GnoPlante-Info

Plant genomic data from the Gnoplante consortium Molecular and phenotypic information on wheat, barley, rye, triticale and oats Database of plant EST and STS sequences annotated with gene family information

http://genoplante-info.infobiogen.fr/ http://wheat.pw.usda.gov or http://www.graingenes.org

GrainGenes

Mendel

http://www.mendel.ac.uk/ http://genoplanteinfo.infobiogen.fr/phytoprot

PHYTOPROT

Clusters of (predicted) plant proteins Plant genome database: actively-transcribed plant genomic sequences Plant EST clustering and functional annotation

PlantGDB Sputnik

http://www.plantgdb.org/ http://mips.gsf.de/proj/sputnik

TIGR plant repeat database

Classification of repetitive sequences in plant genomes Genetic and genomic information about tropical crops: sugarcane, banana, cocoa

http://www.tigr.org/tdb/e2k1/plant.repeat s

TropGENE DB

http://tropgenedb.cirad.fr/

5.3.4.2. Arabidopsis thaliana

ARAMEMNON

Arabidopsis thaliana membrane proteins and transporters

http://aramemnon.botanik.uni-koeln.de/

AthaMap

Genome-wide map of putative transcription factor binding sites in Arabidopsis thaliana

http://www.athamap.de/

CATMA

Complete Arabidopsis transcriptome microarray: gene sequence tags

http://www.catma.org

FLAGdb/FST

Arabidopsis thaliana T-DNA transformants

http://genoplante-info.infobiogen.fr/

MAtDB

MIPS Arabidopsis thaliana database

http://mips.gsf.de/proj/thal/db

SeedGenes TAIR

Genes essential for Arabidopsis development The Arabidopsis information resource

http://www.seedgenes.org/ http://www.arabidopsis.org/

5.3.4.3. Rice

BGI-RISe

Beijing genomics institute rice information system

http://rise.genomics.org.cn/

INE

Integrated rice genome explorer

http://rgp.dna.affrc.go.jp/giot/INE.html

IRIS

International rice information system: all rice data

http://www.iris.irri.org/

MOsDB

MIPS Oryza sativa database

http://mips.gsf.de/proj/rice

Oryzabase

Rice genetics and genomics

http://www.shigen.nig.ac.jp/rice/oryzabase/

RiceGAAS

Rice genome automated annotation system

http://ricegaas.dna.affrc.go.jp/

Rice PIPELINE

Unification tool for rice databases

http://cdna01.dna.affrc.go.jp/PIPE

RPD

Rice proteome database

http://gene64.dna.affrc.go.jp/RPD/

5.3.4.4. Other plants


MaizeGDB MGI MtDB SGMD Maize genetics and genomics database, a successor to MaizeDB and ZmDB databases Medicago genome initiative: ESTs, gene expression and proteomic data Medicago trunculata genome Soybean genomics and microarray database http://www.maizegdb.org/ http://xgi.ncgr.org/mgi http://www.medicago.org/MtDB http://psi081.ba.ars.usda.gov/SGMD/defaul t.htm

5.3.5. Fungi
CADRE COGEME MagnaportheD B MNCDB Central Aspergillus data repository Phytopathogenic fungi and oomycete EST database http://www.cadre.man.ac.uk/ http://cogeme.ex.ac.uk http://www.fungalgenomics.ncsu.edu/Proje cts/mgdatabase/int.htm http://mips.gsf.de/proj/neurospora/

Magnaporthe grisea integrated physical/genetic map MIPS Neurospora crassa database

Phytophthora Genome Consortium Database

ESTs from Phytophthora infestans and P.sojae

https://xgi.ncgr.org/pgc

5.3.6. Invertebrates 5.3.6.1. Caenorhabditis elegans


C.elegans Project

Genome sequencing data at the Sanger Institute

http://www.sanger.ac.uk/Projects/C_elegans

Intronerator RNAiDB

Introns and alternative splicing in C.elegans and C.briggsae RNAi phenotypic analysis of C.elegans genes

http://www.cse.ucsc.edu/ / http://www.rnai.org/

kent/intronerator

WILMA

C.elegans annotation database

http://www.came.sbg.ac.at/wilma/

WorfDB

C.elegans ORFeome

http://worfdb.dfci.harvard.edu/

WormBase

Data repository for C.elegans and C.briggsae: curated genome annotation, genetic and physical maps, pathways

http://www.wormbase.org/

5.3.6.2. Drosophila melanogaster


FlyBase GadFly FlyBrain Drosophila sequences and genomic information Genome annotation database of Drosophila Database of the Drosophila nervous system Drosophila transgenic lines created using an intron protein trap strategy http://flybase.bio.indiana.edu/ http://www.fruitfly.org http://flybrain.neurobio.arizona.edu

FlyTrap

http://flytrap.med.yale.edu/ http://sdb.bio.purdue.edu/fly/aimain/1aahom e.htm

InterActive Fly Drosophila microarray centre

Drosophila genes and their roles in development

Data and tools for Drosophila gene expression studies

http://www.flyarrays.com/fruitfly

5.3.6.3. Other invertebrates


AppaDB A database on the nematode Pristionchus pacificus http://appadb.eb.tuebingen.mpg.de

CnidBase

Cnidarian evolution and gene expression database

http://cnidbase.bu.edu/

Nematode.net NEMBASE

Parasitic nematode sequencing project Nematode sequence and functional data database

http://nematode.net/ http://www.nematodes.org

Database Categories List

Database Categories List

Nucleotide Sequence Databases RNA sequence databases Protein sequence databases Structure Databases Genomics Databases (non-vertebrate) Metabolic and Signaling Pathways Human and other Vertebrate Genomes Human Genes and Diseases Microarray Data and other Gene Expression Databases Proteomics Resources Other Molecular Biology Databases Organelle databases Plant databases Immunological databases

Metabolic and Signaling Pathways

The metabolic and signaling pathway is a collection of Pathway/Signaling Databases. Each database in this collection describes the genome and metabolic pathways of a single organism, with some exception databases. The categories in this 1)Enzymes and enzyme nomenclature 2)Metabolic pathways 3)Intermolecular interactions and signaling pathways

6. Metabolic Enzymes and Pathways; Signaling Pathways 6.1. Enzymes and Enzyme Nomenclature
ENZYME Enzyme nomenclature and properties Enzyme names and properties: sequence, structure, specificity, stability, reaction parameters, isolation data Integrated enzyme database and enzyme nomenclature http://www.expasy.org/enzyme

BRENDA IntEnz Enzyme Nomenclature

http://www.brenda.uni-koeln.de http://www.ebi.ac.uk/intenz

IUBMB Nomenclature Committee recommendations

http://www.chem.qmw.ac.uk/iubmb/enzyme

6.2. Metabolic Pathways


KEGG MetaCyc PathDB Kyoto encyclopedia of genes and genomes: metabolic and regulatory pathways encoded in complete genomes Metabolic pathways and enzymes from various organisms Biochemical pathways, compounds and metabolism University of Minnesota biocatalysis and biodegradation database: microbial catabolism and biotransformations Integrated system for functional curation and development of metabolic models http://www.genome.ad.jp/kegg http://metacyc.org http://www.ncgr.org/pathdb

UM-BBD WIT2

http://umbbd.ahc.umn.edu/ http://wit.mcs.anl.gov/WIT2/

6.3. Intermolecular Interactions and Signaling Pathways

aMAZE BIND

A system for the annotation, management and analysis of biochemical and signaling pathway networks Biomolecular interaction network database

http://www.amaze.ulb.ac.be/ http://www.bind.ca http://www.biocarta.com/genes/allPathways. asp

BioCarta

Online maps of metabolic and signaling pathways

BRITE

Biomolecular relations in information transmission and expression, part of the KEGG system

http://www.genome.ad.jp/brite

DIP

Database of interacting proteins: experimentally determined proteinprotein interactions

http://dip.doe-mbi.ucla.edu

DRC

Database of ribosomal crosslinks

http://www.mpimg-berlindahlem.mpg.de/ ag_ribo/ag_brimacombe/ drc http://wwwmgs.bionet.nsc.ru/mgs/gnw/ge nenet

GeneNet

Database on gene network components

IntAct project InterDom

Proteinprotein interaction data Putative protein domain interactions Functional and quantitative thermodynamic data on peptide binding to immunological biomacromolecules MHCpeptide interaction database Reactive oxygen species (ROS) signaling pathway

http://www.ebi.ac.uk/intact http://interdom.lit.org.sg

JenPep MPID ROSPath

http://www.jenner.ac.uk/Jenpep2 http://surya.bic.nus.edu.sg/mpid http://rospath.ewha.ac.kr

STCDB

Signal transductions classification database

http://www.techfak.unibielefeld.de/ mchen/STCDB

STRING

Predicted functional associations between proteins

www.bork.emblheidelberg.de/STRING

TRANSPATH

Gene regulatory networks and microarray analysis

http://www.biobase.de/pages/products/ databases.html

Database Categories List

Database Categories List

Nucleotide Sequence Databases RNA sequence databases Protein sequence databases Structure Databases Genomics Databases (non-vertebrate) Metabolic and Signaling Pathways Human and other Vertebrate Genomes Human Genes and Diseases Microarray Data and other Gene Expression Databases Proteomics Resources Other Molecular Biology Databases Organelle databases Plant databases Immunological databases

Human and other Vertebrate Genomes

The Human and other vertebrate genomes is a repository of the human genome as well as the other vertebrate genomes containing databases. 1)Model organisms, comparative genomics 2)Human genome databases, maps and viewers 3)Human ORFs.

7. Human and other Vertebrate Genomes 7.1. Mitochondrial Genes and Proteins
AMmtDB Metazoan mitochondrial genes http://bighost.area.ba.cnr.it/mitochondriom e http://megasun.bch.umontreal.ca/gobase/go base.html

GOBASE

Organelle genome database

MitoDat MitoMap

Mitochondrial proteins (predominantly human) Human mitochondrial genome

http://www-lecb.ncifcrf.gov/mitoDat/ http://www.mitomap.org/

MitoNuc MITOP2

Nuclear genes coding for mitochondrial proteins Mitochondrial proteins, genes and diseases Mitochondrial protein sequences encoded by mitochondrial and nuclear genes Complete mitochondrial genome sequences for 200 metazoan species

http://biowww.ba.cnr.it:8000/BioWWW/#MitoNuc http://ihg.gsf.de/mitop2/

MitoProteome OGRe

http://www.mitoproteome.org http://www.bioinf.man.ac.uk/ogre

7.2. Model organisms, comparative genomics


ACeDB C.elegans, S.pombe, and human sequences and genomic information http://www.acedb.org/

AllGenes ArkDB Cre Transgenic Database

Human and mouse gene, transcript and protein annotation Genome databases for farm and other animals Cre transgenic mouse lines with links to publications Human cDNA clones homologous to Drosophila mutant genes Annotated information on eukaryotic genomes

http://www.allgenes.org/ http://www.thearkdb.org/ http://www.mshri.on.ca/nagy/ http://www.tigem.it/LOCAL/drosophila/dros .html http://www.ensembl.org/

DRESH Ensembl

FANTOM FREP

Functional annotation of mouse full-length cDNA clones Functional repeats in mouse cDNAs

http://fantom2.gsc.riken.go.jp http://facts.gsc.riken.go.jp/FREP/

IPD-MHC Database

Non-human major histocompatibility complex sequences

http://www.ebi.ac.uk/ipd/mhc

GenetPig

Genes controlling economic traits in pig

http://www.infobiogen.fr/services/Genetpig

KOG LocusLink Mouse Genome Database Mouse SAGE Mouse Targeted Mutations MTID PEDE Rat Genome Database TIGR Gene Indices UniGene UniSTS ZFIN

Eukaryotic orthologous groups of proteins Curated sequences and descriptions of genetic loci

http://www.ncbi.nlm.nih.gov/COG/new/sh okog.cgi http://www.ncbi.nlm.nih.gov/LocusLink

Mouse genome database SAGE libraries from various mouse tissues and cell lines

http://www.informatics.jax.org/ http://mouse.biomed.cas.cz/sage

Information on transgenic animals and targeted mutations Mouse transposon insertion database Pig EST data explorer: full-length cDNA libraries and ESTs Rat genetic and genomic data Organism-specific databases of EST and gene sequences Unified clusters of ESTs and full-length mRNA sequences Unified non-redundant view of sequence tagged sites with marker and mapping data from a variety of resources Genetic, genomic and developmental data from zebrafish

http://tbase.jax.org/ http://mouse.ccgb.umn.edu/transposon/ http://pede.gene.staff.or.jp/ http://rgd.mcw.edu/ http://www.tigr.org/tdb/tgi.shtml http://www.ncbi.nlm.nih.gov/UniGene/ http://www.ncbi.nlm.nih.gov/entrez/query.f cgi?db=unists http://zfin.org/

7.3. Human genome databases, maps and viewers


Ensembl AluGene CroW 21 G3-RH Annotated information on eukaryotic genomes Complete Alu map in the human genome Human chromosome 21 database Stanford G3 and TNG radiation hybrid maps http://www.ensembl.org/

http://alugene.tau.ac.il/
http://bioinfo.weizmann.ac.il/crow21/ http://www-shgc.stanford.edu/RH/

GB4-RH GDB GenAtlas

Genebridge4 human radiation hybrid maps Human genes and genomic maps Human genes, markers and phenotypes Integrated database of human genes, maps, proteins and diseases

http://www.sanger.ac.uk/Software/RHserver/ RHserver.shtml http://www.gdb.org/ http://www.citi2.fr/GENATLAS/

GeneCards

http://bioinfo.weizmann.ac.il/cards/

GeneLoc GeneNest

Gene location database (formerly UDBUnified database for human genome mapping) Gene indices of human, mouse, zebrafish, etc.

http://genecards.weizmann.ac.il/geneloc/ http://genenest.molgen.mpg.de/

GenMapDB Gene Resource Locator

Mapped human BAC clones Alignment of ESTs with finished human sequence

http://genomics.med.upenn.edu/genmapdb http://grl.gi.k.u-tokyo.ac.jp/

HOWDY

Human organized whole genome database

http://www-alis.tokyo.jst.go.jp/HOWDY/ http://www.infobiogen.fr/services/Hugema p http://www.tigr.org/tdb/humgen/bac_end_s earch/bac_end_intro.html http://ixdb.mpimg-berlin-dahlem.mpg.de/

HuGeMap Human BAC Ends Database IXDB

Human genome genetic and physical map data

Non-redundant human BAC end sequences Physical maps of human chromosome X

NCBI RefSeq UCSC Genome Browser ParaDB RHdb

Non-redundant DNA and protein sequence collection Genome assemblies and annotation Paralogy mapping in human genomes Radiation hybrid map data

http://www.ncbi.nlm.nih.gov/RefSeq/ http://genome.ucsc.edu/ http://abi.marseille.inserm.fr/paradb/ http://www.ebi.ac.uk/RHdb

STACK

Sequence tag alignment and consensus knowledgebase

http://www.sanbi.ac.za/Dbases.html

Database Categories List

Database Categories List

Nucleotide Sequence Databases RNA sequence databases Protein sequence databases Structure Databases Genomics Databases (non-vertebrate) Metabolic and Signaling Pathways Human and other Vertebrate Genomes Human Genes and Diseases Microarray Data and other Gene Expression Databases Proteomics Resources Other Molecular Biology Databases Organelle databases Plant databases Immunological databases

Human Genes and Diseases

Human Genes and Diseases Human genes and diseases is a category of those databases that has the information regarding disease causing genes, having databases of cancerous genes, human ORFs, etc. 1)Human ORFs 2)General human genetics databases 3)General polymorphism databases 4)Cancer gene databases 5)Gene-system or disease-specific databases

7.4. Human proteins


HPMR Human plasma membrane receptome: protein sequences, literature, and expression database Human protein reference database: domain architecture, post-translational modifications, and disease association Human novel transcripts: annotated full-length cDNAs Human unidentified gene-encoded large (>50 kDa) protein and cDNA sequences Localization, interaction and functional assays of human proteins http://receptome.stanford.edu/

HPRD HUNT

http://www.hprd.org http://www.hri.co.jp/HUNT

HUGE

http://www.kazusa.or.jp/huge

LIFEdb trome, trEST and trGEN

http://www.dkfz.de/LIFEdb

Databases of predicted human protein sequences

ftp://ftp.isrec.isb-sib.ch/pub/databases/

8. Human Genes and Diseases 8.1. General Databases


Genetics Home Reference Homophila A general guide on human hereditary diseases Drosophila homologs of human disease genes http://ghr.nlm.nih.gov/ http://homophila.sdsc.edu/

IMGT

International immunogenetics information system: immunoglobulins, T cell receptors, MHC and RPI

http://imgt.cines.fr/

Mutation Spectra Database

Mutations in viral, bacterial, yeast and mammalian genes

http://info.med.yale.edu/mutbase/

OMIA

Online Mendelian inheritance in animals: a catalog of animal genetic and genomic disorders

http://www.angis.org.au/omia

OMIM

Online Mendelian inheritance in man: a catalog of human genetic and genomic disorders Collection of ORFs that are sold by Invitrogen European mutant mice pathology database: histopathology photomicrographs and macroscopic images Compilation of protein mutant data

http://www.ncbi.nlm.nih.gov/Omim/ http://orf.invitrogen.com/

ORFDB

PathBase PMD

http://www.pathbase.net/ http://pmd.ddbj.nig.ac.jp/

8.2. Human Mutations Databases 8.2.1. General polymorphism database


ALFRED BayGenomics dbSNP FIMM HGVS Databases HGVbase HGMD Allele frequencies and DNA polymorphisms Genes relevant to cardiovascular and pulmonary disease Database of single nucleotide polymorphisms Functional molecular immunology data A compilation of human mutation databases Human genome variation database: curated human polymorphisms Human gene mutation database Immuno polymorphism database: data on human killer-cell Ig-like receptors and human platelet antigens Japanese SNP database SNPs in regulatory gene regions http://alfred.med.yale.edu/ http://baygenomics.ucsf.edu/ www.ncbi.nlm.nih.gov/SNP/ http://sdmc.krdl.org.sg:8080/fimm/ http://www.hgvs.org/ http://hgvbase.cgb.ki.se/ http://www.hgmd.org/

IPD JSNP rSNP Guide SNP Consortium database TopoSNP

http://www.ebi.ac.uk/ipd http://snp.ims.u-tokyo.ac.jp/ http://util.bionet.nsc.ru/databases/rsnp.html

SNP Consortium data Topographic database of non-synonymous SNPs

http://snp.cshl.org/ http://gila.bioengr.uic.edu/snp/toposnp

8.2.2. Cancer
Atlas of Genetics and Cytogenetics in Oncology and Haematology CGED Database of Germline p53 Mutations IARC TP53 Database MTB Oral Cancer Gene Database RB1 Gene Mutation Database RTCGD SNP500Cancer SV40 Large TAntigen Mutant Database

Cancer related genes, chromosomal abnormalities in oncology and haematology, and cancer-prone diseases Cancer gene expression database

http://www.infobiogen.fr/services/chromca ncer/ http://love2.aist-nara.ac.jp/CGED http://www.lf2.cuni.cz/win/projects/germli ne_mut_p53.htm http://www.iarc.fr/p53/ http://tumor.informatics.jax.org/

Mutations in human tumor and cell line p53 gene Human TP53 somatic and germline mutations Mouse tumor biology database: mouse tumor types, genes, classification, incidence, pathology Cellular and molecular data for genes involved in oral cancer

http://www.tumor-gene.org/Oral/oral.html

Mutations in the human retinoblastoma (RB1) gene Mouse retroviral tagged cancer gene database Re-sequenced SNPs from 102 reference samples

http://www.d-lohmann.de/Rb/

http://rtcgd.ncifcrf.gov/
http://snp500cancer.nci.nih.gov

Mutations in SV40 large tumor antigen gene

http://bigdaddy.bio.pitt.edu/SV40/

Tumor Gene Family Databases

Cellular, molecular and biological data about genes involved in various cancers

http://www.tumor-gene.org/tgdf.html

8.2.3. Gene, system or disease-specific diseaseALPSbase Androgen Receptor Gene Mutations Database BTKbase CASRDB Cytokine Gene Polymorphism in Human Disease Collagen Mutation Database ERGDB FUNPEP GOLD.db Autoimmune lymphoproliferative syndrome database http://research.nhgri.nih.gov/alps/

Mutations in the androgen receptor gene Mutation registry for X-linked agammaglobulinemia Calcium-sensing receptor database: CASR mutations causing hypercalcemia and/or hyperparathyroidism

http://www.mcgill.ca/androgendb/ http://bioinf.uta.fi/BTKbase/ http://www.casrdb.mcgill.ca/

Cytokine gene polymorphism literature database

http://bris.ac.uk/pathandmicro/services/GAI /cytokine4.htm

Human type I and type III collagen gene mutations Estrogen responsive genes database Low-complexity peptides capable of forming amyloid plaque Genomics of lipid-associated disorders database

http://www.le.ac.uk/genetics/collagen/ http://sdmc.lit.org.sg/ergdb/cgibin/explore.pl http://www.cmbi.kun.nl/swift/FUNPEP/g ergo/ http://gold.tugraz.at

tGRAP

Mutants of G-protein coupled receptors of family A

http://tinygrap.uit.no/GRAP/
http://www.kcl.ac.uk/ip/petergreen/haemBd atabase.html

HaemB

Factor IX gene mutations, insertions and deletions

HbVar

Human hemoglobin variants and thalassemias

http://globin.cse.psu.edu/globin/hbvar

Human p53/hprt, rodent lacI/lacZ databases Human PAX2 Allelic Variant Database

Mutations at the human p53 and hprt genes; rodent transgenic lacI and lacZ mutations

http://www.ibiblio.org/dnam/mainpage.htm l

Mutations in human PAX2 gene

http://pax2.hgu.mrc.ac.uk/

Human PAX6 Allelic Variant Database

Mutations in human PAX6 gene

http://pax6.hgu.mrc.ac.uk/

IL2Rgbase

X-linked severe combined immunodeficiency mutations Vertebrate immunoglobulin and T cell receptor genes

http://research.nhgri.nih.gov/scid/

IMGT/Gene-DB

http://imgt.cines.fr/cgi-bin/GENElect.jv

IMGT/HLA

Polymorphism of human MHC and related genes Hereditary inflammatory disorder and familial mediterranean fever mutation data

http://www.ebi.ac.uk/imgt/hla/

INFEVERS

http://fmf.igh.cnrs.fr/infevers

KinMutBase

Disease-causing protein kinase mutations

http://www.uta.fi/imt/bioinfo/KinMutBase/

Lowe Syndrome Mutation Database

Phosphatidylinositol-4,5-bisphosphate 5-phosphatase mutations causing Lowe oculocerebrorenal syndrome

http://research.nhgri.nih.gov/lowe/

NCL Mutation Database PAHdb PGDB

Polymorphisms in neuronal ceroid lipofuscinoses genes Mutations at the phenylalanine hydroxylase locus Prostate and prostatic diseases gene database

http://www.ucl.ac.uk/ncl/ http://www.pahdb.mcgill.ca/ http://www.ucsf.edu/PGDB

PHEXdb PTCH1 Mutation Database

PHEX mutations causing X-linked hypophosphatemia

http://www.phexdb.mcgill.ca/

Mutations and SNPs found in PTCH1 gene

http://www.cybergene.se/PTCH/ptchbase.ht ml

Database Categories List

Database Categories List

Nucleotide Sequence Databases RNA sequence databases Protein sequence databases Structure Databases Genomics Databases (non-vertebrate) Metabolic and Signaling Pathways Human and other Vertebrate Genomes Human Genes and Diseases Microarray Data and other Gene Expression Databases Proteomics Resources Other Molecular Biology Databases Organelle databases Plant databases Immunological databases

Microarray Data and other Gene Expression Databases

Microarrays are producing massive amounts of data. These data, like genome sequence data, can use to gain insights into underlying biological processes only if they are carefully recorded and stored in databases, where they can be queried, compared and analysed by different computer software programs . A gene expression database can be regarded as consisting of three parts the gene expression data matrix, gene annotation and sample annotation. Hence the Microarray data and other gene expression databases is consists of repositories of microarray data and gene expression data.

9. Microarray Data and other Gene Expression Databases


ArrayExpress Public collection of microarray gene expression data http://www.ebi.ac.uk/arrayexpress http://www.dkfzheidelberg.de/abt0135/axeldb.htm http://bodymap.ims.u-tokyo.ac.jp/ http://love2.aist-nara.ac.jp/BGED

Axeldb BodyMap BGED

Gene expression in Xenopus laevis Human and mouse gene expression data Brain gene expression database

CleanEx

Expression reference database, linking heterogeneous expression data to facilitate cross-dataset comparisons

http://www.cleanex.isb-sib.ch/

EICO DB

Expression-based imprint candidate organiser: a database for discovery of novel imprinted genes

http://fantom2.gsc.riken.jp/EICODB/

emap Atlas

Edinburgh mouse atlas: a digital atlas of mouse embryo development and spatially-mapped gene expression

http://genex.hgu.mrc.ac.uk/

EPConDB

Endocrine pancreas consortium database

http://www.cbil.upenn.edu/EPConDB

EpoDB FlyView GeneAnnot GeneNote GenePaint GeneTrap GermOnline GXD HemBase HugeIndex Interferon Stimulated Gene Database Kidney Development Database

Genes expressed during human erythropoiesis Drosophila development and genetics Revised and improved annotation of Affymetrix human gene probe sets Human genes expression profiles in healthy tissues Gene expression patterns in the mouse Expression patterns in an embryonic stem library of gene trap insertions Expression data relevant for the mitotic and meiotic cell cycle and gametogenesis in yeast and higher eukaryotes Mouse gene expression database Genes transcribed in differentiating human erythroid cells Expression levels of human genes in normal tissues

http://www.cbil.upenn.edu/EpoDB/ http://pbio07.uni-muenster.de/ http://genecards.weizmann.ac.il/geneannot/ http://genecards.weizmann.ac.il/genenote / http://www.genepaint.org/Frameset.html http://www.cmhd.ca/sub/genetrap.asp http://www.germonline.org/ http://www.informatics.jax.org/menus/expre ssion_menu.shtml http://hembase.niddk.nih.gov/ http://hugeindex.org/

Genes induced by treatment with interferons Kidney development and gene expression

http://www.lerner.ccf.org/labs/williams/xchi p-html.cgi http://golgi.ana.ed.ac.uk/kidhome.html

MAGEST

Ascidian (Halocynthia roretzi) gene expression patterns Medaka (freshwater fish Oryzias latipes) gene expression pattern database DNA methylation data, patterns and profiles

http://www.genome.ad.jp/magest

MEPD MethDB

http://medaka.dsp.jst.go.jp/MEPD http://www.methdb.de/

NASCarrays NetAffx

Nottingham Arabidopsis Stock Centre microarray database Public Affymetrix probesets and annotations Prostate expression database: ESTs from prostate tissue and cell type-specific cDNA libraries Public expression profiling resource: expression profiles in a variety of diseases and conditions Genes using programmed translational recoding in their expression Reference database for human gene expression analysis

http://affymetrix.arabidopsis.info http://www.affymetrix.com/

PEDB

http://www.pedb.org/ http://microarray.cnmcresearch.org/pgadatat able.asp

PEPR

RECODE RefExA Stanford Microarray Database Tooth Development Database

http://recode.genetics.utah.edu/ http://www.lsbm.org/db/index_e.html

Raw and normalized data from microarray experiments Gene expression in dental tissue

http://genomewww.stanford.edu/microarray

http://bite-it.helsinki.fi/

Database Categories List

Database Categories List

Nucleotide Sequence Databases RNA sequence databases Protein sequence databases Structure Databases Genomics Databases (non-vertebrate) Metabolic and Signaling Pathways Human and other Vertebrate Genomes Human Genes and Diseases Microarray Data and other Gene Expression Databases Proteomics Resources Other Molecular Biology Databases Organelle databases Plant databases Immunological databases

Proteomics Resources
Applications of Proteomics
Characterization of Protein Complexes Protein Expression Profiling Proteome Mining Protein Arrays

The

proteomic

resources

have

databases

containing

proteomics information from various genomes/proteomes.

What is Proteomics?

Defined as the analysis of the entire protein complement in a given cell, tissue, or organism.

Proteomics also assesses activities, modifications, localization, and interactions of proteins in complexes.

Technology of Proteomics

Separation and Isolation of Proteins




1D and 2D PAGE

Edman Sequencing Mass Spectrometry Database utilization

Types of Proteomics

Protein Expression


Quantitative study of protein expression between samples that differ by some variable

Structural Proteomics


Goal is to map out the 3-D structure of proteins and 3protein complexes

Functional Proteomics

10. Proteomics Resources

GelBank

2D gel electrophoresis patterns of proteins from complete microbial genomes

http://gelbank.anl.gov/

PEP

Predictions for entire proteomes: summarized analyses of protein sequences

http://cubic.bioc.columbia.edu/pep/

Proteome Analysis Database

Functional classification of proteins in whole genomes

http://www.ebi.ac.uk/proteome/ http://wwwnbrf.georgetown.edu/pirwww/dbinfo/r esid.html

RESID SWISS2DPAGE

Pre-, co- and post-translational protein modifications

Annotated 2D gel electrophoresis database

http://www.expasy.org/ch2d/

Other Molecular Biology Databases

This category has the remaining types of databases. This category again can be subdivide into the following divisions: 1) BioImage 2) MetaRouter 3) PubMed 4) Drugs and drug design 5) Molecular probes and primers

11. Other Molecular Biology Databases 11.1. Drugs and drug design
ANTIMIC APD Database of natural antimicrobial peptides Antimicrobial peptide database Biodegradative strain database: microorganisms that can degrade aromatic and other organic compounds http://research.i2r.astar.edu.sg/Templar/DB/ANTIMIC/

http://aps.unmc.edu/AP/main.php

BSD

http://bsd.cme.msu.edu/

DART

Drug adverse reaction target database

http://xin.cz3.nus.edu.sg/group/drt/dart.asp http://www.cryst.bbk.ac.uk/peptaibol/welco me.html

Peptaibol

Peptaibol (antibiotic peptide) sequences

Pharmacogenomics and Pharmacogenetics Knowledge Base

Variation in drug response based on human variation

http://www.pharmgkb.org/

TTD

Therapeutic target database

http://xin.cz3.nus.edu.sg/group/cjttd/ttd.asp

11.2. Probes

IMGT/PRIME R-DB

Immunogenetics oligonucleotide primer database

http://imgt3d.igh.cnrs.fr/PrimerDB/Query_ PrDB.pl

MPDB

Information on synthetic oligonucleotides proven useful as primers or probes

http://www.biotech.ist.unige.it/interlab/m pdb.html

probeBase

rRNA-targeted oligonucleotide probe sequences, DNA microarray layouts and associated information

http://www.microbialecology.net/probeba se

RTPrimerDB

Real-time PCR primer and probe sequences

http://medgen31.ugent.be/primerdatabase/in dex.php

VirOligo

Virus-specific oligonucleotides for PCR and hybridization

http://viroligo.okstate.edu/

11.3. Unclassified databases


PubMed BioImage Citations and abstracts of biomedical literature Database of multidimensional biological images

http://pubmed.gov/
http://www.bioimage.org/

Bioinformatics Tools
BLAST(Basic Local Alignment Search Tool)
BLAST is the algorithm used by a family of five programs that will align your query sequence against sequences in a molecular database. Statistical methods are applied to judge the significance of matches. Reported alignments (i.e. sequences in the database that could be identical to your query sequence) are reported in order of significance, as estimated by the applied statistics

BLASTN

Compares a nucleotide query sequence against a nucleotide sequence database.




BLASTP

Compares an amino acid query sequence against a protein sequence database.




BLASTX

Compares the six-frame conceptual translation sixproducts of a nucleotide query sequence (both strands) against a protein sequence database.


TBLASTN

Compares a protein query sequence against a nucleotide sequence database dynamically translated in all six reading frames (both strands).


TBLASTX

Compares a nucleotide query sequence against the sixsixframe translations of a nucleotide sequence database.

CLUSTALX


Clustal X (Thompson et al. 1997) is a (Thompson 1997) version of Clustal W with a graphical user interface. This programme is used for multiple sequence alignment.

Multiple Alignment

Phylogenetic Analysis


Nucleic acid and protein sequences are used to infer Phylogenetic relationships Molecular phylogeny methods allow the suggestion of phylogenetic trees, from a given set of aligned sequences. The phylogenetic trees aim at reconstructing the history of successive divergence which took place during the evolution, between the considered sequences and their common ancestor.

Phylogenetic programmes
PHYLIP PAUP MEGA Treeview ODEN PHYLOWIN TREECON DENDRON

Gene Identification


AAT: Analysis and Annotation Tool FGENESH: Splice sites, protein coding exons & gene models Genie: Gene finder based on hidden Markov models GenScan: Identification of gene structures in genomic DNA Grail: DNA sequence analysis tool

ORF Finder: Search for open reading frame, at NCBI

Protein Structure Prediction




3D3D-PSSM: Protein Fold Recognition Multicoil: Predict coiled coil structures NNPredict: Protein secondary structure prediction PredictProtein: Sequence analysis and structure prediction SAPS: Statistical analysis of protein sequences

Protein 3D Structure / Modelling




FUGUE: Sequence-structure homology recognition SequencePDB Viewer: Protein structure database Proinformatix: Modeling oligopeptides for energetically minimized structures SWISSSWISS-MODEL: An automated knowledge-based knowledgeprotein modelling server

Potrebbero piacerti anche