Sei sulla pagina 1di 80

Bioinformatics – An Overview

Dr. Virendrakumar (Virendra) C. Bhavsar


Professor
Dean 2003-2008
Director, Advanced Computational Research Lab. 2000-10
Faculty of Computer Science
University of New Brunswick (UNB)
Fredericton, Canada

Visiting Professor
Center for Development of Advanced Computing (C-DAC)
Pune, India

Outline
• Introduction – UNB, C-DAC, Bioinformatics
• Genome – Genes, Proteomes, Evolution

• Databases and Information Retrieval

• Sequence Alignment and Phylogenetic trees

• Protein Structure and Drug Discovery

• Proteomics and Systems Biology

• Infrastructure: UNB and C-DAC

• Research Work at the University of New Brunswick and C-DAC

• Future
2
University of New Brunswick (UNB)

Faculty of Computer Science


The First “Faculty of CS” in Canada

University of New Brunswick


Fredericton, New Brunswick
Canada
Oldest English Language University in Canada
Established in 1785

4
5

Fredericton and UNB


7

Center for Development of Advanced


Computing (C-DAC)
India

8
History

1987

India requires Supercomputer for


Weather Forecasting

Government of USA refuses sale of


Supercomputer to India

The Government of India decides to


launch a national initiative for
development of indigenous
supercomputers

C-DAC: HPC : Evolution and


Main
Road Map Phase
PoC Garuda
100 Mbps
17 Locations
Garuda – Grid
Computing
Social Computing
2002-03 with participatory
2012-13
approach
1 PF
2010
2007
100 TF
10 TF
1998 PARAM Padma
Viable HPC business
computing environment
1994 PARAM 10000
Platform for User community
PARAM 9000 to interact/ collaborate
1991

PARAM 8000
Technology Denial
C-DAC Centres

• Headquarter
– Pune

• Centres
– Pune
– Knowledge Park, Bangalore
– Electronics City, Bangalore
– Chennai
– Delhi
– Hyderabad
– Kolkata
– Mohali
– Mumbai
C-DAC HQ
– Noida
– Thiruvananthapuram Centres

Total Manpower is 2100 across all the centres of C-DAC

C-DAC’s Thrust Areas

• High Performance Computing & Grid Computing


• Hardware, Software, Systems, Applications, Research, Technology, Infrastructure

• Multilingual Computing
• Tools, Fonts, Products, Solutions, Research, Technology Development

• Software Technologies
• OSS, Multimedia, ICT for masses, E-Governance, Geomatics

• Professional Electronics
• Digital Broadband, Wireless Systems, Network Technologies, Power Electronics, Real-Time
Systems, Embedded Systems, VLSI/ASIC Design, Agri Electronics

• Cyber Security & Cyber Forensics


• Cyber Security tools, technologies & solution development, Research & Training

• Health Informatics
• Hospital Information System, Telemedicine, Decision Support System

• Ubiquitous Computing
• RFID, Design, Development and Integration of Ubicomp System Components

• Education & Training


• e-Learning Technologies & Services
Compute Nodes
No. of Processors : 248 (Power 4 @ 1 GHz)
Aggregate Peak Computing : 1005 GFs (~1 TF)

File Servers
No. of Processors : 24 (UltraSparc-III@900MHz)
Aggregate Memory : 96 GigaBytes
Internal Storage : 0.4 TeraBytes
File System : QFS
Operating System : Solaris 8

Networks
™Primary : PARAMNet-II @ 2.5 Gbps Full Duplex
™Backup : Gigabit Ethernet @ 1 Gbps Full Duplex
™Management : 10/100 MBPs Fast Ethernet

External Storage
™Storage Array : 5 TeraBytes with 16 T3 disk arrays
™Tape Library : 12 TeraBytes - L700 (5 LTO drives

Software
™HPCC - C-DAC’s High performance computing and communication software suite
™Compilers, Parallel Libraries and Tools

Ranked 171 in 2nd quarter end and 258 as per the latest ranking

C-DAC
Advanced Computing Training School (ACTS)
ACTS @ a glance

z An outfit initiated by C-DAC


R&D in 1993
z Begun with modest 20
students and grown to over
5000 students
z Trained more than quarter
million students
z Grown from one city one
centre to 30 cities and 50
centres within India
z Over 150 crores of investment
and 600 plus dedicated
manpower
z Spread from India to
International
z From One course to more
than 10 courses

International Presence
Azerbaijan Saudi Arabia Belarus
Russia

Tajikistan

Uzbekistan

Turkmenistan

Mauritius

Ghana

Armenia

Myanmar

Tanzania

Seychelles

Lesotho
Post Graduate Diploma Programs

Post Graduate Courses


DAC : Diploma in Advanced Computing
DACA : Diploma in Advanced Computer Arts
DVLSI : Diploma in VLSI Design
WiMC : Diploma in Wireless & Mobile Computing
DSSD : Diploma in System Software Development
DGi : Diploma in Geo informatics
DISCS : Diploma in Information System & Cyber Security
DHI : Diploma in Healthcare Informatics
DLC : Diploma in Language Computing
DIVESD: Diploma in Integrated VLSI & Embedded System
Design
DESD : Diploma in Embedded Systems Design
DPC : Diploma in Parallel Computing

M.Tech. Programs

Computer Science & Engineering


Software Engineering
Information Technology
VLSI
Artificial Intelligence
Grid Computing & Storage Management
Embedded Systems Design
Wireless & Network Technology
Process Control & Instrumentation
Training Programmes UNDER Tech sangam

Bioinformatics

20
Definitions

Bioinformatics
The creation and development of advanced
information and computational techniques for solving
problems in biology
and development of advanced information and
High Performance Computing (HPC)
Hardware and software for high speed computations
and large storageor solving problems in biology

21

“Bio” Introduction

22
Molecular Biology

inLiving
biology
organisms (on Earth)
Lipids - Separate inside from outside
Proteins – Build 3D machinery to perform biological
functions
DNA: Store information on how to build machinery (DNA)
Diagram of a cell
Lipid membranes - provide barrier
Protein structures - do work
DNA nucleus - store info

23

Molecular Biology

inDeoxyribonucleic
biology Acid (DNA)
Composition
- Sequence of nucleotides
0Nucleotide = deoxyribose sugar + phosphate group +
base

24
Molecular Biology - DNA
DNA: contains genetic instructions used in the
indevelopment
biology and functioning of all known living
organisms with the exception of some viruses.
DNA molecules: long-term storage of information.
DNA: a set of blueprints, like a recipe or a code, since it
contains the instructions needed to construct other
components of cells, such as proteins and RNA
molecules.
Genes: The DNA segments that contain instructions to
construct the above components of cells
Other DNA sequences: structural purposes, or are
involved in regulating the use of this genetic information.
Chemically, DNA consists of two long polymers of simple
units called nucleotides, with backbones made of sugars
and phosphate groups joined by ester bonds. These two
strands run in opposite directions to each other and are
therefore anti-parallel. Attached to each sugar is one of
four types of molecules called bases. It is the sequence
25
of these four bases along the backbone that encodes
i f ti Thi i f ti i d i th ti

Molecular Biology - DNA


- two long polymers of simple units called nucleotides,
inwith backbones made of sugars and phosphate groups
biology
joined by ester bonds.
- These two strands run in opposite directions to each
other and are therefore anti-parallel.
-Attached to each sugar is one of four types of molecules
called bases. It is the sequence of these four bases along
the backbone that encodes information. This information
is read using the genetic code, which specifies the
sequence of the amino acids within proteins.
-The code is read by copying stretches of DNA into the
related nucleic acid RNA, in a process called
transcription.
- Within cells, DNA is organized into long structures
called chromosomes. These chromosomes are
duplicated before cells divide, in a process called DNA
replication. Eukaryotic organisms (animals, plants, fungi,
and protists)
26
Molecular Biology - DNA
-DNA is organized into long structures called
chromosomes.
in biology

- Chromosomes are duplicated before cells divide, in a


process called DNA replication.

- Eukaryotic organisms (animals, plants, fungi, and


protists) store most of their DNA inside the cell nucleus
and some of their DNA in organelles, such as
mitochondria or chloroplasts.

- Prokaryotes (bacteria and archaea) store their DNA only


in the cytoplasm.

27

Molecular Biology
RNA: Ribonucleic acid (RNA)
in- biology
a long chain of nucleotide units
- Each nucleotide consists of a nitrogenous base, a
ribose sugar, and a phosphate
RNA is very similar to DNA
RNA is usually single-stranded
DNA is usually double-stranded
RNA nucleotides contain ribose while DNA contains
deoxyribose (a type of ribose that lacks one oxygen
atom)
RNA has the base uracil rather than thymine that is
present in DNA

28
Molecular Biology
DNA: DNA → DNA (Replication)
in biology
RNA: DNA → RNA (Transcription / Gene
Expression)

Protein: RNA → Protein (Translation)

29

DNA, RNA, Proteins


™ Proteins and nucleic acids (DNA, RNA) are essential
components for living organisms
™ DNA Transcription RNA Translation Proteins
(gene)

Chromosome

DNA

DNA

Gene 1 Gene 2 . . . .
Raw Biological data Nucleic Acids (DNA)

Raw Biological data


Amino acid residues (proteins)
Standard Genetic Code
T C A G

TTT Phe (F) TCT Ser (S) TAT Tyr (Y) TGT Cys (C)
TTC " TCC " TAC TGC
T
TTA Leu (L) TCA " TAA Ter TGA Ter
TTG " TCG " TAG Ter TGG Trp (W)

CTT Leu (L) CCT Pro (P) CAT His (H) CGT Arg (R)
CTC " CCC " CAC " CGC "
C
CTA " CCA " CAA Gln (Q) CGA "
CTG " CCG " CAG " CGG "

ATT Ile (I) ACT Thr (T) AAT Asn (N) AGT Ser (S)
ATC " ACC " AAC " AGC "
A
ATA " ACA " AAA Lys (K) AGA Arg (R)
ATG Met (M) ACG " AAG " AGG "

GTT Val (V) GCT Ala (A) GAT Asp (D) GGT Gly (G)
GTC " GCC " GAC " GGC "
G
GTA " GCA " GAA Glu (E) GGA "
GTG " GCG " GAG " GGG "

Triplets of DNA called ‘Codons’ code into a amino acid

AAProtein
ProteinStructure
Structure
Protein 3D structure

http://anatomy.med.unsw.edu.au/cbl/research/cytoskeleton/swissprotactin.htm

The structure of the protein sequence determines the


functionality

“Informatics”

36
FASTA formatted Sequences

FASTA: "FAST-All“ alignment; it works with any alphabet


- FAST-P for protein
- FAST-N for nucleotide alignment

Sample FASTA formatted Sequences


FASTA: "FAST-All“ alignment; it works with any alphabet, an
extension of "FAST-P" (protein) and "FAST-N" (nucleotide) alignment.

EST sequence (A, C, G, T)


>gi|39796586|gb|CK247430.1|CK247430 EST731067 potato callus cDNA library,
mRNA sequence
ACAAGTCACTATAGGGACATGCTTCAATTTTTTCAAAACATCTTGAATAGTACAAAGTGCACAACATACT
CCAAAAAACTGAATACATTTTCTATTGTCAATATCTATAGCCATATGACTTTCAGTGCGACCTATGCATT
CATAACTCCCGCTACCAAATCCACCATGTAGTGCTTACAACAACAAGCCTAGTGAGAACGTAAGCCTGGT
CTGGAGCCAAAAGCAAATTATGTATACTAAAAAACCCCCTGGCTAAAATGCATATCATGATTAGTAGTGA
CATT

Protein Sequence (20 different amino acids)


>gi|532319|pir|TVFV2E|TVFV2E envelope protein
ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT
QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC
HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCK
MDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLETISKK
TYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESIWAAELDRYKLVEITPIGF
APTEVRRYTGGHERQKRVPF
Biological Databases

™ Genome databases – flat files or relational database

™ GenBank, EMBL, DDBJ, PDB, SWISSPROT, PIR

™ Classification of Biological databases:


- primary databases (GenBank, EMBL, DDBJ)
- secondary databases (SWISSPROT, PDB, PIR)

Biological databases

z Like any other database


z Data organization for optimal analysis

z Data is of different types


z Raw data (DNA, RNA, protein sequences)
z Curated data (DNA, RNA and protein
annotated sequences and structures,
expression data)
for solving problems in biology

41

Biological databases -Examples

z Nucleotide Databases
Alternative Splicing, EMBL-Bank, Ensembl, Genomes Server, Genome,
MOT, EMBL-Align, Simple Queries, dbSTS Queries, Parasites, Mutations,
IMGT
z Genome Databases
Human, Mouse, Yeast, C.elegans, FLYBASE, Parasites
z Protein Databases
Swiss-Prot, TrEMBL, InterPro, CluSTr, IPI, GOA, GO, Proteome Analysis,
HPI, IntEnz, TrEMBLnew, SP_ML, NEWT, PANDIT
z Structure Databases
PDB, MSD, FSSP, DALI
z Microarray Database
ArrayExpress
z Literature Databases
MEDLINE, Software Biocatalog, Flybase Archives
z Alignment Databases
BAliBASE, Homstrad, FSSP
PDB –Protein Data Bank

z 3D Macromolecular structural data

z Data originates from NMR or X-ray


crystallography techniques

z If the 3D structure of a protein is solved ...


they have it

What to take home

z Databases are a collection of data


z Need to access and maintain easily and flexibly
z Biological information is vast and sometimes
very redundant
z Distributed databases bring it all together with
quality controls, cross-referencing and
standardization
z Computers can only create data, they do not
give answers
“Bioinformatics”

45

Premise of Bioinformatics
Gene sequences determine biological function
Genomic DNA → Amino acids → Proteins → Function

Similar composition → similar function?


- DNA sequences
- Amino acid sequences
- Protein 3-D structure
Predicting protein function
- Designer drugs
- Personalized treatments solving problems in biology

46
Bioinformatics
Determining protein function
Hard way
-Biological / chemical analyses
- Determine 3D structure w/ x-ray crystallography, NMR
Easy way?
- Sequence protein / DNA → find close match in database
- Guess function based on match
- Validate guess in lab
Bioinformatics is imprecise
- Similar to data-mining
- Only suggests possible relationships
- Must validate correlation → causation
47

Growth of Bioinformatics
1970’s
- DNA sequencing
- Alignment w/ Smith-Waterman (dynamic programming)
1980’s
- Sequence databases (EMBL, GenBank)
- Alignment w/ FASTA (linked lists, hashing)
1990’s
- Automatic DNA sequencing
- Alignment w/ BLAST (neighborhood words, probabilities)
- Internet & WWW
Now
- Genomics, Proteomics
48
Bioinformatics Topics
Sequence alignments
- Find similarity between DNA / protein (amino acid) sequences
Genome assembly
- Combining genomic fragments to form whole genome
Gene identification & annotation
- Identify and classify genes on the genome
Microarrays & gene expression analysis
- Use DNA microarray (gene chip) to measure mRNA
Protein folding
- Compute 3-D protein structure ↔ protein sequence
Phylogenetic analysis
- Find genetic relationships between sequences and speciesbetween
between sequences / species

49

What Does Genomics Mean?


• “Genomics”: a science that studies the genetic
material of a species at the molecular level
• A scientific approach to identify and define the
function of genes, as well as uncover when and how
genes work together to produce traits
• “Structural Genomics” approaches (mapping) -
focus on traits controlled by one or a few genes, and
often only provide information regarding the
location of a gene or genes
• Examine the interrelationships and interactions
between thousands of genes

How do we do this?
Genome Organization

Chromosome

DNA

Leaf Tuber

Genome Organization
™ Proteins are building blocks for living organisms

™ Proteins are derived from DNA


™ transcription – the gene (RNA) that codes proteins is formed from DNA
™ Translation – RNA triplets (codons) code into amino acids

™ DNA Gene can also be known by finding complimentary (cDNA), the active
or expressed gene is termed as Expressed Sequence Tags (ESTs)

Chromosome

DNA

DNA

Gene 1 Gene 2 . . . .
Genome Organization
DNA

Gene 1 Gene 2 Etc.

....TATACAGCAAAATAGAAAGATCTAGTGTCCCATGGCGATGAGTCGTGTAGCTTCT….

Promoter Coding ORF


“Switch” “Message”

cDNA Collections (Libraries)

• Various tissues are collected from the plant,


and messages are extracted

Leaf
Messages

Tuber
Messages
cDNA Collections (Libraries)

• The messages are “copied” to form double-


stranded DNA copies (cDNA) of each message

Leaf cDNA Tuber cDNA

• Each copy is “glued” into a piece of bacterial DNA


for easier storage, handling and propagation,
resulting in a collection or “library” of cDNAs
for each tissue

cDNA Collections (Libraries)

• The cDNAs are then read or “sequenced”, to give the


order of A’s, C’s, G’s or T’s for each
• We are left with the sequence of each gene that is
active (expressed) in each cell, tissue or organ studies
• These are “Expressed Sequence Tags” or ESTs
• Using complex computer resources, these ESTs can
be analyzed and compared with known sequences
and proteins
• Look for messages associated with specific organs or
characteristic/traits
Take Home Points

• Messages from various genes are important,


as they dictate which proteins are produced

• Promoters are also important, as they dictate


where a specific message and protein is
produced

• “Genomics” involves the study of all of the


messages produced by the various plant cells

• A lot of information needs to be organized


and analyzed

Database

z Contains all the EST’s sequences


z Contains useful annotations
z Blast Searches
z Contig Assemblies
z Transmembrane Spanning Regions
z Gel Pictures
z EST Information
Data Analysis

• Tens of thousands of ESTs available for study


• Most methods to study message distributions are
low throughput AND time consuming
• “Genomics” necessitates the large scale study of
gene expression

How can we do this?

Microarray Analysis

Microarray Analysis
Microarray Analysis

Microarray Analysis
Microarray Analysis - Processing

Image Processing

Intensity Dependence Comparison

12

R2 = 0.6185
10

4 Slide3

Data Normalization

Log(R/G)
Slide70

R2 = 0.2014 Poly. (Slide70)


2 Poly. (Slide3)

0
0 2 4 6 8 10 12 14 16 18

-2

-4

-6
0.5*(Log(G) + Log(R))

Analysis

Differential
Cluster Pathway
Gene
Analysis Analysis
Expression

Microarray Analysis - Processing


Microarray Analysis - Processing

Signal

Background

Microarray Analysis - Processing

z Irregular size or z Saturation


shape z Spot variance
z Irregular placement z Background variance
z Low intensity

indistinguishable saturated bad print miss alignment artifact


Microarray Analysis - Processing

z Calculate numeric characteristics of each spot


z Throw out spots that do not meet minimum
requirements for each characteristic
z Throw out spots that do not have minimum
overall combined quality

Microarray Analysis - Data


Normalization

z Normalize data to correct for variances


z Dye bias
z Location bias
z Intensity bias
z Pin bias
z Slide bias
z Control vs. non-control spots
Microarray Analysis -Clustering

z Cluster genes based on expression profiles

z Gene expression across several treatments

z Hypothesis: Genes with similar function have


similar expression profiles

Expression Profile Clustering


Microarray Analysis - Data Management

Project
Database
Engine

Information Processing and Handling

• Assembly and annotation of genomic data

• EST analysis and databases

• Cluster analysis of microarray data

• Comparisons of various transcriptomic methods

• Integration of sequence, transcriptomic, proteomic,


metabolomic, transgenic data
Research Problems in Bioinformatics
Find genomes of all organisms
Identify and annotate all genes
Compute sequence <-> 3D structure for all proteins
Compare DNA / protein sequences for similarity
Compare families of DNA / protein sequences

Reason to be optimistic: Biology is finite…


~30,000 human genes; ~1000 protein superfamilies
…but computers speeds keep increasing

73

Fighting
FightingBird
BirdFlu
Flu
Virus
Virusin
in3-D
3-D

Bioinformatics Infrastructure – High


Performance Computing

76
Advances in Microprocessor Technology
1974 - 1 MHz clock
1988 – 40 MHz
2002 – 2 GHz
2009 – P4 3.0 GHz, Quadcore 2.66 MHz
Intel Montecito chip
1.72 Billion transistors
NVidia 280 series GPU 1.4 Billion transistors

- Circuit complexity doubles every 18 months


Æ Computing power at a given cost doubles every 18
months
- Processor clock rates: 40% increase/year + more
instr./cycle
- DRAM Access Times: 10% increase/year Æ caches
required
77

Current Supercomputer – Nov 2009


Jaguar
Oak Ridge National Lab., USA
- 1.72 Petaflop/s (Quadrilion): million billion (10**15)
floating-point operations/sec (Flops) on
Linpack benchmark
-2.332 Petaflops peak (.i.e 2332 Tera flops)
- Power – 1750 Watt/sq ft; ~50 million KWh per year
- Space – 4352 square feet, larger than NBA
basketball court
-

78
Current Supercomputer – Nov 2009
Jaguar

79

Current Supercomputer – Nov 2009


Jaguar

80
Future
z IBM Cyclops64 – supercomputer on a chip
z C-DAC initiative for 2010 –petaflop
machine
z NCSA, USA 2011 petaflop machine
z NASA, SGI and Intel Pleiades – 10
petaflop by 2012
z 1 Exaflop (10**18 flops) by 2019
z Human brain neural simulations – 10
exaflop by 2025
z 2-week Full Weather modeling – 1 zeta
flops (10**21 flops) by 2030

High Performance Computing and Networking


@
University of New Brunswick
Advanced Computational Research Lab
(ACRL) Infrastructure

ACEnet: Atlantic Computational Excellence


Network

“People, Research, Excellence”

Hosting sites:

Member sites:
ACEnet

z Atlantic Canada is a distributed environment

z $30 million initiative

z Waterways make networking


solutions difficult (e.g. Cabot Strait)

ACEnet

z World-class HPC facilities

z Behave as a single, regionally distributed


“computational power grid”

z Create and operate sophisticated


collaboration facilities to bind together
geographically dispersed research
communities.
ACEnet at UNB

Fundy: SUN cluster, AMD Opeteron, 632 cores


ACEnet: 3324 cores
Internet connectivity > 2Gbps at UNB
Collaboration Grid

z Collaboration gear across Atlantic Canada


z Lecture rooms equipped so ACEnet sites can share
seminars and participate remotely
z ACEnet cafés at each site sharing continuous video
feeds
z Desktop level collaboration equipment for personal
communication

z Access Grid streams tens to hundreds of


Mbps across the CANARIE
network
ACEnet

Bioinformatics Research
@
University of New Brunswick
The Canadian Potato Genome Project

Collaborators

Dr.Patricia Evans (UNB), Dr.Barry Flinn (BioAtlantech), Dr. David Dekoyer (Potato
Research Center), Carleton University, Nova Scotia Agricultural College

Students: Aijazuddin Syed (MCS Student), En Zhang (MCS Student),


Zheng Wang (MCS Student), Marc Cooper (MCS Student),
Rachita Sharma (PhD Student)

Potato
™ Integral part of diet – French fries,
mashed potatoes

™ Provides 12 essential vitamins

™ Fourth important crop worldwide

™ Potato has not been explored in terms


of functional and bio-chemical traits

™ Potato genome is much unknown


regarding the control of potato
development and processing/quality
traits (disease resistance, stress tolerance, carbohydrate
metabolism, tuber shape)
Economic Importance Of The Potato

• Integral part of the diet of a large


proportion of the world’s population

• Supplies at least 12 essential vitamins


and minerals

• Still much unknown regarding the


control of potato development and
processing/quality traits
(ie. disease resistance, stress tolerance, carbohydrate metabolism, tuber shape)

The Canadian Potato Genome Project (CPGP)

™ 46% of national potato production $1 Billion/year


™ Home of McCain Foods Ltd. $5.5 billion/year
™ Potato Research Center (PRC) of AAFC
™ Solanum Genomics International Inc./BioAtlantech
™ Carleton University
™ University of New Brunswick
™ Nova Scotia Agricultural College (NSAC)
CPGP Goals

™ CPGP targets genes associated with


tuber health and tuber quality:
™ Tuber Health – Late Blight and
Common Scab
Leaf Tuber
™ Tuber Quality – Stable dry matter
accumulation, cold sweetening and
after-cooking darkening

DNA

Gene 1 Gene 2 . . .

Project Description

Identification Of A Differential Gene Expression Pattern


And Genes Related To Resistance In Potato Late Blight

• One of the most devastating disease of potato worldwide


• If left unmanaged, complete destruction of crops can occur
• Attacks leaves and tubers; large necrotic lesions on leaves
and dry rot that spreads through tubers; 2o bacterial and
fungi often infect through late blight lesions
Late Blight Project

• Collaborative effort with AAFC Potato Research Centre


• Population of blight-sensitive and blight-resistant plants
of near isogenicity
• cDNA libraries made from leaves of a blight-sensitive and
a blight resistant plant
• 2500 messages were sequenced from each library
(5000 total ESTs)
• Different ESTs to be profiled for expression
• The tremendous amounts of data generated will need to be
managed efficiently

Database - Sequence Info


Late Blight Project

cDNA Microarray Using SGII Clones

• hybridized with Cy3 (resistant) + Cy5 (susceptible) probes


(reciprocal labelling experiments)

ANDLBRLF02345HTF.01 - Class II chitinase

ANDLBRLF01256HTF.01 - Pathogenesis-related protein


P23 precursor

ANDLBRLF02041HTF.01 - Unknown protein

What Use Is All Of This Information?

• Transgenics:
- Enhance tuber quality, processing traits, disease
resistance, stress tolerance more rapidly than breeding
• Expression Assisted Selection:
- Obtain expression profiles for thousands of genes
associated with specific traits or characteristics
- Use these profiles as a baseline to compare with
the expression profiles of unknown clones; crosses
• New Protein Products :
- Identify genes encoding secreted proteins/ligands
- Test these for growth-promoting/other effects
- Express genes in batch cultures and purify proteins
Example Of Gene Use
GA-20 oxidase in potato:

GFP expression in tobacco cells

• GA-20 oxidase
knockouts with
enhanced tuber
production

• GA-20 oxidase
knockouts with
reduced tuber
sprouting

Information Processing and Handling

• Assembly and annotation of genomic data

• EST analysis and databases

• Cluster analysis of microarray data

• Comparisons of various transcriptomic methods

• Integration of sequence, transcriptomic, proteomic,


metabolomic, transgenic data
The Canadian Potato Genome Project

Sequence the gene Leaf and tuber


and build cDNA libraries cDNA
[Solanum Genomics Intl. Inc
(SGII)]

EST sequence generation


Microarray profiling [National Research Council
[SGII, PRC, UNB, Ontario at Halifax and SGII]
Canter Institute, and NSAC]

Bioinformatics: base-
Calling, clustering,
BLAST, annotations,
and Gene expression FASTA formatted
[UNB and PRC] EST sequence
& trace files

Sample FASTA formatted Sequences

EST sequence
>gi|39796586|gb|CK247430.1|CK247430 EST731067 potato callus cDNA library,
mRNA sequence
ACAAGTCACTATAGGGACATGCTTCAATTTTTTCAAAACATCTTGAATAGTACAAAGTGCACAACATACT
CCAAAAAACTGAATACATTTTCTATTGTCAATATCTATAGCCATATGACTTTCAGTGCGACCTATGCATT
CATAACTCCCGCTACCAAATCCACCATGTAGTGCTTACAACAACAAGCCTAGTGAGAACGTAAGCCTGGT
CTGGAGCCAAAAGCAAATTATGTATACTAAAAAACCCCCTGGCTAAAATGCATATCATGATTAGTAGTGA
CATT

Protein Sequence
>gi|532319|pir|TVFV2E|TVFV2E envelope protein
ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT
QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC
HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCK
MDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLETISKK
TYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESIWAAELDRYKLVEITPIGF
APTEVRRYTGGHERQKRVPFVXXXXXXXXXXXXXXXXXXXXXXVQSQHLLAGILQQQKNL
LAAVEAQQQMLKLTIWGVK
Standard Genetic Code
T C A G

TTT Phe (F) TCT Ser (S) TAT Tyr (Y) TGT Cys (C)
TTC " TCC " TAC TGC
T
TTA Leu (L) TCA " TAA Ter TGA Ter
TTG " TCG " TAG Ter TGG Trp (W)

CTT Leu (L) CCT Pro (P) CAT His (H) CGT Arg (R)
CTC " CCC " CAC " CGC "
C
CTA " CCA " CAA Gln (Q) CGA "
CTG " CCG " CAG " CGG "

ATT Ile (I) ACT Thr (T) AAT Asn (N) AGT Ser (S)
ATC " ACC " AAC " AGC "
A
ATA " ACA " AAA Lys (K) AGA Arg (R)
ATG Met (M) ACG " AAG " AGG "

GTT Val (V) GCT Ala (A) GAT Asp (D) GGT Gly (G)
GTC " GCC " GAC " GGC "
G
GTA " GCA " GAA Glu (E) GGA "
GTG " GCG " GAG " GGG "

Database

™ Contains all the EST’s sequences

™ Contains useful annotations


™ Blast Searches
™ Contig Assemblies
™ Transmembrane Spanning Regions
™ Gel Pictures
™ EST Information
Data Analysis - Bioinformatics

™ Tens of thousands of ESTs available for study


™ Most methods to study message distributions are low
throughput AND time consuming
™ “Genomics” necessitates the large scale study of gene
expression

™ Automation required for routine processes


™ Data acquisition for potato genome annotation
™ Automated protein classification with rule maintenance
™ Use agents to integrate the software and primary databases in
a flexible and robust way

Overview of Bioinformatics Research


at UNB

EST TraceScan
sequences

Automated Data
Acquisition Pipeline

Homologs, Motifs,
Multi-Agent
Fingerprints, Transmembrane,
System for Potato and Signal sites
Genome Annotation

Automated Protein
Classification and Rule
Maintenance
TraceScan - Keywords

™ Chromatogram - visual representation of the digital output produced


by an automated sequencing machine. A chromatogram is drawn as a
set of four overlapping waveforms, one for each nucleotide base

™ Base-calling - determining the set of nucleotide bases for a DNA


sequence strand from the analysis of the digital output produced by a
sequencing machine

™ Heterozygosity exists in the chromatogram where the presence of a


second strong peak appears beneath a primary peak. This may
indicate the presence of a secondary nucleotide base at the location in
the sequence

™ BLAST – Basic Local Alignment Search Tool

Example of a Chromatogram
The TraceScan Software System
™ Designed to investigate sequence quality, potential polymorphisms, and
base heterozygosity in EST sequences.

™ Relies on the combined analysis of a DNA sequence trace file, the trace
chromatogram, and multiple alignment of sequence homologs.

™ Allows base-calls to be substituted where superimposed peaks have


been detected in the trace.

™ Base-calls deemed in error can be corrected to improve sequence quality


and data reliability.

TraceScan

™ Visualizes DNA sequence chromatograms

™ Detects overlapping trace peaks using modifications to the PHRED


base-caller

™ Paks are highlighted on the user interface.

™ Modifications to PHRED enable base-calls with overlapping peaks to be


substituted.

™ Base substitutions produce a new set of base quality scores for the
sequence.
TraceScan

™ An interface to NCBI BLAST provides sequence comparison


capabilities.

™ Sequences are compared using BLASTN and BLASTX.

™ BLASTN alignments are analyzed in search of discrepancies that may


identify base-calling errors or putative polymorphisms in the trace
sequence.

™ Reading Frames from BLASTX results are analyzed to examine if


substituted base-calls result in synonymous or non-synonymous codon
substitutions.

TraceScan System Architecture


Overview of Bioinformatics
Research at UNB
TraceScan
EST
sequences

Automated Data
Acquisition Pipeline

Homologs, Motifs,
Multi-Agent
Fingerprints, Transmembrane,
System for Potato and Signal sites
Genome Annotation

Automated Protein
Classification and Rule
Maintenance

The Automated Data Acquisition Pipeline


(ADAP) - Keywords
™ Hypothetical Protein: The protein sequence that is obtained from
transcription and translation of the DNA sequence. It is hypothetical
because we do not know if it is the real protein which DNA codes to.

™ Homologs: Evolutionarily related protein sequences

™ Comparative genomics: A technique where the functional traits of a


protein sequence are learnt from its homologs

™ Motifs: Highly conserved regions of protein sequences

™ Fingerprints: Collection of motifs

™ BLASTP: Basic Local Alignment Search Tool for Protein to Protein


searches
Automated Data Acquisition
Pipeline (ADAP)
™ Gathers data for genome annotation

™ ADAP features:
™ Uses comparative genomics to learn from the Homologs
™ New variant of BLAST, Parameter Regulated Iterative BLAST
(PRI-BLAST)
™ Uses 7 various analysis/search tools
™ A few software design patterns are used
™ Perl, MySQL, Perl-DBI, BioPerl, EMBOSS, BLASTP, SGE 5.3,
and Perl-Gtk on Linux

ADAP Overview
Legend
Input: FASATA
formatted EST Data Flow
Sequences
Database
Interactions

Phase 1: Hypothetical
Perl-MySQL
protein extraction and
Database
homolog generation
Interface

Homologs and
HPs

Potato ADAP
Phase 2: Sequence based database
protein structure
prediction

Phase 3: Database search


based protein family
prediction
Parameter Regulated Iterative BLAST
(PRI-BLAST)
™ Static set of BLASTP parameters (neighborhood score, E-value, fraction
identical, BLOSUM matrix etc) – not good since protein evolves at different
rates

™ PRI-BLAST iteratively performs the BLASTP over query sequence and


categorizes the query as
™ a Celebrity query (C) – many homologs
™ an Average query (A) – a few or no homologs
™ an Obscured query (O) – some homologs

™ PRI-BLAST
™ Rule module
™ Decides which set of BLASTP parameters to use
™ Halts the PRI-BLAST
™ Statistical module
™ Density of homologs is computed through SQL statements

Example BLASTP report


BLASTP 2.2.8 [Jan-05-2004]
Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer
. . . . . Nucleic Acids Res. 25:3389-3402.
Query= CK00043.5prime
(182 letters)
Database: All non-redundant GenBank CDS
translations+PDB+SwissProt+PIR+PRF excluding environmental samples
1,795,144 sequences; 592,604,613 total letters
Searching..................................................done
Score E
Sequences producing significant alignments: (bits) Value
gb|AAD46849.2| LD03471p [Drosophila melanogaster] 329 5e-90
ref|NP_651977.1| CG6773-PA [Drosophila melanogaster] >gi|7300991... 285 1e-76
ref|XP_312881.1| ENSANGP00000014751 [Anopheles gambiae] >gi|2129... 209 7e-54
gb|AAH54585.1| Unknown (protein for MGC:63980) [Danio rerio] 184 4e-46
.
.
.
>gb|AAD46849.2| LD03471p [Drosophila melanogaster] Length = 386
Score = 329 bits (1155), Expect = 5e-90
Identities = 181/182 (99%), Positives = 181/182 (99%)
Query: 1 VKRRKKTRLAFNQFIWRPDERISSKMVSLLQEIDTEHEDMVHHAALDFYGLLLATCSSDG 60
VKRRKKTRLAFNQFIWRPDERISSKMVSLLQEIDTEHEDMVHHAALDFYGLLLATCSSDG
Sbjct: 6 VKRRKKTRLAFNQFIWRPDERISSKMVSLLQEIDTEHEDMVHHAALDFYGLLLATCSSDG 65
Query: 61 SVRIFHSRKNNKALAELKGHQGPVWQVAWAHPKFGNILASCSYDRKVIVWKSTTPRDWTK 120
SVRIFHSRKNNKALAELKGHQGPVWQVAWAHPKFGNILASCSYDRKVIVWKSTTPRDWTK
Sbjct: 66 SVRIFHSRKNNKALAELKGHQGPVWQVAWAHPKFGNILASCSYDRKVIVWKSTTPRDWTK 125
Query: 121 LYEYSNHDSSVNSVDFAPSEYGLVLACASSDGSVSVLTCNTEYGVWDAKKIPNXHTIGVN 180
LYEYSNHDSSVNSVDFAPSEYGLVLACASSDGSVSVLTCNTEYGVWDAKKIPN HTIGVN
Sbjct: 126 LYEYSNHDSSVNSVDFAPSEYGLVLACASSDGSVSVLTCNTEYGVWDAKKIPNAHTIGVN 185
Query: 181 AI 182
AI
Sbjct: 186 AI 187
motif search based Protein Sequence Analysis
(mPSA)

™ Motifs are conserved regions of protein sequences, and fingerprint is


a collection of motifs in some order

™ mPSA (Phases 2 & 3) for the ADAP contains 6 mPSA tools from
EMBOSS

™ Phase 2: sequence based mPSA


™ secondary structure: transmembranes(Tmap), signal sites
(Sigcleave), and general secondary structure (Garnier)
™ super secondary structure: DNA binding sites (Helixturnhelix)

™ Phase 3: database search based mPSA


™ protein motifs from PROSITE (Patmatmotifs) and protein
fingerprints from (Pscan)

Homologues for Various Ranges of Lengths of Hyp. Proteins

10000

9000
8768

8000

7000
Number of Homologues

6000

5235
5000 Homologues (Total)

4000

3000 2882

2020
2000
1633
979
1000 873
550 592 516 380
434124 495
288 221 6 22 53 279
0 1
10 - 15

15 - 20

20 - 25

25 - 30

30 - 35

35 - 40

40 - 45

45 - 50

50 - 55

55 - 60

60 - 65

65 - 70

70 - 75

75 - 80

80 - 85

85 - 90

90 - 95

95 - 100

100 - 105

105 - 110

110 - 115

Length of Hyp. Protein

Shorter protein sequences have more homologs – they can be false positives
Homologues with E<1 and E<100 for Various Ranges of Lengths of Hyp.Proteins

100.0% 100.0%
100.0%
100.0% 100.0%

90.0%
Percentage of Homologues w.r.t. Total No. of Homologues

85.3%
84.3%
80.0%

72.9%
70.0%
65.8%

60.0%

51.9%
50.0% 50.3% Percentage Homologues
48.6%
44.8% 45.8% E<1
40.0% 41.5%

30.8% Percentage Homologues


30.0% 27.7% E<100
28.6%
24.9%
25.7%
20.0% 19.4%
18.0%
15.3%
8.8%
10.0% 4.9% 8.6%
5.1% 6.3% 3.9%
0.3% 4.8% 0.5% 1.3% 0.0%
0.0% 0.1%0.0% 0.4% 0.0% 0.0% 0.0% 0.0%
0.0% 0.0%
10 - 15

15 - 20

20 - 25

25 - 30

30 - 35

35 - 40

40 - 45

45 - 50

50 - 55

55 - 60

60 - 65

65 - 70

70 - 75

75 - 80

80 - 85

85 - 90

90 - 95

95 - 100

100 - 105

105 - 110

110 - 115
Length of Hyp. Protein

Shorter sequences have a large E-value, hence we cannot use them in Comparative genomics

Fraction of Query Protein Conserved: Homologs (E<1 & Length>35) Vs Homologs (E<1,
Length>35, and have a Fingerprint)

6500

6000

5500

5000

4500

4000
Number of Hsps

3500 Hsps (E<1 &


Length>=35
3000

2500

2000

1500
Hsps w ith E<1, FP &
Length>=35
1000

500

0
0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

Fraction Conserved

Generally, the structure of the protein sequences is conserved if they have a sequence
similarity of 35% or more – selected region in the graph shows the useful homologs
Bioinformatics Research at UNB

TraceScan
EST
sequences

Automated Data
Acquisition Pipeline

Homologs, Motifs,
Multi-Agent
Fingerprints, Transmembrane,
System for Potato and Signal sites
Genome Annotation

Automated Protein
Classification and Rule
Maintenance

Automated Protein Classification and


Rule Maintenance
™ Use machine-learning techniques to find some rules

™ Apply the rules to classify uncharacterized sequences

Categorized A decision tree


sequences and Rule consisting of
their related data Construction rules
Process

Newly
Uncharacterized Rule application characterized
sequences process sequences
Automated Protein Classification and
Rule Maintenance
™ Source data collection

™ Automated rule generation

™ Machine-learning algorithms and their comparison

™ Automated rule maintenance

™ Automated Rule Generation


™ C4.5 and CITree algorithms produce decision trees
™ WEKA (Waikato Environment for Knowledge Analysis ) will be used for
analyzing the dataset. (http://www.cs.waikato.ac.nz/~ml/index.html)

Start

Sequences and their


related data
Yes Rule Construction & Decision
Tree Creation

End of Rules?
No
Rule Sieving

No
Is the rule
qualified?

Yes

Update Rule Database Rule Database

Apply rules to annotate target


sequences

Target Sequence
Update Target Sequence Database Database
End

Rule Generation process


Comparison of Algorithms
™ The evaluation of criteria for machine learning algorithms: accuracy
and AUC (Area Under the ROC (Receiver Operating Characteristics)
Curve)

™ Performance analysis
Tree Generated using Weka

Bioinformatics Research at UNB

TraceScan
EST
sequences

Automated Data
Acquisition Pipeline

Homologs, Motifs,
Multi-Agent
Fingerprints, Transmembrane,
System for Potato and Signal sites
Genome Annotation

Automated Protein
Classification and Rule
Maintenance
Multi-agent Systems
™ A multiagent system is one that consists of a number of agents, which
interact with one-another

™ In the most general case, agents will be acting on behalf of users with
different goals and motivations

™ To successfully interact, they will require the ability to cooperate,


coordinate, and negotiate with each other, much as people do

Multi-Agent System for Potato


Genome Annotation
Local Database

NRDB MONTH PRINTS PROSITE

AUTOMATED DATA
ACQUISITION
PIPELINE

INFORMATION PIPELINE
AGENT AGENT
CLASSIFICATION
MODULE

WEB RULE
Rule Database
CONSTRUCTION
AGENT

TargetSequence
Target Sequence
DATABASE Database
Database
UPDATE AGENT
Mapping Transcription factors from a
Model to a non-Model Organism

Transcription Factor

z Group of proteins that initiate transcription


z transcriptional activators
z transcriptional repressors
z Consists of DNA binding domains
z Binds to the binding site regions (specific DNA
sequences)
z Controls the expression of the genes
z Human genome: 2600 proteins contain DNA-
binding domains

136
Transcription Factor Mapping

A A1

B B1

C C1

Source Genome Target Genome


Model Organism Non-Model Organism
• Investigated thoroughly by biologists • Not much data available
• Nodes: Transcription factors • Nodes: Predicted transcription factors

137

Transcription Factor Mapping

138
Methodology

z BLASTP is used to map transcription factors from E


coli and Bacillus subtillis to E.coli group and Bacillus
group
z Parameter E-value threshold: 1e-5 to 10

z All transcription factors from one genome cannot be


mapped to another genome

z The number of confirmed mappings between any two


genomes is dependent on the definition of confirmed
mapping used
z Compare the available transcription factors of the target genome to
the predicted set of transcription factors

139

Summary of Mapping Results

z Transcription factor mapping in bacterial


genomes
z Proposed method is able to map most of the
transcription factors
z Transcription factor sequence motifs are
preserved well
z 0.1 and 0.01: best e-value thresholds
z Correct choice of e-value threshold can be more
important than selection of evolutionarily closer
model organism

140
Bioinformatics @ C-DAC

Dr. Rajendra Joshi


Group Coordinator: Bioinformatics
Scientific and Engineering Computing Group
Centre for Development of Advanced Computing
Pune - 411007
rajendra@cdac.in
http://bioinfo-portal.cdac.in

Bioinformatics Resources &


Applications Facility (BRAF)
• Funded by the Department of Information
Technology (DIT), Ministry of Communications and
Information Technology

• Grid-enabling of numerous bioinformatics codes


like SW, BLAST, ClustalW, AMBER, CHARMM etc

• As part of BRAF, the team interacted with


scientists from various CSIR labs, IITs and
industries
BIOGENE: 1TF machine
z AMD processor 2.6Ghz (Total: 204
cores, 1060.8 GF)

z 4 nos. of SunX4600 (8 socket dual


core each) giving 64 cores.

z 32 nos. of SunX2200 (dual socket


dual core each) giving 128 cores.

z Backup server: SunX2200 (4 cores)

z Storage server: two Sun X2200 (8


cores)

z Infiniband switch (Mellanox DDR2,


48 port)

z Storage: 20 Terabytes, RAID5

z Tape library with autoloader

z Benchmarking completed for


AMBER, CHARMM, MEME, SW,
Fasta, ClustalW, BLAST

Using BRAF Facility

• Gipsy portal: Use browser and


open the url
http://gipsy.bioinfo-
portal.cdac.in

• Command line login


ssh -p 30005 gateway.cdac.in
• Help on command line usage is
available in the README file in
the users home directory.

• Helpline: braf-help@cdac.in
Bioinformatics Application Software
for High-End Clusters and Grid
Anvaya : A Workflow Environment for High Throughput Comparative Genomics

Taxo Grid : Phylogeny on Grid

iMolDock : An interface for Molecular Docking on HPC

GENOPIPE : Automated Genome Annotation Pipeline on HPC

GenomeGrid : Bioinformatics Problem Solving Environment on Grid

GIPSY : Bioinformatics Problem Solving Environment on HPC

High-throughput Workflows for


Genome Analysis
Collaboration: Biotechnology and
Biological Sciences Research Council (UK)
z A Systems Biology based
approach for annotation of
Salmonella and
Mycobacterium genomes
z Establishment of a common
Bioinformatics pipeline for
analyses of bacterial genomes
with emphasis on identification
of virulence and pathogenic
factors

Collaboration: Institute of Animal


Health (UK)
• Genome Annotation: Salmonella
z Causative agent of Typhoid
z Transmitted via food contamination
z Economic losses as it affects
livestock
• Annotation of 5 Salmonella Food-borne disease cycle: Salmonella

genomes with a wide host-range

Genome Annotation via GENOPIPE


Single nucleotide polymorphism
Collaboration: University of Surrey (UK)

z Expert curation of Mycobacterium leprae


genome: causative agent of Leprosy
z Development of a tool to calculate molecular
weight of metabolites

Collaboration:
Oregon Health & Science University (USA)
¾ Collaborative project initiated with OHSU in December 2009
¾ Provide computational support to the experimental studies at OHSU,
through MD simulations on BIOGENE cluster
¾ Propeptide domain of serine protease Furin acts as a pH sensor
¾ Phenomenon has been elucidated in-silico through MD simulations
¾ Ten sets of simulations performed using NAMD

Furin Complex
Collaborations: caBIG (NIH)

z The National Cancer Institute (NCI) is


involved in deployment of an integrated
biomedical informatics infrastructure,
the cancer Biomedical Informatics Grid
(caBIG™)
z network that will freely connect the
entire cancer community
z caBIG would setup node at CDAC
z GARUDA GRID and BRAF resources
may be used

Collborations: IIT Madras


CGMD studies on GPCR
OA1 (GPR143) – a
GPCR
• Belongs to Class I GPCR,
Rhodopsin family
• 7TM receptors or heptahelical
receptors
• An integral membrane
glycoprotein of 404 aa
• Protein product of ocular albinism
type 1 gene
• Ocular albenism, a X-linked
inherited disorder in which the
eye lacks melanin pigment
• Homology based approach along
with CGMD simulation has been
planned for this work
Collaboration: Jubilant Biosys
z Simulate fragment binding
sites by Molecular Dynamics
simulation methods
z To identify most probable
site of interaction of
chemical fragments in the
protein.
z 8 large simulations of 10ns
each was carried out
z Results handed over to
Jubilant

Collaboration: Nicholas Piramal


z Contract Research project
z To understand protein ligand
interactions using Molecular
Dynamics simulations
z Involves carrying out
molecular dynamics
simulations on very large
biomolecular systems
z Benefits in designing better
molecules for known drug
targets.
z Four 20ns molecular
dynamics simulations have
been carried out
Conclusion

z Biology – transforming from observational and physical


experiments Æ computational science

z Bioinformatics - Exciting research area

z Challenges – Biology and Computer Science – different ways


of working and need for close collaboration

z Opportunities – new crops, personalized medicine, early


diagnosis, …

155

Research Problems in Bioinformatics


Find genomes of all organisms
Identify and annotate all genes
Compute sequence <-> 3D structure for all proteins
Compare DNA / protein sequences for similarity
Compare families of DNA / protein sequences

Reason to be optimistic: Biology is finite…


~30,000 human genes; ~1000 protein superfamilies
…but computers speeds keep increasing

156
Business Opportunities

• Clinical research
• Gene therapy
• Molecular science
• Pharmaceutical companies - automated technologies to
manufacture effective therapies and drugs due to increasing
concerns about drug safety and the stringent regulations that
govern clinical trials for drug discovery.
• Bioinformatics platform market – growing very fast rate
• Global bioinformatics market: ~ $8.3 billion by 2014
• Knowledge management - 2009 -$1.3 billion
• Bioinformatics platforms market - 2014 - ~ $3.9 billion

157

Business Opportunities

Global bioinformatics market segments

- Bioinformatics platforms
- Sequence alignment platforms
- Sequence manipulation platforms,
- Sequence analysis platforms
- Structural analysis platforms
- Content/Knowledge management tools
- Specialized knowledge management tools
- Generalized knowledge management tools
- Services
- Data Analysis
- Sequencing Services
- Database & Management services
- Applications

158
Thank You!

Potrebbero piacerti anche