Sei sulla pagina 1di 10

Extraction of Subcellular Locations From Protein Databases

Julianna L. Conde Adorno Juchang Hua Dr. Robert F. Murphy


August 5, 2004 SURP Final Presentation

Introduction
Methods have been created of for the

quantitative

analysis

fluorescent

microscope images for a large number of proteins to get their subcellular locations.

August 5, 2004

SURP Final Presentation

Project Goal
Improve the algorithm to extract the subcellular locations contained in structured protein databases.

August 5, 2004

SURP Final Presentation

Gene Ontology (GO)


GO is a collaborative effort to address the need for consistent descriptions of gene products in different databases.

It is organized into three parts:


Biological processes Molecular functions Cellular components
August 5, 2004 SURP Final Presentation

Directed Acyclic Graph (DAG)


The structure of the GO database is DAG:

August 5, 2004

SURP Final Presentation

GO Tree
A node of the GO tree contains:
GO ID GO Code
It is a unique number. The length of the GO Code is 15. A GO ID may have more than one GO Code.
GO Code
3132410000000000 3132411000000000

GO ID
0005792 0019718

3132410000000000

3132411000000000

3132412000000000

August 5, 2004

SURP Final Presentation

DAG Structure vs. GO Tree Structure


GO Code 3184000000000000 3184100000000000 3184110000000000 3184111000000000 3184112000000000 3185000000000000 3186000000000000 3187000000000000 3188000000000000 3189000000000000 3189100000000000 3189110000000000 3189111000000000 3189112000000000 3189112100000000 3189113000000000 3189113100000000 3189114000000000 3189114100000000 GO Id 0005938 0030863 0030864 0030478 0030479 0046858 0005694 0005929 0000307 0005737 0005856 0015629 0030482 0005884 0042643 0042641 0042643 0030864 0030478

August 5, 2004

3189114200000000Presentation 0030479 SURP Final

Methods
Protein Name Protein Location

GOfetcher

August 5, 2004

SURP Final Presentation

Example
A protein will have more than one GO ID and GO Code.
Protein name: Actin
GO ID: 0015629 GO ID: 0030482 GO ID: 0005884 GO ID: 0030864 GO ID: 0030864 GO Code: 318911000000000 GO Code: 318911100000000 GO Code: 318911200000000 GO Code: 318912000000000 GO Code: 318411000000000

How does one obtain the location?


August 5, 2004 SURP Final Presentation

Lowest Common Ancestor (LCA)


LCA is the common parent of the all GO Codes found for the each GO ID of the protein name. The input for the function is a matrix that contain all the GO codes found. The three first columns contain the same numbers for all rows. The result is 318000000000000 GO ID: 0005622 (Intracellular) The result is 318910000000000 GO ID: 0015629 (Cytoskeleton)
3 1 8 9 1 1 0 0 0 0 0 0 0 0 0 3 1 8 9 1 1 1 0 0 0 0 0 0 0 0 3 1 8 9 1 1 2 0 0 0 0 0 0 0 0 3 1 8 9 1 2 0 0 0 0 0 0 0 0 0 3 1 8 4 1 1 0 0 0 0 0 0 0 0 0
August 5, 2004 SURP Final Presentation

Possible outputs for the LCA


GO ID
0005576 0009986 0016020 0005623 0005622 0005694 0005737 0016023 0005856 0005829 0005783 0005768 0005793 0005794 0005739 0005840 0005634 0005635 0005730 0005654

GO Term / Output
Extracellular Cell Surface Membrane Cell (other component) Intracellular Chromosome Cytoplasm Cytoplasmic Vesicle Cytoskeleton Cytosol ER Endosome ER Golgi intermediate compartment Golgi Apparatus Mitochondrion Ribosome Nucleus Nuclear Membrane Nucleolus Nucleoplasm

August 5, 2004

SURP Final Presentation

Methods (continuation)
GO Terms Protein name GO IDs LCA GO Term obtained is the result GO Codes

GO Term

GO ID

GO Code

August 5, 2004

SURP Final Presentation

Results
Program Versions Running Time

New Old

5-15 second 3-15 min

* For retrieving location of one protein name


August 5, 2004 SURP Final Presentation

Results SLIF Protein names


Program Version New Old NO GO ID No location Found 11 (21.2 %) 27 (51.9 %) 10 (19.2%) 5 (9.6 %) Location 31 (59.6 %) 20 (38.5%)

August 5, 2004

SURP Final Presentation

Results 3T3 Protein names


Program Version New Old NO GO ID 25 (58.1 %) 22 (51.2 %) No Location 1 (2.3 %) 6 (14.0 %)

Location 17 (39.6 %) 15 (34.8 %)

August 5, 2004

SURP Final Presentation

Confusion Matrix For Slif Protein Names


Nucleus Nucleus Membrane ER Cytoskeleton Cytoplasm Intracellular Extracellular N/A Membrane ER Cytoskeleton Cytoplasm Intracellular Extracellular N/A

8
0 0 0 0 0 0 5

0 0

0 0 0

0 0 0 0

0 0 0 0 0

0 0 0 0 0 0

0 0 0 0 1 4 0 14

0
0 0 0 0 0 7

1
0 0 0 0 0

3
0 0 0 4

0
0 0 0

2
0 1

0
2

August 5, 2004

SURP Final Presentation

Confusion Matrix for 3T3 Protein Names


Nucleus Nucleus Membrane ER Cytoskeleton Mitochondri a Ribosome Cytoplasm Membrane ER Cytoskeleton Mitochondria Ribosome Cytoplasm Intracellular Extracellular N/A

4 0 0 0 0 0 0 0 0 1
August 5, 2004

0 0 0 0 0 0 0 0 0 2

0 0 1 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 2

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 2 2 0 2

0 0 0 0 0 0 0 0 0 1

0 0 0 0 0 0 0 0 0 0

5 0 1 2 2 2 2 0 0 13

Intracellular Extracellular N/A

SURP Final Presentation

Conclusions
The LCA algorithm is more faster and accurate than the old version. Higher recall Agreement with previous results More friendly interface

August 5, 2004

SURP Final Presentation

Future Work
With this program we can compare the subcellular location determined by the image analysis with those extracted from protein databases. This comparison can reveal whether the description obtained from the analysis of the images is consistent or not, and, for example, can identify proteins that were mis-localized due to tagging artifacts.
August 5, 2004 SURP Final Presentation

Acknowledgments
Dr. Robert F. Murphy Murphy Lab Group
Juchang Hua

NSF

August 5, 2004

SURP Final Presentation

Potrebbero piacerti anche