Sei sulla pagina 1di 4

!

000000999      IIInnnttteeerrrnnnaaatttiiiooonnnaaalll      JJJoooiiinnnttt      CCCooonnnfffeeerrreeennnccceee      ooonnn      BBBiiioooiiinnnfffooorrrmmmaaatttiiicccsss,,,      SSSyyysssttteeemmmsss      BBBiiiooolllooogggyyy      aaannnddd      IIInnnttteeelllllliiigggeeennnttt      CCCooommmpppuuutttiiinnnggg

Proteogenomic Mapping for Structural Annotation of Prokaryote Genomes

Nan Wang Shane Burgess, Mark Lawrence


School of Computing Department of Basic Sciences
University of Southern Mississippi Mississippi State University
Hattiesburg, MS 39406, USA Mississippi State, MS 39762, USA
Nan.Wang@usm.edu burgess@cvm.msstate.edu

Susan Bridges
Department of Computer Science and Engineering
Mississippi State University
Mississippi State, MS 39762, USA
bridges@cse.msstate.edu

Abstract——Structural annotation of genomes is one of I. INTRODUCTION


major goals of genomics research. Most popular tools The explosion in genomic sequencing has
for structural annotation of genomes are determined by
led to the availability of a large number of genomes
computational pipelines. It is well-known that these
computational methods have a number of shortcomings over the last decades and these genome sequences are
including false identifications and incorrect now publicly available [1]. However, the genome
identification of gene boundaries. Proteomic data can sequences by themselves are of little use. True value
used to confirm the identification of genes identified by is derived from the genome sequence only after the
computational methods and correct mistakes. A genes have been identified (structural annotation) and
Proteogenomic mapping method has been developed, the function of the protein product has been
which uses peptides identified from mass spectrometry determined (functional annotation). After a genome
for structural annotation of genomes. Spectra are has been sequenced, gene prediction programs are
matched against both a protein database and the
used to predict genes. Most computational programs
genome database translated in all six reading frames.
Those peptides that match the genome but not the for structural annotation in bacteria are based on
protein database potentially represent novel protein features in the nucleic acid sequence. GeneMark [2]
coding genes, annotation errors. These short and Glimmer [3] are popular gene finding tools, and
experimentally derived peptides are used to discover both are based on Hidden Markov models (HMMs) at
potential novel protein coding genes called expressed the nucleic acid sequence level. Although these tools
Protein Sequence Tags (ePSTs) by aligning the peptides are widely used, they are known to have a number of
to the genomic DNA and extending the translation in shortcomings including failing to identify genes that
the 3' and 5' direction. In the paper, an enhanced exist, false identifications, and incorrect identification
pipeline, has been designed and developed for
of gene boundaries [4]. Because of the high false
discovering and evaluating of potential novel protein
coding genes: 1) a distance-based outlier detection identification rate obtained for prediction of short
method for validating peptides identified from MS/MS, genes, these algorithms usually use a somewhat
2) a proteogenomic mapping for discovery of potential arbitrary length cutoff and are therefore particularly
novel protein coding genes, 3) collection of evidence ineffective at identifying novel short genes.
from a number of sources and automatically evaluate Mass spectrometry [5] is a popular
potential novel protein coding genes by using machine technique for detection of proteins in biological
learning techniques, such as Neural Network, Support samples. It provides direct molecular evidence of the
Vector Machine, Naïve Bayes etc. existence of the protein in the living cell. Proteins are
Keywords-peptide validation; outlier detection; ePST;
typically identified using mass spectrometry by
expressed Protein Sequence Tags; proteogenomic computationally matching experimental mass spectra
mapping; neural network; support vector machine; naïve against theoretical spectra derived from a protein
Bayes; Bayesian network; potential genes; target decoy database. However, several groups have recently
strategy reported the use of mass spectral data to identify

!777888-­-­-000-­-­-777666!555-­-­-333777333!-­-­-!      222000000! 111222555


111111555
111000!
111000333
UUU...SSS...      GGGooovvveeerrrnnnmmmeeennnttt      WWWooorrrkkk      NNNooottt      PPPrrrooottteeecccttteeeddd      bbbyyy      UUU...SSS...      CCCooopppyyyrrriiiggghhhttt
DDDOOOIII      111000...111111000!///IIIJJJCCCBBBSSS...222000000!...111222666
genes on the genome [4]. This process was named correcting the annotation error/ sequencing error,
proteogenomic mapping by Jaffe et al. [1]. We have some might be novel protein coding genes. To
designed algorithms that use experimental protein discover these unique peptides, the proteogenomic
data from mass spectrometry to find novel protein mapping workflow requires a sequenced genome, the
coding genes on bacterial genomes. Mass existing protein models for the genome, and a
spectrometry cannot be used to identify proteins proteomics dataset which is specific to the prokaryotic
directly. Instead, the proteins are cleaved into genome under study.
peptides by enzymes such as trypsin that cut the
protein in specific places. The peptides are assigned
to mass spectrometry and identified by matching the
resulting spectra against a database of theoretical
spectra. Once the peptides have been identified, they
are mapped to the parent protein. Peptide
identifications based on mass spectrometry are
extremely noisy and represent a mixture of true and
false identifications. PepOut [6], a distance-based
outlier detection method is used for peptide
validation. In the paper, we describe a proteomics
based method for identifying open reading frames
that are missed by computational algorithms. Mass
spectrometry based identification of peptides and
proteins from biological samples provide evidence
for the expression of the genome sequence at the Figure 1. Flowchart of proteogenomic mapping pipeline.
protein level. This proteogenomic mapping method
uses proteomics to both confirm computationally
predicted genes and to identify novel gene models. B. Discovery of potential novel protenting
coding genes
II. METHOD Unique peptides are segments of expressed
In the paper, we describe our proteogenomic protein sequences, and can be used to discover
mapping pipelinea set of computational tools that potential protein coding genes in the genome. Figure
automates the structural annotation of bacterial 2 illustrates the process used to identify potential
genomes using proteogenomic mapping pipeline. The novel genes or to correct predicted genes. Peptides
work flow is shown in Figure 1. There are four major with high confidence scores from PepOut [6] are
steps in the pipeline including peptide validation by used to discover expressed protein sequence tags
PepOut [6], Proteogenomic mapping for ePST (ePSTs). The genome sequence is translated in all six
generation, feature collection from variety resources reading frames and the validated unique peptides are
and ePST evaluation by machine learning algorithms. mapped to the translated sequence and are assumed
to represent a segment of an expressed gene. The
A. Proteogenomic mapping pipeline next task is to find the beginning and end of the gene.
To discovery potential novel protein coding genes, The peptide match is extended downsteam to find an
a proteogenomic mapping pipeline was developed inframe stop codon. To find the start codon, we first
shown in Figure 1. Biological samples are trypsin locate an inframe upstream stop codon. The start
digested to peptides. These samples are run in an LC position of a potential protein coding gene should be
ESI-MS/MS mass spectrometer which generates mass located between the inframe stop codon of an
spectra for the samples. Mass spectra then are upstream gene and the beginning of the peptide. In
searched against the protein database and the genome our method, we use the first start codon between the
database translated in six reading frames. Those upstream inframe stop codon and the beginning of
peptides that match the genome but not the protein
the peptide as the start codon of the potential protein
database potentially represent novel protein coding
coding gene. If there is no start codon between the
genes or annotation errors, and are called as unique
peptides. The peptide identifications generated from inframe upstream stop codon and the peptide, the
protein and genome database search results are beginning of the peptide is used as the start position
validated by PepOut. The unique peptides are used for of the potential protein coding gene. The ePSTs
discovering the potential protein coding genes by generated using the process described above are
proteogenomic mapping, and these potential protein considered potential novel protein coding genes or
coding genes are called as ePSTs (expressed Protein extensions to predicted genes and are evaluated
Sequence Tags). These ePSTs are used further for further as described in the next sections.

111000444
111111000
111111666
111222666
Column Name Description
ePST_length Number of amino acids in the ePST
sequence
ePST_probability Probability that the ePST is a true
identification based on the probability of the
peptides used to generate the ePST.
Number_peptides Number of peptides matched to the ePST
Coverage Percentage of amino acids in the ePST
covered by peptides. Multiple peptides
matching a single ePST may overlap.
Amino acids matching several peptides are
counted only one time.
StartCodon The start codon for the ePST. A value of ““-
““ means no start codon
NewStartCodon The start codon suggested by RBSFinder
RBS Pattern of the ribosomal binding site of the
ePST generated by RBSFinder
CDD Yes if the ePST has a conserved domain
identified by CDD
Homology protein Yes if the ePST has a homologous match in
match the protein database for a a related organism,
no otherwise. Additional information: hit
length, percent identity.
Homology DNA Yes if the ePST has a homologous match in
Figure 2. The process used to generate ePSTs from peptide
match the genome database for a related organism,
sequences generated from trandem mass spectra.
no otherwise. Additional information: hit
length, percent identity.
C. Potential gene feature collection

Due to noise from a variety of sources D. Evalutaion of potentail new genes


including the collection of biological samples, errors Supervised machine learning algorithms
associated with mass spectrometry, sequencing errors such as decision tree, support vector machine and
etc, many of these potential new genes discovered by nerural network are used for training a model to
proteogenomic mapping may be false positive classifying the ePST as true or false novel genes. To
identifications. The most common method used for train a model, a subset of ePSTs with all appeared
validating potential novel protein-coding genes feature combination is collected and the dataset is
predicted from expressed protein evidence is real- evaluated by human expetises as true or false novel
time PCR. However, real-time PCR is an expensive genes.
labor-intensive process and cannot be used for a large
number of potential candidate genes. Therefore, there III. RESULTS AND DISCUSSION
is a need to collect evidence of the coding potential
A. Experimenal datasets
of ePSTs and to develop machine learning methods
for automatic evaluation. Biologists provided us with Our experiments were done on Mannheimia
a list of potentially useful features for evaluating haemolytica. M. haemolytica has been the most
ePSTs. In this section, we will describe feature commonly isolated species. Dr. Sarah Highlander at
information recommended by biologists for ePST the Baylor College of Medicine directed genome
evaluation and describe how we derive this sequencing of M. haemolytica. A list of 2434
information. Table 1 lists the features we extract for predicted genes are annotated names and unique
each ePST that we will use to train our machine COG (clusters of orthologous group) numbers are
learning models. Note that some of these entries available at the Baylor College of Medicine M.
correspond to more than one ““feature.”” For example, haemolytica genome web page. Currently, Dr.
for each homology search, when there is a match we Highlander has organized a multi-institutional effort
also find the length of the match and the percent to annotate the M. haemolytica genome to
identity. Values of 0 were used for both the length of standardize its gene ontology and make the data more
the match and the percent identity if there is no match. useful to the BRD.
The ePST with the features are ready to be validated. B. Results and discussions

TABLE I. FEATURES USED FOR EVALUATING THE SEQUEST [42] search algorithm is used for
IDENTIFICATION OF POTENTIAL NOVEL GENES identifying peptide-spectra matches. M.
haemolyticamass spectra are search against a protein
database provided by Dr. Highlander and a genome

111000555
111111111
111111777
111222777
database translated in six reading frames. highest accuracy of the classifiers tested, the resulting
Preprocessing of the evaluation dataset was required model is difficult to understand. The support vector
because our features include a combination of real machine resulted in lower accuracy than other
numbers and categorical data. Numerical features classifiers. Rule based classifiers such as NNge
were discretized based on the experts’’ suggestions. (Non-Nested Generalized Exemplars) find a set of
The feature ““ePST length”” was discretized into three rules that can be used for classification. However, the
bins: 0 –– 50 aa, 51 - 100 aa, and > 100 aa, where aa is resulting rules can be quite complex and difficult to
the number of amino acids. The feature understand. Instance based learning algorithms such
ePST_probability was discretized into two bins: 0- as some nearest neighbor find the training instance
70% and 71%-100%. The feature peptide_coverage closest to the given test instance, and predicts the
is discretized into two bins: < 30% or • 30%. The same class as this training instance. These algorithms
feature ePST matches is divided into the categories: S may result in high accuracy but do not provide any
single match or M multiple matches. In the results, information about the knowledge domain.
3812 peptides were identified by PepOut with p > 0.5
and 3496 ePSTs were generated from 3812 peptides. IV. CONCLUSIONS
Software was developed to select a training subset
with all appeared possible combinations of feature Mass spectrometry provides an independent and
values that occur in the dataset where 10 examples complementary method for protein detection and
are selected for each possible combination. This allows one to confirm the computational annotated
resulted in a training set consisting of 117 of the 3496 proteins. Here we developed the proteogenomic
ePSTs generated. These 117 ePSTs in training data mapping pipeline to improve the structural annotation
samples were labeled as T (true gene) or F (false gene) of genomes, correct and confirm genome annotations,
by two experts in bacterial genomics (Dr. Bindu discover novel protein coding genes. Machine
nanduri and Dr. Mark Lawrence of the college of learning algorithms can be used to predeuct the
Veterinary Medicine). In the 117 training data confidence in identification of novel genes.
samples, there are 47 of 117 positives and 70 of 117 ACKNOWLEDGMENT
negatives. Weka [7] which is a datamining toolkit is
used for the experiments. Table II shows the accuracy The project was supported by the National
Research Initiative of the USDA Cooperative State
comparison of machine learning algorithms such as
Research, Education and Extension Service, grant
tree ––based alrogithms, support vector machine, rule-
number #2006-35600-17688. It was partially funded
based algorithms, and neural network. by a grant from the National Science Foundation
(EPS-0556308-06040293).
TABLE II. COMPARISON OF MACHINE LEARNING
ALGORITHMS REFERENCES
[1] J. D. Jaffe, H. C. Berg, and G. M. Church, ““Proteogenomic
TP FP ROC
Classifier mapping as a complementary method to perform genome
Rate Rate Area
annotation,”” Proteomics, vol. 4, no. 1, Jan 2004, pp. 59––77.
Tree Classifier (ID3) 0.855 0.14 0.906 [2] A. V. Lukashin and M. Borodovsky, ““GeneMark.hmm: new
solutions for gene finding,”” Nucl. Acids Res., vol. 26, no. 4,
Tree classifer (J48) 0.803 0.216 0.886 February 15, 1998, pp. 1107––1115.
MultilayerPerceptron 0.838 0.158 0.933 [3] A. L. Delcher, D. Harmon, S. Kasif, O.White, and S. L.
Salzberg, ““Improved microbial gene identification with
SMO 0.838 0.172 0.833 GLIMMER,”” Nucleic Acids Res, vol. 27, no. 23, Dec 1, 1999,
pp. 4636––41.
NaïveBayes 0.863 0.12 0.95
[4] D. E. Kalume, S. Peri, R. Reddy, J. Zhong, M. Okulate, N.
Rule-Based Kumar, and A. Pandey, ““Genome annotation of Anopheles
0.889 0.11 0.89 gambiae using mass spectrometry-derived data,”” BMC
Classifier (NNge)
Genomics, vol. 6, no. 128, 2005, p. 128.
IBk 0.812 0.147 0.923 [5] M. Mann and A. Pandey, ““Use of mass spectrometry-derived
data to annotate nucleotide and protein sequence databases,””
The high accuracy of these models indicates that the Trends Biochem Sci, vol. 26, no. 1, Jan 2001, pp. 54––61.
biologists are using a relatively simple and consistent [6] Wang, Nan; Yuan, Changhe; Wu, Dongfeng; Burgess, Shane;
set of rules to do the classification. Each of these Bridges, Susan: ““PepOut: Distance-based Outlier Detection
methods has its strengths and weaknesses. Decision Model for Improving MS/MS Peptide Identification
tree methods provide a clear visual picture of how the Confidence””. MCBIOS2009, Oaklohoma OK, Feb 23&24.
features are used to classify the ePSTs. We tested two [7] S.R. Garner. Weka: The waikato environment for knowledge
analysis. In Proc New Zealand Computer Science Research
types of function based classifiers - back propagation Students Conference, pages 57-64, University of Waikato,
neural networks, and support vector machines. Hamilton, New Zealand, 1995.
Although the neural network classifier had the

111000666
111111222
111111888
111222888

Potrebbero piacerti anche