Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Introduction
One of the major surprises that has resulted from the
analysis of large numbers of complete bacterial genomes
over the past decade is that the gene content is highly variable, even among sets of closely related genomes. For any
set of genomes analyzed, the core genome is the set of
genes found in all genomes (i.e. the intersection of the
gene sets on individual genomes) and the pangenome is
the set of genes found on at least one of the genomes (i.e.
the union of the sets of genes on individual genomes). For
example, Welch et al. (2002) compared three genomes of
Escherichia coli for which the mean number of genes was
4769. They found only 2996 genes in the core genome
and 7638 genes in the pangenome. This illustrates a general point that has now been observed with many groups of
bacteria: the core genome is always substantially smaller
than the mean genome size and the pangenome is always
substantially larger.
A more recent analysis of 17 E. coli genomes
(Rasko et al. 2008) finds that the core genome is reduced
to about 2200 genes. The number of genes in the core
genome, which we will call Gcore (n), depends on the number of genomes in the sample, n. In the E. coli example,
Gcore (n) continues to decrease slowly with n, which makes
it difficult to estimate the limiting size of the core genome
that would be reached if very large numbers of genomes
were included. The number of genes in the pangenome,
G pan (n), is about 13000 for the 17 E. coli genomes and
Table 1 Comparison of properties of subsets of genomes of Bacilli. Abbreviations are as follows: ng , number of genomes in taxonomic group;
Ngenes , average number of genes per genome; G0 , average number of gene families per genome; Gcore , number of clusters of gene families in core
genome; G pan , number of clusters in pangenome; d prot , average number of amino acid substitutions per site expected since the most recent common
ancestor of the taxa group.
Taxonomic group
Staphylococcus aureus
Streptococcus pyogenes
Streptococcus pneumoniae
Listeria spp.
Bacillus cereus group
Staphyloccoccus spp.
Streptococcus spp.
Bacillaceae 1
Lactobacillales
Bacilli
ng
15
13
14
9
20
22
50
38
31
172
Ngenes
2668
1853
2129
2888
5419
2620
2000
4684
2152
2984
G0
2212
1593
1787
2330
4274
2185
1693
3718
1736
2417
we present only one set of clusters obtained with conservative clustering parameters, but similar results were obtained for more relaxed clustering parameters as well.
Genome Phylogeny
Gene families present in all genomes and having
exactly one gene per genome were used to build a
genome phylogeny. The translated amino acid sequences
of genes from 55 such single-copy clusters (Table S2)
were aligned using MUSCLE v3.7 (Edgar 2004) with parameter -diags. Each cluster was aligned independently; alignments were then concatenated and input into
PhyML v3.0.1 (Guindon et al. 2010) for phylogenetic
tree construction using default parameters. The multiple
sequence alignment and phylogenetic trees are archived
on TreeBASE (URL http://bit.ly/palfvN). The
complete tree is shown in Figure S1. From within this tree
we selected subsets of genomes for further analysis, as
shown in Fig. 1 and Fig. S1. The selected clades are characterized by relatively long internal branches separating
them from other parts of the tree and are strongly supported by the approximate likelihood ratio test (aLRT),
obtained from the program PhyML (Fig. S1). The clades
span multiple taxonomic levels, including species-specific
groups (Bacillus cereus group, Staphylococcus aureus,
Streptococcus pneumoniae, and Streptococcus pyogenes),
genus-specific groups (Staphylococcus spp., Listeria spp.,
and Streptococcus spp.), and higher-level groups: Family Bacillaceae 1 (Ludwig et al. 2009), Order Lactobacillales, and Class Bacilli. The definitions of taxonomic groups like species and genus are far from settled within the Bacteria, and the gene trees may not be
fully consistent with each other because of horizontal
gene transfer, so our intention is not to produce a fully
resolved tree for these genomes. However, the phylogenetic tree clearly shows that the selected groups are monophyletic and, except for the Lactobacillales, nearly ultrametric, making them particularly suitable for the evolutionary analyses presented here.
Ngenes
G0
1.21
1.16
1.19
1.24
1.27
1.20
1.18
1.26
1.24
1.23
Gcore
1532
1043
1141
1718
1854
994
487
929
356
143
Gcore
G0
0.69
0.65
0.64
0.74
0.43
0.45
0.29
0.25
0.21
0.06
G pan
5522
4096
5509
4430
18801
9473
16583
32260
16357
97860
G pan
G0
2.50
2.57
3.08
1.90
4.40
4.34
9.79
8.68
9.42
40.49
d prot
0.003
0.006
0.008
0.057
0.068
0.330
0.355
0.426
0.899
1.136
n!
(k + ) (n k + ) ( + )
k!(n k)! ( )
()
(n + + )
(4)
For numerical evaluation of this result, it is useful to note
that the ratio of gamma functions can be written as
k1
(k + )
= (k 1+ ) . . . = (i + )
(5)
( )
i=0
and equivalent formulae apply for the other ratios. The
number of clusters in the core genome is the number of
families present in every genome:
Gcore (n) = G(n|n)
(6)
(n + ) ( + )
=M
( ) (n + + )
G(0|n) is the number of possible types of families that are
not observed in any of the sampled genomes. The number
of clusters in the pangenome is the number of families that
are present in at least one of the sampled genomes:
G pan (n) = M G(0|n)
(7)
(n + ) ( + )
= MM
() (n + + )
This completes the results for the FMG case. To obtain the IMG results, we note that the overall rate of insertion of all types of gene family is u = Ma. It is useful to
define the parameter = 2Ne u = 2Ne Ma = M . We take
the limit M, and a0 keeping u constant, or equivalently, 0 keeping constant. The results for the IMG
model depend on and . These are the same parameters
used by Baumdicker et al. (2010). By taking the IMG limit
in Equations (2), (4), (6) and (7), we obtain
G0 = /
(8)
n . . . (n k+1)
(9)
G(k|n) =
k (n 1+) . . . (n k + )
(n 1)!
Gcore (n) =
(10)
(n 1+) . . .
=M
(n 1 + ) . . .
(n 1 + + ) . . . ( + )
n1
1
+
k+
k=0
= M M
n1
1
= MM 1
k
+
k=0
(11)
n1
1
k
+
k=0
(18)
(19)
(1 e )n + (1 e ) (21)
(1 e )
Model Fitting
Fitting to the core and pangenome sizes
When calculating Gcore (n) and G pan (n) for the real
data, the result depends on which order the genomes are
considered. For each of the data sets in Table 1, we considered many random permutations of the ng genomes and
calculated the core and the pangenome curves for n increasing from 1 to ng . If the results from the many permutations are plotted on one graph, this generates a spread
of points, shown as the shaded area in Figures 3 and 4.
The mean values for Gcore (n) and G pan (n) can be obtained
by averaging over the results for the different permutations. However, a more direct way to obtain these mean
functions is to use the gene family frequency spectrum,
G(k|ng ), which is a function of the full data set and does
Gmean
core (n) =
(23)
k=n
(24)
Gmean
pan (n) =
(25)
k=1
core (n) +
2ng n=1
!1/2
ng
2
mean
(26)
G perm
pan (n) G pan (n)
n=1
where the superscript theory denotes any one of the theoretical models described above. For each model, we imposed a constraint on the parameters such that the mean
RMS
Taxonomic Group (data)
Staph. aureus
126
Strep. pyogenes
109
Strep. pneumoniae 105
Listeria spp.
140
B. cereus group
493
Staph. spp.
335
Strept. spp.
333
Bacillaceae 1
703
Lactobacillales
387
Bacilli
1972
RMS(theory)/RMS(data)
Star Tree
Coalescent
D+E 2D D+E 2D
0.27 0.15 1.03 0.06
0.23 0.21 0.74 0.03
0.28 0.25 0.93 0.05
0.12 0.11 0.38 0.03
0.36 0.29 1.21 0.08
0.26 0.23 0.99 0.03
1.72 1.72 1.18 0.21
1.26 1.26 1.30 0.20
1.28 1.28 0.72 0.11
1.81 1.81 0.63 0.10
Table 3 Fitted parameters for the best fitting model, having two classes of dispensable gene families (2D) on a coalescent tree. The columns fslow
and f f ast are the fractions of gene families in each rate category. Extrapolations to large numbers of genomes were performed to predict the number
of new gene families (Gnew (k)) and core clusters (Gcore (k)) detected when k genomes are sequenced. 1% Gnew is the predicted number of additional
genome sequences required before the number of new families falls to less than 1% of the mean genome size.
Taxonomic group
1
Staphylococcus aureus
161
Streptococcus pyogenes
135
Streptococcus pneumoniae 131
Listeria spp.
155
Bacillus cereus group
643
Staphyloccoccus spp.
335
Streptococcus spp.
297
Bacillaceae 1
925
Lactobacillales
314
Bacilli
3707
1
2
2 fslow f f ast Gnew (100) 1% Gnew Gcore (100) Gcore (1000)
0.080 53326 260.2 0.91 0.09
150
2159
1331
1106
0.097
9507 48.4 0.88 0.12
66
558
852
681
0.088 12547 42.7 0.84 0.16
90
668
950
776
0.074
6861 28.6 0.90 0.10
55
274
1430
1205
0.179 146941 215.5 0.84 0.16
474
3240
1456
963
0.178 160825 538.3 0.86 0.14
256
6840
769
510
0.234 16950 39.7 0.75 0.25
125
981
392
228
0.328 45461 50.5 0.76 0.24
313
1202
556
261
0.304 16475 23.4 0.60 0.40
138
945
229
114
1.987 226245 406.4 0.77 0.23
484
9094
0
0
gent groups, the core is much smaller than Gslow , indicating that there is sufficient time for deletion of the slow
families as well (Fig. 5). The predicted size of the core
genome continues to slowly decrease well beyond the
number of genomes currently available in each data set
(Table 3). In the 2D models there are no essential families, so the core will eventually decrease to zero. In some
of the most diverse clades, predictions of the core genome
size for large numbers of genomes are improved if a class
of essential families is also added (i.e. 2D+E). However,
we did not feel that inclusion of this extra class was well
justified in terms of the improvement of the fit to core
and pangenome data for the available number of genomes;
therefore these results are not shown. Unfortunately, it will
always be difficult to make accurate extrapolations far beyond the range of the available data.
Fitting to the gene family frequency spectrum
We also wished to test whether the theoretical models
provide a good prediction of the shape of the gene family
frequency spectrum, G(k|ng ). For this, we chose the parameters by minimizing the 2 function:
2
ng
Gtheory (k|ng ) Gdata (k|ng )
2
(28)
=
Gtheory (k|ng )
k=1
This function is weighted such that the theory has to fit the
data across the full range of k, whereas if the unweighted
least squares function is used, the model tends to fit the
high values at the extremes and ignore the low values at
intermediate k.
When fit to the gene family frequency spectrum for
the B. cereus group, the 2D model on the coalescent returns a good prediction, whereas the D+E model on the
coalescent does not adequately explain the shape of this
curve (Fig. 6a). Fig. 6b shows the predicted core and
pangenome curves using the 2D model with parameters
obtained from fitting the frequency spectrum in Fig. 6a.
This shows that the 2D model on the coalescent fits all
these aspects of the data simultaneously.
The B. cereus example is typical of the other data sets
G(k|n) =
(29)
a=1
and for k = 0,
pc (0) = l(ta ) + r(ta )pa (0) l(tb ) + r(tb )pb (0)
(32)
Once these probabilities have been calculated for every
node, Equation (29) gives the gene family frequency spectrum. If there is more than one category of dispensable
family, then the spectrum is the sum of the spectra for the
two classes. If there is a class of essential families, then a
constant Gess is added to G(n|n).
Using the fixed tree for the Bacillaceae 1 in Fig. 1,
we calculated G(k|ng ) for the 2D+E model. The parameters were chosen by optimizing the 2 function in Equation (28). The model predicts the most important peaks
and dips in the gene family frequency spectrum in a way
that cannot be done with the coalescent model. Having obtained G(k|ng ), it is possible to obtain the prediction for
the core and pangenome curves on the same tree by using Equations (22)(25) in the same way as is done for
the observed spectrum. For the Bacillaceae 1 example,
the resulting theoretical curves from the fixed tree and the
coalescent are almost indistinguishable from one another
(Fig. 7b). Moving from the gene family frequency spectrum to the core and pangenome curves smooths out some
of the irregularities that arise on the fixed tree. Thus the
spectrum is more sensitive to the shape of the tree than are
the core and pangenome curves.
Software
Many of the formulae described herein are available
as functions written in the R programming language, available at http://reric.org/work/pangenome/.
A helper script provides example usage and plotting for
each of 7 functions using an example dataset. Users can
compute the gene family frequency spectrum, G(k), from
a matrix of gene clusters, and calculate G(k) using the 1D,
1D+E, 2D, and 2D+E IMG models on a coalescent. The
mean core genome (Gcore ) and pangenome (G pan ) curves
can be calculated from G(k), and users can compute permutations of these curves from the data. Additional functions allow users to calculate Gcore and G pan using the 1D,
1D+E, 2D, and 2D+E models on either a coalescent or a
star tree.
Discussion and Conclusions
The original version of the IMG introduced by
Baumdicker et al. (2010) uses a single class of dispensable
gene families plus a class of essential families, and corresponds to our model D+E on the coalescent tree. These
authors gave an example where the gene family frequency
spectrum from nine Prochlorococcus genomes was well
fitted with this model. In our data sets, however, we always found that model 2D, with two dispensable classes,
gave a much better fit than the D+E model. We note that
the essential class is a limiting case of a dispensable class
with both u and v very small. This means that the fit must
be at least as good with the 2D model as the D+E. The
2D model has an additional free parameter, but addition
of this parameter seems well justified because the ratio
RMS(theory)/RMS(data) is very much smaller (Table 2)
and because the shape of the fitting curves is clearly much
better (Figures 3, 4 & 4). Two classes of families seems
adequate for fitting the core and pangenome curves, and
also most of the gene family frequency spectra.
Not all of the clades were well fit by the 2D model,
however. For the gene family frequency spectrum of
the Bacillaceae 1 (Fig. 7), there was a noticeable improvement when a third class was added (model 2D+E).
This mathematical model corresponds to the conceptual
model described by Koonin and Wolf (2008) that includes a core of essential gene families, a shell of conserved families, and a cloud of dispensable families. The
Bacillaceae 1 is one of the most diverse data sets and
one in which there is noticeable structure in the spectrum arising from the unusual shape of the tree. The frequency spectrum contains more information than the core
and pangenome curves because it is possible to calculate
those curves from the spectrum, but not vice versa. For
this reason, more complex models may be justified for fitting the spectrum than for fitting the core and pangenomes.
For example, the calculation of the frequency spectrum on
the fixed tree is noticeably different than for the coalescent, although both give almost the same prediction for
the core and pangenomes. However, the original cluster
data contains a lot of information that is lost in calculating the gene family frequency spectrum. For example, in
10
11
12
13
Figures
FIG. 1.Phylogenetic tree of 172 Bacilli having complete genome sequences, annotated with the taxonomic groups used in this study. Tree
topology and branch lengths were determined by maximum likelihood analysis (using PhyML) of concatenated alignments of amino acid sequences
from 55 single-copy core genes (Table S2). The fully annotated tree is given in Fig. S1. Taxonomic groupings demarcated with stars (F) were 100%
supported by the approximate likelihood ratio test (as calculated in PhyML) and define the groups used in this study. Scale bar indicates expected
number of amino acid substitutions per site.
FIG. 2.Schematic representation of the shapes of (a) a coalescent tree, and (b) a star tree.
14
15
FIG. 3.Fitting four evolutionary models to the core and pangenome curves for 14 sequenced Streptococcus pneumoniae genomes. Models were
based on 2 different tree shapes: the star tree (blue lines) or the coalescent tree (red lines) and either 1 dispensable and 1 essential class of gene
family (D+E; dashed lines) or 2 dispensable classes (2D; solid lines). The lines for the star tree pangenome overlap. The shaded region demarcates the
maximum range observed during 10,000 permutations of the data.
16
FIG. 4.Fitting four evolutionary models to the core and pangenome curves for 38 sequenced Bacillaceae 1 genomes. Models were based on
2 different tree shapes: the star tree (blue lines) or the coalescent tree (red lines) and either 1 dispensable and 1 essential class of gene family (D+E;
dashed lines) or 2 dispensable classes (2D; solid lines). The lines for the star tree overlap. The shaded region demarcates the maximum range observed
during 10,000 permutations of the data.
17
FIG. 5.The variation in the number of gene families in the slow category (Gslow ; blue bar) and the fast category (G f ast ; red bar), and the total
number of gene families per genome (G0 = Gslow + G f ast ) in each of the clades. The symbol shows the observed size of the core genome.
18
FIG. 6.Fitting the gene family frequency spectrum, G(k), for 20 sequenced genomes of the Bacillus cereus group (circles). Models were based
on the coalescent tree and either 1 dispensable and 1 essential class of gene family (D+E; dashed lines) or 2 dispensable classes (2D; solid lines). The
2D model gives a reasonable fit, whereas the D+E model does not. The lower figure shows the predictions of the 2D model for the core and pangenome
curves with the parameter values estimated from fitting the gene frequency spectrum. The shaded region demarcates the maximum range observed
during 10,000 permutations of the data.
19
FIG. 7.Fitting the gene family frequency spectrum, G(k), for 38 sequenced Bacillaceae 1 genomes (circles). Models were based on 2 different
tree shapes: the coalescent tree (red lines) or the inferred phylogenetic tree (green lines; Fig. S1), using 2 dispensable and 1 essential class of gene
family (2D+E). The 2D+E model on the coalescent gives an adequate smooth fit through the data. The same model calculated on the fixed tree of
Bacillaceae 1 predicts some of the tree-dependent structure in the data. The lower graph shows the (overlapping) predictions of both these models
for the core and pangenomes, using the parameters estimated from fitting the gene family frequency spectrum. The shaded region demarcates the
maximum range observed during 10,000 permutations of the data.
FIG. 8.Schematic to explain the algorithm for calculation of the gene frequency spectrum on a fixed tree.
20
21
Supplementary Table 2: Single-copy core genes used for phylogenetic analysis. Numbers in brackets indicate the
frequency (out of 172 genomes) with which the given annotation was observed.
Supplementary Figure 1: Phylogenetic tree of 172 Bacilli having complete genome sequences. Tree topology and
branch lengths were determined by maximum likelihood analysis of concatenated alignments of amino acid sequences
from 55 single-copy core genes (Table S2). Taxonomic groupings are indicated on the tree by stars and by vertical lines
adjacent to taxon names. Each grouping was monophyletic and highly supported by the approximate likelihood ratio test
(aLRT, as calculated in PhyML); values of aLRT for each taxonomic group are shown adjacent to the stars. The alignment
and tree files are archived on TreeBASE (Sanderson et al., 1994; URL http://bit.ly/qD1Ch2).