Testing The Infinitely Many Genes Model For The Evolution of The Bacterial Core Genome and Pangenome

Testing the Infinitely Many Genes Model
for the Evolution of the Bacterial Core Genome and Pangenome

R. Eric Collins and Paul G. Higgs
Origins Institute and Dept. of Physics and Astronomy, McMaster University, Hamilton, Ontario L8S 4M1, Canada
Any pair of closely related bacteria tends to share mostbut not allof the same genes. Usually, the number of core
genes decreases slowly as new genomes are observed, while the size of the pangenome (the set of genes found on at
least one of the genomes) increases because each genome contains new genes not observed in the others. We analyze 172
complete genomes of Bacilli to compare the properties of the pangenomes and core genomes of monophyletic subsets
of this group, and to assess the capacities of evolutionary models to predict these properties. When subdivided into
clades of different levels of diversity, including examples at the level of species, genus and higher taxonomic groups,
more phylogenetically diverse groups are found to have smaller core genomes and larger pangenomes relative to the
mean genome size for the group. The infinitely many genes (IMG) model is a population genetics model that allows
prediction of the sizes of the core and pangenomes. It is based on the assumption that each new gene can arise only
once, as would be the case for de novo gene invention or horizontal gene transfer from a highly diverse external gene
pool. We consider variants of this model that allow for one or more classes of dispensable genes with different insertion
and deletion rates, and for the possibility of a class of essential genes that must be present in all species. We show
that the predictions of the model depend on the shape of the evolutionary tree that underlies the divergence of the
genomes, and we calculate results for coalescent trees, star trees, and arbitrary phylogenetic trees of predefined fixed
branch length. In general, we show that the IMG model is useful for comparison with experimental genome data both
for species and for widely divergent groups. Software implementing many of the formulae described herein is provided
at http://reric.org/work/pangenome/.
Introduction
One of the major surprises that has resulted from the
analysis of large numbers of complete bacterial genomes
over the past decade is that the gene content is highly variable, even among sets of closely related genomes. For any
set of genomes analyzed, the core genome is the set of
genes found in all genomes (i.e. the intersection of the
gene sets on individual genomes) and the pangenome is
the set of genes found on at least one of the genomes (i.e.
the union of the sets of genes on individual genomes). For
example, Welch et al. (2002) compared three genomes of
Escherichia coli for which the mean number of genes was
4769. They found only 2996 genes in the core genome
and 7638 genes in the pangenome. This illustrates a general point that has now been observed with many groups of
bacteria: the core genome is always substantially smaller
than the mean genome size and the pangenome is always
substantially larger.
A more recent analysis of 17 E. coli genomes
(Rasko et al. 2008) finds that the core genome is reduced
to about 2200 genes. The number of genes in the core
genome, which we will call Gcore (n), depends on the number of genomes in the sample, n. In the E. coli example,
Gcore (n) continues to decrease slowly with n, which makes
it difficult to estimate the limiting size of the core genome
that would be reached if very large numbers of genomes
were included. The number of genes in the pangenome,
G pan (n), is about 13000 for the 17 E. coli genomes and
Key words: bacteria, evolution, genome, pangenome

Email: higgsp@mcmaster.ca
Phone: (905) 525-9140 x26870
Fax: (905) 546-1252
Submitted as a Research Article to Mol. Biol. Evol.
preprint generated October 5, 2011
URL http://reric.org/papers/pangenome.mbe.pdf
the authors 2011
increases rapidly as new genomes are added. If G pan (n)

continues to increase with n for large n, the pangenome is
said to be open, whereas if G pan (n) tends to a maximum
limit for large n, the pangenome is said to be closed. The
E. coli pangenome appears to be open, or least, G pan (n)
is still far from reaching a limit with the available number
of genomes. Open pangenomes have also been observed in
Streptococcus agalactiae (Tettelin et al. 2005), Prochlorococcus (Kettler et al. 2007), a combination of E. coli and
Shigella (Touchon et al. 2009), Listeria (den Bakker et al.
2010), and in a study taking representative genomes from
across the full range of bacteria (Lapierre and Gogarten,
2009). A closed pangenome was found in Clostridium difficile (Scaria et al. 2010), where it was estimated that a
sample of 26 genomes would be sufficient to capture the
entire pangenome. However, this result may depend on the
way the extrapolation to large n was done, and may not
indicate a qualitative difference between C. difficile and
other bacteria.
A quantity closely related to the core and pangenome
curves is the gene frequency spectrum G(k|n), which is
the number of genes found in k genomes after analysis of
n genomes. If G(k|ng ) is known for some current number,
ng , of available genomes, then it is possible to calculate
Gcore (n) and G pan (n) for all n 6 ng . Typically G(k|n) has
a U-shaped distribution, with many genes found in only
1 or a small number of species, a substantial number of
core genes found in all (or nearly all) species, and rather
few genes at intermediate values of k. Examples of G(k|n)
are given in the papers cited in the previous paragraph,
and in our previous work (Collins et al. 2011). Another
quantity related to the pangenome size is the number of
new genes that are found for the first time in the nth
genome sequenced, i.e. Gnew (n) = G pan (n) G pan (n 1).
It was found by Tettelin et al. (2005) that both Gcore (n)
and Gnew (n) could be fitted quite well by simple functions
that decreased exponentially with n. However, no theoret-
Evolution of the bacterial pangenome

ical model was given to explain why this form of fitting
function should apply. It is the main objective of our current paper to consider ways of predicting the shape of the
core and pangenome curves and the gene frequency spectrum beginning from explicit evolutionary models.
An approach that we consider to be very promising is the infinitely many genes (IMG) model, introduced
by Baumdicker et al. (2010). The name IMG is a reference to the infinitely many alleles and infinitely many
sites models, which are well known in population genetics
(Hein et al. 2005). In the infinitely many alleles model, it
is assumed that an infinite number of possible alleles could
exist at a locus; hence, each new mutation gives rise to a
new allele that did not exist previously. In the infinitely
many sites model, it is assumed that there are an infinite
number of possible sites at which a mutation could occur; hence, each new mutation occurs at a different site.
Similarly, in the IMG model, it is assumed that an infinite
number of possible genes might arise in a genome; hence,
each new gene that arises is different from all previous
genes. Baumdicker et al. (2010) consider a population of
genomes evolving according to the neutral Wright-Fisher
model. New genes arise in each genome at a constant rate.
Genes spread to new genomes when the lineages branch
as time goes forward in the Wright-Fisher model. It is
also supposed that each gene currently in a genome can
be deleted at constant rate. The model allows the functions Gcore (n), G pan (n) and G(k|n) to be calculated in an
elegant way as a function of the evolutionary parameters
(insertion and deletion rates and effective population size).
The model has not yet been tested on many data sets. Our
intention here is to test the IMG model on a variety of
sets of genomes, and to compare the results with the predictions of several alternative evolutionary models that we
derive here.
The IMG model touches on fundamental questions
about the mechanisms of bacterial genome evolution,
including the processes by which new genes arise in
genomes and the possibility of exchange of genes between
genomes by horizontal gene transfer (HGT). The fundamental assumption of the IMG model is that a given gene
can be introduced into a population of genomes only once.
There are several reasons to think that this might often be
true. The origin of a new open reading frame from a noncoding region is presumed to be rare, but if it occurs, it will
lead to a new sequence that is unlikely to have similarity
to previous genes; hence it will fit the IMG assumption. A
new gene might also be created by modification of an existing gene by a rapid burst of substitutions, by a substantial insertion or deletion, or by shuffling of domains within
the gene. If the gene becomes sufficiently divergent from
the other related sequences, it will no longer be classed as
belonging to the same gene family by the similarity criteria used for clustering. Thus a new gene will have arisen
within a single lineage. We presume that this kind of gene
origin is fairly frequent, and it is also likely to create a
unique sequence that has not been seen before. Alternatively, the new gene could arise by HGT of a gene from
outside the group of genomes being studied into one of
the genomes in the group. If the diversity of genes in the

environment is sufficiently large, it will be unlikely that
the same type of gene is inserted more than once into the
group being studied, and this will again fit the assumptions
of the IMG model. On the other hand, if the pool of genes
available for HGT contains a smaller number of genes of
high frequency, it is possible that a gene of the same type
could get inserted more than once into the set of genomes
under study. This should show up in systematic deviations
from the predictions of the IMG model.
The IMG also assumes that the presence or absence
of a gene does not affect the fitness of the organism. The
genealogy of the genomes in the sample therefore follows the standard coalescent model for neutral evolution
(Hein et al. 2005). The IMG model does allow for the
presence of a class of essential genes that must be present
in every genome, but these essential genes do not affect
the evolution of the non-essential genes or the shape of
the coalescent tree. It is unlikely that the fairly restrictive
assumptions of the model are exactly true. Nevertheless,
we view the IMG model as a useful null model of neutral
genome evolution. Building more complex models from
this starting point should allow identification of any important additional factors. In fact, as we shall see below,
the IMG model and the variants we consider here stand up
surprisingly well to analysis of a large number of data sets
of bacterial genomes.
Genomics
Gene Clustering
Complete nucleic acid and translated proteome sequences were obtained from the NCBI Genomes database
for all 172 Bacteria from the Class Bacilli available in
April 2011 (Table S1). Only complete genome sequences
were used no incomplete or draft genomes were included. Genes encoded on plasmids associated with each
genome sequence were also included, and were treated as
part of the genome. BLAST databases were created for
each genome with all amino acid sequences encoded by
each genome, and all-against-all searches were performed
using blastp (v. 2.221). For each pair of genomes, including self pairs, each peptide sequence was used as
a query against every peptide sequence in the paired
genome, and the process was repeated in the reverse direction. A direct link was counted between two genes if either
BLAST E-value was less than 1e-30 for searches in both
directions, and if the length of the locally aligned region
found by BLAST was longer than 70% of the length of the
longer of the two sequences. Sequences were grouped into
clusters using the single-link cluster procedure, i.e. sequences are part of the same cluster if there is a direct link
between them or if there is a chain of direct links that connects them. These clustering methods were implemented
with custom Perl scripts that are available from the authors. We have previously used these clustering methods
for analysis of other sets of genomes (Collins et al. 2011)
and we have considered the effects of changing the cutoff
E-value and the minimum length criterion. In this paper
Table 1 Comparison of properties of subsets of genomes of Bacilli. Abbreviations are as follows: ng , number of genomes in taxonomic group;
Ngenes , average number of genes per genome; G0 , average number of gene families per genome; Gcore , number of clusters of gene families in core
genome; G pan , number of clusters in pangenome; d prot , average number of amino acid substitutions per site expected since the most recent common
ancestor of the taxa group.
Taxonomic group
Staphylococcus aureus
Streptococcus pyogenes
Streptococcus pneumoniae
Listeria spp.
Bacillus cereus group
Staphyloccoccus spp.
Streptococcus spp.
Bacillaceae 1
Lactobacillales
Bacilli
ng
15
13
14
9
20
22
50
38
31
172
Ngenes
2668
1853
2129
2888
5419
2620
2000
4684
2152
2984
G0
2212
1593
1787
2330
4274
2185
1693
3718
1736
2417
we present only one set of clusters obtained with conservative clustering parameters, but similar results were obtained for more relaxed clustering parameters as well.
Genome Phylogeny
Gene families present in all genomes and having
exactly one gene per genome were used to build a
genome phylogeny. The translated amino acid sequences
of genes from 55 such single-copy clusters (Table S2)
were aligned using MUSCLE v3.7 (Edgar 2004) with parameter -diags. Each cluster was aligned independently; alignments were then concatenated and input into
PhyML v3.0.1 (Guindon et al. 2010) for phylogenetic
tree construction using default parameters. The multiple
sequence alignment and phylogenetic trees are archived
on TreeBASE (URL http://bit.ly/palfvN). The
complete tree is shown in Figure S1. From within this tree
we selected subsets of genomes for further analysis, as
shown in Fig. 1 and Fig. S1. The selected clades are characterized by relatively long internal branches separating
them from other parts of the tree and are strongly supported by the approximate likelihood ratio test (aLRT),
obtained from the program PhyML (Fig. S1). The clades
span multiple taxonomic levels, including species-specific
groups (Bacillus cereus group, Staphylococcus aureus,
Streptococcus pneumoniae, and Streptococcus pyogenes),
genus-specific groups (Staphylococcus spp., Listeria spp.,
and Streptococcus spp.), and higher-level groups: Family Bacillaceae 1 (Ludwig et al. 2009), Order Lactobacillales, and Class Bacilli. The definitions of taxonomic groups like species and genus are far from settled within the Bacteria, and the gene trees may not be
fully consistent with each other because of horizontal
gene transfer, so our intention is not to produce a fully
resolved tree for these genomes. However, the phylogenetic tree clearly shows that the selected groups are monophyletic and, except for the Lactobacillales, nearly ultrametric, making them particularly suitable for the evolutionary analyses presented here.
Ngenes
G0
1.21
1.16
1.19
1.24
1.27
1.20
1.18
1.26
1.24
1.23
Gcore
1532
1043
1141
1718
1854
994
487
929
356
143
Gcore
G0
0.69
0.65
0.64
0.74
0.43
0.45
0.29
0.25
0.21
0.06
G pan
5522
4096
5509
4430
18801
9473
16583
32260
16357
97860
G pan
G0
2.50
2.57
3.08
1.90
4.40
4.34
9.79
8.68
9.42
40.49
d prot
0.003
0.006
0.008
0.057
0.068
0.330
0.355
0.426
0.899
1.136
Core genome and pangenome properties

Gene clusters created by the above method may contain more than one gene from the same genome in some
cases. We call a group of paralogous genes in the same
genome a gene family. We deal with gene families, not
individual genes, in the following data analysis. Additionally, the IMG model accounts for gain and loss of
gene families but does not account for duplication within
a genome; therefore, it also makes sense to work at the
family level from the point of view of model comparison.
Table 1 shows some useful statistics with which to compare the different subsets of genomes. In this table, ng is
the number of genomes in each subset, and Ngenes is the
mean number of genes for genomes in the set. G0 is the
mean number of gene families in a genome. The mean
number of genes per family is Ngenes /G0 , which is close
to 1.2 in every case. As this ratio is only slightly larger
than 1, this indicates that most genes are in single gene
families. We have previously considered the distribution
of sizes of families of paralogues in individual genomes
(Collins et al. 2011). There are a substantial number of
paralogues in most genomes, and there are a small number
of gene families with large numbers of genes. However,
the majority of genes in any one genome are in singlegene families, and for the purpose of measuring the core
and pangenome sizes, we do not lose much information
by working at the family level.
The numbers of gene clusters in the core and
pangenomes for each data set are shown in Table 1. It
is useful to measure these as ratios relative to the mean
number of gene families per genome. The core ratio,
Gcore /G0 , is in the range 0.60.7 for the three species-level
sets (Staphylococcus aureus, Streptococcus pyogenes and
Streptococcus pneumoniae), i.e. it is substantially less
than one, even for genomes nominally in the same species.
For the full set of Bacilli, the core ratio is only 0.06, indicating that this group is quite diverse with relatively few
genes conserved across the whole group. The pangenome
ratio, G pan /G0 , is in the range 23 for the species-level
data sets and increases to more than 40 for the full set of

Bacilli. The final column shows d prot , the mean phylogenetic distance on the protein evolution tree (Figs. 1 and
S1) from the common ancestor of the group to the tips
of the branches for each genome, measured in units of
amino acid substitutions per site. This phylogenetic distance is a measure of the amount of protein sequence evolution within each group, and should be an indicator of the
time since divergence of the group. The datasets are listed
in order of increasing d prot . It can be seen that G pan /G0
increases and Gcore /G0 decreases as d prot increases, as
we would expect if the set of genomes becomes more diverse as the time since the common ancestor increases.
The Listeria data set stands out in this table, as it has a
higher Gcore /G0 and a lower G pan /G0 than we would expect, given the observed amount of protein sequence evolution. This probably indicates that the Listeria group is
less prone to insertion and deletion of genes than most
other bacteria, although it could also be interpreted as
faster protein sequence evolution within this genus. Den
Bakker et al. (2010) also commented that although the
pangenome of Listeria is not closed, there seems to be a
very limited on-going introduction of new genetic material
from external gene pools.
Evolutionary Models
Genome Evolution on a Coalescent Tree
We will now discuss theoretical models that can be
used to interpret the core and pangenome sizes. Our main
focus in this paper will be the IMG model. However, we
will begin with a model in which there are a finite number
M of gene families, each of which can be either present
or absent from a genome. We will refer to this as the
finitely many genes (FMG) model, and we will show that
if the limit M is taken in a suitable way, the model
becomes the IMG model that was previously defined by
Baumdicker et al. (2010). This provides an alternative way
of deriving the results of the IMG that may be simpler than
the original.
We suppose that each gene family can be inserted
into a genome at a rate a if it is not already present, and
each family that is present can be deleted at a rate v. We
consider a population of Ne haploid genomes evolving according to the neutral Wright-Fisher model. Let x be the
fraction of the population that possesses one particular
family, and let (x) be the probability that the family has
frequency x. The insertion and deletion problem is equivalent to the problem of recurrent mutation between two
alleles, for which the stationary distribution (x) is well
known (Wright, 1969):
( + ) 1
(x) =
x
(1 x)1
(1)
( )()
where = 2Ne a and = 2Ne v. The mean value of x is
/( +); hence the mean number of families present in
one genome is
M
G0 =
(2)
+
If a sample of n genomes is chosen from the population, the genealogy of the genomes is described by the
coalescent process (Kingman, 1982; Hein et al. 2005). A

typical coalescent tree (Fig. 2a) has many short branches
close to the tips and relatively few longer branches close
to the root. The time to coalescence of all the lineages is
roughly 2Ne generations, although there are large fluctuations about this average.
The probability that k of the sampled genomes contain the family, given that its frequency in the population
is x and that n genomes were sampled, is a binomial distribution:
n!
g(k|x, n) =
xk (1 x)nk
(3)
k!(n k)!
The gene family frequency spectrum, G(k|n), is the expected number of families that will be found in k genomes
after sampling n. This can be found by integrating over the
distribution of x, assuming that the frequencies of each of
the M possible families are independent:
Z1
G(k|n) = M (x)g(k|x, n)dx

0
n!
(k + ) (n k + ) ( + )
k!(n k)! ( )
()
(n + + )
(4)
For numerical evaluation of this result, it is useful to note
that the ratio of gamma functions can be written as
k1
(k + )
= (k 1+ ) . . . = (i + )
(5)
( )
i=0
and equivalent formulae apply for the other ratios. The
number of clusters in the core genome is the number of
families present in every genome:
Gcore (n) = G(n|n)
(6)
(n + ) ( + )
=M
( ) (n + + )
G(0|n) is the number of possible types of families that are
not observed in any of the sampled genomes. The number
of clusters in the pangenome is the number of families that
are present in at least one of the sampled genomes:
G pan (n) = M G(0|n)
(7)
(n + ) ( + )
= MM
() (n + + )
This completes the results for the FMG case. To obtain the IMG results, we note that the overall rate of insertion of all types of gene family is u = Ma. It is useful to
define the parameter = 2Ne u = 2Ne Ma = M . We take
the limit M, and a0 keeping u constant, or equivalently, 0 keeping constant. The results for the IMG
model depend on and . These are the same parameters
used by Baumdicker et al. (2010). By taking the IMG limit
in Equations (2), (4), (6) and (7), we obtain
G0 = /
(8)
n . . . (n k+1)
(9)
G(k|n) =
k (n 1+) . . . (n k + )
(n 1)!
Gcore (n) =
(10)
(n 1+) . . .
=M

G pan (n) = M M
(n 1 + ) . . .
(n 1 + + ) . . . ( + )
n1
1
+
k+
k=0
= M M
n1
1
= MM 1
k
+
k=0
(11)
n1
1
k
+
k=0
The pangenome size continues to increase with n,

hence the pangenome is open in the IMG model. From
Equation (11), G pan (n) increases roughly as ln(n) for large
n, i.e. it increases less rapidly than linearly. In contrast,
the pangenome is closed in the FMG model because there
cannot be more than M genes (as in Equation (7)).
So far it was assumed that the same parameters apply
for all gene families. Thus there are two parameters, and
, in the IMG case. Families are assumed to be dispensable, i.e. they can be inserted or deleted without affecting
the fitness. A natural extension of the model is to consider
a second class of essential families that are present in all
genomes and cannot be inserted and deleted. This introduces an extra parameter, Gess , the number of essential
gene families. This changes the above results simply by
adding the constant Gess to the equations for G0 , G(n|n),
Gcore (n) and G pan (n). The gene family frequency spectrum G(k|n) is unaltered for k < n. A second extension of
the model is to consider two classes of dispensable families with different parameters. In this case, there are four
parameters 1 , 2 , 1 and 2 . All the formulae for the twoclass models are simply the sum of two terms of the same
form as the single class models.
Genome Evolution on a Star Tree
In the previous section, it was assumed that the
genomes evolve on a coalescent tree. The coalescent process generates an ensemble of possible trees with welldefined statistical properties. The results for Gcore (n) and
G pan (n) etc. are expectation values averaged over all possible coalescent trees. However, results for any one realization of a coalescent tree may differ substantially from
the mean. The coalescent arises in population genetics
when we consider the genealogy of individuals within a
species. Here, we want to study groups of genomes that
extend beyond the species level. In this case it is more
appropriate to consider a phylogenetic tree across species
rather than a coalescent tree within species. The process of
speciation/diversification giving rise to the genomes in the
available data is not necessarily equivalent to a coalescent.
Therefore, we wish to consider other types of tree in order
to determine the extent to which the predicted behavior of
the core and pangenomes depends on the shape of the tree.
Here we will consider the star phylogeny shown in
Figure 2b, where there is a radiation of lineages at the
root at a time t in the past. We choose the star phylogeny
for three reasons. Firstly, its shape is very different from
a coalescent tree; therefore, if the results are dependent

on the tree shape, we would expect a clear difference between these cases. Secondly, the star phylogeny is a simple case for which an analytical solution is easy to obtain. Thirdly, we are motivated by the observation of Tettelin et al. (2005) that the pangenome size appears to increase linearly with n for large n, whereas according to
Equation (11), the pangenome should increase approximately as ln(n) for large n. We will now show that if evolution occurs on a star phylogeny instead of a coalescent
tree, then the pangenome does in fact increase linearly
with n.
Consider a set of genomes evolving according to the
IMG model with a single class of dispensable gene families with deletion rate v, and overall insertion rate u as
before. Suppose that the genome at the root of the star
phylogeny is a typical genome described by this model. It
therefore contains G0 = u/v gene families. The probability that a dispensable family is retained on one branch for
time t without deletion is evt . The core families are those
that are retained on all n branches. Therefore, the number
of core families is
u
Gcore (n) = envt for n > 2
(12)
v
For a single genome, all families are in the core by definition, so Gcore (1) = G0 . The probability that a dispensable family is retained on at least one of the n branches is
one minus the probability that it is deleted on all branches.
Hence, the number of dispensable families retained since
the root is

u
1 (1 evt )n
(13)
Gret =
v
Let Ggain be the number of families that were not present
at the root and are gained along one branch of length t. We
may write
dGgain
= u vGgain
(14)
dt
from which
u
Ggain = (1 evt )
(15)
v
There will be Ggain gene families gained on each of the n
branches. Hence the size of the pangenome is
G pan (n) = Gret + nGgain
nu
(16)
u
1 (1 evt )n + (1 evt )
=
v
v
From this, the number of new families found for the first
time in the nth genome is
u
u
Gnew (n) = evt (1 evt )n1 + (1 evt )
(17)
v
v
In order to facilitate comparison between the star
phylogeny and the coalescent, it is useful to define = ut,
and = vt. Note that the time to the root in the star is t,
whereas the typical time to the root in a coalescent tree
is 2N; hence and mean almost the same thing in the
two cases, however the shape of the tree affects the way
the core and pangenome depend on and . As with the
coalescent case, it is possible to add a number Gess of essential gene families, or to consider two classes of dispensable families with insertion and deletion rates 1 , 2 ,

1 and 2 . Thus, we now have four models that we can
compare to real data: either a coalescent tree or a star phylogeny can be used, and in each case, we can consider
either one class of dispensable families and one class of
essential families, or two classes of dispensable families
with different insertion and deletion rates.
Before proceeding to data analysis, we wish to make
the connection between the star phylogeny model and the
work of Tettelin et al. (2005). These authors found that the
numbers of new and core gene families in Streptococcus
can be fitted by simple exponential decay functions:
Gcore (n) = c exp(n/c ) + c
(18)
Gnew (n) = s exp(n/s ) + s
(19)
Equations (18) and (19) are equivalent to those given in

the captions to Figures 2 and 3 of Tettelin et al. (2005).
We have used the notation s in Equation (19) instead of
tg( ), which was used by Tettelin et al., in order to avoid
confusion with the parameter in the IMG model. If we
write the results from Equations (12) and (17) in terms of
and , and we explicitly include the essential families
in the core, we obtain:
Gcore (n) = en + Gess for n > 2

(20)
(1 e )n + (1 e ) (21)
(1 e )
It can be seen that Equation (20) is a simple exponential

decay, as in Equation (18) if we identify the parameters
as c = Gess , c = /, and 1/c = . Equation (21) is
also equivalent to Equation (19) if we identify the parameters s , s and s in a suitable way. Note that the number
of new families approaches a constant for large n, so the
pangenome increases linearly with n for large n. The exponential decay functions were originally given as empirical
fitting functions with no theoretical model to justify them.
We now see that these fitting functions are the expected
results for an IMG model on a star phylogeny. However,
in our interpretation, there are only three independent parameters Gess , and , and all six parameters in Equations (18) and (19) depend on these three.
Gnew (n) =
Model Fitting
Fitting to the core and pangenome sizes
When calculating Gcore (n) and G pan (n) for the real
data, the result depends on which order the genomes are
considered. For each of the data sets in Table 1, we considered many random permutations of the ng genomes and
calculated the core and the pangenome curves for n increasing from 1 to ng . If the results from the many permutations are plotted on one graph, this generates a spread
of points, shown as the shaded area in Figures 3 and 4.
The mean values for Gcore (n) and G pan (n) can be obtained
by averaging over the results for the different permutations. However, a more direct way to obtain these mean
functions is to use the gene family frequency spectrum,
G(k|ng ), which is a function of the full data set and does
not depend on the permutation. Consider a family that is

present in k genomes out of ng . If n genomes are sampled,
the probability that the family is present in all n is
( k...(kn+1)
if n 6 k,
Pall (n, k) = ng ...(ng n+1)
(22)
0
if n > k.
Therefore the mean size of the core genome is
ng
Gmean
core (n) =
G(k|ng )Pall (n, k)
(23)
k=n
The probability that the family is absent in all n is

( (n k)...(n kn+1)
g
g
if n 6 ng k,
ng ...(ng n+1)
Pabs (n, k) =
0
if n > ng k.
(24)
The probability that this family is in at least one of the n is

1 Pabs (n, k). Therefore the mean size of the pangenome
is
ng
Gmean
pan (n) =
G(k|ng )(1 Pabs (n, k))
(25)
k=1
We denote the core and pangenome curves for any

one particular permutation of the genomes in the data as
perm
Gcore
(n) and G perm
pan (n). The root mean square deviation
per point between the permutation and the mean is
ng
2
1
perm
RMS(perm) =
Gcore
(n) Gmean
core (n) +
2ng n=1
!1/2
ng

2
mean
(26)
G perm
pan (n) G pan (n)
n=1
We define RMS(data) as the mean value of

RMS(perm) averaged over permutations. RMS(data) is a
measure of the spread of points around the mean, i.e. it is
a measure of the inherent uncertainty in the data. Table 2
gives RMS(data) for each of the data sets, calculated from
10000 random permutations. The units of RMS(data) are
gene families, i.e. this value may be compared to G0 in
Table 1.
Each theoretical model was fitted to the mean of
the data using the Nelder-Mead least-squares optimization
routine in the open-source statistical software package R
(R Development Core Team 2011). The squared deviations for both the core and the pangenome were used in
the minimization because a good model should be able to
fit both curves at the same time. The root mean square deviation per point is
ng
2
1
mean
Gtheory
RMS(theory) =
core (n) Gcore (n) +

2ng n=1
!1/2
ng
2
theory
mean
(27)
G pan (n) G pan (n)
n=1
where the superscript theory denotes any one of the theoretical models described above. For each model, we imposed a constraint on the parameters such that the mean

number of families per genome, G0 , is the same in the
theory as the data. This reduces the number of free parameters by one in each model and means that the theory
curve passes exactly through the mean of the data for the
n = 1 point in both the core and pangenome curves. After the parameters are chosen to minimize RMS(theory),
the ratio RMS(theory)/RMS(data) is a useful measure of
the quality of fit of a theoretical model to the data. The
smaller the ratio, the better the fit. If the ratio is less than
1, the theory deviates less from the mean than does a typical permutation, i.e. the curve falls within the spread of
points generated by the permutations. A well fitting model
should have this ratio less than 1.
Of the four models fit to each dataset, the coalescent model with two classes of dispensable gene
families (model 2D) was always the best, with a
RMS(theory)/RMS(data) much less than 1 in every case
(Table 2). The 2D model is a good fit for every case
tested, unlike the original IMG model of Baumdicker et al.
(2010), which uses a coalescent model with one class of
dispensable gene families plus one class of essential families (model D+E). The fitting ratio of model D+E is more
than 1 in several data sets, and is always much worse
than the 2D model. This is shown clearly in the example of Streptococcus pneumoniae (Fig. 3), where the best
fitting model (2D on the coalescent) fits both the core and
pangenome curves exceptionally well. In this example, the
D+E models dramatically fail to fit the pangenome and
core genome curves simultaneously. Allowing two classes
that evolve at different rates is thus a significant improvement over using a strictly conserved category.
When comparing the D+E models on the star tree and
the coalescent, the star tree is sometimes a better fit. However, the 2D model on the coalescent always fits better
than the 2D model on the star tree, so in general there is
no reason to choose the star tree over the coalescent. In the
example of a more diverse group, Bacillaceae 1, the 2D
model on the coalescent clearly fits the pangenome better
(Fig. 4) because the star model incorrectly predicts a linearly increasing pangenome. None of the models perform
particularly well in fitting the core genome curve for this
dataset because of the shape of the underlying tree, which
we discuss in further detail below.
The fitted values for the insertion and deletion parameters ( and ) vary greatly between the clades for the
2D model on the coalescent (Table 3). In general they are
larger for the more diverse groups (i.e. those for which
d prot is larger; Table 1). If we think in population genetics
terms, this is to be expected, because and depend on
the effective population size Ne , which will be larger for
more diverse groups. If we think in phylogenetic terms,
this is to be expected, because the time since divergence
of the more diverse groups is higher.
The parameters 1 and 1 correspond to gene families with slow rates of insertion and deletion, while 2
and 2 correspond to families with fast rates. The mean
number of families per genome in the two categories
are Gslow = 1 /1 and G f ast = 2 /2 . Table 3 shows
the fractions of families in the two categories: fslow =
Table 2 Comparison of quality-of-fit for four models using root mean

squared (RMS) values. RMS(data) is the RMS of 10,000 replicate permutations of the pangenome and core genome data points compared to
calculated means. RMS(theory) is the RMS of the best fit line for evolutionary models based on 2 different tree shapes (star tree and coalescent
tree) and either 1 dispensable and 1 essential gene class (D+E) or 2 dispensable gene classes (2D).
RMS
Taxonomic Group (data)
Staph. aureus
126
Strep. pyogenes
109
Strep. pneumoniae 105
Listeria spp.
140
B. cereus group
493
Staph. spp.
335
Strept. spp.
333
Bacillaceae 1
703
Lactobacillales
387
Bacilli
1972
RMS(theory)/RMS(data)
Star Tree
Coalescent
D+E 2D D+E 2D
0.27 0.15 1.03 0.06
0.23 0.21 0.74 0.03
0.28 0.25 0.93 0.05
0.12 0.11 0.38 0.03
0.36 0.29 1.21 0.08
0.26 0.23 0.99 0.03
1.72 1.72 1.18 0.21
1.26 1.26 1.30 0.20
1.28 1.28 0.72 0.11
1.81 1.81 0.63 0.10
Gslow /(Gslow + G f ast ) and f f ast = G f ast /(Gslow + G f ast ).

There is usually a relatively small fraction of fast evolving
families ( f f ast ) in the range 0.10.25 for most data sets),
and the deletion rates for the fast families are very much
larger than the slow families (2 1 ). The presence of
fast evolving families, which are easily gained and lost,
means that there can be quite rapid divergence between
genomes that are closely related (such as the species level
datasets considered here). However, if the majority of families are more slowly evolving, this explains why significant numbers of conserved families are still found in the
more diverse data sets. It was previously observed that
gain and loss of families in the B. cereus group appeared
to be much faster than in the wider set of Bacilli (Hao and
Golding, 2006). This is because the fast changes are dominant when considering the short branch lengths in closely
related groups, whereas changes at slower time scales are
relevant for the longer branches, and it is not possible to
see this with a model that has only a single rate.
The fitted parameters for the 2D model were used to
predict the sizes of the pangenome and core genome when
extrapolated to larger numbers of genomes (Table 3). Substantial numbers of new families are expected to be found
even if very many genomes are sequenced. For example,
the predicted Gnew is in the range of 50 to several hundred, even after sequencing 100 genomes. Hundreds or
even thousands of genomes are predicted to be required
before the number of new families falls to less than 1% of
the mean genome size. It should be remembered that the
pangenome is open according to these models, so there
will always be new families, however many genomes are
sequenced.
The observed core genome size is slightly less than
Gslow for the species level sets, indicating that most of the
slowly evolving families are conserved in all the genomes
in the closely-related groups (Fig. 5). In the more diver-
Table 3 Fitted parameters for the best fitting model, having two classes of dispensable gene families (2D) on a coalescent tree. The columns fslow
and f f ast are the fractions of gene families in each rate category. Extrapolations to large numbers of genomes were performed to predict the number
of new gene families (Gnew (k)) and core clusters (Gcore (k)) detected when k genomes are sequenced. 1% Gnew is the predicted number of additional
genome sequences required before the number of new families falls to less than 1% of the mean genome size.
Taxonomic group
1
Staphylococcus aureus
161
Streptococcus pyogenes
135
Streptococcus pneumoniae 131
Listeria spp.
155
Bacillus cereus group
643
Staphyloccoccus spp.
335
Streptococcus spp.
297
Bacillaceae 1
925
Lactobacillales
314
Bacilli
3707
1
2
2 fslow f f ast Gnew (100) 1% Gnew Gcore (100) Gcore (1000)
0.080 53326 260.2 0.91 0.09
150
2159
1331
1106
0.097
9507 48.4 0.88 0.12
66
558
852
681
0.088 12547 42.7 0.84 0.16
90
668
950
776
0.074
6861 28.6 0.90 0.10
55
274
1430
1205
0.179 146941 215.5 0.84 0.16
474
3240
1456
963
0.178 160825 538.3 0.86 0.14
256
6840
769
510
0.234 16950 39.7 0.75 0.25
125
981
392
228
0.328 45461 50.5 0.76 0.24
313
1202
556
261
0.304 16475 23.4 0.60 0.40
138
945
229
114
1.987 226245 406.4 0.77 0.23
484
9094
0
0
gent groups, the core is much smaller than Gslow , indicating that there is sufficient time for deletion of the slow
families as well (Fig. 5). The predicted size of the core
genome continues to slowly decrease well beyond the
number of genomes currently available in each data set
(Table 3). In the 2D models there are no essential families, so the core will eventually decrease to zero. In some
of the most diverse clades, predictions of the core genome
size for large numbers of genomes are improved if a class
of essential families is also added (i.e. 2D+E). However,
we did not feel that inclusion of this extra class was well
justified in terms of the improvement of the fit to core
and pangenome data for the available number of genomes;
therefore these results are not shown. Unfortunately, it will
always be difficult to make accurate extrapolations far beyond the range of the available data.
Fitting to the gene family frequency spectrum
We also wished to test whether the theoretical models
provide a good prediction of the shape of the gene family
frequency spectrum, G(k|ng ). For this, we chose the parameters by minimizing the 2 function:
2
ng
Gtheory (k|ng ) Gdata (k|ng )
2
(28)
=
Gtheory (k|ng )
k=1
This function is weighted such that the theory has to fit the
data across the full range of k, whereas if the unweighted
least squares function is used, the model tends to fit the
high values at the extremes and ignore the low values at
intermediate k.
When fit to the gene family frequency spectrum for
the B. cereus group, the 2D model on the coalescent returns a good prediction, whereas the D+E model on the
coalescent does not adequately explain the shape of this
curve (Fig. 6a). Fig. 6b shows the predicted core and
pangenome curves using the 2D model with parameters
obtained from fitting the frequency spectrum in Fig. 6a.
This shows that the 2D model on the coalescent fits all
these aspects of the data simultaneously.
The B. cereus example is typical of the other data sets
with closely related genomes. However, we found that the

gene family frequency spectrum is more difficult to predict adequately for the more diverse sets of genomes that
we considered. For example, we found that the 2D model
on the coalescent appeared to do a reasonable job at fitting the gene family frequency spectrum for the Bacillaceae 1, but when these same parameters were used to
calculate the core and pangenome curves, the prediction
was poor (not shown). However, for this dataset alone, the
prediction was noticeably improved if we included two
dispensable classes and an essential class (2D+E) in the
model (Fig. 7a). In this case, the predictions for the core
and pangenomes using the same parameters were much
improved (Fig. 7b).
The predictions of the models on the coalescent are
averages over all possible coalescent trees, whereas the
actual data arise on one particular tree, and there is no reason to suppose that the real tree conforms to the coalescent process. We have also calculated the predicted gene
family frequency spectrum using the star phylogeny, but
the fits are noticeably worse than with the coalescent, and
we have not shown them here. We already have an estimate of the shape of the real tree calculated from the protein sequence evolution of the conserved genes (Fig. 1).
Here, we will show that it is possible to calculate the expected gene family frequency spectrum for the IMG model
on any given fixed tree. The reason for doing this is that
the spectrum is not simply a smooth curve (as seen in
Fig. 7), but has structure, such as peaks close to k = 20
and k = 30, and dips on either side of these peaks. These
are signatures of the shape of the particular tree on which
the genomes evolved. The Bacillaceae 1 group contains
the 20 genomes of the B. cereus group within it, which
are very closely related to one another in comparison to
the other members of the Bacillaceae 1 (Fig. 1). The
most likely explanation of the k = 20 peak is that there are
many gene families present in this particular group of 20
genomes that are not present in the rest of this group.
The following method can be used to calculate
G(k|n) on a fixed tree. We suppose branch lengths of the
tree are given in some convenient units. In our case, the

tree is a maximum likelihood tree derived from protein sequence evolution and the branch lengths are measured in
amino acid substitutions per site. Let a be a node in this
fixed tree, let ta be the length of the branch leading to node
a, and let na be the number of genomes in the data that descend from node a, as shown in Figure 8. We may write
2n1
G(k|n) =
g(ta )pa (k)
(29)
a=1
where g(ta ) is the expected number of families present in

a that arose on the branch leading to a, and pa (k) is the
probability that a family present at a is present in k out
of na genomes that descend from a. The sum goes over n
tip nodes (current genomes) and n 1 internal nodes, i.e.
2n 1 nodes in total.
For a single class of dispensable gene families in the
IMG model,
u
g(ta ) = (1 evta )
(30)
v
as in Equation (15). If a is the root node, then g(ta ) = u/v.
If a is a tip node, then pa (1) = 1 and pa (k) = 0 for
k6=1. If a is an internal node, pa (k) = 0 for k > na and
we can calculate pa (k) for 0 6 k 6 na using the following
recursion. Let the probability that a family is retained for
a time t be r(t) = evt , and the probability that it is lost
during time t be l(t) = 1 evt . Let c be the parent node
of a and b, as in Figure 8. Then, for k > 1, we may write
pc (k) = r(ta )l(tb )pa (k)+
l(ta )r(tb )pb (k)+
k
r(ta )r(tb ) pa ( j)pb (k j) (31)

j=0
and for k = 0,

pc (0) = l(ta ) + r(ta )pa (0) l(tb ) + r(tb )pb (0)
(32)
Once these probabilities have been calculated for every
node, Equation (29) gives the gene family frequency spectrum. If there is more than one category of dispensable
family, then the spectrum is the sum of the spectra for the
two classes. If there is a class of essential families, then a
constant Gess is added to G(n|n).
Using the fixed tree for the Bacillaceae 1 in Fig. 1,
we calculated G(k|ng ) for the 2D+E model. The parameters were chosen by optimizing the 2 function in Equation (28). The model predicts the most important peaks
and dips in the gene family frequency spectrum in a way
that cannot be done with the coalescent model. Having obtained G(k|ng ), it is possible to obtain the prediction for
the core and pangenome curves on the same tree by using Equations (22)(25) in the same way as is done for
the observed spectrum. For the Bacillaceae 1 example,
the resulting theoretical curves from the fixed tree and the
coalescent are almost indistinguishable from one another
(Fig. 7b). Moving from the gene family frequency spectrum to the core and pangenome curves smooths out some
of the irregularities that arise on the fixed tree. Thus the
spectrum is more sensitive to the shape of the tree than are
the core and pangenome curves.
Software
Many of the formulae described herein are available
as functions written in the R programming language, available at http://reric.org/work/pangenome/.
A helper script provides example usage and plotting for
each of 7 functions using an example dataset. Users can
compute the gene family frequency spectrum, G(k), from
a matrix of gene clusters, and calculate G(k) using the 1D,
1D+E, 2D, and 2D+E IMG models on a coalescent. The
mean core genome (Gcore ) and pangenome (G pan ) curves
can be calculated from G(k), and users can compute permutations of these curves from the data. Additional functions allow users to calculate Gcore and G pan using the 1D,
1D+E, 2D, and 2D+E models on either a coalescent or a
star tree.
Discussion and Conclusions
The original version of the IMG introduced by
Baumdicker et al. (2010) uses a single class of dispensable
gene families plus a class of essential families, and corresponds to our model D+E on the coalescent tree. These
authors gave an example where the gene family frequency
spectrum from nine Prochlorococcus genomes was well
fitted with this model. In our data sets, however, we always found that model 2D, with two dispensable classes,
gave a much better fit than the D+E model. We note that
the essential class is a limiting case of a dispensable class
with both u and v very small. This means that the fit must
be at least as good with the 2D model as the D+E. The
2D model has an additional free parameter, but addition
of this parameter seems well justified because the ratio
RMS(theory)/RMS(data) is very much smaller (Table 2)
and because the shape of the fitting curves is clearly much
better (Figures 3, 4 & 4). Two classes of families seems
adequate for fitting the core and pangenome curves, and
also most of the gene family frequency spectra.
Not all of the clades were well fit by the 2D model,
however. For the gene family frequency spectrum of
the Bacillaceae 1 (Fig. 7), there was a noticeable improvement when a third class was added (model 2D+E).
This mathematical model corresponds to the conceptual
model described by Koonin and Wolf (2008) that includes a core of essential gene families, a shell of conserved families, and a cloud of dispensable families. The
Bacillaceae 1 is one of the most diverse data sets and
one in which there is noticeable structure in the spectrum arising from the unusual shape of the tree. The frequency spectrum contains more information than the core
and pangenome curves because it is possible to calculate
those curves from the spectrum, but not vice versa. For
this reason, more complex models may be justified for fitting the spectrum than for fitting the core and pangenomes.
For example, the calculation of the frequency spectrum on
the fixed tree is noticeably different than for the coalescent, although both give almost the same prediction for
the core and pangenomes. However, the original cluster
data contains a lot of information that is lost in calculating the gene family frequency spectrum. For example, in

the Bacillaceae 1 case, we argued that the peak close to
k = 20 arose from the group of 20 closely related genomes
within the larger set. This should mean that the cluster data
contains many gene families found in this particular group
of 20, rather than in any other random group of 20. In the
future, we intend to analyze the cluster data directly using
the IMG model. This should allow a more sensitive test of
some of the details of the IMG model that are difficult to
assess from the core and pangenome curves and from the
gene frequency spectrum.
In understanding the parameters of the models, it is
instructive to compare two clades, Listeria and Staphylococcus aureus, which have similar average genome sizes
and similar core genome sizes. The fitted values for insertion and deletion rates of slowly evolving gene families
are also similar between the two groups, making up about
90% of the total in each clade (Fig. 5). However, the predicted rates of insertion and deletion for the fast gene families were nearly an order of magnitude lower in Listeria
than in S. aureus, indicating that the variable fraction of
the genome evolves much more slowly in Listeria. This
in turn leads to the prediction of a larger (more diverse)
pool of gene families available to S. aureus compared to
Listeria, which is surprising because the sequenced Listeria genomes include several species and are much more
phylogenetically divergent than the S. aureus genomes.
Assuming that rates of protein sequence evolution
in core genes are not drastically different between Listeria and S. aureus, explanations for the unexpectedly
small Listeria pangenome include differences in lifestyle
or mechanisms of evolution. One aspect of lifestyle that
may vary is habitat or host range. For example, a pathogen
that infects many host species may have a larger effective
population size than an obligate human pathogen, and may
require access to a larger repertoire of genes. In our example organisms, S. aureus is commensal on and an opportunistic pathogen of humans and occasionally other animals, while Listeria spp. are pathogens of humans and
various other animals, in addition to inhabiting environments like soil and water. So, the smaller pangenome size
of Listeria is not obviously attributable to a less variable
habitat or smaller populations.
An alternative explanation is that the mechanisms of
evolution might vary by clade, particularly those mechanisms involved in the acquisition or invention of novel
genes. Exogenous genes can be acquired via conjugation
(by self-propagating plasmids or transposons), transduction (by phage infection), and natural transformation (by
the uptake of free DNA from the environment). Members
of all of the clades we studied are known to host both plasmids and bacteriophages, indicating that conjugation and
transduction are both potential sources of new genes in
these genomes. However, not all of the clades are known
to become naturally competent for transformation, a complex process that requires a large complement of genes.
In particular, no known isolates of Listeria, Staphylococcus aureus, or Streptococcus pyogenesthe three clades
with the lowest proportional pangenome sizeshave been
observed to become naturally competent in vitro. Strep-
10
tococcus pyogenes and S. aureus each carry many genes

known to be required for competence and could possibly
become competent under conditions not yet replicated in
the laboratory, but sequenced Listeria spp. carry only a
few of the required genes and are unlikely to ever become
competent in their native habitats. Additionally, just one
family of insertion sequence has been reported for Listeria
spp. compared to more than 20 families in S. aureus and
most other clades we investigated (ISFinder database,
Siguier et al., 2006). Compared to S. aureus, the small
pangenome of Listeria could be attributable to a restricted
ability to acquire exogenous genes and a decreased rate of
gene innovation due to a paucity of insertion sequences.
An important question that we found difficult to address with the methods in this paper is the comparison of
the IMG with the finitely many genes (FMG) case. The
FMG was introduced as a stepping stone to derive the
IMG limit. However, it is a valid model in itself, and we
attempted to fit the core and pangenome curves using the
FMG model. We found that the value of M (the finite number of possible gene families) is very difficult to estimate.
M must be at least as large as the observed pangenome
size, but if M is much larger than this, then the predictions
of the model depend principally on the product M , and
only very slightly on the two parameters separately. Thus,
from the core and pangenome curves, we could not make
a clear distinction between the IMG model and the FMG
model with a very large but finite number of families. This
distinction would be more apparent in a direct phylogenetic analysis of the cluster data, because the FMG model
would allow certain patterns of presence and absence to
occur that would be extremely unlikely under the IMG
model.
In the phylogenetic context, the FMG corresponds to
a standard model of a two state 0/1 character, such as that
used by Hao and Golding (2006) to look at gene family
presence/absence patterns. This model allows independent
transitions from 0 to 1 in different parts of a phylogenetic
tree, which can be interpreted as independent insertions of
the same kind of family by HGT from a pool of sequences
of finite diversity. In contrast, the IMG is not a standard
model in the phylogenetic context, because it has an infinite number of possible characters, and the number of
characters actually present in the data is variable. A comparison of these two approaches on the same data would
therefore be interesting.
The distinction between the IMG and FMG is related to the question of whether the pangenome is open
or closed. It seems to us that it must almost certainly be
open in real cases. As long as there is some non-zero rate
of origination of gene families within individual lineages
or there is a high diversity of families available for HGT,
then there will always be some constant rate of origination of gene families that have not been seen before. This
means that there are at least some types of families that fit
the IMG assumption, and so the pangenome must be open.
It is also possible that there are other families that have a
high rate of repeated HGT, and which would fit an FMG
model better. These could be present at the same time as

gene families of the IMG type. But unless all families correspond to the FMG case, which seems unlikely to us, the
pangenome will be open.
An alternate way to model HGT is to suppose that
genes can be repeatedly transferred between members of
the group. In this case the rate of transfer of a gene depends on its frequency x in the population in proportional to x(1 x), because the donor must possess the
gene and the receiver must not. Baumdicker and Pfaffelhuber (2011) have recently extended the IMG to include
this kind of within-group HGT in the population genetics context on the coalescent tree. If HGT is frequent, this
changes the shape of the gene frequency spectrum substantially so that it is no-longer U-shaped, but has a broad
peak at intermediate k. The gene frequency spectra for all
the data sets that we studied here were U-shaped, which
leads us to believe that within-group HGT is not a prime
factor in our data. However, we have not tested this explicitly. We also acknowledge the recent claim that the majority of gene gains in multi-gene families arise from withingroup HGT rather than by gene duplication (Treangen and
Rocha, 2011). One possible way to reconcile these observations would be if genes in large families have an unusually high rate of HGT, whereas the majority of genes have
a very low rate that is consistent with the IMG assumption.
Gain of additional members of gene families is not visible
in the way we analyzed the data here because we worked
at the level of presence/absence of whole gene families.
It is also worth pointing out that it would be extremely
difficult to deal with within-group HGT in a phylogenetic
context on a fixed tree because the rate of 0 to 1 character changes in one branch would depend on the character
states in other branches, which would invalidate the usual
kind of phylogenetic recursion relations used for calculating likelihoods.
Our main reason for introducing the star phylogeny
calculations was the fact that the pangenome increases linearly with n on the star phylogeny, and this corresponds
to claims made for real data in some of the original work
with pangenomes (Tettelin et al. 2005). However, the IMG
model was not available at that time, and we have shown
here that the 2D IMG on the coalescent tree gives a better
fit than the star phylogeny. In other words, the pangenome
seems to increase logarithmically with n, and not linearly,
after all. Our interpretation of this is that real evolution occurs on a tree that is neither a star nor a coalescent. However, typical phylogenetic trees resemble a typical coalescent more than a star, in that as more genomes are added
to the sample, the length of the tip branch leading to the
last genome added gets shorter and shorter. This explains
qualitatively why the pangenome only increases logarithmically.
We will note one final interesting connection with the
star tree. The finite supragenome model is an approach
developed in a recent series of papers (Hogg et al. 2007;
Donati et al. 2010; Boissy et al. 2011) that makes predictions on the core and pangenome sizes. This model supposes that there are a finite total number of gene families
that are divided among K classes. Gene families are as-
11
sumed to be present with a probability in each genome

and absent with a probability 1 , with a different value
of for each class of family. The gene frequency spectrum for families in a given class is binomial, and the total
gene family frequency spectrum is a sum over the binomial distributions for the different classes. The connection
with the star phylogeny is that gene families present at the
root of the star have a probability evt of being present
in each of the tips. Therefore the frequency spectrum is
binomial, with = evt . We tried to fit our gene family frequency spectra with the 2D star tree model, and we
found it to be considerably worse than the 2D coalescent
model; therefore we did not present these results. In the
finite supragenome model, K = 7 classes were used, and it
may be that the star phylogeny model would fit the data if
a large number of classes were introduced. We do not need
to do this however, because the IMG gives a better prediction of the frequency spectrum, using either the coalescent
or a fixed phylogenetic tree, with only 2 or 3 classes. More
importantly, our conclusion is that the pangenome is open,
because there is a significant rate of origination of new
gene families that have never existed before, and this is not
consistent with the basis of the finite supragenome model,
which starts from the assumption that the pangenome is
closed.
In conclusion, we have found that variants of the infinitely many genes model make good predictions of the
sizes of the core and pangenomes and the shape of the
gene family frequency spectrum in many groups of bacterial genomes. The model has a sound basis in population genetics, and is derived from an explicit evolutionary model. Therefore it is more useful for interpretation
of experimental data than alternative approaches that are
simply empirical fitting functions or statistical models for
distributions that are not associated with an evolutionary
mechanism.
Literature Cited
Baumdicker F, Hess WR, Pfaffelhuber P. 2010. The diversity of a
distributed genome in bacterial populations. Ann. Appl. Prob.
20:15671606.
Baumdicker F, Hess WR, Pfaffelhuber P. 2011. The infinitely
many genes model for the distributed genome of bacteria.
PNAS forthcoming.
Baumdicker F, Pfaffelhuber P. 2011. Evolution of bacterial populations under horizontal gene transfer. Preprint, http://
arxiv.org/abs/1105.5014
Boissy R, Ahmed A, Janto B et al. (16 co-authors). 2011.
Comparative supragenomic analyses among the pathogens
Staphylococcus aureus, Streptococcus pneumoniae, and
Haemophilus influenzae using a modification of the finite
supragenome model. BMC Genomics 12:187.
Collins RE, Merz H, Higgs PG. 2011. Origin and evolution of
gene families in Bacteria and Archaea. BMC Bioinformatics
(Proceedings of the RECOMB Comparative Genomics conference to appear).
den Bakker HC, Cummings CA, Ferreira V, Vatta P, Orsi RH,
Degoricija L, Barker M, Petrauskene O, Furtado MR and
Wiedmann M. 2010. Comparative genomics of the bacterial
genus Listeria: Genome evolution is characterized by limited gene acquisition and limited gene loss. BMC Genomics

11:688.
Donati C, Hiller NL, Tettelin H et al. (18 co-authors) 2010.
Structure and dynamics of the pan-genome of Streptococcus pneumoniae and closely related species. Genome Biol.
11:R107.
Edgar RC. 2004. MUSCLE: multiple sequence alignment with
high accuracy and high throughput. Nucl. Acids Res. 32:
17921797.
Guindon S, Dufayard JF, Lefort V, Anisimova M, Hordijk W,
Gascuel O. 2010. New algorithms and methods to estimate
maximum-likelihood phylogenies: assessing the performance
of PhyML 3.0. Sys. Biol. 59:30721.
Hao W, Golding GB. 2006. The fate of laterally transferred
genes: Life in the fast lane to adaptation or death. Genome
Res. 16: 636643.
Hein J, Schierup MH, Wiuf C. 2005. Gene genalogies, variation
and evolution: A primer in coalescent theory. Oxford University Press, Oxford.
Hogg JS, Hu FZ, Janto B, Boissy R, Hayes J, Keefe R, Post JC,
Ehrlich GD. 2007. Characterization of the Haemophilus influenzae core and supragenomes based on the complete genomic sequences of Rd and 12 clinical notypeable strains.
Genome Biol. 8:R103.
Kettler GC, Martiny AC, Huang K, et al. (14 co-authors) 2007.
Patterns and implications of gene gain and loss in the evolution of Prochlorococcus. PLoS Genet. 3(12):e231.
Kingman JFC. 1982. The coalescent. Stoch. Proc. Appl. 13:235
248.
Koonin EV and Wolf YI. 2008. Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world.
Nucleic Acids Res. 36:66886719.
Lapierre P, Gogarten JP. 2009. Estimating the size of the bacterial
pan-genome. Trends in Genetics 25:107110.
Ludwig W, Schleifer K-H, Whitman WB. Accessed 9/2011.
Revised road map to the phylum Firmicutes. URL
http://www.bergeys.org/outlines/bergeys_
vol_3_roadmap_outline.pdf
R Development Core Team. 2009. R: A language and environ-
12
ment for statistical computing. R Foundation for Statistical

Computing, Vienna, Austria. R version 2.10.1 (2009-12-14).
ISBN 3-900051-07-0. URL http://www.R-project.
org.
Rasko DA, Rosovitz MJ, Myers GSA, et al. (13 co-authors)
2008. The pangenome structure of Escherichia coli: Comparative genomic analysis of E. coli commensal and pathogenic
isolates. J. Bacteriol. 190:68816893.
Sanderson, MJ, Donoghue MJ, Piel W, Eriksson T. 1994. TreeBASE: a prototype database of phylogenetic analyses and an
interactive tool for browsing the phylogeny of life. Am. J. Bot.
81:183.
Scaria J, Ponnala L, Janvilisri T, Yan W, Mueller LA, and
Chang YF. 2010. Analysis of ultra-low genome conservation
in Clostridium difficile. PLoS ONE 5(12):e15147.
Siguier P, Perochon J, Lestrade L, Mahillon J, and Chandler M.
2006. ISfinder: the reference centre for bacterial insertion sequences. Nucleic Acids Res. 34: D32-D36. URL http://wwwis.biotoul.fr
Tettelin H, Masignani V, Cieslewicz MJ, et al. (46 co-authors)
2005. Genome analysis of multiple pathogenic isolates of
Streptococcus agalactiae: Implications for the microbial
pangenome. Proc. Nat. Acad. Sci. USA 102:1395013955.
Touchon M, Hoede C, Tenaillon O, et al. (41 co-authors)
2009. Organized genome dynamics in the Escherichia coli
species results in highly diverse adaptive paths. PLoS Genet.
5(1):e1000344.
Treangen TJ, Rocha EPC. 2011. Horizontal transfer, not gene
duplication, drives the expansion of gene families in prokaryotes. PLoS Genet. 7(1):e1001284
Welch RA, Burland V, Plunkett G, et al. (19 co-authors) 2002.
Extensive mosaic structure revealed by the complete genome
sequence of uropathogenic E. coli. Proc. Nat. Acad. Sci. USA
99:1702017024.
Wright S. 1969. Evolution and the Genetics of Populations. Vol.
II. The theory of gene frequencies. University of Chicago
Press.
13
Figures
FIG. 1.Phylogenetic tree of 172 Bacilli having complete genome sequences, annotated with the taxonomic groups used in this study. Tree
topology and branch lengths were determined by maximum likelihood analysis (using PhyML) of concatenated alignments of amino acid sequences
from 55 single-copy core genes (Table S2). The fully annotated tree is given in Fig. S1. Taxonomic groupings demarcated with stars (F) were 100%
supported by the approximate likelihood ratio test (as calculated in PhyML) and define the groups used in this study. Scale bar indicates expected
number of amino acid substitutions per site.
FIG. 2.Schematic representation of the shapes of (a) a coalescent tree, and (b) a star tree.
14
15
FIG. 3.Fitting four evolutionary models to the core and pangenome curves for 14 sequenced Streptococcus pneumoniae genomes. Models were
based on 2 different tree shapes: the star tree (blue lines) or the coalescent tree (red lines) and either 1 dispensable and 1 essential class of gene
family (D+E; dashed lines) or 2 dispensable classes (2D; solid lines). The lines for the star tree pangenome overlap. The shaded region demarcates the
maximum range observed during 10,000 permutations of the data.
16
FIG. 4.Fitting four evolutionary models to the core and pangenome curves for 38 sequenced Bacillaceae 1 genomes. Models were based on
2 different tree shapes: the star tree (blue lines) or the coalescent tree (red lines) and either 1 dispensable and 1 essential class of gene family (D+E;
dashed lines) or 2 dispensable classes (2D; solid lines). The lines for the star tree overlap. The shaded region demarcates the maximum range observed
during 10,000 permutations of the data.
17
FIG. 5.The variation in the number of gene families in the slow category (Gslow ; blue bar) and the fast category (G f ast ; red bar), and the total
number of gene families per genome (G0 = Gslow + G f ast ) in each of the clades. The symbol shows the observed size of the core genome.
18
FIG. 6.Fitting the gene family frequency spectrum, G(k), for 20 sequenced genomes of the Bacillus cereus group (circles). Models were based
on the coalescent tree and either 1 dispensable and 1 essential class of gene family (D+E; dashed lines) or 2 dispensable classes (2D; solid lines). The
2D model gives a reasonable fit, whereas the D+E model does not. The lower figure shows the predictions of the 2D model for the core and pangenome
curves with the parameter values estimated from fitting the gene frequency spectrum. The shaded region demarcates the maximum range observed
during 10,000 permutations of the data.
19
FIG. 7.Fitting the gene family frequency spectrum, G(k), for 38 sequenced Bacillaceae 1 genomes (circles). Models were based on 2 different
tree shapes: the coalescent tree (red lines) or the inferred phylogenetic tree (green lines; Fig. S1), using 2 dispensable and 1 essential class of gene
family (2D+E). The 2D+E model on the coalescent gives an adequate smooth fit through the data. The same model calculated on the fixed tree of
Bacillaceae 1 predicts some of the tree-dependent structure in the data. The lower graph shows the (overlapping) predictions of both these models
for the core and pangenomes, using the parameters estimated from fitting the gene family frequency spectrum. The shaded region demarcates the
maximum range observed during 10,000 permutations of the data.
FIG. 8.Schematic to explain the algorithm for calculation of the gene frequency spectrum on a fixed tree.
20
21
Supplementary Material Captions

Supplementary Table 1: The list of taxa used in the study. The list is archived, with additional links to taxonomic
databases, on TreeBASE (Sanderson et al., 1994) at the following URL: http://bit.ly/pS075b
Supplementary Table 2: Single-copy core genes used for phylogenetic analysis. Numbers in brackets indicate the
frequency (out of 172 genomes) with which the given annotation was observed.
Supplementary Figure 1: Phylogenetic tree of 172 Bacilli having complete genome sequences. Tree topology and
branch lengths were determined by maximum likelihood analysis of concatenated alignments of amino acid sequences
from 55 single-copy core genes (Table S2). Taxonomic groupings are indicated on the tree by stars and by vertical lines
adjacent to taxon names. Each grouping was monophyletic and highly supported by the approximate likelihood ratio test
(aLRT, as calculated in PhyML); values of aLRT for each taxonomic group are shown adjacent to the stars. The alignment
and tree files are archived on TreeBASE (Sanderson et al., 1994; URL http://bit.ly/qD1Ch2).

Testing The Infinitely Many Genes Model For The Evolution of The Bacterial Core Genome and Pangenome

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Testing The Infinitely Many Genes Model For The Evolution of The Bacterial Core Genome and Pangenome

Caricato da

Copyright:

Formati disponibili

Testing the Infinitely Many Genes Model

for the Evolution of the Bacterial Core Genome and Pangenome

Key words: bacteria, evolution, genome, pangenome

the authors 2011

increases rapidly as new genomes are added. If G pan (n)

Evolution of the bacterial pangenome

the genomes in the group. If the diversity of genes in the

Evolution of the bacterial pangenome

Core genome and pangenome properties

Evolution of the bacterial pangenome

coalescent process (Kingman, 1982; Hein et al. 2005). A

G(k|n) = M (x)g(k|x, n)dx

Evolution of the bacterial pangenome

The pangenome size continues to increase with n,

a coalescent tree; therefore, if the results are dependent

Evolution of the bacterial pangenome

Gnew (n) = s exp(n/s ) + s

Equations (18) and (19) are equivalent to those given in

Gcore (n) = en + Gess for n > 2

It can be seen that Equation (20) is a simple exponential

not depend on the permutation. Consider a family that is

G(k|ng )Pall (n, k)

The probability that the family is absent in all n is

The probability that this family is in at least one of the n is

G(k|ng )(1 Pabs (n, k))

We denote the core and pangenome curves for any

We define RMS(data) as the mean value of

core (n) Gcore (n) +

Evolution of the bacterial pangenome

Table 2 Comparison of quality-of-fit for four models using root mean

Gslow /(Gslow + G f ast ) and f f ast = G f ast /(Gslow + G f ast ).

Evolution of the bacterial pangenome

with closely related genomes. However, we found that the

Evolution of the bacterial pangenome

g(ta )pa (k)

where g(ta ) is the expected number of families present in

r(ta )r(tb ) pa ( j)pb (k j) (31)

Evolution of the bacterial pangenome

tococcus pyogenes and S. aureus each carry many genes

Evolution of the bacterial pangenome

sumed to be present with a probability in each genome

Evolution of the bacterial pangenome

ment for statistical computing. R Foundation for Statistical

Evolution of the bacterial pangenome

Evolution of the bacterial pangenome

Evolution of the bacterial pangenome

Evolution of the bacterial pangenome

Evolution of the bacterial pangenome

Evolution of the bacterial pangenome

Evolution of the bacterial pangenome

Evolution of the bacterial pangenome

Evolution of the bacterial pangenome

Supplementary Material Captions

Potrebbero piacerti anche