Sei sulla pagina 1di 9

Human Molecular Genetics, 2010, Vol. 19, No.

15
doi:10.1093/hmg/ddq198
Advance Access published on May 12, 2010

29272935

Genomic and geographic distribution of SNPdefined runs of homozygosity in Europeans


Michael Nothnagel 1, , Timothy Tehua Lu 1, Manfred Kayser 2 and Michael Krawczak 1
1

Institute of Medical Informatics and Statistics, Christian-Albrechts University, 24105 Kiel, Germany and 2Department
of Forensic Molecular Biology, Erasmus University Medical Center, 3015 GE Rotterdam, The Netherlands
Received March 10, 2010; Revised April 28, 2010; Accepted May 6, 2010

INTRODUCTION
The availability of high-density panels of genetic polymorphisms has led to the recent discovery of extended regions of
autozygosity in the human genome. At the genotype level,
these regions present as sizeable stretches, or runs, of homozygosity (ROH) (1). Increased levels of autozygosity have
long been implicated as a cause of the higher prevalence of
recessive diseases in small and isolated populations. Initially,
ROH analysis was thus endorsed successfully as a means to
map recessive diseases genes (2 7), but ROHs may also be
useful for disease gene identification under other genetic
models (8,9), may be indicative of selective sweeps (10,11)
and should be interesting from a human population genetics
point of view.
An abundance of ROHs was first demonstrated in Europeans using short tandem repeat markers (1). Subsequent
analyses of single nucleotide polymorphisms (SNPs) in large

European, Asian and African samples revealed a wide-spread


occurrence of ROHs comprising .1 Mb, but the same studies
also highlighted population differences in terms of the cumulative ROH length and a positive correlation between ROH
number and the population-specific level of consanguinity
(12 14). In addition, ROHs were found to preferentially
occur in regions of decreased recombination activity. Wang
et al. (15) observed a non-uniform distribution of ROHs on
chromosome 22 in 11 population isolates. At a more localized
level, McQuillan et al. (16) found that the cumulative ROH
length per genome was larger in two isolated than in two nonisolate European populations, but no significant difference was
seen between outbred populations from rural and urban areas
of Scotland. These results suggested that prolonged isolation
and a reduced population size may play a crucial role in the
formation of ROHs, as would be predicted by population genetics theory. When focusing upon regions of low linkage disequilibrium (LD) in the genomes of North Americans of

To whom correspondence should be addressed at: Institute of Medical Informatics and Statistics, Christian-Albrechts University, Brunswiker Str. 10,
D-24105 Kiel, Germany. Tel: +49 4315973181; Fax: +49 4315973193; Email: nothnagel@medinfo.uni-kiel.de

# The Author 2010. Published by Oxford University Press. All rights reserved.
For Permissions, please email: journals.permissions@oxfordjournals.org

Downloaded from http://hmg.oxfordjournals.org at University of Sydney on July 18, 2010

The availability of high-density panels of genetic polymorphisms has led to the discovery of extended
regions of apparent autozygosity in the human genome. At the genotype level, these regions present as sizeable stretches, or runs, of homozygosity (ROH). Here, we investigated both the genomic and the geographic
distribution of ROHs in a large European sample of individuals originating from 23 subpopulations. The
genomic ROH distribution was found to be characterized by a pattern of highly significant non-uniformity
that was virtually identical in all subpopulations studied. Some 77 chromosomal regions contained ROHs
at considerable frequency, thereby forming ROH islands that were not explicable by high linkage disequilibrium alone. At the geographic level, the number and cumulative length of ROHs followed a prominent South
to North gradient in agreement with expectations from European population history. The individual ROH
length, in contrast, showed only minor and unsystematic geographic variation. While our findings are thus
consistent with a larger effective population size in Southern than in Northern Europe, combined with a
higher historic population density and mobility, they also indicate that the patterns of meiotic recombination
in humans must have been very similar throughout the continent. Extending previous reports of a strong
correlation between geography and identity-by-state, our data show that the genomic identity-by-descent patterns of Europeans are also clinal. As a consequence, the planning, design and interpretation of ROH-based
genetic studies must take sample origin into account in order for such studies to be sensible and valid.

2928

Human Molecular Genetics, 2010, Vol. 19, No. 15

Table 1. Summary of weighted ROHs statistics in 23 European subpopulations


Subpopulation (sampling site)

NO
SE
FI
IR
UK
DK
NE
NG
SG
AU
SW
FR
PG
S1
S2
I1
I2
YU
GR
HU
RO
PO
CZ

Sample size

52
46
47
35
194
59
280
494
489
50
133
50
16
81
47
106
49
55
51
17
12
49
45
2457

Sampling site coordinates


Latitude

Longitude

59.36
59.51
60.10
53.19
51.30
55.40
51.55
54.14
48.37
47.16
46.31
45.46
38.43
40.25
41.20
41.53
43.37
44.49
40.38
47.27
44.25
52.15
50.04

5.28
17.38
24.56
26.15
20.07
12.34
4.28
10.04
10.89
11.23
6.37
4.50
29.08
23.42
2.10
13.68
13.30
20.30
22.27
19.06
26.07
21.01
14.28

Weighted ROH
number per individual
mean + SD

Median weighted ROH


length per individual
mean + SD (Mb)

42.16 + 6.72
41.49 + 6.37
48.04 + 7.34
40.14 + 5.02
38.51 + 6.30
40.11 + 6.26
38.79 + 6.83
40.49 + 6.25
38.39 + 6.09
36.42 + 6.39
37.64 + 5.89
36.89 + 6.71
34.15 + 7.56
37.70 + 5.89
36.37 + 6.94
35.59 + 6.19
34.60 + 5.11
36.88 + 5.72
33.69 + 5.22
33.68 + 5.25
32.55 + 4.48
41.45 + 5.98
39.21 + 5.11
38.74 + 6.60

1.31 + 0.08
1.27 + 0.07
1.30 + 0.08
1.28 + 0.08
1.30 + 0.09
1.30 + 0.09
1.30 + 0.08
1.31 + 0.08
1.31 + 0.08
1.29 + 0.09
1.29 + 0.08
1.28 + 0.10
1.26 + 0.08
1.30 + 0.08
1.29 + 0.10
1.30 + 0.09
1.28 + 0.07
1.31 + 0.08
1.28 + 0.11
1.25 + 0.07
1.17 + 0.06
1.29 + 0.07
1.27 + 0.06
1.30 + 0.08

Given are the number of samples after data cleaning (21), the geographical location of the sampling sites (subpopulations), the subpopulation-specific mean +
SD of the weighted number of ROHs and of the median-weighted ROH length per individual.

Various definitions of ROHs have been proposed in the past


(12,13,15,18 20). Here, we applied the widely used ROH
definition implemented in PLINK (20) to a set of genomewide SNP data that formed the basis of comprehensive
population genetic analyses before (21). The data included
quality-controlled genotypes, at 304 250 autosomal SNPs, of
2457 unrelated individuals originating from 23 sampling sites
in Europe (henceforth termed subpopulations; Table 1). We
characterized both the genomic and the geographic distribution
of SNP-defined ROHs in these individuals and sought for possible explanations of the observed ROH patterns.

RESULTS
Genomic ROH distribution

Figure 1. Distribution of ROHs in European genomes. For each SNP, the


ROH frequency per SNP in the overall sample is plotted at its physical
location. ROHs that occurred in more than 50% of individuals were identified
on chromosomes 3, 4 and 14 (Table 3).

European descent, Nalls et al. (17) observed a 14% decrease in


the frequency of ROHs and a 24% decrease in the cumulative
ROH length per individual over a time-span of just one
century, and suggested panmixia and an increased effective
population size as the likely causes of these changes.

Using the default ROH definition of PLINK (see Materials and


Methods), the genome-wide median number of individuals for
whom a given SNP was included in an ROH (henceforth
termed ROH count per SNP) was found to be 17, with an
inter-quartile range (IQR) of 7 47 (corresponding to a
median ROH frequency per SNP of 0.7% of individuals,
IQR: 0.3 1.9%). Most SNPs fell into an ROH at least once.
Intriguingly, the genomic distribution of the ROH frequency
per SNP deviated substantially from uniformity. Thus, ROHs
were clearly abundant in particular genomic regions (Fig. 1),
the location of which was virtually identical in all subpopulations studied. This non-uniformity was highly significant
when assessed by a x2 goodness-of-fit test of the binned
average ROH count per SNP in the overall sample
(Table 2). Considering genomic regions of 50 adjacent

Downloaded from http://hmg.oxfordjournals.org at University of Sydney on July 18, 2010

Norway (Frde)
Sweden (Uppsala)
Finland (Helsinki)
Ireland
UK (London)
Denmark (Copenhagen)
Netherlands (Rotterdam)
Germany I (Kiel)
Germany II (Augsburg)
Austria (Tyrol)
Switzerland (Lausanne)
France (Lyon)
Portugal
Spain I
Spain II (Barcelona)
Italy I
Italy II (Marches)
Former Yugoslavia
Northern Greece
Hungary
Romania
Poland (Warsaw)
Czech Republic (Prague)
Total

Code

Human Molecular Genetics, 2010, Vol. 19, No. 15

Table 2. Statistical analysis of the genomic ROH distribution


Bin size (Mb)

x2

df

P-value

1
5
10
20
50
100
250

558199.8639
53382.4379
15142.9161
4768.6176
764.6029
239.1148
98.7624

2696
555
284
146
62
35
21

,102100
,102100
,102100
,102100
,102100
3.1 10232
4.8 10212

SNPs, a total of 10 regions could be identified in which all


SNPs had an ROH frequency per SNP 30% in the European
population (Table 3) and some 77 regions in which the ROH
frequency per SNP exceeded 10% (see Supplementary
Material, Table S1). At least two of these ROH islands on
chromosome 1 (at positions 48.8 51.1 Mb, and 170.1
171.7 Mb) appear to coincide with the ROH clusters previously identified in the Scottish population (16), although
no information as to the exact limits of the respective
regions were reported. We also identified a very common
ROH island on chromosome 4 that has been described for
European samples before (12). Notably, three ROH islands
were found in our study to be present in more than 50% of
individuals (Table 3). These ROH islands were located on
chromosomes 3, 4 and 14. We also noted that the lactase
(LCT) gene on chromosome 2q21, known to have been
subject to strong selection in Europe (22), is located in an
ROH island.
ROHs appear to be a phenomenon predominantly of large
chromosomes (Fig. 2). Thus, the chromosome-wise median
ROH frequency per SNP in the overall sample was strongly
correlated with both the length of the chromosome (Spearmans r 0.915; P 2.6 1029) and the number of SNPs
on the chromosome (r 0.919; P 1.6 1029), but
not with the chromosome-specific SNP density (r 0.097;
P 0.666).
We next investigated whether the occurrence of ROHs was
explicable in terms of reduced single-marker gene diversity
and/or a heterozygote deficit relative to Hardy Weinberg
expectation. To this end, we performed a logistic regression
analysis of the ROH frequency per SNP, using the gene diversity of the SNP and the local estimate of fixation index F as
covariates. Logistic regression analysis was carried out separately for all but the three largest subpopulations (DE1, DE2
and NE), for which the computer memory requirements
turned out to be prohibitive. In all 20 subpopulations considered, the contribution of the two covariates was highly significant (P , 102100 for most subpopulations), with
decreasing gene diversity and increasing F resulting in an
increased ROH frequency per SNP (Table 4). However,
when we assessed the ability of the regression models to
predict the inclusion of a given SNP into an ROH, the
ensuing area-under-curve (AUC) values were found to range
from 0.53 to 0.56 in different subpopulations (Table 4),

which is in fact only marginally better than random classification (AUC 0.50).
The definition of an ROH does not only depend upon the
properties of single markers but takes adjacent SNPs simultaneously into account. We therefore correlated the average
single-marker gene diversity, taken over a sliding 1 Mb
window, with the average ROH frequency per SNP in that
window. Although an unambiguously negative correlation
emerged (genome-wide Pearsons r 20.268 + 0.018;
range of subpopulation-specific r-values: 20.230 to
20.293), the size of the observed correlation implied that
gene diversity explains only 7% (coefficient of determination r2 0.072) of the variation in ROH frequency
per SNP.
We also investigated the effects of LD upon the definition of
ROHs. To this end, we correlated the average ROH frequency
per SNP, taken over a sliding 1 Mb window, with the average
of the squared genotypic correlation coefficient g 2 within this
window. The average of the genome-wide subpopulationspecific Pearson correlation coefficients was 0.453 + 0.024,
with a range of 0.382 (PG) to 0.503 (S2). Thus, 20% (i.e.
0.4532 0.205) of the variation in ROH frequency per SNP
could be explained by the extent of LD in the vicinity of a
given marker. A correlation between LD and ROH prevalence
became particularly apparent for the three genomic regions
(on chromosomes 3, 4 and 14) with the highest ROH frequency per SNP (Fig. 3). However, increased LD in the vicinity of a given SNP was neither necessary nor sufficient for
SNPs to be included in an ROH.
Geographical pattern of ROH distribution in Europe
LD can act as a potential confounder in comparative ROH
analyses of different populations because the local level of
LD determines the effective number of SNPs used for ROH
definition. When characterizing the geographic distribution
of ROHs, we therefore weighted individual ROHs by their
internal level of LD, approximated by one minus the
average of the pair-wise squared genotypic correlation coefficient g 2 (see Materials and Methods).
The weighted ROH number per individual ranged from 10.5
to 60.4 in the overall sample, with all subpopulation-specific
IQRs falling between 25 and 55 (see Supplementary Material,
Fig. S1). The subpopulation average of the weighted ROH
number per individual varied between 32.55 (standard error,
SE: 1.3) in the Romanians and 48.0 (SE: 1.1) in the Finns
(Table 1). Similarly, the subpopulation average of the cumulative weighted ROH length per individual ranged from
49.7 Mb (SE: 2.3 Mb) in the Romanians to 81.5 Mb (SE:
2.2 Mb) in the Finns. Of the 2457 individuals analysed, 40
(1.6%) exhibited a cumulative weighted ROH length
100 Mb (3.3% of the human genome). These individuals originated from South Germany (10), North Germany and
Norway (6 each), Italy I, Spain I, Finland and The Netherlands
(3 each), Portugal (2), and from Austria, Denmark, the UK and
former Yugoslavia (1 each). As a consequence, particularly
high proportions of samples from Finland (6.4%), Norway
(11.5%) and Portugal (12.5%) were found to have at least
100 Mb of their genome located in ROHs. Twelve individuals
(0.5% of the total) had weighted ROHs comprising 150 Mb

Downloaded from http://hmg.oxfordjournals.org at University of Sydney on July 18, 2010

Each chromosome was divided into bins of equal size. The average ROH count
per SNP per bin in the overall sample was subjected to a x2 goodness-of-fit test
over all chromosomes. The last line (bin size 250 Mb) contains the result of a
test for uniformity between chromosomes. df: degrees of freedom.

2929

2930

Human Molecular Genetics, 2010, Vol. 19, No. 15

Table 3. Regions of at least 50 SNPs with high ROH frequency per SNP (ROH islands)
Location/size (kb)

No.
SNPs

Mean (range) ROH


frequency per SNP (%)

Known genes

14
4
3
12

65,754.607 66,956.534/1,201.927
33,305.316 34,167.260/861.944
50,382.348 51,835.857/1,453.509
110,249.612 111,461.573/1,211.961

106
58
101
90

68.5 (61.0 68.8)


62.2 (60.2 62.7)
55.6 (51.4 56.7)
42.9 (40.5 43.3)

1
5
11

35,023.369 36,505.444/1482.075
129,845.818 131,423.014/1,577.196
47,998.479 49,391.209/1,392.730

101
105
114

38.3 (33.2 39.0)


35.9 (30.9 37.4)
32.9 (30.5 35.1)

16
16
10

65,360.598 66,845.475/1,484.877
46,391.563 46,826.430/434.867
74,211.870 75,086.795/874.925

72
55
59

31.7 (30.2 32.2)


30.7 (30.2 31.0)
33.6 (30.0 34.2)

ATP6V1D, EIF2S1, FAM71D, GPHN, MPP5, PLEK2 etc.


AK093205, BC036345
DOCK3, CACNA2D2, HEMK1, KIAA0809, RAD54L2, TEX264, VPRBP etc.
ACAD10, ALDH2, ATXN2, BRAP, MAPKAPK5, PTPN11, TMEM116,
TRAFD1 etc.
EIF2C3, KIAA0319L, NCDN, PSMB2, SFPQ, TEKT2, TRAP2E, ZMYM4 etc.
ACSL6, CDC42SE2, FNIP1, HINT1, LYRM7, RAPGEF6
FOLH1, OR4A47, OR4B1, OR4C3, OR4C45, OR4S1, OR4X1, OR4X2,
PTPRJ
CBFB, CTCF, LCAT, NAE1, NFATC3, LRRC36, RANBP10 etc.
ABCC11, ABCC12, BC048130
ANXA7, CCDC109A, ECD, P4HA1, PPP3CB, TTC18, USP54 etc.

Given are the mean and the range of the ROH frequency per SNP in the overall sample, both taken over all SNPs in the respective region.
Table 4. Relationship between single-marker gene diversity and heterozygote
deficit, respectively, and ROH frequency per SNP
Subpopulation

Figure 2. Chromosome-specific ROH distribution in Europeans. Filled (open)


circle: Median (mean) ROH count per SNP in the overall sample. Bold line:
inter-quartile range of the ROH count per SNP in the overall sample.

(i.e. 5% of the genome), while four individuals (0.2% of the


total; originating from Denmark, North Germany and
the UK) had weighted ROHs comprising 200 Mb (6.7% of
the genome). In contrast, the subpopulation average of the
median-weighted ROH length per individual was found to
vary much less, ranging from 1.17 Mb in the Romanians
(SE: 0.02 Mb) to 1.31 Mb in the North and South Germans
(SE: 0.00 Mb), Norwegians and individuals from
ex-Yugoslavia (SE: 0.01 Mb) (Table 1). All individuals had
median-weighted ROH lengths between 1 and 2 Mb (see Supplementary Material, Fig. S1). Similar results were obtained
for the subpopulation average of the mean weighted ROH
length per individual (data not shown).
The subpopulation average of the weighted ROH number
per individual showed a strong and highly significant correlation with the latitude (Pearsons r 0.84, P 4.3 1027;
Fig. 4), but not the longitude of the corresponding sampling
site (r 0.05, P 0.8). A similar, albeit less pronounced

AU
DK
FI
YU
FR
GR
HU
IR
IT1
IT2
NO
PO
PG
RO
SE
SG
S1
S2
SW
CZ
UK
Mean
SD

Gene diversity
P-value
OR

Fixation index F
P-value
OR

,102100
,102100
,102100
,102100
,102100
,102100
,102100
,102100
,102100
,102100
,102100
,102100
,102100
,102100
,102100
,102100
,102100
,102100
,102100
,102100
,102100

,102100
,102100
,102100
,102100
,102100
,102100
,102100
,102100
,102100
8.1 10238
,102100
,102100
,102100
,102100
,102100
,102100
,102100
,102100
1.1 1029
,102100
2.4 1023

0.377
0.321
0.465
0.367
0.319
0.376
0.287
0.331
0.371
0.349
0.440
0.318
0.412
0.287
0.330
0.428
0.339
0.320
0.304
0.344
0.377
0.354
0.049

1.576
1.629
1.884
1.527
1.442
1.894
1.553
1.691
1.457
1.225
1.916
1.532
1.958
1.575
1.538
1.739
1.855
1.098
1.935
1.046
1.576
1.604
0.261

AUC
0.55
0.55
0.54
0.55
0.55
0.55
0.56
0.55
0.55
0.55
0.54
0.55
0.56
0.56
0.55
0.54
0.55
0.55
0.56
0.55
0.55
0.550
0.005

Odds-ratios (OR) and P-values are from a logistic regression analysis of the
ROH frequency per SNP, using single-marker gene diversity and local fixation
index F as covariates. AUC, area-under-curve (for details, see text).

trend was observed for the subpopulation average of the cumulative weighted ROH length per individual (latitude: r 0.61,
P 1.8 1023; longitude: r 20.14, P 0.5). Nevertheless, since the Finnish are known to be genetically quite distinct from other Europeans, and because some of the
Norwegian sampling sites included in our study (e.g. Frde)
also may have represented genetic isolates, it remained possible that the above correlations hinged mainly on a few
founder populations from the northern fringes of the continent.
However, exclusion of the Finnish and/or Norwegian samples
from our analysis hardly changed the observed correlation
between weighted ROH number and latitude (without

Downloaded from http://hmg.oxfordjournals.org at University of Sydney on July 18, 2010

Chr.

Human Molecular Genetics, 2010, Vol. 19, No. 15

2931

Figure 4. Geographic distribution of weighted ROHs in European genomes. White dots mark the location of the 23 sampling sites where individuals were
recruited into subpopulations (as defined in the text). (A) Subpopulation average of the weighted ROH number per individual; (B) subpopulation average
of the median-weighted ROH length (Mb) per individual. Contour maps were derived through spline interpolation.

FI: r 0.85, P 7.3 1027; without NO: r 0.86, P


4.0 1027; without FI and NO: r 0.83, P 3.5 1026).
Similarly, the correlation with longitude remained nonsignificant upon the exclusion of the two subpopulations,
and the geographic distribution of the cumulative ROH
length was found to be similarly robust. Furthermore, when
those 40 individuals with a cumulative weighted ROH
length 100 Mb were excluded from the analysis in order to
avoid effects of recent cryptic inbreeding, the correlations
also hardly changed (for the weighted ROH number: latitude
r 0.86, P 1.8 1027, longitude r 0.09, P 0.7; for

the cumulative weighted ROH length: latitude r 0.69, P


2.6 1024, longitude r 2 0.03, P 0.9). In contrast to
the weighted number and the cumulative weighted length of
ROHs, the subpopulation average of the median-weighted
ROH length per individual showed only little and insignificant
correlation with latitude (r 0.27, P 0.2; Fig. 4) and longitude (r 2 0.27; P 0.2). Finally, a systematic modification
of the parameters used for ROH definition, in particular of the
number of SNPs required per ROH, the minimum ROH length
and the gap size allowed within ROHs, turned out to leave our
results largely unchanged (data not shown).

Downloaded from http://hmg.oxfordjournals.org at University of Sydney on July 18, 2010

Figure 3. ROH frequency, local linkage disequilibrium and gene diversity per SNP in selected chromosomal regions in the North German (NG) subpopulation.
Regions were selected from the top of Table 3. Vertical gray dashed lines: region limits. Green horizontal bars: extent of individual ROHs. Black ticks: physical
location of analysed SNP. Green line: ROH frequency per SNP. Red line: average genotypic correlation within bins of approximately 200 kb (marked by gray
ticks). Blue line: gene diversity per SNP.

2932

Human Molecular Genetics, 2010, Vol. 19, No. 15

For spatial autocorrelation analysis, the great circle distances


between sampling sites were classified into 200 km intervals,
ranging from 0 to 1800 km. While between 3 and 49 subpopulation pairs fell into each class, the remaining 24 pairs that were
.1800 km apart were combined into a single residual class.
The subpopulation averages of both the weighted ROH
number and the cumulative weighted ROH length per individual showed significant and positive spatial autocorrelation at
small distances and significant but negative spatial autocorrelation at large distances (Fig. 5A and B). The subpopulation
average of the median-weighted ROH length per individual
showed a similar albeit non-significant trend (Fig. 5C).

DISCUSSION
At the level of the individual genome, the distribution of
SNP-defined ROHs was found in our study to be highly structured in all of the European subpopulations analysed. This

Downloaded from http://hmg.oxfordjournals.org at University of Sydney on July 18, 2010

Figure 5. Spatial autocorrelograms of three characteristics of weighted ROHs


in European genomes. (A) Subpopulation average of the weighted ROH
number per individual; (B) subpopulation average of the cumulative weighted
ROH length (Mb) per individual; (C) subpopulation average of the medianweighted ROH length (Mb) per individual. Solid diamonds: P , 0.05; open
diamonds: not significant.

structure could not be explained solely by reduced marker


gene diversity or a localized heterozygote deficit. Although
both factors likely contribute to the formation of ROHs,
their impact was found to be small to moderate. This is not
surprising given that the inference of ROHs is based upon
features of multiple adjacent SNPs. As was demonstrated
here, increased regional LD contributes significantly to the
occurrence of ROHs but is not sufficient to explain their
presence.
While some evidence for a non-uniform distribution of
ROHs has been reported before, we were able to show that
this deviation from uniformity is geographically ubiquitous
and highly significant. Thus, while McQuillan et al. found
some ROHs on chromosome 1 to be relatively frequent in
the Scottish population (16), we observed that many of these
ROHs are not specific to the Scots, but are instead common
throughout Europe. However, the most prominent European
ROH on chromosome 1, located at positions 35.0 36.5 Mb,
apparently occurred only once in the Scottish sample.
Notably, we identified substantially more ROHs than Auton
et al. (12) who observed only a single ROH in their European
samples, located on chromosome 4 and overlapping with an
ROH identified in our study as well. Moreover, in their
entire collection of 3845 individuals of European, East
Asian, South Asian and Mexican origin, the same authors
identified only 39 different ROHs. In all likelihood,
however, the apparent discrepancy in ROH number between
their study and ours is attributable to a somewhat dissimilar
ROH definition employed by the two projects which, among
other parameters, differed in terms of the minimum allele frequency required and the type of inter-marker distance used
(genetic or physical).
We have shown for the first time that both the number and
the expanse of ROHs in individual human genomes, when
weighted by the local level of LD, are strongly correlated
with the latitude of the sample origin in Europe. This result
corroborates earlier findings of a continent-wide decrease in
human genetic diversity with increasing latitude (21,23,24).
It should be noted, however, that the earlier studies were
based upon the identity-by-state of homologous chromosomes,
not their identity-by-descent, and were therefore less specific
to human genealogy than our in-depth analysis of ROHs. In
any case, the consistently observed correlation in Europe
between genetic structure and latitude, but not longitude,
appears readily explicable in terms of human population
history. Thus, all three major migration episodes in Europe
are known to have followed a South to North gradient: (i)
the initial occupation by hunter-gatherers during the Palaeolithic, (ii) the post-glacial re-expansion during the Mesolithic
and (iii) the influx of farmers from the South-East during the
Neolithic. Apart from clinal migration, however, a high level
of autozygosity in Northern Europe could also have resulted
from a lower historic population density and/or a lower level
of individual mobility in this part of the continent, both of
which would have favoured the formation of ROHs as well.
Unfortunately, the geographic resolution of our data is not sufficient to allow any reliable discrimination between these
alternative explanatory scenarios. However, because of their
ubiquitous occurrence in Europe, it seems likely that at least
the point in time when the common ROH islands identified

Human Molecular Genetics, 2010, Vol. 19, No. 15

MATERIALS AND METHODS


SNP genotyping
The genome-wide SNP data used in the present study have
been described in detail elsewhere (21). In brief, 2514 individuals from 23 different sampling sites in Europe were originally
genotyped for 500 568 SNPs, using the GeneChip Human
Mapping 500 k Array Set (Affymetrix). The samples were
either population-based controls (25 29), originated from
population-representative cohorts (30), or were randomly
selected healthy volunteers (often blood donors). European
migrants from non-European regions were not included in

the analysis. Stringent quality control served to ensure the


non-relatedness of individuals and the representativeness of
samples for the respective sampling sites (i.e. genetic outliers
were removed prior to the analysis). SNPs were required to be
autosomal, to be polymorphic in at least one subpopulation, to
lack any significant deviation from Hardy Weinberg equilibrium (P 0.05) in all subpopulations, to have a call-rate
.90% in all six genotyping centres involved and to possess
an rs number. These criteria left 2457 individuals (97.6%)
and 304 250 SNPs (60.8%) for the analysis.
ROH definition
We employed the default ROH definition of PLINK v1.06 (20)
(i.e. ROH length 1 Mb, 100 SNPs per ROH, 1 SNP per
50 kb within ROHs and a gap size 1 Mb within ROHs).
ROH screening was carried out adopting the default window
options (i.e. a 5 Mb window, 50 SNPs per window, at
most one of which was heterozygous, 5 missing SNPs and
a proportion of overlapping windows that must be homozygous 0.05). To adjust for the effects of LD upon ROH definition, previously suggested measures (16) of the extent of
ROHs in the human genome, namely their number, cumulative
length and median length per individuals, were weighted by
one minus the average squared genotypic correlation coefficient g 2, taken over all marker pairs within the ROH (see
below).
Statistical analysis of the genomic ROH distribution
The genomic distribution of ROHs was analysed statistically
as follows. Each chromosome was divided into bins of given
size. Bins were only included into the analysis if they contained at least one SNP, which implied the exclusion of centromeric regions. The distribution of the mean ROH count per
SNP per bin was then tested for uniformity using a x2
goodness-of-fit test with a number of degrees of freedom
equal to the number of bins minus one. Spearmans correlation
coefficient was used to quantify the impact of chromosome
length, SNP number and SNP density (i.e. average intermarker distance) per chromosome upon the chromosome-wise
average of the ROH frequency per SNP. Correlation coefficients were calculated and tested for a significant difference
from zero using the cor and cor.test functions in R
v2.10.1 (31), respectively.
ROH islands were identified as runs of adjacent SNPs with
an ROH frequency per SNP above a given threshold. The
potential impact upon ROH formation of single-marker gene
diversity and local heterozygous deficit relative to Hardy
Weinberg expectation, measured by the F statistic, was assessed
by logistic regression analysis. To this end, the ROH frequency
per SNP was modelled as a logistic function of both covariates.
Regression analyses were carried out separately for each subpopulation. A receiver-operator-characteristic curve was using
100 equidistant values of the ROH frequency per SNP, and
the corresponding AUC was determined by linear interpolation.
The correlation between gene diversity and ROH frequency
per SNP was further analysed using averages of these values
taken over a sliding window of 1 Mb, moved along chromosomes in steps of 250 kb. Pearsons correlation coefficients

Downloaded from http://hmg.oxfordjournals.org at University of Sydney on July 18, 2010

in our study first arose must have predated the differentiation


of the European subpopulations.
The intuitive appeal of the above conclusions notwithstanding, some local variation to the theme still seems obvious. It
may be argued, for example, that both Portugal and Spain
do not follow the generally observed South to North gradient
of ROH density (Fig. 4A) which could, among others, reflect a
recent influx from North Africa not present in other regions of
the continent. Furthermore, the sparse resolution of our data
also implies that the any inference about the fringes of
Europe should be made with some caution (Fig. 4). For
example, the apparently reduced number and length of
ROHs predicted for Eastern Europe and the North-West of
the Iberian Peninsula may well represent interpolation artefacts because no sampling points of our study were located
in these regions.
In contrast to both the number and cumulative length of
ROHs, their average length and the location of ROH
islands were found to be remarkably similar in different subpopulations, suggesting that the latter must have been characterized by very similar genome-wide patterns of meiotic
recombination. Our results also indicate that recent population
growth (say 100 200 years ago) seems to have played only a
minor role in shaping the genomic distribution of ROHs in
Europeans, not the least because the variance-effective population size in Europe has grown much slower than the
census size. Therefore, the recent observation of an apparently
rapid decline in autozygosity among North Americans of
mixed European descent (17) may be more likely to reflect
the contribution of different European source populations
migrating into the USA at different times, rather than progressing urbanization.
In view of the increasing interest in deep-sequencing and
the shift in focus of genetic epidemiology from common to
rare variants as a likely cause of human complex disease, a
detailed understanding of the inheritance patterns prevalent
in different European subpopulations will become increasingly
important. One way in which rare variants can be inferred as
potentially causative of disease would be through the homozygosity of patients from presumably outbred populations.
However, the wide-spread occurrence of ROHs in contemporary Europeans requires such evaluations to be made in comparison to healthy controls, and our data highlight that at
least some matching for geographic origin will be required
for future ROH-based genetic studies to be both sensible and
valid.

2933

2934

Human Molecular Genetics, 2010, Vol. 19, No. 15

between averages were calculated separately for each subpopulation.


Since gametic phase information was lacking in our data
set, pair-wise LD was approximated by the squared genotypic
correlation coefficient g 2, rather than the squared allelic correlation coefficient r 2, and estimated using PLINK v1.06 (20)
with the r 2 option. We considered only pairs of markers
that were no further apart than 1 Mb and were separated by
no more than 100 SNPs. The correlation between LD and
ROH frequency per SNP was analysed using average values
taken over a sliding window of 1 Mb, moved along chromosomes in steps of 250 kb. Pearsons correlation coefficients
were again calculated separately for each subpopulation.

R software v2.92 (31) was used for statistical analysis and for
creating graphs. The akima R package v0.5-2 (32) was used
for gridded bivariate cubic interpolation using splines (33).
The significance of the correlation of certain ROH characteristics with either longitude or latitude was assessed by a twosided test at the 5% level, as implemented in the cor.test
function of the R stats library. Data on European geographic
boundaries were obtained from http://www.oceanteacher.org/.
Graphs were edited with Adobe Illustrator CS2. Spatial autocorrelation was analysed and correlograms were generated
using PASSAGE v1.1 (34).

SUPPLEMENTARY MATERIAL
Supplementary Material is available at HMG online.

ACKNOWLEDGEMENTS
All sample donors are gratefully acknowledged for their participation. We thank the following colleagues for their help
and support: P. Arp, M. Balascakova, C. Becker, A. van
Belkum, J. Bertranpetit, L.A. Bindoff, R. Borup, S. Brauer,
A. Caliebe, J. Chambers, D. Comas, G. Eckstein, H. von EllerEberstein, F.C. Nielsen, S. Freitag-Wolf, U. Gether, C. Gieger,
E. Haastrup, A. Hofman, G. Holmlund, W. van IJken,
M. Jhamai, O. Junge, K. King, E. Knipers, J. Kooner,
A. Kouvatsi, O. Lao, J. Laven, P. Lichtner, J. Lindemans,
M. Macek, T. Meitinger, I. Mollet, V. Mooser, P. Nurnberg,
J. Palo, W. Parson, R. Ploski, F. Rivadeneira, A. Ruther,
A. Sajantila, R. van Schaik, C. Schjerling, S. Schreiber,
E. Sijbrands, M. Simoons, B. Stricker, A. Tagliabracci, A.G.
Uitterlinden, H. Ullum, P. Vollenweider, G. Waeber,
D. Waterworth, T. Werge and H.-E. Wichmann. We also
thank M. Wittig for helpful discussions.
Conflict of Interest statement. None declared.

FUNDING
This work was supported by the Netherlands Forensic Institute
(to M.Ka.); by Affymetrix Inc. (to M.Ka., M.Kr.); by the
German Federal Ministry of Education and Research
(BMBF) through the National Genome Research Network

REFERENCES
1. Broman, K.W. and Weber, J.L. (1999) Long homozygous chromosomal
segments in reference families from the centre dEtude du polymorphisme
humain. Am. J. Hum. Genet., 65, 14931500.
2. Hildebrandt, F., Heeringa, S.F., Ruschendorf, F., Attanasio, M., Nurnberg,
G., Becker, C., Seelow, D., Huebner, N., Chernin, G., Vlangos, C.N. et al.
(2009) A systematic approach to mapping recessive disease genes in
individuals from outbred populations. PLoS Genet., 5, e1000353.
3. Lander, E.S. and Botstein, D. (1987) Homozygosity mapping: a way to
map human recessive traits with the DNA of inbred children. Science,
236, 1567 1570.
4. Miano, M.G., Jacobson, S.G., Carothers, A., Hanson, I., Teague, P.,
Lovell, J., Cideciyan, A.V., Haider, N., Stone, E.M., Sheffield, V.C. et al.
(2000) Pitfalls in homozygosity mapping. Am. J. Hum. Genet., 67, 1348
1351.
5. Seelow, D., Schuelke, M., Hildebrandt, F. and Nurnberg, P. (2009)
HomozygosityMapperan interactive approach to homozygosity
mapping. Nucleic Acids Res., 37, W593W599.
6. Wang, S., Haynes, C., Barany, F. and Ott, J. (2009) Genome-wide
autozygosity mapping in human populations. Genet. Epidemiol., 33,
172180.
7. Woods, C.G., Cox, J., Springell, K., Hampshire, D.J., Mohamed, M.D.,
McKibbin, M., Stern, R., Raymond, F.L., Sandford, R., Malik Sharif, S.
et al. (2006) Quantification of homozygosity in consanguineous
individuals with autosomal recessive disease. Am. J. Hum. Genet., 78,
889896.
8. Jiang, H., Orr, A., Guernsey, D.L., Robitaille, J., Asselin, G., Samuels,
M.E. and Dube, M.P. (2009) Application of homozygosity haplotype
analysis to genetic mapping with high-density SNP genotype data. PLoS
ONE, 4, e5280.
9. Miyazawa, H., Kato, M., Awata, T., Kohda, M., Iwasa, H., Koyama, N.,
Tanaka, T., Huqun Kyo, S., Okazaki, Y. et al. (2007) Homozygosity
haplotype allows a genomewide search for the autosomal segments shared
among patients. Am. J. Hum. Genet., 80, 10901102.
10. Rosenberg, N.A. and Jakobsson, M. (2008) The relationship between
homozygosity and the frequency of the most frequent allele. Genetics,
179, 2027 2036.
11. Sabeti, P.C., Reich, D.E., Higgins, J.M., Levine, H.Z., Richter, D.J.,
Schaffner, S.F., Gabriel, S.B., Platko, J.V., Patterson, N.J., McDonald,
G.J. et al. (2002) Detecting recent positive selection in the human genome
from haplotype structure. Nature, 419, 832 837.
12. Auton, A., Bryc, K., Boyko, A.R., Lohmueller, K.E., Novembre, J.,
Reynolds, A., Indap, A., Wright, M.H., Degenhardt, J.D., Gutenkunst,
R.N. et al. (2009) Global distribution of genomic diversity underscores
rich complex history of continental human populations. Genome Res., 19,
795803.
13. Gibson, J., Morton, N.E. and Collins, A. (2006) Extended tracts of
homozygosity in outbred human populations. Hum. Mol. Genet., 15, 789
795.
14. Li, L.H., Ho, S.F., Chen, C.H., Wei, C.Y., Wong, W.C., Li, L.Y., Hung,
S.I., Chung, W.H., Pan, W.H., Lee, M.T. et al. (2006) Long contiguous
stretches of homozygosity in the human genome. Hum. Mutat., 27, 1115
1121.
15. Wang, H., Lin, C.H., Service, S., Chen, Y., Freimer, N. and Sabatti, C.
(2006) Linkage disequilibrium and haplotype homozygosity in population
samples genotyped at a high marker density. Hum. Hered., 62, 175189.
16. McQuillan, R., Leutenegger, A.L., Abdel-Rahman, R., Franklin, C.S.,
Pericic, M., Barac-Lauc, L., Smolej-Narancic, N., Janicijevic, B., Polasek,

Downloaded from http://hmg.oxfordjournals.org at University of Sydney on July 18, 2010

Statistical analysis of the geographic ROH distribution

NGFNplus (01GS0809 to M.Kr., M.N.) and the German


Research Foundation (DFG)/BMBF through the Excellence
Cluster Inflammation at Interfaces (to M.N.). This study
received additional support by a grant from the Netherlands
Genomics Initiative (NGI)/Netherlands Organization for
Scientific Research (NWO) within the framework of the Forensic Genomics Consortium Netherlands (FGCN; www.
forensicgenomics.nl/) (to M.Ka.). None of the funding organizations had any influence on the design, conduct, or conclusions of the study.

Human Molecular Genetics, 2010, Vol. 19, No. 15

17.

18.
19.
20.

22.

23.

24.
25.

26.

27.

28.

29.

30.

31.
32.

33.

34.

Witteman, J.C. (2007) The Rotterdam Study: objectives and design


update. Eur. J. Epidemiol., 22, 819 829.
Hofman, A., Grobbee, D.E., de Jong, P.T. and van den Ouweland, F.A.
(1991) Determinants of disease and disability in the elderly: the
Rotterdam Elderly Study. Eur. J. Epidemiol., 7, 403422.
Kayser, M., Liu, F., Janssens, A.C., Rivadeneira, F., Lao, O., van Duijn,
K., Vermeulen, M., Arp, P., Jhamai, M.M., van Ijcken, W.F. et al.
(2008) Three genome-wide association studies and a linkage analysis
identify HERC2 as a human iris color gene. Am. J. Hum. Genet., 82,
411 423.
Krawczak, M., Nikolaus, S., von Eberstein, H., Croucher, P.J., El
Mokhtari, N.E. and Schreiber, S. (2006) PopGen: population-based
recruitment of patients and controls for the analysis of complex
genotypephenotype relationships. Community Genet., 9, 5561.
Lowel, H., Doring, A., Schneider, A., Heier, M., Thorand, B., Meisinger,
C. and Group, M.K.S. (2005) The MONICA Augsburg surveysbasis for
prospective cohort studies. Gesundheitswesen, 67, S13S18.
Nelson, M.R., Bryc, K., King, K.S., Indap, A., Boyko, A.R., Novembre, J.,
Briley, L.P., Maruyama, Y., Waterworth, D.M., Waeber, G. et al. (2008)
The Population Reference Sample, POPRES: a resource for population,
disease, and pharmacological genetics research. Am. J. Hum. Genet., 83,
347 358.
R Development Core Team. (2009) R Foundation for Statistical
Computing. Vienna, Austria.
Akima, H., Gebhardt, A., Petzoldt, T. and Maechler, M. (2009).
Interpolation of irregularly spaced data. R package version 0.5-2. http://
CRAN.R-project.org/packageakima.
Akima, H. (1996) Algorithm 761: scattered-data surface fitting that has
the accuracy of a cubic polynomial. ACM Trans. Math. Software, 22,
362 371.
Rosenberg, M.S. (2001) Pattern Analysis, Spatial Statistics, and
Geographic Exegesis, Version 1.1, A.S.U. Department of Biology,
Tempe, AZ, USA.

Downloaded from http://hmg.oxfordjournals.org at University of Sydney on July 18, 2010

21.

O., Tenesa, A. et al. (2008) Runs of homozygosity in European


populations. Am. J. Hum. Genet., 83, 359372.
Nalls, M.A., Simon-Sanchez, J., Gibbs, J.R., Paisan-Ruiz, C., Bras, J.T.,
Tanaka, T., Matarin, M., Scholz, S., Weitz, C., Harris, T.B. et al. (2009)
Measures of autozygosity in decline: globalization, urbanization, and its
implications for medical genetics. PLoS Genet., 5, e1000415.
Curtis, D., Vine, A.E. and Knight, J. (2008) Study of regions of extended
homozygosity provides a powerful method to explore haplotype structure
of human populations. Ann. Hum. Genet., 72, 261 278.
MacLeod, I.M., Meuwissen, T.H., Hayes, B.J. and Goddard, M.E. (2009)
A novel predictor of multilocus haplotype homozygosity: comparison
with existing predictors. Genet. Res., 91, 413426.
Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M.A.,
Bender, D., Maller, J., Sklar, P., de Bakker, P.I., Daly, M.J. et al. (2007)
PLINK: a tool set for whole-genome association and population-based
linkage analyses. Am. J. Hum. Genet., 81, 559575.
Lao, O., Lu, T.T., Nothnagel, M., Junge, O., Freitag-Wolf, S., Caliebe, A.,
Balascakova, M., Bertranpetit, J., Bindoff, L.A., Comas, D. et al. (2008)
Correlation between genetic and geographic structure in Europe. Curr.
Biol., 18, 1241 1248.
Beja-Pereira, A., Luikart, G., England, P.R., Bradley, D.G., Jann, O.C.,
Bertorelle, G., Chamberlain, A.T., Nunes, T.P., Metodiev, S., Ferrand, N.
et al. (2003) Gene-culture coevolution between cattle milk protein genes
and human lactase genes. Nat. Genet., 35, 311313.
Heath, S.C., Gut, I.G., Brennan, P., McKay, J.D., Bencko, V., Fabianova,
E., Foretova, L., Georges, M., Janout, V., Kabesch, M. et al. (2008)
Investigation of the fine structure of European populations with
applications to disease association studies. Eur. J. Hum. Genet., 16, 1413
1429.
Novembre, J., Johnson, T., Bryc, K., Kutalik, Z., Boyko, A.R., Auton, A.,
Indap, A., King, K.S., Bergmann, S., Nelson, M.R. et al. (2008) Genes
mirror geography within Europe. Nature, 456, 98101.
Hofman, A., Breteler, M.M., van Duijn, C.M., Krestin, G.P., Pols, H.A.,
Stricker, B.H., Tiemeier, H., Uitterlinden, A.G., Vingerling, J.R. and

2935

Potrebbero piacerti anche