Sei sulla pagina 1di 9

at University of Sydney on July 18, 2010http://hmg.oxfordjournals.orgDownloaded

from

Human Molecular Genetics, 2010, Vol. 19, No. 15

doi:10.1093/hmg/ddq198

Advance Access published on May 12, 2010

2927–2935

Genomic and geographic distribution of SNP- defined runs of homozygosity in Europeans

Michael Nothnagel 1, , Timothy Tehua

Lu 1 , Manfred Kayser 2 and Michael

Krawczak 1

1 Institute of Medical Informatics and Statistics, Christian-Albrechts University, 24105 Kiel, Germany and 2 Department of Forensic Molecular Biology, Erasmus University Medical Center, 3015 GE Rotterdam, The Netherlands

Received March 10, 2010; Revised April 28, 2010; Accepted May 6, 2010

The availability of high-density panels of genetic polymorphisms has led to the discovery of extended regions of apparent autozygosity in the human genome. At the genotype level, these regions present as size- able stretches, or ‘runs’, of homozygosity (ROH). Here, we investigated both the genomic and the geographic distribution of ROHs in a large European sample of i ndividuals originating from 23 subpopulations. The genomic ROH distribution was found to be characterized by a pattern of highly significant non-uniformity that was virtually identical in all subpopulations studied. Some 77 chromosomal regions contained ROHs at considerable frequency, thereby forming ‘ROH islands’ that were not explicable by high linkage disequili- brium alone. At the geographic level, the number and cumulative length of ROHs followed a prominent South to North gradient in agreement with expectations from European population history. The individual ROH length, in contrast, showed only minor and unsystematic geographic variation. While our findings are thus consistent with a larger effective population size in Southern than in Northern Europe, combined with a higher historic population density and mobility, they also indicate that the patterns of meiotic recombination in humans must have been very similar throughout the continent. Extending previous reports of a strong correlation between geography and identity-by-state, our data show that the genomic identity-by-descent pat- terns of Europeans are also clinal. As a consequence, the planning, design and interpretation of ROH-based genetic studies must take sample origin into account in order for such studies to be sensible and valid.

INTRODUCTION

The availability of high-density panels of genetic polymorph- isms has led to the recent discovery of extended regions of autozygosity in the human genome. At the genotype level, these regions present as sizeable stretches, or ‘runs’, of homo- zygosity (ROH) (1). Increased levels of autozygosity have long been implicated as a cause of the higher prevalence of recessive diseases in small and isolated populations. Initially, ROH analysis was thus endorsed successfully as a means to map recessive diseases genes (2 7), but ROHs may also be useful for disease gene identification under other genetic models (8,9), may be indicative of selective sweeps (10,11) and should be interesting from a human population genetics point of view. An abundance of ROHs was first demonstrated in Eur- opeans using short tandem repeat markers (1). Subsequent analyses of single nucleotide polymorphisms (SNPs) in large

European, Asian and African samples revealed a wide-spread occurrence of ROHs comprising . 1 Mb, but the same studies also highlighted population differences in terms of the cumu- lative ROH length and a positive correlation between ROH number and the population-specific level of consanguinity (1214). In addition, ROHs were found to preferentially occur in regions of decreased recombination activity. Wang et al . (15) observed a non-uniform distribution of ROHs on chromosome 22 in 11 population isolates. At a more localized level, McQuillan et al . (16) found that the cumulative ROH length per genome was larger in two isolated than in two non- isolate European populations, but no significant difference was seen between outbred populations from rural and urban areas of Scotland. These results suggested that prolonged isolation and a reduced population size may play a crucial role in the formation of ROHs, as would be predicted by population gen- etics theory. When focusing upon regions of low linkage dis- equilibrium (LD) in the genomes of North Americans of

2928

Human Molecular Genetics, 2010, Vol. 19, No. 15

Table 1. Summary of weighted ROHs statistics in 23 European subpopulations

at University of Sydney on July 18, 2010http://hmg.oxfordjournals.orgDownloaded

from

Subpopulation (sampling site)

Code

Sample size

Sampling site coordinates

Weighted ROH number per individual mean + SD

Median weighted ROH length per individual mean + SD (Mb)

 

Latitude

Longitude

Norway (Førde)

NO

52

59.36

5.28

42.16 + 6.72 41.49 + 6.37 48.04 + 7.34 40.14 + 5.02 38.51 + 6.30 40.11 + 6.26 38.79 + 6.83 40.49 + 6.25 38.39 + 6.09 36.42 + 6.39 37.64 + 5.89 36.89 + 6.71 34.15 + 7.56 37.70 + 5.89 36.37 + 6.94 35.59 + 6.19 34.60 + 5.11 36.88 + 5.72 33.69 + 5.22 33.68 + 5.25 32.55 + 4.48 41.45 + 5.98 39.21 + 5.11 38.74 + 6.60

1.31 + 0.08 1.27 + 0.07 1.30 + 0.08 1.28 + 0.08 1.30 + 0.09 1.30 + 0.09 1.30 + 0.08 1.31 + 0.08 1.31 + 0.08 1.29 + 0.09 1.29 + 0.08 1.28 + 0.10 1.26 + 0.08 1.30 + 0.08 1.29 + 0.10 1.30 + 0.09 1.28 + 0.07 1.31 + 0.08 1.28 + 0.11 1.25 + 0.07 1.17 + 0.06 1.29 + 0.07 1.27 + 0.06 1.30 + 0.08

Sweden

(Uppsala)

SE

46

59.51

17.38

Finland (Helsinki)

FI

47

60.10

24.56

Ireland

 

IR

35

53.19

2 6.15

UK (London)

UK

194

51.30

2 0.07

Denmark (Copenhagen)

DK

59

55.40

12.34

Netherlands (Rotterdam)

NE

280

51.55

4.28

Germany

I (Kiel)

NG

494

54.14

10.04

Germany

II (Augsburg)

SG

489

48.37

10.89

Austria (Tyrol)

AU

50

47.16

11.23

Switzerland (Lausanne)

SW

133

46.31

6.37

France (Lyon)

FR

50

45.46

4.50

Portugal

 

PG

16

38.43

2 9.08

Spain

I

S1

81

40.25

2 3.42

Spain

II (Barcelona)

S2

47

41.20

2.10

Italy

I

I1

106

41.53

13.68

Italy II (Marches) Former Yugoslavia Northern Greece Hungary Romania Poland (Warsaw) Czech Republic (Prague) Total

I2

49

43.37

13.30

YU

55

44.49

20.30

GR

51

40.38

22.27

HU

17

47.27

19.06

RO

12

44.25

26.07

PO

49

52.15

21.01

CZ

45

50.04

14.28

2457

Given are the number of samples after data cleaning (21), the geographical location of the sampling sites (‘subpopulations’), the subpopulation-specific mean + SD of the weighted number of ROHs and of the median-weighted ROH length per individual.

ROHs and of the median-weighted ROH length per individual. Figure 1. Distribution of ROHs in European

Figure 1. Distribution of ROHs in European genomes. For each SNP, the ROH frequency per SNP in the overall sample is plotted at its physical location. ROHs that occurred in more than 50% of individuals were identified on chromosomes 3, 4 and 14 (Table 3).

European descent, Nalls et al . (17) observed a 14% decrease in the frequency of ROHs and a 24% decrease in the cumulative ROH length per individual over a time-span of just one century, and suggested panmixia and an increased effective population size as the likely causes of these changes.

Various definitions of ROHs have been proposed in the past (12,13,15,18 20). Here, we applied the widely used ROH definition implemented in PLINK (20) to a set of genome- wide SNP data that formed the basis of comprehensive population genetic analyses before (21). The data included quality-controlled genotypes, at 304 250 autosomal SNPs, of 2457 unrelated individuals originating from 23 sampling sites in Europe (henceforth termed ‘subpopulations’; Table 1). We characterized both the genomic and the geographic distribution of SNP-defined ROHs in these individuals and sought for poss- ible explanations of the observed ROH patterns.

RESULTS

Genomic ROH distribution

Using the default ROH definition of PLINK (see Materials and Methods), the genome-wide median number of individuals for whom a given SNP was included in an ROH (henceforth termed ‘ROH count per SNP’) was found to be 17, with an inter-quartile range (IQR) of 7–47 (corresponding to a median ‘ROH frequency per SNP’ of 0.7% of individuals, IQR: 0.3–1.9%). Most SNPs fell into an ROH at least once. Intriguingly, the genomic distribution of the ROH frequency per SNP deviated substantially from uniformity. Thus, ROHs

were clearly abundant in particular genomic regions (Fig. 1), the location of which was virtually identical in all subpopu- lations studied. This non-uniformity was highly significant

when assessed by a x

goodness-of-fit test of the binned

average ROH count per SNP in the overall sample (Table 2). Considering genomic regions of 50 adjacent

2

at University of Sydney on July 18, 2010http://hmg.oxfordjournals.orgDownloaded

from

Table 2. Statistical analysis of the genomic ROH distribution

Bin size (Mb)

x 2

df

P-value

1

558199.8639

2696

, 10 2 100 , 10 2 100 , 10 2 100 , 10 2 100 , 10 2 100

5

53382.4379

555

10

15142.9161

284

20

4768.6176

146

50

764.6029

62

100

239.1148

35

3.1 × 10 2 32 4.8 × 10 2 12

250

98.7624

21

Each chromosome was divided into bins of equal size. The average ROH count per SNP per bin in the overall sample was subjected to a x 2 goodness-of-fit test over all chromosomes. The last line (bin size 250 Mb) contains the result of a test for uniformity between chromosomes. df: degrees of freedom.

SNPs, a total of 10 regions could be identified in which all SNPs had an ROH frequency per SNP 30% in the European population (Table 3) and some 77 regions in which the ROH frequency per SNP exceeded 10% (see Supplementary Material, Table S1). At least two of these ‘ROH islands’ on chromosome 1 (at positions 48.8–51.1 Mb, and 170.1– 171.7 Mb) appear to coincide with the ROH clusters pre- viously identified in the Scottish population (16), although no information as to the exact limits of the respective regions were reported. We also identified a very common ROH island on chromosome 4 that has been described for European samples before (12). Notably, three ROH islands were found in our study to be present in more than 50% of individuals (Table 3). These ROH islands were located on chromosomes 3, 4 and 14. We also noted that the lactase (LCT) gene on chromosome 2q21, known to have been

subject to strong selection in Europe (22), is located in an ROH island. ROHs appear to be a phenomenon predominantly of large chromosomes (Fig. 2). Thus, the chromosome-wise median ROH frequency per SNP in the overall sample was strongly correlated with both the length of the chromosome (Spear-

man’s r ¼ 0.915; P ¼ 2.6 × 10

on the chromosome ( r ¼ 0.919; P ¼ 1.6 × 10 2 9 ), but not with the chromosome-specific SNP density ( r ¼ 0.097; P ¼ 0.666). We next investigated whether the occurrence of ROHs was explicable in terms of reduced single-marker gene diversity and/or a heterozygote deficit relative to Hardy–Weinberg expectation. To this end, we performed a logistic regression analysis of the ROH frequency per SNP, using the gene diver- sity of the SNP and the local estimate of fixation index F as covariates. Logistic regression analysis was carried out separ- ately for all but the three largest subpopulations (DE1, DE2 and NE), for which the computer memory requirements turned out to be prohibitive. In all 20 subpopulations con- sidered, the contribution of the two covariates was highly sig-

nificant ( P , 10 2 100 for most subpopulations), with decreasing gene diversity and increasing F resulting in an increased ROH frequency per SNP (Table 4). However, when we assessed the ability of the regression models to predict the inclusion of a given SNP into an ROH, the ensuing area-under-curve (AUC) values were found to range from 0.53 to 0.56 in different subpopulations (Table 4),

2 9

) and the number of SNPs

Human Molecular Genetics, 2010, Vol. 19, No. 15

2929

which is in fact only marginally better than random classifi- cation (AUC ¼ 0.50). The definition of an ROH does not only depend upon the properties of single markers but takes adjacent SNPs simul- taneously into account. We therefore correlated the average single-marker gene diversity, taken over a sliding 1 Mb window, with the average ROH frequency per SNP in that window. Although an unambiguously negative correlation emerged (genome-wide Pearson’s r ¼ 2 0.268 + 0.018; range of subpopulation-specific r-values: 2 0.230 to 2 0.293), the size of the observed correlation implied that gene diversity explains only 7% (coefficient of determi-

¼ 0.072) of the variation in ROH frequency

nation r

per SNP. We also investigated the effects of LD upon the definition of ROHs. To this end, we correlated the average ROH frequency per SNP, taken over a sliding 1 Mb window, with the average of the squared genotypic correlation coefficient g 2 within this

window. The average of the genome-wide subpopulation- specific Pearson correlation coefficients was 0.453 + 0.024, with a range of 0.382 (PG) to 0.503 (S2). Thus, 20% (i.e. 0.453 2 ¼ 0.205) of the variation in ROH frequency per SNP could be explained by the extent of LD in the vicinity of a given marker. A correlation between LD and ROH prevalence became particularly apparent for the three genomic regions (on chromosomes 3, 4 and 14) with the highest ROH fre- quency per SNP (Fig. 3). However, increased LD in the vicin- ity of a given SNP was neither necessary nor sufficient for SNPs to be included in an ROH.

2

Geographical pattern of ROH distribution in Europe

LD can act as a potential confounder in comparative ROH analyses of different populations because the local level of LD determines the effective number of SNPs used for ROH definition. When characterizing the geographic distribution of ROHs, we therefore weighted individual ROHs by their internal level of LD, approximated by one minus the average of the pair-wise squared genotypic correlation coeffi- cient g 2 (see Materials and Methods). The weighted ROH number per individual ranged from 10.5 to 60.4 in the overall sample, with all subpopulation-specific IQRs falling between 25 and 55 (see Supplementary Material, Fig. S1). The subpopulation average of the weighted ROH number per individual varied between 32.55 (standard error, SE: 1.3) in the Romanians and 48.0 (SE: 1.1) in the Finns (Table 1). Similarly, the subpopulation average of the cumu- lative weighted ROH length per individual ranged from 49.7 Mb (SE: 2.3 Mb) in the Romanians to 81.5 Mb (SE:

2.2 Mb) in the Finns. Of the 2457 individuals analysed, 40 (1.6%) exhibited a cumulative weighted ROH length 100 Mb (3.3% of the human genome). These individuals ori- ginated from South Germany (10), North Germany and Norway (6 each), Italy I, Spain I, Finland and The Netherlands (3 each), Portugal (2), and from Austria, Denmark, the UK and former Yugoslavia (1 each). As a consequence, particularly high proportions of samples from Finland (6.4%), Norway (11.5%) and Portugal (12.5%) were found to have at least 100 Mb of their genome located in ROHs. Twelve individuals (0.5% of the total) had weighted ROHs comprising 150 Mb

2930

Human Molecular Genetics, 2010, Vol. 19, No. 15

Table 3. Regions of at least 50 SNPs with high ROH frequency per SNP (‘ROH islands’)

at University of Sydney on July 18, 2010http://hmg.oxfordjournals.orgDownloaded

from

Chr.

Location/size (kb)

No.

Mean (range) ROH frequency per SNP (%)

Known genes

 

SNPs

14

65,754.607–66,956.534/1,201.927 106

68.5 (61.0–68.8)

ATP6V1D, EIF2S1, FAM71D, GPHN, MPP5, PLEK2 etc. AK093205, BC036345 DOCK3, CACNA2D2 , HEMK1, KIAA0809, RAD54L2, TEX264, VPRBP etc. ACAD10, ALDH2, ATXN2, BRAP, MAPKAPK5, PTPN11, TMEM116, TRAFD1 etc. EIF2C3 , KIAA0319L, NCDN, PSMB2, SFPQ, TEKT2, TRAP2E, ZMYM4 etc. ACSL6, CDC42SE2, FNIP1, HINT1, LYRM7, RAPGEF6 FOLH1, OR4A47, OR4B1, OR4C3, OR4C45, OR4S1, OR4X1, OR4X2, PTPRJ CBFB, CTCF, LCAT, NAE1, NFATC3, LRRC36, RANBP10 etc. ABCC11 , ABCC12, BC048130 ANXA7, CCDC109A, ECD, P4HA1, PPP3CB, TTC18, USP54 etc.

4

33,305.316–34,167.260/861.944

58

62.2 (60.2–62.7)

3

50,382.348–51,835.857/1,453.509 101

55.6 (51.4–56.7)

12

110,249.612–111,461.573/1,211.961 90

42.9 (40.5–43.3)

1

35,023.369–36,505.444/1482.075 101

38.3 (33.2–39.0)

5

129,845.818–131,423.014/1,577.196

105

35.9 (30.9–37.4)

11

47,998.479–49,391.209/1,392.730

114

32.9 (30.5–35.1)

16

65,360.598–66,845.475/1,484.877

72

31.7 (30.2–32.2)

16

46,391.563–46,826.430/434.867

55

30.7 (30.2–31.0)

10

74,211.870–75,086.795/874.925

59

33.6 (30.0–34.2)

Given are the mean and the range of the ROH frequency per SNP in the overall sample, both taken over all SNPs in the respective region.

sample, both taken over all SNPs in the respective region. Figure 2. Chromosome-specific ROH distribution in

Figure 2. Chromosome-specific ROH distribution in Europeans. Filled (open) circle: Median (mean) ROH count per SNP in the overall sample. Bold line:

inter-quartile range of the ROH count per SNP in the overall sample.

(i.e. 5% of the genome), while four individuals (0.2% of the total; originating from Denmark, North Germany and the UK) had weighted ROHs comprising 200 Mb (6.7% of the genome). In contrast, the subpopulation average of the median-weighted ROH length per individual was found to vary much less, ranging from 1.17 Mb in the Romanians (SE: 0.02 Mb) to 1.31 Mb in the North and South Germans (SE: 0.00 Mb), Norwegians and individuals from ex-Yugoslavia (SE: 0.01 Mb) (Table 1). All individuals had median-weighted ROH lengths between 1 and 2 Mb (see Sup- plementary Material, Fig. S1). Similar results were obtained for the subpopulation average of the mean weighted ROH length per individual (data not shown). The subpopulation average of the weighted ROH number per individual showed a strong and highly significant corre- lation with the latitude (Pearson’s r ¼ 0.84, P ¼ 4.3 × 10 2 7 ; Fig. 4), but not the longitude of the corresponding sampling site ( r ¼ 0.05, P ¼ 0.8). A similar, albeit less pronounced

Table 4. Relationship between single-marker gene diversity and heterozygote deficit, respectively, and ROH frequency per SNP

Subpopulation

Gene diversity

 

Fixation index F P -value

 

P -value

OR

OR

AUC

AU

,

10 2 100 10 2 100 10 2 100 10 2 100 10 2 100 10 2 100 10 2 100 10 2 100 10 2 100 10 2 100 10 2 100 10 2 100 10 2 100 10 2 100 10 2 100 10 2 100 10 2 100 10 2 100 10 2 100 10 2 100

0.377

,

10 2 100 10 2 100 10 2 100 10 2 100 10 2 100 10 2 100 10 2 100 10 2 100

1.576

0.55

DK

,

0.321

,

1.629

0.55

FI

,

0.465

,

1.884

0.54

YU

,

0.367

,

1.527

0.55

FR

,

0.319

,

1.442

0.55

GR

,

0.376

,

1.894

0.55

HU

,

0.287

,

1.553

0.56

IR

,

0.331

,

1.691

0.55

IT1

,

0.371

, 10 2 100

1.457

0.55

IT2

,

0.349

8.1 × 10 2 38

1.225

0.55

NO

,

0.440

,

10 2 100 10 2 100 10 2 100 10 2 100 10 2 100 10 2 100 10 2 100

1.916

0.54

PO

,

0.318

,

1.532

0.55

PG

,

0.412

,

1.958

0.56

RO

,

0.287

,

1.575

0.56

SE

,

0.330

,

1.538

0.55

SG

,

0.428

,

1.739

0.54

S1

,

0.339

,

1.855

0.55

S2

,

0.320

, 10 2 100

1.098

0.55

SW

,

0.304

1.1 × 10 2 9

1.935

0.56

CZ

,

0.344

, 10 2 100

1.046

0.55

UK

, 10 2 100

0.377

2.4 × 10 2 3

1.576

0.55

Mean

0.354

1.604

0.550

SD

0.049

0.261

0.005

Odds-ratios (OR) and P -values are from a logistic regression analysis of the ROH frequency per SNP, using single-marker gene diversity and local fixation index F as covariates. AUC, area-under-curve (for details, see text).

trend was observed for the subpopulation average of the cumu- lative weighted ROH length per individual (latitude: r ¼ 0.61, P ¼ 1.8 × 10 2 3 ; longitude: r ¼ 2 0.14, P ¼ 0.5). Neverthe- less, since the Finnish are known to be genetically quite dis- tinct from other Europeans, and because some of the Norwegian sampling sites included in our study (e.g. Førde) also may have represented genetic isolates, it remained poss- ible that the above correlations hinged mainly on a few founder populations from the northern fringes of the continent. However, exclusion of the Finnish and/or Norwegian samples from our analysis hardly changed the observed correlation between weighted ROH number and latitude (without

at University of Sydney on July 18, 2010http://hmg.oxfordjournals.orgDownloaded

from

Human Molecular Genetics, 2010, Vol. 19, No. 15

2931

from Human Molecular Genetics, 2010, Vol. 19, No. 15 2931 Figure 3. ROH frequency, local linkage

Figure 3. ROH frequency, local linkage disequilibrium and gene diversity per SNP in selected chromosomal regions in the North German (NG) subpopulation. Regions were selected from the top of Table 3. Vertical gray dashed lines: region limits. Green horizontal bars: extent of individual ROHs. Black ticks: physical location of analysed SNP. Green line: ROH frequency per SNP. Red line: average genotypic correlation within bins of approximately 200 kb (marked by gray ticks). Blue line: gene diversity per SNP.

(marked by gray ticks). Blue line: gene diversity per SNP. Figure 4. Geographic distribution of weighted

Figure 4. Geographic distribution of weighted ROHs in European genomes. White dots mark the location of the 23 sampling sites where individuals were recruited into ‘subpopulations’ (as defined in the text). (A) Subpopulation average of the weighted ROH number per individual; (B ) subpopulation average of the median-weighted ROH length (Mb) per individual. Contour maps were derived through spline interpolation.

FI: r ¼ 0.85, P ¼ 7.3 × 10 2 7 ; without NO: r ¼ 0.86, P ¼

NO: r ¼ 0.83, P ¼ 3.5 × 10 2 6 ).

Similarly, the correlation with longitude remained non- significant upon the exclusion of the two subpopulations, and the geographic distribution of the cumulative ROH length was found to be similarly robust. Furthermore, when those 40 individuals with a cumulative weighted ROH length 100 Mb were excluded from the analysis in order to avoid effects of recent cryptic inbreeding, the correlations also hardly changed (for the weighted ROH number: latitude r ¼ 0.86, P ¼ 1.8 × 10 2 7 , longitude r ¼ 0.09, P ¼ 0.7; for

4.0 ×

10 2 7 ; without FI and

the cumulative weighted ROH length: latitude r ¼ 0.69, P ¼ 2.6 × 10 2 4 , longitude r ¼ 2 0.03, P ¼ 0.9). In contrast to the weighted number and the cumulative weighted length of ROHs, the subpopulation average of the median-weighted ROH length per individual showed only little and insignificant correlation with latitude ( r ¼ 0.27, P ¼ 0.2; Fig. 4) and longi- tude ( r ¼ 2 0.27; P ¼ 0.2). Finally, a systematic modification of the parameters used for ROH definition, in particular of the number of SNPs required per ROH, the minimum ROH length and the gap size allowed within ROHs, turned out to leave our results largely unchanged (data not shown).

at University of Sydney on July 18, 2010http://hmg.oxfordjournals.orgDownloaded

from

2932 Human Molecular Genetics, 2010, Vol. 19, No. 15

from 2932 Human Molecular Genetics, 2010, Vol. 19, No. 15 Figure 5. Spatial autocorrelograms of three

Figure 5. Spatial autocorrelograms of three characteristics of weighted ROHs in European genomes. ( A) Subpopulation average of the weighted ROH number per individual; ( B ) subpopulation average of the cumulative weighted ROH length (Mb) per individual; ( C) subpopulation average of the median- weighted ROH length (Mb) per individual. Solid diamonds: P , 0.05; open diamonds: not significant.

For spatial autocorrelation analysis, the great circle distances between sampling sites were classified into 200 km intervals, ranging from 0 to 1800 km. While between 3 and 49 subpopu- lation pairs fell into each class, the remaining 24 pairs that were . 1800 km apart were combined into a single residual class. The subpopulation averages of both the weighted ROH number and the cumulative weighted ROH length per individ- ual showed significant and positive spatial autocorrelation at small distances and significant but negative spatial autocorrela- tion at large distances (Fig. 5A and B). The subpopulation average of the median-weighted ROH length per individual showed a similar albeit non-significant trend (Fig. 5C).

DISCUSSION

At the level of the individual genome, the distribution of SNP-defined ROHs was found in our study to be highly struc- tured in all of the European subpopulations analysed. This

structure could not be explained solely by reduced marker gene diversity or a localized heterozygote deficit. Although both factors likely contribute to the formation of ROHs, their impact was found to be small to moderate. This is not surprising given that the inference of ROHs is based upon features of multiple adjacent SNPs. As was demonstrated here, increased regional LD contributes significantly to the occurrence of ROHs but is not sufficient to explain their presence. While some evidence for a non-uniform distribution of ROHs has been reported before, we were able to show that this deviation from uniformity is geographically ubiquitous and highly significant. Thus, while McQuillan et al . found some ROHs on chromosome 1 to be relatively frequent in the Scottish population (16), we observed that many of these ROHs are not specific to the Scots, but are instead common throughout Europe. However, the most prominent European ROH on chromosome 1, located at positions 35.0–36.5 Mb, apparently occurred only once in the Scottish sample. Notably, we identified substantially more ROHs than Auton et al . (12) who observed only a single ROH in their European samples, located on chromosome 4 and overlapping with an ROH identified in our study as well. Moreover, in their entire collection of 3845 individuals of European, East Asian, South Asian and Mexican origin, the same authors identified only 39 different ROHs. In all likelihood, however, the apparent discrepancy in ROH number between their study and ours is attributable to a somewhat dissimilar ROH definition employed by the two projects which, among other parameters, differed in terms of the minimum allele fre- quency required and the type of inter-marker distance used (genetic or physical). We have shown for the first time that both the number and the expanse of ROHs in individual human genomes, when weighted by the local level of LD, are strongly correlated with the latitude of the sample origin in Europe. This result corroborates earlier findings of a continent-wide decrease in human genetic diversity with increasing latitude (21,23,24). It should be noted, however, that the earlier studies were based upon the identity-by-state of homologous chromosomes, not their identity-by-descent , and were therefore less specific to human genealogy than our in-depth analysis of ROHs. In any case, the consistently observed correlation in Europe between genetic structure and latitude, but not longitude, appears readily explicable in terms of human population history. Thus, all three major migration episodes in Europe are known to have followed a South to North gradient: (i) the initial occupation by hunter-gatherers during the Palaeo- lithic, (ii) the post-glacial re-expansion during the Mesolithic and (iii) the influx of farmers from the South-East during the Neolithic. Apart from clinal migration, however, a high level of autozygosity in Northern Europe could also have resulted from a lower historic population density and/or a lower level of individual mobility in this part of the continent, both of which would have favoured the formation of ROHs as well. Unfortunately, the geographic resolution of our data is not suf- ficient to allow any reliable discrimination between these alternative explanatory scenarios. However, because of their ubiquitous occurrence in Europe, it seems likely that at least the point in time when the common ROH islands identified

at University of Sydney on July 18, 2010http://hmg.oxfordjournals.orgDownloaded

from

in our study first arose must have predated the differentiation of the European subpopulations. The intuitive appeal of the above conclusions notwithstand- ing, some local variation to the theme still seems obvious. It may be argued, for example, that both Portugal and Spain do not follow the generally observed South to North gradient of ROH density (Fig. 4A) which could, among others, reflect a recent influx from North Africa not present in other regions of the continent. Furthermore, the sparse resolution of our data also implies that the any inference about the fringes of Europe should be made with some caution (Fig. 4). For example, the apparently reduced number and length of ROHs predicted for Eastern Europe and the North-West of the Iberian Peninsula may well represent interpolation arte- facts because no sampling points of our study were located in these regions. In contrast to both the number and cumulative length of ROHs, their average length and the location of ‘ROH islands’ were found to be remarkably similar in different sub- populations, suggesting that the latter must have been charac- terized by very similar genome-wide patterns of meiotic recombination. Our results also indicate that recent population growth (say 100–200 years ago) seems to have played only a minor role in shaping the genomic distribution of ROHs in Europeans, not the least because the variance-effective popu- lation size in Europe has grown much slower than the census size. Therefore, the recent observation of an apparently rapid decline in autozygosity among North Americans of mixed European descent (17) may be more likely to reflect the contribution of different European source populations migrating into the USA at different times, rather than progres- sing urbanization. In view of the increasing interest in deep-sequencing and the shift in focus of genetic epidemiology from common to rare variants as a likely cause of human complex disease, a detailed understanding of the inheritance patterns prevalent in different European subpopulations will become increasingly important. One way in which rare variants can be inferred as potentially causative of disease would be through the homo- zygosity of patients from presumably outbred populations. However, the wide-spread occurrence of ROHs in contempor- ary Europeans requires such evaluations to be made in com- parison to healthy controls, and our data highlight that at least some matching for geographic origin will be required for future ROH-based genetic studies to be both sensible and valid.

MATERIALS AND METHODS

SNP genotyping

The genome-wide SNP data used in the present study have been described in detail elsewhere (21). In brief, 2514 individ- uals from 23 different sampling sites in Europe were originally genotyped for 500 568 SNPs, using the GeneChip Human Mapping 500 k Array Set (Affymetrix). The samples were either population-based controls (2529), originated from population-representative cohorts (30), or were randomly selected healthy volunteers (often blood donors). European migrants from non-European regions were not included in

Human Molecular Genetics, 2010, Vol. 19, No. 15

2933

the analysis. Stringent quality control served to ensure the non-relatedness of individuals and the representativeness of samples for the respective sampling sites (i.e. ‘genetic outliers’ were removed prior to the analysis). SNPs were required to be autosomal, to be polymorphic in at least one subpopulation, to lack any significant deviation from Hardy–Weinberg equili- brium (P 0.05) in all subpopulations, to have a call-rate . 90% in all six genotyping centres involved and to possess an rs number. These criteria left 2457 individuals (97.6%) and 304 250 SNPs (60.8%) for the analysis.

ROH definition

We employed the default ROH definition of PLINK v1.06 (20) (i.e. ROH length 1 Mb, 100 SNPs per ROH, 1 SNP per 50 kb within ROHs and a gap size 1 Mb within ROHs). ROH screening was carried out adopting the default window options (i.e. a 5 Mb window, 50 SNPs per window, at most one of which was heterozygous, 5 missing SNPs and a proportion of overlapping windows that must be homozy- gous 0.05). To adjust for the effects of LD upon ROH defi- nition, previously suggested measures (16) of the extent of ROHs in the human genome, namely their number, cumulative length and median length per individuals, were weighted by one minus the average squared genotypic correlation coeffi- cient g 2 , taken over all marker pairs within the ROH (see below).

Statistical analysis of the genomic ROH distribution

The genomic distribution of ROHs was analysed statistically as follows. Each chromosome was divided into bins of given size. Bins were only included into the analysis if they con- tained at least one SNP, which implied the exclusion of centro- meric regions. The distribution of the mean ROH count per SNP per bin was then tested for uniformity using a x goodness-of-fit test with a number of degrees of freedom equal to the number of bins minus one. Spearman’s correlation coefficient was used to quantify the impact of chromosome length, SNP number and SNP density (i.e. average inter- marker distance) per chromosome upon the chromosome-wise average of the ROH frequency per SNP. Correlation coeffi- cients were calculated and tested for a significant difference from zero using the cor and cor.test functions in R v2.10.1 (31), respectively. ‘ROH islands’ were identified as runs of adjacent SNPs with an ROH frequency per SNP above a given threshold. The potential impact upon ROH formation of single-marker gene diversity and local heterozygous deficit relative to Hardy– Weinberg expectation, measured by the F statistic, was assessed by logistic regression analysis. To this end, the ROH frequency per SNP was modelled as a logistic function of both covariates. Regression analyses were carried out separately for each sub- population. A receiver-operator-characteristic curve was using 100 equidistant values of the ROH frequency per SNP, and the corresponding AUC was determined by linear interpolation. The correlation between gene diversity and ROH frequency per SNP was further analysed using averages of these values taken over a sliding window of 1 Mb, moved along chromo- somes in steps of 250 kb. Pearson’s correlation coefficients

2

at University of Sydney on July 18, 2010http://hmg.oxfordjournals.orgDownloaded

from

2934 Human Molecular Genetics, 2010, Vol. 19, No. 15

between averages were calculated separately for each sub-

population.

Since gametic phase information was lacking in our data

set, pair-wise LD was approximated by the squared genotypic

correlation coefficient g

, rather than the squared allelic corre-

lation coefficient r 2 , and estimated using PLINK v1.06 (20) with the – r 2 option. We considered only pairs of markers that were no further apart than 1 Mb and were separated by no more than 100 SNPs. The correlation between LD and ROH frequency per SNP was analysed using average values taken over a sliding window of 1 Mb, moved along chromo- somes in steps of 250 kb. Pearson’s correlation coefficients were again calculated separately for each subpopulation.

2

Statistical analysis of the geographic ROH distribution

R software v2.92 (31) was used for statistical analysis and for creating graphs. The akima R package v0.5-2 (32) was used for gridded bivariate cubic interpolation using splines (33). The significance of the correlation of certain ROH character- istics with either longitude or latitude was assessed by a two- sided test at the 5% level, as implemented in the cor.test function of the R stats library. Data on European geographic boundaries were obtained from http://www.oceanteacher.org/. Graphs were edited with Adobe Illustrator CS2. Spatial auto- correlation was analysed and correlograms were generated using PASSAGE v1.1 (34).

SUPPLEMENTARY MATERIAL

ACKNOWLEDGEMENTS

All sample donors are gratefully acknowledged for their par- ticipation. We thank the following colleagues for their help and support: P. Arp, M. Balascakova, C. Becker, A. van Belkum, J. Bertranpetit, L.A. Bindoff, R. Borup, S. Brauer,

A. Caliebe, J. Chambers, D. Comas, G. Eckstein, H. von Eller-

Eberstein, F.C. Nielsen, S. Freitag-Wolf, U. Gether, C. Gieger,

E. Haastrup, A. Hofman, G. Holmlund, W. van IJken,

M. Jhamai, O. Junge, K. King, E. Knipers, J. Kooner,

A.

Kouvatsi, O. Lao, J. Laven, P. Lichtner, J. Lindemans,

M.

Macek, T. Meitinger, I. Mollet, V. Mooser, P. Nu¨rnberg,

J.

Palo, W. Parson, R. Ploski, F. Rivadeneira, A. Ru¨ther,

A.

Sajantila, R. van Schaik, C. Schjerling, S. Schreiber,

E.

Sijbrands, M. Simoons, B. Stricker, A. Tagliabracci, A.G.

Uitterlinden, H. Ullum, P. Vollenweider, G. Waeber,

D. Waterworth, T. Werge and H.-E. Wichmann. We also

thank M. Wittig for helpful discussions.

Conflict of Interest statement . None declared.

FUNDING

This work was supported by the Netherlands Forensic Institute (to M.Ka.); by Affymetrix Inc. (to M.Ka., M.Kr.); by the German Federal Ministry of Education and Research (BMBF) through the National Genome Research Network

NGFNplus (01GS0809 to M.Kr., M.N.) and the German Research Foundation (DFG)/BMBF through the Excellence Cluster ‘‘Inflammation at Interfaces’’ (to M.N.). This study received additional support by a grant from the Netherlands Genomics Initiative (NGI)/Netherlands Organization for Scientific Research (NWO) within the framework of the For- ensic Genomics Consortium Netherlands (FGCN; www. forensicgenomics.nl/) (to M.Ka.). None of the funding organ- izations had any influence on the design, conduct, or con- clusions of the study.

REFERENCES

1. Broman, K.W. and Weber, J.L. (1999) Long homozygous chromosomal segments in reference families from the centre d’Etude du polymorphisme humain. Am. J. Hum. Genet., 65, 1493–1500.

2. Hildebrandt, F., Heeringa, S.F., Ruschendorf, F., Attanasio, M., Nurnberg, G., Becker, C., Seelow, D., Huebner, N., Chernin, G., Vlangos, C.N. et al. (2009) A systematic approach to mapping recessive disease genes in individuals from outbred populations. PLoS Genet., 5, e1000353.

3. Lander, E.S. and Botstein, D. (1987) Homozygosity mapping: a way to map human recessive traits with the DNA of inbred children. Science, 236, 1567–1570.

4. Miano, M.G., Jacobson, S.G., Carothers, A., Hanson, I., Teague, P., Lovell, J., Cideciyan, A.V., Haider, N., Stone, E.M., Sheffield, V.C. et al. (2000) Pitfalls in homozygosity mapping. Am. J. Hum. Genet., 67, 1348–

1351.

5. Seelow, D., Schuelke, M., Hildebrandt, F. and Nurnberg, P. (2009) HomozygosityMapper—an interactive approach to homozygosity mapping. Nucleic Acids Res., 37, W593–W599.

6. Wang, S., Haynes, C., Barany, F. and Ott, J. (2009) Genome-wide autozygosity mapping in human populations. Genet. Epidemiol., 33,

172–180.

7. Woods, C.G., Cox, J., Springell, K., Hampshire, D.J., Mohamed, M.D., McKibbin, M., Stern, R., Raymond, F.L., Sandford, R., Malik Sharif, S. et al. (2006) Quantification of homozygosity in consanguineous individuals with autosomal recessive disease. Am. J. Hum. Genet., 78,

889–896.

8. Jiang, H., Orr, A., Guernsey, D.L., Robitaille, J., Asselin, G., Samuels, M.E. and Dube, M.P. (2009) Application of homozygosity haplotype

analysis to genetic mapping with high-density SNP genotype data. PLoS ONE, 4 , e5280.

9. Miyazawa, H., Kato, M., Awata, T., Kohda, M., Iwasa, H., Koyama, N., Tanaka, T., Huqun Kyo, S., Okazaki, Y. et al. (2007) Homozygosity

haplotype allows a genomewide search for the autosomal segments shared among patients. Am. J. Hum. Genet., 80, 1090–1102.

10. Rosenberg, N.A. and Jakobsson, M. (2008) The relationship between homozygosity and the frequency of the most frequent allele. Genetics, 179, 2027–2036.

11. Sabeti, P.C., Reich, D.E., Higgins, J.M., Levine, H.Z., Richter, D.J., Schaffner, S.F., Gabriel, S.B., Platko, J.V., Patterson, N.J., McDonald, G.J. et al. (2002) Detecting recent positive selection in the human genome from haplotype structure. Nature, 419, 832–837.

12. Auton, A., Bryc, K., Boyko, A.R., Lohmueller, K.E., Novembre, J., Reynolds, A., Indap, A., Wright, M.H., Degenhardt, J.D., Gutenkunst, R.N. et al. (2009) Global distribution of genomic diversity underscores rich complex history of continental human populations. Genome Res., 19,

795–803.

13. Gibson, J., Morton, N.E. and Collins, A. (2006) Extended tracts of homozygosity in outbred human populations. Hum. Mol. Genet., 15, 789–

795.

14. Li, L.H., Ho, S.F., Chen, C.H., Wei, C.Y., Wong, W.C., Li, L.Y., Hung, S.I., Chung, W.H., Pan, W.H., Lee, M.T. et al. (2006) Long contiguous stretches of homozygosity in the human genome. Hum. Mutat., 27, 1115–

1121.

15. Wang, H., Lin, C.H., Service, S., Chen, Y., Freimer, N. and Sabatti, C. (2006) Linkage disequilibrium and haplotype homozygosity in population samples genotyped at a high marker density. Hum. Hered., 62, 175–189.

16. McQuillan, R., Leutenegger, A.L., Abdel-Rahman, R., Franklin, C.S., Pericic, M., Barac-Lauc, L., Smolej-Narancic, N., Janicijevic, B., Polasek,

at University of Sydney on July 18, 2010http://hmg.oxfordjournals.orgDownloaded

from

O., Tenesa, A. et al. (2008) Runs of homozygosity in European populations. Am. J. Hum. Genet. , 83, 359–372.

17. Nalls, M.A., Simon-Sanchez, J., Gibbs, J.R., Paisan-Ruiz, C., Bras, J.T., Tanaka, T., Matarin, M., Scholz, S., Weitz, C., Harris, T.B. et al. (2009) Measures of autozygosity in decline: globalization, urbanization, and its implications for medical genetics. PLoS Genet., 5, e1000415.

18. Curtis, D., Vine, A.E. and Knight, J. (2008) Study of regions of extended homozygosity provides a powerful method to explore haplotype structure of human populations. Ann. Hum. Genet., 72, 261–278.

19. MacLeod, I.M., Meuwissen, T.H., Hayes, B.J. and Goddard, M.E. (2009) A novel predictor of multilocus haplotype homozygosity: comparison with existing predictors. Genet. Res., 91, 413–426.

20. Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M.A., Bender, D., Maller, J., Sklar, P., de Bakker, P.I., Daly, M.J. et al. (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. , 81, 559–575.

21. Lao, O., Lu, T.T., Nothnagel, M., Junge, O., Freitag-Wolf, S., Caliebe, A., Balascakova, M., Bertranpetit, J., Bindoff, L.A., Comas, D. et al. (2008) Correlation between genetic and geographic structure in Europe. Curr. Biol., 18, 1241–1248.

22. Beja-Pereira, A., Luikart, G., England, P.R., Bradley, D.G., Jann, O.C., Bertorelle, G., Chamberlain, A.T., Nunes, T.P., Metodiev, S., Ferrand, N. et al. (2003) Gene-culture coevolution between cattle milk protein genes and human lactase genes. Nat. Genet., 35, 311–313.

23. Heath, S.C., Gut, I.G., Brennan, P., McKay, J.D., Bencko, V., Fabianova, E., Foretova, L., Georges, M., Janout, V., Kabesch, M. et al. (2008) Investigation of the fine structure of European populations with

applications to disease association studies. Eur. J. Hum. Genet. , 16, 1413–

1429.

24. Novembre, J., Johnson, T., Bryc, K., Kutalik, Z., Boyko, A.R., Auton, A., Indap, A., King, K.S., Bergmann, S., Nelson, M.R. et al. (2008) Genes

mirror geography within Europe. Nature, 456, 98–101.

25. Hofman, A., Breteler, M.M., van Duijn, C.M., Krestin, G.P., Pols, H.A., Stricker, B.H., Tiemeier, H., Uitterlinden, A.G., Vingerling, J.R. and

Human Molecular Genetics, 2010, Vol. 19, No. 15

2935

Witteman, J.C. (2007) The Rotterdam Study: objectives and design update. Eur. J. Epidemiol., 22, 819–829.

26. Hofman, A., Grobbee, D.E., de Jong, P.T. and van den Ouweland, F.A. (1991) Determinants of disease and disability in the elderly: the Rotterdam Elderly Study. Eur. J. Epidemiol., 7 , 403–422.

27. Kayser, M., Liu, F., Janssens, A.C., Rivadeneira, F., Lao, O., van Duijn, K., Vermeulen, M., Arp, P., Jhamai, M.M., van Ijcken, W.F. et al. (2008) Three genome-wide association studies and a linkage analysis identify HERC2 as a human iris color gene. Am. J. Hum. Genet., 82,

411–423.

28. Krawczak, M., Nikolaus, S., von Eberstein, H., Croucher, P.J., El Mokhtari, N.E. and Schreiber, S. (2006) PopGen: population-based recruitment of patients and controls for the analysis of complex genotype–phenotype relationships. Community Genet., 9 , 55–61.

29. Lowel, H., Doring, A., Schneider, A., Heier, M., Thorand, B., Meisinger, C. and Group, M.K.S. (2005) The MONICA Augsburg surveys—basis for prospective cohort studies. Gesundheitswesen, 67, S13–S18.

30. Nelson, M.R., Bryc, K., King, K.S., Indap, A., Boyko, A.R., Novembre, J., Briley, L.P., Maruyama, Y., Waterworth, D.M., Waeber, G. et al. (2008) The Population Reference Sample, POPRES: a resource for population, disease, and pharmacological genetics research. Am. J. Hum. Genet., 83,

347–358.

31. R Development Core Team. (2009) R Foundation for Statistical Computing. Vienna, Austria.

32. Akima, H., Gebhardt, A., Petzoldt, T. and Maechler, M. (2009). Interpolation of irregularly spaced data. R package version 0.5-2. http:// CRAN.R-project.org/package ¼ akima.

33. Akima, H. (1996) Algorithm 761: scattered-data surface fitting that has the accuracy of a cubic polynomial. ACM Trans. Math. Software, 22,

362–371.

34. Rosenberg, M.S. (2001) Pattern Analysis, Spatial Statistics, and Geographic Exegesis, Version 1.1, A.S.U. Department of Biology, Tempe, AZ, USA.