Vignette for the package rehh (version 2.0.4)

13/02/2019

This vignette describes how the R package REHH can be applied to perform whole genome scans for footprints of selection using statistics related to the Extended Haplotype Homozygosity (EHH) (Sabeti et al. 2002).

The current implementation of the package needs biallelic genetic markers as input. Typically, albeit not necessarily, these will be SNPs.

The rehh package is currently available for most platforms (Linux, MS Windows and MacOSX) from the CRAN repository https://cran.r-project.org/ and may be installed using a standard procedure. Once the package has been successfully installed on your system, it can be loaded by:

library(rehh)

1 Input Files

The package rehh requires as input:

1. a haplotype data file (in three possible formats) for each population of interest (see section 1.1)
2. a SNP information file (see section 1.2)

Important Note: For a given chromosome, SNPs are assumed to be ordered in the same way in the haplotype and SNP information files.

For illustration purposes, example files that originate from a previously published study on the Creole cattle breed from Guadeloupe (CGU) (Gautier and Naves 2011) are provided in the package and can be copied in the working directory with the command:

make.example.files()

Throughout this vignette, this command is assumed to have been run so that the example files are in the working directory.

1.1 Haplotype data file

Three haplotype input file formats are supported:

1. a standard haplotype format. Each line represents a haplotype (the first element being an haplotype identifier) with SNP genotype in columns as in the example file bta12_cgu.hap containing 280 haplotypes (identifier 1 to 280) with 1424 SNPs each (see section 1.3.1).
2. a (transposed) format with haplotypes in columns and SNPs in rows as in the example file bta12_cgu.thap. This format is similar to the one produced by the phasing program SHAPEIT2 (O’Connell et al. 2014) (see section 1.3.2).
3. the output file format from the phasing program fastPHASE (Scheet and Stephens 2006) as in the bta12_hapguess_switch.out example file. Note that this file format allows to include haplotypes from several populations (i.e., if the -u fastPHASE option was used) (see section 1.3.3).

By default alleles are assumed to be coded as 0 (missing data), 1 (ancestral allele) or 2 (derived allele). Other encodings can be recoded into this format, using the SNP information data file (see section 1.2) and the recode.allele option of the function data2haplohh() (see section 1.3).

1.2 SNP information file

The SNP information file should contain columns without header as in the map.inp example:

head(read.table("map.inp"))
>         V1 V2     V3 V4 V5
> 1 F0100190  1 113642  T  A
> 2 F0100220  1 244699  C  G
> 3 F0100250  1 369419  G  C
> 4 F0100270  1 447278  A  T
> 5 F0100280  1 487654  T  A
> 6 F0100290  1 524507  C  G

For each SNP the five columns correspond to:

1. name/identifier
2. chromosome (or scaffold) of origin
3. position on the chromosome (or scaffold). Note that it is up to the user to choose either physical or genetic map positions (if available).
4. ancestral allele (as coded in the haplotype input file)
5. derived allele (as coded in the haplotype input file)

The fourth and fifth columns (allele coding) should be always filled in, although the corresponding information is only of relevance when the recode.allele option of the function data2haplohh() is used (see section 1.3). In that case, the allele encoding in the haplotype file will be replaced by either 1 (ancestral) if the allele is found in the fourth column of the info file, 2 (derived) if it is found on the fifth column and 0 (missing data) otherwise. The information about ancestral or derived status is relevant only for within population tests (based on iHS). If one is interested exclusively in cross-population tests (based on Rsb or XP-EHH), assignment of the two alleles to the fourth and fifth column may be done at random.

The data2haplohh() function converts data files into an R object of class haplohh subsequently used by several functions of the rehh package. The following main options are available to recode alleles or select SNPs (based on Minor Allele Frequency or percentage of missing data) and haplotypes (based on percentage of missing data):

1. Allele recoding option: This option is activated with recode.allele=TRUE and in this case recodes the haplotype. The allele coding of the haplotype file is replaced by 0 (missing data), 1 (ancestral allele) or 2 (derived allele) according to the whether the allele coding is given at the fourth or fifth column in the corresponding row of the SNP information file.
2. Discard haplotypes with a high amount of missing data. If haplotypes contain missing data (which typically doesn’t occurr since most phasing programs allow imputing missing genotypes), it is possible to discard those with a fraction of less than min_perc_geno.hap of SNPs genotyped. By default min_perc_geno.hap=100 meaning that only completely phased haplotypes are retained.
3. Discard SNPs with a high amount of missing data. It is possible to discard individual SNPs genotyped on a fraction of less than min_perc_geno.snp of haplotypes. By default min_perc_geno.snp=100 meaning that only fully genotyped SNPs are retained.
4. Discard SNPs with a low Minor Allele Frequency. It is possible to discard SNPs with a MAF below min_maf. This is generally not recommended and by default min_maf=0 meaning that all SNPs are retained.

The arguments min_perc_geno.hap, min_perc_geno.snp and min_maf are evaluated in this order. More details about the different arguments of the function are available in the documentation accessible using the command:

?data2haplohh

1.3.1 Example 1: reading haplotype file in standard format

In this example, the haplotype input file bta12_cgu.hap (in a standard haplotype format) and SNP information input file map.inp are converted into an haplohh object named hap. Because the map file contains information for SNPs mapping to multiple chromosomes we have to specify that the haplotype input file is about chromosome 12 by setting the option chr.name=12. Allele recoding is activated (recode.allele=TRUE) to recode alleles given in the haplotype file as nucleotides to 0, 1 or 2.

hap<-data2haplohh(hap_file="bta12_cgu.hap",map_file="map.inp",
recode.allele=TRUE,chr.name=12)
> Map file seems OK: 1424  SNPs declared for chromosome 12
> Standard rehh input file assumed
> Alleles are being recoded according to map file as:
>   0 (missing data), 1 (ancestral allele) or 2 (derived allele)
> Discard Haplotype with less than  100 % of genotyped SNPs
> Discard SNPs genotyped on less than  100 % of haplotypes
> Data consists of 280 haplotypes and 1424 SNPs

If no value is specified for the chr.name argument and more than one chromosome is detected in the map file, the function asks interactively which chromosome to choose:

hap<-data2haplohh(hap_file="bta12_cgu.hap",map_file="map.inp",
recode.allele=TRUE) 
> More than one chromosome name in Map file: map.inp
> Which chromosome should be considered among:
> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
> 1:
12
> Map file seems OK: 1424  SNPs declared for chromosome 12
> Standard rehh input file assumed
> Alleles are being recoded according to map file as:
>   0 (missing data), 1 (ancestral allele) or 2 (derived allele)
> Discard Haplotype with less than  100 % of genotyped SNPs
> Discard SNPs genotyped on less than  100 % of haplotypes
> Data consists of 280 haplotypes and 1424 SNPs

Finally, as an example of an error message, the following message is prompted if the number of SNPs for the chromosome in the info file does not correspond to the one in the haplotype file (for instance when a wrong chromosome is specified):

hap<-data2haplohh(hap_file="bta12_cgu.hap",map_file="map.inp",
recode.allele=TRUE,chr.name=18)
> Map file seems OK: 1123  SNPs declared for chromosome 18
> Standard rehh input file assumed
> The number of snp in the haplotypes 1424  is not equal
> to the number of snps declared in the map file 1123
> Error in data2haplohh(hap_file = "bta12_cgu.hap", map_file = "map.inp", : Conversion stopped

1.3.2 Example 2: reading haplotype file in transposed format (SHAPIT2–like)

In this example, the haplotype input file bta12_cgu.thap (transposed format) and the SNP information input file map.inp are converted to an haplohh object named hap. Setting haplotype.in.columns=TRUE informs the function that the haplotype file is in transposed format:

hap<-data2haplohh(hap_file="bta12_cgu.thap",map_file="map.inp",haplotype.in.columns=TRUE,
recode.allele=TRUE,chr.name=12)
> Map file seems OK: 1424  SNPs declared for chromosome 12
> Haplotype are in columns with no header
> Alleles are being recoded according to map file as:
>   0 (missing data), 1 (ancestral allele) or 2 (derived allele)
> Discard Haplotype with less than  100 % of genotyped SNPs
> Discard SNPs genotyped on less than  100 % of haplotypes
> Data consists of 280 haplotypes and 1424 SNPs

1.3.3 Example 3: reading haplotype file in fastPHASE output format

In this example, the fastPHASE output file bta12_hapguess_switch.out and the SNP information input file map.inp are converted into a haplohh object named hap. As explained above we use the option chr.name=12. Because haplotypes originated here from several populations (the -u fastPHASE option was used), we specify the population of interest (in our example the 280 haplotypes from the CGU population, see above) using the option popsel=7 (7 corresponding to the code of CGU in the example fastPHASE input file).

hap<-data2haplohh(hap_file="bta12_hapguess_switch.out",map_file="map.inp",
recode.allele=TRUE,popsel=7,chr.name=12)
> Map file seems OK: 1424  SNPs declared for chromosome 12
> Looks like a FastPHASE haplotype file
> Haplotypes originate from  8  different populations in the fastPhase output file
> Alleles are being recoded according to map file as:
>   0 (missing data), 1 (ancestral allele) or 2 (derived allele)
> Discard Haplotype with less than  100 % of genotyped SNPs
> Discard SNPs genotyped on less than  100 % of haplotypes
> Data consists of 280 haplotypes and 1424 SNPs

If no value is specified for the popsel argument and more than one population is detected in the fastPHASE output file, the function asks interactively which population to chose:

hap<-data2haplohh(hap_file="bta12_hapguess_switch.out",map_file="map.inp",
recode.allele=TRUE,chr.name=12)
> Map file seems OK: 1424  SNPs declared for chromosome 12
> Looks like a FastPHASE haplotype file
> Haplotypes originate from  8  different populations in the fastPhase output file
> Chosen pop. is not in the list of pop. number: 1 2 3 4 5 6 7 8
> Which population should be considered among: 1 2 3 4 5 6 7 8
> 1:
7
> Map file seems OK: 1424  SNPs declared for chromosome 12
> Looks like a FastPHASE haplotype file
> Haplotypes originate from  8  different populations in the fastPhase output file
> Alleles are being recoded according to map file as:
>   0 (missing data), 1 (ancestral allele) or 2 (derived allele)
> Discard Haplotype with less than  100 % of genotyped SNPs
> Discard SNPs genotyped on less than  100 % of haplotypes
> Data consists of 280 haplotypes and 1424 SNPs

2 Computing EHH, EHHS and their “integrals” iHH and iES: the calc_ehh(), calc_ehhs() and scan_hh() functions

2.1 Definition and computation

2.1.1 The (allele-specific) Extended Haplotype Homozygosity (EHH)

For a given core allele (either ancestral or derived) at a focal SNP, the (allele–specific) extended haplotype homozygosity (EHH) is defined as the probability that two randomly chosen chromosomes (carrying the core allele considered) are identical by descent (as assayed by homozygosity at all SNPs) over a given surrounding chromosomal region (Sabeti et al. 2002). The EHH aims at measuring to which extent an extended haplotype is transmitted without mutation and recombination. It is computed for a given core allele $$a$$ (ancestral or derived) of a focal SNP $$s$$ over the chromosomal stretch extending to some SNP $$t$$: $$$\mathrm{EHH}_{s,t}^a=\frac{1}{n_{a}(n_a-1)}\sum\limits_{k=1}^{K^a_{s,t}}n_k(n_k-1) \tag{2.1}$$$ where $$n_a$$ represents the number of chromosomes carrying the core allele $$a$$, $$K^a_{s,t}$$ represents the number of different extended haplotypes that can be discerned among these chromosomes from SNP $$s$$ to SNP $$t$$, and $$n_k$$ refers to the number of chromosomes pertaining to the $$k$$-th such extended haplotype (it yields $$n_a=\sum\limits_{k=1}^{K^a_{s,t}}n_k$$).

2.1.2 The integrated EHH (iHH)

By definition, irrespective of the allele considered, EHH starts at 1, and decays monotonically to 0 with increasing distance from the focal SNP. For a given core allele, the integrated EHH (iHH) is defined as the area under the EHH curve with respect to map position (Voight et al. 2006). In rehh, iHH is computed using the trapezoid method. In practice, the integral is often truncated if the value of EHH reaches a certain lower threshold, e.g. 0.05.

2.1.3 The (site-specific) Extended Haplotype Homozygosity (EHHS)

An extended homozygosity can be defined for the whole set of chromosomes of a sample. In this case, the quantity is aimed to reflect the probability that any two randomly chosen chromosomes are identical by descent over a given surrounding chromosomal region of a focal SNP. In contrast to the allele-specific EHH defined above, the chromosomes are not devided with respect to their allele at the focal SNP. In order to distinguish this quantity from that defined in the previous section, we adopt the naming by (Tang et al. 2007) as site–specific EHH, abbreviated by EHHS. Note, however that this quantity is sometimes referred to as EHH, too, and there is no agreed notation in the literature.

EHHS was used in genome scans in two versions: un-normalized by (Sabeti et al. 2007) and normalized by (Tang et al. 2007).

In line with (Sabeti et al. 2007) we define $$$\tag{2.2} \mathrm{EHHS}^{\text{Sab}}_{s,t}=\frac{1}{n_s(n_s-1)}\left(\sum\limits_{k=1}^{K_{s,t}}n_k(n_k-1)\right)$$$ where we re-use notation from above and let $$n_s$$ refer to the number of chromosomes at SNP $$s$$. If there are no missing values at that SNP, this is simply the number of chromosomes in the sample.

(Tang et al. 2007) proposed an apparently different estimator for the normalized EHHS, namely $$$\tag{2.3} \mathrm{EHHS}^{\text{Tang}}_{s,t}=\frac{1-h_{s,t}}{1-h_s}$$$ where:

1. $$h_s=\frac{n_s}{n_s-1}\left(1-\frac{1}{n_s^2}\left(\sum\limits_{k=1}^{K_{s,s}}n_k^2 \right )\right)=\frac{n_s}{n_s-1}\left(1-\frac{1}{n_s^2}\left(n_{a1}^2+n_{a2}^2\right) \right)$$ is an estimator of the focal SNP heterozygosity with $$a1$$ and $$a2$$ referring to the numbers of the two alleles at SNP $$s$$ ($$n_{a1}+n_{a2}=n_s$$).

2. $$h_{s,t}=\frac{n_s}{n_s-1}\left(1-\frac{1}{n_s^2}\left(\sum\limits_{k=1}^{K_{s,t}}n_k^2 \right )\right)$$ is an estimator of haplotype heterozygosity across the chromosome region extending from SNP $$s$$ to SNP $$t$$.

However both definitions are in fact equivalent, because it holds $$\mathrm{EHHS}^{\text{Sab}}_{s,t}=1-h_{s,t}$$ and hence $\begin{equation*} \mathrm{EHHS}^{\text{Tang}}_{s,t}=\frac{\mathrm{EHHS}^{\text{Sab}}_{s,t}}{\mathrm{EHHS}^{\text{Sab}}_{s,s}}\;. \end{equation*}$ Thus $$\mathrm{EHHS}^{\text{Tang}}_{s,t}$$ is just normalized in order to yield 1 at the focal SNP $$s$$. Note that the normalization factor depends on the frequency of the two alleles at the focal SNP and consequently is not constant over the whole data set.

Furthermore, we note that EHHS and EHH are related by $\begin{equation*} \mathrm{EHHS}^{\text{Sab}}_{s,t}=\frac{n_{a1}(n_{a1}-1)}{n_s(n_s-1)}\mathrm{EHH}_{s,t}^{a1}+\frac{n_{a2}(n_{a2}-1)}{n_s(n_s-1)}\mathrm{EHH}^{a2}_{s,t}\;. \end{equation*}$ EHHS might hence be viewed as a linear combination of the EHH’s of the two alternative focal alleles, weighted by roughly the square of the focal allele frequencies.

2.1.4 The integrated EHHS (iES)

As for the EHH (see section 2.1.2), $$EHHS^{\text{Tang}}$$ starts at 1 and decays monotonically to 0 with increasing distance from the focal SNP. For a given focal SNP, analogously to iHH, iES is defined as the integrated EHHS (Tang et al. 2007). Depending on wether un-normalized or normalized EHHS is used (respectively, $$\mathrm{EHHS}^{\text{Sab}}$$ or $$\mathrm{EHHS}^{\text{Tang}}$$), we yield two different values for iES that we denote by $$\mathrm{iES}^{\text{Sab}}$$ and $$\mathrm{iES}^{\text{Tang}}$$ respectively. As for iHH, the iES integral is computed using the trapezoid method and is often computed only for the region where EHHS lies over a given threshold (e.g., EHHS>0.05).

2.1.5 Dealing with gaps

Certain genomic regions such as centromeres are usually difficult to sequence and can give rise to large gaps between consecutive SNPs. If not accounted for, these will cause spuriously inflated values of iHH and iES. (Voight et al. 2006) applied two corrections for gaps. First, they introduced a penalty (proportional to physical distance) that resulted in any gap being greater than 20 kb to be re-scaled to this number. Second, they discarded the integration, if two consecutive SNPs with a distance greater than 200 kb were encountered. Both methods are implemented in rehh, yet turned off by default, since the corresponding numbers are likely to depend on the specific data set.

2.1.6 Dealing with missing data

In the computation of both EHH and EHHS from a focal SNP $$s$$ to a SNP $$t$$, only extended haplotypes with no missing data are considered. As a consequence, the number of extended haplotypes retained to compute these two statistics might decrease with increasing distance of $$t$$ from the focal SNP $$s$$. If the number of available extended haplotypes falls below a threshold, computation of EHH and EHHS stops. Note however that most phasing programs (such as fastPHASE or SHAPEIT2) allow to impute missing genotypes resulting in phased haplotypes with no missing data.

2.2 The function calc_ehh()

The calc_ehh() function computes EHH for both ancestral and derived alleles of a focal SNP $$s$$ relative to any other SNP $$t$$ upstream or downstream. The corresponding integral iHH of these EHH values is returned as well. The two options limehh and limhaplo allow to specify conditions to truncate computing EHH (see section 2.1.1). By default limehh=0.05 and limhaplo=2. If the border of the chromosome is reached, but EHH has not yet decayed below limehh, calculation of iHH is discarded in order to avoid border effects. This behaviour can be turned off by setting discard_integration_at_border to FALSE. To account for gaps between consecutive SNPs, two methods can be used: it is possible to set by scalegap a distance such that any one greater is re-scaled to this number. Furthermore, the option maxgap can be used to stop integration at gaps that are greater than the specified value. Again, if discard_integration_at_border is set to TRUE, no value is reported. Finally, if plotehh=TRUE, the decay of EHH for both ancestral and derived allele is plotted against the SNP map position. More details are available in the R documentation by using the command:

?calc_ehh

In the following example, EHH is computed around the SNP with name “F1205400”. Note that the haplohh_cgu_bta12 object was generated using the data2haplohh() function with the example input files (see section 1.3.1). For convenience, it is stored as an example object (accessible with the R function data) as shown below:

#example haplohh object (280 haplotypes, 1424 SNPs) see ?haplohh_cgu_bta12 for details
data(haplohh_cgu_bta12)
#computing EHH statistics for the focal SNP with name "F1205400"
#which displays a strong signal of selection
res.ehh<-calc_ehh(haplohh_cgu_bta12,mrk="F1205400")

The output contained in res.ehh is a list with four elements:

1. a matrix with EHH values along the chromosome for ancestral and derived allele of the focal SNP.
2. a matrix showing how many chromosomes contributed at each position to the calculation of ancestral resp. derived EHH. These numbers decrease with increasing distance to the focal SNP, since missing values for a genotype at a certain SNP exclude the corresponding chromosome from further calculations. If one of the thresholds, either the minimum number of chrosomsomes (limhaplo) or the minimal value of EHH (limehh), is reached, no chromosomes are evaluated any more.
3. the frequency of the ancestral allele at the focal SNP
4. a vector with ancestral and derived iHH, i.e. the integrals over the EHH curves defined by the values given in 1.
res.ehh$ehh[1:2,454:458] > F1205380 F1205390 F1205400 F1205420 F1205440 > Ancestral allele 0.2764706 0.5529412 1 0.8879552 0.6422969 > Derived allele 1.0000000 1.0000000 1 1.0000000 1.0000000 res.ehh$nhaplo_eval[1:2,454:458]
>                  F1205380 F1205390 F1205400 F1205420 F1205440
> Ancestral allele       85       85       85       85       85
> Derived allele        195      195      195      195      195
res.ehh$freq_all1 > [1] 0.3035714 res.ehh$ihh
> Ancestral allele   Derived allele
>         284429.9        2057107.4

In addition, as plotehh=TRUE by default, we obtain Figure 2.1.

2.3 The function calc_ehhs()

The calc_ehhs() function computes $$\mathrm{EHHS}^{\text{Sab}}$$ and $$\mathrm{EHHS}^{\text{Tang}}$$ around the focal SNP $$s$$ relative to each other SNP $$t$$. This function also computes the corresponding integrals $$\mathrm{iES}^{\text{Sab}}$$ and $$\mathrm{iES}^{\text{Tang}}$$ respectively. The two options limehhs and limhaplo allow to specify conditions to truncate computing EHHS (see section 2.1.3). By default limehhs=0.05 and limhaplo=2. If the border of the chromosome is reached, but EHHS has not yet decayed below limehhs, calculation of iES is discarded in order to avoid border effects. This behaviour can be turned off by setting discard_integration_at_border to FALSE. To account for gaps between consecutive SNPs, two methods can be used: it is possible to set by scalegap a distance such that any one greater is re-scaled to this number. Furthermore, the option maxgap can be used to stop integration at gaps that are greater than the specified value. Again, if discard_integration_at_border is set to TRUE, no value is reported. Finally, if plotehhs=TRUE, the decay of EHHS is plotted against SNP map position. More details are available in the R documentation by using the command:

?calc_ehhs

In the following example, the EHHS statistics are computed around the SNP with name “F1205400” on the haplohh_cgu_bta12 object already mentioned above.

data(haplohh_cgu_bta12)
res.ehhs<-calc_ehhs(haplohh_cgu_bta12,mrk="F1205400")

The output contained in res.ehhs is a list with five elements:

1. a vector with $$\mathrm{EHHS}^{\text{Sab}}$$ values along the chromosome around the focal SNP.
2. a vector with $$\mathrm{EHHS}^{\text{Tang}}$$ values along the chromosome around the focal SNP.
3. a vector showing how many chromosomes contributed at each position to the calculation of EHHS. The number decreases with increasing distance to the focal SNP, since missing values for a genotype at a certain SNP exclude the corresponding chromosome from further calculations. If one of the thresholds, either the minimum number of chrosomsomes (limhaplo) or the minimal value of EHHS (limehhs), is reached, no chromosomes are evaluated any more.
4. $$\mathrm{iES}^{\text{Sab}}$$, i.e. the integral over the EHHS curve defined by the values given in 1.
5. $$\mathrm{iES}^{\text{Tang}}$$, i.e. the integral over the EHHS curve defined by the values given in 2.
res.ehhs$EHHS_Sabeti_et_al_2007[453:459]  > F1205370 F1205380 F1205390 F1205400 F1205420 F1205440 F1205450 > 0.5017153 0.5095238 0.5347926 0.5756528 0.5654122 0.5429595 0.5386841 res.ehhs$EHHS_Tang_et_al_2007[453:459] 
>  F1205370  F1205380  F1205390  F1205400  F1205420  F1205440  F1205450
> 0.8715588 0.8851234 0.9290193 1.0000000 0.9822104 0.9432066 0.9357794
res.ehhs$nhaplo_eval[453:459]  > F1205370 F1205380 F1205390 F1205400 F1205420 F1205440 F1205450 > 280 280 280 280 280 280 280 res.ehhs$IES_Tang_et_al_2007
> [1] 1760565
res.ehhs$IES_Sabeti_et_al_2007 > [1] 936407.6 In addition, as plotehh=TRUE by default, we obtain Figure 2.2. 2.4 The function scan_hh() The scan_hh() function efficiently computes iHH for both the ancestral and derived alleles as well as $$\mathrm{iES}^{\text{Sab}}$$ and $$\mathrm{iES}^{\text{Tang}}$$ for all SNPs in the haplohh object. The options limehh, limehhs and limhaplo specify conditions to stop computing EHH and EHHS. By default limehh=limehhs=0.05 and limhaplo=2. If the border of the chromosome is reached, but EHH(S) has not yet decayed below limehh(s), calculation of iHH resp. iES is discarded in order to avoid border effects. This behaviour can be turned off by setting discard_integration_at_border to FALSE. Large “gaps” between consecutive SNPs can be caused by problems in sequencing or SNP calling and may lead to spuriously long extended haplotypes. To account for this, the option maxgap can be used to stop integration at gaps that are greater than the specified size. Again, if discard_integration_at_border is set to TRUE, no value is reported. Finally, the option threads, set by default to 1, allows to specify the number of threads to parallelize computation (parallelization being carried out over SNPs). For instance in order to scan the haplohh_cgu_bta12 object (containing data on 1424 SNPs for 280 haplotypes), one may use the following command: data(haplohh_cgu_bta12) res.scan<-scan_hh(haplohh_cgu_bta12) The resulting object res.scan is a data frame with haplohh_cgu_bta12@nsnp (the number of SNPs declared in the haplohh object) rows and seven columns yielding for each focal SNP in turn: 1. Chromosome 2. Position 3. Ancestral allele frequency 4. iHH for the ancestral allele 5. iHH for the derived allele 6. $$\mathrm{iES}^{\text{Tang}}$$ 7. $$\mathrm{iES}^{\text{Sab}}$$. As an example, the following R code provides the dimension and a segment of the res.scan data frame obtained above: dim(res.scan) > [1] 1424 7 res.scan[453:459,] > CHR POSITION freq_A iHH_A iHH_D iES_Tang_et_al_2007 > F1205370 12 28925117 0.06071429 765720.1 1121814.8 1119699 > F1205380 12 28947722 0.82500000 1477031.7 454891.9 1433143 > F1205390 12 28967990 0.90000000 1211909.5 574301.6 1204234 > F1205400 12 28993983 0.30357143 284429.9 2057107.4 1760565 > F1205420 12 29101326 0.01785714 269322.2 1086530.3 1086279 > F1205440 12 29147373 0.05714286 336600.7 1185898.6 1182946 > F1205450 12 29197279 0.04285714 528732.6 1149951.0 1148696 > iES_Sabeti_et_al_2007 > F1205370 971728.6 > F1205380 966416.3 > F1205390 954903.3 > F1205400 936407.6 > F1205420 1041943.5 > F1205440 1036274.2 > F1205450 1039850.3 Note that scan_hh() is more efficient than calc_ehh() and calc_ehhs() applied consecutively for each SNP as can be seen by running the two code snippets below: system.time(res.scan<-scan_hh(haplohh_cgu_bta12)) > user system elapsed > 0.225 0.000 0.225 foo<-function(haplo){ res.ihh=res.ies=matrix(0,haplo@nsnp,2) for(i in 1:haplo@nsnp){ res.ihh[i,]=calc_ehh(haplo,mrk=haplo@snp.name[i],plotehh=FALSE)$ihh
tmp=calc_ehhs(haplo,mrk=haplo@snp.name[i],plotehhs=FALSE)
res.ies[i,1]=tmp$IES_Tang_et_al_2007 res.ies[i,2]=tmp$IES_Sabeti_et_al_2007
}
list(res.ies=res.ies,res.ihh=res.ihh)
}
system.time(res.scan2<-foo(haplohh_cgu_bta12))
>    user  system elapsed
>  13.590   0.004  13.594

Nevertheless, results are the same:

identical(res.scan2$res.ihh[,1],res.scan[,4]) > [1] TRUE identical(res.scan2$res.ihh[,2],res.scan[,5])
> [1] TRUE
identical(res.scan2$res.ies[,1],res.scan[,6]) > [1] TRUE identical(res.scan2$res.ies[,2],res.scan[,7])
> [1] TRUE

3 Computing iHS, Rsb and XP-EHH: the ihh2ihs(), ies2rsb() and ies2xpehh() functions

3.1 The iHS within-population statistic

3.1.1 Definition

The abbreviation iHS refers to “integrated haplotype homozygosity score”. Let $$\mathrm{uniHS}$$ represent the un-standardized log-ratio of ancestral iHH$$_a$$ to derived iHH$$_d$$ of a certain focal SNP $$s$$: $\mathrm{uniHS}=\log\left(\frac{\mathrm{iHH}_a}{\mathrm{iHH}_d}\right)$ Following (Voight et al. 2006) we perform a standardization by setting: $\mathrm{iHS}=\frac{\mathrm{uniHS} - \mu^{p_s}_\mathrm{uniHS}}{\sigma^{p_s}_\mathrm{uniHS}}$ where $$\mu^{p_s}_\mathrm{uniHS}$$ and $$\sigma^{p_s}_\mathrm{uniHS}$$ represent the average and standard deviation of the $$\mathrm{uniHS}$$ computed over all the SNPs with a derived allele frequency $$p_s$$ similar to that of the SNP $$s$$. In practice, the derived allele frequencies are binned so that each bin contains a large enough number of SNPs (e.g., >10) to obtain reliable estimates of $$\mu^{p_s}_\mathrm{uniHS}$$ and $$\sigma^{p_s}_\mathrm{uniHS}$$.

Note that the iHS is constructed to have an approximately standard Gaussian distribution and to be comparable across SNPs regardless of their underlying allele frequencies. Hence, one may further transform iHS into $$p_\mathrm{iHS}$$ (Gautier and Naves 2011): $p_\mathrm{iHS}=-\log_{10}\left(1-2|\Phi\left(\mathrm{iHS}\right)-0.5|\right)$ where $$\Phi\left(x\right)$$ represents the Gaussian cumulative distribution function. Assuming most of the genotyped SNPs behave neutrally (i.e., the genome-wide empirical iHS distribution is a fair approximation of the neutral distribution), $$p_\mathrm{iHS}$$ might thus be interpreted as a two-sided P-value (on a $$-\log_{10}$$ scale) associated to the neutral hypothesis of no selection.

3.1.2 The function ihh2ihs()

The ihh2ihs() function computes iHS using a matrix of iHH statistics (for both the ancestral and derived alleles) as obtained by the scan_hh() function (see section 2.4). The argument minmaf allows to remove SNPs according to their MAF (by default SNPs with a MAF<minmaf=0.05 are discarded from the standardization). The argument freqbin controls the size (or number) of the allele frequency bins used to perform standardization (see section 3.1.1). More precisely, allele frequency bins are built from minmaf to 1-minmaf in steps of size freqbin (by default freqbin=0.025). If instead an integer of 1 or greater is specified, a corresponding number of equally spaced bins is created. If freqbin is set to 0, standardization is performed considering each observed frequency as a discrete frequency class, which is useful in case that there are only a few different haplotypes.

For instance, to perform a whole genome scan one might run scan_hh() on haplotype data from each chromosome and concatenate the resulting matrices before standardization. In the following example, we assume that the haplotype files are named as hap_chr_i.pop1 where the chromosome number $$i$$ goes from 1 to 29 and the SNP information file is named snp.info. The R code below then generates a matrix wg.res with $$iHH_a$$ and $$iHH_d$$ estimates for all SNPs in an appropriate format to perform standardization with the ihh2ihs() function:

for(i in 1:29){
hap_file=paste("hap_chr_",i,".pop1",sep="")
data<-data2haplohh(hap_file="hap_file","snp.info",chr.name=i)
res<-scan_hh(data)
if(i==1){wg.res<-res}else{wg.res<-rbind(wg.res,res)}
}
wg.ihs<-ihh2ihs(wg.res)

For illustration, $$iHH$$ values of a whole genome scan (Gautier and Naves 2011) are provided as example data. The following R code computes the iHS for the CGU population:

data(wgscan.cgu)
## results from a genome scan (44,057 SNPs) see ?wgscan.eut and ?wgscan.cgu for details
ihs.cgu<-ihh2ihs(wgscan.cgu)

The resulting object ihs.cgu is a list with two elements:

1. a matrix of SNP iHS and corresponding $$p_\mathrm{iHS}$$ (P-values in a $$-\log_{10}$$ scale assuming the iHS are normally distributed under the neutral hypothesis). For instance, the first rows of the ihs.cgu$iHS data frame are displayed below using the R command: head(ihs.cgu$iHS)
>          CHR POSITION        iHS -log10(p-value)
> F0100190   1   113642 -0.5582992       0.2390952
> F0100220   1   244699  0.2723337       0.1049282
> F0100250   1   369419  0.4810736       0.2003396
> F0100270   1   447278  1.0618710       0.5401640
> F0100280   1   487654  0.8184060       0.3839181
> F0100290   1   524507 -0.3897024       0.1569189
1. a matrix summarizing the allele frequency bins. For instance, the first rows of the ihs.cgu$frequency.class data frame are displayed below: head(ihs.cgu$frequency.class)
>              #mrk mean(log(iHHA/iHHD)) sd(log(iHHA/iHHD))
> 0.05 - 0.075 1635            0.7286087          0.6457742
> 0.075 - 0.1  1316            0.5804760          0.5556798
> 0.1 - 0.125  1478            0.4710504          0.5079392
> 0.125 - 0.15 1593            0.3720585          0.4708235
> 0.15 - 0.175 1078            0.3263215          0.4524270
> 0.175 - 0.2  1325            0.2721166          0.4533404

3.1.3 Manhattan plot of the results: the function ihsplot()

The function ihsplot() draws a Manhattan plot of the Whole Genome Scan results returned by the function ihh2ihs(). Various options are available to modify the graphical display (see ?ihsplot). Figure 3.1 was drawn using the following R code:

layout(matrix(1:2,2,1))
ihsplot(ihs.cgu,plot.pval=TRUE,ylim.scan=2,main="iHS (CGU cattle breed)")

3.2 The Rsb pairwise population statistic

3.2.1 Definition

The abbreviation Rsb stands for “ratio of EHHS between populations”. For a given SNP $$s$$, let $\mathrm{LRiES}^{\text{Tang}}=\log\left(\frac{\mathrm{iES}_\text{pop1}^{\text{Tang}}}{\mathrm{iES}_\text{pop2}^{\text{Tang}}}\right)$ represent the log-ratio of the $$\mathrm{iES}^{\text{Tang}}$$ values computed in the pop1 and pop2 populations (see section 2.1.4).

The Rsb for a given focal SNP is then defined as the standardized $$\mathrm{LRiES}^{\text{Tang}}$$ (Tang et al. 2007):

$\begin{equation*} \mathrm{Rsb}=\frac{\mathrm{LRiES}^{\text{Tang}} - \text{med}_{\mathrm{LRiES}^{\text{Tang}}}}{\sigma_{\mathrm{LRiES}^{\text{Tang}}}} \end{equation*}$ where $$\text{med}_{\mathrm{LRiES}^{\text{Tang}}}$$ and $$\sigma_{\mathrm{LRiES}^{\text{Tang}}}$$ represent the median and standard deviation of the $$\mathrm{LRiES}(s)^{\text{Tang}}$$ computed over all analyzed SNPs. Note that we follow (Tang et al. 2007) in using the median instead of the mean (hence in contrast to the definitions of iHS and XP-EHH). They assume that this might increase the robustness against different demographic scenarios. It should be noticed, too, that the information about ancestral/derived status of alleles at the focal SNP does not figure in the formula. Furthermore, in contrast with $$iHS$$, no binning is performed.

As iHS (see section 3.1.1), Rsb is constructed to have an approximately standard Gaussian distribution and may further be transformed into $$p_\mathrm{Rsb}$$: $\begin{equation*} p_\mathrm{Rsb}=-\log_{10}\left(1-2|\Phi\left(\mathrm{Rsb}\right)-0.5|\right) \end{equation*}$ where $$\Phi\left(x\right)$$ represents the Gaussian cumulative distribution function. Assuming most of the genotyped SNPs behave neutrally (i.e., the genome-wide empirical Rsb distribution is a fair approximation of their corresponding neutral distributions), $$p_\mathrm{Rsb}$$ might thus be interpreted as a two-sided P-value (in a $$-\log_{10}$$ scale) associated to the neutral hypothesis of no selection. Alternatively, $$p_\mathrm{Rsb}$$ might also be computed (Gautier and Naves 2011): $\begin{equation*} p\prime_\mathrm{Rsb}=-\log_{10}\left(|\Phi\left(\mathrm{Rsb}\right)|\right) \end{equation*}$ $$p\prime_\mathrm{Rsb}$$ and $$p\prime_\mathrm{Rsb}$$ might then be interpreted as a one-sided P-value (in a $$-\log_{10}$$ scale) allowing the identification of those sites displaying a significantly high extended haplotype homozygosity in population $$pop2$$ (represented in the denominator of the corresponding $$\mathrm{LRiES}$$) relatively to the $$pop1$$ reference population.

3.2.2 The function ies2rsb()

The ies2rsb() function computes Rsb using two data frames containing the iES statistics for each of the two populations considered in the same format as obtained by running the scan_hh() function (see section 2.4). In order to perform a genome-wide scan one might first run for each population scan_hh() on haplotype data from each chromosome and then concatenate the resulting matrices. In the following example, we assume that the haplotype files are named as hap_chr_i.pop1 and hap_chr_i.pop2 where $$i$$ is the chromosome number (going from 1 to 29), the suffixes pop1 and pop2 indicate the population of origin and the SNP information file is named snp.info. The R code below then generates two data frames (wg.res.pop1 and wg.res.pop2) containing the results from all SNPs in the appropriate format to compute Rsb with the ies2rsb() function:

for(i in 1:29){
hap_file=paste("hap_chr_",i,".pop1",sep="")
data<-data2haplohh(hap_file="hap_file","snp.info",chr.name=i)
res<-scan_hh(data)
if(i==1){wg.res.pop1<-res}else{wg.res.pop1<-rbind(wg.res.pop1,res)}
hap_file=paste("hap_chr_",i,".pop2",sep="")
data<-data2haplohh(hap_file="hap_file","snp.info",chr.name=i)
res<-scan_hh(data)
if(i==1){wg.res.pop2<-res}else{wg.res.pop2<-rbind(wg.res.pop2,res)}
}
wg.rsb<-ies2rsb(wg.res.pop1,wg.res.pop2)

For illustration, we take $$iES$$ values from a genome scan (Gautier and Naves 2011) provided as example data and compute for each SNP the Rsb between the CGU and EUT populations as follows:

data(wgscan.cgu) ; data(wgscan.eut)
## results from a genome scan (44,057 SNPs) see ?wgscan.eut and ?wgscan.cgu for details
cguVSeut.rsb<-ies2rsb(wgscan.cgu,wgscan.eut,"CGU","EUT")

The resulting object cguVSeut.rsb is a data frame which shows for each SNP its Rsb and corresponding P-Values assuming Rsb are normally distributed under the neutral hypothesis. The P-value might be either bilateral (default) or unilateral (specified by the method argument). The first rows of the cguVSeut.rsb data frame are displayed below using the following R command:

head(cguVSeut.rsb)
>          CHR POSITION Rsb (CGU vs. EUT) -log10(p-value) [bilateral]
> F0100190   1   113642        -0.3398574                  0.13432529
> F0100220   1   244699        -1.0566283                  0.53658299
> F0100250   1   369419        -0.1468326                  0.05390941
> F0100270   1   447278        -1.8191608                  1.16186336
> F0100280   1   487654        -0.2193069                  0.08280392
> F0100290   1   524507        -0.7941300                  0.36945032

3.2.3 Manhattan plot of the results: the function rsbplot()

The rsbplot() function draws a Manhattan plot of the Whole Genome Scan results as obtained by the function ies2rsb(). Various options are available to modify the graphical display (see ?rsbplot). As an example, Figure 3.2 below provides the output of the function rsbplot() for the Rsb computed above across the CGU and EUT populations (see section 3.2.1). It was drawn using the following R code:

layout(matrix(1:2,2,1))
rsbplot(cguVSeut.rsb,plot.pval=TRUE)

3.3 The XP-EHH pairwise population statistic

3.3.1 Definition

The XP-EHH (cross-population EHH) statistic (Sabeti et al. 2007) is similar to Rsb except that it is based on $$\mathrm{iES}^{\text{Sab}}$$ instead of $$\mathrm{iES}^{\text{Tang}}$$ (see section 2.1.4). Hence, for or a given SNP $$s$$, let $\mathrm{LRiES}^{\text{Sab}}=\log\left(\frac{\mathrm{iES}_\text{pop1}^{\text{Sab}}}{\mathrm{iES}_\text{pop2}^{\text{Sab}}}\right)$ represent the log-ratio of the $$\mathrm{iES}^{\text{Sab}}$$ values computed in the pop1 and pop2 populations (see section 2.1.4).

The XP-EHH for a given focal SNP is then defined as the standardized $$\mathrm{LRiES}^{\text{Sab}}$$ (Sabeti et al. 2007):

$\begin{equation*} \mathrm{XP-EHH}=\frac{\mathrm{LRiES}^{\text{Sab}} - \text{mean}_{\mathrm{LRiES}^{\text{Sab}}}}{\sigma_{\mathrm{LRiES}^{\text{Sab}}}} \end{equation*}$ where $$\text{mean}_{\mathrm{LRiES}^{\text{Sab}}}$$ and $$\sigma_{\mathrm{LRiES}^{\text{Sab}}}$$ represent the mean and standard deviation of $$\mathrm{LRiES}^{\text{Sab}}$$ computed over all analyzed SNPs. As with Rsb, the information about the ancestral and derived status of alleles at the focal SNP does not figure in the formula and no binning is performed.

As with iHS (see section 3.1.1) and Rsb (see section 3.2.1), XP-EHH is constructed to have an approximately standard Gaussian distribution and may further be transformed into $$p_\mathrm{XP-EHH}$$: $\begin{equation*} p_\mathrm{XP-EHH}=-\log_{10}\left(1-2|\Phi\left(\mathrm{XP-EHH}\right)-0.5|\right) \end{equation*}$ where $$\Phi\left(x\right)$$ represents the Gaussian cumulative distribution function. Assuming most of the genotyped SNPs behave neutrally (i.e., the genome-wide empirical XP-EHH distribution is a fair approximation of their corresponding neutral distributions), $$p_\mathrm{XP-EHH}$$ might thus be interpreted as a two-sided P-value (in a $$-\log_{10}$$ scale) associated to the neutral hypothesis of no selection. Alternatively, $$p_\mathrm{XP-EHH}$$ might also be computed (Gautier and Naves 2011): $\begin{equation*} p\prime_\mathrm{XP-EHH}=-\log_{10}\left(|\Phi\left(\mathrm{XP-EHH}\right)|\right) \end{equation*}$ $$p\prime_\mathrm{XP-EHH}$$ and $$p\prime_\mathrm{XP-EHH}$$ might then be interpreted as a one-sided P-value (in a $$-\log_{10}$$ scale) allowing the identification of those sites displaying a significantly high extended haplotype homozygosity in population $$pop2$$ (represented in the denominator of the corresponding $$\mathrm{LRiES}$$) relatively to the $$pop1$$ reference population.

3.3.2 The function ies2xpehh()

The ies2xpehh() function computes XP-EHH using two data frames containing the iES statistics for each of the two populations in the format as obtained by running the scan_hh() function (see section 2.4). For instance, to perform a genome scan one might first run for each population scan_hh() in turn on haplotype data from each chromosome and concatenate the resulting matrices. In the following example, we assume that the haplotype files are named as hap_chr_i.pop1 and hap_chr_i.pop2 where $$i$$ is the chromosome number (going from 1 to 29), the suffixes pop1 and pop2 indicate the population of origin and the SNP information file is named snp.info. The R code below then generates two data frames (wg.res.pop1 and wg.res.pop2) containing the results from all SNPs in the appropriate format to compute Rsb with the ies2rsb() function:

for(i in 1:29){
hap_file=paste("hap_chr_",i,".pop1",sep="")
data<-data2haplohh(hap_file="hap_file","snp.info",chr.name=i)
res<-scan_hh(data)
if(i==1){wg.res.pop1<-res}else{wg.res.pop1<-rbind(wg.res.pop1,res)}
hap_file=paste("hap_chr_",i,".pop2",sep="")
data<-data2haplohh(hap_file="hap_file","snp.info",chr.name=i)
res<-scan_hh(data)
if(i==1){wg.res.pop2<-res}else{wg.res.pop2<-rbind(wg.res.pop2,res)}
}
wg.xpehh<-ies2xpehh(wg.res.pop1,wg.res.pop2)

For illustration, we consider the $$iES$$ values of a genome scan (Gautier and Naves 2011) provided as example data and compute for each SNP the XP-EHH between the CGU and EUT populations as follows:

data(wgscan.cgu) ; data(wgscan.eut)
## results from a genome scan (44,057 SNPs) see ?wgscan.eut and ?wgscan.cgu for details
cguVSeut.xpehh<-ies2xpehh(wgscan.cgu,wgscan.eut,"CGU","EUT")

The resulting object cguVSeut.xpehh is a data frame containing for each SNP the XP-EHH and corresponding P-value assuming XP-EHH are normally distributed under the neutral hypothesis. The P-value might be either bilateral (default) or unilateral (as specified by the method argument). The first rows of this data frame are displayed below:

head(cguVSeut.xpehh)
>          CHR POSITION XPEHH (CGU vs. EUT) -log10(p-value) [bilateral]
> F0100190   1   113642          -0.5943673                   0.2578513
> F0100220   1   244699          -0.7903997                   0.3672448
> F0100250   1   369419          -0.9273568                   0.4513142
> F0100270   1   447278          -0.3858354                   0.1551387
> F0100280   1   487654          -0.9570604                   0.4703941
> F0100290   1   524507          -0.7908863                   0.3675322

3.3.3 Manhattan plot of the results: the function xpehhplot()

The xpehhplot() draws a Manhattan plot of the Whole Genome Scan results produced by the function ies2xpehh(). Various options are available to modify the graphical display (see ?xpehhplot). As an example, Figure 3.3 provides the output of the function xpehhplot for the XP-EHH computed above across the CGU and EUT populations (see section 3.3.2). It was drawn by:

layout(matrix(1:2,2,1))
xpehhplot(cguVSeut.xpehh,plot.pval=TRUE)

3.3.4Rsb vs. XP-EHH comparison

A plot of the Rsb against XP-EHH values across the CGU and EUT populations (see section 3.2.2 and 3.3.2 respectively) is represented in Figure 3.4. Marked in red is the SNP that was used repeatedly in the examples above. The plot was generated using the following R code:

plot(cguVSeut.rsb[,3],cguVSeut.xpehh[,3],xlab="Rsb",ylab="XP-EHH",pch=".",
xlim=c(-7.5,7.5),ylim=c(-7.5,7.5))
points(cguVSeut.rsb["F1205400",3],cguVSeut.xpehh["F1205400",3],col="red")
abline(a=0,b=1,lty=2)

3.4 Visual inspection of the standardized scores distribution: the function distribplot()

The distribplot() function allows to visualize the distributions of the standardized scores (either iHS, Rsb or XP-EHH) and compare them to the standard Gaussian distribution. As an example, Figure 3.5 below provides the output the function distribplot() when considering the iHS estimates obtained for the CGU population (see section 3.1.1) using the following R code:

layout(matrix(1:2,2,1))
distribplot(ihs.cgu\$iHS[,3],xlab="iHS")

4 Visualizing haplotype structure around a focal SNP: the bifurcation.diagram() function

The function bifurcation.diagram() draws haplotype bifurcation diagrams (Sabeti et al. 2002) that visualize the decay of EHH around a focal SNP. A stark contrast of ancestral and derived bifurcation diagrams should correspond to outlier values of iHS. Within the plot, the root (focal SNP) is identified by a vertical dashed line. The diagram is bi-directional, portraying decay along both sides of the focal SNP. Moving in one direction, each marker is an opportunity for a bifurcation to occurr further differentiating between extended haplotypes. The thickness of the lines corresponds to the number of chromosomes with the same haplotype.

Several options are available to modify the aspect of the plots (see command ?bifurcation.diagram). As an illustration, Figure 4.1 shows the bifurcation diagrams for both the derived and ancestral alleles at the SNP with name “F1205400” on BTA12 CGU haplotypes. This SNP is associated with a strong signal of selection (using both iHS and Rsb statistics) and is located closely (<5kb) to a strong candidate gene involved in horn development (Gautier and Naves 2011). Figure 4.1 was obtained by the following R code:

data(haplohh_cgu_bta12)
layout(matrix(1:2,2,1))
bifurcation.diagram(haplohh_cgu_bta12,mrk_foc="F1205400",all_foc=1,nmrk_l=20,nmrk_r=20,
main="Bifurcation diagram (RXFP2 SNP on BTA12): Ancestral Allele")
bifurcation.diagram(haplohh_cgu_bta12,mrk_foc="F1205400",all_foc=2,nmrk_l=20,nmrk_r=20,
main="Bifurcation diagram (RXFP2 SNP on BTA12): Derived Allele")

5 Differences to the program hapbin

The C++ program hapbin (Maclean et al. 2015) is an alternative tool to calculate the statistics iHS and XP-EHH. The obtained values vary between hapbin and rehh. As far as we can tell, this is due to the differences in implementation listed below. The first 4 are hard-wired and cannot be remediated without changing the code while all further discrepancies can be largely eliminated by appropriate parameter settings. We refer to version 1.3.0 of hapbin.

1. Hapbin disregards the SNPs directly at the border of chromosomes.

2. If an allele of a focal SNP is present only on a single chromosome, hapbin assigns a EHH value of 1 at the focal SNP and zero otherwise. rehh assigns zero at the focal SNP, too.

3. For the standardization of iHS resp. XP-EHH, hapbin uses the estimator $$\sqrt{\frac{1}{n}\sum(x_i-\bar{x})^2}$$ for the standard deviation while rehh uses $$\sqrt{\frac{1}{n-1}\sum(x_i-\bar{x})^2}$$.

4. The bins used by hapbin for the standardization of iHS cover the whole interval ]0,1], while in rehh they span the interval [minmaf,1-minmaf[. Hapbin includes the upper endpoint into each bin, while rehh includes the lower endpoint.

5. The default number of bins is 50 in hapbin, yielding a bin width of 0.02. The default width in rehh is 0.025. Setting the number of bins in hapbin to 40 with option --bin or -b yields the latter.

6. If run in default mode, hapbin calculates EHH by (notation as in section 2.1.1) $\begin{equation*} \mathrm{EHH}^a_{s,t}=\sum_{k=1}^{K^a_{s,t}}\left(\frac{n_k}{n_a}\right)^2\;. \end{equation*}$ For a set of $$n$$ chromosomes this estimator reaches its minimum value of $$\frac{1}{n}$$ if all of them are distinct. Yet formula (2.1) used by rehh and applied by hapbin if run with option --binom or -a returns zero in this case. The difference reflects distinct sampling strategies, either with replacement or without. For increasing sample size both converge. The same holds for EHHS.

7. Integration over EHH resp. EHHS is performed by hapbin on the area between the curve spanned by these quantities and the x-axis (y=0) while rehh integrates only over the part of that area that is above the threshold set by the parameters limehh resp. limehhs, i.e. the area between the curve and the line y=threshold. This is not to be confused with the condition for truncation at left and right ends of the curve which is (for all practical purposes) handled identically by both programs. Hence, setting limehh(s) to zero in rehh and to a very small value in hapbin (zero seems to “hang” it) with parameter -c or --cutoff yields virtually identical integrations.

8. By default, the parameter discard_integration_at_border is TRUE in rehh. It has to be set to FALSE in order to conform to hapbin.

9. Large differences can arise from different handling of gaps during the integration of EHH resp. EHHS yielding iHH resp. iES. Hapbin has a parameter -s or --scale to “down-weight” large gaps by capping them to the specified value. Its default value is 20000 while the corresponding option in rehh is turned off by default, but can be set by scalegap. The option maxgap within rehh leads to a stop of the integration and if the parameter discard_integration_at_border is set to TRUE, then no value is reported. This has no counterpart in hapbin. Instead, hapbin allows to specify a maximum length of Extended Haplotypes (disabled by default) which is not possible in rehh.

References

Gautier M., Naves M., 2011 Footprints of selection in the ancestral admixture of a new world creole cattle breed. Mol Ecol 20: 3128–3143.

Maclean C. A., Hong N. P. C., Prendergast J. G. D., 2015 Hapbin: An efficient program for performing haplotype-based scans for positive selection in large genomic datasets. Mol Biol Evol 32: 3027–3029.

O’Connell J., Gurdasani D., Delaneau O., Pirastu N., Ulivi S., others, 2014 A general approach for haplotype phasing across the full spectrum of relatedness. PLoS Genet 10: e1004234.

Sabeti P. C., Reich D. E., Higgins J. M., Levine H. Z. P., Richter D. J., others, 2002 Detecting recent positive selection in the human genome from haplotype structure. Nature 419: 832–837.

Sabeti P. C., Varilly P., Fry B., Lohmueller J., Hostetter E., others, 2007 Genome-wide detection and characterization of positive selection in human populations. Nature 449: 913–918.

Scheet P., Stephens M., 2006 A fast and flexible statistical model for large-scale population genotype data: Applications to inferring missing genotypes and haplotypic phase. Am J Hum Genet 78: 629–644.

Tang K., Thornton K. R., Stoneking M., 2007 A new approach for using genome scans to detect recent positive selection in the human genome. PLoS Biol 5: e171.

Voight B. F., Kudaravalli S., Wen X., Pritchard J. K., 2006 A map of recent positive selection in the human genome. PLoS Biol 4: e72.