# Introduction

This Vignette shows an example usecase for generating a linkage map for polyploid data in R.

# Installation

First, we need to install the package pergola. Second, we set a seed, to ensure, that this vignette is reproducable.

library(pergola)
set.seed(31415)

Next, we load the example dataset from the package.

data("simTetra")

simTetra is a simulated tetraploid backcross dataset, consisting of seven chromosomes. A small subset reveals the structure of the dataset. Four columns represent a sample (e.g. P1, P2, F1…) and each row represents a marker. The values are 0 and 1, which stands for A and B allele in that case.

simTetra[1:5, 1:12]
##      F2_001_1 F2_001_2 F2_001_3 F2_001_4 F2_002_1 F2_002_2 F2_002_3
## A001        0        1        0        0        1        0        0
## A002        0        1        0        0        1        0        0
## A003        0        1        0        0        1        0        0
## A004        0        1        0        0        0        0        0
## A005        0        1        0        0        0        0        0
##      F2_002_4 F2_003_1 F2_003_2 F2_003_3 F2_003_4
## A001        1        1        1        0        1
## A002        1        1        0        0        1
## A003        1        1        0        0        1
## A004        1        1        0        0        1
## A005        1        1        0        0        1

# Data manipulation

As the data is simulated, the markers are grouped according to their chromosomes and ordered according to linkage. Our goal is to predict these. Thus, we want to randomize the order of the markers. The dataset contains the order of the alleles within the samples (haplotypes). This should also be randomized, because it provides information about the exact number of recombinations, which is approximated later.

simTetraGen <- shuffleInput(simTetra, ploidy = 4)
simTetraGen[1:5, 1:12]
##      F2_001_1 F2_001_2 F2_001_3 F2_001_4 F2_002_1 F2_002_2 F2_002_3
## G010        1        0        1        1        1        1        1
## C007        1        0        0        0        0        0        0
## B003        1        0        1        0        1        1        0
## D009        1        0        0        0        0        0        1
## A008        1        0        0        0        0        1        1
##      F2_002_4 F2_003_1 F2_003_2 F2_003_3 F2_003_4
## G010        0        0        1        0        0
## C007        1        1        1        1        1
## B003        0        0        0        0        1
## D009        0        1        1        0        0
## A008        0        1        0        1        0

Now that the data is randomized, we can treat it like real data. In the next step we collapse four columns into one. That way we have one column per sample, without loosing information.

simTetraGen <- bases2genotypes(simTetraGen, ploidy = 4)
simTetraGen[1:5, 1:6]
##      [,1] [,2] [,3] [,4] [,5] [,6]
## G010    3    3    1    1    2    1
## C007    1    1    4    3    2    2
## B003    2    2    1    1    1    1
## D009    1    1    2    2    2    1
## A008    1    2    2    2    2    1

Finally, the data is ready to be analyzed.

# Analysis

## Calculate pairwise recombination frequencies

Now, we want to compare all markers with each other. This allows us to group similar markers together and define linkage groups. The first step is the calculation of pairwise recombination frequencies. Unlike other tools we do not use the expectation–maximization (EM) algorithm or maximum-likelihood (ML) methods. Instead, we approximate the recombination events and always expect the minimum number.

rf <- calcRec(simTetraGen, ploidy = 4)

rf is the matrix containing the recombination frequencies for all pairs of markers.

image(rf, xaxt = 'n', yaxt = 'n')
axis(side = 1, at = seq(0, 1, length.out = nrow(rf)),
labels = rownames(rf), las = 2, cex.axis = 0.8)
axis(side = 2, at = seq(0, 1, length.out = nrow(rf)),
labels = rownames(rf), las = 2, cex.axis = 0.8) 

The colors represent the different frequencies (yellow = high, red = low). We do not see any structure in the data set, except the diagonal. It is red, because each there is no recombination between a marker and itself.

Next, we want to find out which marker belongs into which linkage group. Usually the number of linkage groups is equal to the number of chromosomes. In the beginning of this vignette, we mentioned that our dataset consist of seven chromosomes. Now it is time to see, if that can be observed from the data. First, we plot a dendogram, the default type of plot in the funtion splitChr.

plotRf(rf)

The plot shows seven easily distinguishable clusters, as expected. The other implemented visualization is an image of the recombination frequency values:

plotRf(rf, plottype = "image")

Here we see seven rectangles and assume seven linkage groups. Hence, we split the data with splitChr.

split <- splitChr(rf, nchr = 7)
table(split$split) ## ## 1 2 3 4 5 6 7 ## 16 17 22 20 15 19 22 head(split) ## names split dup ## G010 G010 1 0 ## C007 C007 2 0 ## B003 B003 3 0 ## D009 D009 4 0 ## A008 A008 5 0 ## C009 C009 2 0 table() shows us, how many markers are on each chromosome. ### Additional parameters In case that single markers end up in own linkage groups, one might want to filter them out with the parameter filter = TRUE. If many markers are highly similar, we removing those duplicates is adviced. They have no added value for the linkage map and end up at the same position. ## Marker ordering Now, that we grouped our markers, we want to find the correct order of markers, within the linkage groups. Filtered markers (chromosome 0) would be ignored. split <- sortLeafs(rf, split) head(split) split$order is the global order of markers. We can visualize it with image().

image(rf[split$order, split$order], xaxt = 'n', yaxt = 'n')
axis(side = 1, at = seq(0, 1, length.out = nrow(rf)),
labels = rownames(rf)[split$order], las = 2, cex.axis = 0.8) axis(side = 2, at = seq(0, 1, length.out = nrow(rf)), labels = rownames(rf)[split$order], las = 2, cex.axis = 0.8) 

We see, that the order within the chromosomes has improved, because the red values moved closer to the diagonal.

### Example of ambiguous orders

Here we show an example, where two orders are equally good when only one neighbor is included.

set.seed(3)
ambRF <- cbind(c(0, 2, 4, 6, 8, 12),
c(2, 0, 4, 4, 7, 10),
c(4, 4, 0, 2, 4, 7),
c(6, 4, 2, 0, 4, 5),
c(8, 7, 4, 4, 0, 3),
c(12, 10, 7, 5, 3, 0)) / 100
ambsplit <- data.frame(names = LETTERS[1:6],
split = rep(1, 6),
order = 1:6)
amb1 <- sortLeafs(ambRF, ambsplit)
amb1$order ## [1] 1 2 3 4 5 6 amb2 <- c(1,2,4,3,5,6) amb3 <- c(2,1,3,4,5,6) amb4 <- 6:1 #reverse of amb1 calcSarf(ambRF, amb1$order, n = 1)
## [1] 0.15
calcSarf(ambRF, amb2, n = 1)
## [1] 0.15
calcSarf(ambRF, amb3, n = 1)
## [1] 0.15
calcSarf(ambRF, amb4, n = 1)
## [1] 0.15
calcSarf(ambRF, amb1\$order, n = 2)
## [1] 0.32
calcSarf(ambRF, amb2, n = 2)
## [1] 0.36
calcSarf(ambRF, amb3, n = 2)
## [1] 0.34
calcSarf(ambRF, amb4, n = 2)
## [1] 0.32

## Marker spacing

Finally, we create the map:

maps <- pullMap(rf, split)

We get a list, with one object per chromosome. We visualize it with plotChr().

plotChr(maps[[1]], cex = 0.6)

We can also compare two chromosome maps:

maps2 <- pullMap(rf, split, fun = "kosambi")
plotChr(maps[[1]], maps2[[1]], cex = 0.6)

The map, based on the Kosambi function is smaller, than the default Haldane function.

## Map comparison

To compare two maps on global level we use the package dendextend.

library(dendextend)
library(gclus)

We create a new map from the existing map maps2, because otherwise the two dendograms would be equal.

maps3 <- maps2
maps3[1] <- maps2[2]
maps3[2] <- maps2[1]
maps3[3] <- maps2[7]
maps3[[4]] <- rev(max(maps3[[4]]) - maps3[[4]])
maps3[6] <- maps2[3]
maps3[7] <- maps2[6]
maps3[[4]]
##       D001       D002       D003       D004       D005       D006
##   0.000000   9.641562  19.941057  28.076617  37.563857  45.440883
##       D007       D008       D009       D010       D011       D012
##  53.445072  63.411050  69.693911  78.887030  87.143073  95.414429
##       D013       D014       D015       D016       D017       D018
## 103.454019 111.493610 114.777010 120.637028 127.458702 137.955048
##       D019       D020
## 147.387404 156.503482
maps2[[4]]
##       D020       D019       D018       D017       D016       D015
##   0.000000   9.116078  18.548434  29.044780  35.866454  41.726472
##       D014       D013       D012       D011       D010       D009
##  45.009872  53.049463  61.089053  69.360409  77.616452  86.809571
##       D008       D007       D006       D005       D004       D003
##  93.092432 103.058410 111.062599 118.939625 128.426865 136.562425
##       D002       D001
## 146.861920 156.503482

We create dendogram objects from our previously created maps.

dend1 <- map2dend(maps)
plot(dend1, cex = 0.6)

One is plotted, to show what it looks like.

dend2 <- map2dend(maps3)

maketangle creates a tanglegram.

maketangle(dend1, dend2, cutheight = 500, k = 7)

We can easily observe the rearrangements that we did before. In a real data example we would not want these. We use switchChrs and swapChrs to reverse them. One map serves as model while the other one is changed.

maps <- switchChrs(map = maps, comp = maps3)
maps <- swapChrs(map = maps, comp = maps3)
dend3 <- map2dend(maps)
dend4 <- map2dend(maps3)
maketangle(dend3, dend4, cutheight = 500, k = 7)

We see, that both maps are now perfectly aligned. However, the interchromosomal order is not changed.

# Acknowledgement

I thank Jeroen Lodewijk for testing the package and providing valuable feedback about its usablity and this vignette.

# Funding

This work was supported by the European Union’s Seventh Framework Programme for research, technological development and demonstration under grant agreement No. 289974. INTERCROSSING

# Session Info

sessionInfo()
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 15.10
##
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
##  [3] LC_TIME=de_DE.UTF-8        LC_COLLATE=en_US.UTF-8
##  [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=en_US.UTF-8
##  [7] LC_PAPER=de_DE.UTF-8       LC_NAME=C
## [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base
##
## other attached packages:
## [1] dendextendRcpp_0.6.1 Rcpp_0.12.2          gclus_1.3.1
## [4] cluster_2.0.3        dendextend_1.1.2     pergola_1.0
##
## loaded via a namespace (and not attached):
##  [1] whisker_0.3-2      knitr_1.12.3       magrittr_1.5
##  [4] roxygen2_5.0.1     devtools_1.9.1     MASS_7.3-43
##  [7] colorspace_1.2-6   foreach_1.4.3      stringr_1.0.0
## [10] caTools_1.17.1     tools_3.2.2        grid_3.2.2
## [13] seriation_1.1-3    KernSmooth_2.23-15 registry_0.3
## [16] htmltools_0.3      iterators_1.0.8    gtools_3.5.0
## [19] yaml_2.1.13        digest_0.6.9       formatR_1.2.1
## [22] bitops_1.0-6       codetools_0.2-14   evaluate_0.8
## [25] memoise_0.2.1      rmarkdown_0.9.5    TSP_1.1-3
## [28] gdata_2.17.0       stringi_1.0-1      gplots_2.17.0