R/lineup2 is an R package providing tools for detecting and correcting sample mix-ups between two sets of measurements, such as between gene expression data on two tissues. It’s a revised version of lineup, to be more general and not so closely tied to the R/qtl package.

We will illustrate some of the tools in this User Guide. We will use the example data included in the package `lineup2ex`

. These are a subset of data from Broman et al. (2015) (see also Tian et al. 2015), and concern gene expression microarray data on about 500 mice in two tissues, gastrocnemius muscle (abbreviated “gastroc”) and pancreatic islets (abbreviated “islet”). While the original data included 40,000 gene expression measurements, the example data consider just a subset of 200 genes: 100 chosen to be highly correlated between the two tissues, and 100 others chosen at random.

When we load the lineup2 package, the data are immediately available, as `lineup2ex`

, which is a list containing the two tissues, each of which is a matrix, samples × genes.

We seek to identify sample mix-ups: mislabeled rows in one or the other data sets. One approach to do this is to first identify a subset of correlated columns, and then to look at the correlations between rows, for that subset of columns.

To calculate the correlations between columns, we first need to align the rows between the two matrices. We can do this with the function `align_matrix_rows()`

. There is a similar function `align_matrix_cols()`

for aligning columns.

The original matrices have 498 and 499 rows. The output of `align_matrix_rows()`

is a list with two matrices, where the original matrices are reduced to their common rows, based on their row names. The rows are also placed in the same order. Here, each matrix in the output has 497 rows, which is the number of samples in common between the two tissues.

We could use `align_matrix_cols()`

to align the columns (subsetting to the common column names and putting them in the same order), but here they are already aligned, and so no change occurs.

With the rows now aligned, we can now calculate the correlation in gene expression between the two tissues, to identify genes that will be useful for measuring sample similarity.

The function `corr_betw_matrices()`

is used to calculate the correlation between the columns in one matrix and the columns in another matrix. There are several options; here we will use `what="paired"`

to get just the correlations between the paired columns.

The result is a vector of 200 correlations, for the pairs of columns. Here is a plot of the sorted correlations.

```
par(mar=c(4.1, 5.1, 1.1, 1.1), las=1)
plot(sort(paired_corr), ylab="Correlations between column pairs, sorted", pch=16)
abline(h=0, lty=2, col="gray60")
```

Because of the way the genes were selected, there are 100 genes with between-tissue correlation scattered around 0, and then another 100 genes with correlations > 0.75. We will grab that top 100 for assessing sample similarity.

We can now calculate the similarity between the samples using these selected genes. The simplest way to do so is by again using `corr_betw_matrices()`

, but with the transposed matrices so that the samples become the columns.

```
corr_betw_samples <- corr_betw_matrices(t(gastroc[,selected_genes]),
t(islet[,selected_genes]), what="all")
```

The output is a 498 × 499 matrix, with the (i,j)th element being the correlation between the ith row of `gastroc`

and the jth row of `islet`

.

Note that there is a related function `dist_betw_matrices()`

to calculate the Euclidean or mean absolute distance between samples. There is also `dist_betw_arrays()`

for arrays with >2 dimensions. Unlike `corr_betw_matrices()`

, these functions look for distance between rows not columns.

The first thing to do with the resulting similarity matrix is to plot the values. The function `hist_self_nonself()`

plots histograms of the self-self correlations and the self-nonself correlations.

The self-self correlations are mostly > 0.7, while the self-nonself correlations are mostly < 0.7. The bimodal distribution of the self-nonself correlations is due to sex differences. Same-sex mice generally show positive correlations, while opposite-sex mice generally show negative correlations.

There are a few self-self correlations that are small and even negative. These are the sample mix-ups we are looking for. To identify the problem samples, we want to pull out the individual values, and identify the maximum values in each row and column.

You can pull out the self-self correlations with the function `get_self()`

. These are the cases where the row and column names match.

There is also a function `get_nonself()`

for pulling out the non-self values.

Note that there are 3 missing values, where a sample is present in only one of the two tissues:

```
## Mouse3475 Mouse3325 Mouse3666
## NA NA NA
```

Here are the smallest eight values. As we will see, the first five of these are the real problems.

```
## Mouse3655 Mouse3659 Mouse3598 Mouse3599 Mouse3296 Mouse3538 Mouse3353 Mouse3144
## -0.277 -0.221 0.246 0.277 0.557 0.620 0.637 0.697
```

The samples with low self-self correlation are likely mix-ups. To identify the correct labels, we can look for the largest correlation in each row and column of the similarity matrix. To do this, we use the function `get_best()`

. We use the argument `get_min=FALSE`

since we are looking for the maximum here, and we use the argument `dimension`

to control whether to look within each row (gastroc) or within each column (islet).

```
best_byrow <- get_best(corr_betw_samples, get_min=FALSE, dimension="row")
best_bycol <- get_best(corr_betw_samples, get_min=FALSE, dimension="column")
```

We can plot the self-self correlations against the maximum correlations by row and column. (There is a bit of effort to get labels on the points.)

```
par(mfrow=c(1,2), mar=c(4.1, 4.1, 1.6, 0.6), las=1)
plot(self, best_byrow, xlab="Self-self correlation", ylab="Best islet correlation",
main="Gastroc samples", pch=16, xlim=c(-0.3, 1.0), ylim=c(0.6, 1.0))
label <- best_byrow-self > 0.2
text(self[label]+0.03, best_byrow[label], names(self)[label],
adj=c(0,0.5), cex=0.8)
plot(self, best_bycol, xlab="Self-self correlation", ylab="Best gastroc correlation",
main="Islet samples", pch=16, xlim=c(-0.3, 1.0), ylim=c(0.6, 1.0))
label <- best_bycol-self > 0.3
text(self[label]+0.03, best_bycol[label], names(self)[label],
adj=c(0,0.5), cex=0.8)
label <- best_bycol-self > 0.2 & best_bycol-self < 0.3
text(self[label]-0.03, best_bycol[label], names(self)[label],
adj=c(1,0.5), cex=0.8)
```

Most samples have a large self-self correlation that matches the maximum in the row and column. There are four gastroc samples that have low self-self correlation but where there is another large value in that row, and there are five problem islet samples.

The function `get_problems()`

can be used to get a summary of potential problems. Looking by row or by column, you can pull out the samples where the best distance exceeds the self-self distance by more than some threshold.

Looking by row, and using a threshold of 0.2, we see these samples:

```
## ind self best which_best next_best which_next_best
## 1 Mouse3659 -0.221 0.886 Mouse3655 0.716 Mouse3161
## 2 Mouse3655 -0.277 0.774 Mouse3659 0.593 Mouse3309
## 3 Mouse3598 0.246 0.896 Mouse3599 0.776 Mouse3539
## 4 Mouse3599 0.277 0.866 Mouse3598 0.750 Mouse3193
## 5 Mouse3475 NA 0.713 Mouse3649 0.699 Mouse3432
```

This shows the four problems in the left panel of the above figure, and that they constitute two apparent sample swaps (Mouse3655 ↔︎ Mouse3659 and Mouse3598 ↔︎ Mouse3599).

Looking by column, again using a threshold of 0.2, we see the same results plus one more:

```
## ind self best which_best next_best which_next_best
## 1 Mouse3655 -0.277 0.886 Mouse3659 0.691 Mouse3248
## 2 Mouse3659 -0.221 0.774 Mouse3655 0.656 Mouse3558
## 3 Mouse3598 0.246 0.866 Mouse3599 0.704 Mouse3469
## 4 Mouse3599 0.277 0.896 Mouse3598 0.758 Mouse3285
## 5 Mouse3296 0.557 0.812 Mouse3295 0.626 Mouse3648
## 6 Mouse3325 NA 0.824 Mouse3142 0.787 Mouse3488
## 7 Mouse3666 NA 0.698 Mouse3354 0.694 Mouse3121
```

We again see the Mouse3655 ↔︎ Mouse3659 and Mouse3598 ↔︎ Mouse3599 pairs, and also see that Mouse3296 sample in islet looks to correspond to Mouse3295 (gastroc).

Note that the function `which_best()`

can be used to provide the sample IDs that correspond to the results of `get_best()`

; sort of like `which.max()`

vs `max()`

. This is another way to see that islet Mouse3296 looks like Mouse3295.

```
## Mouse3296
## "Mouse3295"
```

Another way to look at these problems is to plot the values in each column. We will point to the maximum in red and the self value in green.

```
samples <- sort(c(names(self)[best_bycol-self > 0.2], "Mouse3295"))
par(mfrow=c(3,2), las=1, mar=c(4.1, 4.1, 2.1, 0.6))
rn <- rownames(corr_betw_samples)
green <- "#2ecc40"
red <- "#ff4136"
for(sample in samples) {
plot(corr_betw_samples[,sample], xlab="Index", ylab="Correlation",
main=sample, pch=16, ylim=range(corr_betw_samples))
mx <- which.max(corr_betw_samples[,sample])
points(mx, corr_betw_samples[mx,sample], pch=16, col=red)
text(mx-8, corr_betw_samples[mx, sample], names(mx), adj=c(1, 0.5))
points(which(rn==sample), corr_betw_samples[sample,sample], pch=16, col=green)
}
```

From these results, we can see that Mouse3598 and Mouse3599 look to be switched in one or the other tissue, as are Mouse3655 and Mouse3659. From these data on their own, we can’t tell which tissue is the problem, but the original data include gene expression microarrays for four other tissues, and by comparisons to those tissues, it turns out that Mouse3655/Mouse3659 were mixed up in gastroc, while Mouse3598/Mouse3599 were mixed up in islet. See Broman et al. (2015), Fig. 4.

Samples Mouse3295 and Mouse3296 look correct in gastroc, and Mouse3295 looks correct in islet, but Mouse3296 islet looks like Mouse3295 gastroc. The sample labelled Mouse3296 in islet is really an unintended biological replicate of sample Mouse3295.

As further support to show that the best-correlated samples correspond to the correct labels, it can be useful to consider the second-best sample in each row and column. If the best-correlated sample is not much more correlated than the second-best, than the support is not so strong, for the best sample to be the true one.

To pull out the second-best correlated sample, use the function `get_2ndbest()`

, which has the same arguments as `get_best()`

.

```
secbest_byrow <- get_2ndbest(corr_betw_samples, get_min=FALSE, dimension="row")
secbest_bycol <- get_2ndbest(corr_betw_samples, get_min=FALSE, dimension="column")
```

We then can plot the second-best samples versus the best samples, and we will point to the problem samples. The four problem samples in gastroc, and the five problem samples in islet, are highlighted in red.

```
par(mfrow=c(1,2), mar=c(4.1, 4.1, 1.6, 0.6), las=1)
plot(best_byrow, secbest_byrow, xlab="Best islet correlation",
ylab="Second best islet correlation",
main="Gastroc samples", pch=16, xlim=c(0.5, 1), ylim=c(0.5, 1))
label <- best_byrow-self > 0.2
points(best_byrow[label], secbest_byrow[label], pch=16, col=red)
abline(0,1, lty=2)
plot(best_bycol, secbest_bycol, xlab="Best gastroc correlation",
ylab="Second best gastroc correlation",
main="Islet samples", pch=16, xlim=c(0.5, 1), ylim=c(0.5, 1))
label <- best_bycol-self > 0.2
points(best_bycol[label], secbest_bycol[label], pch=16, col=red)
abline(0,1, lty=2)
```

Values away from the diagonal have good support for the “best” sample being the correct one. This includes all of the problem samples (in red).

Note that there is `which_2ndbest()`

function, similar to `which_best()`

but pulling out the second-closest sample.

Broman KW, Keller MP, Broman AT, Kendziorski C, Yandell BS, Sen Ś, Attie AD (2015) Identification and correction of sample mix-ups in expression genetic data: A case study. G3 5:2177-2186 doi:10.1534/g3.115.019778

Tian J, Keller MP, Oler AT, Rabaglia ME, Schueler KL, Stapleton DS, Broman AT, Zhao W, Kendziorski C, Yandell BS, Hagenbuch B, Broman KW, Attie AD (2015) Identification of the bile acid transporter Slco1a6 as a candidate gene that broadly affects gene expression in mouse pancreatic islets. Genetics 201:1253-1262 doi:10.1534/genetics.115.179432