# fRLR: Fit Repeated Linear Regressions

library(fRLR)

## Introduction

This R package aims to fit Repeated Linear Regressions in which there are some same terms.

## An Example

Let’s start with the simplest situation, we want to fit a set of regressions which only differ in one variable. Specifically, denote the response variable as $$y$$, and these regressions are as follows.

$\begin{array}{ll} y&\sim x_1 + cov_1 + cov_2+\ldots+cov_m\\ y&\sim x_2 + cov_1 +cov_2+\ldots+cov_m\\ \cdot &\sim \cdots\\ y&\sim x_n + cov_1 +cov_2+\ldots+cov_m\\ \end{array}$

where $$cov_i, i=1,\ldots, m$$ are the same variables among these regressions.

## Ideas

Intuitively, we can finish this task by using a simple loop.

model = vector(mode='list', length=n)
for (i in 1:n)
{
...
model[[i]] = lm(y~x)
...
}

However, it is not efficient in that situation. As we all know, in the linear regression, the main goal is to estimate the parameter $$\beta$$. And we have

$\hat\beta = (X'X)^{-1}X'Y$

where $$X$$ is the design matrix and $$Y$$ is the observation of response variable.

It is obvious that there are some same elements in the design matrix, and the larger $$m$$ is, the more elements are the same. So I want to reduce the cost of computation by separating the same part in the design matrix.

## Method

For the above example, the design matrix can be denoted as $$X=(x, cov)$$. If we consider intercept, it also can be seen as the same variable among these regression, so it can be included in $$cov$$ naturally. Then we have

$(X'X)^{-1}= \left[ \begin{array}{cc} x'x & x'cov \\ cov'x & cov'cov \end{array} \right]= \left[ \begin{array}{ll} a& v'\\ v& B \end{array} \right]$

Woodbury formula tells us

$(A+UCV)^{-1}=A^{-1}-A^{-1}U(C^{-1}+VA^{-1}U)^{-1}VA^{-1}$

Let

$A=\left[ \begin{array}{ll} a&O\\ O&B \end{array}\right],\; U=\left[ \begin{array}{ll} 1 & 0\\ O & v \end{array} \right],\; V= \left[ \begin{array}{ll} 0& v'\\ 1& O \end{array} \right]$

and $$C=I_{2\times 2}$$. Then we can apply woodbury formula,

$(X'X)^{-1}=(A+UCV)^{-1}=A^{-1}-A^{-1}U(C^{-1}+VA^{-1}U)^{-1}VA^{-1}$

where

$A^{-1}=\left[ \begin{array}{cc} a^{-1}&O\\ O&B^{-1} \end{array} \right]$

We can do further calculations to simplify and obtain the following result

$(X'X)^{-1}=\left[ \begin{array}{cc} 1/a+\frac{a}{a-v'B^{-1}v}v'B^{-1}v & -\frac{v'B^{-1}}{a-v'B^{-1}v}\\ -\frac{B^{-1}v}{a-v'B^{-1}v} & B^{-1}+\frac{-B^{-1}vv'B^{-1}}{a-v'B^{-1}v} \end{array} \right]$

Notice that matrix $$B$$ is the same for all regression, the identical terms for each regression are just $$a$$ and $$v$$, which are very easy to calculate. So theoretically, we can reduce the cost of computation significantly.

## Test

Now test two simulation examples by using the functions in this package.

## use fRLR package
set.seed(123)
X = matrix(rnorm(50), 10, 5)
Y = rnorm(10)
COV = matrix(rnorm(40), 10, 4)
frlr1(X, Y, COV)
#>   r r.p.value
#> 1 0 0.4380128
#> 2 1 0.7791076
#> 3 2 0.2212869
#> 4 3 0.9495018
#> 5 4 0.6729983

## use simple loop
res = matrix(nrow = 0, ncol = 2)
for (i in 1:ncol(X))
{
mat = cbind(X[,i], COV)
df = as.data.frame(mat)
model = lm(Y~., data = df)
tmp = c(i, summary(model)$coefficients[2, 4]) res = rbind(res, tmp) } res #> [,1] [,2] #> tmp 1 0.4380128 #> tmp 2 0.7791076 #> tmp 3 0.2212869 #> tmp 4 0.9495018 #> tmp 5 0.6729983 As we can see in the above output, these p-values for the identical variable in each regression are equal between two methods. Similarly, we can test another example set.seed(123) X = matrix(rnorm(50), 10, 5) Y = rnorm(10) COV = matrix(rnorm(40), 10, 4) idx1 = c(1, 2, 3, 4, 1, 1, 1, 2, 2, 3) idx2 = c(2, 3, 4, 5, 3, 4, 5, 4, 5, 5) frlr2(X, idx1, idx2, Y, COV) #> r1 r2 r1.p.value r2.p.value #> 1 1 2 0.53021406 0.895719578 #> 2 2 3 0.01812006 0.009833047 #> 3 3 4 0.29895922 0.963995969 #> 4 4 5 0.91749181 0.712075464 #> 5 1 3 0.33761507 0.210331456 #> 6 1 4 0.51074586 0.966484642 #> 7 1 5 0.12479380 0.152802911 #> 8 2 4 0.79302893 0.902402294 #> 9 2 5 0.73153760 0.663392258 #> 10 3 5 0.32367303 0.877154122 #> 11 2 3 0.01812006 0.009833047 #> 12 1 2 0.53021406 0.895719578 #> 13 3 4 0.29895922 0.963995969 #> 14 4 5 0.91749181 0.712075464 #> 15 1 3 0.33761507 0.210331456 #> 16 1 4 0.51074586 0.966484642 #> 17 1 5 0.12479380 0.152802911 #> 18 2 4 0.79302893 0.902402294 #> 19 2 5 0.73153760 0.663392258 #> 20 3 5 0.32367303 0.877154122 res = matrix(nrow=0, ncol=4) for (i in 1:length(idx1)) { mat = cbind(X[, idx1[i]], X[,idx2[i]], COV) df = as.data.frame(mat) model = lm(Y~., data = df) tmp = c(idx1[i], idx2[i], summary(model)$coefficients[2,4], summary(model)$coefficients[3,4]) res = rbind(res, tmp) } Again, we obtain the same results by different methods. ## Computation Performance The main aim of this new method is to reduce the computation cost. Now let’s compare its speed with the simple-loop method. We can obtain the following time cost for $$99\times 100/2=4950$$ linear regressions. set.seed(123) n = 100 X = matrix(rnorm(10*n), 10, n) Y = rnorm(10) COV = matrix(rnorm(40), 10, 4) #idx1 = c(1, 2, 3, 4, 1, 1, 1, 2, 2, 3) #idx2 = c(2, 3, 4, 5, 3, 4, 5, 4, 5, 5) id = combn(n, 2) idx1 = id[1, ] idx2 = id[2, ] system.time(frlr2(X, idx1, idx2, Y, COV)) #> user system elapsed #> 2.048 0.109 0.965 simpleLoop <- function() { res = matrix(nrow=0, ncol=4) for (i in 1:length(idx1)) { mat = cbind(X[, idx1[i]], X[,idx2[i]], COV) df = as.data.frame(mat) model = lm(Y~., data = df) tmp = c(idx1[i], idx2[i], summary(model)$coefficients[2,4], summary(model)\$coefficients[3,4])
res = rbind(res, tmp)
}
}

system.time(simpleLoop())
#>    user  system elapsed
#>   9.275   0.011   9.413