Monday, 14 August 2017

Career with Maths & Stats

Thinking of a career in mathematics and statistics? Below are the career information session for Australian students and a general overview of what research is like using statistics.

Sunday, 12 June 2016

A different perspective to one-way ANOVA

If you've done some statistics in the past, you may have come across the analysis of variance (ANOVA) to compare the means of several treatment effects. If you don't remember or never learnt it before, then the section in One-way ANOVA below may jog your memory. In the last section,  we revisit the one-way ANOVA using a notion described in Bailey (2008), which in my opinon, gives enlightening persective to ANOVA. Bailey makes her (I think old version?) of the book available on her website and interested readers are recommended to read it for more details.  For the latter perspective, you need to have an understanding of linear algebra and vector space, which I have some notes written up here.

One-way ANOVA

Suppose we have $t$ treatments with $r_i$ replications for the $i$-th treatment so we have $n=\sum_{i=1}^t r_i$ observations. Let $Y_{ij}$ be the $j$-th observation (which is yet to be realised, hence the capitalisation) of the $i$-th treatment. Suppose for $i=1,...,t$ and $j=1,...,r_i$, we define our model as
\[ Y_{ij} = \mu + \alpha_i + \epsilon_{ij} \]
where $\mu$ is the overall mean level, $\alpha_j$ is the differential effect on the $i$-th treatment, and $\epsilon_{ij}$ is the random error in the $j$-th observation under $i$-th treatment. We assume errors are independent and $\epsilon_{ij} \sim N(0, \sigma^2)$ and most literatures place the zero sum constraint, i.e. $\sum_{i=1}^t \alpha_i = 0$ or the coner point constraint, i.e. $\alpha_{i_0}=0$ for some fixed $i_0$.

Define $\displaystyle\bar{Y}_{i\cdot} = \dfrac{1}{r_i}\sum_{j=1}^{r_i} Y_{ij}$ and $\displaystyle\bar{Y}_{\cdot\cdot} = \dfrac{1}{n}\sum_{i=1}^t\sum_{j=1}^{r_i} Y_{ij}$. Then note the following identity.

\begin{eqnarray}
\sum_{i=1}^t\sum_{j=1}^{r_i}  (Y_{ij} - \bar{Y}_{\cdot\cdot})^2 &=& \sum_{i=1}^t\sum_{j=1}^{r_i}   (Y_{ij} - \bar{Y}_{i\cdot})^2 + \sum_{i=1}^t  r_i (\bar{Y}_{i\cdot} - \bar{Y}_{\cdot\cdot})^2\\
\text{SST} &=& \text{SSW} + \text{SSB}
\end{eqnarray}
where SST is the total sum of squares (corrected for the mean), SSW is the sum of squares within treatment groups and SSB is the sum of squares between treatment groups (referred to as treatment sum of squares and residual sum of squares, respectively). Then recall the construction of the one-way ANOVA table is given by below.

\[\begin{array}{ccccc}
\hline
{\rm Source} & {\rm SS} & {\rm df} & {\rm MS} & {\rm F}\\\hline
{\rm Between/Treatment} &\displaystyle \sum_{i=1}^t  r_i(\bar{Y}_{i\cdot} - \bar{Y}_{\cdot\cdot})^2  & t-1& \dfrac{\text{SSB}}{t-1}&\displaystyle  \dfrac{\text{MSB}}{\text{MSW}}\\
{\rm Within/Residual} & \displaystyle \sum_{i=1}^t\sum_{j=1}^{r_i}   (Y_{ij} - \bar{Y}_{i\cdot})^2  & n - t&  \dfrac{\text{SSW}}{n-t}&\\\hline
{\rm Total} &\displaystyle \sum_{i=1}^t\sum_{j=1}^r  (Y_{ij} - \bar{Y}_{\cdot\cdot})^2 & n-1 & & \\\hline
\end{array}\]

If you want then to test $H_0: \alpha_1 = ... = \alpha_t$ vs. $H_1:$ at least one $\alpha_i$ is not equal (i.e. not $H_0$). The test statistic is given by
\[ F = \dfrac{\text{MSB}}{\text{MSW}} \sim F_{t-1,n-t} \quad \text{ under }H_0. \]


One-way ANOVA revisited

The above is how I learnt it in my undergraduate education but there are some things I wish I learnt it differently, specifically the way that Bailey (2008) presents the idea. These inlcude:

  1. the index notation $Y_{ij}$ - such notation forms a cetain rigid notion on the structure of the data (in my opinon, it appears the data is organised such that the replications are within treatments with this notation but we are aware the experiment would have randomised the units) and an extension of this notation to complex factorial designs makes the notation very cumbersome (imagine $Y_{ijklmn}$); and 
  2. the notion of vector space in the construction of the ANOVA table. 

A few notations

Bailey (2008) addresses the issue 1 by denoting the set of the observatoinal units by $\Omega$ then $Y_\omega$ for $\omega \in \Omega$ is the observation on the $\omega$ unit and $|\Omega| =n$. We denote $\bs{y} = (Y_1,...,Y_\omega, ..., Y_n)^\top$ be the $n\times 1$ vector of data. Here we conveniently ignore the statistical convention of capitalising for random variables as not to confuse with our matrix notation (Bailey, 2008, does capitalise it). She also denotes the set of treatments by $\mathcal{T}$ with treatment factor $T$ is a function which maps each unit to a treatment  $T:\Omega \rightarrow \mathcal{T}$ and $|\mathcal{T}|=t$.

Furthermore, define $V=\mathbb{R}^n$, the vector space of all real vectors of length $n$. This is the vector space associated with the vector of observations and clealry, the orthonormal basis corresponds to the columns of the $n \times n$ identity matrix $\bs{I}_n$. Note clearly the associated projection matrix for this space $V$ (which we denote $P_V$) is given simply by $\bs{I}_n$.

And in addition, we define some subspaces of $V$. Let $V_T\subset V$ be the subspace where the elements of the vector are constant on each treatment. E.g. suppose we have two treatments with two replications each then the followings columns $\bs{v}_1$, $\bs{v}_2$ and $\bs{v}_3$ are examples of vector in $V_T$.
\begin{array}{cccc}
\hline
\omega & Y_\omega & T(\omega) & \bs{v}_1 & \bs{v}_2 & \bs{v}_3 \\
\hline
1 & Y_1 & \text{A} & 2 & 1 & -1 \\
2 & Y_2 & \text{A} & 2 & 1 & -1 \\
3 & Y_3 & \text{B} & 5 & 1 & 1 \\
4 & Y_4 & \text{B} & 5 & 1 & 1 \\
\hline
\end{array}
Let $V_0 \subset V$ be the space spanned by a $n\times 1$ vector of ones, $\bs{1}_n$. Notice that $V_0 \subset V_T$. The above column $\bs{v}_2$ is an example of a vector in $V_0$. And lastly we let $V_T^\perp$ be the orthogonal complement of $V_T$. Clearly, dim$V=n$, dim$V_T=t$ and dim$V_0 = 1$. We denote $P_{V_T}$, $P_{V_0}$ and $P_{V_T^\perp}$ for the projection matrices onto $V_T$, $V_0$, and $V_T^\perp$, respectively. And we denote $W_T = V_T \cup V_0^\perp$. Note then that \[V = V_0 \oplus W_T \oplus V_T^\perp.\]

Sum of squares

The sum of squares of the subspace $W$ is given by
\[ {\rm SS}(W) = \bs{y}^\top\bs{P}_W\bs{y} \] 
where $\bs{P}_W$ is the projection matrix onto $W$, the associated degrees of freedom is the dimension of $W$, and the mean square of $W$ is SS$(W)/\text{dim}W$.

For example, ${\rm SS}(V) = \bs{y}^\top\bs{P}_V\bs{y} = \bs{y}^\top\bs{I}_n\bs{y} =  \bs{y}^\top\bs{y} = \sum_{\omega \in \Omega} Y_\omega^2 $. This is the total sum of squares (not corrected for the mean). 

Now,  ${\rm SS}(W_T) = \bs{y}^\top\bs{P}_{W_T}\bs{y}$. Now we omit the derivation of this but this is in fact equal to SSB and note the degrees of freedom $t-1$ is indeed the dim$W_T$. To see this we know $V_0 \oplus W_T = V_T$ and by the rank-nullity theorem, 

\begin{eqnarray} {\rm dim}V_0 + {\rm dim}W_T &=& {\rm dim}V_T\\ 1 + {\rm dim}W_T &=& t\\ {\rm dim}W_T &=& t - 1.\\ \end{eqnarray}

 Likewise, the same can be done for ${\rm SS}(V_T^\perp)$ which equates to the SSW.

So what is interesting here (for me at least!) is that the vector space associated with the vector of data is decomposed to orthogonal subspaces that are associated with the overall mean, treatments and residuals!

Sunday, 5 June 2016

Variance component estimate: ANOVA vs LMM

In this post we analyse the Dyestuff data (Davies 1947) with particulr empahsis on showing the equivalence of the variance component estimation using ANOVA approach and linear mixed model (LMM) approach. We’ve previously alluded in this post that
The early estimation of variance components is the ANOVA approach, attributed to Ronald Fisher, whereby we equate the expected mean squares to their values computed from the data. However, this was limited to balanced data and only for where the structures of the variance of random effects and residuals are a scaled identity.
We will first describe the data, provide the analysis of the data using a LMM (to be more precise, we fit a random-effects model) with parameter estimation via asreml-R and lme4 package, and then compare the variance component estimation with the ANOVA approach.

Dyestuff Data

The Dyestuff data frame provides the yield of dyestuff from 5 different preparations from each of 6 different batches of an intermediate product and is accessible from the lme4 R-package.
library(lme4)
summary(Dyestuff)
##  Batch     Yield     
##  A:5   Min.   :1440  
##  B:5   1st Qu.:1469  
##  C:5   Median :1530  
##  D:5   Mean   :1528  
##  E:5   3rd Qu.:1575  
##  F:5   Max.   :1635
You can see that this data is balanced, in the sense that the number of observations per batch is equal (5 in this case) and there is no missing data.

Analysis via linear mixed models

The trait of interest, a.k.a. the response, here is the yield. The batch effect is taken as a random effect and we assume both the batch effect and the residuals are independent and identically distributed. More specifically, \[\boldsymbol{y} = \boldsymbol{1}_{30}\mu + \boldsymbol{Z}\boldsymbol{u}_b + \boldsymbol{e}\] where $\boldsymbol{y}$ is the $30 \times 1$ observation of the yield, $\mu$ is the overall mean effect, $\boldsymbol{u}_b$ is the $6\times 1$ batch effect with associated design matrix $\boldsymbol{Z}$, and $\boldsymbol{e}$ is the $30\times 1$ vector of residuals. We assume that the batch effect and residuals are mutually independent and $\boldsymbol{u}_b \sim N(\boldsymbol{0}, \sigma^2_b\boldsymbol{I}_5)$ and $\boldsymbol{e} \sim N(\boldsymbol{0}, \sigma^2\boldsymbol{I}_{30})$.

Why is the batch effect random? This is sometimes a hard question to answer. In this case, we can say that batches are a selection of the possible batches, and hence should be modelled as a random effect. But anyhow, the focus of this post is not about the “correct” analysis. We illustrate below the variance component estimation, namely estimation of $\sigma^2_b$ and $\sigma^2$, via lme4 and asreml-R packages using REML.
mod.lme4 <- lmer(Yield ~ 1 + (1|Batch), data=Dyestuff)
library(asreml)
mod.asr <- asreml(Yield ~ 1, random=~ Batch, data=Dyestuff)
## ASReml: Sun Jun  5 21:18:58 2016
## 
##      LogLik         S2      DF      wall     cpu
##    -134.7945   3324.1954    29  21:18:58     0.0
##    -134.1034   3033.7599    29  21:18:58     0.0
##    -133.4996   2744.8688    29  21:18:58     0.0
##    -133.2019   2519.5230    29  21:18:58     0.0
##    -133.1782   2458.1253    29  21:18:58     0.0
##    -133.1779   2451.3396    29  21:18:58     0.0
##    -133.1779   2451.2500    29  21:18:58     0.0
## 
## Finished on: Sun Jun  5 21:18:58 2016
##  
## LogLikelihood Converged
print(mod.lme4, ranef.comp="Var", digits=10)
## Linear mixed model fit by REML ['lmerMod']
## Formula: Yield ~ 1 + (1 | Batch)
##    Data: Dyestuff
## REML criterion at convergence: 319.6543
## Random effects:
##  Groups   Name        Variance
##  Batch    (Intercept) 1764.05 
##  Residual             2451.25 
## Number of obs: 30, groups:  Batch, 6
## Fixed Effects:
## (Intercept)  
##      1527.5
(out <- summary(mod.asr)$varcomp[, 'component', drop=F])
##                 component
## Batch!Batch.var   1764.05
## R!variance        2451.25
So \(\hat{\sigma}^2_b\) = 1764.05 and \(\hat{\sigma}^2\) = 2451.25.

ANOVA

We do not delve into the theory of variance component estimation in this post and just show the calculation in R below.
lm.out <- lm(Yield ~ Batch, data=Dyestuff)
(anova.out <- anova(lm.out))
## Analysis of Variance Table
## 
## Response: Yield
##           Df Sum Sq Mean Sq F value   Pr(>F)   
## Batch      5  56358 11271.5  4.5983 0.004398 **
## Residuals 24  58830  2451.2                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
So the observed mean squares of the batch and the error are given by 11271.5 and 2451.25, respectively, which we denote as MS(batch) and MS(error) respectively. Note that the expected mean square, which we denote by EMS, are EMS(error) $=\sigma^2$ and EMS(batch) $=\sigma^2 + 5\sigma^2_b$. Now the estimation of the variance components are found by equating the observed MS with the EMS. So $\hat{\sigma}^2=2451.25$ and $\hat{\sigma}^2 + 5\hat{\sigma}^2_b = 2451.25$. Rearranging we have $\hat{\sigma}^2_b = \frac{1}{5} (11271.5 - 2451.25) = 1764.05$.

Thursday, 2 June 2016

Content Guide

It's tricky to navigate around to my old posts so this post is to serve as a content guideline. If you hold the left click of the mouse on the mind map below, there'll be a wheel that appears that will allow you to zoom in and out of the mindmap.

Please note: I go back to old posts at times to update or correct them.

Sunday, 29 May 2016

A glimpse at REML for linear fixed models

Too much social this weekend so I dug out one of the algebraic notes I've written out which is a thorough derivation of the REML approach for estimation of fixed and random effects for linear fixed models of what was presented in this post.


Consider the transformation
\[
\begin{bmatrix}
\bs{y}_1\\
\bs{y}_2
\end{bmatrix} = \begin{bmatrix}
\bs{L}_1^\top\bs{y}\\
\bs{L}_2^\top\bs{y}
\end{bmatrix} = \bs{L}^\top\bs{y}
\]
where $\bs{L}=\begin{bmatrix}
\bs{L}_1 & \bs{L}_2
\end{bmatrix}$ is a non-singular $n\times n$ matrix such that  $\bs{L}_1$ and $\bs{L}_2$ are $n\times p$ and $n\times (n-p)$ matrix such that $\bs{L}_1$ is full rank and $\bs{L}_2^\top\bs{L}_2 = \bs{I}_{n-p}$, $\bs{L}_1^\top\bs{L}_2 = \bs{0}$, $\bs{L}_1^\top\bs{X} = \bs{I}_p$ and $\bs{L}_2^\top\bs{X} = \bs{0}$.

The distribution of the transformed data is then
\begin{equation}
\begin{bmatrix} \bs{y}_1 \\ \bs{y}_2\end{bmatrix}
 = \begin{bmatrix}
  \bs{L}_1^\top\bs{y}\\
  \bs{L}_2^\top\bs{y}
  \end{bmatrix}  \sim N\left(
\begin{bmatrix} \bs{\tau} \\ {\bf
  0}\end{bmatrix},
\sigma^2 \begin{bmatrix} \bs{L}_1^\top\bs{L}_1 & {\bf 0} \\
{\bf 0} & \bs{I}_{n-t}\end{bmatrix} \right)
\end{equation}

A convenient choice for $\bs{L}_1$ is $\bs{X}(\bs{X}^\top\bs{X})^{-1} $ which implies that $\bs{L}_1^\top\bs{L}_1 = (\bs{X}^\top\bs{X})^{-1}$.

The two components of the transformation $\bs{y}_1$ and $\bs{y}_2$ are independent and thus above transformation provides two linear models.

The first linear model on $\bs{y}_1$ provides the information on $\bs{\tau}$ under the transformation and hence is the basis of estimation of $\bs{\tau}$. The estimation of $\bs{\tau}$ can be found via ML estimation for $\bs{y}_1$.  That is, the log likelihood function of the model for $\bs{y}_1$ is given by
\begin{equation}
\ell_1 = \ell_1(\bs{\tau}, \bs{y}_1) = -\frac{n}{2}\log(2\pi) -
\frac{1}{2}\log|\sigma^2(\bs{I}_p^\top\bs{I}_p)^{-1}| - \frac{1}{2} (\bs{y}_1 - \bs{I}_p\bs{\tau})^\top(\sigma^2(\bs{I}_p^\top\bs{I}_p)^{-1})^{-1}(\bs{y}_1 - \bs{I}_p\bs{\tau}).
\end{equation}

The estimation of $\bs{\tau}$ is given the solution of the equation whereby the following derivative is zero,
\begin{eqnarray*}
\frac{\partial \ell_1}{\partial \bs{\tau}} &=& -\frac{1}{2} \cdot 2\frac{\partial (\bs{y}_1 - \bs{\tau})}{\partial\bs{\tau}}(\sigma^2(\bs{X}^\top\bs{X})^{-1})^{-1}(\bs{y}_1-\bs{\tau})\\
&=& (\sigma^2(\bs{X}^\top\bs{X})^{-1})^{-1}(\bs{y}_1-\bs{\tau})\\
&=& \sigma^{-2}\bs{X}^\top\bs{X}(\bs{L}_1^\top\bs{y}-\bs{\tau})
\end{eqnarray*}
i.e.,
\begin{equation}
\hat{\bs{\tau}} = (\bs{X}^\top\bs{X})^{-1} \bs{X}^\top\bs{y}\label{eq:mme1}
\end{equation}


The parameter vector matches the data vector $\bs{y}_1$ in length, so that once $\bs{\tau}$ has been estimated, $\bs{y}_1$ cannot be used for estimation of $\sigma^2$ as there is no further information available.

The second linear model has a known (zero) mean and depends only on $\sigma^2$.  Because $\bs{y}_1$ cannot be used to estimate $\sigma^2$, we use the marginal distribution of $\bs{y}_2$ for estimation of $\sigma^2$. The log-likelihood of $\bs{y}_2$, referred to as the residual log-likelihood, is given by
\[
\ell_R(\sigma^2; \bs{y}_2) = -\frac{n-p}{2} \log(2\pi)
-\frac{n-p}{2} \log(\sigma^2) - \frac{1}{2\sigma^2}
\bs{y}_2^\top\bs{y}_2
\]
and hence the estimate of $\sigma^2$ based on this marginal likelihood is given by (having differentiated the marginal log-likelihood by $\sigma^2$ and equated to zero)
\begin{eqnarray}
\hat{\sigma}^2_{\text{REML}} &=& \frac{1}{n-p} \bs{y}_2^\top\bs{y}_2 \nonumber \\
&=& \frac{1}{n-p} \bs{y}^\top \bs{L}_2\bs{L}_2^\top\bs{y} \nonumber \\
&=& \frac{1}{n-p} \bs{y}^\top(\bs{I}_n - \bs{P_X})\bs{y}\qquad \text{see Lemma below} \nonumber \\
&=& \frac{1}{n-p}  \bs{y}^\top(\bs{I}_n - \bs{P_X})^2\bs{y} \qquad \text{since $(\bs{I}_n - \bs{P_X})$ is idempotent} \\
&=& \frac{1}{n-p}  (\bs{y} - \bs{X}\hat{\bs{\tau}})^\top(\bs{y} - \bs{X}\hat{\bs{\tau}})\qquad \text{since $(\bs{I}_n - \bs{P_X})$ is symmetric} \\
\end{eqnarray}

The following lemma is something that I've written up to justify the above, but as one might notice, this proof only applies for a special case. A more general approach is to prove using properties of a projection matrix but below may be easier for certain teaching purposes (maybe).  
Lemma
$$\bs{L}_2\bs{L}_2^\top = \bs{I}_n - \bs{P_X}$$
where $\bs{P_X} = \bs{X}(\bs{X}^\top\bs{X})^{-1}\bs{X}^\top$, $\bs{L}_2^\top\bs{L}_2 = \bs{I}_{n-p}$, $\bs{L}_1^\top\bs{L}_2 = \bs{0}$, $\bs{L}_1^\top\bs{X} = \bs{I}_p$ and $\bs{L}_2^\top\bs{X} = \bs{0}$.

Proof
This can be proven using properties of projection matrix but we proceed with a different kind of proof. First, observe that \[\begin{bmatrix}
\bs{L}_1^\top\\
\bs{L}_2^\top
\end{bmatrix} \begin{bmatrix}
\bs{X} & \bs{L}_2
\end{bmatrix} =\begin{bmatrix}
\bs{L}_1^\top\bs{X} & \bs{L}_1^\top\bs{L}_2\\
\bs{L}_2^\top\bs{X} & \bs{L}_2^\top\bs{L}_2
\end{bmatrix} =\begin{bmatrix}
\bs{I}_p & \bs{0} \\
\bs{0} & \bs{I}_{n-p}
\end{bmatrix} = \bs{I}_n\]
and hence $\begin{bmatrix}
\bs{X} & \bs{L}_2
\end{bmatrix}$ is non-singular. Now
\begin{eqnarray*}
\bs{I}_n &=& \begin{bmatrix}
\bs{X} & \bs{L}_2
\end{bmatrix}\begin{bmatrix}
\bs{X} & \bs{L}_2
\end{bmatrix}^{-1}\begin{bmatrix}
\bs{X}^\top \\
\bs{L}_2^\top
\end{bmatrix}^{-1}\begin{bmatrix}
\bs{X}^\top \\
\bs{L}_2^\top
\end{bmatrix}\\
&=& \begin{bmatrix}
\bs{X} & \bs{L}_2
\end{bmatrix}\left(\begin{bmatrix}
\bs{X}^\top \\
\bs{L}_2^\top
\end{bmatrix}\begin{bmatrix}
\bs{X} & \bs{L}_2
\end{bmatrix}\right)^{-1}\begin{bmatrix}
\bs{X}^\top \\
\bs{L}_2^\top
\end{bmatrix}\\
&=& \begin{bmatrix}
\bs{X} & \bs{L}_2
\end{bmatrix}\begin{bmatrix}
\bs{X}^\top\bs{X} & \bs{X}^\top\bs{L}_2 \\
\bs{L}_2^\top\bs{X} & \bs{L}_2^\top\bs{L}_2
\end{bmatrix}^{-1}\begin{bmatrix}
\bs{X}^\top \\
\bs{L}_2^\top
\end{bmatrix}\\
&=& \begin{bmatrix}
\bs{X} & \bs{L}_2
\end{bmatrix}\begin{bmatrix}
\bs{X}^\top\bs{X} & \bs{0} \\
\bs{0} & \bs{I}_{n-p}
\end{bmatrix}^{-1}\begin{bmatrix}
\bs{X}^\top \\
\bs{L}_2^\top
\end{bmatrix}\\
&=& \begin{bmatrix}
\bs{X} & \bs{L}_2
\end{bmatrix}\begin{bmatrix}
(\bs{X}^\top\bs{X})^{-1} & \bs{0} \\
\bs{0} & \bs{I}_{n-p}
\end{bmatrix}\begin{bmatrix}
\bs{X}^\top \\
\bs{L}_2^\top
\end{bmatrix}\\
&=& \bs{X}(\bs{X}^\top\bs{X})^{-1}\bs{X}^\top + \bs{L}_2\bs{L}_2^\top
\end{eqnarray*}


Sunday, 22 May 2016

Linear mixed model in R (mainly asreml)

This post was inspired from my recent conversation with a fellow postdoc, from what should be a regular podcast :P

Anyway!

I fitted a linear mixed model in this post but I didn't show how the model is fitted.

To recap, the model fitted was

\[ \bs{y} = \bs{1}_n\mu + \bs{Z_g}\bs{u_g} + \bs{Z_b}\bs{u_b} + \bs{Z_c}\bs{u_c} + \bs{e} \]
where $\bs{y}$ is the $288\times 1$ vector of observations of yield ordered so that rows are within columns, $\bs{u_g}$ is the $191 \times 1$ vector of variety effects with the $288 \times 191$ associated design matrix $\bs{Z_g}$, $\bs{u_b}$ is the 2$\times$1 vector of block effects with $288 \times 2$ associated design matrix $\bs{Z_b}$, $\bs{u_c}$ is the 12$\times$1 vector of column effects with $288 \times 12$ associated design matrix $\bs{Z_c}$, and $\bs{e}$ is the 288$\times$1 vector of residuals and we assume
\[\begin{bmatrix} \bs{u_g} \\ \bs{u_b}\\
\bs{u_c} \\ \bs{e} \end{bmatrix} \sim N\left(\begin{bmatrix} \bs{0} \\ \bs{0} \\ \bs{0} \\ \bs{0} \end{bmatrix},  \begin{bmatrix} \sigma^2_g\bs{I}_{191} & \bs{0} & \bs{0} & \bs{0} \\ \bs{0} & \sigma^2_b \bs{I}_2 & \bs{0}& \bs{0}  \\  \bs{0} & \bs{0} & \sigma^2_c \bs{I}_{12}& \bs{0} \\ \bs{0} &\bs{0} & \bs{0} & \bs{R} \end{bmatrix}\right). \]
For $\bs{R}$, we assume a separable autoregressive process of order one in row and column direction, which we denote as AR1$\times$AR1, and so $\bs{R} = \sigma^2 \bs{\Sigma}_c \times \bs{\Sigma}_r$ where $\bs{\Sigma}_c$ and $\bs{\Sigma}_r$ are 12$\times$12 and $24\times 24$ correlation matrices, respectively, which is each dependent on one auto-correlation parameter $\rho_c$ and $\rho_r$, respectively. 

So how do I fit the above model?

As the title suggest, the model is fitted using R. There are number of R-packages that deal with linear mixed models. The popular ones are nlme, lme4 and the one known well in the breeding world, asreml-R. I have very little experience fitting linear mixed models using the former two and pretty much all my model fitting are done using asreml-R.

A little about asreml-R


asreml-R fits a model by estimating the variance components using REML approach with the average information (AI) algorithm (Johnson and Thompson 1995). This method was popularised and implemented in the stand-alone program ASreml (Gilmour et al 1995) which is widely used, in particular by breeders.  An interface for R was written based on the core algorithm in the stand-alone program, which we refer to as asreml-R (Butler et al. 2009). The asreml-R package is a powerful R-package to fit linear mixed models, with one huge advantage over competition is that, as far as I can see, it allows a lot of flexibility in the variance structures and more intuitive in its use.

The competing, alternatives R-packages that fit the linear mixed models are nlme and lme4. Luis A. Apiolaza makes a comparison of these packages in his blog dated 2011/10/17. One downside is that asreml-R requires a license. This may be provided for free for educational or research purposes (details of which I am not aware of). Here is an attempt for the question dated 2011/11/01 to recreate the same analysis as asreml-R, although it appears spatial models (i.e. AR1$\times$AR1) cannot be easily recreated in open source packages. The dates are a bit old so perhaps the current situation may be different. 

Fitting the linear mixed model in R

So here is the code that fits the above model. Below code load the asreml-R package then shows the first 6 lines of the data and lastly the model fit.

> library(asreml)
> head(asite.df)
  plot   yield Column Row    Geno Block
1    1 3.27665      1   1 Geno163     1
2    2 3.32995      1   2 Geno028     1
3    3 2.88832      1   3 Geno009     1
4    4 3.52030      1   4 Geno015     1
5    5 3.11675      1   5 Geno191     1
6    6 3.55076      1   6 Geno132     1
> asr.m1 <- asreml(yield~1, random=~ Geno + Block + Column,  
                  rcov=~ar1(Column):ar1(Row), data=asite.df)


Now as a comparison, I would like to fit the same model in nlme or lme4 but I cannot figure out how to fit a separable AR1 to column and direction so instead I will fit the same model except assume $\bs{R} = \sigma^2\bs{I}_{288}$ instead of the AR1$\times$AR1 structure. 


So,
# asreml-R
> asr.m0 <- asreml(yield~1, random=~ Geno + Block + Column, data=asite.df)
# or equivalently specifying the variance structure of the residual (default is id)
> asr.m0 <- asreml(yield~1, random=~ Geno + Block + Column,  
                  rcov=~id(Column):id(Row), data=asite.df)
# or equivalently specifying the variance structure of the random effects (default is idv)
> asr.m0 <- asreml(yield~1, random=~ idv(Geno) + idv(Block) + idv(Column),  
                  rcov=~id(Column):id(Row), data=asite.df)
# lme4
> lme4.m0 <- lmer(yield ~ 1 + (1|Geno) + (1|Block) + (1|Column), data=asite.df) 
I couldn't figure out how to apply the same model in nlme using the lme function so I've given up. If any one can enlighten me how to fit it using nlme, that'll be great.   The above models do indeed give equivalent results. Below is a comparison.
# model parameter estimates from lme4
> head(coefficients(lme4.m0)$Geno)
        (Intercept)
Geno001    3.118321
Geno002    2.170554
Geno003    2.701171
Geno004    2.843664
Geno005    3.167335
Geno006    2.994176

# model parameter estimates from asreml-R
> head(coefficients(asr.m0)$random[grep("Geno", rownames(coefficients(asr.m0)$random)), ])
Geno_Geno001 Geno_Geno002 Geno_Geno003 Geno_Geno004 Geno_Geno005 Geno_Geno006 
  0.05169620  -0.89607063  -0.36545364  -0.22296064   0.10071053  -0.07244843 
> coefficients(asr.m0)$fixed
              effect
(Intercept) 3.066625


So above looks different but the differences come from the fact that lme4 adds the fixed effects to their prediction while asreml-R does not. The equivalence can be easily seen from below:
> y <- coefficients(lme4.m0)$Geno[,1]
> x <- coefficients(asr.m0)$random[grep("Geno", rownames(coefficients(asr.m0)$random)), 1]
> lm(y ~ x)

Call:
lm(formula = y ~ x)

Coefficients:
(Intercept)            x  
      3.067        1.000  

The addition of the estimate of the fixed effect to the prediction of random effects makes me wonder what
lme4 returns if I had other fixed effects? Something to explore in the future perhaps.

Sunday, 15 May 2016

Analysis of crop breeding trials: spatial modelling (application)

This post continues from the brief theory explanation in this post.

I will present an analysis of a wheat breeding trial with particular emphasis in application of spatial modelling. 

So here goes!

Data: a wheat breeding trial with 288 observations. 
Trait of interest: yield 
Treatment: 191 varieties 
Trial layout: a rectangular array of 12 columns $\times$24 rows, i.e. a total of 288 plots. 
Design: A partial replicate design (Cullis et al. 2006) with two blocks, i.e. there is a maximum of one for each treatment units per block. Block 1 and Block 2 coincide with columns 1-6 and columns 7-12, respectively.


Given these information, we can decompose yield as

 yield = mean + Genotype + block + residual 

or in a proper model notation:

\[ \bs{y} = \bs{1}_n\mu + \bs{Z_g}\bs{u_g} + \bs{Z_b}\bs{u_b} + \bs{e} \]

where $\bs{y}$ is the $288\times 1$ vector of observations of yield ordered so that rows are within columns, $\bs{u_g}$ is the $191 \times 1$ vector of variety effects with the $288 \times 191$ associated design matrix $\bs{Z_g}$, $\bs{u_b}$ is the 2$\times$1 vector of block effects with $288 \times 2$ associated design matrix $\bs{Z_b}$, and $\bs{e}$ is the 288$\times$1 vector of residuals and we assume 

\[\begin{bmatrix} \bs{u_g} \\ \bs{u_b} \\ \bs{e} \end{bmatrix} \sim N\left(\begin{bmatrix} \bs{0} \\ \bs{0} \\ \bs{0} \end{bmatrix},  \begin{bmatrix} \sigma^2_g\bs{I}_{191} & \bs{0} & \bs{0} \\ \bs{0} & \sigma^2_b \bs{I}_2 & \bs{0} \\ \bs{0} & \bs{0} & \bs{R} \end{bmatrix}\right). \]

So what is the structure of the covariance structure of residuals, $\bs{R}$, here? As we have alluded previously,  the classical iid assumption for $\bs{R} = \sigma^2\bs{I}_{288}$ may not be appropriate. Rather, we assume a separable autoregressive process of order one in row and column direction, which we denote as AR1$\times$AR1, and $\bs{R} = \sigma^2 \bs{\Sigma}_c \times \bs{\Sigma}_r$ where $\bs{\Sigma}_c$ and $\bs{\Sigma}_r$ are 12$\times$12 and $24\times 24$ correlation matrices, respectively, which is each dependent on one auto-correlation parameter $\rho_c$ and $\rho_r$, respectively. 


R Res. log-likelihood AIC AICc BIC
ID$\times$ID 125.34 -244.68 -244.59 -233.70
AR1$\times$AR1 145.75 -281.50 -281.28 -263.20

The above table shows the comparison between the model with different $\bs{R}$ structures specified in the first column. Note that ID$\times$ID = iid. The second column is the residual log-likelihood of the given model, the third column is the Akaike Information Criterion (AIC), the fourth columns is AIC corrected, and the last column is Bayesian Information Criterion (BIC). Residual likelihood ratio test between these two models yield a $p$-value of $4.3\times10^{-10}$. In other words, the autocorrelation parameters in AR1$\times$AR1 model is significant.

Indeed, this data appears to provide a better fit using the AR1$\times$AR1 model. Let's have a look at further diagnostics.

From the clockwise of the top left diagram: histogram of the E-BLUP of the residuals, quantile-quantile plot of the E-BLUP of residuals, plot of E-BLUP of the residuals against the fitted values, and the plot of E-BLUP residuals against the unit numbers.   


Well, all looks okay except for ...  the residual plot! There is a clear pattern there! Let's have a further look into more diagnostics.


The plot of E-BLUPs of the residuals against Row for each Column.

Sample variogram for the AR1$\times$AR1 model. 
Now the sample variogram displays a camel-like hump along the column direction for any row lag. This is indicative of a column effect. Ideally, we would like to have a sample variogram that will become flat after rising from (0,0). So why does this pattern occur? Gilmour et al. (1997) explain that there could be extraneous variation and global trend due to factors such as spraying patterns, harvesting practices, and so on.

Source: Gilmour et al. (1997)
Here we can only seem to identify one extraneous variation, namely, the column effects. Thus, we add this to the model.

yield = mean + Genotype + Block + Column + residual


\[\bs{y} = \bs{1}_n\mu + \bs{Z_g}\bs{u_g} + \bs{Z_b}\bs{u_b} + \bs{Z_c}\bs{u_c} + \bs{e}\] where $\bs{u_c}$ is the 12$\times$1 column effects where we assume $\bs{u_c}\sim N(\bs{0}, \sigma^2_c \bs{I}_{12})$.
Sample variogram for AR1$\times$AR1 model with random column effects. 
And here's the new variogram! Now the variogram looks a lot more like what we want!

Phew! That's it for a brief introduction to spatial modelling practices in crop breeding!

© Statistical Genetics
Maira Gall