PROC MIXED vs. lmer()

I am using R for a few years now. I like it a lot. Previously I used SAS (notably its modules SAS/BASE, SAS/STAT, SAS/IML, ...), because that is the statistical package of choice at my department. I had a hard time when I switched to R, mainly because I was thinking in the SAS way. However, I made a switch because I liked Rs' open-source nature and its vibrant community. Up to now I did not want to make any comparisons between them. I simply use R because I like it and get along well with using it. Recently I helped a student to fit a quite big model. She used SAS and PROC MIXED failed due to "out of memory". I was surprised, since SAS is known for its great stability and performance with big datasets. I was almoast sure that function lmer() in lme4 package in R will fail also, but could not resist to try it out. To my surprise, the model was fitted without problems. Perhaps I could tweak SAS to use more memory, but I did not invest time to study that option. Maybe new versions of SAS would perform better.

I used the following code in SAS 8.02:

proc mixed data=podatkiratio;
class genotype litter eage sex type year month herd hys;
model bw = genotype litter eage sex type year month;
random herd hys;

This is the code for R 2.6.0, lme4 0.99875-9:

lmer(bw ~ genotype + litter + eage + sex + type +
year + month +
(1|herd) + (1|hys),

And the results, which include also the description of the model size:

Linear mixed-effects model fit by REML
Formula: bw ~ genotype + litter + eage + sex + type + year + month + (1|herd) (1|hys)
Data: podatki
AIC BIC logLik MLdeviance REMLdeviance
152403 152899 -76149 152011 152297
Random effects:
Groups Name Variance Std.Dev.
herd (Intercept) 0.313 0.559
hys (Intercept) 0.147 0.383
Residual 0.297 0.545
number of obs: 85035, groups: herd, 346; hys, 9653

Fixed effects:
## genotype, 13 levels
## litter, 3 levels
## eage, 9 levels
## sex, 2 levels
## type, 2 levels
## year, 16 levels
## month, 12 levels

I am using SAS and R under Windows XP, running on Intel(R) Core(TM)2 CPU 6300 @ 2x1.86GHz with 2 GB RAM.

Legacy of late prof. Henderson

I came across the paper by Schaeffer "C. R. Henderson: Contributions to Predicting Genetic Merit". I did not know that paper before. It has a nice introduction that tells the history of how Henderson derived BLUP and MME. Additionally, paper describes two most important contributions of Henderson: the BLUP (and MME) and direct setup of the inverse of the numerator relationship matrix. Schaeffer also represented some open problems that are still not solved today (at least to my knowledge).


R package "entrez"

I got a request for R package "entrez" I wrote quite some time ago. I did upload the package to my old webpage, but I have not yet finnished the movement completely to the new webpage. I decided to make a sloopy move i.e. move files when I need them or when I will have more time, ... Anyway, I am glad that entrez package still works. I uploaded the source of the package as well as the ZIP packlage for easy installation on MS Windows (note that these two links will change with the new versions - if I publish them ever; however permanent link is this one).

What does the entrez package? It is a simple function that performs search for primary IDs in PubMed via Entrez utility esearch at NCBI. Examples:

## Are there any articles about Bioconductor in PubMed?
(search <- entrezSearch("Bioconductor[tiab]"))

## Get counts by year in the period 2000:2008
count <- entrezCountByYear(term="Bioconductor[tiab]", dp=2000:2008)

## Plot counts
plot(count, typ="l", xlab="Year",
ylab="Number of publications in PubMed")


Fitting Mixed-Effects Models Using the lme4 Package in R

Here is a nice presentation by Douglas Bates called "Fitting Mixed-Effects Models Using the lme4 Package in R".

Miti o prireji mesa, mleka, jajc, ... s kloniranimi in transgenimi živalmi

Marko Dolinar piše na svojem blogu "Klonov še ne bo na evropskih krožnikih". Ob številnih debatah o gensko spremenjenih organizmih, ki se v poljedeljstvu že uporabljajo, se vsake toliko časa pojavijo tudi ideje, teze, mnenja, kritike, ... o uporabi kloniranih in transgenih živalih. Klonirane živali so "popolne" kopije izvorne živali, medtem ko imajo transgene živali vstavljen gen iz druge vrste. Veliko ljudi čuti do teh tehnologij odpor. Sam sem nekje vmes, ne vidim večjih zadržkov, a vrag je vedno v podrobnostih, ki jih (še) ne poznamo. Bi pa rad povdaril, da omenjene tehnologije le niso tako super s finančnega stališča. Drži, da so že klonirali živali, drži, da so različnim živalim že vnesli gene iz drugih vrst. A je na drugi strani tudi kopica praktičnih težav ter velika težava, ki se ji po domače lepo reče "računica". Vsi omenjeni postopki so za sedaj še zelo dragi, nezanesljivi in neučinkoviti. Menim, da še lep čas ne bomo imeli na mizi mleka, mesa, jajc, ... od kloniranih in transgenih živali. Blasco je slikovito prikazal, da bomo še lep čas uporabljali klasične metode izboljševanja populacij kmetijskih živali (=selekcija) in da so mnoge moderne biotehnološke metode neprimerne za praktično uporabo v živinoreji.



I like the ICOTS8 Logo.

The description of the logo says:

"The sum of three shifted distributions forms the shape of the three headed mountain Triglav (tri=three, glav=head), the highest mountain in Slovenia and a national pride and symbol of Slovenes. The solid colors correspond to the colors of the Slovenian flag, while the green background is the color of Ljubljana and its symbol: the green dragon.

The three distributions represent students, teachers and researchers, demonstrating that learning, teaching and discovering overlap and progress in harmony."

Encyclopedia of Genetics, Genomics, Proteomics, and Informatics

Springer published new book "Encyclopedia of Genetics, Genomics, Proteomics, and Informatics". Seems great, but way to expensive for me.

Systems biology model

It is quite some time since I have found a course website by Antonio Reverter on quantitative overview of gene expression. His slides are very nice and I guess that the course must be very informative as well as fun. I simply addore his systems biology model picture, which I am shamelesly reproducing here.


Selecting and using colors in graphs

It is some time now since I came across a paper "Escaping RGBland: Selecting Colors for Statistical Graphics" by Zeileis et al. I have finnaly read (well I only parsed it since I get lost in the details about the color systems etc.) it and now it is time to try the new functions they implemeneted in the R package vcd. This is not the first work about colors in R. There is plenty of resources about this topic, sometimes even to much. At the end one does not know which approach to follow. I certainly liked the RColorBrewer package (see some examples here as well as original ColorBrewer site) as well as the work of Gentleman et al. (for example see figure 8 or at R graph gallery).

Zeileis et al. report about three functions for creation of color palletes: qualitative palette for coding categorical information via rainbow_hcl(), sequential palette via sequential_hcl() , and diverging palette via diverge_hcl(). The last two are amied at numerical and ordinal variables. The R Graphical Manual site shows 177 figures (at the time of writing) for the vcd package. I must admit that examples from vcd package did not attract me so much as those from RColorBrewer package.

Some R code to start with palletes in vcd package:
## Install the package

## Load the package

## Check the vignette
vignette(topic="hcl-colors", package="vcd")
## --> this will open a PDF file with the code and examples
Btw. there are some excellent resources about graphics (mainly in relation with R):


Cheat Sheets for Web Developers

Jacob bundled a great list of cheat sheets for Web developers. Very great resource for quick lookup for HTML, XHTML, CSS, JavaScript, mootools, jQuery, RGS colours, SEO, and WordPress.

"Giant" bull and ram

Full story ...

Full story ...


Symmetric sparse matrices with Matrix R package II

I have written about the size of created sparse matrices with Matrix R package here. After discussion with the developers of the package, I realized that one should not create sparse matrices in a way presented here i.e. setting up the matrix and then filling it is NOT OK. Instead one should either start with sparse design matrices and create the crossprods e.g.

> library(Matrix)
> tmp <- factor(rep(letters[1:3], each=4))
[1] a a a a b b b b c c c c
Levels: a b c
> (x <- as(tmp, "sparseMatrix"))
3 x 12 sparse Matrix of class "dgCMatrix"
a 1 1 1 1 . . . . . . . .
b . . . . 1 1 1 1 . . . .
c . . . . . . . . 1 1 1 1
> tcrossprod(x)
3 x 3 sparse Matrix of class "dsCMatrix"
a b c
a 4 . .
b . 4 .
c . . 4

or directly via the "triplets":

i <- as.integer(c(1, 2, 3))
j <- as.integer(c(1, 2, 3))
x <- c(4, 4, 4)
new("dsTMatrix", Dim=c(3L, 3L), i=(i - 1L), j=(j - 1L), x=x)

The above code should of course be more "automagical" but shows the point. Additionally, creating a diagonal matrix is of course easy (there is a special class for them in Matrix), but the above code works in general setting.


CFC program - Contribution, Inbreeding (F), Coancestry

It is quite some time since I downloaded the CFC program from here, see also here. However, I only tried to use it recently and found out that is quite handy. The program allows user to calculate:
  • inbreeding coefficients,
  • relationships (numerator or the "whole" relationship coefficient),
  • ancestral decomposition of inbreeding coefficients,
  • ancestral decomposition of the average coancestry,
  • optimization of matings to minimize the average inbreeding in the next generation,
  • probabilities of gene origin,
  • numerator relationship matrix, its decomposition and its inverse, and
  • recode the identifications from 1 to n (number of all individuals) in such a way that ancestors appear always before descendants.

A nice list of possible tasks for anyone working in genetics. It's GUI is very easy to use. It is a bit clumsy if you want to import the output to other program, say R. Therefore, I wrote two read.* functions that can import the output of numerator relationship matrix and its inverse into R.

read.CFCSparseA <- function(file, fileF=NULL, x, id, idI, out="dsCMatrix")
## ToDo: wrap into few lines - does it make any difference

## Read the numerator relationship matrix (A) from the CFC program
## - output in sparse format (triplets)
## file - character, name of the file with sparse matrix output from the CFC
## program (see bellow) non-inbred
## fileF - character, name of the file with inbreeding coefficients (see bellow)
## if NULL individuals are assumed
## x - data.frame, a pedigree with columns id (original identification) and
## idI (integer recoded identification)
## id - character, column name with original identification
## idI - character, column name with integer recoded identification
## out - character, "format" of the output: "triplet" (a data.frame) or
## "matrix" (a dsCMatrix)
## The file should contain only lines with the matrix elements i.e. without
## the header and footer lines. Edit the original output file to get file
## like this one (the first column is original subject identification, the
## second column is original subject identification, and the third column is
## non-zero matrix element):
## C A 0.5
## C B 0.5
## E A 0.25
## E B 0.25
## E D 0.5
## E C 0.5
## The fileF should have two columns: the first with original identification
## and the second with inbreeding coefficient of the individual:
## E 0.25
## F 0.05


## --- Check ---

## id and idI must be in the pedigree
tmp <- c(id, idI)
if(sum(tmp %in% names(x)) < 2) {
stop("wrong specification of id and/or idI column(s)")

## Output
tmp <- c("triplet", "dsCMatrix")
if(any(!(out %in% tmp))) {
stop(paste("out must be one of:", paste(tmp, collapse=", ")))

## Number of individuals
n <- as.integer(nrow(x))

## --- Import the datafile ---

tmp <- read.table(file=file)

## --- Match recoded identifications ---

tmp <- merge(tmp, x[, c(id, idI)], by.x="V1", by.y=id)
names(tmp)[names(tmp) %in% idI] <- "i"
tmp <- merge(tmp, x[, c(id, idI)], by.x="V2", by.y=id)
names(tmp)[names(tmp) %in% idI] <- "j"
names(tmp)[3] <- "x"
tmp[, 1] <- NULL
tmp[, 1] <- NULL

## --- Add diagonals = 1 + f ---

if(is.null(fileF)) { ## Assume that there is no inbreeding: diag(A) = 1
inbCoef <- x[, idI, drop=FALSE]
names(inbCoef) <- "i"
inbCoef$j <- inbCoef$i
inbCoef$x <- 1
tmp <- rbind(inbCoef[, c("x", "i", "j")], tmp)
} else { ## Get inbreeding coefficients: diag(A) = 1 + F
inbCoef <- read.table(file=fileF)[, 1:2]
names(inbCoef) <- c(id, "F")
inbCoef <- merge(inbCoef, x[, c(id, idI)], by.x=id, by.y=id, all.y=TRUE)
inbCoef[is.na(inbCoef[, "F"]), "F"] <- 0
inbCoef$j <- inbCoef[, idI]
inbCoef$F <- 1 + inbCoef[, "F"]
inbCoef[, 1] <- NULL
names(inbCoef) <- c("x", "i", "j")
tmp <- rbind(inbCoef, tmp)

## --- Return ---

if(out == "triplet") { ## Triplets: i, j, x
tmp[, c("i", "j", "x")]
} else { ## Symetric sparse matrix
as(new("dsTMatrix", Dim=c(n, n), uplo="L",
i=(tmp$i - 1L), j=(tmp$j - 1L), x=tmp$x), "dsCMatrix")

read.CFCSparseAInv <- function(file, out="dsCMatrix")
## ToDo: wrap into few lines - does it make any difference

## Read inverse of the numerator relationship matrix (A) from the CFC program
## - output in sparse format (triplets)
## file - character, name of the file with sparse matrix output from the CFC
## program
## out - character, "format" of the output: "triplet" (a data.frame) or
## "matrix" (a dsCMatrix)
## The file should contain only lines with the matrix elements i.e. without
## the header and footer lines. Edit the original output file to get file
## like this one (the first column is original subject identification, the
## second column is row index and the rest are column indexes, followed by a
## non-zero matrix element in the format columnIndex:non-zeroElement, ...):
## A, 1, 1:1.5,
## B, 2, 1:0.5, 2:1.5,
## D, 3, 3:1.5,
## F, 4, 4:1,
## C, 5, 1:-1, 2:-1, 3:0.5, 5:2.5,
## E, 6, 3:-1, 5:-1, 6:2,


## --- Check ---

tmp <- c("triplet", "dsCMatrix")
if(any(!(out %in% tmp))) {
stop(paste("out must be one of:", paste(tmp, collapse=", ")))

## --- Import and transformation ---

## A long "massage"
## 6. Split at each comma
tmp <- strsplit(split=",",
## 5. Remove the comma at the end
x=sub(pattern=",$", replacement="",
## 4. Change any colon to comma
x=gsub(pattern=":", replacement=",", fixed=TRUE,
## 3. Remove the first column (Individual ID) - ".," anything up to the first (since we use sub()) coma
x=sub(pattern=".,", replacement="",
## 2. Remove any spaces - " "
x=gsub(pattern=" ", replacement="", fixed=TRUE,
## 1. Read the file line by line
## Number of rows = number of individuals
r <- as.integer(length(tmp))
## Length of each row
n <- sapply(tmp, length)
## Number of triplets by row - first element is a row index followed by a
## column index and a non-zero value
t <- (n - 1) %/% 2
## Number of all triplets; according to my tests, inverse of A has always at
## least one element per individual i.e. 1 for individuals without any relatives
N <- sum(t)
## Remove row index values
tmp <- lapply(tmp, function(z) z[-1])

## --- Build triplets ---

ret <- matrix(nrow=3, ncol=N)
## Row index
ret[1, ] <- rep(1:length(t), times=t)
## Column index and non-zero values
end <- cumsum(t)
start <- end - t + 1
i <- 1
while(i < (r + 1)) { ## Loop over all (r) individuals
ret[2:3, start[i]:end[i]] <- tmp[[i]]
i <- i + 1
## Modify for output
ret <- as.data.frame(t(matrix(as.numeric(ret), nrow=3, ncol=N, byrow=FALSE)))
names(ret) <- c("i", "j", "x")
ret[, 1:2] <- lapply(ret[, 1:2], as.integer)

## --- Return ---

if(out == "triplet") { ## Triplets: i, j, x
} else { ## Symetric sparse matrix
as(new("dsTMatrix", Dim=c(r, r),
i=(ret$j - 1L), j=(ret$i - 1L), x=ret$x), "dsCMatrix")


Mixing Noweb and Sweave in LyX

Ed asked:
Is it possible to configure LyX so that I can create a literate article, book or report with either the "standard" LyX/NoWeb settings or the Sweave/R settings? I've spent a few hours trying to figure out how to doit, but everything I've tried only gives me one of the two options.
You want to mix Noweb and Sweave in the same file? No way, since the literate layout files (see http://cran.r-project.org/contrib/extra/lyx/) were modified for Sweave. You can also not use Noweb in one file and Sweave in another file without changing the literate layouts and preferences file.

Drawings of the Mathematics Genealogy Project

Yifan Hu from AT&T labs has nice drawings of the pedigrees from the Mathematics Genealogy Project - see here. He made them with Mathematica programe. See:

Video posnetki predstavitev na temo konjev (prehrana, zdravje, genetika)

EAAP (European Association of Animal Production) je Evropsko združenje živinorejcev in vsako leto organizira kongres. Lansko leto je bil v Dublinu in nekatere predstavitve so tudi posneli in objavili na spletu. Za rejce konj so po mojem mnenju (jih nisem gledal!) zanimivi sledeči prispevki:

  • Nutrition: health and performance on equine S. Harris (Video)
  • Genetics and environment in equine health and diseases V. Gerber (Video)
  • Application of horse genomics E. Baileys (Video)


Benefits from recent genetic progress in sheep and beef populations in UK

Amer et al. report on benefits of breeding programmes for sheep and beef cattle in UK. They evaluated the effect of 10 years of selection for a period of 20 years. They noted that observed genetic trends were lower than predicted from the theory (a common observation!). In recorded hill sheep, sheep crossing sire and sheep terminal sire breeding programmes the benefits was estimated at 5.3, 1.0, and 11.5 million GBP. If the progress is properly disseminated to the rest of the sheep (meat) sector, the benefits would be 110.8 million GBP. Benefits for terminal sire beef breeds were estimated at 4.9 million GBP, while the estimate for dual-purpose breeds wsa 18.2 million GBP for genetic gain in growth and carcass traits accounting for negative effect in calving traits. After some more "economy massage" they report that the internal rate of return for investment in sheep and beef cattle breeding programmes was 32%.

Course: Animal Breeding Strategies

Course website - Animal Breeding Strategies. Seems to be related to course material I recently found (link).

Explanation of Bayes theorem

John D Cook has written a nice explanation of Bayes theorem on rare disease cliche example - link. I did something similar in the past (An introduction to Bayesian statistics, An introduction to Bayesian statistics and MCMC methods - both in slovenian language) for my colleagues.

LyX-Sweave on MS Windows

Jay Kerns has recently notified me that for LyX-Sweave on MS Windows one needs to add the bin directory of R installation (say contrib/extra/lyx/) to the PATH system variable. He is right, since LyX can not find R binary otherwise. On MS Windows I always add R to PATH variable, since I often use R CMD check or other commands in MS Windows Command Tool - terminal. New version of INSTALL file will be updated with this information. Goosh, administration of Linux is getting better and better in comparison to MS Windows, at least for me.


Embeding feeds on your site


Program and videos from the Lush Visions Symposium - Animal Breeding Plans

Lush Visions Symposium - Animal Breeding Plans was held on April 25, 2008 at the Iowa State University, Ames, IA. Here are links to the programme and videos. Putting videos on the web is really a great thing as we can now watch conferences from home. Of course it is better to be there, but that involves costs, time, etc. I think that traking the videos and publishing them is also a great was to disseminate the work and finally, you at least know which person is behind the name.

Using optimized linear algebra libraries with R II

After installing ATLAS on my laptop, I also tried the same setup on machine in office having Intel(R) Core(TM)2 CPU 6300 @ 2x1.86GHz.

  • default: user ~6.5, system ~0, elapsed, ~6.6
  • Core2Duo: user ~0.8, system ~0.1, elapsed ~0.82
Whooa, that certainly is a gain!

Symmetric sparse matrices with Matrix R package

These tests are flawed! See here

I need to solve the following system of normal equations i.e. I want to calculate x from Ax=b, where A is a square symmetric positive definite matrix. Since I use R, I tried to do this with the new Matrix package. However, I wanted to test different approaches of building and storing a symmetric matrix.

## Load the package

## Create a 100 x 100 sparse matrix
n <- 100 test <- Matrix(data=0, nrow=n, ncol=n, sparse=TRUE)
## See the structure

I will only printout, that this is a matrix of dsCMatrix class which means that this is symmetric matrix stored in column compressed format.

## 1.3 kB

## Adding some values to the first row
test[1, 1:n] <- 1:n

Class is now dgCMatrix, since the matrix is not symmetric anymore.

## 2.4 kB

## Force it to become symmetric
test2 <- forceSymmetric(test)
## dsCMatrix --> OK
## 2.5 kB
The size practically did not change, which is good - values are stored only once.

## Adding to the first column
test[1:n, 1] <- 1:n

The class is still dgCMatrix, which means that there is no automatic checking that the matrix is
symmetric. This makes sense, since automatic checking could be a formidable task.

## 3.6 kB
That is also the reason for larger size of the object in memory.

## Convert the class
test <- as(test, "dsCMatrix")
## dsCMatrix
## 2.5 Kb
After conversion to dsCMatrix class, we again drop redundant info from one off-diagonal part of the matrix.

Now let us try some sums to see how matrices behave.

## Summation of two sparse symmetric matrices
test2 <- test + test
## dsCMatrix
## 2.5 kB
The symmetry is preserved as well as the size of the object in memory - the "identical" size is just because we summed the same object.

Now let us try what happens if one matrix has upper and other lower triangle filled.

##Summation of upper and lower "triangle"
n <- 100
test1 <- test2 <- Matrix(data=0, nrow=n, ncol=n, sparse=TRUE)
test1[1, 1:n] <- 1:n
test2[1:n, 1] <- 1:n
test1 <- forceSymmetric(test1)
test2 <- forceSymmetric(test2)
## dsCMatrix
## 2.5 kB
## dsCMatrix
## 1.3 kB
Whoops! these matrices are the same in terms of the content, but the second one is almost half the size. This is probably due to the column compressed format i.e. lower triangle is better suited for this. This is good to know!

## Sum them
test <- test1 + test2
## dsCMatrix
## 2.5
The size of the sum is influenced by the "big" matrix.

## Reverse the sum
test <- test2 + test1
## dsCMatrix
## 2.5 kb
The same result as before - it seem that the order does not matter!

## Now sum only the "small" matrix
test <- test2 + test2
## dsCMatrix
## 1.3 kB
Aha! Again lower triangle wins.

## Now sum only the "big" matrix
test <- test1 + test1
## dsCMatrix
## 2.5 kB


Using optimized linear algebra libraries with R

After reading blog posts by Yu-Sung and Gelman (see also comments) as well as recent additions to R on Debian page at RWiki, I decided to take a try with ATLAS on my laptop. I did not aspect much, since I use a poor IBM ThinkPad R50e with Intel(R) Celeron(R) M processor - 1400MHz. However, this laptop offers what I need when I move around, while I use a "hammer" in the office. I could not find the information about which instruction set (SSE, SSE2, ...) is appropriate for my processor. In order to figure this out and to evaluate the benefits I did:
  • Launched the XUbuntu 7.10 (gutsy)
  • Started R with: R --vanilla -q
  • Used the following R script i.e. setting up the matrix and calculating its crossproduct
mm <- matrix(rnorm(4 * 10^6), ncol = 10^3); system.time(crossprod(mm))
  • Tested packages: atlas3-base, (atlas3-sse, atlas3-sse-dev) and (atlas3-sse2, atlas3-sse2-dev), which is just a matter of installing and removing the packages via aptitude install or remove command
The results (output from system.time function - several calls) are:
  • standard R installation: user ~13.4, system ~0.06, elapsed ~15.0
  • atlas3-base: user ~4.1, system ~0.05, elapsed ~4.2
  • atlas3-sse: user ~3.4, system ~0.03, elapsed ~3.7
  • atlas3-ss2: user ~4.1, system ~0.03, elapsed ~4.1
It is clear that installing ATLAS is beneficial, but there is not much gain with specific instruction sets for my laptop. Nevertheless, I kept atlas3-sse since it was consistently showing the best performance.

Overview of scientific journals with Eigenfactor

I came across the Eigenfactor.org site, which (according to the description of the site) ranks and maps scientific knowledge. I like the display and the interactive browser --> go to mapping menu. When you click a field you can see which fields are related and on the right you get a list of top journals in the field. That is very nice and handy information. The bars at the top of the display lets you show more or less related fields.