New stuff in the gdata R package

I have moved quite some functions from my "testing/playground" R package ggmisc to the gdata package. Bellow is the relevant text from the NEWS file. Here are the links:
  • New function .runRUnitTestsGdata that enables run of all RUnit tests during the R CMD check as well as directly from within R.
  • Enhanced function object.size that returns the size of multiple objects. There is also a handy print method that can print size of an object in "human readable" format when options(humanReadable=TRUE) or print(x, humanReadable=TRUE).
  • New function bindData that binds two data frames into a multivariate data frame in a different way than merge.
  • New function wideByFactor that reshapes given dataset by a given factor it creates a "multivariate" data.frame.
  • New functions getYear, getMonth, getDay, getHour, getMin, and getSec for extracting the date/time parts from objects of a date/time class.
  • New function nPairs that gives the number of variable pairs in a data.frame or a matrix.


My readership

When I started with my blog I thought that it would be nice to share my thoughts, work, and new things with my friends. Recently, I added a gadget - (Whos among us) to monitor the visitors. I checked the "status" today and I was really surprised - up to now I had quite a lot of visitors from all over the Europe, USA, Canada, but also some hits from Middle and South America, Australia, Africa (Egypt and South Africa), Inida, China, ... Wow, this is much more than I ever anticipated! I guess that some visitors are just passing by, but it is still nice to have audience ;)


Genetic evaluation with uncertain paternity

The standard method for the genetic evaluation, i.e., BLUP needs some phenotype and pedigree data. Sometimes pedigree data is not accurate in a sense that there are several potential parents. Usually, this is the case with the use of multiple sires in the large groups of animals, say beef cattle or meat sheep in extensive conditions. However, we usually have only few candidates and this information can be included in the model. Robert Tempelman has a great presentation on this issue; giving an overview and presenting their research.

There are also some other very fine presentations on this and related issues in beef cattle breeding.


Make it simple

A nice example how a boring old scatterplot is much more informative than the fancy plot.


High-Performance Computing with R

Dirk has posted a new version of slides for a tutorial "Introduction to High-Performance Computing with R".

Genetski in selekcijski vidiki plodnosti ovac

Pravkar sem končal z branjem diplomske naloge Jernejke Drolec (1993) "Genetski in selekcijski vidiki plodnosti ovac". Moram priznati, da sem bil presenečen nad širino naloge in pokritostjo literature. Ker me tema zelo zanima, sem nalogo hitro prebral in bom v prihodnje najbrž še velikokrat pokukal vanjo.


gdata gains trimSum function

I was doing some drawing in R and I needed to trim some values to keep the data (x axis) in reasonable limits, but I did not want to loose that info. Therefore, I summed the values that would be trimmed. Since I was repeating this, I wrote a function and commited it to the gdata SVN package repository. It will probably take some time before new version of gdata hits the CRAN, so there is package for MS Windows and a source package. Here it goes in action:
> x <- 1:10
> trimSum(x, n=5)
[1] 1 2 3 4 45
> trimSum(x, n=5, right=FALSE)
[1] 21 7 8 9 10

> x[9] <- NA
> trimSum(x, n=5)
[1] 1 2 3 4 NA
> trimSum(x, n=5, na.rm=TRUE)
[1] 1 2 3 4 36

Rtools and Cygwin on MS Windows

Duncan Murdoch provides Rtools which ease the installation of tools that are needed to do R package development/testing on MS Windows. The Rtools is a collection of various tools. However, if you also use Cygwin on MS Windows, you can expect problems since Rtools also includes some tools from Cygwin. The problem is the version collision of fundamental Cygwin libraries. Basically, this means that you will not be able to use Cygwin and if you have C:/Cygwin/bin in the PATH envorinmental variable, you can expect problems also in the Command Prompt. You can try to fuse both "worlds" as described in the documentation, but is seems tricky. I just used the following to make my life easier:
  • Install Rtools and do not modify the PATH envorinmental variable --> this means that you will still be able to use Cygwin without problems
  • Create a BAT script (say on the Desktop) with the following content
rem --- Add RTools to the PATH ---
set PATH=c:\Programs\R\Rtools\bin;c:\Programs\R\Rtools\perl\bin;c:\Programs\R\Rtools\MinGW\bin;%PATH%

rem --- Start the Command Prompt ---
  • Start the script with the double click and you will get a working environment for R package development/testing
Btw. under Linux or Mac you can test the functionality of R package under MS Windows using the win builder at http://win-builder.r-project.org provided by Uwe Ligges


Genomic selection session on ICAR

Genomic selection is a hot topic in the area of animal breeding and genetics. Today, I came accross the presentations given at ICAR this summer. I liked all presentations, but today I would like to point out the status reports from Netherlands and New Zealand.


Uporaba seksiranega semena pri ekološkem kmetovanju

Pri sesalcih je spol potomca odvisen od kombinacije spolnih kromosomov. Ženski spol ima za par spolnih kromosomov vedno t.i. kombinacijo XX, moški spol pa ima vedno t.i. kombinacijo XY. Tako mati potomcu vedno prenese t.i. X kromosom, medtem ko lahko oče prenese t.i. X ali Y kromosom. Torej je od očeta odvisno ali po potomec moškega ali ženskega spola; če oče prenese X bo potomec ženskega spola, če pa prenese Y pa bo potomec moškega spola. V živinoreji bi si v velikih primerih želeli potomce samo določenega spola, npr. pri mlečni reji bi si želeli od boljših mater samo ženske potomce za obnovo črede, medtem ko bi od slabših mater želeli moške potomce, ki so bolj primerni za pitanje; pri prašičih bi si za pitanje nasprotno želeli samo ženske živali, saj imajo le te boljše rezultate kot kastrirani moški; pri nesni perutnini bi si prav tako želeli več ženskih živali, saj nesne kokoši niso primerne za pitanje zaradi izredne "specializacije na nesnost". Na tem mestu je takoj potrebno opozoriti, da kokoši niso sesalci in da pri kokoših matere "določijo" spol potomcev. V kolikor imamo možnost, da "preberemo" seme, lahko vplivamo na to, kakšnega spola bodo potomci. Kolikor je meni znano, je takšno seme pri govedu že na voljo - za več brskajte po spletu - link SI, link EN.

Kaj ima vse to z ekološkim kmetovanjem? Novica z Danske me je presentila, saj na Danskem na ekoloških kmetijah ni dovoljena uporaba seksiranega semena. Ne vem, kako je s tem pri nas, ampak moje mnenje je, da je to zelo čudna omejitev. Menim, da "prebiranje" semena manj "sporno" kot pa izločitev (klanje) potomcev nezaželenga spola, npr. moških telet pri mlečnem govedu.

Animal Breeding & Genomics 10

Animal Breeding & Genomics 10


Memory limit management in R

R keeps all the data in RAM. I think I read somewhere that S+ does not hold all the data in RAM, which makes S+ slower than R. On the other hand, when we have a lot of data, R chockes. I know that SAS at some "periods" keeps data (tables) on disk in special files, but I do not know the details of interfacing these files. My overall impression is that SAS is more efficient with big datasets than R, but there are also exceptions, some special packages (see this tutorial for some info) and vibrant development which to my impression takles the problem of large data in the spirit of SAS - I really do not know the details, so please bear with me.

Anyway, what can you do when you hit memory limit in R? Yesterday, I was fitting the so called mixed model using the lmer() function from the lme4 package on Dell Inspiron I1520 laptop having Intel(R) Core(TM) Duo CPU T7500 @ 2.20GHz 2.20GHz and 2046 MB of RAM. I was using MS Windows Vista. The fitting went fine, but when I wanted to summarize the returned object, I got the following error message:

> fit <- lmer(y ~ effect1 + ....) > summary(fit)
Error: cannot allocate vector of size 130.4 Mb
In addition: There were 22 warnings (use warnings() to see them)
First, I find this very odd, since I would expect that fitting the model should be much more memory consuming in comparison to summarizing the fitted object! I will ask the developers of the lme4 package, but until then I tried to find my way out.

Message "Error: cannot allocate vector of size 130.4 Mb" means that R can not get additional 130.4 Mb of RAM. That is weird since resource manager showed that I have at least cca 850 MB of RAM free. I printe the warnings using warnings() and got a set of messages saying:
> warnings()
1: In slot(from, what) <- slot(value, what) ... :
Reached total allocation of 1535Mb: see help(memory.size) ...
This did not make sense since I have 2GB of RAM. I closed all other applications and removed all objects in the R workspace instead of the fitted model object. However, that did not help. I started reading the help page of memory.size and I must confes that I did not understand or find anything usefull. However, reading the help further, I follwed to the help page of memor.limit and found out that on my computer R by default can use up to ~ 1.5 GB of RAM and that the user can increase this limit. Using the following code, helped me to solve my problem.
[1] 1535.875
> memory.limit(size=1800)
> summary(fit)


Domestication of sheep and goats

GlobalDiv posted a new newsletter, where the interesting information about the domestication of sheep and goats. I can not copy the relevant part here, but you can take a look at page 6. Basically it says that there probably was not a domestication bottleneck in sheep and goats as it probably occured in cattle. This might be also the reason, we have so much variability in sheep and goats today in comparison to cattle.


Sweave.sh plays with cacheSweave

I have added support for caching to Sweave.sh script as implemented in cacheSweave R package written by Roger D. Peng. Now, one can set caching on for chunks that are time consuming (data import, some calculations, ...) and the Sweaving process will reuse the cached objects each time they are needed. Read the details about the cacheSweave package in the package vignette. Option --cache for Sweave.sh script should also be easy to understand. However, here is a minimalist example:

n <- 10
s <- 15

Let us first simulate \Sexpr{n} values from a normal distribution and add a \Sexpr{s} sec pause to show the effect of caching.

<<simulate, cache=true>>=
x <- rnorm(n)

Now print the values:

<<print, results=verbatim>>=

Now, one can run the following command:
Sweave.sh --cache test.Rnw
and the output on the command line is:
Run Sweave and postprocess with LaTeX directly from the command line
- cache mode via cacheSweave R package

R version 2.8.0 (2008-10-20)
Copyright (C) 2008 The R Foundation for Statistical Computing
ISBN 3-900051-07-0

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(package='cacheSweave'); Sweave(file='test.Rnw', driver=cacheSweaveDriver);
Loading required package: filehash
filehash: Simple key-value database (2.0 2008-08-03)
Loading required package: stashR
A Set of Tools for Administering SHared Repositories (0.3-2 2008-04-30)
Writing to file test.tex
Processing code chunks ...
1 : echo term verbatim (label=setup)
2 : echo term verbatim (label=simulate)
3 : echo term verbatim (label=print)

You can now run LaTeX on 'test.tex'
When you repeat the Sweaving process, which you more or less always do, there is no need to wait for 15 second since cacheSweave package takes the x object from the cache! Excellent job Roger!

Trimmed standard deviation

R provides you with a regular mean, i.e., sum the values and divide them by the number of values, as well as with the trimmed version. In the later, we remove a proportion of the smallest and the largest values. This can be handy to evaluate the effect of outliers on the mean. Check this out:
## Sample 100 values from standard normal distribution
x <- rnorm(n=100)
## Add an outlier
x <- c(x, 100)
## Calculate the mean
## [1] 1.018909
## Calculate the trimmed mean
mean(x, trim=0.01)
## [1] 0.04981092
I will not talk here about the effect of outliers on the mean. Above example can be used for the exploration. I wanted to do the same trick with standard deviation, but found out that this is not possible out of the box in R. I coul be wrong-. Well, I modified the mean.default function to do that. It is not perfect, but serves my purpose. Here are the result for the above data:
## [1] 9.997196
sd.trim(x, trim=0.01)
## [1] 0.9841417
And the code:
sd.trim <- function(x, trim=0, na.rm=FALSE, ...)
if(!is.numeric(x) && !is.complex(x) && !is.logical(x)) {
warning("argument is not numeric or logical: returning NA")
if(na.rm) x <- x[!is.na(x)]
if(!is.numeric(trim) || length(trim) != 1)
stop("'trim' must be numeric of length one")
n <- length(x)
if(trim > 0 && n > 0) {
if(is.complex(x)) stop("trimmed sd are not defined for complex data")
if(trim >= 0.5) return(0)
lo <- floor(n * trim) + 1
hi <- n + 1 - lo
x <- sort.int(x, partial = unique(c(lo, hi)))[lo:hi]


Version control (CVS and SVN) cleints for MS Windows

Version control is a good thing. I have some experience with CVS and SVN. Installing the clients to "talk" with the server is very simple under Linux, i.e., something like:
apt-get install cvs svn
It is quite some time since I "discovered" the TortoiseSVN, which is a client for MS Windows. It is really nice, since it integrates smoothly with the Windows Explorer (the yellow folder icon). Today, I also "discovered" that there is also TortoiseCVS. I usually do not do any development under MS Windows, but it is very handy to be able to do a change or two and commit the changes from MS Windows.

In my opinion using the open source programs (R, FireFox, ThunderBird, ...) can improve the working experience with MS Windows a lot.


PopG genetic simulation program

Teaching the concepts of population genetics can be tedious and boring if one does not have a feeling for the meaning of the "equations" and "numbers". Students at our department always have problems with these concepts. Therefore, I bundled some R scripts in the DemoPopQuanGen package, but students have a hard time installing the package and running the demos. I guess the learning curve is too steep for the majority of the students.

It seems that the PopG genetic simulation program can do more or less anything we need for the population genetics part. From now on I will use this nice program to show the concepts and I am sure students will like this more than R. Though, it is nice to be able to do some more fancy stuff in R, but one really needs at least some feeling about the programming.


Mailing lists for quantitative and/or statistical genetics

It is always good to have someone to ask some questions or to disscus relevant issues. Mailing lists or forums are a very good place for this. Unfortunatelly, there is to my knowledge no place on the net, where one would find all people working in this field in "one place". The field of quantitative and/or statistical genetics is very scattered among various "disciplines" (animal breeding, plant breeding, human genetics, ...). I somewhat monitor the following places:
Does anyone know of any other list?

P.S. If you do not want to have an email jam in your inbox, I warmly recomend you to use something like Bloglines.com, which allows you to create an e-mail to subscribe to a particular list. Then you can read the e-mails as any other RSS news.


Genetic diversity in Alpine sheep breeds - emphasis on Slovenian breeds

Dalvit et al. have published a paper "Genetic diversity in Alpine sheep breeds" in Small Ruminant Research. They attempted to study the genetic diversity --> similarity via microsatellite molecular markers in breeds of sheep kept in the region of Alps. Their study involved animals of the following breeds:
  • Italy
    • Bergamasca (TIHO, SRR paper)
    • Biellese (TIHO)
    • Schwarzbraunes Bergschaf = Black-brown mountain sheep (TIHO)
    • Tiroler Bergschaf = Tiroler mountain sheep (TIHO)
    • Schnalserschaf = ??? sheep (???)
  • Germany
  • Slovenia
My interest in this article is of course due to the Slovenian breeds of sheep: the Bovec sheep and the Jezersko-Solčava sheep breed. The main result regarding these two breeds are that Bovec, Jezersko-Solčava and Carinthian were in the same cluster according to the Reynolds' genetic distance (e.g. see this lecture). This is partly an expected result, since to my knowledge Jezersko-Solčava and Carinthian are phenotypically very similar and the origin and history of these two breeds are similar. Sometimes a name Seeländerschaf was also used for Brillenschaf, which basically means Jezersko (~ lake land) sheep. The following Fst and mean molecular coancestry (MC) values were obtained:
  • Jezersko-Solčava - Carinthian, Fst = 0.053, MC = 0.188
  • Jezersko-Solčava - Bovec, Fst = 0.056 , MC = 0.201
  • Bovec - Carinthian, Fst = 0.084, MC = 0.176
Surprisingly, the molecular coancestry showed that Jezersko-Solčava breed has a bit more similarities with Bovec than with Carinthian breed, though Fst was lower for Jezersko-Solčava - Carinthian pair. The difference for mean molecular coancestry is not large, but I would say that Jezersko-Solčava and Carinthian are phenotypically much more similar. Bovec sheep is phenotypically quite different to Jezersko-Solčava. There surely have been different ways of selection of Bovec and Jezersko-Solčava breed since the former is today used for milk production, while the later is not milked at all and used as a meat type sheep i.e. only for rearing lambs. In study of Dalvit et al. Carinthian animals were sampled from Germany. I wonder if the same results would be obtained with animals from Austria. Additionally, TIHO site "states" that there has been some introgression of White mountain (Weisses Bergschaf) breed into Carinthian breed in Germany. However, this does not mean that Carinthian breed in Germany today still has any of the "genes" from White mountain. For example, in Slovenia Romanov rams were used in some flocks of Jezersko-Solčava sheep, but those animals were never treated as Jezersko-Solčava, but as a separate breed called improved Jezersko-Solčava or JSR in short. JSR breed is today a breed with the biggest population among sheep in Slovenia. Jezersko-Solčava breeds is maintained in its "original" environment, but also in some other parts of the Slovenia.

It was also interesting that Jezersko-Solčava and Carinthian were not clustered with Bergamasca. Historical records for both breeds say that at some time in the past breeders used Bergamasca and Paduaner rams. It seems that only some introgression of Bergamasca was done. The following Fst and molecular coancestries (MC) were obtained:
  • Jezersko-Solčava - Bergamasca, Fst = 0.050, MC = 0.185
  • Carinthian - Bergamasca, Fst = 0.077, MC = 0.160
  • Jezersko-Solčava - Carinthian, Fst = 0.053, MC = 0.188
The mean molecular coancestry for the Jezersko-Solčava - Bergamasca pair was practically of the same value as for the Jezersko-Solčava - Carinthian pair, while Carinthian - Bergamasca pair had a bit lower value. I again think this is not consistent with the history of Jezersko-Solčava and Carinthian breed. Bergamasca breed was involved in both populations, but that was some time ago. I would expect more similarities between animals of Jezersko-Solčava and Carinthian breed. These peculiarities could be due to the Germans vs Austrian Carinthian population, chosen set of microsatellites, substructures etc.

P.S. Slightly different results were obtained in "another version" as published in "Best Practices Manual for sheep and goat breeding". In that version, Jezersko-Solcava and Carinthian breed has greater similarity than Jezersko-Solcava (or Carinthian) and Bovec breed.

P.S. See also the "Atlas of Alpine sheep breeds" by Feldmann et al.



Conjugate priors

John D Cook has posted a nice diagram of relationship between sampling distribution of the data i.e. the likelihood and the conjugate prior. He also posted a link to a compendium of conjugate priors by written by Fink.

Lactation curve demonstration

Here is another demo plot. This time a lactation curve is showed. Lactation curve shows how the amount of produced milk is chaning over the lactation - from parturition to drying off. In this demo I used the so called Ali&Schaeffer function as applied to some goat data. Some special "things" added to the plot:
  • a line to show the time of weaning the offspring and
  • a line to show the time of drying off.
Here is the plot, followed by the R code.

Ali and Schaeffer. 1987. Accounting for covariances among test day milk yields in dairy cows. Can. J. Anim. Sci. v67. 637-644.

## Days in milk
dimMin <- 1
dimMax <- 300
dimBy <- 0.5

## Days in milk
AliSch <- function(x, c=305)
cbind(x / c,
(x / c) * (x / c),
log(c / x),
(log(c / x)) * (log(c / x)))

x <- seq(from=dimMin, to=dimMax, by=dimBy)
XInt <- rep(1, times=length(x))
XDim <- AliSch(x=x)

## Design matrix
X <- cbind(XInt, XDim)

## Parameter estimates - from one analyis in goats
int <- 2800.222
aliSch <- c(-1678.68038, 219.29628, 53.04481, -55.26959)
b <- as.matrix(c(int, aliSch))

## Estimate
yHat <- (X %*% b) / 1000

## Plot

## 2.47
yMax <- 3
yMin <- 1.5

## Open PDF
## pdf(file="lact_curve.pdf", width=5, height=3.2, pointsize=11)

par(bty="l", # brez okvirja
mar=c(3.2, 3, .5, .5), # robovi od osi
mgp=c(2, 0.7, 0), # robovi za stevilke na osi
pty="m") # maksimalno izkoristi povrsino

plot(x=x, y=yHat, type="l", lwd=2, ylim=c(yMin, yMax),
xlab="Days in milk", ylab="Daily milk yield (kg)")

text(x=0, y=yMin - 0.15, xpd=TRUE,

xP <- c(80, 120, 145, 180, 210, 240, 265)
points(x=xP, y=yHat[which(x %in% xP)], cex=2, pch=19)
points(x=xP, y=yHat[which(x %in% xP)], cex=3, pch=21)

text(x=200, y=yMax - 0.8, labels="Test-day records")

abline(v=75, lwd=2, lty=2, col="gray")
arrows(x0=75 - 30, x1=75 + 30, y0=yMin + 0.005, y1=yMin + 0.005,
lty=1, lwd=2, code=3, length=0.15, col="gray")
text(x=75, y=yMin - 0.15, xpd=TRUE,

text(x=30, y=yMax - 0.2,
labels="Suckling period")

abline(v=275, lwd=2, lty=2, col="gray")
arrows(x0=275 - 30, x1=275 + 30, y0=yMin + 0.005, y1=yMin + 0.005,
lty=1, lwd=2, code=3, length=0.15, col="gray")
text(x=275, y=yMin - 0.15, xpd=TRUE,
labels="Drying off")

text(x=175, y=yMax - 0.2,
labels="Milking period")

## Close the plot
## dev.off()

Plot of normal distribution with shaded area

Theory of quantitative genetics much relies on the normal (Gaussian) distribution. Therefore, one would often like to plot it for presentations, class notes, etc. This is not that hard with excellent tools we have today. I played with plotting in R and came up with the following plot of normal distribution showing:
  • the density of standardized breeding values with a mean of 100 units and a standard deviation of 12 unit and
  • shaded area that roughly corresponds to upper 5 % of the distribution.
Here is the plot, followed by the R code.

## Define the mean and the standard deviation
mu <- 100
sigma <- 12

## Define the % of upper are to shade
k <- 0.05

## Define the grid
x <- (seq(-5, 5, 0.01) * sigma) + mu

## Compute the density of normal distribution over the grid
y <- dnorm(x=x, mean=mu, sd=sigma)

## Compute the "threshold" for upper K % of the distribution
t <- qnorm(p=1-k, mean=mu, sd=sigma)

## Open PDF
## pdf(file="normal.pdf", width=5, height=3, pointsize=12)

## Open Windows metafile --> good for inclusion into MS Office documents
## win.metafile(filename="normal.wmf", width=5, height=3, pointsize=12)

par(mar=c(5, 4, 1, 1) + 0.1, # c(bottom, left, top, right)
plot(y ~ x, type="l", axes=FALSE,
xlab="Standardized breeding value",
ylab="Distribution", lwd=3)
## Mark the mean and the "threshold"
abline(v=c(mu, t), lwd=2, lty=2)
abline(h=0, lwd=3)
axis(2, labels=FALSE, tick=FALSE)

## Add shaded polygon
testK <- x >= t
xK <- x[testK]
yK <- y[testK]
polygon(x=c(xK, rev(xK)), y=c(yK, rep(0, times=length(xK))),
col="black", border=NA)

## Add arrow --> this will need some manual work, i.e. modify the values
x2 <- t * 1.1
y[which(round(x) == round(x2))[1]]
arrows(x0=(mu + sigma * 3.3), y0=y[which(round(x) == t)[1]],
x1=x2, y1=y[which(round(x) == x2)[1]], lwd=2)
text(x=142, y=y[which(round(x) == t)[1]] + 0.0015, labels=paste(t, "%"))
text(x=(mu + sigma * 3.3), y=0.02, labels="Good\n(+)")
text(x=(mu - sigma * 3.3), y=0.02, labels="Bad\n(-)")

## Close the device
## dev.off()


Runnning R2WinBUGS on Mac

New Mac computers/laptops (I do not know the details) are shipped with Intel processor. This brings the possibility to install a lot of UNIX/Linux software since Mac OS X is essentially a desktop Linux - I am again stating that I do not know the details about this since I can not afford a Mac.

Why is this important? WinBUGS is a great software for running Bayesian analyses. The name WinBUGS comes from two "parts": BUGS means Bayesian analysis Using Gibbs Sampling (though also other samplers are implemented), while Win corresponds to MS Windows. Therefore, one can natively run WinBUGS only on MS Windows. I personaly like the R2WinBUGS package, which allows one to run WinBUGS (and OpenBUGS, a new incarnation of BUGS software) directly from R. This approach is to a large extent promoted by Andrew Gelman in his "ARM book". Another possibilities are BRugs (using BUGS as directly linked with R - I can not find the link to BRugs package, since it is now in CRAN Extra repository) and rbugs (support for Linux systems with OpenBugs is emphasized) and the following two packages for interfacing JAGS (Just Another Gibbs Sampler): rjags and runjags.

It is quite some time since we managed to "tweak" R2WinBUGS so that it works also on Linux like systems that have installed wine emulator. One just needs to install R, wine and WinBUGS to "C:/Program Files/WinBUGS14" and the bugs() function will work out of the box on Linux!

This week as well as before we had a long discussion about running R2WinBUGS on new Macs with Intel processors. Rodney Sparapani found out that the following is needed:
schools.sim <-  bugs(data, inits, parameters, model.file,      n.chains=3, n.iter=5000,      bugs.directory="/Applications/WinBUGS14",      working.directory=".",      WINE="/Applications/Darwine/Wine.bundle/Contents/bin/wine",      WINEPATH="/Applications/Darwine/Wine.bundle/Contents/bin/winepath") 
Some notes are needed:
  • bugs.directory="/Applications/WinBUGS14" - this seems to be a reasonable default for installation of WinBUGS on Mac, however, we will not "hardwire" this into the R2WinBUGS code since this place can vary a lot between different operation systems and locales

  • working.directory="." - this is needed for the moment, since it seems that Mac launches wine emulator as another user, who does not have a permission to write to a temporary folder created by tempdir() by R; update on 2009-03-25, setting working.directory to "." will not be needed anymore in R2WinBUGS version 2.1-13 and later

  • WINE="/Applications/Darwine/Wine.bundle/Contents/bin/wine" and WINEPATH="/Applications/Darwine/Wine.bundle/Contents/bin/winepath" are paths to wine and winepath binaries that we use in R2WinBUGS; unfortunatelly we can not find these programatically from R as it is the case in UNIX/Linux - if one cretaes links to /usr/bin or /usr/local/bin, there is no need to set this two arguments in bugs()



MathGL (library for scientific graphics) has some very cool examples. I like this one:


Financiranje rejskega programa za lisasto pasmo govedi v Italiji

Te dni se mudim v Italiji na tečaju, ki ga vodi Ignacy Misztal. Tečaj sicer organizira ANAPRI - rejsko organizacija za lisasto pasmo govedi v Italiji. Po pogovoru z gostitelji sem izvedel, da se rejski program financira v grobem na sledeči način: 60 % iz proračuna, 15 % s strani rejcev in ostalo z "aktivnostjo" same zveze. Rejci morajo na kravo na leto plačati 11 EUR. S tem prispevajo del stroškov za izvajanje kontrole in ostale dejavnosti povezane s selekcijo.


SAGE - open source mathematics software

"Sage is a free mathematics software system licensed under the GPL. It combines the power of many existing open-source packages into a common Python-based interface." The installation on my laptop with Ubuntu was very simple. I went to console/terminal and issued:
## Download
wget http://www.sagemath.org/bin/linux/32bit/sage-3.1.2-ubuntu32bit-i686-intel-i686-Linux.tar.gz
## Unpack and remove the tarball
tar xzvf sage-*.tar.gz
rm -f sage-*.tar.gz
## Move to /usr/local
sudo mv sage-* /usr/local/.
## Make link
sudo ln -s /usr/local/sage-*/sage /usr/local/bin/sage
The whole "distribution" is quite big, the tarball has 366 Mb, while the unpacked directory has 1.2 Gb.

Tag cloud in Blogger

Phydeaux3 blog has a very good solution for a tag cloud in Blogger.

sam2p: convert raster (bitmap) images to PDF, PostScript

sam2p does a great job with converting raster (bitmap) images to say a PDF or (Encapsulated) PostScript format.


SAS Macros for BUGS

Rodney Sparapani wrote some SAS macros for interfacing with BUGS. Here is a site with descriptions and links. See also "SAS now goes with Bayesian methods".


Zoja na konju

Excel and LaTeX?

If you want to use Excel table in LaTeX document you can:
  • retype it in LaTeX syntax --> not recommended, especially if a table can change
  • save as CSV and use csv2latex
  • use excel2latex addon for Excel --> this is a very handy solution
  • open Excel file in some other spreadsheet that has the ability to save as LaTeX file, e.g., Gnumeric
  • a related soultion is offered by Spreadsheet2latex
See also PC -> LaTeX Converter Quick Comparison List


Statistical Experts site

I came across the Statistical Experts site. It offers a forum, blog, and wiki. I like the design. There has been quite some "traffic" on the forum and some posts on blog. The wiki seems to be empty for the moment. I added the forum and blog to my RSS reader so that I will be able to monitor the "traffic" on these two places.

Genetic counseling

While I was studying the calculation of genotype probabilities (see related posts) I always wondered what is the situatuion with genetic counseling in humans. Up to now, I only read in the papers that genotype probabilities can be used also in the genetic counselimg, but I never really read anything about real applications. Yesterday, I came across a nice abstract, which is a bit old (from 1990), but neatly summarizes 1000 records on genetic counseling. Unfortunatelly, the paper is in French.


Drawing pedigrees

I often work with data on relatives, where relationship between the individuals is inferred just from the pedigree. Up to now I did not plotted any pedigrees, except the simple ones, where I was able to use the pen and paper easily. However, when the number of pedigree members increases, the complexity of drawing increases enormously.

When I first googled for the term "pedigree", I got a lot of useless hits, but I surely followed the link on Wikipedia, where I learned something new.
The word pedigree is a corruption of the French "pied de grue" or crane's foot, because the typical lines and split lines (each split leading to different offspring of the one parent line) resembling the thin leg and foot of a crane.
Refining the search revealed a lot of good hits. It seems that there is a lot of competition in this area. That is good. However, the programs have many different options for many different kind of uses. Some areas are: genetic counselling, display of family trees, animal and plant breeding, ... Here are the links to the sites providing some more info about these programs: link1, link2, link3, link4, link5, link6, and link7.

I am particularly interested in animal breeding where pedigrees are large and complex. Drawing such pedigrees seems to be very complicated process (link1, link2, link3). Authors of Pedigraph claim that their program is good in this regard. They patented their innovation of summarizing the pedigree in order to gain some space for pedigrees with large half-sib families. The later are quite common in animal breeding. PedNavigator is also described as a program that can handle large and complex pedigrees, but its use is more for exploratory real-time usage, which is great, but not what I need - display the pedigree in one figure. There are very nice examples posted at Pedigraph site. This is a nice showcase for the Pedigraph programme.

Exploring a bit further (especially via link1, link2, link3, and link4) I realised that there is a general framework for drawing pedigrees, which are DAGs - Directed Acyclic Graphs, the Graphviz. Reading a patent behind Pedigraph once again, I found that they also talk about the dot language - used by the Graphviz software. Are they also using the Graphviz for the drawing of the pedigrees? The display looks very similar. Pedigree Visualizer, OPediT, PyPedal also use Graphviz for drawing the pedigrees. I guess that initially the programs try to optimize the pedigree in some way and then they construct the dot file, which is passed to Graphviz to do the drawing. I guess I will use Pedigraph for large and complex pedigrees and Graphviz directly for some small ones.


SAS now goes with Bayesian methods

New versions of SAS (9.2) procedures GENMOD, LIFEREG, and PHREG enable users to use Bayesian approach for estimating model parameters. The computations are done with McMC methods. There is also a new experimental procedure MCMC. See more here. I wonder how fast these procedures are, since SAS is known to be fast and very efficient with big datasets, but see also one of my previous posts on PROC MIXED vs lmer().


Excel 2007 for statistics?

Microsoft Excel is probably one of the most used software for performing various "calculations". I agree that spreadsheets are very usefull and for some tasks MS Excel just excels. However, MS Excel has problems when it somes to statistics. There have been quite some reports about the inaccuraccy of statistical routines. For the 2007 version, McCullough and Heiser (2008)
wrote this in the abstract

Excel 2007, like its predecessors, fails a standard set of intermediate-level accuracy tests in three areas: statistical distributions, random number generation, and estimation. Additional errors in specific Excel procedures are discussed. Microsoft’s continuing inability to correctly fix errors is discussed. No statistical procedure in Excel should be used until Microsoft documents that the procedure is correct; it is not safe to assume that Microsoft Excel’s statistical procedures give the correct answer. Persons who wish to conduct statistical analyses should use some other package.

MS Excel is not the only spredsheet software. Yalta (2008) compared MS Excel with some others and wrote this in the abstract:

We provide an assessment of the statistical distributions in Microsoft® Excel versions 97 through 2007 along with two competing spreadsheet programs, namely Gnumeric 1.7.11 and OpenOffice.org Calc 2.3.0. We find that the accuracy of various statistical functions in Excel 2007 range from unacceptably bad to acceptable but significantly inferior in comparison to alternative implementations. In particular, for the binomial, Poisson, inverse standard normal, inverse beta, inverse student’s t, and inverse F distributions, it is possible to obtain results with zero accurate digits as shown with numerical examples.

McCullough (2008) contrubuted also this:

Microsoft attempted to implement the Wichmann–Hill RNG in Excel 2003 and failed; it did not just produce numbers between zero and unity, it would also produce negative numbers. Microsoft issued a patch that allegedly fixed the problem so that the patched Excel 2003 and Excel 2007 now implement the Wichmann–Hill RNG, as least according to Microsoft. We show that whatever RNG it is that Microsoft has implemented in these versions of Excel, it is not the Wichmann–Hill RNG. Microsoft has now failed twice to implement the dozen lines of code that define the Wichmann–Hill RNG.

Additionally, the graphs made in MS Excel are not really good ones. It depends a lot on the user to make a good graph. I have seen many so called chartjunk commming from Excel. Of course the blame is on users, but software should help us make things easier and not harder! Su (2008) wrote this in the abstract

The purpose of default settings in a graphic tool is to make it easy to produce good graphics that accord with the principles of statistical graphics... If the defaults do not embody these principles, then the only way to produce good graphics is to be sufficiently familiar with the principles of statistical graphics. This paper shows that Excel graphics defaults do not embody the appropriate principles. Users who want to use Excel are advised to know the principles of good graphics well enough so that they can choose the appropriate options to override the defaults. Microsoft® should overhaul the Excel graphics engine so that its defaults embody the principles of statistical graphics and make it easy for non-experts to produce good graphs.

After all this anti-Microsoft abstracts I really, really start to wonder what we get for the money we spend for the software.

Size of marker effects

Genomic selection is the current hot topic in quantitative genetics. Here I would like to show, that there is a subtle issue with marker effects that it is good to be aware of when we talk about the size of marker (gene) effects. In a nutshell, today we can collect many (several thousands) genotypes for a low amount of money. We then try to estimate the effect of markers on the phenotypic values collected on a set of individuals. There is a bunch of important details in this steps, but I would like to discuss here the issue with the size of estimates of marker effects.

Say we are looking at a loci in diploids. In our population we know that only two different variants (alleles) of a gene is present at observed loci, say a and A. Probability of the allele a is p_a and the probability of the allele A is p_A. The possible genotypes in diploids are aa, aA, and AA. The book by Falconer & MacKay (see here and here) provides a nice description how we can estimate genetic effects from such data. I will follow their setup.

We assign arbitrary values to genotypes: aa has arbitrary value -a, aA has an arbitrary value d, while AA has an arbitrary value +a. These values might be defined such that they represent deviations from some origin, say the overall mean of the data. Then the average effect of allele a:

alpha_a = -p_A * alpha,

and the average effect of allele A is:

alpha_A = p_a * alpha,

where alpha is:

alpha = a + d * (p_a - p_A),

where a and d are arbitrary values, and p_a and p_A are allele frequencies. How does this relate to estimates of marker affects? When we get estimates of marker effects, they represent the effect of average allele substitution, which is alpha. As can be seen from the above setup, this value depends on the arbitrary values a and d and allele requencies. Therefore, when we get the estimates of marker effects, we can not just say that there are some markers that have big or small effects, since the estimates depend on allele frequencies. If we assume that there is no dominance, then the alpha = a and we can say something about the size of marker effects. If there is dominance, then we need to be more carefull. I do not have any "genomic" data so I can not check this on real dataset. I would like to see the plot of marker effect estimates (on y-axis) versus allele frequency (on x-axis) in the data as well as versus the value of d * (p_a - p_A). The effect of dominance on marker estimates should be the greatest when one of the alleles is rare and/or the effect of dominance itself is big. I am for now interested in tha case with the allele frequencies.

Perhaps, we can try with some simulations. I will use R for the simulation. I did a simple simulation to quickly see how it goes. At the end of the post is used R code. The results are two graphs, each graph has four subgraphs. The first set is from simulated data with heritability of 0.5 and the second one with heritability of 0.25. There is not much to see in these pictures. No clear trends at all. Well, I could have played a lot with this, especially with the dominance deviation. But it seems that this

Read this document on Scribd: Marker effects h2 0.50

Read this document on Scribd: Marker effects h2 0.25

## Install the DemoPopQuanGen package
tmp <- c("http://gregor.gorjanc.googlepages.com",
contrib.url(repos=getOption("repos"), type=getOption("pkgType")))
install.packages("DemoPopQuanGen", contriburl=tmp, dep=TRUE)

## Function for simulation of genotypes and phenotypic values
rvalue <- function(n, p, value, mean=0, sdE=0, h2=NULL, genotype=FALSE)
## --- Setup ---

k <- length(p)
q <- 1 - p
y <- matrix(nrow=n, ncol=k)
g <- matrix(nrow=n, ncol=k)

## --- Genotype frequencies under Hardy-Weinberg equilibrium ---

P <- p * p
Q <- q * q
H <- 2 * p * q

## --- Marker effects ---
for (i in 1:k) {
g[, i] <- rdiscrete(n=n, probs=c(P[i], H[i], Q[i]))
y[, i] <- value[g[, i], i]
y <- rowSums(y)

## --- Residual variance based on heritability ---

if(!is.null(h2)) {
varG <- var(y)
sdE <- sqrt(varG / h2 - varG)

## --- Phenotype ---

if(genotype) {
y <- y - mean(y)
ret <- cbind(y + rnorm(n=n, mean=mean, sd=sdE), g - 1)
colnames(ret) <- c("y", paste("g", 1:k, sep=""))
} else {
ret <- y

## Set number of phenotyped and genotyped individuals
n <- 2000
## Set number of loci
nLoci <- 1000

## Simulate loci properties, see ?rvalueSetup
tmp <- rvalueSetup(nLoci=nLoci, p=runif(n=nLoci, min=0.05, max=0.95))

## Simulate genotypes and phenotypes, see ?rvalue
sim <- rvalue(n=n, p=tmp$p, value=tmp$value, mean=100, h2=0.5, genotype=TRUE)

## Single marker regression
b <- se <- p_value <- p2t <- numeric(length=nLoci)
for(i in 1:nLoci) {
fit <- lm(sim[, 1] ~ sim[, i+1])
p2t[i] <- mean(sim[, i+1]) / 2
b[i] <- coef(fit)[2]
se[i] <- vcov(fit)[2, 2]
p_value[i] <- summary(fit)$coefficients[2, 4]

a <- tmp$value[3, ]
d <- tmp$value[2, ]
p2 <- tmp$p
p1 <- 1 - p2
p1t <- 1 - p2t

## Plots
pdf("marker_effects.pdf", version="1.4")
par(mfrow=c(2, 2), bty="l",
mar=c(2.7, 3.0, 1, 1) + .1,
mgp=c(1.6, 0.55, 0),

## Marker effect estimates vs true marker effect (a)
plot(b ~ a,
xlab="Marker effect - a",
ylab="Marker effect estimate",
col=rgb(0, 0, 1, 0.2), pch=16)

## Marker effect estimates vs allele frequency
plot(b ~ p1t, xlim=c(0, 1),
xlab="Allele frequency",
ylab="Marker effect estimate", col=rgb(0, 0, 1, 0.2), pch=16)

## Marker effect estimates vs dominance deviation (d)
plot(b ~ d,
xlab="Dominance deviation - d",
ylab="Marker effect estimate", col=rgb(0, 0, 1, 0.2), pch=16)

## Marker effect estimates vs "allele frequency and dominance deviation"
z <- d * (p2t - p1t)
plot(b ~ z,
xlab="d * (p_a - p_A)",
ylab="Marker effect estimate", col=rgb(0, 0, 1, 0.2), pch=16)

Genomic selection

Last week I have been on a course "Whole genome association and genomic selection" by Ben Hayes. Current course webpage does not (yet) provide course materials (notes and slides), but there are materials from a previous course by Ben - look here.

In a "nutshell" the course was about usage of whole genome data in order to improve (via selection of best individuals) important traits in livestock. Today, we can collect a lot (several thousands), up to 60K for cattle for example) of genotypes with "chip technologies" for a price around 200-300$. After the cleaning of the data, a researcher would estimate effect of each genotype (just additive or maybe even the dominance effect) on the phenotype of individuals. Technically, when we get the genotype data, this data do not represent the exact (functional) genes, but just markers along the whole genome. When we estimate the effect of markers, we try to exploit the possible linkage between the markers and potential genes, which are to a large extent unknown to us. Essential issue with the success of this method is that there is enough linkage disequilibrium between markers and genes. We can achieve that with dense chips. Since there has been an intense selection and strong founder effects in populations of livestock, the re is more linkage disequilibrium and the density of chips in livestock can be lower than in humans. After the estimation of markers effects, we can calculate a genotypic value for any genotyped individual. The essence of this approach has been presented in the paper by Meuwissen et al. (2001). There has been a lot of research/work done since then! There are also quite some important technical details, but this really seems to be a new very hot topic for animal breeding and genetics.

During the course I had an opportunity to disscuss with many people and there seems to be a great intereset for genomic selection in dairy cattle. This is due to the availability of dense chips for cattle and possibility of wide exploitation of semen via artifical insemination. New Zealand, Netherlands, USA are the countries which have already implemented the genomic selection and are already starting to sell the semen of young bulls. The big advantage of genomic selection in dairy cattle is in shortened generation interval. Up to now, potential bulls were preselected from best cows accounting for the breeding values of these cows and inbreeding. Then the semen of this preselected bulls would be used to produce their daughters. Then breeding values of the bulls would be estimated from the phenotypic records (milk yield) of their daughters. Well, the calculation of the breeding values accounts for all possible information from relatives, but the accuracy of breeding values is much dependent on the number of close relatives that have phenotypes. A nice description of these techniques is in the book by Mrode. There are also other books, but the book by Mrode is a very good one for the start. The process from preselecting the bull up to the calculation of breeding values can last around five years.

Besides shortened generation intervals there is also some gain in the accuracy of estimation of genetic values. Up to now animal breeders were mainly exploiting the so called infinitesimall model, where we postulate that there are infinite number of genes and that the effect of gene is negligible. This model is clearly wrong, but has been and still is a very usefull abstraction. Its use has enabled genetic improvement of many economically important traits in agriculture. Since the infinitesimall model is only an abstraction, geneticists were and are still trying hard to find genes that have an effect on various traits. There has been quite a lot of succes in this work, especially for the traits that are under the influence of only few genes. However, for the traits that are continous in the nature there have been only few genes found even though the efforts are substantial. Genomic selection now provides another abstraction. It is again an approximation, but given the abundant genomic data, there are gains in comparison to an infinitesimall model.

With genomic selection animal breeders gain a lot of flexibility and it seems that we will face changes in the organization of breeding programmes. Genomic selection is of course aplicable also for other breeding programmes: beef cattle, sheep, goats, pigs, poultry, ...


Using Beamer with LyX-Sweave

Dan Weeks (web1, web2) provided a solution on how to use Beamer with LyX & Sweave.

Make a literate-beamer layout file for LyX!

#% Do not delete the line below; configure depends on this
# \DeclareLaTeXClass[beamer, Sweave.sty]{beamer (beamer Sweave noweb)}
# This is a copy of literate-article.layout from LyX, but changed for
# Sweave - NoWeb syntax:
# - changed noweb.sty to Sweave.sty
# - moved preamble to literate-scrap.inc

Format 2
Input beamer.layout
Input literate-scrap.inc

LyX then has a new Document Class that allows one to use the "scrap" style, and that portion of the text is sent to Sweave properly. However, there is a problem related with verbatim environment. Sweave makes a lot of use of verbatim environments, and beamer doesn't like those. You need to declare that a slide contains verbatim or you get errors. The reasonable solution is to add two ERT boxes in each slide that will contain "Sweave material". The ERT box at the beginning of the slide should contain

\frametitle{Slide title}

while the ERT box at the endo of the slide should contain



A History of Markov Chain Monte Carlo

Robert and Casella wrote a paper with a title "A History of Markov Chain Monte Carlo--Subjective Recollections from Incomplete Data--". It is a nice reading. Beeing an animal breeder I of course liked the following footnote ;)

On a humorous note, the original Technical Report of this paper was called Gibbs for Kids, which was changed because a referee did not appreciate the humor. However, our colleague Dan Gianola, an Animal Breeder at Wisconsin, liked the title. In using Gibbs sampling in his work, he gave a presentation in 1993 at the 44th Annual Meeting of the European Association for Animal Production, Arhus, Denmark. The title: Gibbs for Pigs.


LaTeX Style and BiBTeX Bibliography Formats for Biologists: TeX and LaTeX Resources

AIPL writings and presentations

I noticed that researchers at AIPL (USA) are very active with the current hot topic in animal breeding - genomic (genomewide) selection (Meuwissen et al., 2001). I asked Paul VanRaden for a reprint and he has kindly pointed me to the website of AIPL with conference proceedings and presentations. These two sites are very valubale resource for study - not only forthe topic of genomic selection, but also for dairy cattle breeding!


Password remind flaw in Joomla! 1.5.x

One of my sites uses Joomla! and the last day the password of the admin was changed. Shit! I thought that I had an intruder. Carefull inspection showed that there was nobody on the server beside me - the hacker could have gained my user account, but I remember what I did the last time so only the password was changed. Now I found out that the hacker used password remind flaw. After upgrade, I logged into the DB and updates the password with MD5 hash obtained from Paj's Home: Crypto... site.

UPDATE jos_users SET password='MD5HASH' WHERE id='???';

Alternatively, I could use this

UPDATE jos_users SET password=PASSWORD('???') WHERE id='???';

but then it is better to remove the .mysql history file!



I was searching for some papers via Google and found a PDF file with a list of very fine reference on the topic of inbreeding. I discovered that the PDF file is hosted at Consang.net. I like te site! Read the summary if you are interested in the topic. They show a nice map showing level of inbreeding in humans as reported in the literature. I am linking the picture here to make the post more interesting ;)

Beside nice map, the site has also a Google map that shows the data (by country) that were used in the creation of the above map. I only checked some data and found out that (out of those I looked at) many refernces are quite old. Therefore, I am bit skeptical about the "validity" of the map for the present time. For example, Hans Rosling showed that there have been large changes in the world in last century. Well, the "mating system" in humans probably did not change that much, but I still have some doubts that the level of inbreeding in humans is so much higher in the Near East. This could be as well just my lack of knowledge. I would be happy if anyone can teach me;) Besides this data the site hosts a fine list or references.

Google pages / sites

I use Google page creator for hosting my website. However, this "application" will be replaced by Google Sites. They have posted a very nice tutorial, which shows the capabilities. I like it. Simple and easy to set a website.


Calculation of PrP genotype and NSP type probabilities in Slovenian sheep

Paper "Calculation of PrP genotype and NSP type probabilities in Slovenian sheep" is finnaly finnished. I will send it to the journal today. This is a sort of a companion paper to the previous one where the methods were derived and tested, while in this paper we used those methods on all sheep breed in the Slovenian breeding programme for sheep. Here is the abstract:

PrP genotype probabilities in ungenotyped Slovenian sheep were calculated. Altogether 36,083 ewes and rams of various breeds were included into the analysis. PrP genotype was known for 10,504 animals. Five different PrP alleles were present in the data. Pedigree and genotype data structure differed between breeds. Iterative allelic peeling with incomplete penetrance model was used for the calculation of genotype probabilities for each animal given the genotype data of relatives. Analyses were performed for each breed separately. Additionally, NSP (National Scrapie Plan) type probabilities and the average NSP type were calculated from the genotype probabilities. Results were presented for live animals only. There were no animals with additionally identified PrP genotype or NSP type with certainty. With 95 % probability PrP genotype was additionally identified for 0.0 to 5.7 % animals of different breeds. NSP type was additionally identified with the same probability for 0.0 to 34.9 % animals of different breeds. We maintain that the low number of additional identifications was due to: a large number of alleles, intermediate allele frequencies, data structure, a uniform prior, and the use of incomplete penetrance model. Additional identifications provided some cost savings, but did not prove useful in the selection for scrapie resistance of the entire populations. The average NSP type should be used instead, since it can be calculated for all animals and encompasses all information from genotype probabilities.


Some news about gdata R package

Here are some news for gdata R package:
  • new function cbindX to column bind object with different number of rows
  • write.fwf gains new argument width, additionally several enhancements were done internally to make sure that effect of unknown values (via argument na) on column widths are properly handled
  • changes are only available via SVN and in packaged form (source package, Windows MS package) - packages will be removed after the new version of gdata will be published at CRAN

Literature on connectedness

Denis Laloe has posted a list of papers about the connectedness. See also the results for Connectedness in International Genetic Evaluation (Interbull).


PROC MIXED vs. lmer()

I am using R for a few years now. I like it a lot. Previously I used SAS (notably its modules SAS/BASE, SAS/STAT, SAS/IML, ...), because that is the statistical package of choice at my department. I had a hard time when I switched to R, mainly because I was thinking in the SAS way. However, I made a switch because I liked Rs' open-source nature and its vibrant community. Up to now I did not want to make any comparisons between them. I simply use R because I like it and get along well with using it. Recently I helped a student to fit a quite big model. She used SAS and PROC MIXED failed due to "out of memory". I was surprised, since SAS is known for its great stability and performance with big datasets. I was almoast sure that function lmer() in lme4 package in R will fail also, but could not resist to try it out. To my surprise, the model was fitted without problems. Perhaps I could tweak SAS to use more memory, but I did not invest time to study that option. Maybe new versions of SAS would perform better.

I used the following code in SAS 8.02:

proc mixed data=podatkiratio;
class genotype litter eage sex type year month herd hys;
model bw = genotype litter eage sex type year month;
random herd hys;

This is the code for R 2.6.0, lme4 0.99875-9:

lmer(bw ~ genotype + litter + eage + sex + type +
year + month +
(1|herd) + (1|hys),

And the results, which include also the description of the model size:

Linear mixed-effects model fit by REML
Formula: bw ~ genotype + litter + eage + sex + type + year + month + (1|herd) (1|hys)
Data: podatki
AIC BIC logLik MLdeviance REMLdeviance
152403 152899 -76149 152011 152297
Random effects:
Groups Name Variance Std.Dev.
herd (Intercept) 0.313 0.559
hys (Intercept) 0.147 0.383
Residual 0.297 0.545
number of obs: 85035, groups: herd, 346; hys, 9653

Fixed effects:
## genotype, 13 levels
## litter, 3 levels
## eage, 9 levels
## sex, 2 levels
## type, 2 levels
## year, 16 levels
## month, 12 levels

I am using SAS and R under Windows XP, running on Intel(R) Core(TM)2 CPU 6300 @ 2x1.86GHz with 2 GB RAM.

Legacy of late prof. Henderson

I came across the paper by Schaeffer "C. R. Henderson: Contributions to Predicting Genetic Merit". I did not know that paper before. It has a nice introduction that tells the history of how Henderson derived BLUP and MME. Additionally, paper describes two most important contributions of Henderson: the BLUP (and MME) and direct setup of the inverse of the numerator relationship matrix. Schaeffer also represented some open problems that are still not solved today (at least to my knowledge).


R package "entrez"

I got a request for R package "entrez" I wrote quite some time ago. I did upload the package to my old webpage, but I have not yet finnished the movement completely to the new webpage. I decided to make a sloopy move i.e. move files when I need them or when I will have more time, ... Anyway, I am glad that entrez package still works. I uploaded the source of the package as well as the ZIP packlage for easy installation on MS Windows (note that these two links will change with the new versions - if I publish them ever; however permanent link is this one).

What does the entrez package? It is a simple function that performs search for primary IDs in PubMed via Entrez utility esearch at NCBI. Examples:

## Are there any articles about Bioconductor in PubMed?
(search <- entrezSearch("Bioconductor[tiab]"))

## Get counts by year in the period 2000:2008
count <- entrezCountByYear(term="Bioconductor[tiab]", dp=2000:2008)

## Plot counts
plot(count, typ="l", xlab="Year",
ylab="Number of publications in PubMed")


Fitting Mixed-Effects Models Using the lme4 Package in R

Here is a nice presentation by Douglas Bates called "Fitting Mixed-Effects Models Using the lme4 Package in R".

Miti o prireji mesa, mleka, jajc, ... s kloniranimi in transgenimi živalmi

Marko Dolinar piše na svojem blogu "Klonov še ne bo na evropskih krožnikih". Ob številnih debatah o gensko spremenjenih organizmih, ki se v poljedeljstvu že uporabljajo, se vsake toliko časa pojavijo tudi ideje, teze, mnenja, kritike, ... o uporabi kloniranih in transgenih živalih. Klonirane živali so "popolne" kopije izvorne živali, medtem ko imajo transgene živali vstavljen gen iz druge vrste. Veliko ljudi čuti do teh tehnologij odpor. Sam sem nekje vmes, ne vidim večjih zadržkov, a vrag je vedno v podrobnostih, ki jih (še) ne poznamo. Bi pa rad povdaril, da omenjene tehnologije le niso tako super s finančnega stališča. Drži, da so že klonirali živali, drži, da so različnim živalim že vnesli gene iz drugih vrst. A je na drugi strani tudi kopica praktičnih težav ter velika težava, ki se ji po domače lepo reče "računica". Vsi omenjeni postopki so za sedaj še zelo dragi, nezanesljivi in neučinkoviti. Menim, da še lep čas ne bomo imeli na mizi mleka, mesa, jajc, ... od kloniranih in transgenih živali. Blasco je slikovito prikazal, da bomo še lep čas uporabljali klasične metode izboljševanja populacij kmetijskih živali (=selekcija) in da so mnoge moderne biotehnološke metode neprimerne za praktično uporabo v živinoreji.



I like the ICOTS8 Logo.

The description of the logo says:

"The sum of three shifted distributions forms the shape of the three headed mountain Triglav (tri=three, glav=head), the highest mountain in Slovenia and a national pride and symbol of Slovenes. The solid colors correspond to the colors of the Slovenian flag, while the green background is the color of Ljubljana and its symbol: the green dragon.

The three distributions represent students, teachers and researchers, demonstrating that learning, teaching and discovering overlap and progress in harmony."

Encyclopedia of Genetics, Genomics, Proteomics, and Informatics

Springer published new book "Encyclopedia of Genetics, Genomics, Proteomics, and Informatics". Seems great, but way to expensive for me.

Systems biology model

It is quite some time since I have found a course website by Antonio Reverter on quantitative overview of gene expression. His slides are very nice and I guess that the course must be very informative as well as fun. I simply addore his systems biology model picture, which I am shamelesly reproducing here.


Selecting and using colors in graphs

It is some time now since I came across a paper "Escaping RGBland: Selecting Colors for Statistical Graphics" by Zeileis et al. I have finnaly read (well I only parsed it since I get lost in the details about the color systems etc.) it and now it is time to try the new functions they implemeneted in the R package vcd. This is not the first work about colors in R. There is plenty of resources about this topic, sometimes even to much. At the end one does not know which approach to follow. I certainly liked the RColorBrewer package (see some examples here as well as original ColorBrewer site) as well as the work of Gentleman et al. (for example see figure 8 or at R graph gallery).

Zeileis et al. report about three functions for creation of color palletes: qualitative palette for coding categorical information via rainbow_hcl(), sequential palette via sequential_hcl() , and diverging palette via diverge_hcl(). The last two are amied at numerical and ordinal variables. The R Graphical Manual site shows 177 figures (at the time of writing) for the vcd package. I must admit that examples from vcd package did not attract me so much as those from RColorBrewer package.

Some R code to start with palletes in vcd package:
## Install the package

## Load the package

## Check the vignette
vignette(topic="hcl-colors", package="vcd")
## --> this will open a PDF file with the code and examples
Btw. there are some excellent resources about graphics (mainly in relation with R):


Cheat Sheets for Web Developers

Jacob bundled a great list of cheat sheets for Web developers. Very great resource for quick lookup for HTML, XHTML, CSS, JavaScript, mootools, jQuery, RGS colours, SEO, and WordPress.

"Giant" bull and ram

Full story ...

Full story ...


Symmetric sparse matrices with Matrix R package II

I have written about the size of created sparse matrices with Matrix R package here. After discussion with the developers of the package, I realized that one should not create sparse matrices in a way presented here i.e. setting up the matrix and then filling it is NOT OK. Instead one should either start with sparse design matrices and create the crossprods e.g.

> library(Matrix)
> tmp <- factor(rep(letters[1:3], each=4))
[1] a a a a b b b b c c c c
Levels: a b c
> (x <- as(tmp, "sparseMatrix"))
3 x 12 sparse Matrix of class "dgCMatrix"
a 1 1 1 1 . . . . . . . .
b . . . . 1 1 1 1 . . . .
c . . . . . . . . 1 1 1 1
> tcrossprod(x)
3 x 3 sparse Matrix of class "dsCMatrix"
a b c
a 4 . .
b . 4 .
c . . 4

or directly via the "triplets":

i <- as.integer(c(1, 2, 3))
j <- as.integer(c(1, 2, 3))
x <- c(4, 4, 4)
new("dsTMatrix", Dim=c(3L, 3L), i=(i - 1L), j=(j - 1L), x=x)

The above code should of course be more "automagical" but shows the point. Additionally, creating a diagonal matrix is of course easy (there is a special class for them in Matrix), but the above code works in general setting.