R graphics: margins are way to large

For me R has a very nice and powerfull capabilities for graphics (for example see this gallery). However, I dislike the default setting for margins and placement of axis numbers and labels. Since I always forget the setting of parameters I prefer I am adding this post. For example:
Sigma <- matrix(c(10, 10, 10, 20), nrow=2)
mu <- c(100, 100)
tmp <- mvrnorm(n=200, mu=mu, Sigma=Sigma, empirical=TRUE)
plot(tmp, xlab="X variable (unit)", ylab="Y variable (unit)")
Now compare this plot to the version I prefer much more:
par(bty="l", pty="m", mar=c(3, 3, 1, 1), mgp=c(1.75, 0.75, 0))
plot(tmp, xlab="X variable (unit)", ylab="Y variable (unit)")

Illinois long-term selection experiment for oil and protein in corn

Researchers at the University of Illinois are conducting one of the longest experiments in biology - Illinois long-term selection experiment for oil and protein in corn. The experiment started in 1896 and is still active! In esence they are selecting lines for higher or lower concentration of protein or oil in the kernel. This experiment is very important for a test of the theory of genetics, especially quantitative genetics (link1, link2, link3, link4, link5). I have seen several times the trends from this experiment and I wanted to include them in a talk I am prepairing. A brief search on the web lead me to this site with generation means by line. Bingo! This is all I needed. Bellow is a graph of trends and at the end of the post the R code used to produce the plot.

The theory states that genetic variance and consequently also the genetic gain should diminish after several generations of selection. There are experiments that confirmed that, but in the Illinois corn experiments the limit is not yet reached. Crow (2008) propose the following reasons (verbatim copy!):
  1. "The environment is continually changing so that what was formerly most fit no longer
  2. "There is an input of genetic variance from mutation, and sometimes from migration."
  3. "As intermediate-frequency alleles increase in frequency towards one, producing less variance (as p → 1, p(1 − p) → 0), others that were originally near zero become more common and increase the variance. Thus, a roughly constant variance is maintained."
  4. "There is always selection for fitness and for characters closely related to it."
First point is a bit to general, but it sure is relevant. The second point is well known and an important source of new variation (e.g. see this work in mice for some estimates of mutational variance). I am very glad I came across this paper by Crow, because I never thought about the issue that he raises in third point. To me this is very simple and obvious explanation for maintenance of genetic variance over a relative short period with selection in action. I can not say much about fourth point, but this surely is relevant, especially in animals, where inbreeding (a consequence of selection) has greater effect on fitness than in plants.

Now the R code. First I tried to use read.table(file=url(...)), but the data-file had an error - there was a typo on line 68 or 86 - I do not remember anymore. I downloaded the file, fixed the typo and used the following code:

podatki <- read.table(file="corn.txt", na.strings=".", header=TRUE)
cols <- c(rgb(red=204, blue=0, green=0, max=255),
rgb(red=0, blue=153, green=0, max=255),
rgb(red=0, blue=0, green=204, max=255),
rgb(red=204, blue=0, green=153, max=255))
par(bty="l", pty="m", mar=c(5, 4, 1, 1))
matplot(x=podatki$YR, y=podatki[, c("IHP", "ILP", "IHO", "ILO")], type="l", lty=1,
xlab="Year", ylab="Concentration (%)", col=cols[c(1, 1, 2, 2)], lwd=2)
legend("topleft", c("Protein", "Fat"), lty=1,
lwd=2, col=cols[c(1, 2)], bty="n")



ATLAS (link1, link2) is a Java based tool to manage genotypes. It is handy, but notoriouslly frustrating to install, since there are several different instructions about how to install it - I am not talking about different instructions at different places (say on the net), but in the distribution package. This is very very confusing. I have just spent more than half an hour to install it and I remember I had the same problem last time I tried to install it. The correct instructions are in the ATLAS.html file and this should be done:
  1. Create ATLAS directory in your user home directory, say "C:\Documents and Settings\GGorjan"
  2. Add the following files to this folder (Atlas.jar, atlas.ini, chr.atl, ATLASman_v1.4.pdf, ATLAS.html)
  3. Create directory ATLAS_files such as "C:\Documents and Settings\GGorjan\ATLAS\ATLAS_files" and put all figures and html files in there


Chromosome - DNA figure

National Human Genome Research Institute has a nice talking glossary of genetic terms. I especially this figure that shows a chromosome and how is it organized, i.e., DNA is wrapped with histones etc. However, there is only one chromosome and this is suboptimal if one wants to show the relationship between the genotype and DNA. Therefore I picked one version from the net (I do not remember where) and changed it a bit. I am publishing it here so that others might benefit as I did. There is plenty of place to annotate it.


R in SAS

Another "proof" that R definitely is one of mainstream statistical packages is the news that SAS will provide an interface to R via SAS/IML Studio (today known as SAS Stat Studio).


Fitting Legendre (orthogonal) polynomials in R

Frederick Novomestky packaged a series of orthogonal polynomials in the orthopolynom R package. However, his functions can not be used "directly" in a statistical model, say in lm(). There is no need to use functions from orthopolynom package, since there is a poly() function in stats package that is shipped with R. Nevertheless, I played with class of Legendre polynomials. (Wikipedia, Wolfram) This class of polynomials is very popular in my field since the introduction of so called random regression models (e.g. link1, link2, link3, ..., and some of our work), though the splines (Wikipedia, Wolfram) are making their way through (see this link and follow references). Let's go to orthopolynom package. Say we want to use Legendre polynomial of 4th order for a variable x. First we need to get the coefficients for basis functions:
(leg4coef <- legendre.polynomials(n=4, normalized=TRUE))


-0.7905694 + 2.371708*x^2

-2.806243*x + 4.677072*x^3

0.7954951 - 7.954951*x^2 + 9.280777*x^4
To use this polynomial in a model, we need to create a design matrix with sensible column names and without the intercept:
x <- 1:100
leg4 <- as.matrix(as.data.frame(polynomial.values(polynomials=leg4coef,
x=scaleX(x, u=-1, v=1))))
colnames(leg4) <- c("leg0", "leg1", "leg2", "leg3", "leg4")
leg4 <- leg4[, 2:ncol(leg4)]
Now we can use this in a model, e.g., lm(y ~ leg4). I made this whole process easier - with the functions bellow, we can simply use lm(y ~ Legendre(x=scaleX(x, u=-1, v=1), n=4)). I contacted Fred and I hope he will add some version of these functions to his package.

Update 2009-03-16: If we want that the intercept has a value of 1, we need to rescale the polynomial coefficients by multiplying with sqrt(2).
Legendre <- function(x, n, normalized=TRUE, intercept=FALSE, rescale=TRUE)
## Create a design matrix for Legendre polynomials
## x - numeric
## n - see orthopolynom
## normalized - logical, see orthopolynom
## intercept - logical, add intercept
tmp <- legendre.polynomials(n=n, normalized=normalized)
if(!intercept) tmp <- tmp[2:length(tmp)]
polynomial.values(polynomials=tmp, x=x, matrix=TRUE)

polynomial.values <- function(polynomials, x, matrix=FALSE)
## Changed copy of polynomial.vales from orthopolynom in order
## to add matrix argument
n <- length(polynomials)
if(!matrix) {
values <- vector(mode="list", length=n)
} else {
values <- matrix(ncol=n, nrow=length(x))
j <- 1
while(j <= n) {
if(!matrix) {
values[[j]] <- predict(polynomials[[j]], x)
} else {
values[, j] <- predict(polynomials[[j]], x)
j <- j + 1


Course: Selección genómica

Hayes and Gianola will have a course "Selección genómica" from 25-29 May in Valencia. See here for details.


Course: Study of resistance mechanisms in animal infectious diseases course

On behalf of Anne-Sophie Lequarré:

A new session of the well quoted course "Study of resistance mechanisms in animal infectious diseases" will be held one last time at the University of Liège, Belgium from the 16 to the 20 of March 2009 .

This one-week course gives a general introduction to the methods for the identification and exploitation of genetic factors in infectious diseases. It explores the interactions between the genetic diversity of the pathogen and the host in their particular environments. The aim is to integrate the insights of the various disciplines to get a better picture of the genetic basis of resistance to major infectious diseases in livestock.

The course can interest breeders in order to optimize the production in presence of harmful pathogens, biologists to better understand the causes and consequences of co-evolution of the host and its pathogens and of course the epidemiologists.

The course will be taught at a level commensurate with a Master of Science's degree but can also serve as an elective course for researchers wishing to better understand analyses found in the scientific literature. Speakers involved are not only renown scientists but also very good teachers.

The course is supported by EADGENE, a European Network of Excellence on Animal Disease Genomics. Registration prices are quite affordable (250 Euros requested for students or academic personnel). More information can be found at (registration dead-line 15th of February):
ourses/StudyofResistanceMechanisms/tabid/242/Default.aspx (please copy whole link into browser or go to http://www.eadgene.info, and follow links within the Training section).