Gregor Gorjanc (gg): 09/01/2008

2008-09-27

SAGE - open source mathematics software

"Sage is a free mathematics software system licensed under the GPL. It combines the power of many existing open-source packages into a common Python-based interface." The installation on my laptop with Ubuntu was very simple. I went to console/terminal and issued:

## Download
wget http://www.sagemath.org/bin/linux/32bit/sage-3.1.2-ubuntu32bit-i686-intel-i686-Linux.tar.gz
## Unpack and remove the tarball
tar xzvf sage-*.tar.gz
rm -f sage-*.tar.gz
## Move to /usr/local
sudo mv sage-* /usr/local/.
## Make link
sudo ln -s /usr/local/sage-*/sage /usr/local/bin/sage

The whole "distribution" is quite big, the tarball has 366 Mb, while the unpacked directory has 1.2 Gb.

sam2p: convert raster (bitmap) images to PDF, PostScript

sam2p does a great job with converting raster (bitmap) images to say a PDF or (Encapsulated) PostScript format.

2008-09-19

SAS Macros for BUGS

Rodney Sparapani wrote some SAS macros for interfacing with BUGS. Here is a site with descriptions and links. See also "SAS now goes with Bayesian methods".

2008-09-16

Zoja na konju

Excel and LaTeX?

If you want to use Excel table in LaTeX document you can:

retype it in LaTeX syntax --> not recommended, especially if a table can change
save as CSV and use csv2latex
use excel2latex addon for Excel --> this is a very handy solution
open Excel file in some other spreadsheet that has the ability to save as LaTeX file, e.g., Gnumeric
a related soultion is offered by Spreadsheet2latex

2008-09-15

Statistical Experts site

I came across the Statistical Experts site. It offers a forum, blog, and wiki. I like the design. There has been quite some "traffic" on the forum and some posts on blog. The wiki seems to be empty for the moment. I added the forum and blog to my RSS reader so that I will be able to monitor the "traffic" on these two places.

Genetic counseling

While I was studying the calculation of genotype probabilities (see related posts) I always wondered what is the situatuion with genetic counseling in humans. Up to now, I only read in the papers that genotype probabilities can be used also in the genetic counselimg, but I never really read anything about real applications. Yesterday, I came across a nice abstract, which is a bit old (from 1990), but neatly summarizes 1000 records on genetic counseling. Unfortunatelly, the paper is in French.

2008-09-14

Drawing pedigrees

I often work with data on relatives, where relationship between the individuals is inferred just from the pedigree. Up to now I did not plotted any pedigrees, except the simple ones, where I was able to use the pen and paper easily. However, when the number of pedigree members increases, the complexity of drawing increases enormously.

When I first googled for the term "pedigree", I got a lot of useless hits, but I surely followed the link on Wikipedia, where I learned something new.

The word pedigree is a corruption of the French "pied de grue" or crane's foot, because the typical lines and split lines (each split leading to different offspring of the one parent line) resembling the thin leg and foot of a crane.

Refining the search revealed a lot of good hits. It seems that there is a lot of competition in this area. That is good. However, the programs have many different options for many different kind of uses. Some areas are: genetic counselling, display of family trees, animal and plant breeding, ... Here are the links to the sites providing some more info about these programs: link1, link2, link3, link4, link5, link6, and link7.

I am particularly interested in animal breeding where pedigrees are large and complex. Drawing such pedigrees seems to be very complicated process (link1, link2, link3). Authors of Pedigraph claim that their program is good in this regard. They patented their innovation of summarizing the pedigree in order to gain some space for pedigrees with large half-sib families. The later are quite common in animal breeding. PedNavigator is also described as a program that can handle large and complex pedigrees, but its use is more for exploratory real-time usage, which is great, but not what I need - display the pedigree in one figure. There are very nice examples posted at Pedigraph site. This is a nice showcase for the Pedigraph programme.

Exploring a bit further (especially via link1, link2, link3, and link4) I realised that there is a general framework for drawing pedigrees, which are DAGs - Directed Acyclic Graphs, the Graphviz. Reading a patent behind Pedigraph once again, I found that they also talk about the dot language - used by the Graphviz software. Are they also using the Graphviz for the drawing of the pedigrees? The display looks very similar. Pedigree Visualizer, OPediT, PyPedal also use Graphviz for drawing the pedigrees. I guess that initially the programs try to optimize the pedigree in some way and then they construct the dot file, which is passed to Graphviz to do the drawing. I guess I will use Pedigraph for large and complex pedigrees and Graphviz directly for some small ones.

2008-09-10

SAS now goes with Bayesian methods

New versions of SAS (9.2) procedures GENMOD, LIFEREG, and PHREG enable users to use Bayesian approach for estimating model parameters. The computations are done with McMC methods. There is also a new experimental procedure MCMC. See more here. I wonder how fast these procedures are, since SAS is known to be fast and very efficient with big datasets, but see also one of my previous posts on PROC MIXED vs lmer().

2008-09-09

Excel 2007 for statistics?

Microsoft Excel is probably one of the most used software for performing various "calculations". I agree that spreadsheets are very usefull and for some tasks MS Excel just excels. However, MS Excel has problems when it somes to statistics. There have been quite some reports about the inaccuraccy of statistical routines. For the 2007 version, McCullough and Heiser (2008)

wrote this in the abstract

Excel 2007, like its predecessors, fails a standard set of intermediate-level accuracy tests in three areas: statistical distributions, random number generation, and estimation. Additional errors in specific Excel procedures are discussed. Microsoft’s continuing inability to correctly fix errors is discussed. No statistical procedure in Excel should be used until Microsoft documents that the procedure is correct; it is not safe to assume that Microsoft Excel’s statistical procedures give the correct answer. Persons who wish to conduct statistical analyses should use some other package.

MS Excel is not the only spredsheet software. Yalta (2008) compared MS Excel with some others and wrote this in the abstract:

We provide an assessment of the statistical distributions in Microsoft® Excel versions 97 through 2007 along with two competing spreadsheet programs, namely Gnumeric 1.7.11 and OpenOffice.org Calc 2.3.0. We find that the accuracy of various statistical functions in Excel 2007 range from unacceptably bad to acceptable but significantly inferior in comparison to alternative implementations. In particular, for the binomial, Poisson, inverse standard normal, inverse beta, inverse student’s t, and inverse F distributions, it is possible to obtain results with zero accurate digits as shown with numerical examples.

McCullough (2008) contrubuted also this:

Microsoft attempted to implement the Wichmann–Hill RNG in Excel 2003 and failed; it did not just produce numbers between zero and unity, it would also produce negative numbers. Microsoft issued a patch that allegedly fixed the problem so that the patched Excel 2003 and Excel 2007 now implement the Wichmann–Hill RNG, as least according to Microsoft. We show that whatever RNG it is that Microsoft has implemented in these versions of Excel, it is not the Wichmann–Hill RNG. Microsoft has now failed twice to implement the dozen lines of code that define the Wichmann–Hill RNG.

Additionally, the graphs made in MS Excel are not really good ones. It depends a lot on the user to make a good graph. I have seen many so called chartjunk commming from Excel. Of course the blame is on users, but software should help us make things easier and not harder! Su (2008) wrote this in the abstract

The purpose of default settings in a graphic tool is to make it easy to produce good graphics that accord with the principles of statistical graphics... If the defaults do not embody these principles, then the only way to produce good graphics is to be sufficiently familiar with the principles of statistical graphics. This paper shows that Excel graphics defaults do not embody the appropriate principles. Users who want to use Excel are advised to know the principles of good graphics well enough so that they can choose the appropriate options to override the defaults. Microsoft® should overhaul the Excel graphics engine so that its defaults embody the principles of statistical graphics and make it easy for non-experts to produce good graphs.

After all this anti-Microsoft abstracts I really, really start to wonder what we get for the money we spend for the software.

Size of marker effects

Genomic selection is the current hot topic in quantitative genetics. Here I would like to show, that there is a subtle issue with marker effects that it is good to be aware of when we talk about the size of marker (gene) effects. In a nutshell, today we can collect many (several thousands) genotypes for a low amount of money. We then try to estimate the effect of markers on the phenotypic values collected on a set of individuals. There is a bunch of important details in this steps, but I would like to discuss here the issue with the size of estimates of marker effects.

Say we are looking at a loci in diploids. In our population we know that only two different variants (alleles) of a gene is present at observed loci, say a and A. Probability of the allele a is p_a and the probability of the allele A is p_A. The possible genotypes in diploids are aa, aA, and AA. The book by Falconer & MacKay (see here and here) provides a nice description how we can estimate genetic effects from such data. I will follow their setup.

We assign arbitrary values to genotypes: aa has arbitrary value -a, aA has an arbitrary value d, while AA has an arbitrary value +a. These values might be defined such that they represent deviations from some origin, say the overall mean of the data. Then the average effect of allele a:

alpha_a = -p_A * alpha,

and the average effect of allele A is:

alpha_A = p_a * alpha,

where alpha is:

alpha = a + d * (p_a - p_A),

where a and d are arbitrary values, and p_a and p_A are allele frequencies. How does this relate to estimates of marker affects? When we get estimates of marker effects, they represent the effect of average allele substitution, which is alpha. As can be seen from the above setup, this value depends on the arbitrary values a and d and allele requencies. Therefore, when we get the estimates of marker effects, we can not just say that there are some markers that have big or small effects, since the estimates depend on allele frequencies. If we assume that there is no dominance, then the alpha = a and we can say something about the size of marker effects. If there is dominance, then we need to be more carefull. I do not have any "genomic" data so I can not check this on real dataset. I would like to see the plot of marker effect estimates (on y-axis) versus allele frequency (on x-axis) in the data as well as versus the value of d * (p_a - p_A). The effect of dominance on marker estimates should be the greatest when one of the alleles is rare and/or the effect of dominance itself is big. I am for now interested in tha case with the allele frequencies.

Perhaps, we can try with some simulations. I will use R for the simulation. I did a simple simulation to quickly see how it goes. At the end of the post is used R code. The results are two graphs, each graph has four subgraphs. The first set is from simulated data with heritability of 0.5 and the second one with heritability of 0.25. There is not much to see in these pictures. No clear trends at all. Well, I could have played a lot with this, especially with the dominance deviation. But it seems that this

Marker effects h2 0.50 - Upload a Document to Scribd

Marker effects h2 0.25 - Upload a Document to Scribd

## Install the DemoPopQuanGen package
tmp <- c("http://gregor.gorjanc.googlepages.com",           
         contrib.url(repos=getOption("repos"), type=getOption("pkgType"))) 
install.packages("DemoPopQuanGen", contriburl=tmp, dep=TRUE)  

## Function for simulation of genotypes and phenotypic values 
rvalue <- function(n, p, value, mean=0, sdE=0, h2=NULL, genotype=FALSE)
{   
  ## --- Setup ---    

  k <- length(p)
  q <- 1 - p
  y <- matrix(nrow=n, ncol=k)
  g <- matrix(nrow=n, ncol=k)

  ## --- Genotype frequencies under Hardy-Weinberg equilibrium ---

  P <- p * p
  Q <- q * q
  H <- 2 * p * q

  ## --- Marker effects ---    
  for (i in 1:k) {
    g[, i] <- rdiscrete(n=n, probs=c(P[i], H[i], Q[i]))
    y[, i] <- value[g[, i], i]
  }
  y <- rowSums(y)

  ## --- Residual variance based on heritability ---

  if(!is.null(h2)) {
    varG <- var(y)
    sdE <- sqrt(varG / h2 - varG)
  }

  ## --- Phenotype ---

  if(genotype) {
    y <- y - mean(y)
    ret <- cbind(y + rnorm(n=n, mean=mean, sd=sdE), g - 1)
    colnames(ret) <- c("y", paste("g", 1:k, sep=""))
  } else {
    ret <- y
  }
  ret
}

## Set number of phenotyped and genotyped individuals 
n <- 2000  
## Set number of loci 
nLoci <- 1000  

## Simulate loci properties, see ?rvalueSetup 
tmp <- rvalueSetup(nLoci=nLoci, p=runif(n=nLoci, min=0.05, max=0.95))  

## Simulate genotypes and phenotypes, see ?rvalue 
sim <- rvalue(n=n, p=tmp$p, value=tmp$value, mean=100, h2=0.5, genotype=TRUE)  

## Single marker regression 
b <- se <- p_value <- p2t <- numeric(length=nLoci) 
for(i in 1:nLoci) {
  fit <- lm(sim[, 1] ~ sim[, i+1])
  p2t[i] <- mean(sim[, i+1]) / 2
  b[i] <- coef(fit)[2]
  se[i] <- vcov(fit)[2, 2]
  p_value[i] <- summary(fit)$coefficients[2, 4]
}

a <- tmp$value[3, ]
d <- tmp$value[2, ]
p2 <- tmp$p
p1 <- 1 - p2  
p1t <- 1 - p2t

## Plots 
pdf("marker_effects.pdf", version="1.4") 
par(mfrow=c(2, 2), bty="l",
    mar=c(2.7, 3.0, 1, 1) + .1,
    mgp=c(1.6, 0.55, 0),
    pty="m")  

## Marker effect estimates vs true marker effect (a)
plot(b ~ a,
     xlab="Marker effect - a",
     ylab="Marker effect estimate",
     col=rgb(0, 0, 1, 0.2), pch=16)

## Marker effect estimates vs allele frequency
plot(b ~ p1t, xlim=c(0, 1),
     xlab="Allele frequency",
     ylab="Marker effect estimate", col=rgb(0, 0, 1, 0.2), pch=16)

## Marker effect estimates vs dominance deviation (d)
plot(b ~ d,
     xlab="Dominance deviation - d",
     ylab="Marker effect estimate", col=rgb(0, 0, 1, 0.2), pch=16)

## Marker effect estimates vs "allele frequency and dominance deviation" 
z <- d * (p2t - p1t)
plot(b ~ z,
     xlab="d * (p_a - p_A)",
     ylab="Marker effect estimate", col=rgb(0, 0, 1, 0.2), pch=16)
dev.off()

Genomic selection

Last week I have been on a course "Whole genome association and genomic selection" by Ben Hayes. Current course webpage does not (yet) provide course materials (notes and slides), but there are materials from a previous course by Ben - look here.

In a "nutshell" the course was about usage of whole genome data in order to improve (via selection of best individuals) important traits in livestock. Today, we can collect a lot (several thousands), up to 60K for cattle for example) of genotypes with "chip technologies" for a price around 200-300$. After the cleaning of the data, a researcher would estimate effect of each genotype (just additive or maybe even the dominance effect) on the phenotype of individuals. Technically, when we get the genotype data, this data do not represent the exact (functional) genes, but just markers along the whole genome. When we estimate the effect of markers, we try to exploit the possible linkage between the markers and potential genes, which are to a large extent unknown to us. Essential issue with the success of this method is that there is enough linkage disequilibrium between markers and genes. We can achieve that with dense chips. Since there has been an intense selection and strong founder effects in populations of livestock, the re is more linkage disequilibrium and the density of chips in livestock can be lower than in humans. After the estimation of markers effects, we can calculate a genotypic value for any genotyped individual. The essence of this approach has been presented in the paper by Meuwissen et al. (2001). There has been a lot of research/work done since then! There are also quite some important technical details, but this really seems to be a new very hot topic for animal breeding and genetics.

During the course I had an opportunity to disscuss with many people and there seems to be a great intereset for genomic selection in dairy cattle. This is due to the availability of dense chips for cattle and possibility of wide exploitation of semen via artifical insemination. New Zealand, Netherlands, USA are the countries which have already implemented the genomic selection and are already starting to sell the semen of young bulls. The big advantage of genomic selection in dairy cattle is in shortened generation interval. Up to now, potential bulls were preselected from best cows accounting for the breeding values of these cows and inbreeding. Then the semen of this preselected bulls would be used to produce their daughters. Then breeding values of the bulls would be estimated from the phenotypic records (milk yield) of their daughters. Well, the calculation of the breeding values accounts for all possible information from relatives, but the accuracy of breeding values is much dependent on the number of close relatives that have phenotypes. A nice description of these techniques is in the book by Mrode. There are also other books, but the book by Mrode is a very good one for the start. The process from preselecting the bull up to the calculation of breeding values can last around five years.

Besides shortened generation intervals there is also some gain in the accuracy of estimation of genetic values. Up to now animal breeders were mainly exploiting the so called infinitesimall model, where we postulate that there are infinite number of genes and that the effect of gene is negligible. This model is clearly wrong, but has been and still is a very usefull abstraction. Its use has enabled genetic improvement of many economically important traits in agriculture. Since the infinitesimall model is only an abstraction, geneticists were and are still trying hard to find genes that have an effect on various traits. There has been quite a lot of succes in this work, especially for the traits that are under the influence of only few genes. However, for the traits that are continous in the nature there have been only few genes found even though the efforts are substantial. Genomic selection now provides another abstraction. It is again an approximation, but given the abundant genomic data, there are gains in comparison to an infinitesimall model.

With genomic selection animal breeders gain a lot of flexibility and it seems that we will face changes in the organization of breeding programmes. Genomic selection is of course aplicable also for other breeding programmes: beef cattle, sheep, goats, pigs, poultry, ...

2008-09-03

Using Beamer with LyX-Sweave

Dan Weeks (web1, web2) provided a solution on how to use Beamer with LyX & Sweave.

Make a literate-beamer layout file for LyX!

#% Do not delete the line below; configure depends on this
#  \DeclareLaTeXClass[beamer, Sweave.sty]{beamer (beamer Sweave noweb)}
#
# This is a copy of literate-article.layout from LyX, but changed for
# Sweave - NoWeb syntax:
#  - changed noweb.sty to Sweave.sty
#  - moved preamble to literate-scrap.inc

Format 2
Input beamer.layout
Input literate-scrap.inc

LyX then has a new Document Class that allows one to use the "scrap" style, and that portion of the text is sent to Sweave properly. However, there is a problem related with verbatim environment. Sweave makes a lot of use of verbatim environments, and beamer doesn't like those. You need to declare that a slide contains verbatim or you get errors. The reasonable solution is to add two ERT boxes in each slide that will contain "Sweave material". The ERT box at the beginning of the slide should contain

\begin{frame}[containsverbatim]
\frametitle{Slide title}

while the ERT box at the endo of the slide should contain

\end{frame}