Gregor Gorjanc (gg): Size of marker effects

Genomic selection is the current hot topic in quantitative genetics. Here I would like to show, that there is a subtle issue with marker effects that it is good to be aware of when we talk about the size of marker (gene) effects. In a nutshell, today we can collect many (several thousands) genotypes for a low amount of money. We then try to estimate the effect of markers on the phenotypic values collected on a set of individuals. There is a bunch of important details in this steps, but I would like to discuss here the issue with the size of estimates of marker effects.

Say we are looking at a loci in diploids. In our population we know that only two different variants (alleles) of a gene is present at observed loci, say a and A. Probability of the allele a is p_a and the probability of the allele A is p_A. The possible genotypes in diploids are aa, aA, and AA. The book by Falconer & MacKay (see here and here) provides a nice description how we can estimate genetic effects from such data. I will follow their setup.

We assign arbitrary values to genotypes: aa has arbitrary value -a, aA has an arbitrary value d, while AA has an arbitrary value +a. These values might be defined such that they represent deviations from some origin, say the overall mean of the data. Then the average effect of allele a:

alpha_a = -p_A * alpha,

and the average effect of allele A is:

alpha_A = p_a * alpha,

where alpha is:

alpha = a + d * (p_a - p_A),

where a and d are arbitrary values, and p_a and p_A are allele frequencies. How does this relate to estimates of marker affects? When we get estimates of marker effects, they represent the effect of average allele substitution, which is alpha. As can be seen from the above setup, this value depends on the arbitrary values a and d and allele requencies. Therefore, when we get the estimates of marker effects, we can not just say that there are some markers that have big or small effects, since the estimates depend on allele frequencies. If we assume that there is no dominance, then the alpha = a and we can say something about the size of marker effects. If there is dominance, then we need to be more carefull. I do not have any "genomic" data so I can not check this on real dataset. I would like to see the plot of marker effect estimates (on y-axis) versus allele frequency (on x-axis) in the data as well as versus the value of d * (p_a - p_A). The effect of dominance on marker estimates should be the greatest when one of the alleles is rare and/or the effect of dominance itself is big. I am for now interested in tha case with the allele frequencies.

Perhaps, we can try with some simulations. I will use R for the simulation. I did a simple simulation to quickly see how it goes. At the end of the post is used R code. The results are two graphs, each graph has four subgraphs. The first set is from simulated data with heritability of 0.5 and the second one with heritability of 0.25. There is not much to see in these pictures. No clear trends at all. Well, I could have played a lot with this, especially with the dominance deviation. But it seems that this

Marker effects h2 0.50 - Upload a Document to Scribd

Marker effects h2 0.25 - Upload a Document to Scribd

## Install the DemoPopQuanGen package
tmp <- c("http://gregor.gorjanc.googlepages.com",           
         contrib.url(repos=getOption("repos"), type=getOption("pkgType"))) 
install.packages("DemoPopQuanGen", contriburl=tmp, dep=TRUE)  

## Function for simulation of genotypes and phenotypic values 
rvalue <- function(n, p, value, mean=0, sdE=0, h2=NULL, genotype=FALSE)
{   
  ## --- Setup ---    

  k <- length(p)
  q <- 1 - p
  y <- matrix(nrow=n, ncol=k)
  g <- matrix(nrow=n, ncol=k)

  ## --- Genotype frequencies under Hardy-Weinberg equilibrium ---

  P <- p * p
  Q <- q * q
  H <- 2 * p * q

  ## --- Marker effects ---    
  for (i in 1:k) {
    g[, i] <- rdiscrete(n=n, probs=c(P[i], H[i], Q[i]))
    y[, i] <- value[g[, i], i]
  }
  y <- rowSums(y)

  ## --- Residual variance based on heritability ---

  if(!is.null(h2)) {
    varG <- var(y)
    sdE <- sqrt(varG / h2 - varG)
  }

  ## --- Phenotype ---

  if(genotype) {
    y <- y - mean(y)
    ret <- cbind(y + rnorm(n=n, mean=mean, sd=sdE), g - 1)
    colnames(ret) <- c("y", paste("g", 1:k, sep=""))
  } else {
    ret <- y
  }
  ret
}

## Set number of phenotyped and genotyped individuals 
n <- 2000  
## Set number of loci 
nLoci <- 1000  

## Simulate loci properties, see ?rvalueSetup 
tmp <- rvalueSetup(nLoci=nLoci, p=runif(n=nLoci, min=0.05, max=0.95))  

## Simulate genotypes and phenotypes, see ?rvalue 
sim <- rvalue(n=n, p=tmp$p, value=tmp$value, mean=100, h2=0.5, genotype=TRUE)  

## Single marker regression 
b <- se <- p_value <- p2t <- numeric(length=nLoci) 
for(i in 1:nLoci) {
  fit <- lm(sim[, 1] ~ sim[, i+1])
  p2t[i] <- mean(sim[, i+1]) / 2
  b[i] <- coef(fit)[2]
  se[i] <- vcov(fit)[2, 2]
  p_value[i] <- summary(fit)$coefficients[2, 4]
}

a <- tmp$value[3, ]
d <- tmp$value[2, ]
p2 <- tmp$p
p1 <- 1 - p2  
p1t <- 1 - p2t

## Plots 
pdf("marker_effects.pdf", version="1.4") 
par(mfrow=c(2, 2), bty="l",
    mar=c(2.7, 3.0, 1, 1) + .1,
    mgp=c(1.6, 0.55, 0),
    pty="m")  

## Marker effect estimates vs true marker effect (a)
plot(b ~ a,
     xlab="Marker effect - a",
     ylab="Marker effect estimate",
     col=rgb(0, 0, 1, 0.2), pch=16)

## Marker effect estimates vs allele frequency
plot(b ~ p1t, xlim=c(0, 1),
     xlab="Allele frequency",
     ylab="Marker effect estimate", col=rgb(0, 0, 1, 0.2), pch=16)

## Marker effect estimates vs dominance deviation (d)
plot(b ~ d,
     xlab="Dominance deviation - d",
     ylab="Marker effect estimate", col=rgb(0, 0, 1, 0.2), pch=16)

## Marker effect estimates vs "allele frequency and dominance deviation" 
z <- d * (p2t - p1t)
plot(b ~ z,
     xlab="d * (p_a - p_A)",
     ylab="Marker effect estimate", col=rgb(0, 0, 1, 0.2), pch=16)
dev.off()

Gregor Gorjanc (gg)

2008-09-09

Size of marker effects

No comments: