2011-04-18

LaTeXsearch.com

Want to search over published equations? Try out LaTeXsearch.com. I wonder how useful this might be in practice?

2011-04-15

gRain: genetics example

gRain is an R package for "probability propagation in graphical independence networks, also known as probabilistic expert systems (which includes Bayesian networks as a special case)". This package caught my attention because some genetic models can be seen as graphical models, which is a broader class of models encompassing Bayesian networks or sometimes also called Directed Acyclic graphs (DAGs).

A very simple example of DAG in genetics is to draw a pedigree of a small family (say father, mother, and child) with three nodes representing their genotype (g01, g02, and g03). Nodes are logically connected, father to child (g01 --> g03) and mother to child (g02 --> g03). Assume we observe some phenotype condition in a child and if we postulate that this phenotype is affected by genotype, we draw another connection between genotype and phenotype of a child (g03 --> p03).
This kind of problems/models are often postulated and one might want to do some queries about the genotype of individuals given some data (observed phenotype and/or genotype). Can we do this in R? Yes, this can be neatly done with the gRain package.

Before digging into the code we need to pause to see how can we describe the above model? If you have read at least some parts of Wikipedia links provided above than it should be straightforward to understand the following factorization of this simple model: p(g01, g02, g03, p03) = p(g01) p(g02) p(g03|g01, g02) p(p03|g03). The terms p(g01) and p(g02) are often called prior distributions (probabilities) of genotypes in some population. Term p(g03|g01, g02) describes probability of genotype in child (g03) given that we know the genotype of parents (g01 and g02). This is actually very simple part and often referred as transmission probability If there are only two alleles (gene variants) in population (A and a) and father is AA and mother is AA then child must also be AA or p(g03|g01=AA, g02=AA) = (1, 0, 0), where number show probability for genotype AA, Aa, and aa. The last term p(p03|g03) is often called penetrance function and tells us the conditional probability of some phenotype given the genotype data. Let us assume that we are interested in some binary condition, which is under very very large influence of genotype, but expressed primarily in individuals with genotype aa. Bellow is R code for specifying such a model with some additional details skipped in the text.
## Set of genotype values
geno <- c("AA", "Aa", "aa")

## Prior allele frequencies in (founder) population
PrA <- 2/3
(Pra <- 1 - PrA)

## Prior genotype frequencies in (founder) population - according to Hardy-Weinberg's law
(gP <- c(PrA*PrA, 2*PrA*Pra, Pra*Pra))

## Genotype frequencies in children given the genotype of parents - according to Mendel's law
gM_AA_AA <- c( 1, 0, 0)
gM_AA_Aa <- c(1/2, 1/2, 0)
gM_AA_aa <- c( 0, 1, 0)

gM_Aa_AA <- c(1/2, 1/2, 0)
gM_Aa_Aa <- c(1/4, 1/2, 1/4)
gM_Aa_aa <- c( 0, 1/2, 1/2)

gM_aa_AA <- c( 0, 1, 0)
gM_aa_Aa <- c( 0, 1/2, 1/2)
gM_aa_aa <- c( 0, 0, 1)

gM <- c(gM_AA_AA, gM_AA_Aa, gM_AA_aa,
gM_Aa_AA, gM_Aa_Aa, gM_Aa_aa,
gM_aa_AA, gM_aa_Aa, gM_aa_aa)

## Set of phenotype values
phen <- c("OK", "NOK")

## Error rate (= 1 - penetrance)
e <- 0.01

## Phenotype frequencies in individual given the genotype of individual - penetrance
## OK NOK
p <- c(1-e, 0+e, ## AA
1-e, 0+e, ## Aa
0+e, 1-e) ## aa

## Graph - joint distribution
g01 <- cptable(~g01, values=gP, levels=geno)
g02 <- cptable(~g02, values=gP, levels=geno)
g03 <- cptable(~g03 + g01 + g02, values=gM, levels=geno)
p03 <- cptable(~p03 + g03, values=p, levels=phen)

## Compile model
plist <- compileCPT(list(g01, g02, g03, p03))
g <- grain(plist)
summary(g)

## Plot the model
iplot(g)

Now if we want to query our model without any observed data, but just using parameter values for this population we use the following code:

## Query model given parameter values (=prior)
querygrain(g, nodes=c("g01", "g02", "g03"))

which returns probabilities shown bellow. Not really interesting, but we should not expect much as we did not observe any data. Genotype probabilities are equal for all individuals and they are completely defined by population parameters.

$g01
g01
AA Aa aa
0.4444444 0.4444444 0.1111111

$g02
g02
AA Aa aa
0.4444444 0.4444444 0.1111111

$g03
g03
AA Aa aa
0.4444444 0.4444444 0.1111111

Now we can add some data and repeat the query again.

g <- setFinding(g, nodes="p03", states="NOK")
summary(g)
## Query model given parameter values (=prior) + some data
querygrain(g, nodes=c("g01", "g02", "g03"))

Results show probability of such finding (that phenotype condition was observed) given the defined model and parameters. Probability is not really high. Further on we get updated genotype probabilities and we can see the change due to introduction of phenotype information. Now we can claim with high confidence that child has genotype aa, while there is still quite some ambiguity about genotypes of parents as there are several possible combinations of parent genotypes that would give aa genotype in child (Aa x Aa, Aa x aa, aa x Aa, and aa x aa).

Independence network: Compiled: TRUE Propagated: TRUE
Nodes : chr [1:4] "g01" "g02" "g03" "p03"
Number of cliques: 2
Maximal clique size: 3
Maximal state space in cliques: 27
Finding:
variable state
[1,] p03 NOK
Pr(Finding)= 0.1188889

$g01
g01
AA Aa aa
0.03738318 0.64797508 0.31464174

$g02
g02
AA Aa aa
0.03738318 0.64797508 0.31464174

$g03
g03
AA Aa aa
0.03738318 0.03738318 0.92523364

This is all for the purpose of this blog post, but it should be clear that it is very easy to extend our model either with model individuals (just defined them with cptable and compile in joint model) or with other stuff (more alleles leads to more genotypes, other penetrance function, sex-linked genes, ...).

2011-04-01

Redish color of screen for evening or night work

While reading local computer magazine (Monitor) I find out about f.lux, which is program that changes the color of screen to more redish for evening or night work. They claim that this protects our eyes and sleep pattern. I like the idea. Linux support has some problems, but their is alternative that works with Linux - redshift (see here and here for installation).

Different view options for my blog

Blogger now offers several dynamic views for public blogs. Click on the following links to see them in action for my blog:
  • Flipcard (quite good overview of posts - reminds me of the memory game where cards are flipped to find a pair)
  • Mosaic (similar as Flipcard, but cards have different sizes)
  • Sidebar (more or less like standard view, but with bar on the left side)
  • Snapshot (shows only preview-snapshots of posts with images stored in Picasa and linked with Blogger)
  • Timeslide (a bit confused layout by time)
In my view these are cool views, but the help page is much nicer due to use of graphics. If you do not have much graphics in your posts you will not be that much impressed.