2012-12-28

Functions dim, nrow, and ncol for shell

Ever wanted to find out quickly the number of rows and/or columns in file directly from terminal. There are many ways to skin this cat. Here is what I used for number of rows for quite a while:

wc -l filename

What about number of columns? "Easy", just combine head and awk commands:

head -n 1 filename | awk '{ print NF }'

not a big problem (there are likely better ways to do this), but is long and tedious.

I got sick of typing commands above and assembled them in three easy to use function with R-like names: nrow, ncol, and dim. Functions are simply a collection of above ideas and assume that the file is of "rectangular shape", i.e., a table, a matrix, etc.


dim()
{
  for FILE in $@; do
    NROW=$(nrow $FILE | awk '{ print $1}')
    NCOL=$(ncol $FILE | awk '{ print $1}')
    echo "$NROW $NCOL $FILE"
    unset NROW NCOL
  done
}
export -f dim

nrow ()
{
  for FILE in $@; do
    wc -l $FILE
  done
}
export -f nrow

ncol ()
{
  for FILE in $@; do
    TMP=$(head $FILE -n 1 | awk '{ print NF }')
    echo "$TMP $FILE"
    unset TMP
  done
}
export -f ncol

Add these files to your .bashrc or .profile or something similar and you can now simply type:

nrow filename
ncol filename
dim filename

A simple test:

touch file.txt
echo "line1 with four columns" >> file.txt
echo "line2 with four columns" >> file.txt

nrow file.txt
2 file.txt

ncol file.txt
4 file.txt

dim file.txt
2 4 file.txt

2012-12-04

Recover deleted file under linux

Here is a nice summary, but the bottom line is that the Ubuntu/Debian package testdisk has a nice utility called photorec that can be used to search for deleted files and recover. This is not a GUI program, though!

2012-09-30

Whole-genome evaluation of complex traits using SNP, haplotype, or QTL information

This is the presentation (and abstract bellow) of my talk at local congress Genetika 2012.
Whole-genome evaluation of complex traits using SNP, haplotype, or QTL information

Abstract:

Whole-genome technologies provide rich data for dissection of complex traits. While gene discovery is still largely limited, the data at hand can be successfully used for evaluation of genetic merit. The aim of this work was to demonstrate the value of different sources of information (pedigrees, Single Nucleotide Polymorphisms – SNP, haplotypes, or Quantitative Trait Loci – QTL) for genetic evaluation of non-phenotyped individuals in a typical animal breeding scenario via simulation. In the first step a coalescent simulation was used to create a base population with structured chromosomes that were in the second step dropped and recombined through the pedigree of 10 generations with 50 sires per generation, 10 dams per sire, and 2 offspring per dam. Phenotypic values were simulated with different genetic architectures (QTL effects were sampled from Gaussian or gamma distribution and minor allele frequency less than 0.3) and heritability of 0.25. Genotypic data was available for all individuals from generation 4 onwards, while phenotypic data was available for individuals in generations 4 and 5. Genetic evaluation was based on linear mixed models with relationship matrix between individuals. This matrix was built using pedigree, SNP, haplotype, or QTL data. Haplotypes of different length were considered (from 5 to all the way up to 2000 SNP) with an option to account for similarities between haplotypes while building relationship matrices. The accuracy of different methods was assessed by correlation between true and evaluated additive genetic values for individuals in generations 6, 8 and 10. Average accuracy over ten replications for Gaussian trait over generations was between 0.45 to 0.10 for pedigree data, 0.50 to 0.35 for SNP and haplotype data and 0.6 to 0.4 for QTL data. In the case of long haplotypes accuracies dropped considerably, but accounting for similarities between haplotypes prevented this drop. In the case of gamma trait accuracies were slightly higher in generation 6 and dropped faster in the later generations in the case of pedigree, SNP, and haplotype data due to recombinations. On the other hand accuracies were substantially higher with QTL data and quite stable over generations (from 0.75 to 0.65) though still far from perfect (even though QTL genotypes are known), due to estimation errors. Results demonstrate the value and limitations of genotypic information for the evaluation of additive genetic merit in animal populations.

2012-09-29

Software Tools for Animal Gene Mapping

Here are some cool tools for animal genetics:

  • AGDP for the analysis of genome differences between opulations
  • SNPEVG graphical tool for SNP effect viewing and graphing
  • ...



2012-09-23

Google drive drives in

I just realized that Google offers me "Google drive". The pricing options for larger disk space than 5GB are very competitive with Dropbox! I am using Dropbox for all my files - yes, it costs me quite some money when the bill comes, but if we consider this is "few" dollars per month for backup and synchronization I am more than willing to pay this amount for this!!! Google drives is now yet another option, but the desktop folder does not work on Linux so for now I will stick with Dropbox.

2012-08-11

“Livestock Conservation Genomics: Data, Tools and Trends“ Summer School, Croatia, Oct 1-7, 2012

There will be an ESF GENOMIC-RESOURCES (link1, link2) summer school “Livestock Conservation Genomics: Data, Tools and Trends“ Summer School, Croatia, Oct 1-7, 2012. You can find more details here.

2012-08-09

Nice SNP tools from the Pevsner laboratory

SNPduo - compare SNPs from two individuals
SNPtrio - compare SNPs from a family trio - very cool plots and nice diagnostics
kcoeff - estimate K0, K1, and K2 coefficients from SNP data

2012-08-08

EU FP7 Livestock Genomics projects

The Quantomics group had a meeting in September 2011 and published was a very nice set of presentations on EU FP7 Livestock Genomics projects.

2012-04-21

Some more study video material


2012-04-17

Simulated data for genomic selection and GWAS using a combination of coalescent and gene drop methods

Our (Hickey and Gorjanc) paper describing simulation using coalescent and gene drop methods has been published in the new open G3 journal. This contribution is a part of the Genomic selection collection of the Genetic Society of America that is nicely described here

2012-04-15

Chalk letter generator

Here is a cool online application that takes chalkboard with Bart Simpsons and adds your text on it.

2012-04-08

Small pedigree based mixed model example

Pedigree based mixed models (often called animal models, due to modelling animal performance) are the cornerstone of animal breeding and quantitative genetics. There are many programs that can be used for analyzing your data with these models, e.g., ASREML, BLUPf90MATVEC, MiXBLUP & MiX99,  SurvivalKit, PEST/VCE, WOMBAT, ...). There are also R packages you can use: pedigreemm and MCMCglmm. If you want to run your own program you can take the example code bellow and start from it. The code shows the essence of building the system of equations that needs to be solved on a simple example. Note that this is mean only for demonstration purposes and small scale analyses. In addition, variance components are assumed known here. In order to understand the model for this simple example a bit better the graphical model view is shown first.
Graphical model view of simple pedigree based mixed model example


The code:


options(width=200)
 
### --- Required packages ---
 
## install.packages(pkg=c("pedigreemm", "MatrixModels"))
 
library(package="pedigreemm")   ## pedigree functions
library(package="MatrixModels") ## sparse matrices
 
### --- Data ---
 
## NOTE:
## - some individuals have one or both parents (un)known
## - some individuals have phenotype (un)known
## - some indididuals have repeated phenotype observations 
example <- data.frame(
  individual=c( 1,   2,   2,   3,   4,   5,   6,   7,   8,   9,  10),
      father=c(NA,  NA,  NA,   2,   2,   4,   2,   5,   5,  NA,   8),
      mother=c(NA,  NA,  NA,   1,  NA,   3,   3,   6,   6,  NA,   9),
   phenotype=c(NA, 103, 106,  98, 101, 106,  93,  NA,  NA,  NA, 109),
       group=c(NA,   1,   1,   1,   2,   2,   2,  NA,  NA,  NA,   1))
 
## Variance components
sigma2e <- 1
sigma2a <- 3
(h2 <- sigma2a / (sigma2a + sigma2e))
 
### --- Setup data ---
 
## Make sure each individual has only one record in pedigree
ped <- example[!duplicated(example$individual), 1:3]
 
## Factors (this eases buliding the design matrix considerably)
example$individual <- factor(example$individual)
example$group      <- factor(example$group)
 
## Phenotype data
dat <- example[!is.na(example$phenotype), ]
 
### --- Setup MME ---
 
## Phenotype vector
(y <- dat$phenotype)
 
## Design matrix for the "fixed" effects (group)
(X <- model.Matrix( ~ group,          data=dat, sparse=TRUE))
 
## Design matrix for the "random" effects (individual)
(Z <- model.Matrix(~ individual - 1, data=dat, sparse=TRUE))
 
## Inverse additive relationship matrix
ped2 <- pedigree(sire=ped$father, dam=ped$mother, label=ped$individual)
TInv <- as(ped2, "sparseMatrix")
DInv <- Diagonal(x=1/Dmat(ped2))
AInv <- crossprod(sqrt(DInv) %*% TInv)
 
## Variance ratio
alpha <- sigma2e / sigma2a
 
## Mixed Model Equations (MME)
 
## ... Left-Hand Side (LHS) without pedigree prior
(LHS0 <- rBind(cBind(crossprod(X),    crossprod(X, Z)),
               cBind(crossprod(Z, X), crossprod(Z, Z))))
 
## ... Left-Hand Side (LHS) with    pedigree prior
round(
 LHS <- rBind(cBind(crossprod(X),    crossprod(X, Z)),
              cBind(crossprod(Z, X), crossprod(Z, Z) + AInv * alpha)), digits=1)
 
## ... Right-Hand Side (RHS)
(RHS <- rBind(crossprod(X, y),
              crossprod(Z, y)))
 
### --- Solutions ---
 
## Solve
LHSInv <- solve(LHS)
sol <- LHSInv %*% RHS
 
## Standard errors
se <- diag(LHSInv) * sigma2e
 
## Reliabilities
r2 <- 1 - diag(LHSInv) * alpha
 
## Accuracies
r <- sqrt(r2)
 
## Printout
cBind(sol, se, r, r2)
Created by Pretty R at inside-R.org

And the transcript:

R version 2.14.2 (2012-02-29)
Copyright (C) 2012 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: x86_64-pc-linux-gnu (64-bit)
 
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
 
  Natural language support but running in an English locale
 
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
 
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
 
[Previously saved workspace restored]
 
> 
> options(width=200)
> 
> ### --- Required packages ---
> 
> ## install.packages(pkg=c("pedigreemm", "MatrixModels"))
> 
> library(package="pedigreemm")   ## pedigree functions
Loading required package: lme4
Loading required package: Matrix
Loading required package: lattice
 
Attaching package: ‘Matrix’
 
The following object(s) are masked from ‘package:base’:
 
    det
 
 
Attaching package: ‘lme4’
 
The following object(s) are masked from ‘package:stats’:
 
    AIC, BIC
 
> library(package="MatrixModels") ## sparse matrices
> 
> ### --- Data ---
> 
> ## NOTE:
> ## - some individuals have one or both parents (un)known
> ## - some individuals have phenotype (un)known
> ## - some indididuals have repeated phenotype observations 
> example <- data.frame(
+   individual=c( 1,   2,   2,   3,   4,   5,   6,   7,   8,   9,  10),
+       father=c(NA,  NA,  NA,   2,   2,   4,   2,   5,   5,  NA,   8),
+       mother=c(NA,  NA,  NA,   1,  NA,   3,   3,   6,   6,  NA,   9),
+    phenotype=c(NA, 103, 106,  98, 101, 106,  93,  NA,  NA,  NA, 109),
+        group=c(NA,   1,   1,   1,   2,   2,   2,  NA,  NA,  NA,   1))
> 
> ## Variance components
> sigma2e <- 1
> sigma2a <- 3
> (h2 <- sigma2a / (sigma2a + sigma2e))
[1] 0.75
> 
> ### --- Setup data ---
> 
> ## Make sure each individual has only one record in pedigree
> ped <- example[!duplicated(example$individual), 1:3]
> 
> ## Factors (this eases buliding the design matrix considerably)
> example$individual <- factor(example$individual)
> example$group      <- factor(example$group)
> 
> ## Phenotype data
> dat <- example[!is.na(example$phenotype), ]
> 
> ### --- Setup MME ---
> 
> ## Phenotype vector
> (y <- dat$phenotype)
[1] 103 106  98 101 106  93 109
> 
> ## Design matrix for the "fixed" effects (group)
> (X <- model.Matrix( ~ group,          data=dat, sparse=TRUE))
"dsparseModelMatrix": 7 x 2 sparse Matrix of class "dgCMatrix"
   (Intercept) group2
2            1      .
3            1      .
4            1      .
5            1      1
6            1      1
7            1      1
11           1      .
@ assign:  0 1 
@ contrasts:
$group
[1] "contr.treatment"
 
> 
> ## Design matrix for the "random" effects (individual)
> (Z <- model.Matrix(~ individual - 1, data=dat, sparse=TRUE))
"dsparseModelMatrix": 7 x 10 sparse Matrix of class "dgCMatrix"
   [[ suppressing 10 column names ‘individual1’, ‘individual2’, ‘individual3’ ... ]]
 
2  . 1 . . . . . . . .
3  . 1 . . . . . . . .
4  . . 1 . . . . . . .
5  . . . 1 . . . . . .
6  . . . . 1 . . . . .
7  . . . . . 1 . . . .
11 . . . . . . . . . 1
@ assign:  1 1 1 1 1 1 1 1 1 1 
@ contrasts:
$individual
[1] "contr.treatment"
 
> 
> ## Inverse additive relationship matrix
> ped2 <- pedigree(sire=ped$father, dam=ped$mother, label=ped$individual)
> TInv <- as(ped2, "sparseMatrix")
> DInv <- Diagonal(x=1/Dmat(ped2))
> AInv <- crossprod(sqrt(DInv) %*% TInv)
> 
> ## Variance ratio
> alpha <- sigma2e / sigma2a
> 
> ## Mixed Model Equations (MME)
> 
> ## ... Left-Hand Side (LHS) without pedigree prior
> (LHS0 <- rBind(cBind(crossprod(X),    crossprod(X, Z)),
+                cBind(crossprod(Z, X), crossprod(Z, Z))))
12 x 12 sparse Matrix of class "dgCMatrix"
   [[ suppressing 12 column names(Intercept), ‘group2’, ‘individual1’ ... ]]
 
(Intercept)  7 3 . 2 1 1 1 1 . . . 1
group2       3 3 . . . 1 1 1 . . . .
individual1  . . . . . . . . . . . .
individual2  2 . . 2 . . . . . . . .
individual3  1 . . . 1 . . . . . . .
individual4  1 1 . . . 1 . . . . . .
individual5  1 1 . . . . 1 . . . . .
individual6  1 1 . . . . . 1 . . . .
individual7  . . . . . . . . . . . .
individual8  . . . . . . . . . . . .
individual9  . . . . . . . . . . . .
individual10 1 . . . . . . . . . . 1
> 
> ## ... Left-Hand Side (LHS) with    pedigree prior
> round(
+  LHS <- rBind(cBind(crossprod(X),    crossprod(X, Z)),
+               cBind(crossprod(Z, X), crossprod(Z, Z) + AInv * alpha)), digits=1)
12 x 12 sparse Matrix of class "dgCMatrix"
   [[ suppressing 12 column names(Intercept), ‘group2’, ‘individual1’ ... ]]
 
(Intercept)  7 3  .    2.0  1.0  1.0  1.0  1.0  .    .    .    1.0
group2       3 3  .    .    .    1.0  1.0  1.0  .    .    .    .  
individual1  . .  0.5  0.2 -0.3  .    .    .    .    .    .    .  
individual2  2 .  0.2  2.8 -0.2 -0.2  .   -0.3  .    .    .    .  
individual3  1 . -0.3 -0.2  2.0  0.2 -0.3 -0.3  .    .    .    .  
individual4  1 1  .   -0.2  0.2  1.6 -0.3  .    .    .    .    .  
individual5  1 1  .    .   -0.3 -0.3  2.1  0.4 -0.4 -0.4  .    .  
individual6  1 1  .   -0.3 -0.3  .    0.4  2.1 -0.4 -0.4  .    .  
individual7  . .  .    .    .    .   -0.4 -0.4  0.8  .    .    .  
individual8  . .  .    .    .    .   -0.4 -0.4  .    1.0  0.2 -0.4
individual9  . .  .    .    .    .    .    .    .    0.2  0.5 -0.4
individual10 1 .  .    .    .    .    .    .    .   -0.4 -0.4  1.8
> 
> ## ... Right-Hand Side (RHS)
> (RHS <- rBind(crossprod(X, y),
+               crossprod(Z, y)))
12 x 1 Matrix of class "dgeMatrix"
             [,1]
(Intercept)   716
group2        300
individual1     0
individual2   209
individual3    98
individual4   101
individual5   106
individual6    93
individual7     0
individual8     0
individual9     0
individual10  109
> 
> ### --- Solutions ---
> 
> ## Solve
> LHSInv <- solve(LHS)
> sol <- LHSInv %*% RHS
> 
> ## Standard errors
> se <- diag(LHSInv) * sigma2e
> 
> ## Reliabilities
> r2 <- 1 - diag(LHSInv) * alpha
> 
> ## Accuracies
> r <- sqrt(r2)
Warning message:
In sqrt(r2) : NaNs produced
> 
> ## Printout
> cBind(sol, se, r, r2)
12 x 4 Matrix of class "dgeMatrix"
                         se         r         r2
 [1,] 104.76444809 1.901479 0.6051228  0.3661736
 [2,]  -4.65061921 1.376348 0.7356748  0.5412175
 [3,]  -2.62685846 2.432448 0.4349528  0.1891839
 [4,]  -0.77797260 1.983301 0.5821509  0.3388997
 [5,]  -4.32927399 1.979138 0.5833415  0.3402874
 [6,]   1.54997883 2.316720 0.4772421  0.2277600
 [7,]   3.18706240 2.604540 0.3630701  0.1318199
 [8,]  -5.07852788 2.508673 0.4046922  0.1637758
 [9,]  -0.94573274 3.459904       NaN -0.1533013
[10,]  -0.08765651 3.534636       NaN -0.1782121
[11,]   2.11218764 2.511203 0.4036487  0.1629323
[12,]   2.82742681 2.009539 0.5745901  0.3301538
> 
> proc.time()
   user  system elapsed 
  3.350   0.070   3.425 
Created by Pretty R at inside-R.org

2012-04-04

Regression - covariate adjustment

Linear regression is one of the key concepts in statistics [wikipedia1, wikipedia2]. However, people are often confuse the meaning of parameters of linear regression - the intercept tells us the average value of y at x=0, while the slope tells us how much change of y can we expect on average when we change x for one unit - exactly the same as in the linear function, though we use averages here due to noise.

Today colleague got confused with the meaning of adjusting covariate (x variable) and the effect of parameter estimates. By shifting the x scale, we also shift the point at which intercept is estimated. I made the following graph to demonstrate this point in the case of nested regression of y on x within a group factor having two levels. R code to produce this plots is shown on bottom.
Regression - covariate adjustment
library(package="MASS")
 
x1 <- mvrnorm(n=100, mu=50, Sigma=20, empirical=TRUE)
x2 <- mvrnorm(n=100, mu=70, Sigma=20, empirical=TRUE)
 
mu1 <- mu2 <- 4
b1 <- 0.300
b2 <- 0.250
 
y1 <- mu1 + b1 * x1 + rnorm(n=100, sd=1) 
y2 <- mu2 + b2 * x2 + rnorm(n=100, sd=1) 
 
x <- c(x1, x2)
xK <- x - 60
y <- c(y1, y2)
g <- factor(rep(c(1, 2), each=100))
 
par(mfrow=c(2, 1), pty="m", bty="l")
 
(fit1n <- lm(y ~ g + x + x:g))
## (Intercept)           g2            x         g2:x  
##     3.06785      2.32448      0.31967     -0.09077  
beta <- coef(fit1n)
 
plot(y ~ x, col=c("blue", "red")[g], ylim=c(0, max(y)), xlim=c(0, max(x)), pch=19, cex=0.25)
points(x=mean(x1), y=mean(y1), pch=19)
points(x=mean(x2), y=mean(y2), pch=19)
abline(v=c(mean(x1), mean(x2)), lty=2, col="gray")
abline(h=c(mean(y1), mean(y2)), lty=2, col="gray")
 
points(x=0, y=beta["(Intercept)"],              pch=19, col="blue")
points(x=0, y=beta["(Intercept)"] + beta["g2"], pch=19, col="red")
 
z <- 0:max(x)
lines(y= beta["(Intercept)"]               +  beta["x"] * z                , x=z, col="blue")
lines(y=(beta["(Intercept)"] + beta["g2"]) + (beta["x"] + beta["g2:x"]) * z, x=z, col="red")
 
(fit2n <- lm(y ~ g + xK + xK:g))
## (Intercept)           g2           xK        g2:xK  
##    22.24824     -3.12153      0.31967     -0.09077
beta <- coef(fit2n)
 
plot(y ~ x, col=c("blue", "red")[g], ylim=c(0, max(y)), xlim=c(0, max(x)), pch=19, cex=0.25)
points(x=mean(x1), y=mean(y1), pch=19)
points(x=mean(x2), y=mean(y2), pch=19)
abline(v=c(mean(x1), mean(x2)), lty=2, col="gray")
abline(h=c(mean(y1), mean(y2)), lty=2, col="gray")
 
abline(v=60, lty=2, col="gray")
 
points(x=60, y=beta["(Intercept)"],              pch=19, col="blue")
points(x=60, y=beta["(Intercept)"] + beta["g2"], pch=19, col="red")
 
z <- 0:max(x) - 60
lines(y= beta["(Intercept)"]               +  beta["xK"] * z                 , x=z + 60, col="blue")
lines(y=(beta["(Intercept)"] + beta["g2"]) + (beta["xK"] + beta["g2:xK"]) * z, x=z + 60, col="red")
Created by Pretty R at inside-R.org