2013-12-01

Read line by line of a file in R

Are you using R for data manipulation for later use with other programs, i.e., a workflow something like this:
  1. read data sets from a disk,
  2. modify the data, and
  3. write it back to a disk.
All fine, but of data set is really big, then you will soon stumble on memory issues. If data processing is simple and you can read only chunks, say only line by line, then the following might be useful:

## File
file <- "myfile.txt"
 
## Create connection
con <- file(description=file, open="r")
 
## Hopefully you know the number of lines from some other source or
com <- paste("wc -l ", file, " | awk '{ print $1 }'", sep="")
n <- system(command=com, intern=TRUE)
 
## Loop over a file connection
for(i in 1:n) {
  tmp <- scan(file=con, nlines=1, quiet=TRUE)
  ## do something on a line of data 
}
Created by Pretty R at inside-R.org

2013-07-02

Parse arguments of an R script

R can be used also as a scripting tool. We just need to add shebang in the first line of a file (script):

#!/usr/bin/Rscript

and then the R code should follow.

Often we want to pass arguments to such a script, which can be collected in the script by the commandArgs() function. Then we need to parse the arguments and conditional on them do something. I came with a rather general way of parsing these arguments using simply these few lines:

## Collect arguments
args <- commandArgs(TRUE)
 
## Default setting when no arguments passed
if(length(args) < 1) {
  args <- c("--help")
}
 
## Help section
if("--help" %in% args) {
  cat("
      The R Script
 
      Arguments:
      --arg1=someValue   - numeric, blah blah
      --arg2=someValue   - character, blah blah
      --arg3=someValue   - logical, blah blah
      --help              - print this text
 
      Example:
      ./test.R --arg1=1 --arg2="output.txt" --arg3=TRUE \n\n")
 
  q(save="no")
}
 
## Parse arguments (we expect the form --arg=value)
parseArgs <- function(x) strsplit(sub("^--", "", x), "=")
argsDF <- as.data.frame(do.call("rbind", parseArgs(args)))
argsL <- as.list(as.character(argsDF$V2))
names(argsL) <- argsDF$V1
 
## Arg1 default
if(is.null(args$arg1)) {
  ## do something
}
 
## Arg2 default
if(is.null(args$arg2)) {
  ## do something
}
 
## Arg3 default
if(is.null(args$arg3)) {
  ## do something
}
 
## ... your code here ...
Created by Pretty R at inside-R.org

It is some work, but I find it pretty neat and use it for quite a while now. I do wonder what others have come up for this task. I hope I did not miss some very general solution.

2013-02-11

Kmečka pamet

Zdrava kmečka pamet lahko nadomesti skoraj vsak nivo izobrazbe, Vendar noben nivo izobrazbe ne more nadomestiti zdrave kmečke pameti. Arthur Schopenha

2013-01-27

The contribution of dominance and inbreeding depression in estimating variance components for litter size in Pannon White rabbits

Our paper on analysis of dominance in Pannon White rabbits has been accepted and will appear in Journal of Animal breeding and Genetics. Abstract says:

In a synthetic closed population of Pannon White rabbits, additive (VA), dominance (VD) and permanent environmental (VPe) variance components as well as doe (bFd) and litter (bFl) inbreeding depression were estimated for the number of kits born alive (NBA), number of kits born dead (NBD) and total number of kits born (TNB). The data set consisted of 18,398 kindling records of 3883 does collected from 1992 to 2009. Six models were used to estimate dominance and inbreeding effects. The most complete model estimated VA and VD to contribute 5.5 ± 1.1% and 4.8 ± 2.4%, respectively, to total phenotypic variance (VP) for NBA; the corresponding values for NBD were 1.9 ± 0.6% and 5.3 ± 2.4%, for TNB, 6.2 ± 1.0% and 8.1 ± 3.2% respectively. These results indicate the presence of considerable VD. Including dominance in the model generally reduced VA and VPe estimates, and had only a very small effect on inbreeding depression estimates. Including inbreeding covariates did not affect estimates of any variance component. A 10% increase in doe inbreeding significantly increased NBD (bFd = 0.18 ± 0.07), while a 10% increase in litter inbreeding significantly reduced NBA (bFl = −0.41 ± 0.11) and TNB (bFl = −0.34 ± 0.10). These findings argue for including dominance effects in models of litter size traits in populations that exhibit significant dominance relationships.

The contribution of dominance and inbreeding depression in estimating variance components for litter size i... by

Genomic evaluations using similarity between haplotypes

Our paper on using haplotypes to build relationships between individuals has been accepted and will appear in Journal of Animal breeding and Genetics. Abstract says:

Long-range phasing and haplotype library imputation methodologies are accurate and efficient methods to provide haplotype information that could be used in prediction of breeding value or phenotype. Modelling long haplotypes as independent effects in genomic prediction would be inefficient due to the many effects that need to be estimated and phasing errors, even if relatively low in frequency, exacerbate this problem. One approach to overcome this is to use similarity between haplotypes to model covariance of genomic effects by region or of animal breeding values. We developed a simple method to do this and tested impact on genomic prediction by simulation. Results show that the diagonal and off-diagonal elements of a genomic relationship matrix constructed using the haplotype similarity method had higher correlations with the true relationship between pairs of individuals than genomic relationship matrices built using unphased genotypes or assumed unrelated haplotypes. However, the prediction accuracy of such haplotype-based prediction methods was not higher than those based on unphased genotype information.

Genomic evaluations using similarity between haplotypes by

2013-01-25

Simulation of different strategies to implement genomic selection in plant breeding

Our submitted abstract for the VIPCA conference.

Gorjanc G, Crossa J, Dreisigacker S, Autrique E, Hickey J

Genomic information provides a rich resource for plant improvement. The greatest advantage of genomic information is achived when used early in a programme. The aim of this contribution was to present different ways of utilizing genomic information in cross and self pollinated species early in the bi-parental population (BP) cycle in the F2 generation by simulation. Simulation mimicked a real breeding programme with BP of different degree of relationship. Simulated trait heritability was 0.5. Predictions were always within a focal BP (BPX) avoiding the effect of inflated accuracy due to population structure. Measure of interest was the correlation between true and predicted additive genetic value (accuracy). Predictions were based on training on collected data within BPX, from BP having one parent in common with BPX (BPPC), from BP having one grand-parent in common with BPX (BPGC), and from unrelated BP (BPUR). Results show that genomic predictions can be accurate early in the cycle. Accurate predictions within the BPX can be obtained with a small number of markers and moderate phenotyping of F2 derived lines. With 200 markers and 50 phenotypes accuracy was around 0.60 and increased to 0.80 with 100 phenotypes. Use of information from relatives requires denser marker panels and more phenotypes to properly model the expanding number of linkage blocks in progressively distantly related BP. However, these requirements can be spread over many related BP reducing the cost per BP. For example accuracy of 0.6 can be achieved by using 1,000 markers and 50 phenotypes from 4 BPPC and 40 BPUR or by using 10,000 markers and 50 phenotypes from 8 BPGC and 50 BPUR. Requirement for denser marker panels increases needed investment, which can be offset by proper use of imputation method making the whole procedure cost effective for real applications.

2012-12-28

Functions dim, nrow, and ncol for shell

Ever wanted to find out quickly the number of rows and/or columns in file directly from terminal. There are many ways to skin this cat. Here is what I used for number of rows for quite a while:

wc -l filename

What about number of columns? "Easy", just combine head and awk commands:

head -n 1 filename | awk '{ print NF }'

not a big problem (there are likely better ways to do this), but is long and tedious.

I got sick of typing commands above and assembled them in three easy to use function with R-like names: nrow, ncol, and dim. Functions are simply a collection of above ideas and assume that the file is of "rectangular shape", i.e., a table, a matrix, etc.


dim()
{
  for FILE in $@; do
    NROW=$(nrow $FILE | awk '{ print $1}')
    NCOL=$(ncol $FILE | awk '{ print $1}')
    echo "$NROW $NCOL $FILE"
    unset NROW NCOL
  done
}
export -f dim

nrow ()
{
  for FILE in $@; do
    wc -l $FILE
  done
}
export -f nrow

ncol ()
{
  for FILE in $@; do
    TMP=$(head $FILE -n 1 | awk '{ print NF }')
    echo "$TMP $FILE"
    unset TMP
  done
}
export -f ncol

Add these files to your .bashrc or .profile or something similar and you can now simply type:

nrow filename
ncol filename
dim filename

A simple test:

touch file.txt
echo "line1 with four columns" >> file.txt
echo "line2 with four columns" >> file.txt

nrow file.txt
2 file.txt

ncol file.txt
4 file.txt

dim file.txt
2 4 file.txt