New stuff in the gdata R package

I have moved quite some functions from my "testing/playground" R package ggmisc to the gdata package. Bellow is the relevant text from the NEWS file. Here are the links:
  • New function .runRUnitTestsGdata that enables run of all RUnit tests during the R CMD check as well as directly from within R.
  • Enhanced function object.size that returns the size of multiple objects. There is also a handy print method that can print size of an object in "human readable" format when options(humanReadable=TRUE) or print(x, humanReadable=TRUE).
  • New function bindData that binds two data frames into a multivariate data frame in a different way than merge.
  • New function wideByFactor that reshapes given dataset by a given factor it creates a "multivariate" data.frame.
  • New functions getYear, getMonth, getDay, getHour, getMin, and getSec for extracting the date/time parts from objects of a date/time class.
  • New function nPairs that gives the number of variable pairs in a data.frame or a matrix.


My readership

When I started with my blog I thought that it would be nice to share my thoughts, work, and new things with my friends. Recently, I added a gadget - (Whos among us) to monitor the visitors. I checked the "status" today and I was really surprised - up to now I had quite a lot of visitors from all over the Europe, USA, Canada, but also some hits from Middle and South America, Australia, Africa (Egypt and South Africa), Inida, China, ... Wow, this is much more than I ever anticipated! I guess that some visitors are just passing by, but it is still nice to have audience ;)


Genetic evaluation with uncertain paternity

The standard method for the genetic evaluation, i.e., BLUP needs some phenotype and pedigree data. Sometimes pedigree data is not accurate in a sense that there are several potential parents. Usually, this is the case with the use of multiple sires in the large groups of animals, say beef cattle or meat sheep in extensive conditions. However, we usually have only few candidates and this information can be included in the model. Robert Tempelman has a great presentation on this issue; giving an overview and presenting their research.

There are also some other very fine presentations on this and related issues in beef cattle breeding.


Make it simple

A nice example how a boring old scatterplot is much more informative than the fancy plot.


High-Performance Computing with R

Dirk has posted a new version of slides for a tutorial "Introduction to High-Performance Computing with R".

Genetski in selekcijski vidiki plodnosti ovac

Pravkar sem končal z branjem diplomske naloge Jernejke Drolec (1993) "Genetski in selekcijski vidiki plodnosti ovac". Moram priznati, da sem bil presenečen nad širino naloge in pokritostjo literature. Ker me tema zelo zanima, sem nalogo hitro prebral in bom v prihodnje najbrž še velikokrat pokukal vanjo.


gdata gains trimSum function

I was doing some drawing in R and I needed to trim some values to keep the data (x axis) in reasonable limits, but I did not want to loose that info. Therefore, I summed the values that would be trimmed. Since I was repeating this, I wrote a function and commited it to the gdata SVN package repository. It will probably take some time before new version of gdata hits the CRAN, so there is package for MS Windows and a source package. Here it goes in action:
> x <- 1:10
> trimSum(x, n=5)
[1] 1 2 3 4 45
> trimSum(x, n=5, right=FALSE)
[1] 21 7 8 9 10

> x[9] <- NA
> trimSum(x, n=5)
[1] 1 2 3 4 NA
> trimSum(x, n=5, na.rm=TRUE)
[1] 1 2 3 4 36

Rtools and Cygwin on MS Windows

Duncan Murdoch provides Rtools which ease the installation of tools that are needed to do R package development/testing on MS Windows. The Rtools is a collection of various tools. However, if you also use Cygwin on MS Windows, you can expect problems since Rtools also includes some tools from Cygwin. The problem is the version collision of fundamental Cygwin libraries. Basically, this means that you will not be able to use Cygwin and if you have C:/Cygwin/bin in the PATH envorinmental variable, you can expect problems also in the Command Prompt. You can try to fuse both "worlds" as described in the documentation, but is seems tricky. I just used the following to make my life easier:
  • Install Rtools and do not modify the PATH envorinmental variable --> this means that you will still be able to use Cygwin without problems
  • Create a BAT script (say on the Desktop) with the following content
rem --- Add RTools to the PATH ---
set PATH=c:\Programs\R\Rtools\bin;c:\Programs\R\Rtools\perl\bin;c:\Programs\R\Rtools\MinGW\bin;%PATH%

rem --- Start the Command Prompt ---
  • Start the script with the double click and you will get a working environment for R package development/testing
Btw. under Linux or Mac you can test the functionality of R package under MS Windows using the win builder at http://win-builder.r-project.org provided by Uwe Ligges


Genomic selection session on ICAR

Genomic selection is a hot topic in the area of animal breeding and genetics. Today, I came accross the presentations given at ICAR this summer. I liked all presentations, but today I would like to point out the status reports from Netherlands and New Zealand.


Uporaba seksiranega semena pri ekološkem kmetovanju

Pri sesalcih je spol potomca odvisen od kombinacije spolnih kromosomov. Ženski spol ima za par spolnih kromosomov vedno t.i. kombinacijo XX, moški spol pa ima vedno t.i. kombinacijo XY. Tako mati potomcu vedno prenese t.i. X kromosom, medtem ko lahko oče prenese t.i. X ali Y kromosom. Torej je od očeta odvisno ali po potomec moškega ali ženskega spola; če oče prenese X bo potomec ženskega spola, če pa prenese Y pa bo potomec moškega spola. V živinoreji bi si v velikih primerih želeli potomce samo določenega spola, npr. pri mlečni reji bi si želeli od boljših mater samo ženske potomce za obnovo črede, medtem ko bi od slabših mater želeli moške potomce, ki so bolj primerni za pitanje; pri prašičih bi si za pitanje nasprotno želeli samo ženske živali, saj imajo le te boljše rezultate kot kastrirani moški; pri nesni perutnini bi si prav tako želeli več ženskih živali, saj nesne kokoši niso primerne za pitanje zaradi izredne "specializacije na nesnost". Na tem mestu je takoj potrebno opozoriti, da kokoši niso sesalci in da pri kokoših matere "določijo" spol potomcev. V kolikor imamo možnost, da "preberemo" seme, lahko vplivamo na to, kakšnega spola bodo potomci. Kolikor je meni znano, je takšno seme pri govedu že na voljo - za več brskajte po spletu - link SI, link EN.

Kaj ima vse to z ekološkim kmetovanjem? Novica z Danske me je presentila, saj na Danskem na ekoloških kmetijah ni dovoljena uporaba seksiranega semena. Ne vem, kako je s tem pri nas, ampak moje mnenje je, da je to zelo čudna omejitev. Menim, da "prebiranje" semena manj "sporno" kot pa izločitev (klanje) potomcev nezaželenga spola, npr. moških telet pri mlečnem govedu.

Animal Breeding & Genomics 10

Animal Breeding & Genomics 10


Memory limit management in R

R keeps all the data in RAM. I think I read somewhere that S+ does not hold all the data in RAM, which makes S+ slower than R. On the other hand, when we have a lot of data, R chockes. I know that SAS at some "periods" keeps data (tables) on disk in special files, but I do not know the details of interfacing these files. My overall impression is that SAS is more efficient with big datasets than R, but there are also exceptions, some special packages (see this tutorial for some info) and vibrant development which to my impression takles the problem of large data in the spirit of SAS - I really do not know the details, so please bear with me.

Anyway, what can you do when you hit memory limit in R? Yesterday, I was fitting the so called mixed model using the lmer() function from the lme4 package on Dell Inspiron I1520 laptop having Intel(R) Core(TM) Duo CPU T7500 @ 2.20GHz 2.20GHz and 2046 MB of RAM. I was using MS Windows Vista. The fitting went fine, but when I wanted to summarize the returned object, I got the following error message:

> fit <- lmer(y ~ effect1 + ....) > summary(fit)
Error: cannot allocate vector of size 130.4 Mb
In addition: There were 22 warnings (use warnings() to see them)
First, I find this very odd, since I would expect that fitting the model should be much more memory consuming in comparison to summarizing the fitted object! I will ask the developers of the lme4 package, but until then I tried to find my way out.

Message "Error: cannot allocate vector of size 130.4 Mb" means that R can not get additional 130.4 Mb of RAM. That is weird since resource manager showed that I have at least cca 850 MB of RAM free. I printe the warnings using warnings() and got a set of messages saying:
> warnings()
1: In slot(from, what) <- slot(value, what) ... :
Reached total allocation of 1535Mb: see help(memory.size) ...
This did not make sense since I have 2GB of RAM. I closed all other applications and removed all objects in the R workspace instead of the fitted model object. However, that did not help. I started reading the help page of memory.size and I must confes that I did not understand or find anything usefull. However, reading the help further, I follwed to the help page of memor.limit and found out that on my computer R by default can use up to ~ 1.5 GB of RAM and that the user can increase this limit. Using the following code, helped me to solve my problem.
[1] 1535.875
> memory.limit(size=1800)
> summary(fit)


Domestication of sheep and goats

GlobalDiv posted a new newsletter, where the interesting information about the domestication of sheep and goats. I can not copy the relevant part here, but you can take a look at page 6. Basically it says that there probably was not a domestication bottleneck in sheep and goats as it probably occured in cattle. This might be also the reason, we have so much variability in sheep and goats today in comparison to cattle.