2011-01-28

Converting strsplit() output to a data.frame

R has a nice set of utilities to work with strings. Function paste is surely one among these. It can be used to "glue" several strings with optional separator. The following example shows how paste can be used to create a new variable in a dataset:
dat <- data.frame(x=1:5, y=letters[1:5])
(dat$z <- with(dat, paste(x, y, sep="-")))
Today I was in a situation where I only had column z and wanted to reverse the action of paste. Is there a way to do it? Not directly (AFAIK), but strsplit seems to be quite useful for this:
(tmp <- strsplit(x=dat$z, split="-"))
However, the output of strsplit is a list object with elements (vectors) by the elements of my column z and not by split components. Consequently one can not convert strsplit output easily back to a data.frame as you can test yourself with:
as.data.frame(tmp)
Argh. I understand that strsplit is meant to be very general (say we could have unequal number of components in one element, e.g., c("1-a-0", "1-a")), but its output is really inconvenient for transformation to a data.frame. I came up with the following solution, which seems to work nicely and is quite fast.
tmp <- unlist(strsplit(dat$z, split="-"))
cols <- c("x2", "y2")
nC <- length(cols)
ind <- seq(from=1, by=nC, length=nrow(dat))
for(i in 1:nC) {
  dat[, cols[i]] <- tmp[ind + i - 1]
}
Does anyone have a better (more obvious) solution?

18 comments:

Anonymous said...

How about

tmpdf <- as.data.frame(matrix(unlist(strsplit(dat$z, "-")), nrow=nrow(dat), byrow=T))

colnames(tmpdf) <- c("x1", "x2")

I have found many uses for this technique (wrapping a vector in a matrix with of desired dimensions). It is quite fast, too.

Bernd said...

I really like Anonymous' "matrix solution". This is my way to go...

tmp <- strsplit(dat$z, split="-")
do.call(rbind, lapply(tmp, rbind))

Disgruntled PhD said...

I don't know if it will work, but typically on R Help one sees unlist being used in conjunction with strsplit.

aL3xa said...

Gregor, use regular expressions, for the love of god:

dat$z <- gsub("-", "", dat$z)

Anonymous said...

aL3xa: This doesn't do what Gregor needs at all. It just removes the hyphen from the elements of dat$z. It doesn't split the dat$z column into two.

Scott Chamberlain said...

A combination of plyr and stingr packages:


ddply(dat, .(x), transform,
x2 = str_split(z, "-")[[1]][1],
y2 = str_split(z, "-")[[1]][2]
)

aL3xa said...

Oh, my appologies. In that case, try:

cbind(dat[-3], t(as.data.frame(strsplit(dat$z, "-"))), row.names = NULL)

Anonymous said...

If the lengths of the elements of your list differ I think that you are right and that looping is necessary. For example:

dat <- data.frame(x=1:5, y=letters[1:5])
dat$z <- c("1-a-0", "2-b", "3-c-1", "4-d", "5-e-2-a")
tmp <- strsplit(dat$z, split="-")

nMax <- max(sapply(tmp, length))
dat <- cbind(dat, t(sapply(tmp, function(i) i[1:nMax])))
colnames(dat) <- c("x", "y", "z", sprintf("z%s", 1:nMax))

Best.
James.

Anonymous said...

this seems to work as well:

tmp <- strsplit(x=dat$z, split="-")
cols <- t(sapply(tmp,c))
dat[,c("x2","y2")] <- cols

Anonymous said...

I've usually used something like this when there are the same number of pieces in each string:

tempdf <- t(data.frame(sapply(d$z, strsplit, split='-')))

colnames(tempdf) <- c('x1', 'x2')

Drunken said...

another variant...

data.frame(t(do.call("cbind", strsplit(dat$z,"-"))))

Anonymous said...

as.data.frame(do.call("rbind", tmp))

Gregor Gorjanc said...

Thank you all for the comments/ideas. It is really amazing how many responses I got! I first thought not to post this problem to my blog as it takes time to write things down. But given the amount of response it was more than valuable. I realized that my approach is essentially the same stuff as many people proposed, but I took my problem to literally as some methods in R can do a lot of that "internally". This post will surely serve me as a reference for future usage!

Nirav said...

Hey i created an a small piece of code to "Converting strsplit() output to a data.frame"
Feel free to use (I am beginner @ R, so the code is not very complex)


#Get filelist
setwd(dir_path)
file_list = list.files(dir_path, "*.out")


#Split filenames
pars = strsplit(file_list, "_", fixed = FALSE, perl = FALSE, useBytes = FALSE)

# Insert split filenames in to dataframe
pars_array <- array(dim=c(length(pars),length(cols)))
colnames(pars_array) <- pars_array

for(i in 1:length(pars)) {
for(ii in 1:length(cols)) {pars_array[i,ii] = pars[[i]][ii]}
}

pars_array = as.data.frame(pars_array)


Nirav Khimashia
nirav.khimashia@csiro.au

Max said...

Amazing how many people have the same problem with R functions! Thanks for your help guys!
BTW I find it not logical at all that strsplit is not returning a vector

Anonymous said...

Thanks James and Gregor!

I was looking for this for days. Many thanks once again

Robin Edwards said...

The text= argument in read.table is quite useful in this case - it makes the function treat vectors as if they're lines of a file being read in.

read.table(text=dat$z, sep='-')

Do I win?

Robin Edwards said...

NB. in addition to above, a MUCH faster equivalent method if you have a large dataset is:

data.table::fread(paste(dat$z, collapse='\n'), sep='-')

Happy parsing