R has a nice set of utilities to work with strings. Function paste is surely one among these. It can be used to "glue" several strings with optional separator. The following example shows how paste can be used to create a new variable in a dataset:
dat <- data.frame(x=1:5, y=letters[1:5])(dat$z <- with(dat, paste(x, y, sep="-")))
Today I was in a situation where I only had column z and wanted to reverse the action of paste. Is there a way to do it? Not directly (AFAIK), but strsplit seems to be quite useful for this:
(tmp <- strsplit(x=dat$z, split="-"))
However, the output of strsplit is a list object with elements (vectors) by the elements of my column z and not by split components. Consequently one can not convert strsplit output easily back to a data.frame as you can test yourself with:
as.data.frame(tmp)
Argh. I understand that strsplit is meant to be very general (say we could have unequal number of components in one element, e.g., c("1-a-0", "1-a")), but its output is really inconvenient for transformation to a data.frame. I came up with the following solution, which seems to work nicely and is quite fast.
tmp <- unlist(strsplit(dat$z, split="-"))
cols <- c("x2", "y2")
nC <- length(cols)
ind <- seq(from=1, by=nC, length=nrow(dat))
for(i in 1:nC) {
dat[, cols[i]] <- tmp[ind + i - 1]
}
Does anyone have a better (more obvious) solution?
20 comments:
How about
tmpdf <- as.data.frame(matrix(unlist(strsplit(dat$z, "-")), nrow=nrow(dat), byrow=T))
colnames(tmpdf) <- c("x1", "x2")
I have found many uses for this technique (wrapping a vector in a matrix with of desired dimensions). It is quite fast, too.
I really like Anonymous' "matrix solution". This is my way to go...
tmp <- strsplit(dat$z, split="-")
do.call(rbind, lapply(tmp, rbind))
I don't know if it will work, but typically on R Help one sees unlist being used in conjunction with strsplit.
Gregor, use regular expressions, for the love of god:
dat$z <- gsub("-", "", dat$z)
aL3xa: This doesn't do what Gregor needs at all. It just removes the hyphen from the elements of dat$z. It doesn't split the dat$z column into two.
A combination of plyr and stingr packages:
ddply(dat, .(x), transform,
x2 = str_split(z, "-")[[1]][1],
y2 = str_split(z, "-")[[1]][2]
)
Oh, my appologies. In that case, try:
cbind(dat[-3], t(as.data.frame(strsplit(dat$z, "-"))), row.names = NULL)
If the lengths of the elements of your list differ I think that you are right and that looping is necessary. For example:
dat <- data.frame(x=1:5, y=letters[1:5])
dat$z <- c("1-a-0", "2-b", "3-c-1", "4-d", "5-e-2-a")
tmp <- strsplit(dat$z, split="-")
nMax <- max(sapply(tmp, length))
dat <- cbind(dat, t(sapply(tmp, function(i) i[1:nMax])))
colnames(dat) <- c("x", "y", "z", sprintf("z%s", 1:nMax))
Best.
James.
this seems to work as well:
tmp <- strsplit(x=dat$z, split="-")
cols <- t(sapply(tmp,c))
dat[,c("x2","y2")] <- cols
I've usually used something like this when there are the same number of pieces in each string:
tempdf <- t(data.frame(sapply(d$z, strsplit, split='-')))
colnames(tempdf) <- c('x1', 'x2')
another variant...
data.frame(t(do.call("cbind", strsplit(dat$z,"-"))))
as.data.frame(do.call("rbind", tmp))
Thank you all for the comments/ideas. It is really amazing how many responses I got! I first thought not to post this problem to my blog as it takes time to write things down. But given the amount of response it was more than valuable. I realized that my approach is essentially the same stuff as many people proposed, but I took my problem to literally as some methods in R can do a lot of that "internally". This post will surely serve me as a reference for future usage!
Hey i created an a small piece of code to "Converting strsplit() output to a data.frame"
Feel free to use (I am beginner @ R, so the code is not very complex)
#Get filelist
setwd(dir_path)
file_list = list.files(dir_path, "*.out")
#Split filenames
pars = strsplit(file_list, "_", fixed = FALSE, perl = FALSE, useBytes = FALSE)
# Insert split filenames in to dataframe
pars_array <- array(dim=c(length(pars),length(cols)))
colnames(pars_array) <- pars_array
for(i in 1:length(pars)) {
for(ii in 1:length(cols)) {pars_array[i,ii] = pars[[i]][ii]}
}
pars_array = as.data.frame(pars_array)
Nirav Khimashia
nirav.khimashia@csiro.au
Amazing how many people have the same problem with R functions! Thanks for your help guys!
BTW I find it not logical at all that strsplit is not returning a vector
Thanks James and Gregor!
I was looking for this for days. Many thanks once again
The text= argument in read.table is quite useful in this case - it makes the function treat vectors as if they're lines of a file being read in.
read.table(text=dat$z, sep='-')
Do I win?
NB. in addition to above, a MUCH faster equivalent method if you have a large dataset is:
data.table::fread(paste(dat$z, collapse='\n'), sep='-')
Happy parsing
to make the list even more complete:
https://stackoverflow.com/questions/12946883/strsplit-by-row-and-distribute-results-by-column-in-data-frame
i think you can use pluck() for a cleaner syntax. i had a similar issue and found this:
https://rstudio-education.github.io/tidyverse-cookbook/transform-lists-and-vectors.html#extract
Post a Comment