One of the best features about R is the simple way you can use a number of different strategies to create subsets of large data tables. The basic selection mechanisms you have are that you can subset a data frame by providing:
- A set of column or row indices
- A set of boolean values to specify whether individual columns or rows can be included
- A set of row or column names to include in the filtered set
Mixing these basic mechanisms with the plethora of other functions R provides to generate these data types gives you a simple way to create powerful filters to reshape your data.
Anyone who has used R for a significant amount of time will be painfully aware that although it’s syntax can be very powerful, the implementation of some of its functions is often not very efficient. I had always assumed that the core selection and filtering tools would be very efficient though. It turns out I was wrong.
Specifically I came across a case where I wanted to make a subselection of rows from a very large data frame (~500k rows) based on the row names. I therefore tried what seemed to be the obvious solution, which was to do a rowname based lookup. This it turns out is very slow. You can see the problem using this simple example below which selects 2500 random rows from a 1 million row data frame. On my machine the last line takes around 20 seconds to complete, and the completion time seems to scale linearly with the number of random rows selected.
my.data <- data.frame(values1=rnorm(1000000), values2=rnorm(1000000),values3=rnorm(1000000)) rownames(my.data) <- paste("row",1:1000000,sep="") sample(row.names(my.data),2500) -> random.rows my.data[random.rows,] -> filtered.data
The limiting step in the slow code above seems to be the actual lookup of the list of names against the rownames in the data frame. Lookups by index position do not appear to be slow, so you can work around this by doing the lookups yourself manually before running the filter. You can easily do this with the match() function which will take in a vector of strings and will give you the index position of each one in a second vector.
This means we can change the filtering part of the original script to be:
my.data[match(random.rows,rownames(my.data)),] -> quick.filtered.data
..which completes in less than a second. In this case the lookup in the data frame is done based on the indices returned by match so the match function and the lookup within match appears to be implemented much more efficiently than the core code in the data frame filtering.
There may be some added functionality in the core data frame lookup code of which I’m not aware but it seems odd (and bad) that the natural way to do this kind of lookup is so inefficient. I’d have expected that the core lookup would have been doing what match does behind the scenes and should be able to return a result just as efficiently as with our modified solution. Hopefully this can be addressed in future, but at least for now there is a fairly simple work round which can make things much quicker.