Managing Really Large Data Sets

For a while now I’ve been working with next generation sequencing datasets.  Each dataset consists of around 10 million mapped genome positions, and an experiment can consist of 10 or more datasets.

When analysing this data memory usage is a major issue.  Up until now our approach has been to try to store everything in memory as efficiently as possible but even using machines with 12+GB of memory we’re hitting scaling limits.

To try to alleviate this I’ve been investigating using disk based storage rather than putting the data in RAM.  I’d avoided this before as I wanted the application to be as responsive as possible and didn’t want the overhead of the extra IO.

I’ve now taken the plunge and started experimenting with disk based storage, and my initial results are encouraging.

I chose to go with SQLite using the SQLiteJDBC connector.  My inital attempt was pretty basic using only a single table as a sort of disk based Hashtable and I hadn’t even got round to adding indices to speed up searches.  Despite this the speed of data access was so quick I couldn’t tell the difference from the in-memory storage, and the memory usage dropped significantly.

Since this worked out so successfully I’m not considering doing a major rewrite to put all of the application data into a single large SQLite database and just caching objects for the currently visible chromosome in memory (to allow quick redraws).  This is going to require some major refactoring, but it should allow the program to scale to much larger datasets and should make opening a project a virtually instantaneous operation.


Published:August 27, 2009

Bioinformatics Computing

Bookmark the permalink

Both comments and trackbacks are currently closed.