Archive for the ‘Bioinformatics’ Category
Managing Really Large Data Sets
Thursday, August 27th, 2009
For a while now I’ve been working with next generation sequencing datasets. Each dataset consists of around 10 million mapped genome positions, and an experiment can consist of 10 or more datasets.
When analysing this data memory usage is a major issue. Up until now our approach has been to try to store everything in memory as efficiently as possible but even using machines with 12+GB of memory we’re hitting scaling limits.
To try to alleviate this I’ve been investigating using disk based storage rather than putting the data in RAM. I’d avoided this before as I wanted the application to be as responsive as possible and didn’t want the overhead of the extra IO.
I’ve now taken the plunge and started experimenting with disk based storage, and my initial results are encouraging.
I chose to go with SQLite using the SQLiteJDBC connector. My inital attempt was pretty basic using only a single table as a sort of disk based Hashtable and I hadn’t even got round to adding indices to speed up searches. Despite this the speed of data access was so quick I couldn’t tell the difference from the in-memory storage, and the memory usage dropped significantly.
Since this worked out so successfully I’m not considering doing a major rewrite to put all of the application data into a single large SQLite database and just caching objects for the currently visible chromosome in memory (to allow quick redraws). This is going to require some major refactoring, but it should allow the program to scale to much larger datasets and should make opening a project a virtually instantaneous operation.
Posted in Bioinformatics, Computing | Comments Off
Scientific Instrument Software
Friday, October 10th, 2008
As a bioinformatician I find myself spending too much of my time working around poor software supplied with scientific instruments.
I’m continually amazed that hardware which can cost hundreds of thousands of pounds is very often let down by the control and analysis software supplied with it.
I suspect that the fault for this lies almost equally between the companies producing the equipment and the scientists ordering it. The problem is that scientists only ever think about the data that an instrument can generate, and never worry about the quality of the software which will be used to collect or manipulate it. They assume that collecting the data is the difficult part and that playing around with it on a computer is easy. Increasingly this is proving not to be the case.
The companies know this too. Therefore there is very little incentive for them to do any more than the bare minimum when developing their software. However they do realise that the software can be an extra revenue source, so they often insist on using proprietary file formats so that only their software can be used to access the data – and if you want the software on other machines then you’ll have to pay for the privilege.
The biggest disapointment is that the software ages faster than the machines. I’m not sure that I’ve ever seen a major software update (free or otherwise) to a piece of scientific equipment. Once you’ve stumped up to buy the system initially you need to keep paying for your support contract to keep it running and there’s no incentive to improve the software. If you can’t update the software and it’s not doing what you want then your only recourse is to buy a new machine.
What I have seen are perfectly good systems going to waste because of their software. Array scanners being decomissioned because their software only ran on Windows 98 which meant it couldn’t be connected to the network. Mass Specs capable of quantitation using iCat/iTrac but unable to do so because the software doesn’t support it and won’t export in a standard format to allow the use of other packages.
So what’s the answer to this problem? The only place to tackle this is during negotiations for new equipment. Things scientists should be insisting on are:
- That the software which comes with a machine is supported as part of the support contract and that bugs which are reported will be fixed (or worked round) in a defined timeframe.
- That the file format used by the equipment is documented so that independent readers/writers can be written against the spec. It would be OK to use a proprietary format as long as there is a simple and automatable way to export to a standard format.
- That the software will be guaranteed to be ported to new operating system releases within a defined timeframe as long as a support contract is in effect.
Of course noone will take any notice and will go ahead and spend thousands on machines with artificially limited lifespans, and people like me will spend far too long trying to work around the unnecessary limitations. The only hope I’d see would be for a large institution (NIH, UK research councils) to insist that all equipment uses only documented file formats. What’s the point of having 10 year data retention policies if you can’t find any sofware to read the data when you get it back?
Tags: software
Posted in Bioinformatics | Comments Off