Archive for July, 2010
Where do you analyse next gen sequence data?
Tuesday, July 13th, 2010
We had an interesting discussion at the Bioinfo-core workshop at ISMB2010. The discussion centred around the best way to handle the logistics of making sequence data available to using a sequencing service. The problem is that the data is so big that even if you have a large central store you run the risk of the users taking copies of the data and filling up all of your day to day storage as well.
There are a couple of options which present themselves:
- You just pass a copy of the data to the users and let them worry about where they put it. This is of course the easiest path, but will end up with users putting both data and subsequent analysis on insecure local disks on their workstations – potentially leading to big data losses when things inevitably go wrong.
- You keep all of the data on the central store and then expose this over the network (probably in a read only fashion) to the users. This eliminates the problem of duplicated storage, but could still be problematic if the users generate large derived datasets. It will also place a heavy stress on your network infrastructure as the data will effectively be downloaded every time a user runs an analysis against it.
- You keep all of the data on the central store and provide a remote access system for analysing it. In this scenario you eliminate both the data storage and transfer problems since the data is always contained within the central facility and only views of the derived data are shown to the user. The big downside is that the core facility is then responsible for managing the computational power required by all of their users for all of their analyses, which could add significant cost and workload to the core facility.
Option 3 is the most logical and efficient, but would require convincing users of a service that they should be paying up front for the costs associated with all of the future storage and analysis of their data. Given that enterprise storage is significantly more expensive than most users would expect then this could be a hard sell, but it’s probably something we should at least try.
Another aspect to consider in all of this is that as the volume of publicly available sequencing data increases there are going to be many more occasions on which users will want to re-analyse public data rather than generating the data themselves. Managing the influx of this external data is also likely to be a signficant challenge for core facilities. These data sets are every bit as large as those created in house, and will take just as much computational resource to analyse. It would make sense to be able to include these sorts of data sets into the same storage and analysis system you use for your in-house data, albeit that you will have the option to delete the primary data should disk space become limiting.
The long term solution to many of these problems may come from cloud computing. Having data stored in central repositories and then making these available to servers in the cloud would provide a scalable solution which can provide the sort of cost effective flexible options we’re all likely to need. The problem is that at the moment the costs of storage in the cloud make this sort of system uneconomic. Cloud storage has not typically been associated with the analysis of huge datasets running IO bound analyses, but there’s no intrinsic reason why this shouldn’t be possible.
A wish for the future would be that cloud service providers recognise this potential market and cater to it. What this would require would be to make a read-only copy of some of the main data repositories (SRA, GEO, EMBL etc) permanently available to any cloud server (possibly at a small cost). Since these data repositories could be shared between users this wouldn’t be as arduous as having each user maintain their own data and would then let users simply scale their CPU demands to allow them to perform their desired analyses. This would dramatically lower the barrier preventing people getting into this area and would reduce the cost for people not able to saturate a dedicated compute cluster with their own analyses.
Posted in Bioinformatics, Computing | Comments Off
First impressions of iOS4 on new iPod touch
Sunday, July 4th, 2010
My original first gen iPod touch has been, without doubt, the best gadget I’ve ever purchased. It’s been used daily for almost 3 years now as an MP3 player, email client, TV, photo album, and PDA. However the headphone socket has only played out of one ear for about 6 months, and about a month ago I lost a whole swathe of pixels at the bottom of the screen, so today I went out a bought a new one.
Initially I though this was going to work out well. I synced the new iPod with iTunes and it said it was going to upgrade to iOS 4 and restore the settings from my old iPod – great. However I’m now collecting a list of problems which I thought I’d enumerate here.
- Couldn’t connect to my home wifi network. The new iPod picked up the stored password from my settings, but couldn’t connect to my network. All other machines (and my old iPod) worked fine. I had a very similar problem where my Macbook Pro wouldn’t connect to my parents network after upgrading to snow leopard. Research for that problem let me to several posts which linked the problem to the use of WEP on the connection. I therefore changed the encryption to WPA and the new iPod now connects. Is this a backdoor route through which Apple hope to increase the security of home networks?
- Couldn’t connect to my email. Although my email settings were carried over, the password wasn’t. Oddly I wasn’t prompted for it but rather told to go into the mail settings to find the appropriate box in which to add it. Once I’d added the password I could connect.
- But – I’ve had some oddities with my email. I uses an SSL encrypted IMAP account and I’ve had messages which wouldn’t move to unread status no matter how many times I viewed them. I also had a ‘ghost message’ with no title and no content – but it was apparently unread and couldn’t be read or deleted. It took a reboot to get rid of that, but it’s reoccurred after some more use. It seems that I’m not alone but the immediate fix (use the multitaking switcher to kill mail) doesn’t work for me because this iPod doesn’t support multitasking.
- Although all my media seems to have synced over OK, one of my podcasts (This Week in Tech as it happens) is duplicated on my iPod. The same episodes appear in both copies. I’ve removed and readded the podcast from iTunes, but it just duplicates it all over again. No idea what else to do with this.
This experience has further reinforced my love hate relationship with Apple. When their stuff works it’s great. Well thought out, nicely implemented and consistently integrated. Everything you could hope for really. The problem comes when things don’t work out properly. The ecosystem within which Apple operates is so closed that they won’t acknowledge when problems exist and you’re left with an unknown wait until a future update magically fixes things. People contacting apple support about the IMAP problems have been talked through resetting their mail accounts and a few other fixes which Apple must know don’t work. Contrast this with open source solutions where all bug tracking is out in the open so you can see not only that the problem exists and has been identified, but also get an official response for the best work around and can track the progress on a permanent fix. Would adopting this kind of system really cause harm to Apple?
Posted in Computing, Technology | Comments Off