We had an interesting discussion at the Bioinfo-core workshop at ISMB2010. The discussion centred around the best way to handle the logistics of making sequence data available to using a sequencing service. The problem is that the data is so big that even if you have a large central store you run the risk of the users taking copies of the data and filling up all of your day to day storage as well.
There are a couple of options which present themselves:
- You just pass a copy of the data to the users and let them worry about where they put it. This is of course the easiest path, but will end up with users putting both data and subsequent analysis on insecure local disks on their workstations – potentially leading to big data losses when things inevitably go wrong.
- You keep all of the data on the central store and then expose this over the network (probably in a read only fashion) to the users. This eliminates the problem of duplicated storage, but could still be problematic if the users generate large derived datasets. It will also place a heavy stress on your network infrastructure as the data will effectively be downloaded every time a user runs an analysis against it.
- You keep all of the data on the central store and provide a remote access system for analysing it. In this scenario you eliminate both the data storage and transfer problems since the data is always contained within the central facility and only views of the derived data are shown to the user. The big downside is that the core facility is then responsible for managing the computational power required by all of their users for all of their analyses, which could add significant cost and workload to the core facility.
Option 3 is the most logical and efficient, but would require convincing users of a service that they should be paying up front for the costs associated with all of the future storage and analysis of their data. Given that enterprise storage is significantly more expensive than most users would expect then this could be a hard sell, but it’s probably something we should at least try.
Another aspect to consider in all of this is that as the volume of publicly available sequencing data increases there are going to be many more occasions on which users will want to re-analyse public data rather than generating the data themselves. Managing the influx of this external data is also likely to be a signficant challenge for core facilities. These data sets are every bit as large as those created in house, and will take just as much computational resource to analyse. It would make sense to be able to include these sorts of data sets into the same storage and analysis system you use for your in-house data, albeit that you will have the option to delete the primary data should disk space become limiting.
The long term solution to many of these problems may come from cloud computing. Having data stored in central repositories and then making these available to servers in the cloud would provide a scalable solution which can provide the sort of cost effective flexible options we’re all likely to need. The problem is that at the moment the costs of storage in the cloud make this sort of system uneconomic. Cloud storage has not typically been associated with the analysis of huge datasets running IO bound analyses, but there’s no intrinsic reason why this shouldn’t be possible.
A wish for the future would be that cloud service providers recognise this potential market and cater to it. What this would require would be to make a read-only copy of some of the main data repositories (SRA, GEO, EMBL etc) permanently available to any cloud server (possibly at a small cost). Since these data repositories could be shared between users this wouldn’t be as arduous as having each user maintain their own data and would then let users simply scale their CPU demands to allow them to perform their desired analyses. This would dramatically lower the barrier preventing people getting into this area and would reduce the cost for people not able to saturate a dedicated compute cluster with their own analyses.