Archive for June, 2011
Want to improve your science? Get a dog.
Sunday, June 19th, 2011
Actually the dog is somewhat irrelevant – it’s what comes with it which matters. One of the side-effects of dog ownership is that you get to spend an hour or so a day out walking, which means you have an hour or so with your own thoughts and no distractions.
I’m sure everyone has experienced the situation where you go to sleep thinking about a problem, and then wake up in the morning with the answer. It’s pretty common for your brain to replay the events of the day whilst you sleep and it’s surprising how just making time to think about something can can help you to clarify and understand problems you’re working on. Doing this overnight is good as far as it goes, but can end up somewhat surreal as you effectively relinquish control of your train of thought.
The same principle applies during the day though. If your brain isn’t immediately occupied with a specific task it will tend to go back over other problems you’re working on at the moment. The problem is that in a modern office or lab setting we hardly ever have time when we’re not being presented with something to do. Even when we’re not at work it feels really unnatural to not be doing anything. There’s always something to read, watch or do so your brain never gets chance to drift back to the topics which could usefully employ it. That’s where the dog comes in.
I actually first noticed this effect when I started cycling to work. My commute by bike takes me about an hour (which is why I don’t do it all that often!), but I found that when I cycled to work I would be hugely productive for the first couple of hours of the day and find really creative solutions for things I was working on. I first put this down to having more energy from my early morning exercise, but I then found that if I put on a podcast whilst I cycled I was no more productive than if I’d driven – so it wasn’t the cycling, it was having an hour where my brain had nothing to do other than chew over the problems I was working on.
Dogs are even better in this respect in that you don’t really get a choice about whether you’re going to take them out. You’re forced to spend some time every day with no external distractions from which you can gain the attendant benefits.
Somewhat perversely I find that conferences work this way too, only they provide a double benefit. Anyone who’s ever been to a conference will know that talks fall into two categories – those which capture your attention and provide new and interesting ways of interpreting your science, and those in which you switch off and stop listening in the first five minutes. Actually I’d argue that both of these types of talk are beneficial. The first inspires you and gives you new information to process, and the second gives you some uniterrupted time to think about what you learned in the first. Many of the most intersting projects I’ve worked on have begun during poor talks in a conference, where I’ve stopped listening to the current talk and have sketched out the structure for a software package, or a theory which I could test. Granted it doesn’t always work this way. I once spent an hour in a particularly dull keynote writing a program which would rate all of the keynotes in a conference by the change in ping times on the wireless network (it’s conclusions matched surprisingly well with my own personal judegment), but if not productive at least that was creative.
My tip for improved science then is not necessarily to get a dog – although I can heartily recommend doing so – it’s to force yourself to make some time each day where you don’t have anything to think about. Your brain will thank you for it.
Posted in Bioinformatics, Computing | Comments Off
Choosing the best format for raw sequence data
Thursday, June 16th, 2011
Introduction
In the current Illumina pipeline raw sequence data is generated in qseq files, but can optionally be converted to the more standard FastQ format for use with other analysis programs. The FastQ files produced are uncompressed text files and take up a considerable amount of space in our storage system. We’ve therefore been thinking about either compressing or converting these files to save on the amount of storage they require.
At the same time we’ve also been expanding the range of compression schemes supported in FastQC which gives us a good impression of how quickly we can extract data from the different available formats and since I’ve collected some data on the storage and processing requirements for the different formats available I thought I’d share these to help inform others who may be making similar decisions.
The choices we have are to simply compress the fastq files with a standard compression scheme. The two most commonly used are gzip and bzip2. Many analysis programs support gzipped fastq files as input in addition to uncompressed files, and a few are starting to support bzip2. Bzip2 is often chosen over gzip because it compresses data more efficiently. The other choice is to put the raw data into a BAM file. This format is specifically designed to hold high throughput sequence data and uses a compression scheme which is designed to be optimal for sequence data. The BAM format was primarily designed to hold information about sequences which had been mapped to a reference sequence, but it also allows for raw sequences with no associated mapping to be stored, but with some overhead for the mapped position fields which are not used.
For the tests I used a fastq file containing around 500,000 reads of 33bp in length. The processing times were taken as the time to process the file completely with FastQC. Since the processing overhead for the QC analysis should be the same in all cases any differences will be attributable to the different amount of data needing to be read from disk, and the CPU time required for the decompression. The tests were run on a MacBook Pro laptop (so not the fastest hard drive, or the speediest CPU). FastQC uses pure java decompression code. For gzip compression this is built into the JRE and for bzip2 I used the jbzip2 library.
Results
| File type | File size | Time to process (seconds) |
|---|---|---|
| Uncompressed FastQ | 69.8MB | 14.1 |
| Gzip Compressed FastQ | 17.5MB | 11.0 |
| Bzip2 Compressed FastQ | 13.9MB | 72.1 |
| BAM | 16.3MB | 11.5 |
Conclusions
It is clear that converting your raw FastQ files to a more efficient storage format will produce significant gains in disk space usage. Reducing your storage requirements by a factor of 4-5X and actually making your processing more efficient in some cases is a win-win proposal.
From the results presented it seems clear that unless disk usage is critically important then bzip2 compression is not a viable solution. Increasing your processing time by over 500% for a 20% reduction in size does not seem to be a good trade off. I’ve also checked using the command line gzip/bzip2 decompression utilities in case the effect I saw was an artefact of the java implementations, but the size of the difference between the two was similar there as well.
Choosing between gzip and BAM is less clear. It’s probably fair to say that at the moment more analysis programs support gzipped fastq files as input than support BAM files (which is more normally seen as an output format), but this may change in the future. BAM files offer the prospect of adding in mapping data alongside your sequence data with a minimised increase in filesize which may be a benefit to some. If you’d prefer to keep your raw data separate from derived data then gzipped FastQ files would seem to be the better choice.
In our case we’re going to opt for simply gzipping our FastQ files since this seems to be a simple process which won’t affect any of our existing workflows or processing and which will return to us a significant amount of storage space.
Tags: compression, fastqc, java
Posted in Bioinformatics, Computing | Comments Off
Adding custom chromosome name mappings into SeqMonk
Saturday, June 11th, 2011
When loading data into SeqMonk the program has to try to connect the chromosome names used in your data file with those which are present in the genome one which your project is based. In many cases there won’t be an exact match between the two – many mapping programs report file names in their output so you often see names like ‘chr1.fa’ used to represent chromosome 1.
In order to match as many entries as possible SeqMonk tries a few different tricks to match up the chromosome names if it doesn’t immediately find an exact match. It will do simple manipulations such as removing ‘chr’ from the front of the name and removing things like ‘.fa’, ‘.txt’ and ‘.gz’ from the end. However in some cases it isn’t able to match up the names, and in these cases the read is rejected and you’ll get an import warning saying:
‘Couldn’t extract valid name for [some string which wasn't a chromosome name]‘
The most common offender for this message is the mitochondrion which is variously labelled as M, Mt, MT or some other minor variation. The other one we commonly see are the yeast genomes which seem to alternate between using roman and arabic numbers to describe their chromosomes.
If you get this error and need to import the data which was rejected then how do you bring this data into SeqMonk? The most obvious way is to go back to your data file and rename your chromsomes to match those in SeqMonk. Whilst this is simple to describe it’s often practically quite hard since data files are usually very large, and if they are in a non-text format (eg BAM), then modifying them isn’t at all straight forward.
A better alternative then is to tell SeqMonk in advance about any chromosome name mappings which it isn’t able to figure out for itself. There is a pretty simple way to do this, but it isn’t very obvious in the documentation and most people who hit the problem don’t seem to find it so I thought I’d raise its profile.
The way to add custom chromosome name mappings in SeqMonk is to create a text file called ‘aliases.txt’ in the folder which contains the data for the genome assembly you’re using. If you were using the mouse NCBIM37 genome for your project the file would need to go inside the [Genomes dir]/Mus musculus/NCBIM37 folder. This file is a 2-column tab delimited text file where the first column is the name you want to be able to use in your data files and the second column is the name of a chromosome in your SeqMonk genome. Both of these names must be an exact match since the aliases are not subject to the same manipulations as normal chromosome names read from a data file. There is no problem having multiple aliases pointing to the same SeqMonk chromosome, but you can’t have the same aliases pointing to more than one SeqMonk chromosome (if you do, then the last one wins).
So, and example aliases file would look like:
chro1 1 II 2 NC1234 3 c3 3 bob X
Once this file is in place and SeqMonk has been restarted (or at least the genome has been reloaded) you should be able to import data files containing any of the names in the first column into your project without seeing any warnings. This file is only required when the data is first imported, after which all the chromosome names are turned into SeqMonk names, so you don’t need to distribute the aliases file to be able to use SeqMonk projects which required that file to have their data initially imported.
Tags: SeqMonk, software
Posted in Bioinformatics | Comments Off