Choosing the best format for raw sequence data

Introduction

In the current Illumina pipeline raw sequence data is generated in qseq files, but can optionally be converted to the more standard FastQ format for use with other analysis programs.  The FastQ files produced are uncompressed text files and take up a considerable amount of space in our storage system.  We’ve therefore been thinking about either compressing or converting these files to save on the amount of storage they require.

At the same time we’ve also been expanding the range of compression schemes supported in FastQC which gives us a good impression of how quickly we can extract data from the different available formats and since I’ve collected some data on the storage and processing requirements for the different formats available I thought I’d share these to help inform others who may be making similar decisions.

The choices we have are to simply compress the fastq files with a standard compression scheme.  The two most commonly used are gzip and bzip2.  Many analysis programs support gzipped fastq files as input in addition to uncompressed files, and a few are starting to support bzip2.  Bzip2 is often chosen over gzip because it compresses data more efficiently.  The other choice is to put the raw data into a BAM file.  This format is specifically designed to hold high throughput sequence data and uses a compression scheme which is designed to be optimal for sequence data.  The BAM format was primarily designed to hold information about sequences which had been mapped to a reference sequence, but it also allows for raw sequences with no associated mapping to be stored, but with some overhead for the mapped position fields which are not used.

For the tests I used a fastq file containing around 500,000 reads of 33bp in length.  The processing times were taken as the time to process the file completely with FastQC.  Since the processing overhead for the QC analysis should be the same in all cases any differences will be attributable to the different amount of data needing to be read from disk, and the CPU time required for the decompression.  The tests were run on a MacBook Pro laptop (so not the fastest hard drive, or the speediest CPU).  FastQC uses pure java decompression code.  For gzip compression this is built into the JRE and for bzip2 I used the jbzip2 library.

Results

File type File size Time to process (seconds)
Uncompressed FastQ 69.8MB 14.1
Gzip Compressed FastQ 17.5MB 11.0
Bzip2 Compressed FastQ 13.9MB 72.1
BAM 16.3MB 11.5

Conclusions

It is clear that converting your raw FastQ files to a more efficient storage format will produce significant gains in disk space usage.  Reducing your storage requirements by a factor of 4-5X and actually making your processing more efficient in some cases is a win-win proposal.

From the results presented it seems clear that unless disk usage is critically important then bzip2 compression is not a viable solution. Increasing your processing time by over 500% for a 20% reduction in size does not seem to be a good trade off.  I’ve also checked using the command line gzip/bzip2 decompression utilities in case the effect I saw was an artefact of the java implementations, but the size of the difference between the two was similar there as well.

Choosing between gzip and BAM is less clear.  It’s probably fair to say that at the moment more analysis programs support gzipped fastq files as input than support BAM files (which is more normally seen as an output format), but this may change in the future.  BAM files offer the prospect of adding in mapping data alongside your sequence data with a minimised increase in filesize which may be a benefit to some.  If you’d prefer to keep your raw data separate from derived data then gzipped FastQ files would seem to be the better choice.

In our case we’re going to opt for simply gzipping our FastQ files since this seems to be a simple process which won’t affect any of our existing workflows or processing and which will return to us a significant amount of storage space.

Date
Categories
Tags
Permalink
Status

Published:June 16, 2011

Bioinformatics Computing

Bookmark the permalink

Both comments and trackbacks are currently closed.