Illumina have recently released an updated version of their downstream analysis software CASAVA. This is the analysis pipeline which runs after the sequencer has processed the raw data down to base call files and provides a variety of functionalities from creating usable base calls to alignment and variant calling. Casava 1.8 makes some major changes to previous versions so this isn’t going to just be a drop in replacement for the workflow you used to have. You’re going to have to think about this one.
As we’ve just got through the processing of our first run with Casava 1.8 I wanted to write up the problems (and in some cases solutions) we’d found both so I can remind myself later on and hopefully help other people who might end up in a similar situation.
Our existing pipeline
Our usage of the Illumina pipeline is fairly straight forward. We use it to generate sequence with qualities and in most cases we use Eland to map to a reference genome. We haven’t done any variant calling through the pipeline and RNA-Seq mapping was done externally with tophat rather than with Illumina’s own tools. We have a LIMS system which manages our samples for us and we link into the run folders created by the sequencer to present data back to the user.
A long long time ago…
In older versions of the pipeline there was a strictly hierarchical set of programs which created a structured set of folders which matched the order in which the programs were run. Each folder was named after the program which created it and the time it was run. The order of execution was:
Images (tiff) > [firecrest] > Intensities (cif) > [Bustard] > Base Calls (qseq) > [Gerald] > Filtered Base Calls (fastq) and mapped data (sorted)
In more recent times much of this processing moved into SCS on the sequencer itself so the pipeline actually started at base calls, leaving only a small script to create qseq files and then run Gerald as normal. You could still run all or part of the full pipeline if you wanted to.
There were good and bad things about this pipeline. On the plus side you had full control over what was run, and it was easy to tell when looking through a run folder exactly what had been done simply by looking at the directory structure. Configuration was achieved mostly within a single gerald config file which said what type of analysis you wanted to run and provided details of the reference genome to use. This config file could be passed to any stage in the pipeline and would pass down the levels so that with a single command you could run the whole pipeline from images to mapped data and be confident it was all going to work.
On the bad side this pipeline got quite clunky if you had to demultiplex your samples. There were also issues with file types – illumina made their own variant of fastq files which used a different quality encoding to everyone else (two different variants actually), they also used a non-standard format for their alignments which required a non-trivial conversion to a more standard format for some downstream applications. Also the final results files were named solely after the lane in which they were sequenced, so you couldn’t tell from the file name alone with run a file had come from, which was a pain when mixing files from different runs in the same analysis. Finally the output files tended to be uncompressed text files, which as outputs got bigger (especially with the advent of the HiSeq) got very big and unwieldy.
Having said all of this, the system worked and had remained largely unchanged since the dawn of the GAII, albeit with different sections moving around between the pipeline server to the IPAR (remember those?), and the control PC.
A new hope…
Casava 1.8 was Illumina’s attempt to address some of the problems which had arisen in older versions. It wasn’t just a tweak on the existing pipeline but rather a major update which fundamentally changed the way in which the pipeline was run and the types and locations of the output files. The update had been planned for many months, with Illumina putting out a list of proposed changes for comment and responding to questions about these well before the release of the new pipeline. The major changes were to be:
- The new pipeline would be based around standard file formats (fastq with sanger encoding, and BAM for alignments)
- The pipeline would have much improved support for barcoded samples
- The pipeline would allow sample annotations to be used to identify samples and help with data management
Concerns were raised about some of these proposals, but reassuring noises were made by Illumina who assured everyone that people with existing workflows would be able to transition smoothly to the new system, which would also make new analyses easier to run.
The Empire Strikes Back…
So now that Casava 1.8 has been released how much does this new pipeline make the life of the bioinformatician easier or harder? Since we’ve been working on our first Casava run I’ll go through the changes we’ve had to make and the problems we’ve hit. It is not at all unlikely that there are simple work rounds for some of our problems, but I’ve spent a long time going through the documentation and if there is an easy way around this then I haven’t yet found it.
One of the major annoyances in the new version is the lack of consistency in the config files required to process a run. Up until now there was only one config file (gerald.conf) which laid out the type of analysis required, either globally or on a per-lane basis. With the new version there are now two separate config files required along with many more command line options.
In the new casava the gerald.conf file is still present (although now called config.txt) with substantially the same options as before, but this is now supplemented by a compulsory sample sheet csv file which must be provided when processing of the run starts. The sample sheet allows you to specify a flowcell id, sample name, description and associated project name along with any barcode sequences, a reference genome, sequencing recipe, operator and a flag to say if the sample is a control. You must provide a sample sheet to complete base calling and the sample names and project names are used to name the output files from base calling. Any listed barcodes are used to split the sequences into their respective subgroups.
There are a couple of problems here. Firstly there’s nothing in the documentation which says which fields in the sample sheet are required and which are optional. It appears that the only things you actually need are the lane, sample name and project name fields. You can specify a flowcell id but it’s not validated against the run you’re processing. You can specify a reference genome, but it’s not actually used when you do an alignment – you have to specify it again when you write your gerald.conf file. The other fields seem to only be there for your records and not because the pipeline uses them.
If, like us, you already have a LIMS system for managing your samples then having to write out a blank sample sheet I don’t need is an annoyance I could do without. Also, since our LIMS needs to automatically recognise the output files from a run folder this is substantially harder to do when there is no standard default blank sample sheet. It’s going to take a lot more parsing to find the files associated with a given lane than in the previous release although since either the prefix or suffix of the files involved is controlled this is still possible.
Next there is the problem of configuration. In the old system there was a single configuration file which you presented at the top of your analysis stack and which passed down to all levels. In the new system it’s much more of a mess.
- For base calling you need to provide a sample sheet.
- For alignment you need to provide a gerald conf file
- For BAM creation and variant calling you need to specify the reference genome on the command line and do this separately for every sample.
That’s 3 different places you can set the reference genome, only 2 of which are ever used! The reference genome now also needs to be a set of fasta files which need to be in files ending with .fa (mine ended in .txt but the docs don’t say that, you just work it out when you get an error when you try to use them).
Whereas the transition from one stage of the pipeline to the next used to be handled via a recursive make, the new system now works by letting you specify a command to run after the current analysis has successfully completed. This is in some ways more flexible as it allows you to insert your own custom scripts into the workflow, but it’s also a real pain to work out what set of commands you’re going to need to run to get to the very end of the pipeline, especially as these are going to reference folder names which don’t yet exist, and whose names are dynamically generated from the data in the sample sheet. Add to this that BAM file creation is now a per-sample process and this means that configuring a full run is now a major undertaking, and one which is very fragile since the pipeline itself cannot validate the options for anything beyond the initial analysis step. Unless you have a completely standard analysis then this is going to put an end to running stuff over the weekend and being sure that it will have finished successfully when you come back on Monday!
The move to use standard formats for output files should have made life easier for users of the pipeline, but in practice it’s actually harder to work with the new system than the old one.
One change which has been introduced, probably due to the number of sequences generated by hi-seq machines, is that you now get a set of output files rather than a single file for all of the analyses. This was sometimes the case for internal formats in previous releases (eg qseq files), but not it applies to some of the final outputs. The sets of files have a common start and end with a grouping number which let you know how to recombine them. So, for a GAIIx lane you might expect to get up to 8 gzipped fastq files for each lane. You can pass an option to increase the number of sequences in each file , but if you make that limit over 16million then it breaks eland, so you’re really stuck with multiple output files. This same problem affects eland output as well with multiple gzipped export files.
Another problem with the fastq output is that unlike older versions of the pipeline the fastq files in Casava 1.8 are not filtered, so you’ll have an entry for every cluster on your flowcell. A flag in the header tells you if your sequence should be filtered, but none of the downstream programs will understand that, so you’re going to have to do it yourself, which takes away the point of having fastq files assembled for you already!
Illumina do acknowledge both of these problems in the documentation, and provide documented workrounds. Unfortunately both of the workrounds they provide are broken – they say that you can simply cat the gzipped fastq files together to combine them, but unfortunately this puts gzip headers in the middle of your file. Some decompressors (gzip itself for example) can understand this and will extract all of the data, but others (the standard java gzip extractor for example) do not, and just stop extracting data (with no warnings or errors) after the end of the first compressed section. There is also a somewhat complex grep statement provided which should filter out flagged sequences to produce a fastq file, but this was reported to create broken fastq files from some inputs, and leaves you with uncompressed data.
We ended up writing our own short Perl script which filtered, combined and re-compressed the fastq files but this (well the compressing part) is a slow process which we’re currently performing manually.
The other standard output file used in the new version is BAM files for aligned output. What wasn’t said though is that this is not the default output from eland (which is still export files, albeit now gzipped). Creating BAM output requires that you run part of the variant calling pipeline to run the sort and bam targets to turn your export files into a bam. There is a stand alone script to to this without going through variant calling, but this only produces SAM files. There is also a –bam option for Eland, but you can only set this (according to the docs at any rate) if you run Eland as a stand alone program, and not by running the actual pipeline. The documentation for this part of the pipeline is tortuous with at least 3 different way to create BAM files all with their own options and requirements.
When you do finally create a BAM you find that in order to do this you must create an arbitrarily named directory (so no chance of finding it with our LIMS), which contains a completely generically named bam file (sorted.bam), so really no way to associate this robustly with a sample, along with separate bam files for every chromosome in your reference genome, even if you’re not doing any other analysis. All to get the mapped data in an output you can use with any other downstream application.
Finally, the base calls, alignments and variant calls are now not stored in a structured heirarchy under the data folder, but get their own top level folders inside the run folder (base calls go in ‘Unaligned’, export files go in ‘Aligned’ and BAM files go wherever you tell them, with no defaults). What this means in effect is that you can only run one set of base calls and alignments if you need to be able to reliably predict where your output will go. Subsequent runs will fail if the output folder already exists. I realise many people won’t ever both reprocessing their data, but we do this all the time and this change is going to limit what we can do. We can work around it locally, but we also distribute a LIMS which has to find these files on other sites and this gets to be impractical in the new system (and impossible in the case of variant calling output).
The bioinformatician strikes back…
So what is the outcome of all of this?
In our case the outcome is that we’re going to:
- Use a standard default sample sheet for all runs so we at least get consistent directory naming
- Rewrite our LIMS to be able to recognise the new folder structure generated from our standard sample sheet
- Write a script to filter and combine the multiple fastq files into a single file we can give back to our users
- Stop using Eland for our mapping. It’s just too much of a pain to put this into the pipeline and link the results with our LIMS. We were using bowtie and tophat for secondary mappings anyway but these will take over as our primary mapping pipeline.
- Work out a nasty hack to allow our LIMS to recognise reprocessed folders – which won’t work on any site other than ours.
All in all, for our use case this update has made our pipeline significantly less functional than it used to be. I can see that there are probably people who would benefit from some of the changes here, but they are far from being the promised panacea. I’d be interested to know how many groups are actually doing any downstream analysis in Casava, as opposed to stopping after base calling and using other options. I got used to getting funny looks when I said we were using Eland, and I suspect these changes will make the users of Casava for analysis an ever rarer breed.
With luck some or all of the problems we’ve experienced will be rectified in future releases. But I think once we’ve moved away from doing anything beyond base calling in Casava then I don’t see us moving back.
Since posting this originally I’ve received some feedback which might be useful.
It appears that you can run Casava without a sample sheet at all, even though the documentation makes no mention of this, and actually says it will read one from a default location if one isn’t specified. The output directories are named after the flowcell id for the the project name, and the lane number for the samples – which is way better than using a minimal default sample sheet.
In a more recent update Illumina say they are intending to release an update to Casava (1.8.1) which will change the behaviour of the BCL converter to not include reads which failed QC by default. They will also provide an optional flag to write all of the sequence output into a single file. This should solve our problems with this part of the pipeline.