Archive for the ‘Bioinformatics’ Category

Should you buy a nanopore sequencer?

Saturday, February 18th, 2012

This morning twitter is awash with posts discussing the newly announced nanopore sequencers from Oxford Nanopore. Speculation has been rife for some time about the potential specifications of the first sequencers to be produced by the company, and it certainly appears that the company have fulfilled the expectations placed upon them.

I’m not going to go through the details of the two sequencers announced – others have done a good job of listing the specs here and here, but basically you have machines with either 512, or 2000 nanopores capable of sequencing fragments up to 40kb (but probably several kb routinely) and error rates of around 4%, mostly indels, and with promises of imminent improvements to bring this value down – all at a per-base cost similar to the best of the existing platforms.

Reading through the initial comments about this new platform my first reaction was that we have to get one (or more) of these, but after calming down and thinking about this for a bit I thought I’d have a stab at going through the use cases where this type of sequencer really makes sense.

De-dovo sequence assembly: Oh yes!

The one place where this new platform is a complete no-brainer is if you’re assembling de-novo genome sequence. Whilst Illumina sequencers can give you good coverage depth from paired end reads of around 100bp there are always regions of the genome whose repetitive nature mean that this will not provide enough context to allow a contiguous assembly. Currently you either need to start creating mate-pair libraries, which are notoriously difficult to produce, or you need to get your floor reinforced and stump up a huge amount of cash for a PacBio.  The prospect of generating reads of 10kb+ with a simple library prep should be music to your ears, and a 4% error rate with short indels should be easy to work around with a mixed assembly.

Metagenomics: Oh yes!

In the same vein as de-novo assembly the propect of longer reads should make metagenomic studies much easier.  Getting more context for your reads should allow you to distinguish between related species much more easily and assembly of mixed bacterial populations should be possible even with the slightly more limited throughput of these sequencers.

Genotyping: Possibly

I guess the main advantages of the nanopore platform for genotyping is the speed with which it can generate data.  Data collection begins almost immediately upon addition of the sample, and real-time monitoring of the data output means that you can immediately stop the run once you have observed all of the variants you are looking for.  The long read lengths should allow the illucidation of even the most complex genome re-arrangements.  The somewhat high error rates may be problematic, but if these really are mostly indels, then SNP calling might still be practical.  The per-base cost means that the current sequencers aren’t yet practical for real time diagnostic use, but future developments on this platform would seem to make this a possibility.

Epigentics: Possibly

One of the promises of nanopore sequencing was the ability to distinguish modified bases during the base calling process.  PacBio have shown that they are able to distinguish hydrox-methyl-cytosine from cytosine, and suggest that identification of methyl-cytosine is theoretically possible. In the reports I’ve seen so far Oxford Nanopore haven’t said anything concrete yet about the ability of their platform to call modified bases, but if this proves to be possible and reliable then this will become an essential bit of kit for labs working on epigentics.  The ability not only to directly read modifications directly, but to be able to put these in the context of a multi-kb fragment is truly exciting.  The addition of a hairpin structure at the end of a fragment would also allow these sequencers to read both strands of the same fragment, again providing contextual information which has so far been lacking.

It’s possible that the nanopore sequencers may still be of use to epigenetics even without the ability to read modifications directly.  Genome wide bisulphite sequencing is already being undertaken on Illumina sequencers, and should be possible on nanopore sequencers, however the bisulphite treatment itself is very harsh, and fragments the DNA sample as it modifies it, so the super-length reads able to be obtained from normal genomic DNA may be elusive once it has been modified.

 ChIP-Seq: Not really

The power of ChIP-Seq comes from the number of observations you make, not the length of those observations.  The nanopore sequencers seem to be best suited to sequencing fewer, longer fragments which would not be an advantage for ChIP-Seq.  There seems no obvious reason why short insert libraries couldn’t be sequenced on a nanopore platform, but at the moment we know very little about the overhead of starting a new sequence on the same nanopore so this may be feasible, but longer read lengths would simply reduce the resolution of the ChIP assay.  For some applications it might be interesting to monitor ChIP results in real time, and be able to halt a run once clear peaks had emerged, but in the short term I can’t see this being a good option for this type of experiment.

RNA-Seq: It depends

As with ChIP-Seq, much of the power of RNA-Seq comes from the number of observations which have been made.  To make a reasonable measurement of low-expressed transcripts then very large numbers of sequences must be generated, and the existing short read platforms will likely have an advantage in this regard for some time, so for simple quantitation of transcripts the nanopore platform may not offer huge advantages.  Where the longer read lengths of the nanopore sequencers will  be of use will be in the elucidation and quantitation of splice variants.  Current RNA-Seq protocols provide coverage of a very small part of the transcript and often do not provide enough context to determine exactly which splice variant the reads came from.  Performing relative quantitation of the splice variants of a gene is therefore not a simple process.  Longer reads from a nanopore sequencer could cover the whole length of a transcript removing all doubt about exactly which variant it was.  Whether this proves to be a useful tool for expression quantitation will depend on whether the platform is able to generate an unbiased selection of reads (or a selection with a well understood bias) to allow accurate quantitation.

So…

So do I think we should get one of these sequencers?  Heck yes! For $900 a piece for the MinIon there’s absolutely no excuse for everyone not to get one and start playing with it to see what it can do. For much of the workload we currently have it may be that this platform isn’t going to revolutionise what we do, but if nothing else it will hopefully spur on the existing manufacturers to push forward the development of their existing platforms.  In any case the scientists win.  We live in exciting times…

Tags:
Posted in Bioinformatics, Technology | No Comments »


Moving over to Casava 1.8

Friday, September 16th, 2011

Introduction

Illumina have recently released an updated version of their downstream analysis software CASAVA.  This is the analysis pipeline which runs after the sequencer has processed the raw data down to base call files and provides a variety of functionalities from creating usable base calls to alignment and variant calling.  Casava 1.8 makes some major changes to previous versions so this isn’t going to just be a drop in replacement for the workflow you used to have.  You’re going to have to think about this one.

As we’ve just got through the processing of our first run with Casava 1.8 I wanted to write up the problems (and in some cases solutions) we’d found both so I can remind myself later on and hopefully help other people who might end up in a similar situation.

Our existing pipeline

Our usage of the Illumina pipeline is fairly straight forward.  We use it to generate sequence with qualities and in most cases we use Eland to map to a reference genome.  We haven’t done any variant calling through the pipeline and RNA-Seq mapping was done externally with tophat rather than with Illumina’s own tools.  We have a LIMS system which manages our samples for us and we link into the run folders created by the sequencer to present data back to the user.

A long long time ago…

In older versions of the pipeline there was a strictly hierarchical set of programs which created a structured set of folders which matched the order in which the programs were run.  Each folder was named after the program which created it and the time it was run.  The order of execution was:

Images (tiff) > [firecrest] > Intensities (cif) > [Bustard] > Base Calls (qseq) > [Gerald] > Filtered Base Calls (fastq) and mapped data (sorted)

In more recent times much of this processing moved into SCS on the sequencer itself so the pipeline actually started at base calls, leaving only a small script to create qseq files and then run Gerald as normal.  You could still run all or part of the full pipeline if you wanted to.

There were good and bad things about this pipeline.  On the plus side you had full control over what was run, and it was easy to tell when looking through a run folder exactly what had been done simply by looking at the directory structure.  Configuration was achieved mostly within a single gerald config file which said what type of analysis you wanted to run and provided details of the reference genome to use.  This config file could be passed to any stage in the pipeline and would pass down the levels so that with a single command you could run the whole pipeline from images to mapped data and be confident it was all going to work.

On the bad side this pipeline got quite clunky if you had to demultiplex your samples.  There were also issues with file types – illumina made their own variant of fastq files which used a different quality encoding to everyone else (two different variants actually), they also used a non-standard format for their alignments which required a non-trivial conversion to a more standard format for some downstream applications.  Also the final results files were named solely after the lane in which they were sequenced, so you couldn’t tell from the file name alone with run a file had come from, which was a pain when mixing files from different runs in the same analysis.  Finally the output files tended to be uncompressed text files, which as outputs got bigger (especially with the advent of the HiSeq) got very big and unwieldy.

Having said all of this, the system worked and had remained largely unchanged since the dawn of the GAII, albeit with different sections moving around between the pipeline server to the IPAR (remember those?), and the control PC.

A new hope…

Casava 1.8 was Illumina’s attempt to address some of the problems which had arisen in older versions.  It wasn’t just a tweak on the existing pipeline but rather a major update which fundamentally changed the way in which the pipeline was run and the types and locations of the output files.  The update had been planned for many months, with Illumina putting out a list of proposed changes for comment and responding to questions about these well before the release of the new pipeline.  The major changes were to be:

  1. The new pipeline would be based around standard file formats (fastq with sanger encoding, and BAM for alignments)
  2. The pipeline would have much improved support for barcoded samples
  3. The pipeline would allow sample annotations to be used to identify samples and help with data management

Concerns were raised about some of these proposals, but reassuring noises were made by Illumina who assured everyone that people with existing workflows would be able to transition smoothly to the new system, which would also make new analyses easier to run.

The Empire Strikes Back…

So now that Casava 1.8 has been released how much does this new pipeline make the life of the bioinformatician easier or harder?  Since we’ve been working on our first Casava run I’ll go through the changes we’ve had to make and the problems we’ve hit.  It is not at all unlikely that there are simple work rounds for some of our problems, but I’ve spent a long time going through the documentation and if there is an easy way around this then I haven’t yet found it.

Config files

One of the major annoyances in the new version is the lack of consistency in the config files required to process a run.  Up until now there was only one config file (gerald.conf) which laid out the type of analysis required, either globally or on a per-lane basis.  With the new version there are now two separate config files required along with many more command line options.

In the new casava the gerald.conf file is still present (although now called config.txt) with substantially the same options as before, but this is now supplemented by a compulsory sample sheet csv file which must be provided when processing of the run starts.  The sample sheet allows you to specify a flowcell id, sample name, description and associated project name along with any barcode sequences, a reference genome, sequencing recipe, operator and a flag to say if the sample is a control.  You must provide a sample sheet to complete base calling and the sample names and project names are used to name the output files from base calling.  Any listed barcodes are used to split the sequences into their respective subgroups.

There are a couple of problems here.  Firstly there’s nothing in the documentation which says which fields in the sample sheet are required and which are optional.  It appears that the only things you actually need are the lane, sample name and project name fields.  You can specify a flowcell id but it’s not validated against the run you’re processing.  You can specify a reference genome, but it’s not actually used when you do an alignment – you have to specify it again when you write your gerald.conf file.  The other fields seem to only be there for your records and not because the pipeline uses them.

If, like us, you already have a LIMS system for managing your samples then having to write out a blank sample sheet I don’t need is an annoyance I could do without.  Also, since our LIMS needs to automatically recognise the output files from a run folder this is substantially harder to do when there is no standard default blank sample sheet.  It’s going to take a lot more parsing to find the files associated with a given lane than in the previous release although since either the prefix or suffix of the files involved is controlled this is still possible.

Next there is the problem of configuration.  In the old system there was a single configuration file which you presented at the top of your analysis stack and which passed down to all levels.  In the new system it’s much more of a mess.

  • For base calling you need to provide a sample sheet.
  • For alignment you need to provide a gerald conf file
  • For BAM creation and variant calling you need to specify the reference genome on the command line and do this separately for every sample.

That’s 3 different places you can set the reference genome, only 2 of which are ever used!  The reference genome now also needs to be a set of fasta files which need to be in files ending with .fa (mine ended in .txt but the docs don’t say that, you just work it out when you get an error when you try to use them).

Whereas the transition from one stage of the pipeline to the next used to be handled via a recursive make, the new system now works by letting you specify a command to run after the current analysis has successfully completed.  This is in some ways more flexible as it allows you to insert your own custom scripts into the workflow, but it’s also a real pain to work out what set of commands you’re going to need to run to get to the very end of the pipeline, especially as these are going to reference folder names which don’t yet exist, and whose names are dynamically generated from the data in the sample sheet.  Add to this that BAM file creation is now a per-sample process and this means that configuring a full run is now a major undertaking, and one which is very fragile since the pipeline itself cannot validate the options for anything beyond the initial analysis step.  Unless you have a completely standard analysis then this is going to put an end to running stuff over the weekend and being sure that it will have finished successfully when you come back on Monday!

Output files

The move to use standard formats for output files should have made life easier for users of the pipeline, but in practice it’s actually harder to work with the new system than the old one.

One change which has been introduced, probably due to the number of sequences generated by hi-seq machines, is that you now get a set of output files rather than a single file for all of the analyses.  This was sometimes the case for internal formats in previous releases (eg qseq files), but not it applies to some of the final outputs.  The sets of files have a common start and end with a grouping number which let you know how to recombine them.  So, for a GAIIx lane you might expect to get up to 8 gzipped fastq files for each lane.  You can pass an option to increase the number of sequences in each file , but if you make that limit over 16million then it breaks eland, so you’re really stuck with multiple output files.  This same problem affects eland output as well with multiple gzipped export files.

Another problem with the fastq output is that unlike older versions of the pipeline the fastq files in Casava 1.8 are not filtered, so you’ll have an entry for every cluster on your flowcell.  A flag in the header tells you if your sequence should be filtered, but none of the downstream programs will understand that, so you’re going to have to do it yourself, which takes away the point of having fastq files assembled for you already!

Illumina do acknowledge both of these problems in the documentation, and provide documented workrounds.  Unfortunately both of the workrounds they provide are broken – they say that you can simply cat the gzipped fastq files together to combine them, but unfortunately this puts gzip headers in the middle of your file.  Some decompressors (gzip itself for example) can understand this and will extract all of the data, but others (the standard java gzip extractor for example) do not, and just stop extracting data (with no warnings or errors) after the end of the first compressed section.  There is also a somewhat complex grep statement provided which should filter out flagged sequences to produce a fastq file, but this was reported to create broken fastq files from some inputs, and leaves you with uncompressed data.

We ended up writing our own short Perl script which filtered, combined and re-compressed the fastq files but this (well the compressing part) is a slow process which we’re currently performing manually.

The other standard output file used in the new version is BAM files for aligned output.  What wasn’t said though is that this is not the default output from eland (which is still export files, albeit now gzipped).  Creating BAM output requires that you run part of the variant calling pipeline to run the sort and bam targets to turn your export files into a bam.  There is a stand alone script to to this without going through variant calling, but this only produces SAM files.  There is also a –bam option for Eland, but you can only set this (according to the docs at any rate) if you run Eland as a stand alone program, and not by running the actual pipeline.  The documentation for this part of the pipeline is tortuous with at least 3 different way to create BAM files all with their own options and requirements.

When you do finally create a BAM you find that in order to do this you must create an arbitrarily named directory (so no chance of finding it with our LIMS), which contains a completely generically named bam file (sorted.bam), so really no way to associate this robustly with a sample, along with separate bam files for every chromosome in your reference genome, even if you’re not doing any other analysis.  All to get the mapped data in an output you can use with any other downstream application.

Finally, the base calls, alignments and variant calls are now not stored in a structured heirarchy under the data folder, but get their own top level folders inside the run folder (base calls go in ‘Unaligned’, export files go in ‘Aligned’ and BAM files go wherever you tell them, with no defaults).  What this means in effect is that you can only run one set of base calls and alignments if you need to be able to reliably predict where your output will go.  Subsequent runs will fail if the output folder already exists.  I realise many people won’t ever both reprocessing their data, but we do this all the time and this change is going to limit what we can do.  We can work around it locally, but we also distribute a LIMS which has to find these files on other sites and this gets to be impractical in the new system (and impossible in the case of variant calling output).

The bioinformatician strikes back…

So what is the outcome of all of this?

In our case the outcome is that we’re going to:

  1. Use a standard default sample sheet for all runs so we at least get consistent directory naming
  2. Rewrite our LIMS to be able to recognise the new folder structure generated from our standard sample sheet
  3. Write a script to filter and combine the multiple fastq files into a single file we can give back to our users
  4. Stop using Eland for our mapping.  It’s just too much of a pain to put this into the pipeline and link the results with our LIMS. We were using bowtie and tophat for secondary mappings anyway but these will take over as our primary mapping pipeline.
  5. Work out a nasty hack to allow our LIMS to recognise reprocessed folders – which won’t work on any site other than ours.

All in all, for our use case this update has made our pipeline significantly less functional than it used to be.  I can see that there are probably people who would benefit from some of the changes here, but they are far from being the promised panacea.  I’d be interested to know how many groups are actually doing any downstream analysis in Casava, as opposed to stopping after base calling and using other options.  I got used to getting funny looks when I said we were using Eland, and I suspect these changes will make the users of Casava for analysis an ever rarer breed.

With luck some or all of the problems we’ve experienced will be rectified in future releases.  But I think once we’ve moved away from doing anything beyond base calling in Casava then I don’t see us moving back.

Addenda…

Since posting this originally I’ve received some feedback which might be useful.

It appears that you can run Casava without a sample sheet at all, even though the documentation makes no mention of this, and actually says it will read one from a default location if one isn’t specified.  The output directories are named after the flowcell id for the the project name, and the lane number for the samples – which is way better than using a minimal default sample sheet.

In a more recent update Illumina say they are intending to release an update to Casava (1.8.1) which will change the behaviour of the BCL converter to not include reads which failed QC by default.  They will also provide an optional flag to write all of the sequence output into a single file.  This should solve our problems with this part of the pipeline.

 

Tags: , ,
Posted in Bioinformatics, Computing | Comments Off


Importing RNA-Seq data into SeqMonk

Sunday, September 4th, 2011

Introduction

Mapped RNA-Seq data coming from eukaryotes is probably the most complicated data type to import into SeqMonk due to it’s relative complexity and the abundance of options with which you are presented.  Depending on exactly what sort of information you want to know about your data different data import options will be useful, so understanding the best way to import your data can be important.

Options

Virtually all RNA-Seq data will start out as a BAM / SAM file since this is the only format currently capable of describing mapped positions which span one or more splice junctions.  The options for import are therefore the options found in the BAM / SAM parser.  The options you have are:

  • Is the data single or paired end
  • Do you want to split the reads into spliced segments
  • Do you want to import introns instead of exons

Each combination of options will produce a different set of aligned reads and will be useful for different types of analysis.  The figure below summarises the reads which are imported from a set of spliced paired end mapped data.  The colours represent alignment direction (red=forward, blue=reverse).

Choosing the correct options

The options you choose will depend on what you want to do with your data.  Typically you either want to do one of three things:

  1. Quantitatively analyse the expression of each transcript
  2. Quantitatively analyse splice variation
  3. Identify novel transcripts

If you want to quantitatively analyse expression then you need to split your reads so that separate reads are recorded for each mapped splice section.  You should then combine this with the base pair quantiation so that each spliced read is counted proportionally in the different exons it spans.  If you have paired end data then selecting paired end import will simply reverse the orientation of any reads coming from the second read in a pair.  If your data is not directional then this won’t make any difference, but if you have a directional library this will ensure that the correct orientation is assigned to each read.

If your interest is in the analysis of splice variants then the easiest way to do this is to import only the introns from your mapped data. Introns will only be imported from a read which contains an internal splice site, reads which do not splice will be ignored.  Once the introns have been imported these should be quantitated using the read position probe generator, to put a probe over every different splice junction combination and then quantitated with the exact overlap quantitation to get an exact count for each splice combination seen in the data.  If you have paired end data then it’s best to import introns as paired end data so that splice junctions from the second read match exactly (position and strand) with those from the first read.

Finally, if your interest is in seeing the extend of novel transcripts then you can import spliced mapped data as a normal BAM / SAM file.  Since the mapped region is not expected to be contiguous on the genome you will need to greatly increase the filter for largest allowed distance between the ends of the reads to avoid rejecting reads which span introns.  This type of import will join the ends of the read (or reads for paired end data) so you get the most complete view of the region of the genome spanned by your reads, but the quantitative influence of each read will depend on the size of the introns spanned, so importing data this way should only be used as a qualitative way of viewing your data.  Combining this view of your transcriptome with a more conventional spliced import in a different track should give you the most complete view of the position and extent of each of your transcripts.

 

Tags: , ,
Posted in Bioinformatics | Comments Off


Want to improve your science? Get a dog.

Sunday, June 19th, 2011

Actually the dog is somewhat irrelevant – it’s what comes with it which matters.  One of the side-effects of dog ownership is that you get to spend an hour or so a day out walking, which means you have an hour or so with your own thoughts and no distractions.

I’m sure everyone has experienced the situation where you go to sleep thinking about a problem, and then wake up in the morning with the answer. It’s pretty common for your brain to replay the events of the day whilst you sleep and it’s surprising how just making time to think about something can can help you to clarify and understand problems you’re working on.  Doing this overnight is good as far as it goes, but can end up somewhat surreal as you effectively relinquish control of your train of thought.

The same principle applies during the day though. If your brain isn’t immediately occupied with a specific task it will tend to go back over other problems you’re working on at the moment.  The problem is that in a modern office or lab setting we hardly ever have time when we’re not being presented with something to do. Even when we’re not at work it feels really unnatural to not be doing anything.  There’s always something to read, watch or do so your brain never gets chance to drift back to the topics which could usefully employ it.  That’s where the dog comes in.

I actually first noticed this effect when I started cycling to work. My commute by bike takes me about an hour (which is why I don’t do it all that often!), but I found that when I cycled to work I would be hugely productive for the first couple of hours of the day and find really creative solutions for things I was working on.  I first put this down to having more energy from my early morning exercise, but I then found that if I put on a podcast whilst I cycled I was no more productive than if I’d driven – so it wasn’t the cycling, it was having an hour where my brain had nothing to do other than chew over the problems I was working on.

Dogs are even better in this respect in that you don’t really get a choice about whether you’re going to take them out.  You’re forced to spend some time every day with no external distractions from which you can gain the attendant benefits.

Somewhat perversely I find that conferences work this way too, only they provide a double benefit.  Anyone who’s ever been to a conference will know that talks fall into two categories – those which capture your attention and provide new and interesting ways of interpreting your science, and those in which you switch off and stop listening in the first five minutes. Actually I’d argue that both of these types of talk are beneficial.  The first inspires you and gives you new information to process, and the second gives you some uniterrupted time to think about what you learned in the first. Many of the most intersting projects I’ve worked on have begun during poor talks in a conference, where I’ve stopped listening to the current talk and have sketched out the structure for a software package, or a theory which I could test.  Granted it doesn’t always work this way.  I once spent an hour in a particularly dull keynote writing a program which would rate all of the keynotes in a conference by the change in ping times on the wireless network (it’s conclusions matched surprisingly well with my own personal judegment), but if not productive at least that was creative.

My tip for improved science then is not necessarily to get a dog – although I can heartily recommend doing so – it’s to force yourself to make some time each day where you don’t have anything to think about.  Your brain will thank you for it.

Posted in Bioinformatics, Computing | Comments Off


Choosing the best format for raw sequence data

Thursday, June 16th, 2011

Introduction

In the current Illumina pipeline raw sequence data is generated in qseq files, but can optionally be converted to the more standard FastQ format for use with other analysis programs.  The FastQ files produced are uncompressed text files and take up a considerable amount of space in our storage system.  We’ve therefore been thinking about either compressing or converting these files to save on the amount of storage they require.

At the same time we’ve also been expanding the range of compression schemes supported in FastQC which gives us a good impression of how quickly we can extract data from the different available formats and since I’ve collected some data on the storage and processing requirements for the different formats available I thought I’d share these to help inform others who may be making similar decisions.

The choices we have are to simply compress the fastq files with a standard compression scheme.  The two most commonly used are gzip and bzip2.  Many analysis programs support gzipped fastq files as input in addition to uncompressed files, and a few are starting to support bzip2.  Bzip2 is often chosen over gzip because it compresses data more efficiently.  The other choice is to put the raw data into a BAM file.  This format is specifically designed to hold high throughput sequence data and uses a compression scheme which is designed to be optimal for sequence data.  The BAM format was primarily designed to hold information about sequences which had been mapped to a reference sequence, but it also allows for raw sequences with no associated mapping to be stored, but with some overhead for the mapped position fields which are not used.

For the tests I used a fastq file containing around 500,000 reads of 33bp in length.  The processing times were taken as the time to process the file completely with FastQC.  Since the processing overhead for the QC analysis should be the same in all cases any differences will be attributable to the different amount of data needing to be read from disk, and the CPU time required for the decompression.  The tests were run on a MacBook Pro laptop (so not the fastest hard drive, or the speediest CPU).  FastQC uses pure java decompression code.  For gzip compression this is built into the JRE and for bzip2 I used the jbzip2 library.

Results

File type File size Time to process (seconds)
Uncompressed FastQ 69.8MB 14.1
Gzip Compressed FastQ 17.5MB 11.0
Bzip2 Compressed FastQ 13.9MB 72.1
BAM 16.3MB 11.5

Conclusions

It is clear that converting your raw FastQ files to a more efficient storage format will produce significant gains in disk space usage.  Reducing your storage requirements by a factor of 4-5X and actually making your processing more efficient in some cases is a win-win proposal.

From the results presented it seems clear that unless disk usage is critically important then bzip2 compression is not a viable solution. Increasing your processing time by over 500% for a 20% reduction in size does not seem to be a good trade off.  I’ve also checked using the command line gzip/bzip2 decompression utilities in case the effect I saw was an artefact of the java implementations, but the size of the difference between the two was similar there as well.

Choosing between gzip and BAM is less clear.  It’s probably fair to say that at the moment more analysis programs support gzipped fastq files as input than support BAM files (which is more normally seen as an output format), but this may change in the future.  BAM files offer the prospect of adding in mapping data alongside your sequence data with a minimised increase in filesize which may be a benefit to some.  If you’d prefer to keep your raw data separate from derived data then gzipped FastQ files would seem to be the better choice.

In our case we’re going to opt for simply gzipping our FastQ files since this seems to be a simple process which won’t affect any of our existing workflows or processing and which will return to us a significant amount of storage space.

Tags: , ,
Posted in Bioinformatics, Computing | Comments Off


Adding custom chromosome name mappings into SeqMonk

Saturday, June 11th, 2011

When loading data into SeqMonk the program has to try to connect the chromosome names used in your data file with those which are present in the genome one which your project is based.  In many cases there won’t be an exact match between the two – many mapping programs report file names in their output so you often see names like ‘chr1.fa’ used to represent chromosome 1.

In order to match as many entries as possible SeqMonk tries a few different tricks to match up the chromosome names if it doesn’t immediately find an exact match.  It will do simple manipulations such as removing ‘chr’ from the front of the name and removing things like ‘.fa’, ‘.txt’ and ‘.gz’ from the end. However in some cases it isn’t able to match up the names, and in these cases the read is rejected and you’ll get an import warning saying:

‘Couldn’t extract valid name for [some string which wasn't a chromosome name]‘

The most common offender for this message is the mitochondrion which is variously labelled as M, Mt, MT or some other minor variation.  The other one we commonly see are the yeast genomes which seem to alternate between using roman and arabic numbers to describe their chromosomes.

If you get this error and need to import the data which was rejected then how do you bring this data into SeqMonk? The most obvious way is to go back to your data file and rename your chromsomes to match those in SeqMonk.  Whilst this is simple to describe it’s often practically quite hard since data files are usually very large, and if they are in a non-text format (eg BAM), then modifying them isn’t at all straight forward.

A better alternative then is to tell SeqMonk in advance about any chromosome name mappings which it isn’t able to figure out for itself. There is a pretty simple way to do this, but it isn’t very obvious in the documentation and most people who hit the problem don’t seem to find it so I thought I’d raise its profile.

The way to add custom chromosome name mappings in SeqMonk is to create a text file called ‘aliases.txt’ in the folder which contains the data for the genome assembly you’re using.  If you were using the mouse NCBIM37 genome for your project the file would need to go inside the [Genomes dir]/Mus musculus/NCBIM37 folder.  This file is a 2-column tab delimited text file where the first column is the name you want to be able to use in your data files and the second column is the name of a chromosome in your SeqMonk genome.  Both of these names must be an exact match since the aliases are not subject to the same manipulations as normal chromosome names read from a data file.  There is no problem having multiple aliases pointing to the same SeqMonk chromosome, but you can’t have the same aliases pointing to more than one SeqMonk chromosome (if you do, then the last one wins).

So, and example aliases file would look like:

chro1      1
II         2
NC1234     3
c3         3
bob        X

Once this file is in place and SeqMonk has been restarted (or at least the genome has been reloaded) you should be able to import data files containing any of the names in the first column into your project without seeing any warnings.  This file is only required when the data is first imported, after which all the chromosome names are turned into SeqMonk names, so you don’t need to distribute the aliases file to be able to use SeqMonk projects which required that file to have their data initially imported.

Tags: ,
Posted in Bioinformatics | Comments Off


Interpreting the duplicate sequence plot in FastQC

Monday, May 23rd, 2011

Background

The one analysis module which seems to elicit more questions than any other is the duplicate sequence plot. Of all of the plots which the program generates it’s probably the one which causes the most warnings / errors in otherwise nice looking data. I’m happy to admit that it’s not always immediately obvious what the plot is telling you, but it does contain some really useful information, so it’s worth getting to grips with it!

The Basics

So what does the duplicate sequence plot tell you? As with all of the plots in FastQC it’s not there to tell you if your data is good or bad, it’s there to tell you if your data looks unusual in some way.  More specifically the point of the duplicate sequences plot is to tell you to what extent you are wasting the sequencing capacity you have paid for by simply resequencing the exact same sequences over and over again.

The point of the plot is that in a diverse random library, even at relatively high coverage you shouldn’t resequence the exact same region multiple times.  If across your entire library you are sequencing each fragment multiple times then in most cases you stop learning anything new about your library.  You could effectively get the same amount of information by doing less sequencing, which either means you would benefit from a more diverse library, or you could do less sequencing (eg mixing the library with others in a multiplexed experiment).

One of the slightly confusing aspects of the plot is the scale used for the y-axis, but this scale is a result of compromises made to allow FastQC to run in a reasonable amount of memory.  In an ideal world we would analyse every sequence in a library and count how many times each one occurred.  If we did that we could put absolute numbers on the plot and this would make it easier to interpret.  However, if we had a perfectly diverse library with no duplicated sequences this would mean holding the whole library in memory at once which isn’t practical on many machines.  What happens instead is that the analysis occurs only for the first 200,000 different sequences seen.  The number of occurrences of these sequences is then tracked through the rest of the file, but any new sequences after the first 200,000 are then discarded.  Because this precludes the possibility of generating absolute numbers we plot the results on a relative scale, with the number of sequences occurring exactly once being set as 100%, and everything else being shown relative to that number.  The plot value for duplicate 1 is therefore always 100%, and higher duplication levels can produce values either higher or lower than this.

In an ideal plot for a diverse library the values for duplicate levels above 1 should quickly decay to zero and stay there.  However there are several ways in which the plot can indicate other types of problem in the library.

How things can go wrong

There are a number of different potential modes of failure which show up in the duplicate sequences plot. Here we try to provide examples of some of the most common ones and explain how they might occur.

A small number of sequences dominate an otherwise diverse library

In some cases you may find that a very small number of sequences end up comprising a large proportion of your library. The most common example of this is contamination by adapter sequence but any small source of contamination could conceivably cause a similar effect.

In this case the shape of the duplicate plot would look normal, with the unique sequences being the highest value and the rest of the plot quickly decaying to values close to zero, however the overall duplicate percentage shown at the top of the plot will be very high. The reason the plot looks normal is that it is limited to a duplicate level of 10+ so a very small number of highly duplicated sequences won’t cause the plot to peak.

The easiest way to confirm this type of problem is to look at the overrepresented sequences module results which should list the overrepresented sequences along with the proportion of the library they comprise.

Every sequence in a library occurs a large number of times

If a library has very limited diversity is then subjected to high throughput sequencing then it’s possible that every sequence seen in the library is likely to be seen many times.

In this case the plot trace will be low through all duplication levels until the 10+ bin, where the plot will spike sharply, possibly showing a relative level of many hundreds of percent relative to the unique sequences. The overall duplicate level will probably also be high (>90%).

The root cause of this type of result is low diversity in the original library.  This could be because you are hugely over-sequencing a diverse library – a PhiX control lane for example would show a result like this, or it could be that the library diversity is unexpectedly low.  The most common cause of unexpected loss of diversity is PCR over-amplification where a small subset of the full library is artificially expanded through the use of too many cycles of amplification. Such libraries can also be found from experiment types which are expected to generate limited diversity, such as those which are based around restriction sites rather than random fragmentation.

A low level of duplication is seen across a whole library

In some cases you will see a less extreme plot where the rate at which the duplicate plot falls from unique sequences is slow – showing appreciable proportions of the library with duplication levels of 4-5, and it may show a small spike in the 10+ bin.

In some cases this will simply be a less extreme version of the last situation where every sequence in the library is subjected to low level duplication, and natural variation spreads this across a somewhat broad range.

In other cases though you might be dealing with a library which has highly variable levels of duplication – and this may have a biological rather than a technical cause.  The most common type of library to produce this type of plot is an RNA-Seq library.  In this type of library it is expected that some sequences will occur very frequently, and others will be very rare.  If you want to see the very rare sequences (eg low copy number transcripts), then you will have to greatly over-sequence the most frequent sequences (eg housekeeping genes), so a high level of duplication in part of the library is unavoidable.

A final warning

One final thing to bear in mind when looking at duplicate sequence plots is that they are based on an exact match between sequences. Any sequencing errors in the library will tend to create artificial diversity in the library – making identical sequences look different because of technical errors. To mitigate against this effect in long reads only the first 50bp of each sequence are used to assess the duplication level – however if the sequence quality in a library is very poor then the duplication plot for a heavily duplicated library might be made to look perfectly normal due to the introduced errors.  You should therefore always consider the results of the sequence quality plots (especially the per-base quality plot) alongside the duplication plot to gain a realistic assessment of the true level of duplication.

 

Tags: , , ,
Posted in Bioinformatics | Comments Off


How good is ‘good enough’ for research software

Sunday, September 12th, 2010

There are two linked problems which seem to face me with every piece of software I write for research use:

  1. When is the software complete enough to write a paper on it
  2. How to manage the versions and project description

I think that although similar questions arise within software written for general use, their answers are quite different for research software.

In a general software project it’s not unreasonable to set out with an idea of what you want the software to do and to have a list of features which, once complete, would mark the project as ready for a stable release.  Following the standard conventions then works reasonably well:

  • Alpha release – some features not implemented, any feature could be subject to change
  • Beta release – all (or nearly all) features implemented but bugs may still exist
  • Stable release – all features should be complete and code should be well tested to eliminate major bugs

On our projects page we used to put our software into these categories, but soon found that they really didn’t work well for research based software.  Most crucially, in normal software development progressing from alpha – beta – stable indicates an increase in quality, but in research software it peaks at beta, and stable is probably a bad sign.  For research software we now use a system more like the following:

  • Alpha – Software is still experimental.  It’s not been widely tested so treat all derived results with a healthy dose of scepticism and double check everything.  File formats and major concepts could still be subject to radical change.
  • Beta – The way the program works is mostly stable and the released functionality has been pretty well tested and results should mostly be reliable.  Specific details are still subject to change and more functionality could be added at any point, but saved files should still continue to work.
  • Legacy – Generally used for projects which aren’t being worked on any more.  If the functionality of the last release does what you want, then great, but no new functionality will be added, although obvious bugs will be fixed.

The most successful and active project are therefore under a pattern of continuous rolling development internally to adapt to the ongoing requirements of the research.  Public releases tend to happen either when a reasonably serious bug is found, or when a bunch of new features are complete and everything in the menus actually works.

The problem with this kind of development is that no project is ever ‘finished’, which means there’s no obvious point at which to pause to say ‘It’s time to write this up as a paper’.  There’s always one more thing you want to get finished before publishing.  Since our group is a core facility rather than a research group we’re not really under pressure to produce our own papers, since our focus is to provide a service – and this makes the problem worse.  We’ve had a few projects now where we kept deferring writing a paper to the point where the project moved into the ‘legacy’ phase and it was too late.

My new resolution therefore is to be less cautious about this.  As soon as a project is producing scientifically useful results and is a state of completion where I’d be happy for other people to use it without putting warning stickers all over it we should aim to get a paper out on it.  This will both help to promote work that we’re doing and get more people involved, but may also help to focus our attention on exactly which features we want to complete before committing to the publication.

Tags: ,
Posted in Bioinformatics, Computing | Comments Off


Where do you analyse next gen sequence data?

Tuesday, July 13th, 2010

We had an interesting discussion at the Bioinfo-core workshop at ISMB2010.  The discussion centred around the best way to handle the logistics of making sequence data available to using a sequencing service.  The problem is that the data is so big that even if you have a large central store you run the risk of the users taking copies of the data and filling up all of your day to day storage as well.

There are a couple of options which present themselves:

  1. You just pass a copy of the data to the users and let them worry about where they put it.  This is of course the easiest path, but will end up with users putting both data and subsequent analysis on insecure local disks on their workstations – potentially leading to big data losses when things inevitably go wrong.
  2. You keep all of the data on the central store and then expose this over the network (probably in a read only fashion) to the users.  This eliminates the problem of duplicated storage, but could still be problematic if the users generate large derived datasets.  It will also place a heavy stress on your network infrastructure as the data will effectively be downloaded every time a user runs an analysis against it.
  3. You keep all of the data on the central store and provide a remote access system for analysing it.  In this scenario you eliminate both the data storage and transfer problems since the data is always contained within the central facility and only views of the derived data are shown to the user.  The big downside is that the core facility is then responsible for managing the computational power required by all of their users for all of their analyses, which could add significant cost and workload to the core facility.

Option 3 is the most logical and efficient, but would require convincing users of a service that they should be paying up front for the costs associated with all of the future storage and analysis of their data.  Given that enterprise storage is significantly more expensive than most users would expect then this could be a hard sell, but it’s probably something we should at least try.

Another aspect to consider in all of this is that as the volume of publicly available sequencing data increases there are going to be many more occasions on which users will want to re-analyse public data rather than generating the data themselves.  Managing the influx of this external data is also likely to be a signficant challenge for core facilities.  These data sets are every bit as large as those created in house, and will take just as much computational resource to analyse.  It would make sense to be able to include these sorts of data sets into the same storage and analysis system you use for your in-house data, albeit that you will have the option to delete the primary data should disk space become limiting.

The long term solution to many of these problems may come from cloud computing.  Having data stored in central repositories and then making these available to servers in the cloud would provide a scalable solution which can provide the sort of cost effective flexible options we’re all likely to need.  The problem is that at the moment the costs of storage in the cloud make this sort of system uneconomic.  Cloud storage has not typically been associated with the analysis of huge datasets running IO bound analyses, but there’s no intrinsic reason why this shouldn’t be possible.

A wish for the future would be that cloud service providers recognise this potential market and cater to it.  What this would require would be to make a read-only copy of some of the main data repositories (SRA, GEO, EMBL etc) permanently available to any cloud server (possibly at a small cost).  Since these data repositories could be shared between users this wouldn’t be as arduous as having each user maintain their own data and would then let users simply scale their CPU demands to allow them to perform their desired analyses.  This would dramatically lower the barrier preventing people getting into this area and would reduce the cost for people not able to saturate a dedicated compute cluster with their own analyses.

Posted in Bioinformatics, Computing | Comments Off


Mapping Bisulphite Converted Sequence

Thursday, August 27th, 2009

I’ve been thinking lately about the best way to construct a mapping pipline for large sequence datasets which have been bisulphite converted.

Bisulphite conversion is mostly used to detect DNA methylation, although other uses are also being found.  The basic principle is that treating DNA with sodium bisulphite modifies cytosine bases such that when they are copied they pair with adenine instead of guanine.  Effectively all Cs are converted to Ts.  The exception is if the Cytosinse is methyated in which case it is not modified by bisulphite.

By comparing a bisulphite converted sequence to a known reference you can therefore tell which cytosines were methylated.

The problem is that in order to search a converted sequence to a reference you need a specialised search program which understands the changes introduced by the conversion and is still able to match these to the reference in their converted form.

There are a few programs available which allow you to do this such as BSMap or mrsFast.  These provide custom aligners which can match fully or partially converted sequence to a reference, but they can still leave a number of problems.

The main problem is that many types of experiment rely on being able to count the number of times a particular region is observed as methylated and unmethylated and using the ratio of the two to infer the overall methylation status.  The problem with doing this is that if your sequence is methylated it will still contain Cs, and therefore has more information than the converted form.  In many cases this means that the unconverted form is uniquely mappable within the genome, but the converted form isn’t.  In these cases you will bias your results in favour of the methylated sequences.

Also, bisulphite treatment is often not complete.  Some unmethylated Cs will not be converted just by chance.  Being able to make an assesment of the convertion efficiency allows you to correct the results you observe.  In mammals this is simple since methyation should only be observed where a C forms part of a CG dinucleotide.  Any protected Cs which aren’t part of a CG therefore represent incomplete conversion.  In other species such as Arabidopsis though methylation can occur on any C so a different way to measure conversion efficiency is required.

Our proposal for mapping is therefore quite simple.  All mapping must be done on fully converted sequence.  If we generate a fully converted genome sequence (which requires separate conversions for forward and reverse strands), then we can completely convert our sequence tags and try to map them against the converted genome.  If we find a unique match then we can compare the actual observed tag sequence to the unconverted genome sequence to obtain the methylation status.

There are a few advantages to this approach.

Firstly it requires no specialised mapping software.  Any of the standard mapping programs will be able to do the search, albeit with reduced efficiency due to the reduced complexity of the sequence.

Mapping with fully converted sequence ensures that there is no bias towards methylated sequences.  The downside of this is that some potentially mappable sequences are excluded – but at least we’re on a level playing field.

Operations such as conversion efficiency calculation can be introduced during the methylation calling step, and it is fairly simple to introduce rules to say whether a methylated base should appear as part of a larger sequence pattern.

Posted in Bioinformatics | Comments Off