Archive for August, 2009

Mapping Bisulphite Converted Sequence

Thursday, August 27th, 2009

I’ve been thinking lately about the best way to construct a mapping pipline for large sequence datasets which have been bisulphite converted.

Bisulphite conversion is mostly used to detect DNA methylation, although other uses are also being found.  The basic principle is that treating DNA with sodium bisulphite modifies cytosine bases such that when they are copied they pair with adenine instead of guanine.  Effectively all Cs are converted to Ts.  The exception is if the Cytosinse is methyated in which case it is not modified by bisulphite.

By comparing a bisulphite converted sequence to a known reference you can therefore tell which cytosines were methylated.

The problem is that in order to search a converted sequence to a reference you need a specialised search program which understands the changes introduced by the conversion and is still able to match these to the reference in their converted form.

There are a few programs available which allow you to do this such as BSMap or mrsFast.  These provide custom aligners which can match fully or partially converted sequence to a reference, but they can still leave a number of problems.

The main problem is that many types of experiment rely on being able to count the number of times a particular region is observed as methylated and unmethylated and using the ratio of the two to infer the overall methylation status.  The problem with doing this is that if your sequence is methylated it will still contain Cs, and therefore has more information than the converted form.  In many cases this means that the unconverted form is uniquely mappable within the genome, but the converted form isn’t.  In these cases you will bias your results in favour of the methylated sequences.

Also, bisulphite treatment is often not complete.  Some unmethylated Cs will not be converted just by chance.  Being able to make an assesment of the convertion efficiency allows you to correct the results you observe.  In mammals this is simple since methyation should only be observed where a C forms part of a CG dinucleotide.  Any protected Cs which aren’t part of a CG therefore represent incomplete conversion.  In other species such as Arabidopsis though methylation can occur on any C so a different way to measure conversion efficiency is required.

Our proposal for mapping is therefore quite simple.  All mapping must be done on fully converted sequence.  If we generate a fully converted genome sequence (which requires separate conversions for forward and reverse strands), then we can completely convert our sequence tags and try to map them against the converted genome.  If we find a unique match then we can compare the actual observed tag sequence to the unconverted genome sequence to obtain the methylation status.

There are a few advantages to this approach.

Firstly it requires no specialised mapping software.  Any of the standard mapping programs will be able to do the search, albeit with reduced efficiency due to the reduced complexity of the sequence.

Mapping with fully converted sequence ensures that there is no bias towards methylated sequences.  The downside of this is that some potentially mappable sequences are excluded – but at least we’re on a level playing field.

Operations such as conversion efficiency calculation can be introduced during the methylation calling step, and it is fairly simple to introduce rules to say whether a methylated base should appear as part of a larger sequence pattern.

Posted in Bioinformatics | Comments Off


Managing Really Large Data Sets

Thursday, August 27th, 2009

For a while now I’ve been working with next generation sequencing datasets.  Each dataset consists of around 10 million mapped genome positions, and an experiment can consist of 10 or more datasets.

When analysing this data memory usage is a major issue.  Up until now our approach has been to try to store everything in memory as efficiently as possible but even using machines with 12+GB of memory we’re hitting scaling limits.

To try to alleviate this I’ve been investigating using disk based storage rather than putting the data in RAM.  I’d avoided this before as I wanted the application to be as responsive as possible and didn’t want the overhead of the extra IO.

I’ve now taken the plunge and started experimenting with disk based storage, and my initial results are encouraging.

I chose to go with SQLite using the SQLiteJDBC connector.  My inital attempt was pretty basic using only a single table as a sort of disk based Hashtable and I hadn’t even got round to adding indices to speed up searches.  Despite this the speed of data access was so quick I couldn’t tell the difference from the in-memory storage, and the memory usage dropped significantly.

Since this worked out so successfully I’m not considering doing a major rewrite to put all of the application data into a single large SQLite database and just caching objects for the currently visible chromosome in memory (to allow quick redraws).  This is going to require some major refactoring, but it should allow the program to scale to much larger datasets and should make opening a project a virtually instantaneous operation.

Posted in Bioinformatics, Computing | Comments Off