I’ve been thinking lately about the best way to construct a mapping pipline for large sequence datasets which have been bisulphite converted.
Bisulphite conversion is mostly used to detect DNA methylation, although other uses are also being found. The basic principle is that treating DNA with sodium bisulphite modifies cytosine bases such that when they are copied they pair with adenine instead of guanine. Effectively all Cs are converted to Ts. The exception is if the Cytosinse is methyated in which case it is not modified by bisulphite.
By comparing a bisulphite converted sequence to a known reference you can therefore tell which cytosines were methylated.
The problem is that in order to search a converted sequence to a reference you need a specialised search program which understands the changes introduced by the conversion and is still able to match these to the reference in their converted form.
There are a few programs available which allow you to do this such as BSMap or mrsFast. These provide custom aligners which can match fully or partially converted sequence to a reference, but they can still leave a number of problems.
The main problem is that many types of experiment rely on being able to count the number of times a particular region is observed as methylated and unmethylated and using the ratio of the two to infer the overall methylation status. The problem with doing this is that if your sequence is methylated it will still contain Cs, and therefore has more information than the converted form. In many cases this means that the unconverted form is uniquely mappable within the genome, but the converted form isn’t. In these cases you will bias your results in favour of the methylated sequences.
Also, bisulphite treatment is often not complete. Some unmethylated Cs will not be converted just by chance. Being able to make an assesment of the convertion efficiency allows you to correct the results you observe. In mammals this is simple since methyation should only be observed where a C forms part of a CG dinucleotide. Any protected Cs which aren’t part of a CG therefore represent incomplete conversion. In other species such as Arabidopsis though methylation can occur on any C so a different way to measure conversion efficiency is required.
Our proposal for mapping is therefore quite simple. All mapping must be done on fully converted sequence. If we generate a fully converted genome sequence (which requires separate conversions for forward and reverse strands), then we can completely convert our sequence tags and try to map them against the converted genome. If we find a unique match then we can compare the actual observed tag sequence to the unconverted genome sequence to obtain the methylation status.
There are a few advantages to this approach.
Firstly it requires no specialised mapping software. Any of the standard mapping programs will be able to do the search, albeit with reduced efficiency due to the reduced complexity of the sequence.
Mapping with fully converted sequence ensures that there is no bias towards methylated sequences. The downside of this is that some potentially mappable sequences are excluded – but at least we’re on a level playing field.
Operations such as conversion efficiency calculation can be introduced during the methylation calling step, and it is fairly simple to introduce rules to say whether a methylated base should appear as part of a larger sequence pattern.