Adding custom chromosome name mappings into SeqMonk

When loading data into SeqMonk the program has to try to connect the chromosome names used in your data file with those which are present in the genome one which your project is based.  In many cases there won’t be an exact match between the two – many mapping programs report file names in their output so you often see names like ‘chr1.fa’ used to represent chromosome 1.

In order to match as many entries as possible SeqMonk tries a few different tricks to match up the chromosome names if it doesn’t immediately find an exact match.  It will do simple manipulations such as removing ‘chr’ from the front of the name and removing things like ‘.fa’, ‘.txt’ and ‘.gz’ from the end. However in some cases it isn’t able to match up the names, and in these cases the read is rejected and you’ll get an import warning saying:

‘Couldn’t extract valid name for [some string which wasn’t a chromosome name]’

The most common offender for this message is the mitochondrion which is variously labelled as M, Mt, MT or some other minor variation.  The other one we commonly see are the yeast genomes which seem to alternate between using roman and arabic numbers to describe their chromosomes.

If you get this error and need to import the data which was rejected then how do you bring this data into SeqMonk? The most obvious way is to go back to your data file and rename your chromsomes to match those in SeqMonk.  Whilst this is simple to describe it’s often practically quite hard since data files are usually very large, and if they are in a non-text format (eg BAM), then modifying them isn’t at all straight forward.

A better alternative then is to tell SeqMonk in advance about any chromosome name mappings which it isn’t able to figure out for itself. There is a pretty simple way to do this, but it isn’t very obvious in the documentation and most people who hit the problem don’t seem to find it so I thought I’d raise its profile.

The way to add custom chromosome name mappings in SeqMonk is to create a text file called ‘aliases.txt’ in the folder which contains the data for the genome assembly you’re using.  If you were using the mouse NCBIM37 genome for your project the file would need to go inside the [Genomes dir]/Mus musculus/NCBIM37 folder.  This file is a 2-column tab delimited text file where the first column is the name you want to be able to use in your data files and the second column is the name of a chromosome in your SeqMonk genome.  Both of these names must be an exact match since the aliases are not subject to the same manipulations as normal chromosome names read from a data file.  There is no problem having multiple aliases pointing to the same SeqMonk chromosome, but you can’t have the same aliases pointing to more than one SeqMonk chromosome (if you do, then the last one wins).

So, and example aliases file would look like:

chro1      1
II         2
NC1234     3
c3         3
bob        X

Once this file is in place and SeqMonk has been restarted (or at least the genome has been reloaded) you should be able to import data files containing any of the names in the first column into your project without seeing any warnings.  This file is only required when the data is first imported, after which all the chromosome names are turned into SeqMonk names, so you don’t need to distribute the aliases file to be able to use SeqMonk projects which required that file to have their data initially imported.

Date
Categories
Tags
Permalink
Status

Published:June 11, 2011

Bioinformatics

Bookmark the permalink

Both comments and trackbacks are currently closed.