Choosing the best format for raw sequence data

Introduction In the current Illumina pipeline raw sequence data is generated in qseq files, but can optionally be converted to the more standard FastQ format for use with other analysis programs.  The FastQ files produced are uncompressed text files and take up a considerable amount of space in our storage system.  We’ve therefore been thinking…

Date
Categories
Tags
Comments

Published:June 16, 2011 View Post

Computing

Comments closed

Adding custom chromosome name mappings into SeqMonk

When loading data into SeqMonk the program has to try to connect the chromosome names used in your data file with those which are present in the genome one which your project is based.  In many cases there won’t be an exact match between the two – many mapping programs report file names in their…

Date
Categories
Tags
Comments

Published:June 11, 2011 View Post

Bioinformatics

Comments closed

Interpreting the duplicate sequence plot in FastQC

Background The one analysis module which seems to elicit more questions than any other is the duplicate sequence plot. Of all of the plots which the program generates it’s probably the one which causes the most warnings / errors in otherwise nice looking data. I’m happy to admit that it’s not always immediately obvious what…

Date
Categories
Tags
Comments

Published:May 23, 2011 View Post

Bioinformatics

Comments closed

How good is ‘good enough’ for research software

There are two linked problems which seem to face me with every piece of software I write for research use: When is the software complete enough to write a paper on it How to manage the versions and project description I think that although similar questions arise within software written for general use, their answers…

Date
Categories
Tags
Comments

Published:September 12, 2010 View Post

Computing

Comments closed

Where do you analyse next gen sequence data?

We had an interesting discussion at the Bioinfo-core workshop at ISMB2010.  The discussion centred around the best way to handle the logistics of making sequence data available to using a sequencing service.  The problem is that the data is so big that even if you have a large central store you run the risk of…

Date
Categories
Tags
Comments

Published:July 13, 2010 View Post

Computing

Comments closed

Mapping Bisulphite Converted Sequence

I’ve been thinking lately about the best way to construct a mapping pipline for large sequence datasets which have been bisulphite converted. Bisulphite conversion is mostly used to detect DNA methylation, although other uses are also being found.  The basic principle is that treating DNA with sodium bisulphite modifies cytosine bases such that when they…

Date
Categories
Tags
Comments

Published:August 27, 2009 View Post

Bioinformatics

Comments closed

Managing Really Large Data Sets

For a while now I’ve been working with next generation sequencing datasets.  Each dataset consists of around 10 million mapped genome positions, and an experiment can consist of 10 or more datasets. When analysing this data memory usage is a major issue.  Up until now our approach has been to try to store everything in…

Date
Categories
Tags
Comments

Published:August 27, 2009 View Post

Computing

Comments closed

Scientific Instrument Software

As a bioinformatician I find myself spending too much of my time working around poor software supplied with scientific instruments. I’m continually amazed that hardware which can cost hundreds of thousands of pounds is very often let down by the control and analysis software supplied with it. I suspect that the fault for this lies…

Date
Categories
Tags
Comments

Published:October 10, 2008 View Post

Bioinformatics

Comments closed